Introduction
Let's face it – monitoring is essential, but it can get expensive fast. If you're using Datadog for your observability needs, you've probably experienced that moment of sticker shock when reviewing your monthly bill. With its powerful features and comprehensive monitoring capabilities, Datadog has become a go-to solution for DevOps teams worldwide. However, this robust platform comes with costs that can quickly escalate if not properly managed.
At DevOps Horizon, we've helped countless teams optimize their monitoring strategies while keeping costs under control. In this guide, we'll dive deep into practical strategies for managing Datadog expenses, setting up effective cost monitoring, and implementing alerts to prevent budget overruns.
Understanding Datadog's Pricing Model
Before we jump into cost-saving strategies, it's crucial to understand how Datadog charges for its services. Datadog primarily bills based on:
- Host monitoring: Per host/agent deployed
- Container monitoring: Based on container count
- Custom metrics: Number of unique combinations of metric name and tag values
- Log management: Volume of logs ingested and retained
- APM and tracing: Spans ingested and indexed
- Synthetic monitoring: Number of API tests and browser tests
The primary cost drivers typically include high-cardinality custom metrics, excessive log ingestion, and underutilized host agents. Understanding these factors is the first step toward effective cost management.
Strategies to Save on Datadog Costs
1. Optimize Data Volume
One of the most effective ways to reduce Datadog costs is to be selective about what data you send to the platform:
-
Filter logs before ingestion: Use Datadog's filtering capabilities to exclude non-essential logs. For example, you might filter out debug logs in production or exclude high-volume, low-value logs like access logs from certain endpoints.
-
Sample high-volume logs: Instead of sending every single log, consider sampling high-volume logs. For instance, if you have millions of similar API request logs, sending a representative sample (e.g., 10-20%) can provide sufficient visibility while dramatically reducing costs.
-
Adjust metric collection intervals: Not every metric needs to be collected at 1-second intervals. For less critical metrics, consider increasing collection intervals to 15, 30, or 60 seconds.
# Example log filtering configuration in datadog-agent.yaml
logs:
- type: file
path: /var/log/application/*.log
service: my-application
source: custom
log_processing_rules:
- type: exclude_at_match
name: exclude_debug_logs
pattern: 'level=DEBUG'
2. Manage Custom Metrics Wisely
Custom metrics can quickly become a major cost driver due to their cardinality (unique combinations of metrics and tag values):
-
Audit existing custom metrics: Regularly review your custom metrics to identify those that aren't actively used in dashboards or alerts. These are prime candidates for elimination.
-
Reduce tag cardinality: Tags like user IDs, session IDs, or timestamps can create millions of unique metric combinations. Instead, consider using bucketing or categorizing high-cardinality values.
-
Use distribution metrics: For metrics where you only need statistical summaries (like p90, p95, p99 latencies), consider using distribution metrics instead of sending each individual data point.
3. Implement Efficient Tagging and Filtering
A well-planned tagging strategy is essential for both effective monitoring and cost control:
-
Standardize tags across resources: Consistent tagging across your infrastructure helps in more efficient filtering and querying.
-
Limit tag values: Establish conventions that limit the potential values for each tag to reduce cardinality.
-
Use composite tags strategically: Instead of adding multiple separate tags that might create high cardinality, consider using composite tags for certain dimensions.
Monitoring Datadog Usage and Costs
1. Leverage Datadog's Usage Metrics
Datadog provides built-in metrics to help you track your usage:
-
Usage metrics: Datadog automatically collects metrics about your usage, such as
datadog.estimated_usage.logs.ingested_bytes
anddatadog.estimated_usage.apm.indexed_spans
. -
Usage attribution: Enable usage attribution to understand which services or teams are driving your Datadog costs.
{
"widget": {
"title": "Log Ingestion by Service",
"type": "timeseries",
"requests": [
{
"q": "sum:datadog.estimated_usage.logs.ingested_bytes{*} by {service}",
"display_type": "line"
}
]
}
}
2. Create Cost Monitoring Dashboards
Build dedicated dashboards to visualize and track your Datadog usage and associated costs:
-
Usage trends dashboard: Track usage patterns over time to identify anomalies or gradual increases.
-
Cost allocation dashboard: Visualize costs by team, application, or environment to drive accountability.
-
Cost vs. value dashboard: Compare costs against business metrics to ensure your monitoring investment delivers appropriate ROI.
3. Implement Regular Cost Reviews
Establish a routine process for reviewing Datadog usage and costs:
-
Weekly usage reviews: Quick checks to catch sudden spikes or anomalies.
-
Monthly cost analysis: Deeper analysis of trends, new cost drivers, and opportunities for optimization.
-
Quarterly optimization initiatives: Focused projects to implement cost-saving measures based on your analysis.
Setting Up Cost Monitors and Alerts
1. Essential Cost Monitors to Implement
Here are specific monitors you should set up to prevent unexpected cost increases:
a) Log Volume Monitors
Set up monitors to alert when log volume exceeds expected thresholds:
sum(last_1d):sum:datadog.estimated_usage.logs.ingested_bytes{*} > 1000000000000
This alert triggers when your daily log ingestion exceeds 1TB, which might indicate an issue or unexpected behavior.
b) Custom Metric Cardinality Monitors
Monitor the growth of custom metrics to prevent unexpected billing increases:
avg(last_1d):sum:datadog.estimated_usage.metrics.custom_metrics{*} > threshold
Set appropriate thresholds based on your plan and expected usage.
c) Container Count Monitors
For container-heavy environments, monitor container counts to prevent unexpected scaling issues:
avg(last_1d):sum:datadog.estimated_usage.containers{*} > expected_max_containers * 1.2
This alerts when container count exceeds your expected maximum by 20%.
d) APM Trace Volume Monitors
Monitor trace volume to catch unexpected increases in application traffic or tracing configuration changes:
sum(last_1d):sum:datadog.estimated_usage.apm.ingested_spans{*} > threshold
2. Implement Usage Anomaly Detection
Beyond simple threshold-based alerts, set up anomaly detection to identify unusual patterns:
anomalies(avg(last_2w):sum:datadog.estimated_usage.logs.ingested_bytes{*}, 'basic', 2)
This detects when log ingestion deviates significantly from the baseline established over the past two weeks.
3. Create Budget Burn-Rate Alerts
Set up alerts based on the rate at which you're consuming your Datadog budget:
- Monthly budget burn rate: Alert when you're on track to exceed your monthly budget.
- Sudden spike alerts: Detect when costs increase dramatically over a short period.
Real-World Implementation Example
Let's look at how a mid-sized SaaS company optimized their Datadog costs:
Initial situation: $12,000/month Datadog bill with 200 hosts, heavy APM usage, and extensive custom metrics.
Optimization steps:
- Identified and removed 40% of unused custom metrics
- Implemented log filtering to reduce ingestion by 35%
- Configured service-level tagging for better cost attribution
- Set up weekly usage reviews and cost anomaly alerts
Results:
- Reduced monthly costs to $7,200 (40% savings)
- Improved visibility into cost drivers
- Established proactive cost management processes
- Maintained comprehensive monitoring coverage for critical systems
Best Practices and Recommendations
Based on our experience working with numerous DevOps teams, here are our top recommendations:
1. Align Monitoring with Business Objectives
- Define monitoring tiers: Not all systems require the same level of monitoring. Define tiers based on business criticality.
- Match retention to requirements: Adjust retention periods based on actual needs rather than using defaults.
2. Integrate with FinOps Practices
- Implement showback/chargeback: Make teams aware of their Datadog usage costs.
- Set clear budgets: Establish monitoring budgets for each team or application.
- Regular optimization cycles: Build cost optimization into your regular operations.
3. Leverage Datadog's Cost Control Features
- Use the Usage Attribution feature: Track costs by team, application, or environment.
- Utilize usage metrics: Take advantage of Datadog's built-in usage tracking.
- Configure exclusion filters: Use Datadog's filtering capabilities to exclude unnecessary data.
Conclusion
Effective Datadog cost management doesn't mean compromising on monitoring quality. By implementing the strategies outlined in this guide, you can achieve comprehensive visibility into your systems while keeping costs under control.
Remember that cost optimization is an ongoing process, not a one-time effort. Regular reviews, continuous refinement of your monitoring strategy, and fostering a cost-conscious culture will help you maintain the right balance between observability and budget.
At DevOps Horizon, we're committed to helping teams implement effective monitoring strategies that deliver value without breaking the bank. If you need personalized guidance on optimizing your Datadog implementation, contact us for a consultation.
Have you implemented any of these strategies in your organization? What challenges did you face? Share your experiences in the comments below!