Your Datadog setup is working exactly as designed. That's the problem.
Anomaly detection is firing on a memory metric that spikes every Tuesday during a scheduled job. The APM latency monitor pages your team at 2 AM for a blip that self-resolved in 40 seconds. Three engineers acknowledge the same alert because nobody's sure who owns it. By the time a real incident lands, your team has learned to wait and see before treating any alert as urgent.
Alert fatigue is not a Datadog problem. It's a routing problem. Datadog generates signals — your job is to turn those signals into the right action for the right person at the right time. The two systems are doing different work, and conflating them is where things go wrong.
The Root Cause: Every Alert Takes the Same Path
Most teams start by wiring Datadog monitors directly to PagerDuty or sending every notification to a shared Slack channel. Both approaches collapse over time. The shared channel becomes background noise. PagerDuty fatigue kicks in and engineers start silencing their phones.
The underlying issue is that you have one routing path for signals that warrant very different responses:
- A flapping Redis connection that's self-healing doesn't need a phone call at 3 AM.
- A payment processing error that's been open for 10 minutes absolutely does.
The fix is to give these different signals different paths.
Strategy 1: Route by Priority, Not by Severity
Datadog lets you tag monitors with priority levels (P1 through P5). Most teams set these and then ignore them when configuring notifications. Don't.
In Alert24, you can create routing rules that inspect the incoming webhook payload from Datadog and branch based on priority. Low-priority monitors (P3-P5) go to a Slack channel your team actually reads. High-priority monitors (P1-P2) trigger the on-call escalation policy.
A basic Alert24 routing rule for this looks like:
{
"rules": [
{
"name": "Low-priority to Slack only",
"conditions": [
{ "field": "priority", "operator": "in", "value": ["P3", "P4", "P5"] }
],
"actions": [
{ "type": "notify", "target": "slack-channel-ops-noise" }
],
"stop_processing": true
},
{
"name": "High-priority to on-call",
"conditions": [
{ "field": "priority", "operator": "in", "value": ["P1", "P2"] }
],
"actions": [
{ "type": "escalation_policy", "target": "default-oncall" }
]
}
]
}
The stop_processing: true flag on the first rule is important. Without it, a P4 alert could match the first rule, post to Slack, and then fall through to page the on-call engineer anyway.
The discipline this requires is keeping your Datadog priorities accurate. If everything is P2, nothing is P2. Do a quick audit of your monitors and ask whether the human response to each one is "wake someone up" or "look at it in the morning." That answer should drive the priority tag.
Strategy 2: Deduplicate Flapping Monitors
A flapping monitor is one that oscillates between OK and ALERT states — often because the threshold is too tight or the check interval is too short. A single flapping monitor can generate dozens of pages in an hour, each one technically a unique alert event.
Datadog has a renotify setting that controls how often a monitor re-notifies in a sustained alert state, but it doesn't prevent the initial burst when a monitor starts flapping.
In Alert24, deduplication works by defining a dedup key — a string that identifies "this is the same underlying problem." Alerts that share a dedup key within a configurable window get collapsed into a single incident instead of generating a new one each time.
For Datadog webhooks, the dedup key is typically the monitor ID. Configure it in your integration settings:
Deduplication key: {{ monitor_id }}
Deduplication window: 30 minutes
With this in place, if your Redis connection monitor fires 15 times in 20 minutes, your on-call engineer gets one page with a note that the monitor has triggered 15 times — not 15 pages. They can look at the incident timeline in Alert24 and immediately see the flapping pattern rather than piecing it together from 15 separate Slack threads.
One additional step that pays dividends: set a minimum incident duration before escalating. If a monitor recovers within two minutes, it probably didn't need a page. Alert24's escalation delay setting lets you hold a new incident for N minutes before notifying anyone — if it auto-resolves before the delay expires, it never pages.
Strategy 3: Time-Based Routing for Business Hours vs. After-Hours
Not every alert that warrants a response at 2 PM warrants a phone call at 2 AM. This is especially true for infrastructure alerts that affect internal tooling, non-revenue systems, or issues that require a business decision rather than an immediate technical fix.
Time-based routing lets you define different escalation paths depending on when the alert arrives. During business hours, you can route more aggressively — more alerts to more people, since the cost of interruption is lower. After hours, you narrow the funnel to P1 and P2 only.
A sample routing matrix:
| Priority | Business hours (9 AM – 6 PM local) | After hours |
|---|---|---|
| P1 | On-call phone + Slack | On-call phone + Slack |
| P2 | On-call phone + Slack | On-call phone + Slack |
| P3 | Email + Slack | Email only |
| P4 | Slack only | Suppressed until morning |
| P5 | Slack only | Suppressed until morning |
In Alert24, time-based conditions use your team's configured timezone and support day-of-week filtering. You can also define "override windows" for planned maintenance, deployments, or on-call handoff periods where the normal routing rules are suspended or modified.
The suppressed-until-morning behavior deserves a note: alerts don't disappear. They're held and delivered as a digest at the start of business hours. Your team still sees them — they just don't get woken up for them.
Putting It Together in Datadog
On the Datadog side, the integration is straightforward. Create a webhook integration pointing at your Alert24 ingest URL, and include the priority and monitor ID in the payload:
POST https://app.alert24.com/ingest/YOUR_API_KEY
Body (JSON):
{
"title": "$EVENT_TITLE",
"monitor_id": "$ALERT_ID",
"priority": "$PRIORITY",
"alert_type": "$ALERT_TYPE",
"tags": "$TAGS",
"url": "$LINK"
}
Add this webhook to your Datadog monitors as a notification channel, the same way you'd add an email or Slack notification. You can use it alongside existing Datadog notification channels — Alert24 handles routing and escalation, while Datadog continues sending raw notifications wherever you already have them going.
Next Steps
Start with strategy two. Deduplication has the most immediate impact on page volume and requires no changes to how your Datadog monitors are configured — it's purely an Alert24 setting.
Then audit your monitor priorities. This takes an afternoon and makes strategies one and three effective. Without accurate priorities, routing rules based on priority are just theater.
Finally, set up time-based routing once your team has a week of data on what's actually paging after hours. The goal is not to suppress real incidents — it's to make sure that when your engineer's phone rings at midnight, everyone, including the engineer, treats it as something that genuinely couldn't wait until morning.
That trust is what gets rebuilt when alert fatigue gets fixed.