The Problem You Already Know You Have
Your on-call rotation is getting paged at 3 AM for a service that recovered on its own thirty seconds after the alert fired. Your Slack channel dedicated to "low severity" alerts has been muted by everyone on the team. You have alerts with names like "copy of copy of high latency — DO NOT DELETE" because nobody knows which one is authoritative anymore.
Grafana makes it genuinely easy to create alert rules. That ease is a trap. Most teams end up with dozens of overlapping rules, no coherent severity model, and a notification pipeline that treats every threshold breach as equally urgent. The result is alert fatigue — the slow erosion of trust in your alerting system until engineers start ignoring pages the way they ignore cookie consent banners.
The fix is not fewer alert rules. It is smarter alert configuration. Here is how to work through it systematically.
Step 1: Stop Alerting on Transient Spikes with Pending Periods
The most common source of false positives in Grafana is alerting on a single evaluation that crosses a threshold. CPU hits 92% for fifteen seconds during a garbage collection pass. A health check times out once during a deploy. A single slow query spikes p99 latency.
Grafana's pending period setting exists specifically for this. When you set a pending period, an alert must stay in a "pending" state — above the threshold — for the entire duration before it fires. A five-minute pending period on a CPU alert means the CPU has to stay above your threshold for five consecutive minutes before anyone gets paged.
# Grafana alert rule (exported YAML)
apiVersion: 1
groups:
- name: infrastructure
interval: 1m
rules:
- uid: cpu-high
title: High CPU Usage
condition: C
data:
- refId: A
queryType: ''
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus
model:
expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
noDataState: NoData
execErrState: Error
for: 5m # <-- pending period: must stay above threshold for 5 minutes
annotations:
summary: "CPU above 85% for 5+ minutes on {{ $labels.instance }}"
labels:
severity: warning
The for: 5m field is the pending period. Without it, a single evaluation above your threshold pages someone. With it, a transient spike that resolves on its own never generates an incident.
A reasonable starting point for pending periods by alert type:
| Alert Type | Suggested Pending Period |
|---|---|
| CPU / Memory usage | 5 minutes |
| HTTP error rate | 2 minutes |
| Service health check | 1 minute |
| Disk space | 10 minutes |
| Latency (p99) | 3 minutes |
Adjust based on how quickly your services normally recover. If your service genuinely takes eight minutes to restart after a crash, a five-minute pending period on a health check will delay your page unnecessarily.
Step 2: Consolidate with Evaluation Groups
Grafana evaluates alert rules on a per-group schedule. If you have forty alert rules each configured in their own group with different evaluation intervals, you are running forty separate query loops against your data source. That is expensive, and it means related alerts are evaluated at different times, which can create confusing sequences of firings.
Move related alerts into shared evaluation groups. All your database alerts in one group on a one-minute interval. All your application error rate alerts in another. All your infrastructure capacity alerts in a third. This reduces load on your data source and makes the evaluation behavior predictable.
More importantly, consolidating groups forces you to audit your rules. You will find duplicates — two teams each created a "high memory" alert for the same hosts. You will find orphaned rules monitoring services that were decommissioned. You will find thresholds that were copy-pasted from a template and never adjusted for the actual service.
Step 3: Route by Severity, Not by Alert Existence
Every alert does not need to page someone. This sounds obvious but most Grafana configurations do not reflect it. The default notification policy routes everything to the same contact point.
Grafana's notification policy tree supports label-based routing. Use it. The label severity is a standard convention — add it to your alert rules during the consolidation pass above, and then build a routing tree that reflects what you actually want to happen:
| Severity | Action |
|---|---|
| critical | Page on-call immediately, escalate if no ack in 10 minutes |
| warning | Post to Slack #alerts channel, no page |
| info | Log to incident tracking, no notification |
# Grafana notification policy (exported YAML)
apiVersion: 1
policies:
- receiver: default-email
group_by: ['grafana_folder', 'alertname']
routes:
- receiver: alert24-oncall
matchers:
- severity = critical
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
- receiver: slack-warnings
matchers:
- severity = warning
group_wait: 1m
group_interval: 10m
repeat_interval: 12h
The repeat_interval settings here matter as much as the routing. A four-hour repeat on critical alerts means if the problem is still firing but someone is actively working it, they do not get re-paged every five minutes. A twelve-hour repeat on warnings means the Slack channel gets one message per flap cycle, not one per evaluation.
Step 4: Deduplicate with Alert24 Routing Rules
Even with good Grafana configuration, you will encounter flapping. A service bounces between healthy and degraded repeatedly over twenty minutes — maybe it is an auto-scaling group adding capacity, or a deployment doing a rolling restart. Each transition from "firing" to "resolved" and back generates a new alert.
If each alert creates a new incident, you end up with fifty incidents for one event. Your incident history is polluted, your on-call engineer is getting paged repeatedly for the same underlying issue, and your metrics on incident frequency are meaningless.
This is where Alert24's deduplication handles what Grafana cannot. When Grafana sends an alert to Alert24 via webhook, Alert24 groups incoming alerts by configurable criteria — by alert name, by the labels you define, or by a fingerprint you provide. If a matching incident is already open, Alert24 updates the existing incident rather than creating a new one. When the alert resolves and re-fires, Alert24 reopens the same incident and logs the transition rather than creating a fresh one.
For the warning-severity Slack-routed alerts, Alert24 routing rules let you define logic like: if this alert fires more than three times in thirty minutes without resolving for at least ten minutes between firings, escalate it to critical. The alert started as a warning but its behavior pattern — repeated flapping — indicates something that needs human attention. Alert24 promotes it automatically.
Before and After: A Noisy Grafana Config
Here is what a typical noisy configuration looks like versus a cleaned-up one:
Before
- 47 alert rules across 23 separate evaluation groups
- All rules route to the same contact point (email to the whole team)
- No pending periods — every threshold breach fires immediately
- No severity labels
- 6 duplicate rules for the same database host
repeat_interval: 5mon all rules
An engineer on this team receives 200+ notifications on a busy day, most of which are transient spikes or duplicates. Response rate to genuine incidents is low because the signal-to-noise ratio is terrible.
After
- 31 alert rules in 6 evaluation groups (removed duplicates, consolidated)
- Routing by severity: 8 critical rules page on-call via Alert24, 23 warning rules post to Slack
- Pending periods added: 1–10 minutes depending on alert type
- Alert24 deduplication groups flapping alerts by service name
repeat_interval: 4hon critical,12hon warning
The same infrastructure now generates roughly 15–20 actionable pages per week instead of 200+ notifications per day. The incidents that do fire are real.
Concrete Next Steps
Start with a one-hour audit. Export your Grafana alert rules to YAML, sort them by last-fired date, and identify anything that has never fired or has fired more than ten times in the past week. Alerts that never fire are either wrong (threshold too high) or redundant (covered by something else). Alerts that fire constantly are misconfigured.
Add pending periods to every rule that does not have one. Even a sixty-second pending period eliminates a large fraction of transient false positives.
Set up severity labels and build a routing tree that actually reflects your team's priorities. Not everything should page someone.
If flapping is your main problem, connect Grafana to Alert24 via webhook and configure deduplication rules. Alert24 gives you the incident grouping, escalation policies, and routing logic that Grafana's notification system was not built to handle. Grafana evaluates your rules and fires reliably — Alert24 handles what happens after the webhook lands, including making sure a flapping service generates one incident with a clear timeline rather than fifty separate noise events.
Your monitoring coverage does not change. Your team's ability to trust the alerts that do fire improves significantly.