The Problem You Already Know You Have

Your on-call rotation is getting paged at 3 AM for a service that recovered on its own thirty seconds after the alert fired. Your Slack channel dedicated to "low severity" alerts has been muted by everyone on the team. You have alerts with names like "copy of copy of high latency — DO NOT DELETE" because nobody knows which one is authoritative anymore.

Grafana makes it genuinely easy to create alert rules. That ease is a trap. Most teams end up with dozens of overlapping rules, no coherent severity model, and a notification pipeline that treats every threshold breach as equally urgent. The result is alert fatigue — the slow erosion of trust in your alerting system until engineers start ignoring pages the way they ignore cookie consent banners.

The fix is not fewer alert rules. It is smarter alert configuration. Here is how to work through it systematically.

Step 1: Stop Alerting on Transient Spikes with Pending Periods

The most common source of false positives in Grafana is alerting on a single evaluation that crosses a threshold. CPU hits 92% for fifteen seconds during a garbage collection pass. A health check times out once during a deploy. A single slow query spikes p99 latency.

Grafana's pending period setting exists specifically for this. When you set a pending period, an alert must stay in a "pending" state — above the threshold — for the entire duration before it fires. A five-minute pending period on a CPU alert means the CPU has to stay above your threshold for five consecutive minutes before anyone gets paged.

# Grafana alert rule (exported YAML)
apiVersion: 1
groups:
  - name: infrastructure
    interval: 1m
    rules:
      - uid: cpu-high
        title: High CPU Usage
        condition: C
        data:
          - refId: A
            queryType: ''
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus
            model:
              expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
        noDataState: NoData
        execErrState: Error
        for: 5m           # <-- pending period: must stay above threshold for 5 minutes
        annotations:
          summary: "CPU above 85% for 5+ minutes on {{ $labels.instance }}"
        labels:
          severity: warning

The for: 5m field is the pending period. Without it, a single evaluation above your threshold pages someone. With it, a transient spike that resolves on its own never generates an incident.

A reasonable starting point for pending periods by alert type:

Alert Type	Suggested Pending Period
CPU / Memory usage	5 minutes
HTTP error rate	2 minutes
Service health check	1 minute
Disk space	10 minutes
Latency (p99)	3 minutes

Adjust based on how quickly your services normally recover. If your service genuinely takes eight minutes to restart after a crash, a five-minute pending period on a health check will delay your page unnecessarily.

Step 2: Consolidate with Evaluation Groups

Grafana evaluates alert rules on a per-group schedule. If you have forty alert rules each configured in their own group with different evaluation intervals, you are running forty separate query loops against your data source. That is expensive, and it means related alerts are evaluated at different times, which can create confusing sequences of firings.

Move related alerts into shared evaluation groups. All your database alerts in one group on a one-minute interval. All your application error rate alerts in another. All your infrastructure capacity alerts in a third. This reduces load on your data source and makes the evaluation behavior predictable.

More importantly, consolidating groups forces you to audit your rules. You will find duplicates — two teams each created a "high memory" alert for the same hosts. You will find orphaned rules monitoring services that were decommissioned. You will find thresholds that were copy-pasted from a template and never adjusted for the actual service.

Step 3: Route by Severity, Not by Alert Existence

Every alert does not need to page someone. This sounds obvious but most Grafana configurations do not reflect it. The default notification policy routes everything to the same contact point.

Grafana's notification policy tree supports label-based routing. Use it. The label severity is a standard convention — add it to your alert rules during the consolidation pass above, and then build a routing tree that reflects what you actually want to happen:

Severity	Action
critical	Page on-call immediately, escalate if no ack in 10 minutes
warning	Post to Slack #alerts channel, no page
info	Log to incident tracking, no notification

# Grafana notification policy (exported YAML)
apiVersion: 1
policies:
  - receiver: default-email
    group_by: ['grafana_folder', 'alertname']
    routes:
      - receiver: alert24-oncall
        matchers:
          - severity = critical
        group_wait: 30s
        group_interval: 5m
        repeat_interval: 4h
      - receiver: slack-warnings
        matchers:
          - severity = warning
        group_wait: 1m
        group_interval: 10m
        repeat_interval: 12h

The repeat_interval settings here matter as much as the routing. A four-hour repeat on critical alerts means if the problem is still firing but someone is actively working it, they do not get re-paged every five minutes. A twelve-hour repeat on warnings means the Slack channel gets one message per flap cycle, not one per evaluation.

Step 4: Deduplicate with Alert24 Routing Rules

Even with good Grafana configuration, you will encounter flapping. A service bounces between healthy and degraded repeatedly over twenty minutes — maybe it is an auto-scaling group adding capacity, or a deployment doing a rolling restart. Each transition from "firing" to "resolved" and back generates a new alert.

If each alert creates a new incident, you end up with fifty incidents for one event. Your incident history is polluted, your on-call engineer is getting paged repeatedly for the same underlying issue, and your metrics on incident frequency are meaningless.

This is where Alert24's deduplication handles what Grafana cannot. When Grafana sends an alert to Alert24 via webhook, Alert24 groups incoming alerts by configurable criteria — by alert name, by the labels you define, or by a fingerprint you provide. If a matching incident is already open, Alert24 updates the existing incident rather than creating a new one. When the alert resolves and re-fires, Alert24 reopens the same incident and logs the transition rather than creating a fresh one.

For the warning-severity Slack-routed alerts, Alert24 routing rules let you define logic like: if this alert fires more than three times in thirty minutes without resolving for at least ten minutes between firings, escalate it to critical. The alert started as a warning but its behavior pattern — repeated flapping — indicates something that needs human attention. Alert24 promotes it automatically.

Before and After: A Noisy Grafana Config

Here is what a typical noisy configuration looks like versus a cleaned-up one:

Before

47 alert rules across 23 separate evaluation groups
All rules route to the same contact point (email to the whole team)
No pending periods — every threshold breach fires immediately
No severity labels
6 duplicate rules for the same database host
repeat_interval: 5m on all rules

An engineer on this team receives 200+ notifications on a busy day, most of which are transient spikes or duplicates. Response rate to genuine incidents is low because the signal-to-noise ratio is terrible.

After

31 alert rules in 6 evaluation groups (removed duplicates, consolidated)
Routing by severity: 8 critical rules page on-call via Alert24, 23 warning rules post to Slack
Pending periods added: 1–10 minutes depending on alert type
Alert24 deduplication groups flapping alerts by service name
repeat_interval: 4h on critical, 12h on warning

The same infrastructure now generates roughly 15–20 actionable pages per week instead of 200+ notifications per day. The incidents that do fire are real.

Concrete Next Steps

Start with a one-hour audit. Export your Grafana alert rules to YAML, sort them by last-fired date, and identify anything that has never fired or has fired more than ten times in the past week. Alerts that never fire are either wrong (threshold too high) or redundant (covered by something else). Alerts that fire constantly are misconfigured.

Add pending periods to every rule that does not have one. Even a sixty-second pending period eliminates a large fraction of transient false positives.

Set up severity labels and build a routing tree that actually reflects your team's priorities. Not everything should page someone.

If flapping is your main problem, connect Grafana to Alert24 via webhook and configure deduplication rules. Alert24 gives you the incident grouping, escalation policies, and routing logic that Grafana's notification system was not built to handle. Grafana evaluates your rules and fires reliably — Alert24 handles what happens after the webhook lands, including making sure a flapping service generates one incident with a clear timeline rather than fifty separate noise events.

Your monitoring coverage does not change. Your team's ability to trust the alerts that do fire improves significantly.

How to Reduce Grafana Alert Noise Without Reducing Coverage