← Back to Blog

How to Deduplicate Prometheus Alerts Across Microservices

The Alert Storm Problem

Your database host goes down at 2 AM. Within thirty seconds your phone is lighting up: payment-service is down, cart-service is down, user-service is down, order-service is down. You have seventeen open incidents before you have time to open a single one. Every service team has a separate alert rule watching the same upstream dependency, and Prometheus fired all of them.

This is an alert storm, and it is one of the most common reasons on-call engineers start ignoring pages. When you cannot trust that a page represents a unique, actionable problem, you start treating all pages as noise — which is exactly when a real, isolated incident slips through.

There are two places to address this: upstream in your alerting pipeline, and downstream in your incident management layer. Both have their place, and understanding where each fits will help you build a system that stays quiet when things are fine and specific when things are broken.

What Alertmanager's inhibit_rules Actually Do

Alertmanager's inhibition feature lets you suppress certain alerts when another alert is already firing. The typical use case is exactly the database scenario above: define a "source" alert for the database host being unreachable, and inhibit all downstream service alerts while that source is active.

inhibit_rules:
  - source_match:
      alertname: DatabaseHostDown
      severity: critical
    target_match:
      severity: warning
    equal:
      - cluster
      - namespace

When DatabaseHostDown is firing, Alertmanager will suppress any warning-severity alerts in the same cluster and namespace. Your on-call engineer gets one page, works the actual problem, and the downstream alerts clear on their own when the database recovers.

This works well when:

  • You control the Alertmanager configuration centrally
  • The relationship between source and downstream alerts is stable and well-understood
  • Your inhibition rules can be expressed in terms of label equality

Where it breaks down is at the edges. Real microservice architectures rarely have clean one-to-one dependency graphs. A degraded network segment might cause partial failures across fifty services in ways that do not fit neatly into a source/target label pair. Inhibition rules also require you to anticipate the failure modes ahead of time — they do not generalize to novel incident patterns.

More practically: inhibition happens before notifications reach your incident management system. If you want a record that twenty services were affected during an incident, that information is gone. You suppressed it.

Alias-Based Deduplication at the Incident Layer

A different approach is to let all the alerts fire and route normally, but group them into a single incident using a shared deduplication key. Alert24 calls this an alias — a string you attach to related alerts that tells the system "all of these belong to the same event."

The mechanism is straightforward. When an alert fires, you include an alias field in the payload. Alert24 checks whether an open incident already exists with that alias. If one does, the new alert increments that incident's occurrence count and updates its metadata rather than opening a new incident. Your on-call engineer sees one incident with context showing that it has fired thirty-seven times across twelve services, which is more useful than thirty-seven separate pages.

Sending Alerts with an Alias

If you are sending alerts from Alertmanager via webhook, you can include the alias as an annotation or a custom field in the payload template:

# alertmanager.yml
receivers:
  - name: alert24
    webhook_configs:
      - url: https://api.alert24.com/v1/alerts
        http_config:
          authorization:
            credentials: your-api-key
        send_resolved: true

# prometheus/rules/database.yml
groups:
  - name: database
    rules:
      - alert: DatabaseHostDown
        expr: up{job="postgres"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Postgres host {{ $labels.instance }} is unreachable"
          alias: "db-outage-{{ $labels.cluster }}"

The alias here is scoped to the cluster label. Every alert firing because of a database outage in the prod-us-east cluster will carry the alias db-outage-prod-us-east. One incident opens on the first fire; everything after increments it.

When the root cause clears and Alertmanager sends resolved events, Alert24 closes the incident automatically if all fires carrying that alias have resolved.

Downstream Service Alerts

For the services that depend on the database, wire them with the same alias pattern:

# prometheus/rules/services.yml
groups:
  - name: service-health
    rules:
      - alert: ServiceDatabaseErrors
        expr: rate(http_requests_total{status=~"5..", service=~"payment|cart|user|order"}[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service }} experiencing elevated errors"
          alias: "db-outage-{{ $labels.cluster }}"

Now the payment-service errors and cart-service errors both increment the same incident as the database host alert. Your incident timeline shows the sequence: database went down at 02:14, payment errors started at 02:15, cart errors at 02:16. That is a richer picture than seventeen separate incidents with no visible relationship.

Choosing Between the Two Approaches

Scenario Prefer inhibit_rules Prefer alias deduplication
You want zero noise for dependent alerts Yes No — dependent alerts still page
You want audit trail of all affected services No Yes
Failure modes are well-defined and static Yes Either works
Novel or complex blast radius No Yes
You manage Alertmanager config centrally Yes Either works
Teams own their own alert rules independently Harder Easier

These approaches are not mutually exclusive. A reasonable architecture uses inhibition to silence truly redundant alerts (a flapping instance triggering both a warning and a critical version of the same rule) while using alias deduplication to correlate alerts across service boundaries that represent the same underlying failure.

Practical Considerations

Keep aliases scoped tightly. If your alias is too broad — say, just "database-problems" — unrelated incidents will get merged. Scope to the failure domain: cluster, region, or specific resource identifier.

Set an alias expiry strategy. If an incident closes and the same alias fires again six hours later, you probably want a new incident rather than a reopened one. Alert24 treats resolved incidents as closed; a new fire after resolution opens fresh.

Test your alias logic before an incident. Send a synthetic alert with the alias and verify the incident appears as expected. Then send a second synthetic with the same alias and confirm it increments rather than duplicates. This is much easier to validate in a staging environment than to debug at 2 AM.

Document your alias conventions. If five teams are independently writing alert rules that are supposed to correlate, they need to know the alias naming scheme. A short internal wiki page listing the alias patterns for common shared infrastructure dependencies will save you from discovering inconsistencies during an outage.

Next Steps

Start with your noisiest alert storm. Pull the last week of incident history and find the event that generated the most pages. Look at what those alerts have in common — shared upstream dependency, same cluster, same deployment — and model an alias that would have grouped them.

Wire that alias into your Alertmanager webhook payload and deploy it to a non-production environment first. Trigger a synthetic failure, watch the alerts fire, and confirm Alert24 groups them correctly. Once you have validated the pattern, roll it to production and measure the change in incident volume over the next two weeks.

If you are not yet routing Prometheus alerts through Alert24, the webhook integration setup takes about ten minutes and works with any Alertmanager version from 0.22 onward.