← Back to Blog

How to Reduce Nagios Alert Fatigue Without Missing Real Incidents

Your phone buzzed at 2 AM. Disk usage on a non-critical log server hit 78%. Same alert at 2:05. Same at 2:10. By the time a genuine production database failure fired at 2:47, you'd already silenced your phone.

That's the core problem with a default Nagios setup: it treats a transient CPU spike and a full application outage as equally urgent, sends both to the same channel, and re-notifies every five minutes until someone acknowledges. The result is a team that learns to ignore pages — which is the single worst outcome monitoring can produce.

This post walks through the concrete Nagios configuration changes that reduce noise without creating blind spots, and shows how pairing Nagios with a routing layer that understands severity gets you the rest of the way.

The Root Causes of Nagios Alert Fatigue

Nagios is a check engine. Its job is to run checks and report state. It was never designed to be an intelligent routing layer, and the default configuration reflects that. Three patterns drive most of the noise:

Soft-state spam. By default, Nagios notifies on the first failure, even if the check is in a SOFT state — meaning the failure hasn't been confirmed by multiple consecutive check attempts. A single blip on a flaky network check fires a page before Nagios has any reason to believe the failure is real.

Flat notification intervals. A 5-minute re-notification interval makes sense for a down host. It makes no sense for a warning-level disk check that will resolve itself when a log rotation runs at 3 AM.

No severity differentiation. The same notification command fires for CRITICAL and WARNING states. Your on-call engineer gets paged at the same urgency for a 78% disk warning and a database that refuses connections.

Fix 1: Require HARD States Before Notifying

The max_check_attempts and notification_options settings let you filter out soft-state noise without changing how Nagios detects problems.

# /etc/nagios/conf.d/templates.cfg

define service {
    name                    standard-service
    max_check_attempts      4          ; Must fail 4 consecutive checks before HARD
    check_interval          1          ; Check every minute
    retry_interval          1          ; Re-check every minute during soft state
    notification_interval   60         ; Re-notify every 60 min, not every 5
    notification_options    c,r        ; Only notify on CRITICAL and RECOVERY
    register                0
}

define service {
    name                    warning-service
    use                     standard-service
    notification_options    w,r        ; Warn-level checks: WARNING + RECOVERY only
    notification_interval   240        ; Re-notify every 4 hours for warnings
    register                0
}

With max_check_attempts 4 and a 1-minute retry, Nagios waits four consecutive failures — four minutes — before calling the state HARD and sending a notification. Transient blips disappear entirely.

Setting notification_options c,r on your critical services means you never get paged for a WARNING state on a service that's configured as CRITICAL-threshold only. And bumping notification_interval to 60 minutes stops the repeat-page loop while still reminding you something is open.

Fix 2: Separate Warning and Critical Contacts

Nagios allows contact groups per service. Use that to create a deliberate routing split.

define contactgroup {
    contactgroup_name    p1-oncall
    alias                On-call Engineers (P1 only)
    members              engineer-alice, engineer-bob
}

define contactgroup {
    contactgroup_name    warning-email
    alias                Warning-level email recipients
    members              ops-team-email
}

define service {
    use                  standard-service
    host_name            db-prod-01
    service_description  PostgreSQL Connections
    check_command        check_pgsql_connections
    contact_groups       p1-oncall       ; Pages on-call for this one
}

define service {
    use                  warning-service
    host_name            web-01
    service_description  Disk Usage /var/log
    check_command        check_disk_warn_only
    contact_groups       warning-email   ; Email only, never pages
}

This is the right boundary: Nagios decides what is broken and how severely, then routes to the appropriate contact group. The contact group itself determines how notifications are delivered — but Nagios's notification commands are blunt instruments. A "notify by phone" command is a script that calls an API; it doesn't understand on-call schedules, escalation paths, or whether your engineer just acknowledged a related incident 10 minutes ago.

Fix 3: Deduplication and Escalation Belong Outside Nagios

Here's where most Nagios configurations hit a ceiling. You can tune HARD states and contact groups, but you can't easily express rules like:

  • "If this alert isn't acknowledged in 15 minutes, escalate to the secondary on-call."
  • "Don't page me again for the same failing service if I already have an open incident for it."
  • "This PostgreSQL alert and the app server alert at the same time are probably the same root cause."

Nagios has escalation directives (define serviceescalation), but they're time-based only — they don't know whether a human has engaged with the incident. And Nagios has no deduplication model at all. If five services fail because a network switch went down, you get five separate notification threads.

This is the gap a purpose-built routing layer fills.

Connecting Nagios to a Routing Layer

Alert24 accepts alerts from Nagios via webhook or the Nagios notification command. Once the alert arrives, routing rules handle the logic that Nagios configuration can't express cleanly.

Nagios State Alert24 Severity Routing Rule
CRITICAL, HARD P1 SMS + phone call to on-call; escalate in 15 min if unacknowledged
WARNING, HARD P3 Email to ops channel; no phone call
UNKNOWN P2 Slack + SMS; no escalation
RECOVERY Auto-resolve open incident; notify same contacts who received the alert

A typical routing rule in Alert24 looks like this:

  • Condition: severity == P1 AND source == nagios
  • Channel: Phone call, then SMS if no answer within 2 minutes
  • Escalation: If unacknowledged after 15 minutes, notify secondary on-call
  • Deduplication key: host + service_description

The deduplication key is what stops repeat pages for an ongoing incident. Once Alert24 opens an incident for db-prod-01 / PostgreSQL Connections, every subsequent Nagios notification for that host/service combination updates the existing incident rather than firing a new page. Your engineer gets one phone call, works the problem, and acknowledges it. The duplicate notifications from Nagios's re-notification interval are absorbed silently.

When Nagios sends a RECOVERY notification, Alert24 auto-resolves the incident and sends a recovery summary to the same people who were paged — no separate configuration needed.

Before and After

Before these changes, a typical noisy hour during a disk-fill event looks like: 14 pages across SMS, email, and Slack — most of them repeating the same WARNING state, some catching your team mid-sleep for a problem that resolved itself.

After implementing HARD-state thresholds, severity-split contact groups, and routing through Alert24: one email when the disk warning first goes HARD, one page if it crosses the CRITICAL threshold, one auto-resolved incident when log rotation clears the space. Your on-call engineer slept through it appropriately.

Concrete Next Steps

  1. Audit your notification_options across all service templates. If you're notifying on w (WARNING) and routing that to the same contact group as CRITICAL alerts, fix that first — it's the single highest-impact change.

  2. Set max_check_attempts to at least 3 on any service that has transient failures. Network checks, external HTTP checks, and anything that touches shared infrastructure all benefit.

  3. Create two contact groups — one for phone/SMS escalation (P1) and one for email-only (warnings). Assign them intentionally based on the actual business impact of each service failing.

  4. If you're not already routing Nagios through an incident management layer, set up the webhook integration. The Nagios configuration handles detection; the routing layer handles the human side of incident response — schedules, escalation, deduplication, and status communication.

Nagios is good at what it does. Alert fatigue isn't a Nagios failure; it's a signal that the routing and response layer around it needs as much attention as the check configuration itself.