Your phone buzzed at 2 AM. Disk usage on a non-critical log server hit 78%. Same alert at 2:05. Same at 2:10. By the time a genuine production database failure fired at 2:47, you'd already silenced your phone.
That's the core problem with a default Nagios setup: it treats a transient CPU spike and a full application outage as equally urgent, sends both to the same channel, and re-notifies every five minutes until someone acknowledges. The result is a team that learns to ignore pages — which is the single worst outcome monitoring can produce.
This post walks through the concrete Nagios configuration changes that reduce noise without creating blind spots, and shows how pairing Nagios with a routing layer that understands severity gets you the rest of the way.
The Root Causes of Nagios Alert Fatigue
Nagios is a check engine. Its job is to run checks and report state. It was never designed to be an intelligent routing layer, and the default configuration reflects that. Three patterns drive most of the noise:
Soft-state spam. By default, Nagios notifies on the first failure, even if the check is in a SOFT state — meaning the failure hasn't been confirmed by multiple consecutive check attempts. A single blip on a flaky network check fires a page before Nagios has any reason to believe the failure is real.
Flat notification intervals. A 5-minute re-notification interval makes sense for a down host. It makes no sense for a warning-level disk check that will resolve itself when a log rotation runs at 3 AM.
No severity differentiation. The same notification command fires for CRITICAL and WARNING states. Your on-call engineer gets paged at the same urgency for a 78% disk warning and a database that refuses connections.
Fix 1: Require HARD States Before Notifying
The max_check_attempts and notification_options settings let you filter out soft-state noise without changing how Nagios detects problems.
# /etc/nagios/conf.d/templates.cfg
define service {
name standard-service
max_check_attempts 4 ; Must fail 4 consecutive checks before HARD
check_interval 1 ; Check every minute
retry_interval 1 ; Re-check every minute during soft state
notification_interval 60 ; Re-notify every 60 min, not every 5
notification_options c,r ; Only notify on CRITICAL and RECOVERY
register 0
}
define service {
name warning-service
use standard-service
notification_options w,r ; Warn-level checks: WARNING + RECOVERY only
notification_interval 240 ; Re-notify every 4 hours for warnings
register 0
}
With max_check_attempts 4 and a 1-minute retry, Nagios waits four consecutive failures — four minutes — before calling the state HARD and sending a notification. Transient blips disappear entirely.
Setting notification_options c,r on your critical services means you never get paged for a WARNING state on a service that's configured as CRITICAL-threshold only. And bumping notification_interval to 60 minutes stops the repeat-page loop while still reminding you something is open.
Fix 2: Separate Warning and Critical Contacts
Nagios allows contact groups per service. Use that to create a deliberate routing split.
define contactgroup {
contactgroup_name p1-oncall
alias On-call Engineers (P1 only)
members engineer-alice, engineer-bob
}
define contactgroup {
contactgroup_name warning-email
alias Warning-level email recipients
members ops-team-email
}
define service {
use standard-service
host_name db-prod-01
service_description PostgreSQL Connections
check_command check_pgsql_connections
contact_groups p1-oncall ; Pages on-call for this one
}
define service {
use warning-service
host_name web-01
service_description Disk Usage /var/log
check_command check_disk_warn_only
contact_groups warning-email ; Email only, never pages
}
This is the right boundary: Nagios decides what is broken and how severely, then routes to the appropriate contact group. The contact group itself determines how notifications are delivered — but Nagios's notification commands are blunt instruments. A "notify by phone" command is a script that calls an API; it doesn't understand on-call schedules, escalation paths, or whether your engineer just acknowledged a related incident 10 minutes ago.
Fix 3: Deduplication and Escalation Belong Outside Nagios
Here's where most Nagios configurations hit a ceiling. You can tune HARD states and contact groups, but you can't easily express rules like:
- "If this alert isn't acknowledged in 15 minutes, escalate to the secondary on-call."
- "Don't page me again for the same failing service if I already have an open incident for it."
- "This PostgreSQL alert and the app server alert at the same time are probably the same root cause."
Nagios has escalation directives (define serviceescalation), but they're time-based only — they don't know whether a human has engaged with the incident. And Nagios has no deduplication model at all. If five services fail because a network switch went down, you get five separate notification threads.
This is the gap a purpose-built routing layer fills.
Connecting Nagios to a Routing Layer
Alert24 accepts alerts from Nagios via webhook or the Nagios notification command. Once the alert arrives, routing rules handle the logic that Nagios configuration can't express cleanly.
| Nagios State | Alert24 Severity | Routing Rule |
|---|---|---|
| CRITICAL, HARD | P1 | SMS + phone call to on-call; escalate in 15 min if unacknowledged |
| WARNING, HARD | P3 | Email to ops channel; no phone call |
| UNKNOWN | P2 | Slack + SMS; no escalation |
| RECOVERY | — | Auto-resolve open incident; notify same contacts who received the alert |
A typical routing rule in Alert24 looks like this:
- Condition:
severity == P1 AND source == nagios - Channel: Phone call, then SMS if no answer within 2 minutes
- Escalation: If unacknowledged after 15 minutes, notify secondary on-call
- Deduplication key:
host + service_description
The deduplication key is what stops repeat pages for an ongoing incident. Once Alert24 opens an incident for db-prod-01 / PostgreSQL Connections, every subsequent Nagios notification for that host/service combination updates the existing incident rather than firing a new page. Your engineer gets one phone call, works the problem, and acknowledges it. The duplicate notifications from Nagios's re-notification interval are absorbed silently.
When Nagios sends a RECOVERY notification, Alert24 auto-resolves the incident and sends a recovery summary to the same people who were paged — no separate configuration needed.
Before and After
Before these changes, a typical noisy hour during a disk-fill event looks like: 14 pages across SMS, email, and Slack — most of them repeating the same WARNING state, some catching your team mid-sleep for a problem that resolved itself.
After implementing HARD-state thresholds, severity-split contact groups, and routing through Alert24: one email when the disk warning first goes HARD, one page if it crosses the CRITICAL threshold, one auto-resolved incident when log rotation clears the space. Your on-call engineer slept through it appropriately.
Concrete Next Steps
Audit your
notification_optionsacross all service templates. If you're notifying onw(WARNING) and routing that to the same contact group as CRITICAL alerts, fix that first — it's the single highest-impact change.Set
max_check_attemptsto at least 3 on any service that has transient failures. Network checks, external HTTP checks, and anything that touches shared infrastructure all benefit.Create two contact groups — one for phone/SMS escalation (P1) and one for email-only (warnings). Assign them intentionally based on the actual business impact of each service failing.
If you're not already routing Nagios through an incident management layer, set up the webhook integration. The Nagios configuration handles detection; the routing layer handles the human side of incident response — schedules, escalation, deduplication, and status communication.
Nagios is good at what it does. Alert fatigue isn't a Nagios failure; it's a signal that the routing and response layer around it needs as much attention as the check configuration itself.