The Problem With Nagios Notifications
Your monitoring caught the outage. Nagios fired off emails at 2:47 AM. The on-call engineer slept through them. By the time someone noticed at 6:15 AM, customers had been hitting a broken checkout page for over three hours.
This is the gap Nagios cannot close on its own. Nagios has a notification escalation feature — you can configure it to CC additional contacts after a certain number of notification attempts. But that is not escalation in any meaningful sense. It is just a slightly wider CC list on the same email that nobody read the first time.
Real escalation is a timer. Someone receives an alert, and if they do not acknowledge it within a defined window, a second person gets paged. If that person also does not respond, a third contact or an entire team gets pulled in. This is how on-call rotations actually work in organizations that take uptime seriously. Nagios cannot do this natively, but you can build it by connecting Nagios event handlers to a dedicated incident management system.
What Nagios Event Handlers Actually Do
Nagios runs two types of handlers: service event handlers and host event handlers. They execute a command whenever a monitored service or host changes state. The key distinction from notifications is that event handlers fire on state changes — OK to CRITICAL, WARNING to CRITICAL, recovery — rather than on notification intervals.
This makes event handlers the right hook for triggering an external incident management system. Instead of relying on Nagios to manage escalation timing, you hand the incident off to a system designed for that purpose.
Here is a minimal service event handler script:
#!/bin/bash
# /usr/local/nagios/libexec/eventhandlers/alert24_notify.sh
SERVICESTATE="$1"
SERVICESTATETYPE="$2"
SERVICEATTEMPT="$3"
HOSTNAME="$4"
SERVICEDESC="$5"
SERVICEOUTPUT="$6"
ALERT24_API_KEY="your_integration_key_here"
ALERT24_URL="https://api.alert24.com/v1/incidents"
# Only fire on HARD state changes to avoid alerting on transient soft failures
if [ "$SERVICESTATETYPE" != "HARD" ]; then
exit 0
fi
if [ "$SERVICESTATE" = "CRITICAL" ] || [ "$SERVICESTATE" = "WARNING" ]; then
curl -s -X POST "$ALERT24_URL" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ALERT24_API_KEY" \
-d "{
\"title\": \"$SERVICEDESC on $HOSTNAME\",
\"body\": \"$SERVICEOUTPUT\",
\"severity\": \"$(echo $SERVICESTATE | tr '[:upper:]' '[:lower:]')\",
\"source\": \"nagios\",
\"dedup_key\": \"nagios-${HOSTNAME}-${SERVICEDESC}\"
}"
elif [ "$SERVICESTATE" = "OK" ]; then
# Auto-resolve the incident when Nagios sees recovery
curl -s -X POST "$ALERT24_URL/resolve" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ALERT24_API_KEY" \
-d "{\"dedup_key\": \"nagios-${HOSTNAME}-${SERVICEDESC}\"}"
fi
Make this script executable (chmod +x) and owned by the Nagios user. The dedup_key field is important — it ties the resolution event back to the original incident so you do not end up with zombie open incidents in your dashboard after Nagios recovers.
Wiring the Handler Into Nagios
Define the command in your Nagios commands configuration:
# /usr/local/nagios/etc/objects/commands.cfg
define command {
command_name alert24_event_handler
command_line /usr/local/nagios/libexec/eventhandlers/alert24_notify.sh \
$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ \
$HOSTNAME$ $SERVICEDESC$ "$SERVICEOUTPUT$"
}
Then attach it to a service template or individual service definition:
define service {
use generic-service
host_name web-prod-01
service_description HTTP
check_command check_http
event_handler alert24_event_handler
event_handler_enabled 1
max_check_attempts 3
; Keep Nagios notifications if you want, or disable them
; once Alert24 is handling the paging
notifications_enabled 0
}
Setting max_check_attempts 3 means Nagios will enter HARD state after three consecutive failures. The event handler fires on the state transition, so you are not paging people for a single failed check that immediately recovers.
Configuring Escalation Policies in Alert24
Once Alert24 is receiving incidents from Nagios, you configure escalation entirely on the Alert24 side. This is where you define the actual timer-based escalation that Nagios cannot provide.
A typical two-tier escalation policy looks like this:
| Step | Contacts | Notify After | Channel |
|---|---|---|---|
| 1 | Primary on-call | Immediately | SMS + voice call |
| 2 | Secondary on-call | 10 minutes (if unacknowledged) | SMS + voice call |
| 3 | Engineering manager | 25 minutes (if unacknowledged) | Voice call |
The "if unacknowledged" condition is what makes this real escalation. Alert24 starts the timer the moment the incident opens. If the on-call engineer acknowledges the alert — either by replying to the SMS, pressing a key on the voice call, or clicking in the dashboard — the timer stops and escalation does not proceed. If they do not acknowledge it, Alert24 pages the next contact in the chain automatically.
You can attach different escalation policies to different services. A storage warning on a dev environment might escalate slowly over an hour. A payment processor health check might escalate to a second responder within five minutes and pull in a third contact at ten.
The Difference Between Nagios Escalation and Incident Escalation
Nagios does have a define escalation block. It looks like this:
define serviceescalation {
host_name web-prod-01
service_description HTTP
first_notification 3
last_notification 10
notification_interval 30
contact_groups senior-engineers
}
This tells Nagios to add the senior-engineers contact group to notifications starting at the third notification. There are two problems with this for real escalation:
First, Nagios escalation is based on notification count, not time. If your notification interval is 60 minutes, it takes three hours to reach the senior engineers. If your monitoring system was checked every 30 seconds but notifications are batched, the math gets complicated to reason about.
Second, and more importantly, there is no acknowledgment loop. Nagios does not know whether anyone actually read the alert. It sends the email and moves on. A timer-based system that tracks acknowledgment gives you a meaningful guarantee: if nobody responds, someone else gets pulled in. That is operationally different from a wider CC list.
Handling Recovery and Avoiding Alert Fatigue
The dedup key in the event handler script serves double duty: it prevents duplicate incidents when Nagios re-fires the event handler on subsequent HARD state checks, and it lets the resolution call close the right incident automatically.
Without auto-resolution, your team will accumulate open incidents in Alert24 even after Nagios shows everything green. That erodes trust in the dashboard quickly. Make sure your handler sends the resolve payload on OK state transitions and test it by manually triggering a failure and recovery on a non-production service.
If you find that certain services generate noisy alerts that almost always self-resolve, tune max_check_attempts upward for those services rather than disabling event handlers entirely. You want the handler to fire on genuine HARD failures without flooding the on-call rotation with transient blips.
Next Steps
Start with one critical service — probably your primary web endpoint or database health check — and wire up the event handler there. Validate that incidents open and close correctly before rolling it out broadly.
Then build your escalation policy in Alert24 to reflect how your team actually operates. If your on-call rotation rotates weekly, set that up as a schedule so alerts go to the right person automatically. If you have a secondary on-call, define the escalation timer at a threshold your team agrees on — ten minutes is a reasonable default for high-severity services.
Once you have seen it work through a real incident, adding additional services is a matter of attaching event_handler alert24_event_handler to more service definitions. The escalation logic stays centralized in Alert24 and applies consistently across everything Nagios monitors.