The Problem With Nagios Notifications

Your monitoring caught the outage. Nagios fired off emails at 2:47 AM. The on-call engineer slept through them. By the time someone noticed at 6:15 AM, customers had been hitting a broken checkout page for over three hours.

This is the gap Nagios cannot close on its own. Nagios has a notification escalation feature — you can configure it to CC additional contacts after a certain number of notification attempts. But that is not escalation in any meaningful sense. It is just a slightly wider CC list on the same email that nobody read the first time.

Real escalation is a timer. Someone receives an alert, and if they do not acknowledge it within a defined window, a second person gets paged. If that person also does not respond, a third contact or an entire team gets pulled in. This is how on-call rotations actually work in organizations that take uptime seriously. Nagios cannot do this natively, but you can build it by connecting Nagios event handlers to a dedicated incident management system.

What Nagios Event Handlers Actually Do

Nagios runs two types of handlers: service event handlers and host event handlers. They execute a command whenever a monitored service or host changes state. The key distinction from notifications is that event handlers fire on state changes — OK to CRITICAL, WARNING to CRITICAL, recovery — rather than on notification intervals.

This makes event handlers the right hook for triggering an external incident management system. Instead of relying on Nagios to manage escalation timing, you hand the incident off to a system designed for that purpose.

Here is a minimal service event handler script:

#!/bin/bash
# /usr/local/nagios/libexec/eventhandlers/alert24_notify.sh

SERVICESTATE="$1"
SERVICESTATETYPE="$2"
SERVICEATTEMPT="$3"
HOSTNAME="$4"
SERVICEDESC="$5"
SERVICEOUTPUT="$6"
ALERT24_API_KEY="your_integration_key_here"
ALERT24_URL="https://api.alert24.com/v1/incidents"

# Only fire on HARD state changes to avoid alerting on transient soft failures
if [ "$SERVICESTATETYPE" != "HARD" ]; then
  exit 0
fi

if [ "$SERVICESTATE" = "CRITICAL" ] || [ "$SERVICESTATE" = "WARNING" ]; then
  curl -s -X POST "$ALERT24_URL" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $ALERT24_API_KEY" \
    -d "{
      \"title\": \"$SERVICEDESC on $HOSTNAME\",
      \"body\": \"$SERVICEOUTPUT\",
      \"severity\": \"$(echo $SERVICESTATE | tr '[:upper:]' '[:lower:]')\",
      \"source\": \"nagios\",
      \"dedup_key\": \"nagios-${HOSTNAME}-${SERVICEDESC}\"
    }"
elif [ "$SERVICESTATE" = "OK" ]; then
  # Auto-resolve the incident when Nagios sees recovery
  curl -s -X POST "$ALERT24_URL/resolve" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $ALERT24_API_KEY" \
    -d "{\"dedup_key\": \"nagios-${HOSTNAME}-${SERVICEDESC}\"}"
fi

Make this script executable (chmod +x) and owned by the Nagios user. The dedup_key field is important — it ties the resolution event back to the original incident so you do not end up with zombie open incidents in your dashboard after Nagios recovers.

Wiring the Handler Into Nagios

Define the command in your Nagios commands configuration:

# /usr/local/nagios/etc/objects/commands.cfg

define command {
    command_name    alert24_event_handler
    command_line    /usr/local/nagios/libexec/eventhandlers/alert24_notify.sh \
                    $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ \
                    $HOSTNAME$ $SERVICEDESC$ "$SERVICEOUTPUT$"
}

Then attach it to a service template or individual service definition:

define service {
    use                     generic-service
    host_name               web-prod-01
    service_description     HTTP
    check_command           check_http
    event_handler           alert24_event_handler
    event_handler_enabled   1
    max_check_attempts      3
    ; Keep Nagios notifications if you want, or disable them
    ; once Alert24 is handling the paging
    notifications_enabled   0
}

Setting max_check_attempts 3 means Nagios will enter HARD state after three consecutive failures. The event handler fires on the state transition, so you are not paging people for a single failed check that immediately recovers.

Configuring Escalation Policies in Alert24

Once Alert24 is receiving incidents from Nagios, you configure escalation entirely on the Alert24 side. This is where you define the actual timer-based escalation that Nagios cannot provide.

A typical two-tier escalation policy looks like this:

Step	Contacts	Notify After	Channel
1	Primary on-call	Immediately	SMS + voice call
2	Secondary on-call	10 minutes (if unacknowledged)	SMS + voice call
3	Engineering manager	25 minutes (if unacknowledged)	Voice call

The "if unacknowledged" condition is what makes this real escalation. Alert24 starts the timer the moment the incident opens. If the on-call engineer acknowledges the alert — either by replying to the SMS, pressing a key on the voice call, or clicking in the dashboard — the timer stops and escalation does not proceed. If they do not acknowledge it, Alert24 pages the next contact in the chain automatically.

You can attach different escalation policies to different services. A storage warning on a dev environment might escalate slowly over an hour. A payment processor health check might escalate to a second responder within five minutes and pull in a third contact at ten.

The Difference Between Nagios Escalation and Incident Escalation

Nagios does have a define escalation block. It looks like this:

define serviceescalation {
    host_name               web-prod-01
    service_description     HTTP
    first_notification      3
    last_notification       10
    notification_interval   30
    contact_groups          senior-engineers
}

This tells Nagios to add the senior-engineers contact group to notifications starting at the third notification. There are two problems with this for real escalation:

First, Nagios escalation is based on notification count, not time. If your notification interval is 60 minutes, it takes three hours to reach the senior engineers. If your monitoring system was checked every 30 seconds but notifications are batched, the math gets complicated to reason about.

Second, and more importantly, there is no acknowledgment loop. Nagios does not know whether anyone actually read the alert. It sends the email and moves on. A timer-based system that tracks acknowledgment gives you a meaningful guarantee: if nobody responds, someone else gets pulled in. That is operationally different from a wider CC list.

Handling Recovery and Avoiding Alert Fatigue

The dedup key in the event handler script serves double duty: it prevents duplicate incidents when Nagios re-fires the event handler on subsequent HARD state checks, and it lets the resolution call close the right incident automatically.

Without auto-resolution, your team will accumulate open incidents in Alert24 even after Nagios shows everything green. That erodes trust in the dashboard quickly. Make sure your handler sends the resolve payload on OK state transitions and test it by manually triggering a failure and recovery on a non-production service.

If you find that certain services generate noisy alerts that almost always self-resolve, tune max_check_attempts upward for those services rather than disabling event handlers entirely. You want the handler to fire on genuine HARD failures without flooding the on-call rotation with transient blips.

Next Steps

Start with one critical service — probably your primary web endpoint or database health check — and wire up the event handler there. Validate that incidents open and close correctly before rolling it out broadly.

Then build your escalation policy in Alert24 to reflect how your team actually operates. If your on-call rotation rotates weekly, set that up as a schedule so alerts go to the right person automatically. If you have a secondary on-call, define the escalation timer at a threshold your team agrees on — ten minutes is a reasonable default for high-severity services.

Once you have seen it work through a real incident, adding additional services is a matter of attaching event_handler alert24_event_handler to more service definitions. The escalation logic stays centralized in Alert24 and applies consistently across everything Nagios monitors.

How to Escalate Nagios Alerts Automatically When Nobody Responds