← Back to Blog

How to Route Nagios Alerts to the Right On-Call Engineer

Nagios Knows Something Is Broken. It Just Doesn't Know Who to Tell.

Your Nagios setup is solid. It checks disk usage, monitors service health, catches latency spikes before your users do. The problem surfaces at 2 AM when a CRITICAL alert fires and the notification goes to a flat email list—or worse, to someone who is on vacation, or to a team that has nothing to do with the affected service.

Nagios was designed to detect problems. It was not designed to manage on-call schedules, escalate through tiers of responders, or deduplicate a flapping service from generating 200 pages in an hour. That is the gap this post covers: how to wire Nagios into Alert24 so every alert reaches the right engineer based on who is actually on call right now, which service is affected, and what escalation path applies if the first responder doesn't acknowledge.

How the Integration Works

The core mechanism is a Nagios event handler—a script that Nagios calls whenever a host or service changes state. Instead of (or in addition to) sending email, your event handler posts a structured payload to the Alert24 ingest API. Alert24 then owns everything downstream: who gets paged, when escalations kick in, and how duplicates are suppressed.

This keeps Nagios doing what it does well and delegates the human-routing problem to a tool built for it.

Step 1: Create the Event Handler Script

Nagios event handlers are shell scripts (or any executable) that receive state information as arguments. Create the following script at /usr/local/nagios/libexec/alert24_handler.sh:

#!/bin/bash

# alert24_handler.sh
# Called by Nagios for service state changes.
# Arguments passed by Nagios:
#   $1 = service state (OK, WARNING, CRITICAL, UNKNOWN)
#   $2 = state type (SOFT, HARD)
#   $3 = attempt number
#   $4 = service description
#   $5 = host name

SERVICESTATE="$1"
SERVICESTATETYPE="$2"
SERVICEATTEMPT="$3"
SERVICEDESC="$4"
HOSTNAME="$5"

ALERT24_API_KEY="your_api_key_here"
ALERT24_ROUTING_KEY="your_routing_key_here"

# Only page on HARD states to avoid noise during soft retries
if [ "$SERVICESTATETYPE" != "HARD" ]; then
  exit 0
fi

if [ "$SERVICESTATE" = "OK" ]; then
  EVENT_ACTION="resolve"
else
  EVENT_ACTION="trigger"
fi

DEDUP_KEY="${HOSTNAME}/${SERVICEDESC}"

curl -s -X POST https://api.alert24.io/v1/events \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${ALERT24_API_KEY}" \
  -d "{
    \"routing_key\": \"${ALERT24_ROUTING_KEY}\",
    \"event_action\": \"${EVENT_ACTION}\",
    \"dedup_key\": \"${DEDUP_KEY}\",
    \"payload\": {
      \"summary\": \"${SERVICEDESC} is ${SERVICESTATE} on ${HOSTNAME}\",
      \"source\": \"nagios\",
      \"severity\": \"$(echo $SERVICESTATE | tr '[:upper:]' '[:lower:]')\",
      \"custom_details\": {
        \"host\": \"${HOSTNAME}\",
        \"service\": \"${SERVICEDESC}\",
        \"state_type\": \"${SERVICESTATETYPE}\",
        \"attempt\": \"${SERVICEATTEMPT}\"
      }
    }
  }"

Make it executable:

chmod +x /usr/local/nagios/libexec/alert24_handler.sh

The dedup_key field is what prevents alert storms. If Nagios checks a service every 30 seconds and it stays in a CRITICAL state, every subsequent call with the same dedup_key updates the existing incident rather than creating a new one. Your on-call engineer gets one page, not hundreds.

Step 2: Register the Handler in Nagios

In your Nagios configuration, define the command and attach it to your services:

# commands.cfg
define command {
  command_name  alert24-service-handler
  command_line  /usr/local/nagios/libexec/alert24_handler.sh "$SERVICESTATE$" "$SERVICESTATETYPE$" "$SERVICEATTEMPT$" "$SERVICEDESC$" "$HOSTNAME$"
}

# services.cfg (example service)
define service {
  host_name             web-prod-01
  service_description   HTTP
  check_command         check_http
  event_handler         alert24-service-handler
  event_handler_enabled 1
  notifications_enabled 0   ; disable native email if Alert24 handles all routing
  ...
}

Setting notifications_enabled 0 on services you've migrated prevents duplicate pages from Nagios's own notification system while Alert24 handles routing.

Step 3: Set Up Routing Keys and On-Call Schedules in Alert24

A routing key in Alert24 maps an inbound alert to a specific service. The routing key you put in the handler script determines which on-call schedule and escalation policy applies. This is where team-based routing happens.

A typical setup for a three-team operation might look like this:

Routing Key Service On-Call Schedule Escalation Policy
rk_infra Infrastructure checks Infra team rotation 5 min → Infra lead → VP Engineering
rk_app Application services App team rotation 10 min → App lead
rk_db Database health DBA rotation 5 min → DBA lead → CTO

Each routing key links to a schedule where you define who is on call for which hours and days. Schedules support weekly rotations, time-zone-aware shifts, and override entries for holidays or planned absences. When an alert arrives, Alert24 evaluates the current time, looks up the active on-call responder, and sends the page—SMS, phone call, push notification, or email, depending on that engineer's notification preferences.

Matching Nagios Services to Alert24 Routing Keys

If you have services spread across multiple teams, you can use different routing keys per service group rather than a single key for everything. Update the handler script to accept a routing key as an argument, or maintain separate handler commands per team:

define command {
  command_name  alert24-db-handler
  command_line  /usr/local/nagios/libexec/alert24_handler.sh "$SERVICESTATE$" "$SERVICESTATETYPE$" "$SERVICEATTEMPT$" "$SERVICEDESC$" "$HOSTNAME$" "rk_db"
}

Then modify the script to read $6 as the routing key override. This keeps a single script but lets you route MySQL health checks to the DBA rotation while disk-space checks go to infrastructure.

Step 4: Configure Escalation Policies

An escalation policy defines what happens when the first on-call responder doesn't acknowledge within your time window. In Alert24, you attach an escalation policy to each service. A basic three-tier policy works like this:

  • Level 1: Page the on-call engineer immediately
  • Level 2: If no acknowledgment in 10 minutes, page the team lead
  • Level 3: If no acknowledgment in another 15 minutes, page the engineering manager

This ensures that a genuine outage doesn't die silently because the primary on-call had their phone on silent. The escalation path is separate from the on-call schedule, so when your team lead rotates out, the escalation chain doesn't break—it automatically resolves to whoever is on the lead schedule at that moment.

Step 5: Test the Full Flow

Before trusting this in production, send a test event manually:

curl -X POST https://api.alert24.io/v1/events \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_api_key_here" \
  -d '{
    "routing_key": "rk_infra",
    "event_action": "trigger",
    "dedup_key": "test/manual-verification",
    "payload": {
      "summary": "Manual test alert from Nagios integration",
      "source": "nagios",
      "severity": "critical"
    }
  }'

Verify that the alert appears in Alert24, routes to the correct on-call schedule, and generates a page to the expected responder. Then send a resolve event (same dedup_key, event_action: resolve) and confirm the incident closes cleanly.

Also test the SOFT state filtering by temporarily removing the SERVICESTATETYPE check from your handler and watching what happens when a service flaps. You will quickly understand why filtering on HARD state only is worth the discipline.

What You End Up With

After this setup, Nagios continues doing exactly what it does—executing checks, tracking state, generating data. Alert24 takes the routing problem off your hands. Every alert that fires maps to a specific team's on-call schedule. Engineers only get paged for services in their domain. Escalations happen automatically. Flapping services generate one incident, not a storm.

Next Steps

  • Get your Alert24 API key and create your first routing key from the Alert24 dashboard
  • Map your Nagios service groups to Alert24 routing keys before you go live
  • Set up at least one override in each on-call schedule for upcoming holidays
  • Review your Nagios check_interval and max_check_attempts settings—shorter intervals with higher max attempts give you better HARD state accuracy before Alert24 ever sees the alert