← Back to Blog

How to Notify Customers During an Outage Detected by Nagios

The Gap Between Detection and Communication

Nagios fires at 2:14 AM. Your event handler pages the on-call engineer. They SSH in, stare at logs, start triaging. Forty minutes later, the root cause is isolated and a fix is deployed.

At 3:00 AM, your CEO gets an email from a customer asking why the API was down for the past hour. Nobody had posted a status update.

This is not a process failure — it's a tooling gap. Nagios is built to detect and alert. It has no concept of a customer-facing status page, no way to send bulk notifications to subscribers, and no structured incident timeline that survives a postmortem. When engineers are heads-down fixing things, customer communication falls through the cracks unless the tooling makes it effortless.

The fix is to wire your Nagios event handlers to create incidents in Alert24 automatically, then handle all customer communication from within the incident response workflow — status page updates, email notifications, and the incident timeline — without switching contexts.


How the Integration Works

Alert24 exposes a REST API for incident management. When Nagios detects a problem, an event handler script calls that API to open an incident. From that point forward, the responding engineer works inside Alert24: they update the incident with customer-facing messages, and Alert24 handles publishing those updates to the public status page and notifying email subscribers.

The Nagios side stays simple. The Alert24 side handles everything customer-facing.

Step 1: Create the Nagios Event Handler Script

Nagios event handlers run when a service or host changes state. You configure them per service or globally via a command definition. The script below calls the Alert24 API to create an incident when a service enters a CRITICAL state and resolves it when the service recovers.

#!/bin/bash
# /usr/local/nagios/libexec/alert24_handler.sh

ALERT24_API_KEY="your-api-key-here"
ALERT24_BASE_URL="https://api.alert24.com/v1"

SERVICE_STATE="$1"       # CRITICAL, WARNING, OK, etc.
SERVICE_HOST="$2"        # Hostname from Nagios
SERVICE_NAME="$3"        # Service description
NOTIFICATION_TYPE="$4"   # PROBLEM or RECOVERY

create_incident() {
  curl -s -X POST "$ALERT24_BASE_URL/incidents" \
    -H "Authorization: Bearer $ALERT24_API_KEY" \
    -H "Content-Type: application/json" \
    -d "{
      \"name\": \"$SERVICE_NAME degraded on $SERVICE_HOST\",
      \"status\": \"investigating\",
      \"severity\": \"critical\",
      \"message\": \"Nagios detected $SERVICE_STATE on $SERVICE_HOST ($SERVICE_NAME). Engineers are investigating.\",
      \"notify_subscribers\": true
    }"
}

resolve_incident() {
  INCIDENT_ID=$(cat /tmp/alert24_incident_${SERVICE_HOST}_${SERVICE_NAME// /_}.id 2>/dev/null)
  if [ -n "$INCIDENT_ID" ]; then
    curl -s -X PATCH "$ALERT24_BASE_URL/incidents/$INCIDENT_ID" \
      -H "Authorization: Bearer $ALERT24_API_KEY" \
      -H "Content-Type: application/json" \
      -d "{
        \"status\": \"resolved\",
        \"message\": \"Service has recovered. The issue affecting $SERVICE_NAME on $SERVICE_HOST has been resolved.\",
        \"notify_subscribers\": true
      }"
    rm -f /tmp/alert24_incident_${SERVICE_HOST}_${SERVICE_NAME// /_}.id
  fi
}

if [ "$NOTIFICATION_TYPE" = "PROBLEM" ] && [ "$SERVICE_STATE" = "CRITICAL" ]; then
  INCIDENT_ID=$(create_incident | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")
  echo "$INCIDENT_ID" > /tmp/alert24_incident_${SERVICE_HOST}_${SERVICE_NAME// /_}.id
elif [ "$NOTIFICATION_TYPE" = "RECOVERY" ]; then
  resolve_incident
fi

The script persists the incident ID to a temp file so the recovery handler knows which incident to close. In production you would store this in a more durable location — a lightweight key-value store or a dedicated database table.

Step 2: Wire the Handler into Nagios

In your Nagios configuration, define the command and attach it to the service:

# commands.cfg
define command {
  command_name  alert24_event_handler
  command_line  /usr/local/nagios/libexec/alert24_handler.sh "$SERVICESTATE$" "$HOSTNAME$" "$SERVICEDESC$" "$NOTIFICATIONTYPE$"
}

# services.cfg
define service {
  host_name             web-prod-01
  service_description   HTTP Check
  check_command         check_http
  event_handler         alert24_event_handler
  event_handler_enabled 1
  # ... rest of your service config
}

Set event_handler_enabled 1 and make sure the Nagios process account has execute permission on the script.


What Happens on the Alert24 Side

Once the event handler fires, Alert24 creates an incident and — if notify_subscribers is true — sends an immediate email to everyone subscribed to your status page. Subscribers opted in voluntarily; they want to know when something is wrong.

The Incident Timeline

Every update you post to the incident appears as a timestamped entry on your public status page. The timeline gives customers a clear record of what happened and when. This matters during and after an outage.

Timestamp Status Message
02:14 UTC Investigating Nagios detected API degradation. Engineers are investigating.
02:31 UTC Identified Root cause identified: database connection pool exhausted. Fix in progress.
02:58 UTC Monitoring Fix deployed. Monitoring for stability.
03:04 UTC Resolved API fully recovered. Connection pool limits increased. Postmortem scheduled.

Your team posts these updates from inside the Alert24 incident — the same screen where they're reading acknowledgment status and managing the escalation chain. There's no separate status page admin panel to context-switch into.

Subscriber Notifications

Customers who subscribe to your status page receive an email for each update you publish. The subscription flow is a simple email input on your public status page — no account required. When you post an update with notify_subscribers: true, Alert24 fans out the notification automatically.

You control the message. The engineer writing the update decides what's customer-safe to say. You're not leaking internal hostnames or stack traces — just the user-facing impact and what you're doing about it.


Keeping the Signal Clean

A few things to get right so you don't create noise:

Deduplicate on flapping services. If Nagios is configured with aggressive re-check intervals, a flapping service can fire the event handler multiple times in rapid succession. Check for an existing open incident before creating a new one. The Alert24 API returns existing open incidents via GET /incidents?status=open&name=... — query before creating.

Use severity appropriately. Not every Nagios CRITICAL is a customer-visible incident. A background job failure might be SEV-3 internal only; a payment API failure is SEV-1 public. You can add logic to the event handler script to map Nagios service names or host groups to Alert24 severity levels and control whether notify_subscribers is true.

Test your handler before production. Nagios has a dry-run facility. Use nagios -v /etc/nagios/nagios.cfg to validate config, then manually invoke the handler script with test arguments to confirm the API calls work before you rely on it at 2 AM.


Next Steps

If you're running Nagios today, the event handler approach gets you customer communication without replacing any part of your existing monitoring stack. Nagios keeps doing what it's good at; Alert24 handles the communication layer that Nagios was never designed for.

To get started:

  1. Generate an API key from your Alert24 account settings under Integrations.
  2. Deploy the event handler script to your Nagios server and configure the command definition.
  3. Set up your public status page and add the subscriber widget — it's a single embed snippet.
  4. Run a test incident by manually triggering the script, post a few updates, and verify the timeline and subscriber emails look right.

The goal is that the next time Nagios wakes your team at 2 AM, customer communication is already happening before anyone has to think about it.