The Gap Between Nagios Alerts and Incident Records
Nagios is good at one thing: detecting that something is wrong. It watches your hosts and services, fires off notifications when thresholds are crossed, and logs the event. What it does not do is tell you how long that problem lasted, whether anyone acknowledged it, what actions were taken, or what the final resolution was. The moment the state returns to OK, Nagios moves on.
This matters when something breaks at 2 AM and your manager asks for a post-mortem on Monday. Your Nagios log has a line saying the check went CRITICAL at 02:14 and OK at 02:47. That is not an incident record — it is a timestamp pair. There is no acknowledgment timestamp, no notes about what you did to fix it, no severity classification, and no way to see how this compares to the last three times this same service failed.
If you want incident lifecycle tracking, MTTR reporting, and a timeline that is useful for post-mortems, you need to close this gap yourself. The good news is that Nagios's event handler mechanism makes this straightforward.
How Nagios Event Handlers Work
Nagios event handlers are scripts that run in response to state changes. You define a command in your Nagios configuration, attach it to a host or service, and Nagios executes it whenever that object changes state. The handler receives arguments you specify — typically the state type, state name, and attempt number — which lets you filter on exactly the transitions you care about.
For incident tracking, you want to trigger on HARD state changes only. Nagios has two state types: SOFT (the check has failed but not yet reached the maximum number of retries) and HARD (the check has failed enough times to be considered a real problem). Posting to an incident API on every SOFT state would create noise; HARD states represent confirmed failures worth tracking.
Writing the Event Handler Script
Create a script at /usr/local/nagios/libexec/create_incident.sh. This script will receive the state information from Nagios and post it to Alert24's incident API.
#!/bin/bash
# Alert24 incident creation handler for Nagios
# Called by Nagios for HARD state changes on hosts and services
STATE_TYPE="$1" # HARD or SOFT
STATE="$2" # OK, WARNING, CRITICAL, UNKNOWN, UP, DOWN, UNREACHABLE
HOSTNAME="$3"
SERVICE_DESC="$4" # Empty for host checks
ALIAS="$5" # Unique identifier for deduplication
ALERT24_API_KEY="your-api-key-here"
ALERT24_API_URL="https://api.alert24.com/v1/incidents"
# Only act on HARD state changes
if [ "$STATE_TYPE" != "HARD" ]; then
exit 0
fi
# Map Nagios states to Alert24 severity
case "$STATE" in
CRITICAL|DOWN|UNREACHABLE)
SEVERITY="critical"
ACTION="open"
;;
WARNING)
SEVERITY="warning"
ACTION="open"
;;
OK|UP)
SEVERITY="info"
ACTION="resolve"
;;
*)
exit 0
;;
esac
# Build a human-readable title
if [ -n "$SERVICE_DESC" ]; then
TITLE="${SERVICE_DESC} is ${STATE} on ${HOSTNAME}"
ALIAS_KEY="${HOSTNAME}/${SERVICE_DESC}"
else
TITLE="Host ${HOSTNAME} is ${STATE}"
ALIAS_KEY="host/${HOSTNAME}"
fi
# Post to Alert24
curl -s -X POST "$ALERT24_API_URL" \
-H "Authorization: Bearer $ALERT24_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"title\": \"$TITLE\",
\"severity\": \"$SEVERITY\",
\"action\": \"$ACTION\",
\"alias\": \"$ALIAS_KEY\",
\"source\": \"nagios\",
\"tags\": [\"nagios\", \"$HOSTNAME\"]
}"
Make the script executable:
chmod +x /usr/local/nagios/libexec/create_incident.sh
Wiring It Into Nagios
Define the event handler command in your Nagios commands configuration:
define command {
command_name alert24_incident_handler
command_line /usr/local/nagios/libexec/create_incident.sh \
$SERVICESTATETYPE$ $SERVICESTATE$ \
$HOSTNAME$ $SERVICEDESC$ \
"$HOSTNAME$/$SERVICEDESC$"
}
For host checks, define a separate command using host macros:
define command {
command_name alert24_host_incident_handler
command_line /usr/local/nagios/libexec/create_incident.sh \
$HOSTSTATETYPE$ $HOSTSTATE$ \
$HOSTNAME$ "" \
"host/$HOSTNAME$"
}
Then attach the handler to your service template or individual service definitions:
define service {
use generic-service
event_handler alert24_incident_handler
event_handler_enabled 1
; ... rest of your service definition
}
Reload Nagios after making these changes:
nagios -v /etc/nagios/nagios.cfg && systemctl reload nagios
Why the Alias Field Matters
The alias field in the API payload is how Alert24 handles deduplication. When Nagios sends a second CRITICAL event for the same service (which happens if the check keeps failing), Alert24 uses the alias to recognize that this is the same incident, not a new one. It adds a timeline entry rather than opening a duplicate.
When the state returns to OK and your handler posts with "action": "resolve", Alert24 matches on the same alias and closes the incident, recording the resolution time. This is how MTTR gets calculated accurately: the open timestamp comes from the first CRITICAL event, the close timestamp comes from the OK event, and the alias ties them together.
A good alias scheme uses enough specificity to avoid collisions across your infrastructure. Using hostname/service works for most setups. If you run Nagios across multiple data centers or clusters, prepend a datacenter identifier: dc1/web-01/HTTP.
What You Get in the Incident Record
Once the integration is running, every HARD state change in Nagios creates or updates an incident in Alert24 with a structured record:
| Field | Source |
|---|---|
| Title | Constructed from hostname and service name |
| Severity | Mapped from Nagios state (CRITICAL, WARNING) |
| Open time | Timestamp of first HARD failure |
| Acknowledge time | Set when on-call engineer acknowledges in Alert24 |
| Resolution time | Timestamp of OK/UP HARD state |
| MTTR | Calculated automatically from open and resolution times |
| Timeline | Each state change appended as a timeline event |
| Tags | Nagios hostname, plus any custom tags you add |
The timeline is particularly useful. If a database service goes CRITICAL, gets acknowledged, briefly recovers and goes CRITICAL again, then finally resolves, all of those transitions appear in chronological order in the incident record. When you are writing the post-mortem, you have an accurate sequence of events without reconstructing it from log files.
On-Call Routing From Nagios Alerts
By posting to Alert24, you also bring Nagios alerts into Alert24's on-call routing. You can define escalation policies that determine who gets paged based on severity, time of day, and rotation schedules. A WARNING-severity incident might send a Slack notification; a CRITICAL incident pages the on-call engineer immediately and escalates to their backup after ten minutes if there is no acknowledgment.
This means you can simplify your Nagios notification configuration. Instead of maintaining per-contact notification commands in Nagios for SMS, phone calls, and Slack, Nagios sends the alert to Alert24, and Alert24 handles the routing. Your Nagios config becomes more maintainable, and your notification logic lives in one place.
Next Steps
To get started, you need an Alert24 account and an API key from the integrations page. From there:
- Create the event handler script and set your API key.
- Define the Nagios commands and attach the handler to a single non-critical service for testing.
- Manually force a service into a CRITICAL state using
nagios -c "PROCESS_SERVICE_CHECK_RESULT;hostname;service;2;forced critical for testing"and verify the incident appears in Alert24. - Check that the OK recovery creates a resolve action on the same incident (same alias, closed timestamp, MTTR calculated).
- Roll the handler out to the rest of your services.
Your Nagios installation tells you what broke. Pairing it with Alert24 tells you how long it took to fix it, who responded, and what happened along the way — which is the data you need to actually improve your reliability over time.