The Problem with Monitoring Cron Jobs in Nagios
Your Nagios setup is solid. Services are checked on a schedule, alerts go out when something goes down, and your team has a handle on infrastructure health. Then someone asks: "Are our cron jobs actually running?"
Nagios is built around active checks — it reaches out to something and tests it. Cron jobs don't work that way. A cron job runs, does something, and exits. There's no port to probe, no HTTP endpoint to hit, no service to query. Nagios can't knock on a cron job's door.
The traditional workaround is passive checks. You configure Nagios to accept external results for a service, write a script that reports check results via send_nsca or the Nagios API at the end of your cron job, and set a freshness threshold so Nagios complains if no result arrives within a certain window. It works, but the setup involves multiple moving parts: NSCA daemon configuration, shared encryption keys, freshness thresholds per service, and a Nagios config reload every time you add a job. When a developer adds a new cron job, they're unlikely to go through that process — so the job goes unmonitored.
There's a simpler model that gets you the same outcome: heartbeat monitoring.
How Heartbeat Monitoring Works
A heartbeat monitor flips the relationship. Instead of Nagios polling your job, your job pings a URL when it finishes. The monitoring system tracks those pings and fires an incident if one goes missing within the expected window.
The flow looks like this:
- You create a heartbeat monitor with a schedule (every hour, every day at 3am, whatever matches your cron expression).
- Your cron job runs and, on success, sends a GET or POST request to the heartbeat URL.
- If the ping doesn't arrive within the configured grace period, the monitoring system treats it as a missed check and opens an incident.
No agents. No NSCA. No shared keys. No Nagios config reload.
Setting Up a Heartbeat in Alert24
In Alert24, go to Monitors and create a new Heartbeat monitor. You'll specify:
- Name — something that identifies the job clearly ("Nightly invoice export", "Hourly cache warmer")
- Schedule — how often the job should run
- Grace period — how long Alert24 waits past the expected ping before declaring the job missed
- Escalation policy — who gets paged if the job misses its window
Once saved, Alert24 gives you a unique heartbeat URL that looks like:
https://app.alert24.com/api/v1/heartbeat/YOUR_UNIQUE_TOKEN
That URL is all your cron job needs.
Wiring the Ping Into Your Cron Job
Bash
The simplest case — append the curl call at the end of your cron entry, or inside your shell script after the main work completes:
# crontab entry
0 2 * * * /opt/scripts/export-invoices.sh && curl -fsS --retry 3 https://app.alert24.com/api/v1/heartbeat/YOUR_UNIQUE_TOKEN > /dev/null
The && ensures the ping only fires if the script exits cleanly. If the script fails, no ping is sent, and Alert24 will open an incident when the window passes. The -fsS flags suppress curl output while still surfacing errors to stderr if something goes wrong with the request itself. --retry 3 handles transient network blips without any extra code on your end.
If you want to ping even on failure (so you can handle the distinction in Alert24 via a separate alert), use ; instead of &&.
Python
If your job is a Python script, add the ping to a finally block or at the end of a successful run:
import requests
import subprocess
import sys
HEARTBEAT_URL = "https://app.alert24.com/api/v1/heartbeat/YOUR_UNIQUE_TOKEN"
def run_job():
# your job logic here
pass
if __name__ == "__main__":
try:
run_job()
requests.get(HEARTBEAT_URL, timeout=10)
except Exception as e:
print(f"Job failed: {e}", file=sys.stderr)
sys.exit(1)
Using requests with an explicit timeout prevents your job from hanging if Alert24 is temporarily unreachable — the job finishes and exits, and the missing ping will trigger the incident via the normal missed-window path.
Ruby
require 'net/http'
require 'uri'
HEARTBEAT_URL = "https://app.alert24.com/api/v1/heartbeat/YOUR_UNIQUE_TOKEN"
def run_job
# your job logic here
end
begin
run_job
uri = URI(HEARTBEAT_URL)
Net::HTTP.get(uri)
rescue => e
$stderr.puts "Job failed: #{e.message}"
exit 1
end
Nagios Passive Checks vs. Alert24 Heartbeats
Here's a direct comparison of the two approaches for the common case of monitoring a cron job:
| Concern | Nagios Passive Checks | Alert24 Heartbeats |
|---|---|---|
| Initial setup | NSCA daemon, encryption keys, Nagios service definition, freshness threshold | Create monitor in UI, copy URL |
| Adding a new job | Nagios config change + reload | Add one curl line to cron job |
| Missed-job detection | Freshness threshold (requires tuning per job) | Built-in per-monitor schedule |
| Alert routing | Nagios contacts/escalations config | Alert24 escalation policies |
| On-call schedules | Third-party or manual | Native in Alert24 |
| Status page integration | Requires separate tooling | Native in Alert24 |
The passive check approach isn't wrong — it's the correct Nagios primitive for this use case. But the operational overhead compounds as you add jobs. Alert24 heartbeats don't replace Nagios for active service checks; they handle the specific case Nagios wasn't designed for.
Where Alert24 Picks Up
Nagios is good at detecting that something is wrong. What happens next — who gets called, what the escalation path looks like, whether there's a status page to update — typically requires additional tooling or significant Nagios configuration.
Alert24 handles the incident management side: escalation policies with on-call rotations, acknowledgment tracking, status page updates, and post-incident timelines. When a heartbeat misses its window, Alert24 opens an incident, routes it to the right person based on your current on-call schedule, and tracks the response through to resolution.
If you're already using Nagios, you don't have to change anything about your active check setup. Add Alert24 for heartbeat monitoring and route all your Nagios alerts through Alert24 as well if you want centralized incident management — or keep them separate and use Alert24 only for the heartbeat cases where Nagios passive checks would be painful.
Next Steps
Start with one cron job that matters. Create a heartbeat monitor in Alert24, add the curl ping to the job, and let it run through a few cycles to confirm the pings are landing. Once you've seen it work, rolling it out to the rest of your scheduled jobs takes minutes per job.
If you want to send Nagios alerts through Alert24 as well, the integration uses a Nagios event broker or a webhook — Alert24's documentation covers the setup, and it doesn't require changes to your existing Nagios check configuration.
The goal is simple: if a job was supposed to run and didn't, someone should know within minutes, not the next morning when a customer reports a problem.