Your nightly database backup job failed at 2 AM. Nobody noticed. You found out three weeks later when you needed to restore from backup and discovered you had nothing. The job had been exiting with a non-zero status the entire time, logs were written, and that was the end of it.
This is the defining problem with Kubernetes CronJobs: the platform treats job completion — successful or otherwise — as a terminal state. There is no built-in alerting, no retry-with-notification, no "did anyone check this?" mechanism. A failed CronJob simply shows up as a failed pod in your cluster history, waiting for someone to go looking.
Two approaches actually work in production. The first uses Prometheus and kube-state-metrics to watch job status from the outside. The second flips the model and uses a heartbeat: the job itself signals success, and you get alerted when the signal stops arriving.
Why CronJobs Fail Without Anyone Knowing
Before picking an approach, it helps to understand what Kubernetes actually does when a job fails. When a job's pod exits non-zero, the CronJob controller records the failure and, depending on your restartPolicy and backoffLimit settings, may retry it. Once retries are exhausted, the job is marked as Failed.
The relevant status fields are status.failed and status.succeeded on the Job resource. Nothing in the default Kubernetes distribution watches those fields and sends a notification. Your log aggregator might capture the pod output, but unless someone is actively watching job logs, the failure is invisible.
Approach 1: Prometheus and kube-state-metrics
If you already run a Prometheus stack, kube-state-metrics exposes job metrics you can alert on. The two most useful metrics are:
kube_job_status_failed— gauge, 1 if the job has failedkube_job_status_succeeded— gauge, 1 if the job has succeededkube_job_complete— gauge with aconditionlabel
A basic alerting rule for job failures looks like this:
groups:
- name: cronjob-alerts
rules:
- alert: KubernetesCronJobFailed
expr: kube_job_status_failed > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Job {{ $labels.job_name }} in {{ $labels.namespace }} failed"
description: "{{ $labels.job_name }} has {{ $value }} failed pods."
- alert: KubernetesCronJobNotRunning
expr: |
time() - kube_cronjob_status_last_schedule_time{cronjob="your-backup-job"} > 90000
for: 5m
labels:
severity: warning
annotations:
summary: "CronJob {{ $labels.cronjob }} has not run in 25 hours"
The second rule catches a different failure mode: a job that never runs at all. This happens when the previous job is still running (and concurrencyPolicy: Forbid is set), when the cluster scheduler has an issue, or when the CronJob itself is suspended. A job that never starts produces zero failures — it just produces no successes either, which is equally bad for your backup.
What kube-state-metrics gets you
This approach is comprehensive if you already have the infrastructure. You get visibility into job history, you can track trends, and you can write rules for any job-level condition kube-state-metrics exposes. The kube_job_owner metric lets you correlate jobs back to their parent CronJobs.
The downside is the setup cost. Running Prometheus with kube-state-metrics requires cluster-level access, ongoing maintenance, and a working Alertmanager pipeline to actually route those alerts somewhere actionable. If you are setting this up from scratch, you are looking at a meaningful investment before your first alert fires.
Approach 2: Heartbeat Monitoring
The heartbeat model inverts the responsibility: instead of an external system watching for failures, your job signals that it completed successfully. You configure an expected check-in interval, and if the signal stops arriving, you get alerted.
Add a curl call at the end of your job script — after the real work is done and only if it succeeded:
#!/bin/bash
set -e
# Your actual job work here
pg_dump -h $DB_HOST -U $DB_USER mydb | gzip > /backup/$(date +%Y%m%d).sql.gz
# Only reached if the above succeeds (set -e exits on error)
curl -fsS --retry 3 "https://pulse.alert24.io/v1/heartbeat/YOUR-HEARTBEAT-TOKEN" \
--data-urlencode "msg=backup completed successfully"
In your CronJob manifest, the structure stays exactly the same:
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-backup
spec:
schedule: "0 2 * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: your-backup-image:latest
command: ["/scripts/backup.sh"]
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: db-credentials
key: host
Because set -e is active, any command that fails causes the script to exit immediately and the curl call is never reached. Alert24 registers the missed check-in and routes the alert according to your on-call schedule.
You configure the heartbeat period to match your job schedule. For a job that runs daily, you might set a 25-hour window to give yourself an hour of slack for jobs that run slightly late due to cluster load.
What heartbeats get you
The heartbeat approach works regardless of your observability stack. There is nothing to install on the cluster, no Prometheus to maintain, no Alertmanager to configure. The job itself is the monitor.
This also catches partial failures that job-level metrics might miss. If your backup script completes but writes a corrupt file, the exit code might still be zero. If you add a validation step before the curl call, a corrupt backup fails validation, the script exits, and the heartbeat is never sent.
Comparing the Two Approaches
| Prometheus / kube-state-metrics | Heartbeat monitoring | |
|---|---|---|
| Infrastructure required | Prometheus + kube-state-metrics + Alertmanager | HTTP endpoint you can reach |
| Cluster access needed | Yes (cluster-role for metrics) | No |
| Catches "job never ran" | Yes, with time-based rules | Yes, same mechanism |
| Catches partial failures | Only if they produce non-zero exit | Yes, if you add validation before the ping |
| Setup time | Hours to days | Minutes |
| Maintenance overhead | Ongoing (upgrades, rule tuning) | Near zero |
| Works across clusters | Requires federation or remote_write | Native |
The two approaches are not mutually exclusive. In a mature cluster you might run Prometheus for general cluster health and use heartbeats specifically for the jobs that carry business-critical workloads. The heartbeat adds almost nothing to your job and gives you a direct signal that is independent of your metrics infrastructure.
Next Steps
Pick the job that would hurt most if it silently failed — backups, invoice generation, data sync, compliance reports — and add a heartbeat to it today. The change is small: add a curl call at the end of your script and set a check-in window in Alert24 that matches your schedule.
If you are already running Prometheus, add the kube-state-metrics alerting rules above for cluster-wide coverage, and reserve heartbeats for the jobs where business impact justifies explicit end-to-end validation.
Once your critical jobs are pinged, set up Alert24 to route missed heartbeats to the right person on call. The job failure detection is only useful if it reaches someone who can act on it, at the right hour, through the right channel.