← Back to Blog

How to Get Paged When a Kubernetes Pod Goes Down

Kubernetes is self-healing by design. When a pod crashes, the kubelet restarts it. When a node goes down, the scheduler reschedules the pod elsewhere. Most of the time this happens quietly and your users never notice.

The problem is when the automatic recovery stops working. A pod stuck in CrashLoopBackOff is restarting every few minutes, your application is throwing errors, and Kubernetes is dutifully doing exactly what it was told — restart the pod — without anyone on your team knowing anything is wrong. By the time someone notices a user complaint, the issue has been festering for an hour.

The fix is a proper alerting pipeline. Prometheus watches cluster state via kube-state-metrics, fires alerts when pods are unhealthy past a threshold, AlertManager routes those alerts to an incident management system, and that system pages whoever is on call. Here is how to wire it together.

The Monitoring Stack

The standard Kubernetes observability stack for this use case has three layers:

Layer Tool What it does
Metrics collection Prometheus + kube-state-metrics Exposes pod phase, container state, and restart counts as metrics
Alert evaluation Prometheus alerting rules Fires alerts when metrics cross defined thresholds
Alert routing AlertManager Deduplicates, groups, and routes alerts to receivers

If you are running the kube-prometheus-stack Helm chart, the first two layers are already partially configured. What most teams are missing are the precise alerting rules that distinguish a real problem from normal Kubernetes churn, and the routing from AlertManager into an on-call system.

Writing the Prometheus Alert Rules

The two conditions you care about are CrashLoopBackOff and PodNotReady. kube-state-metrics exposes both:

  • kube_pod_container_status_waiting_reason with label reason="CrashLoopBackOff"
  • kube_pod_status_ready with condition false

Here are the alert rules you actually want:

groups:
  - name: kubernetes-pod-health
    interval: 1m
    rules:

      - alert: PodCrashLooping
        expr: |
          kube_pod_container_status_waiting_reason{
            reason="CrashLoopBackOff",
            namespace!~"kube-system|monitoring"
          } == 1
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod {{ $labels.pod }} is in CrashLoopBackOff"
          description: >
            Container {{ $labels.container }} in pod {{ $labels.pod }}
            (namespace {{ $labels.namespace }}) has been in CrashLoopBackOff
            for more than 5 minutes.
          runbook_url: "https://wiki.example.com/runbooks/crashloopbackoff"

      - alert: PodNotReady
        expr: |
          kube_pod_status_ready{
            condition="false",
            namespace!~"kube-system|monitoring"
          } == 1
          unless on(pod, namespace)
          kube_pod_status_phase{phase=~"Succeeded|Failed"} == 1
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Pod {{ $labels.pod }} has been not-ready for 10+ minutes"
          description: >
            Pod {{ $labels.pod }} in namespace {{ $labels.namespace }}
            has not been in a ready state for more than 10 minutes.

The for clause is doing important work here. Without it, every rolling deployment would trigger an alert the moment new pods spin up before passing their readiness probes. Setting for: 5m on CrashLoopBackOff and for: 10m on PodNotReady filters out the noise from normal deployments, which typically complete in under two minutes.

The unless clause on PodNotReady excludes pods in Succeeded or Failed phase. Completed Jobs and one-shot pods should not page anyone at 3am.

Filtering by Namespace

The namespace!~"kube-system|monitoring" filter keeps system pods out of your alerts. If you want to alert on kube-system pods separately with a different routing path or a higher threshold, add a second rule group rather than complicating the first one.

Configuring AlertManager to Route to Alert24

Once Prometheus fires an alert, AlertManager receives it and decides where it goes. Your alertmanager.yaml receiver config for Alert24 looks like this:

global:
  resolve_timeout: 5m

route:
  group_by: ["alertname", "namespace"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "alert24-critical"
  routes:
    - matchers:
        - severity = "critical"
      receiver: "alert24-critical"
      continue: false
    - matchers:
        - severity = "warning"
      receiver: "alert24-warning"
      continue: false

receivers:
  - name: "alert24-critical"
    webhook_configs:
      - url: "https://api.alert24.com/v1/alerts/ingest/YOUR_INTEGRATION_KEY"
        send_resolved: true
        http_config:
          authorization:
            credentials: "YOUR_API_KEY"

  - name: "alert24-warning"
    webhook_configs:
      - url: "https://api.alert24.com/v1/alerts/ingest/YOUR_INTEGRATION_KEY"
        send_resolved: true
        http_config:
          authorization:
            credentials: "YOUR_API_KEY"

inhibit_rules:
  - source_matchers:
      - severity = "critical"
    target_matchers:
      - severity = "warning"
    equal: ["namespace", "pod"]

The inhibit_rules block suppresses PodNotReady warnings for a pod that is already generating a PodCrashLooping critical alert. Without this, a single bad pod generates two separate alerts and your on-call gets two pages for the same thing.

The group_by: ["alertname", "namespace"] setting means AlertManager bundles alerts of the same type in the same namespace into a single notification rather than sending one page per pod. If you have a bad deployment that takes down fifteen pods at once, your on-call gets one page, not fifteen.

Where Alert24 Fits In

AlertManager is good at routing. It is not an on-call management system. It does not know who is on call this week, it does not escalate if the first responder does not acknowledge, it does not log incident timelines, and it does not give your team a status page to post updates to.

Alert24 fills that gap. When AlertManager sends the webhook, Alert24 receives it, looks up your on-call schedule, and calls or texts the right person. If they do not acknowledge within your configured timeout, it escalates to the next person in the rotation. The alert stays active until someone resolves it, and the full timeline — when it fired, who was paged, when they acknowledged, when it resolved — is recorded automatically.

This matters more than it might seem during a real incident. When you are woken up at 3am and three services are behaving oddly, knowing that a CrashLoopBackOff alert fired 47 minutes ago while you were asleep tells you something important about the incident timeline.

Testing the Pipeline End to End

Before you trust this for production, confirm the pipeline works by deploying a pod that will immediately fail:

apiVersion: v1
kind: Pod
metadata:
  name: crash-test
  namespace: staging
spec:
  containers:
    - name: crash-test
      image: busybox
      command: ["sh", "-c", "exit 1"]

Apply it with kubectl apply -f crash-test.yaml. Within a minute the pod will be in CrashLoopBackOff. Wait for your for: 5m window to expire, then verify:

  1. The alert appears in the Prometheus Alerts UI (check /alerts on your Prometheus instance)
  2. AlertManager shows it as active and routed (check /alerts on your AlertManager instance)
  3. Alert24 shows the incoming alert and has paged the on-call engineer

Clean up with kubectl delete pod crash-test -n staging and confirm the resolved notification arrives as well.

Next Steps

Once this baseline is working, two additions make the system more useful. First, add a KubeDeploymentReplicasMismatch rule that fires when available replicas drop below desired replicas for an extended period — this catches situations where new pods fail to schedule at all, which neither of the above rules will catch. Second, attach a runbook URL to every alert annotation. The person woken up at 3am will have the Prometheus alert in front of them; a direct link to the relevant runbook cuts minutes off the time-to-resolution.

Start with the two rules above, validate the full pipeline from CrashLoopBackOff to page, then layer in more rules as you understand where your blind spots are.