← Back to Blog

How to Get Paged When a GitHub Actions Deployment Fails in Production

The Problem with Email Notifications at 11pm

GitHub Actions will send you an email when a workflow fails. You probably already know this. You probably also know that if your production deployment breaks at 11pm on a Friday, you are not going to see that email until the next morning—at which point your users have spent eight hours staring at a broken app, your on-call rotation has no idea anything is wrong, and you are starting your weekend with an incident retrospective.

Email is a pull medium. You check it when you want to. An incident in production is a push situation. You need something that reaches out to the right person, right now, and keeps escalating until someone acknowledges it. That is not what email does.

The fix is straightforward: add a final step to your deployment workflow that fires an incident alert when anything upstream fails. Two minutes of YAML, one stored secret, and you have a path from broken deployment to paged engineer.

How if: failure() Works

GitHub Actions evaluates a step's if condition against the current job context. By default, a step only runs if all previous steps succeeded. The special expression failure() inverts that: the step runs only when at least one prior step has failed and the job has not been explicitly cancelled.

This is exactly the hook you need. Your deployment steps run normally. If they succeed, the alert step is skipped entirely. If anything fails—a failed test, a bad Docker build, a Kubernetes rollout that times out—the alert step fires.

The expression if: failure() is evaluated after the job's overall status is determined, which means it correctly handles failures in any prior step, including steps that were themselves conditional.

There is a related expression worth knowing: always(). That one runs regardless of success or failure, which is useful for cleanup steps but not for incident alerts—you do not want a page every time a deployment succeeds.

Adding the Alert Step

Here is a complete deployment workflow with an incident notification step wired in. The interesting part is the final step.

name: Deploy to Production

on:
  push:
    branches:
      - main

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install dependencies
        run: npm ci

      - name: Run tests
        run: npm test

      - name: Build
        run: npm run build

      - name: Deploy to production
        run: ./scripts/deploy.sh
        env:
          DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}

      - name: Notify Alert24 on failure
        if: failure()
        run: |
          curl -s -X POST https://api.alert24.app/v1/incidents \
            -H "Authorization: Bearer ${{ secrets.ALERT24_API_KEY }}" \
            -H "Content-Type: application/json" \
            -d '{
              "title": "Production deployment failed",
              "description": "GitHub Actions workflow failed on branch main. Commit: ${{ github.sha }}. Run: ${{ github.server_url }}/${{ github.repository }}/actions/run/${{ github.run_id }}",
              "severity": "critical",
              "source": "github-actions"
            }'

The description field pulls in the commit SHA and a direct link to the failed run. When your on-call engineer gets paged, they do not need to go hunting for context—the alert already tells them which commit and links them straight to the logs.

Storing the API Key as a Secret

Never put credentials directly in your workflow YAML. GitHub provides encrypted secrets for exactly this purpose, and they are straightforward to set up.

Go to your repository on GitHub, then navigate to Settings > Secrets and variables > Actions. Click New repository secret. Name it ALERT24_API_KEY and paste in the API key from your Alert24 account. Save it.

From that point on, ${{ secrets.ALERT24_API_KEY }} in your workflow file will resolve to the key at runtime. It will never appear in logs—GitHub masks it automatically.

If you are managing many repositories and want a single key to work across all of them, you can set an organization-level secret instead. The same flow applies under your organization's settings rather than a specific repository.

One practical note: create a dedicated API key for your CI/CD systems rather than reusing a personal key. That way, if you need to rotate it or revoke it, you can do so without affecting anything tied to your personal credentials.

What Happens on the Alert24 Side

When the curl request reaches the Alert24 incidents API, it creates a new incident and immediately runs it through your on-call routing rules.

Alert24 looks at the incident's severity, the time of day, and your current on-call schedule to determine who gets paged. If the primary on-call does not acknowledge within your configured escalation window, it moves to the secondary. If that person does not respond either, it can escalate further or notify a Slack channel, depending on how you have set up your policy.

The incident stays open until someone explicitly resolves it. That matters: you get a full timeline of when the alert fired, when it was acknowledged, and when it was resolved. That data is useful for incident retrospectives and for tracking whether your on-call response times are meeting your team's goals.

Alert24 also handles status page updates if you are running one. You can configure it so that a critical incident automatically posts a status page entry, which means your users get informed at the same time your engineer does.

Covering Multi-Job Workflows

If your workflow has multiple jobs—separate jobs for testing, building, and deploying—the if: failure() on a step only catches failures within that job. Failures in an upstream job that blocked execution will result in the dependent job being skipped entirely, which means your notification step never runs.

For multi-job workflows, add a dedicated notification job with a needs clause and a job-level condition:

  notify-on-failure:
    runs-on: ubuntu-latest
    needs: [test, build, deploy]
    if: failure()

    steps:
      - name: Notify Alert24
        run: |
          curl -s -X POST https://api.alert24.app/v1/incidents \
            -H "Authorization: Bearer ${{ secrets.ALERT24_API_KEY }}" \
            -H "Content-Type: application/json" \
            -d '{
              "title": "Production deployment failed",
              "description": "One or more jobs failed. Run: ${{ github.server_url }}/${{ github.repository }}/actions/run/${{ github.run_id }}",
              "severity": "critical",
              "source": "github-actions"
            }'

The needs array lists all of the jobs that must complete before this job runs. The if: failure() at the job level means this job runs if any of those listed jobs failed or were cancelled due to a failure. This pattern catches failures anywhere in your pipeline.

A Quick Reference

Scenario Condition to use
Single-job workflow, notify on any step failure if: failure() on the final step
Multi-job workflow, notify if any job fails Separate job with needs: [...] and if: failure()
Notify on failure but not on cancellation if: failure() (cancellation does not trigger failure())
Notify regardless of outcome if: always() (use with care)

Next Steps

The YAML above is enough to get paged when a deployment breaks. To get more out of it:

First, set up your on-call schedule in Alert24 if you have not already. The API call does the work of creating the incident, but routing it to the right person requires a schedule with at least one layer.

Second, consider whether you want different severity levels for different failure types. A failed deployment is critical. A failed lint check in a non-blocking workflow probably is not. You can pass different severity values in the API call depending on which workflow is calling it.

Third, look at the Alert24 runbook field. You can include a URL to your deployment runbook in the incident payload, which gives your on-call engineer a starting point for diagnosis without needing to remember where the documentation lives at 11pm.

The goal is not just to get paged—it is to get paged with enough context that you can start fixing the problem immediately.