← Back to Blog

How to Page the On-Call Engineer from a CloudWatch Alarm

The Gap CloudWatch Alarms Leave Open

You have CloudWatch alarms set up. When something breaks, an SNS notification fires and an email lands in a shared inbox — or maybe a text goes to one person's phone. That person may or may not be available. There's no rotation, no escalation if they don't acknowledge, and no record of who was paged and when.

AWS gives you powerful telemetry and alerting primitives, but on-call scheduling isn't one of them. SNS delivers a message and considers its job done. Whether anyone actually acts on that message is your problem.

The fix is straightforward: intercept the SNS notification with a Lambda function, translate it into an incident, and hand it off to a system that understands on-call schedules. Here's how to do it end to end.

How the Pattern Works

The full flow looks like this:

Step Service What happens
1 CloudWatch Alarm transitions to ALARM state
2 SNS Publishes alarm payload to a topic
3 Lambda Receives SNS event, calls Alert24 incident API
4 Alert24 Creates incident, pages the on-call engineer
5 Alert24 Escalates if unacknowledged within your policy window

You already have steps 1 and 2 — CloudWatch alarms natively publish to SNS. The work here is steps 3 and 4.

Create the SNS Topic

If you don't already have a dedicated SNS topic for infrastructure alerts, create one:

aws sns create-topic --name infra-alerts --region us-east-1

Note the TopicArn from the response. You'll need it when configuring CloudWatch alarms and when subscribing the Lambda function.

Write the Lambda Handler

Create a Python function that receives the SNS event, extracts the alarm details, and posts them to Alert24's incident API.

import json
import os
import urllib.request
import urllib.error

ALERT24_API_KEY = os.environ["ALERT24_API_KEY"]
ALERT24_INTEGRATION_KEY = os.environ["ALERT24_INTEGRATION_KEY"]
ALERT24_API_URL = "https://api.alert24.com/v1/incidents"


def lambda_handler(event, context):
    for record in event.get("Records", []):
        sns_message = json.loads(record["Sns"]["Message"])

        alarm_name = sns_message.get("AlarmName", "Unknown Alarm")
        new_state = sns_message.get("NewStateValue", "UNKNOWN")
        reason = sns_message.get("NewStateReason", "")
        region = sns_message.get("Region", "")
        account_id = sns_message.get("AWSAccountId", "")

        # Only page on ALARM state; INSUFFICIENT_DATA and OK can be filtered here
        if new_state != "ALARM":
            print(f"Skipping state {new_state} for {alarm_name}")
            continue

        title = f"CloudWatch: {alarm_name}"
        body = f"{reason}\n\nRegion: {region}\nAccount: {account_id}"

        payload = json.dumps({
            "integration_key": ALERT24_INTEGRATION_KEY,
            "event_type": "trigger",
            "description": title,
            "details": body,
            "severity": "critical",
        }).encode("utf-8")

        req = urllib.request.Request(
            ALERT24_API_URL,
            data=payload,
            headers={
                "Authorization": f"Bearer {ALERT24_API_KEY}",
                "Content-Type": "application/json",
            },
            method="POST",
        )

        try:
            with urllib.request.urlopen(req, timeout=10) as resp:
                print(f"Alert24 response: {resp.status} for alarm {alarm_name}")
        except urllib.error.HTTPError as e:
            print(f"Alert24 HTTP error {e.code}: {e.read().decode()}")
            raise
        except urllib.error.URLError as e:
            print(f"Alert24 connection error: {e.reason}")
            raise

    return {"statusCode": 200}

A few things worth noting about this handler:

The ALERT24_INTEGRATION_KEY is the key tied to a specific Alert24 service. It tells Alert24 which on-call schedule and escalation policy to use when creating the incident. You configure that once in Alert24 under your service settings.

The function filters out non-ALARM states by default. You probably don't want a page when an alarm returns to OK — you want an acknowledgment or auto-resolve in Alert24, which the platform handles separately. If you want auto-resolve on OK, add a second elif new_state == "OK" branch that posts an event_type: resolve payload.

Error handling raises exceptions intentionally. Lambda will retry on failure, and you'd rather have a duplicate page than a missed one.

Deploy the Lambda Function

Package and deploy:

# Create a deployment package
zip function.zip lambda_function.py

# Create the function
aws lambda create-function \
  --function-name cloudwatch-to-alert24 \
  --runtime python3.12 \
  --role arn:aws:iam::YOUR_ACCOUNT_ID:role/lambda-basic-execution \
  --handler lambda_function.lambda_handler \
  --zip-file fileb://function.zip \
  --environment "Variables={ALERT24_API_KEY=your_key,ALERT24_INTEGRATION_KEY=your_integration_key}" \
  --timeout 30 \
  --region us-east-1

Then subscribe the Lambda function to your SNS topic:

aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:infra-alerts \
  --protocol lambda \
  --notification-endpoint arn:aws:lambda:us-east-1:YOUR_ACCOUNT_ID:function:cloudwatch-to-alert24

Finally, grant SNS permission to invoke the Lambda:

aws lambda add-permission \
  --function-name cloudwatch-to-alert24 \
  --statement-id sns-invoke \
  --action lambda:InvokeFunction \
  --principal sns.amazonaws.com \
  --source-arn arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:infra-alerts

Wire a CloudWatch Alarm to the SNS Topic

If you have an existing alarm you want to route through this pipeline, update its actions:

aws cloudwatch put-metric-alarm \
  --alarm-name "high-error-rate" \
  --alarm-actions arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:infra-alerts \
  --ok-actions arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:infra-alerts

For new alarms, include --alarm-actions at creation time. You can attach the same SNS topic to as many alarms as you want — all of them will flow through the same Lambda and create incidents in Alert24 with the correct alarm name and reason in the description.

Set Up the On-Call Side in Alert24

Before this produces an actual page, you need three things configured in Alert24:

A service that represents your infrastructure or application. This is what the integration key maps to.

An on-call schedule attached to that service. Put your engineers in a rotation — daily, weekly, follow-the-sun, whatever your team uses. Alert24 will evaluate the schedule at the moment the incident arrives and route the page to whoever is currently on duty.

An escalation policy that defines what happens if the first responder doesn't acknowledge within N minutes. You can escalate to a secondary, notify a Slack channel, or loop in a manager. This is the part that raw SNS delivery simply cannot do.

Once those are in place, an alarm firing in CloudWatch produces a page within seconds. If the on-call engineer acknowledges, the incident moves to acknowledged and the escalation timer stops. If they don't, Alert24 escalates per your policy.

What You've Built

Your CloudWatch alarms still fire exactly as before. Nothing about your existing observability setup changes. You've added a thin Lambda adapter that translates SNS messages into structured incidents, and Alert24 takes it from there — handling the scheduling, paging, escalation, and incident history.

The operational cost is low: one Lambda function, one SNS topic, a few IAM permissions. The benefit is that every alarm now follows a defined process rather than landing in an inbox and hoping someone sees it.

Next Steps

  • Deploy the Lambda and run a test by manually setting a CloudWatch alarm to the ALARM state with aws cloudwatch set-alarm-state
  • Review the Alert24 incident it creates and confirm the description and severity look right
  • Adjust the severity field in the Lambda payload — you can map different alarms to different severity levels by inspecting the alarm name or dimensions in the SNS payload
  • Add the ok-actions SNS notification and implement the auto-resolve branch in the Lambda if you want incidents to close automatically when alarms recover
  • Consider adding Dead Letter Queue (DLQ) configuration to your Lambda so failed invocations don't silently drop pages