The Gap CloudWatch Alarms Leave Open
You have CloudWatch alarms set up. When something breaks, an SNS notification fires and an email lands in a shared inbox — or maybe a text goes to one person's phone. That person may or may not be available. There's no rotation, no escalation if they don't acknowledge, and no record of who was paged and when.
AWS gives you powerful telemetry and alerting primitives, but on-call scheduling isn't one of them. SNS delivers a message and considers its job done. Whether anyone actually acts on that message is your problem.
The fix is straightforward: intercept the SNS notification with a Lambda function, translate it into an incident, and hand it off to a system that understands on-call schedules. Here's how to do it end to end.
How the Pattern Works
The full flow looks like this:
| Step | Service | What happens |
|---|---|---|
| 1 | CloudWatch | Alarm transitions to ALARM state |
| 2 | SNS | Publishes alarm payload to a topic |
| 3 | Lambda | Receives SNS event, calls Alert24 incident API |
| 4 | Alert24 | Creates incident, pages the on-call engineer |
| 5 | Alert24 | Escalates if unacknowledged within your policy window |
You already have steps 1 and 2 — CloudWatch alarms natively publish to SNS. The work here is steps 3 and 4.
Create the SNS Topic
If you don't already have a dedicated SNS topic for infrastructure alerts, create one:
aws sns create-topic --name infra-alerts --region us-east-1
Note the TopicArn from the response. You'll need it when configuring CloudWatch alarms and when subscribing the Lambda function.
Write the Lambda Handler
Create a Python function that receives the SNS event, extracts the alarm details, and posts them to Alert24's incident API.
import json
import os
import urllib.request
import urllib.error
ALERT24_API_KEY = os.environ["ALERT24_API_KEY"]
ALERT24_INTEGRATION_KEY = os.environ["ALERT24_INTEGRATION_KEY"]
ALERT24_API_URL = "https://api.alert24.com/v1/incidents"
def lambda_handler(event, context):
for record in event.get("Records", []):
sns_message = json.loads(record["Sns"]["Message"])
alarm_name = sns_message.get("AlarmName", "Unknown Alarm")
new_state = sns_message.get("NewStateValue", "UNKNOWN")
reason = sns_message.get("NewStateReason", "")
region = sns_message.get("Region", "")
account_id = sns_message.get("AWSAccountId", "")
# Only page on ALARM state; INSUFFICIENT_DATA and OK can be filtered here
if new_state != "ALARM":
print(f"Skipping state {new_state} for {alarm_name}")
continue
title = f"CloudWatch: {alarm_name}"
body = f"{reason}\n\nRegion: {region}\nAccount: {account_id}"
payload = json.dumps({
"integration_key": ALERT24_INTEGRATION_KEY,
"event_type": "trigger",
"description": title,
"details": body,
"severity": "critical",
}).encode("utf-8")
req = urllib.request.Request(
ALERT24_API_URL,
data=payload,
headers={
"Authorization": f"Bearer {ALERT24_API_KEY}",
"Content-Type": "application/json",
},
method="POST",
)
try:
with urllib.request.urlopen(req, timeout=10) as resp:
print(f"Alert24 response: {resp.status} for alarm {alarm_name}")
except urllib.error.HTTPError as e:
print(f"Alert24 HTTP error {e.code}: {e.read().decode()}")
raise
except urllib.error.URLError as e:
print(f"Alert24 connection error: {e.reason}")
raise
return {"statusCode": 200}
A few things worth noting about this handler:
The ALERT24_INTEGRATION_KEY is the key tied to a specific Alert24 service. It tells Alert24 which on-call schedule and escalation policy to use when creating the incident. You configure that once in Alert24 under your service settings.
The function filters out non-ALARM states by default. You probably don't want a page when an alarm returns to OK — you want an acknowledgment or auto-resolve in Alert24, which the platform handles separately. If you want auto-resolve on OK, add a second elif new_state == "OK" branch that posts an event_type: resolve payload.
Error handling raises exceptions intentionally. Lambda will retry on failure, and you'd rather have a duplicate page than a missed one.
Deploy the Lambda Function
Package and deploy:
# Create a deployment package
zip function.zip lambda_function.py
# Create the function
aws lambda create-function \
--function-name cloudwatch-to-alert24 \
--runtime python3.12 \
--role arn:aws:iam::YOUR_ACCOUNT_ID:role/lambda-basic-execution \
--handler lambda_function.lambda_handler \
--zip-file fileb://function.zip \
--environment "Variables={ALERT24_API_KEY=your_key,ALERT24_INTEGRATION_KEY=your_integration_key}" \
--timeout 30 \
--region us-east-1
Then subscribe the Lambda function to your SNS topic:
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:infra-alerts \
--protocol lambda \
--notification-endpoint arn:aws:lambda:us-east-1:YOUR_ACCOUNT_ID:function:cloudwatch-to-alert24
Finally, grant SNS permission to invoke the Lambda:
aws lambda add-permission \
--function-name cloudwatch-to-alert24 \
--statement-id sns-invoke \
--action lambda:InvokeFunction \
--principal sns.amazonaws.com \
--source-arn arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:infra-alerts
Wire a CloudWatch Alarm to the SNS Topic
If you have an existing alarm you want to route through this pipeline, update its actions:
aws cloudwatch put-metric-alarm \
--alarm-name "high-error-rate" \
--alarm-actions arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:infra-alerts \
--ok-actions arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:infra-alerts
For new alarms, include --alarm-actions at creation time. You can attach the same SNS topic to as many alarms as you want — all of them will flow through the same Lambda and create incidents in Alert24 with the correct alarm name and reason in the description.
Set Up the On-Call Side in Alert24
Before this produces an actual page, you need three things configured in Alert24:
A service that represents your infrastructure or application. This is what the integration key maps to.
An on-call schedule attached to that service. Put your engineers in a rotation — daily, weekly, follow-the-sun, whatever your team uses. Alert24 will evaluate the schedule at the moment the incident arrives and route the page to whoever is currently on duty.
An escalation policy that defines what happens if the first responder doesn't acknowledge within N minutes. You can escalate to a secondary, notify a Slack channel, or loop in a manager. This is the part that raw SNS delivery simply cannot do.
Once those are in place, an alarm firing in CloudWatch produces a page within seconds. If the on-call engineer acknowledges, the incident moves to acknowledged and the escalation timer stops. If they don't, Alert24 escalates per your policy.
What You've Built
Your CloudWatch alarms still fire exactly as before. Nothing about your existing observability setup changes. You've added a thin Lambda adapter that translates SNS messages into structured incidents, and Alert24 takes it from there — handling the scheduling, paging, escalation, and incident history.
The operational cost is low: one Lambda function, one SNS topic, a few IAM permissions. The benefit is that every alarm now follows a defined process rather than landing in an inbox and hoping someone sees it.
Next Steps
- Deploy the Lambda and run a test by manually setting a CloudWatch alarm to the ALARM state with
aws cloudwatch set-alarm-state - Review the Alert24 incident it creates and confirm the description and severity look right
- Adjust the
severityfield in the Lambda payload — you can map different alarms to different severity levels by inspecting the alarm name or dimensions in the SNS payload - Add the
ok-actionsSNS notification and implement the auto-resolve branch in the Lambda if you want incidents to close automatically when alarms recover - Consider adding Dead Letter Queue (DLQ) configuration to your Lambda so failed invocations don't silently drop pages