The Problem with CloudWatch Notifications
You set up a CloudWatch alarm, wired it to an SNS topic, and subscribed your on-call engineer's email or phone number. The alarm fires at 2am. The engineer is in a dead zone, their phone is on silent, or they're just burned out and missing alerts. Nobody acknowledges the incident. CloudWatch dutifully keeps repeating the notification — to the same person, who still isn't responding.
CloudWatch has no concept of "if this person doesn't acknowledge in 10 minutes, try someone else." It knows how to send notifications. It does not know how to escalate them. That gap is where production incidents quietly turn into outages.
This post shows you how to close that gap: use CloudWatch to detect the problem, route it through SNS and a small Lambda function to Alert24, and let Alert24's escalation policies handle the on-call logic — engineer, then team lead, then manager — with the chain resetting the moment someone actually acknowledges.
How CloudWatch Alarm Routing Actually Works
When a CloudWatch alarm transitions to ALARM state, it publishes a message to an SNS topic. From there, you have a few options:
- Email or SMS directly to a person (no acknowledgment tracking, no escalation)
- A webhook to a Slack channel (same problem — no ack, no escalation)
- A Lambda function (this is where you can do something useful)
The Lambda approach is the bridge. Your Lambda receives the CloudWatch alarm payload, transforms it into an Alert24 incident via the Alert24 API, and from that point on, Alert24 owns the notification and escalation lifecycle.
Setting Up the Pipeline
Step 1: Create an SNS Topic for CloudWatch
If you don't have one already, create an SNS topic for your CloudWatch alarms:
aws sns create-topic --name cloudwatch-alerts
Note the ARN — you'll need it for both the CloudWatch alarm and the Lambda subscription.
Step 2: Write the Lambda Function
This Lambda receives the SNS event, parses the CloudWatch alarm payload, and opens an incident in Alert24.
import json
import os
import urllib.request
ALERT24_API_KEY = os.environ["ALERT24_API_KEY"]
ALERT24_INTEGRATION_KEY = os.environ["ALERT24_INTEGRATION_KEY"]
ALERT24_API_URL = "https://api.alert24.co/v1/incidents"
def lambda_handler(event, context):
for record in event["Records"]:
message = json.loads(record["Sns"]["Message"])
alarm_name = message.get("AlarmName", "Unknown Alarm")
alarm_description = message.get("AlarmDescription", "")
new_state = message.get("NewStateValue", "ALARM")
reason = message.get("NewStateReason", "")
region = message.get("Region", "")
if new_state != "ALARM":
# Only open incidents on ALARM transitions
continue
payload = json.dumps({
"integration_key": ALERT24_INTEGRATION_KEY,
"summary": f"CloudWatch: {alarm_name}",
"details": f"{alarm_description}\n\nReason: {reason}\nRegion: {region}",
"source": "cloudwatch",
"severity": "critical"
}).encode("utf-8")
req = urllib.request.Request(
ALERT24_API_URL,
data=payload,
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {ALERT24_API_KEY}"
},
method="POST"
)
with urllib.request.urlopen(req) as resp:
print(f"Alert24 response: {resp.status} for alarm {alarm_name}")
Set the Lambda environment variables ALERT24_API_KEY and ALERT24_INTEGRATION_KEY — don't hardcode credentials. Subscribe this Lambda to the SNS topic, and point your CloudWatch alarms at that topic.
Step 3: Configure the CloudWatch Alarm
Attach the SNS topic to your alarm. Via the CLI:
aws cloudwatch put-metric-alarm \
--alarm-name "high-error-rate" \
--alarm-description "5xx error rate exceeded 5% for 5 minutes" \
--metric-name 5XXError \
--namespace AWS/ApplicationELB \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--threshold 50 \
--comparison-operator GreaterThanOrEqualToThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789012:cloudwatch-alerts \
--ok-actions arn:aws:sns:us-east-1:123456789012:cloudwatch-alerts
Note the --ok-actions — you can extend the Lambda to auto-resolve the Alert24 incident when the alarm returns to OK state, which keeps your incident history clean.
Configuring Escalation in Alert24
Once the incident lands in Alert24, the escalation policy takes over. Here is where you get the behavior CloudWatch cannot provide on its own.
An escalation policy defines a sequence of targets and timeouts. A typical three-tier setup looks like this:
| Level | Target | Notify after |
|---|---|---|
| 1 | On-call engineer | Immediately |
| 2 | Team lead | 10 minutes with no acknowledgment |
| 3 | Engineering manager | 20 minutes with no acknowledgment |
You configure this in the Alert24 dashboard under Escalation Policies. Each level specifies who to contact (a user, a team, or a schedule) and how long to wait before moving to the next level. You can mix notification channels per level — SMS and phone call for the engineer, email and SMS for the lead, phone call for the manager.
When someone acknowledges the incident, the escalation chain stops immediately. No redundant pages to the manager at 2:17am because the engineer finally picked up at 2:14am. If the incident is resolved, it's marked resolved and the chain terminates. If nobody acknowledges through all levels, Alert24 can repeat the full cycle or notify a catch-all group.
Why the Acknowledgment Reset Matters
The escalation reset on acknowledgment is not just a nice-to-have. Without it, you create two problems:
First, the manager gets woken up for incidents that the engineer already handled. That erodes trust in your alerting system and leads to alert fatigue — eventually people start ignoring pages because they assume someone else already dealt with it.
Second, without acknowledgment tracking, you have no audit trail. You can't answer "who was paged, who responded, and how long did it take?" after the fact. That data matters for incident reviews and for right-sizing your on-call rotation.
Alert24 tracks the full lifecycle: when the incident was created, which escalation level was active when it was acknowledged, who acknowledged it, and when it was resolved. That history is visible in the incident timeline and exportable for post-mortems.
Handling the OK Transition
Extend the Lambda to resolve the incident automatically when CloudWatch returns to OK:
if new_state == "OK":
# Look up the open incident by alarm name and resolve it
# Alert24 supports incident resolution via PATCH /v1/incidents/{id}
print(f"Alarm {alarm_name} returned to OK — resolve corresponding incident")
In practice, you'll want to store the Alert24 incident ID somewhere (DynamoDB works well) keyed by alarm name so the Lambda can look it up on the OK transition and send the resolve call. This keeps your Alert24 incident list clean and prevents responders from investigating incidents that resolved themselves before anyone had a chance to look.
What You've Built
The complete pipeline:
- CloudWatch detects a threshold breach and transitions to ALARM
- SNS publishes the alarm payload
- Lambda receives the payload and opens an Alert24 incident
- Alert24 notifies the on-call engineer immediately
- If no acknowledgment in 10 minutes, Alert24 escalates to the team lead
- If still no acknowledgment in 20 minutes, Alert24 escalates to the manager
- The moment anyone acknowledges, the escalation stops
- When CloudWatch returns to OK, the Lambda resolves the incident
CloudWatch handles detection. Alert24 handles everything that comes after.
Next Steps
Start with a single low-severity alarm to validate the pipeline end to end before wiring up your critical alarms. Confirm the Lambda is receiving the SNS payload correctly, the incident appears in Alert24, and the escalation policy triggers on schedule. Then expand to your full alarm inventory.
If you don't have an Alert24 account, you can start a free trial at alert24.co — the API integration key you need for the Lambda is generated during onboarding.