The Gap Between CloudWatch Alarms and Real Incident Management
CloudWatch alarms are good at one thing: telling you when a metric crosses a threshold. What they do not do is track whether anyone on your team saw the alert, acknowledged it, started investigating, or resolved the underlying issue. You get state history — ALARM, OK, INSUFFICIENT_DATA — but no lifecycle. No acknowledgment timestamps. No resolution notes. No way to know if the same alarm has fired twelve times this month without digging through logs.
If your team is routing CloudWatch notifications directly to email or Slack, you already know the failure mode. An alarm fires at 2am. Someone glances at it on their phone. They assume someone else is handling it. By morning, nobody has actually looked at the database, and you have an incident that lasted six hours with zero documentation.
The fix is not a different alerting strategy. It is adding a proper incident layer between CloudWatch and your team. This post walks through a Lambda function that translates CloudWatch alarm payloads into structured incidents in Alert24, complete with severity mapping, deduplication, and automatic resolution when the alarm clears.
How CloudWatch Alarm Notifications Work
When a CloudWatch alarm changes state, it publishes a JSON payload to an SNS topic. That payload contains everything you need: the alarm name, the new state, the previous state, the metric, and the reason for the state change.
A typical payload looks like this:
{
"AlarmName": "prod-api-error-rate-high",
"AlarmDescription": "API error rate exceeded 5% for 3 consecutive minutes",
"NewStateValue": "ALARM",
"NewStateReason": "Threshold Crossed: 1 out of the last 1 datapoints was 7.3 >= 5.0.",
"OldStateValue": "OK",
"StateChangeTime": "2026-05-28T03:14:22.123Z",
"Region": "us-east-1",
"Trigger": {
"MetricName": "5XXError",
"Namespace": "AWS/ApiGateway",
"Statistic": "Average",
"Period": 180,
"Threshold": 5.0
}
}
Your SNS topic delivers this to subscribers — email, Lambda, HTTP endpoints. The goal is to wire up a Lambda function as a subscriber that takes this payload and creates or resolves an incident.
Mapping CloudWatch States to Incident Severity
CloudWatch has three meaningful states. Here is how to think about mapping them to incident severity:
| CloudWatch State | Meaning | Incident Action |
|---|---|---|
| ALARM | Threshold crossed, metric is bad | Create or reopen incident — severity depends on alarm |
| INSUFFICIENT_DATA | Not enough data to evaluate | Create low-priority incident or skip, depending on context |
| OK | Metric returned to normal | Resolve the open incident |
For severity within ALARM state, you have two options. You can embed severity in the alarm name using a convention like prod-api-error-rate-high-critical, or you can maintain a lookup table in your Lambda configuration. The convention approach is simpler and works well for teams that already have a consistent alarm naming scheme.
The Lambda Adapter
The following function handles the full lifecycle. It uses the alarm name as a deduplication alias, which means repeated firings of the same alarm update the existing incident rather than creating duplicates.
import json
import os
import urllib.request
import urllib.error
ALERT24_API_KEY = os.environ["ALERT24_API_KEY"]
ALERT24_BASE_URL = "https://api.alert24.com/v1"
SEVERITY_MAP = {
"critical": "critical",
"high": "high",
"medium": "medium",
"low": "low",
}
def get_severity_from_alarm_name(alarm_name: str) -> str:
lower = alarm_name.lower()
for keyword, severity in SEVERITY_MAP.items():
if keyword in lower:
return severity
return "high" # safe default
def call_alert24(method: str, path: str, body: dict = None):
url = f"{ALERT24_BASE_URL}{path}"
data = json.dumps(body).encode() if body else None
req = urllib.request.Request(
url,
data=data,
method=method,
headers={
"Authorization": f"Bearer {ALERT24_API_KEY}",
"Content-Type": "application/json",
},
)
with urllib.request.urlopen(req) as resp:
return json.loads(resp.read())
def handler(event, context):
for record in event.get("Records", []):
message = json.loads(record["Sns"]["Message"])
alarm_name = message["AlarmName"]
new_state = message["NewStateValue"]
reason = message["NewStateReason"]
description = message.get("AlarmDescription", "")
region = message.get("Region", "unknown")
alias = f"cloudwatch-{alarm_name}"
if new_state == "OK":
# Resolve any open incident with this alias
call_alert24("POST", "/incidents/resolve-by-alias", {
"alias": alias,
"resolution_note": f"CloudWatch alarm returned to OK. {reason}",
})
elif new_state == "ALARM":
severity = get_severity_from_alarm_name(alarm_name)
call_alert24("POST", "/incidents", {
"title": f"CloudWatch: {alarm_name}",
"description": f"{description}\n\nReason: {reason}\nRegion: {region}",
"severity": severity,
"alias": alias,
"source": "cloudwatch",
})
elif new_state == "INSUFFICIENT_DATA":
call_alert24("POST", "/incidents", {
"title": f"CloudWatch: {alarm_name} — Insufficient Data",
"description": f"Alarm entered INSUFFICIENT_DATA state. {reason}",
"severity": "medium",
"alias": alias,
"source": "cloudwatch",
})
Deploy this function with the ALERT24_API_KEY environment variable set. Then add it as a subscriber to your SNS topic — either through the console or with a single AWS CLI command:
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:cloudwatch-alarms \
--protocol lambda \
--notification-endpoint arn:aws:lambda:us-east-1:123456789012:function:cloudwatch-to-alert24
Do not forget to add permission for SNS to invoke the function:
aws lambda add-permission \
--function-name cloudwatch-to-alert24 \
--statement-id sns-invoke \
--action lambda:InvokeFunction \
--principal sns.amazonaws.com \
--source-arn arn:aws:sns:us-east-1:123456789012:cloudwatch-alarms
Why Alias-Based Deduplication Matters
The alias is the key to this entire approach. When an alarm flaps — crossing the threshold, recovering, crossing it again within minutes — without deduplication you end up with multiple open incidents for the same problem. Your on-call engineer gets paged repeatedly. The incident list becomes noise.
By passing a stable alias derived from the alarm name, Alert24 treats subsequent ALARM firings as updates to the same incident rather than new ones. The incident stays open and accumulates timeline entries. When CloudWatch sends OK, the resolve-by-alias call closes exactly that incident regardless of whether you have the incident ID in memory.
This is especially important for alarms that have brief recovery windows. A database connection pool alarm that oscillates during a traffic spike should produce one incident with a full timeline, not six separate incidents with no relationship to each other.
Handling Multiple Alarms and Environments
If you run multiple environments — production, staging, development — you will likely want separate SNS topics and separate Alert24 routing rules. A clean approach is to embed the environment in the alarm name convention: prod-api-latency-high-critical, staging-worker-queue-depth-medium. Your Lambda can then parse the environment prefix and attach it as metadata, or route to a different escalation policy.
For larger teams, consider maintaining a JSON configuration file in S3 that maps alarm name patterns to severity overrides and team assignments. Your Lambda reads this at cold start and caches it. This lets non-engineers adjust routing rules without a code deploy.
What You Get on the Alert24 Side
Once the Lambda is in place, every CloudWatch alarm transition creates a structured incident with a full lifecycle. Your on-call engineer gets paged through whatever channel your escalation policy specifies — phone call, SMS, push notification. They can acknowledge the incident from their phone, add notes, and loop in teammates. When the alarm resolves, the incident closes automatically and the resolution is logged with the CloudWatch reason string.
Over time, you build a searchable history of every production incident, how long they lasted, who responded, and how they were resolved. That history is what CloudWatch alarm state history was never designed to provide.
Next Steps
If you already have CloudWatch alarms in place, you can have this adapter running in under an hour. Start with one SNS topic covering your highest-priority alarms, verify that incidents are creating and resolving correctly, then expand from there.
If you do not yet have Alert24 set up, start at alert24.com to create an account and generate your API key. The API is the same regardless of which plan you are on, so you can build and test the integration on a free account before connecting your production escalation policies.
The CloudWatch state machine already knows when things are broken. This adapter makes sure your team does too.