Why Every Incident Needs a Postmortem
An incident postmortem template gives your team a repeatable structure for analyzing failures. Without one, postmortems become inconsistent, blame-heavy, or skipped entirely.
Google, Netflix, and Etsy all run blameless postmortems after significant incidents. They do this not because they have more outages than other companies, but because they learn faster from each one. That learning is what separates teams that repeat the same failures from teams that build increasingly resilient systems.
When to Write a Postmortem
Not every blip needs a full postmortem. Run one when:
- Customer-facing downtime exceeded 5 minutes
- Data loss occurred (any amount)
- The on-call engineer had to be paged outside business hours
- A security incident was detected
- Revenue was directly impacted
- The same type of failure happened twice in 30 days
Set a clear threshold so the decision isn't subjective. "We'll write a postmortem for any incident that affects more than 1% of users for more than 5 minutes" removes ambiguity.
The Complete Postmortem Template
Section 1: Incident Summary
Keep this to 3-4 sentences. Anyone in the company should be able to read this and understand what happened.
Example: "On March 10, 2026 at 14:23 UTC, our primary database experienced connection pool exhaustion. The API returned 503 errors for 23 minutes, affecting all authenticated users. Checkout was completely unavailable. The issue was resolved at 14:46 UTC by scaling the connection pool and restarting the application servers."
Section 2: Impact
Quantify the damage. Vague impact statements produce vague action items.
| Metric | Value |
|---|---|
| Duration | 23 minutes |
| Users affected | ~4,200 (all authenticated users) |
| Failed API requests | 12,847 |
| Revenue impact | ~$890 (estimated lost transactions) |
| Support tickets filed | 67 |
| SLA impact | 99.95% target. Month-to-date now at 99.94%. |
Section 3: Timeline
Reconstruct events in chronological order. Include timestamps, who did what, and what signals were observed.
- 14:18 - Database connection count begins climbing (visible in Grafana but no alert configured)
- 14:23 - First 503 errors hit the API. PagerDuty alert fires to on-call engineer.
- 14:25 - On-call acknowledges. Begins investigating.
- 14:28 - Status page updated to "Investigating." Customer-facing notification sent.
- 14:32 - Root cause identified: connection pool max (50) reached due to a long-running migration query holding connections.
- 14:35 - Migration query killed. Connection count begins dropping.
- 14:38 - Connection pool max increased to 100 as a safety measure.
- 14:41 - Application servers restarted to clear stale connections.
- 14:46 - All services confirmed healthy. Status page updated to "Resolved."
Section 4: Root Cause Analysis
Use the Five Whys technique. Start with the failure and ask "why" until you reach a systemic cause.
Why did the API return 503 errors? Because the database connection pool was exhausted.
Why was the connection pool exhausted? Because a migration query held 38 connections for over 10 minutes.
Why did the migration hold that many connections? Because it ran as a batch operation against 2M rows without connection limits.
Why was it allowed to run without connection limits? Because our migration framework doesn't enforce connection budgets, and the runbook doesn't mention this risk.
Why wasn't there an alert on connection pool usage? Because our monitoring dashboard shows the metric but we never set an alert threshold.
The root cause is not "the migration query was bad." The root cause is the absence of safeguards: no connection budgets for migrations, no alerting on pool saturation, and no runbook covering this scenario.
Section 5: Contributing Factors
List everything that made the incident worse or slower to resolve.
- No alert on database connection count (monitoring existed but no threshold was set)
- On-call engineer was unfamiliar with the migration framework
- Status page update happened 5 minutes after the first alert, not immediately
- The connection pool max of 50 was set during initial deployment and never revisited as traffic grew
Section 6: What Went Well
This matters. Acknowledge what worked so the team doesn't only associate postmortems with criticism.
- PagerDuty alert fired within 2 minutes of the first errors
- On-call responded and began investigating within 2 minutes
- Root cause was identified in under 10 minutes
- Status page was updated and customers were notified proactively
- No data loss occurred
Section 7: Action Items
Every action item needs an owner and a due date. Action items without owners don't get done.
| Action | Owner | Due Date | Priority |
|---|---|---|---|
| Add alert for DB connection pool > 80% capacity | Sarah | March 14 | P0 |
| Add connection budget enforcement to migration framework | James | March 21 | P0 |
| Update migration runbook with connection management section | Sarah | March 17 | P1 |
| Increase default connection pool to 100 (permanent) | James | March 14 | P1 |
| Add postmortem link to incident timeline on status page | Lisa | March 15 | P2 |
Section 8: Lessons Learned
Capture the broader takeaways, not just the specific fixes.
- Monitoring without alerting provides false confidence. Every metric on a dashboard should have a corresponding alert threshold.
- Migration operations need the same operational safeguards as production deployments: connection limits, rollback plans, and scheduling during low-traffic windows.
- Our connection pool was sized for launch-day traffic, not current traffic. Infrastructure defaults should be reviewed quarterly.
Running the Postmortem Meeting
Hold the meeting within 48 hours of resolution while details are fresh.
Facilitator role: One person runs the meeting. Their job is to keep the conversation blameless and productive. If someone says "John should have caught this," the facilitator redirects: "What process or tool could have caught this automatically?"
Duration: 30-60 minutes. If it runs longer, the incident was complex enough to warrant a follow-up session.
Attendees: Everyone involved in the incident response, plus the engineering lead and someone from customer support.
Output: A published document that the entire team can read. Use a tool like alert24.net, Statuspage, or Instatus to link postmortems directly to resolved incidents on your status page.
Share Publicly When Appropriate
Some companies publish postmortems publicly. Cloudflare, GitLab, and Incident.io all do this regularly. Public postmortems demonstrate transparency and operational maturity.
You don't need to share every internal detail. A public version covers: what happened, how long it lasted, what you're doing to prevent it, and an apology. Keep the Five Whys and internal action items in the private version.
