--- a/Incident-2024-04-11-API-Outage.md +++ b/Incident-2024-04-11-API-Outage.md @@ -1,23 +1,54 @@ --- -title: +title: API Outage Retro — 2024-04-11 +date: 2024-04-11 tags: + - incident + - postmortem + - api aliases: + - April 2024 API Outage Retro --- # API Outage Retro — 2024-04-11 ## What Happened -TODO: summarize the outage +A config deployment at 06:12 UTC removed the `max_pool_size` parameter from the `api-config` ConfigMap, causing database connection pool exhaustion. Postgres hit its 500-connection limit within 19 minutes, resulting in a ==47-minute full outage== of all public API endpoints. Approximately 2,300 requests failed across 180 customers, with an estimated $4,200 in lost transactions. + +[[On-Call Rotation|On-call engineer]] [[Alex Rivera]] acknowledged the PagerDuty alert at 06:25 and identified the root cause by 06:38. The config was reverted and service restored by 06:59 UTC. ## Timeline -TODO: embed timeline +![[Incident Timeline#Key Events]] ## Root Cause -TODO: fill in +The deployment pipeline applied a ConfigMap update that omitted the `max_pool_size` parameter. Without an explicit pool size limit, each API pod opened connections without constraint, exhausting the Postgres `max_connections=500` ceiling. + +> [!question] Why was the missing parameter not caught? +> - No validation exists in the deployment pipeline to enforce required config keys +> - The config change was not reviewed by the API team before merge +> - Staging environment did not surface the issue due to lower traffic volume + +The [[API Runbook#Database Connection Pool Exhaustion]] procedure was followed for resolution: idle connections were killed, pods were restarted via `kubectl rollout restart`, and pool recovery was verified in Grafana. + +## Impact + +- **Duration:** 47 minutes (06:12–06:59 UTC) +- **Affected services:** All public API endpoints +- **Customer impact:** ~2,300 failed requests, 180 unique customers +- **Revenue impact:** ~$4,200 in failed transactions ## Action Items -- [ ] TODO +- [ ] Add ConfigMap schema validation to the CI pipeline to reject deploys missing required keys (`max_pool_size`, `rate_limit`, etc.) +- [ ] Require API team approval for changes to `api-config` ConfigMap +- [ ] Add a connection pool utilization alert at 80% capacity (before hitting hard limit) +- [ ] Update [[API Runbook]] with pre-deploy config checklist +- [ ] Run a staging load test with production-level traffic for config changes touching connection parameters + +## Related + +- [[Incident Timeline]] +- [[API Runbook]] +- [[On-Call Rotation]]