--- a/postmortem-draft.md +++ b/postmortem-draft.md @@ -1,17 +1,80 @@ -# Postmortem — April 12, 2026 Customer-Facing Service Degradation Event +# Postmortem — April 12, 2026 Service Degradation -Author: Alex Nguyen (Backend Platform Lead) -Date: April 13, 2026 -Status: DRAFT — Pending review +**Author:** Alex Nguyen (Backend Platform Lead) +**Date:** April 13, 2026 +**Status:** DRAFT — Pending review +**Severity:** High — customer-facing errors for 1h 57m +**Duration:** 06:14 – 08:11 UTC (1 hour 57 minutes) -So basically what happened was that at approximately 06:14 UTC on Saturday April 12 the primary PostgreSQL connection pool managed by PgBouncer on prod-db-01 became exhausted due to the fact that a configuration change that was deployed as part of the v3.8.2 release on Friday evening at 22:30 UTC inadvertently reduced the max_client_conn parameter from 200 to 20 which is obviously way too low for our production traffic patterns and this was not caught during the canary deployment phase because the canary instance was only receiving approximately 5% of total traffic which meant that even with the reduced connection limit the canary never actually hit the ceiling and so the issue was promoted to the full fleet without anyone realizing there was a problem. +--- -The impact of this was that starting at 06:14 UTC when traffic began ramping up for the Saturday morning window (which as everyone knows is actually our second-highest traffic period after Tuesday evenings) users began experiencing connection timeout errors that manifested as HTTP 503 responses from the API gateway and this affected the /api/v2/orders endpoint and the /api/v2/inventory/sync endpoint and also the /api/v2/customers/search endpoint and the webhook delivery pipeline which relies on the same connection pool was also impacted meaning that partner integrations through Shopify and WooCommerce were not receiving order update webhooks during this window. The error rate went from our normal baseline of approximately 0.02% to a peak of 34.7% at 07:22 UTC which represents a very significant increase in customer-facing errors. Our monitoring system (Datadog) did actually fire an alert at 06:18 UTC but because the on-call engineer (myself) had a notification routing rule that was sending PgBouncer alerts to a low-priority Slack channel rather than paging, I didn't actually see the alert until I happened to check Slack at 07:45 UTC at which point things were already quite degraded. +## Summary -In terms of the timeline of events what happened was: the v3.8.2 config was merged by Jamie at 21:15 UTC on April 11, the canary was deployed at 22:30 UTC and appeared healthy, full fleet rollout happened at 23:45 UTC, everything looked fine through the overnight low-traffic period, then at 06:14 UTC connections started getting exhausted, at 06:18 UTC Datadog fired the PgBouncer saturation alert, at 07:22 UTC the error rate peaked at 34.7%, at 07:45 UTC I noticed the Slack alert and began investigating, at 07:52 UTC I identified the max_client_conn misconfiguration, at 07:58 UTC I pushed an emergency config rollback via our Ansible playbook, at 08:03 UTC PgBouncer restarted with the correct connection limit and connections began draining normally, at 08:11 UTC error rates returned to baseline, and at 08:15 UTC I confirmed full recovery and sent the all-clear to the #incidents channel. +A misconfigured PgBouncer deployment reduced the `max_client_conn` parameter on prod-db-01 from 200 to 20. When Saturday morning traffic ramped up, the connection pool exhausted, causing HTTP 503 errors across multiple API endpoints and delaying webhook deliveries to partner integrations. -The root cause when you really get down to it was that the PR for v3.8.2 (#4271) included a change to the pgbouncer.ini template that was generated by our new infra-as-code tooling (Pulumi) and the diff showed the max_client_conn change but it was buried in a 847-line diff that also included TLS cert rotation configs and log format changes and the reviewer (Dana) approved it after reviewing the TLS and logging changes but didn't notice the connection pool parameter change because it was a single line change on line 643 of the diff. We don't currently have any automated validation or policy checks on PgBouncer configuration parameters which means this kind of thing can slip through review. Also the canary deployment validation only checks HTTP health endpoints and error rates but doesn't validate infrastructure-level configuration parameters like database connection pool sizes. +--- -For the resolution and the things we are going to do to make sure this doesn't happen again, there are several action items that we need to work on. First we need to add automated policy checks using Open Policy Agent (OPA) to validate PgBouncer configuration parameters against known-good baselines before any deployment can proceed (owner: Jamie, due: April 25). Second we need to update the canary deployment validation to include infrastructure-level checks specifically connection pool utilization as a promotion gate (owner: Alex, due: April 22). Third we need to fix the on-call notification routing so that all database-related alerts go through PagerDuty rather than Slack regardless of the specific subsystem (owner: Alex, due: April 14 — already done actually). Fourth we need to break up large infrastructure PRs so that connection pool changes, TLS changes, and logging changes are separate PRs that can be reviewed independently (owner: Dana, due: ongoing process change). Fifth we need to add a PgBouncer connection utilization dashboard panel to our existing Grafana service health board so that connection pool saturation is visible during routine monitoring (owner: Priya, due: April 20). +## Customer Impact -In terms of customer impact, during the 1 hour 57 minute window from 06:14 to 08:11 UTC approximately 12,400 API requests failed with 503 errors across 2,847 unique customer accounts and 340 webhook deliveries to partner integrations were delayed (all were successfully retried and delivered by 08:45 UTC after recovery). We have received 23 support tickets so far related to this incident and the support team has been responding with our standard incident acknowledgment template. No data was lost or corrupted during this event — the failures were all connection-level timeouts and the application-level retry logic in our SDKs handled most cases transparently for end users, though users on older SDK versions (pre-v2.1) without built-in retry would have seen raw errors. +| Metric | Value | +|---|---| +| Duration | 06:14 – 08:11 UTC (1h 57m) | +| Failed API requests | ~12,400 (503 errors) | +| Affected customer accounts | 2,847 | +| Delayed webhook deliveries | 340 (all retried and delivered by 08:45 UTC) | +| Support tickets received | 23 | +| Error rate (baseline to peak) | 0.02% to 34.7% (at 07:22 UTC) | +| Data loss | None | + +**Affected endpoints:** +- `/api/v2/orders` +- `/api/v2/inventory/sync` +- `/api/v2/customers/search` +- Webhook delivery pipeline (Shopify and WooCommerce partner integrations) + +Users on SDK v2.1+ experienced transparent retries. Users on older SDK versions (pre-v2.1) without built-in retry saw raw errors. The support team is responding to tickets with the standard incident acknowledgment template. + +--- + +## Timeline (all times UTC) + +| Time | Event | +|---|---| +| **Apr 11, 21:15** | Jamie merged v3.8.2 config (PR #4271) | +| **Apr 11, 22:30** | Canary deployed — appeared healthy | +| **Apr 11, 23:45** | Full fleet rollout completed | +| *Apr 11–12, overnight* | *Low traffic — no issues surfaced* | +| **Apr 12, 06:14** | Connection pool exhausted; 503 errors began | +| **Apr 12, 06:18** | Datadog fired PgBouncer saturation alert | +| **Apr 12, 07:22** | Error rate peaked at 34.7% | +| **Apr 12, 07:45** | On-call engineer (Alex) saw alert in Slack, began investigating | +| **Apr 12, 07:52** | Identified `max_client_conn` misconfiguration | +| **Apr 12, 07:58** | Pushed emergency config rollback via Ansible | +| **Apr 12, 08:03** | PgBouncer restarted with correct connection limit; connections draining | +| **Apr 12, 08:11** | Error rates returned to baseline | +| **Apr 12, 08:15** | Full recovery confirmed; all-clear sent to #incidents | + +**Detection gap:** The Datadog alert fired at 06:18 but was routed to a low-priority Slack channel instead of PagerDuty. The on-call engineer did not see it until 07:45 — a 1h 27m delay. + +--- + +## Root Cause + +PR #4271 (v3.8.2) included a change to the `pgbouncer.ini` template generated by our Pulumi infra-as-code tooling. The diff changed `max_client_conn` from 200 to 20 — but this single-line change was on line 643 of an 847-line diff that also included TLS cert rotation and log format changes. The reviewer (Dana) approved after reviewing the TLS and logging sections but did not catch the connection pool parameter change. + +**Contributing factors:** +- **No automated config validation.** No policy checks exist for PgBouncer configuration parameters against known-good baselines. +- **Canary gap.** Canary validation checks HTTP health endpoints and error rates, but does not validate infrastructure-level parameters like connection pool sizes. The canary received only ~5% of traffic, so the reduced limit of 20 connections was never hit. +- **Alert routing.** PgBouncer alerts were routed to a low-priority Slack channel instead of PagerDuty. + +--- + +## Action Items + +| # | Action | Owner | Due | Status | +|---|---|---|---|---| +| 1 | Add OPA policy checks to validate PgBouncer config parameters against known-good baselines before deployment | Jamie | Apr 25 | Pending | +| 2 | Add connection pool utilization as a canary promotion gate | Alex | Apr 22 | Pending | +| 3 | Route all database-related alerts through PagerDuty (not Slack) | Alex | Apr 14 | **Done** | +| 4 | Split large infra PRs so connection pool, TLS, and logging changes are reviewed independently | Dana | Ongoing | Process change | +| 5 | Add PgBouncer connection utilization panel to Grafana service health dashboard | Priya | Apr 20 | Pending |