# INC-2026-0412-001 — Warehouse routing outage postmortem

## Properties (Engineering Docs database)

| Property      | Value |
|---------------|-------|
| Title         | INC-2026-0412-001 — Warehouse routing outage postmortem |
| Doc Type      | Postmortem |
| Status        | Draft |
| Owner         | Sofia Reyes |
| Team          | Ops |
| Tags          | route-optimizer, OOM, warehouse-routing, SFO, PDX |
| Last Reviewed | 2026-04-12 |
| Related Pages | → Decision: Add prod-scale load profiles to staging |
|               | → Decision: Add memory profiling to CI for route-optimizer |
|               | → Decision: Decouple carrier-api from route-optimizer releases |
|               | → Decision: Create queue drain runbook |
|               | → INC-2026-0328 — Carrier API timeout cascade |
| Source Link   | https://grafana.internal/d/route-optimizer-pods |

---

## Page Body

## Overview

Postmortem for the warehouse routing outage on 2026-04-12 (INC-2026-0412-001, SEV-2). The route-optimizer service OOM-killed repeatedly during the SFO morning peak after deploy v2.14.0 introduced a multi-stop batching algorithm whose memory usage scales quadratically with batch size. ~340 orders were delayed ~90 minutes. Service was restored via hotfix v2.14.1 within 92 minutes of first detection.

**Incident ID:** INC-2026-0412-001
**Severity:** SEV-2
**Attendees:** Sofia Reyes (EM), Derek Wu (SWE), Priya Sharma (SRE), Carlos Mendez (Ops Lead)

## Timeline

- **07:50** — route-optimizer pods begin OOM-killing in SFO region
- **08:15** — Carlos Mendez notices orders stuck in SFO queue, raises alarm in #incident-warehouse-routing
- **08:17** — Priya Sharma confirms same issue on PDX hub; Grafana shows route-planner pods restart-looping
- **08:23** — Priya identifies root cause: v2.14.0 multi-stop batching algorithm consuming 3× expected memory on large order sets
- **08:25** — Derek Wu confirms: his PR assumed max 50 orders/route; SFO peak hits 200+; staging tested up to 80
- **08:27** — Rollback to v2.13.2 considered
- **08:28** — Rollback ruled out — v2.14.0 also contains carrier-api auth migration; rolling back would break carrier handshakes
- **08:31** — Sofia calls hotfix path: Derek on fix, Priya monitors pods, Carlos manually drains SFO queue
- **08:45** — Priya bumps pod memory limit to 4Gi as stopgap; pods stabilize but run slow
- **09:02** — Derek opens hotfix PR #4821: caps batch at 60, adds chunked processing for large sets, 3Gi soft memory guard
- **09:08** — PR merged, deploying to prod
- **09:22** — v2.14.1 fully rolled out; pods healthy, memory usage down to ~1.2Gi average
- **09:25** — Backlog fully cleared; ops confirms normal order flow
- **09:30** — Sofia schedules retro for 10:00 AM
- **09:47** — Retro completed; 4 action items decided

## Impact

- **~340 orders delayed** — average delay ~90 minutes
- **3 carrier SLA breaches:** FedEx SFO, UPS PDX, DHL SFO
- **Ops team manually drained queue** for ~45 minutes using admin tool
- **No data loss**
- **Total time to resolution:** ~92 minutes (07:50 → 09:22)

## Root Cause

The multi-stop batching algorithm (PR #4789, merged 2026-04-11) assumed a maximum of 50 orders per batch. SFO morning peak regularly hits 150–200 orders. The algorithm's memory usage scales **quadratically** with batch size — 200 orders requires ~6Gi vs. the 2Gi pod memory limit.

Staging environment only had synthetic load up to 80 orders, so the OOM condition was never triggered pre-deploy.

A secondary factor was **deploy coupling**: the carrier-api auth migration was bundled into the same release (v2.14.0), which blocked rollback as a mitigation path.

## Decisions Made

1. **Add prod-scale load profiles to staging** — Mirror real peak volumes (200+ orders). Owner: Derek Wu. Due: 2026-04-19.
2. **Add memory profiling to CI for route-optimizer** — CI step that profiles memory under load, alerts if >2× baseline. Owner: Priya Sharma. Due: 2026-04-26.
3. **Decouple carrier-api from route-optimizer releases** — Separate release trains via RFC. Owners: Carlos Mendez + Derek Wu. RFC due: 2026-05-03.
4. **Create queue drain runbook** — Document manual drain procedure. Owner: Carlos Mendez. Due: 2026-04-16.

## Open Questions

- Should we set hard memory limits per algorithm variant? *(Derek Wu to investigate)*
- Do we need a circuit breaker on batch size at the API gateway level? *(Priya Sharma to evaluate)*
- Can we get carrier SLA breach cost numbers from Finance for the final report? *(Sofia Reyes to follow up)*

## Related

- Slack thread: #incident-warehouse-routing (2026-04-12)
- Hotfix PR: https://github.com/acme-logistics/route-optimizer/pull/4821
- Original PR: https://github.com/acme-logistics/route-optimizer/pull/4789
- Grafana dashboard: https://grafana.internal/d/route-optimizer-pods
- Related prior incident: INC-2026-0328 — Carrier API timeout cascade (same carrier-api coupling issue)

## Changelog

| Date       | Author       | Change           |
|------------|-------------|------------------|
| 2026-04-12 | Sofia Reyes | Initial creation from retro |
