# Add memory profiling to CI for route-optimizer

## Properties (Decision Log database)

| Property      | Value |
|---------------|-------|
| Title         | Add memory profiling to CI for route-optimizer |
| Decision Date | 2026-04-12 |
| Status        | Decided |
| Category      | Process |
| Severity      | SEV-2 |
| Owner         | Priya Sharma |
| Tags          | CI, memory-profiling, route-optimizer, observability |
| Due Date      | 2026-04-26 |
| Incident ID   | INC-2026-0412-001 |
| Source Link   | https://grafana.internal/d/route-optimizer-pods |
| Related Pages | → INC-2026-0412-001 — Warehouse routing outage postmortem |
|               | → Decision: Add prod-scale load profiles to staging |
|               | → Decision: Decouple carrier-api from route-optimizer releases |
|               | → Decision: Create queue drain runbook |

---

## Page Body

## Summary

Add a CI pipeline step that profiles memory usage for the route-optimizer service under load and fails the build if usage exceeds 2× the established baseline, catching OOM risks before production deploy.

## Context

v2.14.0's multi-stop batching algorithm consumed 3× expected memory (~6Gi vs. 2Gi pod limit), but there was no automated memory profiling anywhere in the CI pipeline. The issue was completely invisible until it hit production. Memory regressions can be introduced by any change to routing algorithms, and the team currently has no guardrail against them.

## Timeline

- 2026-04-12 08:23 — Priya identifies 3× memory usage as root cause
- 2026-04-12 08:45 — Pod memory bumped to 4Gi as stopgap
- 2026-04-12 09:22 — Hotfix v2.14.1 brings memory to ~1.2Gi
- 2026-04-12 09:47 — Retro decision: add memory profiling gate in CI

## Impact

Without memory profiling, any algorithm change risks OOM in production. The outage delayed ~340 orders and caused 3 SLA breaches.

## Root Cause

No automated memory profiling in CI. Developers had no signal about memory impact of their changes before merging.

## Decision

Add a memory profiling step to the route-optimizer CI pipeline. The step runs the service under prod-scale load profiles (from Decision #1) and fails the build if memory usage exceeds 2× the established baseline. Baseline thresholds will be derived from current production metrics.

## Alternatives Considered

- **Manual memory review during PR review** — Rejected; error-prone, doesn't scale, easy to skip under deadline pressure.
- **Runtime memory alerts only (no CI gate)** — Rejected; catches the problem too late — service is already in production and users are already affected.
- **Memory profiling for all services at once** — Rejected as initial scope; start with route-optimizer, expand to other services based on learnings.

## Action Items

- [ ] Implement memory profiling CI step for route-optimizer — Owner: Priya Sharma, Due: 2026-04-26
- [ ] Define baseline memory thresholds from current production metrics — Owner: Priya Sharma, Due: 2026-04-26
- [ ] Document the profiling approach so other teams can adopt it — Owner: Priya Sharma, Due: 2026-04-26

## Related

- Postmortem: INC-2026-0412-001 — Warehouse routing outage
- Grafana dashboard: https://grafana.internal/d/route-optimizer-pods
- Depends on: Decision — Add prod-scale load profiles to staging (profiles needed for meaningful CI memory tests)
