# Incident Review Memo — Latency Spike (2026-04-12)

**Prepared for:** 14:00 UTC Incident Review  
**Date:** 2026-04-12  
**Author:** (draft)  
**Incident Severity:** P2 | **On-call:** Marcus Chen

---

## 1. Problem Statement

Following the v3.8.0 production deploy (2026-04-11 18:00 UTC), p99 API latency degraded from a ~220 ms baseline to 812 ms peak at 02:17 UTC, breaching the 500 ms SLO. The degradation correlated with a Redis cache hit-ratio collapse from 94% to 61% and an eviction-rate surge from ~50 keys/s to ~4,200 keys/s.

---

## 2. Scientific Framework

This analysis follows hypothesis-driven reasoning: each candidate cause is stated as a falsifiable hypothesis with specific predictions, evaluated against the metrics snapshot and release changelog.

---

## 3. Hypotheses

### H1: Memory-pressure cache thrashing from the new `user-preferences` namespace

**Claim:** The `uprefs:{user_id}` keyspace, introduced in v3.8.0 within the `catalog-cache` cluster, consumes enough memory to push the cluster past its `maxmemory` threshold, triggering aggressive `allkeys-lru` evictions of existing high-value keys (catalog, session).

**Predictions (testable):**
- P1.1: `uprefs_keys` count should grow monotonically after deploy, correlating with falling hit ratio. **Supported.** CSV shows uprefs keys growing from 0 to 845,200 while hit ratio falls from 94% to 61%.
- P1.2: At estimated peak (1.2M keys x 840 B avg), the namespace alone would consume ~1.0 GB. Combined with existing working set in an 8 GB cluster, this plausibly exceeds headroom. **Supported.** The 845K keys already at 06:00 UTC represent ~710 MB, and the namespace is still growing toward the 1.2M estimate.
- P1.3: Eviction rate should inflect sharply once `maxmemory` is reached, not gradually. **Supported.** Evictions rose from 51/s to 4,210/s — an 80x increase — consistent with a threshold breach, not gradual degradation.

**Falsification criteria:** If `uprefs` keys are isolated to a dedicated Redis instance (not sharing `catalog-cache` memory), this hypothesis is falsified. Also falsified if disabling `ENABLE_UPREFS_CACHE` does not materially recover hit ratio.

**Current assessment: Strong support.**

---

### H2: Thundering-herd reload from session key-schema migration

**Claim:** The session key change from `sess:{user_id}` to `sess:v2:{user_id}` means all active sessions miss on first request, causing a simultaneous wave of PostgreSQL reads and new cache writes.

**Predictions (testable):**
- P2.1: DB pool utilization should spike immediately after deploy, then decay as sessions are re-cached. **Partially supported.** DB pool usage rose from 35% to 82%, but the rise was gradual (hours), not a sharp spike at deploy time.
- P2.2: Session-related cache writes should show a burst at 18:00–19:00 UTC. **Not yet evaluated** — requires per-namespace write-rate metrics.
- P2.3: The reload burst should be bounded by active-session count at deploy time (~129K). Write-through on first request means each user hits DB once. **Plausible** but the gradual timeline weakens the "thundering herd" framing — it's more of a rolling migration as users return.

**Falsification criteria:** If session cache writes show no spike in the first 2 hours post-deploy, or if DB pool util rise predates the eviction surge (rather than coinciding), the herd explanation doesn't hold as a primary cause.

**Current assessment: Contributing factor, not primary driver.** The gradual ramp is more consistent with steady memory accumulation (H1) than a stampede.

---

### H3: TTL mismatch amplifying evictions

**Claim:** The `user-preferences` namespace has a 7-day TTL, far longer than the 24h session TTL. Under LRU eviction pressure, shorter-TTL keys that are frequently accessed get evicted in favor of longer-TTL keys that are rarely re-read, inverting the intended caching priority.

**Predictions (testable):**
- P3.1: Evicted keys should disproportionately be high-access-frequency `sess:v2:*` and `catalog:*` keys, not `uprefs:*` keys. **Not yet evaluated** — requires eviction-key sampling.
- P3.2: If uprefs keys have low read-after-write rates (write-once, read-rarely for many users), they become "dead weight" under LRU. **Plausible** given the nature of display preferences.

**Falsification criteria:** If eviction sampling shows uprefs keys are evicted at the same or higher rate than session keys, TTL mismatch is not the amplifier.

**Current assessment: Plausible amplifier of H1, not an independent root cause.**

---

### H4: Consistent-hash ring hotspot shift

**Claim:** The new key patterns (`sess:v2:*`, `uprefs:*`) redistributed slot assignments, creating hotspots on specific Redis nodes.

**Predictions (testable):**
- P4.1: Per-node memory and eviction rates should be asymmetric. **Not yet evaluated** — requires per-node metrics.
- P4.2: Aggregate metrics would look the same even without hotspots, so this hypothesis is underdetermined by the current data.

**Falsification criteria:** If per-node memory utilization is uniform (within 10%), hash-ring skew is not a factor.

**Current assessment: Insufficient data to evaluate.**

---

## 4. Evidence Summary

| Signal                   | Pre-deploy (17:00) | Alert (02:17) | Post-mitigation (06:00) | Interpretation                          |
|--------------------------|---------------------|---------------|--------------------------|------------------------------------------|
| p99 latency (ms)         | 215                 | 812           | 488                      | 2.3x above baseline even after +4 GB RAM |
| Cache hit ratio (%)      | 94.2                | 60.8          | 71.8                     | Partial recovery — memory pressure eased, not resolved |
| Evictions/sec            | 48                  | 4,210         | 1,980                    | Still 40x baseline — new namespace still growing |
| DB pool utilization (%)  | 34                  | 82            | 62                       | Falls with eviction rate, confirms cache-miss pass-through |
| uprefs keys              | 0                   | 819,000       | 845,200                  | Approaching 1.2M peak estimate, ~710 MB consumed |

Key observation: increasing `maxmemory` from 8 to 12 GB reduced latency from 812 to ~490 ms but did **not** restore baseline. This is consistent with H1 — the extra 4 GB bought headroom but the uprefs namespace continues to grow toward its 1.2M-key working set.

---

## 5. Causal Model (most parsimonious explanation)

```
v3.8.0 deploy
  |
  +---> new uprefs namespace fills catalog-cache memory (H1, primary)
  |       |
  |       +---> allkeys-lru evicts session + catalog keys
  |               |
  |               +---> cache misses fall through to PostgreSQL
  |                       |
  |                       +---> DB pool saturation -> latency spike
  |
  +---> session key migration causes rolling cache misses (H2, secondary)
  |       adds ~129K write-through DB reads over first few hours
  |
  +---> 7-day uprefs TTL may shield low-value keys from eviction (H3, amplifier)
```

---

## 6. Recommended Next Steps (ordered by information value)

1. **Test H1 directly:** Disable `ENABLE_UPREFS_CACHE` via feature flag in a canary group. If hit ratio recovers to >90% within 1 TTL cycle, H1 is confirmed.
2. **Size the namespace properly:** 1.2M keys x 840 B = ~1.0 GB. Either allocate a dedicated Redis instance for uprefs, or increase `catalog-cache` maxmemory to accommodate the combined working set (current ~7 GB + 1 GB uprefs + headroom = 10 GB minimum).
3. **Reduce uprefs TTL:** 7 days is aggressive for display preferences. A 24h TTL with lazy reload would reduce steady-state memory by ~6x (only daily-active users cached).
4. **Evaluate H3 (TTL inversion):** Sample evicted keys with `OBJECT FREQ` or keyspace notifications to confirm whether uprefs keys are displacing higher-value keys.
5. **Evaluate H4:** Pull per-node memory and eviction metrics from the cluster to rule out hash-ring skew.
6. **Monitor session migration:** Track `sess:v2:*` key count vs. `sess:*` expiration curve to confirm the old keys age out cleanly within 24h.

---

## 7. Conclusion

The evidence most strongly supports **H1 (memory-pressure cache thrashing)** as the primary cause: the new `user-preferences` namespace is consuming ~710 MB and growing within a shared 8 GB Redis cluster, triggering aggressive LRU evictions of existing high-value keys. The session key-schema migration (H2) is a secondary contributor, and TTL mismatch (H3) is a plausible amplifier. Hash-ring skew (H4) cannot yet be evaluated.

The immediate mitigation (+4 GB RAM) was correct but insufficient — a durable fix requires either isolating the uprefs namespace to its own instance or right-sizing the shared cluster for the combined working set.