---
marp: true
theme: default
paginate: true
footer: "Agent Runtime Latency Reduction | Architecture Review"
style: |
  /* Tech Theme - GitHub-inspired dark */
  section {
    background-color: #0d1117;
    color: #c9d1d9;
    font-family: 'SF Mono', 'Fira Code', 'Cascadia Code', monospace;
    font-size: 28px;
    padding: 40px 60px;
  }
  h1 {
    color: #58a6ff;
    font-size: 2.2em;
    border-bottom: 2px solid #30363d;
    padding-bottom: 12px;
  }
  h2 {
    color: #58a6ff;
    font-size: 1.6em;
    border-bottom: 1px solid #30363d;
    padding-bottom: 8px;
  }
  h2::before {
    content: "# ";
    color: #8b949e;
    font-weight: 300;
  }
  h3 {
    color: #7ee787;
    font-size: 1.2em;
  }
  a {
    color: #58a6ff;
  }
  strong {
    color: #f0f6fc;
  }
  em {
    color: #f0883e;
    font-style: normal;
  }
  code {
    background-color: #161b22;
    color: #7ee787;
    padding: 2px 6px;
    border-radius: 4px;
    font-size: 0.9em;
  }
  pre {
    background-color: #161b22;
    border: 1px solid #30363d;
    border-radius: 8px;
    padding: 16px;
  }
  table {
    width: 100%;
    border-collapse: collapse;
    margin: 16px 0;
  }
  th {
    background-color: #161b22;
    color: #58a6ff;
    border: 1px solid #30363d;
    padding: 10px 16px;
    text-align: left;
  }
  td {
    border: 1px solid #30363d;
    padding: 10px 16px;
  }
  ul, ol {
    line-height: 1.8;
  }
  blockquote {
    border-left: 4px solid #58a6ff;
    padding-left: 16px;
    color: #8b949e;
    font-style: italic;
  }
  section.lead {
    display: flex;
    flex-direction: column;
    justify-content: center;
    align-items: center;
    text-align: center;
  }
  section.lead h1 {
    border-bottom: none;
    font-size: 2.8em;
  }
  section.lead h2 {
    border-bottom: none;
    color: #8b949e;
    font-size: 1.2em;
  }
  section.lead h2::before {
    content: "";
  }
  section.lead h3 {
    color: #7ee787;
    font-size: 1.0em;
  }
  footer {
    color: #484f58;
    font-size: 14px;
  }
  header {
    color: #484f58;
    font-size: 14px;
  }
  section.highlight-box td:nth-child(3) {
    color: #7ee787;
    font-weight: bold;
  }
  section.highlight-box td:nth-child(4) {
    color: #7ee787;
    font-weight: bold;
  }
---

<!-- _class: lead -->
<!-- _paginate: false -->
<!-- _footer: "" -->

# Agent Runtime Latency Reduction

## Architecture Review  -  April 12, 2026

### Alex

---

## The Problem

**p99 latency is 3x over SLO** and driving customer escalations

- p99 round-trip latency hit **2.4s** — SLO target is **800ms**
- Tail latency spikes correlate with **memory pressure** on inference pods
- **3 P1 escalations** from enterprise accounts in the last 2 weeks
- Planner step alone accounts for **~40%** of total latency (avg 960ms)

> Status quo is untenable — we are breaching SLO on every request tail.

---

## Current Architecture

Synchronous chain with **200-400ms serialization overhead per hop**:

```
  router  -->  planner  -->  executor  -->  tool-call  -->  synthesizer
          200ms        400ms         300ms          200ms
```

### Key bottleneck

- Each stage **blocks** until the previous stage fully completes
- JSON serialization between every hop
- No parallelism — planner must finish before executor can start

---

## Proposed Architecture

Five changes to move from blocking chain to **streaming pipeline**:

1. **Pipeline planner + executor** — stream tokens into executor as planner emits
2. **Speculative execution** — prefetch tool calls while planner is still running
3. **FlatBuffers over JSON** — 6x faster serialization (validated in prototype)
4. **Adaptive batching** — synthesizer flushes every 50ms, not waiting for all tools
5. **Pull-based backpressure** — slow consumers no longer block the pipeline

---

<!-- _class: highlight-box -->

## Prototype Results

Staging environment, **10k request sample** — all targets met:

| Metric | Before | After | Impact |
|---|---|---|---|
| p50 latency | 1.1s | **0.38s** | **-65%** |
| p99 latency | 2.4s | **0.72s** | **-70%** |
| Throughput | 420 req/s | **680 req/s** | **+62%** |
| Memory / pod | 2.1 GB | **1.8 GB** | **-14%** |
| Error rate | 0.02% | 0.02% | No change |

> p99 now **within SLO** at 0.72s. Error rate unchanged.

---

## Risks & Mitigations

| Risk | Impact | Mitigation |
|---|---|---|
| Speculative execution wastes GPU cycles if planner diverges | Cost / resource pressure | **Cancellation budget** — cap speculative work per request |
| FlatBuffers touches **14 internal services** | Rollout complexity | **Incremental migration** — top-5 hottest paths first |
| Streaming loses full-plan validation before execution | Correctness risk | **Transparent fallback** — revert to sync chain on pipeline failure |

---

## Rollout Timeline

### Phase 1 — Weeks 1-3
**Planner streaming + executor pipelining** — captures the biggest latency win

### Phase 2 — Weeks 4-5
**FlatBuffers migration** — top-5 hottest serialization paths

### Phase 3 — Weeks 6-8
**Speculative execution + adaptive batching** — final optimization layer

**Observability:** New Grafana dashboard tracking per-hop latency breakdown

---

<!-- _class: lead -->
<!-- _paginate: false -->

# The Ask

## Sign-off on architecture and rollout plan

### What we need

Staff eng spike on FlatBuffers migration plan *(1 week)*
SRE to provision canary cluster for Phase 1
Weekly latency review cadence during rollout
