--- a/slides.md
+++ b/slides.md
@@ -1,50 +1,457 @@
 ---
 theme: default
-title: Agent Runtime Architecture Review
+title: "Agent Runtime Architecture Review — doany.ai"
+info: Architecture review covering orchestration, sandboxing, streaming pipeline, and fault tolerance.
+drawings:
+  enabled: false
+transition: slide-left
+mdc: true
 ---
 
 # Agent Runtime Architecture Review
 
 doany.ai — Platform Team
 
-April 2026
+<div class="pt-4 text-gray-400">
+April 2026 &middot; Engineering Leadership Review
+</div>
+
+<!--
+Welcome everyone. Today I'm walking through the agent runtime architecture — the execution engine behind doany.ai's skill system. We have new ground to cover since the streaming pipeline and fault tolerance layer shipped recently.
+-->
 
 ---
 
 # Agenda
 
-- Orchestrator overview
-- Sandbox runtime
-- Q&A
-
+<v-clicks>
+
+- **System Overview** — What the agent runtime does
+- **End-to-End Data Flow** — Full path from user input to rendered result
+- **Orchestrator** — ExecutionGraph, DAG scheduling, retry logic
+- **Sandbox Runtime** — Isolation model with Firecracker / Docker
+- **Streaming Pipeline** *(new)* — Ring buffer, backpressure, SSE protocol
+- **Fault Tolerance** *(new)* — Circuit breakers, checkpoint/resume, DLQ
+- **Streaming + Checkpoints** — How the new systems interact
+- **Infrastructure & Observability**
+
+</v-clicks>
+
+<!--
+Quick roadmap. The first three sections are review for most of you. The streaming pipeline and fault tolerance sections are the new scope — these shipped in the last week. I'll spend the most time there.
+-->
+
+---
+layout: section
+---
+
+# System Overview
+
+---
+
+# What the Agent Runtime Does
+
+The runtime is the execution engine behind every skill invocation on doany.ai.
+
+<v-clicks>
+
+- Receives **user intents** (natural language or structured)
+- **Selects and composes** skills via the Skill Registry
+- **Orchestrates** execution as a DAG of sandboxed steps
+- **Streams** results back to the frontend in real time
+- **Recovers** from failures without losing progress
+
+</v-clicks>
+
+<div v-click class="mt-4 p-3 bg-blue-50 rounded text-sm text-blue-900">
+<strong>Key invariant:</strong> Every skill step runs in an isolated sandbox. The runtime never executes user-triggered code in the host process.
+</div>
+
+<!--
+This is the 30-second mental model. User says something, we figure out which skills to run, run them in sandboxes, and stream the output. The new pieces — streaming and fault tolerance — make this more robust and more responsive.
+-->
+
+---
+
+# End-to-End Data Flow
+
+```mermaid {scale: 0.85}
+flowchart LR
+    A["User Input"] --> B["Intent Router"]
+    B --> C["Skill Registry"]
+    C --> D["Orchestrator"]
+    D --> E["ExecutionGraph\n(DAG)"]
+    E --> F["Sandbox Runtime\n(Firecracker / Docker)"]
+    F --> G["Streaming Pipeline\n(Ring Buffer + SSE)"]
+    G --> H["Result Aggregator\n(Vercel Blob)"]
+    H --> I["Frontend\n(SSE Consumer)"]
+
+    style B fill:#e0f2fe,stroke:#0284c7
+    style D fill:#e0f2fe,stroke:#0284c7
+    style F fill:#fef3c7,stroke:#d97706
+    style G fill:#dcfce7,stroke:#16a34a
+    style H fill:#dcfce7,stroke:#16a34a
+```
+
+<div class="mt-2 text-sm text-gray-500 flex gap-6">
+  <span><span class="inline-block w-3 h-3 rounded bg-blue-100 border border-blue-500 mr-1"></span> Routing & Scheduling</span>
+  <span><span class="inline-block w-3 h-3 rounded bg-yellow-100 border border-yellow-600 mr-1"></span> Isolated Execution</span>
+  <span><span class="inline-block w-3 h-3 rounded bg-green-100 border border-green-600 mr-1"></span> Streaming & Delivery</span>
+</div>
+
+<!--
+Here's the full pipeline. Blue is routing and scheduling, yellow is sandboxed execution, green is the new streaming and delivery layer. Every arrow here is a typed channel — we'll look at the key types on the next slide.
+-->
+
+---
+layout: section
 ---
 
 # Orchestrator
 
-- Receives a SkillPlan, builds an ExecutionGraph (DAG)
-- Parallel fan-out for independent steps
-- Retry: exponential backoff, max 3 attempts
-
-```typescript
+---
+layout: two-cols
+---
+
+# Orchestrator Deep-Dive
+
+Receives a `SkillPlan`, builds an `ExecutionGraph` (DAG)
+
+<v-clicks>
+
+- **Parallel fan-out** for independent skill steps
+- **Typed channels** pass data between steps
+- **Retry**: exponential backoff + jitter, max 3 attempts
+- **Timeouts**: 120s per step, 600s per graph
+- **Checkpoints**: graph state → Redis every 10s
+
+</v-clicks>
+
+::right::
+
+```typescript {all|3-4|5|6|7}
+interface SkillPlan {
+  intentId: string;
+  skills: SkillStep[];
+  constraints: ExecutionConstraints;
+}
+
+interface SkillStep {
+  skillId: string;
+  inputs: Record<string, unknown>;
+  dependsOn: string[];
+  timeout: number;
+  retryPolicy: RetryPolicy;
+}
+
 interface ExecutionGraph {
   id: string;
   steps: Map<string, SkillStep>;
   edges: [string, string][];
-  status: 'pending' | 'running' | 'completed' | 'failed';
+  status: 'pending' | 'running'
+    | 'completed' | 'failed';
+  checkpointKey?: string;
 }
 ```
 
+<!--
+The orchestrator is the brain. It takes a plan from the intent router and turns it into a DAG. Independent steps run in parallel. Each step has its own timeout and retry policy. The checkpointKey is new — that's how we support resume after failure, which I'll cover in the fault tolerance section.
+-->
+
 ---
 
 # Sandbox Runtime
 
-- Firecracker microVMs (prod) / Docker (dev)
-- Resource limits: 2 vCPU, 4GB RAM, 10GB disk
-- Network deny-by-default
-- Read-only FS except `/workspace` and `/output`
-
----
-
-# Thank You
-
-Questions?
+Firecracker microVMs in production, Docker containers in dev/staging.
+
+<div class="grid grid-cols-2 gap-8 mt-4">
+<div>
+
+### Resource Limits
+
+<v-clicks>
+
+- **2 vCPU** per sandbox
+- **4 GB RAM**
+- **10 GB ephemeral disk**
+- Network **deny-by-default**
+- Skills must declare required domains
+
+</v-clicks>
+
+</div>
+<div>
+
+### Filesystem Model
+
+<v-clicks>
+
+- Root FS: **read-only**
+- `/workspace`: read-write (skill working dir)
+- `/output`: read-write (results written here)
+- Mounts destroyed after step completes
+
+</v-clicks>
+
+</div>
+</div>
+
+<div v-click class="mt-4 p-3 bg-amber-50 rounded text-sm text-amber-900">
+<strong>Why Firecracker?</strong> Sub-second boot time (~125ms), strong isolation via KVM, smaller attack surface than containers. Docker is used in dev for faster iteration.
+</div>
+
+<!--
+Each skill step gets its own sandbox. We use Firecracker in prod for the security boundary — it's KVM-backed, boots in about 125ms, and gives us hardware-level isolation. In dev we use Docker because it's faster to iterate with. The network deny-by-default is important — skills can only reach domains they've explicitly declared.
+-->
+
+---
+layout: section
+---
+
+# Streaming Pipeline
+
+*Shipped 2026-04-09 — replaces the batch-then-send model*
+
+---
+layout: two-cols
+---
+
+# Streaming Architecture
+
+<v-clicks>
+
+- **Token-level streaming** replaces old batch model
+- **Ring buffer** per active session (64 KB capacity)
+- **Backpressure**: if client falls behind, buffer drops intermediate tokens but preserves final state
+- **Inline extractors** detect code blocks, tables, and diagrams as they stream
+
+</v-clicks>
+
+<div v-click class="mt-4 p-3 bg-green-50 rounded text-sm text-green-900">
+<strong>Impact:</strong> Time-to-first-token dropped from ~3s (batch) to ~180ms (streaming).
+</div>
+
+::right::
+
+```mermaid {scale: 0.75}
+flowchart TB
+    S["Sandbox Output"] --> RB["Ring Buffer\n(64 KB per session)"]
+    RB --> EX["Inline Extractors\n(code, tables, diagrams)"]
+    EX --> SSE["SSE Emitter"]
+    SSE --> C["Client"]
+
+    RB -->|"backpressure"| BP["Drop intermediate\ntokens, keep final"]
+    BP --> SSE
+
+    style RB fill:#dcfce7,stroke:#16a34a
+    style EX fill:#dcfce7,stroke:#16a34a
+    style SSE fill:#dcfce7,stroke:#16a34a
+```
+
+<!--
+This is the biggest user-facing change. We went from buffering the entire skill output and sending it at the end, to streaming tokens as they're produced. The ring buffer handles the case where the client can't keep up — we drop intermediate tokens but always preserve the final state so the client converges to the correct output.
+-->
+
+---
+
+# SSE Event Protocol
+
+All events use `text/event-stream` with a monotonic sequence number for ordering.
+
+| Event | Payload | When |
+|-------|---------|------|
+| `token` | `{ text: string }` | Each token from the skill |
+| `artifact` | `{ type, url, meta }` | File/image/code generated |
+| `status` | `{ stepId, state }` | Step state transitions |
+| `error` | `{ code, message, stepId }` | Recoverable errors |
+| `done` | `{ resultId }` | Skill execution complete |
+
+<v-click>
+
+```typescript
+interface StreamEvent {
+  type: 'token' | 'artifact' | 'status' | 'error' | 'done';
+  sessionId: string;
+  stepId: string;
+  payload: unknown;
+  seq: number;  // monotonic — client uses this to reorder if needed
+}
+```
+
+</v-click>
+
+<!--
+Five event types. The sequence number is monotonic per session — the client uses it to detect gaps and reorder if events arrive out of order over SSE. The artifact event fires when a sandbox writes to /output — the aggregator uploads it to Vercel Blob and the URL goes to the client immediately.
+-->
+
+---
+layout: section
+---
+
+# Fault Tolerance
+
+*Circuit breakers + checkpoint/resume — live since 2026-04-03*
+
+---
+
+# Circuit Breaker
+
+One breaker per external dependency: LLM provider, storage, search.
+
+```mermaid {scale: 0.9}
+stateDiagram-v2
+    [*] --> CLOSED
+    CLOSED --> OPEN : 5 failures in 30s
+    OPEN --> HALF_OPEN : After cooldown
+    HALF_OPEN --> CLOSED : 1 success
+    HALF_OPEN --> OPEN : 1 failure
+```
+
+<v-clicks>
+
+- **CLOSED** — Normal operation, requests pass through
+- **OPEN** — Failing dependency, reject fast (no waiting for timeouts)
+- **HALF-OPEN** — Probe with a single request to test recovery
+- **Thresholds**: 5 failures in 30s triggers OPEN; 1 success in HALF-OPEN restores CLOSED
+
+</v-clicks>
+
+<!--
+Standard circuit breaker pattern, one per external dependency. The key value here is that when an LLM provider is down, we fail fast instead of burning through our timeout budget. The 5-in-30s threshold was tuned against our error rates from the last quarter.
+-->
+
+---
+layout: two-cols
+---
+
+# Checkpoint / Resume & DLQ
+
+<v-clicks>
+
+### Checkpoint/Resume
+- Graph state serialized to **Redis every 10s**
+- On failure, resume from **last completed step**
+- Eliminates re-running expensive steps (LLM calls, data processing)
+
+### Dead-Letter Queue
+- Permanently failed steps → **DLQ**
+- Surfaced in **admin dashboard**
+- Operators can inspect, retry, or discard
+
+### Graceful Degradation
+- Non-critical skill failure → **partial result + warning**
+- User sees what succeeded, not a blank error
+
+</v-clicks>
+
+::right::
+
+```mermaid {scale: 0.7}
+flowchart TB
+    G["ExecutionGraph\n(running)"] --> CP["Checkpoint\n(Redis, every 10s)"]
+    G --> |"step fails"| R{"Retries\nexhausted?"}
+    R --> |"no"| G
+    R --> |"yes, critical"| DLQ["Dead-Letter Queue\n(admin dashboard)"]
+    R --> |"yes, non-critical"| PR["Partial Result\n+ warning"]
+    
+    CP --> |"on resume"| RS["Resume from last\ncompleted step"]
+    RS --> G
+
+    style CP fill:#e0f2fe,stroke:#0284c7
+    style DLQ fill:#fef2f2,stroke:#dc2626
+    style PR fill:#fef3c7,stroke:#d97706
+```
+
+<!--
+Checkpoints are the safety net for long-running graphs. Every 10 seconds we snapshot the graph state to Redis. If the process crashes, the next run picks up where we left off. The DLQ catches steps that are permanently broken — they show up in the admin dashboard for operators to triage. And if a non-critical skill fails, we still return what we have with a warning instead of failing the whole request.
+-->
+
+---
+
+# How Streaming + Checkpoints Interact
+
+<div class="p-4 bg-gray-50 rounded mt-2">
+
+```mermaid {scale: 0.85}
+sequenceDiagram
+    participant O as Orchestrator
+    participant S as Sandbox
+    participant SP as Streaming Pipeline
+    participant R as Redis
+    participant C as Client
+
+    O->>S: Execute step 3
+    S->>SP: Token stream
+    SP->>C: SSE events (real-time)
+    O->>R: Checkpoint (graph state, step 3 running)
+    Note over R: Captures graph state,<br/>NOT stream state
+    S->>O: Step 3 complete
+    O->>R: Checkpoint (step 3 done)
+    O->>S: Execute step 4
+    Note over O,C: If crash here → resume from step 4<br/>Stream restarts from step 4 output
+```
+
+</div>
+
+<div v-click class="mt-3 p-3 bg-blue-50 rounded text-sm text-blue-900">
+<strong>Key design decision:</strong> Checkpoints capture <em>graph</em> state, not <em>stream</em> state. On resume, the stream restarts from the last completed step's output. This keeps the checkpoint small and avoids replaying token history.
+</div>
+
+<!--
+This is the question leadership specifically asked about. Checkpoints and streaming are decoupled by design. The checkpoint stores which steps are done and their outputs — it does NOT store the token stream. On resume, we restart execution from the next incomplete step and the streaming pipeline starts fresh from that point. The client may miss intermediate tokens from the crashed step, but it will always get the correct final output.
+-->
+
+---
+
+# Infrastructure
+
+<div class="grid grid-cols-2 gap-6 mt-2">
+<div>
+
+### Runtime & Compute
+- **Node.js 22** on Fly.io
+- **4 regions**: iad, cdg, nrt, syd
+- **Firecracker** microVMs (prod)
+- **Docker** containers (dev/staging)
+
+### State & Storage
+- **Redis Cluster** — checkpoints + sessions
+- **Vercel Blob** — artifacts (images, files, code)
+- **Supabase Postgres** — metadata
+
+</div>
+<div>
+
+### Observability
+- **OpenTelemetry** → Grafana Cloud
+- Traces, metrics, and logs unified
+- Per-step trace spans for debugging
+
+### Deployment
+- **GitHub Actions** CI/CD
+- Canary rollout: 10% → 50% → 100%
+- Rollback on error-rate spike
+
+</div>
+</div>
+
+<!--
+Quick infrastructure overview. We're on Fly.io across four regions, Redis Cluster for state, Vercel Blob for artifact storage. Everything is instrumented with OpenTelemetry — you can trace a single user request across the intent router, orchestrator, sandbox, and streaming pipeline in Grafana.
+-->
+
+---
+layout: center
+class: text-center
+---
+
+# Questions?
+
+<div class="text-gray-400 mt-4">
+
+Agent Runtime Architecture Review — doany.ai
+
+</div>
+
+<!--
+That's the full picture. Happy to dig into any of these areas. The streaming pipeline and fault tolerance code is in the runtime repo if anyone wants to review the implementation directly.
+-->