--- a/slides.md +++ b/slides.md @@ -1,50 +1,457 @@ --- theme: default -title: Agent Runtime Architecture Review +title: "Agent Runtime Architecture Review — doany.ai" +info: Architecture review covering orchestration, sandboxing, streaming pipeline, and fault tolerance. +drawings: + enabled: false +transition: slide-left +mdc: true --- # Agent Runtime Architecture Review doany.ai — Platform Team -April 2026 +
+April 2026 · Engineering Leadership Review +
+ + --- # Agenda -- Orchestrator overview -- Sandbox runtime -- Q&A - + + +- **System Overview** — What the agent runtime does +- **End-to-End Data Flow** — Full path from user input to rendered result +- **Orchestrator** — ExecutionGraph, DAG scheduling, retry logic +- **Sandbox Runtime** — Isolation model with Firecracker / Docker +- **Streaming Pipeline** *(new)* — Ring buffer, backpressure, SSE protocol +- **Fault Tolerance** *(new)* — Circuit breakers, checkpoint/resume, DLQ +- **Streaming + Checkpoints** — How the new systems interact +- **Infrastructure & Observability** + + + + + +--- +layout: section +--- + +# System Overview + +--- + +# What the Agent Runtime Does + +The runtime is the execution engine behind every skill invocation on doany.ai. + + + +- Receives **user intents** (natural language or structured) +- **Selects and composes** skills via the Skill Registry +- **Orchestrates** execution as a DAG of sandboxed steps +- **Streams** results back to the frontend in real time +- **Recovers** from failures without losing progress + + + +
+Key invariant: Every skill step runs in an isolated sandbox. The runtime never executes user-triggered code in the host process. +
+ + + +--- + +# End-to-End Data Flow + +```mermaid {scale: 0.85} +flowchart LR + A["User Input"] --> B["Intent Router"] + B --> C["Skill Registry"] + C --> D["Orchestrator"] + D --> E["ExecutionGraph\n(DAG)"] + E --> F["Sandbox Runtime\n(Firecracker / Docker)"] + F --> G["Streaming Pipeline\n(Ring Buffer + SSE)"] + G --> H["Result Aggregator\n(Vercel Blob)"] + H --> I["Frontend\n(SSE Consumer)"] + + style B fill:#e0f2fe,stroke:#0284c7 + style D fill:#e0f2fe,stroke:#0284c7 + style F fill:#fef3c7,stroke:#d97706 + style G fill:#dcfce7,stroke:#16a34a + style H fill:#dcfce7,stroke:#16a34a +``` + +
+ Routing & Scheduling + Isolated Execution + Streaming & Delivery +
+ + + +--- +layout: section --- # Orchestrator -- Receives a SkillPlan, builds an ExecutionGraph (DAG) -- Parallel fan-out for independent steps -- Retry: exponential backoff, max 3 attempts - -```typescript +--- +layout: two-cols +--- + +# Orchestrator Deep-Dive + +Receives a `SkillPlan`, builds an `ExecutionGraph` (DAG) + + + +- **Parallel fan-out** for independent skill steps +- **Typed channels** pass data between steps +- **Retry**: exponential backoff + jitter, max 3 attempts +- **Timeouts**: 120s per step, 600s per graph +- **Checkpoints**: graph state → Redis every 10s + + + +::right:: + +```typescript {all|3-4|5|6|7} +interface SkillPlan { + intentId: string; + skills: SkillStep[]; + constraints: ExecutionConstraints; +} + +interface SkillStep { + skillId: string; + inputs: Record; + dependsOn: string[]; + timeout: number; + retryPolicy: RetryPolicy; +} + interface ExecutionGraph { id: string; steps: Map; edges: [string, string][]; - status: 'pending' | 'running' | 'completed' | 'failed'; + status: 'pending' | 'running' + | 'completed' | 'failed'; + checkpointKey?: string; } ``` + + --- # Sandbox Runtime -- Firecracker microVMs (prod) / Docker (dev) -- Resource limits: 2 vCPU, 4GB RAM, 10GB disk -- Network deny-by-default -- Read-only FS except `/workspace` and `/output` - ---- - -# Thank You - -Questions? +Firecracker microVMs in production, Docker containers in dev/staging. + +
+
+ +### Resource Limits + + + +- **2 vCPU** per sandbox +- **4 GB RAM** +- **10 GB ephemeral disk** +- Network **deny-by-default** +- Skills must declare required domains + + + +
+
+ +### Filesystem Model + + + +- Root FS: **read-only** +- `/workspace`: read-write (skill working dir) +- `/output`: read-write (results written here) +- Mounts destroyed after step completes + + + +
+
+ +
+Why Firecracker? Sub-second boot time (~125ms), strong isolation via KVM, smaller attack surface than containers. Docker is used in dev for faster iteration. +
+ + + +--- +layout: section +--- + +# Streaming Pipeline + +*Shipped 2026-04-09 — replaces the batch-then-send model* + +--- +layout: two-cols +--- + +# Streaming Architecture + + + +- **Token-level streaming** replaces old batch model +- **Ring buffer** per active session (64 KB capacity) +- **Backpressure**: if client falls behind, buffer drops intermediate tokens but preserves final state +- **Inline extractors** detect code blocks, tables, and diagrams as they stream + + + +
+Impact: Time-to-first-token dropped from ~3s (batch) to ~180ms (streaming). +
+ +::right:: + +```mermaid {scale: 0.75} +flowchart TB + S["Sandbox Output"] --> RB["Ring Buffer\n(64 KB per session)"] + RB --> EX["Inline Extractors\n(code, tables, diagrams)"] + EX --> SSE["SSE Emitter"] + SSE --> C["Client"] + + RB -->|"backpressure"| BP["Drop intermediate\ntokens, keep final"] + BP --> SSE + + style RB fill:#dcfce7,stroke:#16a34a + style EX fill:#dcfce7,stroke:#16a34a + style SSE fill:#dcfce7,stroke:#16a34a +``` + + + +--- + +# SSE Event Protocol + +All events use `text/event-stream` with a monotonic sequence number for ordering. + +| Event | Payload | When | +|-------|---------|------| +| `token` | `{ text: string }` | Each token from the skill | +| `artifact` | `{ type, url, meta }` | File/image/code generated | +| `status` | `{ stepId, state }` | Step state transitions | +| `error` | `{ code, message, stepId }` | Recoverable errors | +| `done` | `{ resultId }` | Skill execution complete | + + + +```typescript +interface StreamEvent { + type: 'token' | 'artifact' | 'status' | 'error' | 'done'; + sessionId: string; + stepId: string; + payload: unknown; + seq: number; // monotonic — client uses this to reorder if needed +} +``` + + + + + +--- +layout: section +--- + +# Fault Tolerance + +*Circuit breakers + checkpoint/resume — live since 2026-04-03* + +--- + +# Circuit Breaker + +One breaker per external dependency: LLM provider, storage, search. + +```mermaid {scale: 0.9} +stateDiagram-v2 + [*] --> CLOSED + CLOSED --> OPEN : 5 failures in 30s + OPEN --> HALF_OPEN : After cooldown + HALF_OPEN --> CLOSED : 1 success + HALF_OPEN --> OPEN : 1 failure +``` + + + +- **CLOSED** — Normal operation, requests pass through +- **OPEN** — Failing dependency, reject fast (no waiting for timeouts) +- **HALF-OPEN** — Probe with a single request to test recovery +- **Thresholds**: 5 failures in 30s triggers OPEN; 1 success in HALF-OPEN restores CLOSED + + + + + +--- +layout: two-cols +--- + +# Checkpoint / Resume & DLQ + + + +### Checkpoint/Resume +- Graph state serialized to **Redis every 10s** +- On failure, resume from **last completed step** +- Eliminates re-running expensive steps (LLM calls, data processing) + +### Dead-Letter Queue +- Permanently failed steps → **DLQ** +- Surfaced in **admin dashboard** +- Operators can inspect, retry, or discard + +### Graceful Degradation +- Non-critical skill failure → **partial result + warning** +- User sees what succeeded, not a blank error + + + +::right:: + +```mermaid {scale: 0.7} +flowchart TB + G["ExecutionGraph\n(running)"] --> CP["Checkpoint\n(Redis, every 10s)"] + G --> |"step fails"| R{"Retries\nexhausted?"} + R --> |"no"| G + R --> |"yes, critical"| DLQ["Dead-Letter Queue\n(admin dashboard)"] + R --> |"yes, non-critical"| PR["Partial Result\n+ warning"] + + CP --> |"on resume"| RS["Resume from last\ncompleted step"] + RS --> G + + style CP fill:#e0f2fe,stroke:#0284c7 + style DLQ fill:#fef2f2,stroke:#dc2626 + style PR fill:#fef3c7,stroke:#d97706 +``` + + + +--- + +# How Streaming + Checkpoints Interact + +
+ +```mermaid {scale: 0.85} +sequenceDiagram + participant O as Orchestrator + participant S as Sandbox + participant SP as Streaming Pipeline + participant R as Redis + participant C as Client + + O->>S: Execute step 3 + S->>SP: Token stream + SP->>C: SSE events (real-time) + O->>R: Checkpoint (graph state, step 3 running) + Note over R: Captures graph state,
NOT stream state + S->>O: Step 3 complete + O->>R: Checkpoint (step 3 done) + O->>S: Execute step 4 + Note over O,C: If crash here → resume from step 4
Stream restarts from step 4 output +``` + +
+ +
+Key design decision: Checkpoints capture graph state, not stream state. On resume, the stream restarts from the last completed step's output. This keeps the checkpoint small and avoids replaying token history. +
+ + + +--- + +# Infrastructure + +
+
+ +### Runtime & Compute +- **Node.js 22** on Fly.io +- **4 regions**: iad, cdg, nrt, syd +- **Firecracker** microVMs (prod) +- **Docker** containers (dev/staging) + +### State & Storage +- **Redis Cluster** — checkpoints + sessions +- **Vercel Blob** — artifacts (images, files, code) +- **Supabase Postgres** — metadata + +
+
+ +### Observability +- **OpenTelemetry** → Grafana Cloud +- Traces, metrics, and logs unified +- Per-step trace spans for debugging + +### Deployment +- **GitHub Actions** CI/CD +- Canary rollout: 10% → 50% → 100% +- Rollback on error-rate spike + +
+
+ + + +--- +layout: center +class: text-center +--- + +# Questions? + +
+ +Agent Runtime Architecture Review — doany.ai + +
+ +