;
edges: [string, string][];
- status: 'pending' | 'running' | 'completed' | 'failed';
+ status: 'pending' | 'running'
+ | 'completed' | 'failed';
+ checkpointKey?: string;
}
```
+
+
---
# Sandbox Runtime
-- Firecracker microVMs (prod) / Docker (dev)
-- Resource limits: 2 vCPU, 4GB RAM, 10GB disk
-- Network deny-by-default
-- Read-only FS except `/workspace` and `/output`
-
----
-
-# Thank You
-
-Questions?
+Firecracker microVMs in production, Docker containers in dev/staging.
+
+
+
+
+### Resource Limits
+
+
+
+- **2 vCPU** per sandbox
+- **4 GB RAM**
+- **10 GB ephemeral disk**
+- Network **deny-by-default**
+- Skills must declare required domains
+
+
+
+
+
+
+### Filesystem Model
+
+
+
+- Root FS: **read-only**
+- `/workspace`: read-write (skill working dir)
+- `/output`: read-write (results written here)
+- Mounts destroyed after step completes
+
+
+
+
+
+
+
+Why Firecracker? Sub-second boot time (~125ms), strong isolation via KVM, smaller attack surface than containers. Docker is used in dev for faster iteration.
+
+
+
+
+---
+layout: section
+---
+
+# Streaming Pipeline
+
+*Shipped 2026-04-09 — replaces the batch-then-send model*
+
+---
+layout: two-cols
+---
+
+# Streaming Architecture
+
+
+
+- **Token-level streaming** replaces old batch model
+- **Ring buffer** per active session (64 KB capacity)
+- **Backpressure**: if client falls behind, buffer drops intermediate tokens but preserves final state
+- **Inline extractors** detect code blocks, tables, and diagrams as they stream
+
+
+
+
+Impact: Time-to-first-token dropped from ~3s (batch) to ~180ms (streaming).
+
+
+::right::
+
+```mermaid {scale: 0.75}
+flowchart TB
+ S["Sandbox Output"] --> RB["Ring Buffer\n(64 KB per session)"]
+ RB --> EX["Inline Extractors\n(code, tables, diagrams)"]
+ EX --> SSE["SSE Emitter"]
+ SSE --> C["Client"]
+
+ RB -->|"backpressure"| BP["Drop intermediate\ntokens, keep final"]
+ BP --> SSE
+
+ style RB fill:#dcfce7,stroke:#16a34a
+ style EX fill:#dcfce7,stroke:#16a34a
+ style SSE fill:#dcfce7,stroke:#16a34a
+```
+
+
+
+---
+
+# SSE Event Protocol
+
+All events use `text/event-stream` with a monotonic sequence number for ordering.
+
+| Event | Payload | When |
+|-------|---------|------|
+| `token` | `{ text: string }` | Each token from the skill |
+| `artifact` | `{ type, url, meta }` | File/image/code generated |
+| `status` | `{ stepId, state }` | Step state transitions |
+| `error` | `{ code, message, stepId }` | Recoverable errors |
+| `done` | `{ resultId }` | Skill execution complete |
+
+
+
+```typescript
+interface StreamEvent {
+ type: 'token' | 'artifact' | 'status' | 'error' | 'done';
+ sessionId: string;
+ stepId: string;
+ payload: unknown;
+ seq: number; // monotonic — client uses this to reorder if needed
+}
+```
+
+
+
+
+
+---
+layout: section
+---
+
+# Fault Tolerance
+
+*Circuit breakers + checkpoint/resume — live since 2026-04-03*
+
+---
+
+# Circuit Breaker
+
+One breaker per external dependency: LLM provider, storage, search.
+
+```mermaid {scale: 0.9}
+stateDiagram-v2
+ [*] --> CLOSED
+ CLOSED --> OPEN : 5 failures in 30s
+ OPEN --> HALF_OPEN : After cooldown
+ HALF_OPEN --> CLOSED : 1 success
+ HALF_OPEN --> OPEN : 1 failure
+```
+
+
+
+- **CLOSED** — Normal operation, requests pass through
+- **OPEN** — Failing dependency, reject fast (no waiting for timeouts)
+- **HALF-OPEN** — Probe with a single request to test recovery
+- **Thresholds**: 5 failures in 30s triggers OPEN; 1 success in HALF-OPEN restores CLOSED
+
+
+
+
+
+---
+layout: two-cols
+---
+
+# Checkpoint / Resume & DLQ
+
+
+
+### Checkpoint/Resume
+- Graph state serialized to **Redis every 10s**
+- On failure, resume from **last completed step**
+- Eliminates re-running expensive steps (LLM calls, data processing)
+
+### Dead-Letter Queue
+- Permanently failed steps → **DLQ**
+- Surfaced in **admin dashboard**
+- Operators can inspect, retry, or discard
+
+### Graceful Degradation
+- Non-critical skill failure → **partial result + warning**
+- User sees what succeeded, not a blank error
+
+
+
+::right::
+
+```mermaid {scale: 0.7}
+flowchart TB
+ G["ExecutionGraph\n(running)"] --> CP["Checkpoint\n(Redis, every 10s)"]
+ G --> |"step fails"| R{"Retries\nexhausted?"}
+ R --> |"no"| G
+ R --> |"yes, critical"| DLQ["Dead-Letter Queue\n(admin dashboard)"]
+ R --> |"yes, non-critical"| PR["Partial Result\n+ warning"]
+
+ CP --> |"on resume"| RS["Resume from last\ncompleted step"]
+ RS --> G
+
+ style CP fill:#e0f2fe,stroke:#0284c7
+ style DLQ fill:#fef2f2,stroke:#dc2626
+ style PR fill:#fef3c7,stroke:#d97706
+```
+
+
+
+---
+
+# How Streaming + Checkpoints Interact
+
+
+
+```mermaid {scale: 0.85}
+sequenceDiagram
+ participant O as Orchestrator
+ participant S as Sandbox
+ participant SP as Streaming Pipeline
+ participant R as Redis
+ participant C as Client
+
+ O->>S: Execute step 3
+ S->>SP: Token stream
+ SP->>C: SSE events (real-time)
+ O->>R: Checkpoint (graph state, step 3 running)
+ Note over R: Captures graph state,
NOT stream state
+ S->>O: Step 3 complete
+ O->>R: Checkpoint (step 3 done)
+ O->>S: Execute step 4
+ Note over O,C: If crash here → resume from step 4
Stream restarts from step 4 output
+```
+
+
+
+
+Key design decision: Checkpoints capture graph state, not stream state. On resume, the stream restarts from the last completed step's output. This keeps the checkpoint small and avoids replaying token history.
+
+
+
+
+---
+
+# Infrastructure
+
+
+
+
+### Runtime & Compute
+- **Node.js 22** on Fly.io
+- **4 regions**: iad, cdg, nrt, syd
+- **Firecracker** microVMs (prod)
+- **Docker** containers (dev/staging)
+
+### State & Storage
+- **Redis Cluster** — checkpoints + sessions
+- **Vercel Blob** — artifacts (images, files, code)
+- **Supabase Postgres** — metadata
+
+
+
+
+### Observability
+- **OpenTelemetry** → Grafana Cloud
+- Traces, metrics, and logs unified
+- Per-step trace spans for debugging
+
+### Deployment
+- **GitHub Actions** CI/CD
+- Canary rollout: 10% → 50% → 100%
+- Rollback on error-rate spike
+
+
+
+
+
+
+---
+layout: center
+class: text-center
+---
+
+# Questions?
+
+
+
+Agent Runtime Architecture Review — doany.ai
+
+
+
+