Agent Eval Pipeline
Automated Discovery, Testing & Publishing for 850+ AI Agent Skills
850+
Skills
87%
Correctness
<2m
Publish
01
DISCOVER
Crawl registries & build unified skill catalog
GitHub skills.sh Community
02
ANALYZE
Parse SKILL.md, classify & score complexity
Metadata Dependencies
03
EVALUATE
Sandbox test in Claude Code & capture metrics
3,200+ Scenarios 5D Scoring
04
PUBLISH
Push to CDN & update search index in <2 min
Vercel Blob Embed Cards
850+
Skills Indexed
3,200+
Eval Scenarios
~45m
Full Pipeline
<2m
Publish Latency
87%
Average Correctness Score
CORRECTNESS 87% COMPLETENESS 82% QUALITY 85% EFFICIENCY 78% SAFETY 91%
Meets specification
All deliverables present
Production-grade output
Token & tool economy
No leaks or destructive ops

Multi-Source Auto-Discovery

No manual curation — crawls & deduplicates across registries

Sandboxed Claude Code Eval

Real isolated sessions with full tool-call capture

Continuous Re-Evaluation

Auto-retests as models improve — always fresh scores

CDN-First <2 min Publishing

Vercel Blob push with instant search index updates

Embed-Ready Partner Cards

Drop-in integration for partner ecosystems