Abstract
We introduce DoAny-Bench, a holistic benchmark suite for evaluating autonomous AI agents across 1,247 real-world tasks spanning software engineering, data analysis, research synthesis, and creative production. Unlike existing benchmarks that focus on isolated capabilities, DoAny-Bench measures end-to-end task completion with human-validated acceptance criteria. We evaluate 12 leading agent frameworks and find that while frontier models achieve 67% on constrained sub-tasks, performance drops to 34% on multi-step workflows requiring tool orchestration. We release the full evaluation harness, task corpus, and human annotations under an Apache 2.0 license to accelerate progress on general-purpose AI agents.
1. Introduction
Autonomous AI agents are rapidly advancing beyond single-turn question answering toward multi-step, tool-augmented workflows. However, the evaluation landscape remains fragmented: existing benchmarks either test narrow capabilities in isolation (e.g., code generation, web browsing) or rely on synthetic tasks that fail to capture the complexity of real-world work.
DoAny-Bench addresses this gap with three design principles:
- Real-world grounding -- Every task is sourced from actual workflows performed by knowledge workers, not synthetically generated.
- End-to-end evaluation -- Tasks require planning, tool selection, multi-step execution, and self-correction, not just a single API call.
- Human-validated acceptance -- Each task includes acceptance criteria written and verified by domain experts, ensuring evaluation fidelity.
Our benchmark spans four domains that represent the breadth of autonomous agent deployment: software engineering (412 tasks), data analysis (298 tasks), research synthesis (287 tasks), and creative production (250 tasks).
2. Background
2.1 Agent Evaluation Landscape
Prior benchmarks have made important contributions but exhibit key limitations:
| Benchmark | Tasks | Domains | Multi-Step | Human-Validated |
|---|---|---|---|---|
| SWE-bench | 2,294 | SWE only | Partial | No |
| WebArena | 812 | Web only | Yes | No |
| GAIA | 466 | General | Yes | Yes |
| AgentBench | 8 datasets | Mixed | Yes | Partial |
| DoAny-Bench | 1,247 | 4 domains | Yes | Yes |
2.2 Task Complexity Taxonomy
We define three tiers of task complexity:
- Tier 1 -- Constrained (single tool, 1-3 steps): e.g., "Write a unit test for this function."
- Tier 2 -- Compositional (multi-tool, 4-8 steps): e.g., "Analyze this CSV, generate a chart, and write a summary."
- Tier 3 -- Open-ended (orchestration, 9+ steps): e.g., "Debug this CI pipeline failure, fix the root cause, and open a PR."
3. Approach
3.1 Task Collection
Tasks were collected through a three-phase pipeline:
- Sourcing: Partnered with 48 knowledge workers across 12 organizations to record real workflows over a 6-month period.
- Annotation: Each task was decomposed into structured fields -- input context, available tools, reference solution steps, and binary acceptance criteria.
- Validation: Two independent annotators verified each task. Inter-annotator agreement (Cohen's kappa) was 0.87.
3.2 Evaluation Harness
The DoAny-Bench harness provides:
- Sandboxed execution -- Docker-based environments with pre-configured tool access per task.
- Deterministic replay -- Recorded API responses for reproducible evaluation without live service dependencies.
- Automated + human scoring -- Automated checks for objective criteria, with human judges for subjective quality dimensions.
3.3 Metrics
We report three primary metrics:
- Task Success Rate (TSR): Binary pass/fail against acceptance criteria.
- Step Completion Rate (SCR): Fraction of reference solution steps completed.
- Efficiency Score (ES): Ratio of optimal steps to actual steps taken (penalizes unnecessary actions).
4. Experimental Setup
4.1 Evaluated Frameworks
We evaluate 12 agent frameworks across three model families:
| Framework | Model Backend | Agent Architecture |
|---|---|---|
| ReAct | GPT-4o, Claude 3.5 | Reason + Act loop |
| AutoGPT | GPT-4o | Goal-driven planner |
| OpenHands | Claude 3.5 Sonnet | Code-act agent |
| LangGraph | GPT-4o, Claude 3.5 | Graph-based orchestration |
| CrewAI | GPT-4o | Multi-agent crew |
| AIDE | Claude 3.5 | Data science specialist |
| SWE-Agent | GPT-4o, Claude 3.5 | SWE-specific agent |
| Devin | Proprietary | Full-stack agent |
| Claude Code | Claude 3.5 Sonnet | Terminal-native agent |
| Cursor Agent | GPT-4o / Claude | IDE-integrated agent |
| Copilot Workspace | GPT-4o | Plan-and-execute |
| Custom Baseline | GPT-4o | Zero-shot chain-of-thought |
4.2 Configuration
- Each agent was given identical tool access and context windows per task.
- Maximum budget: 50 steps or 10 minutes wall-clock per task.
- Three independent runs per (framework, task) pair; majority vote determines pass/fail.
5. Results & Analysis
5.1 Headline Results
| Tier | Best TSR | Average TSR | Human Baseline |
|---|---|---|---|
| Tier 1 (Constrained) | 72% | 67% | 94% |
| Tier 2 (Compositional) | 51% | 43% | 91% |
| Tier 3 (Open-ended) | 38% | 34% | 88% |
| Overall | 54% | 48% | 91% |
Key finding: Frontier models achieve 67% on constrained sub-tasks but drop to 34% on multi-step workflows requiring tool orchestration -- a 33-point gap that highlights the challenge of sustained, multi-step reasoning.
5.2 Domain Breakdown
| Domain | Top Agent | TSR | SCR | ES |
|---|---|---|---|---|
| Software Engineering | Claude Code | 58% | 71% | 0.64 |
| Data Analysis | AIDE | 52% | 68% | 0.59 |
| Research Synthesis | ReAct + Claude 3.5 | 41% | 55% | 0.48 |
| Creative Production | LangGraph + GPT-4o | 39% | 52% | 0.45 |
5.3 Failure Mode Analysis
The most common failure modes across all frameworks:
- Premature termination (28%) -- Agent declares task complete before meeting all acceptance criteria.
- Tool misselection (22%) -- Agent selects an inappropriate tool for the current step.
- Context loss (19%) -- Agent loses track of earlier results in long sequences.
- Error non-recovery (16%) -- Agent fails to recover from tool errors or unexpected outputs.
- Scope creep (15%) -- Agent performs unnecessary actions that waste budget.
5.4 Scaling Analysis
We observe log-linear scaling between model compute and Tier 1 performance, but diminishing returns on Tier 3 tasks -- suggesting that multi-step orchestration requires architectural improvements beyond simply scaling inference compute.
6. Conclusion
DoAny-Bench reveals a clear gap between current agent capabilities on isolated tasks versus realistic, multi-step workflows. Our key contributions:
- A benchmark of 1,247 human-sourced, human-validated tasks across four professional domains.
- Comprehensive evaluation of 12 frameworks, establishing the first cross-framework comparison on identical real-world tasks.
- Identification of the "orchestration gap" -- the 33-point TSR drop from constrained to open-ended tasks.
- A fully open evaluation harness released under Apache 2.0 to enable reproducible agent research.
We believe closing the orchestration gap is the critical frontier for autonomous AI agents, and we hope DoAny-Bench serves as a reliable measuring stick for progress.
Resources
| Resource | Link |
|---|---|
| Paper | arxiv.org/abs/2504.02495 |
| Paper Page | huggingface.co/papers/2504.02495 |
| Model | doany-ai/doany-bench-agent-v1 |
| Dataset | doany-ai/doany-bench-data |
| License | Apache 2.0 |
References
-
Chen, A., Sharma, P., Liu, M., & Reyes, S. (2026). DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks. arXiv preprint arXiv:2504.02495.
-
Jimenez, C. E., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024.
-
Zhou, S., et al. (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR 2024.
-
Mialon, G., et al. (2024). GAIA: A Benchmark for General AI Assistants. ICLR 2024.
-
Liu, X., et al. (2024). AgentBench: Evaluating LLMs as Agents. ICLR 2024.
-
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
@article{chen2026doanybench,
title={DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks},
author={Chen, Alex and Sharma, Priya and Liu, Marcus and Reyes, Sofia},
journal={arXiv preprint arXiv:2504.02495},
year={2026}
}