Abstract

We introduce DoAny-Bench, a holistic benchmark suite for evaluating autonomous AI agents across 1,247 real-world tasks spanning software engineering, data analysis, research synthesis, and creative production. Unlike existing benchmarks that focus on isolated capabilities, DoAny-Bench measures end-to-end task completion with human-validated acceptance criteria. We evaluate 12 leading agent frameworks and find that while frontier models achieve 67% on constrained sub-tasks, performance drops to 34% on multi-step workflows requiring tool orchestration. We release the full evaluation harness, task corpus, and human annotations under an Apache 2.0 license to accelerate progress on general-purpose AI agents.

1. Introduction

Autonomous AI agents are rapidly advancing beyond single-turn question answering toward multi-step, tool-augmented workflows. However, the evaluation landscape remains fragmented: existing benchmarks either test narrow capabilities in isolation (e.g., code generation, web browsing) or rely on synthetic tasks that fail to capture the complexity of real-world work.

DoAny-Bench addresses this gap with three design principles:

Real-world grounding -- Every task is sourced from actual workflows performed by knowledge workers, not synthetically generated.
End-to-end evaluation -- Tasks require planning, tool selection, multi-step execution, and self-correction, not just a single API call.
Human-validated acceptance -- Each task includes acceptance criteria written and verified by domain experts, ensuring evaluation fidelity.

Our benchmark spans four domains that represent the breadth of autonomous agent deployment: software engineering (412 tasks), data analysis (298 tasks), research synthesis (287 tasks), and creative production (250 tasks).

2. Background

2.1 Agent Evaluation Landscape

Prior benchmarks have made important contributions but exhibit key limitations:

Benchmark	Tasks	Domains	Multi-Step	Human-Validated
SWE-bench	2,294	SWE only	Partial	No
WebArena	812	Web only	Yes	No
GAIA	466	General	Yes	Yes
AgentBench	8 datasets	Mixed	Yes	Partial
DoAny-Bench	1,247	4 domains	Yes	Yes

2.2 Task Complexity Taxonomy

We define three tiers of task complexity:

Tier 1 -- Constrained (single tool, 1-3 steps): e.g., "Write a unit test for this function."
Tier 2 -- Compositional (multi-tool, 4-8 steps): e.g., "Analyze this CSV, generate a chart, and write a summary."
Tier 3 -- Open-ended (orchestration, 9+ steps): e.g., "Debug this CI pipeline failure, fix the root cause, and open a PR."

3. Approach

3.1 Task Collection

Tasks were collected through a three-phase pipeline:

Sourcing: Partnered with 48 knowledge workers across 12 organizations to record real workflows over a 6-month period.
Annotation: Each task was decomposed into structured fields -- input context, available tools, reference solution steps, and binary acceptance criteria.
Validation: Two independent annotators verified each task. Inter-annotator agreement (Cohen's kappa) was 0.87.

3.2 Evaluation Harness

The DoAny-Bench harness provides:

Sandboxed execution -- Docker-based environments with pre-configured tool access per task.
Deterministic replay -- Recorded API responses for reproducible evaluation without live service dependencies.
Automated + human scoring -- Automated checks for objective criteria, with human judges for subjective quality dimensions.

3.3 Metrics

We report three primary metrics:

Task Success Rate (TSR): Binary pass/fail against acceptance criteria.
Step Completion Rate (SCR): Fraction of reference solution steps completed.
Efficiency Score (ES): Ratio of optimal steps to actual steps taken (penalizes unnecessary actions).

4. Experimental Setup

4.1 Evaluated Frameworks

We evaluate 12 agent frameworks across three model families:

Framework	Model Backend	Agent Architecture
ReAct	GPT-4o, Claude 3.5	Reason + Act loop
AutoGPT	GPT-4o	Goal-driven planner
OpenHands	Claude 3.5 Sonnet	Code-act agent
LangGraph	GPT-4o, Claude 3.5	Graph-based orchestration
CrewAI	GPT-4o	Multi-agent crew
AIDE	Claude 3.5	Data science specialist
SWE-Agent	GPT-4o, Claude 3.5	SWE-specific agent
Devin	Proprietary	Full-stack agent
Claude Code	Claude 3.5 Sonnet	Terminal-native agent
Cursor Agent	GPT-4o / Claude	IDE-integrated agent
Copilot Workspace	GPT-4o	Plan-and-execute
Custom Baseline	GPT-4o	Zero-shot chain-of-thought

4.2 Configuration

Each agent was given identical tool access and context windows per task.
Maximum budget: 50 steps or 10 minutes wall-clock per task.
Three independent runs per (framework, task) pair; majority vote determines pass/fail.

5. Results & Analysis

5.1 Headline Results

Tier	Best TSR	Average TSR	Human Baseline
Tier 1 (Constrained)	72%	67%	94%
Tier 2 (Compositional)	51%	43%	91%
Tier 3 (Open-ended)	38%	34%	88%
Overall	54%	48%	91%

Key finding: Frontier models achieve 67% on constrained sub-tasks but drop to 34% on multi-step workflows requiring tool orchestration -- a 33-point gap that highlights the challenge of sustained, multi-step reasoning.

5.2 Domain Breakdown

Domain	Top Agent	TSR	SCR	ES
Software Engineering	Claude Code	58%	71%	0.64
Data Analysis	AIDE	52%	68%	0.59
Research Synthesis	ReAct + Claude 3.5	41%	55%	0.48
Creative Production	LangGraph + GPT-4o	39%	52%	0.45

5.3 Failure Mode Analysis

The most common failure modes across all frameworks:

Premature termination (28%) -- Agent declares task complete before meeting all acceptance criteria.
Tool misselection (22%) -- Agent selects an inappropriate tool for the current step.
Context loss (19%) -- Agent loses track of earlier results in long sequences.
Error non-recovery (16%) -- Agent fails to recover from tool errors or unexpected outputs.
Scope creep (15%) -- Agent performs unnecessary actions that waste budget.

5.4 Scaling Analysis

We observe log-linear scaling between model compute and Tier 1 performance, but diminishing returns on Tier 3 tasks -- suggesting that multi-step orchestration requires architectural improvements beyond simply scaling inference compute.

6. Conclusion

DoAny-Bench reveals a clear gap between current agent capabilities on isolated tasks versus realistic, multi-step workflows. Our key contributions:

A benchmark of 1,247 human-sourced, human-validated tasks across four professional domains.
Comprehensive evaluation of 12 frameworks, establishing the first cross-framework comparison on identical real-world tasks.
Identification of the "orchestration gap" -- the 33-point TSR drop from constrained to open-ended tasks.
A fully open evaluation harness released under Apache 2.0 to enable reproducible agent research.

We believe closing the orchestration gap is the critical frontier for autonomous AI agents, and we hope DoAny-Bench serves as a reliable measuring stick for progress.

Resources

Resource	Link
Paper	arxiv.org/abs/2504.02495
Paper Page	huggingface.co/papers/2504.02495
Model	doany-ai/doany-bench-agent-v1
Dataset	doany-ai/doany-bench-data
License	Apache 2.0

References

Chen, A., Sharma, P., Liu, M., & Reyes, S. (2026). DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks. arXiv preprint arXiv:2504.02495.
Jimenez, C. E., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024.
Zhou, S., et al. (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR 2024.
Mialon, G., et al. (2024). GAIA: A Benchmark for General AI Assistants. ICLR 2024.
Liu, X., et al. (2024). AgentBench: Evaluating LLMs as Agents. ICLR 2024.
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

@article{chen2026doanybench,
  title={DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks},
  author={Chen, Alex and Sharma, Priya and Liu, Marcus and Reyes, Sofia},
  journal={arXiv preprint arXiv:2504.02495},
  year={2026}
}