Abstract

We introduce DoAny-Bench, a holistic benchmark suite for evaluating autonomous AI agents across 1,247 real-world tasks spanning software engineering, data analysis, research synthesis, and creative production. Unlike existing benchmarks that focus on isolated capabilities, DoAny-Bench measures end-to-end task completion with human-validated acceptance criteria. We evaluate 12 leading agent frameworks and find that while frontier models achieve 67% on constrained sub-tasks, performance drops to 34% on multi-step workflows requiring tool orchestration. We release the full evaluation harness, task corpus, and human annotations under an Apache 2.0 license to accelerate progress on general-purpose AI agents.

1. Introduction

Autonomous AI agents are rapidly advancing beyond single-turn question answering toward multi-step, tool-augmented workflows. However, the evaluation landscape remains fragmented: existing benchmarks either test narrow capabilities in isolation (e.g., code generation, web browsing) or rely on synthetic tasks that fail to capture the complexity of real-world work.

DoAny-Bench addresses this gap with three design principles:

  1. Real-world grounding -- Every task is sourced from actual workflows performed by knowledge workers, not synthetically generated.
  2. End-to-end evaluation -- Tasks require planning, tool selection, multi-step execution, and self-correction, not just a single API call.
  3. Human-validated acceptance -- Each task includes acceptance criteria written and verified by domain experts, ensuring evaluation fidelity.

Our benchmark spans four domains that represent the breadth of autonomous agent deployment: software engineering (412 tasks), data analysis (298 tasks), research synthesis (287 tasks), and creative production (250 tasks).

2. Background

2.1 Agent Evaluation Landscape

Prior benchmarks have made important contributions but exhibit key limitations:

Benchmark Tasks Domains Multi-Step Human-Validated
SWE-bench 2,294 SWE only Partial No
WebArena 812 Web only Yes No
GAIA 466 General Yes Yes
AgentBench 8 datasets Mixed Yes Partial
DoAny-Bench 1,247 4 domains Yes Yes

2.2 Task Complexity Taxonomy

We define three tiers of task complexity:

3. Approach

3.1 Task Collection

Tasks were collected through a three-phase pipeline:

  1. Sourcing: Partnered with 48 knowledge workers across 12 organizations to record real workflows over a 6-month period.
  2. Annotation: Each task was decomposed into structured fields -- input context, available tools, reference solution steps, and binary acceptance criteria.
  3. Validation: Two independent annotators verified each task. Inter-annotator agreement (Cohen's kappa) was 0.87.

3.2 Evaluation Harness

The DoAny-Bench harness provides:

3.3 Metrics

We report three primary metrics:

4. Experimental Setup

4.1 Evaluated Frameworks

We evaluate 12 agent frameworks across three model families:

Framework Model Backend Agent Architecture
ReAct GPT-4o, Claude 3.5 Reason + Act loop
AutoGPT GPT-4o Goal-driven planner
OpenHands Claude 3.5 Sonnet Code-act agent
LangGraph GPT-4o, Claude 3.5 Graph-based orchestration
CrewAI GPT-4o Multi-agent crew
AIDE Claude 3.5 Data science specialist
SWE-Agent GPT-4o, Claude 3.5 SWE-specific agent
Devin Proprietary Full-stack agent
Claude Code Claude 3.5 Sonnet Terminal-native agent
Cursor Agent GPT-4o / Claude IDE-integrated agent
Copilot Workspace GPT-4o Plan-and-execute
Custom Baseline GPT-4o Zero-shot chain-of-thought

4.2 Configuration

5. Results & Analysis

5.1 Headline Results

Tier Best TSR Average TSR Human Baseline
Tier 1 (Constrained) 72% 67% 94%
Tier 2 (Compositional) 51% 43% 91%
Tier 3 (Open-ended) 38% 34% 88%
Overall 54% 48% 91%

Key finding: Frontier models achieve 67% on constrained sub-tasks but drop to 34% on multi-step workflows requiring tool orchestration -- a 33-point gap that highlights the challenge of sustained, multi-step reasoning.

5.2 Domain Breakdown

Domain Top Agent TSR SCR ES
Software Engineering Claude Code 58% 71% 0.64
Data Analysis AIDE 52% 68% 0.59
Research Synthesis ReAct + Claude 3.5 41% 55% 0.48
Creative Production LangGraph + GPT-4o 39% 52% 0.45

5.3 Failure Mode Analysis

The most common failure modes across all frameworks:

  1. Premature termination (28%) -- Agent declares task complete before meeting all acceptance criteria.
  2. Tool misselection (22%) -- Agent selects an inappropriate tool for the current step.
  3. Context loss (19%) -- Agent loses track of earlier results in long sequences.
  4. Error non-recovery (16%) -- Agent fails to recover from tool errors or unexpected outputs.
  5. Scope creep (15%) -- Agent performs unnecessary actions that waste budget.

5.4 Scaling Analysis

We observe log-linear scaling between model compute and Tier 1 performance, but diminishing returns on Tier 3 tasks -- suggesting that multi-step orchestration requires architectural improvements beyond simply scaling inference compute.

6. Conclusion

DoAny-Bench reveals a clear gap between current agent capabilities on isolated tasks versus realistic, multi-step workflows. Our key contributions:

We believe closing the orchestration gap is the critical frontier for autonomous AI agents, and we hope DoAny-Bench serves as a reliable measuring stick for progress.

Resources

Resource Link
Paper arxiv.org/abs/2504.02495
Paper Page huggingface.co/papers/2504.02495
Model doany-ai/doany-bench-agent-v1
Dataset doany-ai/doany-bench-data
License Apache 2.0

References

  1. Chen, A., Sharma, P., Liu, M., & Reyes, S. (2026). DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks. arXiv preprint arXiv:2504.02495.

  2. Jimenez, C. E., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024.

  3. Zhou, S., et al. (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR 2024.

  4. Mialon, G., et al. (2024). GAIA: A Benchmark for General AI Assistants. ICLR 2024.

  5. Liu, X., et al. (2024). AgentBench: Evaluating LLMs as Agents. ICLR 2024.

  6. Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

@article{chen2026doanybench,
  title={DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks},
  author={Chen, Alex and Sharma, Priya and Liu, Marcus and Reyes, Sofia},
  journal={arXiv preprint arXiv:2504.02495},
  year={2026}
}