---
title: "DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks"
authors: Alex Chen, Priya Sharma, Marcus Liu, Sofia Reyes
affiliations: DoAny AI Research
date: 2026-04-13
arxiv: 2504.02495
tags: [autonomous-agents, benchmark, evaluation, tool-use, multi-step-reasoning]
---

# Abstract

We introduce **DoAny-Bench**, a holistic benchmark suite for evaluating autonomous AI agents across **1,247 real-world tasks** spanning software engineering, data analysis, research synthesis, and creative production. Unlike existing benchmarks that focus on isolated capabilities, DoAny-Bench measures end-to-end task completion with human-validated acceptance criteria. We evaluate 12 leading agent frameworks and find that while frontier models achieve **67% on constrained sub-tasks**, performance drops to **34% on multi-step workflows** requiring tool orchestration. We release the full evaluation harness, task corpus, and human annotations under an Apache 2.0 license to accelerate progress on general-purpose AI agents.

# 1. Introduction

Autonomous AI agents are rapidly advancing beyond single-turn question answering toward multi-step, tool-augmented workflows. However, the evaluation landscape remains fragmented: existing benchmarks either test narrow capabilities in isolation (e.g., code generation, web browsing) or rely on synthetic tasks that fail to capture the complexity of real-world work.

**DoAny-Bench** addresses this gap with three design principles:

1. **Real-world grounding** -- Every task is sourced from actual workflows performed by knowledge workers, not synthetically generated.
2. **End-to-end evaluation** -- Tasks require planning, tool selection, multi-step execution, and self-correction, not just a single API call.
3. **Human-validated acceptance** -- Each task includes acceptance criteria written and verified by domain experts, ensuring evaluation fidelity.

Our benchmark spans four domains that represent the breadth of autonomous agent deployment: software engineering (412 tasks), data analysis (298 tasks), research synthesis (287 tasks), and creative production (250 tasks).

# 2. Background

## 2.1 Agent Evaluation Landscape

Prior benchmarks have made important contributions but exhibit key limitations:

| Benchmark | Tasks | Domains | Multi-Step | Human-Validated |
|-----------|-------|---------|------------|-----------------|
| SWE-bench | 2,294 | SWE only | Partial | No |
| WebArena | 812 | Web only | Yes | No |
| GAIA | 466 | General | Yes | Yes |
| AgentBench | 8 datasets | Mixed | Yes | Partial |
| **DoAny-Bench** | **1,247** | **4 domains** | **Yes** | **Yes** |

## 2.2 Task Complexity Taxonomy

We define three tiers of task complexity:

- **Tier 1 -- Constrained** (single tool, 1-3 steps): e.g., "Write a unit test for this function."
- **Tier 2 -- Compositional** (multi-tool, 4-8 steps): e.g., "Analyze this CSV, generate a chart, and write a summary."
- **Tier 3 -- Open-ended** (orchestration, 9+ steps): e.g., "Debug this CI pipeline failure, fix the root cause, and open a PR."

# 3. Approach

## 3.1 Task Collection

Tasks were collected through a three-phase pipeline:

1. **Sourcing**: Partnered with 48 knowledge workers across 12 organizations to record real workflows over a 6-month period.
2. **Annotation**: Each task was decomposed into structured fields -- input context, available tools, reference solution steps, and binary acceptance criteria.
3. **Validation**: Two independent annotators verified each task. Inter-annotator agreement (Cohen's kappa) was 0.87.

## 3.2 Evaluation Harness

The DoAny-Bench harness provides:

- **Sandboxed execution** -- Docker-based environments with pre-configured tool access per task.
- **Deterministic replay** -- Recorded API responses for reproducible evaluation without live service dependencies.
- **Automated + human scoring** -- Automated checks for objective criteria, with human judges for subjective quality dimensions.

## 3.3 Metrics

We report three primary metrics:

- **Task Success Rate (TSR)**: Binary pass/fail against acceptance criteria.
- **Step Completion Rate (SCR)**: Fraction of reference solution steps completed.
- **Efficiency Score (ES)**: Ratio of optimal steps to actual steps taken (penalizes unnecessary actions).

# 4. Experimental Setup

## 4.1 Evaluated Frameworks

We evaluate 12 agent frameworks across three model families:

| Framework | Model Backend | Agent Architecture |
|-----------|--------------|-------------------|
| ReAct | GPT-4o, Claude 3.5 | Reason + Act loop |
| AutoGPT | GPT-4o | Goal-driven planner |
| OpenHands | Claude 3.5 Sonnet | Code-act agent |
| LangGraph | GPT-4o, Claude 3.5 | Graph-based orchestration |
| CrewAI | GPT-4o | Multi-agent crew |
| AIDE | Claude 3.5 | Data science specialist |
| SWE-Agent | GPT-4o, Claude 3.5 | SWE-specific agent |
| Devin | Proprietary | Full-stack agent |
| Claude Code | Claude 3.5 Sonnet | Terminal-native agent |
| Cursor Agent | GPT-4o / Claude | IDE-integrated agent |
| Copilot Workspace | GPT-4o | Plan-and-execute |
| Custom Baseline | GPT-4o | Zero-shot chain-of-thought |

## 4.2 Configuration

- Each agent was given identical tool access and context windows per task.
- Maximum budget: 50 steps or 10 minutes wall-clock per task.
- Three independent runs per (framework, task) pair; majority vote determines pass/fail.

# 5. Results & Analysis

## 5.1 Headline Results

| Tier | Best TSR | Average TSR | Human Baseline |
|------|----------|-------------|----------------|
| Tier 1 (Constrained) | **72%** | **67%** | 94% |
| Tier 2 (Compositional) | **51%** | **43%** | 91% |
| Tier 3 (Open-ended) | **38%** | **34%** | 88% |
| **Overall** | **54%** | **48%** | **91%** |

**Key finding**: Frontier models achieve 67% on constrained sub-tasks but drop to 34% on multi-step workflows requiring tool orchestration -- a **33-point gap** that highlights the challenge of sustained, multi-step reasoning.

## 5.2 Domain Breakdown

| Domain | Top Agent | TSR | SCR | ES |
|--------|-----------|-----|-----|----|
| Software Engineering | Claude Code | 58% | 71% | 0.64 |
| Data Analysis | AIDE | 52% | 68% | 0.59 |
| Research Synthesis | ReAct + Claude 3.5 | 41% | 55% | 0.48 |
| Creative Production | LangGraph + GPT-4o | 39% | 52% | 0.45 |

## 5.3 Failure Mode Analysis

The most common failure modes across all frameworks:

1. **Premature termination** (28%) -- Agent declares task complete before meeting all acceptance criteria.
2. **Tool misselection** (22%) -- Agent selects an inappropriate tool for the current step.
3. **Context loss** (19%) -- Agent loses track of earlier results in long sequences.
4. **Error non-recovery** (16%) -- Agent fails to recover from tool errors or unexpected outputs.
5. **Scope creep** (15%) -- Agent performs unnecessary actions that waste budget.

## 5.4 Scaling Analysis

We observe log-linear scaling between model compute and Tier 1 performance, but **diminishing returns on Tier 3 tasks** -- suggesting that multi-step orchestration requires architectural improvements beyond simply scaling inference compute.

# 6. Conclusion

DoAny-Bench reveals a clear gap between current agent capabilities on isolated tasks versus realistic, multi-step workflows. Our key contributions:

- **A benchmark of 1,247 human-sourced, human-validated tasks** across four professional domains.
- **Comprehensive evaluation of 12 frameworks**, establishing the first cross-framework comparison on identical real-world tasks.
- **Identification of the "orchestration gap"** -- the 33-point TSR drop from constrained to open-ended tasks.
- **A fully open evaluation harness** released under Apache 2.0 to enable reproducible agent research.

We believe closing the orchestration gap is the critical frontier for autonomous AI agents, and we hope DoAny-Bench serves as a reliable measuring stick for progress.

## Resources

| Resource | Link |
|----------|------|
| Paper | [arxiv.org/abs/2504.02495](https://arxiv.org/abs/2504.02495) |
| Paper Page | [huggingface.co/papers/2504.02495](https://huggingface.co/papers/2504.02495) |
| Model | [doany-ai/doany-bench-agent-v1](https://huggingface.co/doany-ai/doany-bench-agent-v1) |
| Dataset | [doany-ai/doany-bench-data](https://huggingface.co/datasets/doany-ai/doany-bench-data) |
| License | Apache 2.0 |

# References

1. Chen, A., Sharma, P., Liu, M., & Reyes, S. (2026). DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks. *arXiv preprint arXiv:2504.02495*.

2. Jimenez, C. E., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? *ICLR 2024*.

3. Zhou, S., et al. (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. *ICLR 2024*.

4. Mialon, G., et al. (2024). GAIA: A Benchmark for General AI Assistants. *ICLR 2024*.

5. Liu, X., et al. (2024). AgentBench: Evaluating LLMs as Agents. *ICLR 2024*.

6. Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. *ICLR 2023*.

```bibtex
@article{chen2026doanybench,
  title={DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks},
  author={Chen, Alex and Sharma, Priya and Liu, Marcus and Reyes, Sofia},
  journal={arXiv preprint arXiv:2504.02495},
  year={2026}
}
```
