---
language:
  - en
license: apache-2.0
task_categories:
  - text-generation
  - question-answering
tags:
  - autonomous-agents
  - benchmark
  - evaluation
  - tool-use
  - multi-step-reasoning
  - arxiv:2504.02495
size_categories:
  - 1K<n<10K
---

# DoAny-Bench Data

Official task corpus and human annotations for [DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks](https://arxiv.org/abs/2504.02495).

## Dataset Description

DoAny-Bench contains 1,247 real-world tasks with human-validated acceptance criteria, spanning four domains:

| Domain | Tasks | Description |
|--------|-------|-------------|
| Software Engineering | 412 | Code generation, debugging, refactoring |
| Data Analysis | 298 | Statistical analysis, visualization, ETL |
| Research Synthesis | 287 | Literature review, summarization, fact-checking |
| Creative Production | 250 | Writing, design briefs, content planning |

Each task includes structured inputs, tool specifications, step-by-step reference solutions, and binary human acceptance judgments.

## Paper

- [DoAny-Bench (arXiv:2504.02495)](https://arxiv.org/abs/2504.02495)
- [Paper Page on Hugging Face](https://huggingface.co/papers/2504.02495)

## Linked Resources

| Resource | Link |
|----------|------|
| Paper | [huggingface.co/papers/2504.02495](https://huggingface.co/papers/2504.02495) |
| Model | [doany-ai/doany-bench-agent-v1](https://huggingface.co/doany-ai/doany-bench-agent-v1) |

## Citation

```bibtex
@article{chen2026doanybench,
  title={DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks},
  author={Chen, Alex and Sharma, Priya and Liu, Marcus and Reyes, Sofia},
  journal={arXiv preprint arXiv:2504.02495},
  year={2026}
}
```