---
language:
  - en
license: apache-2.0
tags:
  - autonomous-agents
  - benchmark
  - evaluation
  - tool-use
  - multi-step-reasoning
  - arxiv:2504.02495
library_name: transformers
pipeline_tag: text-generation
---

# DoAny-Bench Agent v1

Official agent checkpoint evaluated in [DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks](https://arxiv.org/abs/2504.02495).

## Model Description

This is the reference agent model evaluated across the DoAny-Bench suite of 1,247 real-world tasks spanning software engineering, data analysis, research synthesis, and creative production. It serves as the baseline agent checkpoint for reproducible benchmark evaluation.

## Paper

- [DoAny-Bench (arXiv:2504.02495)](https://arxiv.org/abs/2504.02495)
- [Paper Page on Hugging Face](https://huggingface.co/papers/2504.02495)

## Linked Resources

| Resource | Link |
|----------|------|
| Paper | [huggingface.co/papers/2504.02495](https://huggingface.co/papers/2504.02495) |
| Dataset | [doany-ai/doany-bench-data](https://huggingface.co/datasets/doany-ai/doany-bench-data) |

## Citation

```bibtex
@article{chen2026doanybench,
  title={DoAny-Bench: A Comprehensive Benchmark for Evaluating Autonomous AI Agents on Real-World Tasks},
  author={Chen, Alex and Sharma, Priya and Liu, Marcus and Reyes, Sofia},
  journal={arXiv preprint arXiv:2504.02495},
  year={2026}
}
```