# Accuracy, Latency, and Cost Tradeoffs Across Retrieval-Augmented and Agentic LLM Configurations: A Comparative Benchmark on Production Queries

**doany.ai Internal Technical Report — Q1 2026**

*Alex Chen, Applied AI Team*

*doany.ai, Inc.*

*April 2026*

---

![Graphical Abstract](figures/graphical_abstract.png)
*Figure 1. Graphical abstract summarizing the benchmark study design, agent configurations, evaluation pipeline, and key findings.*

---

## Abstract

Large language model (LLM) applications increasingly employ retrieval-augmented generation (RAG), multi-step agentic reasoning, and self-critique mechanisms to improve response quality, yet the empirical tradeoffs among these strategies remain poorly characterized in production settings. This report presents a controlled benchmark of four agent configurations deployed on the doany.ai platform: a zero-shot baseline (GPT-4o), a RAG-enhanced variant (GPT-4o with vector-store retrieval), an agentic-loop configuration (Claude Opus 4 with iterative tool use), and a hybrid system combining retrieval, agentic reasoning, and self-critique (Claude Opus 4 + RAG + self-critique). We evaluated 312 real user queries sampled from production logs across four task categories — factual lookup, multi-step reasoning, code generation, and creative/open-ended — with each query executed three times per configuration, yielding 3,744 total runs. Responses were assessed by two human raters (Cohen's *kappa* = 0.78) supplemented by automated evaluation. The hybrid configuration achieved the highest overall accuracy (81.2%), representing a 20.9 percentage-point improvement over the baseline (60.3%), with hallucination rates reduced from 14.2% to 2.9%. However, these gains came at 7.5x higher median latency (12.1 s vs. 1.8 s) and 8.7x greater per-query cost ($0.104 vs. $0.012). The RAG-enhanced configuration emerged as the most cost-effective option for factual workloads, delivering a 13.4 percentage-point accuracy lift at only 1.6x the baseline cost. These findings inform doany.ai's adaptive routing strategy, in which query complexity determines the optimal configuration to deploy.

---

## 1. Introduction

The deployment of large language models in production applications has evolved rapidly from simple prompt-response architectures toward increasingly sophisticated pipelines that integrate external knowledge retrieval, iterative reasoning, and self-verification mechanisms. This evolution has been driven by well-documented limitations of standalone LLMs, including factual hallucination, inability to access current information, and difficulty with complex multi-step tasks (Yao et al., 2023). Understanding how these architectural enhancements interact — and at what cost — is critical for platform operators making deployment decisions that affect both user experience and infrastructure expenditure.

Retrieval-augmented generation, first formalized by Lewis et al. (2020), demonstrated that coupling a pre-trained language model with a non-parametric retrieval mechanism over a document index substantially improves performance on knowledge-intensive tasks. Subsequent work has refined this approach through active retrieval strategies that dynamically decide when to query external sources (Jiang et al., 2023) and through interleaving retrieval steps with chain-of-thought reasoning to tackle multi-hop questions (Trivedi et al., 2023). These advances have made RAG a standard component in production LLM systems, though the magnitude of improvement varies considerably across task types and domains.

In parallel, agentic architectures have emerged as a complementary paradigm for improving LLM output quality. The ReAct framework introduced by Yao et al. (2023) demonstrated that interleaving reasoning traces with action steps — such as tool calls and information retrieval — enables language models to solve tasks that require dynamic planning and environmental interaction. Schick et al. (2023) extended this concept with Toolformer, showing that language models can autonomously learn when and how to invoke external tools. These agentic approaches operate through iterative loops in which the model generates a plan, executes actions, observes results, and refines its approach, often over multiple cycles.

A third line of research has explored self-critique and iterative refinement as mechanisms for error reduction. Madaan et al. (2023) proposed Self-Refine, demonstrating that models can improve their own outputs through cycles of generation, self-evaluation, and revision without external training signal. Shinn et al. (2023) introduced Reflexion, which equips language agents with verbal reinforcement learning by allowing them to reflect on failures and incorporate that feedback into subsequent attempts. Gao et al. (2023) developed RARR, a framework for post-hoc verification and revision of model outputs using attribution to retrieved evidence. Collectively, these approaches suggest that multi-pass reasoning architectures can substantially reduce hallucination and improve factual accuracy, albeit at increased computational cost.

Despite this growing body of work, most evaluations have been conducted on standardized academic benchmarks rather than production workloads. Production queries exhibit greater diversity, ambiguity, and domain specificity than curated test sets, and the latency and cost constraints of real-time applications impose practical boundaries that academic evaluations rarely address. Furthermore, prior studies have typically compared only two configurations at a time — for example, RAG versus no RAG, or agentic versus non-agentic — rather than examining the full combinatorial space of retrieval, agentic reasoning, and self-critique.

This report addresses that gap by presenting a systematic benchmark of four agent configurations on the doany.ai platform, evaluated against 312 production user queries spanning four task categories. Our objectives were threefold: (1) to quantify the accuracy improvements attributable to retrieval augmentation, agentic reasoning, and self-critique, both individually and in combination; (2) to characterize the latency and cost profiles of each configuration under production conditions; and (3) to identify optimal routing strategies that balance quality against resource consumption for different query types.

---

## 2. Methods

### 2.1 Study Design

We conducted a within-subjects comparative evaluation of four LLM agent configurations on the doany.ai platform (version 2.4.1). The study followed the STROBE guidelines for reporting observational studies where applicable (von Elm et al., 2007). All configurations were evaluated against the same set of production queries, with each query executed three times per configuration to assess response variance, yielding a total of 3,744 experimental runs.

### 2.2 Query Corpus

The evaluation corpus comprised 312 real user queries sampled from doany.ai production logs spanning January through February 2026. Queries were stratified into four categories based on task taxonomy: factual lookup (*n* = 104), multi-step reasoning (*n* = 96), code generation (*n* = 62), and creative/open-ended (*n* = 50). This distribution reflected the natural frequency of query types observed in production traffic during the sampling period, with minor oversampling of less frequent categories to ensure adequate statistical power within each stratum.

### 2.3 Agent Configurations

Four configurations were evaluated, each representing a distinct architectural approach to query resolution:

**Baseline.** GPT-4o in zero-shot mode with no retrieval augmentation. Queries were passed directly to the model with a standard system prompt, and the first generated response was returned without iteration or post-processing.

**RAG-Enhanced.** GPT-4o augmented with vector-store retrieval. For each query, the top *k* = 5 most relevant documents were retrieved from a Pinecone serverless index (1,536-dimensional embeddings generated by text-embedding-3-large) and prepended to the prompt as grounding context before generation.

**Agentic-Loop.** Claude Opus 4 configured with a tool-use loop permitting up to eight iterative reasoning steps. The model could invoke search, computation, and code execution tools, observe the results, and refine its approach across multiple iterations before producing a final response.

**Hybrid.** Claude Opus 4 with both RAG retrieval (identical to the RAG-Enhanced configuration) and an agentic reasoning loop, supplemented by an explicit self-critique step. After the agentic loop produced a candidate response, the model evaluated its own output for factual accuracy, logical consistency, and completeness, then revised the response if deficiencies were identified. This self-critique step added a dedicated verification pass before final delivery.

### 2.4 Infrastructure

All experimental runs were executed on AWS us-east-1 using c7g.4xlarge inference nodes. The vector store was hosted on Pinecone serverless with 1,536-dimensional embeddings. End-to-end latency was instrumented through Datadog APM traces, with Prometheus histograms capturing latency percentile distributions. All prompts, queries, and responses were logged to Amazon S3 with unique run identifiers to ensure full reproducibility.

### 2.5 Evaluation Protocol

Each response was evaluated through a dual assessment protocol combining human judgment with automated scoring. Two trained human raters independently evaluated every response for correctness on a binary scale (correct/incorrect), with category-specific rubrics calibrated prior to the evaluation phase. Inter-rater agreement was assessed using Cohen's kappa, which reached 0.78, indicating substantial agreement. Disagreements were resolved through discussion and consensus. An automated evaluation layer using GPT-4o was deployed in parallel to provide a scalable quality signal; automated scores were used as a secondary check but did not override human judgments.

### 2.6 Outcome Measures

The primary outcome was overall response accuracy (percentage of correct responses as judged by human raters). Secondary outcomes included category-specific accuracy rates, end-to-end latency (reported as p50, p95, and p99 percentiles), mean cost per query in USD, and hallucination rate (percentage of responses containing fabricated or factually incorrect assertions).

### 2.7 Statistical Analysis

Accuracy differences between configurations were assessed using paired *t*-tests, with statistical significance set at *p* < 0.01 to account for multiple comparisons. Pearson's correlation coefficient was used to quantify the relationship between the number of reasoning steps and hallucination rate across configurations. All analyses were conducted at the configuration level, with three replications per query averaged to produce stable per-query estimates before cross-configuration comparison.

![Methods Flowchart](figures/methods_flowchart.png)
*Figure 2. Study design and evaluation pipeline. 312 production queries were categorized into four task types and processed through four agent configurations, each with three replications, yielding 3,744 total runs evaluated by dual human raters and automated assessment.*

---

## 3. Results

### 3.1 Accuracy

The hybrid configuration achieved the highest overall accuracy at 81.2%, followed by the agentic-loop at 76.4%, the RAG-enhanced configuration at 67.8%, and the baseline at 60.3% (Table 1). The hybrid configuration was statistically significantly superior to all other configurations on overall accuracy (*p* < 0.01, paired *t*-test).

Performance gains varied substantially by query category. RAG augmentation produced the largest absolute improvement on factual lookup queries, raising accuracy from 71.2% (baseline) to 84.6% — a 13.4 percentage-point lift. The agentic-loop configuration demonstrated its greatest advantage on multi-step reasoning tasks, where accuracy rose from 48.6% (baseline) to 72.9%, representing a 24.3 percentage-point improvement. For code generation, the hybrid configuration achieved 81.5% accuracy versus 62.4% for the baseline, a 19.1 percentage-point gain. Creative and open-ended queries proved most challenging across all configurations, with the hybrid achieving 71.4% compared to 55.0% for the baseline.

**Table 1.** Accuracy by query category and agent configuration (% correct, human-rated).

| Configuration | Factual (*n*=104) | Multi-step (*n*=96) | Code Gen (*n*=62) | Creative (*n*=50) | Overall (*n*=312) |
|:---|:---:|:---:|:---:|:---:|:---:|
| Baseline | 71.2 | 48.6 | 62.4 | 55.0 | 60.3 |
| RAG-Enhanced | 84.6 | 58.3 | 64.1 | 56.2 | 67.8 |
| Agentic-Loop | 82.7 | 72.9 | 79.8 | 68.0 | 76.4 |
| Hybrid | 89.4 | 78.1 | 81.5 | 71.4 | 81.2 |

*Note.* Hybrid vs. all other configurations: *p* < 0.01 (paired *t*-test) on overall accuracy.

![Accuracy Comparison](figures/accuracy_comparison.png)
*Figure 3. Accuracy comparison across query categories and agent configurations. The hybrid configuration achieved the highest accuracy in all categories, with the largest absolute gains observed on multi-step reasoning tasks.*

### 3.2 Latency

Latency increased monotonically with configuration complexity (Table 2). The baseline exhibited the lowest median latency at 1.8 s (p95 = 3.2 s, p99 = 5.1 s), while the hybrid configuration reached a median of 12.1 s (p95 = 31.5 s, p99 = 58.2 s). The RAG retrieval step contributed approximately 1.5 s of additional latency per query. Each agentic iteration averaged 2.8 s, with a median iteration count of 3.1 per query. The self-critique step in the hybrid configuration added approximately 3.2 s per invocation. A cold-start penalty of approximately 1.8 s was observed for the first request in each session across all configurations.

The tail latency profile of the agentic and hybrid configurations is particularly noteworthy for real-time user experience. At the p99 level, the hybrid configuration exhibited latencies of 58.2 s — over 11 times the baseline p99 of 5.1 s — driven by queries that triggered the maximum number of agentic iterations combined with extended self-critique cycles.

**Table 2.** End-to-end latency by agent configuration (seconds).

| Configuration | p50 | p95 | p99 | Mean | SD |
|:---|:---:|:---:|:---:|:---:|:---:|
| Baseline | 1.8 | 3.2 | 5.1 | 2.1 | 0.9 |
| RAG-Enhanced | 3.4 | 7.8 | 12.3 | 4.2 | 2.1 |
| Agentic-Loop | 8.7 | 22.4 | 41.6 | 11.3 | 7.4 |
| Hybrid | 12.1 | 31.5 | 58.2 | 15.8 | 9.6 |

### 3.3 Cost

Per-query cost scaled with configuration complexity, ranging from a median of $0.010 (baseline) to $0.082 (hybrid), representing an 8.2x increase (Table 3). The agentic-loop and hybrid configurations exhibited substantially higher cost variance (p95 of $0.210 and $0.260, respectively), reflecting the variable number of iterations and tool calls executed per query. The RAG-enhanced configuration offered a moderate cost profile at $0.016 median, only 1.6x the baseline, making it the most cost-efficient augmentation strategy.

**Table 3.** Cost per query by agent configuration (USD).

| Configuration | Mean | Median | p95 |
|:---|:---:|:---:|:---:|
| Baseline | $0.012 | $0.010 | $0.028 |
| RAG-Enhanced | $0.019 | $0.016 | $0.041 |
| Agentic-Loop | $0.087 | $0.068 | $0.210 |
| Hybrid | $0.104 | $0.082 | $0.260 |

### 3.4 Hallucination and Error Analysis

Hallucination rates decreased monotonically across configurations as the number of reasoning and verification steps increased (Table 4). The baseline exhibited a hallucination rate of 14.2% (44 of 312 queries), which dropped to 6.1% with RAG augmentation (19/312), 4.5% with the agentic loop (14/312), and 2.9% with the hybrid configuration (9/312). This progression yielded a strong negative correlation between the number of reasoning steps and hallucination rate across configurations (*r* = -0.94). The self-critique mechanism in the hybrid configuration intercepted approximately 18% of initial errors before the final response was delivered to the user.

Failure mode analysis revealed configuration-specific vulnerabilities. The agentic-loop configuration encountered infinite tool-call loops — reaching the maximum eight iterations without convergence — on 8.3% of multi-step reasoning queries, typically triggered by ambiguous or underspecified user intent. The RAG-enhanced configuration suffered retrieval failures on 11.2% of queries, in which the retrieved context was irrelevant to the query; this was most pronounced on code generation queries and attributed to embedding drift in the code domain.

**Table 4.** Hallucination rates and failure modes by configuration.

| Configuration | Hallucination Rate | Unique Failure Modes |
|:---|:---:|:---|
| Baseline | 14.2% (44/312) | Direct fabrication in absence of grounding |
| RAG-Enhanced | 6.1% (19/312) | Irrelevant retrieval on 11.2% of queries |
| Agentic-Loop | 4.5% (14/312) | Max-iteration loops on 8.3% of multi-step queries |
| Hybrid | 2.9% (9/312) | Self-critique catches ~18% of initial errors |

![Hallucination Analysis](figures/hallucination_analysis.png)
*Figure 4. Hallucination rate reduction across configurations (left panel) and configuration-specific failure mode analysis (right panel). A strong inverse correlation (r = -0.94) was observed between the number of reasoning steps and hallucination rate.*

### 3.5 Latency-Accuracy-Cost Tradeoff

The relationship between accuracy, latency, and cost across configurations reveals a clear tradeoff frontier (Figure 5). Each incremental accuracy improvement required disproportionately greater increases in both latency and cost. Moving from the baseline (60.3% accuracy, 1.8 s median latency, $0.012 mean cost) to RAG-enhanced (67.8%, 3.4 s, $0.019) yielded 7.5 percentage points of accuracy at 1.9x latency and 1.6x cost — the most efficient tradeoff observed. The jump from RAG-enhanced to agentic-loop (76.4%, 8.7 s, $0.087) added 8.6 percentage points at 2.6x additional latency and 4.6x additional cost. The final increment from agentic-loop to hybrid (81.2%, 12.1 s, $0.104) provided only 4.8 percentage points at 1.4x further latency and 1.2x further cost, suggesting diminishing returns at the upper end of the quality spectrum.

![Tradeoff Analysis](figures/tradeoff_analysis.png)
*Figure 5. Latency-accuracy-cost tradeoff across agent configurations. Bubble size is proportional to mean cost per query. The hybrid configuration is 7.5x slower and 8.7x more expensive than the baseline, indicating diminishing returns at the upper end of the quality spectrum.*

---

## 4. Discussion

### 4.1 Key Findings in Context

This benchmark demonstrates that retrieval augmentation, agentic reasoning, and self-critique each contribute distinct and partially complementary improvements to LLM response quality on production workloads. The hybrid configuration's 81.2% overall accuracy represents a substantial improvement over the 60.3% baseline, confirming that architectural complexity translates into measurable quality gains. However, the nonlinear scaling of latency and cost with accuracy underscores the importance of deployment strategies that match configuration complexity to query requirements.

The finding that RAG augmentation produces its greatest benefit on factual queries — a 13.4 percentage-point improvement at minimal cost overhead — aligns with the foundational observations of Lewis et al. (2020), who demonstrated that retrieval mechanisms are most effective when the task requires specific factual knowledge not reliably encoded in model parameters. Our results extend this finding to a production context, confirming that the RAG advantage persists even when queries are naturalistic and often ambiguous, rather than curated to match retrieval capabilities.

The agentic-loop configuration's dominance on multi-step reasoning tasks (24.3 percentage-point improvement over baseline) is consistent with the theoretical motivation of the ReAct framework (Yao et al., 2023), which posits that interleaving reasoning with action enables models to decompose complex problems and iteratively refine their solutions. The 8.3% rate of infinite-loop failures on ambiguous multi-step queries represents a practical challenge that academic benchmarks rarely surface, and it highlights the importance of robust termination conditions and fallback strategies in production agentic systems.

The self-critique mechanism in the hybrid configuration intercepted 18% of initial errors, a finding that resonates with the iterative refinement literature. Madaan et al. (2023) reported that Self-Refine improved task performance across multiple benchmarks through iterative self-feedback, while Shinn et al. (2023) demonstrated that reflective mechanisms enable agents to learn from failures within an episode. Our result provides production-scale evidence that self-critique is not merely a benchmark artifact but a deployable mechanism with measurable impact on hallucination reduction. The strong inverse correlation between reasoning steps and hallucination rate (*r* = -0.94) further supports the hypothesis advanced by Gao et al. (2023) that post-hoc verification and revision represent an effective strategy for improving factual grounding.

### 4.2 Implications for Deployment Strategy

The pronounced variation in cost-effectiveness across task categories argues strongly for an adaptive routing architecture. For factual-heavy workloads, the RAG-enhanced configuration offers the optimal balance of quality and cost, delivering a 12.4% accuracy improvement at only 1.6x the baseline cost. For multi-step reasoning and code generation, where the agentic-loop and hybrid configurations provide their greatest advantages, the additional latency and cost may be justified — particularly for asynchronous or non-latency-sensitive use cases. For real-time interactive applications, the p99 tail latency of the hybrid configuration (58.2 s) likely exceeds acceptable user experience thresholds, suggesting the need for timeout mechanisms and graceful degradation to simpler configurations.

A complexity-aware routing layer that classifies incoming queries and dispatches them to the appropriate configuration could capture the majority of the hybrid's accuracy benefits while constraining average latency and cost closer to the RAG-enhanced profile. Preliminary analysis suggests that factual queries (comprising one-third of production traffic) could be routed to the RAG-enhanced configuration, while multi-step and code generation queries would benefit from agentic or hybrid processing. Creative queries, which showed the smallest absolute improvement from any augmentation, may not justify the cost of configurations beyond the baseline.

### 4.3 Potential Optimization Strategies

Several optimization pathways merit investigation to improve the cost-accuracy frontier. First, the self-critique step in the hybrid configuration, which adds approximately 3.2 s of latency and significant token cost, could potentially be implemented with a smaller, cheaper verifier model rather than using the same Claude Opus 4 model for both generation and verification. This approach would preserve much of the error-catching capability while reducing per-query cost. Second, the agentic-loop's infinite-loop failures on ambiguous queries could be mitigated through improved query classification at the routing layer, enabling early identification and reformulation of underspecified inputs before they enter the reasoning loop. Third, retrieval quality for code-domain queries — where 11.2% of RAG retrievals returned irrelevant context — could be improved through domain-specific embedding models or hybrid retrieval strategies that combine semantic search with code-aware indexing, following the active retrieval paradigm proposed by Jiang et al. (2023).

### 4.4 Strengths and Limitations

This study benefits from several design strengths: evaluation on production queries rather than synthetic benchmarks, a substantial corpus size (312 queries, 3,744 total runs), dual human evaluation with strong inter-rater agreement (*kappa* = 0.78), and comprehensive instrumentation of latency, cost, and error metrics. The use of triplicate runs per configuration enables estimation of response variance, and full logging to S3 ensures reproducibility.

Several limitations should be acknowledged. First, the study compares two underlying model families (GPT-4o and Claude Opus 4) across configurations, which partially confounds the effect of architectural augmentation with differences in base model capability. The RAG-enhanced and baseline configurations both use GPT-4o, while the agentic-loop and hybrid configurations use Claude Opus 4, making it difficult to attribute accuracy differences solely to the architectural enhancements. Second, the query corpus, while drawn from production logs, represents a single two-month snapshot and may not capture seasonal or domain-specific variation in query patterns. Third, the binary correct/incorrect evaluation rubric may obscure graded quality differences between configurations, particularly for creative and open-ended queries where response quality exists on a spectrum. Fourth, the study does not evaluate user satisfaction or downstream task completion, which may not correlate perfectly with the accuracy metric used here.

### 4.5 Future Directions

Future work should address three priorities. First, an ablation study controlling for base model differences — for example, running the agentic-loop and hybrid configurations on GPT-4o, or the baseline and RAG-enhanced configurations on Claude Opus 4 — would isolate the contribution of each architectural component independent of model choice. Second, the development and evaluation of an adaptive routing system based on query complexity classification would test whether the theoretical cost savings identified in this analysis can be realized in practice. Third, longitudinal evaluation over a longer time horizon would reveal whether the observed performance differences remain stable as model versions evolve, embedding indices are updated, and production query distributions shift.

---

## 5. Conclusions

This benchmark establishes that retrieval augmentation, agentic reasoning, and self-critique each provide measurable and distinct improvements to LLM response quality on production workloads, but at sharply increasing latency and cost. The hybrid configuration achieves the highest accuracy (81.2%) and lowest hallucination rate (2.9%), while the RAG-enhanced configuration offers the most efficient tradeoff for factual workloads (+13.4 pp accuracy at 1.6x cost). These results support the implementation of an adaptive routing architecture on the doany.ai platform, in which query complexity determines the optimal agent configuration, balancing quality against cost and latency constraints. Immediate next steps include piloting a complexity-aware router and evaluating smaller verifier models for the self-critique step.

---

## References

Gao, L., Dai, Z., Pasupat, P., Chen, A., Chaganty, A. T., Fan, Y., Zhao, V., Lao, N., Lee, H., Juan, D.-C., & Guu, K. (2023). RARR: Researching and revising what language models say, using language models. In *Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (pp. 16477–16508). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.910

Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., Yang, Y., Callan, J., & Neubig, G. (2023). Active retrieval augmented generation. In *Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing* (pp. 7969–7992). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.495

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In *Advances in Neural Information Processing Systems, 33* (pp. 9459–9474). Curran Associates.

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., & Clark, P. (2023). Self-Refine: Iterative refinement with self-feedback. In *Advances in Neural Information Processing Systems, 36*.

Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. In *Advances in Neural Information Processing Systems, 36* (pp. 68539–68551).

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. In *Advances in Neural Information Processing Systems, 36*.

Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2023). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In *Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (pp. 10014–10037). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.557

von Elm, E., Altman, D. G., Egger, M., Pocock, S. J., Gotzsche, P. C., & Vandenbroucke, J. P., for the STROBE Initiative. (2007). The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: Guidelines for reporting observational studies. *Epidemiology*, *18*(6), 800–804. https://doi.org/10.1097/EDE.0b013e3181577654

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. In *Proceedings of the 11th International Conference on Learning Representations (ICLR 2023)*.

---

*Prepared for the CTO Packet, April 2026. For internal distribution only.*
*doany.ai Applied AI Team*