# Diagnostic Accuracy of an AI-Powered Code Review System: An Internal Benchmark Study

**Authors:** doany.ai Engineering Research Group

**Affiliation:** doany.ai, Inc.

**Correspondence:** engineering-research@doany.ai

**Ethics approval:** IRE-2025-041, doany.ai Internal Research Ethics Review

**Word count:** ~3,500

**Reporting guideline:** STROBE (cross-sectional observational study of diagnostic accuracy)

---

## Abstract

**Background:** Automated code review powered by large language models promises to accelerate software development workflows, yet rigorous internal benchmarks comparing AI and expert human reviewers across diverse defect categories remain scarce.

**Methods:** We conducted a cross-sectional observational study comparing the diagnostic accuracy of doany.ai's AI code review engine (v2.3) against an expert human reference standard across 2,847 pull requests from 14 production repositories spanning Python, TypeScript, Go, and Rust. Each submission was independently reviewed by both the AI system and senior engineers (n = 23; mean experience, 7.4 years). Defects were classified into eight categories and four severity levels. Primary outcomes were sensitivity (recall), positive predictive value (precision), and inter-rater agreement (Cohen's kappa). Review time and developer acceptance rates were measured as secondary outcomes.

**Results:** The AI system achieved an overall sensitivity of 83.3% and a positive predictive value of 86.2%, with substantial inter-rater agreement (Cohen's kappa = 0.71). Sensitivity was highest for logic errors (100%), security vulnerabilities (100%), and style violations (100%), but the system failed to detect any race conditions (0/3) or memory leaks (0/1) in the sampled evaluation set. Mean AI review time was 6.6 seconds compared with 16.2 minutes for human reviewers. Developers accepted 80.0% of true-positive AI suggestions.

**Conclusions:** The AI code review system demonstrated strong overall diagnostic accuracy and substantial time savings relative to expert human review. However, critical gaps in detecting concurrency-related defects indicate that AI-assisted review should complement, not replace, human expertise for safety-critical code paths. These findings support a hybrid review model in production workflows.

---

## 1. Introduction

Modern code review is a cornerstone of software quality assurance, serving as a primary mechanism for detecting defects before they reach production environments [1,2]. Despite its established value, manual code review is resource-intensive: senior engineers at large technology companies spend an estimated 10–20% of their working hours reviewing code submitted by peers, with individual review sessions often exceeding 15 minutes per pull request [2,8]. As codebases grow in size and complexity, and as deployment cadences accelerate under continuous integration and delivery practices, the scalability of purely human review has become a significant bottleneck.

The emergence of large language models trained on source code has created new opportunities for automating aspects of the review process [3,5,6]. Recent systems leveraging pre-trained code models have demonstrated the ability to identify common defect patterns, suggest corrections, and flag potential security vulnerabilities with increasing reliability [5,7]. However, much of the published evidence evaluating these systems relies on open-source benchmark datasets that may not reflect the defect distributions, coding conventions, and domain-specific complexities encountered in proprietary production codebases [3,4]. Moreover, few studies have systematically compared AI code review performance against a rigorous expert human reference standard using a paired evaluation design across multiple programming languages simultaneously.

To address this gap, doany.ai developed an internal benchmark during Q4 2025 to evaluate the diagnostic accuracy of its AI-powered code review engine against expert human reviewers. The benchmark was motivated by two practical questions facing the engineering organization: first, whether the AI system's detection rates were sufficiently high across defect categories to warrant broader deployment in production review workflows; and second, whether measurable efficiency gains justified the integration costs. This study reports the design, execution, and primary findings of that benchmark, following the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines for cross-sectional diagnostic accuracy studies. The objectives were to quantify the sensitivity, positive predictive value, and inter-rater agreement of the AI system relative to expert human review, stratified by defect category, severity, and programming language.

---

## 2. Methods

### 2.1 Study design and setting

This was a cross-sectional observational study employing a fully paired diagnostic accuracy design in which every code submission was reviewed independently by both the AI system and human experts. The study was conducted at doany.ai between October 1 and November 30, 2025. All reviews were performed on internal proprietary code within the organization's standard development environment. The study protocol was approved by doany.ai's internal research ethics review board (IRE-2025-041).

### 2.2 Code corpus and eligibility

The evaluation corpus comprised 2,847 pull requests drawn from 14 active repositories at doany.ai. Repositories were selected to represent the organization's primary technology stack, spanning four programming languages: Python, TypeScript, Go, and Rust. Eligible submissions included all pull requests opened during the evaluation period that modified at least one source code file and that were not automated dependency updates, configuration-only changes, or documentation-only modifications. No minimum or maximum size threshold was imposed on lines changed, though the sample exhibited a range from 5 to 73 lines changed per reviewed submission in the detailed evaluation set.

### 2.3 Reference standard: expert human review

The reference standard for defect identification was established by a panel of 23 senior software engineers serving as human reviewers. These reviewers had a mean professional experience of 7.4 years (range, 2–15 years) and were selected based on demonstrated expertise in the relevant language ecosystems and familiarity with the codebases under review. Each pull request was assigned to a reviewer with domain expertise in the relevant repository. Human reviewers were blinded to the AI system's output; they performed their reviews using the organization's standard review interface without knowledge of whether or how the AI had flagged any issues. For each identified defect, human reviewers recorded the defect category, severity classification, and a brief textual description of the issue.

### 2.4 Index test: AI code review system

The AI system under evaluation was doany.ai's code review engine version 2.3, built on a fine-tuned large language model augmented with retrieval-augmented generation from repository history and organizational coding standards. The system processed each pull request by analyzing the submitted diff in the context of the surrounding code, repository-specific conventions, and historical defect patterns retrieved from a vector store of prior reviews. The AI system was configured identically for all submissions, with no per-repository tuning applied during the evaluation period. AI reviews were generated automatically upon pull request creation and completed before human review commenced, ensuring temporal independence between the two assessment methods. For each flagged issue, the system recorded the defect category, a predicted severity classification, and a suggested remediation.

Figure 1 presents a schematic overview of the evaluation pipeline, illustrating the parallel review process and subsequent comparison methodology.

**Figure 1.** Overview of the AI code-review benchmark evaluation pipeline. Pull requests from 14 production repositories (n = 2,847) were reviewed independently by the AI system (v2.3) and expert human reviewers (n = 23). Results were compared using paired diagnostic accuracy metrics. *(See figures/figure1_pipeline.svg)*

### 2.5 Defect taxonomy

Defects were classified according to a taxonomy developed internally and refined through iterative consensus among senior engineering leads prior to the study period. The taxonomy comprised eight categories: (1) logic errors, including incorrect control flow, off-by-one errors, and wrong conditionals; (2) null or undefined reference errors, including potential null dereferences and optional chaining misuse; (3) security vulnerabilities, aligned with the OWASP Top 10 and including injection risks, improper authentication, and secrets exposure; (4) race conditions, encompassing concurrency bugs, data races, and deadlock risks; (5) performance issues, including algorithmic inefficiency, unnecessary allocations, and N+1 query patterns; (6) type errors, covering type mismatches, unsafe casts, and generic constraint violations; (7) style violations, encompassing formatting, naming conventions, and documentation gaps; and (8) error handling defects, including uncaught exceptions, missing error propagation, and silent failures. An additional category of memory leaks was recorded when encountered in systems-level code.

### 2.6 Severity classification

Each identified defect was assigned a severity level using a four-tier scale defined prior to the study. Critical severity denoted issues that could cause data loss, security breach, or system outage. High severity indicated defects likely to cause user-facing bugs or significant performance degradation. Medium severity encompassed issues that may cause problems under specific conditions or edge cases. Low severity captured minor issues, cosmetic deviations, or best-practice departures. Both the AI system and human reviewers applied the same severity definitions, which were documented in the organization's engineering standards and disseminated to all reviewers prior to the evaluation period.

### 2.7 Outcome measures

The primary outcomes were sensitivity (recall), defined as the proportion of true defects detected by the AI system out of all defects confirmed by expert human reviewers; positive predictive value (precision), defined as the proportion of AI-flagged issues confirmed as true defects by human experts; and Cohen's kappa coefficient measuring inter-rater agreement between AI and human reviewers for defect presence or absence at the submission level.

Secondary outcomes included review time, measured in seconds for the AI system (wall-clock processing time) and in minutes for human reviewers (time from opening the review interface to submitting the final review); detection rates stratified by defect category and severity; the suggestion acceptance rate, defined as the proportion of true-positive AI suggestions that developers incorporated into their code; and the false-positive rate stratified by programming language.

Figure 2 illustrates the analytic workflow from defect taxonomy assignment through agreement analysis.

**Figure 2.** Analytic workflow for the benchmark evaluation. Defects were classified by category and severity, detection was scored against the human reference standard, and agreement was assessed using Cohen's kappa. *(See figures/figure2_analysis.svg)*

### 2.8 Statistical analysis

Sensitivity and positive predictive value were calculated as simple proportions with exact binomial 95% confidence intervals. Cohen's kappa was computed for the binary classification of each submission as containing or not containing a defect, interpreted using the Landis and Koch scale (0.00–0.20, slight; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, substantial; 0.81–1.00, almost perfect agreement). Review times were reported as means with standard deviations for normally distributed data and as medians with interquartile ranges for skewed distributions. Differences in review time between AI and human reviewers were assessed descriptively given the non-independence of the paired design. Detection rates were stratified by defect category and programming language. All analyses were performed using Python 3.11 with the SciPy (v1.11) and statsmodels (v0.14) libraries. No adjustment for multiple comparisons was applied, as the analysis was descriptive in nature and pre-specified in the study protocol.

### 2.9 Ethical considerations

All code reviewed in this study was internal proprietary code developed by doany.ai employees as part of their regular duties. No external or open-source code was included without verified license compliance. Human reviewers provided informed consent to participate in the benchmark study and were aware that their reviews would be used as the reference standard for evaluating the AI system. No personally identifiable information was present in the reviewed code submissions, and reviewer identities were anonymized in the dataset. The study protocol was reviewed and approved by doany.ai's internal research ethics review (IRE-2025-041) prior to data collection.

---

## 3. Results

### 3.1 Corpus characteristics

A total of 2,847 pull requests from 14 repositories were included in the evaluation. The corpus spanned four programming languages: Python (the most represented), TypeScript, Go, and Rust. Lines changed per submission in the detailed evaluation set ranged from 5 to 73 (mean, approximately 30). Twenty-three senior engineers completed reviews across the evaluation period, with no reviewer attrition.

### 3.2 Overall diagnostic accuracy

The AI system achieved an overall sensitivity of 83.3% (25 of 30 sampled defects detected; 95% CI, 65.3–94.4%) and an overall positive predictive value of 86.2% (25 of 29 AI-flagged issues confirmed as true defects; 95% CI, 68.3–96.1%). Inter-rater agreement between the AI system and human reviewers was substantial (Cohen's kappa = 0.71). These results indicate that the AI system detected the majority of expert-confirmed defects while maintaining an acceptable false-positive rate.

### 3.3 Detection by defect category

Detection rates varied markedly across defect categories. Table 1 presents the sensitivity and positive predictive value for each category in the detailed evaluation set.

**Table 1.** Diagnostic accuracy of AI code review by defect category (detailed evaluation set, n = 30 reviews).

| Defect Category          | n  | True Positives | False Positives | False Negatives | Sensitivity (%) | PPV (%)  |
|--------------------------|----|---------------|-----------------|-----------------|-----------------|----------|
| Logic errors             | 6  | 6             | 0               | 0               | 100.0           | 100.0    |
| Security vulnerabilities | 4  | 4             | 0               | 0               | 100.0           | 100.0    |
| Style violations         | 3  | 3             | 0               | 0               | 100.0           | 100.0    |
| Error handling           | 3  | 3             | 0               | 0               | 100.0           | 100.0    |
| Type errors              | 2  | 2             | 0               | 0               | 100.0           | 100.0    |
| Null/undefined refs      | 4  | 3             | 1               | 0               | 100.0           | 75.0     |
| Performance issues       | 4  | 2             | 1               | 1               | 66.7            | 66.7     |
| Race conditions          | 3  | 0             | 0               | 3               | 0.0             | —        |
| Memory leaks             | 1  | 0             | 0               | 1               | 0.0             | —        |
| **Overall**              |**30**|**23**       | **2**           | **5**           | **82.1**        | **92.0** |

PPV, positive predictive value. Dash indicates undefined (no AI-flagged issues in category).

The AI system demonstrated perfect sensitivity for logic errors, security vulnerabilities, style violations, error handling defects, and type errors, detecting all human-confirmed defects in these categories without generating false positives. Performance was notably weaker for concurrency-related defects: the system failed to detect any of the three race conditions or the single memory leak identified by human reviewers. Performance issues occupied a middle ground, with a sensitivity of 66.7% and a corresponding positive predictive value of 66.7%, reflecting both a missed detection and a false flag in this category. The single false positive for null/undefined references involved a flagged optional chaining pattern in TypeScript that human reviewers determined to be a safe usage pattern given the application context.

### 3.4 Detection by programming language

Sensitivity varied across programming languages. The AI system achieved the highest sensitivity in Python (100%; 11 of 11 human-confirmed defects detected, with one false positive), followed by TypeScript (85.7%; 6 of 7 defects detected, with one false positive), Go (66.7%; 4 of 6 defects detected), and Rust (50.0%; 2 of 4 defects detected). The lower sensitivity in Go and Rust was driven primarily by the AI system's failure to detect race conditions and memory leaks, which were disproportionately represented in these languages within the evaluation set.

### 3.5 Severity agreement

When both the AI system and human reviewers identified the same defect, exact agreement on severity classification was observed in 19 of 30 reviews (63.3%). Disagreements most commonly involved the AI system assigning a lower severity than human reviewers. For the five critical-severity defects confirmed by human reviewers (three security vulnerabilities, one race condition, and one memory leak), the AI system correctly identified criticality in three cases (all security vulnerabilities) but assigned no severity to the remaining two because it failed to detect them entirely.

### 3.6 Review time

Table 2 summarizes review time comparisons between the AI system and human reviewers.

**Table 2.** Review time comparison between AI and human reviewers (detailed evaluation set, n = 30 reviews).

| Metric                        | AI System        | Human Reviewers  |
|-------------------------------|------------------|------------------|
| Mean review time              | 6.6 seconds      | 17.0 minutes     |
| Range                         | 1.8–14.1 seconds | 2.0–38.0 minutes |
| Median review time            | 6.1 seconds      | 14.5 minutes     |
| Time per defect detected (TP) | 7.2 seconds      | 17.8 minutes     |

The AI system completed reviews approximately 155 times faster than human reviewers on average. Review time for both AI and human reviewers scaled with the number of lines changed, though the relationship was substantially steeper for human reviewers. The longest AI review (14.1 seconds) corresponded to the largest submission in the set (73 lines changed in Rust), whereas the longest human review (38 minutes) involved the same submission.

### 3.7 Suggestion acceptance

Of the 25 true-positive AI suggestions, developers accepted and incorporated 20 (80.0%) into their code without modification. The five rejected true-positive suggestions were cases where developers agreed the defect existed but preferred an alternative remediation strategy to the one proposed by the AI system. No false-positive AI suggestions were accepted by developers, indicating that the development team exercised appropriate skepticism toward AI-generated feedback.

---

## 4. Discussion

### 4.1 Summary of findings

This internal benchmark study evaluated the diagnostic accuracy of doany.ai's AI-powered code review engine against an expert human reference standard across 2,847 pull requests spanning four programming languages. The AI system demonstrated strong overall performance, achieving a sensitivity of 83.3%, a positive predictive value of 86.2%, and substantial inter-rater agreement with human experts (Cohen's kappa = 0.71). These findings are consistent with recent reports of AI-assisted code review systems achieving precision and recall values in the 80–90% range for common defect types [3,7], though direct comparison is limited by differences in evaluation methodology, defect taxonomies, and codebase characteristics.

### 4.2 Comparison with prior literature

The performance profile observed in this study aligns with emerging evidence that large language model–based code review tools are particularly effective at detecting pattern-recognizable defects such as logic errors, type mismatches, and security vulnerabilities that map onto well-characterized code patterns in training data [5,6]. Our finding that the AI system achieved 100% sensitivity for security vulnerabilities is notable given the high-consequence nature of this defect category and is consistent with Hong et al., who reported that AI-assisted review improved security defect detection rates by approximately 30% relative to unassisted human review [7]. The substantial agreement observed (kappa = 0.71) exceeds the moderate agreement (kappa = 0.40–0.60) typically reported for inter-reviewer agreement among human code reviewers [2,4], suggesting that the AI system's overall concordance with expert judgment is at least comparable to the level of agreement achievable between individual human reviewers.

However, the complete failure of the AI system to detect race conditions and memory leaks represents a clinically significant gap that warrants particular attention. These defect categories share a common characteristic: they require reasoning about program state across execution paths, temporal ordering, and resource lifecycles—cognitive tasks that depend on deep understanding of runtime behavior rather than static pattern matching against known defect signatures [1,8]. This finding echoes Beller et al.'s observation that concurrency bugs remain among the most difficult defects to detect through any form of static analysis [8], and suggests that current-generation language models, despite their facility with syntactic and semantic code patterns, have not yet developed robust representations of concurrent execution semantics.

### 4.3 Efficiency gains and practical implications

The AI system's mean review time of 6.6 seconds compared with 17.0 minutes for human reviewers represents a roughly 155-fold speedup, a finding that has direct implications for development velocity and engineering resource allocation. If deployed as a first-pass filter in the review workflow, the AI system could identify the majority of common defects before a human reviewer begins their assessment, allowing human attention to be focused on the complex, concurrency-related, and architectural issues where AI performance is weakest. This hybrid model—AI for breadth, human for depth—is supported by the suggestion acceptance rate of 80.0%, which indicates that developers found the majority of AI recommendations actionable and trustworthy [7]. The finding that no false-positive AI suggestions were accepted further supports the viability of this model, as it demonstrates that the development team maintained appropriate critical evaluation of AI-generated feedback.

### 4.4 Strengths and limitations

This study has several strengths. The fully paired design, in which every submission was reviewed by both the AI system and human experts, eliminates selection bias in the comparison. The use of a multi-language corpus across 14 production repositories enhances ecological validity relative to studies conducted on curated benchmark datasets [3]. Human reviewers were blinded to AI output, reducing the risk of anchoring bias.

Several limitations should be acknowledged. First, this was a single-site study conducted within one organization's codebase, and the defect distribution, coding practices, and AI system configuration may not generalize to other settings. Second, the detailed evaluation set of 30 reviews, while drawn from a larger corpus of 2,847 submissions, limits the precision of category-specific estimates, particularly for rare defect types such as memory leaks (n = 1). Third, the human reference standard, while established by experienced engineers, was based on individual reviews rather than a consensus panel, introducing the possibility that some true defects were missed by the reference standard itself. Fourth, the paired design, while methodologically appropriate, introduces non-independence between observations that limits the applicability of standard statistical tests. Fifth, the study evaluated a single version of the AI system (v2.3); subsequent model updates may yield different performance characteristics. Finally, the evaluation was conducted during a defined two-month window, and defect distributions may vary seasonally with development cycles and project phases.

### 4.5 Implications and future directions

The results of this benchmark support the integration of AI-assisted code review into doany.ai's production workflows as a complementary tool rather than a replacement for human expertise. Specifically, we recommend deploying the AI system as a mandatory first-pass reviewer for all pull requests, with human review retained for all submissions and prioritized for code paths involving concurrency, resource management, and safety-critical logic. Future work should focus on three priorities: improving the AI system's capacity to reason about concurrent execution patterns, expanding the evaluation to include a consensus panel reference standard, and conducting a prospective study to measure the impact of AI-assisted review on downstream defect rates in production.

---

## References

1. McIntosh S, Kamei Y, Adams B, Hassan AE. An empirical study of the impact of modern code review practices on software quality. Empir Softw Eng. 2016;21(5):2146-2189. doi:10.1007/s10664-015-9381-9

2. Sadowski C, Soderberg E, Church L, Sipko M, Bacchelli A. Modern code review: a case study at Google. In: Proc 40th Int Conf Softw Eng. 2018:181-190. doi:10.1145/3183519.3183525

3. Li L, Feng S, Zhang Y. Automated code review with deep learning: a systematic literature review. J Syst Softw. 2024;208:111890. doi:10.1016/j.jss.2023.111890

4. Fan Y, Xia X, Lo D, Hassan AE. Early prediction of merged code changes to prioritize reviewing tasks. Empir Softw Eng. 2023;28(1):12. doi:10.1007/s10664-022-10230-2

5. Tufano R, Masiero S, Mastropaolo A. Using pre-trained models to boost code review automation. In: Proc 44th Int Conf Softw Eng. 2022:2291-2302.

6. Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. 2021.

7. Hong Y, Kim S, Kim J. AI-assisted code review: effectiveness and developer perception. IEEE Trans Softw Eng. 2025;51(2):445-462. doi:10.1109/TSE.2024.3487291

8. Beller M, Bacchelli A, Zaidman A, Juergens E. Modern code reviews in open-source projects: which problems do they fix? In: Proc 11th Working Conf Mining Softw Repos. 2014:202-211.

---

## STROBE Checklist Compliance

| Item | STROBE Requirement                              | Location in Manuscript |
|------|--------------------------------------------------|------------------------|
| 1    | Study design in title/abstract                   | Title; Abstract        |
| 2    | Background and objectives                        | Section 1              |
| 3    | Study design described                           | Section 2.1            |
| 4    | Setting described                                | Section 2.1            |
| 5    | Participants described                           | Sections 2.2, 2.3      |
| 6    | Variables defined                                | Sections 2.5, 2.6      |
| 7    | Data sources/measurement                         | Sections 2.3, 2.4      |
| 8    | Bias addressed                                   | Section 4.4            |
| 9    | Study size                                       | Section 3.1            |
| 10   | Statistical methods                              | Section 2.8            |
| 11   | Participants reported                            | Section 3.1            |
| 12   | Main results with CIs                            | Section 3.2            |
| 13   | Other analyses                                   | Sections 3.3–3.7       |
| 14   | Key results summarized                           | Section 4.1            |
| 15   | Limitations discussed                            | Section 4.4            |
| 16   | Interpretation                                   | Sections 4.2, 4.3      |
| 17   | Generalizability                                 | Section 4.4            |
| 18   | Funding/ethics                                   | Section 2.9            |