# FORMAL PEER REVIEW

**Manuscript Title:** Emergency Department Sepsis Detection with a Gradient-Boosted Clinical AI Model: A Multicenter Validation Study

**Reviewer Recommendation:** MAJOR REVISION REQUIRED

**Date:** April 14, 2026

---

## SUMMARY ASSESSMENT

This retrospective study develops an XGBoost model for predicting sepsis within 48 hours of ED arrival across three hospitals. While the research question is clinically relevant and the reported AUROC (0.91) appears promising, the manuscript contains significant methodological flaws, statistical errors, inadequate reporting, and overstated conclusions that preclude publication in its current form. Major revisions addressing fundamental issues in study design, analysis, and interpretation are required.

---

## 1. METHODS ASSESSMENT

### Critical Issues:

**1.1 Study Design Mislabeling**
- The authors claim this is a "multicenter validation study" but this is actually a multicenter *development* study. True external validation would require developing the model on one set of hospitals and validating on completely separate institutions not used in any phase of development.
- The 70/15/15 split across all three hospitals means the model was trained on data from all sites, making this internal validation only.

**1.2 Sample Size and Power**
- Total sample (n=842, 155 events) is modest for machine learning, particularly with 18.4% prevalence
- The test set contains only ~126 encounters (~23 sepsis cases), which is insufficient for stable performance estimation
- Authors acknowledge no power analysis was performed, which is problematic for drawing definitive conclusions
- No justification for the 70/15/15 split ratio

**1.3 Temporal Validation Absent**
- No mention of temporal splitting (training on earlier dates, testing on later dates)
- This is critical for clinical ML models to assess temporal drift and generalizability
- Random splitting can lead to optimistic bias if practice patterns change over time

**1.4 Operating Threshold Selection (CRITICAL FLAW)**
- Line 68: "The operating threshold was chosen by maximizing Youden's J statistic on the test set"
- This is a fundamental methodological error constituting data leakage
- The test set must remain completely untouched until final evaluation
- Threshold selection should occur on the validation set only
- This invalidates the reported performance metrics

**1.5 Missing Data Handling**
- Median imputation is overly simplistic and can introduce bias
- No reporting of missingness patterns or percentages by feature
- No sensitivity analysis for imputation strategy
- Missing lab values may be informative (missing not at random)

**1.6 Feature Engineering and Selection**
- No discussion of feature selection methodology
- Winsorization at 1st/99th percentiles is reasonable but not justified
- No mention of multicollinearity assessment
- Treatment variables (antibiotics, vasopressors) may introduce reverse causality if administered based on suspected sepsis

### Moderate Issues:

**1.7 Outcome Adjudication**
- Chart adjudication by residents (not attendings) may introduce variability
- No inter-rater reliability metrics reported (kappa, agreement rate)
- No discussion of adjudication protocol or training

**1.8 Inclusion/Exclusion Criteria**
- Requiring 6-hour ED stay may exclude rapid admissions or transfers
- This creates selection bias toward less acute presentations
- Generalizability to typical ED populations is questionable

---

## 2. STATISTICAL ANALYSIS

### Critical Issues:

**2.1 Confidence Intervals Absent**
- No confidence intervals for AUROC, sensitivity, specificity, or other metrics
- Bootstrapping should be used to estimate uncertainty
- Without CIs, the statistical significance of the 0.91 AUROC cannot be properly assessed

**2.2 Calibration Problems Not Addressed**
- Calibration slope of 0.64 (should be ~1.0) indicates poor calibration
- Intercept of 0.09 suggests systematic miscalibration
- Brier score of 0.19 is moderate but not excellent
- Authors do not discuss these calibration issues or their clinical implications
- Poor calibration means predicted probabilities are not reliable for clinical decision-making

**2.3 Multiple Comparisons**
- 12 subgroup analyses performed without multiple testing correction
- Risk of false positive findings
- Should apply Bonferroni correction or report this limitation

**2.4 Statistical Testing Issues**
- Comparison of AUROCs (0.91 vs 0.71, p<0.001) method not specified
- Should use DeLong test or similar paired comparison
- T-tests and chi-square tests assume independence; unclear if clustering by hospital/patient was addressed

**2.5 Class Imbalance Handling**
- `scale_pos_weight=4` is mentioned but not justified
- Should match the imbalance ratio (approximately 4.4:1 for 18.4% prevalence)
- No discussion of alternative approaches (SMOTE, stratified sampling)

### Moderate Issues:

**2.6 Hyperparameter Tuning**
- Grid search details not provided (search space, cross-validation folds)
- Risk of overfitting to validation set if extensive tuning performed
- No nested cross-validation mentioned

**2.7 Model Interpretation**
- Feature importance mentioned briefly but not shown
- No SHAP values or other explainability analysis presented
- Critical for clinical adoption and trust

---

## 3. REPRODUCIBILITY

### Critical Issues:

**3.1 Data Availability**
- "Available upon reasonable request" is insufficient for reproducibility
- Data should be deposited in a public repository (with appropriate de-identification)
- "Institutional and vendor constraints" suggests data will not actually be shared
- This violates modern open science standards

**3.2 Code Availability**
- Code is not publicly available
- Should be deposited on GitHub/Zenodo with DOI
- Preprocessing, training, and evaluation code all necessary

**3.3 Insufficient Implementation Details**
- Random seed not specified
- Train/validation/test split methodology not detailed
- Preprocessing pipeline not fully described
- Software versions provided but environment not fully specified

### Moderate Issues:

**3.4 Model Sharing**
- Trained model weights not available
- Model card or documentation not provided
- Deployment specifications absent

---

## 4. ETHICS AND GOVERNANCE

### Critical Issues:

**4.1 Conflicts of Interest**
- Two authors are advisors to SepsiSense, a company developing ED decision-support software
- This is directly related to the manuscript topic
- Potential for bias in interpretation and presentation
- Should disclose which authors and nature of financial relationships

**4.2 Funding Source**
- MedAxis Analytics is a commercial entity
- Potential influence on study design and reporting
- Need transparency about funder role in analysis and manuscript preparation

**4.3 Overstated Claims Without Evidence**
- Line 91-92: "could reduce sepsis mortality by approximately 20%"
- This is pure speculation without any intervention data
- Irresponsible claim that could mislead stakeholders
- Prediction model performance ≠ intervention effectiveness

### Moderate Issues:

**4.4 Patient and Public Involvement**
- Authors acknowledge no PPI involvement
- For a clinical decision tool, patient perspectives on acceptable false positive/negative rates would be valuable
- Should justify this omission

**4.5 Equity and Fairness**
- No discussion of potential algorithmic bias
- No subgroup analysis by race/ethnicity
- Fairness metrics (equalized odds, demographic parity) not assessed
- Critical for clinical AI deployment

**4.6 Deployment Considerations**
- No discussion of implementation challenges
- No consideration of alert fatigue
- No workflow integration planning
- Premature to recommend "immediate progression toward clinical use" (line 90)

---

## 5. REPORTING STANDARDS

### Critical Issues:

**5.1 Wrong Reporting Guideline**
- Authors claim to follow CONSORT (line 99)
- CONSORT is for randomized controlled trials
- Should follow TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis)
- This is a fundamental misunderstanding of study type

**5.2 Missing TRIPOD Elements**
- No clear statement of intended use and target population
- Sample size justification absent
- Handling of missing data inadequately described
- Model specification incomplete
- No discussion of model updating or recalibration needs
- Risk groups and clinical interpretation not provided

**5.3 PROBAST Risk of Bias**
- Using this tool, the study would be rated high risk of bias due to:
  - Inappropriate analysis (threshold selection on test set)
  - Inadequate sample size for test set
  - Poor calibration not addressed
  - No external validation

### Moderate Issues:

**5.4 Figure and Table References**
- Figure 1 and Table 1 referenced but not provided
- Table S1 referenced but supplement not included
- Cannot fully assess cohort characteristics

**5.5 Incomplete Methods**
- qSOFA calculation details sparse
- Chart adjudication protocol not described
- De-identification process not detailed

---

## 6. STRENGTHS

Despite the significant concerns, the manuscript has notable strengths:

1. **Clinically Important Question**: Early sepsis detection in ED is a genuine clinical need with potential for meaningful impact

2. **Multi-Site Data**: Using three hospitals (even if not true external validation) is better than single-center studies

3. **Patient-Level Splitting**: Splitting at patient level prevents data leakage from repeated encounters

4. **Appropriate Outcome Definition**: Sepsis-3 criteria with chart adjudication is rigorous and clinically relevant

5. **Comparator Benchmark**: Including qSOFA comparison provides clinical context

6. **Secondary Outcome**: 30-day mortality association provides clinical validity evidence

7. **Reasonable Model Choice**: XGBoost is appropriate for tabular clinical data

8. **Transparency About Limitations**: Authors acknowledge retrospective design and EHR-derived labels (though dismiss too quickly)

---

## 7. MAJOR COMMENTS

### Major Comment 1: Fundamental Methodological Flaw - Test Set Contamination
The selection of the operating threshold on the test set (line 68) is a critical error that invalidates the reported performance metrics. This must be corrected by:
- Selecting threshold on validation set only
- Re-running all test set evaluations with the validation-derived threshold
- Reporting both threshold-independent (AUROC) and threshold-dependent metrics
- This will likely result in lower reported performance

### Major Comment 2: This is Not External Validation
The title and framing claim "multicenter validation" but this is misleading. The model was trained on data from all three hospitals. True external validation requires:
- Either: Develop on hospitals 1-2, validate on hospital 3
- Or: Develop on all three hospitals, validate on completely separate institutions
- The current design is multicenter development with internal validation
- Title and conclusions must be revised accordingly

### Major Comment 3: Inadequate Sample Size for Definitive Conclusions
With only ~126 test encounters (~23 sepsis cases), the performance estimates are unstable. The confidence intervals would likely be wide (e.g., AUROC 0.91 might have 95% CI of 0.83-0.97). Authors should:
- Report confidence intervals for all metrics
- Acknowledge uncertainty in performance estimates
- Tone down definitive language about clinical deployment
- Consider pooling more data or using cross-validation for more stable estimates

### Major Comment 4: Calibration Failure Not Addressed
A calibration slope of 0.64 means predicted probabilities are poorly calibrated. A patient with predicted risk of 40% might actually have ~25% risk. This is critical for clinical use. Authors must:
- Present calibration plots
- Discuss clinical implications of miscalibration
- Consider recalibration methods (Platt scaling, isotonic regression)
- Acknowledge this as a major limitation

### Major Comment 5: Overstated and Unsupported Claims
The claim of "20% mortality reduction" (line 91-92) and recommendation for "immediate progression toward clinical use" (line 90) are irresponsible:
- No intervention has been tested
- Prediction ≠ intervention effectiveness
- Implementation science challenges not addressed
- These statements must be removed or heavily qualified

### Major Comment 6: Conflicts of Interest and Bias
Two authors advising a commercial ED decision-support company represents a significant conflict. The manuscript should:
- Specify which authors have this relationship
- Describe the nature and extent of financial relationships
- Discuss how conflicts were managed in study conduct and interpretation
- Consider independent data analysis verification

### Major Comment 7: Wrong Reporting Standard
Using CONSORT instead of TRIPOD demonstrates fundamental misunderstanding. Authors must:
- Complete the TRIPOD checklist
- Restructure manuscript to include all TRIPOD elements
- Add required sections (intended use, risk stratification, model updating)

### Major Comment 8: Reproducibility Insufficient
"Available upon reasonable request" is not acceptable for a clinical AI model. Authors should:
- Deposit de-identified data in a public repository (or explain specific legal barriers)
- Share all code on GitHub with clear documentation
- Provide trained model weights
- Create a model card documenting specifications, limitations, and intended use

---

## 8. MINOR COMMENTS

### Minor Comment 1: Abstract
- Line 9: "Data were split into train/validation/test sets" - specify this was done at patient level
- Line 12: Report confidence intervals for AUROC values
- Line 13: "At the selected operating threshold" - specify how threshold was selected

### Minor Comment 2: Introduction
- Line 19: "Existing bedside scores show limited sensitivity" - provide specific values
- Line 22: Hypothesis of AUROC >0.85 is reasonable but justification for this threshold would strengthen

### Minor Comment 3: Methods
- Line 32: "ED stay >=6 hours" - justify this cutoff
- Line 52: Winsorization percentiles - justify choice
- Line 63: `scale_pos_weight=4` - justify this value
- Line 68: Statistical analysis section should specify software versions for all packages

### Minor Comment 4: Results
- Line 78: "Median age 59 years" - provide mean ± SD as well for completeness
- Line 81: Provide confidence intervals for all performance metrics
- Line 84: "12 total" subgroups - list all subgroups analyzed
- Line 84: Calibration results should be in a dedicated subsection with plots

### Minor Comment 5: Discussion
- Line 90-92: Remove or heavily qualify speculative claims about mortality reduction
- Line 93: "Larger cohort" - 842 encounters is not particularly large for ML
- Line 94: Feature importance should be shown in results, not just mentioned
- Line 95: "Unlikely to materially affect conclusions" - this dismisses limitations too quickly

### Minor Comment 6: References
- Only 4 references is insufficient
- Should cite relevant ML in sepsis literature
- Should cite TRIPOD guidelines
- Should cite calibration methodology papers

### Minor Comment 7: Terminology
- "AI model" vs "machine learning model" - be consistent
- "Chart-adjudicated" - define this process clearly
- "High-risk alerts" - define the threshold used

### Minor Comment 8: Missing Information
- No discussion of computational requirements
- No discussion of inference time (critical for ED deployment)
- No discussion of model updating frequency needed
- No discussion of performance monitoring in deployment

---

## 9. SPECIFIC RECOMMENDATIONS FOR REVISION

### Must Address (Required for Acceptance):
1. Re-analyze with threshold selection on validation set only
2. Report confidence intervals for all performance metrics
3. Change title and framing from "validation" to "development" study
4. Address calibration issues with plots and discussion
5. Remove unsupported claims about mortality reduction and immediate deployment
6. Switch from CONSORT to TRIPOD reporting
7. Provide detailed conflict of interest disclosures
8. Commit to public data and code sharing (or provide detailed justification if impossible)

### Should Address (Strongly Recommended):
9. Add temporal validation analysis
10. Perform external validation on separate institutions
11. Report inter-rater reliability for outcome adjudication
12. Add fairness and equity analysis
13. Expand discussion of limitations
14. Add feature importance plots and SHAP analysis
15. Discuss deployment considerations and implementation challenges
16. Expand references to relevant literature

### Could Address (Would Strengthen):
17. Increase sample size if possible
18. Compare to additional baselines (SIRS, NEWS, etc.)
19. Add cost-effectiveness analysis
20. Include patient/clinician perspectives on acceptable performance

---

## 10. CONCLUSION

This manuscript addresses an important clinical problem and reports promising preliminary results. However, it contains fundamental methodological flaws (test set contamination, miscalibration), inadequate reporting (wrong guidelines, missing details), and overstated conclusions that preclude publication without major revision.

The most critical issue is the selection of the operating threshold on the test set, which invalidates the reported performance metrics. Additionally, the poor calibration, small test set, lack of true external validation, and unsupported claims about clinical impact represent serious concerns.

With substantial revision addressing the methodological errors, proper reporting following TRIPOD guidelines, toned-down conclusions, and commitment to reproducibility, this work could make a meaningful contribution to the literature. However, in its current form, it is not suitable for publication.

**Recommendation: MAJOR REVISION REQUIRED**

The authors should address all critical issues and most major comments before resubmission. A revised manuscript would require careful re-review to ensure methodological soundness.

---

**Reviewer Signature:** [Peer Reviewer]  
**Date:** April 14, 2026  
**Review Completed:** Ready for sponsorship meeting