News|Articles|April 4, 2026

AI tools enhance pediatric diagnostic accuracy

Fact checked by: Kelly King

Key Takeaways

  • Advanced LLMs outperformed human clinicians in diagnosing real-world pediatric cases, particularly when identifying rare diseases.
  • The highest diagnostic accuracy (94.3%) was achieved when AI tools were provided with extended clinical information and used in conjunction with human expertise.
SHOW MORE

Research shows that combining AI models with human expertise significantly boosts diagnostic performance, particularly for rare diseases.

Investigators from Hospital Sant Joan de Déu have found improved pediatric diagnosis when artificial intelligence (AI) and human clinicians work together, publishing their findings in Pediatric Investigation.1

Challenges of diagnosing diseases in pediatric patients include rare conditions with subtle or overlapping symptoms, leading to uncertainty that may delay diagnosis and cause adverse outcomes. AI models have been considered for improving health care, but most studies have been based on simplified or curated cases.1

“Rather than replacing clinicians, these tools may help broaden the differential diagnosis and reduce the likelihood of missed diagnoses—as long as outputs are interpreted critically and within robust oversight frameworks,” said Cristian Launes, MD, MSc, PhD, clinical professor at Hospital Sant Joan de Déu.1

LLMs vs clinicians

The cross-sectional study was conducted to obtain real-world data about the diagnostic accuracy, consistency, and clinical usability of large language models (LLMs) as diagnostic support tools in pediatric medicine.2 There were 4 LLMs included in the study, as follows:

  • DxGPT/GPT-4 (0613)
  • Claude-3.5 Sonnet
  • GPT-4o (0513)
  • o1-preview

Knowledge cutoffs for these models were September 2021, early 2024, October 2023, and October 2023, respectively. The 4 models were selected based on application programming interface accessibility, benchmark performance, latency, pricing, and architectural diversity.2

These models were compared to 78 pediatric clinicians by assessing 50 real-world cases, 25 of which involved rare diseases and 25 involved common conditions. The clinicians had different levels of experience, and each case was run through a single LLM 3 times.2

Data extracted for each case included patient sex and age, main symptoms and signs, relevant medical history, physical examination findings, and results from the initial complementary tests. Through discussion, investigators classified scenarios as low, medium, or high complexity.2

The top 1 and top 5 diagnostic accuracy, response consistency, and qualitative evaluation were used to evaluate performance. Diagnostic efficacy was achieved by providing extended clinical information for 20 cases, leading to 70 unique clinical scenarios.2

Accuracy findings

Significant improvements in diagnostic accuracy were reported from advanced LLMs vs clinicians, with top 1 accuracies of 60%, 59%, and 48.2% for o1-preview, Claude-3.5 Sonnet, and clinicians, respectively. This indicated ORs of 2.99 for o1-preview and 2.75 for Claude-3.5 vs clinicians.2

Top 5 accuracies were also significant for LLMs. The greatest rates were 78.1% for o1-preview and 77.6% for Claude-3.5. Mid-tier performance was reported for GPT-4o, while the lowest performance was found in the DxGPT model.2

The most significant improvement in LLMs was reported for rare diseases. In this assessment, a 6-fold improvement was reported for o1-preview in top 5 diagnostic odds vs clinicians, with an OR of 6.2

This model also had the greatest rate among LLMs for top 1 accuracy of 50%. In comparison, the greatest Top-1 accuracy for common disease of 77.1% was reported in Claude-3.5.2

Synergy between AI and clinical insight

Both LLMs and clinicians reported improved accuracy from extended clinical information, including o1-preview with a 94.3% union accuracy. This rate was 10% greater than clinician accuracy alone. Additionally, favorable clinician ratings were reported for DxGPT, with a mean score of 3.9 out of 5 overall and 4.1 out of 5 for rare case support.2

Overall, the results indicated improved performance from LLMs vs prior models and human clinicians. However, investigators stressed the need for addressing variability, establishing regulatory frameworks, and maintaining human oversight in order to implement these models.2

“AI systems perform best when they are part of a continuous clinical process, where clinicians iteratively gather, verify, and curate the evolving clinical picture to feed the model,” said Launes.1

References

  1. Pediatric Investigation study finds AI and clinicians together improve pediatric diagnosis. News release. Pediatric Investigation. March 31, 2026. Accessed April 3, 2026. https://www.eurekalert.org/news-releases/1122121
  2. Launes C, Esteller-Cucala, Alvarez-Estape M, et al. Large-language-models for pediatric diagnosis: Performance evaluation using real-world clinical notes from common and rare cases. Pediatr Investig. Published online March 25, 2026. doi:10.1002/ped4.70053