AI Surpasses Harvard Docs on Clinical Reasoning Test

Publish date: April 22, 2024

TOPLINE:

A study comparing the clinical reasoning of an artificial intelligence (AI) model with that of physicians found the AI outperformed residents and attending physicians in simulated cases. The AI had more instances of incorrect reasoning than the doctors did but scored better overall.

METHODOLOGY:

The study involved 39 physicians from two academic medical centers in Boston and the generative AI model GPT-4.
Participants were presented with 20 simulated clinical cases involving common problems such as pharyngitis, headache, abdominal pain, cough, and chest pain. Each case included sections describing the triage presentation, review of systems, physical examination, and diagnostic testing.
The primary outcome was the Revised-IDEA (R-IDEA) score, a 10-point scale evaluating clinical reasoning documentation across four domains: Interpretive summary, differential diagnosis, explanation of the lead diagnosis, and alternative diagnoses.

TAKEAWAY:

AI achieved a median R-IDEA score of 10, higher than attending physicians (median score, 9) and residents (8).
The chatbot had a significantly higher estimated probability of achieving a high R-IDEA score of 8-10 (0.99) compared with attendings (0.76) and residents (0.56).
AI provided more responses that contained instances of incorrect clinical reasoning (13.8%) than residents (2.8%) and attending physicians (12.5%). It performed similarly to physicians in diagnostic accuracy and inclusion of cannot-miss diagnoses.

IN PRACTICE:

“Future research should assess clinical reasoning of the LLM-physician interaction, as LLMs will more likely augment, not replace, the human reasoning process,” the authors of the study wrote.

SOURCE:

Adam Rodman, MD, MPH, with Beth Israel Deaconess Medical Center, Boston, was the corresponding author on the paper. The research was published online in JAMA Internal Medicine.

LIMITATIONS:

Simulated clinical cases may not replicate performance in real-world scenarios. Further training could enhance the performance of the AI, so the study may underestimate its capabilities, the researchers noted.

DISCLOSURES:

The study was supported by the Harvard Clinical and Translational Science Center and Harvard University. Authors disclosed financial ties to publishing companies and Solera Health. Dr. Rodman received funding from the Gordon and Betty Moore Foundation.

This article was created using several editorial tools, including AI, as part of the process. Human editors reviewed this content before publication. A version of this article appeared on Medscape.com.