(UC).
In a small study, ChatGPT version 4 (GPT-4) accurately gauged disease severity and made decisions about the need for hospitalization that were largely in line with expert gastroenterologists.
“Our findings suggest that GPT-4 has potential as a clinical decision-support tool in assessing UC severity and recommending suitable settings for further treatment,” say the authors, led by Asaf Levartovsky, MD, department of gastroenterology, Sheba Medical Center, Tel Aviv University.
The study was published online in the American Journal of Gastroenterology.
Assessing its potential
UC is a chronic inflammatory bowel disease known for episodes of flare-ups and remissions. Flare-ups often result in a trip to the ED, where staff must rapidly assess disease severity and need for hospital admission.
Dr. Levartovsky and colleagues explored how helpful GPT-4 could be in 20 distinct presentations of acute UC in the ED. They assessed the chatbot’s ability to determine the severity of disease and whether a specific presentation warranted hospital admission for further treatment.
They fed GPT-4 case summaries that included crucial data such as symptoms, vital signs, and laboratory results. For each case, they asked the chatbot to assess disease severity based on established criteria and recommend hospital admission or outpatient care.
The GPT-4 answers were compared with assessments made by gastroenterologists and the actual decision regarding hospitalization made by the physician in the ED.
Overall, ChatGPT categorized acute UC as severe in 12 patients, as moderate in 7, and as mild in 1. In each case, the chatbot provided a detailed answer depicting severity of every variable of the criteria and an overall severity classification.
ChatGPT’s assessments were consistent with gastroenterologists’ assessments 80% of the time, with a “high degree of reliability” between the two assessments, the study team reports.
The average correlation between ChatGPT and physician ratings was 0.839 (P < .001). Inconsistencies in four cases stemmed largely from inaccurate cut-off values for systemic variables (such as hemoglobin and tachycardia).
Following severity assessment, ChatGPT leaned toward hospital admission for 16 patients, whereas in actual clinical practice, only 12 patients were hospitalized. In one patient with moderate UC who was discharged, ChatGPT was in favor of hospitalization, based on the patient’s age and comorbid conditions. For two moderate UC cases, the chatbot recommended consultation with a health care professional for further evaluation and management, which the researchers deemed to be an indecisive response.
Based on their findings, the researchers say ChatGPT could serve as a real-time decision support tool – one that is not meant to replace physicians but rather enhance human decision-making.
They note that the small sample size is a limitation and that ChatGPT’s accuracy rate requires further validation across larger samples and diverse clinical scenarios.
The study had no specific funding. The authors report no relevant financial relationships.
A version of this article first appeared on Medscape.com.