Is ChatGPT Reliable for CRC Screening/Surveillance Advice?

Publish date: February 20, 2024

Author(s):

TOPLINE:

ChatGPT (version 3.5) provides relatively poor and inconsistent responses when asked about appropriate colorectal cancer (CRC) screening and surveillance, a new study showed.

METHODOLOGY:

Three board-certified gastroenterologists with 10+ years of clinical experience developed five CRC screening and five CRC surveillance clinical vignettes (with multiple choice answers), which were fed to ChatGPT version 3.5.
ChatGPT’s responses were recorded over four separate sessions and screened for accuracy to determine reliability of the tool.
The average number of correct answers was compared to that of 238 gastroenterologists and colorectal surgeons answering the same questions with and without the help of a previously validated CRC screening mobile app.

TAKEAWAY:

ChatGPT’s average overall performance was 45%; the average number of correct answers was 2.75 for screening and 1.75 for surveillance.
ChatGPT’s responses were inconsistent in a large proportion of questions; the tool gave a different answer in four questions among the different sessions.
The average number of total correct answers of ChatGPT was significantly lower (P < .001) than that of physicians with and without the mobile app (7.71 and 5.62 correct answers, respectively).

IN PRACTICE:

“The use of validated mobile apps with decision-making algorithms could serve as more reliable assistants until large language models developed with AI are further refined,” the authors concluded.

SOURCE:

The study, with first author Lisandro Pereyra, MD, Department of Gastroenterology, Hospital Alemán of Buenos Aires, Argentina, was published online on February 7, 2024, in the Journal of Clinical Gastroenterology.

LIMITATIONS:

The 10 clinical vignettes represented a relatively small sample size to assess accuracy. The study did not use the latest version of ChatGPT. No “fine-tuning” attempts with inputs of diverse prompts, instructions, or relevant data were performed, which could potentially improve the performance of the chatbot.

DISCLOSURES:

The study had no specific funding. The authors declared no conflicts of interest.

A version of this article appeared on Medscape.com.