In an experiment that pitted the wits of
results from a small single-center study showed.“We were relieved to find that the pediatric dermatologists in our study performed better than ChatGPT on both multiple choice and case-based questions; however, the latest iteration of ChatGPT (4.0) was very close,” one of the study’s first authors Charles Huang, a fourth-year medical student at Thomas Jefferson University, Philadelphia, said in an interview. “Something else that was interesting in our data was that the pediatric dermatologists performed much better than ChatGPT on questions related to procedural dermatology/surgical techniques, perhaps indicating that knowledge/reasoning gained through practical experience isn’t easily replicated in AI tools such as ChatGPT.”
For the study, which was published on May 9 in Pediatric Dermatology, Mr. Huang, and co-first author Esther Zhang, BS, a medical student at the University of Pennsylvania, Philadelphia, and coauthors from the Department of Dermatology, Children’s Hospital of Philadelphia, asked five pediatric dermatologists to answer 24 text-based questions including 16 single-answer, multiple-choice questions and two multiple answer questions drawn from the American Board of Dermatology 2021 Certification Sample Test and six free-response case-based questions drawn from the “Photoquiz” section of Pediatric Dermatology between July 2022 and July 2023. The researchers then processed the same set of questions through ChatGPT versions 3.5 and 4.0 and used statistical analysis to compare responses between the pediatric dermatologists and ChatGPT. A 5-point scale adapted from current AI tools was used to score replies to case-based questions.
On average, study participants had 5.6 years of clinical experience. Pediatric dermatologists performed significantly better than ChatGPT version 3.5 on multiple-choice and multiple answer questions (91.4% vs 76.2%, respectively; P = .021) but not significantly better than ChatGPT version 4.0 (90.5%; P = .44). As for replies to case-based questions, the average performance based on the 5-point scale was 3.81 for pediatric dermatologists and 3.53 for ChatGPT overall. The mean scores were significantly greater for pediatric dermatologists than for ChatGPT version 3.5 (P = .039) but not ChatGPT version 4.0 (P = .43).
The researchers acknowledged certain limitations of the analysis, including the evolving nature of AI tools, which may affect the reproducibility of results with subsequent model updates. And, while participating pediatric dermatologists said they were unfamiliar with the questions and cases used in the study, “there is potential for prior exposure through other dermatology board examination review processes,” they wrote.
“AI tools such as ChatGPT and similar large language models can be a valuable tool in your clinical practice, but be aware of potential pitfalls such as patient privacy, medical inaccuracies, [and] intrinsic biases in the tools,” Mr. Huang told this news organization. “As these technologies continue to advance, it is essential for all of us as medical clinicians to gain familiarity and stay abreast of new developments, just as we adapted to electronic health records and the use of the Internet.”
Maria Buethe, MD, PhD, a pediatric dermatology fellow at Rady Children’s Hospital–San Diego, who was asked to comment on the study, said she found it “interesting” that ChatGPT’s version 4.0 started to produce comparable results to clinician responses in some of the tested scenarios.
“The authors propose a set of best practices for pediatric dermatology clinicians using ChatGPT and other AI tools,” said Dr. Buethe, who was senior author of a recent literature review on AI and its application to pediatric dermatology. It was published in SKIN The Journal of Cutaneous Medicine. “One interesting recommended use for AI tools is to utilize it to generate differential diagnosis, which can broaden the list of pathologies previously considered.”
Asked to comment on the study, Erum Ilyas, MD, who practices dermatology in King of Prussia, Pennsylvania, and is a member of the Society for Pediatric Dermatology, said she was not surprised that ChatGPT “can perform fairly well on multiple-choice questions as we find available in testing circumstances,” as presented in the study. “Just as board questions only support testing a base of medical knowledge and facts for clinicians to master, they do not necessarily provide real-life circumstances that apply to caring for patients, which is inherently nuanced.”
In addition, the study “highlights that ChatGPT can be an aid to support thinking through differentials based on data entered by a clinician who understands how to phrase queries, especially if provided with enough data while respecting patient privacy, in the context of fact checking responses,” Dr. Ilyas said. “This underscores the fact that AI tools can be helpful to clinicians in assimilating various data points entered. However, ultimately, the tool is only able to support an output based on the information it has access to.” She added, “ChatGPT cannot be relied on to provide a single diagnosis with the clinician still responsible for making a final diagnosis. The tool is not definitive and cannot assimilate data that is not entered correctly.”
The study was not funded, and the study authors reported having no disclosures. Dr. Buethe and Dr. Ilyas, who were not involved with the study, had no disclosures.
A version of this article appeared on Medscape.com .