AI in Medicine: Are Large Language Models Ready for the Exam Room?

Publish date: October 29, 2024

In seconds, Ravi Parikh, MD, an oncologist at the Emory University School of Medicine in Atlanta, had a summary of his patient’s entire medical history. Normally, Parikh skimmed the cumbersome files before seeing a patient. However, the artificial intelligence (AI) tool his institution was testing could list the highlights he needed in a fraction of the time.

“On the whole, I like it ... it saves me time,” Parikh said of the tool. “But I’d be lying if I told you it was perfect all the time. It’s interpreting the [patient] history in some ways that may be inaccurate,” he said.

Within the first week of testing the tool, Parikh started to notice that the large language model (LLM) made a particular mistake in his patients with prostate cancer. If their prostate-specific antigen test results came back slightly elevated — which is part of normal variation — the LLM recorded it as disease progression. Because Parikh reviews all his notes — with or without using an AI tool — after a visit, he easily caught the mistake before it was added to the chart. “The problem, I think, is if these mistakes go under the hood,” he said.

In the data science world, these mistakes are called hallucinations. And a growing body of research suggests they’re happening more frequently than is safe for healthcare. The industry promised LLMs would alleviate administrative burden and reduce physician burnout. But so far, studies show these AI-tool mistakes often create more work for doctors, not less. To truly help physicians and be safe for patients, some experts say healthcare needs to build its own LLMs from the ground up. And all agree that the field desperately needs a way to vet these algorithms more thoroughly.

Prone to Error

Right now, “I think the industry is focused on taking existing LLMs and forcing them into usage for healthcare,” said Nigam H. Shah, MBBS, PhD, chief data scientist for Stanford Health. However, the value of deploying general LLMs in the healthcare space is questionable. “People are starting to wonder if we’re using these tools wrong,” he told this news organization.

In 2023, Shah and his colleagues evaluated seven LLMs on their ability to answer electronic health record–based questions. For realistic tasks, the error rate in the best cases was about 35%, he said. “To me, that rate seems a bit high ... to adopt for routine use.”

A study earlier this year by the UC San Diego School of Medicine showed that using LLMs to respond to patient messages increased the time doctors spent on messages. And this summer, a study by the clinical AI firm Mendel found that when GPT-4o or Llama-3 were used to summarize patient medical records, almost every summary contained at least one type of hallucination.

“We’ve seen cases where a patient does have drug allergies, but the system says ‘no known drug allergies’ ” in the medical history summary, said Wael Salloum, PhD, cofounder and chief science officer at Mendel. “That’s a serious hallucination.” And if physicians have to constantly verify what the system is telling them, that “defeats the purpose [of summarization],” he said.

AI in Medicine: Are Large Language Models Ready for the Exam Room?

Prone to Error

Pages

Recommended Reading