AI in Medicine: Are Large Language Models Ready for the Exam Room?

Article Type

Changed

Tue, 10/29/2024 - 10:52

Author(s)

In seconds, Ravi Parikh, MD, an oncologist at the Emory University School of Medicine in Atlanta, had a summary of his patient’s entire medical history. Normally, Parikh skimmed the cumbersome files before seeing a patient. However, the artificial intelligence (AI) tool his institution was testing could list the highlights he needed in a fraction of the time.

“On the whole, I like it ... it saves me time,” Parikh said of the tool. “But I’d be lying if I told you it was perfect all the time. It’s interpreting the [patient] history in some ways that may be inaccurate,” he said.

Within the first week of testing the tool, Parikh started to notice that the large language model (LLM) made a particular mistake in his patients with prostate cancer. If their prostate-specific antigen test results came back slightly elevated — which is part of normal variation — the LLM recorded it as disease progression. Because Parikh reviews all his notes — with or without using an AI tool — after a visit, he easily caught the mistake before it was added to the chart. “The problem, I think, is if these mistakes go under the hood,” he said.

In the data science world, these mistakes are called hallucinations. And a growing body of research suggests they’re happening more frequently than is safe for healthcare. The industry promised LLMs would alleviate administrative burden and reduce physician burnout. But so far, studies show these AI-tool mistakes often create more work for doctors, not less. To truly help physicians and be safe for patients, some experts say healthcare needs to build its own LLMs from the ground up. And all agree that the field desperately needs a way to vet these algorithms more thoroughly.

Prone to Error

Right now, “I think the industry is focused on taking existing LLMs and forcing them into usage for healthcare,” said Nigam H. Shah, MBBS, PhD, chief data scientist for Stanford Health. However, the value of deploying general LLMs in the healthcare space is questionable. “People are starting to wonder if we’re using these tools wrong,” he told this news organization.

In 2023, Shah and his colleagues evaluated seven LLMs on their ability to answer electronic health record–based questions. For realistic tasks, the error rate in the best cases was about 35%, he said. “To me, that rate seems a bit high ... to adopt for routine use.”

A study earlier this year by the UC San Diego School of Medicine showed that using LLMs to respond to patient messages increased the time doctors spent on messages. And this summer, a study by the clinical AI firm Mendel found that when GPT-4o or Llama-3 were used to summarize patient medical records, almost every summary contained at least one type of hallucination.

“We’ve seen cases where a patient does have drug allergies, but the system says ‘no known drug allergies’ ” in the medical history summary, said Wael Salloum, PhD, cofounder and chief science officer at Mendel. “That’s a serious hallucination.” And if physicians have to constantly verify what the system is telling them, that “defeats the purpose [of summarization],” he said.

A Higher Quality Diet

Part of the trouble with LLMs is that there’s just not enough high-quality information to feed them. The algorithms are insatiable, requiring vast swaths of data for training. GPT-3.5, for instance, was trained on 570 GB of data from the internet, more than 300 billion words. And to train GPT-4o, OpenAI reportedly transcribed more than 1 million hours of YouTube content.

However, the strategies that built these general LLMs don’t always translate well to healthcare. The internet is full of low-quality or misleading health information from wellness sites and supplement advertisements. And even data that are trustworthy, like the millions of clinical studies and the US Food and Drug Administration (FDA) statements, can be outdated, Salloum said. And “an LLM in training can’t distinguish good from bad,” he added.

The good news is that clinicians don’t rely on controversial information in the real world. Medical knowledge is standardized. “Healthcare is a domain rich with explicit knowledge,” Salloum said. So there’s potential to build a more reliable LLM that is guided by robust medical standards and guidelines.

It’s possible that healthcare could use small language models, which are LLM’s pocket-sized cousins, and perform tasks needing only bite-sized datasets requiring fewer resources and easier fine-tuning, according to Microsoft’s website. Shah said training these smaller models on real medical data might be an option, like an LLM meant to respond to patient messages that could be trained with real messages sent by physicians.

Several groups are already working on databases of standardized human medical knowledge or real physician responses. “Perhaps that will work better than using LLMs trained on the general internet. Those studies need to be done,” Shah said.

Jon Tamir, assistant professor of electrical and computer engineering and co-lead of the AI Health Lab at The University of Texas at Austin, said, “The community has recognized that we are entering a new era of AI where the dataset itself is the most important aspect. We need training sets that are highly curated and highly specialized.

“If the dataset is highly specialized, it will definitely help reduce hallucinations,” he said.

Cutting Overconfidence

A major problem with LLM mistakes is that they are often hard to detect. Hallucinations can be highly convincing even if they’re highly inaccurate, according to Tamir.

When Shah, for instance, was recently testing an LLM on de-identified patient data, he asked the LLM which blood test the patient last had. The model responded with “complete blood count [CBC].” But when he asked for the results, the model gave him white blood count and other values. “Turns out that record did not have a CBC done at all! The result was entirely made up,” he said.

Making healthcare LLMs safer and more reliable will mean training AI to acknowledge potential mistakes and uncertainty. Existing LLMs are trained to project confidence and produce a lot of answers, even when there isn’t one, Salloum said. They rarely respond with “I don’t know” even when their prediction has low confidence, he added.

Healthcare stands to benefit from a system that highlights uncertainty and potential errors. For instance, if a patient’s history shows they have smoked, stopped smoking, vaped, and started smoking again. The LLM might call them a smoker but flag the comment as uncertain because the chronology is complicated, Salloum said.

Tamir added that this strategy could improve LLM and doctor collaboration by honing in on where human expertise is needed most.

Too Little Evaluation

For any improvement strategy to work, LLMs — and all AI-assisted healthcare tools — first need a better evaluation framework. So far, LLMs have “been used in really exciting ways but not really well-vetted ways,” Tamir said.

While some AI-assisted tools, particularly in medical imaging, have undergone rigorous FDA evaluations and earned approval, most haven’t. And because the FDA only regulates algorithms that are considered medical devices, Parikh said that most LLMs used for administrative tasks and efficiency don’t fall under the regulatory agency’s purview.

But these algorithms still have access to patient information and can directly influence patient and doctor decisions. Third-party regulatory agencies are expected to emerge, but it’s still unclear who those will be. Before developers can build a safer and more efficient LLM for healthcare, they’ll need better guidelines and guardrails. “Unless we figure out evaluation, how would we know whether the healthcare-appropriate large language models are better or worse?” Shah asked.

A version of this article appeared on Medscape.com.

Publications

Topics

Business of Medicine

Practice Management

Sections

Prone to Error

Right now, “I think the industry is focused on taking existing LLMs and forcing them into usage for healthcare,” said Nigam H. Shah, MBBS, PhD, chief data scientist for Stanford Health. However, the value of deploying general LLMs in the healthcare space is questionable. “People are starting to wonder if we’re using these tools wrong,” he told this news organization.

In 2023, Shah and his colleagues evaluated seven LLMs on their ability to answer electronic health record–based questions. For realistic tasks, the error rate in the best cases was about 35%, he said. “To me, that rate seems a bit high ... to adopt for routine use.”

A study earlier this year by the UC San Diego School of Medicine showed that using LLMs to respond to patient messages increased the time doctors spent on messages. And this summer, a study by the clinical AI firm Mendel found that when GPT-4o or Llama-3 were used to summarize patient medical records, almost every summary contained at least one type of hallucination.

“We’ve seen cases where a patient does have drug allergies, but the system says ‘no known drug allergies’ ” in the medical history summary, said Wael Salloum, PhD, cofounder and chief science officer at Mendel. “That’s a serious hallucination.” And if physicians have to constantly verify what the system is telling them, that “defeats the purpose [of summarization],” he said.

A Higher Quality Diet

Part of the trouble with LLMs is that there’s just not enough high-quality information to feed them. The algorithms are insatiable, requiring vast swaths of data for training. GPT-3.5, for instance, was trained on 570 GB of data from the internet, more than 300 billion words. And to train GPT-4o, OpenAI reportedly transcribed more than 1 million hours of YouTube content.

However, the strategies that built these general LLMs don’t always translate well to healthcare. The internet is full of low-quality or misleading health information from wellness sites and supplement advertisements. And even data that are trustworthy, like the millions of clinical studies and the US Food and Drug Administration (FDA) statements, can be outdated, Salloum said. And “an LLM in training can’t distinguish good from bad,” he added.

The good news is that clinicians don’t rely on controversial information in the real world. Medical knowledge is standardized. “Healthcare is a domain rich with explicit knowledge,” Salloum said. So there’s potential to build a more reliable LLM that is guided by robust medical standards and guidelines.

It’s possible that healthcare could use small language models, which are LLM’s pocket-sized cousins, and perform tasks needing only bite-sized datasets requiring fewer resources and easier fine-tuning, according to Microsoft’s website. Shah said training these smaller models on real medical data might be an option, like an LLM meant to respond to patient messages that could be trained with real messages sent by physicians.

Several groups are already working on databases of standardized human medical knowledge or real physician responses. “Perhaps that will work better than using LLMs trained on the general internet. Those studies need to be done,” Shah said.

Jon Tamir, assistant professor of electrical and computer engineering and co-lead of the AI Health Lab at The University of Texas at Austin, said, “The community has recognized that we are entering a new era of AI where the dataset itself is the most important aspect. We need training sets that are highly curated and highly specialized.

“If the dataset is highly specialized, it will definitely help reduce hallucinations,” he said.

Cutting Overconfidence

A major problem with LLM mistakes is that they are often hard to detect. Hallucinations can be highly convincing even if they’re highly inaccurate, according to Tamir.

When Shah, for instance, was recently testing an LLM on de-identified patient data, he asked the LLM which blood test the patient last had. The model responded with “complete blood count [CBC].” But when he asked for the results, the model gave him white blood count and other values. “Turns out that record did not have a CBC done at all! The result was entirely made up,” he said.

Making healthcare LLMs safer and more reliable will mean training AI to acknowledge potential mistakes and uncertainty. Existing LLMs are trained to project confidence and produce a lot of answers, even when there isn’t one, Salloum said. They rarely respond with “I don’t know” even when their prediction has low confidence, he added.

Healthcare stands to benefit from a system that highlights uncertainty and potential errors. For instance, if a patient’s history shows they have smoked, stopped smoking, vaped, and started smoking again. The LLM might call them a smoker but flag the comment as uncertain because the chronology is complicated, Salloum said.

Tamir added that this strategy could improve LLM and doctor collaboration by honing in on where human expertise is needed most.

Too Little Evaluation

For any improvement strategy to work, LLMs — and all AI-assisted healthcare tools — first need a better evaluation framework. So far, LLMs have “been used in really exciting ways but not really well-vetted ways,” Tamir said.

While some AI-assisted tools, particularly in medical imaging, have undergone rigorous FDA evaluations and earned approval, most haven’t. And because the FDA only regulates algorithms that are considered medical devices, Parikh said that most LLMs used for administrative tasks and efficiency don’t fall under the regulatory agency’s purview.

But these algorithms still have access to patient information and can directly influence patient and doctor decisions. Third-party regulatory agencies are expected to emerge, but it’s still unclear who those will be. Before developers can build a safer and more efficient LLM for healthcare, they’ll need better guidelines and guardrails. “Unless we figure out evaluation, how would we know whether the healthcare-appropriate large language models are better or worse?” Shah asked.

A version of this article appeared on Medscape.com.

In seconds, Ravi Parikh, MD, an oncologist at the Emory University School of Medicine in Atlanta, had a summary of his patient’s entire medical history. Normally, Parikh skimmed the cumbersome files before seeing a patient. However, the artificial intelligence (AI) tool his institution was testing could list the highlights he needed in a fraction of the time.

“On the whole, I like it ... it saves me time,” Parikh said of the tool. “But I’d be lying if I told you it was perfect all the time. It’s interpreting the [patient] history in some ways that may be inaccurate,” he said.

Within the first week of testing the tool, Parikh started to notice that the large language model (LLM) made a particular mistake in his patients with prostate cancer. If their prostate-specific antigen test results came back slightly elevated — which is part of normal variation — the LLM recorded it as disease progression. Because Parikh reviews all his notes — with or without using an AI tool — after a visit, he easily caught the mistake before it was added to the chart. “The problem, I think, is if these mistakes go under the hood,” he said.

In the data science world, these mistakes are called hallucinations. And a growing body of research suggests they’re happening more frequently than is safe for healthcare. The industry promised LLMs would alleviate administrative burden and reduce physician burnout. But so far, studies show these AI-tool mistakes often create more work for doctors, not less. To truly help physicians and be safe for patients, some experts say healthcare needs to build its own LLMs from the ground up. And all agree that the field desperately needs a way to vet these algorithms more thoroughly.

Prone to Error

Right now, “I think the industry is focused on taking existing LLMs and forcing them into usage for healthcare,” said Nigam H. Shah, MBBS, PhD, chief data scientist for Stanford Health. However, the value of deploying general LLMs in the healthcare space is questionable. “People are starting to wonder if we’re using these tools wrong,” he told this news organization.

In 2023, Shah and his colleagues evaluated seven LLMs on their ability to answer electronic health record–based questions. For realistic tasks, the error rate in the best cases was about 35%, he said. “To me, that rate seems a bit high ... to adopt for routine use.”

A study earlier this year by the UC San Diego School of Medicine showed that using LLMs to respond to patient messages increased the time doctors spent on messages. And this summer, a study by the clinical AI firm Mendel found that when GPT-4o or Llama-3 were used to summarize patient medical records, almost every summary contained at least one type of hallucination.

“We’ve seen cases where a patient does have drug allergies, but the system says ‘no known drug allergies’ ” in the medical history summary, said Wael Salloum, PhD, cofounder and chief science officer at Mendel. “That’s a serious hallucination.” And if physicians have to constantly verify what the system is telling them, that “defeats the purpose [of summarization],” he said.

A Higher Quality Diet

Part of the trouble with LLMs is that there’s just not enough high-quality information to feed them. The algorithms are insatiable, requiring vast swaths of data for training. GPT-3.5, for instance, was trained on 570 GB of data from the internet, more than 300 billion words. And to train GPT-4o, OpenAI reportedly transcribed more than 1 million hours of YouTube content.

However, the strategies that built these general LLMs don’t always translate well to healthcare. The internet is full of low-quality or misleading health information from wellness sites and supplement advertisements. And even data that are trustworthy, like the millions of clinical studies and the US Food and Drug Administration (FDA) statements, can be outdated, Salloum said. And “an LLM in training can’t distinguish good from bad,” he added.

The good news is that clinicians don’t rely on controversial information in the real world. Medical knowledge is standardized. “Healthcare is a domain rich with explicit knowledge,” Salloum said. So there’s potential to build a more reliable LLM that is guided by robust medical standards and guidelines.

It’s possible that healthcare could use small language models, which are LLM’s pocket-sized cousins, and perform tasks needing only bite-sized datasets requiring fewer resources and easier fine-tuning, according to Microsoft’s website. Shah said training these smaller models on real medical data might be an option, like an LLM meant to respond to patient messages that could be trained with real messages sent by physicians.

Several groups are already working on databases of standardized human medical knowledge or real physician responses. “Perhaps that will work better than using LLMs trained on the general internet. Those studies need to be done,” Shah said.

Jon Tamir, assistant professor of electrical and computer engineering and co-lead of the AI Health Lab at The University of Texas at Austin, said, “The community has recognized that we are entering a new era of AI where the dataset itself is the most important aspect. We need training sets that are highly curated and highly specialized.

“If the dataset is highly specialized, it will definitely help reduce hallucinations,” he said.

Cutting Overconfidence

A major problem with LLM mistakes is that they are often hard to detect. Hallucinations can be highly convincing even if they’re highly inaccurate, according to Tamir.

When Shah, for instance, was recently testing an LLM on de-identified patient data, he asked the LLM which blood test the patient last had. The model responded with “complete blood count [CBC].” But when he asked for the results, the model gave him white blood count and other values. “Turns out that record did not have a CBC done at all! The result was entirely made up,” he said.

Making healthcare LLMs safer and more reliable will mean training AI to acknowledge potential mistakes and uncertainty. Existing LLMs are trained to project confidence and produce a lot of answers, even when there isn’t one, Salloum said. They rarely respond with “I don’t know” even when their prediction has low confidence, he added.

Healthcare stands to benefit from a system that highlights uncertainty and potential errors. For instance, if a patient’s history shows they have smoked, stopped smoking, vaped, and started smoking again. The LLM might call them a smoker but flag the comment as uncertain because the chronology is complicated, Salloum said.

Tamir added that this strategy could improve LLM and doctor collaboration by honing in on where human expertise is needed most.

Too Little Evaluation

For any improvement strategy to work, LLMs — and all AI-assisted healthcare tools — first need a better evaluation framework. So far, LLMs have “been used in really exciting ways but not really well-vetted ways,” Tamir said.

While some AI-assisted tools, particularly in medical imaging, have undergone rigorous FDA evaluations and earned approval, most haven’t. And because the FDA only regulates algorithms that are considered medical devices, Parikh said that most LLMs used for administrative tasks and efficiency don’t fall under the regulatory agency’s purview.

But these algorithms still have access to patient information and can directly influence patient and doctor decisions. Third-party regulatory agencies are expected to emerge, but it’s still unclear who those will be. Before developers can build a safer and more efficient LLM for healthcare, they’ll need better guidelines and guardrails. “Unless we figure out evaluation, how would we know whether the healthcare-appropriate large language models are better or worse?” Shah asked.

A version of this article appeared on Medscape.com.

Publications

Topics

Business of Medicine

Practice Management

Article Type

News

Sections

User login

Prone to Error

A Higher Quality Diet

Cutting Overconfidence

Too Little Evaluation

Prone to Error

A Higher Quality Diet

Cutting Overconfidence

Too Little Evaluation

Prone to Error

A Higher Quality Diet

Cutting Overconfidence

Too Little Evaluation