A Higher Quality Diet
Part of the trouble with LLMs is that there’s just not enough high-quality information to feed them. The algorithms are insatiable, requiring vast swaths of data for training. GPT-3.5, for instance, was trained on 570 GB of data from the internet, more than 300 billion words. And to train GPT-4o, OpenAI reportedly transcribed more than 1 million hours of YouTube content.
However, the strategies that built these general LLMs don’t always translate well to healthcare. The internet is full of low-quality or misleading health information from wellness sites and supplement advertisements. And even data that are trustworthy, like the millions of clinical studies and the US Food and Drug Administration (FDA) statements, can be outdated, Salloum said. And “an LLM in training can’t distinguish good from bad,” he added.
The good news is that clinicians don’t rely on controversial information in the real world. Medical knowledge is standardized. “Healthcare is a domain rich with explicit knowledge,” Salloum said. So there’s potential to build a more reliable LLM that is guided by robust medical standards and guidelines.
It’s possible that healthcare could use small language models, which are LLM’s pocket-sized cousins, and perform tasks needing only bite-sized datasets requiring fewer resources and easier fine-tuning, according to Microsoft’s website. Shah said training these smaller models on real medical data might be an option, like an LLM meant to respond to patient messages that could be trained with real messages sent by physicians.
Several groups are already working on databases of standardized human medical knowledge or real physician responses. “Perhaps that will work better than using LLMs trained on the general internet. Those studies need to be done,” Shah said.
Jon Tamir, assistant professor of electrical and computer engineering and co-lead of the AI Health Lab at The University of Texas at Austin, said, “The community has recognized that we are entering a new era of AI where the dataset itself is the most important aspect. We need training sets that are highly curated and highly specialized.
“If the dataset is highly specialized, it will definitely help reduce hallucinations,” he said.
Cutting Overconfidence
A major problem with LLM mistakes is that they are often hard to detect. Hallucinations can be highly convincing even if they’re highly inaccurate, according to Tamir.
When Shah, for instance, was recently testing an LLM on de-identified patient data, he asked the LLM which blood test the patient last had. The model responded with “complete blood count [CBC].” But when he asked for the results, the model gave him white blood count and other values. “Turns out that record did not have a CBC done at all! The result was entirely made up,” he said.
Making healthcare LLMs safer and more reliable will mean training AI to acknowledge potential mistakes and uncertainty. Existing LLMs are trained to project confidence and produce a lot of answers, even when there isn’t one, Salloum said. They rarely respond with “I don’t know” even when their prediction has low confidence, he added.
Healthcare stands to benefit from a system that highlights uncertainty and potential errors. For instance, if a patient’s history shows they have smoked, stopped smoking, vaped, and started smoking again. The LLM might call them a smoker but flag the comment as uncertain because the chronology is complicated, Salloum said.
Tamir added that this strategy could improve LLM and doctor collaboration by honing in on where human expertise is needed most.