Too Little Evaluation
For any improvement strategy to work, LLMs — and all AI-assisted healthcare tools — first need a better evaluation framework. So far, LLMs have “been used in really exciting ways but not really well-vetted ways,” Tamir said.
While some AI-assisted tools, particularly in medical imaging, have undergone rigorous FDA evaluations and earned approval, most haven’t. And because the FDA only regulates algorithms that are considered medical devices, Parikh said that most LLMs used for administrative tasks and efficiency don’t fall under the regulatory agency’s purview.
But these algorithms still have access to patient information and can directly influence patient and doctor decisions. Third-party regulatory agencies are expected to emerge, but it’s still unclear who those will be. Before developers can build a safer and more efficient LLM for healthcare, they’ll need better guidelines and guardrails. “Unless we figure out evaluation, how would we know whether the healthcare-appropriate large language models are better or worse?” Shah asked.
A version of this article appeared on Medscape.com.