Article Type
Changed
Thu, 05/16/2024 - 15:04

Large language models (LLM) such as ChatGPT have shown mixed results in the quality of their responses to consumer questions about cancer.

One recent study found AI chatbots to churn out incomplete, inaccurate, or even nonsensical cancer treatment recommendations, while another found them to generate largely accurate — if technical — responses to the most common cancer questions.

While researchers have seen success with purpose-built chatbots created to address patient concerns about specific cancers, the consensus to date has been that the generalized models like ChatGPT remain works in progress and that physicians should avoid pointing patients to them, for now.

Yet new findings suggest that these chatbots may do better than individual physicians, at least on some measures, when it comes to answering queries about cancer. For research published May 16 in JAMA Oncology (doi: 10.1001/jamaoncol.2024.0836), David Chen, a medical student at the University of Toronto, and his colleagues, isolated a random sample of 200 questions related to cancer care addressed to doctors on the public online forum Reddit. They then compared responses from oncologists with responses generated by three different AI chatbots. The blinded responses were rated for quality, readability, and empathy by six physicians, including oncologists and palliative and supportive care specialists.

Mr. Chen and colleagues’ research was modeled after a 2023 study that measured the quality of physician responses compared with chatbots for general medicine questions addressed to doctors on Reddit. That study found that the chatbots produced more empathetic-sounding answers, something Mr. Chen’s study also found. The best-performing chatbot in Mr. Chen and colleagues’ study, Claude AI, performed significantly higher than the Reddit physicians on all the domains evaluated: quality, empathy, and readability.
 

Q&A With Author of New Research

Mr. Chen discussed his new study’s implications during an interview with this news organization.

Question: What is novel about this study?

Mr. Chen: We’ve seen many evaluations of chatbots that test for medical accuracy, but this study occurs in the domain of oncology care, where there are unique psychosocial and emotional considerations that are not precisely reflected in a general medicine setting. In effect, this study is putting these chatbots through a harder challenge.



Question: Why would chatbot responses seem more empathetic than those of physicians?

Mr. Chen: With the physician responses that we observed in our sample data set, we saw that there was very high variation of amount of apparent effort [in the physician responses]. Some physicians would put in a lot of time and effort, thinking through their response, and others wouldn’t do so as much. These chatbots don’t face fatigue the way humans do, or burnout. So they’re able to consistently provide responses with less variation in empathy.



Question: Do chatbots just seem empathetic because they are chattier?

Mr. Chen: We did think of verbosity as a potential confounder in this study. So we set a word count limit for the chatbot responses to keep it in the range of the physician responses. That way, verbosity was no longer a significant factor.



Question: How were quality and empathy measured by the reviewers?

Mr. Chen: For our study we used two teams of readers, each team composed of three physicians. In terms of the actual metrics we used, they were pilot metrics. There are no well-defined measurement scales or checklists that we could use to measure empathy. This is an emerging field of research. So we came up by consensus with our own set of ratings, and we feel that this is an area for the research to define a standardized set of guidelines.

Another novel aspect of this study is that we separated out different dimensions of quality and empathy. A quality response didn’t just mean it was medically accurate — quality also had to do with the focus and completeness of the response.

With empathy there are cognitive and emotional dimensions. Cognitive empathy uses critical thinking to understand the person’s emotions and thoughts and then adjusting a response to fit that. A patient may not want the best medically indicated treatment for their condition, because they want to preserve their quality of life. The chatbot may be able to adjust its recommendation with consideration of some of those humanistic elements that the patient is presenting with.

Emotional empathy is more about being supportive of the patient’s emotions by using expressions like ‘I understand where you’re coming from.’ or, ‘I can see how that makes you feel.’



Question: Why would physicians, not patients, be the best evaluators of empathy?

Mr. Chen: We’re actually very interested in evaluating patient ratings of empathy. We are conducting a follow-up study that evaluates patient ratings of empathy to the same set of chatbot and physician responses,to see if there are differences.



Question: Should cancer patients go ahead and consult chatbots?

Mr. Chen: Although we did observe increases in all of the metrics compared with physicians, this is a very specialized evaluation scenario where we’re using these Reddit questions and responses.

Naturally, we would need to do a trial, a head to head randomized comparison of physicians versus chatbots.

This pilot study does highlight the promising potential of these chatbots to suggest responses. But we can’t fully recommend that they should be used as standalone clinical tools without physicians.

This Q&A was edited for clarity.

Publications
Topics
Sections

Large language models (LLM) such as ChatGPT have shown mixed results in the quality of their responses to consumer questions about cancer.

One recent study found AI chatbots to churn out incomplete, inaccurate, or even nonsensical cancer treatment recommendations, while another found them to generate largely accurate — if technical — responses to the most common cancer questions.

While researchers have seen success with purpose-built chatbots created to address patient concerns about specific cancers, the consensus to date has been that the generalized models like ChatGPT remain works in progress and that physicians should avoid pointing patients to them, for now.

Yet new findings suggest that these chatbots may do better than individual physicians, at least on some measures, when it comes to answering queries about cancer. For research published May 16 in JAMA Oncology (doi: 10.1001/jamaoncol.2024.0836), David Chen, a medical student at the University of Toronto, and his colleagues, isolated a random sample of 200 questions related to cancer care addressed to doctors on the public online forum Reddit. They then compared responses from oncologists with responses generated by three different AI chatbots. The blinded responses were rated for quality, readability, and empathy by six physicians, including oncologists and palliative and supportive care specialists.

Mr. Chen and colleagues’ research was modeled after a 2023 study that measured the quality of physician responses compared with chatbots for general medicine questions addressed to doctors on Reddit. That study found that the chatbots produced more empathetic-sounding answers, something Mr. Chen’s study also found. The best-performing chatbot in Mr. Chen and colleagues’ study, Claude AI, performed significantly higher than the Reddit physicians on all the domains evaluated: quality, empathy, and readability.
 

Q&A With Author of New Research

Mr. Chen discussed his new study’s implications during an interview with this news organization.

Question: What is novel about this study?

Mr. Chen: We’ve seen many evaluations of chatbots that test for medical accuracy, but this study occurs in the domain of oncology care, where there are unique psychosocial and emotional considerations that are not precisely reflected in a general medicine setting. In effect, this study is putting these chatbots through a harder challenge.



Question: Why would chatbot responses seem more empathetic than those of physicians?

Mr. Chen: With the physician responses that we observed in our sample data set, we saw that there was very high variation of amount of apparent effort [in the physician responses]. Some physicians would put in a lot of time and effort, thinking through their response, and others wouldn’t do so as much. These chatbots don’t face fatigue the way humans do, or burnout. So they’re able to consistently provide responses with less variation in empathy.



Question: Do chatbots just seem empathetic because they are chattier?

Mr. Chen: We did think of verbosity as a potential confounder in this study. So we set a word count limit for the chatbot responses to keep it in the range of the physician responses. That way, verbosity was no longer a significant factor.



Question: How were quality and empathy measured by the reviewers?

Mr. Chen: For our study we used two teams of readers, each team composed of three physicians. In terms of the actual metrics we used, they were pilot metrics. There are no well-defined measurement scales or checklists that we could use to measure empathy. This is an emerging field of research. So we came up by consensus with our own set of ratings, and we feel that this is an area for the research to define a standardized set of guidelines.

Another novel aspect of this study is that we separated out different dimensions of quality and empathy. A quality response didn’t just mean it was medically accurate — quality also had to do with the focus and completeness of the response.

With empathy there are cognitive and emotional dimensions. Cognitive empathy uses critical thinking to understand the person’s emotions and thoughts and then adjusting a response to fit that. A patient may not want the best medically indicated treatment for their condition, because they want to preserve their quality of life. The chatbot may be able to adjust its recommendation with consideration of some of those humanistic elements that the patient is presenting with.

Emotional empathy is more about being supportive of the patient’s emotions by using expressions like ‘I understand where you’re coming from.’ or, ‘I can see how that makes you feel.’



Question: Why would physicians, not patients, be the best evaluators of empathy?

Mr. Chen: We’re actually very interested in evaluating patient ratings of empathy. We are conducting a follow-up study that evaluates patient ratings of empathy to the same set of chatbot and physician responses,to see if there are differences.



Question: Should cancer patients go ahead and consult chatbots?

Mr. Chen: Although we did observe increases in all of the metrics compared with physicians, this is a very specialized evaluation scenario where we’re using these Reddit questions and responses.

Naturally, we would need to do a trial, a head to head randomized comparison of physicians versus chatbots.

This pilot study does highlight the promising potential of these chatbots to suggest responses. But we can’t fully recommend that they should be used as standalone clinical tools without physicians.

This Q&A was edited for clarity.

Large language models (LLM) such as ChatGPT have shown mixed results in the quality of their responses to consumer questions about cancer.

One recent study found AI chatbots to churn out incomplete, inaccurate, or even nonsensical cancer treatment recommendations, while another found them to generate largely accurate — if technical — responses to the most common cancer questions.

While researchers have seen success with purpose-built chatbots created to address patient concerns about specific cancers, the consensus to date has been that the generalized models like ChatGPT remain works in progress and that physicians should avoid pointing patients to them, for now.

Yet new findings suggest that these chatbots may do better than individual physicians, at least on some measures, when it comes to answering queries about cancer. For research published May 16 in JAMA Oncology (doi: 10.1001/jamaoncol.2024.0836), David Chen, a medical student at the University of Toronto, and his colleagues, isolated a random sample of 200 questions related to cancer care addressed to doctors on the public online forum Reddit. They then compared responses from oncologists with responses generated by three different AI chatbots. The blinded responses were rated for quality, readability, and empathy by six physicians, including oncologists and palliative and supportive care specialists.

Mr. Chen and colleagues’ research was modeled after a 2023 study that measured the quality of physician responses compared with chatbots for general medicine questions addressed to doctors on Reddit. That study found that the chatbots produced more empathetic-sounding answers, something Mr. Chen’s study also found. The best-performing chatbot in Mr. Chen and colleagues’ study, Claude AI, performed significantly higher than the Reddit physicians on all the domains evaluated: quality, empathy, and readability.
 

Q&A With Author of New Research

Mr. Chen discussed his new study’s implications during an interview with this news organization.

Question: What is novel about this study?

Mr. Chen: We’ve seen many evaluations of chatbots that test for medical accuracy, but this study occurs in the domain of oncology care, where there are unique psychosocial and emotional considerations that are not precisely reflected in a general medicine setting. In effect, this study is putting these chatbots through a harder challenge.



Question: Why would chatbot responses seem more empathetic than those of physicians?

Mr. Chen: With the physician responses that we observed in our sample data set, we saw that there was very high variation of amount of apparent effort [in the physician responses]. Some physicians would put in a lot of time and effort, thinking through their response, and others wouldn’t do so as much. These chatbots don’t face fatigue the way humans do, or burnout. So they’re able to consistently provide responses with less variation in empathy.



Question: Do chatbots just seem empathetic because they are chattier?

Mr. Chen: We did think of verbosity as a potential confounder in this study. So we set a word count limit for the chatbot responses to keep it in the range of the physician responses. That way, verbosity was no longer a significant factor.



Question: How were quality and empathy measured by the reviewers?

Mr. Chen: For our study we used two teams of readers, each team composed of three physicians. In terms of the actual metrics we used, they were pilot metrics. There are no well-defined measurement scales or checklists that we could use to measure empathy. This is an emerging field of research. So we came up by consensus with our own set of ratings, and we feel that this is an area for the research to define a standardized set of guidelines.

Another novel aspect of this study is that we separated out different dimensions of quality and empathy. A quality response didn’t just mean it was medically accurate — quality also had to do with the focus and completeness of the response.

With empathy there are cognitive and emotional dimensions. Cognitive empathy uses critical thinking to understand the person’s emotions and thoughts and then adjusting a response to fit that. A patient may not want the best medically indicated treatment for their condition, because they want to preserve their quality of life. The chatbot may be able to adjust its recommendation with consideration of some of those humanistic elements that the patient is presenting with.

Emotional empathy is more about being supportive of the patient’s emotions by using expressions like ‘I understand where you’re coming from.’ or, ‘I can see how that makes you feel.’



Question: Why would physicians, not patients, be the best evaluators of empathy?

Mr. Chen: We’re actually very interested in evaluating patient ratings of empathy. We are conducting a follow-up study that evaluates patient ratings of empathy to the same set of chatbot and physician responses,to see if there are differences.



Question: Should cancer patients go ahead and consult chatbots?

Mr. Chen: Although we did observe increases in all of the metrics compared with physicians, this is a very specialized evaluation scenario where we’re using these Reddit questions and responses.

Naturally, we would need to do a trial, a head to head randomized comparison of physicians versus chatbots.

This pilot study does highlight the promising potential of these chatbots to suggest responses. But we can’t fully recommend that they should be used as standalone clinical tools without physicians.

This Q&A was edited for clarity.

Publications
Publications
Topics
Article Type
Sections
Article Source

FROM JAMA ONCOLOGY

Disallow All Ads
Content Gating
No Gating (article Unlocked/Free)
Alternative CME
Disqus Comments
Default
Use ProPublica
Hide sidebar & use full width
render the right sidebar.
Conference Recap Checkbox
Not Conference Recap
Clinical Edge
Display the Slideshow in this Article
Medscape Article
Display survey writer
Reuters content
Disable Inline Native ads
WebMD Article