AI in the Real World: Generative AI’s role in empathy-driven Healthcare

Welcome back to AI in the Real World for another look at AI applications creating real-world value for businesses and consumers.

In this iteration of the blog, I am doing a deep dive into AI usage in the healthcare industry as I’ve seen a lot of progress over the years. I’ve gathered some studies that explore whether generative AI can be used to help doctors daily and whether AI can be more empathetic than doctors. We have known for a while now that AI-driven analysis can reduce cost, reduce errors, and enhance patient outcomes in medical imaging use cases, but here we are focused on the type of communication.

I’d be the first to acknowledge that when it comes to bedside manner, not all doctors are created equal – compare, for example, the affected irritability of Bones McCoy, the (potentially alcohol-driven) geniality of Hawkeye Pierce, and the manifest narcissism of the prototypical “mad scientist”, Victor Frankenstein. But putting that variability aside for a moment, let’s ask the question: “Can AI be more empathetic than doctors?” Without wanting to make this sound like a “clickbait” article, the answer might shock you!

AI assistant vs. Physician responses to patient questions

A recent study that measured physicians’ responses in terms of empathy showed that patients’ messages that were generated by ChatGPT were preferred over those generated by qualified physicians.

I imagine that this raises as many questions for you as it did for me. Who conducted the study? Who preferred these responses? How often did they prefer these responses? Well, it was published by JAMA Internal Medicine (Journal of the American Medical Association), and it doesn’t get much more prestigious or respected than that. The study presented patients’ questions and randomized responses to a team of licensed healthcare professionals who evaluated the quality of the empathy or bedside manner provided. They found an almost 10 times higher prevalence of empathetic or very empathetic responses from the chatbot.

At this point, armed with the knowledge that generative AI has earned a reputation for bias and hallucination, you are probably thinking that this turned out to be a triumph of “style over substance”, of “form over function”. Me too. But no, it turns out that the panel of expert evaluators also indicated that chatbot responses were rated of significantly higher quality than physician responses – they preferred the ChatGPT responses in 78% of cases.

Does this indicate that we should all be ditching expensive medical appointments in favor of online solutions? I think that is a resounding “no”. While the study tells us that the AI responses were consistently (with statistical significance) preferred both for accuracy and empathy, it doesn’t say anything about the quality in absolute terms of the instances where the physician responses were better. Would any of the less accurate ChatGPT responses, for example, have been life-threatening?

If my new digital doctor is getting it wrong, what I want to know is just how wrong that is. The stakes are high in this domain, and few among us would be willing to play medical roulette as a trade-off for a more empathetic Doctor-Patient experience. But this potential for errors does not render the technology useless – in their conclusion, the authors suggest that AI assistance in the generation of draft responses that physicians can edit might be a practical way in which the technology can be applied. They suggest that pending the results of careful clinical studies, this could improve the quality of responses, reduce the levels of clinician burnout and improve patient outcomes and experiences.

Empathetic, sincere, and considerate scripts for clinicians

In a Forbes article from last summer, Robert Pearl, M.D. also explored the topic of whether doctors or ChatGPT were more empathetic and the results aligned with those of the JAMA study. One of the examples shared came from a New York Times article that reported on the University of Texas in Austin’s experience with generative AI.

The Chair of Internal Medicine needed a script that clinicians could use to speak more compassionately and better engage with patients who are part of a behavioral therapy treatment for alcoholism. At the time, no one on the team took the assignment seriously. So, the department head turned to ChatGPT for help. And the results amazed him. The app created an excellent letter that was considered “sincere, considerate, even touching.”

Following this creation, others at the university continued to use generative AI to create additional versions that were written for a fifth-grade reading level and translated into Spanish. This produced scripts in both languages that were characterized by greater clarity and appropriateness.

Clinical notes on par with those written by senior internal medicine residents

Referencing his recent study, Ashwin Nayak of Stanford University told MedPage Today that “Large language models like ChatGPT seem to be advanced enough to draft clinical notes at a level that we would want as a clinician reviewing the charts and interpreting the clinical situation. That is pretty exciting because it opens up a whole lot of doors for ways to automate some of the more menial tasks and the documentation tasks that clinicians don’t love to do.”

As was the case for the JAMA study, Nayak is not expecting Generative AI to replace doctors, but did he report that ChatGPT could generate clinical notes comparable to those written by senior internal medical residents. Although the study found minimal qualitative differences in ‘history of present illness’ (HPI) reporting between residents and ChatGPT, attending physicians could only identify whether the source was human or AI with 61% accuracy.

So, does generative AI have a promising future with healthcare professionals? Can it be more empathetic than doctors themselves?

Looking at these studies and use cases, I’d say that the deck is stacked heavily in favor of YES. We are still in the infancy of the technology, and the experiments reported here were carried out using general-purpose models – it is pretty much inevitable that once more specialized models become available the results will be even more compelling.