ChatGPT struggles to evaluate heart risk—but it could still help cardiologists
ChatGPT-4 is unable to provide helpful heart health assessments based on simulated cases, according to new research published in PLOS ONE.[1]
ChatGPT-4 represents a recent version of ChatGPT, Open AI’s massively popular dialogue-based artificial intelligence (AI) model. Healthcare researchers all over the world have been assessing ChatGPT in recent years—hoping to see if it offers any potential for assisting physicians, educating trainees or even consulting patients—with mixed results.
For this latest analysis, Thomas Heston, MD, an AI specialist with Washington State University’s Elson S. Floyd College of Medicine, and colleagues put the large language model (LLM) to the test by asking it calculate risk based on computer-simulated patient cases.
Heston et al. created three datasets of 10,000 randomized cases. The first dataset focused on the seven variables used to produce TIMI scores, which were designed to evaluate patients with unstable angina or non-ST-segment myocardial infarction (NSTEMI). The second dataset focused on the five variables used to produce HEART scores, which were designed to predict an emergency department patient’s three-month risk of major adverse cardiovascular events. Finally, the third dataset included a total of 44 health variables.
The authors used ChatGPT-4 to review each randomized case five different times, asking it to deliver a risk assessment over the patient variables provided. They wanted to learn how ChatGPT’s responses would correlate to TIMI and HEART scores—and how consistent its answers would be if reviewing the same case five different times.
Overall, the team found that ChatGPT-4 “showed high correlation” with the two risk scores. However, the LLM frequently delivered different risk scores when reviewing the same patient case multiple times. In addition, when reviewing data from the third dataset featuring 44 health variables multiple times, ChatGPT-4 often disagreed with its own previous responses.
“ChatGPT was not acting in a consistent manner,” Heston said in a statement. “Given the exact same data, ChatGPT would give a score of low risk, then next time an intermediate risk, and occasionally, it would go as far as giving a high risk.”
According to the authors, this inconsistency could be seen as a good thing when turning to ChatGPT for other uses. In medicine, however, consistent answers are vital.
“We found there was a lot of variation, and that variation in approach can be dangerous,” Heston said. “It can be a useful tool, but I think the technology is going a lot faster than our understanding of it, so it’s critically important that we do a lot of research, especially in these high-stakes clinical situations.”
Reviewing their findings, the group did conclude their study with a positive perspective on the potential of LLMs such as ChatGPT.
“ChatGPT could be excellent at creating a differential diagnosis and that’s probably one of its greatest strengths,” Heston said. “If you don’t quite know what’s going on with a patient, you could ask it to give the top five diagnoses and the reasoning behind each one. So it could be good at helping you think through a problem, but it’s not good at giving the answer.”
Click here to read the full analysis. Lawrence M. Lewis, MD, an emergency medicine specialist with Washington University in St. Louis, served as the study’s co-author.
ChatGPT’s latest update promises improvements in ‘logical reasoning’
In April, OpenAI announced the launch of a new GPT-4 Turbo with “improved capabilities in writing, math, logical reasoning and coding.” Would this latest iteration deliver better heart risk assessments? Researchers are likely already working toward an answer.
Paid ChatGPT users now have access to GPT-4 Turbo.