ChatGPT, the advanced artificial intelligence (AI) model capable of writing text and engaging in detailed conversations, could make a significant impact in the field of thoracic surgery, according to a new analysis from the Khalpey AI Lab.
GPT-4, the most recent version of the ChatGPT model, is much more accurate than previous versions, suggesting this technology is improving at a rapid rate.
The Khalpey AI Lab, located in Scottsdale, Arizona, is led by veteran cardiothoracic surgeon Zain Khalpey, MD, PhD. The lab is primarily focused on how AI technology can improve the prevention, diagnosis, treatment and management of cardiovascular disease.
For this study, Khalpey’s team tested ChatGPT’s ability to answer Self-Education and Self-Assessment in Thoracic Surgery (SESATS) board questions from the American Board of Thoracic Surgery.
“Large language models such as ChatGPT, released by OpenAI, have shown exceptional performance in various fields, including medicine, law, and management,” wrote the study’s authors. “The successful performance of ChatGPT on board exam questions in the field of general surgery has been reported previously, indicating its potential in surgical education and training.”
The GPT-3.5 and GPT-4 models of ChatGPT were both put to the test, answering 400 SESATS exam questions from the years 2016 to 2021. While 55% of questions focused on adult cardiac surgery, 35% focused on general thoracic surgery, 5% focused on congenital cardiac surgery and another 5% focused on critical care. None of the questions in the dataset included clinical images.
Overall, GPT-3.5 was linked to an accuracy of 52%. GPT-4, on the other hand, did much better, achieving an accuracy of 81.3%. Looking closer at the data, GPT-4 achieved accuracies of 87.3% in the adult cardiac surgery category, 90.2% in the general thoracic surgery category, 68.9% in the congenital cardiac surgery category and 80% in the critical care category. GPT-4 delivered a better performance than GPT-3.5 in all of those categories, though the difference in critical care accuracy was not statistically significant.
“The results of our study demonstrate that ChatGPT, particularly the GPT-4 model, shows a remarkable ability to understand complex thoracic surgical clinical information, achieving an accuracy rate of 81.3% on the SESATS board questions,” the authors wrote. “The GPT-4 model consistently outperformed GPT-3.5 across all subspecialties of thoracic surgery, indicating its potential for application in surgical education and training in this field.”
Khalpey et al. wrote that this strong performance provides new evidence that large language models could “potentially revolutionize surgical education and training” by building personalized learning platforms for students and trainees. In addition, these models could also help practicing surgeons keep up with the field and earn continuing medical education credits.
ChatGPT and other large language models are also still associated with significant limitations, the researchers explained. They can be swayed by incorrect or misleading information, for instance, and it is possible that surgeons could “become overly dependent” on their ability to provide assistance.
“The advent of advanced AI models such as ChatGPT has generated both excitement and concern within the medical community, particularly in the field of surgery,” the authors concluded. “This study has demonstrated that ChatGPT, specifically the GPT-4 model, can significantly reduce the number of errors made by surgeons by improving the quality of surgical education. This controversial aspect has led to heated debates on the future role of AI in medicine.”
Click here to read the full evaluation, including a detailed breakdown of the pros and cons associated with the use of ChatGPT in cardiothoracic surgery.