AI models for predicting heart failure still far from perfect
Advanced artificial intelligence (AI) models designed to detect early signs of heart failure may be much less effective when evaluating older patients, according to new research published in Circulation: Heart Failure.[1] The study’s authors identified other disparities as well, highlighting the need for caution when developing or using such algorithms.
“Heart failure may be underdiagnosed at higher rates in Black patients and women in the outpatient setting,” wrote first author Dhamanpreet Kaur, BS, an AI researcher with Stanford University, and colleagues. “Earlier detection and closer monitoring of high-risk individuals may aid in reducing the occurrence and improving the prognosis of the disease to ultimately combat these disparities.”
The group focused on a convolutional neural network model trained to anticipate heart failure cases within five years based on 12-lead electrocardiograms (ECGs) performed from 2008 to 2018 on patients referred to a high volume facility. Once that model had been developed, four additional models were trained that followed the same basic design.
This retrospective study’s population included more than 325,000 ECGs. The average patient age was 59 years old, and nearly 50% were women. While 56.5% of patients were non-Hispanic white, 14.2% were Asian, 12.3% were Hispanic, 4% were Black, 1.2% were Hawaiian/Pacific Islander and 0.2% were American Indian or Alaskan native.
Overall, there were nearly 60,000 heart failure incidents within five years of ECG collection among the study’s population. While 9% of patients 40 years old or younger went on to develop heart failure over the course of the study, that number skyrocketed to 36.6% when looking at patients over the age of 80. Heart failure within five years was more common among Black patients (23.5%) than white patients (19%), Asian patients (18.6%), Hispanic patients (17.5%) or any other racial subgroup.
Kaur et al. found that there were no noteworthy differences between racial groups when it came to the performance of their primary AI model. However, its performance “declined significantly” with age, dropping from an area under the ROC curve (AUC) of 0.80 for patients 40 years old and younger to an AUC of 0.66 for patients over the age of 80.
Another key takeaway was that the AI model was much less effective when evaluating 12-lead ECGs from the specific subgroup of younger Black patients. The AUC was just 0.69 among these patients, much lower than the AUC of 0.80 documented for non-Hispanic white patients.
“The calibration curves indicate that the model is best calibrated for Asian and non-Hispanic white patients,” the authors wrote. “The observed fraction of cases with heart failure exceeds the probability predicted by the model among Black patients, indicating greater underdiagnosis in comparison to other racial groups, especially among Black women.”
The team also noted that they experimented with using different subsets of training data to see if this could improve the AI model’s performance and reduce its apparent bias.
“Using a data set with equal racial representation did not eliminate disparities between Black patients and patients of other racial groups in the zero to 40 age group,” the authors wrote. “The AUC values did not improve from the primary model. Similarly, there was no improvement in performance among the different race and ethnic subgroups compared using models that were trained on the same race and ethnicity as the test set. Moreover, there was no improvement in age-related disparities when the models were trained and tested on data from separate age groups.”
Reviewing these findings, the group said more research is still needed to ensure advanced AI models do not “perpetuate existing disparities” in patient outcomes.
“Our findings warrant caution in using this ECG deep learning model for heart failure prognosis among certain demographic subgroups,” they added.
Click here to read the full analysis.