Introduction

Heart disease is a very popular topic in the debates surrounding public health, and for good reason; the Centers for Disease Control and Prevention found it to be responsible for more deaths in 2021 than all other categories that were assessed. There is, however, a popular misconception that heart disease is synonymous with cardiovascular disease, and this tends to overshadow how comprehensive cardiovascular disease actually is. Cardiovascular disease is actually a collection of conditions, and while heart disease is primary among them, its scope is significantly broader. Cardiovascular disease also includes stroke, the fifth leading cause of mortality in 2021, as well as peripheral arterial disease and aortic disease.

Common risk factors for cardiovascular disease include high blood pressure, high cholesterol levels, smoking and smoke exposure, obesity, alcohol use, unhealthy diet, and sedentary activity patterns, each of which has received a wealth of attention in interventions designed to stem the prevalence of this disease within the scientific literature. While identifying patients who meet the criteria for diagnosis with cardiovascular disease may be possible in most clinical settings, preventing others from developing it in the first place has become an important aim for public health.

Machine learning methods allow for the specification of predictive models that can evaluate the medical records of patients not yet diagnosed with cardiovascular disease and assess their likelihood of developing these conditions in the future. These models are first trained on complete patient records whose diagnosis outcome is known, and their performance is subsequently assessed on records for new patients who have yet to be assessed for the disease; indices of model performance describe how well new patient outcomes are predicted.

Required Packages

library(tidyverse) # Expanded general utility
library(labelled) # Variable labels
library(skimr) # EDA
library(gridExtra) # EDA
library(utils) # Reading in data functions
library(broom) # Model summary  
library(summarytools) # Formal summary table
library(corrplot) # Correlation matrix
library(gtsummary) # Formally presenting tables
library(ROCR) # Classification model performance
library(vip) # Variable importance plot
library(equatiomatic) # Model specification plot

Data Import and Processing

A simulated sample of n = 70,000 patient records were accessed from Kaggle.com and appear to have been generated poorly, with values for systolic and diastolic blood pressure ranging from the negative to above 16,000 and 11,000, respectively. Extreme height values ranging from under 2 feet tall to over 8 feet were also encountered, and certain records indicated patients to weigh as little as 22 pounds, where the minimum age was given to be 29.6 years of age. These values were processed with more reasonable upper and lower bounds substituted where appropriate. While recoding these extreme values would necessarily attenuate the strength of association with other variables in the data set, this was judged to be an acceptable compromise given the artificial nature of the data in question.

# Read in raw data file
raw <- read.csv(webaddress, header = TRUE, sep = ';')

# Data cleaning
raw <- raw %>% 
  mutate(age = as.numeric(age / 365.25), # convert to years
         weight = as.numeric(weight * 2.20462), # convert to lbs
         female = as.numeric(recode(gender, "1" = "1", "2" = "0")),
         height = as.numeric(height / 2.54), # convert to inches
         ap_hi = as.numeric(ap_hi),
         ap_lo = as.numeric(ap_lo),
         cholesterol = as.factor(cholesterol),
         gluc = as.factor(gluc),
         smoke = as.numeric(smoke), # to calculate proportions
         alco = as.numeric(alco), # to calculate proportions
         active = as.numeric(active), # to calculate proportions
         cardio = as.numeric(cardio) # to calculate proportions
         ) %>% dplyr::select(c(-gender, -id))

# Add intuitive variable label names
var_label(raw) <- list(
  age = "Years of Age (Calculated)",  
  weight = "Weight in Lbs (Calculated)", 
  female = "Female Indicator",
  height = "Height in Inches",  
  ap_hi = "Systolic BP",
  ap_lo = "Diastolic BP",
  cholesterol = "Categorized Total Cholesterol",
  gluc = "Categorized Blood Glucose",
  smoke = "Smoking Indicator",
  alco = "Alcohol Indicator",
  active = "Physical Activity Indicator",
  cardio = "Cardiovascular Disease Indicator"
) 

# Re-code extreme values
raw <- raw %>%
  mutate(ap_hi = case_when(ap_hi < 90 ~ 90,
                           TRUE ~ ap_hi),
         ap_hi = case_when(ap_hi > 180 ~ 180,
                           TRUE ~ ap_hi),
         ap_lo = case_when(ap_lo  > 110 ~ 110,
                           TRUE ~ ap_lo),
         ap_lo = case_when(ap_lo < 60 ~ 60,
                           TRUE ~ ap_lo),
         weight = case_when(weight < 110 ~ 110,
                            TRUE ~ weight),
         height = case_when(height < 53 ~ 53,
                            TRUE ~ height),
         height = case_when(height > 84 ~ 84,
                           TRUE ~ height))

# No other data cleaning is warranted 
clean <- as_tibble(raw)

Data Overview

The processed analytic sample data appear in the summary table below. As these are simulated data, there are no missing values present. A small number of duplicate values n = 24 result of the large number of observations and comparatively few features, several of which have a very limited range of possible values; in other words, we would have expected a few records to have identical values. We see in the table below a continued lack of nuance in simulation routine of both blood pressure variables (ap_hi and ap_lo, respectively) where plots show a non-continuous distribution of values within the expected range. Given this report’s emphasis on overall model performance, these issues are deemed minimally impactful.

Data Frame Summary

clean

Dimensions: 70000 x 12
Duplicates: 24
Variable Stats / Values Freqs (% of Valid) Graph Missing
age [numeric]
Mean (sd) : 53.3 (6.8)
min ≤ med ≤ max:
29.6 ≤ 53.9 ≤ 64.9
IQR (CV) : 10 (0.1)
8076 distinct values 0 (0.0%)
height [numeric]
Mean (sd) : 64.7 (3.1)
min ≤ med ≤ max:
53 ≤ 65 ≤ 84
IQR (CV) : 4.3 (0)
68 distinct values 0 (0.0%)
weight [numeric]
Mean (sd) : 163.7 (31.5)
min ≤ med ≤ max:
110 ≤ 158.7 ≤ 440.9
IQR (CV) : 37.5 (0.2)
258 distinct values 0 (0.0%)
ap_hi [numeric]
Mean (sd) : 126.9 (16.8)
min ≤ med ≤ max:
90 ≤ 120 ≤ 180
IQR (CV) : 20 (0.1)
87 distinct values 0 (0.0%)
ap_lo [numeric]
Mean (sd) : 81.7 (9.8)
min ≤ med ≤ max:
60 ≤ 80 ≤ 110
IQR (CV) : 10 (0.1)
51 distinct values 0 (0.0%)
cholesterol [factor]
1. 1
2. 2
3. 3
52385(74.8%)
9549(13.6%)
8066(11.5%)
0 (0.0%)
gluc [factor]
1. 1
2. 2
3. 3
59479(85.0%)
5190(7.4%)
5331(7.6%)
0 (0.0%)
smoke [numeric]
Min : 0
Mean : 0.1
Max : 1
0:63831(91.2%)
1:6169(8.8%)
0 (0.0%)
alco [numeric]
Min : 0
Mean : 0.1
Max : 1
0:66236(94.6%)
1:3764(5.4%)
0 (0.0%)
active [numeric]
Min : 0
Mean : 0.8
Max : 1
0:13739(19.6%)
1:56261(80.4%)
0 (0.0%)
cardio [numeric]
Min : 0
Mean : 0.5
Max : 1
0:35021(50.0%)
1:34979(50.0%)
0 (0.0%)
female [numeric]
Min : 0
Mean : 0.7
Max : 1
0:24470(35.0%)
1:45530(65.0%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.1)
2023-07-26

Correlation Matrix

Next, we see a correlation matrix of the entire data set with heat mapping applied to the general shape of each variable relationship. The magnitude and direction of expected relationships are generally what we would expect at the population level for the U.S. Note the strong positive association between measures of blood pressure, moderate positive association between systolic blood pressure and cardiac event, and the moderate negative associations of female with height and smoking. While the blood pressure values are shown to lack a relationship in either direction with the physical activity indicator, we would expect a weak negative association at the population level.

Model Estimation

# Re-code categorical variables for model estimation 
clean <- clean %>% 
  mutate(cholesterol = as.factor(cholesterol),
         gluc = as.factor(gluc),
         smoke = as.factor(smoke), 
         alco = as.factor(alco),  
         active = as.factor(active) 
        )  

# Partition into training / test
index <- sample(nrow(clean), .75*nrow(clean))
train <- clean[index,]
test <- clean[-index,]

# Fit full logit model
model1 <- glm(cardio ~ ., family = binomial, data = train)

# Examine model fit indices
tbl_summary(glance(model1))  

# Variable importance
vip(model1)

After partitioning the analytic sample into training and testing subsets, we estimate a logistic model of the following specification:

\[ \begin{aligned} \log\left[ \frac { \widehat{P( cardio = 1 )} }{ 1 - \widehat{P( cardio = 1 )} } \right] &= -10.82 + 0.05(age) - 0.01(height) + 0.01(weight) + 0.05(ap\_hi) + 0.02(ap\_lo)\ + \\ &\quad 0.39(cholesterol_{2}) + 1.08(cholesterol_{3}) + 0.03(gluc_{2}) - 0.34(gluc_{3}) - 0.14(smoke_{1}) - 0.19(alco_{1})\ - \\ &\quad 0.22(active_{1}) + 0.02(female) \end{aligned} \]

Variable Importance Plot

We are now able to view the relative importance that each included term has within the estimated model, though again, our emphasis here is on assessing overall model performance.

Model Fit Indices

After having reviewed individual model terms and their associated impact, we can now turn our attention to assessing overall model fit indices. These values are informative in the case of comparing the relative performance of one model against each other, and each of their values are unitless.

Characteristic N = 11
null.deviance
    72780.4234826008 1 (100%)
df.null
    52499 1 (100%)
logLik
    -29367.6915335645 1 (100%)
AIC
    58763.383067129 1 (100%)
BIC
    58887.5430254091 1 (100%)
deviance
    58735.383067129 1 (100%)
df.residual
    52486 1 (100%)
nobs
    52500 1 (100%)
1 n (%)

Model Performance

The output of a binary classification model such as our binomial logistic model here is a probability given the set of patient record data that they will experience a cardiac event at some point in the future. This probability can range from 0 to 1, but typically falls somewhere within these extreme values. A classification threshold value (think of this as a cut point) is selected within this probability range, and patient probability values higher and lower than the threshold are classified as likely or unlikely, respectively, to experience a future cardiac event.

In practice, attention is given to selecting such a threshold, and the optimal value depends on each specific context. We select a naïve threshold of 0.5, as such tuning is beyond the intended scope of this report, though it is worth noting that this value is a tradeoff of true positives to false positives. For example, if we raised this threshold from 0.5 to 0.65 then the model would predict fewer patients to experience a future cardiac event, decreasing the number of false positives and true positives.

In our graphical assessment of model performance, we display a receiver operating characteristic (ROC) curve which displays model classification performance across all possible classification thresholds; the ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) across these varied threshold values. The Area under the ROC Curve (AUC-ROC) is a classification model performance measure of the two-dimensional area beneath the ROC curve and can be interpreted as the ability of our model to distinguish patients that experience cardiac events from those that do not. The AUC value of our fit model is 0.789, meaning that ~79% of the time our model correctly classifies patients that experience cardiac disease. ROC curves are suitable for relatively balanced classification categories as we have here (~45% vs ~ 55%).

Conclusion

In this report we demonstrate a step-by-step approach to analyzing cardiac outcomes patent records to build a model that predicted the occurrence of cardiac events in new patients. We began by briefly situating the context for why such an analysis was necessary and followed this by illustrating the import of simulated cardiac outcomes data from Kaggle. We walked through identifying and correcting issues with several improper values and summarized the final analytic sample in an intuitive table format. We summarized bivariate associations among the entire data set in a correlation matrix and then presented the model formulation with coefficients. Rather than naïvely relying on the magnitude of these coefficients to judge predictor importance, we presented a plot of the formal assessment of variable importance within the model. We follow this by presenting a variety of model fit indices and conclude in the description of classification performance of the model across all threshold values.