
Introduction
Heart disease is a very popular topic in the debates surrounding
public health, and for good reason; the Centers
for Disease Control and Prevention found it to be responsible for
more deaths in 2021 than all other categories that were assessed. There
is, however, a popular misconception that heart disease is synonymous
with cardiovascular disease, and this tends to overshadow how
comprehensive cardiovascular disease actually is. Cardiovascular disease
is actually a collection of conditions, and while heart disease is
primary among them, its scope is significantly broader. Cardiovascular
disease also includes stroke, the fifth leading cause of mortality in
2021, as well as peripheral arterial disease and aortic disease.
Common risk factors for cardiovascular disease include high blood
pressure, high cholesterol levels, smoking and smoke exposure, obesity,
alcohol use, unhealthy diet, and sedentary activity patterns, each of
which has received a wealth of attention in interventions designed to
stem the prevalence of this disease within the scientific literature.
While identifying patients who meet the criteria for diagnosis with
cardiovascular disease may be possible in most clinical settings,
preventing others from developing it in the first place has become an
important aim for public health.
Machine learning methods allow for the specification of predictive
models that can evaluate the medical records of patients not yet
diagnosed with cardiovascular disease and assess their likelihood of
developing these conditions in the future. These models are first
trained on complete patient records whose diagnosis outcome is known,
and their performance is subsequently assessed on records for new
patients who have yet to be assessed for the disease; indices of model
performance describe how well new patient outcomes are predicted.
Required Packages
library(tidyverse) # Expanded general utility
library(labelled) # Variable labels
library(skimr) # EDA
library(gridExtra) # EDA
library(utils) # Reading in data functions
library(broom) # Model summary
library(summarytools) # Formal summary table
library(corrplot) # Correlation matrix
library(gtsummary) # Formally presenting tables
library(ROCR) # Classification model performance
library(vip) # Variable importance plot
library(equatiomatic) # Model specification plot
Data Import and Processing
A simulated sample of n = 70,000 patient records were
accessed from Kaggle.com
and appear to have been generated poorly, with values for systolic and
diastolic blood pressure ranging from the negative to above
16,000 and 11,000, respectively. Extreme
height values ranging from under 2 feet tall to over
8 feet were also encountered, and certain records indicated
patients to weigh as little as 22 pounds, where the minimum
age was given to be 29.6 years of age. These values were
processed with more reasonable upper and lower bounds substituted where
appropriate. While recoding these extreme values would necessarily
attenuate the strength of association with other variables in the data
set, this was judged to be an acceptable compromise given the artificial
nature of the data in question.
# Read in raw data file
raw <- read.csv(webaddress, header = TRUE, sep = ';')
# Data cleaning
raw <- raw %>%
mutate(age = as.numeric(age / 365.25), # convert to years
weight = as.numeric(weight * 2.20462), # convert to lbs
female = as.numeric(recode(gender, "1" = "1", "2" = "0")),
height = as.numeric(height / 2.54), # convert to inches
ap_hi = as.numeric(ap_hi),
ap_lo = as.numeric(ap_lo),
cholesterol = as.factor(cholesterol),
gluc = as.factor(gluc),
smoke = as.numeric(smoke), # to calculate proportions
alco = as.numeric(alco), # to calculate proportions
active = as.numeric(active), # to calculate proportions
cardio = as.numeric(cardio) # to calculate proportions
) %>% dplyr::select(c(-gender, -id))
# Add intuitive variable label names
var_label(raw) <- list(
age = "Years of Age (Calculated)",
weight = "Weight in Lbs (Calculated)",
female = "Female Indicator",
height = "Height in Inches",
ap_hi = "Systolic BP",
ap_lo = "Diastolic BP",
cholesterol = "Categorized Total Cholesterol",
gluc = "Categorized Blood Glucose",
smoke = "Smoking Indicator",
alco = "Alcohol Indicator",
active = "Physical Activity Indicator",
cardio = "Cardiovascular Disease Indicator"
)
# Re-code extreme values
raw <- raw %>%
mutate(ap_hi = case_when(ap_hi < 90 ~ 90,
TRUE ~ ap_hi),
ap_hi = case_when(ap_hi > 180 ~ 180,
TRUE ~ ap_hi),
ap_lo = case_when(ap_lo > 110 ~ 110,
TRUE ~ ap_lo),
ap_lo = case_when(ap_lo < 60 ~ 60,
TRUE ~ ap_lo),
weight = case_when(weight < 110 ~ 110,
TRUE ~ weight),
height = case_when(height < 53 ~ 53,
TRUE ~ height),
height = case_when(height > 84 ~ 84,
TRUE ~ height))
# No other data cleaning is warranted
clean <- as_tibble(raw)
Data Overview
The processed analytic sample data appear in the summary table below.
As these are simulated data, there are no missing values present. A
small number of duplicate values n = 24 result of the large
number of observations and comparatively few features, several of which
have a very limited range of possible values; in other words, we would
have expected a few records to have identical values. We see in the
table below a continued lack of nuance in simulation routine of both
blood pressure variables (ap_hi and ap_lo, respectively) where plots
show a non-continuous distribution of values within the expected range.
Given this report’s emphasis on overall model performance, these issues
are deemed minimally impactful.
Correlation Matrix
Next, we see a correlation matrix of the entire data set with heat
mapping applied to the general shape of each variable relationship. The
magnitude and direction of expected relationships are generally what we
would expect at the population level for the U.S. Note the strong
positive association between measures of blood pressure, moderate
positive association between systolic blood pressure and cardiac event,
and the moderate negative associations of female with height and
smoking. While the blood pressure values are shown to lack a
relationship in either direction with the physical activity indicator,
we would expect a weak negative association at the population level.

Model Estimation
# Re-code categorical variables for model estimation
clean <- clean %>%
mutate(cholesterol = as.factor(cholesterol),
gluc = as.factor(gluc),
smoke = as.factor(smoke),
alco = as.factor(alco),
active = as.factor(active)
)
# Partition into training / test
index <- sample(nrow(clean), .75*nrow(clean))
train <- clean[index,]
test <- clean[-index,]
# Fit full logit model
model1 <- glm(cardio ~ ., family = binomial, data = train)
# Examine model fit indices
tbl_summary(glance(model1))
# Variable importance
vip(model1)
After partitioning the analytic sample into training and testing
subsets, we estimate a logistic model of the following
specification:
\[
\begin{aligned}
\log\left[ \frac { \widehat{P( cardio = 1 )} }{ 1 - \widehat{P( cardio =
1 )} } \right] &= -10.82 + 0.05(age) - 0.01(height) + 0.01(weight) +
0.05(ap\_hi) + 0.02(ap\_lo)\ + \\
&\quad 0.39(cholesterol_{2}) + 1.08(cholesterol_{3}) +
0.03(gluc_{2}) - 0.34(gluc_{3}) - 0.14(smoke_{1}) - 0.19(alco_{1})\ - \\
&\quad 0.22(active_{1}) + 0.02(female)
\end{aligned}
\]
Variable Importance Plot
We are now able to view the relative importance that each included
term has within the estimated model, though again, our emphasis here is
on assessing overall model performance.

Model Fit Indices
After having reviewed individual model terms and their associated
impact, we can now turn our attention to assessing overall model fit
indices. These values are informative in the case of comparing the
relative performance of one model against each other, and each of their
values are unitless.
| Characteristic |
N = 1 |
| null.deviance |
|
| 72780.4234826008 |
1 (100%) |
| df.null |
|
| 52499 |
1 (100%) |
| logLik |
|
| -29367.6915335645 |
1 (100%) |
| AIC |
|
| 58763.383067129 |
1 (100%) |
| BIC |
|
| 58887.5430254091 |
1 (100%) |
| deviance |
|
| 58735.383067129 |
1 (100%) |
| df.residual |
|
| 52486 |
1 (100%) |
| nobs |
|
| 52500 |
1 (100%) |
Model Performance
The output of a binary classification model such as our binomial
logistic model here is a probability given the set of patient record
data that they will experience a cardiac event at some point in the
future. This probability can range from 0 to
1, but typically falls somewhere within these extreme
values. A classification threshold value (think of this as a cut point)
is selected within this probability range, and patient probability
values higher and lower than the threshold are classified as likely or
unlikely, respectively, to experience a future cardiac event.
In practice, attention is given to selecting such a threshold, and
the optimal value depends on each specific context. We select a naïve
threshold of 0.5, as such tuning is beyond the intended
scope of this report, though it is worth noting that this value is a
tradeoff of true positives to false positives. For example, if we raised
this threshold from 0.5 to 0.65 then the model
would predict fewer patients to experience a future cardiac event,
decreasing the number of false positives and true positives.
In our graphical assessment of model performance, we display a receiver
operating characteristic (ROC) curve which displays model
classification performance across all possible classification
thresholds; the ROC curve plots the true positive rate (TPR) against the
false positive rate (FPR) across these varied threshold values. The Area
under the ROC Curve (AUC-ROC) is a classification model performance
measure of the two-dimensional area beneath the ROC curve and can be
interpreted as the ability of our model to distinguish patients that
experience cardiac events from those that do not. The AUC value of our
fit model is 0.789, meaning that ~79% of the
time our model correctly classifies patients that experience cardiac
disease. ROC curves are suitable for relatively balanced classification
categories as we have here (~45% vs ~
55%).

Conclusion
In this report we demonstrate a step-by-step approach to analyzing
cardiac outcomes patent records to build a model that predicted the
occurrence of cardiac events in new patients. We began by briefly
situating the context for why such an analysis was necessary and
followed this by illustrating the import of simulated cardiac outcomes
data from Kaggle. We walked through identifying and correcting issues
with several improper values and summarized the final analytic sample in
an intuitive table format. We summarized bivariate associations among
the entire data set in a correlation matrix and then presented the model
formulation with coefficients. Rather than naïvely relying on the
magnitude of these coefficients to judge predictor importance, we
presented a plot of the formal assessment of variable importance within
the model. We follow this by presenting a variety of model fit indices
and conclude in the description of classification performance of the
model across all threshold values.