Summary

In this analysis I created a cardiovascular disease (CVD) classification model using logistic regression. The data was processed, I inspected least frequent outcomes and class imbalance, and assessed model fit and model assumptions. It was by no means a comprehensive analysis, though. The model accuracy was ~72% on both the training and test sets, and sensitivity and specificity also remained consistent between testing and training sets.

Data Overview

This dataset (link) was downloaded from Kaggle and the following data description was provided:

There are 3 types of input features:
Objective: factual information;
Examination: results of medical examination;
Subjective: information given by the patient.

Features:
Age | Objective Feature | age | int (days)
Height | Objective Feature | height | int (cm) |
Weight | Objective Feature | weight | float (kg) |
Gender | Objective Feature | gender | categorical code |
Systolic blood pressure | Examination Feature | ap_hi | int |
Diastolic blood pressure | Examination Feature | ap_lo | int |
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
Smoking | Subjective Feature | smoke | binary |
Alcohol intake | Subjective Feature | alco | binary |
Physical activity | Subjective Feature | active | binary |
Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
All of the dataset values were collected at the moment of medical examination.

As the data description shows, the predictor variables are classified into one of three categories: Objective, Subjective, Examination. The data consists of both categorical and continuous predictors, and the target variable, cardio, is binary.

Importing Data and Loading Packages

The following packages are needed to replicate the analyses in this document.

library(ggplot2); library(dplyr); library(InformationValue)
library(kableExtra); library(gridExtra); library(forcats)
library(esquisse); library(ResourceSelection); library(caret)
library(hrbrthemes); library(papeR); library(generalhoslem)
library(car); library(ROCit); library(effects); library(rmarkdown)

Data Processing

The first thing I’ll do after importing and before processing or analyzing any data is split the data into testing and training sets. Note that I’m using an 80/20 train/test split. This is because I want to ensure that the least frequent outcomes (i.e., the outcomes after cross-tabulating all categorical variables) are well-represented with a sufficient sample size.

data <- read.csv("../.././data/cardio_train.csv", sep = ";")
set.seed(123)

inTrain <- createDataPartition(data$cardio, p = 0.8, list = F)

train <- data[inTrain, ]
test <- data[-inTrain, ]

row.names(train) <- 1:nrow(train)
row.names(test) <- 1:nrow(test)

rm(inTrain)
rm(data)

Let’s take a look at the first few rows of the imported dataset.

And let’s look at the structure of the data.

str(train)
## 'data.frame':    56000 obs. of  13 variables:
##  $ id         : int  1 2 4 8 9 12 13 14 16 18 ...
##  $ age        : int  20228 18857 17474 21914 22113 22584 17668 19834 18815 14791 ...
##  $ gender     : int  1 1 1 1 1 2 1 1 2 2 ...
##  $ height     : int  156 165 156 151 157 178 158 164 173 165 ...
##  $ weight     : num  85 64 56 67 93 95 71 68 60 60 ...
##  $ ap_hi      : int  140 130 100 120 130 130 110 110 120 120 ...
##  $ ap_lo      : int  90 70 60 80 80 90 70 60 80 80 ...
##  $ cholesterol: int  3 3 1 2 3 3 1 1 1 1 ...
##  $ gluc       : int  1 1 1 2 1 3 1 1 1 1 ...
##  $ smoke      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ alco       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ active     : int  1 0 0 0 1 1 1 0 1 0 ...
##  $ cardio     : int  1 1 0 0 0 1 0 0 0 0 ...

After inspecting the structure, it’s clear that some variables need to converted to a factor. Additionally, changing the factor levels from numeric to categories will make exploratory analysis easier. The data description does not clarify which numeric values correspond to males and females. We can calculate the average height for each subset and hopefully the large sample size will provide a clear answer.

# WHAT NUMERIC VALUE REPRESENTS MALES AND WHAT REPRESENTS FEMALES?
# WE CAN CALCULATE THE AVERGAE HEIGHT AND LET LARGE SAMPLE SIZES ANSWER THIS 
train %>% 
  group_by(gender) %>% 
  dplyr::summarise(mean = mean(height))
## # A tibble: 2 x 2
##   gender  mean
##    <int> <dbl>
## 1      1  161.
## 2      2  170.
# CHANGING NUMERIC LABELS TO CATEGORIES
train$cardio <- as.factor(train$cardio)
train$gender <- factor(train$gender, 1:2, c("Female", "Male"))
train$smoke <- factor(train$smoke, 0:1, c("No", "Yes"))
train$active <- factor(train$active, 0:1, c("No", "Yes"))
train$alco <- factor(train$alco, 0:1, c("No", "Yes")) 
train$cholesterol <- 
factor(train$cholesterol, 1:3, c("Normal", "Above Normal", "Well Above Normal"))
train$gluc <- 
factor(train$gluc, 1:3, c("Normal", "Above Normal", "Well Above Normal"))

There will likely be covariance between height and weight and between systolic blood pressure, ap_lo, and diastolic blood pressure, ap_lo. To circumvent this, I created two features: body mass index, bmi, and mean arterial pressure, map.

# FEATURE ENGINEERING
train$bmi <- round((train$weight / (train$height / 100)^2), 0)
train$map <- round(((2*train$ap_lo) + train$ap_hi) / 3, 0)

I live in the United States and we like to be different here, so I’ll convert height and weight metric units to “American” units for interpretability. I’ll also change age from a days to a years scale.

# CHANGING SCALES
train$age <- round(train$age / 365, 0)
train$height <- round(train$height / 2.54, 0)
train$weight <- round(train$weight * 2.2, 0)

Let’s take a look at the distributions of the continuous predictors with boxplots.

par(mfrow = c(2,2))
boxplot(train$age)
boxplot(train$height)
boxplot(train$weight)
boxplot(train$ap_hi)

par(mfrow = c(1,3))
boxplot(train$ap_lo)
boxplot(train$map)
boxplot(train$bmi)

par(mfrow = c(1,1))

Some of the visualizations are squished to save space, but clearly there are some distribution problems. I’ll now remove outliers using clinical knowledge and best judgment.

# REMOVING OUTLIERS AND UNUSUAL VALUES
remove <- row.names(train[train$age < 40 |
                          train$height < 48 |
                          train$height > 80 |
                          train$bmi > 80 |
                          train$bmi < 10 |
                          train$weight < 50 | 
                          train$ap_hi < 80 |
                          train$ap_hi > 220 |
                          train$ap_lo < 20 |
                          train$ap_lo > 200 |
                          train$map > 160, ])

remove <- as.numeric(remove)

I’ll also remove observations where the patient’s diastolic blood pressure was greater than the systolic blood pressure, because the pressure in the arteries should never be greater during relaxation of the heart compared to contraction of the heart.

# REMOVE VALUES WHERE DIASTOLIC PRESSURE IS GREATER THAN SYSTOLIC 
train$ap_lo <- ifelse(train$ap_lo >= train$ap_hi, NA, train$ap_lo)
# CLEANED DATASET
train <- train[-remove, ]
train <- train[!is.na(train$ap_lo), ]
# PERCENT OF DATA REMOVED
round((length(remove)+length(train[is.na(train$ap_lo),]))/ nrow(train) * 100, 2)
## [1] 2.61

Filtering the data by the above criteria removed ~2.6% of the observations. Now let’s take a look at the cleaned dataset.

Summary of predictor variables:

summary(train[ ,-c(1, 3, 8:13)])
summary(train[ ,c(3, 8:13)])
##       age            height          weight        ap_hi           ap_lo       
##  Min.   :40.00   Min.   :49.00   Min.   : 62   Min.   : 80.0   Min.   : 20.00  
##  1st Qu.:49.00   1st Qu.:63.00   1st Qu.:143   1st Qu.:120.0   1st Qu.: 80.00  
##  Median :54.00   Median :65.00   Median :158   Median :120.0   Median : 80.00  
##  Mean   :53.41   Mean   :64.76   Mean   :163   Mean   :126.7   Mean   : 81.31  
##  3rd Qu.:58.00   3rd Qu.:67.00   3rd Qu.:180   3rd Qu.:140.0   3rd Qu.: 90.00  
##  Max.   :65.00   Max.   :78.00   Max.   :440   Max.   :220.0   Max.   :140.00  
##       bmi             map        
##  Min.   :10.00   Min.   : 47.00  
##  1st Qu.:24.00   1st Qu.: 93.00  
##  Median :26.00   Median : 93.00  
##  Mean   :27.43   Mean   : 96.34  
##  3rd Qu.:30.00   3rd Qu.:103.00  
##  Max.   :70.00   Max.   :160.00  
##     gender                 cholesterol                   gluc       smoke      
##  Female:35541   Normal           :40824   Normal           :46299   No :49756  
##  Male  :18982   Above Normal     : 7386   Above Normal     : 4057   Yes: 4767  
##                 Well Above Normal: 6313   Well Above Normal: 4167              
##   alco       active      cardio   
##  No :51611   No :10667   0:27435  
##  Yes: 2912   Yes:43856   1:27088  
## 

Exploratory Analysis

NOTE: Much of the exploratory analysis in this Markdown is not shown because I used the wonderful package, esquisse. This package creates a Shiny app within the user’s R session and allows for rapid exploratory analysis, in sort of a Tableau-esque fashion. Highly recommend!! All it takes is one line of code to get the interface up and running: esquisse::esquisser(train).

Continuous Variables

The first thing I want to do for the continuous variables is assess covariance using a correlation matrix.

cont_vars <- c("age", "height", "weight", "ap_hi", "ap_lo", "bmi", "map")
M1 <- cor(train[ ,cont_vars])
print(round(M1, 2))
##          age height weight ap_hi ap_lo   bmi  map
## age     1.00  -0.09   0.05  0.21  0.15  0.10 0.19
## height -0.09   1.00   0.31  0.02  0.03 -0.20 0.03
## weight  0.05   0.31   1.00  0.27  0.26  0.86 0.28
## ap_hi   0.21   0.02   0.27  1.00  0.73  0.27 0.92
## ap_lo   0.15   0.03   0.26  0.73  1.00  0.24 0.94
## bmi     0.10  -0.20   0.86  0.27  0.24  1.00 0.28
## map     0.19   0.03   0.28  0.92  0.94  0.28 1.00

A visualization may help here:

corrplot::corrplot(M1, "square", "upper")

As suspected, systolic and diastolic blood pressure are correlated. Instead, I’ll use the map, mean arterial pressure. Surprisingly, height and weight are not very correlated (perhaps because the dataset consists of all adults, and the relationship is much stronger during childhood and adolescence). In any case, I’ll still use bmi, body mass index, instead of height and weight separately.

M2 <- cor(train[ ,c("age", "bmi", "map")])
corrplot::corrplot(M2, "square", "upper")

Now I’ll bin the continuous variables to so the counts of 0’s and 1’s for cardiovascular disease can be visualized for each bin.

# BINNING CONTINUOUS VARIABLES FOR VISUALIZATION
train$age_bin <- cut(train$age, breaks = 5)
train$weight_bin <- cut(train$weight, breaks = seq(100, 300, 20))
train$bmi_bin <- cut(train$bmi, breaks = seq(15, 50, 5))
train$height_bin <- cut(train$height, breaks = seq(55, 75, 5))
train$ap_hi_bin <- cut(train$ap_hi, breaks = seq(100, 200, 10))
train$ap_lo_bin <- cut(train$ap_lo, breaks = seq(50, 130, 20))
train$map_bin <- cut(train$map, breaks = seq(60, 140, 10))

Below are some visualizations I found interesting for showing the relationship between the continuous variables and cardiovascular disease

# MAP BIN
ggplot(train) +
 aes(x = map_bin, fill = cardio) +
 geom_bar(position = "dodge") +
 scale_fill_brewer(palette = "Accent") +
 labs(x = "Mean Arterial Pressure", y = "Count", fill = "CVD") +
 theme_minimal()

# BMI BIN
ggplot(train) +
 aes(x = bmi_bin, fill = cardio) +
 geom_bar(position = "dodge") +
 scale_fill_brewer(palette = "Set2") +
 labs(x = "Body Mass Index", y = "Count", fill = "CVD") +
 theme_minimal()

# AGE BIN
ggplot(train) +
 aes(x = age_bin, fill = cardio) +
 geom_bar(position = "dodge") +
 scale_fill_brewer(palette = "Pastel1") +
 labs(x = "Age", y = "Count", fill = "CVD") +
 theme_minimal()

grid.arrange(nrow = 1,
ggplot(train) +
 aes(x = cardio, y = map, fill = cardio) +
 geom_boxplot(show.legend = F) +
 scale_fill_brewer(palette = "Set1") +
 labs(x = "CVD", y = "Mean Arterial Pressure", fill = "CVD") +
 theme_minimal(),
ggplot(train) +
 aes(x = cardio, y = age, fill = cardio) +
 geom_boxplot(show.legend = F) +
 scale_fill_viridis_d(option = "cividis") +
 labs(x = "CVD", y = "Age", fill = "CVD") +
 theme_minimal()
)

grid.arrange(
ggplot(train) +
 aes(x = age, y = map, colour = cardio) +
 geom_point(size = 1L) +
 scale_color_brewer(palette = "Set1") +
 labs(x = "Age", y = "Mean Arterial Pressure", color = "CVD") +
 theme_minimal(),
ggplot(train) +
 aes(x = map, y = bmi, colour = cardio) +
 geom_point(size = 1L) +
 scale_color_brewer(palette = "Set1") +
 labs(x = "Mean Arterial Pressure", y = "Body Mass Index", color = "CVD") +
 theme_minimal()
)

Categorical Variables

ggplot(train) +
 aes(x = gluc, fill = cardio) +
 geom_bar(position = "dodge") +
 scale_fill_brewer(palette = "Pastel1") +
 labs(x = "Glucose", y = "Count", fill = "CVD") +
 theme_minimal()

ggplot(train) +
 aes(x = cholesterol, fill = cardio) +
 geom_bar(position = "dodge") +
 scale_fill_ipsum() +
 labs(x = "Cholesterol", y = "Count", fill = "CVD") +
 theme_minimal()

ggplot(train) +
 aes(x = gender, fill = cardio) +
 geom_bar(position = "dodge") +
 scale_fill_brewer(palette = "Accent") +
 labs(x = "Gender", y = "Count", fill = "CVD") +
 theme_minimal()

ggplot(train) +
 aes(x = alco, fill = cardio) +
 geom_bar(position = "dodge") +
 scale_fill_brewer(palette = "Pastel2") +
 labs(x = "Alcohol Consumption", y = "Cigarette Smoking", fill = "CVD") +
 theme_minimal() +
 facet_grid(vars(smoke), vars(), scales = "free_y")

ggplot(train) +
 aes(x = active, fill = cardio) +
 geom_bar(position = "dodge") +
 scale_fill_brewer(palette = "Set3") +
 labs(x = "Physical Activity", y = "Count", fill = "CVD") +
 theme_minimal()

ggplot(train) +
 aes(x = gluc, fill = cardio) +
 geom_bar(position = "dodge") +
 scale_fill_brewer(palette = "Set2") +
 labs(x = "Glucose ", y = "Cholesterol", fill = "CVD") +
 theme_minimal() +
 facet_grid(vars(cholesterol), vars(), scales = "free_y")

What do all of these visualizations tell us? What do we actually do with this information? Generally, the exploratory analysis should inform the feature engineering and modeling processes. However, this dataset consists of well-known risk factors for cardiovascular disease, and, as such, the exploratory analysis is not very interesting. It’s already well-established in the literature that cholesterol, diabetes, age, blood pressure, etc. will contribute to the development of cardiovascular disease.

Despite this, the EDA still provided some useful information. Smoking, activity status, and alcohol consumption were not very strong predictors. These could potentially be removed from modeling. An interaction term of cholesterol and glucose may be more insightful. Let’s begin preparing the data for modeling by inspecting the least frequent outcomes.

Model Prep

Least Frequent Outcomes

Before modeling, I want to look at the least frequent outcomes to check for adequate sample size. Here are the least frequent outcomes after cross-tabulating all categorical variables:

# ALL VARIABLES
train %>% 
  dplyr::select(gender, cholesterol, smoke, gluc, alco, active) %>%
  group_by_all() %>% 
  dplyr::summarise(count = n()) %>% 
  arrange(count) %>% 
  head()
## # A tibble: 6 x 7
## # Groups:   gender, cholesterol, smoke, gluc, alco [5]
##   gender cholesterol       smoke gluc              alco  active count
##   <fct>  <fct>             <fct> <fct>             <fct> <fct>  <int>
## 1 Female Normal            Yes   Well Above Normal Yes   Yes        1
## 2 Female Well Above Normal Yes   Normal            Yes   No         1
## 3 Female Well Above Normal Yes   Above Normal      Yes   No         1
## 4 Female Well Above Normal Yes   Well Above Normal Yes   No         1
## 5 Female Well Above Normal Yes   Well Above Normal Yes   Yes        1
## 6 Male   Above Normal      No    Well Above Normal Yes   No         1

Removing smoke and gender from the cross-tabulation:

# REMOVE SMOKE AND GENDER
train %>% 
  dplyr::select(cholesterol, gluc, active, alco) %>%
  group_by_all() %>% 
  dplyr::summarise(count = n()) %>% 
  arrange(count) %>% 
  head()
## # A tibble: 6 x 5
## # Groups:   cholesterol, gluc, active [6]
##   cholesterol       gluc              active alco  count
##   <fct>             <fct>             <fct>  <fct> <int>
## 1 Above Normal      Well Above Normal No     Yes       2
## 2 Well Above Normal Above Normal      No     Yes       9
## 3 Normal            Well Above Normal No     Yes      11
## 4 Above Normal      Well Above Normal Yes    Yes      18
## 5 Normal            Above Normal      No     Yes      20
## 6 Well Above Normal Well Above Normal No     Yes      23

Removing smoke, gender, and alcohol:

# REMOVE SMOKE, GENDER AND ALCOHOL
train %>% 
  dplyr::select(cholesterol, gluc, active) %>%
  group_by_all() %>% 
  dplyr::summarise(count = n()) %>% 
  arrange(count) %>% 
  head()
## # A tibble: 6 x 4
## # Groups:   cholesterol, gluc [4]
##   cholesterol       gluc              active count
##   <fct>             <fct>             <fct>  <int>
## 1 Above Normal      Well Above Normal No        54
## 2 Well Above Normal Above Normal      No        71
## 3 Normal            Well Above Normal No       211
## 4 Above Normal      Well Above Normal Yes      242
## 5 Well Above Normal Above Normal      Yes      327
## 6 Normal            Above Normal      No       369

Removing smoke, gender, and alcohol provides us with sufficient sample size. Given that these predictors were not particularly insightful, dropping them seems like a good decision.

Select Variables for Modeling

# PREDICTORS BEING USED FOR MODELING
train <- train[ ,c("age", "cholesterol", "gluc", "active", "bmi", "map", "cardio")]

Modeling: Logistic Regression

The modeling section of this analysis is brief and straightforward. I’m interested in comparing a model with all the predictors listed above to a model with all of the predictors listed above plus a glucose x cholesterol interaction. I’m be using a plain vanilla logistic regression due to speed, simplicity and interpretability.

Model 1

Below I’m creating a model object, mod1, with the following predictors: age, cholesterol, gluc, active, bmi, map, cardio. A model summary is shown below.

mod1 <- glm(cardio ~ ., family = "binomial", data = train)
Estimate Odds Ratio CI (lower) CI (upper) Std. Error z value Pr(>|z|)
(Intercept) -12. 8.7e-06 6.7e-06 1.1e-05 0.14 -85. <0.001 ***
age 0.054 1.1 1.1 1.1 0.0015
<0.001 ***
cholesterol: Above Normal 0.40 1.5 1.4 1.6 0.030
<0.001 ***
cholesterol: Well Above Normal 1.1 3.0 2.7 3.2 0.040
<0.001 ***
gluc: Above Normal 0.0079 1.0 0.93 1.1 0.040 0.20 0.84
gluc: Well Above Normal -0.33 0.72 0.66 0.78 0.044 -7.5 <0.001 ***
active: Yes -0.23 0.79 0.75 0.83 0.024 -9.6 <0.001 ***
bmi 0.028 1.0 1.0 1.0 0.0020
<0.001 ***
map 0.084 1.1 1.1 1.1 0.0011
<0.001 ***

Model 2

Below I’m creating a model object, mod2, with the following predictors: age, cholesterol, gluc, active, bmi, map, cardio, gluc*cholesterol. A model summary is shown below.

mod2 <- update(mod1, .~. + gluc*cholesterol)
Estimate Odds Ratio CI (lower) CI (upper) Std. Error z value Pr(>|z|)
(Intercept) -12. 8.9e-06 6.8e-06 1.2e-05 0.14 -85. <0.001 ***
age 0.053 1.1 1.1 1.1 0.0015
<0.001 ***
cholesterol: Above Normal 0.46 1.6 1.5 1.7 0.034
<0.001 ***
cholesterol: Well Above Normal 1.3 3.6 3.2 4.0 0.050
<0.001 ***
gluc: Above Normal 0.20 1.2 1.1 1.4 0.056 3.6 <0.001 ***
gluc: Well Above Normal -0.051 0.95 0.83 1.1 0.066 -0.77 0.44
active: Yes -0.24 0.79 0.75 0.83 0.024 -9.7 <0.001 ***
bmi 0.028 1.0 1.0 1.0 0.0020
<0.001 ***
map 0.084 1.1 1.1 1.1 0.0011
<0.001 ***
cholesterol: Above Normal:glucAbove Normal -0.36 0.69 0.59 0.82 0.083 -4.4 <0.001 ***
cholesterol: Well Above Normal:glucAbove Normal -0.65 0.52 0.39 0.71 0.15 -4.2 <0.001 ***
cholesterol: Above Normal:glucWell Above Normal -0.32 0.72 0.54 0.97 0.15 -2.2 0.03
cholesterol: Well Above Normal:glucWell Above Normal -0.58 0.56 0.47 0.67 0.094 -6.2 <0.001 ***

Model Comparison

Let’s see if mod2 is significantly better than mod1 using an analysis of variance.

anova(mod1, mod2, test='LR')
## Analysis of Deviance Table
## 
## Model 1: cardio ~ age + cholesterol + gluc + active + bmi + map
## Model 2: cardio ~ age + cholesterol + gluc + active + bmi + map + cholesterol:gluc
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     54514      62138                          
## 2     54510      62077  4   61.213 1.613e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show that mod2 is in fact a significantly better predictor.

Model Assumptions & Fit

It’s important to check a model’s assumptions and fit. The model assumption of no multicollinearity will be assessed using variance inflation factor, and model fit will be assessed using the Hosmer and Lemeshow test.

Multicollinearity

vif(mod1)
vif(mod2)
##                 GVIF Df GVIF^(1/(2*Df))
## age         1.011257  1        1.005613
## cholesterol 1.489277  2        1.104699
## gluc        1.481988  2        1.103345
## active      1.001492  1        1.000746
## bmi         1.052166  1        1.025751
## map         1.039769  1        1.019691
##                       GVIF Df GVIF^(1/(2*Df))
## age               1.012091  1        1.006028
## cholesterol       3.081504  2        1.324923
## gluc              6.865854  2        1.618727
## active            1.001938  1        1.000968
## bmi               1.055058  1        1.027160
## map               1.040310  1        1.019956
## cholesterol:gluc 13.646748  4        1.386368

Hosmer and Lemeshow test

logitgof(mod1$y, fitted(mod1))
logitgof(mod2$y, fitted(mod2))
## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  mod1$y, fitted(mod1)
## X-squared = 560.4, df = 8, p-value < 2.2e-16
## 
## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  mod2$y, fitted(mod2)
## X-squared = 597.28, df = 8, p-value < 2.2e-16

There appears to be multicollinearity for mod2 but not mod1, and both models are a poor fit according to the H-L test. It’s important to keep these things in mind moving forward, as model interpretation should be highly criticized when multicollinearity is present and the model is a poor fit.

Model Predictions

I’ll use these two models to make CVD classification predictions and create a confusion matrix.

Training Set

Model 1

pred_mod1 <- ifelse(predict.glm(mod1, type = "response") < 0.5, 0, 1)
(cm_1 <- caret::confusionMatrix(as.factor(pred_mod1), train$cardio))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 21117  8809
##          1  6318 18279
##                                           
##                Accuracy : 0.7226          
##                  95% CI : (0.7188, 0.7263)
##     No Information Rate : 0.5032          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4448          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7697          
##             Specificity : 0.6748          
##          Pos Pred Value : 0.7056          
##          Neg Pred Value : 0.7431          
##              Prevalence : 0.5032          
##          Detection Rate : 0.3873          
##    Detection Prevalence : 0.5489          
##       Balanced Accuracy : 0.7223          
##                                           
##        'Positive' Class : 0               
## 

Model 2

pred_mod2 <- ifelse(predict.glm(mod2, type = "response") < 0.5, 0, 1)
(cm_2 <- caret::confusionMatrix(as.factor(pred_mod2), train$cardio))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 21174  8863
##          1  6261 18225
##                                           
##                Accuracy : 0.7226          
##                  95% CI : (0.7188, 0.7264)
##     No Information Rate : 0.5032          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4449          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7718          
##             Specificity : 0.6728          
##          Pos Pred Value : 0.7049          
##          Neg Pred Value : 0.7443          
##              Prevalence : 0.5032          
##          Detection Rate : 0.3883          
##    Detection Prevalence : 0.5509          
##       Balanced Accuracy : 0.7223          
##                                           
##        'Positive' Class : 0               
## 

Figures

par(mfrow = c(1,2))

# MODEL 1
fourfoldplot(cm_1$table, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "Accuracy: 72.2%")

# MODEL 2
fourfoldplot(cm_2$table, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "Accuracy: 72.2%")

par(mfrow = c(1,1))
# MODEL 1
plot(rocit(predict(mod1, train, type = "response"), train$cardio))

# MODEL 2
plot(rocit(predict(mod2, train, type = "response"), train$cardio))

Test Set

Time for the moment of truth. Given that the models were very similar in terms of both sensitivity and specificity, I’ll use mod1 to employ on the test set. This seems like a good choice, given that it’s also a slightly more parsimonious model and multicollinearity was not observed. First I need to process the test set in the same way the training set was processed.

# PROCESS TEST SET

# CHANGING NUMERIC LABELS TO CATEGORIES
test$cardio <- as.factor(test$cardio)
test$gender <- factor(test$gender, 1:2, c("Female", "Male"))
test$smoke <- factor(test$smoke, 0:1, c("No", "Yes"))
test$active <- factor(test$active, 0:1, c("No", "Yes"))
test$alco <- factor(test$alco, 0:1, c("No", "Yes")) 
test$cholesterol <- 
  factor(test$cholesterol, 1:3, c("Normal", "Above Normal", "Well Above Normal"))
test$gluc <- 
  factor(test$gluc, 1:3, c("Normal", "Above Normal", "Well Above Normal"))

# FEATURE ENGINEERING
test$bmi <- round((test$weight / (test$height / 100)^2), 0)
test$map <- round(((2*test$ap_lo) + test$ap_hi) / 3, 0)

# CHANGING SCALES
test$age <- round(test$age / 365, 0)
test$height <- round(test$height / 2.54, 0)
test$weight <- round(test$weight * 2.2, 0)
test_pred <- ifelse(predict.glm(mod1, test, type = "response") < 0.5, 0, 1)
(test_cm <- caret::confusionMatrix(as.factor(test_pred), test$cardio))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5355 2284
##          1 1661 4700
##                                           
##                Accuracy : 0.7182          
##                  95% CI : (0.7107, 0.7257)
##     No Information Rate : 0.5011          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4363          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7633          
##             Specificity : 0.6730          
##          Pos Pred Value : 0.7010          
##          Neg Pred Value : 0.7389          
##              Prevalence : 0.5011          
##          Detection Rate : 0.3825          
##    Detection Prevalence : 0.5456          
##       Balanced Accuracy : 0.7181          
##                                           
##        'Positive' Class : 0               
## 
fourfoldplot(test_cm$table, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "Accuracy: 71.8%")

The accuracy is almost as good as the accuracy on the training set, with very similar sensitivity and specificity.

Predicted Probability Plots

Hmm, not sure what’s going on with the predicted probability of glucose levels.. definitely worth investigating.

grid.arrange(nrow = 1,
plot(effect("cholesterol", mod1), 
     ylab = "Probability of CVD",
     xlab = "Cholesterol Levels",
     main = "",
     lines = list(col = "black")),
plot(effect("gluc", mod1), 
     ylab = "Probability of CVD",
     xlab = "Glucose Levels",
     main = "",
     lines = list(col = "black"))
)