In this analysis I created a cardiovascular disease (CVD) classification model using logistic regression. The data was processed, I inspected least frequent outcomes and class imbalance, and assessed model fit and model assumptions. It was by no means a comprehensive analysis, though. The model accuracy was ~72% on both the training and test sets, and sensitivity and specificity also remained consistent between testing and training sets.
This dataset (link) was downloaded from Kaggle and the following data description was provided:
There are 3 types of input features:
Objective: factual information;
Examination: results of medical examination;
Subjective: information given by the patient.
Features:
Age | Objective Feature | age | int (days)
Height | Objective Feature | height | int (cm) |
Weight | Objective Feature | weight | float (kg) |
Gender | Objective Feature | gender | categorical code |
Systolic blood pressure | Examination Feature | ap_hi | int |
Diastolic blood pressure | Examination Feature | ap_lo | int |
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
Smoking | Subjective Feature | smoke | binary |
Alcohol intake | Subjective Feature | alco | binary |
Physical activity | Subjective Feature | active | binary |
Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
All of the dataset values were collected at the moment of medical examination.
As the data description shows, the predictor variables are classified into one of three categories: Objective, Subjective, Examination. The data consists of both categorical and continuous predictors, and the target variable, cardio, is binary.
The following packages are needed to replicate the analyses in this document.
library(ggplot2); library(dplyr); library(InformationValue)
library(kableExtra); library(gridExtra); library(forcats)
library(esquisse); library(ResourceSelection); library(caret)
library(hrbrthemes); library(papeR); library(generalhoslem)
library(car); library(ROCit); library(effects); library(rmarkdown)
The first thing I’ll do after importing and before processing or analyzing any data is split the data into testing and training sets. Note that I’m using an 80/20 train/test split. This is because I want to ensure that the least frequent outcomes (i.e., the outcomes after cross-tabulating all categorical variables) are well-represented with a sufficient sample size.
data <- read.csv("../.././data/cardio_train.csv", sep = ";")
set.seed(123)
inTrain <- createDataPartition(data$cardio, p = 0.8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
row.names(train) <- 1:nrow(train)
row.names(test) <- 1:nrow(test)
rm(inTrain)
rm(data)
Let’s take a look at the first few rows of the imported dataset.
And let’s look at the structure of the data.
str(train)
## 'data.frame': 56000 obs. of 13 variables:
## $ id : int 1 2 4 8 9 12 13 14 16 18 ...
## $ age : int 20228 18857 17474 21914 22113 22584 17668 19834 18815 14791 ...
## $ gender : int 1 1 1 1 1 2 1 1 2 2 ...
## $ height : int 156 165 156 151 157 178 158 164 173 165 ...
## $ weight : num 85 64 56 67 93 95 71 68 60 60 ...
## $ ap_hi : int 140 130 100 120 130 130 110 110 120 120 ...
## $ ap_lo : int 90 70 60 80 80 90 70 60 80 80 ...
## $ cholesterol: int 3 3 1 2 3 3 1 1 1 1 ...
## $ gluc : int 1 1 1 2 1 3 1 1 1 1 ...
## $ smoke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ alco : int 0 0 0 0 0 0 0 0 0 0 ...
## $ active : int 1 0 0 0 1 1 1 0 1 0 ...
## $ cardio : int 1 1 0 0 0 1 0 0 0 0 ...
After inspecting the structure, it’s clear that some variables need to converted to a factor. Additionally, changing the factor levels from numeric to categories will make exploratory analysis easier. The data description does not clarify which numeric values correspond to males and females. We can calculate the average height for each subset and hopefully the large sample size will provide a clear answer.
# WHAT NUMERIC VALUE REPRESENTS MALES AND WHAT REPRESENTS FEMALES?
# WE CAN CALCULATE THE AVERGAE HEIGHT AND LET LARGE SAMPLE SIZES ANSWER THIS
train %>%
group_by(gender) %>%
dplyr::summarise(mean = mean(height))
## # A tibble: 2 x 2
## gender mean
## <int> <dbl>
## 1 1 161.
## 2 2 170.
# CHANGING NUMERIC LABELS TO CATEGORIES
train$cardio <- as.factor(train$cardio)
train$gender <- factor(train$gender, 1:2, c("Female", "Male"))
train$smoke <- factor(train$smoke, 0:1, c("No", "Yes"))
train$active <- factor(train$active, 0:1, c("No", "Yes"))
train$alco <- factor(train$alco, 0:1, c("No", "Yes"))
train$cholesterol <-
factor(train$cholesterol, 1:3, c("Normal", "Above Normal", "Well Above Normal"))
train$gluc <-
factor(train$gluc, 1:3, c("Normal", "Above Normal", "Well Above Normal"))
There will likely be covariance between height and weight and between systolic blood pressure, ap_lo, and diastolic blood pressure, ap_lo. To circumvent this, I created two features: body mass index, bmi, and mean arterial pressure, map.
# FEATURE ENGINEERING
train$bmi <- round((train$weight / (train$height / 100)^2), 0)
train$map <- round(((2*train$ap_lo) + train$ap_hi) / 3, 0)
I live in the United States and we like to be different here, so I’ll convert height and weight metric units to “American” units for interpretability. I’ll also change age from a days to a years scale.
# CHANGING SCALES
train$age <- round(train$age / 365, 0)
train$height <- round(train$height / 2.54, 0)
train$weight <- round(train$weight * 2.2, 0)
Let’s take a look at the distributions of the continuous predictors with boxplots.
par(mfrow = c(2,2))
boxplot(train$age)
boxplot(train$height)
boxplot(train$weight)
boxplot(train$ap_hi)
par(mfrow = c(1,3))
boxplot(train$ap_lo)
boxplot(train$map)
boxplot(train$bmi)
par(mfrow = c(1,1))
Some of the visualizations are squished to save space, but clearly there are some distribution problems. I’ll now remove outliers using clinical knowledge and best judgment.
# REMOVING OUTLIERS AND UNUSUAL VALUES
remove <- row.names(train[train$age < 40 |
train$height < 48 |
train$height > 80 |
train$bmi > 80 |
train$bmi < 10 |
train$weight < 50 |
train$ap_hi < 80 |
train$ap_hi > 220 |
train$ap_lo < 20 |
train$ap_lo > 200 |
train$map > 160, ])
remove <- as.numeric(remove)
I’ll also remove observations where the patient’s diastolic blood pressure was greater than the systolic blood pressure, because the pressure in the arteries should never be greater during relaxation of the heart compared to contraction of the heart.
# REMOVE VALUES WHERE DIASTOLIC PRESSURE IS GREATER THAN SYSTOLIC
train$ap_lo <- ifelse(train$ap_lo >= train$ap_hi, NA, train$ap_lo)
# CLEANED DATASET
train <- train[-remove, ]
train <- train[!is.na(train$ap_lo), ]
# PERCENT OF DATA REMOVED
round((length(remove)+length(train[is.na(train$ap_lo),]))/ nrow(train) * 100, 2)
## [1] 2.61
Filtering the data by the above criteria removed ~2.6% of the observations. Now let’s take a look at the cleaned dataset.
Summary of predictor variables:
summary(train[ ,-c(1, 3, 8:13)])
summary(train[ ,c(3, 8:13)])
## age height weight ap_hi ap_lo
## Min. :40.00 Min. :49.00 Min. : 62 Min. : 80.0 Min. : 20.00
## 1st Qu.:49.00 1st Qu.:63.00 1st Qu.:143 1st Qu.:120.0 1st Qu.: 80.00
## Median :54.00 Median :65.00 Median :158 Median :120.0 Median : 80.00
## Mean :53.41 Mean :64.76 Mean :163 Mean :126.7 Mean : 81.31
## 3rd Qu.:58.00 3rd Qu.:67.00 3rd Qu.:180 3rd Qu.:140.0 3rd Qu.: 90.00
## Max. :65.00 Max. :78.00 Max. :440 Max. :220.0 Max. :140.00
## bmi map
## Min. :10.00 Min. : 47.00
## 1st Qu.:24.00 1st Qu.: 93.00
## Median :26.00 Median : 93.00
## Mean :27.43 Mean : 96.34
## 3rd Qu.:30.00 3rd Qu.:103.00
## Max. :70.00 Max. :160.00
## gender cholesterol gluc smoke
## Female:35541 Normal :40824 Normal :46299 No :49756
## Male :18982 Above Normal : 7386 Above Normal : 4057 Yes: 4767
## Well Above Normal: 6313 Well Above Normal: 4167
## alco active cardio
## No :51611 No :10667 0:27435
## Yes: 2912 Yes:43856 1:27088
##
NOTE: Much of the exploratory analysis in this Markdown is not shown because I used the wonderful package,
esquisse. This package creates a Shiny app within the user’s R session and allows for rapid exploratory analysis, in sort of a Tableau-esque fashion. Highly recommend!! All it takes is one line of code to get the interface up and running:esquisse::esquisser(train).
The first thing I want to do for the continuous variables is assess covariance using a correlation matrix.
cont_vars <- c("age", "height", "weight", "ap_hi", "ap_lo", "bmi", "map")
M1 <- cor(train[ ,cont_vars])
print(round(M1, 2))
## age height weight ap_hi ap_lo bmi map
## age 1.00 -0.09 0.05 0.21 0.15 0.10 0.19
## height -0.09 1.00 0.31 0.02 0.03 -0.20 0.03
## weight 0.05 0.31 1.00 0.27 0.26 0.86 0.28
## ap_hi 0.21 0.02 0.27 1.00 0.73 0.27 0.92
## ap_lo 0.15 0.03 0.26 0.73 1.00 0.24 0.94
## bmi 0.10 -0.20 0.86 0.27 0.24 1.00 0.28
## map 0.19 0.03 0.28 0.92 0.94 0.28 1.00
A visualization may help here:
corrplot::corrplot(M1, "square", "upper")
As suspected, systolic and diastolic blood pressure are correlated. Instead, I’ll use the map, mean arterial pressure. Surprisingly, height and weight are not very correlated (perhaps because the dataset consists of all adults, and the relationship is much stronger during childhood and adolescence). In any case, I’ll still use bmi, body mass index, instead of height and weight separately.
M2 <- cor(train[ ,c("age", "bmi", "map")])
corrplot::corrplot(M2, "square", "upper")
Now I’ll bin the continuous variables to so the counts of 0’s and 1’s for cardiovascular disease can be visualized for each bin.
# BINNING CONTINUOUS VARIABLES FOR VISUALIZATION
train$age_bin <- cut(train$age, breaks = 5)
train$weight_bin <- cut(train$weight, breaks = seq(100, 300, 20))
train$bmi_bin <- cut(train$bmi, breaks = seq(15, 50, 5))
train$height_bin <- cut(train$height, breaks = seq(55, 75, 5))
train$ap_hi_bin <- cut(train$ap_hi, breaks = seq(100, 200, 10))
train$ap_lo_bin <- cut(train$ap_lo, breaks = seq(50, 130, 20))
train$map_bin <- cut(train$map, breaks = seq(60, 140, 10))
Below are some visualizations I found interesting for showing the relationship between the continuous variables and cardiovascular disease
# MAP BIN
ggplot(train) +
aes(x = map_bin, fill = cardio) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Accent") +
labs(x = "Mean Arterial Pressure", y = "Count", fill = "CVD") +
theme_minimal()
# BMI BIN
ggplot(train) +
aes(x = bmi_bin, fill = cardio) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set2") +
labs(x = "Body Mass Index", y = "Count", fill = "CVD") +
theme_minimal()
# AGE BIN
ggplot(train) +
aes(x = age_bin, fill = cardio) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Pastel1") +
labs(x = "Age", y = "Count", fill = "CVD") +
theme_minimal()
grid.arrange(nrow = 1,
ggplot(train) +
aes(x = cardio, y = map, fill = cardio) +
geom_boxplot(show.legend = F) +
scale_fill_brewer(palette = "Set1") +
labs(x = "CVD", y = "Mean Arterial Pressure", fill = "CVD") +
theme_minimal(),
ggplot(train) +
aes(x = cardio, y = age, fill = cardio) +
geom_boxplot(show.legend = F) +
scale_fill_viridis_d(option = "cividis") +
labs(x = "CVD", y = "Age", fill = "CVD") +
theme_minimal()
)
grid.arrange(
ggplot(train) +
aes(x = age, y = map, colour = cardio) +
geom_point(size = 1L) +
scale_color_brewer(palette = "Set1") +
labs(x = "Age", y = "Mean Arterial Pressure", color = "CVD") +
theme_minimal(),
ggplot(train) +
aes(x = map, y = bmi, colour = cardio) +
geom_point(size = 1L) +
scale_color_brewer(palette = "Set1") +
labs(x = "Mean Arterial Pressure", y = "Body Mass Index", color = "CVD") +
theme_minimal()
)
ggplot(train) +
aes(x = gluc, fill = cardio) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Pastel1") +
labs(x = "Glucose", y = "Count", fill = "CVD") +
theme_minimal()
ggplot(train) +
aes(x = cholesterol, fill = cardio) +
geom_bar(position = "dodge") +
scale_fill_ipsum() +
labs(x = "Cholesterol", y = "Count", fill = "CVD") +
theme_minimal()
ggplot(train) +
aes(x = gender, fill = cardio) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Accent") +
labs(x = "Gender", y = "Count", fill = "CVD") +
theme_minimal()
ggplot(train) +
aes(x = alco, fill = cardio) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Pastel2") +
labs(x = "Alcohol Consumption", y = "Cigarette Smoking", fill = "CVD") +
theme_minimal() +
facet_grid(vars(smoke), vars(), scales = "free_y")
ggplot(train) +
aes(x = active, fill = cardio) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set3") +
labs(x = "Physical Activity", y = "Count", fill = "CVD") +
theme_minimal()
ggplot(train) +
aes(x = gluc, fill = cardio) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set2") +
labs(x = "Glucose ", y = "Cholesterol", fill = "CVD") +
theme_minimal() +
facet_grid(vars(cholesterol), vars(), scales = "free_y")
What do all of these visualizations tell us? What do we actually do with this information? Generally, the exploratory analysis should inform the feature engineering and modeling processes. However, this dataset consists of well-known risk factors for cardiovascular disease, and, as such, the exploratory analysis is not very interesting. It’s already well-established in the literature that cholesterol, diabetes, age, blood pressure, etc. will contribute to the development of cardiovascular disease.
Despite this, the EDA still provided some useful information. Smoking, activity status, and alcohol consumption were not very strong predictors. These could potentially be removed from modeling. An interaction term of cholesterol and glucose may be more insightful. Let’s begin preparing the data for modeling by inspecting the least frequent outcomes.
Before modeling, I want to look at the least frequent outcomes to check for adequate sample size. Here are the least frequent outcomes after cross-tabulating all categorical variables:
# ALL VARIABLES
train %>%
dplyr::select(gender, cholesterol, smoke, gluc, alco, active) %>%
group_by_all() %>%
dplyr::summarise(count = n()) %>%
arrange(count) %>%
head()
## # A tibble: 6 x 7
## # Groups: gender, cholesterol, smoke, gluc, alco [5]
## gender cholesterol smoke gluc alco active count
## <fct> <fct> <fct> <fct> <fct> <fct> <int>
## 1 Female Normal Yes Well Above Normal Yes Yes 1
## 2 Female Well Above Normal Yes Normal Yes No 1
## 3 Female Well Above Normal Yes Above Normal Yes No 1
## 4 Female Well Above Normal Yes Well Above Normal Yes No 1
## 5 Female Well Above Normal Yes Well Above Normal Yes Yes 1
## 6 Male Above Normal No Well Above Normal Yes No 1
Removing smoke and gender from the cross-tabulation:
# REMOVE SMOKE AND GENDER
train %>%
dplyr::select(cholesterol, gluc, active, alco) %>%
group_by_all() %>%
dplyr::summarise(count = n()) %>%
arrange(count) %>%
head()
## # A tibble: 6 x 5
## # Groups: cholesterol, gluc, active [6]
## cholesterol gluc active alco count
## <fct> <fct> <fct> <fct> <int>
## 1 Above Normal Well Above Normal No Yes 2
## 2 Well Above Normal Above Normal No Yes 9
## 3 Normal Well Above Normal No Yes 11
## 4 Above Normal Well Above Normal Yes Yes 18
## 5 Normal Above Normal No Yes 20
## 6 Well Above Normal Well Above Normal No Yes 23
Removing smoke, gender, and alcohol:
# REMOVE SMOKE, GENDER AND ALCOHOL
train %>%
dplyr::select(cholesterol, gluc, active) %>%
group_by_all() %>%
dplyr::summarise(count = n()) %>%
arrange(count) %>%
head()
## # A tibble: 6 x 4
## # Groups: cholesterol, gluc [4]
## cholesterol gluc active count
## <fct> <fct> <fct> <int>
## 1 Above Normal Well Above Normal No 54
## 2 Well Above Normal Above Normal No 71
## 3 Normal Well Above Normal No 211
## 4 Above Normal Well Above Normal Yes 242
## 5 Well Above Normal Above Normal Yes 327
## 6 Normal Above Normal No 369
Removing smoke, gender, and alcohol provides us with sufficient sample size. Given that these predictors were not particularly insightful, dropping them seems like a good decision.
# PREDICTORS BEING USED FOR MODELING
train <- train[ ,c("age", "cholesterol", "gluc", "active", "bmi", "map", "cardio")]
The modeling section of this analysis is brief and straightforward. I’m interested in comparing a model with all the predictors listed above to a model with all of the predictors listed above plus a glucose x cholesterol interaction. I’m be using a plain vanilla logistic regression due to speed, simplicity and interpretability.
Below I’m creating a model object, mod1, with the following predictors: age, cholesterol, gluc, active, bmi, map, cardio. A model summary is shown below.
mod1 <- glm(cardio ~ ., family = "binomial", data = train)
| Estimate | Odds Ratio | CI (lower) | CI (upper) | Std. Error | z value | Pr(>|z|) | ||
|---|---|---|---|---|---|---|---|---|
| (Intercept) | -12. | 8.7e-06 | 6.7e-06 | 1.1e-05 | 0.14 | -85. | <0.001 | *** |
| age | 0.054 | 1.1 | 1.1 | 1.1 | 0.0015 |
|
<0.001 | *** |
| cholesterol: Above Normal | 0.40 | 1.5 | 1.4 | 1.6 | 0.030 |
|
<0.001 | *** |
| cholesterol: Well Above Normal | 1.1 | 3.0 | 2.7 | 3.2 | 0.040 |
|
<0.001 | *** |
| gluc: Above Normal | 0.0079 | 1.0 | 0.93 | 1.1 | 0.040 | 0.20 | 0.84 | |
| gluc: Well Above Normal | -0.33 | 0.72 | 0.66 | 0.78 | 0.044 | -7.5 | <0.001 | *** |
| active: Yes | -0.23 | 0.79 | 0.75 | 0.83 | 0.024 | -9.6 | <0.001 | *** |
| bmi | 0.028 | 1.0 | 1.0 | 1.0 | 0.0020 |
|
<0.001 | *** |
| map | 0.084 | 1.1 | 1.1 | 1.1 | 0.0011 |
|
<0.001 | *** |
Below I’m creating a model object, mod2, with the following predictors: age, cholesterol, gluc, active, bmi, map, cardio, gluc*cholesterol. A model summary is shown below.
mod2 <- update(mod1, .~. + gluc*cholesterol)
| Estimate | Odds Ratio | CI (lower) | CI (upper) | Std. Error | z value | Pr(>|z|) | ||
|---|---|---|---|---|---|---|---|---|
| (Intercept) | -12. | 8.9e-06 | 6.8e-06 | 1.2e-05 | 0.14 | -85. | <0.001 | *** |
| age | 0.053 | 1.1 | 1.1 | 1.1 | 0.0015 |
|
<0.001 | *** |
| cholesterol: Above Normal | 0.46 | 1.6 | 1.5 | 1.7 | 0.034 |
|
<0.001 | *** |
| cholesterol: Well Above Normal | 1.3 | 3.6 | 3.2 | 4.0 | 0.050 |
|
<0.001 | *** |
| gluc: Above Normal | 0.20 | 1.2 | 1.1 | 1.4 | 0.056 | 3.6 | <0.001 | *** |
| gluc: Well Above Normal | -0.051 | 0.95 | 0.83 | 1.1 | 0.066 | -0.77 | 0.44 | |
| active: Yes | -0.24 | 0.79 | 0.75 | 0.83 | 0.024 | -9.7 | <0.001 | *** |
| bmi | 0.028 | 1.0 | 1.0 | 1.0 | 0.0020 |
|
<0.001 | *** |
| map | 0.084 | 1.1 | 1.1 | 1.1 | 0.0011 |
|
<0.001 | *** |
| cholesterol: Above Normal:glucAbove Normal | -0.36 | 0.69 | 0.59 | 0.82 | 0.083 | -4.4 | <0.001 | *** |
| cholesterol: Well Above Normal:glucAbove Normal | -0.65 | 0.52 | 0.39 | 0.71 | 0.15 | -4.2 | <0.001 | *** |
| cholesterol: Above Normal:glucWell Above Normal | -0.32 | 0.72 | 0.54 | 0.97 | 0.15 | -2.2 | 0.03 |
|
| cholesterol: Well Above Normal:glucWell Above Normal | -0.58 | 0.56 | 0.47 | 0.67 | 0.094 | -6.2 | <0.001 | *** |
Let’s see if mod2 is significantly better than mod1 using an analysis of variance.
anova(mod1, mod2, test='LR')
## Analysis of Deviance Table
##
## Model 1: cardio ~ age + cholesterol + gluc + active + bmi + map
## Model 2: cardio ~ age + cholesterol + gluc + active + bmi + map + cholesterol:gluc
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 54514 62138
## 2 54510 62077 4 61.213 1.613e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show that mod2 is in fact a significantly better predictor.
It’s important to check a model’s assumptions and fit. The model assumption of no multicollinearity will be assessed using variance inflation factor, and model fit will be assessed using the Hosmer and Lemeshow test.
vif(mod1)
vif(mod2)
## GVIF Df GVIF^(1/(2*Df))
## age 1.011257 1 1.005613
## cholesterol 1.489277 2 1.104699
## gluc 1.481988 2 1.103345
## active 1.001492 1 1.000746
## bmi 1.052166 1 1.025751
## map 1.039769 1 1.019691
## GVIF Df GVIF^(1/(2*Df))
## age 1.012091 1 1.006028
## cholesterol 3.081504 2 1.324923
## gluc 6.865854 2 1.618727
## active 1.001938 1 1.000968
## bmi 1.055058 1 1.027160
## map 1.040310 1 1.019956
## cholesterol:gluc 13.646748 4 1.386368
logitgof(mod1$y, fitted(mod1))
logitgof(mod2$y, fitted(mod2))
##
## Hosmer and Lemeshow test (binary model)
##
## data: mod1$y, fitted(mod1)
## X-squared = 560.4, df = 8, p-value < 2.2e-16
##
##
## Hosmer and Lemeshow test (binary model)
##
## data: mod2$y, fitted(mod2)
## X-squared = 597.28, df = 8, p-value < 2.2e-16
There appears to be multicollinearity for mod2 but not mod1, and both models are a poor fit according to the H-L test. It’s important to keep these things in mind moving forward, as model interpretation should be highly criticized when multicollinearity is present and the model is a poor fit.
I’ll use these two models to make CVD classification predictions and create a confusion matrix.
pred_mod1 <- ifelse(predict.glm(mod1, type = "response") < 0.5, 0, 1)
(cm_1 <- caret::confusionMatrix(as.factor(pred_mod1), train$cardio))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21117 8809
## 1 6318 18279
##
## Accuracy : 0.7226
## 95% CI : (0.7188, 0.7263)
## No Information Rate : 0.5032
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4448
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7697
## Specificity : 0.6748
## Pos Pred Value : 0.7056
## Neg Pred Value : 0.7431
## Prevalence : 0.5032
## Detection Rate : 0.3873
## Detection Prevalence : 0.5489
## Balanced Accuracy : 0.7223
##
## 'Positive' Class : 0
##
pred_mod2 <- ifelse(predict.glm(mod2, type = "response") < 0.5, 0, 1)
(cm_2 <- caret::confusionMatrix(as.factor(pred_mod2), train$cardio))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21174 8863
## 1 6261 18225
##
## Accuracy : 0.7226
## 95% CI : (0.7188, 0.7264)
## No Information Rate : 0.5032
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4449
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7718
## Specificity : 0.6728
## Pos Pred Value : 0.7049
## Neg Pred Value : 0.7443
## Prevalence : 0.5032
## Detection Rate : 0.3883
## Detection Prevalence : 0.5509
## Balanced Accuracy : 0.7223
##
## 'Positive' Class : 0
##
par(mfrow = c(1,2))
# MODEL 1
fourfoldplot(cm_1$table, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "Accuracy: 72.2%")
# MODEL 2
fourfoldplot(cm_2$table, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "Accuracy: 72.2%")
par(mfrow = c(1,1))
# MODEL 1
plot(rocit(predict(mod1, train, type = "response"), train$cardio))
# MODEL 2
plot(rocit(predict(mod2, train, type = "response"), train$cardio))
Time for the moment of truth. Given that the models were very similar in terms of both sensitivity and specificity, I’ll use mod1 to employ on the test set. This seems like a good choice, given that it’s also a slightly more parsimonious model and multicollinearity was not observed. First I need to process the test set in the same way the training set was processed.
# PROCESS TEST SET
# CHANGING NUMERIC LABELS TO CATEGORIES
test$cardio <- as.factor(test$cardio)
test$gender <- factor(test$gender, 1:2, c("Female", "Male"))
test$smoke <- factor(test$smoke, 0:1, c("No", "Yes"))
test$active <- factor(test$active, 0:1, c("No", "Yes"))
test$alco <- factor(test$alco, 0:1, c("No", "Yes"))
test$cholesterol <-
factor(test$cholesterol, 1:3, c("Normal", "Above Normal", "Well Above Normal"))
test$gluc <-
factor(test$gluc, 1:3, c("Normal", "Above Normal", "Well Above Normal"))
# FEATURE ENGINEERING
test$bmi <- round((test$weight / (test$height / 100)^2), 0)
test$map <- round(((2*test$ap_lo) + test$ap_hi) / 3, 0)
# CHANGING SCALES
test$age <- round(test$age / 365, 0)
test$height <- round(test$height / 2.54, 0)
test$weight <- round(test$weight * 2.2, 0)
test_pred <- ifelse(predict.glm(mod1, test, type = "response") < 0.5, 0, 1)
(test_cm <- caret::confusionMatrix(as.factor(test_pred), test$cardio))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5355 2284
## 1 1661 4700
##
## Accuracy : 0.7182
## 95% CI : (0.7107, 0.7257)
## No Information Rate : 0.5011
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4363
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7633
## Specificity : 0.6730
## Pos Pred Value : 0.7010
## Neg Pred Value : 0.7389
## Prevalence : 0.5011
## Detection Rate : 0.3825
## Detection Prevalence : 0.5456
## Balanced Accuracy : 0.7181
##
## 'Positive' Class : 0
##
fourfoldplot(test_cm$table, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "Accuracy: 71.8%")
The accuracy is almost as good as the accuracy on the training set, with very similar sensitivity and specificity.
Hmm, not sure what’s going on with the predicted probability of glucose levels.. definitely worth investigating.
grid.arrange(nrow = 1,
plot(effect("cholesterol", mod1),
ylab = "Probability of CVD",
xlab = "Cholesterol Levels",
main = "",
lines = list(col = "black")),
plot(effect("gluc", mod1),
ylab = "Probability of CVD",
xlab = "Glucose Levels",
main = "",
lines = list(col = "black"))
)