This report deals with regression modelling of patient data to predict occurence of cardiovascular disease. The data is obtained from https://archive.ics.uci.edu/ml/datasets/Heart+Disease
We first invoke the required libraries.
library(tidyverse)
library(caret)
library(plotly)
library(ggplot2)
library(data.table)
library(GGally)
library(tidymodels)
library(scales)
library(lmtest)
library(inspectdf)
library(randomForest)
library(e1071)
library(reactable)
library(car)
library(class)
# library(ggcorrplot)
options(scipen = 100, max.print = 1e+06)
Then download the data.
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 70000 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## dbl (12): id, age, gender, height, ap_hi, ap_lo, cholesterol, gluc, smoke, a...
## num (1): weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We identify the column names.
colnames(df)
## [1] "id" "age" "gender" "height" "weight"
## [6] "ap_hi" "ap_lo" "cholesterol" "gluc" "smoke"
## [11] "alco" "active" "cardio"
We check the structure of the dataset.
str(df)
## spc_tbl_ [70,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:70000] 0 1 2 3 4 8 9 12 13 14 ...
## $ age : num [1:70000] 18393 20228 18857 17623 17474 ...
## $ gender : num [1:70000] 2 1 1 2 1 1 1 2 1 1 ...
## $ height : num [1:70000] 168 156 165 169 156 151 157 178 158 164 ...
## $ weight : num [1:70000] 62 85 64 82 56 67 93 95 71 68 ...
## $ ap_hi : num [1:70000] 110 140 130 150 100 120 130 130 110 110 ...
## $ ap_lo : num [1:70000] 80 90 70 100 60 80 80 90 70 60 ...
## $ cholesterol: num [1:70000] 1 3 3 1 1 2 3 3 1 1 ...
## $ gluc : num [1:70000] 1 1 1 1 1 2 1 3 1 1 ...
## $ smoke : num [1:70000] 0 0 0 0 0 0 0 0 0 0 ...
## $ alco : num [1:70000] 0 0 0 0 0 0 0 0 0 0 ...
## $ active : num [1:70000] 1 1 0 1 0 0 1 1 1 0 ...
## $ cardio : num [1:70000] 0 1 1 1 0 0 0 1 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. age = col_double(),
## .. gender = col_double(),
## .. height = col_double(),
## .. weight = col_number(),
## .. ap_hi = col_double(),
## .. ap_lo = col_double(),
## .. cholesterol = col_double(),
## .. gluc = col_double(),
## .. smoke = col_double(),
## .. alco = col_double(),
## .. active = col_double(),
## .. cardio = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
We describe the data features as below:
‘age’ is an Objective Feature is of data type ‘int’ (days),
‘height’ is an Objective Feature is of data type ‘int’ (cm),
‘weight’ is an Objective Feature is of data type ‘float’ (kg),
‘gender’ is an Objective Feature is of data type ‘categorical’,
‘ap_hi’ identifies ‘Systolic blood pressure’, is an Examination Feature is of data type ‘int’,
‘ap_lo’ identifies ‘Diastolic blood pressure’ is an Examination Feature is of data type ‘int’,
‘cholesterol’ identifies Cholesterol levels, is an Examination Feature, is of data type int (1: normal, 2: above normal, 3: well above normal),
‘gluc’ identifies Glucose levels, is an Examination Feature, is of data type int (1: normal, 2: above normal, 3: well above normal),
‘smoke’ is a Subjective Feature is of data type binary,
‘alco’ identifies Alcohol intake is a Subjective Feature is of data type binary,
‘active’ signifies levels of Physical activity, is a Subjective Feature and of data type binary.
‘cardio’ signifies Presence or absence of cardiovascular disease and is of data type binary.
We check for missing values.
colSums(is.na(df))
## id age gender height weight ap_hi
## 0 0 0 0 0 0
## ap_lo cholesterol gluc smoke alco active
## 0 0 0 0 0 0
## cardio
## 0
The data types of the dataset requires extensive changes namely “num” types to “factor”.
df$cholesterol <- as.factor(df$cholesterol)
df$gluc <- as.factor(df$gluc)
df$smoke<- as.factor(df$smoke)
df$gender <- as.factor(df$gender)
df$alco <- as.factor(df$alco)
df$active <- as.factor(df$active)
df$cardio<- as.factor(df$cardio)
We check that changes have taken effect as required.
str(df)
## spc_tbl_ [70,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:70000] 0 1 2 3 4 8 9 12 13 14 ...
## $ age : num [1:70000] 18393 20228 18857 17623 17474 ...
## $ gender : Factor w/ 2 levels "1","2": 2 1 1 2 1 1 1 2 1 1 ...
## $ height : num [1:70000] 168 156 165 169 156 151 157 178 158 164 ...
## $ weight : num [1:70000] 62 85 64 82 56 67 93 95 71 68 ...
## $ ap_hi : num [1:70000] 110 140 130 150 100 120 130 130 110 110 ...
## $ ap_lo : num [1:70000] 80 90 70 100 60 80 80 90 70 60 ...
## $ cholesterol: Factor w/ 3 levels "1","2","3": 1 3 3 1 1 2 3 3 1 1 ...
## $ gluc : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 3 1 1 ...
## $ smoke : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ alco : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ active : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 2 2 1 ...
## $ cardio : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. age = col_double(),
## .. gender = col_double(),
## .. height = col_double(),
## .. weight = col_number(),
## .. ap_hi = col_double(),
## .. ap_lo = col_double(),
## .. cholesterol = col_double(),
## .. gluc = col_double(),
## .. smoke = col_double(),
## .. alco = col_double(),
## .. active = col_double(),
## .. cardio = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
We check for correlation between the predictors.
ggcorr(df, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(df, label = TRUE, label_size = 2.9, hjust = 1, layout.exp
## = 2): data in column(s) 'gender', 'cholesterol', 'gluc', 'smoke', 'alco',
## 'active', 'cardio' are not numeric and were ignored
There is little correlation between the numeric predictors.
We now compute a summary of the prepared data.
summary(df)
## id age gender height weight
## Min. : 0 Min. :10798 1:45530 Min. : 55.0 Min. : 10.00
## 1st Qu.:25007 1st Qu.:17664 2:24470 1st Qu.:159.0 1st Qu.: 65.00
## Median :50002 Median :19703 Median :165.0 Median : 72.00
## Mean :49972 Mean :19469 Mean :164.4 Mean : 76.46
## 3rd Qu.:74889 3rd Qu.:21327 3rd Qu.:170.0 3rd Qu.: 82.00
## Max. :99999 Max. :23713 Max. :250.0 Max. :7994.00
## ap_hi ap_lo cholesterol gluc smoke alco
## Min. : -150.0 Min. : -70.00 1:52385 1:59479 0:63831 0:66236
## 1st Qu.: 120.0 1st Qu.: 80.00 2: 9549 2: 5190 1: 6169 1: 3764
## Median : 120.0 Median : 80.00 3: 8066 3: 5331
## Mean : 128.8 Mean : 96.63
## 3rd Qu.: 140.0 3rd Qu.: 90.00
## Max. :16020.0 Max. :11000.00
## active cardio
## 0:13739 0:35021
## 1:56261 1:34979
##
##
##
##
We make the following quick observations:
We begin to split the dataset into “train” and “test” sets.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
samplesize <- round(0.8 * nrow(df), 0)
index <- sample(seq_len(nrow(df)), size = samplesize)
df_train <- df[index, ]
df_test <- df[-index, ]
prop.table(table(df_train$cardio))
##
## 0 1
## 0.4994643 0.5005357
The data is fairly well proportioned between the cardio = 0 and cardio =1.
table(df$cardio)
##
## 0 1
## 35021 34979
The size of each class is ~35000.
The Linear Regression (LM) model
This model implements no predictor variable.
library(e1071)
model_1 <- glm(formula = cardio ~ 1,
data = df_train,
family = "binomial")
summary(model_1)
##
## Call:
## glm(formula = cardio ~ 1, family = "binomial", data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.178 -1.178 1.177 1.177 1.177
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.002143 0.008452 0.254 0.8
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 77632 on 55999 degrees of freedom
## Residual deviance: 77632 on 55999 degrees of freedom
## AIC: 77634
##
## Number of Fisher Scoring iterations: 3
This model implements categorical predictor variables ‘gender’, ‘cholesterol’, ‘gluc’, ‘smoke’, ‘alco’ , and ‘active’.
model_2 <- glm(formula = cardio ~ gender + cholesterol + gluc + smoke + alco + active,
data = df_train,
family = "binomial")
summary(model_2)
##
## Call:
## glm(formula = cardio ~ gender + cholesterol + gluc + smoke +
## alco + active, family = "binomial", data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9544 -1.0543 0.5998 1.2498 1.5199
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.11341 0.02116 -5.361 0.00000008284917 ***
## gender2 0.12513 0.01939 6.454 0.00000000010923 ***
## cholesterol2 0.62446 0.02649 23.577 < 0.0000000000000002 ***
## cholesterol3 1.52970 0.03547 43.129 < 0.0000000000000002 ***
## gluc2 0.20804 0.03499 5.946 0.00000000274561 ***
## gluc3 -0.28066 0.03987 -7.039 0.00000000000194 ***
## smoke1 -0.18034 0.03432 -5.254 0.00000014891516 ***
## alco1 -0.14413 0.04158 -3.466 0.000527 ***
## active1 -0.18335 0.02184 -8.394 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 77632 on 55999 degrees of freedom
## Residual deviance: 74613 on 55991 degrees of freedom
## AIC: 74631
##
## Number of Fisher Scoring iterations: 4
This model implements numerical predictor variables “weight”, “age”, “height” , “ap_hi”, and “ap_lo”.
model_3 <- glm(formula = cardio ~ weight + age + height + ap_hi + ap_lo,
data = df_train,
family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_3)
##
## Call:
## glm(formula = cardio ~ weight + age + height + ap_hi + ap_lo,
## family = "binomial", data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.490 -1.013 0.127 1.029 4.694
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.503687956 0.217740308 -39.054 < 0.0000000000000002 ***
## weight 0.001984978 0.000286643 6.925 0.00000000000436 ***
## age 0.000163951 0.000003881 42.248 < 0.0000000000000002 ***
## height -0.001115498 0.001136195 -0.982 0.326
## ap_hi 0.041973403 0.000653753 64.204 < 0.0000000000000002 ***
## ap_lo 0.000393412 0.000087411 4.501 0.00000677203941 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 77632 on 55999 degrees of freedom
## Residual deviance: 66476 on 55994 degrees of freedom
## AIC: 66488
##
## Number of Fisher Scoring iterations: 8
The predictor ‘height’ can be discarded as it has no bearing on the predictive ability of the model (Pr(>|z|)>0.05).
We build prediction set with the regression models.
pred_1 <- predict(object = model_1,
newdata = df_test,
type = "response")
head(pred_1)
## 1 2 3 4 5 6
## 0.5005357 0.5005357 0.5005357 0.5005357 0.5005357 0.5005357
pred_label1 <- ifelse(pred_1 > 0.5, yes = 1, no = 0)
pred_label1 <- as.factor(pred_label1)
tail(pred_label1)
## 13995 13996 13997 13998 13999 14000
## 1 1 1 1 1 1
## Levels: 1
pred_2 <- predict(object = model_2,
newdata = df_test,
type = "response")
head(pred_2)
## 1 2 3 4 5 6
## 0.4571991 0.7743326 0.4571991 0.4571991 0.4263504 0.4263504
pred_label2 <- ifelse(pred_2 > 0.5, yes = 1, no = 0)
pred_label2 <- as.factor(pred_label2)
head(pred_label2)
## 1 2 3 4 5 6
## 0 1 0 0 0 0
## Levels: 0 1
pred_3 <- predict(object = model_3,
newdata = df_test,
type = "response")
head(pred_3)
## 1 2 3 4 5 6
## 0.2882456 0.6500484 0.4477289 0.4712374 0.3792145 0.4593904
pred_label3 <- ifelse(pred_3 > 0.5, yes = 1, no = 0)
pred_label3 <- as.factor(pred_label3)
head(pred_label3)
## 1 2 3 4 5 6
## 0 1 0 0 0 0
## Levels: 0 1
We use ConfusionMatrix() to compute accuracy measures of the regression models.
confusionMatrix(data = pred_label1,
reference = df_test$cardio,
positive = "1")
## Warning in confusionMatrix.default(data = pred_label1, reference =
## df_test$cardio, : Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 7051 6949
##
## Accuracy : 0.4964
## 95% CI : (0.488, 0.5047)
## No Information Rate : 0.5036
## P-Value [Acc > NIR] : 0.9584
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.4964
## Neg Pred Value : NaN
## Prevalence : 0.4964
## Detection Rate : 0.4964
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
confusionMatrix(data = pred_label2,
reference = df_test$cardio,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5591 4193
## 1 1460 2756
##
## Accuracy : 0.5962
## 95% CI : (0.588, 0.6044)
## No Information Rate : 0.5036
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.1901
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.3966
## Specificity : 0.7929
## Pos Pred Value : 0.6537
## Neg Pred Value : 0.5714
## Prevalence : 0.4964
## Detection Rate : 0.1969
## Detection Prevalence : 0.3011
## Balanced Accuracy : 0.5948
##
## 'Positive' Class : 1
##
confusionMatrix(data = pred_label3,
reference = df_test$cardio,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5203 2291
## 1 1848 4658
##
## Accuracy : 0.7044
## 95% CI : (0.6967, 0.7119)
## No Information Rate : 0.5036
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4084
##
## Mcnemar's Test P-Value : 0.000000000006408
##
## Sensitivity : 0.6703
## Specificity : 0.7379
## Pos Pred Value : 0.7160
## Neg Pred Value : 0.6943
## Prevalence : 0.4964
## Detection Rate : 0.3327
## Detection Prevalence : 0.4647
## Balanced Accuracy : 0.7041
##
## 'Positive' Class : 1
##
The regression model “Model 3” gives the best results with an accuracy of 70% with a sensitivity of 67% and specificity of 74%. The next best regression model “Model 2” using categorical values gives an accuracy of 60% with a sensitivity of 40% and a specificity of 79%.
KNN stands for K-Nearest Neighbour is a classification predictive modelling method. KNN algorithm classifies data points based on how similar they are to their neighboring data points. KNN does not make any assumptions about the dataset and can be used to solve classification and regression problems.
We begin to relabel the split data again.
df_train_x <- df_train[,-c(2,7,8,9,10,11,12)]
df_test_x <- df_test[,-c(2,7,8,9,10,11,12)]
df_train_y <- df_train$cardio
df_test_y <- df_test$cardio
We explore the structure of the training set.
str(df_train_x)
## tibble [56,000 × 6] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:56000] 28737 78715 40928 88246 93959 ...
## $ gender: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 1 2 2 1 ...
## $ height: num [1:56000] 165 144 167 152 155 170 159 174 167 158 ...
## $ weight: num [1:56000] 63 87 78 75 62 72 67 85 85 65 ...
## $ ap_hi : num [1:56000] 120 140 120 140 100 110 110 140 200 135 ...
## $ cardio: Factor w/ 2 levels "0","1": 1 1 1 2 1 2 2 2 2 2 ...
We change ‘gender’ and ‘cardio’ into numeric type.
df_train_x$gender=as.numeric(df_train_x$gender)
df_train_x$cardio=as.numeric(df_train_x$cardio)
df_test_x$gender=as.numeric(df_test_x$gender)
df_test_x$cardio=as.numeric(df_test_x$cardio)
We perform a scaling process.
df_train_xs <- scale(df_train_x)
df_test_xs <- scale(df_test_x ,
center = attr(df_train_xs,"scaled:center"),
scale = attr(df_train_xs,"scaled:scale"))
sqrt(nrow(df_train_xs))
## [1] 236.6432
We build the prediction of outcome set “knn_pred” with knn().
knn_pred <- knn(train = df_train_xs,
test = df_test_xs,
cl = df_train_y,
k = 13)
head(knn_pred)
## [1] 0 0 0 0 0 0
## Levels: 0 1
confusionMatrix(data = knn_pred,
reference = df_test_y,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7051 3
## 1 0 6946
##
## Accuracy : 0.9998
## 95% CI : (0.9994, 1)
## No Information Rate : 0.5036
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.9996
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 0.9996
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9996
## Prevalence : 0.4964
## Detection Rate : 0.4961
## Detection Prevalence : 0.4961
## Balanced Accuracy : 0.9998
##
## 'Positive' Class : 1
##
We achieved an accuracy of 99% with a sensitivity of 99% and a specificity of 100%.
Out of the two models - Regression Modelling and K-Nearest Neighbour (KNN)- the KNN model provided the most accurate predictive model that can be built from the cardiovascular disease data with an accuracy of 99%. The utility of the KNN stems from its non parametric nature and its ability to build on classification and regression datasets.