Introduction

This report deals with regression modelling of patient data to predict occurence of cardiovascular disease. The data is obtained from https://archive.ics.uci.edu/ml/datasets/Heart+Disease

Data : Cardio-Vascular Disease

Data Inspection

We first invoke the required libraries.

library(tidyverse)
library(caret)
library(plotly)
library(ggplot2)
library(data.table)
library(GGally)
library(tidymodels)
library(scales)
library(lmtest)
library(inspectdf) 
library(randomForest)
library(e1071)
library(reactable)
library(car)
library(class)
# library(ggcorrplot)

options(scipen = 100, max.print = 1e+06)

Then download the data.

## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 70000 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## dbl (12): id, age, gender, height, ap_hi, ap_lo, cholesterol, gluc, smoke, a...
## num  (1): weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We identify the column names.

colnames(df)
##  [1] "id"          "age"         "gender"      "height"      "weight"     
##  [6] "ap_hi"       "ap_lo"       "cholesterol" "gluc"        "smoke"      
## [11] "alco"        "active"      "cardio"

We check the structure of the dataset.

str(df)
## spc_tbl_ [70,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id         : num [1:70000] 0 1 2 3 4 8 9 12 13 14 ...
##  $ age        : num [1:70000] 18393 20228 18857 17623 17474 ...
##  $ gender     : num [1:70000] 2 1 1 2 1 1 1 2 1 1 ...
##  $ height     : num [1:70000] 168 156 165 169 156 151 157 178 158 164 ...
##  $ weight     : num [1:70000] 62 85 64 82 56 67 93 95 71 68 ...
##  $ ap_hi      : num [1:70000] 110 140 130 150 100 120 130 130 110 110 ...
##  $ ap_lo      : num [1:70000] 80 90 70 100 60 80 80 90 70 60 ...
##  $ cholesterol: num [1:70000] 1 3 3 1 1 2 3 3 1 1 ...
##  $ gluc       : num [1:70000] 1 1 1 1 1 2 1 3 1 1 ...
##  $ smoke      : num [1:70000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ alco       : num [1:70000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ active     : num [1:70000] 1 1 0 1 0 0 1 1 1 0 ...
##  $ cardio     : num [1:70000] 0 1 1 1 0 0 0 1 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   age = col_double(),
##   ..   gender = col_double(),
##   ..   height = col_double(),
##   ..   weight = col_number(),
##   ..   ap_hi = col_double(),
##   ..   ap_lo = col_double(),
##   ..   cholesterol = col_double(),
##   ..   gluc = col_double(),
##   ..   smoke = col_double(),
##   ..   alco = col_double(),
##   ..   active = col_double(),
##   ..   cardio = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data Features

We describe the data features as below:

  1. ‘age’ is an Objective Feature is of data type ‘int’ (days),

  2. ‘height’ is an Objective Feature is of data type ‘int’ (cm),

  3. ‘weight’ is an Objective Feature is of data type ‘float’ (kg),

  4. ‘gender’ is an Objective Feature is of data type ‘categorical’,

  5. ‘ap_hi’ identifies ‘Systolic blood pressure’, is an Examination Feature is of data type ‘int’,

  6. ‘ap_lo’ identifies ‘Diastolic blood pressure’ is an Examination Feature is of data type ‘int’,

  7. ‘cholesterol’ identifies Cholesterol levels, is an Examination Feature, is of data type int (1: normal, 2: above normal, 3: well above normal),

  8. ‘gluc’ identifies Glucose levels, is an Examination Feature, is of data type int (1: normal, 2: above normal, 3: well above normal),

  9. ‘smoke’ is a Subjective Feature is of data type binary,

  10. ‘alco’ identifies Alcohol intake is a Subjective Feature is of data type binary,

  11. ‘active’ signifies levels of Physical activity, is a Subjective Feature and of data type binary.

  12. ‘cardio’ signifies Presence or absence of cardiovascular disease and is of data type binary.

Data Wrangling

We check for missing values.

colSums(is.na(df))
##          id         age      gender      height      weight       ap_hi 
##           0           0           0           0           0           0 
##       ap_lo cholesterol        gluc       smoke        alco      active 
##           0           0           0           0           0           0 
##      cardio 
##           0

The data types of the dataset requires extensive changes namely “num” types to “factor”.

df$cholesterol <- as.factor(df$cholesterol)
df$gluc <- as.factor(df$gluc)
df$smoke<- as.factor(df$smoke)
df$gender <- as.factor(df$gender)
df$alco <- as.factor(df$alco)
df$active <- as.factor(df$active)
df$cardio<- as.factor(df$cardio)

We check that changes have taken effect as required.

str(df)
## spc_tbl_ [70,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id         : num [1:70000] 0 1 2 3 4 8 9 12 13 14 ...
##  $ age        : num [1:70000] 18393 20228 18857 17623 17474 ...
##  $ gender     : Factor w/ 2 levels "1","2": 2 1 1 2 1 1 1 2 1 1 ...
##  $ height     : num [1:70000] 168 156 165 169 156 151 157 178 158 164 ...
##  $ weight     : num [1:70000] 62 85 64 82 56 67 93 95 71 68 ...
##  $ ap_hi      : num [1:70000] 110 140 130 150 100 120 130 130 110 110 ...
##  $ ap_lo      : num [1:70000] 80 90 70 100 60 80 80 90 70 60 ...
##  $ cholesterol: Factor w/ 3 levels "1","2","3": 1 3 3 1 1 2 3 3 1 1 ...
##  $ gluc       : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 3 1 1 ...
##  $ smoke      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ alco       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ active     : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 2 2 1 ...
##  $ cardio     : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   age = col_double(),
##   ..   gender = col_double(),
##   ..   height = col_double(),
##   ..   weight = col_number(),
##   ..   ap_hi = col_double(),
##   ..   ap_lo = col_double(),
##   ..   cholesterol = col_double(),
##   ..   gluc = col_double(),
##   ..   smoke = col_double(),
##   ..   alco = col_double(),
##   ..   active = col_double(),
##   ..   cardio = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

We check for correlation between the predictors.

ggcorr(df, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(df, label = TRUE, label_size = 2.9, hjust = 1, layout.exp
## = 2): data in column(s) 'gender', 'cholesterol', 'gluc', 'smoke', 'alco',
## 'active', 'cardio' are not numeric and were ignored

There is little correlation between the numeric predictors.

Data : Quick Summary

We now compute a summary of the prepared data.

summary(df)
##        id             age        gender        height          weight       
##  Min.   :    0   Min.   :10798   1:45530   Min.   : 55.0   Min.   :  10.00  
##  1st Qu.:25007   1st Qu.:17664   2:24470   1st Qu.:159.0   1st Qu.:  65.00  
##  Median :50002   Median :19703             Median :165.0   Median :  72.00  
##  Mean   :49972   Mean   :19469             Mean   :164.4   Mean   :  76.46  
##  3rd Qu.:74889   3rd Qu.:21327             3rd Qu.:170.0   3rd Qu.:  82.00  
##  Max.   :99999   Max.   :23713             Max.   :250.0   Max.   :7994.00  
##      ap_hi             ap_lo          cholesterol gluc      smoke     alco     
##  Min.   : -150.0   Min.   :  -70.00   1:52385     1:59479   0:63831   0:66236  
##  1st Qu.:  120.0   1st Qu.:   80.00   2: 9549     2: 5190   1: 6169   1: 3764  
##  Median :  120.0   Median :   80.00   3: 8066     3: 5331                      
##  Mean   :  128.8   Mean   :   96.63                                            
##  3rd Qu.:  140.0   3rd Qu.:   90.00                                            
##  Max.   :16020.0   Max.   :11000.00                                            
##  active    cardio   
##  0:13739   0:35021  
##  1:56261   1:34979  
##                     
##                     
##                     
## 

We make the following quick observations:

  1. Age are counted in terms of days.
  2. Gender 1 means male (45530) and 2 means female (24470).
  3. Cholesterol 1 means ‘normal’,2 means ‘above normal’ and 3 means ‘well above normal’.
  4. Gluc 1 means ‘normal’,2 means ‘above normal’ and 3 means ‘well above normal’.
  5. Less than 10% of respondents are smokers and alcoholics. (1 means ‘yes’ and 0 means ‘no’)
  6. About half of the respondents registered cardiac illness.

Data-Splitting

We begin to split the dataset into “train” and “test” sets.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)

samplesize <- round(0.8 * nrow(df), 0)
index <- sample(seq_len(nrow(df)), size = samplesize)

df_train <- df[index, ]
df_test <- df[-index, ]
prop.table(table(df_train$cardio))
## 
##         0         1 
## 0.4994643 0.5005357

The data is fairly well proportioned between the cardio = 0 and cardio =1.

table(df$cardio)
## 
##     0     1 
## 35021 34979

The size of each class is ~35000.

Model : Linear Regression

The Linear Regression (LM) model

LM Model 1

This model implements no predictor variable.

library(e1071)

model_1 <- glm(formula = cardio ~ 1, 
               data = df_train, 
               family = "binomial")

summary(model_1)
## 
## Call:
## glm(formula = cardio ~ 1, family = "binomial", data = df_train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.178  -1.178   1.177   1.177   1.177  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.002143   0.008452   0.254      0.8
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 77632  on 55999  degrees of freedom
## Residual deviance: 77632  on 55999  degrees of freedom
## AIC: 77634
## 
## Number of Fisher Scoring iterations: 3

LM Model 2

This model implements categorical predictor variables ‘gender’, ‘cholesterol’, ‘gluc’, ‘smoke’, ‘alco’ , and ‘active’.

model_2 <- glm(formula = cardio ~ gender + cholesterol + gluc + smoke + alco + active, 
                 data = df_train, 
                 family = "binomial")

summary(model_2)
## 
## Call:
## glm(formula = cardio ~ gender + cholesterol + gluc + smoke + 
##     alco + active, family = "binomial", data = df_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9544  -1.0543   0.5998   1.2498   1.5199  
## 
## Coefficients:
##              Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  -0.11341    0.02116  -5.361     0.00000008284917 ***
## gender2       0.12513    0.01939   6.454     0.00000000010923 ***
## cholesterol2  0.62446    0.02649  23.577 < 0.0000000000000002 ***
## cholesterol3  1.52970    0.03547  43.129 < 0.0000000000000002 ***
## gluc2         0.20804    0.03499   5.946     0.00000000274561 ***
## gluc3        -0.28066    0.03987  -7.039     0.00000000000194 ***
## smoke1       -0.18034    0.03432  -5.254     0.00000014891516 ***
## alco1        -0.14413    0.04158  -3.466             0.000527 ***
## active1      -0.18335    0.02184  -8.394 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 77632  on 55999  degrees of freedom
## Residual deviance: 74613  on 55991  degrees of freedom
## AIC: 74631
## 
## Number of Fisher Scoring iterations: 4

LM Model 3

This model implements numerical predictor variables “weight”, “age”, “height” , “ap_hi”, and “ap_lo”.

model_3 <- glm(formula = cardio ~ weight + age + height + ap_hi + ap_lo, 
                 data = df_train, 
                 family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_3)
## 
## Call:
## glm(formula = cardio ~ weight + age + height + ap_hi + ap_lo, 
##     family = "binomial", data = df_train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -8.490  -1.013   0.127   1.029   4.694  
## 
## Coefficients:
##                 Estimate   Std. Error z value             Pr(>|z|)    
## (Intercept) -8.503687956  0.217740308 -39.054 < 0.0000000000000002 ***
## weight       0.001984978  0.000286643   6.925     0.00000000000436 ***
## age          0.000163951  0.000003881  42.248 < 0.0000000000000002 ***
## height      -0.001115498  0.001136195  -0.982                0.326    
## ap_hi        0.041973403  0.000653753  64.204 < 0.0000000000000002 ***
## ap_lo        0.000393412  0.000087411   4.501     0.00000677203941 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 77632  on 55999  degrees of freedom
## Residual deviance: 66476  on 55994  degrees of freedom
## AIC: 66488
## 
## Number of Fisher Scoring iterations: 8

The predictor ‘height’ can be discarded as it has no bearing on the predictive ability of the model (Pr(>|z|)>0.05).

Model : LM Predictions

We build prediction set with the regression models.

LM Model 1

pred_1 <- predict(object = model_1, 
                     newdata = df_test, 
                     type = "response")

head(pred_1)
##         1         2         3         4         5         6 
## 0.5005357 0.5005357 0.5005357 0.5005357 0.5005357 0.5005357
pred_label1 <- ifelse(pred_1 > 0.5, yes = 1, no = 0)

pred_label1 <- as.factor(pred_label1)

tail(pred_label1)
## 13995 13996 13997 13998 13999 14000 
##     1     1     1     1     1     1 
## Levels: 1

LM Model 2

pred_2 <- predict(object = model_2, 
                     newdata = df_test, 
                     type = "response")

head(pred_2)
##         1         2         3         4         5         6 
## 0.4571991 0.7743326 0.4571991 0.4571991 0.4263504 0.4263504
pred_label2 <- ifelse(pred_2 > 0.5, yes = 1, no = 0)

pred_label2 <- as.factor(pred_label2)

head(pred_label2)
## 1 2 3 4 5 6 
## 0 1 0 0 0 0 
## Levels: 0 1

LM Model 3

pred_3 <- predict(object = model_3, 
                      newdata = df_test, 
                      type = "response")

head(pred_3)
##         1         2         3         4         5         6 
## 0.2882456 0.6500484 0.4477289 0.4712374 0.3792145 0.4593904
pred_label3 <- ifelse(pred_3 > 0.5, yes = 1, no = 0)

pred_label3 <- as.factor(pred_label3)

head(pred_label3)
## 1 2 3 4 5 6 
## 0 1 0 0 0 0 
## Levels: 0 1

Model Evaluation

We use ConfusionMatrix() to compute accuracy measures of the regression models.

LM Model 1

confusionMatrix(data = pred_label1, 
                reference = df_test$cardio, 
                positive = "1")
## Warning in confusionMatrix.default(data = pred_label1, reference =
## df_test$cardio, : Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0    0    0
##          1 7051 6949
##                                              
##                Accuracy : 0.4964             
##                  95% CI : (0.488, 0.5047)    
##     No Information Rate : 0.5036             
##     P-Value [Acc > NIR] : 0.9584             
##                                              
##                   Kappa : 0                  
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.0000             
##          Pos Pred Value : 0.4964             
##          Neg Pred Value :    NaN             
##              Prevalence : 0.4964             
##          Detection Rate : 0.4964             
##    Detection Prevalence : 1.0000             
##       Balanced Accuracy : 0.5000             
##                                              
##        'Positive' Class : 1                  
## 

LM Model 2

confusionMatrix(data = pred_label2, 
                reference = df_test$cardio, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5591 4193
##          1 1460 2756
##                                                
##                Accuracy : 0.5962               
##                  95% CI : (0.588, 0.6044)      
##     No Information Rate : 0.5036               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.1901               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.3966               
##             Specificity : 0.7929               
##          Pos Pred Value : 0.6537               
##          Neg Pred Value : 0.5714               
##              Prevalence : 0.4964               
##          Detection Rate : 0.1969               
##    Detection Prevalence : 0.3011               
##       Balanced Accuracy : 0.5948               
##                                                
##        'Positive' Class : 1                    
## 

LM Model 3

confusionMatrix(data = pred_label3, 
                reference = df_test$cardio, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5203 2291
##          1 1848 4658
##                                                
##                Accuracy : 0.7044               
##                  95% CI : (0.6967, 0.7119)     
##     No Information Rate : 0.5036               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4084               
##                                                
##  Mcnemar's Test P-Value : 0.000000000006408    
##                                                
##             Sensitivity : 0.6703               
##             Specificity : 0.7379               
##          Pos Pred Value : 0.7160               
##          Neg Pred Value : 0.6943               
##              Prevalence : 0.4964               
##          Detection Rate : 0.3327               
##    Detection Prevalence : 0.4647               
##       Balanced Accuracy : 0.7041               
##                                                
##        'Positive' Class : 1                    
## 

Summary : Linear Regression Model

The regression model “Model 3” gives the best results with an accuracy of 70% with a sensitivity of 67% and specificity of 74%. The next best regression model “Model 2” using categorical values gives an accuracy of 60% with a sensitivity of 40% and a specificity of 79%.

Model : KNN

KNN stands for K-Nearest Neighbour is a classification predictive modelling method. KNN algorithm classifies data points based on how similar they are to their neighboring data points. KNN does not make any assumptions about the dataset and can be used to solve classification and regression problems.

We begin to relabel the split data again.

df_train_x <- df_train[,-c(2,7,8,9,10,11,12)]
df_test_x <- df_test[,-c(2,7,8,9,10,11,12)]

df_train_y <- df_train$cardio
df_test_y <- df_test$cardio

We explore the structure of the training set.

str(df_train_x)
## tibble [56,000 × 6] (S3: tbl_df/tbl/data.frame)
##  $ id    : num [1:56000] 28737 78715 40928 88246 93959 ...
##  $ gender: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 1 2 2 1 ...
##  $ height: num [1:56000] 165 144 167 152 155 170 159 174 167 158 ...
##  $ weight: num [1:56000] 63 87 78 75 62 72 67 85 85 65 ...
##  $ ap_hi : num [1:56000] 120 140 120 140 100 110 110 140 200 135 ...
##  $ cardio: Factor w/ 2 levels "0","1": 1 1 1 2 1 2 2 2 2 2 ...

We change ‘gender’ and ‘cardio’ into numeric type.

df_train_x$gender=as.numeric(df_train_x$gender)
df_train_x$cardio=as.numeric(df_train_x$cardio)

df_test_x$gender=as.numeric(df_test_x$gender)
df_test_x$cardio=as.numeric(df_test_x$cardio)

We perform a scaling process.

df_train_xs <- scale(df_train_x)

df_test_xs <- scale(df_test_x , 
                        center = attr(df_train_xs,"scaled:center"), 
                        scale = attr(df_train_xs,"scaled:scale"))
sqrt(nrow(df_train_xs))
## [1] 236.6432

We build the prediction of outcome set “knn_pred” with knn().

knn_pred <- knn(train = df_train_xs, 
                   test = df_test_xs, 
                   cl = df_train_y, 
                   k = 13)

head(knn_pred)
## [1] 0 0 0 0 0 0
## Levels: 0 1

Evaluation : KNN

confusionMatrix(data = knn_pred, 
                reference = df_test_y, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7051    3
##          1    0 6946
##                                              
##                Accuracy : 0.9998             
##                  95% CI : (0.9994, 1)        
##     No Information Rate : 0.5036             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.9996             
##                                              
##  Mcnemar's Test P-Value : 0.2482             
##                                              
##             Sensitivity : 0.9996             
##             Specificity : 1.0000             
##          Pos Pred Value : 1.0000             
##          Neg Pred Value : 0.9996             
##              Prevalence : 0.4964             
##          Detection Rate : 0.4961             
##    Detection Prevalence : 0.4961             
##       Balanced Accuracy : 0.9998             
##                                              
##        'Positive' Class : 1                  
## 

Summary : KNN Model

We achieved an accuracy of 99% with a sensitivity of 99% and a specificity of 100%.

Conclusion

Out of the two models - Regression Modelling and K-Nearest Neighbour (KNN)- the KNN model provided the most accurate predictive model that can be built from the cardiovascular disease data with an accuracy of 99%. The utility of the KNN stems from its non parametric nature and its ability to build on classification and regression datasets.