1 Introduction: Cardiovascular Disease Dataset

This project includes exploratory data analysis, feature engineering and the use of three predictive models for cardiovascular disease classification. The dataset was taken from Kaggle.

The dataset consists of 70,000 records of patients data, 11 attributes and 1 target variable (cardio).

Features:

Age | Objective Feature | age | int (days)

Height | Objective Feature | height | int (cm)

Weight | Objective Feature | weight | float (kg)

Gender | Objective Feature | gender | categorical code

Systolic blood pressure | Examination Feature | ap_hi | int

Diastolic blood pressure | Examination Feature | ap_lo | int

Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal

Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal

Smoking | Subjective Feature | smoke | binary

Alcohol intake | Subjective Feature | alco | binary

Physical activity | Subjective Feature | active | binary

Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

library(readr)
library(tidyverse)
library(ggplot2)
library(DescTools)
library(caret)

2 EDA, Feature Engineering, Removing Outliers

cdata <- read.csv("cardio.csv", sep=';') # data is separated by ; not commas so we convert it do a data frame
head(cdata)
##   id   age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1  0 18393      2    168     62   110    80           1    1     0    0      1
## 2  1 20228      1    156     85   140    90           3    1     0    0      1
## 3  2 18857      1    165     64   130    70           3    1     0    0      0
## 4  3 17623      2    169     82   150   100           1    1     0    0      1
## 5  4 17474      1    156     56   100    60           1    1     0    0      0
## 6  8 21914      1    151     67   120    80           2    2     0    0      0
##   cardio
## 1      0
## 2      1
## 3      1
## 4      1
## 5      0
## 6      0
str(cdata)
## 'data.frame':    70000 obs. of  13 variables:
##  $ id         : int  0 1 2 3 4 8 9 12 13 14 ...
##  $ age        : int  18393 20228 18857 17623 17474 21914 22113 22584 17668 19834 ...
##  $ gender     : int  2 1 1 2 1 1 1 2 1 1 ...
##  $ height     : int  168 156 165 169 156 151 157 178 158 164 ...
##  $ weight     : num  62 85 64 82 56 67 93 95 71 68 ...
##  $ ap_hi      : int  110 140 130 150 100 120 130 130 110 110 ...
##  $ ap_lo      : int  80 90 70 100 60 80 80 90 70 60 ...
##  $ cholesterol: int  1 3 3 1 1 2 3 3 1 1 ...
##  $ gluc       : int  1 1 1 1 1 2 1 3 1 1 ...
##  $ smoke      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ alco       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ active     : int  1 1 0 1 0 0 1 1 1 0 ...
##  $ cardio     : int  0 1 1 1 0 0 0 1 0 0 ...
Abstract(cdata) # no missing values
## ------------------------------------------------------------------------------ 
## cdata
## 
## data frame:  70000 obs. of  13 variables
##      70000 complete cases (100.0%)
## 
##   Nr  ColName      Class    NAs  Levels
##   1   id           integer  .          
##   2   age          integer  .          
##   3   gender       integer  .          
##   4   height       integer  .          
##   5   weight       numeric  .          
##   6   ap_hi        integer  .          
##   7   ap_lo        integer  .          
##   8   cholesterol  integer  .          
##   9   gluc         integer  .          
##   10  smoke        integer  .          
##   11  alco         integer  .          
##   12  active       integer  .          
##   13  cardio       integer  .

2.1 Feature Engineering

# dropping id column
cdata <- cdata %>% select(-id)

All of our predictor variables are numerical (including the ordinal ones), which is what we want them to be for this analysis.

  1. Gender is binary but expressed as 1 (woman) and 2 (man) instead of 0 and 1.
cdata$gender <- ifelse(cdata$gender == 2, 0, 1)
  1. Patient age is given in days, so we will convert it to years with 1 decimal place
cdata$age <- round(cdata$age / 365.0, 1)
cdata$age <- as.numeric(cdata$age)
  1. Instead of using both height and weight, we will use BMI.
cdata$height<-cdata$height/100 
cdata$bmi<-round(cdata$weight/cdata$height^2,2)
# removing height and weight
cdata <- subset(cdata, select=-c(height, weight))
head(cdata)
##    age gender ap_hi ap_lo cholesterol gluc smoke alco active cardio   bmi
## 1 50.4      0   110    80           1    1     0    0      1      0 21.97
## 2 55.4      1   140    90           3    1     0    0      1      1 34.93
## 3 51.7      1   130    70           3    1     0    0      0      1 23.51
## 4 48.3      0   150   100           1    1     0    0      1      1 28.71
## 5 47.9      1   100    60           1    1     0    0      0      0 23.01
## 6 60.0      1   120    80           2    2     0    0      0      0 29.38

2.2 Exploring the Predictor Variables

  1. Alcohol intake - subjective feature
alcohol <- cdata %>%
  group_by(alco) %>%
  summarise(count=n())
alcohol
## # A tibble: 2 × 2
##    alco count
##   <int> <int>
## 1     0 66236
## 2     1  3764
  1. Smoking - subjective feature
smoking <- cdata %>%
  group_by(smoke) %>%
  summarise(count=n())
smoking
## # A tibble: 2 × 2
##   smoke count
##   <int> <int>
## 1     0 63831
## 2     1  6169
  1. Physical Activity - subjective feature
exercise <- cdata %>%
  group_by(active) %>%
  summarise(count=n())
exercise
## # A tibble: 2 × 2
##   active count
##    <int> <int>
## 1      0 13739
## 2      1 56261
  1. Cholesterol - examination feature - 1: normal, 2: above normal, 3: well above normal
chol <- cdata %>%
  group_by(cholesterol) %>%
  summarise(count=n())
chol
## # A tibble: 3 × 2
##   cholesterol count
##         <int> <int>
## 1           1 52385
## 2           2  9549
## 3           3  8066
  1. Glucose - examination feature - 1: normal, 2: above normal, 3: well above normal
glucose <- cdata %>%
  group_by(gluc) %>%
  summarise(count=n())
glucose
## # A tibble: 3 × 2
##    gluc count
##   <int> <int>
## 1     1 59479
## 2     2  5190
## 3     3  5331

2.3 Distributions & Removing Outliers

  1. Age
boxplot(cdata$age, main = "Box Plot of Age", ylab = "Age", col = "lightblue", border = "blue")

Q1a <- quantile(cdata$age, 0.25)
Q3a <- quantile(cdata$age, 0.75)
IQRa <- Q3a - Q1a
cdata <- cdata[cdata$age >= Q1a - 1.5 * IQRa & cdata$age <= Q3a + 1.5 * IQRa, ]
boxplot(cdata$age, main = "Box Plot of Age After Outliers", ylab = "Age", col = "lightblue", border = "blue")

  1. Systolic Blood Pressure - ap_hi
boxplot(cdata$ap_hi, main = "Box Plot of Systolic Blood Pressure", ylab = "ap_hi", col = "lightblue", border = "blue")

Q1h <- quantile(cdata$ap_hi, 0.25)
Q3h <- quantile(cdata$ap_hi, 0.75)
IQRh <- Q3h - Q1h
cdata <- cdata[cdata$ap_hi >= Q1h - 1.5 * IQRh & cdata$ap_hi <= Q3h + 1.5 * IQRh, ]
boxplot(cdata$ap_hi, main = "Box Plot of Systolic Blood Pressure After Removing Outliers", ylab = "ap_hi", col = "lightblue", border = "blue")

  1. Diastolic Blood Pressure - ap_lo
boxplot(cdata$ap_lo, main = "Box Plot of Diastolic Blood Pressure", ylab = "ap_lo", col = "lightblue", border = "blue")

Q1l <- quantile(cdata$ap_lo, 0.25)
Q3l <- quantile(cdata$ap_lo, 0.75)
IQRl <- Q3l - Q1l
cdata <- cdata[cdata$ap_lo >= Q1l - 1.5 * IQRl & cdata$ap_lo <= Q3l + 1.5 * IQRl, ]
boxplot(cdata$ap_lo, main = "Box Plot of Diastolic Blood Pressure After Removing Outliers", ylab = "ap_lo", col = "lightblue", border = "blue")

  1. BMI
boxplot(cdata$bmi, main = "Box Plot of BMI", ylab = "bmi", col = "lightblue", border = "blue")

Q1b <- quantile(cdata$bmi, 0.25)
Q3b <- quantile(cdata$bmi, 0.75)
IQRb <- Q3b - Q1b
cdata <- cdata[cdata$bmi >= Q1b - 1.5 * IQRb  & cdata$bmi <= Q3b + 1.5 * IQRb, ]
boxplot(cdata$bmi, main = "Box Plot of BMI After Removing Outliers", ylab = "bmi", col = "lightblue", border = "blue")

dim(cdata)
## [1] 62642    11
# Now the dataset has 62,642 instances of data

2.4 Target Variable - Cardio

cdis <- cdata %>%
  group_by(cardio) %>%
  summarise(count=n())
cdis
## # A tibble: 2 × 2
##   cardio count
##    <int> <int>
## 1      0 31751
## 2      1 30891

The two classes are well balanced.

ggplot(cdis, aes(x = factor(cardio, labels = c("No", "Yes")), y = count, fill = cardio, label = count)) +
  geom_bar(stat = 'identity', fill = c("No" = "lightcoral", "Yes" = "darkred")) +
  geom_text(aes(label = count), position = position_stack(vjust = 0.5), size = 4, color = "white") +
  labs(title = "Presence of Cardiovascular Disease", x = "Cardiovascular Disease", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5))

2.5 Correlation & Feature Selection

library(corrplot)
## corrplot 0.92 loaded
cor_matrix <- cor(cdata)
cor_matrix
##                      age       gender        ap_hi         ap_lo cholesterol
## age          1.000000000  0.030612293  0.202837888  0.1454028837 0.151245475
## gender       0.030612293  1.000000000 -0.052757750 -0.0567983832 0.035261486
## ap_hi        0.202837888 -0.052757750  1.000000000  0.7050659231 0.191687087
## ap_lo        0.145402884 -0.056798383  0.705065923  1.0000000000 0.154803451
## cholesterol  0.151245475  0.035261486  0.191687087  0.1548034510 1.000000000
## gluc         0.096749804  0.021499748  0.084052541  0.0641076228 0.450900512
## smoke       -0.049032745 -0.336786526  0.025541236  0.0245914847 0.010178120
## alco        -0.029464573 -0.171068211  0.030049471  0.0340535312 0.032094209
## active      -0.009197974 -0.005868253  0.003060102  0.0005716486 0.009579959
## cardio       0.234931954 -0.002835732  0.432060170  0.3357314688 0.217905747
## bmi          0.103889606  0.089952844  0.241169212  0.2118325666 0.162060686
##                     gluc        smoke         alco        active       cardio
## age          0.096749804 -0.049032745 -0.029464573 -0.0091979742  0.234931954
## gender       0.021499748 -0.336786526 -0.171068211 -0.0058682533 -0.002835732
## ap_hi        0.084052541  0.025541236  0.030049471  0.0030601018  0.432060170
## ap_lo        0.064107623  0.024591485  0.034053531  0.0005716486  0.335731469
## cholesterol  0.450900512  0.010178120  0.032094209  0.0095799594  0.217905747
## gluc         1.000000000 -0.007255069  0.005080104 -0.0067185467  0.085498122
## smoke       -0.007255069  1.000000000  0.344831756  0.0245634346 -0.018291237
## alco         0.005080104  0.344831756  1.000000000  0.0258486789 -0.010989216
## active      -0.006718547  0.024563435  0.025848679  1.0000000000 -0.036227554
## cardio       0.085498122 -0.018291237 -0.010989216 -0.0362275543  1.000000000
## bmi          0.105082326 -0.026160811  0.019221735 -0.0069647215  0.178538543
##                      bmi
## age          0.103889606
## gender       0.089952844
## ap_hi        0.241169212
## ap_lo        0.211832567
## cholesterol  0.162060686
## gluc         0.105082326
## smoke       -0.026160811
## alco         0.019221735
## active      -0.006964721
## cardio       0.178538543
## bmi          1.000000000
corrplot(cor_matrix, method = "color")

The variables most correlated with cardiovascular disease, and therefore the “best” predictors for this type of disease are high systolic and diastolic blood pressure.

For this project, we will be using logistic regression, kNN and decision trees. For the first two methods, multicollinearity (the presence of significant correlation between two or more predictor variables) comprises a problem. If we include highly correlated features in these models, we might corrupt them.

There is high correlation between diastolic and systolic blood pressure, which means that we need to keep only one of these features. According to this study published in the NIH, systolic blood pressure is a better indicator of risk, so we will keep ap_hi and get rid of ap_lo.

cd <- subset(cdata, select=-c(ap_lo))

The rest of the variables do not exhibit extremely high correlation that could interfere with our models.

3 CLASSIFICATION METHODS

3.1 Training & Testing Sets

set.seed(444)
part <- createDataPartition(y = cd$cardio, p = 0.80, list = FALSE)
train <- cd[part, ]
test <- cd[-part, ]

3.2 METHOD 1: Logistic Regression

Logistic regression is a powerful tool for binary classification, where the goal is to predict one of two possible outcomes. Unlike linear regression, which is used for predicting continuous values, logistic regression models the probability that an input belongs to a particular category. This technique is particularly useful in various fields, including medicine - disease diagnosis.

# Fit a logistic regression model
logmod <- glm(cardio ~ ., data = train, family = binomial)
summary(logmod)
## 
## Call:
## glm(formula = cardio ~ ., family = binomial, data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.249e+01  1.453e-01 -85.948  < 2e-16 ***
## age          5.086e-02  1.580e-03  32.182  < 2e-16 ***
## gender      -8.072e-03  2.271e-02  -0.355 0.722292    
## ap_hi        6.784e-02  9.009e-04  75.299  < 2e-16 ***
## cholesterol  5.063e-01  1.837e-02  27.558  < 2e-16 ***
## gluc        -1.308e-01  2.078e-02  -6.298 3.02e-10 ***
## smoke       -1.452e-01  4.072e-02  -3.566 0.000362 ***
## alco        -2.562e-01  4.975e-02  -5.150 2.60e-07 ***
## active      -2.375e-01  2.566e-02  -9.257  < 2e-16 ***
## bmi          3.396e-02  2.455e-03  13.830  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 69466  on 50113  degrees of freedom
## Residual deviance: 56359  on 50104  degrees of freedom
## AIC: 56379
## 
## Number of Fisher Scoring iterations: 4
# binding the intercept and coefficients into a dataframe
intercept <- summary(logmod)$coefficients[1, "Estimate"]
coefficients <- summary(logmod)$coefficients[-1, "Estimate"]
coef_df <- data.frame(Variable = names(coefficients), Coefficient = coefficients)
coef_df <- rbind(data.frame(Variable = "(Intercept)", Coefficient = intercept), coef_df)
coef_df
##                Variable   Coefficient
## 1           (Intercept) -12.489608182
## age                 age   0.050857239
## gender           gender  -0.008071702
## ap_hi             ap_hi   0.067839897
## cholesterol cholesterol   0.506293305
## gluc               gluc  -0.130848485
## smoke             smoke  -0.145200565
## alco               alco  -0.256207587
## active           active  -0.237549326
## bmi                 bmi   0.033958580
# Make predictions on the test dataset using cut-off point of 0.5
predicted_probs <- predict(logmod, newdata = test, type = "response")
predicted_classes <- ifelse(predicted_probs >= 0.5, 1, 0)

# Create a confusion matrix
logconf <- confusionMatrix(as.factor(test$cardio), as.factor(predicted_classes), positive = "1")
logconf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5042 1357
##          1 2142 3987
##                                           
##                Accuracy : 0.7207          
##                  95% CI : (0.7128, 0.7285)
##     No Information Rate : 0.5734          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4396          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7461          
##             Specificity : 0.7018          
##          Pos Pred Value : 0.6505          
##          Neg Pred Value : 0.7879          
##              Prevalence : 0.4266          
##          Detection Rate : 0.3182          
##    Detection Prevalence : 0.4892          
##       Balanced Accuracy : 0.7240          
##                                           
##        'Positive' Class : 1               
## 

The model got only 72% of the predictions correct, which needs improvement. It correctly identified 70% of the positive cases (Sensitivity) and 74.6% of the negative cases (Specificity). 17% of the model’s predictions were false positives (Type I error) and 10.8% were false negatives (Type II error).

Next, we will calculate the ROC and AUC for our model. ROC stands for “Receiver Operating Characteristics” which is a graphical representation of a binary classification model’s performance across different discrimination thresholds (cut-off points). It plots two key metrics: True Positive Rate (Sensitivity) on the y-axis and False Positive Rate (Specificity) on the x-axis.

AUC stands for “Area Under the ROC Curve”, and it’s a single value that summarizes the overall performance of a binary classification model.

# ROC curve and AUC
# ROC shows the performance of a classification model at all classification thresholds
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_obj <- roc(test$cardio, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Print the ROC curve
plot(roc_obj, main = "ROC Curve")
abline(a = 0, b = 1, lty = 2)  # Add a diagonal reference line

# Calculate and print the AUC
auc_value <- auc(roc_obj)
cat("AUC:", auc_value, "\n") # 0.7837477
## AUC: 0.7837477
# the model has some discriminatory power but it's not perfect

3.3 METHOD 2: k-Nearest Neighbors

k-NN is a simple algorithm used for both classification and regression tasks. It’s based on the idea that similar data points tend to be close to each other in a feature space. It makes predictions by finding the K nearest data points to a new, unseen data point and then aggregating their labels. In short, ‘K’ stands for the number of neighbors to consider when making a prediction.

library(class)
# Identifying a ‘best guess’ value of k (the square root of the number of training observations)
ceiling(sqrt(nrow(train)))
## [1] 224
# 224, so we can choose either 223 or 225
knn.pred1 <- knn(train = train[ ,-10], # we remove the target variable
                test = test[ ,-10], 
                cl = train$cardio, 
                k = 223)

knn_conf1 <- confusionMatrix(as.factor(knn.pred1), as.factor(test$cardio), positive = "1")
knn_conf1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5731  464
##          1  668 5665
##                                           
##                Accuracy : 0.9096          
##                  95% CI : (0.9045, 0.9146)
##     No Information Rate : 0.5108          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8193          
##                                           
##  Mcnemar's Test P-Value : 1.604e-09       
##                                           
##             Sensitivity : 0.9243          
##             Specificity : 0.8956          
##          Pos Pred Value : 0.8945          
##          Neg Pred Value : 0.9251          
##              Prevalence : 0.4892          
##          Detection Rate : 0.4522          
##    Detection Prevalence : 0.5055          
##       Balanced Accuracy : 0.9100          
##                                           
##        'Positive' Class : 1               
## 
# using 225
knn.pred2 <- knn(train = train[ ,-10], # we remove the target variable
                test = test[ ,-10], 
                cl = train$cardio, 
                k = 225)

knn_conf2 <- confusionMatrix(as.factor(knn.pred2), as.factor(test$cardio), positive = "1")
knn_conf2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5733  464
##          1  666 5665
##                                           
##                Accuracy : 0.9098          
##                  95% CI : (0.9047, 0.9148)
##     No Information Rate : 0.5108          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8196          
##                                           
##  Mcnemar's Test P-Value : 2.24e-09        
##                                           
##             Sensitivity : 0.9243          
##             Specificity : 0.8959          
##          Pos Pred Value : 0.8948          
##          Neg Pred Value : 0.9251          
##              Prevalence : 0.4892          
##          Detection Rate : 0.4522          
##    Detection Prevalence : 0.5053          
##       Balanced Accuracy : 0.9101          
##                                           
##        'Positive' Class : 1               
## 

The differences in models are very slight, but k = 223 results in a higher accuracy (90.98%). This algortihm was able to detect 92% of the ppositive cases and 89.6% of the negative cases. Type I error was 5% and Type II error was 3.7%.

3.4 METHOD 3: Decision Trees

A decision tree is a hierarchical structure consisting of nodes. The top node is called the “root,” and it represents the initial decision. The intermediate nodes are called “internal nodes,” and they contain decision criteria. The bottom nodes are called “leaves” or “terminal nodes,” and they provide the final prediction or decision. At each internal node of the tree, a decision is made based on a specific feature (attribute) from the dataset. This decision splits the data into subsets, sending them down different branches of the tree. The disadvantages of decision trees include overfitting and instability. They tend to perform well with balanced datasets like ours.

Decision trees are insensitive to multicollinearity so we will use the original dataset with ap_lo.

library(rpart)
library(rpart.plot)

set.seed(444)
part <- createDataPartition(y = cdata$cardio, p = 0.80, list = FALSE)
ttrain <- cdata[part, ]
ttest <- cdata[-part, ]

Building the model:

set.seed(831)
c.rpart <- rpart(formula = cardio ~ ., data = ttrain, method = "class")
c.rpart
## n= 50114 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 50114 24762 0 (0.5058866 0.4941134)  
##   2) ap_hi< 129.5 30103  9696 0 (0.6779059 0.3220941) *
##   3) ap_hi>=129.5 20011  4945 1 (0.2471141 0.7528859) *

Variable Importance:

c.rpart$variable.importance
##       ap_hi       ap_lo cholesterol         bmi         age        gluc 
##  4461.53487  2595.40890   486.70884   404.88468   164.09423    80.48644

Visualizing the rpart objected:

prp(x = c.rpart, extra = 2)

The tree stops after splitting at the root node, which is ap_hi - the variable with the highest importance. This means that the algorithm determined that further splits would not significantly improve the purity of the leaf nodes.

Predictions using the test dataset:

preds <- as.factor(predict(c.rpart, newdata = ttest, type='class'))
testf <- as.factor(ttest$cardio)

c.conf <- confusionMatrix(preds, testf, positive='1')
c.conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5183 2422
##          1 1216 3707
##                                           
##                Accuracy : 0.7096          
##                  95% CI : (0.7016, 0.7175)
##     No Information Rate : 0.5108          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4165          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6048          
##             Specificity : 0.8100          
##          Pos Pred Value : 0.7530          
##          Neg Pred Value : 0.6815          
##              Prevalence : 0.4892          
##          Detection Rate : 0.2959          
##    Detection Prevalence : 0.3930          
##       Balanced Accuracy : 0.7074          
##                                           
##        'Positive' Class : 1               
## 

This model has an accuracy of ~71%, which is not great. The model was only able to predict 60% of the positive cases and 81% of the negative cases. It resulted in way more undetected positive patients than the other 2 algorithms.

4 Conclusions

The model with the highest predictive power was the k-Nearest Neighbors, with accuracy of almost 91%.The model can be further improved with hyperparameter tuning but due to limited computational resources that is not possible now.