This project includes exploratory data analysis, feature engineering and the use of three predictive models for cardiovascular disease classification. The dataset was taken from Kaggle.
The dataset consists of 70,000 records of patients data, 11 attributes and 1 target variable (cardio).
Features:
Age | Objective Feature | age | int (days)
Height | Objective Feature | height | int (cm)
Weight | Objective Feature | weight | float (kg)
Gender | Objective Feature | gender | categorical code
Systolic blood pressure | Examination Feature | ap_hi | int
Diastolic blood pressure | Examination Feature | ap_lo | int
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal
Smoking | Subjective Feature | smoke | binary
Alcohol intake | Subjective Feature | alco | binary
Physical activity | Subjective Feature | active | binary
Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
library(readr)
library(tidyverse)
library(ggplot2)
library(DescTools)
library(caret)
cdata <- read.csv("cardio.csv", sep=';') # data is separated by ; not commas so we convert it do a data frame
head(cdata)
## id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1 0 18393 2 168 62 110 80 1 1 0 0 1
## 2 1 20228 1 156 85 140 90 3 1 0 0 1
## 3 2 18857 1 165 64 130 70 3 1 0 0 0
## 4 3 17623 2 169 82 150 100 1 1 0 0 1
## 5 4 17474 1 156 56 100 60 1 1 0 0 0
## 6 8 21914 1 151 67 120 80 2 2 0 0 0
## cardio
## 1 0
## 2 1
## 3 1
## 4 1
## 5 0
## 6 0
str(cdata)
## 'data.frame': 70000 obs. of 13 variables:
## $ id : int 0 1 2 3 4 8 9 12 13 14 ...
## $ age : int 18393 20228 18857 17623 17474 21914 22113 22584 17668 19834 ...
## $ gender : int 2 1 1 2 1 1 1 2 1 1 ...
## $ height : int 168 156 165 169 156 151 157 178 158 164 ...
## $ weight : num 62 85 64 82 56 67 93 95 71 68 ...
## $ ap_hi : int 110 140 130 150 100 120 130 130 110 110 ...
## $ ap_lo : int 80 90 70 100 60 80 80 90 70 60 ...
## $ cholesterol: int 1 3 3 1 1 2 3 3 1 1 ...
## $ gluc : int 1 1 1 1 1 2 1 3 1 1 ...
## $ smoke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ alco : int 0 0 0 0 0 0 0 0 0 0 ...
## $ active : int 1 1 0 1 0 0 1 1 1 0 ...
## $ cardio : int 0 1 1 1 0 0 0 1 0 0 ...
Abstract(cdata) # no missing values
## ------------------------------------------------------------------------------
## cdata
##
## data frame: 70000 obs. of 13 variables
## 70000 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 id integer .
## 2 age integer .
## 3 gender integer .
## 4 height integer .
## 5 weight numeric .
## 6 ap_hi integer .
## 7 ap_lo integer .
## 8 cholesterol integer .
## 9 gluc integer .
## 10 smoke integer .
## 11 alco integer .
## 12 active integer .
## 13 cardio integer .
# dropping id column
cdata <- cdata %>% select(-id)
All of our predictor variables are numerical (including the ordinal ones), which is what we want them to be for this analysis.
cdata$gender <- ifelse(cdata$gender == 2, 0, 1)
cdata$age <- round(cdata$age / 365.0, 1)
cdata$age <- as.numeric(cdata$age)
cdata$height<-cdata$height/100
cdata$bmi<-round(cdata$weight/cdata$height^2,2)
# removing height and weight
cdata <- subset(cdata, select=-c(height, weight))
head(cdata)
## age gender ap_hi ap_lo cholesterol gluc smoke alco active cardio bmi
## 1 50.4 0 110 80 1 1 0 0 1 0 21.97
## 2 55.4 1 140 90 3 1 0 0 1 1 34.93
## 3 51.7 1 130 70 3 1 0 0 0 1 23.51
## 4 48.3 0 150 100 1 1 0 0 1 1 28.71
## 5 47.9 1 100 60 1 1 0 0 0 0 23.01
## 6 60.0 1 120 80 2 2 0 0 0 0 29.38
alcohol <- cdata %>%
group_by(alco) %>%
summarise(count=n())
alcohol
## # A tibble: 2 × 2
## alco count
## <int> <int>
## 1 0 66236
## 2 1 3764
smoking <- cdata %>%
group_by(smoke) %>%
summarise(count=n())
smoking
## # A tibble: 2 × 2
## smoke count
## <int> <int>
## 1 0 63831
## 2 1 6169
exercise <- cdata %>%
group_by(active) %>%
summarise(count=n())
exercise
## # A tibble: 2 × 2
## active count
## <int> <int>
## 1 0 13739
## 2 1 56261
chol <- cdata %>%
group_by(cholesterol) %>%
summarise(count=n())
chol
## # A tibble: 3 × 2
## cholesterol count
## <int> <int>
## 1 1 52385
## 2 2 9549
## 3 3 8066
glucose <- cdata %>%
group_by(gluc) %>%
summarise(count=n())
glucose
## # A tibble: 3 × 2
## gluc count
## <int> <int>
## 1 1 59479
## 2 2 5190
## 3 3 5331
boxplot(cdata$age, main = "Box Plot of Age", ylab = "Age", col = "lightblue", border = "blue")
Q1a <- quantile(cdata$age, 0.25)
Q3a <- quantile(cdata$age, 0.75)
IQRa <- Q3a - Q1a
cdata <- cdata[cdata$age >= Q1a - 1.5 * IQRa & cdata$age <= Q3a + 1.5 * IQRa, ]
boxplot(cdata$age, main = "Box Plot of Age After Outliers", ylab = "Age", col = "lightblue", border = "blue")
boxplot(cdata$ap_hi, main = "Box Plot of Systolic Blood Pressure", ylab = "ap_hi", col = "lightblue", border = "blue")
Q1h <- quantile(cdata$ap_hi, 0.25)
Q3h <- quantile(cdata$ap_hi, 0.75)
IQRh <- Q3h - Q1h
cdata <- cdata[cdata$ap_hi >= Q1h - 1.5 * IQRh & cdata$ap_hi <= Q3h + 1.5 * IQRh, ]
boxplot(cdata$ap_hi, main = "Box Plot of Systolic Blood Pressure After Removing Outliers", ylab = "ap_hi", col = "lightblue", border = "blue")
boxplot(cdata$ap_lo, main = "Box Plot of Diastolic Blood Pressure", ylab = "ap_lo", col = "lightblue", border = "blue")
Q1l <- quantile(cdata$ap_lo, 0.25)
Q3l <- quantile(cdata$ap_lo, 0.75)
IQRl <- Q3l - Q1l
cdata <- cdata[cdata$ap_lo >= Q1l - 1.5 * IQRl & cdata$ap_lo <= Q3l + 1.5 * IQRl, ]
boxplot(cdata$ap_lo, main = "Box Plot of Diastolic Blood Pressure After Removing Outliers", ylab = "ap_lo", col = "lightblue", border = "blue")
boxplot(cdata$bmi, main = "Box Plot of BMI", ylab = "bmi", col = "lightblue", border = "blue")
Q1b <- quantile(cdata$bmi, 0.25)
Q3b <- quantile(cdata$bmi, 0.75)
IQRb <- Q3b - Q1b
cdata <- cdata[cdata$bmi >= Q1b - 1.5 * IQRb & cdata$bmi <= Q3b + 1.5 * IQRb, ]
boxplot(cdata$bmi, main = "Box Plot of BMI After Removing Outliers", ylab = "bmi", col = "lightblue", border = "blue")
dim(cdata)
## [1] 62642 11
# Now the dataset has 62,642 instances of data
cdis <- cdata %>%
group_by(cardio) %>%
summarise(count=n())
cdis
## # A tibble: 2 × 2
## cardio count
## <int> <int>
## 1 0 31751
## 2 1 30891
The two classes are well balanced.
ggplot(cdis, aes(x = factor(cardio, labels = c("No", "Yes")), y = count, fill = cardio, label = count)) +
geom_bar(stat = 'identity', fill = c("No" = "lightcoral", "Yes" = "darkred")) +
geom_text(aes(label = count), position = position_stack(vjust = 0.5), size = 4, color = "white") +
labs(title = "Presence of Cardiovascular Disease", x = "Cardiovascular Disease", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
library(corrplot)
## corrplot 0.92 loaded
cor_matrix <- cor(cdata)
cor_matrix
## age gender ap_hi ap_lo cholesterol
## age 1.000000000 0.030612293 0.202837888 0.1454028837 0.151245475
## gender 0.030612293 1.000000000 -0.052757750 -0.0567983832 0.035261486
## ap_hi 0.202837888 -0.052757750 1.000000000 0.7050659231 0.191687087
## ap_lo 0.145402884 -0.056798383 0.705065923 1.0000000000 0.154803451
## cholesterol 0.151245475 0.035261486 0.191687087 0.1548034510 1.000000000
## gluc 0.096749804 0.021499748 0.084052541 0.0641076228 0.450900512
## smoke -0.049032745 -0.336786526 0.025541236 0.0245914847 0.010178120
## alco -0.029464573 -0.171068211 0.030049471 0.0340535312 0.032094209
## active -0.009197974 -0.005868253 0.003060102 0.0005716486 0.009579959
## cardio 0.234931954 -0.002835732 0.432060170 0.3357314688 0.217905747
## bmi 0.103889606 0.089952844 0.241169212 0.2118325666 0.162060686
## gluc smoke alco active cardio
## age 0.096749804 -0.049032745 -0.029464573 -0.0091979742 0.234931954
## gender 0.021499748 -0.336786526 -0.171068211 -0.0058682533 -0.002835732
## ap_hi 0.084052541 0.025541236 0.030049471 0.0030601018 0.432060170
## ap_lo 0.064107623 0.024591485 0.034053531 0.0005716486 0.335731469
## cholesterol 0.450900512 0.010178120 0.032094209 0.0095799594 0.217905747
## gluc 1.000000000 -0.007255069 0.005080104 -0.0067185467 0.085498122
## smoke -0.007255069 1.000000000 0.344831756 0.0245634346 -0.018291237
## alco 0.005080104 0.344831756 1.000000000 0.0258486789 -0.010989216
## active -0.006718547 0.024563435 0.025848679 1.0000000000 -0.036227554
## cardio 0.085498122 -0.018291237 -0.010989216 -0.0362275543 1.000000000
## bmi 0.105082326 -0.026160811 0.019221735 -0.0069647215 0.178538543
## bmi
## age 0.103889606
## gender 0.089952844
## ap_hi 0.241169212
## ap_lo 0.211832567
## cholesterol 0.162060686
## gluc 0.105082326
## smoke -0.026160811
## alco 0.019221735
## active -0.006964721
## cardio 0.178538543
## bmi 1.000000000
corrplot(cor_matrix, method = "color")
The variables most correlated with cardiovascular disease, and therefore the “best” predictors for this type of disease are high systolic and diastolic blood pressure.
For this project, we will be using logistic regression, kNN and decision trees. For the first two methods, multicollinearity (the presence of significant correlation between two or more predictor variables) comprises a problem. If we include highly correlated features in these models, we might corrupt them.
There is high correlation between diastolic and systolic blood pressure, which means that we need to keep only one of these features. According to this study published in the NIH, systolic blood pressure is a better indicator of risk, so we will keep ap_hi and get rid of ap_lo.
cd <- subset(cdata, select=-c(ap_lo))
The rest of the variables do not exhibit extremely high correlation that could interfere with our models.
set.seed(444)
part <- createDataPartition(y = cd$cardio, p = 0.80, list = FALSE)
train <- cd[part, ]
test <- cd[-part, ]
Logistic regression is a powerful tool for binary classification, where the goal is to predict one of two possible outcomes. Unlike linear regression, which is used for predicting continuous values, logistic regression models the probability that an input belongs to a particular category. This technique is particularly useful in various fields, including medicine - disease diagnosis.
# Fit a logistic regression model
logmod <- glm(cardio ~ ., data = train, family = binomial)
summary(logmod)
##
## Call:
## glm(formula = cardio ~ ., family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.249e+01 1.453e-01 -85.948 < 2e-16 ***
## age 5.086e-02 1.580e-03 32.182 < 2e-16 ***
## gender -8.072e-03 2.271e-02 -0.355 0.722292
## ap_hi 6.784e-02 9.009e-04 75.299 < 2e-16 ***
## cholesterol 5.063e-01 1.837e-02 27.558 < 2e-16 ***
## gluc -1.308e-01 2.078e-02 -6.298 3.02e-10 ***
## smoke -1.452e-01 4.072e-02 -3.566 0.000362 ***
## alco -2.562e-01 4.975e-02 -5.150 2.60e-07 ***
## active -2.375e-01 2.566e-02 -9.257 < 2e-16 ***
## bmi 3.396e-02 2.455e-03 13.830 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 69466 on 50113 degrees of freedom
## Residual deviance: 56359 on 50104 degrees of freedom
## AIC: 56379
##
## Number of Fisher Scoring iterations: 4
# binding the intercept and coefficients into a dataframe
intercept <- summary(logmod)$coefficients[1, "Estimate"]
coefficients <- summary(logmod)$coefficients[-1, "Estimate"]
coef_df <- data.frame(Variable = names(coefficients), Coefficient = coefficients)
coef_df <- rbind(data.frame(Variable = "(Intercept)", Coefficient = intercept), coef_df)
coef_df
## Variable Coefficient
## 1 (Intercept) -12.489608182
## age age 0.050857239
## gender gender -0.008071702
## ap_hi ap_hi 0.067839897
## cholesterol cholesterol 0.506293305
## gluc gluc -0.130848485
## smoke smoke -0.145200565
## alco alco -0.256207587
## active active -0.237549326
## bmi bmi 0.033958580
# Make predictions on the test dataset using cut-off point of 0.5
predicted_probs <- predict(logmod, newdata = test, type = "response")
predicted_classes <- ifelse(predicted_probs >= 0.5, 1, 0)
# Create a confusion matrix
logconf <- confusionMatrix(as.factor(test$cardio), as.factor(predicted_classes), positive = "1")
logconf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5042 1357
## 1 2142 3987
##
## Accuracy : 0.7207
## 95% CI : (0.7128, 0.7285)
## No Information Rate : 0.5734
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4396
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7461
## Specificity : 0.7018
## Pos Pred Value : 0.6505
## Neg Pred Value : 0.7879
## Prevalence : 0.4266
## Detection Rate : 0.3182
## Detection Prevalence : 0.4892
## Balanced Accuracy : 0.7240
##
## 'Positive' Class : 1
##
The model got only 72% of the predictions correct, which needs improvement. It correctly identified 70% of the positive cases (Sensitivity) and 74.6% of the negative cases (Specificity). 17% of the model’s predictions were false positives (Type I error) and 10.8% were false negatives (Type II error).
Next, we will calculate the ROC and AUC for our model. ROC stands for “Receiver Operating Characteristics” which is a graphical representation of a binary classification model’s performance across different discrimination thresholds (cut-off points). It plots two key metrics: True Positive Rate (Sensitivity) on the y-axis and False Positive Rate (Specificity) on the x-axis.
AUC stands for “Area Under the ROC Curve”, and it’s a single value that summarizes the overall performance of a binary classification model.
# ROC curve and AUC
# ROC shows the performance of a classification model at all classification thresholds
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_obj <- roc(test$cardio, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Print the ROC curve
plot(roc_obj, main = "ROC Curve")
abline(a = 0, b = 1, lty = 2) # Add a diagonal reference line
# Calculate and print the AUC
auc_value <- auc(roc_obj)
cat("AUC:", auc_value, "\n") # 0.7837477
## AUC: 0.7837477
# the model has some discriminatory power but it's not perfect
k-NN is a simple algorithm used for both classification and regression tasks. It’s based on the idea that similar data points tend to be close to each other in a feature space. It makes predictions by finding the K nearest data points to a new, unseen data point and then aggregating their labels. In short, ‘K’ stands for the number of neighbors to consider when making a prediction.
library(class)
# Identifying a ‘best guess’ value of k (the square root of the number of training observations)
ceiling(sqrt(nrow(train)))
## [1] 224
# 224, so we can choose either 223 or 225
knn.pred1 <- knn(train = train[ ,-10], # we remove the target variable
test = test[ ,-10],
cl = train$cardio,
k = 223)
knn_conf1 <- confusionMatrix(as.factor(knn.pred1), as.factor(test$cardio), positive = "1")
knn_conf1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5731 464
## 1 668 5665
##
## Accuracy : 0.9096
## 95% CI : (0.9045, 0.9146)
## No Information Rate : 0.5108
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8193
##
## Mcnemar's Test P-Value : 1.604e-09
##
## Sensitivity : 0.9243
## Specificity : 0.8956
## Pos Pred Value : 0.8945
## Neg Pred Value : 0.9251
## Prevalence : 0.4892
## Detection Rate : 0.4522
## Detection Prevalence : 0.5055
## Balanced Accuracy : 0.9100
##
## 'Positive' Class : 1
##
# using 225
knn.pred2 <- knn(train = train[ ,-10], # we remove the target variable
test = test[ ,-10],
cl = train$cardio,
k = 225)
knn_conf2 <- confusionMatrix(as.factor(knn.pred2), as.factor(test$cardio), positive = "1")
knn_conf2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5733 464
## 1 666 5665
##
## Accuracy : 0.9098
## 95% CI : (0.9047, 0.9148)
## No Information Rate : 0.5108
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8196
##
## Mcnemar's Test P-Value : 2.24e-09
##
## Sensitivity : 0.9243
## Specificity : 0.8959
## Pos Pred Value : 0.8948
## Neg Pred Value : 0.9251
## Prevalence : 0.4892
## Detection Rate : 0.4522
## Detection Prevalence : 0.5053
## Balanced Accuracy : 0.9101
##
## 'Positive' Class : 1
##
The differences in models are very slight, but k = 223 results in a higher accuracy (90.98%). This algortihm was able to detect 92% of the ppositive cases and 89.6% of the negative cases. Type I error was 5% and Type II error was 3.7%.
A decision tree is a hierarchical structure consisting of nodes. The top node is called the “root,” and it represents the initial decision. The intermediate nodes are called “internal nodes,” and they contain decision criteria. The bottom nodes are called “leaves” or “terminal nodes,” and they provide the final prediction or decision. At each internal node of the tree, a decision is made based on a specific feature (attribute) from the dataset. This decision splits the data into subsets, sending them down different branches of the tree. The disadvantages of decision trees include overfitting and instability. They tend to perform well with balanced datasets like ours.
Decision trees are insensitive to multicollinearity so we will use the original dataset with ap_lo.
library(rpart)
library(rpart.plot)
set.seed(444)
part <- createDataPartition(y = cdata$cardio, p = 0.80, list = FALSE)
ttrain <- cdata[part, ]
ttest <- cdata[-part, ]
Building the model:
set.seed(831)
c.rpart <- rpart(formula = cardio ~ ., data = ttrain, method = "class")
c.rpart
## n= 50114
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 50114 24762 0 (0.5058866 0.4941134)
## 2) ap_hi< 129.5 30103 9696 0 (0.6779059 0.3220941) *
## 3) ap_hi>=129.5 20011 4945 1 (0.2471141 0.7528859) *
Variable Importance:
c.rpart$variable.importance
## ap_hi ap_lo cholesterol bmi age gluc
## 4461.53487 2595.40890 486.70884 404.88468 164.09423 80.48644
Visualizing the rpart objected:
prp(x = c.rpart, extra = 2)
The tree stops after splitting at the root node, which is ap_hi - the variable with the highest importance. This means that the algorithm determined that further splits would not significantly improve the purity of the leaf nodes.
Predictions using the test dataset:
preds <- as.factor(predict(c.rpart, newdata = ttest, type='class'))
testf <- as.factor(ttest$cardio)
c.conf <- confusionMatrix(preds, testf, positive='1')
c.conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5183 2422
## 1 1216 3707
##
## Accuracy : 0.7096
## 95% CI : (0.7016, 0.7175)
## No Information Rate : 0.5108
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4165
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6048
## Specificity : 0.8100
## Pos Pred Value : 0.7530
## Neg Pred Value : 0.6815
## Prevalence : 0.4892
## Detection Rate : 0.2959
## Detection Prevalence : 0.3930
## Balanced Accuracy : 0.7074
##
## 'Positive' Class : 1
##
This model has an accuracy of ~71%, which is not great. The model was only able to predict 60% of the positive cases and 81% of the negative cases. It resulted in way more undetected positive patients than the other 2 algorithms.
The model with the highest predictive power was the k-Nearest Neighbors, with accuracy of almost 91%.The model can be further improved with hyperparameter tuning but due to limited computational resources that is not possible now.