In this case study, I will try to predict whether heart disease patients in a hospital will predict heart disease or not based on the categories of several supporting variables. The algorithm that I will use is to use logistic regression and k-nearest neighbor which is included in supervised learning.
library(dplyr)
library(gtools)
library(caret)
library(ggplot2)
library(class)
library(tidyr)heart <- read.csv("heart_2020_cleaned.csv")
head(heart)Description : HeartDisease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI). BMI: Body Mass Index (BMI). Smoking: Have you smoked at least 100 cigarettes in your entire life? AlcoholDrinking: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week Stroke: (Ever told) (you had) a stroke? PhysicalHealth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? (0-30 days). MentalHealth: Thinking about your mental health, for how many days during the past 30 days was your mental health not good? (0-30 days). DiffWalking: Do you have serious difficulty walking or climbing stairs? Sex: Are you male or female? AgeCategory: Fourteen-level age category. (then calculated the mean) Race: Imputed race/ethnicity value. Diabetic: (Ever told) (you had) diabetes? PhysicalActivity: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job. GenHealth: Would you say that in general your health is… SleepTime: On average, how many hours of sleep do you get in a 24-hour period? Asthma: (Ever told) (you had) asthma? KidneyDisease: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? SkinCancer: (Ever told) (you had) skin cancer?
str(heart)## 'data.frame': 319795 obs. of 18 variables:
## $ HeartDisease : chr "No" "No" "No" "No" ...
## $ BMI : num 16.6 20.3 26.6 24.2 23.7 ...
## $ Smoking : chr "Yes" "No" "Yes" "No" ...
## $ AlcoholDrinking : chr "No" "No" "No" "No" ...
## $ Stroke : chr "No" "Yes" "No" "No" ...
## $ PhysicalHealth : num 3 0 20 0 28 6 15 5 0 0 ...
## $ MentalHealth : num 30 0 30 0 0 0 0 0 0 0 ...
## $ DiffWalking : chr "No" "No" "No" "No" ...
## $ Sex : chr "Female" "Female" "Male" "Female" ...
## $ AgeCategory : chr "55-59" "80 or older" "65-69" "75-79" ...
## $ Race : chr "White" "White" "White" "White" ...
## $ Diabetic : chr "Yes" "No" "Yes" "No" ...
## $ PhysicalActivity: chr "Yes" "Yes" "Yes" "No" ...
## $ GenHealth : chr "Very good" "Very good" "Fair" "Good" ...
## $ SleepTime : num 5 7 8 6 8 12 4 9 5 10 ...
## $ Asthma : chr "Yes" "No" "Yes" "No" ...
## $ KidneyDisease : chr "No" "No" "No" "No" ...
## $ SkinCancer : chr "Yes" "No" "No" "Yes" ...
Change data type to factor(because it belongs to the category) -> HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, Asthma, KidneyDisease, SkinCancer
heart <- heart %>%
mutate_if(is.character, as.factor)
str(heart)## 'data.frame': 319795 obs. of 18 variables:
## $ HeartDisease : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ BMI : num 16.6 20.3 26.6 24.2 23.7 ...
## $ Smoking : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 2 1 1 ...
## $ AlcoholDrinking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Stroke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ PhysicalHealth : num 3 0 20 0 28 6 15 5 0 0 ...
## $ MentalHealth : num 30 0 30 0 0 0 0 0 0 0 ...
## $ DiffWalking : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 1 2 1 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 1 1 1 1 2 ...
## $ AgeCategory : Factor w/ 13 levels "18-24","25-29",..: 8 13 10 12 5 12 11 13 13 10 ...
## $ Race : Factor w/ 6 levels "American Indian/Alaskan Native",..: 6 6 6 6 6 3 6 6 6 6 ...
## $ Diabetic : Factor w/ 4 levels "No","No, borderline diabetes",..: 3 1 3 1 1 1 1 3 2 1 ...
## $ PhysicalActivity: Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 1 1 2 ...
## $ GenHealth : Factor w/ 5 levels "Excellent","Fair",..: 5 5 2 3 5 2 2 3 2 3 ...
## $ SleepTime : num 5 7 8 6 8 12 4 9 5 10 ...
## $ Asthma : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 2 1 1 ...
## $ KidneyDisease : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 2 1 ...
## $ SkinCancer : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 1 1 ...
Good, now the data type is correct
colSums(is.na(heart))## HeartDisease BMI Smoking AlcoholDrinking
## 0 0 0 0
## Stroke PhysicalHealth MentalHealth DiffWalking
## 0 0 0 0
## Sex AgeCategory Race Diabetic
## 0 0 0 0
## PhysicalActivity GenHealth SleepTime Asthma
## 0 0 0 0
## KidneyDisease SkinCancer
## 0 0
Good, we don’t have missing values
Next, we divide the data into train data and test data. Train data is used to train the model you want to use. Then the test data is used to test the model used in the data train
library(rsample)
RNGkind(sample.kind="Rounding")
set.seed(100)
index <- sample(nrow(heart), size = nrow(heart)*0.8)
heart_train <- heart[index,]
heart_test <- heart[-index,]prop.table(table(heart_train$HeartDisease))##
## No Yes
## 0.91363608 0.08636392
RNGkind(sample.kind = "Rounding")
set.seed(100)
heart_train_balance <- upSample(x = heart_train %>%
select(-HeartDisease),
y = heart_train$HeartDisease,
yname = "HeartDisease")
table(heart_train_balance$HeartDisease)##
## No Yes
## 233741 233741
Good, now the proportion of target variables in the data train is balanced
We will do the modeling using logistic regression
model_heart1 <- glm(HeartDisease~., data = heart_train_balance, family="binomial")summary(model_heart1)##
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = heart_train_balance)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.12785 -0.77775 -0.03002 0.80677 2.89334
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.9147554 0.0492635 -79.466 < 2e-16 ***
## BMI 0.0102424 0.0006128 16.713 < 2e-16 ***
## SmokingYes 0.3766601 0.0074723 50.408 < 2e-16 ***
## AlcoholDrinkingYes -0.2482033 0.0161782 -15.342 < 2e-16 ***
## StrokeYes 1.2493232 0.0157347 79.399 < 2e-16 ***
## PhysicalHealth 0.0065735 0.0005011 13.119 < 2e-16 ***
## MentalHealth 0.0077066 0.0004871 15.823 < 2e-16 ***
## DiffWalkingYes 0.2292409 0.0103155 22.223 < 2e-16 ***
## SexMale 0.7427594 0.0075959 97.784 < 2e-16 ***
## AgeCategory25-29 0.0645424 0.0435182 1.483 0.1380
## AgeCategory30-34 0.2963549 0.0400590 7.398 1.38e-13 ***
## AgeCategory35-39 0.4665007 0.0380227 12.269 < 2e-16 ***
## AgeCategory40-44 0.8299386 0.0359920 23.059 < 2e-16 ***
## AgeCategory45-49 1.0568693 0.0349009 30.282 < 2e-16 ***
## AgeCategory50-54 1.5313967 0.0334189 45.824 < 2e-16 ***
## AgeCategory55-59 1.7933823 0.0326930 54.855 < 2e-16 ***
## AgeCategory60-64 2.1081518 0.0322857 65.297 < 2e-16 ***
## AgeCategory65-69 2.3308238 0.0322026 72.380 < 2e-16 ***
## AgeCategory70-74 2.6749217 0.0322812 82.863 < 2e-16 ***
## AgeCategory75-79 2.8972291 0.0328750 88.129 < 2e-16 ***
## AgeCategory80 or older 3.1959081 0.0327605 97.554 < 2e-16 ***
## RaceAsian -0.5135930 0.0411494 -12.481 < 2e-16 ***
## RaceBlack -0.3480615 0.0308863 -11.269 < 2e-16 ***
## RaceHispanic -0.2506542 0.0311479 -8.047 8.47e-16 ***
## RaceOther -0.0733927 0.0341677 -2.148 0.0317 *
## RaceWhite -0.1406597 0.0278449 -5.052 4.38e-07 ***
## DiabeticNo, borderline diabetes 0.1354803 0.0223245 6.069 1.29e-09 ***
## DiabeticYes 0.5053012 0.0096236 52.506 < 2e-16 ***
## DiabeticYes (during pregnancy) 0.2009695 0.0485297 4.141 3.46e-05 ***
## PhysicalActivityYes 0.0034033 0.0087040 0.391 0.6958
## GenHealthFair 1.5571769 0.0155801 99.947 < 2e-16 ***
## GenHealthGood 1.0744250 0.0129301 83.095 < 2e-16 ***
## GenHealthPoor 1.8852860 0.0224097 84.128 < 2e-16 ***
## GenHealthVery good 0.4934495 0.0129408 38.131 < 2e-16 ***
## SleepTime -0.0252031 0.0023724 -10.623 < 2e-16 ***
## AsthmaYes 0.3116445 0.0105316 29.591 < 2e-16 ***
## KidneyDiseaseYes 0.5975351 0.0159261 37.519 < 2e-16 ***
## SkinCancerYes 0.1147830 0.0110262 10.410 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 648068 on 467481 degrees of freedom
## Residual deviance: 461023 on 467444 degrees of freedom
## AIC: 461099
##
## Number of Fisher Scoring iterations: 5
heart_test$prediction <- predict(model_heart1, newdata = heart_test, type = "response")After that, we label the prediction results and change the data type to factor
heart_test$prediction_label <- ifelse(heart_test$prediction > 0.5, "Yes", "No")
heart_test <- heart_test %>% mutate(prediction_label = as.factor(prediction_label))heart_test[c("prediction","prediction_label")]In the prediction above, the yes label indicates that you have heart disease while no does not have heart disease
We will evaluate the model using confusion matrix
confusionMatrix(data = heart_test$prediction_label, reference = heart_test$HeartDisease, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 43859 1131
## Yes 14822 4147
##
## Accuracy : 0.7506
## 95% CI : (0.7472, 0.7539)
## No Information Rate : 0.9175
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2445
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.78571
## Specificity : 0.74741
## Pos Pred Value : 0.21862
## Neg Pred Value : 0.97486
## Prevalence : 0.08252
## Detection Rate : 0.06484
## Detection Prevalence : 0.29658
## Balanced Accuracy : 0.76656
##
## 'Positive' Class : Yes
##
Description : Accuracy = 75 % Sensitivity / Recall = 78,5 % -> 79 % Pos Pred Value / Precision = 21,8 % -> 22 % Specificity = 74,7 % -> 75 %
First, we split the data into train data and test data
heart_x_train <- heart_train_balance %>%
select_if(is.numeric)
heart_y_train <- heart_train_balance %>%
select(HeartDisease)
heart_x_test <- heart_test %>%
select_if(is.numeric)
heart_y_test <- heart_test %>%
select(HeartDisease)After that we do the scaling on the data
heart_x_train_scaled <- scale(heart_x_train)
heart_x_test_scaled <- scale(heart_x_test %>% select(-prediction),
center = attr(heart_x_train_scaled, "scaled:center"),
scale = attr(heart_x_train_scaled, "scaled:scale"))Then we do the selection of the value of k
sqrt(nrow(heart_x_train))## [1] 683.7266
Because the amount of data is too much, then the result of the square of the data is also large and will result in an error if it is selected as k because it is too large. therefore we use the function knn1() in this data.
predict_knn <- knn1(train = heart_x_train_scaled,
test = heart_x_test_scaled,
cl = heart_y_train$HeartDisease)confusionMatrix(data = predict_knn,reference = heart_y_test$HeartDisease, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 46308 3901
## Yes 12373 1377
##
## Accuracy : 0.7456
## 95% CI : (0.7422, 0.7489)
## No Information Rate : 0.9175
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0289
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.26089
## Specificity : 0.78915
## Pos Pred Value : 0.10015
## Neg Pred Value : 0.92230
## Prevalence : 0.08252
## Detection Rate : 0.02153
## Detection Prevalence : 0.21498
## Balanced Accuracy : 0.52502
##
## 'Positive' Class : Yes
##
In this heart disease case study, we will prioritize recall metrics over accuracy. We prioritize recall metrics because we have to anticipate people who actually have heart disease but are predicted to be healthy. From the model that has been made, the logistic regression model is the best model because it has greater sensitivity/recall, which is 75% compared to the KNN model, which is 26%.