1 Intro

In this case study, I will try to predict whether heart disease patients in a hospital will predict heart disease or not based on the categories of several supporting variables. The algorithm that I will use is to use logistic regression and k-nearest neighbor which is included in supervised learning.

1.1 Load Library & Read Data

library(dplyr)
library(gtools)
library(caret)
library(ggplot2)
library(class)
library(tidyr)
heart <- read.csv("heart_2020_cleaned.csv")
head(heart)

Description : HeartDisease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI). BMI: Body Mass Index (BMI). Smoking: Have you smoked at least 100 cigarettes in your entire life? AlcoholDrinking: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week Stroke: (Ever told) (you had) a stroke? PhysicalHealth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? (0-30 days). MentalHealth: Thinking about your mental health, for how many days during the past 30 days was your mental health not good? (0-30 days). DiffWalking: Do you have serious difficulty walking or climbing stairs? Sex: Are you male or female? AgeCategory: Fourteen-level age category. (then calculated the mean) Race: Imputed race/ethnicity value. Diabetic: (Ever told) (you had) diabetes? PhysicalActivity: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job. GenHealth: Would you say that in general your health is… SleepTime: On average, how many hours of sleep do you get in a 24-hour period? Asthma: (Ever told) (you had) asthma? KidneyDisease: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? SkinCancer: (Ever told) (you had) skin cancer?

2 Data Manipulation

2.1 Data Type

str(heart)
## 'data.frame':    319795 obs. of  18 variables:
##  $ HeartDisease    : chr  "No" "No" "No" "No" ...
##  $ BMI             : num  16.6 20.3 26.6 24.2 23.7 ...
##  $ Smoking         : chr  "Yes" "No" "Yes" "No" ...
##  $ AlcoholDrinking : chr  "No" "No" "No" "No" ...
##  $ Stroke          : chr  "No" "Yes" "No" "No" ...
##  $ PhysicalHealth  : num  3 0 20 0 28 6 15 5 0 0 ...
##  $ MentalHealth    : num  30 0 30 0 0 0 0 0 0 0 ...
##  $ DiffWalking     : chr  "No" "No" "No" "No" ...
##  $ Sex             : chr  "Female" "Female" "Male" "Female" ...
##  $ AgeCategory     : chr  "55-59" "80 or older" "65-69" "75-79" ...
##  $ Race            : chr  "White" "White" "White" "White" ...
##  $ Diabetic        : chr  "Yes" "No" "Yes" "No" ...
##  $ PhysicalActivity: chr  "Yes" "Yes" "Yes" "No" ...
##  $ GenHealth       : chr  "Very good" "Very good" "Fair" "Good" ...
##  $ SleepTime       : num  5 7 8 6 8 12 4 9 5 10 ...
##  $ Asthma          : chr  "Yes" "No" "Yes" "No" ...
##  $ KidneyDisease   : chr  "No" "No" "No" "No" ...
##  $ SkinCancer      : chr  "Yes" "No" "No" "Yes" ...

Change data type to factor(because it belongs to the category) -> HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, Asthma, KidneyDisease, SkinCancer

heart <- heart %>% 
  mutate_if(is.character, as.factor)

str(heart)
## 'data.frame':    319795 obs. of  18 variables:
##  $ HeartDisease    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ BMI             : num  16.6 20.3 26.6 24.2 23.7 ...
##  $ Smoking         : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 2 1 1 ...
##  $ AlcoholDrinking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Stroke          : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ PhysicalHealth  : num  3 0 20 0 28 6 15 5 0 0 ...
##  $ MentalHealth    : num  30 0 30 0 0 0 0 0 0 0 ...
##  $ DiffWalking     : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 1 2 1 2 ...
##  $ Sex             : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 1 1 1 1 2 ...
##  $ AgeCategory     : Factor w/ 13 levels "18-24","25-29",..: 8 13 10 12 5 12 11 13 13 10 ...
##  $ Race            : Factor w/ 6 levels "American Indian/Alaskan Native",..: 6 6 6 6 6 3 6 6 6 6 ...
##  $ Diabetic        : Factor w/ 4 levels "No","No, borderline diabetes",..: 3 1 3 1 1 1 1 3 2 1 ...
##  $ PhysicalActivity: Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 1 1 2 ...
##  $ GenHealth       : Factor w/ 5 levels "Excellent","Fair",..: 5 5 2 3 5 2 2 3 2 3 ...
##  $ SleepTime       : num  5 7 8 6 8 12 4 9 5 10 ...
##  $ Asthma          : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 2 1 1 ...
##  $ KidneyDisease   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 2 1 ...
##  $ SkinCancer      : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 1 1 ...

Good, now the data type is correct

2.2 Check Missing Values

colSums(is.na(heart))
##     HeartDisease              BMI          Smoking  AlcoholDrinking 
##                0                0                0                0 
##           Stroke   PhysicalHealth     MentalHealth      DiffWalking 
##                0                0                0                0 
##              Sex      AgeCategory             Race         Diabetic 
##                0                0                0                0 
## PhysicalActivity        GenHealth        SleepTime           Asthma 
##                0                0                0                0 
##    KidneyDisease       SkinCancer 
##                0                0

Good, we don’t have missing values

3 Cross Validation

Next, we divide the data into train data and test data. Train data is used to train the model you want to use. Then the test data is used to test the model used in the data train

library(rsample)
RNGkind(sample.kind="Rounding")
set.seed(100)

index <- sample(nrow(heart), size = nrow(heart)*0.8)

heart_train <- heart[index,]
heart_test <- heart[-index,]

3.1 Balance Target Variable in Train Data

prop.table(table(heart_train$HeartDisease))
## 
##         No        Yes 
## 0.91363608 0.08636392
RNGkind(sample.kind = "Rounding")
set.seed(100)

heart_train_balance <- upSample(x = heart_train %>% 
                                   select(-HeartDisease),
                                 y = heart_train$HeartDisease,
                                 yname = "HeartDisease")
table(heart_train_balance$HeartDisease)
## 
##     No    Yes 
## 233741 233741

Good, now the proportion of target variables in the data train is balanced

4 Logistic Regression

4.1 Build Model

We will do the modeling using logistic regression

model_heart1 <- glm(HeartDisease~., data = heart_train_balance, family="binomial")
summary(model_heart1)
## 
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = heart_train_balance)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.12785  -0.77775  -0.03002   0.80677   2.89334  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -3.9147554  0.0492635 -79.466  < 2e-16 ***
## BMI                              0.0102424  0.0006128  16.713  < 2e-16 ***
## SmokingYes                       0.3766601  0.0074723  50.408  < 2e-16 ***
## AlcoholDrinkingYes              -0.2482033  0.0161782 -15.342  < 2e-16 ***
## StrokeYes                        1.2493232  0.0157347  79.399  < 2e-16 ***
## PhysicalHealth                   0.0065735  0.0005011  13.119  < 2e-16 ***
## MentalHealth                     0.0077066  0.0004871  15.823  < 2e-16 ***
## DiffWalkingYes                   0.2292409  0.0103155  22.223  < 2e-16 ***
## SexMale                          0.7427594  0.0075959  97.784  < 2e-16 ***
## AgeCategory25-29                 0.0645424  0.0435182   1.483   0.1380    
## AgeCategory30-34                 0.2963549  0.0400590   7.398 1.38e-13 ***
## AgeCategory35-39                 0.4665007  0.0380227  12.269  < 2e-16 ***
## AgeCategory40-44                 0.8299386  0.0359920  23.059  < 2e-16 ***
## AgeCategory45-49                 1.0568693  0.0349009  30.282  < 2e-16 ***
## AgeCategory50-54                 1.5313967  0.0334189  45.824  < 2e-16 ***
## AgeCategory55-59                 1.7933823  0.0326930  54.855  < 2e-16 ***
## AgeCategory60-64                 2.1081518  0.0322857  65.297  < 2e-16 ***
## AgeCategory65-69                 2.3308238  0.0322026  72.380  < 2e-16 ***
## AgeCategory70-74                 2.6749217  0.0322812  82.863  < 2e-16 ***
## AgeCategory75-79                 2.8972291  0.0328750  88.129  < 2e-16 ***
## AgeCategory80 or older           3.1959081  0.0327605  97.554  < 2e-16 ***
## RaceAsian                       -0.5135930  0.0411494 -12.481  < 2e-16 ***
## RaceBlack                       -0.3480615  0.0308863 -11.269  < 2e-16 ***
## RaceHispanic                    -0.2506542  0.0311479  -8.047 8.47e-16 ***
## RaceOther                       -0.0733927  0.0341677  -2.148   0.0317 *  
## RaceWhite                       -0.1406597  0.0278449  -5.052 4.38e-07 ***
## DiabeticNo, borderline diabetes  0.1354803  0.0223245   6.069 1.29e-09 ***
## DiabeticYes                      0.5053012  0.0096236  52.506  < 2e-16 ***
## DiabeticYes (during pregnancy)   0.2009695  0.0485297   4.141 3.46e-05 ***
## PhysicalActivityYes              0.0034033  0.0087040   0.391   0.6958    
## GenHealthFair                    1.5571769  0.0155801  99.947  < 2e-16 ***
## GenHealthGood                    1.0744250  0.0129301  83.095  < 2e-16 ***
## GenHealthPoor                    1.8852860  0.0224097  84.128  < 2e-16 ***
## GenHealthVery good               0.4934495  0.0129408  38.131  < 2e-16 ***
## SleepTime                       -0.0252031  0.0023724 -10.623  < 2e-16 ***
## AsthmaYes                        0.3116445  0.0105316  29.591  < 2e-16 ***
## KidneyDiseaseYes                 0.5975351  0.0159261  37.519  < 2e-16 ***
## SkinCancerYes                    0.1147830  0.0110262  10.410  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 648068  on 467481  degrees of freedom
## Residual deviance: 461023  on 467444  degrees of freedom
## AIC: 461099
## 
## Number of Fisher Scoring iterations: 5

4.2 Predict

heart_test$prediction <- predict(model_heart1, newdata = heart_test, type = "response")

After that, we label the prediction results and change the data type to factor

heart_test$prediction_label <- ifelse(heart_test$prediction > 0.5, "Yes", "No")
heart_test <- heart_test %>% mutate(prediction_label = as.factor(prediction_label))
heart_test[c("prediction","prediction_label")]

In the prediction above, the yes label indicates that you have heart disease while no does not have heart disease

4.3 Model Evaluation

We will evaluate the model using confusion matrix

confusionMatrix(data = heart_test$prediction_label, reference = heart_test$HeartDisease, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  43859  1131
##        Yes 14822  4147
##                                           
##                Accuracy : 0.7506          
##                  95% CI : (0.7472, 0.7539)
##     No Information Rate : 0.9175          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2445          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.78571         
##             Specificity : 0.74741         
##          Pos Pred Value : 0.21862         
##          Neg Pred Value : 0.97486         
##              Prevalence : 0.08252         
##          Detection Rate : 0.06484         
##    Detection Prevalence : 0.29658         
##       Balanced Accuracy : 0.76656         
##                                           
##        'Positive' Class : Yes             
## 

Description : Accuracy = 75 % Sensitivity / Recall = 78,5 % -> 79 % Pos Pred Value / Precision = 21,8 % -> 22 % Specificity = 74,7 % -> 75 %

5 K-Nearest Neighbor

5.1 Cross Validation

First, we split the data into train data and test data

heart_x_train <- heart_train_balance %>% 
  select_if(is.numeric)
heart_y_train <- heart_train_balance %>% 
  select(HeartDisease)
  

heart_x_test <- heart_test %>% 
  select_if(is.numeric)
heart_y_test <- heart_test %>% 
  select(HeartDisease)

5.2 Build Model

After that we do the scaling on the data

heart_x_train_scaled <-  scale(heart_x_train)
heart_x_test_scaled <- scale(heart_x_test %>% select(-prediction),
                             center =  attr(heart_x_train_scaled, "scaled:center"),
                             scale = attr(heart_x_train_scaled, "scaled:scale"))

Then we do the selection of the value of k

sqrt(nrow(heart_x_train))
## [1] 683.7266

Because the amount of data is too much, then the result of the square of the data is also large and will result in an error if it is selected as k because it is too large. therefore we use the function knn1() in this data.

predict_knn <- knn1(train = heart_x_train_scaled,
                test = heart_x_test_scaled,
                cl = heart_y_train$HeartDisease)

5.3 Model Evaluation

confusionMatrix(data = predict_knn,reference = heart_y_test$HeartDisease, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  46308  3901
##        Yes 12373  1377
##                                           
##                Accuracy : 0.7456          
##                  95% CI : (0.7422, 0.7489)
##     No Information Rate : 0.9175          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0289          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.26089         
##             Specificity : 0.78915         
##          Pos Pred Value : 0.10015         
##          Neg Pred Value : 0.92230         
##              Prevalence : 0.08252         
##          Detection Rate : 0.02153         
##    Detection Prevalence : 0.21498         
##       Balanced Accuracy : 0.52502         
##                                           
##        'Positive' Class : Yes             
## 

6 Conclusion

In this heart disease case study, we will prioritize recall metrics over accuracy. We prioritize recall metrics because we have to anticipate people who actually have heart disease but are predicted to be healthy. From the model that has been made, the logistic regression model is the best model because it has greater sensitivity/recall, which is 75% compared to the KNN model, which is 26%.