Dataset Description

This dataset contains health survey data from around 300,000 individuals in the US (CDC, 2020). It is used to study which factors—like lifestyle and medical history—are most likely to lead to heart disease. The link from the source: Indicators of Heart Disease.

Variables

Summary of Dataset Variables
Variable Description
HeartDisease Target: Ever had heart disease
BMI Body Mass Index
Smoking Smoked 100+ cigarettes
AlcoholDrinking Heavy drinker status
Stroke History of stroke
PhysicalHealth Bad physical days (last 30)
MentalHealth Bad mental days (last 30)
DiffWalking Difficulty walking
Sex Gender
AgeCategory Age group
Race Ethnicity
Diabetic Diabetes status
PhysicalActivity Recent exercise
GenHealth General health rating
SleepTime Sleep hours/day
Asthma History of asthma
KidneyDisease History of kidney disease
SkinCancer History of skin cancer

The classification of the variables:

  • Dependent Variable: HeartDisease
  • Independent Variable: The rest of the variables (like BMI, Smoking, AgeCategory, etc.)

Dataset upload

#Dataset is loaded. Note that the data set should be in the same location as this rmd file
dataset <- read.csv("heart_2020_cleaned.csv")
dataset$HeartDisease <- as.factor(dataset$HeartDisease)

Preparation

Train and Test subset creation

In this section we will create “train” and “test” subsets

set.seed(936)

#Spliting the dataset
split <- sample.split(dataset$HeartDisease,SplitRatio=0.65)

#Creation of train and displaying the amount
train = subset(dataset,split==TRUE)
nrow(train)
## [1] 207866
#Creation of test and displaying the amount
test = subset(dataset,split==FALSE)
nrow(test)
## [1] 111929

Logistic Regression model creation

HeartLog <- glm(HeartDisease ~ ., data=train,family=binomial)
summary(HeartLog)
## 
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = train)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -6.377908   0.143749 -44.369  < 2e-16 ***
## BMI                              0.008296   0.001421   5.837 5.30e-09 ***
## SmokingYes                       0.365040   0.017834  20.469  < 2e-16 ***
## AlcoholDrinkingYes              -0.219559   0.041588  -5.279 1.30e-07 ***
## StrokeYes                        1.023812   0.028137  36.387  < 2e-16 ***
## PhysicalHealth                   0.003799   0.001065   3.569 0.000359 ***
## MentalHealth                     0.005651   0.001092   5.173 2.30e-07 ***
## DiffWalkingYes                   0.246741   0.022421  11.005  < 2e-16 ***
## SexMale                          0.718773   0.018076  39.765  < 2e-16 ***
## AgeCategory25-29                 0.175032   0.154139   1.136 0.256145    
## AgeCategory30-34                 0.516356   0.139104   3.712 0.000206 ***
## AgeCategory35-39                 0.653792   0.132929   4.918 8.73e-07 ***
## AgeCategory40-44                 0.989534   0.126073   7.849 4.20e-15 ***
## AgeCategory45-49                 1.310078   0.121465  10.786  < 2e-16 ***
## AgeCategory50-54                 1.779564   0.117026  15.207  < 2e-16 ***
## AgeCategory55-59                 2.010254   0.115249  17.443  < 2e-16 ***
## AgeCategory60-64                 2.246983   0.114287  19.661  < 2e-16 ***
## AgeCategory65-69                 2.503246   0.113939  21.970  < 2e-16 ***
## AgeCategory70-74                 2.787636   0.113867  24.481  < 2e-16 ***
## AgeCategory75-79                 2.972605   0.114555  25.949  < 2e-16 ***
## AgeCategory80 or older           3.247133   0.114219  28.429  < 2e-16 ***
## RaceAsian                       -0.495563   0.103956  -4.767 1.87e-06 ***
## RaceBlack                       -0.328249   0.071825  -4.570 4.87e-06 ***
## RaceHispanic                    -0.220821   0.073099  -3.021 0.002521 ** 
## RaceOther                       -0.005139   0.079124  -0.065 0.948215    
## RaceWhite                       -0.051176   0.064179  -0.797 0.425220    
## DiabeticNo, borderline diabetes  0.084732   0.052569   1.612 0.106999    
## DiabeticYes                      0.489951   0.020682  23.690  < 2e-16 ***
## DiabeticYes (during pregnancy)   0.025617   0.134307   0.191 0.848734    
## PhysicalActivityYes              0.035412   0.019983   1.772 0.076369 .  
## GenHealthFair                    1.497764   0.040866  36.651  < 2e-16 ***
## GenHealthGood                    1.037926   0.036786  28.216  < 2e-16 ***
## GenHealthPoor                    1.890887   0.050722  37.280  < 2e-16 ***
## GenHealthVery good               0.473761   0.037729  12.557  < 2e-16 ***
## SleepTime                       -0.024255   0.005374  -4.513 6.39e-06 ***
## AsthmaYes                        0.261537   0.023830  10.975  < 2e-16 ***
## KidneyDiseaseYes                 0.553450   0.030374  18.221  < 2e-16 ***
## SkinCancerYes                    0.106839   0.024249   4.406 1.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 121486  on 207865  degrees of freedom
## Residual deviance:  94254  on 207828  degrees of freedom
## AIC: 94330
## 
## Number of Fisher Scoring iterations: 7

Key findings

The variables that have high statistical significance are:

  • AgeCategory: The older the person, the higher risk of heart disease.
    • coef 0.17 for 25-29.
    • coef 3.24 for those 80 or older.
  • GenHealth: The worse the overall health, the higher risk of heart disease.
    • “Poor” (1.89)
    • “Fair” (1.49)
  • Stroke: The risk is higher for those who had stroke (coef 1.02).
  • Sex: The male population has a higher risk for heart disease (coef 0.71).
  • Race: Asians and Black people are less likely to get heart disease.
  • AlcoholDrinking: thoae who were drinking alcohol had lower ods of hear disease (-0.21).
  • Generally, those who were Diabetic (0.48) and/or Asthma (0.26) and/or Smoking (0.36) and/or KidneyDisease (0.55) and/or SkinCancer (0.10) and/or were sleeping less hours (SleepTime (-0.02)) had higher risk of heart disease.

Prediction

predictTest <-  predict(HeartLog, type='response', newdata=test)
summary(predictTest)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0009019 0.0132206 0.0408723 0.0855506 0.1065490 0.9226723

The above prediction shows:

  • What is the least probability of a person having heart disease (Min).
  • What is the average probability the 1st quarter of people having heart disease (1st Qu.).
  • What is the median probability of people having heart disease (Median).
  • What is the average probability of people having heart disease (Mean).
  • What is the average probability the 3rd quarter of people having heart disease (3rd Qu.).
  • What is the highest probability of a person having heart disease (Min).

Confusion Matrix

temp <- table(test$HeartDisease, predictTest > 0.5)
print(temp)
##      
##        FALSE   TRUE
##   No  101484    864
##   Yes   8549   1032
# 1. Calculate Accuracy
accuracy <- sum(diag(temp)) / sum(temp)
print(accuracy)
## [1] 0.915902
# 2. Calculate Sensitivity (Recall)
# True Positives / (True Positives + False Negatives)
sensitivity <- temp[2,2] / sum(temp[2,])
print(sensitivity)
## [1] 0.1077132
# 3. Calculate Specificity 
# True Negatives / (True Negatives + False Positives)
specificity <- temp[1,1] / sum(temp[1,])
print(specificity)
## [1] 0.9915582
# 4. Calculate Baseline Accuracy
# (Total 'No' cases) / Total
baseline <- sum(temp[1,]) / sum(temp)
print(baseline)
## [1] 0.9144011

Results

  • Accuracy: 91.59%
  • Baseline Accuracy: 91.44%
    • Conclusion: The model barely outperforms a simple “No-heart-disease” guess.
  • Sensitivity: 10.77%
    • Conclusion: Very low. The model misses ~89% of actual heart disease cases.
  • Specificity: 99.15%
    • Conclusion: Excellent. The model almost never incorrectly labels a healthy person as sick.

Model ROCR

ROCRpred <- prediction(predictTest, test$HeartDisease)

Create second group of dataset

#Remove the NA values
train2 = na.omit(train)
test2 = na.omit(test)

#Row count train2
nrow(train2)
## [1] 207866
#Row count test2
nrow(test2)
## [1] 111929

Area Under the Curve (AUC)

HeartLog2 = glm(HeartDisease ~ ., data = train2, family = binomial)
predictTest2 = predict(HeartLog2, type = "response", newdata = test2)

#ROC Curve
ROCRpred2 <- prediction(predictTest2, test2$HeartDisease)
ROCRperf2 <- performance(ROCRpred2, "tpr", "fpr")
plot(ROCRperf2,colorize = TRUE)

# AUC value is:
as.numeric(performance(ROCRpred2, "auc")@y.values)
## [1] 0.8399207