Heart Disease Prediction

Overview

courtesy - harvard.edu

Summary

Our group decided to choose a Personal Key Indicators of Heart Disease dataset, which has a mostly categorical and few numerical. In this machine learning projects, “HeartDisease” can be used as the target variable. The dataset has been downloaded from

The original dataset came from 2020 annual CDC survey data of 400k adults related to their health status. This dataset is subset of the original dataset containing only 18 variables (9 booleans, 5 strings and 4 decimals). It has total 319,795 records. In this project we would use R to do following

Clean the data and determine its characteristics
Analyze the data and run basis statistics
Visualize data
Run Models and do predictions
Hypothesis testing
Regression Tree

R Libraries used are

library(tidyverse) - Used in performing sub-setting, transformations, and visualizations, with this library.
library(data.table) - This package is used for working with tabular data in R.
library(scales) - This provide the internal scaling infrastructure used by ggplot2, and provides tools to override the default breaks, labels, transformations, and palettes.
library(ggplot2) - Dedicated to data visualization.
library(GGally) - GGally extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data.
library(dplyr) - It provides a set of tools for efficiently manipulating datasets in R.
library(rpart) - It is used to build classification, recursive partitioning, and regression trees.
library(rpart.plot) - This function combines and extends plot. rpart and text.
library(caret) - The package contains tools for data splitting, pre-processing, feature selection, model tuning using resampling, and variable importance estimation.
library(forecast) - This provides methods and tools for displaying and analyzing univariate time series forecasts.
library(treemap) - A Treemap displays hierarchical data.
library(ggthemes) - It provides additional themes, geoms, and scales for the ggplot2 package.
library(ggcorrplot) - It is used to easily visualize a correlation matrix using ggplot2.
library(pROC)pROC - This is to analyze and compare ROC curves

The entire project is published on Rpubs at https://rpubs.com/pravinhere/893585

Data

About Dataset

courtesy - www.techrepublic.com

Following are the variables in our dataset.

HeartDisease - Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)
BMI - Body Mass Index (BMI)
Smoking - Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
AlcoholDrinking - Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
Stroke - (Ever told) (you had) a stroke?
PhysicalHealth - Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30
MentalHealth - Thinking about your mental health, for how many days during the past 30 days was your mental health not good?
DiffWalking - Do you have serious difficulty walking or climbing stairs?
Sex - Are you male or female?
AgeCategory - Fourteen-level age category
Race - Imputed race/ethnicity value
Diabetic - (Ever told) (you had) diabetes?
PhysicalActivity - Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
GenHealth - Would you say that in general your health is good or bad?
SleepTime - On average, how many hours of sleep do you get in a 24-hour period?
Asthma - (Ever told) (you had) asthma?
KidneyDisease - Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?
SkinCancer - (Ever told) (you had) skin cancer?

Data Cleaning, Characteristics, and Summary

To begin we utilized the head() function to evaluate what the data looked like and found that a large majority of our variables were health questions with a “yes” or “no” response. A few variables were health questions with categorical responses such as sex, age category, race, and general health . A few variables were also numeric such as bmi, physical health, mental health, and sleep time. Since these variables all came from a telephone survey it was also critical to check for any gaps in our data with is.na() function. We did not find any NA or missing values so we moved to the next piece of our cleaning. From here we used the table() function to check for missing values. In surveys data can be inputted incorrectly with the same category having different naming conventions or data that has not yet been categorized. This dataset was already very clean so it was ok to move on from here without adjustments being made.
With clean data it is then necessary to qualitatively understand the dataset. We found some questions within our data such as “How are people recorded getting 1 hour or 24 hours of sleep?” (referring to our sleep time variable). We also noted that the lowest age category starts at 18 years old so we are not qualified to make any correlations with minors and heart disease as we do not have that data. We are not able to conclude the nature of these data points as they do not have any explanation and are from the CDC, a reliable source we decided to leave these in. The size of the data naturally smoothed out any outliers as this is a very large dataset. Now looking further into the data with functions like class(), dim(), summary(), and str() we can understand the class, dimensions, summary, and structure of the data respectively. The summary() function in particular gives us insight to the min, max, std deviation, median and mean of our data. It also allows us to see categorical data in each grouping. In our summarizing we can create dummy variables as factors in order to allow prediction of categorical variables. Utilizing dummy variables allows character data like “yes” or “no” to become “1” or “0”. The dummy variables allow us to run models of a potential respondent to try to predict whether their collection of responses would align or defer from a heart disease diagnosis.

In addition to this, since we want to use this data to find which variables are the best predictors of heart disease we decide to split our many variables into category groups as follows

Bad Habits - Alcohol, Smoking
Pre-Existing Conditions - Stroke, Diabetic, Asthma, Kidney Disease, Skin Cancer
Demographic Data - Sex, Age Category, Race
Overall Health - BMI, Difficulty Walking
Daily Habits - Physical Activity, Sleep Time, General Health

Further following categories are made

BMI
- < 18.4 - Underweight
- 18-5-24.9 - Normal
- 25-29.9 - Overweight
- 30-34.9 - Obese I
- 35-39.9 - Obese II
- Greater than 40 - Obese III
Age Category
- 18-39 - Young Adults
- 40-59 - Middle-aged Adults
- Greater than 60 - Old-aged Adults
Sleep
- < 7 - Underslept
- 7-10 - Good-quality Sleep
- Greater than 10 - Overslept

Visualization

Graphs

ggplot(heartDataOriginal, aes(x = HeartDisease)) +
  geom_bar(fill = "steelblue")  +
  xlab("Heart Disease") +
  ylab("Count in Thousands") +
  scale_y_continuous(labels = label_number(suffix = " K", scale = 1e-3)) +
  ggtitle("Total Heart Disease Count")

HeartDisease vs Smoking shows a notable prevalence of Heart Disease for people with smoking habits.

ggplot(data = heartDataOriginal, aes(x = Smoking, fill = HeartDisease)) +
  geom_bar(position = "dodge") +
  ggtitle("HeartDisease vs Smoking")

HeartDisease vs AlcoholDrinking shows a notable prevalence of Heart Disease for people who don’t have heavy drinking habits.

ggplot(data = heartDataOriginal, aes(x = AlcoholDrinking, fill = HeartDisease)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(labels = label_number(suffix = " K", scale = 1e-3)) +
  ggtitle("HeartDisease vs AlcoholDrinking")

HeartDisease vs Stroke shows a notable prevalence of Heart Disease for people with smoking habits.

ggplot(data = heartDataOriginal, aes(x = Stroke, fill = HeartDisease)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(labels = label_number(suffix = " K", scale = 1e-3)) +
  ggtitle("HeartDisease vs Stroke")

Demonstrating that stroke variable is an important factor for indicating if a person has heart disease.

ggplot(heartDataOriginal, 
       aes(x = Stroke, 
           fill = HeartDisease)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .1), 
                     label = percent) +
  labs(y = "Percent") +
  ggtitle("Stroke Vs Percent")

Demonstrating that DiffWalking variable is an important factor for indicating if a person has heart disease.

ggplot(heartDataOriginal, 
       aes(x = DiffWalking, 
           fill = HeartDisease)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .1), 
                     label = percent) +
  labs(y = "Percent") +
  ggtitle("DiffWalking Vs Percent")

Demonstrating that Men have higher chance of heart disease.

ggplot(heartDataOriginal, 
       aes(x = Sex, 
           fill = HeartDisease)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .1), 
                     label = percent) +
  labs(y = "Percent") +
  ggtitle("Sex Vs Percent")

HeartDisease vs Age shows a notable prevalence of Heart Disease for old people.

ggplot(data = heartDataOriginal, aes(x = AgeCategory, fill = HeartDisease)) +
  geom_bar(position = "dodge") +
  ggtitle("Age Vs HeartDisease")

Demonstrating that Diabetic have higher chance of heart disease.

ggplot(heartDataOriginal, 
       aes(x = Diabetic, 
           fill = HeartDisease)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .1), 
                     label = percent) +
  labs(y = "Percent") + 
  ggtitle("Diabetic Vs HeartDisease")

Demonstrating that no physical activities have higher chance of heart disease.

ggplot(heartDataOriginal, 
       aes(x = PhysicalActivity, 
           fill = HeartDisease)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .1), 
                     label = percent) +
  labs(y = "Percent") + 
  ggtitle("PhysicalActivity Vs HeartDisease")

Demonstrating that poor and fair health have higher chance of heart disease.

ggplot(heartDataOriginal, 
       aes(x = GenHealth, 
           fill = HeartDisease)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .1), 
                     label = percent) +
  labs(y = "Percent") + 
  ggtitle("GenHealth Vs HeartDisease")

Demonstrating that Asthma has higher chance of heart disease.

ggplot(heartDataOriginal, 
       aes(x = Asthma, 
           fill = HeartDisease)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .1), 
                     label = percent) +
  labs(y = "Percent") + 
  ggtitle("Asthma Vs HeartDisease")

Demonstrating that Kidney Disease has higher chance of heart disease.

ggplot(heartDataOriginal, 
       aes(x = KidneyDisease, 
           fill = HeartDisease)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .1), 
                     label = percent) +
  labs(y = "Percent") + 
  ggtitle("KidneyDisease Vs HeartDisease")

Demonstrating that Skin Cancer has higher chance of heart disease.

ggplot(heartDataOriginal, 
       aes(x = SkinCancer, 
           fill = HeartDisease)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .1), 
                     label = percent) +
  labs(y = "Percent") +
  ggtitle("SkinCancer Vs HeartDisease")

BMI Density Plot.

ggplot(heartDataOriginal, aes(BMI, color = HeartDisease)) + 
  geom_density(alpha = 0.5) + 
  ggtitle("BMI Density Plot")

Testing

courtesy - devops.com

Test Variables

As the target or outcome variable is binary HeartDisease = 1 or HeartDisease = 0 chi-square test was used as the primary method to test variable independence.

H0: The two variables are independent.

H1: The two variables relate to each other.

We will use a 95% confidence level and reject the null hypothesis if the p-value is less than our predetermined significance level of 0.05.

Bad habits - Alcohol, Smoking

chisq.test (heartDataOriginal$HeartDisease, heartDataOriginal$Stroke, correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  heartDataOriginal$HeartDisease and heartDataOriginal$Stroke
## X-squared = 12390, df = 1, p-value < 2.2e-16

Pre-existing conditions - Stroke, Diabetic, Asthma, Kidney Disease

chisq.test (heartDataOriginal$Asthma, heartDataOriginal$KidneyDisease, correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  heartDataOriginal$Asthma and heartDataOriginal$KidneyDisease
## X-squared = 504.2, df = 1, p-value < 2.2e-16

Demographic data - Sex, Age Category, Race

chisq.test (heartDataOriginal$Sex, heartDataOriginal$AgeCategory, 
            heartDataOriginal$Race, correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  heartDataOriginal$Sex and heartDataOriginal$AgeCategory
## X-squared = 1772.9, df = 12, p-value < 2.2e-16

Overall Health - BMI, Difficulty Walking

chisq.test (heartDataOriginal$BMI, heartDataOriginal$DiffWalking, correct=FALSE)

## Warning in chisq.test(heartDataOriginal$BMI, heartDataOriginal$DiffWalking, :
## Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  heartDataOriginal$BMI and heartDataOriginal$DiffWalking
## X-squared = 20833, df = 3603, p-value < 2.2e-16

Daily habits - Physical Activity, Sleep Time, General Health

chisq.test (heartDataOriginal$PhysicalActivity, heartDataOriginal$SleepTime, 
            heartDataOriginal$GenHealth, correct=FALSE)

## Warning in chisq.test(heartDataOriginal$PhysicalActivity,
## heartDataOriginal$SleepTime, : Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  heartDataOriginal$PhysicalActivity and heartDataOriginal$SleepTime
## X-squared = 7807.3, df = 23, p-value < 2.2e-16

Based on above we see that all the variables are important as the p-value is less that 0.05

Regression Models

courtesy - www.appier.com

As our target variable is binary 1 for having Heart Disease and 0 for not having Heart Disease We can do Logistic Regression predicting HeartDisease. Using this regression We use following steps for each model.

Changed Variable type to factors as to have value 0 or 1.
Split the data into 2 parts 70/30. 70% is training dataset and 30% is testing dataset.
Run logistic regression on training dataset.
Predict the model using testing dataset.
If HeartDisease probality predicted is greater than 0.5 then we will say HeartDisease value is 1.
Determine model accuracy.
Create a confusion matrix.

Confusion matrix	Predicted Negative (0)	Predicted Positive (1)
Actual Negative (0)	Number of True Negatives	Number of False Positives
Actual Positive (1)	Number of False Negatives	Number of True Positives

Calculate and draw area under curve (AOC), if AOC is > 70% then it is good model.

Model 1 - Bad habits - Alcohol, Smoking.

mod1 <- glm(HeartDisease_dummy ~ Smoking_dummy + AlcoholDrinking_dummy, family = binomial,data=heartData_train)  
summary(mod1)

## 
## Call:
## glm(formula = HeartDisease_dummy ~ Smoking_dummy + AlcoholDrinking_dummy, 
##     family = binomial, data = heartData_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5238  -0.5238  -0.3563  -0.3563   2.6412  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.72532    0.01163 -234.34   <2e-16 ***
## Smoking_dummy1          0.80838    0.01542   52.41   <2e-16 ***
## AlcoholDrinking_dummy1 -0.73148    0.03745  -19.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130776  on 223855  degrees of freedom
## Residual deviance: 127715  on 223853  degrees of freedom
## AIC: 127721
## 
## Number of Fisher Scoring iterations: 5

testing_mod1 <- predict(mod1, heartData_test, type = "response")

hist(testing_mod1)

F_hat_mod1 <- as.numeric(testing_mod1 > 0.50)

table(heartData_test$HeartDisease_dummy, F_hat_mod1, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual     0
##      0 87715
##      1  8224

mean(heartData_test$HeartDisease_dummy == F_hat_mod1) # 0.9142789 i.e. 91% accuracy

## [1] 0.9142789

roc1 <- roc(heartData_train$HeartDisease_dummy, mod1$fitted.values)
plot.roc(roc1, col='blue', lty=2, lwd = 4)

auc(roc1) # 0.6048

## Area under the curve: 0.6048

Model 2 - Pre-Existing Conditions - Stroke, Diabetic, Asthma, Kidney Disease, Skin Cancer.

mod2 <- glm(HeartDisease_dummy ~ Stroke_dummy + Diabetic_dummy + Asthma_dummy + KidneyDisease_dummy + SkinCancer_dummy, family = binomial,data=heartData_train)  
summary(mod2)

## 
## Call:
## glm(formula = HeartDisease_dummy ~ Stroke_dummy + Diabetic_dummy + 
##     Asthma_dummy + KidneyDisease_dummy + SkinCancer_dummy, family = binomial, 
##     data = heartData_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0378  -0.3677  -0.3230  -0.3230   2.4410  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.92711    0.01083 -270.26   <2e-16 ***
## Stroke_dummy1         1.67014    0.02570   64.99   <2e-16 ***
## Diabetic_dummy1       1.08855    0.01785   61.00   <2e-16 ***
## Asthma_dummy1         0.26700    0.02104   12.69   <2e-16 ***
## KidneyDisease_dummy1  1.09379    0.02792   39.18   <2e-16 ***
## SkinCancer_dummy1     0.74995    0.02163   34.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130776  on 223855  degrees of freedom
## Residual deviance: 118377  on 223850  degrees of freedom
## AIC: 118389
## 
## Number of Fisher Scoring iterations: 5

testing_mod2 <- predict(mod2, heartData_test, type = "response")

hist(testing_mod2)

F_hat_mod2 <- as.numeric(testing_mod2 > 0.50)

table(heartData_test$HeartDisease_dummy, F_hat_mod2, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual     0     1
##      0 87303   412
##      1  7799   425

mean(heartData_test$HeartDisease_dummy == F_hat_mod2) #0.9144144 i.e. 91% accuracy

## [1] 0.9144144

roc2 <- roc(heartData_train$HeartDisease_dummy, mod2$fitted.values)
plot.roc(roc2, col='red', lty=2, lwd = 4)

auc(roc2) #0.6981

## Area under the curve: 0.6981

Model 3 - Demographic Data - Sex, Age Category, Race.

mod3 <- glm(HeartDisease_dummy ~ Sex_dummy + AgeCategory_dummy + Race_dummy, family = binomial,data=heartData_train)  
summary(mod3)

## 
## Call:
## glm(formula = HeartDisease_dummy ~ Sex_dummy + AgeCategory_dummy + 
##     Race_dummy, family = binomial, data = heartData_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7922  -0.4912  -0.3622  -0.1627   3.3640  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -2.25761    0.05842 -38.642  < 2e-16 ***
## Sex_dummy1                       -0.63661    0.01579 -40.308  < 2e-16 ***
## AgeCategory_dummyOld-aged Adults  1.25958    0.01997  63.084  < 2e-16 ***
## AgeCategory_dummyYoung Adults    -1.64121    0.04645 -35.332  < 2e-16 ***
## Race_dummyAsian                  -1.11944    0.09695 -11.547  < 2e-16 ***
## Race_dummyBlack                  -0.43350    0.06414  -6.759 1.39e-11 ***
## Race_dummyHispanic               -0.43400    0.06513  -6.664 2.66e-11 ***
## Race_dummyOther                  -0.22775    0.07115  -3.201  0.00137 ** 
## Race_dummyWhite                  -0.41925    0.05694  -7.363 1.80e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130776  on 223855  degrees of freedom
## Residual deviance: 116902  on 223847  degrees of freedom
## AIC: 116920
## 
## Number of Fisher Scoring iterations: 7

testing_mod3 <- predict(mod3, heartData_test, type = "response")

hist(testing_mod3)

F_hat_mod3 <- as.numeric(testing_mod3 > 0.50)

table(heartData_test$HeartDisease_dummy, F_hat_mod3, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual     0
##      0 87715
##      1  8224

mean(heartData_test$HeartDisease_dummy == F_hat_mod3) # 0.9142789 i.e. 91% accuracy

## [1] 0.9142789

roc3 <- roc(heartData_train$HeartDisease_dummy, mod3$fitted.values)
plot.roc(roc3, col='green', lty=2, lwd = 4)

auc(roc3) # 0.7408

## Area under the curve: 0.7408

Model 4 - Overall Health - BMI, Difficulty Walking.

mod4 <- glm(HeartDisease_dummy ~ BMI_Category_dummy + DiffWalking_dummy, family = binomial,data=heartData_train)  
summary(mod4)

## 
## Call:
## glm(formula = HeartDisease_dummy ~ BMI_Category_dummy + DiffWalking_dummy, 
##     family = binomial, data = heartData_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7514  -0.3849  -0.3753  -0.3281   2.4285  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.84694    0.06520 -43.661  < 2e-16 ***
## BMI_Category_dummyNormal     -0.04798    0.06685  -0.718  0.47293    
## BMI_Category_dummyOverweight  0.22920    0.06614   3.465  0.00053 ***
## BMI_Category_dummyObese I     0.28134    0.06689   4.206  2.6e-05 ***
## BMI_Category_dummyObese II    0.20490    0.06950   2.948  0.00320 ** 
## BMI_Category_dummyObese III   0.06888    0.07164   0.962  0.33628    
## DiffWalking_dummy1            1.44529    0.01696  85.234  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130776  on 223855  degrees of freedom
## Residual deviance: 123586  on 223849  degrees of freedom
## AIC: 123600
## 
## Number of Fisher Scoring iterations: 5

testing_mod4 <- predict(mod4, heartData_test, type = "response")

hist(testing_mod4)

F_hat_mod4 <- as.numeric(testing_mod4 > 0.50)

table(heartData_test$HeartDisease_dummy, F_hat_mod4, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual     0
##      0 87715
##      1  8224

mean(heartData_test$HeartDisease_dummy == F_hat_mod4) # 0.9142789 i.e. 91% accuracy

## [1] 0.9142789

roc4 <- roc(heartData_train$HeartDisease_dummy, mod4$fitted.values)
plot.roc(roc4, col='orange', lty=2, lwd = 4)

auc(roc4) #0.6481

## Area under the curve: 0.6481

Model 5 - Daily Habits - Physical Activity, Sleep Time, General Health.

mod5 <- glm(HeartDisease_dummy ~ PhysicalActivity_dummy + SleepTime_dummy + GenHealth_dummy, family = binomial,data=heartData_train)  
summary(mod5)

## 
## Call:
## glm(formula = HeartDisease_dummy ~ PhysicalActivity_dummy + SleepTime_dummy + 
##     GenHealth_dummy, family = binomial, data = heartData_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0162  -0.4727  -0.3192  -0.2269   2.7954  
## 
## Coefficients:
##                                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                       -3.69504    0.03534 -104.555  < 2e-16 ***
## PhysicalActivity_dummy1           -0.19167    0.01743  -10.995  < 2e-16 ***
## SleepTime_dummyGood-quality Sleep  0.24006    0.01609   14.922  < 2e-16 ***
## SleepTime_dummyOverslept           0.29071    0.05625    5.168 2.37e-07 ***
## GenHealth_dummyFair                2.34453    0.03534   66.345  < 2e-16 ***
## GenHealth_dummyGood                1.55988    0.03370   46.292  < 2e-16 ***
## GenHealth_dummyPoor                3.01250    0.03996   75.390  < 2e-16 ***
## GenHealth_dummyVery good           0.74332    0.03523   21.099  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130776  on 223855  degrees of freedom
## Residual deviance: 117815  on 223848  degrees of freedom
## AIC: 117831
## 
## Number of Fisher Scoring iterations: 6

testing_mod5 <- predict(mod5, heartData_test, type = "response")

hist(testing_mod5)

F_hat_mod5 <- as.numeric(testing_mod5 > 0.50)

table(heartData_test$HeartDisease_dummy, F_hat_mod5, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual     0
##      0 87715
##      1  8224

mean(heartData_test$HeartDisease_dummy == F_hat_mod5) #0.8641845 i.e. 86% accuracy

## [1] 0.9142789

roc5 <- roc(heartData_train$HeartDisease_dummy, mod5$fitted.values)
plot.roc(roc5, col='brown', lty=2, lwd = 4)

auc(roc5) #0.732

## Area under the curve: 0.7324

Model Comparison

Model	Accuracy	AUC	Attributes
1	91%	60%	Bad habits - Alcohol, Smoking
2	91%	70%	Pre-existing conditions - Stroke, Diabetic, Asthma, Kidney Disease, Skin Cancer
3	91%	74%	Demographic data - Sex, Age Category, Race
4	91%	65%	Overall Health - BMI, Difficulty Walking
5	86%	73%	Daily habits - Physical Activity, Sleep Time, General Health

Based on the above table we can reject model 1 and 4 as AOC is less than 70%

** Best model is 3, followed by 5 and 2. **

Regression Tree

We introduced the Regression Tree. This model makes it easy to analyze new data. Following the bubbles you will then be able to estimate heart disease by answering those simple questions using only six variables (Sex, AgeCategory, Race, PhysicalActivity, SleepTime, GenHealth). The blue bubbles with the split criteria are considered split points (decision nodes). The orange bubbles represent the “leaves” also called terminal nodes. We decided to “prune” the tree as too many splits/decision_notes will actually overfit the model as shown in the picture to the right.

prp(tree, type = 1, extra = 100, varlen = 10, digits = 3, space=0, split.cex=1, 
    nn.border.col = 2, tweak = 1.1, gap = 2,  main = "Heart Disease Regression Tree", 
    branch.col = "dark red", box.col = ifelse(tree$frame$var == "<leaf>", 'dark orange', 'light blue')
    )