courtesy - harvard.edu
Our group decided to choose a Personal Key Indicators of Heart Disease dataset, which has a mostly categorical and few numerical. In this machine learning projects, “HeartDisease” can be used as the target variable. The dataset has been downloaded from
The original dataset came from 2020 annual CDC survey data of 400k adults related to their health status. This dataset is subset of the original dataset containing only 18 variables (9 booleans, 5 strings and 4 decimals). It has total 319,795 records. In this project we would use R to do following
R Libraries used are
The entire project is published on Rpubs at https://rpubs.com/pravinhere/893585
courtesy - www.techrepublic.com
Following are the variables in our dataset.
To begin we utilized the head() function to evaluate what the data looked like and found that a large majority of our variables were health questions with a “yes” or “no” response. A few variables were health questions with categorical responses such as sex, age category, race, and general health . A few variables were also numeric such as bmi, physical health, mental health, and sleep time. Since these variables all came from a telephone survey it was also critical to check for any gaps in our data with is.na() function. We did not find any NA or missing values so we moved to the next piece of our cleaning. From here we used the table() function to check for missing values. In surveys data can be inputted incorrectly with the same category having different naming conventions or data that has not yet been categorized. This dataset was already very clean so it was ok to move on from here without adjustments being made.
With clean data it is then necessary to qualitatively understand the dataset. We found some questions within our data such as “How are people recorded getting 1 hour or 24 hours of sleep?” (referring to our sleep time variable). We also noted that the lowest age category starts at 18 years old so we are not qualified to make any correlations with minors and heart disease as we do not have that data. We are not able to conclude the nature of these data points as they do not have any explanation and are from the CDC, a reliable source we decided to leave these in. The size of the data naturally smoothed out any outliers as this is a very large dataset. Now looking further into the data with functions like class(), dim(), summary(), and str() we can understand the class, dimensions, summary, and structure of the data respectively. The summary() function in particular gives us insight to the min, max, std deviation, median and mean of our data. It also allows us to see categorical data in each grouping. In our summarizing we can create dummy variables as factors in order to allow prediction of categorical variables. Utilizing dummy variables allows character data like “yes” or “no” to become “1” or “0”. The dummy variables allow us to run models of a potential respondent to try to predict whether their collection of responses would align or defer from a heart disease diagnosis.
In addition to this, since we want to use this data to find which variables are the best predictors of heart disease we decide to split our many variables into category groups as follows
Further following categories are made
ggplot(heartDataOriginal, aes(x = HeartDisease)) +
geom_bar(fill = "steelblue") +
xlab("Heart Disease") +
ylab("Count in Thousands") +
scale_y_continuous(labels = label_number(suffix = " K", scale = 1e-3)) +
ggtitle("Total Heart Disease Count")
HeartDisease vs Smoking shows a notable prevalence of Heart Disease for people with smoking habits.
ggplot(data = heartDataOriginal, aes(x = Smoking, fill = HeartDisease)) +
geom_bar(position = "dodge") +
ggtitle("HeartDisease vs Smoking")
HeartDisease vs AlcoholDrinking shows a notable prevalence of Heart Disease for people who don’t have heavy drinking habits.
ggplot(data = heartDataOriginal, aes(x = AlcoholDrinking, fill = HeartDisease)) +
geom_bar(position = "dodge") +
scale_y_continuous(labels = label_number(suffix = " K", scale = 1e-3)) +
ggtitle("HeartDisease vs AlcoholDrinking")
HeartDisease vs Stroke shows a notable prevalence of Heart Disease for people with smoking habits.
ggplot(data = heartDataOriginal, aes(x = Stroke, fill = HeartDisease)) +
geom_bar(position = "dodge") +
scale_y_continuous(labels = label_number(suffix = " K", scale = 1e-3)) +
ggtitle("HeartDisease vs Stroke")
Demonstrating that stroke variable is an important factor for indicating if a person has heart disease.
ggplot(heartDataOriginal,
aes(x = Stroke,
fill = HeartDisease)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .1),
label = percent) +
labs(y = "Percent") +
ggtitle("Stroke Vs Percent")
Demonstrating that DiffWalking variable is an important factor for indicating if a person has heart disease.
ggplot(heartDataOriginal,
aes(x = DiffWalking,
fill = HeartDisease)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .1),
label = percent) +
labs(y = "Percent") +
ggtitle("DiffWalking Vs Percent")
Demonstrating that Men have higher chance of heart disease.
ggplot(heartDataOriginal,
aes(x = Sex,
fill = HeartDisease)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .1),
label = percent) +
labs(y = "Percent") +
ggtitle("Sex Vs Percent")
HeartDisease vs Age shows a notable prevalence of Heart Disease for old people.
ggplot(data = heartDataOriginal, aes(x = AgeCategory, fill = HeartDisease)) +
geom_bar(position = "dodge") +
ggtitle("Age Vs HeartDisease")
Demonstrating that Diabetic have higher chance of heart disease.
ggplot(heartDataOriginal,
aes(x = Diabetic,
fill = HeartDisease)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .1),
label = percent) +
labs(y = "Percent") +
ggtitle("Diabetic Vs HeartDisease")
Demonstrating that no physical activities have higher chance of heart disease.
ggplot(heartDataOriginal,
aes(x = PhysicalActivity,
fill = HeartDisease)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .1),
label = percent) +
labs(y = "Percent") +
ggtitle("PhysicalActivity Vs HeartDisease")
Demonstrating that poor and fair health have higher chance of heart disease.
ggplot(heartDataOriginal,
aes(x = GenHealth,
fill = HeartDisease)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .1),
label = percent) +
labs(y = "Percent") +
ggtitle("GenHealth Vs HeartDisease")
Demonstrating that Asthma has higher chance of heart disease.
ggplot(heartDataOriginal,
aes(x = Asthma,
fill = HeartDisease)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .1),
label = percent) +
labs(y = "Percent") +
ggtitle("Asthma Vs HeartDisease")
Demonstrating that Kidney Disease has higher chance of heart disease.
ggplot(heartDataOriginal,
aes(x = KidneyDisease,
fill = HeartDisease)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .1),
label = percent) +
labs(y = "Percent") +
ggtitle("KidneyDisease Vs HeartDisease")
Demonstrating that Skin Cancer has higher chance of heart disease.
ggplot(heartDataOriginal,
aes(x = SkinCancer,
fill = HeartDisease)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .1),
label = percent) +
labs(y = "Percent") +
ggtitle("SkinCancer Vs HeartDisease")
BMI Density Plot.
ggplot(heartDataOriginal, aes(BMI, color = HeartDisease)) +
geom_density(alpha = 0.5) +
ggtitle("BMI Density Plot")
courtesy - devops.com
Test Variables
As the target or outcome variable is binary HeartDisease = 1 or HeartDisease = 0 chi-square test was used as the primary method to test variable independence.
H0: The two variables are independent.
H1: The two variables relate to each other.
We will use a 95% confidence level and reject the null hypothesis if the p-value is less than our predetermined significance level of 0.05.
chisq.test (heartDataOriginal$HeartDisease, heartDataOriginal$Stroke, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: heartDataOriginal$HeartDisease and heartDataOriginal$Stroke
## X-squared = 12390, df = 1, p-value < 2.2e-16
chisq.test (heartDataOriginal$Asthma, heartDataOriginal$KidneyDisease, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: heartDataOriginal$Asthma and heartDataOriginal$KidneyDisease
## X-squared = 504.2, df = 1, p-value < 2.2e-16
chisq.test (heartDataOriginal$Sex, heartDataOriginal$AgeCategory,
heartDataOriginal$Race, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: heartDataOriginal$Sex and heartDataOriginal$AgeCategory
## X-squared = 1772.9, df = 12, p-value < 2.2e-16
chisq.test (heartDataOriginal$BMI, heartDataOriginal$DiffWalking, correct=FALSE)
## Warning in chisq.test(heartDataOriginal$BMI, heartDataOriginal$DiffWalking, :
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: heartDataOriginal$BMI and heartDataOriginal$DiffWalking
## X-squared = 20833, df = 3603, p-value < 2.2e-16
chisq.test (heartDataOriginal$PhysicalActivity, heartDataOriginal$SleepTime,
heartDataOriginal$GenHealth, correct=FALSE)
## Warning in chisq.test(heartDataOriginal$PhysicalActivity,
## heartDataOriginal$SleepTime, : Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: heartDataOriginal$PhysicalActivity and heartDataOriginal$SleepTime
## X-squared = 7807.3, df = 23, p-value < 2.2e-16
Based on above we see that all the variables are important as the p-value is less that 0.05
courtesy - www.appier.com
As our target variable is binary 1 for having Heart Disease and 0 for not having Heart Disease We can do Logistic Regression predicting HeartDisease. Using this regression We use following steps for each model.
| Confusion matrix | Predicted Negative (0) | Predicted Positive (1) |
|---|---|---|
| Actual Negative (0) | Number of True Negatives | Number of False Positives |
| Actual Positive (1) | Number of False Negatives | Number of True Positives |
Model 1 - Bad habits - Alcohol, Smoking.
mod1 <- glm(HeartDisease_dummy ~ Smoking_dummy + AlcoholDrinking_dummy, family = binomial,data=heartData_train)
summary(mod1)
##
## Call:
## glm(formula = HeartDisease_dummy ~ Smoking_dummy + AlcoholDrinking_dummy,
## family = binomial, data = heartData_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5238 -0.5238 -0.3563 -0.3563 2.6412
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.72532 0.01163 -234.34 <2e-16 ***
## Smoking_dummy1 0.80838 0.01542 52.41 <2e-16 ***
## AlcoholDrinking_dummy1 -0.73148 0.03745 -19.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130776 on 223855 degrees of freedom
## Residual deviance: 127715 on 223853 degrees of freedom
## AIC: 127721
##
## Number of Fisher Scoring iterations: 5
testing_mod1 <- predict(mod1, heartData_test, type = "response")
hist(testing_mod1)
F_hat_mod1 <- as.numeric(testing_mod1 > 0.50)
table(heartData_test$HeartDisease_dummy, F_hat_mod1, dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0
## 0 87715
## 1 8224
mean(heartData_test$HeartDisease_dummy == F_hat_mod1) # 0.9142789 i.e. 91% accuracy
## [1] 0.9142789
roc1 <- roc(heartData_train$HeartDisease_dummy, mod1$fitted.values)
plot.roc(roc1, col='blue', lty=2, lwd = 4)
auc(roc1) # 0.6048
## Area under the curve: 0.6048
Model 2 - Pre-Existing Conditions - Stroke, Diabetic, Asthma, Kidney Disease, Skin Cancer.
mod2 <- glm(HeartDisease_dummy ~ Stroke_dummy + Diabetic_dummy + Asthma_dummy + KidneyDisease_dummy + SkinCancer_dummy, family = binomial,data=heartData_train)
summary(mod2)
##
## Call:
## glm(formula = HeartDisease_dummy ~ Stroke_dummy + Diabetic_dummy +
## Asthma_dummy + KidneyDisease_dummy + SkinCancer_dummy, family = binomial,
## data = heartData_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0378 -0.3677 -0.3230 -0.3230 2.4410
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.92711 0.01083 -270.26 <2e-16 ***
## Stroke_dummy1 1.67014 0.02570 64.99 <2e-16 ***
## Diabetic_dummy1 1.08855 0.01785 61.00 <2e-16 ***
## Asthma_dummy1 0.26700 0.02104 12.69 <2e-16 ***
## KidneyDisease_dummy1 1.09379 0.02792 39.18 <2e-16 ***
## SkinCancer_dummy1 0.74995 0.02163 34.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130776 on 223855 degrees of freedom
## Residual deviance: 118377 on 223850 degrees of freedom
## AIC: 118389
##
## Number of Fisher Scoring iterations: 5
testing_mod2 <- predict(mod2, heartData_test, type = "response")
hist(testing_mod2)
F_hat_mod2 <- as.numeric(testing_mod2 > 0.50)
table(heartData_test$HeartDisease_dummy, F_hat_mod2, dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0 1
## 0 87303 412
## 1 7799 425
mean(heartData_test$HeartDisease_dummy == F_hat_mod2) #0.9144144 i.e. 91% accuracy
## [1] 0.9144144
roc2 <- roc(heartData_train$HeartDisease_dummy, mod2$fitted.values)
plot.roc(roc2, col='red', lty=2, lwd = 4)
auc(roc2) #0.6981
## Area under the curve: 0.6981
Model 3 - Demographic Data - Sex, Age Category, Race.
mod3 <- glm(HeartDisease_dummy ~ Sex_dummy + AgeCategory_dummy + Race_dummy, family = binomial,data=heartData_train)
summary(mod3)
##
## Call:
## glm(formula = HeartDisease_dummy ~ Sex_dummy + AgeCategory_dummy +
## Race_dummy, family = binomial, data = heartData_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7922 -0.4912 -0.3622 -0.1627 3.3640
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.25761 0.05842 -38.642 < 2e-16 ***
## Sex_dummy1 -0.63661 0.01579 -40.308 < 2e-16 ***
## AgeCategory_dummyOld-aged Adults 1.25958 0.01997 63.084 < 2e-16 ***
## AgeCategory_dummyYoung Adults -1.64121 0.04645 -35.332 < 2e-16 ***
## Race_dummyAsian -1.11944 0.09695 -11.547 < 2e-16 ***
## Race_dummyBlack -0.43350 0.06414 -6.759 1.39e-11 ***
## Race_dummyHispanic -0.43400 0.06513 -6.664 2.66e-11 ***
## Race_dummyOther -0.22775 0.07115 -3.201 0.00137 **
## Race_dummyWhite -0.41925 0.05694 -7.363 1.80e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130776 on 223855 degrees of freedom
## Residual deviance: 116902 on 223847 degrees of freedom
## AIC: 116920
##
## Number of Fisher Scoring iterations: 7
testing_mod3 <- predict(mod3, heartData_test, type = "response")
hist(testing_mod3)
F_hat_mod3 <- as.numeric(testing_mod3 > 0.50)
table(heartData_test$HeartDisease_dummy, F_hat_mod3, dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0
## 0 87715
## 1 8224
mean(heartData_test$HeartDisease_dummy == F_hat_mod3) # 0.9142789 i.e. 91% accuracy
## [1] 0.9142789
roc3 <- roc(heartData_train$HeartDisease_dummy, mod3$fitted.values)
plot.roc(roc3, col='green', lty=2, lwd = 4)
auc(roc3) # 0.7408
## Area under the curve: 0.7408
Model 4 - Overall Health - BMI, Difficulty Walking.
mod4 <- glm(HeartDisease_dummy ~ BMI_Category_dummy + DiffWalking_dummy, family = binomial,data=heartData_train)
summary(mod4)
##
## Call:
## glm(formula = HeartDisease_dummy ~ BMI_Category_dummy + DiffWalking_dummy,
## family = binomial, data = heartData_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7514 -0.3849 -0.3753 -0.3281 2.4285
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.84694 0.06520 -43.661 < 2e-16 ***
## BMI_Category_dummyNormal -0.04798 0.06685 -0.718 0.47293
## BMI_Category_dummyOverweight 0.22920 0.06614 3.465 0.00053 ***
## BMI_Category_dummyObese I 0.28134 0.06689 4.206 2.6e-05 ***
## BMI_Category_dummyObese II 0.20490 0.06950 2.948 0.00320 **
## BMI_Category_dummyObese III 0.06888 0.07164 0.962 0.33628
## DiffWalking_dummy1 1.44529 0.01696 85.234 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130776 on 223855 degrees of freedom
## Residual deviance: 123586 on 223849 degrees of freedom
## AIC: 123600
##
## Number of Fisher Scoring iterations: 5
testing_mod4 <- predict(mod4, heartData_test, type = "response")
hist(testing_mod4)
F_hat_mod4 <- as.numeric(testing_mod4 > 0.50)
table(heartData_test$HeartDisease_dummy, F_hat_mod4, dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0
## 0 87715
## 1 8224
mean(heartData_test$HeartDisease_dummy == F_hat_mod4) # 0.9142789 i.e. 91% accuracy
## [1] 0.9142789
roc4 <- roc(heartData_train$HeartDisease_dummy, mod4$fitted.values)
plot.roc(roc4, col='orange', lty=2, lwd = 4)
auc(roc4) #0.6481
## Area under the curve: 0.6481
Model 5 - Daily Habits - Physical Activity, Sleep Time, General Health.
mod5 <- glm(HeartDisease_dummy ~ PhysicalActivity_dummy + SleepTime_dummy + GenHealth_dummy, family = binomial,data=heartData_train)
summary(mod5)
##
## Call:
## glm(formula = HeartDisease_dummy ~ PhysicalActivity_dummy + SleepTime_dummy +
## GenHealth_dummy, family = binomial, data = heartData_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0162 -0.4727 -0.3192 -0.2269 2.7954
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.69504 0.03534 -104.555 < 2e-16 ***
## PhysicalActivity_dummy1 -0.19167 0.01743 -10.995 < 2e-16 ***
## SleepTime_dummyGood-quality Sleep 0.24006 0.01609 14.922 < 2e-16 ***
## SleepTime_dummyOverslept 0.29071 0.05625 5.168 2.37e-07 ***
## GenHealth_dummyFair 2.34453 0.03534 66.345 < 2e-16 ***
## GenHealth_dummyGood 1.55988 0.03370 46.292 < 2e-16 ***
## GenHealth_dummyPoor 3.01250 0.03996 75.390 < 2e-16 ***
## GenHealth_dummyVery good 0.74332 0.03523 21.099 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130776 on 223855 degrees of freedom
## Residual deviance: 117815 on 223848 degrees of freedom
## AIC: 117831
##
## Number of Fisher Scoring iterations: 6
testing_mod5 <- predict(mod5, heartData_test, type = "response")
hist(testing_mod5)
F_hat_mod5 <- as.numeric(testing_mod5 > 0.50)
table(heartData_test$HeartDisease_dummy, F_hat_mod5, dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0
## 0 87715
## 1 8224
mean(heartData_test$HeartDisease_dummy == F_hat_mod5) #0.8641845 i.e. 86% accuracy
## [1] 0.9142789
roc5 <- roc(heartData_train$HeartDisease_dummy, mod5$fitted.values)
plot.roc(roc5, col='brown', lty=2, lwd = 4)
auc(roc5) #0.732
## Area under the curve: 0.7324
| Model | Accuracy | AUC | Attributes |
|---|---|---|---|
| 1 | 91% | 60% | Bad habits - Alcohol, Smoking |
| 2 | 91% | 70% | Pre-existing conditions - Stroke, Diabetic, Asthma, Kidney Disease, Skin Cancer |
| 3 | 91% | 74% | Demographic data - Sex, Age Category, Race |
| 4 | 91% | 65% | Overall Health - BMI, Difficulty Walking |
| 5 | 86% | 73% | Daily habits - Physical Activity, Sleep Time, General Health |
Based on the above table we can reject model 1 and 4 as AOC is less than 70%
** Best model is 3, followed by 5 and 2. **
We introduced the Regression Tree. This model makes it easy to analyze new data. Following the bubbles you will then be able to estimate heart disease by answering those simple questions using only six variables (Sex, AgeCategory, Race, PhysicalActivity, SleepTime, GenHealth). The blue bubbles with the split criteria are considered split points (decision nodes). The orange bubbles represent the “leaves” also called terminal nodes. We decided to “prune” the tree as too many splits/decision_notes will actually overfit the model as shown in the picture to the right.
prp(tree, type = 1, extra = 100, varlen = 10, digits = 3, space=0, split.cex=1,
nn.border.col = 2, tweak = 1.1, gap = 2, main = "Heart Disease Regression Tree",
branch.col = "dark red", box.col = ifelse(tree$frame$var == "<leaf>", 'dark orange', 'light blue')
)