This dataset contains health survey data from around 300,000 individuals in the US (CDC, 2020). It is used to study which factors—like lifestyle and medical history—are most likely to lead to heart disease. The link from the source: Indicators of Heart Disease.
| Variable | Description |
|---|---|
| HeartDisease | Target: Ever had heart disease |
| BMI | Body Mass Index |
| Smoking | Smoked 100+ cigarettes |
| AlcoholDrinking | Heavy drinker status |
| Stroke | History of stroke |
| PhysicalHealth | Bad physical days (last 30) |
| MentalHealth | Bad mental days (last 30) |
| DiffWalking | Difficulty walking |
| Sex | Gender |
| AgeCategory | Age group |
| Race | Ethnicity |
| Diabetic | Diabetes status |
| PhysicalActivity | Recent exercise |
| GenHealth | General health rating |
| SleepTime | Sleep hours/day |
| Asthma | History of asthma |
| KidneyDisease | History of kidney disease |
| SkinCancer | History of skin cancer |
#Dataset is loaded. Note that the data set should be in the same location as this rmd file
dataset <- read.csv("heart_2020_cleaned.csv")
dataset$HeartDisease <- as.factor(dataset$HeartDisease)
In this section we will create “train” and “test” subsets
set.seed(936)
#Spliting the dataset
split <- sample.split(dataset$HeartDisease,SplitRatio=0.65)
#Creation of train and displaying the amount
train = subset(dataset,split==TRUE)
nrow(train)
## [1] 207866
#Creation of test and displaying the amount
test = subset(dataset,split==FALSE)
nrow(test)
## [1] 111929
HeartLog <- glm(HeartDisease ~ ., data=train,family=binomial)
summary(HeartLog)
##
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.377908 0.143749 -44.369 < 2e-16 ***
## BMI 0.008296 0.001421 5.837 5.30e-09 ***
## SmokingYes 0.365040 0.017834 20.469 < 2e-16 ***
## AlcoholDrinkingYes -0.219559 0.041588 -5.279 1.30e-07 ***
## StrokeYes 1.023812 0.028137 36.387 < 2e-16 ***
## PhysicalHealth 0.003799 0.001065 3.569 0.000359 ***
## MentalHealth 0.005651 0.001092 5.173 2.30e-07 ***
## DiffWalkingYes 0.246741 0.022421 11.005 < 2e-16 ***
## SexMale 0.718773 0.018076 39.765 < 2e-16 ***
## AgeCategory25-29 0.175032 0.154139 1.136 0.256145
## AgeCategory30-34 0.516356 0.139104 3.712 0.000206 ***
## AgeCategory35-39 0.653792 0.132929 4.918 8.73e-07 ***
## AgeCategory40-44 0.989534 0.126073 7.849 4.20e-15 ***
## AgeCategory45-49 1.310078 0.121465 10.786 < 2e-16 ***
## AgeCategory50-54 1.779564 0.117026 15.207 < 2e-16 ***
## AgeCategory55-59 2.010254 0.115249 17.443 < 2e-16 ***
## AgeCategory60-64 2.246983 0.114287 19.661 < 2e-16 ***
## AgeCategory65-69 2.503246 0.113939 21.970 < 2e-16 ***
## AgeCategory70-74 2.787636 0.113867 24.481 < 2e-16 ***
## AgeCategory75-79 2.972605 0.114555 25.949 < 2e-16 ***
## AgeCategory80 or older 3.247133 0.114219 28.429 < 2e-16 ***
## RaceAsian -0.495563 0.103956 -4.767 1.87e-06 ***
## RaceBlack -0.328249 0.071825 -4.570 4.87e-06 ***
## RaceHispanic -0.220821 0.073099 -3.021 0.002521 **
## RaceOther -0.005139 0.079124 -0.065 0.948215
## RaceWhite -0.051176 0.064179 -0.797 0.425220
## DiabeticNo, borderline diabetes 0.084732 0.052569 1.612 0.106999
## DiabeticYes 0.489951 0.020682 23.690 < 2e-16 ***
## DiabeticYes (during pregnancy) 0.025617 0.134307 0.191 0.848734
## PhysicalActivityYes 0.035412 0.019983 1.772 0.076369 .
## GenHealthFair 1.497764 0.040866 36.651 < 2e-16 ***
## GenHealthGood 1.037926 0.036786 28.216 < 2e-16 ***
## GenHealthPoor 1.890887 0.050722 37.280 < 2e-16 ***
## GenHealthVery good 0.473761 0.037729 12.557 < 2e-16 ***
## SleepTime -0.024255 0.005374 -4.513 6.39e-06 ***
## AsthmaYes 0.261537 0.023830 10.975 < 2e-16 ***
## KidneyDiseaseYes 0.553450 0.030374 18.221 < 2e-16 ***
## SkinCancerYes 0.106839 0.024249 4.406 1.05e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 121486 on 207865 degrees of freedom
## Residual deviance: 94254 on 207828 degrees of freedom
## AIC: 94330
##
## Number of Fisher Scoring iterations: 7
The variables that have high statistical significance are:
predictTest <- predict(HeartLog, type='response', newdata=test)
summary(predictTest)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0009019 0.0132206 0.0408723 0.0855506 0.1065490 0.9226723
The above prediction shows:
temp <- table(test$HeartDisease, predictTest > 0.5)
print(temp)
##
## FALSE TRUE
## No 101484 864
## Yes 8549 1032
# 1. Calculate Accuracy
accuracy <- sum(diag(temp)) / sum(temp)
print(accuracy)
## [1] 0.915902
# 2. Calculate Sensitivity (Recall)
# True Positives / (True Positives + False Negatives)
sensitivity <- temp[2,2] / sum(temp[2,])
print(sensitivity)
## [1] 0.1077132
# 3. Calculate Specificity
# True Negatives / (True Negatives + False Positives)
specificity <- temp[1,1] / sum(temp[1,])
print(specificity)
## [1] 0.9915582
# 4. Calculate Baseline Accuracy
# (Total 'No' cases) / Total
baseline <- sum(temp[1,]) / sum(temp)
print(baseline)
## [1] 0.9144011
ROCRpred <- prediction(predictTest, test$HeartDisease)
#Remove the NA values
train2 = na.omit(train)
test2 = na.omit(test)
#Row count train2
nrow(train2)
## [1] 207866
#Row count test2
nrow(test2)
## [1] 111929
HeartLog2 = glm(HeartDisease ~ ., data = train2, family = binomial)
predictTest2 = predict(HeartLog2, type = "response", newdata = test2)
#ROC Curve
ROCRpred2 <- prediction(predictTest2, test2$HeartDisease)
ROCRperf2 <- performance(ROCRpred2, "tpr", "fpr")
plot(ROCRperf2,colorize = TRUE)
# AUC value is:
as.numeric(performance(ROCRpred2, "auc")@y.values)
## [1] 0.8399207