I chose to use a data set acquired from kaggle. Each observation represents survey results. Most of the columns are binary and describe the health of an individual. The goal is to predict if the person has diabetes, hypertension, or ever has had a stroke.
df <- read.csv('health_data.csv')
head(df)
## Age Sex HighChol CholCheck BMI Smoker HeartDiseaseorAttack PhysActivity
## 1 4 1 0 1 26 0 0 1
## 2 12 1 1 1 26 1 0 0
## 3 13 1 0 1 26 0 0 1
## 4 11 1 1 1 28 1 0 1
## 5 8 0 0 1 29 1 0 1
## 6 1 0 0 1 18 0 0 1
## Fruits Veggies HvyAlcoholConsump GenHlth MentHlth PhysHlth DiffWalk Diabetes
## 1 0 1 0 3 5 30 0 0
## 2 1 0 0 3 0 0 0 0
## 3 1 1 0 1 0 10 0 0
## 4 1 1 0 3 0 3 0 0
## 5 1 1 0 2 0 0 0 0
## 6 1 1 0 2 7 0 0 0
## Hypertension Stroke
## 1 1 0
## 2 1 1
## 3 0 0
## 4 1 0
## 5 0 0
## 6 0 0
Let’s look at the number of observations and features we have. This will help us understand which models will be most effective in handling this data set. We see around 70k observations and 18 features. 3 of the columns are target variables and the others are feature predictors.
dim(df)
## [1] 70692 18
Visualizing if there is any missing data lets us know if there is any imputation needed. This is a clean data set, so all the variables are present and no imputation is needed.
vis_miss(df, warn_large_data=FALSE)
A summary helps us to understand which variables are binary and for the ones that are not what are their summary statistics. We see Age, BMI, GenHlth, MentHlth, and PhysHlth are not binary.
summary(df)
## Age Sex HighChol CholCheck
## Min. : 1.000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 7.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000
## Median : 9.000 Median :0.000 Median :1.0000 Median :1.0000
## Mean : 8.584 Mean :0.457 Mean :0.5257 Mean :0.9753
## 3rd Qu.:11.000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :13.000 Max. :1.000 Max. :1.0000 Max. :1.0000
## BMI Smoker HeartDiseaseorAttack PhysActivity
## Min. :12.00 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:25.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :29.00 Median :0.0000 Median :0.0000 Median :1.000
## Mean :29.86 Mean :0.4753 Mean :0.1478 Mean :0.703
## 3rd Qu.:33.00 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.000
## Max. :98.00 Max. :1.0000 Max. :1.0000 Max. :1.000
## Fruits Veggies HvyAlcoholConsump GenHlth
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:2.000
## Median :1.0000 Median :1.0000 Median :0.00000 Median :3.000
## Mean :0.6118 Mean :0.7888 Mean :0.04272 Mean :2.837
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:4.000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :5.000
## MentHlth PhysHlth DiffWalk Diabetes
## Min. : 0.000 Min. : 0.00 Min. :0.0000 Min. :0.0
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.:0.0
## Median : 0.000 Median : 0.00 Median :0.0000 Median :0.5
## Mean : 3.752 Mean : 5.81 Mean :0.2527 Mean :0.5
## 3rd Qu.: 2.000 3rd Qu.: 6.00 3rd Qu.:1.0000 3rd Qu.:1.0
## Max. :30.000 Max. :30.00 Max. :1.0000 Max. :1.0
## Hypertension Stroke
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000
## Median :1.0000 Median :0.00000
## Mean :0.5635 Mean :0.06217
## 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000
A correlation plot shows how closely correlated our variables are. For the purposes of predicting we will be using Stroke.
M = cor(df)
corrplot(M, method = 'number', tl.cex = 0.5, cl.cex = 0.5)
df$Sex <- as.factor(df$Sex)
df$HighChol <- as.factor(df$HighChol)
df$Smoker <- as.factor(df$Smoker)
df$HeartDiseaseorAttack <- as.factor(df$HeartDiseaseorAttack)
df$PhysActivity <- as.factor(df$PhysActivity)
df$Fruits <- as.factor(df$Fruits)
df$Veggies <- as.factor(df$Veggies)
df$HvyAlcoholConsump <- as.factor(df$HvyAlcoholConsump)
df$DiffWalk <- as.factor(df$DiffWalk)
df$Stroke <- as.factor(df$Stroke)
Focusing on just what the Stroke variable is correlated with. Not suprisingly the vairable with the highest correlation is Heart Disease or Attack. Interestingly, GenHlth is the second highest correlated which is a the person’s opinion on their general health.
M[,18]
## Age Sex HighChol
## 0.123879343 0.003822095 0.099786191
## CholCheck BMI Smoker
## 0.022529381 0.022930877 0.064658397
## HeartDiseaseorAttack PhysActivity Fruits
## 0.223393786 -0.079984782 -0.008996297
## Veggies HvyAlcoholConsump GenHlth
## -0.047601204 -0.023394552 0.189446857
## MentHlth PhysHlth DiffWalk
## 0.087303439 0.164487793 0.192265923
## Diabetes Hypertension Stroke
## 0.125426785 0.129059872 1.000000000
ggplot(data = df, aes(x=as.factor(Stroke), y=(..count..)/sum(..count..))) +
geom_bar(color = "steelblue") +
labs(title = "Sroke by Smoking and High Chol",
subtitle = "(0 means no and 1 means yest)",
y = "Count of new characters", x = "") +
facet_wrap(~ HighChol + Smoker)
First let’s create traning and test data sets. We use 70% of the data set to
df <- subset(df, select = -c(Diabetes, Hypertension))
spl = sample.split(df$Stroke, 0.7)
train = subset(df, spl == TRUE)
test = subset(df, spl == FALSE)
c(dim(train), dim(test))
## [1] 49484 16 21208 16
Now let’s create a logistic regression model
lrm <- glm(Stroke ~ ., data=train, family='binomial')
summary(lrm)
##
## Call:
## glm(formula = Stroke ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3525 -0.3547 -0.2386 -0.1700 3.3144
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.754338 0.239610 -24.015 < 2e-16 ***
## Age 0.134291 0.009258 14.506 < 2e-16 ***
## Sex1 -0.034374 0.040916 -0.840 0.400842
## HighChol1 0.382832 0.044309 8.640 < 2e-16 ***
## CholCheck 0.269237 0.189546 1.420 0.155483
## BMI -0.010065 0.002859 -3.521 0.000430 ***
## Smoker1 0.166711 0.041092 4.057 4.97e-05 ***
## HeartDiseaseorAttack1 1.057549 0.043014 24.586 < 2e-16 ***
## PhysActivity1 -0.005049 0.042651 -0.118 0.905763
## Fruits1 0.068872 0.042064 1.637 0.101562
## Veggies1 -0.262445 0.045512 -5.766 8.10e-09 ***
## HvyAlcoholConsump1 -0.204559 0.125768 -1.626 0.103847
## GenHlth 0.316009 0.024292 13.009 < 2e-16 ***
## MentHlth 0.009160 0.002155 4.250 2.13e-05 ***
## PhysHlth 0.006976 0.002060 3.386 0.000709 ***
## DiffWalk1 0.682829 0.048014 14.221 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23047 on 49483 degrees of freedom
## Residual deviance: 19451 on 49468 degrees of freedom
## AIC: 19483
##
## Number of Fisher Scoring iterations: 6
We remove all the features that are not statistically significant and run the model again. The results do no change much.
lrm2 <- glm(Stroke ~Age+HighChol+BMI+Smoker+HeartDiseaseorAttack+Veggies+GenHlth+MentHlth+DiffWalk, data=train, family='binomial')
summary(lrm2)
##
## Call:
## glm(formula = Stroke ~ Age + HighChol + BMI + Smoker + HeartDiseaseorAttack +
## Veggies + GenHlth + MentHlth + DiffWalk, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3291 -0.3558 -0.2392 -0.1697 3.2541
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.592649 0.151187 -36.992 < 2e-16 ***
## Age 0.136912 0.009174 14.923 < 2e-16 ***
## HighChol1 0.381585 0.044218 8.630 < 2e-16 ***
## BMI -0.009918 0.002844 -3.487 0.000488 ***
## Smoker1 0.155907 0.040468 3.853 0.000117 ***
## HeartDiseaseorAttack1 1.054101 0.042528 24.786 < 2e-16 ***
## Veggies1 -0.241270 0.043923 -5.493 3.95e-08 ***
## GenHlth 0.351981 0.021898 16.074 < 2e-16 ***
## MentHlth 0.010777 0.002094 5.148 2.63e-07 ***
## DiffWalk1 0.725704 0.046033 15.765 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23047 on 49483 degrees of freedom
## Residual deviance: 19472 on 49474 degrees of freedom
## AIC: 19492
##
## Number of Fisher Scoring iterations: 6
train$Stroke_pred <- predict(lrm2, train, type="response")
test$Stroke_pred <- predict(lrm2, test, type="response")
ggplot( train, aes( Stroke_pred, color = as.factor(Stroke) ) ) +
geom_density( size = 1 ) +
ggtitle( "Training Set's Predicted Score" )
Interestingly, as I increase the cutoff the accuracy goes up, but the worse the model gets at predicting if the person has had a stroke. Which overall makes the model worse.
df1 <- test
df2 <- test
df3 <- test
df4 <- test
df5 <- test
df1$Stroke_pred <- as.factor(ifelse(df1$Stroke_pred > 0.05, 1, 0))
df2$Stroke_pred <- as.factor(ifelse(df2$Stroke_pred > 0.08, 1, 0))
df3$Stroke_pred <- as.factor(ifelse(df1$Stroke_pred > 0.1, 1, 0))
## Warning in Ops.factor(df1$Stroke_pred, 0.1): '>' not meaningful for factors
df4$Stroke_pred <- as.factor(ifelse(df4$Stroke_pred > 0.15, 1, 0))
df5$Stroke_pred <- as.factor(ifelse(df5$Stroke_pred > 0.2, 1, 0))
confusionMatrix(df1$Stroke, df1$Stroke_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13463 6426
## 1 303 1016
##
## Accuracy : 0.6827
## 95% CI : (0.6764, 0.689)
## No Information Rate : 0.6491
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.1412
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9780
## Specificity : 0.1365
## Pos Pred Value : 0.6769
## Neg Pred Value : 0.7703
## Prevalence : 0.6491
## Detection Rate : 0.6348
## Detection Prevalence : 0.9378
## Balanced Accuracy : 0.5573
##
## 'Positive' Class : 0
##
confusionMatrix(df2$Stroke, df2$Stroke_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 15721 4168
## 1 508 811
##
## Accuracy : 0.7795
## 95% CI : (0.7739, 0.7851)
## No Information Rate : 0.7652
## P-Value [Acc > NIR] : 4.013e-07
##
## Kappa : 0.1766
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9687
## Specificity : 0.1629
## Pos Pred Value : 0.7904
## Neg Pred Value : 0.6149
## Prevalence : 0.7652
## Detection Rate : 0.7413
## Detection Prevalence : 0.9378
## Balanced Accuracy : 0.5658
##
## 'Positive' Class : 0
##
Next we look at a PCA to understand how much variance is account for in the components. We can see the first 3 principle components account for 99% of the variance.
df <- read.csv('health_data.csv')
df_cov <- cov(df)
#Reduce to principle components
pca_df <- prcomp(df_cov)
(cumsum(pca_df$sdev^2) / sum(pca_df$sdev^2))[1:20]
## [1] 0.7341011 0.8708795 0.9967436 0.9999437 0.9999796 0.9999848 0.9999884
## [8] 0.9999912 0.9999930 0.9999948 0.9999963 0.9999976 0.9999987 0.9999993
## [15] 0.9999998 0.9999999 1.0000000 1.0000000 NA NA
plot(pca_df)
pca_df$rotation
## PC1 PC2 PC3 PC4
## Age -5.152739e-05 -0.0565642550 0.0725871486 0.989864604
## Sex 3.313550e-03 -0.0041519573 0.0011779547 -0.005415128
## HighChol -6.200164e-03 -0.0047326501 -0.0039689200 0.040692431
## CholCheck -2.964072e-04 -0.0011606404 -0.0002132847 0.005010380
## BMI -1.429584e-01 -0.4896837351 -0.8585568393 0.030960975
## Smoker -5.446945e-03 0.0010210641 0.0019886520 0.015296027
## HeartDiseaseorAttack -5.662476e-03 -0.0034638748 0.0020500992 0.024140049
## PhysActivity 1.009743e-02 0.0058959299 0.0049925238 -0.018363369
## Fruits 3.156759e-03 0.0006731043 0.0057018086 0.006447254
## Veggies 2.913322e-03 0.0009338350 0.0022364034 -0.005545991
## HvyAlcoholConsump 4.936663e-04 0.0018460608 0.0006818879 -0.004310319
## GenHlth -5.635948e-02 -0.0214131301 -0.0088165663 0.049045689
## MentHlth -4.879332e-01 0.7892418913 -0.3655816052 0.070363841
## PhysHlth -8.588168e-01 -0.3649335137 0.3515381464 -0.050780548
## DiffWalk -1.904763e-02 -0.0092299276 -0.0027889610 0.025424131
## Diabetes -9.178769e-03 -0.0129976373 -0.0114497200 0.047465074
## Hypertension -7.124716e-03 -0.0111579084 -0.0086754546 0.057811815
## Stroke -3.414798e-03 -0.0006489302 0.0014717033 0.008809910
## PC5 PC6 PC7 PC8
## Age 0.071347032 0.0182351493 0.054635425 1.717275e-02
## Sex 0.005743038 -0.5164046494 0.342861141 6.328796e-01
## HighChol -0.078294596 -0.1965756489 -0.599595701 -9.539092e-02
## CholCheck -0.001915329 0.0044206429 -0.023635343 3.682678e-03
## BMI 0.035043762 0.0062451037 0.015344885 9.364329e-05
## Smoker -0.038213798 -0.4042372794 0.409496162 -6.753330e-01
## HeartDiseaseorAttack -0.059948039 -0.0748107928 -0.010481627 1.033537e-02
## PhysActivity 0.121914982 0.1624802458 0.102474481 2.949231e-01
## Fruits 0.108207030 0.5867861894 0.083530844 1.475160e-03
## Veggies 0.085820483 0.2913937602 0.083356095 -2.998151e-02
## HvyAlcoholConsump 0.012332619 -0.0206511732 0.025255133 -3.463939e-02
## GenHlth -0.947527450 0.1839531127 0.180444883 7.402932e-02
## MentHlth 0.018902352 -0.0007183078 0.001808247 6.892785e-03
## PhysHlth 0.053751530 -0.0031821932 -0.002604814 4.654706e-03
## DiffWalk -0.093305851 0.0218879395 -0.031831991 -1.750208e-01
## Diabetes -0.163166371 -0.1134658059 -0.338638902 8.164000e-02
## Hypertension -0.107085207 -0.1480807699 -0.421690496 4.008675e-02
## Stroke -0.022808425 -0.0101598957 -0.014414921 -1.843031e-02
## PC9 PC10 PC11 PC12
## Age -0.0137216155 0.028316025 -0.023436265 0.0061537140
## Sex 0.0350209517 -0.390124161 -0.081562454 -0.0120103470
## HighChol -0.5705027098 -0.337114492 -0.346173694 0.0967539307
## CholCheck 0.0106270984 0.017520010 -0.001590248 0.0076801918
## BMI -0.0091114594 0.002616755 -0.005720937 -0.0015007357
## Smoker -0.2033783834 -0.220935699 0.317235721 0.0195297300
## HeartDiseaseorAttack 0.0982257133 0.033054683 -0.134918438 -0.0125876571
## PhysActivity -0.6091654101 0.359085652 0.417239179 0.0144664851
## Fruits 0.0687915092 -0.708322963 0.141507837 0.0054359119
## Veggies -0.1196601775 -0.063621536 -0.025340205 -0.0141004093
## HvyAlcoholConsump -0.0005524860 0.041976914 -0.032060480 -0.0401384936
## GenHlth -0.1336894757 -0.005586969 -0.020243254 -0.0472662663
## MentHlth 0.0006503514 -0.001530293 0.001472338 0.0005266919
## PhysHlth -0.0043176702 0.001400982 0.005185694 0.0006990573
## DiffWalk 0.3241529567 0.179758880 -0.285402954 0.0068939306
## Diabetes 0.2682314708 -0.049230785 0.501225805 0.6960718954
## Hypertension 0.1795553095 -0.091995241 0.469387022 -0.7077582160
## Stroke 0.0679408072 0.058539844 -0.069693307 -0.0093030424
## PC13 PC14 PC15 PC16
## Age 0.001186777 -0.0026885471 0.0340499747 -0.0072392548
## Sex 0.068646016 -0.1453043756 -0.1449379382 0.0375958513
## HighChol -0.022547584 -0.0705032975 -0.0897328762 0.0195670739
## CholCheck -0.005250046 0.0021240612 0.0994655120 -0.0919823521
## BMI -0.002727483 0.0051234113 0.0056213931 0.0002420837
## Smoker -0.007079805 -0.0210695961 -0.1236132822 0.0429185130
## HeartDiseaseorAttack -0.099968345 0.8015861173 -0.5337560439 -0.1205793003
## PhysActivity -0.234184894 -0.0983876162 -0.3220483198 0.0481182131
## Fruits -0.297841471 -0.0122835282 -0.1127304932 0.0149810211
## Veggies 0.885371807 -0.0293840415 -0.2783605551 0.0727009757
## HvyAlcoholConsump 0.007229332 -0.0052129075 0.2317774485 -0.4066951986
## GenHlth 0.017099396 -0.0030132617 0.0560943242 -0.0113421173
## MentHlth 0.001023249 0.0009595718 0.0007321967 -0.0003112121
## PhysHlth 0.001576188 0.0044889294 0.0093039567 -0.0017848400
## DiffWalk -0.184490613 -0.5494927701 -0.6158018939 0.0055328815
## Diabetes 0.110156716 -0.0140674463 -0.0808192984 0.0024731838
## Hypertension 0.078462773 -0.0438933815 -0.1080323924 0.0185070698
## Stroke -0.067734075 0.1267877010 0.1067885165 0.8941877429
## PC17 PC18
## Age -0.0038797245 0.0135682504
## Sex 0.0380880417 -0.0722994335
## HighChol 0.0349304910 -0.0378139307
## CholCheck -0.6110679283 -0.7791703180
## BMI 0.0001422267 0.0012434537
## Smoker 0.0013607573 -0.0478348317
## HeartDiseaseorAttack 0.0442511038 -0.0832618400
## PhysActivity 0.0715343449 -0.1037507655
## Fruits 0.0567882886 -0.0733722885
## Veggies 0.0570094056 -0.0992890046
## HvyAlcoholConsump 0.7280151406 -0.4939154167
## GenHlth -0.0036574654 0.0074913255
## MentHlth -0.0009732207 0.0003555294
## PhysHlth -0.0015482105 0.0029815024
## DiffWalk 0.0800188344 -0.1324830752
## Diabetes 0.0719117972 -0.0486505186
## Hypertension 0.0245037506 -0.0305975719
## Stroke 0.2610754674 -0.2931752390
plot(pca_df$rotation[1:18],pca_df$rotation[19:36])
The logisitc regression model didnt do a great job of predicting if someone had heart disease. In the future I would use random forrest. PCA showed that the variance is account for in the first 3 components.