Data Exploration

I chose to use a data set acquired from kaggle. Each observation represents survey results. Most of the columns are binary and describe the health of an individual. The goal is to predict if the person has diabetes, hypertension, or ever has had a stroke.

df <- read.csv('health_data.csv')
head(df)
##   Age Sex HighChol CholCheck BMI Smoker HeartDiseaseorAttack PhysActivity
## 1   4   1        0         1  26      0                    0            1
## 2  12   1        1         1  26      1                    0            0
## 3  13   1        0         1  26      0                    0            1
## 4  11   1        1         1  28      1                    0            1
## 5   8   0        0         1  29      1                    0            1
## 6   1   0        0         1  18      0                    0            1
##   Fruits Veggies HvyAlcoholConsump GenHlth MentHlth PhysHlth DiffWalk Diabetes
## 1      0       1                 0       3        5       30        0        0
## 2      1       0                 0       3        0        0        0        0
## 3      1       1                 0       1        0       10        0        0
## 4      1       1                 0       3        0        3        0        0
## 5      1       1                 0       2        0        0        0        0
## 6      1       1                 0       2        7        0        0        0
##   Hypertension Stroke
## 1            1      0
## 2            1      1
## 3            0      0
## 4            1      0
## 5            0      0
## 6            0      0

Let’s look at the number of observations and features we have. This will help us understand which models will be most effective in handling this data set. We see around 70k observations and 18 features. 3 of the columns are target variables and the others are feature predictors.

dim(df)
## [1] 70692    18

Visualizing if there is any missing data lets us know if there is any imputation needed. This is a clean data set, so all the variables are present and no imputation is needed.

vis_miss(df, warn_large_data=FALSE)

A summary helps us to understand which variables are binary and for the ones that are not what are their summary statistics. We see Age, BMI, GenHlth, MentHlth, and PhysHlth are not binary.

summary(df)
##       Age              Sex           HighChol        CholCheck     
##  Min.   : 1.000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 7.000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median : 9.000   Median :0.000   Median :1.0000   Median :1.0000  
##  Mean   : 8.584   Mean   :0.457   Mean   :0.5257   Mean   :0.9753  
##  3rd Qu.:11.000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :13.000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##       BMI            Smoker       HeartDiseaseorAttack  PhysActivity  
##  Min.   :12.00   Min.   :0.0000   Min.   :0.0000       Min.   :0.000  
##  1st Qu.:25.00   1st Qu.:0.0000   1st Qu.:0.0000       1st Qu.:0.000  
##  Median :29.00   Median :0.0000   Median :0.0000       Median :1.000  
##  Mean   :29.86   Mean   :0.4753   Mean   :0.1478       Mean   :0.703  
##  3rd Qu.:33.00   3rd Qu.:1.0000   3rd Qu.:0.0000       3rd Qu.:1.000  
##  Max.   :98.00   Max.   :1.0000   Max.   :1.0000       Max.   :1.000  
##      Fruits          Veggies       HvyAlcoholConsump    GenHlth     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000  
##  Median :1.0000   Median :1.0000   Median :0.00000   Median :3.000  
##  Mean   :0.6118   Mean   :0.7888   Mean   :0.04272   Mean   :2.837  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :5.000  
##     MentHlth         PhysHlth        DiffWalk         Diabetes  
##  Min.   : 0.000   Min.   : 0.00   Min.   :0.0000   Min.   :0.0  
##  1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.:0.0000   1st Qu.:0.0  
##  Median : 0.000   Median : 0.00   Median :0.0000   Median :0.5  
##  Mean   : 3.752   Mean   : 5.81   Mean   :0.2527   Mean   :0.5  
##  3rd Qu.: 2.000   3rd Qu.: 6.00   3rd Qu.:1.0000   3rd Qu.:1.0  
##  Max.   :30.000   Max.   :30.00   Max.   :1.0000   Max.   :1.0  
##   Hypertension        Stroke       
##  Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.00000  
##  Mean   :0.5635   Mean   :0.06217  
##  3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000

A correlation plot shows how closely correlated our variables are. For the purposes of predicting we will be using Stroke.

M = cor(df)
corrplot(M, method = 'number', tl.cex = 0.5, cl.cex = 0.5) 

df$Sex <- as.factor(df$Sex)
df$HighChol <- as.factor(df$HighChol)
df$Smoker <- as.factor(df$Smoker)
df$HeartDiseaseorAttack <- as.factor(df$HeartDiseaseorAttack)
df$PhysActivity <- as.factor(df$PhysActivity)
df$Fruits <- as.factor(df$Fruits)
df$Veggies <- as.factor(df$Veggies)
df$HvyAlcoholConsump <- as.factor(df$HvyAlcoholConsump)
df$DiffWalk <- as.factor(df$DiffWalk)
df$Stroke <- as.factor(df$Stroke)

Focusing on just what the Stroke variable is correlated with. Not suprisingly the vairable with the highest correlation is Heart Disease or Attack. Interestingly, GenHlth is the second highest correlated which is a the person’s opinion on their general health.

M[,18]
##                  Age                  Sex             HighChol 
##          0.123879343          0.003822095          0.099786191 
##            CholCheck                  BMI               Smoker 
##          0.022529381          0.022930877          0.064658397 
## HeartDiseaseorAttack         PhysActivity               Fruits 
##          0.223393786         -0.079984782         -0.008996297 
##              Veggies    HvyAlcoholConsump              GenHlth 
##         -0.047601204         -0.023394552          0.189446857 
##             MentHlth             PhysHlth             DiffWalk 
##          0.087303439          0.164487793          0.192265923 
##             Diabetes         Hypertension               Stroke 
##          0.125426785          0.129059872          1.000000000
ggplot(data = df, aes(x=as.factor(Stroke), y=(..count..)/sum(..count..))) +
  geom_bar(color = "steelblue") +
  labs(title = "Sroke by Smoking and High Chol",
       subtitle = "(0 means no and 1 means yest)",
       y = "Count of new characters", x = "") + 
  facet_wrap(~ HighChol + Smoker) 

Logistic Regression

First let’s create traning and test data sets. We use 70% of the data set to

df <- subset(df, select = -c(Diabetes, Hypertension))

spl = sample.split(df$Stroke, 0.7)

train = subset(df, spl == TRUE)
test = subset(df, spl == FALSE)

c(dim(train), dim(test))
## [1] 49484    16 21208    16

Now let’s create a logistic regression model

lrm <- glm(Stroke ~ ., data=train, family='binomial')
summary(lrm)
## 
## Call:
## glm(formula = Stroke ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3525  -0.3547  -0.2386  -0.1700   3.3144  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -5.754338   0.239610 -24.015  < 2e-16 ***
## Age                    0.134291   0.009258  14.506  < 2e-16 ***
## Sex1                  -0.034374   0.040916  -0.840 0.400842    
## HighChol1              0.382832   0.044309   8.640  < 2e-16 ***
## CholCheck              0.269237   0.189546   1.420 0.155483    
## BMI                   -0.010065   0.002859  -3.521 0.000430 ***
## Smoker1                0.166711   0.041092   4.057 4.97e-05 ***
## HeartDiseaseorAttack1  1.057549   0.043014  24.586  < 2e-16 ***
## PhysActivity1         -0.005049   0.042651  -0.118 0.905763    
## Fruits1                0.068872   0.042064   1.637 0.101562    
## Veggies1              -0.262445   0.045512  -5.766 8.10e-09 ***
## HvyAlcoholConsump1    -0.204559   0.125768  -1.626 0.103847    
## GenHlth                0.316009   0.024292  13.009  < 2e-16 ***
## MentHlth               0.009160   0.002155   4.250 2.13e-05 ***
## PhysHlth               0.006976   0.002060   3.386 0.000709 ***
## DiffWalk1              0.682829   0.048014  14.221  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23047  on 49483  degrees of freedom
## Residual deviance: 19451  on 49468  degrees of freedom
## AIC: 19483
## 
## Number of Fisher Scoring iterations: 6

We remove all the features that are not statistically significant and run the model again. The results do no change much.

lrm2 <- glm(Stroke ~Age+HighChol+BMI+Smoker+HeartDiseaseorAttack+Veggies+GenHlth+MentHlth+DiffWalk, data=train, family='binomial')
summary(lrm2)
## 
## Call:
## glm(formula = Stroke ~ Age + HighChol + BMI + Smoker + HeartDiseaseorAttack + 
##     Veggies + GenHlth + MentHlth + DiffWalk, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3291  -0.3558  -0.2392  -0.1697   3.2541  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -5.592649   0.151187 -36.992  < 2e-16 ***
## Age                    0.136912   0.009174  14.923  < 2e-16 ***
## HighChol1              0.381585   0.044218   8.630  < 2e-16 ***
## BMI                   -0.009918   0.002844  -3.487 0.000488 ***
## Smoker1                0.155907   0.040468   3.853 0.000117 ***
## HeartDiseaseorAttack1  1.054101   0.042528  24.786  < 2e-16 ***
## Veggies1              -0.241270   0.043923  -5.493 3.95e-08 ***
## GenHlth                0.351981   0.021898  16.074  < 2e-16 ***
## MentHlth               0.010777   0.002094   5.148 2.63e-07 ***
## DiffWalk1              0.725704   0.046033  15.765  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23047  on 49483  degrees of freedom
## Residual deviance: 19472  on 49474  degrees of freedom
## AIC: 19492
## 
## Number of Fisher Scoring iterations: 6
train$Stroke_pred <- predict(lrm2, train, type="response")
test$Stroke_pred <- predict(lrm2, test, type="response")
ggplot( train, aes( Stroke_pred, color = as.factor(Stroke) ) ) + 
geom_density( size = 1 ) +
ggtitle( "Training Set's Predicted Score" ) 

Interestingly, as I increase the cutoff the accuracy goes up, but the worse the model gets at predicting if the person has had a stroke. Which overall makes the model worse.

df1 <- test 
df2 <- test
df3 <- test
df4 <- test
df5 <- test

df1$Stroke_pred <- as.factor(ifelse(df1$Stroke_pred > 0.05, 1, 0))
df2$Stroke_pred <- as.factor(ifelse(df2$Stroke_pred > 0.08, 1, 0))
df3$Stroke_pred <- as.factor(ifelse(df1$Stroke_pred > 0.1, 1, 0))
## Warning in Ops.factor(df1$Stroke_pred, 0.1): '>' not meaningful for factors
df4$Stroke_pred <- as.factor(ifelse(df4$Stroke_pred > 0.15, 1, 0))
df5$Stroke_pred <- as.factor(ifelse(df5$Stroke_pred > 0.2, 1, 0))

confusionMatrix(df1$Stroke, df1$Stroke_pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13463  6426
##          1   303  1016
##                                          
##                Accuracy : 0.6827         
##                  95% CI : (0.6764, 0.689)
##     No Information Rate : 0.6491         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.1412         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9780         
##             Specificity : 0.1365         
##          Pos Pred Value : 0.6769         
##          Neg Pred Value : 0.7703         
##              Prevalence : 0.6491         
##          Detection Rate : 0.6348         
##    Detection Prevalence : 0.9378         
##       Balanced Accuracy : 0.5573         
##                                          
##        'Positive' Class : 0              
## 
confusionMatrix(df2$Stroke, df2$Stroke_pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 15721  4168
##          1   508   811
##                                           
##                Accuracy : 0.7795          
##                  95% CI : (0.7739, 0.7851)
##     No Information Rate : 0.7652          
##     P-Value [Acc > NIR] : 4.013e-07       
##                                           
##                   Kappa : 0.1766          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9687          
##             Specificity : 0.1629          
##          Pos Pred Value : 0.7904          
##          Neg Pred Value : 0.6149          
##              Prevalence : 0.7652          
##          Detection Rate : 0.7413          
##    Detection Prevalence : 0.9378          
##       Balanced Accuracy : 0.5658          
##                                           
##        'Positive' Class : 0               
## 

Principle Component Analysis

Next we look at a PCA to understand how much variance is account for in the components. We can see the first 3 principle components account for 99% of the variance.

df <- read.csv('health_data.csv')
df_cov <- cov(df)

#Reduce to principle components
pca_df <- prcomp(df_cov)

(cumsum(pca_df$sdev^2) / sum(pca_df$sdev^2))[1:20]
##  [1] 0.7341011 0.8708795 0.9967436 0.9999437 0.9999796 0.9999848 0.9999884
##  [8] 0.9999912 0.9999930 0.9999948 0.9999963 0.9999976 0.9999987 0.9999993
## [15] 0.9999998 0.9999999 1.0000000 1.0000000        NA        NA
plot(pca_df)

pca_df$rotation
##                                PC1           PC2           PC3          PC4
## Age                  -5.152739e-05 -0.0565642550  0.0725871486  0.989864604
## Sex                   3.313550e-03 -0.0041519573  0.0011779547 -0.005415128
## HighChol             -6.200164e-03 -0.0047326501 -0.0039689200  0.040692431
## CholCheck            -2.964072e-04 -0.0011606404 -0.0002132847  0.005010380
## BMI                  -1.429584e-01 -0.4896837351 -0.8585568393  0.030960975
## Smoker               -5.446945e-03  0.0010210641  0.0019886520  0.015296027
## HeartDiseaseorAttack -5.662476e-03 -0.0034638748  0.0020500992  0.024140049
## PhysActivity          1.009743e-02  0.0058959299  0.0049925238 -0.018363369
## Fruits                3.156759e-03  0.0006731043  0.0057018086  0.006447254
## Veggies               2.913322e-03  0.0009338350  0.0022364034 -0.005545991
## HvyAlcoholConsump     4.936663e-04  0.0018460608  0.0006818879 -0.004310319
## GenHlth              -5.635948e-02 -0.0214131301 -0.0088165663  0.049045689
## MentHlth             -4.879332e-01  0.7892418913 -0.3655816052  0.070363841
## PhysHlth             -8.588168e-01 -0.3649335137  0.3515381464 -0.050780548
## DiffWalk             -1.904763e-02 -0.0092299276 -0.0027889610  0.025424131
## Diabetes             -9.178769e-03 -0.0129976373 -0.0114497200  0.047465074
## Hypertension         -7.124716e-03 -0.0111579084 -0.0086754546  0.057811815
## Stroke               -3.414798e-03 -0.0006489302  0.0014717033  0.008809910
##                               PC5           PC6          PC7           PC8
## Age                   0.071347032  0.0182351493  0.054635425  1.717275e-02
## Sex                   0.005743038 -0.5164046494  0.342861141  6.328796e-01
## HighChol             -0.078294596 -0.1965756489 -0.599595701 -9.539092e-02
## CholCheck            -0.001915329  0.0044206429 -0.023635343  3.682678e-03
## BMI                   0.035043762  0.0062451037  0.015344885  9.364329e-05
## Smoker               -0.038213798 -0.4042372794  0.409496162 -6.753330e-01
## HeartDiseaseorAttack -0.059948039 -0.0748107928 -0.010481627  1.033537e-02
## PhysActivity          0.121914982  0.1624802458  0.102474481  2.949231e-01
## Fruits                0.108207030  0.5867861894  0.083530844  1.475160e-03
## Veggies               0.085820483  0.2913937602  0.083356095 -2.998151e-02
## HvyAlcoholConsump     0.012332619 -0.0206511732  0.025255133 -3.463939e-02
## GenHlth              -0.947527450  0.1839531127  0.180444883  7.402932e-02
## MentHlth              0.018902352 -0.0007183078  0.001808247  6.892785e-03
## PhysHlth              0.053751530 -0.0031821932 -0.002604814  4.654706e-03
## DiffWalk             -0.093305851  0.0218879395 -0.031831991 -1.750208e-01
## Diabetes             -0.163166371 -0.1134658059 -0.338638902  8.164000e-02
## Hypertension         -0.107085207 -0.1480807699 -0.421690496  4.008675e-02
## Stroke               -0.022808425 -0.0101598957 -0.014414921 -1.843031e-02
##                                PC9         PC10         PC11          PC12
## Age                  -0.0137216155  0.028316025 -0.023436265  0.0061537140
## Sex                   0.0350209517 -0.390124161 -0.081562454 -0.0120103470
## HighChol             -0.5705027098 -0.337114492 -0.346173694  0.0967539307
## CholCheck             0.0106270984  0.017520010 -0.001590248  0.0076801918
## BMI                  -0.0091114594  0.002616755 -0.005720937 -0.0015007357
## Smoker               -0.2033783834 -0.220935699  0.317235721  0.0195297300
## HeartDiseaseorAttack  0.0982257133  0.033054683 -0.134918438 -0.0125876571
## PhysActivity         -0.6091654101  0.359085652  0.417239179  0.0144664851
## Fruits                0.0687915092 -0.708322963  0.141507837  0.0054359119
## Veggies              -0.1196601775 -0.063621536 -0.025340205 -0.0141004093
## HvyAlcoholConsump    -0.0005524860  0.041976914 -0.032060480 -0.0401384936
## GenHlth              -0.1336894757 -0.005586969 -0.020243254 -0.0472662663
## MentHlth              0.0006503514 -0.001530293  0.001472338  0.0005266919
## PhysHlth             -0.0043176702  0.001400982  0.005185694  0.0006990573
## DiffWalk              0.3241529567  0.179758880 -0.285402954  0.0068939306
## Diabetes              0.2682314708 -0.049230785  0.501225805  0.6960718954
## Hypertension          0.1795553095 -0.091995241  0.469387022 -0.7077582160
## Stroke                0.0679408072  0.058539844 -0.069693307 -0.0093030424
##                              PC13          PC14          PC15          PC16
## Age                   0.001186777 -0.0026885471  0.0340499747 -0.0072392548
## Sex                   0.068646016 -0.1453043756 -0.1449379382  0.0375958513
## HighChol             -0.022547584 -0.0705032975 -0.0897328762  0.0195670739
## CholCheck            -0.005250046  0.0021240612  0.0994655120 -0.0919823521
## BMI                  -0.002727483  0.0051234113  0.0056213931  0.0002420837
## Smoker               -0.007079805 -0.0210695961 -0.1236132822  0.0429185130
## HeartDiseaseorAttack -0.099968345  0.8015861173 -0.5337560439 -0.1205793003
## PhysActivity         -0.234184894 -0.0983876162 -0.3220483198  0.0481182131
## Fruits               -0.297841471 -0.0122835282 -0.1127304932  0.0149810211
## Veggies               0.885371807 -0.0293840415 -0.2783605551  0.0727009757
## HvyAlcoholConsump     0.007229332 -0.0052129075  0.2317774485 -0.4066951986
## GenHlth               0.017099396 -0.0030132617  0.0560943242 -0.0113421173
## MentHlth              0.001023249  0.0009595718  0.0007321967 -0.0003112121
## PhysHlth              0.001576188  0.0044889294  0.0093039567 -0.0017848400
## DiffWalk             -0.184490613 -0.5494927701 -0.6158018939  0.0055328815
## Diabetes              0.110156716 -0.0140674463 -0.0808192984  0.0024731838
## Hypertension          0.078462773 -0.0438933815 -0.1080323924  0.0185070698
## Stroke               -0.067734075  0.1267877010  0.1067885165  0.8941877429
##                               PC17          PC18
## Age                  -0.0038797245  0.0135682504
## Sex                   0.0380880417 -0.0722994335
## HighChol              0.0349304910 -0.0378139307
## CholCheck            -0.6110679283 -0.7791703180
## BMI                   0.0001422267  0.0012434537
## Smoker                0.0013607573 -0.0478348317
## HeartDiseaseorAttack  0.0442511038 -0.0832618400
## PhysActivity          0.0715343449 -0.1037507655
## Fruits                0.0567882886 -0.0733722885
## Veggies               0.0570094056 -0.0992890046
## HvyAlcoholConsump     0.7280151406 -0.4939154167
## GenHlth              -0.0036574654  0.0074913255
## MentHlth             -0.0009732207  0.0003555294
## PhysHlth             -0.0015482105  0.0029815024
## DiffWalk              0.0800188344 -0.1324830752
## Diabetes              0.0719117972 -0.0486505186
## Hypertension          0.0245037506 -0.0305975719
## Stroke                0.2610754674 -0.2931752390
plot(pca_df$rotation[1:18],pca_df$rotation[19:36])

Conclusion

The logisitc regression model didnt do a great job of predicting if someone had heart disease. In the future I would use random forrest. PCA showed that the variance is account for in the first 3 components.