I will look at those who self report being in bad health and those who self report of not having bad health bsed on sex (gender), health plan, checkups, and race/ethnicity.
knitr::kable(head(predbrfss2020))
| badhealth | sex | hlthpln1 | checkup1 | raceeth |
|---|---|---|---|---|
| 0 | Female | hp | 0last2yrs | nhwhite |
| 0 | Male | nohp | 1last5yrs | nhblack |
| 0 | Female | hp | 0last2yrs | nhwhite |
| 0 | Female | hp | 0last2yrs | nhwhite |
| 0 | Female | hp | 0last2yrs | nhwhite |
| 0 | Male | hp | 0last2yrs | nhwhite |
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## The following object is masked from 'package:survival':
##
## cluster
set.seed(1014)
train<- createDataPartition(y = predbrfss2020$badhealth, p = .80, list=F)
predbrfss2020train<-predbrfss2020[train,]
prebrfss2020test<-predbrfss2020[-train,]
table(predbrfss2020train$badhealth)
##
## 0 1
## 129371 20421
prop.table(table(predbrfss2020train$badhealth))
##
## 0 1
## 0.863671 0.136329
summary(predbrfss2020train)
## badhealth sex hlthpln1 checkup1
## Min. :0.0000 Female:81038 Length:149792 0last2yrs:135134
## 1st Qu.:0.0000 Male :68754 Class :character 1last5yrs: 13941
## Median :0.0000 Mode :character 2never : 717
## Mean :0.1363
## 3rd Qu.:0.0000
## Max. :1.0000
## raceeth
## hispanic: 16225
## nhblack : 15075
## nhmulti : 2717
## nhother : 6981
## nhwhite :108794
##
I had to reduce the .5 decision to .2 bacuse I would not get any 1s with the .5 threshold as seen in cm0 below. In cm1 the accuracy with a .2 decision threshold is 84.6% , however, the 0s are 97% correct, the 1s are 4% correct (which is really bad) and the balanced accuracy is about 50%
cm0<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))
## Warning in confusionMatrix.default(data = trpredcl, reference =
## factor(predbrfss2020train$badhealth)): Levels are not in the same order for
## reference and data. Refactoring data to match.
cm0
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 129371 20421
## 1 0 0
##
## Accuracy : 0.8637
## 95% CI : (0.8619, 0.8654)
## No Information Rate : 0.8637
## P-Value [Acc > NIR] : 0.5019
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8637
## Neg Pred Value : NaN
## Prevalence : 0.8637
## Detection Rate : 0.8637
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
cm1<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))
cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 125951 19613
## 1 3420 808
##
## Accuracy : 0.8462
## 95% CI : (0.8444, 0.8481)
## No Information Rate : 0.8637
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0197
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.97356
## Specificity : 0.03957
## Pos Pred Value : 0.86526
## Neg Pred Value : 0.19111
## Prevalence : 0.86367
## Detection Rate : 0.84084
## Detection Prevalence : 0.97177
## Balanced Accuracy : 0.50657
##
## 'Positive' Class : 0
##
When changing the threshold rule to .1 in cm2 below, the accuracy percent ges down to 19% and accuracy of the 0s goes down to 7% as well but the accuracy for the prediction of 1s gos up significantly to 95% and the balanced accuracy also goes up although its just one percent to 51%.
cm2<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))
cm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9210 826
## 1 120161 19595
##
## Accuracy : 0.1923
## 95% CI : (0.1903, 0.1943)
## No Information Rate : 0.8637
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0089
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.07119
## Specificity : 0.95955
## Pos Pred Value : 0.91770
## Neg Pred Value : 0.14021
## Prevalence : 0.86367
## Detection Rate : 0.06149
## Detection Prevalence : 0.06700
## Balanced Accuracy : 0.51537
##
## 'Positive' Class : 0
##
Again, I changed the threshold of .5 to .2 to get predictions of 0s and 1s. So instead, when comparing the threshold of .2 to the mean of .1362 (cm3), the .2 threshold had a larger accuracy of 84% versus the 72% when using the mean. While the accuracy of predicting 0s goes down when changing the threshold from .2 to .1362 from 97% to 78%, the accuracy for predicting 1s goes up quite a bit from 4% to almost 32%. The balanced accuracy also goes up when using the mean by 5%, from 50% to 55%.
cm3<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))
cm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 101362 13940
## 1 28009 6481
##
## Accuracy : 0.72
## 95% CI : (0.7177, 0.7222)
## No Information Rate : 0.8637
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0782
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7835
## Specificity : 0.3174
## Pos Pred Value : 0.8791
## Neg Pred Value : 0.1879
## Prevalence : 0.8637
## Detection Rate : 0.6767
## Detection Prevalence : 0.7697
## Balanced Accuracy : 0.5504
##
## 'Positive' Class : 0
##