1) Define a binary outcome of your choosing

I will look at those who self report being in bad health and those who self report of not having bad health bsed on sex (gender), health plan, checkups, and race/ethnicity.

2) Fit a predictive logistic regression model using as many predictor variables as you think you need

knitr::kable(head(predbrfss2020))
badhealth sex hlthpln1 checkup1 raceeth
0 Female hp 0last2yrs nhwhite
0 Male nohp 1last5yrs nhblack
0 Female hp 0last2yrs nhwhite
0 Female hp 0last2yrs nhwhite
0 Female hp 0last2yrs nhwhite
0 Male hp 0last2yrs nhwhite

3) Use a 80% training/20% test split for your data.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## The following object is masked from 'package:survival':
## 
##     cluster
set.seed(1014)
train<- createDataPartition(y = predbrfss2020$badhealth, p = .80, list=F)

predbrfss2020train<-predbrfss2020[train,]
prebrfss2020test<-predbrfss2020[-train,]

table(predbrfss2020train$badhealth)
## 
##      0      1 
## 129371  20421
prop.table(table(predbrfss2020train$badhealth))
## 
##        0        1 
## 0.863671 0.136329
summary(predbrfss2020train)
##    badhealth          sex          hlthpln1              checkup1     
##  Min.   :0.0000   Female:81038   Length:149792      0last2yrs:135134  
##  1st Qu.:0.0000   Male  :68754   Class :character   1last5yrs: 13941  
##  Median :0.0000                  Mode  :character   2never   :   717  
##  Mean   :0.1363                                                       
##  3rd Qu.:0.0000                                                       
##  Max.   :1.0000                                                       
##      raceeth      
##  hispanic: 16225  
##  nhblack : 15075  
##  nhmulti :  2717  
##  nhother :  6981  
##  nhwhite :108794  
## 

4) Report the % correct classification from the training data using the .5 decision rule and again useing the mean of the outcome decision rule

I had to reduce the .5 decision to .2 bacuse I would not get any 1s with the .5 threshold as seen in cm0 below. In cm1 the accuracy with a .2 decision threshold is 84.6% , however, the 0s are 97% correct, the 1s are 4% correct (which is really bad) and the balanced accuracy is about 50%

cm0<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))
## Warning in confusionMatrix.default(data = trpredcl, reference =
## factor(predbrfss2020train$badhealth)): Levels are not in the same order for
## reference and data. Refactoring data to match.
cm0
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 129371  20421
##          1      0      0
##                                           
##                Accuracy : 0.8637          
##                  95% CI : (0.8619, 0.8654)
##     No Information Rate : 0.8637          
##     P-Value [Acc > NIR] : 0.5019          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8637          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8637          
##          Detection Rate : 0.8637          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 
cm1<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))

cm1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 125951  19613
##          1   3420    808
##                                           
##                Accuracy : 0.8462          
##                  95% CI : (0.8444, 0.8481)
##     No Information Rate : 0.8637          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0197          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.97356         
##             Specificity : 0.03957         
##          Pos Pred Value : 0.86526         
##          Neg Pred Value : 0.19111         
##              Prevalence : 0.86367         
##          Detection Rate : 0.84084         
##    Detection Prevalence : 0.97177         
##       Balanced Accuracy : 0.50657         
##                                           
##        'Positive' Class : 0               
## 

4a) Does changing the decision rule threshold affect your classification accuracy?

When changing the threshold rule to .1 in cm2 below, the accuracy percent ges down to 19% and accuracy of the 0s goes down to 7% as well but the accuracy for the prediction of 1s gos up significantly to 95% and the balanced accuracy also goes up although its just one percent to 51%.

cm2<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))

cm2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0   9210    826
##          1 120161  19595
##                                           
##                Accuracy : 0.1923          
##                  95% CI : (0.1903, 0.1943)
##     No Information Rate : 0.8637          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0089          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.07119         
##             Specificity : 0.95955         
##          Pos Pred Value : 0.91770         
##          Neg Pred Value : 0.14021         
##              Prevalence : 0.86367         
##          Detection Rate : 0.06149         
##    Detection Prevalence : 0.06700         
##       Balanced Accuracy : 0.51537         
##                                           
##        'Positive' Class : 0               
## 

5) Report the % correct classification from the test data using the .5 decision rule and again useing the mean of the outcome decision rule

Again, I changed the threshold of .5 to .2 to get predictions of 0s and 1s. So instead, when comparing the threshold of .2 to the mean of .1362 (cm3), the .2 threshold had a larger accuracy of 84% versus the 72% when using the mean. While the accuracy of predicting 0s goes down when changing the threshold from .2 to .1362 from 97% to 78%, the accuracy for predicting 1s goes up quite a bit from 4% to almost 32%. The balanced accuracy also goes up when using the mean by 5%, from 50% to 55%.

cm3<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))

cm3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 101362  13940
##          1  28009   6481
##                                           
##                Accuracy : 0.72            
##                  95% CI : (0.7177, 0.7222)
##     No Information Rate : 0.8637          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0782          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7835          
##             Specificity : 0.3174          
##          Pos Pred Value : 0.8791          
##          Neg Pred Value : 0.1879          
##              Prevalence : 0.8637          
##          Detection Rate : 0.6767          
##    Detection Prevalence : 0.7697          
##       Balanced Accuracy : 0.5504          
##                                           
##        'Positive' Class : 0               
## 
---
title: "Homework Assignment 4"
author: "Selene M. Gomez"
date:  "`r format(Sys.time(), '%d %B, %Y')`"
output:
   html_document:
    df_print: paged
    fig_height: 7
    fig_width: 7
    toc: yes
    toc_float: yes
    code_download: true
---

```{r, include=FALSE}
library(car)
library(stargazer)
library(survey)
library(questionr)
library(dplyr)
library(haven)
library(tidyverse)
library(janitor)
```

```{r, include=FALSE}
brfss2020<- readRDS(url("https://github.com/coreysparks/DEM7283/blob/master/data/brfss20sm.rds?raw=true"))

names(brfss2020) <- tolower(gsub(pattern = "_",replacement =  "",x =  names(brfss2020)))

brfss2020$badhealth<-Recode(brfss2020$genhlth, recodes="4:5=1; 1:3=0; else=NA")

brfss2020$sex<-as.factor(ifelse(brfss2020$sex==1, "Male", "Female"))

brfss2020$hlthpln1<-Recode(brfss2020$hlthpln1, recodes="1='hp' ; 2='nohp'; else=NA")


brfss2020$checkup1<-Recode(brfss2020$checkup1,
                     recodes="1:2='0last2yrs'; 3:4='1last5yrs'; 8='2never'; 7=NA; 9=NA",
                     as.factor=T)
brfss2020$checkup1<-fct_relevel(brfss2020$checkup1,'0last2yrs','1last5yrs','2never') 

brfss2020$black<-Recode(brfss2020$racegr3,
                       recodes="2=1; 9=NA; else=0")
brfss2020$white<-Recode(brfss2020$racegr3,
                       recodes="1=1; 9=NA; else=0")
brfss2020$other<-Recode(brfss2020$racegr3,
                       recodes="3:4=1; 9=NA; else=0")
brfss2020$hispanic<-Recode(brfss2020$racegr3,
                          recodes="5=1; 9=NA; else=0")

brfss2020$raceeth<-Recode(brfss2020$racegr3,
                          recodes="1='nhwhite'; 2='nhblack'; 3='nhother';4='nhmulti'; 5='hispanic'; else=NA",
                          as.factor = T)

install.packages("magrittr", repos = "http://cran.us.r-project.org")
library(magrittr)

brfss2020<-brfss2020 %>%
  
  filter(sex!=9,
         is.na(hlthpln1)==F,
         is.na(checkup1)==F,
         is.na(badhealth)==F,
         is.na(raceeth)==F)

options(survey.lonely.psu = "adjust")

des<-svydesign(ids=~1, strata=~ststr, weights=~mmsawt, data = brfss2020 )
des
```


#### 1) Define a binary outcome of your choosing

I will look at those who self report being in bad health and those who self report of not having bad health bsed on sex (gender), health plan, checkups, and race/ethnicity.

#### 2) Fit a predictive logistic regression model using as many predictor variables as you think you need

```{r, include=FALSE}
library(dplyr)

predbrfss2020<- brfss2020 %>%
  
  
dplyr::select(badhealth, sex, hlthpln1, checkup1, raceeth)
```

```{r}
knitr::kable(head(predbrfss2020))

```

#### 3) Use a 80% training/20% test split for your data.

```{r}
library(caret)
set.seed(1014)
train<- createDataPartition(y = predbrfss2020$badhealth, p = .80, list=F)

predbrfss2020train<-predbrfss2020[train,]
prebrfss2020test<-predbrfss2020[-train,]

table(predbrfss2020train$badhealth)
prop.table(table(predbrfss2020train$badhealth))
```

```{r}
summary(predbrfss2020train)

```


#### 4) Report the % correct classification from the training data using the .5 decision rule and again useing the mean of the outcome decision rule

I had to reduce the .5 decision to .2 bacuse I would not get any 1s with the .5 threshold as seen in cm0 below. In cm1 the accuracy with a .2 decision threshold is 84.6% , however, the 0s are 97% correct, the 1s are 4% correct (which is really bad) and the balanced accuracy is about 50%

```{r, include=FALSE}
glm1<-glm(predbrfss2020train$badhealth~factor(predbrfss2020train$sex)+factor(predbrfss2020train$hlthpln1)+factor(predbrfss2020train$checkup1)+factor(predbrfss2020train$raceeth),
          data=predbrfss2020train[,-1],
          family = binomial)

summary(glm1)

```

```{r, include=FALSE}
trpred<- predict(glm1,
                  newdata = predbrfss2020train,
                  type = "response")

head(trpred)

```

```{r, include=FALSE}
trpredcl<-factor(ifelse(trpred>.5, 1, 0))

as.factor(predbrfss2020train$badhealth)

summary(trpredcl)
summary(predbrfss2020train$badhealth)

predbrfss2020train %>% drop_na()

length(trpredcl)
length(predbrfss2020train$badhealth)


table(trpredcl, predbrfss2020train$badhealth)

```


```{r}

cm0<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))

cm0

```


```{r, include=FALSE}
trpredcl<-factor(ifelse(trpred>.2, 1, 0))

as.factor(predbrfss2020train$badhealth)

summary(trpredcl)
summary(predbrfss2020train$badhealth)

predbrfss2020train %>% drop_na()

length(trpredcl)
length(predbrfss2020train$badhealth)


table(trpredcl, predbrfss2020train$badhealth)

```


```{r}

cm1<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))

cm1

```

#### 4a) Does changing the decision rule threshold affect your classification accuracy?

When changing the threshold rule to .1 in cm2 below, the accuracy percent ges down to 19% and accuracy of the 0s goes down to 7% as well but the accuracy for the prediction of 1s gos up significantly to 95% and the balanced accuracy also goes up although its just one percent to 51%. 

```{r, include=FALSE}
trpredcl<-factor(ifelse(trpred>.1, 1, 0))

as.factor(predbrfss2020train$badhealth)

summary(trpredcl)
summary(predbrfss2020train$badhealth)

predbrfss2020train %>% drop_na()

length(trpredcl)
length(predbrfss2020train$badhealth)


table(trpredcl, predbrfss2020train$badhealth)

```

```{r}
cm2<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))

cm2

```
  
#### 5) Report the % correct classification from the test data using the .5 decision rule and again useing the mean of the outcome decision rule

Again, I changed the threshold of .5 to .2 to get predictions of 0s and 1s. So instead, when comparing the threshold of .2 to the mean of .1362 (cm3), the .2 threshold had a larger accuracy of 84% versus the 72% when using the mean. While the accuracy of predicting 0s goes down when changing the threshold from .2 to .1362 from 97% to 78%, the accuracy for predicting 1s goes up quite a bit from 4% to almost 32%. The balanced accuracy also goes up when using the mean by 5%, from 50% to 55%. 

```{r, include=FALSE}
trpredcl<-factor(ifelse(trpred>.1363, 1, 0))

as.factor(predbrfss2020train$badhealth)

summary(trpredcl)
summary(predbrfss2020train$badhealth)

predbrfss2020train %>% drop_na()

length(trpredcl)
length(predbrfss2020train$badhealth)


table(trpredcl, predbrfss2020train$badhealth)

```

```{r}
cm3<-confusionMatrix(data = trpredcl,reference = factor(predbrfss2020train$badhealth))

cm3

```