Homework 4 STATS2

1) Define a binary outcome of your choosing

The primary binary outcome will be specifically looking at the variable College. This measures those who have not gone to college (0) and those who have attended college (1). 

2) Fit a predictive logistic regression model using as many predictor variables as you think you need

3) Use a 80% training/20% test split for your data.
 
 Report the % correct classification from the training data using the .5 decision rule and again useing the mean of the outcome decision rule
 
 When using the 50% split of the data it was very one sided showing a majority of cases of those in the model attending college with a extreamly small showing of those who have not went to college. What this 50% split showed is that the model is great at predicting those who have went to college but has major problems with those who have not went to college. 
 
 When looking at the confusion matrix it showed high accuracy at .75 but a extreamly low sensitivity at .02 with a extreamly high specificity at .99. These numbers being very spurious in nature as well as the output of the reference of the confusion matrix shows that this model is very poor an best fits those who have gone to college as opposed to those who have not. 
 
 After calculating the mean of the outcome:
 
 "mean(I(model.dat2train$College==1)) [1] 0.7527911" 
 
 The adjusted mean of outcome based on this mean was placed at .7.
 When this was done the overal reference of the confusion matrix evened out but still favored those who attened college. The accuracy dropped to about 61% but with a 60% jump in the sensitivity and a drop to 55% for specificity. Even the balanced accuracy increased to .59 from the previous model which sat at a coin flip of .5.

When this new predictive model was ran with the 20% set aside it gave very relatively similiar outputs. Overall accuracy dropped from 61% to 56%, sensitiviy dropped from 62% to 55%, balenced accurracy dropped from 59% to 56%, but specificity increased from 55% to 57%.  

With these outcome measures it shows that the adjuested threshold is a better predictive model than the .5 rule. 
 
3a) Does changing the decision rule threshold affect your classification accuracy?

Yes, it decreased the accuracy of the model but it created more proper outputs of sensitivy and specificity that is able to capture the data so that it is a better fit.  

4) Report the % correct classification from the test data using the .5 decision rule and again useing the mean of the outcome decision rule

  For the .5 rule the percent correct from the classification it predicted those who did not go to college about 32% of the time while the adjusted means model gave a outcome of about 63%. 
  
  With these estimations it shows that the newer adjusted means model was able to capture the data better than the .5 rule. 
  
  When looking at the test data set those percentages are now showing a classification percentage of about 76% of being able to predict with the new adjusted means rule. 
  
  This further shows that the adjusted means rule creates better parameters for the data so that the outcomes are more realistic in nature.

library(haven)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(scales)
library(sur)
library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(summarytools)
library(Rmisc)

## Loading required package: lattice

library(car)

## Loading required package: carData

## 
## Attaching package: 'carData'

## The following objects are masked from 'package:sur':
## 
##     Anscombe, States

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(forcats)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.6     v purrr   0.3.4
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plyr::arrange()     masks dplyr::arrange()
## x readr::col_factor() masks scales::col_factor()
## x purrr::compact()    masks plyr::compact()
## x plyr::count()       masks dplyr::count()
## x purrr::discard()    masks scales::discard()
## x plyr::failwith()    masks dplyr::failwith()
## x dplyr::filter()     masks stats::filter()
## x plyr::id()          masks dplyr::id()
## x dplyr::lag()        masks stats::lag()
## x plyr::mutate()      masks dplyr::mutate()
## x car::recode()       masks dplyr::recode()
## x plyr::rename()      masks dplyr::rename()
## x purrr::some()       masks car::some()
## x plyr::summarise()   masks dplyr::summarise()
## x plyr::summarize()   masks dplyr::summarize()
## x tibble::view()      masks summarytools::view()

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

library(grid)
library(Matrix)

gss2021_ZERODraft<-read_dta("C:\\Users\\BTP\\Desktop\\STATS 2 FOLDER\\2021_sas\\gss2021.dta")

## Key Predictor
gss2021_ZERODraft %>%
tabyl(hispanic)

##  hispanic    n      percent valid_percent
##         1 3544 0.8789682540  0.8864432216
##         2  245 0.0607638889  0.0612806403
##         3   51 0.0126488095  0.0127563782
##         4   21 0.0052083333  0.0052526263
##         5    9 0.0022321429  0.0022511256
##         6    2 0.0004960317  0.0005002501
##         7    4 0.0009920635  0.0010005003
##         8    2 0.0004960317  0.0005002501
##         9    4 0.0009920635  0.0010005003
##        10    5 0.0012400794  0.0012506253
##        11    3 0.0007440476  0.0007503752
##        15   17 0.0042162698  0.0042521261
##        20    4 0.0009920635  0.0010005003
##        21    8 0.0019841270  0.0020010005
##        22   10 0.0024801587  0.0025012506
##        23    5 0.0012400794  0.0012506253
##        24    1 0.0002480159  0.0002501251
##        25    1 0.0002480159  0.0002501251
##        30   46 0.0114087302  0.0115057529
##        35    1 0.0002480159  0.0002501251
##        41    3 0.0007440476  0.0007503752
##        46    2 0.0004960317  0.0005002501
##        47    7 0.0017361111  0.0017508754
##        50    3 0.0007440476  0.0007503752
##        NA   34 0.0084325397            NA

## Key dependent 
gss2021_ZERODraft %>%
tabyl(educ)

##  educ   n      percent valid_percent
##     0   9 0.0022321429  0.0022692890
##     1   1 0.0002480159  0.0002521432
##     2   2 0.0004960317  0.0005042864
##     3   3 0.0007440476  0.0007564297
##     4   1 0.0002480159  0.0002521432
##     5   2 0.0004960317  0.0005042864
##     6  15 0.0037202381  0.0037821483
##     7   5 0.0012400794  0.0012607161
##     8  25 0.0062003968  0.0063035804
##     9  32 0.0079365079  0.0080685830
##    10  52 0.0128968254  0.0131114473
##    11  83 0.0205853175  0.0209278870
##    12 829 0.2056051587  0.2090267272
##    13 277 0.0687003968  0.0698436712
##    14 542 0.1344246032  0.1366616238
##    15 208 0.0515873016  0.0524457892
##    16 942 0.2336309524  0.2375189107
##    17 258 0.0639880952  0.0650529501
##    18 351 0.0870535714  0.0885022693
##    19 113 0.0280257937  0.0284921836
##    20 216 0.0535714286  0.0544629349
##    NA  66 0.0163690476            NA

## Key Control(startified variable)
gss2021_ZERODraft %>%
tabyl(sex)

##  sex    n    percent valid_percent
##    1 1736 0.43055556     0.4406091
##    2 2204 0.54662698     0.5593909
##   NA   92 0.02281746            NA

## Parental Factors
gss2021_ZERODraft %>%
  tabyl(incom16)

##  incom16    n    percent valid_percent
##        1  421 0.10441468    0.11003659
##        2 1013 0.25124008    0.26476738
##        3 1625 0.40302579    0.42472556
##        4  679 0.16840278    0.17746994
##        5   88 0.02182540    0.02300052
##       NA  206 0.05109127            NA

gss2021_ZERODraft %>%
  tabyl(born)

##  born    n    percent valid_percent
##     1 3516 0.87202381     0.8878788
##     2  444 0.11011905     0.1121212
##    NA   72 0.01785714            NA

## Racialization Factors
gss2021_ZERODraft %>%
  tabyl(disrspct)

##  disrspct    n    percent valid_percent
##         1  136 0.03373016    0.05228758
##         2  231 0.05729167    0.08881200
##         3  327 0.08110119    0.12572088
##         4  801 0.19866071    0.30795848
##         5  552 0.13690476    0.21222607
##         6  554 0.13740079    0.21299500
##        NA 1431 0.35491071            NA

gss2021_ZERODraft %>%
  tabyl(notsmart)

##  notsmart    n    percent valid_percent
##         1  101 0.02504960    0.03881630
##         2  129 0.03199405    0.04957725
##         3  202 0.05009921    0.07763259
##         4  684 0.16964286    0.26287471
##         5  619 0.15352183    0.23789393
##         6  867 0.21502976    0.33320523
##        NA 1430 0.35466270            NA

## Other Control Variables
gss2021_ZERODraft %>%
  tabyl(age)

##  age   n      percent valid_percent
##   18   4 0.0009920635   0.001081373
##   19  14 0.0034722222   0.003784807
##   20  18 0.0044642857   0.004866180
##   21  24 0.0059523810   0.006488240
##   22  31 0.0076884921   0.008380643
##   23  30 0.0074404762   0.008110300
##   24  38 0.0094246032   0.010273047
##   25  46 0.0114087302   0.012435793
##   26  44 0.0109126984   0.011895107
##   27  35 0.0086805556   0.009462017
##   28  59 0.0146329365   0.015950257
##   29  63 0.0156250000   0.017031630
##   30  61 0.0151289683   0.016490943
##   31  56 0.0138888889   0.015139227
##   32  59 0.0146329365   0.015950257
##   33  82 0.0203373016   0.022168154
##   34  69 0.0171130952   0.018653690
##   35  62 0.0153769841   0.016761287
##   36  61 0.0151289683   0.016490943
##   37  74 0.0183531746   0.020005407
##   38  59 0.0146329365   0.015950257
##   39  65 0.0161210317   0.017572317
##   40  63 0.0156250000   0.017031630
##   41  76 0.0188492063   0.020546094
##   42  77 0.0190972222   0.020816437
##   43  56 0.0138888889   0.015139227
##   44  60 0.0148809524   0.016220600
##   45  52 0.0128968254   0.014057853
##   46  61 0.0151289683   0.016490943
##   47  54 0.0133928571   0.014598540
##   48  47 0.0116567460   0.012706137
##   49  54 0.0133928571   0.014598540
##   50  51 0.0126488095   0.013787510
##   51  69 0.0171130952   0.018653690
##   52  53 0.0131448413   0.014328197
##   53  65 0.0161210317   0.017572317
##   54  53 0.0131448413   0.014328197
##   55  65 0.0161210317   0.017572317
##   56  63 0.0156250000   0.017031630
##   57  79 0.0195932540   0.021357124
##   58  60 0.0148809524   0.016220600
##   59  81 0.0200892857   0.021897810
##   60  70 0.0173611111   0.018924034
##   61  76 0.0188492063   0.020546094
##   62  77 0.0190972222   0.020816437
##   63  77 0.0190972222   0.020816437
##   64  68 0.0168650794   0.018383347
##   65  68 0.0168650794   0.018383347
##   66  58 0.0143849206   0.015679913
##   67  89 0.0220734127   0.024060557
##   68  72 0.0178571429   0.019464720
##   69  73 0.0181051587   0.019735064
##   70  71 0.0176091270   0.019194377
##   71  55 0.0136408730   0.014868883
##   72  52 0.0128968254   0.014057853
##   73  48 0.0119047619   0.012976480
##   74  69 0.0171130952   0.018653690
##   75  57 0.0141369048   0.015409570
##   76  34 0.0084325397   0.009191673
##   77  37 0.0091765873   0.010002703
##   78  32 0.0079365079   0.008650987
##   79  40 0.0099206349   0.010813733
##   80  30 0.0074404762   0.008110300
##   81  31 0.0076884921   0.008380643
##   82  18 0.0044642857   0.004866180
##   83  17 0.0042162698   0.004595837
##   84  21 0.0052083333   0.005677210
##   85  16 0.0039682540   0.004325493
##   86  14 0.0034722222   0.003784807
##   87   5 0.0012400794   0.001351717
##   88   5 0.0012400794   0.001351717
##   89  26 0.0064484127   0.007028927
##   NA 333 0.0825892857            NA

gss2021_ZERODraft %>%
  tabyl(coninc)

##    coninc   n     percent valid_percent
##     336.0  64 0.015873016   0.018238814
##    1344.0  50 0.012400794   0.014249074
##    2352.0  19 0.004712302   0.005414648
##    3024.0  11 0.002728175   0.003134796
##    3696.0  17 0.004216270   0.004844685
##    4368.0  12 0.002976190   0.003419778
##    5040.0  18 0.004464286   0.005129667
##    6048.0  54 0.013392857   0.015389000
##    7560.0  87 0.021577381   0.024793388
##    9240.0  80 0.019841270   0.022798518
##   10920.0  74 0.018353175   0.021088629
##   12600.0  66 0.016369048   0.018808777
##   14280.0  80 0.019841270   0.022798518
##   15960.0  69 0.017113095   0.019663722
##   18480.0 131 0.032490079   0.037332573
##   21840.0 155 0.038442460   0.044172129
##   25200.0 136 0.033730159   0.038757481
##   30240.0 266 0.065972222   0.075805073
##   36960.0 253 0.062748016   0.072100313
##   45360.0 367 0.091021825   0.104588202
##   55440.0 314 0.077876984   0.089484184
##   67200.0 326 0.080853175   0.092903961
##   80640.0 207 0.051339286   0.058991166
##   94080.0 141 0.034970238   0.040182388
##  107520.0 140 0.034722222   0.039897407
##  168736.3 372 0.092261905   0.106013109
##        NA 523 0.129712302            NA

recode hispanic

gss2021_ZERODraft %>%
  tabyl(hispanic)

##  hispanic    n      percent valid_percent
##         1 3544 0.8789682540  0.8864432216
##         2  245 0.0607638889  0.0612806403
##         3   51 0.0126488095  0.0127563782
##         4   21 0.0052083333  0.0052526263
##         5    9 0.0022321429  0.0022511256
##         6    2 0.0004960317  0.0005002501
##         7    4 0.0009920635  0.0010005003
##         8    2 0.0004960317  0.0005002501
##         9    4 0.0009920635  0.0010005003
##        10    5 0.0012400794  0.0012506253
##        11    3 0.0007440476  0.0007503752
##        15   17 0.0042162698  0.0042521261
##        20    4 0.0009920635  0.0010005003
##        21    8 0.0019841270  0.0020010005
##        22   10 0.0024801587  0.0025012506
##        23    5 0.0012400794  0.0012506253
##        24    1 0.0002480159  0.0002501251
##        25    1 0.0002480159  0.0002501251
##        30   46 0.0114087302  0.0115057529
##        35    1 0.0002480159  0.0002501251
##        41    3 0.0007440476  0.0007503752
##        46    2 0.0004960317  0.0005002501
##        47    7 0.0017361111  0.0017508754
##        50    3 0.0007440476  0.0007503752
##        NA   34 0.0084325397            NA

gss2021_ZERODraft$subgrouphis <-Recode(gss2021_ZERODraft$hispanic, recodes="1 = 0; 2:50 = 1; else=NA", as.factor=T)

gss2021_ZERODraft %>%
  
tabyl(subgrouphis)

##  subgrouphis    n    percent valid_percent
##            0 3544 0.87896825     0.8864432
##            1  454 0.11259921     0.1135568
##         <NA>   34 0.00843254            NA

subgrouphis_1<-as.factor(ifelse(gss2021_ZERODraft$subgrouphis==1, "Hispanic", "Non Hispanic"))

tabyl(subgrouphis_1)

##  subgrouphis_1    n    percent valid_percent
##       Hispanic  454 0.11259921     0.1135568
##   Non Hispanic 3544 0.87896825     0.8864432
##           <NA>   34 0.00843254            NA

recode education college

gss2021_ZERODraft %>%
  tabyl(educ)

##  educ   n      percent valid_percent
##     0   9 0.0022321429  0.0022692890
##     1   1 0.0002480159  0.0002521432
##     2   2 0.0004960317  0.0005042864
##     3   3 0.0007440476  0.0007564297
##     4   1 0.0002480159  0.0002521432
##     5   2 0.0004960317  0.0005042864
##     6  15 0.0037202381  0.0037821483
##     7   5 0.0012400794  0.0012607161
##     8  25 0.0062003968  0.0063035804
##     9  32 0.0079365079  0.0080685830
##    10  52 0.0128968254  0.0131114473
##    11  83 0.0205853175  0.0209278870
##    12 829 0.2056051587  0.2090267272
##    13 277 0.0687003968  0.0698436712
##    14 542 0.1344246032  0.1366616238
##    15 208 0.0515873016  0.0524457892
##    16 942 0.2336309524  0.2375189107
##    17 258 0.0639880952  0.0650529501
##    18 351 0.0870535714  0.0885022693
##    19 113 0.0280257937  0.0284921836
##    20 216 0.0535714286  0.0544629349
##    NA  66 0.0163690476            NA

gss2021_ZERODraft$subgroupeducCollege <-Recode(gss2021_ZERODraft$educ, recodes="0:12 = 0; 13:20 = 1; else=NA", as.factor=T)

gss2021_ZERODraft %>%
  
tabyl(subgroupeducCollege)

##  subgroupeducCollege    n    percent valid_percent
##                    0 1059 0.26264881     0.2670197
##                    1 2907 0.72098214     0.7329803
##                 <NA>   66 0.01636905            NA

subgroupEduCollege_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupeducCollege==1, "College", "No College"))

tabyl(subgroupEduCollege_1)

##  subgroupEduCollege_1    n    percent valid_percent
##               College 2907 0.72098214     0.7329803
##            No College 1059 0.26264881     0.2670197
##                  <NA>   66 0.01636905            NA

education highschool only

gss2021_ZERODraft %>%
  tabyl(educ)

##  educ   n      percent valid_percent
##     0   9 0.0022321429  0.0022692890
##     1   1 0.0002480159  0.0002521432
##     2   2 0.0004960317  0.0005042864
##     3   3 0.0007440476  0.0007564297
##     4   1 0.0002480159  0.0002521432
##     5   2 0.0004960317  0.0005042864
##     6  15 0.0037202381  0.0037821483
##     7   5 0.0012400794  0.0012607161
##     8  25 0.0062003968  0.0063035804
##     9  32 0.0079365079  0.0080685830
##    10  52 0.0128968254  0.0131114473
##    11  83 0.0205853175  0.0209278870
##    12 829 0.2056051587  0.2090267272
##    13 277 0.0687003968  0.0698436712
##    14 542 0.1344246032  0.1366616238
##    15 208 0.0515873016  0.0524457892
##    16 942 0.2336309524  0.2375189107
##    17 258 0.0639880952  0.0650529501
##    18 351 0.0870535714  0.0885022693
##    19 113 0.0280257937  0.0284921836
##    20 216 0.0535714286  0.0544629349
##    NA  66 0.0163690476            NA

gss2021_ZERODraft$subgroupeducHS <-Recode(gss2021_ZERODraft$educ, recodes="0:11 = 0; 12:12 = 1; else=NA", as.factor=T)

gss2021_ZERODraft %>%
  
tabyl(subgroupeducHS)

##  subgroupeducHS    n    percent valid_percent
##               0  230 0.05704365      0.217186
##               1  829 0.20560516      0.782814
##            <NA> 2973 0.73735119            NA

subgroupEduHS_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupeducHS==1, "Highschool", "No Highschool"))

tabyl(subgroupEduHS_1)

##  subgroupEduHS_1    n    percent valid_percent
##       Highschool  829 0.20560516      0.782814
##    No Highschool  230 0.05704365      0.217186
##             <NA> 2973 0.73735119            NA

sex

gss2021_ZERODraft %>%
  tabyl(sex)

##  sex    n    percent valid_percent
##    1 1736 0.43055556     0.4406091
##    2 2204 0.54662698     0.5593909
##   NA   92 0.02281746            NA

gss2021_ZERODraft$subgroupsex <-Recode(gss2021_ZERODraft$sex, recodes="1:1 = 0; 2:2 = 1; else=NA", as.factor=T)

gss2021_ZERODraft %>%
  
tabyl(subgroupsex)

##  subgroupsex    n    percent valid_percent
##            0 1736 0.43055556     0.4406091
##            1 2204 0.54662698     0.5593909
##         <NA>   92 0.02281746            NA

subgroupsex_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupsex==1, "Women", "Men"))

tabyl(subgroupsex_1)

##  subgroupsex_1    n    percent valid_percent
##            Men 1736 0.43055556     0.4406091
##          Women 2204 0.54662698     0.5593909
##           <NA>   92 0.02281746            NA

income of household of respondent when age 16

gss2021_ZERODraft %>%
  tabyl(incom16)

##  incom16    n    percent valid_percent
##        1  421 0.10441468    0.11003659
##        2 1013 0.25124008    0.26476738
##        3 1625 0.40302579    0.42472556
##        4  679 0.16840278    0.17746994
##        5   88 0.02182540    0.02300052
##       NA  206 0.05109127            NA

gss2021_ZERODraft$subgroupincom16 <-Recode(gss2021_ZERODraft$incom16, recodes="1:2 = 0; 3:5 = 1; else=NA", as.factor=T)

gss2021_ZERODraft %>%
  
tabyl(subgroupincom16)

##  subgroupincom16    n    percent valid_percent
##                0 1434 0.35565476      0.374804
##                1 2392 0.59325397      0.625196
##             <NA>  206 0.05109127            NA

subgroupincom16_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupincom16==1, "secure economic resources at 16", "insecure economic resources"))

tabyl(subgroupincom16_1)

##                subgroupincom16_1    n    percent valid_percent
##      insecure economic resources 1434 0.35565476      0.374804
##  secure economic resources at 16 2392 0.59325397      0.625196
##                             <NA>  206 0.05109127            NA

if respondent was born in the US

gss2021_ZERODraft %>%
  tabyl(born)

##  born    n    percent valid_percent
##     1 3516 0.87202381     0.8878788
##     2  444 0.11011905     0.1121212
##    NA   72 0.01785714            NA

gss2021_ZERODraft$subgroupBorn <-Recode(gss2021_ZERODraft$born, recodes="1:1 = 1; 2:2 = 0; else=NA", as.factor=T)

gss2021_ZERODraft %>%
  
tabyl(subgroupBorn)

##  subgroupBorn    n    percent valid_percent
##             0  444 0.11011905     0.1121212
##             1 3516 0.87202381     0.8878788
##          <NA>   72 0.01785714            NA

subgroupborn_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupBorn==1, "Born in US", "Not born in US"))

tabyl(subgroupborn_1)

##  subgroupborn_1    n    percent valid_percent
##      Born in US 3516 0.87202381     0.8878788
##  Not born in US  444 0.11011905     0.1121212
##            <NA>   72 0.01785714            NA

if respondents ever been disrespected

gss2021_ZERODraft %>%
  tabyl(disrspct)

##  disrspct    n    percent valid_percent
##         1  136 0.03373016    0.05228758
##         2  231 0.05729167    0.08881200
##         3  327 0.08110119    0.12572088
##         4  801 0.19866071    0.30795848
##         5  552 0.13690476    0.21222607
##         6  554 0.13740079    0.21299500
##        NA 1431 0.35491071            NA

gss2021_ZERODraft$subgroupdisrspct <-Recode(gss2021_ZERODraft$disrspct, recodes="1:5 = 1; 6:6 = 0; else=NA", as.factor=T)

gss2021_ZERODraft %>%
  
tabyl(subgroupdisrspct)

##  subgroupdisrspct    n   percent valid_percent
##                 0  554 0.1374008      0.212995
##                 1 2047 0.5076885      0.787005
##              <NA> 1431 0.3549107            NA

subgroupdisrspct_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupdisrspct==1, "respondent has been disrespected", "Not being disrespected"))

tabyl(subgroupdisrspct_1)

##                subgroupdisrspct_1    n   percent valid_percent
##            Not being disrespected  554 0.1374008      0.212995
##  respondent has been disrespected 2047 0.5076885      0.787005
##                              <NA> 1431 0.3549107            NA

if respondents ever been called or treated like they were not smart.

gss2021_ZERODraft %>%
  tabyl(notsmart)

##  notsmart    n    percent valid_percent
##         1  101 0.02504960    0.03881630
##         2  129 0.03199405    0.04957725
##         3  202 0.05009921    0.07763259
##         4  684 0.16964286    0.26287471
##         5  619 0.15352183    0.23789393
##         6  867 0.21502976    0.33320523
##        NA 1430 0.35466270            NA

gss2021_ZERODraft$subgroupnotsmart <-Recode(gss2021_ZERODraft$notsmart, recodes="1:5 = 1; 6:6 = 0; else=NA", as.factor=T)

gss2021_ZERODraft %>%
  
tabyl(subgroupnotsmart)

##  subgroupnotsmart    n   percent valid_percent
##                 0  867 0.2150298     0.3332052
##                 1 1735 0.4303075     0.6667948
##              <NA> 1430 0.3546627            NA

subgroupnotsmart_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupnotsmart==1, "respondent was told or treated as if they are not smart", "Never experienced that sort of treatment of not being smart"))

tabyl(subgroupnotsmart_1)

##                                           subgroupnotsmart_1    n   percent
##  Never experienced that sort of treatment of not being smart  867 0.2150298
##      respondent was told or treated as if they are not smart 1735 0.4303075
##                                                         <NA> 1430 0.3546627
##  valid_percent
##      0.3332052
##      0.6667948
##             NA

age cut into intervals

age1<-cut(gss2021_ZERODraft$age,
          breaks = c(0,24,39,59,79,99))

Creating model.dat2

library(dplyr)

model.dat2 <- gss2021_ZERODraft %>%
  mutate(hispanic = subgrouphis ,
         College = subgroupeducCollege ,
         HS = subgroupeducHS ,
         born = subgroupBorn ,
         income16 = subgroupincom16 ,
         Disrespect = subgroupdisrspct ,
         NotSmart = subgroupnotsmart ,
         sex = subgroupsex ,
         age = age1) %>%
  
  dplyr::select(id, College, hispanic, born, income16, Disrespect, NotSmart, sex, age)%>% filter(complete.cases(.))
  

knitr::kable(head(model.dat2))

id	College	born	income16	Disrespect	NotSmart	sex	age
1	0	1	0	1	0	1	(59,79]
7	1	1	1	0	1	1	(24,39]
12	1	0	1	1	1	1	(59,79]
13	0	1	0	1	1	0	(24,39]
14	1	0	1	0	0	0	(0,24]
15	1	1	1	1	0	1	(59,79]

table(model.dat2$subgroupincom16)

## Warning: Unknown or uninitialised column: `subgroupincom16`.

## < table of extent 0 >

summary(model.dat2)

##        id       College  hispanic born     income16 Disrespect NotSmart
##  Min.   :   1   0: 581   0:2121   0: 225   0: 877   0: 485     0: 762  
##  1st Qu.:1130   1:1770   1: 230   1:2126   1:1474   1:1866     1:1589  
##  Median :2189                                                          
##  Mean   :2215                                                          
##  3rd Qu.:3319                                                          
##  Max.   :4470                                                          
##  sex           age     
##  0:1060   (0,24] : 92  
##  1:1291   (24,39]:571  
##           (39,59]:780  
##           (59,79]:797  
##           (79,99]:111  
##

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(1115)
train<-createDataPartition(y=model.dat2$College,
                           p = .80, 
                           list = F)

model.dat2train<-model.dat2[train,]
model.dat2test<-model.dat2[-train,]

table(model.dat2train$College)

## 
##    0    1 
##  465 1416

prop.table(table(model.dat2train$College))

## 
##         0         1 
## 0.2472089 0.7527911

summary(model.dat2train)

##        id       College  hispanic born     income16 Disrespect NotSmart
##  Min.   :   7   0: 465   0:1696   0: 180   0: 691   0: 383     0: 607  
##  1st Qu.:1108   1:1416   1: 185   1:1701   1:1190   1:1498     1:1274  
##  Median :2169                                                          
##  Mean   :2190                                                          
##  3rd Qu.:3291                                                          
##  Max.   :4470                                                          
##  sex           age     
##  0: 830   (0,24] : 69  
##  1:1051   (24,39]:461  
##           (39,59]:633  
##           (59,79]:632  
##           (79,99]: 86  
##

glml<-glm(College ~ hispanic + born + income16 + Disrespect + NotSmart + sex + age,
          
          data = model.dat2train,
          family = binomial)

summary(glml)

## 
## Call:
## glm(formula = College ~ hispanic + born + income16 + Disrespect + 
##     NotSmart + sex + age, family = binomial, data = model.dat2train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0939   0.4983   0.6451   0.7796   1.3155  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.19211    0.34022   0.565 0.572301    
## hispanic1   -0.57906    0.17849  -3.244 0.001178 ** 
## born1       -0.25113    0.19704  -1.275 0.202472    
## income161    0.61754    0.11156   5.536  3.1e-08 ***
## Disrespect1  0.26918    0.15274   1.762 0.078015 .  
## NotSmart1    0.05003    0.13634   0.367 0.713672    
## sex1        -0.20822    0.11156  -1.866 0.061979 .  
## age(24,39]   0.94488    0.27514   3.434 0.000594 ***
## age(39,59]   0.79895    0.26896   2.971 0.002973 ** 
## age(59,79]   0.58645    0.26977   2.174 0.029715 *  
## age(79,99]   0.62754    0.35518   1.767 0.077258 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2103.9  on 1880  degrees of freedom
## Residual deviance: 2039.6  on 1870  degrees of freedom
## AIC: 2061.6
## 
## Number of Fisher Scoring iterations: 4

tr_pred<- predict(glml,
                  newdata = model.dat2train,
                  type = "response")

head(tr_pred)

##         1         2         3         4         5         6 
## 0.7933452 0.8186289 0.7694253 0.6920354 0.7063517 0.5987661

tr_predcl<-factor(ifelse(tr_pred>.5, 1, 0))

library(ggplot2)

pred1<-data.frame(pr=tr_pred,
                  gr=tr_predcl,
                  College=model.dat2train$College)

pred1%>%
  ggplot()+
  geom_histogram(aes(x=pr, color=gr, fill=gr))+
  ggtitle(label = "Probability of Hispanics going to College Compared to their Non-Hispanic White Counterparts",
          subtitle = "Threshold = .5")+
  geom_vline(xintercept=.5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pred1%>%
  ggplot()+
  geom_histogram(aes(x=pr, color=College, fill=College))+
  ggtitle(label = "Probability of attending College between Hispanics and NH Whites",
          subtitle = "Truth")+
  geom_vline(xintercept=.5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

table( tr_predcl,
       model.dat2train$College)

##          
## tr_predcl    0    1
##         0   11    6
##         1  454 1410

cm1<-confusionMatrix(data = tr_predcl,
                     reference = model.dat2train$College )
cm1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0   11    6
##          1  454 1410
##                                           
##                Accuracy : 0.7554          
##                  95% CI : (0.7354, 0.7747)
##     No Information Rate : 0.7528          
##     P-Value [Acc > NIR] : 0.4066          
##                                           
##                   Kappa : 0.0287          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.023656        
##             Specificity : 0.995763        
##          Pos Pred Value : 0.647059        
##          Neg Pred Value : 0.756438        
##              Prevalence : 0.247209        
##          Detection Rate : 0.005848        
##    Detection Prevalence : 0.009038        
##       Balanced Accuracy : 0.509709        
##                                           
##        'Positive' Class : 0               
##

tr_predcl<-factor(ifelse(tr_pred>mean(I(model.dat2train$College==1)), 1, 0)) #mean of response

pred2<-data.frame(pr=tr_pred,
                  gr=tr_predcl,
                College=model.dat2train$College)

pred2%>%
  ggplot(aes(x=pr, fill=gr))+
  geom_histogram(position="identity",
                 alpha=.7)+
  ggtitle(label = "Probability of attending College between Hispanics and NH Whites",
          subtitle = "Threshold = Mean")+
  geom_vline(xintercept=mean(I(model.dat2train$College==1)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pred2%>%
  ggplot(aes(x=pr, fill=College))+
  geom_histogram(position="identity",
                 alpha=.7)+
  ggtitle(label = "Probability of attending College between Hispanics and NH Whites",
          subtitle = "Truth")+
  geom_vline(xintercept=mean(I(model.dat2train$College==1)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

confusionMatrix(data = tr_predcl,
                model.dat2train$College,
                positive = "1" )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 259 524
##          1 206 892
##                                          
##                Accuracy : 0.6119         
##                  95% CI : (0.5895, 0.634)
##     No Information Rate : 0.7528         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.152          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.6299         
##             Specificity : 0.5570         
##          Pos Pred Value : 0.8124         
##          Neg Pred Value : 0.3308         
##              Prevalence : 0.7528         
##          Detection Rate : 0.4742         
##    Detection Prevalence : 0.5837         
##       Balanced Accuracy : 0.5935         
##                                          
##        'Positive' Class : 1              
##

pred_test<-predict(glml,
                   newdata=model.dat2test,
                   type="response")

pred_cl<-factor(ifelse(pred_test > mean( I(model.dat2test$College==1)), 1, 0))

table(model.dat2test$College,pred_cl)

##    pred_cl
##       0   1
##   0  64  52
##   1 152 202

confusionMatrix(data = pred_cl,model.dat2test$College )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  64 152
##          1  52 202
##                                           
##                Accuracy : 0.566           
##                  95% CI : (0.5198, 0.6113)
##     No Information Rate : 0.7532          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0949          
##                                           
##  Mcnemar's Test P-Value : 4.167e-12       
##                                           
##             Sensitivity : 0.5517          
##             Specificity : 0.5706          
##          Pos Pred Value : 0.2963          
##          Neg Pred Value : 0.7953          
##              Prevalence : 0.2468          
##          Detection Rate : 0.1362          
##    Detection Prevalence : 0.4596          
##       Balanced Accuracy : 0.5612          
##                                           
##        'Positive' Class : 0               
##

Homework 4 STATS2

Brandon Flores

Febuary 21, 2022

recode hispanic

recode education college

education highschool only

sex

income of household of respondent when age 16

if respondent was born in the US

if respondents ever been disrespected

if respondents ever been called or treated like they were not smart.

age cut into intervals

Creating model.dat2