1) Define a binary outcome of your choosing
The primary binary outcome will be specifically looking at the variable College. This measures those who have not gone to college (0) and those who have attended college (1).
2) Fit a predictive logistic regression model using as many predictor variables as you think you need
3) Use a 80% training/20% test split for your data.
Report the % correct classification from the training data using the .5 decision rule and again useing the mean of the outcome decision rule
When using the 50% split of the data it was very one sided showing a majority of cases of those in the model attending college with a extreamly small showing of those who have not went to college. What this 50% split showed is that the model is great at predicting those who have went to college but has major problems with those who have not went to college.
When looking at the confusion matrix it showed high accuracy at .75 but a extreamly low sensitivity at .02 with a extreamly high specificity at .99. These numbers being very spurious in nature as well as the output of the reference of the confusion matrix shows that this model is very poor an best fits those who have gone to college as opposed to those who have not.
After calculating the mean of the outcome:
"mean(I(model.dat2train$College==1)) [1] 0.7527911"
The adjusted mean of outcome based on this mean was placed at .7.
When this was done the overal reference of the confusion matrix evened out but still favored those who attened college. The accuracy dropped to about 61% but with a 60% jump in the sensitivity and a drop to 55% for specificity. Even the balanced accuracy increased to .59 from the previous model which sat at a coin flip of .5.
When this new predictive model was ran with the 20% set aside it gave very relatively similiar outputs. Overall accuracy dropped from 61% to 56%, sensitiviy dropped from 62% to 55%, balenced accurracy dropped from 59% to 56%, but specificity increased from 55% to 57%.
With these outcome measures it shows that the adjuested threshold is a better predictive model than the .5 rule.
3a) Does changing the decision rule threshold affect your classification accuracy?
Yes, it decreased the accuracy of the model but it created more proper outputs of sensitivy and specificity that is able to capture the data so that it is a better fit.
4) Report the % correct classification from the test data using the .5 decision rule and again useing the mean of the outcome decision rule
For the .5 rule the percent correct from the classification it predicted those who did not go to college about 32% of the time while the adjusted means model gave a outcome of about 63%.
With these estimations it shows that the newer adjusted means model was able to capture the data better than the .5 rule.
When looking at the test data set those percentages are now showing a classification percentage of about 76% of being able to predict with the new adjusted means rule.
This further shows that the adjusted means rule creates better parameters for the data so that the outcomes are more realistic in nature.
library(haven)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(scales)
library(sur)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(summarytools)
library(Rmisc)
## Loading required package: lattice
library(car)
## Loading required package: carData
##
## Attaching package: 'carData'
## The following objects are masked from 'package:sur':
##
## Anscombe, States
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(forcats)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v purrr 0.3.4
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plyr::arrange() masks dplyr::arrange()
## x readr::col_factor() masks scales::col_factor()
## x purrr::compact() masks plyr::compact()
## x plyr::count() masks dplyr::count()
## x purrr::discard() masks scales::discard()
## x plyr::failwith() masks dplyr::failwith()
## x dplyr::filter() masks stats::filter()
## x plyr::id() masks dplyr::id()
## x dplyr::lag() masks stats::lag()
## x plyr::mutate() masks dplyr::mutate()
## x car::recode() masks dplyr::recode()
## x plyr::rename() masks dplyr::rename()
## x purrr::some() masks car::some()
## x plyr::summarise() masks dplyr::summarise()
## x plyr::summarize() masks dplyr::summarize()
## x tibble::view() masks summarytools::view()
library(survey)
## Loading required package: grid
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(grid)
library(Matrix)
gss2021_ZERODraft<-read_dta("C:\\Users\\BTP\\Desktop\\STATS 2 FOLDER\\2021_sas\\gss2021.dta")
## Key Predictor
gss2021_ZERODraft %>%
tabyl(hispanic)
## hispanic n percent valid_percent
## 1 3544 0.8789682540 0.8864432216
## 2 245 0.0607638889 0.0612806403
## 3 51 0.0126488095 0.0127563782
## 4 21 0.0052083333 0.0052526263
## 5 9 0.0022321429 0.0022511256
## 6 2 0.0004960317 0.0005002501
## 7 4 0.0009920635 0.0010005003
## 8 2 0.0004960317 0.0005002501
## 9 4 0.0009920635 0.0010005003
## 10 5 0.0012400794 0.0012506253
## 11 3 0.0007440476 0.0007503752
## 15 17 0.0042162698 0.0042521261
## 20 4 0.0009920635 0.0010005003
## 21 8 0.0019841270 0.0020010005
## 22 10 0.0024801587 0.0025012506
## 23 5 0.0012400794 0.0012506253
## 24 1 0.0002480159 0.0002501251
## 25 1 0.0002480159 0.0002501251
## 30 46 0.0114087302 0.0115057529
## 35 1 0.0002480159 0.0002501251
## 41 3 0.0007440476 0.0007503752
## 46 2 0.0004960317 0.0005002501
## 47 7 0.0017361111 0.0017508754
## 50 3 0.0007440476 0.0007503752
## NA 34 0.0084325397 NA
## Key dependent
gss2021_ZERODraft %>%
tabyl(educ)
## educ n percent valid_percent
## 0 9 0.0022321429 0.0022692890
## 1 1 0.0002480159 0.0002521432
## 2 2 0.0004960317 0.0005042864
## 3 3 0.0007440476 0.0007564297
## 4 1 0.0002480159 0.0002521432
## 5 2 0.0004960317 0.0005042864
## 6 15 0.0037202381 0.0037821483
## 7 5 0.0012400794 0.0012607161
## 8 25 0.0062003968 0.0063035804
## 9 32 0.0079365079 0.0080685830
## 10 52 0.0128968254 0.0131114473
## 11 83 0.0205853175 0.0209278870
## 12 829 0.2056051587 0.2090267272
## 13 277 0.0687003968 0.0698436712
## 14 542 0.1344246032 0.1366616238
## 15 208 0.0515873016 0.0524457892
## 16 942 0.2336309524 0.2375189107
## 17 258 0.0639880952 0.0650529501
## 18 351 0.0870535714 0.0885022693
## 19 113 0.0280257937 0.0284921836
## 20 216 0.0535714286 0.0544629349
## NA 66 0.0163690476 NA
## Key Control(startified variable)
gss2021_ZERODraft %>%
tabyl(sex)
## sex n percent valid_percent
## 1 1736 0.43055556 0.4406091
## 2 2204 0.54662698 0.5593909
## NA 92 0.02281746 NA
## Parental Factors
gss2021_ZERODraft %>%
tabyl(incom16)
## incom16 n percent valid_percent
## 1 421 0.10441468 0.11003659
## 2 1013 0.25124008 0.26476738
## 3 1625 0.40302579 0.42472556
## 4 679 0.16840278 0.17746994
## 5 88 0.02182540 0.02300052
## NA 206 0.05109127 NA
gss2021_ZERODraft %>%
tabyl(born)
## born n percent valid_percent
## 1 3516 0.87202381 0.8878788
## 2 444 0.11011905 0.1121212
## NA 72 0.01785714 NA
## Racialization Factors
gss2021_ZERODraft %>%
tabyl(disrspct)
## disrspct n percent valid_percent
## 1 136 0.03373016 0.05228758
## 2 231 0.05729167 0.08881200
## 3 327 0.08110119 0.12572088
## 4 801 0.19866071 0.30795848
## 5 552 0.13690476 0.21222607
## 6 554 0.13740079 0.21299500
## NA 1431 0.35491071 NA
gss2021_ZERODraft %>%
tabyl(notsmart)
## notsmart n percent valid_percent
## 1 101 0.02504960 0.03881630
## 2 129 0.03199405 0.04957725
## 3 202 0.05009921 0.07763259
## 4 684 0.16964286 0.26287471
## 5 619 0.15352183 0.23789393
## 6 867 0.21502976 0.33320523
## NA 1430 0.35466270 NA
## Other Control Variables
gss2021_ZERODraft %>%
tabyl(age)
## age n percent valid_percent
## 18 4 0.0009920635 0.001081373
## 19 14 0.0034722222 0.003784807
## 20 18 0.0044642857 0.004866180
## 21 24 0.0059523810 0.006488240
## 22 31 0.0076884921 0.008380643
## 23 30 0.0074404762 0.008110300
## 24 38 0.0094246032 0.010273047
## 25 46 0.0114087302 0.012435793
## 26 44 0.0109126984 0.011895107
## 27 35 0.0086805556 0.009462017
## 28 59 0.0146329365 0.015950257
## 29 63 0.0156250000 0.017031630
## 30 61 0.0151289683 0.016490943
## 31 56 0.0138888889 0.015139227
## 32 59 0.0146329365 0.015950257
## 33 82 0.0203373016 0.022168154
## 34 69 0.0171130952 0.018653690
## 35 62 0.0153769841 0.016761287
## 36 61 0.0151289683 0.016490943
## 37 74 0.0183531746 0.020005407
## 38 59 0.0146329365 0.015950257
## 39 65 0.0161210317 0.017572317
## 40 63 0.0156250000 0.017031630
## 41 76 0.0188492063 0.020546094
## 42 77 0.0190972222 0.020816437
## 43 56 0.0138888889 0.015139227
## 44 60 0.0148809524 0.016220600
## 45 52 0.0128968254 0.014057853
## 46 61 0.0151289683 0.016490943
## 47 54 0.0133928571 0.014598540
## 48 47 0.0116567460 0.012706137
## 49 54 0.0133928571 0.014598540
## 50 51 0.0126488095 0.013787510
## 51 69 0.0171130952 0.018653690
## 52 53 0.0131448413 0.014328197
## 53 65 0.0161210317 0.017572317
## 54 53 0.0131448413 0.014328197
## 55 65 0.0161210317 0.017572317
## 56 63 0.0156250000 0.017031630
## 57 79 0.0195932540 0.021357124
## 58 60 0.0148809524 0.016220600
## 59 81 0.0200892857 0.021897810
## 60 70 0.0173611111 0.018924034
## 61 76 0.0188492063 0.020546094
## 62 77 0.0190972222 0.020816437
## 63 77 0.0190972222 0.020816437
## 64 68 0.0168650794 0.018383347
## 65 68 0.0168650794 0.018383347
## 66 58 0.0143849206 0.015679913
## 67 89 0.0220734127 0.024060557
## 68 72 0.0178571429 0.019464720
## 69 73 0.0181051587 0.019735064
## 70 71 0.0176091270 0.019194377
## 71 55 0.0136408730 0.014868883
## 72 52 0.0128968254 0.014057853
## 73 48 0.0119047619 0.012976480
## 74 69 0.0171130952 0.018653690
## 75 57 0.0141369048 0.015409570
## 76 34 0.0084325397 0.009191673
## 77 37 0.0091765873 0.010002703
## 78 32 0.0079365079 0.008650987
## 79 40 0.0099206349 0.010813733
## 80 30 0.0074404762 0.008110300
## 81 31 0.0076884921 0.008380643
## 82 18 0.0044642857 0.004866180
## 83 17 0.0042162698 0.004595837
## 84 21 0.0052083333 0.005677210
## 85 16 0.0039682540 0.004325493
## 86 14 0.0034722222 0.003784807
## 87 5 0.0012400794 0.001351717
## 88 5 0.0012400794 0.001351717
## 89 26 0.0064484127 0.007028927
## NA 333 0.0825892857 NA
gss2021_ZERODraft %>%
tabyl(coninc)
## coninc n percent valid_percent
## 336.0 64 0.015873016 0.018238814
## 1344.0 50 0.012400794 0.014249074
## 2352.0 19 0.004712302 0.005414648
## 3024.0 11 0.002728175 0.003134796
## 3696.0 17 0.004216270 0.004844685
## 4368.0 12 0.002976190 0.003419778
## 5040.0 18 0.004464286 0.005129667
## 6048.0 54 0.013392857 0.015389000
## 7560.0 87 0.021577381 0.024793388
## 9240.0 80 0.019841270 0.022798518
## 10920.0 74 0.018353175 0.021088629
## 12600.0 66 0.016369048 0.018808777
## 14280.0 80 0.019841270 0.022798518
## 15960.0 69 0.017113095 0.019663722
## 18480.0 131 0.032490079 0.037332573
## 21840.0 155 0.038442460 0.044172129
## 25200.0 136 0.033730159 0.038757481
## 30240.0 266 0.065972222 0.075805073
## 36960.0 253 0.062748016 0.072100313
## 45360.0 367 0.091021825 0.104588202
## 55440.0 314 0.077876984 0.089484184
## 67200.0 326 0.080853175 0.092903961
## 80640.0 207 0.051339286 0.058991166
## 94080.0 141 0.034970238 0.040182388
## 107520.0 140 0.034722222 0.039897407
## 168736.3 372 0.092261905 0.106013109
## NA 523 0.129712302 NA
recode hispanic
gss2021_ZERODraft %>%
tabyl(hispanic)
## hispanic n percent valid_percent
## 1 3544 0.8789682540 0.8864432216
## 2 245 0.0607638889 0.0612806403
## 3 51 0.0126488095 0.0127563782
## 4 21 0.0052083333 0.0052526263
## 5 9 0.0022321429 0.0022511256
## 6 2 0.0004960317 0.0005002501
## 7 4 0.0009920635 0.0010005003
## 8 2 0.0004960317 0.0005002501
## 9 4 0.0009920635 0.0010005003
## 10 5 0.0012400794 0.0012506253
## 11 3 0.0007440476 0.0007503752
## 15 17 0.0042162698 0.0042521261
## 20 4 0.0009920635 0.0010005003
## 21 8 0.0019841270 0.0020010005
## 22 10 0.0024801587 0.0025012506
## 23 5 0.0012400794 0.0012506253
## 24 1 0.0002480159 0.0002501251
## 25 1 0.0002480159 0.0002501251
## 30 46 0.0114087302 0.0115057529
## 35 1 0.0002480159 0.0002501251
## 41 3 0.0007440476 0.0007503752
## 46 2 0.0004960317 0.0005002501
## 47 7 0.0017361111 0.0017508754
## 50 3 0.0007440476 0.0007503752
## NA 34 0.0084325397 NA
gss2021_ZERODraft$subgrouphis <-Recode(gss2021_ZERODraft$hispanic, recodes="1 = 0; 2:50 = 1; else=NA", as.factor=T)
gss2021_ZERODraft %>%
tabyl(subgrouphis)
## subgrouphis n percent valid_percent
## 0 3544 0.87896825 0.8864432
## 1 454 0.11259921 0.1135568
## <NA> 34 0.00843254 NA
subgrouphis_1<-as.factor(ifelse(gss2021_ZERODraft$subgrouphis==1, "Hispanic", "Non Hispanic"))
tabyl(subgrouphis_1)
## subgrouphis_1 n percent valid_percent
## Hispanic 454 0.11259921 0.1135568
## Non Hispanic 3544 0.87896825 0.8864432
## <NA> 34 0.00843254 NA
recode education college
gss2021_ZERODraft %>%
tabyl(educ)
## educ n percent valid_percent
## 0 9 0.0022321429 0.0022692890
## 1 1 0.0002480159 0.0002521432
## 2 2 0.0004960317 0.0005042864
## 3 3 0.0007440476 0.0007564297
## 4 1 0.0002480159 0.0002521432
## 5 2 0.0004960317 0.0005042864
## 6 15 0.0037202381 0.0037821483
## 7 5 0.0012400794 0.0012607161
## 8 25 0.0062003968 0.0063035804
## 9 32 0.0079365079 0.0080685830
## 10 52 0.0128968254 0.0131114473
## 11 83 0.0205853175 0.0209278870
## 12 829 0.2056051587 0.2090267272
## 13 277 0.0687003968 0.0698436712
## 14 542 0.1344246032 0.1366616238
## 15 208 0.0515873016 0.0524457892
## 16 942 0.2336309524 0.2375189107
## 17 258 0.0639880952 0.0650529501
## 18 351 0.0870535714 0.0885022693
## 19 113 0.0280257937 0.0284921836
## 20 216 0.0535714286 0.0544629349
## NA 66 0.0163690476 NA
gss2021_ZERODraft$subgroupeducCollege <-Recode(gss2021_ZERODraft$educ, recodes="0:12 = 0; 13:20 = 1; else=NA", as.factor=T)
gss2021_ZERODraft %>%
tabyl(subgroupeducCollege)
## subgroupeducCollege n percent valid_percent
## 0 1059 0.26264881 0.2670197
## 1 2907 0.72098214 0.7329803
## <NA> 66 0.01636905 NA
subgroupEduCollege_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupeducCollege==1, "College", "No College"))
tabyl(subgroupEduCollege_1)
## subgroupEduCollege_1 n percent valid_percent
## College 2907 0.72098214 0.7329803
## No College 1059 0.26264881 0.2670197
## <NA> 66 0.01636905 NA
education highschool only
gss2021_ZERODraft %>%
tabyl(educ)
## educ n percent valid_percent
## 0 9 0.0022321429 0.0022692890
## 1 1 0.0002480159 0.0002521432
## 2 2 0.0004960317 0.0005042864
## 3 3 0.0007440476 0.0007564297
## 4 1 0.0002480159 0.0002521432
## 5 2 0.0004960317 0.0005042864
## 6 15 0.0037202381 0.0037821483
## 7 5 0.0012400794 0.0012607161
## 8 25 0.0062003968 0.0063035804
## 9 32 0.0079365079 0.0080685830
## 10 52 0.0128968254 0.0131114473
## 11 83 0.0205853175 0.0209278870
## 12 829 0.2056051587 0.2090267272
## 13 277 0.0687003968 0.0698436712
## 14 542 0.1344246032 0.1366616238
## 15 208 0.0515873016 0.0524457892
## 16 942 0.2336309524 0.2375189107
## 17 258 0.0639880952 0.0650529501
## 18 351 0.0870535714 0.0885022693
## 19 113 0.0280257937 0.0284921836
## 20 216 0.0535714286 0.0544629349
## NA 66 0.0163690476 NA
gss2021_ZERODraft$subgroupeducHS <-Recode(gss2021_ZERODraft$educ, recodes="0:11 = 0; 12:12 = 1; else=NA", as.factor=T)
gss2021_ZERODraft %>%
tabyl(subgroupeducHS)
## subgroupeducHS n percent valid_percent
## 0 230 0.05704365 0.217186
## 1 829 0.20560516 0.782814
## <NA> 2973 0.73735119 NA
subgroupEduHS_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupeducHS==1, "Highschool", "No Highschool"))
tabyl(subgroupEduHS_1)
## subgroupEduHS_1 n percent valid_percent
## Highschool 829 0.20560516 0.782814
## No Highschool 230 0.05704365 0.217186
## <NA> 2973 0.73735119 NA
sex
gss2021_ZERODraft %>%
tabyl(sex)
## sex n percent valid_percent
## 1 1736 0.43055556 0.4406091
## 2 2204 0.54662698 0.5593909
## NA 92 0.02281746 NA
gss2021_ZERODraft$subgroupsex <-Recode(gss2021_ZERODraft$sex, recodes="1:1 = 0; 2:2 = 1; else=NA", as.factor=T)
gss2021_ZERODraft %>%
tabyl(subgroupsex)
## subgroupsex n percent valid_percent
## 0 1736 0.43055556 0.4406091
## 1 2204 0.54662698 0.5593909
## <NA> 92 0.02281746 NA
subgroupsex_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupsex==1, "Women", "Men"))
tabyl(subgroupsex_1)
## subgroupsex_1 n percent valid_percent
## Men 1736 0.43055556 0.4406091
## Women 2204 0.54662698 0.5593909
## <NA> 92 0.02281746 NA
income of household of respondent when age 16
gss2021_ZERODraft %>%
tabyl(incom16)
## incom16 n percent valid_percent
## 1 421 0.10441468 0.11003659
## 2 1013 0.25124008 0.26476738
## 3 1625 0.40302579 0.42472556
## 4 679 0.16840278 0.17746994
## 5 88 0.02182540 0.02300052
## NA 206 0.05109127 NA
gss2021_ZERODraft$subgroupincom16 <-Recode(gss2021_ZERODraft$incom16, recodes="1:2 = 0; 3:5 = 1; else=NA", as.factor=T)
gss2021_ZERODraft %>%
tabyl(subgroupincom16)
## subgroupincom16 n percent valid_percent
## 0 1434 0.35565476 0.374804
## 1 2392 0.59325397 0.625196
## <NA> 206 0.05109127 NA
subgroupincom16_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupincom16==1, "secure economic resources at 16", "insecure economic resources"))
tabyl(subgroupincom16_1)
## subgroupincom16_1 n percent valid_percent
## insecure economic resources 1434 0.35565476 0.374804
## secure economic resources at 16 2392 0.59325397 0.625196
## <NA> 206 0.05109127 NA
if respondent was born in the US
gss2021_ZERODraft %>%
tabyl(born)
## born n percent valid_percent
## 1 3516 0.87202381 0.8878788
## 2 444 0.11011905 0.1121212
## NA 72 0.01785714 NA
gss2021_ZERODraft$subgroupBorn <-Recode(gss2021_ZERODraft$born, recodes="1:1 = 1; 2:2 = 0; else=NA", as.factor=T)
gss2021_ZERODraft %>%
tabyl(subgroupBorn)
## subgroupBorn n percent valid_percent
## 0 444 0.11011905 0.1121212
## 1 3516 0.87202381 0.8878788
## <NA> 72 0.01785714 NA
subgroupborn_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupBorn==1, "Born in US", "Not born in US"))
tabyl(subgroupborn_1)
## subgroupborn_1 n percent valid_percent
## Born in US 3516 0.87202381 0.8878788
## Not born in US 444 0.11011905 0.1121212
## <NA> 72 0.01785714 NA
if respondents ever been disrespected
gss2021_ZERODraft %>%
tabyl(disrspct)
## disrspct n percent valid_percent
## 1 136 0.03373016 0.05228758
## 2 231 0.05729167 0.08881200
## 3 327 0.08110119 0.12572088
## 4 801 0.19866071 0.30795848
## 5 552 0.13690476 0.21222607
## 6 554 0.13740079 0.21299500
## NA 1431 0.35491071 NA
gss2021_ZERODraft$subgroupdisrspct <-Recode(gss2021_ZERODraft$disrspct, recodes="1:5 = 1; 6:6 = 0; else=NA", as.factor=T)
gss2021_ZERODraft %>%
tabyl(subgroupdisrspct)
## subgroupdisrspct n percent valid_percent
## 0 554 0.1374008 0.212995
## 1 2047 0.5076885 0.787005
## <NA> 1431 0.3549107 NA
subgroupdisrspct_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupdisrspct==1, "respondent has been disrespected", "Not being disrespected"))
tabyl(subgroupdisrspct_1)
## subgroupdisrspct_1 n percent valid_percent
## Not being disrespected 554 0.1374008 0.212995
## respondent has been disrespected 2047 0.5076885 0.787005
## <NA> 1431 0.3549107 NA
if respondents ever been called or treated like they were not smart.
gss2021_ZERODraft %>%
tabyl(notsmart)
## notsmart n percent valid_percent
## 1 101 0.02504960 0.03881630
## 2 129 0.03199405 0.04957725
## 3 202 0.05009921 0.07763259
## 4 684 0.16964286 0.26287471
## 5 619 0.15352183 0.23789393
## 6 867 0.21502976 0.33320523
## NA 1430 0.35466270 NA
gss2021_ZERODraft$subgroupnotsmart <-Recode(gss2021_ZERODraft$notsmart, recodes="1:5 = 1; 6:6 = 0; else=NA", as.factor=T)
gss2021_ZERODraft %>%
tabyl(subgroupnotsmart)
## subgroupnotsmart n percent valid_percent
## 0 867 0.2150298 0.3332052
## 1 1735 0.4303075 0.6667948
## <NA> 1430 0.3546627 NA
subgroupnotsmart_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupnotsmart==1, "respondent was told or treated as if they are not smart", "Never experienced that sort of treatment of not being smart"))
tabyl(subgroupnotsmart_1)
## subgroupnotsmart_1 n percent
## Never experienced that sort of treatment of not being smart 867 0.2150298
## respondent was told or treated as if they are not smart 1735 0.4303075
## <NA> 1430 0.3546627
## valid_percent
## 0.3332052
## 0.6667948
## NA
age cut into intervals
age1<-cut(gss2021_ZERODraft$age,
breaks = c(0,24,39,59,79,99))
Creating model.dat2
library(dplyr)
model.dat2 <- gss2021_ZERODraft %>%
mutate(hispanic = subgrouphis ,
College = subgroupeducCollege ,
HS = subgroupeducHS ,
born = subgroupBorn ,
income16 = subgroupincom16 ,
Disrespect = subgroupdisrspct ,
NotSmart = subgroupnotsmart ,
sex = subgroupsex ,
age = age1) %>%
dplyr::select(id, College, hispanic, born, income16, Disrespect, NotSmart, sex, age)%>% filter(complete.cases(.))
knitr::kable(head(model.dat2))
| 1 |
0 |
0 |
1 |
0 |
1 |
0 |
1 |
(59,79] |
| 7 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
(24,39] |
| 12 |
1 |
0 |
0 |
1 |
1 |
1 |
1 |
(59,79] |
| 13 |
0 |
0 |
1 |
0 |
1 |
1 |
0 |
(24,39] |
| 14 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
(0,24] |
| 15 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
(59,79] |
table(model.dat2$subgroupincom16)
## Warning: Unknown or uninitialised column: `subgroupincom16`.
## < table of extent 0 >
summary(model.dat2)
## id College hispanic born income16 Disrespect NotSmart
## Min. : 1 0: 581 0:2121 0: 225 0: 877 0: 485 0: 762
## 1st Qu.:1130 1:1770 1: 230 1:2126 1:1474 1:1866 1:1589
## Median :2189
## Mean :2215
## 3rd Qu.:3319
## Max. :4470
## sex age
## 0:1060 (0,24] : 92
## 1:1291 (24,39]:571
## (39,59]:780
## (59,79]:797
## (79,99]:111
##
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
set.seed(1115)
train<-createDataPartition(y=model.dat2$College,
p = .80,
list = F)
model.dat2train<-model.dat2[train,]
model.dat2test<-model.dat2[-train,]
table(model.dat2train$College)
##
## 0 1
## 465 1416
prop.table(table(model.dat2train$College))
##
## 0 1
## 0.2472089 0.7527911
summary(model.dat2train)
## id College hispanic born income16 Disrespect NotSmart
## Min. : 7 0: 465 0:1696 0: 180 0: 691 0: 383 0: 607
## 1st Qu.:1108 1:1416 1: 185 1:1701 1:1190 1:1498 1:1274
## Median :2169
## Mean :2190
## 3rd Qu.:3291
## Max. :4470
## sex age
## 0: 830 (0,24] : 69
## 1:1051 (24,39]:461
## (39,59]:633
## (59,79]:632
## (79,99]: 86
##
glml<-glm(College ~ hispanic + born + income16 + Disrespect + NotSmart + sex + age,
data = model.dat2train,
family = binomial)
summary(glml)
##
## Call:
## glm(formula = College ~ hispanic + born + income16 + Disrespect +
## NotSmart + sex + age, family = binomial, data = model.dat2train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0939 0.4983 0.6451 0.7796 1.3155
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.19211 0.34022 0.565 0.572301
## hispanic1 -0.57906 0.17849 -3.244 0.001178 **
## born1 -0.25113 0.19704 -1.275 0.202472
## income161 0.61754 0.11156 5.536 3.1e-08 ***
## Disrespect1 0.26918 0.15274 1.762 0.078015 .
## NotSmart1 0.05003 0.13634 0.367 0.713672
## sex1 -0.20822 0.11156 -1.866 0.061979 .
## age(24,39] 0.94488 0.27514 3.434 0.000594 ***
## age(39,59] 0.79895 0.26896 2.971 0.002973 **
## age(59,79] 0.58645 0.26977 2.174 0.029715 *
## age(79,99] 0.62754 0.35518 1.767 0.077258 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2103.9 on 1880 degrees of freedom
## Residual deviance: 2039.6 on 1870 degrees of freedom
## AIC: 2061.6
##
## Number of Fisher Scoring iterations: 4
tr_pred<- predict(glml,
newdata = model.dat2train,
type = "response")
head(tr_pred)
## 1 2 3 4 5 6
## 0.7933452 0.8186289 0.7694253 0.6920354 0.7063517 0.5987661
tr_predcl<-factor(ifelse(tr_pred>.5, 1, 0))
library(ggplot2)
pred1<-data.frame(pr=tr_pred,
gr=tr_predcl,
College=model.dat2train$College)
pred1%>%
ggplot()+
geom_histogram(aes(x=pr, color=gr, fill=gr))+
ggtitle(label = "Probability of Hispanics going to College Compared to their Non-Hispanic White Counterparts",
subtitle = "Threshold = .5")+
geom_vline(xintercept=.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pred1%>%
ggplot()+
geom_histogram(aes(x=pr, color=College, fill=College))+
ggtitle(label = "Probability of attending College between Hispanics and NH Whites",
subtitle = "Truth")+
geom_vline(xintercept=.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

table( tr_predcl,
model.dat2train$College)
##
## tr_predcl 0 1
## 0 11 6
## 1 454 1410
cm1<-confusionMatrix(data = tr_predcl,
reference = model.dat2train$College )
cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11 6
## 1 454 1410
##
## Accuracy : 0.7554
## 95% CI : (0.7354, 0.7747)
## No Information Rate : 0.7528
## P-Value [Acc > NIR] : 0.4066
##
## Kappa : 0.0287
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.023656
## Specificity : 0.995763
## Pos Pred Value : 0.647059
## Neg Pred Value : 0.756438
## Prevalence : 0.247209
## Detection Rate : 0.005848
## Detection Prevalence : 0.009038
## Balanced Accuracy : 0.509709
##
## 'Positive' Class : 0
##
tr_predcl<-factor(ifelse(tr_pred>mean(I(model.dat2train$College==1)), 1, 0)) #mean of response
pred2<-data.frame(pr=tr_pred,
gr=tr_predcl,
College=model.dat2train$College)
pred2%>%
ggplot(aes(x=pr, fill=gr))+
geom_histogram(position="identity",
alpha=.7)+
ggtitle(label = "Probability of attending College between Hispanics and NH Whites",
subtitle = "Threshold = Mean")+
geom_vline(xintercept=mean(I(model.dat2train$College==1)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pred2%>%
ggplot(aes(x=pr, fill=College))+
geom_histogram(position="identity",
alpha=.7)+
ggtitle(label = "Probability of attending College between Hispanics and NH Whites",
subtitle = "Truth")+
geom_vline(xintercept=mean(I(model.dat2train$College==1)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

confusionMatrix(data = tr_predcl,
model.dat2train$College,
positive = "1" )
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 259 524
## 1 206 892
##
## Accuracy : 0.6119
## 95% CI : (0.5895, 0.634)
## No Information Rate : 0.7528
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.152
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6299
## Specificity : 0.5570
## Pos Pred Value : 0.8124
## Neg Pred Value : 0.3308
## Prevalence : 0.7528
## Detection Rate : 0.4742
## Detection Prevalence : 0.5837
## Balanced Accuracy : 0.5935
##
## 'Positive' Class : 1
##
pred_test<-predict(glml,
newdata=model.dat2test,
type="response")
pred_cl<-factor(ifelse(pred_test > mean( I(model.dat2test$College==1)), 1, 0))
table(model.dat2test$College,pred_cl)
## pred_cl
## 0 1
## 0 64 52
## 1 152 202
confusionMatrix(data = pred_cl,model.dat2test$College )
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 64 152
## 1 52 202
##
## Accuracy : 0.566
## 95% CI : (0.5198, 0.6113)
## No Information Rate : 0.7532
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0949
##
## Mcnemar's Test P-Value : 4.167e-12
##
## Sensitivity : 0.5517
## Specificity : 0.5706
## Pos Pred Value : 0.2963
## Neg Pred Value : 0.7953
## Prevalence : 0.2468
## Detection Rate : 0.1362
## Detection Prevalence : 0.4596
## Balanced Accuracy : 0.5612
##
## 'Positive' Class : 0
##