Assignment#6.2:Regression Analysis(Logit) using Zelig

Introduction

The purpose of the present research was to understand the differences in citizenship status in estimating the probabilities of poverty by the variation in educational attainment among employed and never married female in . The research question was tested for this purpose. 1. Does the estimated probability of poverty vary in terms of the differences in citizenship status among three different groups of educational attainment.

Method

Sampling method:

The sample of this research was consisted of employed and never married female between the age 18 and 45 years. The sample of the present research was nationally representative. The sample was consisted of the American community Survey data 2016, which was available in pumps.org. Although the extracted data consisted of both US census and American community survey from the year 2002 to 2016, only the year 2016 was used for the present analysis.

Data analysis

Recoding variables:

Independent variables

Education: Education was re coded as 3 categories(Less than high school, High school, College) out of existing codes. Race: The variable race was re coded as 5 broad categories(White, Black, American Indian, Asian, Other) of race out of the existing codes. Chinese, Japanese, and other Asian were re coded under Asian category, all the other races and multiracial are re coded under Other category. Citizen: Citizenship status was re-coding as “yes” for those with neutralized citizenship and those who born abroad of American parents, and “no”" for those who were no a citizen and who were not citizen but received the first paper and “NA” for rest of the codes(Foreign born, citizenship status not reported, NA).

Dependent variables

Poverty:Poverty in the data set was recorded as a poverty thresholds. Thresholds range between 1 and 500 was re coded as below poverty thresholds and 501 or more as above poverty thresholds.

Procedures of data analysis

Binary logistic regression analysis was conducted by using 5 different models to get the best fitted model. Zelig 4 was used to estimate the probabilities of poverty by the difference in citizenship status among different groups of educational attainment.

 getwd()
## [1] "/Users/mzaliny/work"
setwd("/Users/mzaliny/work")
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
F1<-read_csv("family_ipums.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer()
## )
## See spec(...) for full column specifications.
F2<-subset(F1,YEAR=="2016")
head(F2)
## # A tibble: 6 x 28
##    YEAR DATANUM SERIAL  HHWT    GQ PERNUM PERWT FAMSIZE NCHILD ELDCH YNGCH
##   <int>   <int>  <int> <int> <int>  <int> <int>   <int>  <int> <int> <int>
## 1  2016       1      3   159     1      3   184       4      0    99    99
## 2  2016       1      6   226     1      1   226       2      0    99    99
## 3  2016       1      6   226     1      2   155       2      0    99    99
## 4  2016       1     17    78     1      2   142       3      0    99    99
## 5  2016       1     17    78     1      3   142       3      0    99    99
## 6  2016       1     18    71     1      1    71       3      2    20    16
## # ... with 17 more variables: SEX <int>, AGE <int>, MARST <int>,
## #   RACE <int>, RACED <int>, CITIZEN <int>, LANGUAGE <int>,
## #   LANGUAGED <int>, EDUC <int>, EDUCD <int>, EMPSTAT <int>,
## #   EMPSTATD <int>, UHRSWORK <int>, INCTOT <int>, INCOTHER <int>,
## #   POVERTY <int>, DISABWRK <int>
library(tidyverse)
F2<-F2[c(1, 8, 12:15, 17, 20, 22, 27)]
library(tidyverse)
F2<-F2 %>%
    mutate(race_r = sjmisc::rec(RACE, rec = "1=1; 2=2; 3=3; 4:6=4; 7:9=5 "))%>%
  mutate(citizen_r=sjmisc::rec(CITIZEN, rec = "0=0; 1:2=1; 3:4=2; 5=0"))%>%
  mutate(poverty_r=sjmisc::rec(POVERTY, rec = "1:500=1; 501=0"))%>%
  mutate(edu_r=sjmisc::rec(EDUC, rec = "0:5=1; 6=2; 7:11=3"))

head(F2)
## # A tibble: 6 x 14
##    YEAR FAMSIZE   SEX   AGE MARST  RACE CITIZEN  EDUC EMPSTAT POVERTY
##   <int>   <int> <int> <int> <int> <int>   <int> <int>   <int>   <int>
## 1  2016       4     1    20     6     1       0     6       3     261
## 2  2016       2     1    25     6     1       0     7       1     154
## 3  2016       2     2    23     6     1       0     8       1     153
## 4  2016       3     2    25     6     2       0     7       1     326
## 5  2016       3     1    24     6     2       0     7       1     326
## 6  2016       3     2    37     6     2       0     7       1     122
## # ... with 4 more variables: race_r <dbl>, citizen_r <dbl>,
## #   poverty_r <dbl>, edu_r <dbl>
library(tidyverse)
F2$SEX<-factor(F2$SEX, levels = c(1, 2), 
               labels = c("Male","Female"))
F2$race_r<-factor(F2$race_r, levels = c(1, 2, 3, 4, 5), 
                  labels = c("White", "Black", "AmericanIndian", "Asian", "Other"))
F2$MARST<-factor(F2$MARST, levels = c(1, 2, 3, 4, 5, 6), 
                 labels = c("Married_SP", "Married_SA","Separated", "Divorced", "Widowed", "NM/single"))
F2$citizen_r<-factor(F2$citizen_r,  levels = c(0, 1, 2), 
                     labels = c("NA", "yes", "no"))
F2$EMPSTAT<-factor(F2$EMPSTAT,  levels = c(0, 1, 2, 3), 
                   labels = c("NA", "Employed", "Unemployed", "Other" ))

F2$edu_r<-factor(F2$edu_r,  levels = c(1, 2, 3), 
                   labels = c("Less than High School", "High School", "College"))

head(F2)
## # A tibble: 6 x 14
##    YEAR FAMSIZE SEX      AGE MARST      RACE CITIZEN  EDUC EMPSTAT POVERTY
##   <int>   <int> <fct>  <int> <fct>     <int>   <int> <int> <fct>     <int>
## 1  2016       4 Male      20 NM/single     1       0     6 Other       261
## 2  2016       2 Male      25 NM/single     1       0     7 Employ…     154
## 3  2016       2 Female    23 NM/single     1       0     8 Employ…     153
## 4  2016       3 Female    25 NM/single     2       0     7 Employ…     326
## 5  2016       3 Male      24 NM/single     2       0     7 Employ…     326
## 6  2016       3 Female    37 NM/single     2       0     7 Employ…     122
## # ... with 4 more variables: race_r <fct>, citizen_r <fct>,
## #   poverty_r <dbl>, edu_r <fct>
library(tidyverse)
F2<-F2[c(3:5,9,11:14)]
head(F2)
## # A tibble: 6 x 8
##   SEX      AGE MARST     EMPSTAT  race_r citizen_r poverty_r edu_r      
##   <fct>  <int> <fct>     <fct>    <fct>  <fct>         <dbl> <fct>      
## 1 Male      20 NM/single Other    White  NA             1.00 High School
## 2 Male      25 NM/single Employed White  NA             1.00 College    
## 3 Female    23 NM/single Employed White  NA             1.00 College    
## 4 Female    25 NM/single Employed Black  NA             1.00 College    
## 5 Male      24 NM/single Employed Black  NA             1.00 College    
## 6 Female    37 NM/single Employed Black  NA             1.00 College
library(dplyr)
Female.data <- subset(F2, SEX=="Female"  & EMPSTAT=="Employed",
                         select=SEX:edu_r)
head(Female.data)
## # A tibble: 6 x 8
##   SEX      AGE MARST     EMPSTAT  race_r citizen_r poverty_r edu_r        
##   <fct>  <int> <fct>     <fct>    <fct>  <fct>         <dbl> <fct>        
## 1 Female    23 NM/single Employed White  NA             1.00 College      
## 2 Female    25 NM/single Employed Black  NA             1.00 College      
## 3 Female    37 NM/single Employed Black  NA             1.00 College      
## 4 Female    28 NM/single Employed Black  NA             1.00 College      
## 5 Female    18 NM/single Employed White  NA             1.00 High School  
## 6 Female    26 NM/single Employed White  NA             1.00 Less than Hi…
library(tidyverse)
Female.data<-Female.data[-which(Female.data$citizen_r == "NA"), ]
head(Female.data)
## # A tibble: 6 x 8
##   SEX      AGE MARST     EMPSTAT  race_r citizen_r poverty_r edu_r  
##   <fct>  <int> <fct>     <fct>    <fct>  <fct>         <dbl> <fct>  
## 1 Female    27 NM/single Employed Asian  yes            1.00 College
## 2 Female    22 NM/single Employed Black  yes           NA    College
## 3 Female    43 NM/single Employed White  yes            1.00 College
## 4 Female    31 NM/single Employed Black  no             1.00 College
## 5 Female    31 NM/single Employed Black  no             1.00 College
## 6 Female    31 NM/single Employed Asian  no             0    College
library(tidyverse)
Female.data<-na.omit(Female.data) 
head(Female.data)
## # A tibble: 6 x 8
##   SEX      AGE MARST     EMPSTAT  race_r citizen_r poverty_r edu_r  
##   <fct>  <int> <fct>     <fct>    <fct>  <fct>         <dbl> <fct>  
## 1 Female    27 NM/single Employed Asian  yes            1.00 College
## 2 Female    43 NM/single Employed White  yes            1.00 College
## 3 Female    31 NM/single Employed Black  no             1.00 College
## 4 Female    31 NM/single Employed Black  no             1.00 College
## 5 Female    31 NM/single Employed Asian  no             0    College
## 6 Female    31 NM/single Employed Black  no             1.00 College

Logistic models

m0f<-glm(poverty_r~citizen_r,family = binomial, data = Female.data)
summary(m0f)
## 
## Call:
## glm(formula = poverty_r ~ citizen_r, family = binomial, data = Female.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0647   0.5026   0.5026   0.8093   0.8093  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.94820    0.02449   38.72   <2e-16 ***
## citizen_rno  1.05691    0.04139   25.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16766  on 16869  degrees of freedom
## Residual deviance: 16067  on 16868  degrees of freedom
## AIC: 16071
## 
## Number of Fisher Scoring iterations: 4
m1f<-glm(poverty_r~citizen_r+edu_r+race_r,family = binomial, data = Female.data)
summary(m1f)
## 
## Call:
## glm(formula = poverty_r ~ citizen_r + edu_r + race_r, family = binomial, 
##     data = Female.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7982   0.2628   0.4821   0.7833   1.0749  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.57031    0.12225  21.025  < 2e-16 ***
## citizen_rno           0.77849    0.04338  17.946  < 2e-16 ***
## edu_rHigh School     -0.74569    0.12781  -5.834 5.40e-09 ***
## edu_rCollege         -1.80087    0.12073 -14.917  < 2e-16 ***
## race_rBlack           0.24212    0.06990   3.464 0.000533 ***
## race_rAmericanIndian  0.87846    0.43395   2.024 0.042938 *  
## race_rAsian          -0.52350    0.04674 -11.200  < 2e-16 ***
## race_rOther           0.54590    0.07068   7.724 1.13e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16766  on 16869  degrees of freedom
## Residual deviance: 14857  on 16862  degrees of freedom
## AIC: 14873
## 
## Number of Fisher Scoring iterations: 6
m2f<-glm(poverty_r~citizen_r*edu_r+race_r,family = binomial, data = Female.data)
summary(m2f)
## 
## Call:
## glm(formula = poverty_r ~ citizen_r * edu_r + race_r, family = binomial, 
##     data = Female.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9568   0.2079   0.5042   0.7740   1.0550  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   1.96568    0.16172  12.155  < 2e-16 ***
## citizen_rno                   1.85784    0.23993   7.743 9.70e-15 ***
## edu_rHigh School             -0.27156    0.17168  -1.582 0.113692    
## edu_rCollege                 -1.16434    0.16248  -7.166 7.73e-13 ***
## race_rBlack                   0.25072    0.06974   3.595 0.000324 ***
## race_rAmericanIndian          0.88967    0.43383   2.051 0.040293 *  
## race_rAsian                  -0.50647    0.04667 -10.853  < 2e-16 ***
## race_rOther                   0.53520    0.07078   7.561 3.99e-14 ***
## citizen_rno:edu_rHigh School -0.74164    0.26227  -2.828 0.004687 ** 
## citizen_rno:edu_rCollege     -1.19594    0.24470  -4.887 1.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16766  on 16869  degrees of freedom
## Residual deviance: 14821  on 16860  degrees of freedom
## AIC: 14841
## 
## Number of Fisher Scoring iterations: 6
m3f<-glm(poverty_r~citizen_r*race_r+edu_r,family = binomial, data = Female.data)
summary(m3f)
## 
## Call:
## glm(formula = poverty_r ~ citizen_r * race_r + edu_r, family = binomial, 
##     data = Female.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8346   0.2614   0.4515   0.7703   1.0624  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       2.51703    0.12532  20.085  < 2e-16 ***
## citizen_rno                       0.84238    0.06816  12.359  < 2e-16 ***
## race_rBlack                       0.31330    0.08292   3.778 0.000158 ***
## race_rAmericanIndian              0.81611    0.54331   1.502 0.133067    
## race_rAsian                      -0.47303    0.05775  -8.191 2.60e-16 ***
## race_rOther                       0.48421    0.08911   5.434 5.52e-08 ***
## edu_rHigh School                 -0.72063    0.12827  -5.618 1.93e-08 ***
## edu_rCollege                     -1.76726    0.12152 -14.543  < 2e-16 ***
## citizen_rno:race_rBlack          -0.24838    0.15287  -1.625 0.104203    
## citizen_rno:race_rAmericanIndian  0.16306    0.90958   0.179 0.857729    
## citizen_rno:race_rAsian          -0.14675    0.09721  -1.510 0.131152    
## citizen_rno:race_rOther           0.15554    0.14717   1.057 0.290562    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16766  on 16869  degrees of freedom
## Residual deviance: 14850  on 16858  degrees of freedom
## AIC: 14874
## 
## Number of Fisher Scoring iterations: 6
m4f<-glm(poverty_r~citizen_r*(edu_r+race_r),family = binomial, data = Female.data)
summary(m4f)
## 
## Call:
## glm(formula = poverty_r ~ citizen_r * (edu_r + race_r), family = binomial, 
##     data = Female.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9752   0.2089   0.4834   0.7589   1.0514  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       1.96181    0.16263  12.063  < 2e-16 ***
## citizen_rno                       1.85216    0.24409   7.588 3.25e-14 ***
## edu_rHigh School                 -0.27555    0.17163  -1.606 0.108377    
## edu_rCollege                     -1.17166    0.16252  -7.210 5.62e-13 ***
## race_rBlack                       0.30724    0.08254   3.722 0.000197 ***
## race_rAmericanIndian              0.83765    0.54056   1.550 0.121240    
## race_rAsian                      -0.48628    0.05751  -8.455  < 2e-16 ***
## race_rOther                       0.49579    0.08866   5.592 2.24e-08 ***
## citizen_rno:edu_rHigh School     -0.71914    0.26261  -2.738 0.006174 ** 
## citizen_rno:edu_rCollege         -1.15428    0.24664  -4.680 2.87e-06 ***
## citizen_rno:race_rBlack          -0.20549    0.15345  -1.339 0.180540    
## citizen_rno:race_rAmericanIndian  0.13892    0.91045   0.153 0.878730    
## citizen_rno:race_rAsian          -0.06093    0.09842  -0.619 0.535853    
## citizen_rno:race_rOther           0.10400    0.14778   0.704 0.481586    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16766  on 16869  degrees of freedom
## Residual deviance: 14818  on 16856  degrees of freedom
## AIC: 14846
## 
## Number of Fisher Scoring iterations: 6

Likelihood Test

lmtest::lrtest(m0f, m1f,m2f, m3f, m4f)
## Likelihood ratio test
## 
## Model 1: poverty_r ~ citizen_r
## Model 2: poverty_r ~ citizen_r + edu_r + race_r
## Model 3: poverty_r ~ citizen_r * edu_r + race_r
## Model 4: poverty_r ~ citizen_r * race_r + edu_r
## Model 5: poverty_r ~ citizen_r * (edu_r + race_r)
##   #Df  LogLik Df    Chisq Pr(>Chisq)    
## 1   2 -8033.4                           
## 2   8 -7428.6  6 1209.637  < 2.2e-16 ***
## 3  10 -7410.3  2   36.610  1.122e-08 ***
## 4  12 -7425.2  2   29.671  3.606e-07 ***
## 5  14 -7408.8  2   32.696  7.946e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimating model

library(Zelig)
## Loading required package: survival
## 
## Attaching package: 'Zelig'
## The following object is masked from 'package:purrr':
## 
##     reduce
zfemale <- zelig(poverty_r~citizen_r*edu_r+race_r, model = "logit", data = Female.data, cite = F)
summary(zfemale)
## Model: 
## 
## Call:
## z5$zelig(formula = poverty_r ~ citizen_r * edu_r + race_r, data = Female.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9568   0.2079   0.5042   0.7740   1.0550  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)
## (Intercept)                   1.96568    0.16172  12.155  < 2e-16
## citizen_rno                   1.85784    0.23993   7.743 9.70e-15
## edu_rHigh School             -0.27156    0.17168  -1.582 0.113692
## edu_rCollege                 -1.16434    0.16248  -7.166 7.73e-13
## race_rBlack                   0.25072    0.06974   3.595 0.000324
## race_rAmericanIndian          0.88967    0.43383   2.051 0.040293
## race_rAsian                  -0.50647    0.04667 -10.853  < 2e-16
## race_rOther                   0.53520    0.07078   7.561 3.99e-14
## citizen_rno:edu_rHigh School -0.74164    0.26227  -2.828 0.004687
## citizen_rno:edu_rCollege     -1.19594    0.24470  -4.887 1.02e-06
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16766  on 16869  degrees of freedom
## Residual deviance: 14821  on 16860  degrees of freedom
## AIC: 14841
## 
## Number of Fisher Scoring iterations: 6
## 
## Next step: Use 'setx' method

Effect of citizenship status

x.cy <- setx(zfemale, citizen_r="yes")
x.cn <- setx(zfemale, citizen_r="no")
s.citizen <- sim(zfemale, x = x.cy, x1=x.cn)
summary(s.citizen)
## 
##  sim x :
##  -----
## ev
##           mean          sd       50%      2.5%     97.5%
## [1,] 0.6901251 0.008076961 0.6903511 0.6738736 0.7058212
## pv
##          0     1
## [1,] 0.303 0.697
## 
##  sim x1 :
##  -----
## ev
##           mean          sd       50%     2.5%     97.5%
## [1,] 0.8116545 0.007255409 0.8116289 0.796806 0.8255204
## pv
##          0     1
## [1,] 0.192 0.808
## fd
##           mean          sd       50%      2.5%     97.5%
## [1,] 0.1215293 0.008519054 0.1215659 0.1042979 0.1384006
plot(s.citizen)

Variation of education in the differences of citizenship status

Education: Less than High School

xlessh <- setx(zfemale, citizen_r = "yes", edu_r = "Less than High School")
x1lessh <- setx(zfemale, citizen_r = "no", edu_r = "Less than High School")
slessh <- sim(zfemale, x = xlessh, x1 = x1lessh)
plot(slessh)

Education: High School

xhigh <- setx(zfemale, citizen_r = "yes", edu_r = "High School")
x1high <- setx(zfemale, citizen_r = "no", edu_r = "High School")
shigh <- sim(zfemale, x = xhigh, x1 = x1high)
plot(shigh)

Education: College

xcollege <- setx(zfemale, citizen_r = "yes", edu_r = "College")
x1college <- setx(zfemale, citizen_r = "no", edu_r = "College")
scollege <- sim(zfemale, x = xcollege, x1 = x1college)
plot(scollege)

Putting together

fd1 <- slessh$get_qi(xvalue="x1", qi="fd")
fd2 <- shigh$get_qi(xvalue="x1", qi="fd")
fd3 <- scollege$get_qi(xvalue="x1", qi="fd")
edufd <- as.data.frame(cbind(fd1, fd2, fd3))
head(edufd)
##           V1         V2        V3
## 1 0.09180455 0.10272978 0.1279951
## 2 0.10987893 0.10719154 0.1046025
## 3 0.12549160 0.10324767 0.1222664
## 4 0.07864366 0.08847176 0.1304338
## 5 0.10662803 0.10324674 0.1253150
## 6 0.11110825 0.10268249 0.1150790
library(tidyr)

tidd <- edufd %>% 
  gather(edu_r, simv, 1:3)
head(tidd)
##   edu_r       simv
## 1    V1 0.09180455
## 2    V1 0.10987893
## 3    V1 0.12549160
## 4    V1 0.07864366
## 5    V1 0.10662803
## 6    V1 0.11110825
library(dplyr)

tidd %>% 
  group_by(edu_r) %>% 
  summarise(mean = mean(simv), sd = sd(simv))
## # A tibble: 3 x 3
##   edu_r   mean      sd
##   <chr>  <dbl>   <dbl>
## 1 V1    0.102  0.0175 
## 2 V2    0.0989 0.00960
## 3 V3    0.122  0.00903
library(ggplot2)

ggplot(tidd, aes(simv)) + geom_histogram() + facet_grid(~edu_r)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Results

The results suggested that the probability of poverty can be estimated as 0.102 by the differences between citizenship status within female group of less than high school education. The probabilities were 0.099 and 0.122 for female group with high school education and college education respectively. So, the probability of estimating poverty by the differences in citizenship status was highest among female with college education than other groups of educational attainment.