Homework 1

Load necessary packages.

options(repos="https://cran.rstudio.com" )
library(htmlTable)
library(car)
## Loading required package: carData
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
library(questionr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Load data from ipums

I pulled data for Texas only, as an extract of 2017 5 Year ACS.

For iPUMS in R, more details available here: https://cran.r-project.org/web/packages/ipumsr/vignettes/ipums.html

# 
install.packages("ipumsr")
## 
## The downloaded binary packages are in
##  /var/folders/lr/thl33hwj2jdd01hth_whxlz00000gn/T//RtmpAYPyW6/downloaded_packages
library(ipumsr)
## Registered S3 methods overwritten by 'ipumsr':
##   method                                 from 
##   format.pillar_shaft_haven_labelled_chr haven
##   format.pillar_shaft_haven_labelled_num haven
##   pillar_shaft.haven_labelled            haven
# Note to self: Must set wd to where all files are, then open ddi and feed into read_ipums_micro
setwd("~/Desktop/UTSA/DEM7283_StatsII/HW1")
ddi <- read_ipums_ddi("usa_00006.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS-USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.
names(data)
##  [1] "YEAR"     "MULTYEAR" "SAMPLE"   "SERIAL"   "CBSERIAL" "HHWT"    
##  [7] "CLUSTER"  "STATEFIP" "STRATA"   "GQ"       "PERNUM"   "PERWT"   
## [13] "AGE"      "MARST"    "EMPSTAT"  "EMPSTATD" "UHRSWORK" "POVERTY"

a) Define a binary outcome variable of your choosing and define how you recode the original variable.

My binary outcome variable is whether a person is above or below the poverty threshold. Poverty is calculated for families, but reported at the person level as a continuous variable which represents the share of that family’s income divided by the poverty threshold for a family of that size, giving a “percentage” a family’s income represents of the poverty threshold (i.e a value of 60 would mean the family’s income is 60% of the poverty threshold, where a value of 100 would mean the family’s income is at the poverty threshold). This variable will have to be recoded into a binary variable to represent whether a person is above poverty or below poverty.

data$pov_b <-Recode(data$POVERTY, recodes="000=NA; 1:100='Below Poverty'; else='Above Poverty'", as.factor=T)

b) State a research question about what factors you believe will affect your outcome variable.

My chosen outcome variable is whether someone is above or below the poverty threshold. My question is how a person’s age (categories: children, working age, and seniors), a person’s employment status (categories: employed, unemployed), and their average hours worked per week (categories: full or part time) influence their poverty status. While poverty is calculated at the family level, it is reported at the person level.

c) Define at least 2 predictor variables, based on your research question. For this assignment, it’s best if these are categorical variables.

The two predictor variable’s I’ll examine are age and employment status. Age will be recoded from a continuous to factor variable into categories for Children (Under 18), Working Age (18-64), and Seniors (65+). Average hours worked per week will also be broken down from a continuous to factor variable, with Less than 35 hours worked per week considered “part time” and more than 35 hours per week considered “full time.”

data$age_c <-Recode(data$AGE, recodes="0:17='0 Child'; 18:64='1 Working Age'; else='2 Senior'", as.factor=T)

data$empl_c <-Recode(data$EMPSTAT, recodes="0=NA; 1='1 Employed'; 2='2 Unemployed'; 3='3 Not in LF'", as.factor=T)

data$fulltime <- Recode(data$UHRSWORK, recodes="0=NA; 1:34='1 Part Time'; else='2 Full Time'", as.factor=T)

d) Perform a descriptive analysis of the outcome variable by each of the variables you defined in part b. (e.g. 2 x 2 table, 2 x k table). Follow a similar approach to presenting your statistics as presented in Sparks 2009 (in the Google drive). This can be done easily using the tableone package!

a. Calculate descriptive statistics (mean or percentages) for each variable using no weights or survey design, as well as with full survey design and weights.

#Adding in full survey design: Creating survey object
library(survey)
des <- svydesign(id=~CLUSTER, strata=~STRATA, weights=~PERWT, data=data)

#Using TableOne
library(tableone)

#Unweighted
desc1<-CreateTableOne(vars = c("age_c", "empl_c","fulltime", "pov_b"), data = data)
#t1<-print(t1, format="p")
print(desc1,format="p")
##                             
##                              Overall
##   n                          1295444
##   age_c (%)                         
##      0 Child                    23.7
##      1 Working Age              60.5
##      2 Senior                   15.9
##   empl_c (%)                        
##      1 Employed                 57.2
##      2 Unemployed                3.3
##      3 Not in LF                39.5
##   fulltime = 2 Full Time (%)    78.2
##   pov_b = Below Poverty (%)     14.8
#Full Survey Design
desc2<-svyCreateTableOne(vars = c("age_c", "empl_c","fulltime", "pov_b"), data = des)
#t1<-print(t1, format="p")
print(desc2,format="p")
##                             
##                              Overall   
##   n                          27419612.0
##   age_c (%)                            
##      0 Child                       26.3
##      1 Working Age                 62.0
##      2 Senior                      11.7
##   empl_c (%)                           
##      1 Employed                    60.8
##      2 Unemployed                   3.7
##      3 Not in LF                   35.5
##   fulltime = 2 Full Time (%)       79.2
##   pov_b = Below Poverty (%)        16.3

b. Calculate percentages, or means, for each of your independent variables for each level of your outcome variable and present this in a table, with appropriate survey-corrected test statistics. (tableone package helps)

#not using survey design
t1<-CreateTableOne(vars = c("age_c", "empl_c","fulltime"), strata = "pov_b", test = T, data = data)
#t1<-print(t1, format="p")
print(t1,format="p")
##                             Stratified by pov_b
##                              Above Poverty Below Poverty p      test
##   n                          1063120       184843                   
##   age_c (%)                                              <0.001     
##      0 Child                    22.5         34.9                   
##      1 Working Age              60.9         54.4                   
##      2 Senior                   16.6         10.7                   
##   empl_c (%)                                             <0.001     
##      1 Employed                 63.2         33.5                   
##      2 Unemployed                2.7          7.9                   
##      3 Not in LF                34.1         58.7                   
##   fulltime = 2 Full Time (%)    81.2         51.7        <0.001
#using survey design
st1<-svyCreateTableOne(vars = c("age_c", "empl_c", "fulltime"), strata = "pov_b", test = T, data = des)
#st1<-print(st1, format="p")
print(st1, format="p")
##                             Stratified by pov_b
##                              Above Poverty Below Poverty p      test
##   n                          22446493.0    4383229.0                
##   age_c (%)                                              <0.001     
##      0 Child                       24.4         38.5                
##      1 Working Age                 63.2         53.9                
##      2 Senior                      12.4          7.7                
##   empl_c (%)                                             <0.001     
##      1 Employed                    66.3         36.2                
##      2 Unemployed                   3.1          8.3                
##      3 Not in LF                   30.7         55.5                
##   fulltime = 2 Full Time (%)       82.1         53.7     <0.001

c. Are there substantive differences in the descriptive results between the analysis using survey design and that not using survey design?

When adding in the person weights and full survey design to the data, there is a “lower” share of seniors represented in the weighted data and a “higher” share of children represented in the weighted data, both for those above and below poverty. This suggests that the children who were sampled are weighted higher, and the seniors who were sampled are weighted lower overall (and perhaps could suggest that fewer children were sampled, and more seniors were sampled). A similar shift happened between those who work full time - the percentage is higher in the weighted table than the unweighted table, suggesting those who work full time are weighted higher in the analysis. Perhaps this could lead to an assumption that seniors and those who are not working full time are more likely to be surveyed for the ACS?

Generally, similar trends can be observed in both tables. The majority of people who are below poverty are not in the labor force, but roughly one third of those below poverty are employed. Compared to seniors, children make up a higher share of those below poverty, but the majority of those in poverty are working age. Lastly, of those who are below poverty, 54% work full time.