Homework 2

For this homework, you will perform a logistic regression (or probit regression) of a binary outcome, using the dataset of your choice. I ask you specify a research question for your analysis and generate appropriate predictors in order to examine your question.

Present results from a model with sample weights and design effects, if your data allow for this. Present the results in tabular form, with Parameter estimates, odds ratios (if using the logit model) and confidence intervals for the odds ratios.

If you need help figuring out what your PSU, weight and sample strata variable are called in YOUR data, please ask.

Generate predicted probabilities for some “interesting” cases from your analysis, to highlight the effects from the model and your stated research question.

Publish your homework to Rpubs

Load necessary packages.

options(repos="https://cran.rstudio.com" )
library(htmlTable)
library(car)
## Loading required package: carData
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
library(questionr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Load data from ipums

I pulled data for 2018 5 Year ACS, with hopes to enventually do some time-series analysis. This data is only gathered for Texas, Oklahoma, and Louisiana, which are three likely states to have high concentrations of oil and gas employment. This was also done to limit the file size for my computer’s sake.

For iPUMS in R, more details available here: https://cran.r-project.org/web/packages/ipumsr/vignettes/ipums.html

# 
install.packages("ipumsr")
## 
## The downloaded binary packages are in
##  /var/folders/lr/thl33hwj2jdd01hth_whxlz00000gn/T//RtmpAUJvEs/downloaded_packages
library(ipumsr)
## Registered S3 methods overwritten by 'ipumsr':
##   method                                 from 
##   format.pillar_shaft_haven_labelled_chr haven
##   format.pillar_shaft_haven_labelled_num haven
##   pillar_shaft.haven_labelled            haven
# Note to self: Must set wd to where all files are, then open ddi and feed into read_ipums_micro
setwd("~/Desktop/UTSA/DEM7283_StatsII/HW2")
ddi <- read_ipums_ddi("usa_00011.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS-USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.
names(data)
##  [1] "YEAR"      "MULTYEAR"  "SAMPLE"    "SERIAL"    "CBSERIAL"  "HHWT"     
##  [7] "CLUSTER"   "STATEFIP"  "COUNTYFIP" "MET2013"   "PUMA"      "STRATA"   
## [13] "GQ"        "OWNERSHP"  "OWNERSHPD" "VALUEH"    "PERNUM"    "PERWT"    
## [19] "SEX"       "AGE"       "RACE"      "RACED"     "HISPAN"    "HISPAND"  
## [25] "EMPSTAT"   "EMPSTATD"  "LABFORCE"  "OCC"       "IND"       "CLASSWKR" 
## [31] "CLASSWKRD" "WKSWORK2"  "UHRSWORK"  "INCWAGE"   "POVERTY"   "MIGRATE1" 
## [37] "MIGRATE1D"

Loading BLS Data to Look at Location Quotients

In order to identify where there are counties with high values of oil and gas extraction, I found out I can access BLS data through their website, and read it directly into R! I’ve attempted to create a yes/no categorization for all counties in the U.S. in terms of whether they have a higher or lower value of oil/gas extraction employment (employment location quotient > 1.0).

#Reading in BLS data as csv
locationquot <- read.csv("http://www.bls.gov/cew/data/api/2018/a/industry/211.csv")
#recoding location quotients to >1 = "oil and gas county"
locationquot$LQ_OilGas <- Recode(locationquot$lq_annual_avg_emplvl, recodes="0:1 =0; else=1")
# counting number of oil and gas counties identified = total = 210
sum(locationquot$LQ_OilGas)
## [1] 210
library(dplyr)
LQmerge <- select(locationquot, area_fips, LQ_OilGas)
View(LQmerge)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1

Joining “Oil Gas County” identifiers to ACS data from iPums

In order to focus on oil and gas counties only, I’m joining the oil/gas county binary variable to the ACS survey data from ipums. Understandably, there will be data suppression for many counties. However, this will allow us to examine the counties for which there is data, and compare them to counties without high concentrations of oil and gas employment. Joining will be done through the county fips field.

# Annoyingly, iPums doesn't have FIPS coded into 5 character format, need to pad county level fips with zeros and combine with state fips in order to join. Found this formula in google, seems to work easily enough!
data$county_format <- formatC(data$COUNTYFIP, width = 3, format = "d", flag = "0")
data$FIPS <- paste(data$STATEFIP, data$county_format, sep = "")

# and I used to do this with vlookup in excel, or through "joning" in ArcGIS, but here it seems the best option is a merge?
data_merge <- merge(data,LQmerge,by.x="FIPS",by.y="area_fips")

Binary outcome variable: living within an oil/gas county

Cleaning data/recoding variables

For this regression, I’m choosing to look at similar variables as last week’s homework, to determine if people living in oil/gas counties are more or lesslikely to be certain ages, to have moved within the last year, to be employed, to work full time, or to be in poverty.

data_merge$age_c <-Recode(data_merge$AGE, recodes="0:17='0 Child'; 18:64='1 Working Age'; else='2 Senior'", as.factor=T)

data_merge$empl_c <-Recode(data_merge$EMPSTAT, recodes="0=NA; 1='1 Employed'; 2='2 Unemployed'; 3='3 Not in LF'", as.factor=T)

data_merge$fulltime <- Recode(data_merge$UHRSWORK, recodes="0=NA; 1:34='1 Part Time'; else='2 Full Time'", as.factor=T)

data_merge$move <- Recode(data_merge$MIGRATE1, recodes="0=NA; 1='2. Did not Move'; 2:4='1. Moved within last year'; 9=NA", as.factor=T)

data_merge$hisp <- Recode(data_merge$HISPAN, recodes = "0='1. Not Hispanic'; 1:4='2. Hispanic/Latino'; else = NA", as.factor=T)

data_merge$pov_b <-Recode(data_merge$POVERTY, recodes="000=NA; 1:100='Below Poverty'; else='Above Poverty'", as.factor=T)
#Adding in full survey design: Creating survey object
library(survey)
des <- svydesign(id=~CLUSTER, strata=~STRATA, weights=~PERWT, data=data_merge)

#fitting logit model
fit.logit<-svyglm(LQ_OilGas~age_c+empl_c+fulltime+move+hisp+pov_b,
                  design= des,
                  family=binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.logit)
## 
## Call:
## svyglm(formula = LQ_OilGas ~ age_c + empl_c + fulltime + move + 
##     hisp + pov_b, design = des, family = binomial)
## 
## Survey design:
## svydesign(id = ~CLUSTER, strata = ~STRATA, weights = ~PERWT, 
##     data = data_merge)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.463024   0.032184  45.459  < 2e-16 ***
## age_c1 Working Age     -0.088572   0.030042  -2.948 0.003196 ** 
## age_c2 Senior           0.045030   0.033002   1.364 0.172417    
## empl_c2 Unemployed      0.086997   0.023265   3.739 0.000184 ***
## empl_c3 Not in LF      -0.049679   0.016574  -2.997 0.002722 ** 
## fulltime2 Full Time     0.098910   0.009797  10.096  < 2e-16 ***
## move2. Did not Move    -0.017115   0.013169  -1.300 0.193729    
## hisp2. Hispanic/Latino -0.292685   0.010337 -28.316  < 2e-16 ***
## pov_bBelow Poverty      0.008689   0.016284   0.534 0.593626    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.028778)
## 
## Number of Fisher Scoring iterations: 4

Percentage Table For All Variables

#using survey design
library(tableone)
st1<-svyCreateTableOne(vars = c("age_c", "empl_c","fulltime","move","hisp","pov_b"), strata = "LQ_OilGas", test = T, data = des)
#st1<-print(st1, format="p")
print(st1, format="p")
                           Stratified by LQ_OilGas
                            0         1          p      test

n 7259702.0 28322713.0
age_c (%) <0.001
0 Child 26.5 25.3
1 Working Age 61.8 61.8
2 Senior 11.7 12.9
empl_c (%) <0.001
1 Employed 61.0 59.9
2 Unemployed 3.4 3.5
3 Not in LF 35.7 36.6
fulltime = 2 Full Time (%) 77.7 79.4 <0.001
move = 2. Did not Move (%) 84.4 84.2 0.343
hisp = 2. Hispanic/Latino (%) 37.6 29.5 <0.001
pov_b = Below Poverty (%) 15.8 16.3 0.003

Interpretation

It seems that the location quotient cut off might be too low - maybe I need to focus on more specialized counties, rather than just a simple yes/no based on a baseline location quotient. There are a larger number of people in the oil/gas counties in Louisiana, Texas, and Oklahoma than those not in those counties, as currently defined. People in these counties are less likely to be Hispanic (not surprisingly), and are slightly more likley to be unemployed (interestingly). Further refinement is needed for this analysis to be useful, but I’m thankful to have gotten the framework this far.