I’ve taken a slightly different direction on the research question. Since this is a multilevel analysis, I’ve chosen to focus more on the individuals who do NOT work in the oil and gas industry in areas with high oil and gas employment, and see if their incomes are higher or lower than workers in areas without high levels of oil and gas employment.

Research Question

Are individuals living in areas with high concentrations of oil and gas employment, but not working within the industry, more or less likely to have higher incomes than workers in other areas?

Data: Instead of looking over time as before, I am now focusing on iPUMS 2014-2018 ACS microdata for Texas, Louisiana, and Oklahoma. Some descriptive statistics on income inequality and other aspects of the oil and gas areas will be provided through 2014-2015 ACS Aggregated Estimates at the PUMA level.

Geography: I think examining changes at the Public Use Microdata level is the correct geography, both because of data availability for more rural areas and for their likely economic ties.

Industry Definition: Due to limitations in the ACS data, this study will focus on all “extrative” industries: not only oil and gas, but also coal, mining, quarrying, and other mineral extraction. The increase of employment in these industries will be key to examine: areas with rapid employment increases or decreases could potentially be more at risk for shifting inequality.

Measuring inequality: For the sake of this early analysis, I am choosing to focus on a simple measure for the time being: the income quintile share ratio. This is a ratio of the total income received by the top 20% of earners compared to the total income recieved by the bottom 20% of earners. These numbers are available from the ACS. This measure was used in a study examining the income inequality in areas of Russia that had seen increases in oil and gas employment (Bucellato & Mickiewicz, 2009).

Other Variables: Since income is inextricably linked to other social factors in the U.S., and since the oil and gas industry is largely dominated by white men, incorporating race, ethnicity, and perhaps class of worker could be key to this analysis.

Load necessary packages.

options(repos="https://cran.rstudio.com" )
library(htmlTable)
library(car)
library(stargazer)
library(survey)
library(questionr)
library(dplyr)
library(tigris)
library(tidycensus)
library(sf)
library(ggplot2)
library(ipumsr)


#Census Loading
#census_api_key("76a8c99503b97d96cf9d7b170c23dd5a9070b289", install = TRUE)
# First time, reload your environment so you can use the key without restarting R.
#readRenviron("~/.Renviron")
# You can check it with:
#Sys.getenv("CENSUS_API_KEY")

#Searching for ACS Variables
v18s <- load_variables(2018, "acs5/subject", cache = TRUE)

v18 <- load_variables(2018, "acs5", cache = TRUE)
#then can filter and view different variables
#View(v18s)

Data Gathering pt. 1: Pulling PUMA Data from ACS

These data will be pulled in aggregate from the ACS 2018 5 Year Estimates by PUMA: * Total workers in Mining, Quarrying, Oil + Gas (includes part time) = S2403_C01_004E * Total workers 16 and older (includes part time) = S2403_C01_001E * Share of aggregate income, lowest quintile = B19082_001E * Share of aggregate income, highest quintile = B19082_005E * Total Population = B03002_001E * White, NonHispanic Population = B03002_003E * Black, NonHispanic Population = B03002_004E * Hispanic/Latino Population, of any race = B03002_012E * Number of Males = B01001_002E * Percent Unemployed = DP03_0009PE * Percent Below Poverty = DP03_0119PE * Median HH Income = DP03_0062E

Calculated values will be: * Percent White, Hispanic, Black, Male * Perent of workers in Oil/Gas * Income Quartile Share (IQS)

State FIPS of Interest (for future): * Louisiana = 22 * Oklahoma = 40 * Texas = 48

ext18_PUMA<-get_acs(geography="Public Use Microdata Area", year=2018,variables="S2403_C01_004")

## Getting data from the 2014-2018 5-year ACS

## Using the ACS Subject Tables

usacs_pumas<-get_acs(geography = "public use microdata area", year = 2018,
                variables=c( "S2403_C01_004", "S2403_C01_001","B19082_001","B19082_005", "B03002_003","B03002_004","B03002_012","B01001_002", "DP03_0062E", "DP03_0009PE", "DP03_0119PE", "B01002_001", "B03002_001","B03002_001E"),
                geometry = F, output = "wide")

## Getting data from the 2014-2018 5-year ACS

## Fetching data by table type ("B/C", "S", "DP") and combining the result.

usacs_pumas<-usacs_pumas%>%
  mutate(totpop= B03002_001E, nwhite=B03002_003E,pwhite=B03002_003E/B03002_001E,nblack=B03002_004E,
         pblack=B03002_004E/B03002_001E , nhisp = B03002_012E,phisp=B03002_012E/B03002_001E, nmale = B01001_002E,
         pmale=B01001_002E/B03002_001E, totwrk = S2403_C01_001E, extwrk = S2403_C01_004E, 
         pextwrk = S2403_C01_004E/S2403_C01_001E, shincLQ = B19082_001E, shincHQ = B19082_005E, 
         incQS = B19082_001E/B19082_005E, 
         statefip = substr(GEOID,start=1,stop=2),punemp=DP03_0009PE/100, medhhinc=DP03_0062E,
         ppov=DP03_0119PE/100)%>%
  select(GEOID, NAME, statefip, totpop, nwhite, pwhite, nblack, pblack, nhisp, phisp, nmale, pmale, totwrk, extwrk, pextwrk, shincLQ, shincHQ, incQS, punemp, medhhinc, ppov)

#Isolating out just Texas, Oklahoma, and Louisiana PUMAs
TXOKLA_pumas <- filter(usacs_pumas, statefip%in%c("22","40","48"))

#Calculating the total share of oil/gas employment for Texas, Oklahoma, and Louisiana to use as reference for location quotients (could have used a national point of comparison, but figured using these three states would select fewer areas as "high extractive industries" concentrated areas). 
TXOKLA_pextwrk <- sum(TXOKLA_pumas$extwrk, na.rm = TRUE)/sum(TXOKLA_pumas$totwrk, na.rm=TRUE)

#Adding a location quotient measure to the dataset. 
#LQ = share of industry employment in small area / share of industry employment in large area
#LQ = share extractive industry employment for PUMA / share extractive industry employment for TX+OK+LA
TXOKLA_pumas$LocQuotExtWrk <- TXOKLA_pumas$pextwrk/TXOKLA_pextwrk

#Selecting out areas with LQ > 1 (areas with a higher share of oil/gas/mining/quarrying employment than the overall share for TX, OK, & LA). Defining these as "Extractive Communities"
TXOKLA_pumas$ExtComm <-  Recode(TXOKLA_pumas$LocQuotExtWrk, recodes="0:1=0; 1:10=1", as.factor=F)

#With 274 PUMAs total in TX, OK, and LA, 85 out of the 274 (or 31%) have higher extractive industry employment than the overall share of the three states, with location quotients greater than 1. 
sum(TXOKLA_pumas$ExtComm)/274

## [1] 0.310219

#Scaling data can allow us to see where they fall in terms of the average of all other PUMAs (here we are generating Z scores)
myscale<-function(x){as.numeric(scale(x, center = TRUE, scale = TRUE))}
TXOKLA_pumas_scale<-TXOKLA_pumas %>%
  mutate_at(c("pwhite", "pblack", "phisp", "pmale", "pextwrk", "incQS", "punemp", "medhhinc", "ppov"),myscale)

to_merge<-TXOKLA_pumas_scale%>%filter(complete.cases(.))
head(to_merge)

## # A tibble: 6 x 23
##   GEOID NAME  statefip totpop nwhite pwhite nblack pblack nhisp phisp nmale
##   <chr> <chr> <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 2200… Coor… 22       123757  51313 -0.193  64818  2.86   4110 -1.16 60019
## 2 2200… Coor… 22       124604  60897  0.136  55984  2.31   2846 -1.20 57809
## 3 2200… Coor… 22       165762 109281  0.895  41463  0.821  9001 -1.08 81591
## 4 2200… Coor… 22       176062  97540  0.427  66211  1.76   4471 -1.19 86559
## 5 2200… Nort… 22       156075  90924  0.554  58313  1.74   3429 -1.21 74853
## 6 2200… Nort… 22       149728  89347  0.616  55451  1.72   3047 -1.21 74947
## # … with 12 more variables: pmale <dbl>, totwrk <dbl>, extwrk <dbl>,
## #   pextwrk <dbl>, shincLQ <dbl>, shincHQ <dbl>, incQS <dbl>, punemp <dbl>,
## #   medhhinc <dbl>, ppov <dbl>, LocQuotExtWrk <dbl>, ExtComm <dbl>

Descriptive Statistics for the “Extractive Communities”

Extracrive communities are henceforth defined as areas with LQ > 1 for oil/gas/mining/quarrying employment shares. With 274 PUMAs total in TX, OK, and LA, 85 out of the 274 (or 31%) have higher extractive industry employment than the overall share of the three states, with location quotients greater than 1.

library(tableone)

TXOKLA_pumas$ExtComm_F <- 
  factor(TXOKLA_pumas$ExtComm,
         levels=c(0,1),
         labels=c("Non-Extractive Communities", # Reference
                  "Extractive Communities"))

TXOKLA_pumas$StateName <- 
  factor(TXOKLA_pumas$statefip,
         levels=c(48,22,40),
         labels=c("Texas", # Reference
                  "Louisiana",
                  "Oklahoma"))

names(TXOKLA_pumas)

##  [1] "GEOID"         "NAME"          "statefip"      "totpop"       
##  [5] "nwhite"        "pwhite"        "nblack"        "pblack"       
##  [9] "nhisp"         "phisp"         "nmale"         "pmale"        
## [13] "totwrk"        "extwrk"        "pextwrk"       "shincLQ"      
## [17] "shincHQ"       "incQS"         "punemp"        "medhhinc"     
## [21] "ppov"          "LocQuotExtWrk" "ExtComm"       "ExtComm_F"    
## [25] "StateName"

library(boot)

## 
## Attaching package: 'boot'

## The following object is masked from 'package:survival':
## 
##     aml

## The following object is masked from 'package:car':
## 
##     logit

install.packages("table1")

## 
## The downloaded binary packages are in
##  /var/folders/lr/thl33hwj2jdd01hth_whxlz00000gn/T//Rtmp09BS1U/downloaded_packages

library(table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

table1(~ factor(StateName)+ incQS | ExtComm_F, data=TXOKLA_pumas)

	Non-Extractive Communities (N=189)	Extractive Communities (N=85)	Overall (N=274)
factor(StateName)
Texas	155 (82.0%)	57 (67.1%)	212 (77.4%)
Louisiana	21 (11.1%)	13 (15.3%)	34 (12.4%)
Oklahoma	13 (6.9%)	15 (17.6%)	28 (10.2%)
incQS
Mean (SD)	0.0767 (0.0219)	0.0711 (0.0163)	0.0749 (0.0205)
Median [Min, Max]	0.0738 [0.0234, 0.152]	0.0704 [0.0323, 0.123]	0.0720 [0.0234, 0.152]

t1<-CreateTableOne(vars = c("pwhite","pblack","phisp","pmale","punemp", "medhhinc", "ppov"), strata = c("ExtComm"), test = T, data = TXOKLA_pumas)
print(t1,format="p")

##                       Stratified by ExtComm
##                        0                   1                   p      test
##   n                         189                  85                       
##   pwhite (mean (SD))       0.41 (0.23)         0.55 (0.18)     <0.001     
##   pblack (mean (SD))       0.15 (0.14)         0.12 (0.11)      0.091     
##   phisp (mean (SD))        0.36 (0.26)         0.25 (0.21)      0.001     
##   pmale (mean (SD))        0.49 (0.01)         0.50 (0.01)      0.004     
##   punemp (mean (SD))       0.06 (0.02)         0.06 (0.02)      0.538     
##   medhhinc (mean (SD)) 58974.27 (20227.34) 59285.82 (18714.80)  0.904     
##   ppov (mean (SD))         0.13 (0.07)         0.12 (0.05)      0.048

plot(incQS~pextwrk, TXOKLA_pumas, main ="Income Inequality vs. % Workers in Ext Industries by PUMA")

plot(pwhite~pextwrk, TXOKLA_pumas, main ="Pct White Pop vs. % Workers in Ext Industries by PUMA")

plot(medhhinc~pextwrk, TXOKLA_pumas, main ="Median HH Income vs. % Workers in Ext Industries by PUMA")

plot(punemp~pextwrk, TXOKLA_pumas, main ="Pct Unemployment vs. % Workers in Ext Industries by PUMA")

plot(ppov~pextwrk, TXOKLA_pumas, main ="Pct Pop in Poverty vs. % Workers in Ext Industries by PUMA")

plot(pmale~pextwrk, TXOKLA_pumas, main ="Pct Male Pop vs. % Workers in Ext Industries by PUMA")

Data Gathering pt. 2: iPUMs microdata

setwd("~/Desktop/UTSA/DEM7283_StatsII/HW2")
ddi <- read_ipums_ddi("usa_00011.xml")
data_total <- read_ipums_micro(ddi)

## Use of data from IPUMS-USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.

names(data_total)

##  [1] "YEAR"      "MULTYEAR"  "SAMPLE"    "SERIAL"    "CBSERIAL"  "HHWT"     
##  [7] "CLUSTER"   "STATEFIP"  "COUNTYFIP" "MET2013"   "PUMA"      "STRATA"   
## [13] "GQ"        "OWNERSHP"  "OWNERSHPD" "VALUEH"    "PERNUM"    "PERWT"    
## [19] "SEX"       "AGE"       "RACE"      "RACED"     "HISPAN"    "HISPAND"  
## [25] "EMPSTAT"   "EMPSTATD"  "LABFORCE"  "OCC"       "IND"       "CLASSWKR" 
## [31] "CLASSWKRD" "WKSWORK2"  "UHRSWORK"  "INCWAGE"   "POVERTY"   "MIGRATE1" 
## [37] "MIGRATE1D"

# Need to make a joinerID for PUMAs with the state code added, and pad with leading zeros
data_total$PUMA_pad <- formatC(data_total$PUMA, width = 5, format = "d", flag = "0")
data_total$PUMA_joiner <- paste(data_total$STATEFIP, data_total$PUMA_pad, sep ="")

# Now we want to see the number of observations we have per PUMA (NOT population per PUMA, we have to account for weights in order to have that)
head(data.frame(name=table(data_total$PUMA_joiner),id=unique(data_total$PUMA_joiner)))

##   name.Var1 name.Freq      id
## 1   2200100      6086 2200100
## 2   2200101      4440 2202402
## 3   2200200      7298 2200700
## 4   2200300     11708 2200200
## 5   2200400      7222 2202500
## 6   2200500      9337 2200500

# Thankfully the number of PUMAs (274) matches the previous count from the ACS data. 
length(table(data_total$PUMA_joiner))

## [1] 274

# Now we want to join the data from the PUMA level to this dataset
data_merged<-merge(x=data_total, y=to_merge, by.x="PUMA_joiner", by.y="GEOID", all.x=F)

Data Cleaning: Recoding iPums Variables

# Filtering the data for only those who are employed (to remove those who are underage or who may be on social secruity/retired etc). This leaves us with 771,044 observations of the original 1,715,275. 
data_merged_empl <- filter(data_merged, EMPSTAT==1)

# Recoding Income (to use for mean and accounting for NAs)
data_merged_empl$income <- Recode(data_merged_empl$INCWAGE, recodes="999998:999999=NA")
describe(data_merged_empl$income)

## [771044 obs.] Wage and salary income
## labelled double: 5237 39800 8169 18000 25000 39568 366245 42319 23042 82522 ...
## min: 0 - max: 477388 - NAs: 0 (0%) - 3817 unique values
## 2 value labels: [999998] Missing [999999] N/A

# Outcome Variable = Income Z Score
data_merged_empl$incwage_z <- scale(data_merged_empl$income, center = TRUE, scale = TRUE)

# Recoding poverty status 
data_merged_empl$pov_c <-Recode(data_merged_empl$POVERTY, recodes="000=NA; 1:150='1. Below Poverty or Near Poverty'; else='Above Poverty'", as.factor=T)

# Recoding Workers in Ext Industry
data_merged_empl$ExtrWorker <- Recode(data_merged_empl$IND, recodes="9920=NA; 370:490='1. Extraction (Oil/Gas/Mining/Quarrying)'; else = '0. Non Extractive Industries'", as.factor=T)

data_merged_empl$ExtrWorker_Num <- Recode(data_merged_empl$IND, recodes="9920=NA; 370:490=1; else=2", as.factor=F)

# Recoding Workers in Ext Industry AND Ext Communities
data_merged_empl$ExtrWorker_ExtComm <-as.numeric(paste(data_merged_empl$ExtrWorker_Num,data_merged_empl$ExtComm,sep=""))

data_merged_empl$ExtrWorker_ExtCommC <- Recode(data_merged_empl$ExtrWorker_ExtComm, recodes="11='1. Ext Worker in Ext Community'; 21='2. NonExt Worker in Ext Community'; 20= '3. NonExt Worker in NonExt Community' ; 10= '4. Ext Worker in NonExt community'", as.factor=T)

# Recoding age into groups 
data_merged_empl$age_c <-Recode(data_merged_empl$AGE, recodes="0:15 = '0. Child';16:29 ='1. Young Adult'; 30:55='2. Working Age'; 56:65='3. Pre-Retirement' ; 65:93 = '4. Retirement Age'", as.factor=T)

# Recoding Hispanic/Latino ethnicity
data_merged_empl$hisp <- Recode(data_merged_empl$HISPAN, recodes = "0='1. Not Hispanic'; 1:4='2. Hispanic/Latino'; else = NA", as.factor=T)

# Recoding Race 
data_merged_empl$race_c <- Recode(data_merged_empl$RACE, recodes = "1='1. White'; 2='2. Black'; 3='3. Native American'; 4:6='4. Asian/PI'; else = '5. Other, or more than one race'", as.factor=T)

# Recoding Gender
data_merged_empl$gender_c <- Recode(data_merged_empl$SEX, recodes = "1='1. Male'; 2='2. Female'", as.factor=T)

Collecting Complete Cases and Cleaning final Data Set

#Selecting out only the variables of interest and ensuring complete cases:
merged_emp_final<-data_merged_empl%>%
  select(PUMA_joiner,CLUSTER,STRATA,GQ,PERWT,incwage_z, income, ExtrWorker,ExtrWorker_ExtComm,pov_c,race_c,hisp,age_c,NAME,ExtComm,pwhite,pblack,phisp,pmale,pextwrk,incQS,punemp,medhhinc,ppov,ExtrWorker_ExtCommC)%>%
  filter(complete.cases(.))

#In Filtering for complete cases, 763,652 employed person observations remain out of the 771,044 (a decrease of 7,392 observations, or ~1%).

mean_income<-mean(merged_emp_final$income, na.rm=T)
median_income<-median(merged_emp_final$income, na.rm=T)
sd_income<-sd(merged_emp_final$income, na.rm=T)

mean_incomeZ<-mean(merged_emp_final$incwage_z, na.rm=T)
sd_incomeZ<-sd(merged_emp_final$incwage_z, na.rm=T)

Descriptive Statistics for iPUMS data

#Using Survey Design
names(merged_emp_final)

##  [1] "PUMA_joiner"         "CLUSTER"             "STRATA"             
##  [4] "GQ"                  "PERWT"               "incwage_z"          
##  [7] "income"              "ExtrWorker"          "ExtrWorker_ExtComm" 
## [10] "pov_c"               "race_c"              "hisp"               
## [13] "age_c"               "NAME"                "ExtComm"            
## [16] "pwhite"              "pblack"              "phisp"              
## [19] "pmale"               "pextwrk"             "incQS"              
## [22] "punemp"              "medhhinc"            "ppov"               
## [25] "ExtrWorker_ExtCommC"

library(tableone)
library(survey)

merged_emp_final$ExtComm_F <- 
  factor(merged_emp_final$ExtComm,
         levels=c(0,1),
         labels=c("Non-Extractive Communities",
                  "Extractive Communities"))

des <- svydesign(id=~CLUSTER, strata=~STRATA, weights=~PERWT, data=merged_emp_final)
st1<-svyCreateTableOne(vars = c("income","race_c","hisp","ExtrWorker","ExtrWorker_ExtCommC"), strata = "ExtComm_F", test = T, data = des)

## Warning in svyCreateTableOne(vars = c("income", "race_c", "hisp", "ExtrWorker", : Dropping variable(s) income  due to unsupported class.

#st1<-print(st1, format="p")
print(st1, format="fp")

##                                                            Stratified by ExtComm_F
##                                                             Non-Extractive Communities
##   n                                                         11703635.0                
##   race_c (%)                                                                          
##      1. White                                                8472413.0 (72.4)         
##      2. Black                                                1664771.0 (14.2)         
##      3. Native American                                       122198.0 ( 1.0)         
##      4. Asian/PI                                              540285.0 ( 4.6)         
##      5. Other, or more than one race                          903968.0 ( 7.7)         
##   hisp = 2. Hispanic/Latino (%)                              3836327.0 (32.8)         
##   ExtrWorker = 1. Extraction (Oil/Gas/Mining/Quarrying) (%)   130142.0 ( 1.1)         
##   ExtrWorker_ExtCommC (%)                                                             
##      1. Ext Worker in Ext Community                                0.0 ( 0.0)         
##      2. NonExt Worker in Ext Community                             0.0 ( 0.0)         
##      3. NonExt Worker in NonExt Community                   11573493.0 (98.9)         
##      4. Ext Worker in NonExt community                        130142.0 ( 1.1)         
##                                                            Stratified by ExtComm_F
##                                                             Extractive Communities
##   n                                                         5080996.0             
##   race_c (%)                                                                      
##      1. White                                               3887099.0 (76.5)      
##      2. Black                                                580044.0 (11.4)      
##      3. Native American                                       70670.0 ( 1.4)      
##      4. Asian/PI                                             210093.0 ( 4.1)      
##      5. Other, or more than one race                         333090.0 ( 6.6)      
##   hisp = 2. Hispanic/Latino (%)                             1215269.0 (23.9)      
##   ExtrWorker = 1. Extraction (Oil/Gas/Mining/Quarrying) (%)  284154.0 ( 5.6)      
##   ExtrWorker_ExtCommC (%)                                                         
##      1. Ext Worker in Ext Community                          284154.0 ( 5.6)      
##      2. NonExt Worker in Ext Community                      4796842.0 (94.4)      
##      3. NonExt Worker in NonExt Community                         0.0 ( 0.0)      
##      4. Ext Worker in NonExt community                            0.0 ( 0.0)      
##                                                            Stratified by ExtComm_F
##                                                             p      test
##   n                                                                    
##   race_c (%)                                                <0.001     
##      1. White                                                          
##      2. Black                                                          
##      3. Native American                                                
##      4. Asian/PI                                                       
##      5. Other, or more than one race                                   
##   hisp = 2. Hispanic/Latino (%)                             <0.001     
##   ExtrWorker = 1. Extraction (Oil/Gas/Mining/Quarrying) (%) <0.001     
##   ExtrWorker_ExtCommC (%)                                   <0.001     
##      1. Ext Worker in Ext Community                                    
##      2. NonExt Worker in Ext Community                                 
##      3. NonExt Worker in NonExt Community                              
##      4. Ext Worker in NonExt community

print("Median Income in Extractive vs. NonExtractive Communities")

## [1] "Median Income in Extractive vs. NonExtractive Communities"

tapply(merged_emp_final$income, merged_emp_final$ExtComm_F, median)

## Non-Extractive Communities     Extractive Communities 
##                      35000                      34000

print("Mean Income in Extractive vs. NonExtractive Communities")

## [1] "Mean Income in Extractive vs. NonExtractive Communities"

tapply(merged_emp_final$income, merged_emp_final$ExtComm_F, mean)

## Non-Extractive Communities     Extractive Communities 
##                   50055.49                   49442.31

print("Median Income For Workers In/Out of Ext Communities and Industries")

## [1] "Median Income For Workers In/Out of Ext Communities and Industries"

tapply(merged_emp_final$income, merged_emp_final$ExtrWorker_ExtCommC, median)

##       1. Ext Worker in Ext Community    2. NonExt Worker in Ext Community 
##                                71725                                32000 
## 3. NonExt Worker in NonExt Community    4. Ext Worker in NonExt community 
##                                34913                                70087

print("Mean Income For Workers In/Out of Ext Communities and Industries")

## [1] "Mean Income For Workers In/Out of Ext Communities and Industries"

tapply(merged_emp_final$income, merged_emp_final$ExtrWorker_ExtCommC, mean)

##       1. Ext Worker in Ext Community    2. NonExt Worker in Ext Community 
##                             95643.26                             46664.66 
## 3. NonExt Worker in NonExt Community    4. Ext Worker in NonExt community 
##                             49560.56                             93217.86

Performing ANOVA Tests

In case the PUMAs did not have enough variation, I also wanted to check how the ANOVA would have results for all of the “extraction communties” as a unit. I have a dummy variable (1 = Extraction Community) to use for the test, and examine in terms of descriptive statistics. Both levels of geography are significant in the ANOVA, showing considerable variance among the geographies.

# Treating "extraction communities" as large geography
  fit.an_Ext<-lm(incwage_z~as.factor(ExtComm), merged_emp_final)
  anova(fit.an_Ext)

## Analysis of Variance Table
## 
## Response: incwage_z
##                        Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ExtComm)      1     16 16.0180  15.934 6.561e-05 ***
## Residuals          763650 767701  1.0053                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Treating PUMAs as large geography
  fit.an_Pumas<-lm(incwage_z~as.factor(PUMA_joiner), merged_emp_final)
  anova(fit.an_Pumas)

## Analysis of Variance Table
## 
## Response: incwage_z
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(PUMA_joiner)    273  50839 186.224   198.3 < 2.2e-16 ***
## Residuals              763378 716878   0.939                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multilevel Regression

library(lme4)
library(lmerTest)
library(arm)

Model with Context/Larger Geography Predictors (PUMAs)

The AICs on both a model considering the “extractive community” label as a large geography and on a PUMA model are both rather large. The PUMA model’s AIC is slighely smaller. However, I don’t belive this means I have a perfect model, and likely there are many factors of certain areas that influence whether the area has higher or lower levels of income (rural vs. urban, etc.). I’ll start moving forward with the PUMA geography, I was just curious how the Extractive Communities Label as a whole would affect the analysis.

I’m unsure where to factor in the aspects of the larger area that might be influencing the difference? Is that where I would put the percent of oil/gas/extractive employment? I couldn’t figure out where it would fit best in the model.

#Fitting model for extractive communities
fit_ext<-glmer(incwage_z~(1|ExtComm), data=merged_emp_final)

## Warning in glmer(incwage_z ~ (1 | ExtComm), data = merged_emp_final): calling
## glmer() with family=gaussian (identity link) as a shortcut to lmer() is
## deprecated; please call lmer() directly

arm::display(fit_ext, detail=T)

## lme4::lmer(formula = incwage_z ~ (1 | ExtComm), data = merged_emp_final)
##             coef.est coef.se t value
## (Intercept) 0.00     0.00    0.84   
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  ExtComm  (Intercept) 0.01    
##  Residual             1.00    
## ---
## number of obs: 763652, groups: ExtComm, 2
## AIC = 2.17121e+06, DIC = 2171186
## deviance = 2171195.0

#Fitting model for PUMAs
fit_puma<-glmer(incwage_z~(1|PUMA_joiner), data=merged_emp_final)

## Warning in glmer(incwage_z ~ (1 | PUMA_joiner), data = merged_emp_final):
## calling glmer() with family=gaussian (identity link) as a shortcut to lmer() is
## deprecated; please call lmer() directly

arm::display(fit_puma, detail=T)

## lme4::lmer(formula = incwage_z ~ (1 | PUMA_joiner), data = merged_emp_final)
##             coef.est coef.se t value
## (Intercept) -0.01     0.02   -0.45  
## 
## Error terms:
##  Groups      Name        Std.Dev.
##  PUMA_joiner (Intercept) 0.25    
##  Residual                0.97    
## ---
## number of obs: 763652, groups: PUMA_joiner, 274
## AIC = 2.1206e+06, DIC = 2120580
## deviance = 2120587.0

# Creating predicted income for PUMAs
merged_emp_final$pred_income<-sd_income*(fitted(fit_puma))+mean_income

head(ranef(fit_puma)$PUMA_joiner)

##         (Intercept)
## 2200100 -0.13707210
## 2200101 -0.01831929
## 2200200 -0.14509267
## 2200300 -0.14738772
## 2200400 -0.11760870
## 2200500 -0.19220255

dim(ranef(fit_puma)$PUMA_joiner)

## [1] 274   1

rate<- fixef(fit_puma)+ranef(fit_puma)$PUMA_joiner
est<-data.frame(rate =rate, id=rownames(ranef(fit_puma)$PUMA_joiner))

Maps (an attempt)

To get a better understanding for what exactly we’re looking at, I want to make some maps. But I ran into a LOT of problems in trying to get these maps made and honestly spent way too much time doing this and wound up scrambling to get the actual regression done at the last minute. I did wint up at least being able to show which communities are labeled as “Extractive Communities” base on their share of employment in the oil/gas/mining/quarrying sector.

library(ggplot2)
library(ggthemes)
library(rgeos)

## Loading required package: sp

## rgeos version: 0.5-1, (SVN revision 614)
##  GEOS runtime version: 3.7.2-CAPI-1.11.2 
##  Linking to sp version: 1.3-1 
##  Polygon checking: TRUE

library(sp)
install.packages("tigris")

## 
## The downloaded binary packages are in
##  /var/folders/lr/thl33hwj2jdd01hth_whxlz00000gn/T//Rtmp09BS1U/downloaded_packages

library(tigris)
options(tigris_use_cache = FALSE)

txokla <- c('Texas','Oklahoma','Louisiana')
txokla_pumageo <- rbind_tigris(lapply(txokla, function(x) pumas(state=x, cb=T)))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |=====================                                                 |  31%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |==========================================================            |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |====================================================================  |  96%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |===================================================                   |  74%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%

pumas_joined<-geo_join(txokla_pumageo, TXOKLA_pumas, "GEOID10", "GEOID", how="inner")

library(RColorBrewer)
library(sp)
library(sf)

spplot(pumas_joined, "ExtComm")

spplot(pumas_joined, "medhhinc")

spplot(pumas_joined, "pmale")

spplot(pumas_joined, "punemp")

Showing Predicted income values on a map and chart

pumas_joined<-merge(pumas_joined, est, by.x="GEOID", by.y="id")
spplot(pumas_joined, "X.Intercept.")

spplot(pumas_joined[is.na(pumas_joined$X.Intercept.)==F,], "X.Intercept.")

head(est)

##         X.Intercept.      id
## 2200100  -0.14395279 2200100
## 2200101  -0.02519998 2200101
## 2200200  -0.15197336 2200200
## 2200300  -0.15426841 2200300
## 2200400  -0.12448939 2200400
## 2200500  -0.19908324 2200500

#Calculating mean income by PUMA vs predicted income
PUMAmeans<-aggregate(cbind(income, pred_income)~PUMA_joiner, merged_emp_final, mean)
head(PUMAmeans, n=10)

##    PUMA_joiner   income pred_income
## 1      2200100 40567.53    40964.54
## 2      2200101 47934.55    48309.38
## 3      2200200 40080.48    40468.47
## 4      2200300 39950.06    40326.53
## 5      2200400 41780.71    42168.35
## 6      2200500 37162.88    37554.74
## 7      2200600 39446.41    39846.69
## 8      2200700 41841.01    42213.94
## 9      2200800 44919.98    45297.22
## 10     2200900 44852.33    45232.96

#shows that we're really predicting income well based on PUMA. Not all that surprising. 
plot(pred_income~income, PUMAmeans)

Individual Level Predictors

For the aspects of the individual that might predict their income, I’m using: * Status as an Extraction Worker (binary outcome, 1 = yes) * Race * Hispanic/Latino Origin * Whether this person lives in an extractive community * Whether this person lives in an extractive community or not, and whether they work in an extractive industry (an attempt at a dummy variable for a multiplication factor?)

fit_ind<-lmer(incwage_z~race_c+hisp+ExtrWorker+ExtrWorker_ExtCommC+(1|PUMA_joiner), data=merged_emp_final,  na.action=na.omit)

## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient

arm::display(fit_ind, detail=T)

## lmer(formula = incwage_z ~ race_c + hisp + ExtrWorker + ExtrWorker_ExtCommC + 
##     (1 | PUMA_joiner), data = merged_emp_final, na.action = na.omit)
##                                                      coef.est coef.se t value
## (Intercept)                                             0.13     0.02    7.98
## race_c2. Black                                         -0.30     0.00  -77.15
## race_c3. Native American                               -0.10     0.01  -10.32
## race_c4. Asian/PI                                      -0.07     0.01  -11.57
## race_c5. Other, or more than one race                  -0.10     0.00  -22.83
## hisp2. Hispanic/Latino                                 -0.34     0.00 -111.34
## ExtrWorker1. Extraction (Oil/Gas/Mining/Quarrying)      0.71     0.03   22.76
## ExtrWorker_ExtCommC2. NonExt Worker in Ext Community   -0.05     0.03   -1.73
## ExtrWorker_ExtCommC4. Ext Worker in NonExt community   -0.02     0.03   -0.54
## 
## Error terms:
##  Groups      Name        Std.Dev.
##  PUMA_joiner (Intercept) 0.23    
##  Residual                0.95    
## ---
## number of obs: 763652, groups: PUMA_joiner, 274
## AIC = 2.0913e+06, DIC = 2091132
## deviance = 2091202.9

#Another ANOVA to conduct a likelihood Ratio Test (to see how much is explained by one model over another)
anova(fit_puma, fit_ind)

## refitting model(s) with ML (instead of REML)

## Data: merged_emp_final
## Models:
## fit_puma: incwage_z ~ (1 | PUMA_joiner)
## fit_ind: incwage_z ~ race_c + hisp + ExtrWorker + ExtrWorker_ExtCommC + 
## fit_ind:     (1 | PUMA_joiner)
##          Df     AIC     BIC   logLik deviance Chisq Chi Df Pr(>Chisq)    
## fit_puma  3 2120593 2120628 -1060293  2120587                            
## fit_ind  11 2091225 2091352 -1045601  2091203 29384      8  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

DEM 7283 - Stats II - Final Research Submission

Sarah Serpas

May 12, 2020