I’ve taken a slightly different direction on the research question. Since this is a multilevel analysis, I’ve chosen to focus more on the individuals who do NOT work in the oil and gas industry in areas with high oil and gas employment, and see if their incomes are higher or lower than workers in areas without high levels of oil and gas employment.
Are individuals living in areas with high concentrations of oil and gas employment, but not working within the industry, more or less likely to have higher incomes than workers in other areas?
Data: Instead of looking over time as before, I am now focusing on iPUMS 2014-2018 ACS microdata for Texas, Louisiana, and Oklahoma. Some descriptive statistics on income inequality and other aspects of the oil and gas areas will be provided through 2014-2015 ACS Aggregated Estimates at the PUMA level.
Geography: I think examining changes at the Public Use Microdata level is the correct geography, both because of data availability for more rural areas and for their likely economic ties.
Industry Definition: Due to limitations in the ACS data, this study will focus on all “extrative” industries: not only oil and gas, but also coal, mining, quarrying, and other mineral extraction. The increase of employment in these industries will be key to examine: areas with rapid employment increases or decreases could potentially be more at risk for shifting inequality.
Measuring inequality: For the sake of this early analysis, I am choosing to focus on a simple measure for the time being: the income quintile share ratio. This is a ratio of the total income received by the top 20% of earners compared to the total income recieved by the bottom 20% of earners. These numbers are available from the ACS. This measure was used in a study examining the income inequality in areas of Russia that had seen increases in oil and gas employment (Bucellato & Mickiewicz, 2009).
Other Variables: Since income is inextricably linked to other social factors in the U.S., and since the oil and gas industry is largely dominated by white men, incorporating race, ethnicity, and perhaps class of worker could be key to this analysis.
options(repos="https://cran.rstudio.com" )
library(htmlTable)
library(car)
library(stargazer)
library(survey)
library(questionr)
library(dplyr)
library(tigris)
library(tidycensus)
library(sf)
library(ggplot2)
library(ipumsr)
#Census Loading
#census_api_key("76a8c99503b97d96cf9d7b170c23dd5a9070b289", install = TRUE)
# First time, reload your environment so you can use the key without restarting R.
#readRenviron("~/.Renviron")
# You can check it with:
#Sys.getenv("CENSUS_API_KEY")
#Searching for ACS Variables
v18s <- load_variables(2018, "acs5/subject", cache = TRUE)
v18 <- load_variables(2018, "acs5", cache = TRUE)
#then can filter and view different variables
#View(v18s)
These data will be pulled in aggregate from the ACS 2018 5 Year Estimates by PUMA: * Total workers in Mining, Quarrying, Oil + Gas (includes part time) = S2403_C01_004E * Total workers 16 and older (includes part time) = S2403_C01_001E * Share of aggregate income, lowest quintile = B19082_001E * Share of aggregate income, highest quintile = B19082_005E * Total Population = B03002_001E * White, NonHispanic Population = B03002_003E * Black, NonHispanic Population = B03002_004E * Hispanic/Latino Population, of any race = B03002_012E * Number of Males = B01001_002E * Percent Unemployed = DP03_0009PE * Percent Below Poverty = DP03_0119PE * Median HH Income = DP03_0062E
Calculated values will be: * Percent White, Hispanic, Black, Male * Perent of workers in Oil/Gas * Income Quartile Share (IQS)
State FIPS of Interest (for future): * Louisiana = 22 * Oklahoma = 40 * Texas = 48
ext18_PUMA<-get_acs(geography="Public Use Microdata Area", year=2018,variables="S2403_C01_004")
## Getting data from the 2014-2018 5-year ACS
## Using the ACS Subject Tables
usacs_pumas<-get_acs(geography = "public use microdata area", year = 2018,
variables=c( "S2403_C01_004", "S2403_C01_001","B19082_001","B19082_005", "B03002_003","B03002_004","B03002_012","B01001_002", "DP03_0062E", "DP03_0009PE", "DP03_0119PE", "B01002_001", "B03002_001","B03002_001E"),
geometry = F, output = "wide")
## Getting data from the 2014-2018 5-year ACS
## Fetching data by table type ("B/C", "S", "DP") and combining the result.
usacs_pumas<-usacs_pumas%>%
mutate(totpop= B03002_001E, nwhite=B03002_003E,pwhite=B03002_003E/B03002_001E,nblack=B03002_004E,
pblack=B03002_004E/B03002_001E , nhisp = B03002_012E,phisp=B03002_012E/B03002_001E, nmale = B01001_002E,
pmale=B01001_002E/B03002_001E, totwrk = S2403_C01_001E, extwrk = S2403_C01_004E,
pextwrk = S2403_C01_004E/S2403_C01_001E, shincLQ = B19082_001E, shincHQ = B19082_005E,
incQS = B19082_001E/B19082_005E,
statefip = substr(GEOID,start=1,stop=2),punemp=DP03_0009PE/100, medhhinc=DP03_0062E,
ppov=DP03_0119PE/100)%>%
select(GEOID, NAME, statefip, totpop, nwhite, pwhite, nblack, pblack, nhisp, phisp, nmale, pmale, totwrk, extwrk, pextwrk, shincLQ, shincHQ, incQS, punemp, medhhinc, ppov)
#Isolating out just Texas, Oklahoma, and Louisiana PUMAs
TXOKLA_pumas <- filter(usacs_pumas, statefip%in%c("22","40","48"))
#Calculating the total share of oil/gas employment for Texas, Oklahoma, and Louisiana to use as reference for location quotients (could have used a national point of comparison, but figured using these three states would select fewer areas as "high extractive industries" concentrated areas).
TXOKLA_pextwrk <- sum(TXOKLA_pumas$extwrk, na.rm = TRUE)/sum(TXOKLA_pumas$totwrk, na.rm=TRUE)
#Adding a location quotient measure to the dataset.
#LQ = share of industry employment in small area / share of industry employment in large area
#LQ = share extractive industry employment for PUMA / share extractive industry employment for TX+OK+LA
TXOKLA_pumas$LocQuotExtWrk <- TXOKLA_pumas$pextwrk/TXOKLA_pextwrk
#Selecting out areas with LQ > 1 (areas with a higher share of oil/gas/mining/quarrying employment than the overall share for TX, OK, & LA). Defining these as "Extractive Communities"
TXOKLA_pumas$ExtComm <- Recode(TXOKLA_pumas$LocQuotExtWrk, recodes="0:1=0; 1:10=1", as.factor=F)
#With 274 PUMAs total in TX, OK, and LA, 85 out of the 274 (or 31%) have higher extractive industry employment than the overall share of the three states, with location quotients greater than 1.
sum(TXOKLA_pumas$ExtComm)/274
## [1] 0.310219
#Scaling data can allow us to see where they fall in terms of the average of all other PUMAs (here we are generating Z scores)
myscale<-function(x){as.numeric(scale(x, center = TRUE, scale = TRUE))}
TXOKLA_pumas_scale<-TXOKLA_pumas %>%
mutate_at(c("pwhite", "pblack", "phisp", "pmale", "pextwrk", "incQS", "punemp", "medhhinc", "ppov"),myscale)
to_merge<-TXOKLA_pumas_scale%>%filter(complete.cases(.))
head(to_merge)
## # A tibble: 6 x 23
## GEOID NAME statefip totpop nwhite pwhite nblack pblack nhisp phisp nmale
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2200… Coor… 22 123757 51313 -0.193 64818 2.86 4110 -1.16 60019
## 2 2200… Coor… 22 124604 60897 0.136 55984 2.31 2846 -1.20 57809
## 3 2200… Coor… 22 165762 109281 0.895 41463 0.821 9001 -1.08 81591
## 4 2200… Coor… 22 176062 97540 0.427 66211 1.76 4471 -1.19 86559
## 5 2200… Nort… 22 156075 90924 0.554 58313 1.74 3429 -1.21 74853
## 6 2200… Nort… 22 149728 89347 0.616 55451 1.72 3047 -1.21 74947
## # … with 12 more variables: pmale <dbl>, totwrk <dbl>, extwrk <dbl>,
## # pextwrk <dbl>, shincLQ <dbl>, shincHQ <dbl>, incQS <dbl>, punemp <dbl>,
## # medhhinc <dbl>, ppov <dbl>, LocQuotExtWrk <dbl>, ExtComm <dbl>
Extracrive communities are henceforth defined as areas with LQ > 1 for oil/gas/mining/quarrying employment shares. With 274 PUMAs total in TX, OK, and LA, 85 out of the 274 (or 31%) have higher extractive industry employment than the overall share of the three states, with location quotients greater than 1.
library(tableone)
TXOKLA_pumas$ExtComm_F <-
factor(TXOKLA_pumas$ExtComm,
levels=c(0,1),
labels=c("Non-Extractive Communities", # Reference
"Extractive Communities"))
TXOKLA_pumas$StateName <-
factor(TXOKLA_pumas$statefip,
levels=c(48,22,40),
labels=c("Texas", # Reference
"Louisiana",
"Oklahoma"))
names(TXOKLA_pumas)
## [1] "GEOID" "NAME" "statefip" "totpop"
## [5] "nwhite" "pwhite" "nblack" "pblack"
## [9] "nhisp" "phisp" "nmale" "pmale"
## [13] "totwrk" "extwrk" "pextwrk" "shincLQ"
## [17] "shincHQ" "incQS" "punemp" "medhhinc"
## [21] "ppov" "LocQuotExtWrk" "ExtComm" "ExtComm_F"
## [25] "StateName"
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:survival':
##
## aml
## The following object is masked from 'package:car':
##
## logit
install.packages("table1")
##
## The downloaded binary packages are in
## /var/folders/lr/thl33hwj2jdd01hth_whxlz00000gn/T//Rtmp09BS1U/downloaded_packages
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
table1(~ factor(StateName)+ incQS | ExtComm_F, data=TXOKLA_pumas)
| Non-Extractive Communities (N=189) |
Extractive Communities (N=85) |
Overall (N=274) |
|
|---|---|---|---|
| factor(StateName) | |||
| Texas | 155 (82.0%) | 57 (67.1%) | 212 (77.4%) |
| Louisiana | 21 (11.1%) | 13 (15.3%) | 34 (12.4%) |
| Oklahoma | 13 (6.9%) | 15 (17.6%) | 28 (10.2%) |
| incQS | |||
| Mean (SD) | 0.0767 (0.0219) | 0.0711 (0.0163) | 0.0749 (0.0205) |
| Median [Min, Max] | 0.0738 [0.0234, 0.152] | 0.0704 [0.0323, 0.123] | 0.0720 [0.0234, 0.152] |
t1<-CreateTableOne(vars = c("pwhite","pblack","phisp","pmale","punemp", "medhhinc", "ppov"), strata = c("ExtComm"), test = T, data = TXOKLA_pumas)
print(t1,format="p")
## Stratified by ExtComm
## 0 1 p test
## n 189 85
## pwhite (mean (SD)) 0.41 (0.23) 0.55 (0.18) <0.001
## pblack (mean (SD)) 0.15 (0.14) 0.12 (0.11) 0.091
## phisp (mean (SD)) 0.36 (0.26) 0.25 (0.21) 0.001
## pmale (mean (SD)) 0.49 (0.01) 0.50 (0.01) 0.004
## punemp (mean (SD)) 0.06 (0.02) 0.06 (0.02) 0.538
## medhhinc (mean (SD)) 58974.27 (20227.34) 59285.82 (18714.80) 0.904
## ppov (mean (SD)) 0.13 (0.07) 0.12 (0.05) 0.048
plot(incQS~pextwrk, TXOKLA_pumas, main ="Income Inequality vs. % Workers in Ext Industries by PUMA")
plot(pwhite~pextwrk, TXOKLA_pumas, main ="Pct White Pop vs. % Workers in Ext Industries by PUMA")
plot(medhhinc~pextwrk, TXOKLA_pumas, main ="Median HH Income vs. % Workers in Ext Industries by PUMA")
plot(punemp~pextwrk, TXOKLA_pumas, main ="Pct Unemployment vs. % Workers in Ext Industries by PUMA")
plot(ppov~pextwrk, TXOKLA_pumas, main ="Pct Pop in Poverty vs. % Workers in Ext Industries by PUMA")
plot(pmale~pextwrk, TXOKLA_pumas, main ="Pct Male Pop vs. % Workers in Ext Industries by PUMA")
setwd("~/Desktop/UTSA/DEM7283_StatsII/HW2")
ddi <- read_ipums_ddi("usa_00011.xml")
data_total <- read_ipums_micro(ddi)
## Use of data from IPUMS-USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.
names(data_total)
## [1] "YEAR" "MULTYEAR" "SAMPLE" "SERIAL" "CBSERIAL" "HHWT"
## [7] "CLUSTER" "STATEFIP" "COUNTYFIP" "MET2013" "PUMA" "STRATA"
## [13] "GQ" "OWNERSHP" "OWNERSHPD" "VALUEH" "PERNUM" "PERWT"
## [19] "SEX" "AGE" "RACE" "RACED" "HISPAN" "HISPAND"
## [25] "EMPSTAT" "EMPSTATD" "LABFORCE" "OCC" "IND" "CLASSWKR"
## [31] "CLASSWKRD" "WKSWORK2" "UHRSWORK" "INCWAGE" "POVERTY" "MIGRATE1"
## [37] "MIGRATE1D"
# Need to make a joinerID for PUMAs with the state code added, and pad with leading zeros
data_total$PUMA_pad <- formatC(data_total$PUMA, width = 5, format = "d", flag = "0")
data_total$PUMA_joiner <- paste(data_total$STATEFIP, data_total$PUMA_pad, sep ="")
# Now we want to see the number of observations we have per PUMA (NOT population per PUMA, we have to account for weights in order to have that)
head(data.frame(name=table(data_total$PUMA_joiner),id=unique(data_total$PUMA_joiner)))
## name.Var1 name.Freq id
## 1 2200100 6086 2200100
## 2 2200101 4440 2202402
## 3 2200200 7298 2200700
## 4 2200300 11708 2200200
## 5 2200400 7222 2202500
## 6 2200500 9337 2200500
# Thankfully the number of PUMAs (274) matches the previous count from the ACS data.
length(table(data_total$PUMA_joiner))
## [1] 274
# Now we want to join the data from the PUMA level to this dataset
data_merged<-merge(x=data_total, y=to_merge, by.x="PUMA_joiner", by.y="GEOID", all.x=F)
# Filtering the data for only those who are employed (to remove those who are underage or who may be on social secruity/retired etc). This leaves us with 771,044 observations of the original 1,715,275.
data_merged_empl <- filter(data_merged, EMPSTAT==1)
# Recoding Income (to use for mean and accounting for NAs)
data_merged_empl$income <- Recode(data_merged_empl$INCWAGE, recodes="999998:999999=NA")
describe(data_merged_empl$income)
## [771044 obs.] Wage and salary income
## labelled double: 5237 39800 8169 18000 25000 39568 366245 42319 23042 82522 ...
## min: 0 - max: 477388 - NAs: 0 (0%) - 3817 unique values
## 2 value labels: [999998] Missing [999999] N/A
# Outcome Variable = Income Z Score
data_merged_empl$incwage_z <- scale(data_merged_empl$income, center = TRUE, scale = TRUE)
# Recoding poverty status
data_merged_empl$pov_c <-Recode(data_merged_empl$POVERTY, recodes="000=NA; 1:150='1. Below Poverty or Near Poverty'; else='Above Poverty'", as.factor=T)
# Recoding Workers in Ext Industry
data_merged_empl$ExtrWorker <- Recode(data_merged_empl$IND, recodes="9920=NA; 370:490='1. Extraction (Oil/Gas/Mining/Quarrying)'; else = '0. Non Extractive Industries'", as.factor=T)
data_merged_empl$ExtrWorker_Num <- Recode(data_merged_empl$IND, recodes="9920=NA; 370:490=1; else=2", as.factor=F)
# Recoding Workers in Ext Industry AND Ext Communities
data_merged_empl$ExtrWorker_ExtComm <-as.numeric(paste(data_merged_empl$ExtrWorker_Num,data_merged_empl$ExtComm,sep=""))
data_merged_empl$ExtrWorker_ExtCommC <- Recode(data_merged_empl$ExtrWorker_ExtComm, recodes="11='1. Ext Worker in Ext Community'; 21='2. NonExt Worker in Ext Community'; 20= '3. NonExt Worker in NonExt Community' ; 10= '4. Ext Worker in NonExt community'", as.factor=T)
# Recoding age into groups
data_merged_empl$age_c <-Recode(data_merged_empl$AGE, recodes="0:15 = '0. Child';16:29 ='1. Young Adult'; 30:55='2. Working Age'; 56:65='3. Pre-Retirement' ; 65:93 = '4. Retirement Age'", as.factor=T)
# Recoding Hispanic/Latino ethnicity
data_merged_empl$hisp <- Recode(data_merged_empl$HISPAN, recodes = "0='1. Not Hispanic'; 1:4='2. Hispanic/Latino'; else = NA", as.factor=T)
# Recoding Race
data_merged_empl$race_c <- Recode(data_merged_empl$RACE, recodes = "1='1. White'; 2='2. Black'; 3='3. Native American'; 4:6='4. Asian/PI'; else = '5. Other, or more than one race'", as.factor=T)
# Recoding Gender
data_merged_empl$gender_c <- Recode(data_merged_empl$SEX, recodes = "1='1. Male'; 2='2. Female'", as.factor=T)
#Selecting out only the variables of interest and ensuring complete cases:
merged_emp_final<-data_merged_empl%>%
select(PUMA_joiner,CLUSTER,STRATA,GQ,PERWT,incwage_z, income, ExtrWorker,ExtrWorker_ExtComm,pov_c,race_c,hisp,age_c,NAME,ExtComm,pwhite,pblack,phisp,pmale,pextwrk,incQS,punemp,medhhinc,ppov,ExtrWorker_ExtCommC)%>%
filter(complete.cases(.))
#In Filtering for complete cases, 763,652 employed person observations remain out of the 771,044 (a decrease of 7,392 observations, or ~1%).
mean_income<-mean(merged_emp_final$income, na.rm=T)
median_income<-median(merged_emp_final$income, na.rm=T)
sd_income<-sd(merged_emp_final$income, na.rm=T)
mean_incomeZ<-mean(merged_emp_final$incwage_z, na.rm=T)
sd_incomeZ<-sd(merged_emp_final$incwage_z, na.rm=T)
#Using Survey Design
names(merged_emp_final)
## [1] "PUMA_joiner" "CLUSTER" "STRATA"
## [4] "GQ" "PERWT" "incwage_z"
## [7] "income" "ExtrWorker" "ExtrWorker_ExtComm"
## [10] "pov_c" "race_c" "hisp"
## [13] "age_c" "NAME" "ExtComm"
## [16] "pwhite" "pblack" "phisp"
## [19] "pmale" "pextwrk" "incQS"
## [22] "punemp" "medhhinc" "ppov"
## [25] "ExtrWorker_ExtCommC"
library(tableone)
library(survey)
merged_emp_final$ExtComm_F <-
factor(merged_emp_final$ExtComm,
levels=c(0,1),
labels=c("Non-Extractive Communities",
"Extractive Communities"))
des <- svydesign(id=~CLUSTER, strata=~STRATA, weights=~PERWT, data=merged_emp_final)
st1<-svyCreateTableOne(vars = c("income","race_c","hisp","ExtrWorker","ExtrWorker_ExtCommC"), strata = "ExtComm_F", test = T, data = des)
## Warning in svyCreateTableOne(vars = c("income", "race_c", "hisp", "ExtrWorker", : Dropping variable(s) income due to unsupported class.
#st1<-print(st1, format="p")
print(st1, format="fp")
## Stratified by ExtComm_F
## Non-Extractive Communities
## n 11703635.0
## race_c (%)
## 1. White 8472413.0 (72.4)
## 2. Black 1664771.0 (14.2)
## 3. Native American 122198.0 ( 1.0)
## 4. Asian/PI 540285.0 ( 4.6)
## 5. Other, or more than one race 903968.0 ( 7.7)
## hisp = 2. Hispanic/Latino (%) 3836327.0 (32.8)
## ExtrWorker = 1. Extraction (Oil/Gas/Mining/Quarrying) (%) 130142.0 ( 1.1)
## ExtrWorker_ExtCommC (%)
## 1. Ext Worker in Ext Community 0.0 ( 0.0)
## 2. NonExt Worker in Ext Community 0.0 ( 0.0)
## 3. NonExt Worker in NonExt Community 11573493.0 (98.9)
## 4. Ext Worker in NonExt community 130142.0 ( 1.1)
## Stratified by ExtComm_F
## Extractive Communities
## n 5080996.0
## race_c (%)
## 1. White 3887099.0 (76.5)
## 2. Black 580044.0 (11.4)
## 3. Native American 70670.0 ( 1.4)
## 4. Asian/PI 210093.0 ( 4.1)
## 5. Other, or more than one race 333090.0 ( 6.6)
## hisp = 2. Hispanic/Latino (%) 1215269.0 (23.9)
## ExtrWorker = 1. Extraction (Oil/Gas/Mining/Quarrying) (%) 284154.0 ( 5.6)
## ExtrWorker_ExtCommC (%)
## 1. Ext Worker in Ext Community 284154.0 ( 5.6)
## 2. NonExt Worker in Ext Community 4796842.0 (94.4)
## 3. NonExt Worker in NonExt Community 0.0 ( 0.0)
## 4. Ext Worker in NonExt community 0.0 ( 0.0)
## Stratified by ExtComm_F
## p test
## n
## race_c (%) <0.001
## 1. White
## 2. Black
## 3. Native American
## 4. Asian/PI
## 5. Other, or more than one race
## hisp = 2. Hispanic/Latino (%) <0.001
## ExtrWorker = 1. Extraction (Oil/Gas/Mining/Quarrying) (%) <0.001
## ExtrWorker_ExtCommC (%) <0.001
## 1. Ext Worker in Ext Community
## 2. NonExt Worker in Ext Community
## 3. NonExt Worker in NonExt Community
## 4. Ext Worker in NonExt community
print("Median Income in Extractive vs. NonExtractive Communities")
## [1] "Median Income in Extractive vs. NonExtractive Communities"
tapply(merged_emp_final$income, merged_emp_final$ExtComm_F, median)
## Non-Extractive Communities Extractive Communities
## 35000 34000
print("Mean Income in Extractive vs. NonExtractive Communities")
## [1] "Mean Income in Extractive vs. NonExtractive Communities"
tapply(merged_emp_final$income, merged_emp_final$ExtComm_F, mean)
## Non-Extractive Communities Extractive Communities
## 50055.49 49442.31
print("Median Income For Workers In/Out of Ext Communities and Industries")
## [1] "Median Income For Workers In/Out of Ext Communities and Industries"
tapply(merged_emp_final$income, merged_emp_final$ExtrWorker_ExtCommC, median)
## 1. Ext Worker in Ext Community 2. NonExt Worker in Ext Community
## 71725 32000
## 3. NonExt Worker in NonExt Community 4. Ext Worker in NonExt community
## 34913 70087
print("Mean Income For Workers In/Out of Ext Communities and Industries")
## [1] "Mean Income For Workers In/Out of Ext Communities and Industries"
tapply(merged_emp_final$income, merged_emp_final$ExtrWorker_ExtCommC, mean)
## 1. Ext Worker in Ext Community 2. NonExt Worker in Ext Community
## 95643.26 46664.66
## 3. NonExt Worker in NonExt Community 4. Ext Worker in NonExt community
## 49560.56 93217.86
In case the PUMAs did not have enough variation, I also wanted to check how the ANOVA would have results for all of the “extraction communties” as a unit. I have a dummy variable (1 = Extraction Community) to use for the test, and examine in terms of descriptive statistics. Both levels of geography are significant in the ANOVA, showing considerable variance among the geographies.
# Treating "extraction communities" as large geography
fit.an_Ext<-lm(incwage_z~as.factor(ExtComm), merged_emp_final)
anova(fit.an_Ext)
## Analysis of Variance Table
##
## Response: incwage_z
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ExtComm) 1 16 16.0180 15.934 6.561e-05 ***
## Residuals 763650 767701 1.0053
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Treating PUMAs as large geography
fit.an_Pumas<-lm(incwage_z~as.factor(PUMA_joiner), merged_emp_final)
anova(fit.an_Pumas)
## Analysis of Variance Table
##
## Response: incwage_z
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(PUMA_joiner) 273 50839 186.224 198.3 < 2.2e-16 ***
## Residuals 763378 716878 0.939
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(lme4)
library(lmerTest)
library(arm)
The AICs on both a model considering the “extractive community” label as a large geography and on a PUMA model are both rather large. The PUMA model’s AIC is slighely smaller. However, I don’t belive this means I have a perfect model, and likely there are many factors of certain areas that influence whether the area has higher or lower levels of income (rural vs. urban, etc.). I’ll start moving forward with the PUMA geography, I was just curious how the Extractive Communities Label as a whole would affect the analysis.
I’m unsure where to factor in the aspects of the larger area that might be influencing the difference? Is that where I would put the percent of oil/gas/extractive employment? I couldn’t figure out where it would fit best in the model.
#Fitting model for extractive communities
fit_ext<-glmer(incwage_z~(1|ExtComm), data=merged_emp_final)
## Warning in glmer(incwage_z ~ (1 | ExtComm), data = merged_emp_final): calling
## glmer() with family=gaussian (identity link) as a shortcut to lmer() is
## deprecated; please call lmer() directly
arm::display(fit_ext, detail=T)
## lme4::lmer(formula = incwage_z ~ (1 | ExtComm), data = merged_emp_final)
## coef.est coef.se t value
## (Intercept) 0.00 0.00 0.84
##
## Error terms:
## Groups Name Std.Dev.
## ExtComm (Intercept) 0.01
## Residual 1.00
## ---
## number of obs: 763652, groups: ExtComm, 2
## AIC = 2.17121e+06, DIC = 2171186
## deviance = 2171195.0
#Fitting model for PUMAs
fit_puma<-glmer(incwage_z~(1|PUMA_joiner), data=merged_emp_final)
## Warning in glmer(incwage_z ~ (1 | PUMA_joiner), data = merged_emp_final):
## calling glmer() with family=gaussian (identity link) as a shortcut to lmer() is
## deprecated; please call lmer() directly
arm::display(fit_puma, detail=T)
## lme4::lmer(formula = incwage_z ~ (1 | PUMA_joiner), data = merged_emp_final)
## coef.est coef.se t value
## (Intercept) -0.01 0.02 -0.45
##
## Error terms:
## Groups Name Std.Dev.
## PUMA_joiner (Intercept) 0.25
## Residual 0.97
## ---
## number of obs: 763652, groups: PUMA_joiner, 274
## AIC = 2.1206e+06, DIC = 2120580
## deviance = 2120587.0
# Creating predicted income for PUMAs
merged_emp_final$pred_income<-sd_income*(fitted(fit_puma))+mean_income
head(ranef(fit_puma)$PUMA_joiner)
## (Intercept)
## 2200100 -0.13707210
## 2200101 -0.01831929
## 2200200 -0.14509267
## 2200300 -0.14738772
## 2200400 -0.11760870
## 2200500 -0.19220255
dim(ranef(fit_puma)$PUMA_joiner)
## [1] 274 1
rate<- fixef(fit_puma)+ranef(fit_puma)$PUMA_joiner
est<-data.frame(rate =rate, id=rownames(ranef(fit_puma)$PUMA_joiner))
To get a better understanding for what exactly we’re looking at, I want to make some maps. But I ran into a LOT of problems in trying to get these maps made and honestly spent way too much time doing this and wound up scrambling to get the actual regression done at the last minute. I did wint up at least being able to show which communities are labeled as “Extractive Communities” base on their share of employment in the oil/gas/mining/quarrying sector.
library(ggplot2)
library(ggthemes)
library(rgeos)
## Loading required package: sp
## rgeos version: 0.5-1, (SVN revision 614)
## GEOS runtime version: 3.7.2-CAPI-1.11.2
## Linking to sp version: 1.3-1
## Polygon checking: TRUE
library(sp)
install.packages("tigris")
##
## The downloaded binary packages are in
## /var/folders/lr/thl33hwj2jdd01hth_whxlz00000gn/T//Rtmp09BS1U/downloaded_packages
library(tigris)
options(tigris_use_cache = FALSE)
txokla <- c('Texas','Oklahoma','Louisiana')
txokla_pumageo <- rbind_tigris(lapply(txokla, function(x) pumas(state=x, cb=T)))
##
|
| | 0%
|
|== | 2%
|
|=== | 5%
|
|==== | 5%
|
|===== | 8%
|
|====== | 9%
|
|======== | 12%
|
|========== | 14%
|
|=========== | 15%
|
|============ | 18%
|
|============= | 19%
|
|=============== | 21%
|
|================ | 23%
|
|================= | 25%
|
|=================== | 27%
|
|==================== | 28%
|
|===================== | 31%
|
|======================= | 33%
|
|======================== | 34%
|
|========================= | 36%
|
|========================== | 38%
|
|============================ | 40%
|
|============================== | 42%
|
|============================== | 43%
|
|================================ | 46%
|
|================================= | 47%
|
|=================================== | 49%
|
|==================================== | 52%
|
|===================================== | 53%
|
|======================================= | 55%
|
|======================================= | 56%
|
|========================================= | 59%
|
|=========================================== | 61%
|
|============================================ | 62%
|
|============================================= | 65%
|
|============================================== | 66%
|
|================================================ | 68%
|
|================================================= | 71%
|
|================================================== | 72%
|
|==================================================== | 74%
|
|===================================================== | 75%
|
|====================================================== | 78%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================= | 87%
|
|=============================================================== | 89%
|
|=============================================================== | 91%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|==================================================================== | 96%
|
|===================================================================== | 99%
|
|======================================================================| 100%
##
|
| | 0%
|
|========= | 13%
|
|================== | 26%
|
|=================== | 27%
|
|============================ | 40%
|
|=================================== | 50%
|
|============================================ | 63%
|
|===================================================== | 76%
|
|========================================================== | 82%
|
|=================================================================== | 95%
|
|======================================================================| 100%
##
|
| | 0%
|
|===== | 7%
|
|========== | 14%
|
|=========== | 15%
|
|================ | 22%
|
|=================== | 28%
|
|======================== | 35%
|
|============================= | 42%
|
|================================ | 45%
|
|===================================== | 52%
|
|======================================= | 56%
|
|============================================ | 63%
|
|================================================= | 70%
|
|=================================================== | 74%
|
|======================================================== | 81%
|
|=========================================================== | 84%
|
|================================================================ | 91%
|
|===================================================================== | 98%
|
|======================================================================| 100%
pumas_joined<-geo_join(txokla_pumageo, TXOKLA_pumas, "GEOID10", "GEOID", how="inner")
library(RColorBrewer)
library(sp)
library(sf)
spplot(pumas_joined, "ExtComm")
spplot(pumas_joined, "medhhinc")
spplot(pumas_joined, "pmale")
spplot(pumas_joined, "punemp")
pumas_joined<-merge(pumas_joined, est, by.x="GEOID", by.y="id")
spplot(pumas_joined, "X.Intercept.")
spplot(pumas_joined[is.na(pumas_joined$X.Intercept.)==F,], "X.Intercept.")
head(est)
## X.Intercept. id
## 2200100 -0.14395279 2200100
## 2200101 -0.02519998 2200101
## 2200200 -0.15197336 2200200
## 2200300 -0.15426841 2200300
## 2200400 -0.12448939 2200400
## 2200500 -0.19908324 2200500
#Calculating mean income by PUMA vs predicted income
PUMAmeans<-aggregate(cbind(income, pred_income)~PUMA_joiner, merged_emp_final, mean)
head(PUMAmeans, n=10)
## PUMA_joiner income pred_income
## 1 2200100 40567.53 40964.54
## 2 2200101 47934.55 48309.38
## 3 2200200 40080.48 40468.47
## 4 2200300 39950.06 40326.53
## 5 2200400 41780.71 42168.35
## 6 2200500 37162.88 37554.74
## 7 2200600 39446.41 39846.69
## 8 2200700 41841.01 42213.94
## 9 2200800 44919.98 45297.22
## 10 2200900 44852.33 45232.96
#shows that we're really predicting income well based on PUMA. Not all that surprising.
plot(pred_income~income, PUMAmeans)
For the aspects of the individual that might predict their income, I’m using: * Status as an Extraction Worker (binary outcome, 1 = yes) * Race * Hispanic/Latino Origin * Whether this person lives in an extractive community * Whether this person lives in an extractive community or not, and whether they work in an extractive industry (an attempt at a dummy variable for a multiplication factor?)
fit_ind<-lmer(incwage_z~race_c+hisp+ExtrWorker+ExtrWorker_ExtCommC+(1|PUMA_joiner), data=merged_emp_final, na.action=na.omit)
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
arm::display(fit_ind, detail=T)
## lmer(formula = incwage_z ~ race_c + hisp + ExtrWorker + ExtrWorker_ExtCommC +
## (1 | PUMA_joiner), data = merged_emp_final, na.action = na.omit)
## coef.est coef.se t value
## (Intercept) 0.13 0.02 7.98
## race_c2. Black -0.30 0.00 -77.15
## race_c3. Native American -0.10 0.01 -10.32
## race_c4. Asian/PI -0.07 0.01 -11.57
## race_c5. Other, or more than one race -0.10 0.00 -22.83
## hisp2. Hispanic/Latino -0.34 0.00 -111.34
## ExtrWorker1. Extraction (Oil/Gas/Mining/Quarrying) 0.71 0.03 22.76
## ExtrWorker_ExtCommC2. NonExt Worker in Ext Community -0.05 0.03 -1.73
## ExtrWorker_ExtCommC4. Ext Worker in NonExt community -0.02 0.03 -0.54
##
## Error terms:
## Groups Name Std.Dev.
## PUMA_joiner (Intercept) 0.23
## Residual 0.95
## ---
## number of obs: 763652, groups: PUMA_joiner, 274
## AIC = 2.0913e+06, DIC = 2091132
## deviance = 2091202.9
#Another ANOVA to conduct a likelihood Ratio Test (to see how much is explained by one model over another)
anova(fit_puma, fit_ind)
## refitting model(s) with ML (instead of REML)
## Data: merged_emp_final
## Models:
## fit_puma: incwage_z ~ (1 | PUMA_joiner)
## fit_ind: incwage_z ~ race_c + hisp + ExtrWorker + ExtrWorker_ExtCommC +
## fit_ind: (1 | PUMA_joiner)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## fit_puma 3 2120593 2120628 -1060293 2120587
## fit_ind 11 2091225 2091352 -1045601 2091203 29384 8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1