Lab 8: Spatial Analysis and Linear Regression

Preliminaries

Before beginning this lab, make sure you’ve:

created a sub-folder for this assignment within your ECON_210 folder
Within your assignment sub-folder, created another sub-folder called DATA
Downloaded the necessary data from the WordPress site and save it to your assignment-specific subfolder.
Set your working directory. The address of your working directory should look something like /Users/YourName/OneDrive - Middlebury College/ECON_210/LABS/LAB_#/DATA ;

In this lab:

We’ll explore the question: Do higher minimum wages lift people out of poverty?

To do so, we’ll do the following:

Many to One Merge
Generate a Map showing the spatial distribution of poverty in the US
t-test comparing poverty rates between low and high minimum wage areas
CI plots and bar plots to visualize t-tests
Linear Regression

The datasets

We will use two datasets.

The county-level dataset generated in Lab 7 acs_health_election_pviolence (Source: Lab 7)
State Minimum Wages Source: Equitable Growth min_wage_state
Geographic Coordinates for US counties usa_counties Source: US Census Bureau
Boundary Files for each US county usacounties_shp Source: US Census Bureau

Many to One Merge

In previous merges/joins, every observation from both datasets were on the same geographic level (county). This is known as a “1:1 or one-to-one merge”. In this lab, we will be doing a (m:1 or many to one merge).

To start, let’s open our first dataset ACS_citizens_health_countypres_police (from Lab 7 Part I).

ACS_citizens_health_countypres_police <- read_csv("ACS_citizens_health_countypres_police.csv")

## Warning: Missing column names filled in: 'X1' [1]

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   GEOID = col_character(),
##   NAME.x = col_character(),
##   QName = col_character(),
##   STUSAB = col_character(),
##   SUMLEV = col_character(),
##   GEOCOMP = col_character(),
##   FILEID = col_character(),
##   LOGRECNO = col_character(),
##   US = col_logical(),
##   REGION = col_logical(),
##   DIVISION = col_logical(),
##   STATECE = col_logical(),
##   STATE.x = col_character(),
##   COUNTY.x = col_character(),
##   COUSUB = col_logical(),
##   PLACE = col_logical(),
##   PLACESE = col_logical(),
##   TRACT = col_logical(),
##   BLKGRP = col_logical(),
##   CONCIT = col_logical()
##   # ... with 40 more columns
## )
## ℹ Use `spec()` for the full column specifications.

Recall that the unique identifier of the ACS_citizens_health_countypres_police is FIPS with the follow code:

ACS_citizens_health_countypres_police %>% unique_id(FIPS)

## [1] TRUE

And we see “TRUE” if FIPS is the unique identifier.

And if you have problems with thatssorandom the alternative method is:

length(unique(ACS_citizens_health_countypres_police$FIPS))

## [1] 3120

We know that each observation in this dataset is uniquely identified by a FIPS geographic identifier - there are 3210 observations and 3210 unique FIPS code.

The ACS_citizens_health_countypres_police dataset also contains two non-unique identifiers: the STATE.x and COUNTY.x identifiers that comprise a county’s unique FIPS code.

To understand the relationship between STATE.x, COUNTY.x and FIPS, let’s filter out dataset for those three variables (we’ll also including the variable QName for easier readability).

Looking at the first 10 observations:

ACS_citizens_health_countypres_police %>%
  select(FIPS, STATE.x, COUNTY.x, QName) %>%
  head(10)

## # A tibble: 10 x 4
##     FIPS STATE.x COUNTY.x QName                   
##    <dbl> <chr>   <chr>    <chr>                   
##  1  1001 01      001      Autauga County, Alabama 
##  2  1003 01      003      Baldwin County, Alabama 
##  3  1005 01      005      Barbour County, Alabama 
##  4  1007 01      007      Bibb County, Alabama    
##  5  1009 01      009      Blount County, Alabama  
##  6  1011 01      011      Bullock County, Alabama 
##  7  1013 01      013      Butler County, Alabama  
##  8  1015 01      015      Calhoun County, Alabama 
##  9  1017 01      017      Chambers County, Alabama
## 10  1019 01      019      Cherokee County, Alabama

We can see that the first two digits in a FIPS code represent the state the county is in (ex, 01 represents Alabama) , and the last three digits refer to the county.

We’ll need to use the state identifier STATE.x in our many-to-one merge. But in order to do so, we’ll need both datasets to use a common variable name. We’ll also need them to be stored as the same data type (string, numeric, etc). To take care of that, let’s renameSTATE.x as statefips and convert it to numeric.

ACS_citizens_health_countypres_police2 <- ACS_citizens_health_countypres_police %>%
  dplyr::rename(statefips = STATE.x) 

ACS_citizens_health_countypres_police2$statefips <- as.numeric(ACS_citizens_health_countypres_police2$statefips)

Now let’s take a look at the state-level minimum wage data and examine the first 10 observations. This is a STATA file, so we’ll need to use the read_data function from the haven package.

library(haven)
min_wage_state <- read_dta("min_wage_state.dta")

min_wage_state %>%
  head(10)

## # A tibble: 10 x 4
##    statename            statefips mean_mw mean_fed_mw
##    <chr>                    <dbl>   <dbl>       <dbl>
##  1 Alabama                      1    7.25        7.25
##  2 Alaska                       2    9.75        7.25
##  3 Arizona                      4    8.05        7.25
##  4 Arkansas                     5    8           7.25
##  5 California                   6   10           7.25
##  6 Colorado                     8    8.31        7.25
##  7 Connecticut                  9    9.60        7.25
##  8 Delaware                    10    8.25        7.25
##  9 District of Columbia        11   10.5         7.25
## 10 Florida                     12    8.05        7.25

In this dataset, each observation is not a county but a state. Why? Minimum wages are typically set on a state level and not a county level.

When we merge out two datasets together, the state minimum wage of 7.25 for Alabama will be merged to every county in Alabama, hence the many (county-level data) to one (state-level data) merge.

Enough talk of merging, let’s get it done!

To do our many-to-one merge, we’ll be using the left_join function. left_joinreturns all rows from x (ACS_citizens_health_countypres_police2), and all columns from x and y (min_wage_state). If there are multiple matches between x and y, all combinations of the matches are returned.

ACS_citizen_min_wage <- left_join(ACS_citizens_health_countypres_police2, min_wage_state, by = "statefips")

Many-to-one merges can be tricky to do, and there are a few ways to do them. To check our code is doing what we want it to, let’s look at the first 100 observations of the merged dataset selecting for three variables: statefips, mean_mw, and QName

ACS_citizen_min_wage %>%
  select(statefips, mean_mw, QName) %>%
  head(100)

## # A tibble: 100 x 3
##    statefips mean_mw QName                   
##        <dbl>   <dbl> <chr>                   
##  1         1    7.25 Autauga County, Alabama 
##  2         1    7.25 Baldwin County, Alabama 
##  3         1    7.25 Barbour County, Alabama 
##  4         1    7.25 Bibb County, Alabama    
##  5         1    7.25 Blount County, Alabama  
##  6         1    7.25 Bullock County, Alabama 
##  7         1    7.25 Butler County, Alabama  
##  8         1    7.25 Calhoun County, Alabama 
##  9         1    7.25 Chambers County, Alabama
## 10         1    7.25 Cherokee County, Alabama
## # … with 90 more rows

We can see that the mean_mw (mean minimum wage) assigned to each observation corresponds to a specific state (ex. Alaska, statefips = 2, has a minimum wage of 9.75)

Finally let’s load in the geographic coordinates shape file, which we’ll need to build our map. We can do this using the st_read function from the sf library. The .shp file depends on the rest of the files in the cb_2018_us_county_500k folder you downloaded, so we’ll need to have all the files saved together in the same folder.

library(sf)

## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1

county_shapes <- st_read("cb_2018_us_county_500k.shp")

## Reading layer `cb_2018_us_county_500k' from data source `/Users/egong/Dropbox/ReEnvision_210/LABS/Lab_8_11/R_files/cb_2018_us_county_500k.shp' using driver `ESRI Shapefile'
## Simple feature collection with 3233 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.1489 ymin: -14.5487 xmax: 179.7785 ymax: 71.36516
## geographic CRS: NAD83

To merge county_shapes with ACS_citizen_min_wage_shp, we’ll need to rename GEOID as FIPS and convert it to numeric

county_shapes2 <- county_shapes %>%
  dplyr::rename(FIPS = GEOID)

county_shapes2$FIPS <- as.numeric(county_shapes2$FIPS)

Finally, we’ll use left_join to merge the shape file with ACS_citizen_min_wage_shp

ACS_citizen_min_wage_shp <- left_join(ACS_citizen_min_wage, county_shapes2, by = "FIPS")

2 Generate Variables

Let’s generate a poverty rate for each county. By referencing the dataset’s respective codebook, we can determined that the variable B13004_002 represents the “Population for Whom Poverty Status Is Determined: Under 1.00 (Doing Poorly)” and variable B13004_001 represents the “Population for Whom Poverty Status Is Determined”.

Codebooks are super useful but can also be tedious- especially in a dataset with as many variables as this one. In a few weeks, we’ll learn a very easy way to do a keyword search on data labels (basically a CTRL + F for “poverty”) in STATA- something that would have been very handy in this situation.

To generate the poverty rate, we’ll mutate a new column dividing B13004_002 by B13004_001

ACS_citizen_min_wage_shp2 <- ACS_citizen_min_wage_shp %>%
  mutate(poverty_rate = B13004_002 / B13004_001)

For the minimum wage, let’s designate all states where the state minimum wage is higher than the federal minimum wage as high minimum wage areas (thus all the counties in the state will be coded as having a high minimum wage).

ACS_citizen_min_wage_shp3 <- ACS_citizen_min_wage_shp2 %>%
  mutate(minwage_high = ifelse(mean_mw > mean_fed_mw, "Above Fed Min", "Below Fed Min"))

Make a Map

We can examine the spatial variation of poverty by creating a map. R provides a variety of ways for us to make really high quality maps without having to use a new program like GIS. Today, we’re going to be using one of the most basic mapping functions in R: geom_sp from ggplot2.

Depending on your available computing power, it might take a minute or so for your map to generate. Don’t fret! Take a stretch break and rest your eyes in the mean time.

We’ve also limited the coordinates of the map to create an approximate “bounding box” for the continental United States (sorry Alaska and Hawaii!)

ggplot(data = ACS_citizen_min_wage_shp3, aes(geometry = geometry)) +
  geom_sf(aes(fill = poverty_rate)) +
  coord_sf(xlim = c(-126, -66), ylim = c(23, 51.5), expand = FALSE)

Awesome! But this map needs a bit of work. For one, R has deciding to fill counties with lower poverty rates with a dark blue, and those with higher poverty rates with light blue. This isn’t super intuitive, and we can correct it by specifying low and high hexidecimal color values for ggplot to use. We can also decrease the thickness of county boundaries by adjusting the lwd argument within geom_sf. We can also add a title to our map using ggtitle. Finally, for the map to really stand out, we’ll use theme_classic, which removes the grey gridlines.

ggplot(data = ACS_citizen_min_wage_shp3, aes(geometry = geometry)) +
  geom_sf(aes(fill = poverty_rate), lwd = 0.025) +
  coord_sf(xlim = c(-126, -66), ylim = c(23, 51.5), expand = FALSE) +
  scale_fill_gradient(low = "#ffffff", high = "#2069a6") +
  ggtitle("County Poverty Rates") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

For 7 lines of code, that’s a pretty decent map!

Questions

What observations can you make about where poverty is concentrated? Why do you think this is the case? Is there anything that is surprising?

See if you can make a map that shows which areas of the US have higher minimum wages.

T-tests

A t-test is simple a test that compares two groups. We will compare poverty rates between counties with low minimum wages and high minimum wages. To do so, we can use the t.test function.

t.test(poverty_rate ~ minwage_high, data = ACS_citizen_min_wage_shp3)

## 
##  Welch Two Sample t-test
## 
## data:  poverty_rate by minwage_high
## t = -8.1993, df = 3048.1, p-value = 3.521e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02289988 -0.01406121
## sample estimates:
## mean in group Above Fed Min mean in group Below Fed Min 
##                   0.1454535                   0.1639340

Typically it’s better to present a t-test as a figure and report the p-value. We can use our confidence interval plot from last week’s lab as base code.

library(Rmisc)

## Loading required package: lattice

## Loading required package: plyr

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

summary_stats <- summarySE(ACS_citizen_min_wage_shp3, measurevar="poverty_rate", groupvars=c("minwage_high"), na.rm = TRUE)

## Warning in qt(conf.interval/2 + 0.5, datac$N - 1): NaNs produced

#drop missing value row
summary_stats2 <- summary_stats %>%
  na.omit()

#generate a palette for us to pull from 
cbbPalette <- c("#000000", "#E69F00")  

#plot! 
ggplot(summary_stats2, aes(x = minwage_high, y = poverty_rate)) + 
  geom_bar(position=position_dodge(), stat="identity", aes(fill = cbbPalette)) +
  geom_errorbar(aes(ymin=poverty_rate - se, ymax= poverty_rate + se), width=.2) +
  geom_point() +
  scale_y_continuous(limits=c(0.14,0.17),oob = rescale_none) +
  scale_fill_discrete(name = "Minimum Wage Classification", labels = c("Below Federal Minimum", "Above Federal Minimum")) +
  ggtitle("Poverty Rates by High Min Wage Counties") +
  xlab("Minimum Wage Classification")

  ylab("Poverty Rate")

## $y
## [1] "Poverty Rate"
## 
## attr(,"class")
## [1] "labels"

Linear Regression

Regression in R is fairly easy to implement. To regress poverty rates on minimum wage, we can use the lm (linear model) function.

lr_1 <- lm(poverty_rate ~ minwage_high, data = ACS_citizen_min_wage_shp3)
summary(lr_1)

## 
## Call:
## lm(formula = poverty_rate ~ minwage_high, data = ACS_citizen_min_wage_shp3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14090 -0.04453 -0.00759  0.03358  0.40551 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.145453   0.001751  83.063  < 2e-16 ***
## minwage_highBelow Fed Min 0.018481   0.002309   8.005 1.67e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06367 on 3111 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.02018,    Adjusted R-squared:  0.01987 
## F-statistic: 64.08 on 1 and 3111 DF,  p-value: 1.672e-15

Question

Compare the results from the regression to the t-test. What are the similarities? What are the differences?

In almost all cases, we want to use heteroskedastic-robust standard errors - the general idea is that your standard errors will be “conservative” or larger than non-adjusted standard errors (except of course in this specific example). STATA making calculating heteroskedastic-robust standard errors a snap. Unfortunately, the same is not true for R. For now, we’ll skip over generating robust errors, but keep your eyes out for information to come.

If we suspect that education might be an omitted variable/confounder, we can control for it. We first generate “college” which is the percentage of college graduates in a count. We can do so with the variables A12002_005, the “Population 25 Years and Over: Bachelor’s Degree or More”, and A12001_001, the “Population 25 Years and Over”

ACS_citizen_min_wage_shp4 <- ACS_citizen_min_wage_shp3 %>%
  mutate(college = A12002_005 /A12001_001 )

Then we can control for it in the regression.

lr_2 <-lm(poverty_rate ~ minwage_high + college, data = ACS_citizen_min_wage_shp4)
summary(lr_2)

## 
## Call:
## lm(formula = poverty_rate ~ minwage_high + college, data = ACS_citizen_min_wage_shp4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14143 -0.03912 -0.00843  0.02978  0.38400 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                0.212742   0.003030  70.215  < 2e-16 ***
## minwage_highBelow Fed Min  0.009903   0.002117   4.677 3.04e-06 ***
## college                   -0.289170   0.011093 -26.068  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05769 on 3110 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.1959, Adjusted R-squared:  0.1954 
## F-statistic: 378.8 on 2 and 3110 DF,  p-value: < 2.2e-16

Finally, let’s save this dataset as an rds file so that we can use it for mapping in the future.

write_rds(ACS_citizen_min_wage_shp4, "ACS_citizen_min_wage_shp4.rds")

Question Are the percentage of college graduates an omitted variable/confounder? What evidence do we have for this?