Are the declines in Union Membership leading to income inequality? We will measure income inequality by using the share of all income capture by the top 1% of income earnings. As this share increases, income inequality increases.
We will examine this question by analyzing inequality and union membership over time, over space, and using a linear regression. Fun stuff!
Before beginning this lab, make sure you’ve:
created a sub-folder for this assignment within your ECON_210 folder
Within your assignment sub-folder, created another sub-folder called DATA
Downloaded the necessary data from the WordPress site and saved it to your assignment-specific subfolder.
Set your working directory. The address of your working directory should look something like `YourName/ECON_210/Assignment/DATA/
We will use six datasets.
We will create two datasets from the above:
A: National-level dataset that tracks income inequality and union membership over time in the US. This is typically known as a time-series dataset.
B: Metro-level dataset that documents income inequality and union membership at a specific point in time (2013). This is typically known as a cross-sectional dataset. A metro area or core-based statistical area (CBSA) consists of one or more counties anchored by an urban center. See here for more on CBSAs.
If you combine both a time-series and cross section dataset, you get a panel dataset. An example of this would be if we had county-level data over multiple years.
The time-series data we are working track the level of variables over time. Let’s look at the first 10 observations in the national level income inequality dataset.
library(haven)
inequality_usa <- read_dta("income_inequality_p1_usa.dta")
inequality_usa %>%
head(10)
## # A tibble: 10 x 2
## Year top1
## <dbl> <dbl>
## 1 1970 0.115
## 2 1971 0.118
## 3 1972 0.118
## 4 1973 0.117
## 5 1974 0.112
## 6 1975 0.111
## 7 1976 0.110
## 8 1977 0.114
## 9 1978 0.111
## 10 1979 0.117
Notice that there are no geographic identifiers. Why? All the data is for the US. This begs the question, “What is the unit of observation? (what does each row of the dataset represent?)”
inequality_usa%>% unique_id(Year)
## [1] TRUE
length(unique(inequality_usa$Year))
## [1] 50
The unit of observation is Year and we have 50 years of income inequality data. You can do the same exercise for the national level union data. To merge these two datasets together, load the union_usa
data
union_usa <- read_dta("union_usa.dta")
inequality_union_usa <- left_join(inequality_usa, union_usa, by = "Year")
The results of the merge tell us that we have 47 years that contain both inequality and union data; there are 3 years are not common between the two datasets. Let’s take a quick peak at the data using a handy function we haven’t used before - summary
, which will give us summary statistics (mean, median, etc.) for all the variables in our dataset. How neat is that?
summary(inequality_usa)
## Year top1
## Min. :1970 Min. :0.1103
## 1st Qu.:1982 1st Qu.:0.1187
## Median :1994 Median :0.1542
## Mean :1994 Mean :0.1577
## 3rd Qu.:2007 3rd Qu.:0.1943
## Max. :2019 Max. :0.2091
Wait! There isn’t a mean for Union_Rate
. What’s going on? Note that Union_Rate
is stored as a character. But why?
inequality_union_usa %>%
select(Union_Rate) %>%
head(15)
## # A tibble: 15 x 1
## Union_Rate
## <chr>
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 "24"
## 5 "23.6"
## 6 "22.2"
## 7 "22.1"
## 8 "23.8"
## 9 "23"
## 10 "24.1"
## 11 "23"
## 12 "21.4"
## 13 " --"
## 14 "20.1"
## 15 "18.8"
Ah, it looks like there’s a symbol “–-” that is meant to be a missing value. To fix that, we’ll need to convert Union_Rate
to numeric. There are two ways for us to do this, one of which you’re already familiar with: using the as.numeric
function. In this case, any values that R
cannot read as numeric (such as a symbol) will be coerced as a missing NA
value.
inequality_union_usa$Union_Rate <- as.numeric(inequality_union_usa$Union_Rate)
## Warning: NAs introduced by coercion
The second way uses the function str_replace
from the stringr
library. stringr
is designed specifically for working with strings (a.k.a. characters). It’s a super helpful library, and about time we introduced it!
str_replace
takes three arguments: - a vector of strings - a pattern to look for (such as “–”) - the replacement (such as "" (nothing))
library(stringr)
inequality_union_usa$Union_Rate <- str_replace(inequality_union_usa$Union_Rate, "--", "")
This gets rid of the “–” but Union_Rate
is still stored as a character. We can now use as.numeric
like we usually do
inequality_union_usa$Union_Rate <- as.numeric(inequality_union_usa$Union_Rate)
“Why bother doing this?” you may ask…good question! The answer is you don’t have to: as.numeric
works just fine for our purposes. But str_replace
is a great function to have in your toolkit for future string manipulation.
Finally, let’s save this dataset for future use
write_csv(inequality_union_usa, "inequality_union_usa.csv")
Let’s load the metro-level top 1% share dataset
inequality_msa <- read_dta("income_inequality_county.dta")
inequality_msa%>% unique_id(FIPS)
## [1] TRUE
length(unique(inequality_msa$FIPS))
## [1] 917
summary(inequality_msa)
## FIPS top10 top5 top1
## Min. :10100 Min. :27.22 Min. :16.81 Min. : 5.597
## 1st Qu.:20020 1st Qu.:39.85 1st Qu.:26.70 1st Qu.:11.346
## Median :29820 Median :42.74 Median :29.21 Median :13.351
## Mean :29793 Mean :43.10 Mean :29.84 Mean :13.990
## 3rd Qu.:39460 3rd Qu.:45.97 3rd Qu.:32.35 3rd Qu.:15.531
## Max. :49820 Max. :82.71 Max. :78.07 Max. :68.266
## NA's :1 NA's :1 NA's :1
## cntyname
## Length:917
## Class :character
## Mode :character
##
##
##
##
Everything looks good. The unit of observation is a metro area and each one has a FIPS. We only have 917 FIPS observations because there are fewer MSAs than counties. MSAs are basically “large” counties that represent major metropolitan areas and are technically known as CBSAs (core-based statistical areas). This dataset is not going to capture a lot of rural areas. You can examine the metro-level union dataset as well and see similar results.
Now merge the metro-level income inequality and union data (I already checked the union data so I know it will merge). However, most times you’ll have to examine each dataset separately to make sure they are cleaned up.
msa_union <-read_dta("msa_union.dta")
inequality_union_msa <- left_join(inequality_msa, msa_union, by = "FIPS")
summary(inequality_union_msa)
## FIPS top10 top5 top1
## Min. :10100 Min. :27.22 Min. :16.81 Min. : 5.597
## 1st Qu.:20020 1st Qu.:39.85 1st Qu.:26.70 1st Qu.:11.346
## Median :29820 Median :42.74 Median :29.21 Median :13.351
## Mean :29793 Mean :43.10 Mean :29.84 Mean :13.990
## 3rd Qu.:39460 3rd Qu.:45.97 3rd Qu.:32.35 3rd Qu.:15.531
## Max. :49820 Max. :82.71 Max. :78.07 Max. :68.266
## NA's :1 NA's :1 NA's :1
## cntyname MetropolitanArea MemTotal MemPrivate
## Length:917 Length:917 Min. : 0.0 Min. : 0.000
## Class :character Class :character 1st Qu.: 4.3 1st Qu.: 2.175
## Mode :character Mode :character Median : 9.1 Median : 5.050
## Mean :10.7 Mean : 6.312
## 3rd Qu.:15.6 3rd Qu.: 9.175
## Max. :39.0 Max. :27.700
## NA's :681 NA's :681
## MemPublic
## Min. : 0.00
## 1st Qu.:12.97
## Median :28.25
## Mean :32.79
## 3rd Qu.:53.62
## Max. :83.90
## NA's :681
From our summary
we see that we have 681 missing values. This means we have 236 counties with both inequality and union membership data. To keep only observations that have both, we can filter
our dataset for observations that are not (!
) missing (is.na
) MetropolitanArea
inequality_union_msa2 <- inequality_union_msa %>%
filter(!is.na(MetropolitanArea))
We also merge in metro-level control variables.
msa_controls <-read_dta("msa_controls.dta")
inequality_union_msa_controls <- left_join(inequality_union_msa2, msa_controls, by = "FIPS")
Since we plan do to spatial analysis for the metro-level data, we need to merge in geographic coordinates for each metro area. Since the metro-level data is for 2013, we’ll use 2013 boundaries. You can find them at the US Census here as well as on the course Lab page
To do this, we need to find the shape file for metro or CBSA areas. Read in the shape file using the st_read
function from the sf
library.
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
metro_shapes <- st_read("cb_2013_us_cbsa_500k.shp")
Just like last week, we’ll be using GEOID
(the FIPS
code) to merge our shapes file with our inequality_union_msa_controls
dataset. First, we’ll need to convert GEOID
from a character to numeric. We’ll also rename
it.
metro_shapes$GEOID <- as.numeric(metro_shapes$GEOID)
metro_shapes2 <- metro_shapes %>%
rename(FIPS = GEOID)
inequality_union_msa_controls_metro_shapes <- left_join(metro_shapes2, inequality_union_msa_controls, by = "FIPS")
summary(inequality_union_msa_controls_metro_shapes)
## CSAFP CBSAFP AFFGEOID FIPS
## Length:929 Length:929 Length:929 Min. :10100
## Class :character Class :character Class :character 1st Qu.:19940
## Mode :character Mode :character Mode :character Median :29780
## Mean :29750
## 3rd Qu.:39460
## Max. :49820
##
## NAME LSAD ALAND AWATER
## Length:929 Length:929 Min. :8.814e+07 Min. :3.091e+03
## Class :character Class :character 1st Qu.:1.535e+09 1st Qu.:1.484e+07
## Mode :character Mode :character Median :2.512e+09 Median :4.262e+07
## Mean :4.676e+09 Mean :3.403e+08
## 3rd Qu.:5.258e+09 3rd Qu.:1.755e+08
## Max. :7.061e+10 Max. :1.533e+10
##
## top10 top5 top1 cntyname
## Min. :34.63 Min. :22.18 Min. : 8.662 Length:929
## 1st Qu.:41.87 1st Qu.:28.71 1st Qu.:13.040 Class :character
## Median :44.76 Median :31.39 Median :15.072 Mode :character
## Mean :44.93 Mean :31.77 Mean :15.606
## 3rd Qu.:47.21 3rd Qu.:33.46 3rd Qu.:16.948
## Max. :72.36 Max. :61.65 Max. :42.507
## NA's :694 NA's :694 NA's :694
## MetropolitanArea MemTotal MemPrivate MemPublic
## Length:929 Min. : 0.0 Min. : 0.000 Min. : 0.00
## Class :character 1st Qu.: 4.3 1st Qu.: 2.175 1st Qu.:12.97
## Mode :character Median : 9.1 Median : 5.050 Median :28.25
## Mean :10.7 Mean : 6.312 Mean :32.79
## 3rd Qu.:15.6 3rd Qu.: 9.175 3rd Qu.:53.62
## Max. :39.0 Max. :27.700 Max. :83.90
## NA's :693 NA's :693 NA's :693
## population median_income per_capita_income college_pct
## Min. : 95349 Min. :33179 Min. :14222 Min. :0.1262
## 1st Qu.: 213280 1st Qu.:44790 1st Qu.:23160 1st Qu.:0.2080
## Median : 399092 Median :49318 Median :25512 Median :0.2590
## Mean : 933087 Mean :50564 Mean :25975 Mean :0.2641
## 3rd Qu.: 818240 3rd Qu.:55070 3rd Qu.:27934 3rd Qu.:0.3031
## Max. :19716880 Max. :90962 Max. :41588 Max. :0.5830
## NA's :693 NA's :693 NA's :693 NA's :693
## unemployed geometry
## Min. :0.0396 MULTIPOLYGON :929
## 1st Qu.:0.0812 epsg:4269 : 0
## Median :0.0954 +proj=long...: 0
## Mean :0.0980
## 3rd Qu.:0.1131
## Max. :0.1851
## NA's :693
Note that we don’t have a state_id variable- but we do have something close. Let’s take a look at the first 5 observations.
inequality_union_msa_controls_metro_shapes %>%
head(5)
## Simple feature collection with 5 features and 21 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -98.95394 ymin: 28.06135 xmax: -78.97446 ymax: 43.64223
## geographic CRS: NAD83
## CSAFP CBSAFP AFFGEOID FIPS NAME LSAD ALAND
## 1 357 12660 310M200US12660 12660 Baraboo, WI M2 2152024914
## 2 430 38300 310M200US38300 38300 Pittsburgh, PA M1 13678950959
## 3 184 17460 310M200US17460 17460 Cleveland-Elyria, OH M1 5174448881
## 4 544 38920 310M200US38920 38920 Port Lavaca, TX M2 1312707302
## 5 <NA> 48660 310M200US48660 48660 Wichita Falls, TX M1 6784779607
## AWATER top10 top5 top1 cntyname
## 1 45882328 NA NA NA <NA>
## 2 157932094 46.30528 33.43989 16.63414 Pittsburgh, PA
## 3 5132097166 45.58507 33.22408 16.92012 Cleveland-Elyria, OH
## 4 1361903536 NA NA NA <NA>
## 5 144176932 NA NA NA <NA>
## MetropolitanArea MemTotal MemPrivate MemPublic population
## 1 <NA> NA NA NA NA
## 2 Pittsburgh, PA 11.8 7.5 49.8 2358746
## 3 Cleveland-Elyria-Mentor, OH 11.0 7.1 43.8 2070965
## 4 <NA> NA NA NA NA
## 5 <NA> NA NA NA NA
## median_income per_capita_income college_pct unemployed
## 1 NA NA NA NA
## 2 50935 29484 0.2994702 0.0776305
## 3 48954 28002 0.2836095 0.1056396
## 4 NA NA NA NA
## 5 NA NA NA NA
## geometry
## 1 MULTIPOLYGON (((-90.3124 43...
## 2 MULTIPOLYGON (((-80.51922 3...
## 3 MULTIPOLYGON (((-82.34807 4...
## 4 MULTIPOLYGON (((-96.38984 2...
## 5 MULTIPOLYGON (((-98.95382 3...
We can see that the variable cntyname
contains the state identifier (two letter abbreviation). We can extract the state identifier from the cntyname
variable using str_sub
from (you guess it) the stringr
library.
The str_sub
function takes three arguments: - a vector of strings - the position of the first character R
should look at - the position of the last character R
should look at
We want R
to start at the second to last character (-2) and end at the second to last character (-1). We’ll then mutate a new column called StateID
that is filled with the subsetted characters.
inequality_union_msa_controls_metro_shapes2 <- inequality_union_msa_controls_metro_shapes %>%
mutate(StateID = str_sub(inequality_union_msa_controls_metro_shapes$cntyname, start = -2, end = -1))
Finally, let’s save this dataset for future use. But we can’t just save it as a .csv like we usually do. In order to preserve geometry
’s storage as a sfc_MULTIPOLYGON
, we’ll need to write a .rds file.
write_rds(inequality_union_msa_controls_metro_shapes2 , "msa_inq_union.rds")
Now we’re ready to do the analysis!
END PART 1