Lab 9 Part I: Analysis in 3 Parts: Time, Space, and Regression

Are the declines in Union Membership leading to income inequality? We will measure income inequality by using the share of all income capture by the top 1% of income earnings. As this share increases, income inequality increases.

We will examine this question by analyzing inequality and union membership over time, over space, and using a linear regression. Fun stuff!

Preliminaries

Before beginning this lab, make sure you’ve:

created a sub-folder for this assignment within your ECON_210 folder
Within your assignment sub-folder, created another sub-folder called DATA
Downloaded the necessary data from the WordPress site and saved it to your assignment-specific subfolder.
Set your working directory. The address of your working directory should look something like `YourName/ECON_210/Assignment/DATA/

The datasets

We will use six datasets.

National-level Top 1% Share of Income (1970-2019) Source: World Inequality Databse
National-level Union Membership (1970-2019) Source: Unionstats.com
Metro-level Top 1% Share of Income (2013) Source: Economic Policy Institute
Metro-level (MSA) Union Membership (2013) Source: Unionstats.com
Metro-level (MSA) Control Variables: population, median income, pct college educated, pct unemployed (2013) Source: Social Explorer/ACS
Boundary Files for each US metro area “cb_2013_us_cbsa_500k.shp” Source: US Census Bureau

Dataset Types (Time Series / Cross Section / Panel)

We will create two datasets from the above:

A: National-level dataset that tracks income inequality and union membership over time in the US. This is typically known as a time-series dataset.

B: Metro-level dataset that documents income inequality and union membership at a specific point in time (2013). This is typically known as a cross-sectional dataset. A metro area or core-based statistical area (CBSA) consists of one or more counties anchored by an urban center. See here for more on CBSAs.

If you combine both a time-series and cross section dataset, you get a panel dataset. An example of this would be if we had county-level data over multiple years.

1: Merge Datasets

National Level Datasets (Time Series)

The time-series data we are working track the level of variables over time. Let’s look at the first 10 observations in the national level income inequality dataset.

library(haven)
inequality_usa <- read_dta("income_inequality_p1_usa.dta")

inequality_usa %>%
  head(10)

## # A tibble: 10 x 2
##     Year  top1
##    <dbl> <dbl>
##  1  1970 0.115
##  2  1971 0.118
##  3  1972 0.118
##  4  1973 0.117
##  5  1974 0.112
##  6  1975 0.111
##  7  1976 0.110
##  8  1977 0.114
##  9  1978 0.111
## 10  1979 0.117

Notice that there are no geographic identifiers. Why? All the data is for the US. This begs the question, “What is the unit of observation? (what does each row of the dataset represent?)”

inequality_usa%>% unique_id(Year)

## [1] TRUE

length(unique(inequality_usa$Year))

## [1] 50

The unit of observation is Year and we have 50 years of income inequality data. You can do the same exercise for the national level union data. To merge these two datasets together, load the union_usa data

union_usa <- read_dta("union_usa.dta")

inequality_union_usa <- left_join(inequality_usa, union_usa, by = "Year")

The results of the merge tell us that we have 47 years that contain both inequality and union data; there are 3 years are not common between the two datasets. Let’s take a quick peak at the data using a handy function we haven’t used before - summary, which will give us summary statistics (mean, median, etc.) for all the variables in our dataset. How neat is that?

summary(inequality_usa)

##       Year           top1       
##  Min.   :1970   Min.   :0.1103  
##  1st Qu.:1982   1st Qu.:0.1187  
##  Median :1994   Median :0.1542  
##  Mean   :1994   Mean   :0.1577  
##  3rd Qu.:2007   3rd Qu.:0.1943  
##  Max.   :2019   Max.   :0.2091

Wait! There isn’t a mean for Union_Rate. What’s going on? Note that Union_Rate is stored as a character. But why?

inequality_union_usa %>%
  select(Union_Rate) %>%
  head(15)

## # A tibble: 15 x 1
##    Union_Rate
##    <chr>     
##  1  <NA>     
##  2  <NA>     
##  3  <NA>     
##  4 "24"      
##  5 "23.6"    
##  6 "22.2"    
##  7 "22.1"    
##  8 "23.8"    
##  9 "23"      
## 10 "24.1"    
## 11 "23"      
## 12 "21.4"    
## 13 "     --" 
## 14 "20.1"    
## 15 "18.8"

Ah, it looks like there’s a symbol “–-” that is meant to be a missing value. To fix that, we’ll need to convert Union_Rate to numeric. There are two ways for us to do this, one of which you’re already familiar with: using the as.numeric function. In this case, any values that R cannot read as numeric (such as a symbol) will be coerced as a missing NA value.

inequality_union_usa$Union_Rate <- as.numeric(inequality_union_usa$Union_Rate)

## Warning: NAs introduced by coercion

The second way uses the function str_replace from the stringr library. stringr is designed specifically for working with strings (a.k.a. characters). It’s a super helpful library, and about time we introduced it!

str_replace takes three arguments: - a vector of strings - a pattern to look for (such as “–”) - the replacement (such as "" (nothing))

library(stringr)

inequality_union_usa$Union_Rate <- str_replace(inequality_union_usa$Union_Rate, "--", "")

This gets rid of the “–” but Union_Rate is still stored as a character. We can now use as.numeric like we usually do

inequality_union_usa$Union_Rate <- as.numeric(inequality_union_usa$Union_Rate)

“Why bother doing this?” you may ask…good question! The answer is you don’t have to: as.numeric works just fine for our purposes. But str_replace is a great function to have in your toolkit for future string manipulation.

Finally, let’s save this dataset for future use

write_csv(inequality_union_usa, "inequality_union_usa.csv")

Metro Level Dataset (Cross Section)

Let’s load the metro-level top 1% share dataset

inequality_msa <- read_dta("income_inequality_county.dta")

inequality_msa%>% unique_id(FIPS)

## [1] TRUE

length(unique(inequality_msa$FIPS))

## [1] 917

summary(inequality_msa)

##       FIPS           top10            top5            top1       
##  Min.   :10100   Min.   :27.22   Min.   :16.81   Min.   : 5.597  
##  1st Qu.:20020   1st Qu.:39.85   1st Qu.:26.70   1st Qu.:11.346  
##  Median :29820   Median :42.74   Median :29.21   Median :13.351  
##  Mean   :29793   Mean   :43.10   Mean   :29.84   Mean   :13.990  
##  3rd Qu.:39460   3rd Qu.:45.97   3rd Qu.:32.35   3rd Qu.:15.531  
##  Max.   :49820   Max.   :82.71   Max.   :78.07   Max.   :68.266  
##                  NA's   :1       NA's   :1       NA's   :1       
##    cntyname        
##  Length:917        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Everything looks good. The unit of observation is a metro area and each one has a FIPS. We only have 917 FIPS observations because there are fewer MSAs than counties. MSAs are basically “large” counties that represent major metropolitan areas and are technically known as CBSAs (core-based statistical areas). This dataset is not going to capture a lot of rural areas. You can examine the metro-level union dataset as well and see similar results.

Now merge the metro-level income inequality and union data (I already checked the union data so I know it will merge). However, most times you’ll have to examine each dataset separately to make sure they are cleaned up.

msa_union <-read_dta("msa_union.dta") 

inequality_union_msa <- left_join(inequality_msa, msa_union, by = "FIPS")

summary(inequality_union_msa)

##       FIPS           top10            top5            top1       
##  Min.   :10100   Min.   :27.22   Min.   :16.81   Min.   : 5.597  
##  1st Qu.:20020   1st Qu.:39.85   1st Qu.:26.70   1st Qu.:11.346  
##  Median :29820   Median :42.74   Median :29.21   Median :13.351  
##  Mean   :29793   Mean   :43.10   Mean   :29.84   Mean   :13.990  
##  3rd Qu.:39460   3rd Qu.:45.97   3rd Qu.:32.35   3rd Qu.:15.531  
##  Max.   :49820   Max.   :82.71   Max.   :78.07   Max.   :68.266  
##                  NA's   :1       NA's   :1       NA's   :1       
##    cntyname         MetropolitanArea      MemTotal      MemPrivate    
##  Length:917         Length:917         Min.   : 0.0   Min.   : 0.000  
##  Class :character   Class :character   1st Qu.: 4.3   1st Qu.: 2.175  
##  Mode  :character   Mode  :character   Median : 9.1   Median : 5.050  
##                                        Mean   :10.7   Mean   : 6.312  
##                                        3rd Qu.:15.6   3rd Qu.: 9.175  
##                                        Max.   :39.0   Max.   :27.700  
##                                        NA's   :681    NA's   :681     
##    MemPublic    
##  Min.   : 0.00  
##  1st Qu.:12.97  
##  Median :28.25  
##  Mean   :32.79  
##  3rd Qu.:53.62  
##  Max.   :83.90  
##  NA's   :681

From our summary we see that we have 681 missing values. This means we have 236 counties with both inequality and union membership data. To keep only observations that have both, we can filter our dataset for observations that are not (!) missing (is.na) MetropolitanArea

inequality_union_msa2 <- inequality_union_msa %>%
  filter(!is.na(MetropolitanArea))

We also merge in metro-level control variables.

msa_controls <-read_dta("msa_controls.dta") 

inequality_union_msa_controls <- left_join(inequality_union_msa2, msa_controls, by = "FIPS")

Adding in Geographic Coordinates

Since we plan do to spatial analysis for the metro-level data, we need to merge in geographic coordinates for each metro area. Since the metro-level data is for 2013, we’ll use 2013 boundaries. You can find them at the US Census here as well as on the course Lab page

To do this, we need to find the shape file for metro or CBSA areas. Read in the shape file using the st_read function from the sf library.

library(sf)

## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1

metro_shapes <- st_read("cb_2013_us_cbsa_500k.shp")

Just like last week, we’ll be using GEOID (the FIPS code) to merge our shapes file with our inequality_union_msa_controls dataset. First, we’ll need to convert GEOID from a character to numeric. We’ll also rename it.

metro_shapes$GEOID <- as.numeric(metro_shapes$GEOID)

metro_shapes2 <- metro_shapes %>%
  rename(FIPS = GEOID)

inequality_union_msa_controls_metro_shapes <- left_join(metro_shapes2, inequality_union_msa_controls, by = "FIPS")

summary(inequality_union_msa_controls_metro_shapes)

##     CSAFP              CBSAFP            AFFGEOID              FIPS      
##  Length:929         Length:929         Length:929         Min.   :10100  
##  Class :character   Class :character   Class :character   1st Qu.:19940  
##  Mode  :character   Mode  :character   Mode  :character   Median :29780  
##                                                           Mean   :29750  
##                                                           3rd Qu.:39460  
##                                                           Max.   :49820  
##                                                                          
##      NAME               LSAD               ALAND               AWATER         
##  Length:929         Length:929         Min.   :8.814e+07   Min.   :3.091e+03  
##  Class :character   Class :character   1st Qu.:1.535e+09   1st Qu.:1.484e+07  
##  Mode  :character   Mode  :character   Median :2.512e+09   Median :4.262e+07  
##                                        Mean   :4.676e+09   Mean   :3.403e+08  
##                                        3rd Qu.:5.258e+09   3rd Qu.:1.755e+08  
##                                        Max.   :7.061e+10   Max.   :1.533e+10  
##                                                                               
##      top10            top5            top1          cntyname        
##  Min.   :34.63   Min.   :22.18   Min.   : 8.662   Length:929        
##  1st Qu.:41.87   1st Qu.:28.71   1st Qu.:13.040   Class :character  
##  Median :44.76   Median :31.39   Median :15.072   Mode  :character  
##  Mean   :44.93   Mean   :31.77   Mean   :15.606                     
##  3rd Qu.:47.21   3rd Qu.:33.46   3rd Qu.:16.948                     
##  Max.   :72.36   Max.   :61.65   Max.   :42.507                     
##  NA's   :694     NA's   :694     NA's   :694                        
##  MetropolitanArea      MemTotal      MemPrivate       MemPublic    
##  Length:929         Min.   : 0.0   Min.   : 0.000   Min.   : 0.00  
##  Class :character   1st Qu.: 4.3   1st Qu.: 2.175   1st Qu.:12.97  
##  Mode  :character   Median : 9.1   Median : 5.050   Median :28.25  
##                     Mean   :10.7   Mean   : 6.312   Mean   :32.79  
##                     3rd Qu.:15.6   3rd Qu.: 9.175   3rd Qu.:53.62  
##                     Max.   :39.0   Max.   :27.700   Max.   :83.90  
##                     NA's   :693    NA's   :693      NA's   :693    
##    population       median_income   per_capita_income  college_pct    
##  Min.   :   95349   Min.   :33179   Min.   :14222     Min.   :0.1262  
##  1st Qu.:  213280   1st Qu.:44790   1st Qu.:23160     1st Qu.:0.2080  
##  Median :  399092   Median :49318   Median :25512     Median :0.2590  
##  Mean   :  933087   Mean   :50564   Mean   :25975     Mean   :0.2641  
##  3rd Qu.:  818240   3rd Qu.:55070   3rd Qu.:27934     3rd Qu.:0.3031  
##  Max.   :19716880   Max.   :90962   Max.   :41588     Max.   :0.5830  
##  NA's   :693        NA's   :693     NA's   :693       NA's   :693     
##    unemployed              geometry  
##  Min.   :0.0396   MULTIPOLYGON :929  
##  1st Qu.:0.0812   epsg:4269    :  0  
##  Median :0.0954   +proj=long...:  0  
##  Mean   :0.0980                      
##  3rd Qu.:0.1131                      
##  Max.   :0.1851                      
##  NA's   :693

Note that we don’t have a state_id variable- but we do have something close. Let’s take a look at the first 5 observations.

inequality_union_msa_controls_metro_shapes %>%
  head(5)

## Simple feature collection with 5 features and 21 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -98.95394 ymin: 28.06135 xmax: -78.97446 ymax: 43.64223
## geographic CRS: NAD83
##   CSAFP CBSAFP       AFFGEOID  FIPS                 NAME LSAD       ALAND
## 1   357  12660 310M200US12660 12660          Baraboo, WI   M2  2152024914
## 2   430  38300 310M200US38300 38300       Pittsburgh, PA   M1 13678950959
## 3   184  17460 310M200US17460 17460 Cleveland-Elyria, OH   M1  5174448881
## 4   544  38920 310M200US38920 38920      Port Lavaca, TX   M2  1312707302
## 5  <NA>  48660 310M200US48660 48660    Wichita Falls, TX   M1  6784779607
##       AWATER    top10     top5     top1             cntyname
## 1   45882328       NA       NA       NA                 <NA>
## 2  157932094 46.30528 33.43989 16.63414       Pittsburgh, PA
## 3 5132097166 45.58507 33.22408 16.92012 Cleveland-Elyria, OH
## 4 1361903536       NA       NA       NA                 <NA>
## 5  144176932       NA       NA       NA                 <NA>
##              MetropolitanArea MemTotal MemPrivate MemPublic population
## 1                        <NA>       NA         NA        NA         NA
## 2              Pittsburgh, PA     11.8        7.5      49.8    2358746
## 3 Cleveland-Elyria-Mentor, OH     11.0        7.1      43.8    2070965
## 4                        <NA>       NA         NA        NA         NA
## 5                        <NA>       NA         NA        NA         NA
##   median_income per_capita_income college_pct unemployed
## 1            NA                NA          NA         NA
## 2         50935             29484   0.2994702  0.0776305
## 3         48954             28002   0.2836095  0.1056396
## 4            NA                NA          NA         NA
## 5            NA                NA          NA         NA
##                         geometry
## 1 MULTIPOLYGON (((-90.3124 43...
## 2 MULTIPOLYGON (((-80.51922 3...
## 3 MULTIPOLYGON (((-82.34807 4...
## 4 MULTIPOLYGON (((-96.38984 2...
## 5 MULTIPOLYGON (((-98.95382 3...

We can see that the variable cntyname contains the state identifier (two letter abbreviation). We can extract the state identifier from the cntyname variable using str_sub from (you guess it) the stringr library.

The str_sub function takes three arguments: - a vector of strings - the position of the first character R should look at - the position of the last character R should look at

We want R to start at the second to last character (-2) and end at the second to last character (-1). We’ll then mutate a new column called StateID that is filled with the subsetted characters.

inequality_union_msa_controls_metro_shapes2 <- inequality_union_msa_controls_metro_shapes %>%
  mutate(StateID = str_sub(inequality_union_msa_controls_metro_shapes$cntyname, start = -2, end = -1))

Finally, let’s save this dataset for future use. But we can’t just save it as a .csv like we usually do. In order to preserve geometry’s storage as a sfc_MULTIPOLYGON, we’ll need to write a .rds file.

write_rds(inequality_union_msa_controls_metro_shapes2 , "msa_inq_union.rds")

Now we’re ready to do the analysis!

END PART 1