Lab 7: Cleaning / Merging / Descriptive Analysis (Part 1)

Preliminaries

Before beginning this lab, make sure you’ve:

created a sub-folder for this assignment within your ECON_210 folder
Within your assignment sub-folder, created another sub-folder called DATA
Downloaded the necessary data from the WordPress site and saved it to your assignment-specific subfolder.
Set your working directory. The address of your working directory should look something like /Users/YourName/OneDrive - Middlebury College/ECON_210/LABS/LAB_#/DATA ;

Creating a reproducible lab report

To create your new lab report, in RStudio, go to New File -> R Markdown… You can also download a lab report template from the Labs page of the class WordPress site.

How to get started with empirical research

Once you have a dataset, what’s the next step? This lab will walk you through the common steps we take before data analysis begins.

We will do the following: 1) Clean datasets 2) Merge/Join multiple datasets 3) Simple visual analysis comparing the means of two groups

The question this lab will examine is: “Is voter turnout lower in counties with a higher percentage of Black voters?” One motivation for this question are the claims that voter turnout is being surpressed Who gets to vote in Florida?.

We’re warning you now- this lab is long. But! It contains almost all of the cleaning code you will need for any dataset you come across ever. Maybe not ever, but definitely for the rest of this class. So get yourself a snack, take a stretch, and let’s get into it!

The datasets

We will use five datasets.

The American Community Survey or ACS (Source: Social Explorer)
US Citizen Voting Age Population (Source: Social Explorer)
US County Health Rankings (Source: Social Explorer)
County Presidential Election Returns Source: MIT Election Data and Science Lab
County Level Police Violence Data Source: Mapping Police Violence

Step 1: Clean Datasets

Dataset 1: ACS data

Start by loading the haven package (you may need to install it). The haven package allows us to open data forms from other statistical programs, and we’ll need it to open the ACS data which is stored as a STATA .dta file. After loading haven, open the ACS_raw dataset using the read_dta function.

library(haven)

ACS_raw <- read_dta("ACS_raw.dta")

Now, check to see if the county FIPS is the unique identifier using unique_id. It’s a function that comes from the thatssorandom package and is available on GitHub and not on CRAN. What does this mean? You cannot just install it in the console using install.packages. In the Console, do the following:

install.packages("githubinstall")

library(githubinstall)

githubinstall("thatssorandom")

If that doesn’t work, try the following:

library(devtools)

install_github("edwinth/thatssorandom")

See R packages on GitHub and details on the `unique_id’ function here

If unique_id returns “TRUE” then FIPS is the unique identifier.

## Package that contains the unique_id fucntion
library(thatssorandom)

## unique_id returns "TRUE" if the variable FIPS is a unique identifier
ACS_raw %>% unique_id(FIPS)

## [1] TRUE

For some people, they are having problems installing thatsorandom. An alternative is to do the following:

length(unique(ACS_raw$FIPS))

## [1] 3220

What exactly are we looking for here? We know that there are 3220 observations in our ACS_raw dataset. If FIPS is a unique identifier, than the total number of observations should equal the distinct number of FIPS codes. If we find that there are 3220 unique FIPS, this means that each row of the dataset is a county and each county only appears once. Thus, FIPS uniquely identifies each observation in the dataset. unique will return only unique values of FIPS, and length will measure the length of the vector. If this is less than 3220, then we know some FIPS were duplicated and others didn’t appear. If it’s exactly 3220, then every FIPS appears exactly once.

So if you are having problems, replace unique_id with length(unique(.....)).

Next, we need to convert FIPS to numeric (it’s currently stored as a character). Why do we do this? We will be merging on FIPS, and this variable has to take the same type across all datasets that are to be merged. Note that FIPS when it is a string variable will have leading “0”, while when we destring it, the leading “0” will be eliminated.

How did we know that FIPS was stored as a character? We can quickly check the structure of any variable in a dataset by using the str function.

str(ACS_raw$FIPS)

##  chr [1:3220] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
##  - attr(*, "label")= chr "FIPS"
##  - attr(*, "format.stata")= chr "%50s"

ACS_raw$FIPS <- as.numeric(ACS_raw$FIPS)

Using the $ operator here allows us to specifically alter the structure of the FIPS variable. Now, we can re-check the structure of FIPS to make sure our code did what we intended.

str(ACS_raw$FIPS)

##  num [1:3220] 1001 1003 1005 1007 1009 ...

See how the storage type changes from chr to num? FIPS is now stored as a number. Now, onto our other four datasets!

Dataset 2: Citizens Data

We’ll do the same thing as the ACS dataset here for the citizens dataset. Check to see if the county FIPS is the unique identifier and then convert FIPS from a string to a numeric variable.

citizens_raw <- read_dta("citizens_raw.dta")

## Confirm FIPS is unique identifier
citizens_raw %>% unique_id(FIPS)

## [1] TRUE

We will also rename a few variables to make it easier to find them in the future. Thankfully, citizens_raw comes with something very exciting- variable labels! (You may have noticed that our ACS_raw dataset has them as well). We don’t typically use variable labels in R, but they are pretty common in STATA. Because citizens_raw is a STATA (.dta) file, we get variable labels as an added bonus.

We can read variable labels using the var_lab function from the expss package (you’ll likely need to install the package now). Load the expss package and the view the labels on a few variables. .

library(expss)

var_lab(citizens_raw$T002_001)

## [1] "Total Number of United States Citizens"

var_lab(citizens_raw$T010_002)

## [1] "Total Population: Total Citizens of Voting Age"

var_lab(citizens_raw$T002_003)

## [1] "Total Number of United States Citizens: Not Hispanic or Latino: White Alone"

var_lab(citizens_raw$T002_004)

## [1] "Total Number of United States Citizens: Not Hispanic or Latino: Black or African"

var_lab(citizens_raw$T002_005)

## [1] "Total Number of United States Citizens: Not Hispanic or Latino: American Indian "

var_lab(citizens_raw$T002_006)

## [1] "Total Number of United States Citizens: Not Hispanic or Latino: Asian Alone"

var_lab(citizens_raw$T002_014)

## [1] "Total Number of United States Citizens: Hispanic or Latino"

Now, rename these variables based on their labels.

citizens_raw2 <- citizens_raw %>%
rename(citizens_total = T002_001,
       citizens_voting_age = T010_002,
       citizens_white = T002_003,
       citizens_black = T002_004,
       citizens_hispanic_latino = T002_014,
       citizens_native_american = T002_005,
       citizens_asian = T002_006)

We’ll also need to convert FIPS from a character to numeric variable.

citizens_raw2$FIPS <- as.numeric(citizens_raw2$FIPS)

Finally, to keep things simpler, we’ll just keep the variables that we need from this dataset (it will become clear why we keep these variables in the section below)

citizens <- citizens_raw2 %>%
  select(FIPS,
         NAME,
         citizens_total,
         citizens_voting_age,
         citizens_white,
         citizens_black,
         citizens_hispanic_latino)

Dataset 3: Health Data

Let’s clean up the health dataset. It’s easy. We’re doing the same commands as above—you know the drill.

health_raw <- read_dta("health_raw.dta")

#check that FIPS is the unique identifier
# check that FIPS is the unique identifier
health_raw %>% unique_id(FIPS)

## [1] TRUE

# convert FIPS from character -> numeric
health_raw$FIPS <- as.numeric(health_raw$FIPS)

#save this new dataset as `health`
health <- health_raw

Datset 4: Election data

Only two more datasets to go! Let’s import the election data now and check that FIPS is the unique identifier. We will see that we don’t get “TRUE” back and instead, `unique_id’ will return the duplicates. In this case FIPS is going to show up in multiple rows. Thus each row does NOT represent a county.

county_pres <- read_csv("countypres_2000-2016.csv")

# check that FIPS is the unique identifier
county_pres %>% unique_id(FIPS)

## # A tibble: 50,524 x 11
##     FIPS  year state  state_po county  office   candidate  party  candidatevotes
##    <dbl> <dbl> <chr>  <chr>    <chr>   <chr>    <chr>      <chr>           <dbl>
##  1  1001  2000 Alaba… AL       Autauga Preside… Al Gore    democ…           4942
##  2  1001  2000 Alaba… AL       Autauga Preside… George W.… repub…          11993
##  3  1001  2000 Alaba… AL       Autauga Preside… Ralph Nad… green             160
##  4  1001  2000 Alaba… AL       Autauga Preside… Other      <NA>              113
##  5  1001  2004 Alaba… AL       Autauga Preside… John Kerry democ…           4758
##  6  1001  2004 Alaba… AL       Autauga Preside… George W.… repub…          15196
##  7  1001  2004 Alaba… AL       Autauga Preside… Other      <NA>              127
##  8  1001  2008 Alaba… AL       Autauga Preside… Barack Ob… democ…           6093
##  9  1001  2008 Alaba… AL       Autauga Preside… John McCa… repub…          17403
## 10  1001  2008 Alaba… AL       Autauga Preside… Other      <NA>              145
## # … with 50,514 more rows, and 2 more variables: totalvotes <dbl>,
## #   version <dbl>

Whoa! The county FIPS is not a unique identifier. Now what?

Fear not! With a little data exploration, we can still figure out the unit of observation. Let’s start by arrangeing the dataset by county (FIPS) and year, and then head the first 10 observations for the variables specified.

county_pres %>%
  arrange(FIPS, year) %>%
  head(10)

## # A tibble: 10 x 11
##     year state  state_po county   FIPS office   candidate  party  candidatevotes
##    <dbl> <chr>  <chr>    <chr>   <dbl> <chr>    <chr>      <chr>           <dbl>
##  1  2000 Alaba… AL       Autauga  1001 Preside… Al Gore    democ…           4942
##  2  2000 Alaba… AL       Autauga  1001 Preside… George W.… repub…          11993
##  3  2000 Alaba… AL       Autauga  1001 Preside… Ralph Nad… green             160
##  4  2000 Alaba… AL       Autauga  1001 Preside… Other      <NA>              113
##  5  2004 Alaba… AL       Autauga  1001 Preside… John Kerry democ…           4758
##  6  2004 Alaba… AL       Autauga  1001 Preside… George W.… repub…          15196
##  7  2004 Alaba… AL       Autauga  1001 Preside… Other      <NA>              127
##  8  2008 Alaba… AL       Autauga  1001 Preside… Barack Ob… democ…           6093
##  9  2008 Alaba… AL       Autauga  1001 Preside… John McCa… repub…          17403
## 10  2008 Alaba… AL       Autauga  1001 Preside… Other      <NA>              145
## # … with 2 more variables: totalvotes <dbl>, version <dbl>

This dataset is a bit strange- the unit of observation is a county x year x candidate. Remember the unit of observation for the first three datasets is the county. So if we attempt to do a one-to-one merge with this dataset and one of the others, it won’t work. Let’s clean up this dataset so that it works for our purposes.

Let’s keep the 2016 election year. And also, let’s drop the other party candidate (we’ll focus just on Democrats and Republicans—the two major political parties in the USA).

 county_pres2 <- county_pres %>%
  filter(year == 2016 & party == "democrat" | year == 2016 & party == "republican")

Alright, are we done? Gosh no! Take a stretch break, and then proceed. We need to clean this dataset. Now that we’ve manipulated the data, let’s re-examine whether FIPS is a unique identifier.

county_pres2 %>% unique_id(FIPS)

## # A tibble: 6,316 x 11
##     FIPS  year state  state_po county  office  candidate   party  candidatevotes
##    <dbl> <dbl> <chr>  <chr>    <chr>   <chr>   <chr>       <chr>           <dbl>
##  1  1001  2016 Alaba… AL       Autauga Presid… Hillary Cl… democ…           5936
##  2  1001  2016 Alaba… AL       Autauga Presid… Donald Tru… repub…          18172
##  3  1003  2016 Alaba… AL       Baldwin Presid… Hillary Cl… democ…          18458
##  4  1003  2016 Alaba… AL       Baldwin Presid… Donald Tru… repub…          72883
##  5  1005  2016 Alaba… AL       Barbour Presid… Hillary Cl… democ…           4871
##  6  1005  2016 Alaba… AL       Barbour Presid… Donald Tru… repub…           5454
##  7  1007  2016 Alaba… AL       Bibb    Presid… Hillary Cl… democ…           1874
##  8  1007  2016 Alaba… AL       Bibb    Presid… Donald Tru… repub…           6738
##  9  1009  2016 Alaba… AL       Blount  Presid… Hillary Cl… democ…           2156
## 10  1009  2016 Alaba… AL       Blount  Presid… Donald Tru… repub…          22859
## # … with 6,306 more rows, and 2 more variables: totalvotes <dbl>, version <dbl>

Nope. It looks like each county FIPS is repeated twice based on political party. Let’s see if these two columns (variables) uniquiely identifies each observation.

county_pres2 %>% unique_id(FIPS, party)

## # A tibble: 6 x 11
##    FIPS party    year state   state_po county   office candidate  candidatevotes
##   <dbl> <chr>   <dbl> <chr>   <chr>    <chr>    <chr>  <chr>               <dbl>
## 1    NA democr…  2016 Connec… <NA>     Statewi… Presi… Hillary C…             NA
## 2    NA democr…  2016 Maine   <NA>     Maine U… Presi… Hillary C…           3017
## 3    NA democr…  2016 Rhode … <NA>     Federal… Presi… Hillary C…            637
## 4    NA republ…  2016 Connec… <NA>     Statewi… Presi… Donald Tr…             NA
## 5    NA republ…  2016 Maine   <NA>     Maine U… Presi… Donald Tr…            648
## 6    NA republ…  2016 Rhode … <NA>     Federal… Presi… Donald Tr…             53
## # … with 2 more variables: totalvotes <dbl>, version <dbl>

So we have a county X party dataset. We want to change this to be simply a county-level dataset. Why?

Answer: All the other datasets are county-level. In order to join/merge them, it really helps if they all have the same unit of observation.

To change this dataset so that each county is only represented once, we can use do some reshaping using the pivot_wider function. pivot_wider lets us go from FIPS x candidate observations to just FIPS observations. Before doing this reshaping of the data, let’s just keep the variables we need.

county_pres3 <- county_pres2 %>%
   select(FIPS, party, candidatevotes, totalvotes )

Now, let’s check to see if we have any missing values (“NA”). This is typically not a major problem at this stage, UNLESS there’s a missing value for our identifying variables (FIPS).

## Returns which observations have missing values
which(is.na(county_pres3$FIPS))

## [1] 6229 6230 6231 6232 6233 6234

## See how observation 6229 looks
county_pres3[6229,]

## # A tibble: 1 x 4
##    FIPS party    candidatevotes totalvotes
##   <dbl> <chr>             <dbl>      <dbl>
## 1    NA democrat             NA       5056

## Remove obserations with missing FIPS
county_pres3 <- county_pres3 %>%
   drop_na(FIPS)

## Check to see if any observations have a missing FIPS
which(is.na(county_pres3$FIPS))

## integer(0)

We’re good!

Now, let’s use pivor_wider'.names_from’ will use the column party' and move it to thevalues_from’ columns. It can be hard to explain. The best way to see this is to compare county_pres4' tocounty_pres3’. See what happens?

county_pres4<- county_pres3 %>% 
 pivot_wider(names_from =party, values_from = c(candidatevotes, totalvotes) )

# Check to see if unique number of FIPS equals the number of observations
county_pres4 %>% unique_id(FIPS)

## [1] TRUE

county_pres3 %>%
  arrange(FIPS) %>%
  head(10)

## # A tibble: 10 x 4
##     FIPS party      candidatevotes totalvotes
##    <dbl> <chr>               <dbl>      <dbl>
##  1  1001 democrat             5936      24973
##  2  1001 republican          18172      24973
##  3  1003 democrat            18458      95215
##  4  1003 republican          72883      95215
##  5  1005 democrat             4871      10469
##  6  1005 republican           5454      10469
##  7  1007 democrat             1874       8819
##  8  1007 republican           6738       8819
##  9  1009 democrat             2156      25588
## 10  1009 republican          22859      25588

county_pres4 %>%
  arrange(FIPS) %>%
  head(10)

## # A tibble: 10 x 5
##     FIPS candidatevotes_de… candidatevotes_re… totalvotes_demo… totalvotes_repu…
##    <dbl>              <dbl>              <dbl>            <dbl>            <dbl>
##  1  1001               5936              18172            24973            24973
##  2  1003              18458              72883            95215            95215
##  3  1005               4871               5454            10469            10469
##  4  1007               1874               6738             8819             8819
##  5  1009               2156              22859            25588            25588
##  6  1011               3530               1140             4710             4710
##  7  1013               3726               4901             8732             8732
##  8  1015              13242              32865            47864            47864
##  9  1017               5784               7843            13900            13900
## 10  1019               1547               8953            10733            10733

One more item. The variable totalvotes_democrat' is identical tototalvotes_republican’. We only need one. Let’s remove totalvotes_democrat', and then renametotalvotes_republican’ to `totalvotes’

county_pres5 <- county_pres4 %>%
  select(-totalvotes_democrat) %>% 
  rename(totalvotes = totalvotes_republican)

Great! Now we have out countypres dataset in exactly the form we need.

Dataset 5: police_violence

The original dataset is complicated. The unit of observation is a victim of police violence. In order to clean this so that it represents the number of victims in a county requires extensive code. I have posted this code (in STATA, R code will be available at a later date), but for the purpose of this lab, I provide a cleaned version of the data.

To be clear, the victim in this dataset is someone who has been killed by police.

police_violence <- read_dta("police_violence.dta")

# check that FIPS is the unique identifier
police_violence %>% unique_id(FIPS)

## [1] TRUE

PART 2: Merging Datasets

It will be straightforward to merge all 5 datasets. Typically this process is iterative. You clean the datasets, attempt a merge, fail at the merge, and go back and clean the data to accomplish the merge. As I noted earlier, this takes time, so budget accordingly. Let’s start with the ACS dataset and we will merge the citizens dataset onto it using a 1-to-1 merge based on county FIPS.

ACS_citizens <- merge(ACS_raw, citizens, by = "FIPS")

How do we know that we merged correctly? the ACS_raw dataset has 3220 observations and 621 variables. The citizens dataset has 3220 observations and 7 variables. Our merged dataset, merge1, has 3220 observations and 627 (all the variables from citizens other than FIPS).

Now, let’s merge the rest of the datasets together. Note that when we merge in the police_violence dataset, we are specifying for R to keep all observations from both datasets, regardless of whether they match an observation from the other dataset. We’ll explain why in a second.

ACS_citizens_health <- merge(ACS_citizens, health, by = "FIPS")

ACS_citizens_health_countypres <- merge(ACS_citizens_health, county_pres5, by = "FIPS")

ACS_citizens_health_countypres_police <- merge(ACS_citizens_health_countypres, police_violence, by = "FIPS", all = TRUE)

Why did we keep those missing values? We’re still interested in counties where there is no police violence. Right now, counties with no reported police violence just report *NA*. We can replace those *NA*s with 0s using the following code:

ACS_citizens_health_countypres_police<-ACS_citizens_health_countypres_police %>%
mutate_at(c(677:684), ~replace(., is.na(.), 0))

Note also the difference in how we treat missing values. For FIPS, missing values are a PROBLEM. Why? If you don’t have a FIPS number, you cannot do a merge. Observations with missing FIPS numbers are not helpful because they don’t tell us where we are missing data. If we had additional geographic identifiers (county name, state) that could help us find the FIPS.

However, for police violence, missing values are telling us something - specifically counties where there is NO police violence.

Finally, let’s save this newly merged dataset as a .csv for us to access later using the write.csv (R will save this dataset to your working directory).

write.csv(ACS_citizens_health_countypres_police, file = "ACS_citizens_health_countypres_police.csv")

Whoohoo! You made it to the end of Part I of this lab. Don’t worry, Part II won’t be quite as long. Take a break, get some sunshine, and we’ll see you in Part II!