Before beginning this lab, make sure you’ve:
created a sub-folder for this assignment within your ECON_210 folder
Within your assignment sub-folder, created another sub-folder called DATA
Downloaded the necessary data from the WordPress site and saved it to your assignment-specific subfolder.
Set your working directory. The address of your working directory should look something like /Users/YourName/OneDrive - Middlebury College/ECON_210/LABS/LAB_#/DATA ;
To create your new lab report, in RStudio, go to New File -> R Markdown… You can also download a lab report template from the Labs page of the class WordPress site.
Once you have a dataset, what’s the next step? This lab will walk you through the common steps we take before data analysis begins.
We will do the following: 1) Clean datasets 2) Merge/Join multiple datasets 3) Simple visual analysis comparing the means of two groups
The question this lab will examine is: “Is voter turnout lower in counties with a higher percentage of Black voters?” One motivation for this question are the claims that voter turnout is being surpressed Who gets to vote in Florida?.
We’re warning you now- this lab is long. But! It contains almost all of the cleaning code you will need for any dataset you come across ever. Maybe not ever, but definitely for the rest of this class. So get yourself a snack, take a stretch, and let’s get into it!
We will use five datasets.
Start by loading the haven package (you may need to install it). The haven package allows us to open data forms from other statistical programs, and we’ll need it to open the ACS data which is stored as a STATA .dta file. After loading haven, open the ACS_raw dataset using the read_dta function.
library(haven)
ACS_raw <- read_dta("ACS_raw.dta")
Now, check to see if the county FIPS is the unique identifier using unique_id. It’s a function that comes from the thatssorandom package and is available on GitHub and not on CRAN. What does this mean? You cannot just install it in the console using install.packages. In the Console, do the following:
install.packages("githubinstall")
library(githubinstall)
githubinstall("thatssorandom")
If that doesn’t work, try the following:
library(devtools)
install_github("edwinth/thatssorandom")
See R packages on GitHub and details on the `unique_id’ function here
If unique_id returns “TRUE” then FIPS is the unique identifier.
## Package that contains the unique_id fucntion
library(thatssorandom)
## unique_id returns "TRUE" if the variable FIPS is a unique identifier
ACS_raw %>% unique_id(FIPS)
## [1] TRUE
For some people, they are having problems installing thatsorandom. An alternative is to do the following:
length(unique(ACS_raw$FIPS))
## [1] 3220
What exactly are we looking for here? We know that there are 3220 observations in our ACS_raw dataset. If FIPS is a unique identifier, than the total number of observations should equal the distinct number of FIPS codes. If we find that there are 3220 unique FIPS, this means that each row of the dataset is a county and each county only appears once. Thus, FIPS uniquely identifies each observation in the dataset. unique will return only unique values of FIPS, and length will measure the length of the vector. If this is less than 3220, then we know some FIPS were duplicated and others didn’t appear. If it’s exactly 3220, then every FIPS appears exactly once.
So if you are having problems, replace unique_id with length(unique(.....)).
Next, we need to convert FIPS to numeric (it’s currently stored as a character). Why do we do this? We will be merging on FIPS, and this variable has to take the same type across all datasets that are to be merged. Note that FIPS when it is a string variable will have leading “0”, while when we destring it, the leading “0” will be eliminated.
How did we know that FIPS was stored as a character? We can quickly check the structure of any variable in a dataset by using the str function.
str(ACS_raw$FIPS)
## chr [1:3220] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
## - attr(*, "label")= chr "FIPS"
## - attr(*, "format.stata")= chr "%50s"
ACS_raw$FIPS <- as.numeric(ACS_raw$FIPS)
Using the $ operator here allows us to specifically alter the structure of the FIPS variable. Now, we can re-check the structure of FIPS to make sure our code did what we intended.
str(ACS_raw$FIPS)
## num [1:3220] 1001 1003 1005 1007 1009 ...
See how the storage type changes from chr to num? FIPS is now stored as a number. Now, onto our other four datasets!
We’ll do the same thing as the ACS dataset here for the citizens dataset. Check to see if the county FIPS is the unique identifier and then convert FIPS from a string to a numeric variable.
citizens_raw <- read_dta("citizens_raw.dta")
## Confirm FIPS is unique identifier
citizens_raw %>% unique_id(FIPS)
## [1] TRUE
We will also rename a few variables to make it easier to find them in the future. Thankfully, citizens_raw comes with something very exciting- variable labels! (You may have noticed that our ACS_raw dataset has them as well). We don’t typically use variable labels in R, but they are pretty common in STATA. Because citizens_raw is a STATA (.dta) file, we get variable labels as an added bonus.
We can read variable labels using the var_lab function from the expss package (you’ll likely need to install the package now). Load the expss package and the view the labels on a few variables. .
library(expss)
var_lab(citizens_raw$T002_001)
## [1] "Total Number of United States Citizens"
var_lab(citizens_raw$T010_002)
## [1] "Total Population: Total Citizens of Voting Age"
var_lab(citizens_raw$T002_003)
## [1] "Total Number of United States Citizens: Not Hispanic or Latino: White Alone"
var_lab(citizens_raw$T002_004)
## [1] "Total Number of United States Citizens: Not Hispanic or Latino: Black or African"
var_lab(citizens_raw$T002_005)
## [1] "Total Number of United States Citizens: Not Hispanic or Latino: American Indian "
var_lab(citizens_raw$T002_006)
## [1] "Total Number of United States Citizens: Not Hispanic or Latino: Asian Alone"
var_lab(citizens_raw$T002_014)
## [1] "Total Number of United States Citizens: Hispanic or Latino"
Now, rename these variables based on their labels.
citizens_raw2 <- citizens_raw %>%
rename(citizens_total = T002_001,
citizens_voting_age = T010_002,
citizens_white = T002_003,
citizens_black = T002_004,
citizens_hispanic_latino = T002_014,
citizens_native_american = T002_005,
citizens_asian = T002_006)
We’ll also need to convert FIPS from a character to numeric variable.
citizens_raw2$FIPS <- as.numeric(citizens_raw2$FIPS)
Finally, to keep things simpler, we’ll just keep the variables that we need from this dataset (it will become clear why we keep these variables in the section below)
citizens <- citizens_raw2 %>%
select(FIPS,
NAME,
citizens_total,
citizens_voting_age,
citizens_white,
citizens_black,
citizens_hispanic_latino)
Let’s clean up the health dataset. It’s easy. We’re doing the same commands as above—you know the drill.
health_raw <- read_dta("health_raw.dta")
#check that FIPS is the unique identifier
# check that FIPS is the unique identifier
health_raw %>% unique_id(FIPS)
## [1] TRUE
# convert FIPS from character -> numeric
health_raw$FIPS <- as.numeric(health_raw$FIPS)
#save this new dataset as `health`
health <- health_raw
Only two more datasets to go! Let’s import the election data now and check that FIPS is the unique identifier. We will see that we don’t get “TRUE” back and instead, `unique_id’ will return the duplicates. In this case FIPS is going to show up in multiple rows. Thus each row does NOT represent a county.
county_pres <- read_csv("countypres_2000-2016.csv")
# check that FIPS is the unique identifier
county_pres %>% unique_id(FIPS)
## # A tibble: 50,524 x 11
## FIPS year state state_po county office candidate party candidatevotes
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1001 2000 Alaba… AL Autauga Preside… Al Gore democ… 4942
## 2 1001 2000 Alaba… AL Autauga Preside… George W.… repub… 11993
## 3 1001 2000 Alaba… AL Autauga Preside… Ralph Nad… green 160
## 4 1001 2000 Alaba… AL Autauga Preside… Other <NA> 113
## 5 1001 2004 Alaba… AL Autauga Preside… John Kerry democ… 4758
## 6 1001 2004 Alaba… AL Autauga Preside… George W.… repub… 15196
## 7 1001 2004 Alaba… AL Autauga Preside… Other <NA> 127
## 8 1001 2008 Alaba… AL Autauga Preside… Barack Ob… democ… 6093
## 9 1001 2008 Alaba… AL Autauga Preside… John McCa… repub… 17403
## 10 1001 2008 Alaba… AL Autauga Preside… Other <NA> 145
## # … with 50,514 more rows, and 2 more variables: totalvotes <dbl>,
## # version <dbl>
Whoa! The county FIPS is not a unique identifier. Now what?
Fear not! With a little data exploration, we can still figure out the unit of observation. Let’s start by arrangeing the dataset by county (FIPS) and year, and then head the first 10 observations for the variables specified.
county_pres %>%
arrange(FIPS, year) %>%
head(10)
## # A tibble: 10 x 11
## year state state_po county FIPS office candidate party candidatevotes
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 2000 Alaba… AL Autauga 1001 Preside… Al Gore democ… 4942
## 2 2000 Alaba… AL Autauga 1001 Preside… George W.… repub… 11993
## 3 2000 Alaba… AL Autauga 1001 Preside… Ralph Nad… green 160
## 4 2000 Alaba… AL Autauga 1001 Preside… Other <NA> 113
## 5 2004 Alaba… AL Autauga 1001 Preside… John Kerry democ… 4758
## 6 2004 Alaba… AL Autauga 1001 Preside… George W.… repub… 15196
## 7 2004 Alaba… AL Autauga 1001 Preside… Other <NA> 127
## 8 2008 Alaba… AL Autauga 1001 Preside… Barack Ob… democ… 6093
## 9 2008 Alaba… AL Autauga 1001 Preside… John McCa… repub… 17403
## 10 2008 Alaba… AL Autauga 1001 Preside… Other <NA> 145
## # … with 2 more variables: totalvotes <dbl>, version <dbl>
This dataset is a bit strange- the unit of observation is a county x year x candidate. Remember the unit of observation for the first three datasets is the county. So if we attempt to do a one-to-one merge with this dataset and one of the others, it won’t work. Let’s clean up this dataset so that it works for our purposes.
Let’s keep the 2016 election year. And also, let’s drop the other party candidate (we’ll focus just on Democrats and Republicans—the two major political parties in the USA).
county_pres2 <- county_pres %>%
filter(year == 2016 & party == "democrat" | year == 2016 & party == "republican")
Alright, are we done? Gosh no! Take a stretch break, and then proceed. We need to clean this dataset. Now that we’ve manipulated the data, let’s re-examine whether FIPS is a unique identifier.
county_pres2 %>% unique_id(FIPS)
## # A tibble: 6,316 x 11
## FIPS year state state_po county office candidate party candidatevotes
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1001 2016 Alaba… AL Autauga Presid… Hillary Cl… democ… 5936
## 2 1001 2016 Alaba… AL Autauga Presid… Donald Tru… repub… 18172
## 3 1003 2016 Alaba… AL Baldwin Presid… Hillary Cl… democ… 18458
## 4 1003 2016 Alaba… AL Baldwin Presid… Donald Tru… repub… 72883
## 5 1005 2016 Alaba… AL Barbour Presid… Hillary Cl… democ… 4871
## 6 1005 2016 Alaba… AL Barbour Presid… Donald Tru… repub… 5454
## 7 1007 2016 Alaba… AL Bibb Presid… Hillary Cl… democ… 1874
## 8 1007 2016 Alaba… AL Bibb Presid… Donald Tru… repub… 6738
## 9 1009 2016 Alaba… AL Blount Presid… Hillary Cl… democ… 2156
## 10 1009 2016 Alaba… AL Blount Presid… Donald Tru… repub… 22859
## # … with 6,306 more rows, and 2 more variables: totalvotes <dbl>, version <dbl>
Nope. It looks like each county FIPS is repeated twice based on political party. Let’s see if these two columns (variables) uniquiely identifies each observation.
county_pres2 %>% unique_id(FIPS, party)
## # A tibble: 6 x 11
## FIPS party year state state_po county office candidate candidatevotes
## <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 NA democr… 2016 Connec… <NA> Statewi… Presi… Hillary C… NA
## 2 NA democr… 2016 Maine <NA> Maine U… Presi… Hillary C… 3017
## 3 NA democr… 2016 Rhode … <NA> Federal… Presi… Hillary C… 637
## 4 NA republ… 2016 Connec… <NA> Statewi… Presi… Donald Tr… NA
## 5 NA republ… 2016 Maine <NA> Maine U… Presi… Donald Tr… 648
## 6 NA republ… 2016 Rhode … <NA> Federal… Presi… Donald Tr… 53
## # … with 2 more variables: totalvotes <dbl>, version <dbl>
So we have a county X party dataset. We want to change this to be simply a county-level dataset. Why?
Answer: All the other datasets are county-level. In order to join/merge them, it really helps if they all have the same unit of observation.
To change this dataset so that each county is only represented once, we can use do some reshaping using the pivot_wider function. pivot_wider lets us go from FIPS x candidate observations to just FIPS observations. Before doing this reshaping of the data, let’s just keep the variables we need.
county_pres3 <- county_pres2 %>%
select(FIPS, party, candidatevotes, totalvotes )
Now, let’s check to see if we have any missing values (“NA”). This is typically not a major problem at this stage, UNLESS there’s a missing value for our identifying variables (FIPS).
## Returns which observations have missing values
which(is.na(county_pres3$FIPS))
## [1] 6229 6230 6231 6232 6233 6234
## See how observation 6229 looks
county_pres3[6229,]
## # A tibble: 1 x 4
## FIPS party candidatevotes totalvotes
## <dbl> <chr> <dbl> <dbl>
## 1 NA democrat NA 5056
## Remove obserations with missing FIPS
county_pres3 <- county_pres3 %>%
drop_na(FIPS)
## Check to see if any observations have a missing FIPS
which(is.na(county_pres3$FIPS))
## integer(0)
We’re good!
Now, let’s use pivor_wider'.names_from’ will use the column party' and move it to thevalues_from’ columns. It can be hard to explain. The best way to see this is to compare county_pres4' tocounty_pres3’. See what happens?
county_pres4<- county_pres3 %>%
pivot_wider(names_from =party, values_from = c(candidatevotes, totalvotes) )
# Check to see if unique number of FIPS equals the number of observations
county_pres4 %>% unique_id(FIPS)
## [1] TRUE
county_pres3 %>%
arrange(FIPS) %>%
head(10)
## # A tibble: 10 x 4
## FIPS party candidatevotes totalvotes
## <dbl> <chr> <dbl> <dbl>
## 1 1001 democrat 5936 24973
## 2 1001 republican 18172 24973
## 3 1003 democrat 18458 95215
## 4 1003 republican 72883 95215
## 5 1005 democrat 4871 10469
## 6 1005 republican 5454 10469
## 7 1007 democrat 1874 8819
## 8 1007 republican 6738 8819
## 9 1009 democrat 2156 25588
## 10 1009 republican 22859 25588
county_pres4 %>%
arrange(FIPS) %>%
head(10)
## # A tibble: 10 x 5
## FIPS candidatevotes_de… candidatevotes_re… totalvotes_demo… totalvotes_repu…
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 5936 18172 24973 24973
## 2 1003 18458 72883 95215 95215
## 3 1005 4871 5454 10469 10469
## 4 1007 1874 6738 8819 8819
## 5 1009 2156 22859 25588 25588
## 6 1011 3530 1140 4710 4710
## 7 1013 3726 4901 8732 8732
## 8 1015 13242 32865 47864 47864
## 9 1017 5784 7843 13900 13900
## 10 1019 1547 8953 10733 10733
One more item. The variable totalvotes_democrat' is identical tototalvotes_republican’. We only need one. Let’s remove totalvotes_democrat', and then renametotalvotes_republican’ to `totalvotes’
county_pres5 <- county_pres4 %>%
select(-totalvotes_democrat) %>%
rename(totalvotes = totalvotes_republican)
Great! Now we have out countypres dataset in exactly the form we need.
The original dataset is complicated. The unit of observation is a victim of police violence. In order to clean this so that it represents the number of victims in a county requires extensive code. I have posted this code (in STATA, R code will be available at a later date), but for the purpose of this lab, I provide a cleaned version of the data.
To be clear, the victim in this dataset is someone who has been killed by police.
police_violence <- read_dta("police_violence.dta")
# check that FIPS is the unique identifier
police_violence %>% unique_id(FIPS)
## [1] TRUE
It will be straightforward to merge all 5 datasets. Typically this process is iterative. You clean the datasets, attempt a merge, fail at the merge, and go back and clean the data to accomplish the merge. As I noted earlier, this takes time, so budget accordingly. Let’s start with the ACS dataset and we will merge the citizens dataset onto it using a 1-to-1 merge based on county FIPS.
ACS_citizens <- merge(ACS_raw, citizens, by = "FIPS")
How do we know that we merged correctly? the ACS_raw dataset has 3220 observations and 621 variables. The citizens dataset has 3220 observations and 7 variables. Our merged dataset, merge1, has 3220 observations and 627 (all the variables from citizens other than FIPS).
Now, let’s merge the rest of the datasets together. Note that when we merge in the police_violence dataset, we are specifying for R to keep all observations from both datasets, regardless of whether they match an observation from the other dataset. We’ll explain why in a second.
ACS_citizens_health <- merge(ACS_citizens, health, by = "FIPS")
ACS_citizens_health_countypres <- merge(ACS_citizens_health, county_pres5, by = "FIPS")
ACS_citizens_health_countypres_police <- merge(ACS_citizens_health_countypres, police_violence, by = "FIPS", all = TRUE)
Why did we keep those missing values? We’re still interested in counties where there is no police violence. Right now, counties with no reported police violence just report *NA*. We can replace those *NA*s with 0s using the following code:
ACS_citizens_health_countypres_police<-ACS_citizens_health_countypres_police %>%
mutate_at(c(677:684), ~replace(., is.na(.), 0))
Note also the difference in how we treat missing values. For FIPS, missing values are a PROBLEM. Why? If you don’t have a FIPS number, you cannot do a merge. Observations with missing FIPS numbers are not helpful because they don’t tell us where we are missing data. If we had additional geographic identifiers (county name, state) that could help us find the FIPS.
However, for police violence, missing values are telling us something - specifically counties where there is NO police violence.
Finally, let’s save this newly merged dataset as a .csv for us to access later using the write.csv (R will save this dataset to your working directory).
write.csv(ACS_citizens_health_countypres_police, file = "ACS_citizens_health_countypres_police.csv")
Whoohoo! You made it to the end of Part I of this lab. Don’t worry, Part II won’t be quite as long. Take a break, get some sunshine, and we’ll see you in Part II!