Logistics

DEADLINE: The midterm is due before your next class (on March 9th or 11th, depending on your campus).
You can do this homework in groups of two at most.
This homework has a total of 20 points. The number of points for each section is clearly indicated in the section and/or subsection headings.
You need to hand in this .Rmd file on Moodle and send me the .html file - that is produced when you click on the “Knit” button - through Slack. If you are having issues knitting your file do get in touch (and knit regularly to make sure everything is working properly).
We left code blocks in this .Rmd file for you to fill out. Just put your code within the marked region, and click the Knit button in RStudio, just above this script, to see the generated HTML output. Knit it now, before you add any code to see how nice the output looks like.
Whenever we give you a chunk of code with code already written, you need to replace eval = FALSE by eval = TRUE in the chunk “header”. This will make sure that the chunk is ‘evaluated’ (i.e., that the code runs) when you knit the document.
FYI a code block looks like this:

# code goes here

Do NOT delete the ```{r} and ```; they separate the code chunk from text.
You can find a nice intro to Rmarkdown (.Rmd) here.
Questions marked with a red star (*) are the ones we will build on progressively to clean the data until it is ready for the regression analysis. Other questions are side quests, which you don’t need to answer to get to the regressions.
We have added a link to a clean dataset just before the regression analysis part. If there are necessary questions you have not managed to do, you can start from this clean data instead of using your own.

2020 US Presidential Elections - Analysis of the Results

This midterm project consists in one long analysis of data from the 2020 Presidential elections in the United States.

You will be asked to download the data from its original source, just like you would if you were to undertake this task by yourself. You will use everything we’ve learned so far: importing (un-tidy) data; summarizing, visualizing and tidying it; running regressions for different variables; and thinking about whether these associations can be interpreted causally. In the process, you will also learn 2 useful operations: merging 2 (or more) datasets together, and “reshaping” data from wide to long format. Don’t worry we will work you through how to do all of this.

We hope you find this project interesting and useful. Now let’s get going!

Erin Barbusse, Roisin Casey

Loading packages

It’s good practice to load all the packages you will need in the same place.

# Load all the packages you need here
# Don't write the install.packages() code here. Run that code in the console.
library(readxl)
library(readr)
library(tidyverse)
library(janitor)
library(sf)
library(tmap)
library(skimr)
library(jtools)
library(huxtable)
library(rmapshaper)

Loading the Data [0.5 points]

Download the data from November 2020 US presidential elections from here:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ#

Scroll down, click on the download icon next to the second file (countypres_2000-2020.tab), and select “Comma Separated Values (Original File Format)”.

This is straight from the MIT Election Data + Science lab, which collects and shares election data from the United States. This dataset contains election results aggregated at the county level. Counties are small geographic units in the United States.

1*. Load the file you downloaded into RStudio in an object called elections. You can use the read.csv function.

elections <- read.csv(file = 'countypres_2000-2020.csv')
str(elections)

## 'data.frame':    72617 obs. of  12 variables:
##  $ year          : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ state         : chr  "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" ...
##  $ state_po      : chr  "AL" "AL" "AL" "AL" ...
##  $ county_name   : chr  "AUTAUGA" "AUTAUGA" "AUTAUGA" "AUTAUGA" ...
##  $ county_fips   : int  1001 1001 1001 1001 1003 1003 1003 1003 1005 1005 ...
##  $ office        : chr  "PRESIDENT" "PRESIDENT" "PRESIDENT" "PRESIDENT" ...
##  $ candidate     : chr  "AL GORE" "GEORGE W. BUSH" "RALPH NADER" "OTHER" ...
##  $ party         : chr  "DEMOCRAT" "REPUBLICAN" "GREEN" "OTHER" ...
##  $ candidatevotes: int  4942 11993 160 113 13997 40872 1033 578 5188 5096 ...
##  $ totalvotes    : int  17208 17208 17208 17208 56480 56480 56480 56480 10395 10395 ...
##  $ version       : int  20191203 20191203 20191203 20191203 20191203 20191203 20191203 20191203 20191203 20191203 ...
##  $ mode          : chr  "TOTAL" "TOTAL" "TOTAL" "TOTAL" ...

Understanding what the dataset contains [1 point]

1. Use head to display the first 6 rows of the dataset

head(elections)

year	state	state_po	county_name	county_fips	office	candidate	party	candidatevotes	totalvotes	version	mode
2000	ALABAMA	AL	AUTAUGA	1001	PRESIDENT	AL GORE	DEMOCRAT	4942	17208	20191203	TOTAL
2000	ALABAMA	AL	AUTAUGA	1001	PRESIDENT	GEORGE W. BUSH	REPUBLICAN	11993	17208	20191203	TOTAL
2000	ALABAMA	AL	AUTAUGA	1001	PRESIDENT	RALPH NADER	GREEN	160	17208	20191203	TOTAL
2000	ALABAMA	AL	AUTAUGA	1001	PRESIDENT	OTHER	OTHER	113	17208	20191203	TOTAL
2000	ALABAMA	AL	BALDWIN	1003	PRESIDENT	AL GORE	DEMOCRAT	13997	56480	20191203	TOTAL
2000	ALABAMA	AL	BALDWIN	1003	PRESIDENT	GEORGE W. BUSH	REPUBLICAN	40872	56480	20191203	TOTAL

2. What years are included in our data? What’s inside the enigmatic “mode” variable? Tabulate the variables year and mode separately to find out.

elections_year <- table(elections$year)
elections_year

## 
##  2000  2004  2008  2012  2016  2020 
## 12628  9474  9474  9474  9474 22093

elections_mode <- table(elections$mode)
elections_mode

## 
##         2ND ABSENTEE             ABSENTEE     ABSENTEE BY MAIL 
##                  120                 1995                 1038 
##      ADVANCED VOTING                EARLY           EARLY VOTE 
##                  477                  453                  450 
##         EARLY VOTING         ELECTION DAY             FAILSAFE 
##                  120                 3737                  230 
## FAILSAFE PROVISIONAL   IN-PERSON ABSENTEE                 MAIL 
##                  230                  230                  145 
##             ONE STOP                 PROV          PROVISIONAL 
##                  500                  477                 1832 
##                TOTAL 
##                60583

The years that are included in the data are 2000, 2004, 2008, 2012, 2016, and 2020. The enigmatic “mode” variable includes the different types of voting used; such as, absentee by mail, early voting, election day, etc…

3. What is the unit of observation? In other words, what does each row correspond to?

The unit of observation is the number of votes that each candidate received in each US county during each presidential election.

4. How many variables are there?

num_var <- ncol(elections)
num_var

## [1] 12

There are 12 variables present in the data set.

5. How many rows are there?

num_row <- nrow(elections)
num_row

## [1] 72617

There are 72,617 variables in the data set.

Filtering the data and exploring further [1 points]

1*. This data contains the results of all presidential elections from 2000 to 2020. We only want to analyse the 2020 results. Use the filter function to keep only the observations that concern the 2020 election. Create a new object elections_2020 for this task.

elections_2020 <- elections %>% 
  filter(year == "2020")

2. Our main variable of interest is candidatevotes. In which counties and for which candidates is it missing?

no_candidatevotes <- elections_2020 %>% 
  filter(is.na(candidatevotes))
no_candidatevotes

year	state	state_po	county_name	county_fips	office	candidate	party	candidatevotes	totalvotes	version	mode
2020	CALIFORNIA	CA	SAN JOAQUIN	6077	PRESIDENT	OTHER	OTHER			20210622	TOTAL

The only county which has missing data for candidatevotes is San Joaquin. The candidate which the data is missing for is not precised, rather it is the votes for “Other” candidates for the “Other” parties, not those who were the main streamlined candidates in the Democrat, Republican, Libertarian, or Green party.

3. Searching online for the official results of those counties, are those missing values the result of an error in data collection, or are they due to the candidates receiving no votes in those counties?

It seems to be an error in data collection because in our research we found that 10 candidates received votes. Nevertheless, in the data, only 4 named parties show results. This means that in the “Other” candidate and “Other” party category there should be the votes of the other 6 candidates. From our calculations, it should have a value of 3,553 votes

In the rest of this midterm we will disregard those missing values, because there are few enough that they are unlikely to change the results of our analysis. Still, it was important to explore the reason behind missing values, to understand what they mean!

4. We saw previously that there were different modes of voting recorded in the data, and that one category is “Total”. This could be worrying: are we counting ballots twice? Could a ballot appear both in the “in person” and “total” categories? Cross-tabulate the mode variable with state, to understand whether the modes of voting are the same in all States. (Use the table function)

tab_mode_state <- table(elections_2020$state, elections_2020$mode)
tab_mode_state

##                       
##                        2ND ABSENTEE ABSENTEE ABSENTEE BY MAIL ADVANCED VOTING
##   ALABAMA                         0        0                0               0
##   ALASKA                          0        0                0               0
##   ARIZONA                         0        0                0               0
##   ARKANSAS                        0      375                0               0
##   CALIFORNIA                      0        0                0               0
##   COLORADO                        0        0                0               0
##   CONNECTICUT                     0        0                0               0
##   DELAWARE                        0        0                0               0
##   DISTRICT OF COLUMBIA            0        0                0               0
##   FLORIDA                         0        0                0               0
##   GEORGIA                         0      477                0             477
##   HAWAII                          0        0                0               0
##   IDAHO                           0        0                0               0
##   ILLINOIS                        0        0                0               0
##   INDIANA                         0        0                0               0
##   IOWA                            0      495                0               0
##   KANSAS                          0        0                0               0
##   KENTUCKY                        0        0                0               0
##   LOUISIANA                       0        0                0               0
##   MAINE                           0        0                0               0
##   MARYLAND                      120      120                0               0
##   MASSACHUSETTS                   0        0                0               0
##   MICHIGAN                        0        0                0               0
##   MINNESOTA                       0        0                0               0
##   MISSISSIPPI                     0        0                0               0
##   MISSOURI                        0        0                0               0
##   MONTANA                         0        0                0               0
##   NEBRASKA                        0        0                0               0
##   NEVADA                          0        0                0               0
##   NEW HAMPSHIRE                   0        0                0               0
##   NEW JERSEY                      0        0                0               0
##   NEW MEXICO                      0        0                0               0
##   NEW YORK                        0        0                0               0
##   NORTH CAROLINA                  0        0              500               0
##   NORTH DAKOTA                    0        0                0               0
##   OHIO                            0        0                0               0
##   OKLAHOMA                        0        0              308               0
##   OREGON                          0        0                0               0
##   PENNSYLVANIA                    0        0                0               0
##   RHODE ISLAND                    0        0                0               0
##   SOUTH CAROLINA                  0        0              230               0
##   SOUTH DAKOTA                    0        0                0               0
##   TENNESSEE                       0        0                0               0
##   TEXAS                           0        0                0               0
##   UTAH                            0        0                0               0
##   VERMONT                         0        0                0               0
##   VIRGINIA                        0      528                0               0
##   WASHINGTON                      0        0                0               0
##   WEST VIRGINIA                   0        0                0               0
##   WISCONSIN                       0        0                0               0
##   WYOMING                         0        0                0               0
##                       
##                        EARLY EARLY VOTE EARLY VOTING ELECTION DAY FAILSAFE
##   ALABAMA                  0          0            0            0        0
##   ALASKA                   0          0            0            0        0
##   ARIZONA                  0         75            0           75        0
##   ARKANSAS                 0        375            0          375        0
##   CALIFORNIA               0          0            0            0        0
##   COLORADO                 0          0            0            0        0
##   CONNECTICUT              0          0            0            0        0
##   DELAWARE                 0          0            0            0        0
##   DISTRICT OF COLUMBIA     0          0            0            0        0
##   FLORIDA                  0          0            0            0        0
##   GEORGIA                  0          0            0          477        0
##   HAWAII                   0          0            0            0        0
##   IDAHO                    0          0            0            0        0
##   ILLINOIS                 0          0            0            0        0
##   INDIANA                  0          0            0            0        0
##   IOWA                     0          0            0          495        0
##   KANSAS                   0          0            0            0        0
##   KENTUCKY                 0          0            0          480        0
##   LOUISIANA                0          0            0            0        0
##   MAINE                    0          0            0            0        0
##   MARYLAND                 0          0          120          120        0
##   MASSACHUSETTS            0          0            0            0        0
##   MICHIGAN                 0          0            0            0        0
##   MINNESOTA                0          0            0            0        0
##   MISSISSIPPI              0          0            0            0        0
##   MISSOURI                 0          0            0            0        0
##   MONTANA                  0          0            0            0        0
##   NEBRASKA                 0          0            0            0        0
##   NEVADA                   0          0            0            0        0
##   NEW HAMPSHIRE            0          0            0            0        0
##   NEW JERSEY               0          0            0            0        0
##   NEW MEXICO               0          0            0            0        0
##   NEW YORK                 0          0            0            0        0
##   NORTH CAROLINA           0          0            0          500        0
##   NORTH DAKOTA             0          0            0            0        0
##   OHIO                     0          0            0            0        0
##   OKLAHOMA               308          0            0          308        0
##   OREGON                   0          0            0            0        0
##   PENNSYLVANIA             0          0            0            0        0
##   RHODE ISLAND             0          0            0            0        0
##   SOUTH CAROLINA           0          0            0          230      230
##   SOUTH DAKOTA             0          0            0            0        0
##   TENNESSEE                0          0            0            0        0
##   TEXAS                    0          0            0            0        0
##   UTAH                   145          0            0          145        0
##   VERMONT                  0          0            0            0        0
##   VIRGINIA                 0          0            0          532        0
##   WASHINGTON               0          0            0            0        0
##   WEST VIRGINIA            0          0            0            0        0
##   WISCONSIN                0          0            0            0        0
##   WYOMING                  0          0            0            0        0
##                       
##                        FAILSAFE PROVISIONAL IN-PERSON ABSENTEE MAIL ONE STOP
##   ALABAMA                                 0                  0    0        0
##   ALASKA                                  0                  0    0        0
##   ARIZONA                                 0                  0    0        0
##   ARKANSAS                                0                  0    0        0
##   CALIFORNIA                              0                  0    0        0
##   COLORADO                                0                  0    0        0
##   CONNECTICUT                             0                  0    0        0
##   DELAWARE                                0                  0    0        0
##   DISTRICT OF COLUMBIA                    0                  0    0        0
##   FLORIDA                                 0                  0    0        0
##   GEORGIA                                 0                  0    0        0
##   HAWAII                                  0                  0    0        0
##   IDAHO                                   0                  0    0        0
##   ILLINOIS                                0                  0    0        0
##   INDIANA                                 0                  0    0        0
##   IOWA                                    0                  0    0        0
##   KANSAS                                  0                  0    0        0
##   KENTUCKY                                0                  0    0        0
##   LOUISIANA                               0                  0    0        0
##   MAINE                                   0                  0    0        0
##   MARYLAND                                0                  0    0        0
##   MASSACHUSETTS                           0                  0    0        0
##   MICHIGAN                                0                  0    0        0
##   MINNESOTA                               0                  0    0        0
##   MISSISSIPPI                             0                  0    0        0
##   MISSOURI                                0                  0    0        0
##   MONTANA                                 0                  0    0        0
##   NEBRASKA                                0                  0    0        0
##   NEVADA                                  0                  0    0        0
##   NEW HAMPSHIRE                           0                  0    0        0
##   NEW JERSEY                              0                  0    0        0
##   NEW MEXICO                              0                  0    0        0
##   NEW YORK                                0                  0    0        0
##   NORTH CAROLINA                          0                  0    0      500
##   NORTH DAKOTA                            0                  0    0        0
##   OHIO                                    0                  0    0        0
##   OKLAHOMA                                0                  0    0        0
##   OREGON                                  0                  0    0        0
##   PENNSYLVANIA                            0                  0    0        0
##   RHODE ISLAND                            0                  0    0        0
##   SOUTH CAROLINA                        230                230    0        0
##   SOUTH DAKOTA                            0                  0    0        0
##   TENNESSEE                               0                  0    0        0
##   TEXAS                                   0                  0    0        0
##   UTAH                                    0                  0  145        0
##   VERMONT                                 0                  0    0        0
##   VIRGINIA                                0                  0    0        0
##   WASHINGTON                              0                  0    0        0
##   WEST VIRGINIA                           0                  0    0        0
##   WISCONSIN                               0                  0    0        0
##   WYOMING                                 0                  0    0        0
##                       
##                        PROV PROVISIONAL TOTAL
##   ALABAMA                 0           0   201
##   ALASKA                  0           0   205
##   ARIZONA                 0          75     0
##   ARKANSAS                0         375     0
##   CALIFORNIA              0           0   290
##   COLORADO                0           0   320
##   CONNECTICUT             0           0    40
##   DELAWARE                0           0    12
##   DISTRICT OF COLUMBIA    0           0     5
##   FLORIDA                 0           0   335
##   GEORGIA               477           0     0
##   HAWAII                  0           0    20
##   IDAHO                   0           0   176
##   ILLINOIS                0           0   510
##   INDIANA                 0           0   460
##   IOWA                    0           0     0
##   KANSAS                  0           0   315
##   KENTUCKY                0           0     0
##   LOUISIANA               0           0   256
##   MAINE                   0           0    80
##   MARYLAND                0         120     0
##   MASSACHUSETTS           0           0    70
##   MICHIGAN                0           0   415
##   MINNESOTA               0           0   435
##   MISSISSIPPI             0           0   410
##   MISSOURI                0           0   580
##   MONTANA                 0           0   168
##   NEBRASKA                0           0   279
##   NEVADA                  0           0    68
##   NEW HAMPSHIRE           0           0    30
##   NEW JERSEY              0           0   105
##   NEW MEXICO              0           0   165
##   NEW YORK                0           0   310
##   NORTH CAROLINA          0         500     0
##   NORTH DAKOTA            0           0   212
##   OHIO                    0           0   352
##   OKLAHOMA                0           0     0
##   OREGON                  0           0   180
##   PENNSYLVANIA            0           0   201
##   RHODE ISLAND            0           0    24
##   SOUTH CAROLINA          0         230     0
##   SOUTH DAKOTA            0           0   198
##   TENNESSEE               0           0   285
##   TEXAS                   0           0  1270
##   UTAH                    0           0   140
##   VERMONT                 0           0    70
##   VIRGINIA                0         532     0
##   WASHINGTON              0           0   195
##   WEST VIRGINIA           0           0   220
##   WISCONSIN               0           0   360
##   WYOMING                 0           0    92

As you can see, in some States the mode variable takes exclusively the value “TOTAL” (in Alabama for instance). In those States, we know the total number of votes obtained by candidates in each county. In other States, we do not have the total number of votes, but only a number disaggregated by voting method (in Arizona for instance).

All of those situations are fine, since they mean ballots are not counted twice.

However there is one worrying case: Utah. In this State, we observe both the total number of votes, and the number of votes by mail, election day voting, and early voting. By looking more closely at the data, you can see that in the Salt Lake county (code 49035), we have the number of votes received early; on election day; and by mail. For all other counties, the ballot count for these voting modes is zero, and we only observe the “total” ballot count. If you wanted to be convinced more formally that this is true, you could run the following script:

elections_2020 %>% 
  filter(state=="UTAH") %>% #keep only UTAH observations
  mutate(mode_vote_total = (mode=="TOTAL" & candidatevotes!=0), #create a dummy if we observe a ballot in the "total" category
         mode_vote_other = (mode!="TOTAL" & candidatevotes!=0)) %>% # create a dummy if we observe a ballot in any other voting category
  group_by(county_fips, county_name) %>%  # group by county
  summarise(mode_vote_total = max(mode_vote_total), #create a dummy if for any candidate in the county we observe a ballot in the "total" category
            mode_vote_other = max(mode_vote_other)) %>% #create a dummy if for any candidate in the county we observe a ballot in any other voting category
  View() #visualise the resulting dataset

#You can see that in no county do we observe a ballot both in "total" and in another category!

So it holds true that we never count a ballot twice: in every county, either we count it in the “total” category, or we count it in one of the categories associated to specific voting method. This is good news!

Checking quality of the data [2 points]

1. In States like Arizona where we don’t have the total number of ballots, it is not clear that the voting modes for which we have data cover all voting methods available.

For instance, in Kentucky we only have the number of ballots cast on election day. A quick search online shows that early voting is possible in this State, although it is heavily restricted. It is hence very likely that some ballots are missing in our data.

Now that we noticed this feature of the data, we want to check how big of an issue this is: how many votes are not counted in our data? Can we still reasonably use it, or is its quality too bad?

To answer this question, we will compare the total number of votes by State in our dataset to the official number provided by the Federal Government.

Create a new object called elections_2020_state_totals, with two variables: state_po and sum_votes_state: the sum of candidatevotes by State. Hint: you should group your data by state_po, and then use summarise. You should also disregard missing values when you compute the sum of candidatevotes: find out the argument of the sum function which lets you do that.

elections_2020_state_totals <- elections_2020 %>%
  group_by(state_po) %>%
  summarise(sum_votes_state = sum(candidatevotes, na.rm = TRUE))
elections_2020_state_totals

state_po	sum_votes_state
AK	357569
AL	2323282
AR	1219069
AZ	3385294
CA	17498617
CO	3256980
CT	1823857
DC	344356
DE	504010
FL	11067456
GA	4998482
HI	574457
IA	1690871
ID	867361
IL	6033744
IN	3033121
KS	1372303
KY	2134996
LA	2148062
MA	3658005
MD	3037031
ME	822534
MI	5539302
MN	3277171
MO	3025962
MS	1313759
MT	603640
NC	5524802
ND	361819
NE	951712
NH	803833
NJ	4549353
NM	923965
NV	1404911
NY	8661735
OH	5922202
OK	1560699
OR	2374321
PA	6915283
RI	517757
SC	2513329
SD	422609
TN	3053851
TX	11315056
UT	1495354
VA	4462600
VT	370826
WA	4087631
WI	3297352
WV	794652
WY	278503

2. The official election data in the US is provided by the Federal Election Commission, in this document. Notice that it is in pdf format. We already extracted it into an excel file for you.

Download the file here

Import it into a new object named official_state_votes, using the read_excel function from the readxl package.

official_state_votes <- read_excel('state_votes.xlsx')
official_state_votes

state_short	official_total_votes
AL	2.32e+06
AK	3.6e+05
AZ	3.39e+06
AR	1.22e+06
CA	1.75e+07
CO	3.26e+06
CT	1.82e+06
DE	5.04e+05
DC	3.44e+05
FL	1.11e+07
GA	5e+06
HI	5.74e+05
ID	8.68e+05
IL	6.03e+06
IN	3.03e+06
IA	1.69e+06
KS	1.37e+06
KY	2.14e+06
LA	2.15e+06
ME	8.19e+05
MD	3.04e+06
MA	3.63e+06
MI	5.54e+06
MN	3.28e+06
MS	1.31e+06
MO	3.03e+06
MT	6.04e+05
NE	9.56e+05
NV	1.41e+06
NH	8.06e+05
NJ	4.55e+06
NM	9.24e+05
NY	8.59e+06
NC	5.52e+06
ND	3.62e+05
OH	5.92e+06
OK	1.56e+06
OR	2.37e+06
PA	6.92e+06
RI	5.18e+05
SC	2.51e+06
SD	4.23e+05
TN	3.05e+06
TX	1.13e+07
UT	1.49e+06
VT	3.67e+05
VA	4.46e+06
WA	4.09e+06
WV	7.95e+05
WI	3.3e+06
WY	2.77e+05

3. Both your elections_2020_state_totals and official_state_votes should have 51 observations, one per State in the US + Washington DC. But now the official number of votes and the one from our dataset are in two different R objects. You could compare the numbers manually, but this would be a lot of work. Instead, we want to merge them into a single object.

We will do this using the left_join() function from the tidyverse package.

This is how left_join() works: left_join(x,y) will return all rows from x (very important!) and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. Rows in y with no match in x will be discarded.

The following cheatsheet might help you visualise the way the function works: https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf. You can also find more details here.

If nothing is specified, left_join will automatically detect the variables which have the same name in both datasets and use them to join the data. If your variables have different names in the datasets, you can specify left_join(x,y, by=c("var1x" = "var1y", "var2x" = "var2y")). This will match variable var1x in x to var1y in y and var2x in x to var2y in y. (You can keep going, and join on more than two variables!)

Create a new object called elections_2020_state_comparison, which is the left_join of elections_2020_state_totals (1st argument) and official_state_votes (2nd argument). Here, our identifying variables have different names, so we need to add the argument by=c("state_po" = "state_short")

elections_2020_state_comparison <- left_join(elections_2020_state_totals, official_state_votes, by=c("state_po" = "state_short"))
elections_2020_state_comparison

state_po	sum_votes_state	official_total_votes
AK	357569	3.6e+05
AL	2323282	2.32e+06
AR	1219069	1.22e+06
AZ	3385294	3.39e+06
CA	17498617	1.75e+07
CO	3256980	3.26e+06
CT	1823857	1.82e+06
DC	344356	3.44e+05
DE	504010	5.04e+05
FL	11067456	1.11e+07
GA	4998482	5e+06
HI	574457	5.74e+05
IA	1690871	1.69e+06
ID	867361	8.68e+05
IL	6033744	6.03e+06
IN	3033121	3.03e+06
KS	1372303	1.37e+06
KY	2134996	2.14e+06
LA	2148062	2.15e+06
MA	3658005	3.63e+06
MD	3037031	3.04e+06
ME	822534	8.19e+05
MI	5539302	5.54e+06
MN	3277171	3.28e+06
MO	3025962	3.03e+06
MS	1313759	1.31e+06
MT	603640	6.04e+05
NC	5524802	5.52e+06
ND	361819	3.62e+05
NE	951712	9.56e+05
NH	803833	8.06e+05
NJ	4549353	4.55e+06
NM	923965	9.24e+05
NV	1404911	1.41e+06
NY	8661735	8.59e+06
OH	5922202	5.92e+06
OK	1560699	1.56e+06
OR	2374321	2.37e+06
PA	6915283	6.92e+06
RI	517757	5.18e+05
SC	2513329	2.51e+06
SD	422609	4.23e+05
TN	3053851	3.05e+06
TX	11315056	1.13e+07
UT	1495354	1.49e+06
VA	4462600	4.46e+06
VT	370826	3.67e+05
WA	4087631	4.09e+06
WI	3297352	3.3e+06
WV	794652	7.95e+05
WY	278503	2.77e+05

Congratulations, you have merged your first datasets! Easy right? :)

4. Compute the percentage difference between the official number of votes, and the sum of votes in our data by State: (official votes - sum votes)/official votes*100. What are the min, max and mean of this percentage difference?

elections_2020_state_comparison <- elections_2020_state_comparison %>%
  mutate("perc_diff" = ((official_total_votes - sum_votes_state)/official_total_votes)*100)

min_perc <- min(elections_2020_state_comparison$perc_diff)
min_perc

## [1] -0.924807

max_perc <- max(elections_2020_state_comparison$perc_diff)
max_perc

## [1] 0.5454343

mean_perc <- mean(elections_2020_state_comparison$perc_diff)
mean_perc

## [1] -0.0437868

The minimum percentage difference is -0.92%, the maximum is 0.55%, and the mean is -0.04%.

If our election data contained the official number of votes per State, the difference would always be zero. Clearly, this is not the case: in some States our data has too few votes, in others it has too many. However, the difference is always very very small and should not dramatically change the final results obtained by each candidate.

Still, the discrepancy is not easy to explain: the MIT lab collects official results from all the States and puts them together into a national file. It is not clear where the difference is coming from: maybe there was human error in data collection, maybe the results were updated in some States after a recount…

Since the United States government itself does not provide official county-level results for the entire country, the only alternative to using the MIT data would be to start from State level official results. However, this would be a lot of work as we would have to find all the files, deal with format issues (many data files are still provided in pdf, for instance in Kentucky) and clean them all. It would be time consuming, and we would likely make more errors than the MIT lab!

Some more tidying [2.5 points]

1. Now that we have decided that our data was good enough, we still need to tidy it before we can use it. Make sure that the candidate name is always written with the same format by tabulating the candidate variable.

candidate_check <- table(elections_2020$candidate)
candidate_check

## 
##    DONALD J TRUMP      JO JORGENSEN JOSEPH R BIDEN JR             OTHER 
##              5117              4955              5117              6904

2*. Remember that we noticed earlier there were missing values in the candidatevotes and totalvotes variables.

Run the following script to fix the issue (replace eval = FALSE by eval = TRUE below, so that the code is also run when you knit). This chunk of code:

replaces candidate_votes by zero when it is missing
replaces total_votes_county by the sum of votes in the county when it is missing.

Note that we create a new object, elections_2020_clean.

The ifelse function works in the following way: ifelse(test, value if test==TRUE, value if test==FALSE). For instance in the third line below, for each observation, if candidatevotes is missing then it is replaced by 0, otherwise it is not changed.

elections_2020_clean <- elections_2020 %>%
  group_by(county_fips) %>%
  mutate(candidatevotes = ifelse(is.na(candidatevotes), 0, candidatevotes),
         totalvotes = ifelse(is.na(totalvotes), sum(candidatevotes), totalvotes))

3*. Clean some more the elections_2020_clean object by:

Summarising the data so that the unit of observation is a candidate in a county. To do this, you need to sum the number of votes obtained for each candidate in each county, across all modes of voting. Note: in order to keep all the variables, you will need to use the following command group_by(state, state_po, county_name, county_fips, totalvotes, candidate, party). If we only grouped by county_fips, candidate, we would “lose” all the other “extra” variables that give us information about the county (its name, its state, the number of votes, …). If this is not clear to you, simply run the code twice, once grouping by all the variables, and once with group_by(county_fips, candidate), and see what happens!
Creating a variable pct_votes which is equal to the percentage of votes obtained by a candidate in a county: (number of votes obtained by a candidate / total number of votes in county)*100.
Grouping the data by county (county_name and county_fips), and then use a combination of rank() and desc() commands to compute the ranks of each candidate within each county. You should create a variable called candidate_rank which takes the value 1 for the candidate who received the most vote in the county, 2 for the candidate who received the second highest number of votes etc.
Once you’ve ensured your code is correct, “ungroup” the data by passing your object into the ungroup() function or by piping it. This function does what it says: it removes grouping. This means that it cancels the effect of your last group_by, and ensures that the next functions you use will not be applied by group.

You can pipe all these commands together if you wish or proceed in several steps.

elections_2020_clean <- elections_2020_clean %>%
  group_by(state, state_po, county_name, county_fips, totalvotes, candidate, party) %>%
  summarise(all_votes = sum(candidatevotes)) %>%
  ungroup()
elections_2020_clean <- elections_2020_clean %>%
  mutate('pct_votes' = (all_votes/totalvotes)*100)
elections_2020_clean <- elections_2020_clean %>%
  group_by(county_name, county_fips) %>%
  mutate('candidate_rank' = rank(desc(all_votes))) %>%
  ungroup()

4. Check that your variable pct_votes is actually a percentage (that it is comprised within 0 and 100). This is a safety check to make sure that there is no inconsistency in the data.

lowerbound <- 0
upperbound <- 100
pct_check <- between(elections_2020_clean$pct_votes,lowerbound, upperbound)
pct_check <- FALSE %in% pct_check
pct_check

## [1] FALSE

#Due to pct_check returning false, we know that all percentages are between the range of 0-100. The data is consistent!

Maps [3 points]

Election data are probably better visualized using maps.

In this section we will produce two maps displaying:

Joe Biden and Donald Trump’s vote percentages in each county,
The leading candidate in each county.

To produce the maps you will need to install and load the sf, tmap and rmapshaper packages.

Maps are tricky and very complicated objects. But to boil it down to the essential ingredients, you need a “shapefile” which is basically a file that contains the map onto which data will be plotted.

This file is provided by the US census website here. However it is somewhat heavy, so to make your life easier, we have simplified it for you using the following commands. Do NOT touch nor run this chunk, and do NOT replace eval = FALSE by eval = TRUE. We’ve left it so you can see how we created the simplified shapefile you will be loading in the next question.

county_shfl_full <- read_sf("path/cb_2018_us_county_500k/cb_2018_us_county_500k.shp")
# where `path` corresponds to where the `cb_2018_us_county_500k` folder is located.
county_shfl <- st_simplify(county_shfl_full, preserveTopology = FALSE, dTolerance = 1000) #simplify the map to make it less memory-heavy

save(county_shfl, file="path/county_shfl.RData") #saving the object

You can download the resulting R object from this link: > https://www.dropbox.com/s/m8zq5e38imp2he1/county_shfl.RData?dl=1

1. Let’s load the county_shfl.RData file by using the load() function. Replace eval = FALSE by eval = TRUE.

load("county_shfl.RData")
# where `path` corresponds to where the county_shfl.RData file is located on your laptop.

You should now have an object called county_shfl in your environment

2. FIPS codes are numbers which uniquely identify geographic areas in the US. Working with FIPS codes is always advisable when identifying areas, as names can be duplicates: there is an Aidar county in Kentucky and another one in Iowa.

What is the county FIPS code of the “MCDUFFIE” county in the elections_2020_clean data.frame? What are the State FIPS (STATEFP), county FIPS (COUNTYFP) and GEOID of the same county in the county_shfl data? Hint: in county_shfl the county’s name is written “McDuffie” and the variable containing county names is NAME.

mcduffie_fips_elec <- elections_2020_clean %>%
  filter(county_name =="MCDUFFIE")
mcduffie_fips_elec

state	state_po	county_name	county_fips	totalvotes	candidate	party	all_votes	pct_votes	candidate_rank
GEORGIA	GA	MCDUFFIE	13189	1.05e+04	DONALD J TRUMP	REPUBLICAN	6.17e+03	59	1
GEORGIA	GA	MCDUFFIE	13189	1.05e+04	JO JORGENSEN	LIBERTARIAN	118	1.13	3
GEORGIA	GA	MCDUFFIE	13189	1.05e+04	JOSEPH R BIDEN JR	DEMOCRAT	4.17e+03	39.9	2

mcduffie_fips_county <- county_shfl %>% 
  filter(NAME == "McDuffie")
mcduffie_fips_county

STATEFP	COUNTYFP	COUNTYNS	AFFGEOID	GEOID	NAME	LSAD	ALAND	AWATER	geometry
13	189	00348794	0500000US13189	13189	McDuffie	06	6.67e+08	2.31e+07	list(c(-82.63813, -82.642723, -82.565563, -82.549373, -82.478057, -82.412063, -82.350299, -82.295688, -82.400473, -82.393244, -82.436814, -82.414914, -82.442846, -82.430915, -82.462662, -82.51171, -82.53992, -82.573511, -82.583603, -82.588003, -82.606452, -82.617599, -82.607769, -82.623897, -82.62787, -82.63813, 33.615055, 33.594259, 33.428891, 33.357748, 33.339022, 33.324309, 33.314801, 33.356766, 33.506422, 33.533083, 33.549658, 33.586413, 33.631561, 33.655076, 33.639874, 33.659172, 33.640058, 33.639505, 33.650477, 33.642008, 33.635061, 33.632429, 33.625673, 33.621661, 33.611516, 33.615055))

The county FIPS code of “MCDUFFIE” in the elections_2020_clean data frame is 13189. The State FIPS for the same county is 13, the county FIPS is 189, and the GEOID is 13189.

3. You might have noticed that the county_fips variable in our elections_2020_clean is actually made of the State FIPS code, followed by the county code.

In county_shfl, we have a variable for the State FIPS code and a separate one for the county code. The variable containing the same code as in elections_2020_clean is actually called GEOID. This type of thing is exactly why you always need to look inside your data: it would not have been obvious from the variable names.

What is the type of county_fips in the elections_2020_clean data? What is the type of GEOID in county_shfl? Hint: use the typeof function.

type_countyfips <- typeof(elections_2020_clean$county_fips)
type_countyfips

## [1] "integer"

type_GEOID <- typeof(county_shfl$GEOID)
type_GEOID

## [1] "character"

The type of county_fips in elections_2020_clean is an integer, while the type of GEOID in county_shfl is character.

It is not the same! In order to merge the two datasets, we want county_fips to have the same type as in GEOID and be a character string instead.

Furthermore, because it is stored as a numeric value there is no leading zero (the FIPS code “01001” is stored as 1001).

The following script fixes this, and creates a new variable called county_fips_clean. Run the code and change eval = FALSE to eval = TRUE to make sure the code runs when you knit your document.

We start by using the as.character function to convert the variable to a character string. Then, we use the ifelse function again. What this conditional statement says is: if the number of characters of the variable county_fips_clean is equal to 4, then add a “0” in front of this variable; otherwise do nothing.

elections_2020_clean <- elections_2020_clean %>% 
  mutate(county_fips_clean = as.character(county_fips), #convert county_fips to a character string
         county_fips_clean = ifelse(nchar(county_fips_clean) == 4,
                            paste0("0", county_fips_clean),
                            county_fips_clean)) #Add a leading zero when necessary: if there are only four characters to the FIPS code, it means that the leading 0 is missing

4. How many different State FIPS codes are there in the county_shfl dataset? Hint: You can use a combination of the length and unique commands to count them.

unique_state_fips <-
  length(unique(county_shfl$STATEFP))
unique_state_fips

## [1] 56

There are 56 unique state FPs.

5. Our shapefile clearly contains the map for more than just the 50 US States + DC. By looking at the existing State codes here, you can see that codes above 56 correspond to overseas territories.

The final map we want to produce should only contain the 50 States + DC: we want to remove overseas territories. We also exclude Alaska and Hawaii for purely aesthetic reasons: including them zooms out the map too much. We could fix this problem by showing Alaska and Hawaii in a corner of the map but this is beyond the scope of the midterm.

Filter county_shfl_clean to keep only observations such that:

STATEFP is not equal to “02” (Alaska) or “15” (Hawaii)
and as.numeric(STATEFP) is lower or equal to 56 (note that because State FIPS is stored as a character, we have to convert it to a numeric before evaluating whether it is below 56)

Create a new object called county_shfl_clean.

county_shfl_clean <- county_shfl %>% 
  filter(STATEFP != "02") %>%
  filter(STATEFP != "15") %>%
  filter((as.numeric(STATEFP)) <= 56)

Furthermore, some counties’ FIPS have changed over the years, and they differ in our two datasets. The following script will fix all these inconsistencies for you. The case_when function is similar to ifelse. Here it modifies the variable county_fips_clean, such that :

if county_fips_clean was equal to “46113”, it now takes the value “46102”
if state_po is equal to “DC”, county_fips_clean now takes the value “11001”
else county_fips_clean keeps the same value.

Remember to replace eval = FALSE with eval = TRUE.

elections_2020_clean <- elections_2020_clean %>% 
  mutate(county_fips_clean = case_when(
    county_fips_clean == "46113" ~ "46102",
    state_po == "DC" ~ "11001",
    TRUE ~ county_fips_clean))

6. Use left_join to merge county_shfl_clean (first argument) and elections_2020_clean (second argument). Create a new object called elections_2020_map for this question.

Here, our identifying variables have different names: GEOID in county_shfl_clean (first argument) and county_fips_clean in elections_2020_clean (second argument). Look at the help file or go back to question 3 of the “Checking quality” section to see how to specify the by argument.

elections_2020_map <- left_join(county_shfl_clean, elections_2020_clean, by=c("GEOID" = "county_fips_clean"))

Map 1: Candidates’ vote percentage

7. Create a heat map of Biden and Trump’s percentage of the vote by county, using the following code (change eval = FALSE to eval = TRUE just below so that the html file displays the map):

tmap_mode("view") # we want an interactive map
elections_2020_map %>%
             # only keep Trump and Biden
             filter(candidate %in% c("DONALD J TRUMP", "JOSEPH R BIDEN JR")) %>% 
    tm_shape() +
    tm_borders(col="white", lwd = 0.3) + # white and thin (line width) borders
    tm_fill(
        col = "pct_votes",   # variable to be mapped
        title = "% of votes",   # legend title
        id = "county_name",   # information to display when mouse hovers over a departement
        popup.vars = c("Vote %:" = "pct_votes")) + # variable to display in popup window
    tm_facets(by = "candidate") # create one map per selected candidate

Wasn’t that easy (well, sort of ;) ) ! And now you have 2 comparable maps of the vote percentages received by Biden and Trump! Hover over the counties with your mouse. The name of the county will show up. If you click on a county you will get the vote percentage received by the candidate. Isn’t that cool!

Map 2: Leading candidate

8. Create a map displaying the candidate who received the highest percentage of votes in each county. Try to create this map by yourself using the code above (and the comments in it). You don’t need to create any new variables. Don’t hesitate to use Google (or your favorite search engine) to help you out. Don’t forget to change the title of the legend.

Add the following argument in tm_fill() to have a nice color palette for your map: palette = c("red","blue").

tmap_mode("view") # we want an interactive map
elections_2020_map %>%
  filter(candidate_rank %in% c("1")) %>%
    tm_shape() +
    tm_borders(col="white", lwd = 0.3) + # white and thin (line width) borders
    tm_fill(
        col = "pct_votes",   # variable to be mapped
        title = "County Winners' Vote%",   # legend title
        id = "county_name",   # information to display when mouse hovers over a departement
         palette = c("blue","red"),
        popup.vars = c("Winner"="candidate")) + # variable to display in popup window
  tmap_mode()

Tidying covariate data [2 points]

A covariate is simply the name we give to variables containing characteristics of our unit of interest (here, characteristic of counties). A covariate can be of direct interest if it is an independent variable in the model. It can also be an extra characteristic we add to the regression to avoid the omitted variable bias which we will discuss later in the course.

Download county-level social data from the American Community Survey (ACS) here: https://data.census.gov/cedsci/table?q=DP02&g=0100000US%240500000&tid=ACSDP5Y2019.DP02

Click “DOWNLOAD TABLE” and select the year 2019 for “ACS 5-Year Estimates Data Profiles” and download. Unzip the folder once downloaded.

Repeat for ACS economic data: https://data.census.gov/cedsci/table?q=DP03&g=0100000US%240500000&tid=ACSDP5Y2019.DP03

1*. Import these csv files (the ones with data_with_overlays in their name) into objects called acs_soc_2019 and acs_eco_2019 respectively.

Note: the automatic name of the csv file when you download it includes the date and time, so the name of your file will be different to mine! If you work in groups and want the code to run on all your laptops, make sure that your files have the same name.

acs_soc_2019 <- read.csv(file = 'ACSDP5Y2019.DP02_data_with_overlays_2022-03-06T093954.csv')

acs_eco_2019 <- read.csv(file = "ACSDP5Y2019.DP03_data_with_overlays_2021-11-05T171612.csv")

2. How many variables are there in each dataset?

acs_soc_col <- ncol(acs_soc_2019)
acs_soc_col

## [1] 614

acs_eco_col <- ncol(acs_eco_2019)
acs_eco_col

## [1] 550

There are 614 variables in the acs_soc_2019 data set. There are 550 variables in the acs_eco_2019.

3*. That’s a lot! We need to keep only the variables we are interested in, and tidy the data a little. For each acs data frame:

Rename the following variables : (old_name -> new_name)

In acs_soc_2019: DP02_0062PE -> pct_high_school
In acs_eco_2019: DP03_0063E -> mean_income
In acs_eco_2019: DP03_0030PE -> pct_const_maintenance_occ

The new name tells you what the variable actually contains. pct_high_school is the share of people in the county who only have a high school degree (no further education). pct_const_maintenance_occ is the share of civilians aged 16 and above who work in the “Natural resources, construction and maintenance” occupation.

Keep only the variables you just renamed, NAME and GEO_ID

For this task, create two new objects: acs_soc_2019_clean and acs_eco_2019_clean.

acs_soc_2019_clean <- acs_soc_2019 %>% 
  rename("pct_high_school" = "DP02_0062PE")
  
acs_eco_2019_clean <- acs_eco_2019 %>%
  rename("mean_income" = "DP03_0063E") %>%
  rename("pct_const_maintenance_occ" = "DP03_0030PE")

acs_soc_2019_clean <- acs_soc_2019_clean %>%
  select(pct_high_school, NAME, GEO_ID)

acs_eco_2019_clean <- acs_eco_2019_clean %>%
  select(mean_income, pct_const_maintenance_occ, NAME, GEO_ID)

4. Display the first five lines of acs_soc_2019_clean and acs_eco_2019_clean

head(acs_eco_2019_clean, 5)

mean_income	pct_const_maintenance_occ	NAME	GEO_ID
Estimate!!INCOME AND BENEFITS (IN 2019 INFLATION-ADJUSTED DOLLARS)!!Total households!!Mean household income (dollars)	Percent!!OCCUPATION!!Civilian employed population 16 years and over!!Natural resources, construction, and maintenance occupations	Geographic Area Name	id
75326	7.8	Autauga County, Alabama	0500000US01001
80986	9.3	Baldwin County, Alabama	0500000US01003
47068	14.1	Barbour County, Alabama	0500000US01005
60182	16.8	Bibb County, Alabama	0500000US01007

head(acs_soc_2019_clean, 5)

pct_high_school	NAME	GEO_ID
Percent!!EDUCATIONAL ATTAINMENT!!Population 25 years and over!!High school graduate (includes equivalency)	Geographic Area Name	id
33.6	Autauga County, Alabama	0500000US01001
27.7	Baldwin County, Alabama	0500000US01003
35.6	Barbour County, Alabama	0500000US01005
44.9	Bibb County, Alabama	0500000US01007

5. What is in the first line of the datasets?

The first line of the data set, shows the names of each column which corresponds to the variable names. The variables do not just name the data, but they also include statistical information. The variables for acs_eco_2019_clean is estimate income and benefits, percent occupation, geographic area name, and the geo id. As for acs_soc_2019_clean, we have percent educational attainment, geographic area name, and the geo id.

6*. Drop the first row of acs_soc_2019_clean and acs_eco_2019_clean. Hint: the negative sign should help you.

acs_soc_2019_clean <- acs_soc_2019_clean[-1,]
acs_eco_2019_clean <- acs_eco_2019_clean[-1,]

7*. Merge the acs_soc_2019_clean and acs_eco_2019_clean together using left_join. You don’t need to specify the by argument here, since the common variabes GEO_ID and NAME should have the same name in both datasets.

Create a new object called acs_2019_clean.

acs_2019_clean <- left_join(acs_soc_2019_clean, acs_eco_2019_clean)

8. What is the type of the pct_const_maintenance_occ, mean_income and pct_high_school variables?

type_pct_main <- typeof(acs_2019_clean$pct_const_maintenance_occ)
type_pct_main

## [1] "character"

type_mean_income <- typeof(acs_2019_clean$mean_income)
type_mean_income

## [1] "character"

type_pct_hs <- typeof(acs_2019_clean$pct_high_school)
type_pct_hs

## [1] "character"

All three variables are of type “character”.

9*. In the acs_2019_clean dataframe, convert the pct_high_school, pct_const_maintenance_occ and mean_income variables to numeric, using the as.numeric() function. This function simply converts character strings to numbers: for example as.numeric("17.9") = 17.9. Hint: to convert the variable, simply use the dplyr verb to create a new variable and give it the same name as the variable you are modifying.

Ignore the warning about NAs being introduced. They are due to observations for which pct_high_school was already missing (it was equal to “null”)

acs_2019_clean <- acs_2019_clean %>% 
  mutate(pct_high_school = as.numeric(pct_high_school)) %>%
  mutate(pct_const_maintenance_occ = as.numeric(pct_const_maintenance_occ)) %>%
  mutate(mean_income = as.numeric(mean_income))

10. What is the GEO_ID of the county whose NAME is “McDuffie County, Georgia” in the acs_2019_clean dataframe?

McDuffie <- acs_2019_clean %>%
  filter(NAME == "McDuffie County, Georgia")
geo_id_McDuffie <- McDuffie$GEO_ID
geo_id_McDuffie

## [1] "0500000US13189"

11*. The geographical ID of the county is different to what we had last time! You might have noticed that this time, the 5-digit FIPS code is preceded by a prefix: “0500000US”. We want to extract the FIPS code from this geographic identifier.

Create a new variable, called county_fips_clean which contains only the 5 last characters of the GEO_ID variable. Use the substr() function. This function extracts a substring from a character string. You need to give it the original string, and the position of the first and last character of the substring. Example: substr("abcde", 2, 4) = "bcd", i.e. you get letters 2 to 4. Here we want to extract characters 10 to 14 of the GEO_ID variable. As with as.numeric you have to create the variable using the relevant dplyr verb.

typeof(acs_2019_clean$GEO_ID)

## [1] "character"

# GEO_ID is already a character string, so can use substring verb

acs_2019_clean <- acs_2019_clean %>%
  mutate(county_fips_clean = substr(GEO_ID,10,14)) 
acs_2019_clean$county_fips_clean

##    [1] "01001" "01003" "01005" "01007" "01009" "01011" "01013" "01015" "01017"
##   [10] "01019" "01021" "01023" "01025" "01027" "01029" "01031" "01033" "01035"
##   [19] "01037" "01039" "01041" "01043" "01045" "01047" "01049" "01051" "01053"
##   [28] "01055" "01057" "01059" "01061" "01063" "01065" "01067" "01069" "01071"
##   [37] "01073" "01075" "01077" "01079" "01081" "01083" "01085" "01087" "01089"
##   [46] "01091" "01093" "01095" "01097" "01099" "01101" "01103" "01105" "01107"
##   [55] "01109" "01111" "01113" "01115" "01117" "01119" "01121" "01123" "01125"
##   [64] "01127" "01129" "01131" "01133" "02013" "02016" "02020" "02050" "02060"
##   [73] "02068" "02070" "02090" "02100" "02105" "02110" "02122" "02130" "02150"
##   [82] "02158" "02164" "02170" "02180" "02185" "02188" "02195" "02198" "02220"
##   [91] "02230" "02240" "02261" "02275" "02282" "02290" "04001" "04003" "04005"
##  [100] "04007" "04009" "04011" "04012" "04013" "04015" "04017" "04019" "04021"
##  [109] "04023" "04025" "04027" "05001" "05003" "05005" "05007" "05009" "05011"
##  [118] "05013" "05015" "05017" "05019" "05021" "05023" "05025" "05027" "05029"
##  [127] "05031" "05033" "05035" "05037" "05039" "05041" "05043" "05045" "05047"
##  [136] "05049" "05051" "05053" "05055" "05057" "05059" "05061" "05063" "05065"
##  [145] "05067" "05069" "05071" "05073" "05075" "05077" "05079" "05081" "05083"
##  [154] "05085" "05087" "05089" "05091" "05093" "05095" "05097" "05099" "05101"
##  [163] "05103" "05105" "05107" "05109" "05111" "05113" "05115" "05117" "05119"
##  [172] "05121" "05123" "05125" "05127" "05129" "05131" "05133" "05135" "05137"
##  [181] "05139" "05141" "05143" "05145" "05147" "05149" "06001" "06003" "06005"
##  [190] "06007" "06009" "06011" "06013" "06015" "06017" "06019" "06021" "06023"
##  [199] "06025" "06027" "06029" "06031" "06033" "06035" "06037" "06039" "06041"
##  [208] "06043" "06045" "06047" "06049" "06051" "06053" "06055" "06057" "06059"
##  [217] "06061" "06063" "06065" "06067" "06069" "06071" "06073" "06075" "06077"
##  [226] "06079" "06081" "06083" "06085" "06087" "06089" "06091" "06093" "06095"
##  [235] "06097" "06099" "06101" "06103" "06105" "06107" "06109" "06111" "06113"
##  [244] "06115" "08001" "08003" "08005" "08007" "08009" "08011" "08013" "08014"
##  [253] "08015" "08017" "08019" "08021" "08023" "08025" "08027" "08029" "08031"
##  [262] "08033" "08035" "08037" "08039" "08041" "08043" "08045" "08047" "08049"
##  [271] "08051" "08053" "08055" "08057" "08059" "08061" "08063" "08065" "08067"
##  [280] "08069" "08071" "08073" "08075" "08077" "08079" "08081" "08083" "08085"
##  [289] "08087" "08089" "08091" "08093" "08095" "08097" "08099" "08101" "08103"
##  [298] "08105" "08107" "08109" "08111" "08113" "08115" "08117" "08119" "08121"
##  [307] "08123" "08125" "09001" "09003" "09005" "09007" "09009" "09011" "09013"
##  [316] "09015" "10001" "10003" "10005" "11001" "12001" "12003" "12005" "12007"
##  [325] "12009" "12011" "12013" "12015" "12017" "12019" "12021" "12023" "12027"
##  [334] "12029" "12031" "12033" "12035" "12037" "12039" "12041" "12043" "12045"
##  [343] "12047" "12049" "12051" "12053" "12055" "12057" "12059" "12061" "12063"
##  [352] "12065" "12067" "12069" "12071" "12073" "12075" "12077" "12079" "12081"
##  [361] "12083" "12085" "12086" "12087" "12089" "12091" "12093" "12095" "12097"
##  [370] "12099" "12101" "12103" "12105" "12107" "12109" "12111" "12113" "12115"
##  [379] "12117" "12119" "12121" "12123" "12125" "12127" "12129" "12131" "12133"
##  [388] "13001" "13003" "13005" "13007" "13009" "13011" "13013" "13015" "13017"
##  [397] "13019" "13021" "13023" "13025" "13027" "13029" "13031" "13033" "13035"
##  [406] "13037" "13039" "13043" "13045" "13047" "13049" "13051" "13053" "13055"
##  [415] "13057" "13059" "13061" "13063" "13065" "13067" "13069" "13071" "13073"
##  [424] "13075" "13077" "13079" "13081" "13083" "13085" "13087" "13089" "13091"
##  [433] "13093" "13095" "13097" "13099" "13101" "13103" "13105" "13107" "13109"
##  [442] "13111" "13113" "13115" "13117" "13119" "13121" "13123" "13125" "13127"
##  [451] "13129" "13131" "13133" "13135" "13137" "13139" "13141" "13143" "13145"
##  [460] "13147" "13149" "13151" "13153" "13155" "13157" "13159" "13161" "13163"
##  [469] "13165" "13167" "13169" "13171" "13173" "13175" "13177" "13179" "13181"
##  [478] "13183" "13185" "13187" "13189" "13191" "13193" "13195" "13197" "13199"
##  [487] "13201" "13205" "13207" "13209" "13211" "13213" "13215" "13217" "13219"
##  [496] "13221" "13223" "13225" "13227" "13229" "13231" "13233" "13235" "13237"
##  [505] "13239" "13241" "13243" "13245" "13247" "13249" "13251" "13253" "13255"
##  [514] "13257" "13259" "13261" "13263" "13265" "13267" "13269" "13271" "13273"
##  [523] "13275" "13277" "13279" "13281" "13283" "13285" "13287" "13289" "13291"
##  [532] "13293" "13295" "13297" "13299" "13301" "13303" "13305" "13307" "13309"
##  [541] "13311" "13313" "13315" "13317" "13319" "13321" "15001" "15003" "15005"
##  [550] "15007" "15009" "16001" "16003" "16005" "16007" "16009" "16011" "16013"
##  [559] "16015" "16017" "16019" "16021" "16023" "16025" "16027" "16029" "16031"
##  [568] "16033" "16035" "16037" "16039" "16041" "16043" "16045" "16047" "16049"
##  [577] "16051" "16053" "16055" "16057" "16059" "16061" "16063" "16065" "16067"
##  [586] "16069" "16071" "16073" "16075" "16077" "16079" "16081" "16083" "16085"
##  [595] "16087" "17001" "17003" "17005" "17007" "17009" "17011" "17013" "17015"
##  [604] "17017" "17019" "17021" "17023" "17025" "17027" "17029" "17031" "17033"
##  [613] "17035" "17037" "17039" "17041" "17043" "17045" "17047" "17049" "17051"
##  [622] "17053" "17055" "17057" "17059" "17061" "17063" "17065" "17067" "17069"
##  [631] "17071" "17073" "17075" "17077" "17079" "17081" "17083" "17085" "17087"
##  [640] "17089" "17091" "17093" "17095" "17097" "17099" "17101" "17103" "17105"
##  [649] "17107" "17109" "17111" "17113" "17115" "17117" "17119" "17121" "17123"
##  [658] "17125" "17127" "17129" "17131" "17133" "17135" "17137" "17139" "17141"
##  [667] "17143" "17145" "17147" "17149" "17151" "17153" "17155" "17157" "17159"
##  [676] "17161" "17163" "17165" "17167" "17169" "17171" "17173" "17175" "17177"
##  [685] "17179" "17181" "17183" "17185" "17187" "17189" "17191" "17193" "17195"
##  [694] "17197" "17199" "17201" "17203" "18001" "18003" "18005" "18007" "18009"
##  [703] "18011" "18013" "18015" "18017" "18019" "18021" "18023" "18025" "18027"
##  [712] "18029" "18031" "18033" "18035" "18037" "18039" "18041" "18043" "18045"
##  [721] "18047" "18049" "18051" "18053" "18055" "18057" "18059" "18061" "18063"
##  [730] "18065" "18067" "18069" "18071" "18073" "18075" "18077" "18079" "18081"
##  [739] "18083" "18085" "18087" "18089" "18091" "18093" "18095" "18097" "18099"
##  [748] "18101" "18103" "18105" "18107" "18109" "18111" "18113" "18115" "18117"
##  [757] "18119" "18121" "18123" "18125" "18127" "18129" "18131" "18133" "18135"
##  [766] "18137" "18139" "18141" "18143" "18145" "18147" "18149" "18151" "18153"
##  [775] "18155" "18157" "18159" "18161" "18163" "18165" "18167" "18169" "18171"
##  [784] "18173" "18175" "18177" "18179" "18181" "18183" "19001" "19003" "19005"
##  [793] "19007" "19009" "19011" "19013" "19015" "19017" "19019" "19021" "19023"
##  [802] "19025" "19027" "19029" "19031" "19033" "19035" "19037" "19039" "19041"
##  [811] "19043" "19045" "19047" "19049" "19051" "19053" "19055" "19057" "19059"
##  [820] "19061" "19063" "19065" "19067" "19069" "19071" "19073" "19075" "19077"
##  [829] "19079" "19081" "19083" "19085" "19087" "19089" "19091" "19093" "19095"
##  [838] "19097" "19099" "19101" "19103" "19105" "19107" "19109" "19111" "19113"
##  [847] "19115" "19117" "19119" "19121" "19123" "19125" "19127" "19129" "19131"
##  [856] "19133" "19135" "19137" "19139" "19141" "19143" "19145" "19147" "19149"
##  [865] "19151" "19153" "19155" "19157" "19159" "19161" "19163" "19165" "19167"
##  [874] "19169" "19171" "19173" "19175" "19177" "19179" "19181" "19183" "19185"
##  [883] "19187" "19189" "19191" "19193" "19195" "19197" "20001" "20003" "20005"
##  [892] "20007" "20009" "20011" "20013" "20015" "20017" "20019" "20021" "20023"
##  [901] "20025" "20027" "20029" "20031" "20033" "20035" "20037" "20039" "20041"
##  [910] "20043" "20045" "20047" "20049" "20051" "20053" "20055" "20057" "20059"
##  [919] "20061" "20063" "20065" "20067" "20069" "20071" "20073" "20075" "20077"
##  [928] "20079" "20081" "20083" "20085" "20087" "20089" "20091" "20093" "20095"
##  [937] "20097" "20099" "20101" "20103" "20105" "20107" "20109" "20111" "20113"
##  [946] "20115" "20117" "20119" "20121" "20123" "20125" "20127" "20129" "20131"
##  [955] "20133" "20135" "20137" "20139" "20141" "20143" "20145" "20147" "20149"
##  [964] "20151" "20153" "20155" "20157" "20159" "20161" "20163" "20165" "20167"
##  [973] "20169" "20171" "20173" "20175" "20177" "20179" "20181" "20183" "20185"
##  [982] "20187" "20189" "20191" "20193" "20195" "20197" "20199" "20201" "20203"
##  [991] "20205" "20207" "20209" "21001" "21003" "21005" "21007" "21009" "21011"
## [1000] "21013" "21015" "21017" "21019" "21021" "21023" "21025" "21027" "21029"
## [1009] "21031" "21033" "21035" "21037" "21039" "21041" "21043" "21045" "21047"
## [1018] "21049" "21051" "21053" "21055" "21057" "21059" "21061" "21063" "21065"
## [1027] "21067" "21069" "21071" "21073" "21075" "21077" "21079" "21081" "21083"
## [1036] "21085" "21087" "21089" "21091" "21093" "21095" "21097" "21099" "21101"
## [1045] "21103" "21105" "21107" "21109" "21111" "21113" "21115" "21117" "21119"
## [1054] "21121" "21123" "21125" "21127" "21129" "21131" "21133" "21135" "21137"
## [1063] "21139" "21141" "21143" "21145" "21147" "21149" "21151" "21153" "21155"
## [1072] "21157" "21159" "21161" "21163" "21165" "21167" "21169" "21171" "21173"
## [1081] "21175" "21177" "21179" "21181" "21183" "21185" "21187" "21189" "21191"
## [1090] "21193" "21195" "21197" "21199" "21201" "21203" "21205" "21207" "21209"
## [1099] "21211" "21213" "21215" "21217" "21219" "21221" "21223" "21225" "21227"
## [1108] "21229" "21231" "21233" "21235" "21237" "21239" "22001" "22003" "22005"
## [1117] "22007" "22009" "22011" "22013" "22015" "22017" "22019" "22021" "22023"
## [1126] "22025" "22027" "22029" "22031" "22033" "22035" "22037" "22039" "22041"
## [1135] "22043" "22045" "22047" "22049" "22051" "22053" "22055" "22057" "22059"
## [1144] "22061" "22063" "22065" "22067" "22069" "22071" "22073" "22075" "22077"
## [1153] "22079" "22081" "22083" "22085" "22087" "22089" "22091" "22093" "22095"
## [1162] "22097" "22099" "22101" "22103" "22105" "22107" "22109" "22111" "22113"
## [1171] "22115" "22117" "22119" "22121" "22123" "22125" "22127" "23001" "23003"
## [1180] "23005" "23007" "23009" "23011" "23013" "23015" "23017" "23019" "23021"
## [1189] "23023" "23025" "23027" "23029" "23031" "24001" "24003" "24005" "24009"
## [1198] "24011" "24013" "24015" "24017" "24019" "24021" "24023" "24025" "24027"
## [1207] "24029" "24031" "24033" "24035" "24037" "24039" "24041" "24043" "24045"
## [1216] "24047" "24510" "25001" "25003" "25005" "25007" "25009" "25011" "25013"
## [1225] "25015" "25017" "25019" "25021" "25023" "25025" "25027" "26001" "26003"
## [1234] "26005" "26007" "26009" "26011" "26013" "26015" "26017" "26019" "26021"
## [1243] "26023" "26025" "26027" "26029" "26031" "26033" "26035" "26037" "26039"
## [1252] "26041" "26043" "26045" "26047" "26049" "26051" "26053" "26055" "26057"
## [1261] "26059" "26061" "26063" "26065" "26067" "26069" "26071" "26073" "26075"
## [1270] "26077" "26079" "26081" "26083" "26085" "26087" "26089" "26091" "26093"
## [1279] "26095" "26097" "26099" "26101" "26103" "26105" "26107" "26109" "26111"
## [1288] "26113" "26115" "26117" "26119" "26121" "26123" "26125" "26127" "26129"
## [1297] "26131" "26133" "26135" "26137" "26139" "26141" "26143" "26145" "26147"
## [1306] "26149" "26151" "26153" "26155" "26157" "26159" "26161" "26163" "26165"
## [1315] "27001" "27003" "27005" "27007" "27009" "27011" "27013" "27015" "27017"
## [1324] "27019" "27021" "27023" "27025" "27027" "27029" "27031" "27033" "27035"
## [1333] "27037" "27039" "27041" "27043" "27045" "27047" "27049" "27051" "27053"
## [1342] "27055" "27057" "27059" "27061" "27063" "27065" "27067" "27069" "27071"
## [1351] "27073" "27075" "27077" "27079" "27081" "27083" "27085" "27087" "27089"
## [1360] "27091" "27093" "27095" "27097" "27099" "27101" "27103" "27105" "27107"
## [1369] "27109" "27111" "27113" "27115" "27117" "27119" "27121" "27123" "27125"
## [1378] "27127" "27129" "27131" "27133" "27135" "27137" "27139" "27141" "27143"
## [1387] "27145" "27147" "27149" "27151" "27153" "27155" "27157" "27159" "27161"
## [1396] "27163" "27165" "27167" "27169" "27171" "27173" "28001" "28003" "28005"
## [1405] "28007" "28009" "28011" "28013" "28015" "28017" "28019" "28021" "28023"
## [1414] "28025" "28027" "28029" "28031" "28033" "28035" "28037" "28039" "28041"
## [1423] "28043" "28045" "28047" "28049" "28051" "28053" "28055" "28057" "28059"
## [1432] "28061" "28063" "28065" "28067" "28069" "28071" "28073" "28075" "28077"
## [1441] "28079" "28081" "28083" "28085" "28087" "28089" "28091" "28093" "28095"
## [1450] "28097" "28099" "28101" "28103" "28105" "28107" "28109" "28111" "28113"
## [1459] "28115" "28117" "28119" "28121" "28123" "28125" "28127" "28129" "28131"
## [1468] "28133" "28135" "28137" "28139" "28141" "28143" "28145" "28147" "28149"
## [1477] "28151" "28153" "28155" "28157" "28159" "28161" "28163" "29001" "29003"
## [1486] "29005" "29007" "29009" "29011" "29013" "29015" "29017" "29019" "29021"
## [1495] "29023" "29025" "29027" "29029" "29031" "29033" "29035" "29037" "29039"
## [1504] "29041" "29043" "29045" "29047" "29049" "29051" "29053" "29055" "29057"
## [1513] "29059" "29061" "29063" "29065" "29067" "29069" "29071" "29073" "29075"
## [1522] "29077" "29079" "29081" "29083" "29085" "29087" "29089" "29091" "29093"
## [1531] "29095" "29097" "29099" "29101" "29103" "29105" "29107" "29109" "29111"
## [1540] "29113" "29115" "29117" "29119" "29121" "29123" "29125" "29127" "29129"
## [1549] "29131" "29133" "29135" "29137" "29139" "29141" "29143" "29145" "29147"
## [1558] "29149" "29151" "29153" "29155" "29157" "29159" "29161" "29163" "29165"
## [1567] "29167" "29169" "29171" "29173" "29175" "29177" "29179" "29181" "29183"
## [1576] "29185" "29186" "29187" "29189" "29195" "29197" "29199" "29201" "29203"
## [1585] "29205" "29207" "29209" "29211" "29213" "29215" "29217" "29219" "29221"
## [1594] "29223" "29225" "29227" "29229" "29510" "30001" "30003" "30005" "30007"
## [1603] "30009" "30011" "30013" "30015" "30017" "30019" "30021" "30023" "30025"
## [1612] "30027" "30029" "30031" "30033" "30035" "30037" "30039" "30041" "30043"
## [1621] "30045" "30047" "30049" "30051" "30053" "30055" "30057" "30059" "30061"
## [1630] "30063" "30065" "30067" "30069" "30071" "30073" "30075" "30077" "30079"
## [1639] "30081" "30083" "30085" "30087" "30089" "30091" "30093" "30095" "30097"
## [1648] "30099" "30101" "30103" "30105" "30107" "30109" "30111" "31001" "31003"
## [1657] "31005" "31007" "31009" "31011" "31013" "31015" "31017" "31019" "31021"
## [1666] "31023" "31025" "31027" "31029" "31031" "31033" "31035" "31037" "31039"
## [1675] "31041" "31043" "31045" "31047" "31049" "31051" "31053" "31055" "31057"
## [1684] "31059" "31061" "31063" "31065" "31067" "31069" "31071" "31073" "31075"
## [1693] "31077" "31079" "31081" "31083" "31085" "31087" "31089" "31091" "31093"
## [1702] "31095" "31097" "31099" "31101" "31103" "31105" "31107" "31109" "31111"
## [1711] "31113" "31115" "31117" "31119" "31121" "31123" "31125" "31127" "31129"
## [1720] "31131" "31133" "31135" "31137" "31139" "31141" "31143" "31145" "31147"
## [1729] "31149" "31151" "31153" "31155" "31157" "31159" "31161" "31163" "31165"
## [1738] "31167" "31169" "31171" "31173" "31175" "31177" "31179" "31181" "31183"
## [1747] "31185" "32001" "32003" "32005" "32007" "32009" "32011" "32013" "32015"
## [1756] "32017" "32019" "32021" "32023" "32027" "32029" "32031" "32033" "32510"
## [1765] "33001" "33003" "33005" "33007" "33009" "33011" "33013" "33015" "33017"
## [1774] "33019" "34001" "34003" "34005" "34007" "34009" "34011" "34013" "34015"
## [1783] "34017" "34019" "34021" "34023" "34025" "34027" "34029" "34031" "34033"
## [1792] "34035" "34037" "34039" "34041" "35001" "35003" "35005" "35006" "35007"
## [1801] "35009" "35011" "35013" "35015" "35017" "35019" "35021" "35023" "35025"
## [1810] "35027" "35028" "35029" "35031" "35033" "35035" "35037" "35039" "35041"
## [1819] "35043" "35045" "35047" "35049" "35051" "35053" "35055" "35057" "35059"
## [1828] "35061" "36001" "36003" "36005" "36007" "36009" "36011" "36013" "36015"
## [1837] "36017" "36019" "36021" "36023" "36025" "36027" "36029" "36031" "36033"
## [1846] "36035" "36037" "36039" "36041" "36043" "36045" "36047" "36049" "36051"
## [1855] "36053" "36055" "36057" "36059" "36061" "36063" "36065" "36067" "36069"
## [1864] "36071" "36073" "36075" "36077" "36079" "36081" "36083" "36085" "36087"
## [1873] "36089" "36091" "36093" "36095" "36097" "36099" "36101" "36103" "36105"
## [1882] "36107" "36109" "36111" "36113" "36115" "36117" "36119" "36121" "36123"
## [1891] "37001" "37003" "37005" "37007" "37009" "37011" "37013" "37015" "37017"
## [1900] "37019" "37021" "37023" "37025" "37027" "37029" "37031" "37033" "37035"
## [1909] "37037" "37039" "37041" "37043" "37045" "37047" "37049" "37051" "37053"
## [1918] "37055" "37057" "37059" "37061" "37063" "37065" "37067" "37069" "37071"
## [1927] "37073" "37075" "37077" "37079" "37081" "37083" "37085" "37087" "37089"
## [1936] "37091" "37093" "37095" "37097" "37099" "37101" "37103" "37105" "37107"
## [1945] "37109" "37111" "37113" "37115" "37117" "37119" "37121" "37123" "37125"
## [1954] "37127" "37129" "37131" "37133" "37135" "37137" "37139" "37141" "37143"
## [1963] "37145" "37147" "37149" "37151" "37153" "37155" "37157" "37159" "37161"
## [1972] "37163" "37165" "37167" "37169" "37171" "37173" "37175" "37177" "37179"
## [1981] "37181" "37183" "37185" "37187" "37189" "37191" "37193" "37195" "37197"
## [1990] "37199" "38001" "38003" "38005" "38007" "38009" "38011" "38013" "38015"
## [1999] "38017" "38019" "38021" "38023" "38025" "38027" "38029" "38031" "38033"
## [2008] "38035" "38037" "38039" "38041" "38043" "38045" "38047" "38049" "38051"
## [2017] "38053" "38055" "38057" "38059" "38061" "38063" "38065" "38067" "38069"
## [2026] "38071" "38073" "38075" "38077" "38079" "38081" "38083" "38085" "38087"
## [2035] "38089" "38091" "38093" "38095" "38097" "38099" "38101" "38103" "38105"
## [2044] "39001" "39003" "39005" "39007" "39009" "39011" "39013" "39015" "39017"
## [2053] "39019" "39021" "39023" "39025" "39027" "39029" "39031" "39033" "39035"
## [2062] "39037" "39039" "39041" "39043" "39045" "39047" "39049" "39051" "39053"
## [2071] "39055" "39057" "39059" "39061" "39063" "39065" "39067" "39069" "39071"
## [2080] "39073" "39075" "39077" "39079" "39081" "39083" "39085" "39087" "39089"
## [2089] "39091" "39093" "39095" "39097" "39099" "39101" "39103" "39105" "39107"
## [2098] "39109" "39111" "39113" "39115" "39117" "39119" "39121" "39123" "39125"
## [2107] "39127" "39129" "39131" "39133" "39135" "39137" "39139" "39141" "39143"
## [2116] "39145" "39147" "39149" "39151" "39153" "39155" "39157" "39159" "39161"
## [2125] "39163" "39165" "39167" "39169" "39171" "39173" "39175" "40001" "40003"
## [2134] "40005" "40007" "40009" "40011" "40013" "40015" "40017" "40019" "40021"
## [2143] "40023" "40025" "40027" "40029" "40031" "40033" "40035" "40037" "40039"
## [2152] "40041" "40043" "40045" "40047" "40049" "40051" "40053" "40055" "40057"
## [2161] "40059" "40061" "40063" "40065" "40067" "40069" "40071" "40073" "40075"
## [2170] "40077" "40079" "40081" "40083" "40085" "40087" "40089" "40091" "40093"
## [2179] "40095" "40097" "40099" "40101" "40103" "40105" "40107" "40109" "40111"
## [2188] "40113" "40115" "40117" "40119" "40121" "40123" "40125" "40127" "40129"
## [2197] "40131" "40133" "40135" "40137" "40139" "40141" "40143" "40145" "40147"
## [2206] "40149" "40151" "40153" "41001" "41003" "41005" "41007" "41009" "41011"
## [2215] "41013" "41015" "41017" "41019" "41021" "41023" "41025" "41027" "41029"
## [2224] "41031" "41033" "41035" "41037" "41039" "41041" "41043" "41045" "41047"
## [2233] "41049" "41051" "41053" "41055" "41057" "41059" "41061" "41063" "41065"
## [2242] "41067" "41069" "41071" "42001" "42003" "42005" "42007" "42009" "42011"
## [2251] "42013" "42015" "42017" "42019" "42021" "42023" "42025" "42027" "42029"
## [2260] "42031" "42033" "42035" "42037" "42039" "42041" "42043" "42045" "42047"
## [2269] "42049" "42051" "42053" "42055" "42057" "42059" "42061" "42063" "42065"
## [2278] "42067" "42069" "42071" "42073" "42075" "42077" "42079" "42081" "42083"
## [2287] "42085" "42087" "42089" "42091" "42093" "42095" "42097" "42099" "42101"
## [2296] "42103" "42105" "42107" "42109" "42111" "42113" "42115" "42117" "42119"
## [2305] "42121" "42123" "42125" "42127" "42129" "42131" "42133" "44001" "44003"
## [2314] "44005" "44007" "44009" "45001" "45003" "45005" "45007" "45009" "45011"
## [2323] "45013" "45015" "45017" "45019" "45021" "45023" "45025" "45027" "45029"
## [2332] "45031" "45033" "45035" "45037" "45039" "45041" "45043" "45045" "45047"
## [2341] "45049" "45051" "45053" "45055" "45057" "45059" "45061" "45063" "45065"
## [2350] "45067" "45069" "45071" "45073" "45075" "45077" "45079" "45081" "45083"
## [2359] "45085" "45087" "45089" "45091" "46003" "46005" "46007" "46009" "46011"
## [2368] "46013" "46015" "46017" "46019" "46021" "46023" "46025" "46027" "46029"
## [2377] "46031" "46033" "46035" "46037" "46039" "46041" "46043" "46045" "46047"
## [2386] "46049" "46051" "46053" "46055" "46057" "46059" "46061" "46063" "46065"
## [2395] "46067" "46069" "46071" "46073" "46075" "46077" "46079" "46081" "46083"
## [2404] "46085" "46087" "46089" "46091" "46093" "46095" "46097" "46099" "46101"
## [2413] "46102" "46103" "46105" "46107" "46109" "46111" "46115" "46117" "46119"
## [2422] "46121" "46123" "46125" "46127" "46129" "46135" "46137" "47001" "47003"
## [2431] "47005" "47007" "47009" "47011" "47013" "47015" "47017" "47019" "47021"
## [2440] "47023" "47025" "47027" "47029" "47031" "47033" "47035" "47037" "47039"
## [2449] "47041" "47043" "47045" "47047" "47049" "47051" "47053" "47055" "47057"
## [2458] "47059" "47061" "47063" "47065" "47067" "47069" "47071" "47073" "47075"
## [2467] "47077" "47079" "47081" "47083" "47085" "47087" "47089" "47091" "47093"
## [2476] "47095" "47097" "47099" "47101" "47103" "47105" "47107" "47109" "47111"
## [2485] "47113" "47115" "47117" "47119" "47121" "47123" "47125" "47127" "47129"
## [2494] "47131" "47133" "47135" "47137" "47139" "47141" "47143" "47145" "47147"
## [2503] "47149" "47151" "47153" "47155" "47157" "47159" "47161" "47163" "47165"
## [2512] "47167" "47169" "47171" "47173" "47175" "47177" "47179" "47181" "47183"
## [2521] "47185" "47187" "47189" "48001" "48003" "48005" "48007" "48009" "48011"
## [2530] "48013" "48015" "48017" "48019" "48021" "48023" "48025" "48027" "48029"
## [2539] "48031" "48033" "48035" "48037" "48039" "48041" "48043" "48045" "48047"
## [2548] "48049" "48051" "48053" "48055" "48057" "48059" "48061" "48063" "48065"
## [2557] "48067" "48069" "48071" "48073" "48075" "48077" "48079" "48081" "48083"
## [2566] "48085" "48087" "48089" "48091" "48093" "48095" "48097" "48099" "48101"
## [2575] "48103" "48105" "48107" "48109" "48111" "48113" "48115" "48117" "48119"
## [2584] "48121" "48123" "48125" "48127" "48129" "48131" "48133" "48135" "48137"
## [2593] "48139" "48141" "48143" "48145" "48147" "48149" "48151" "48153" "48155"
## [2602] "48157" "48159" "48161" "48163" "48165" "48167" "48169" "48171" "48173"
## [2611] "48175" "48177" "48179" "48181" "48183" "48185" "48187" "48189" "48191"
## [2620] "48193" "48195" "48197" "48199" "48201" "48203" "48205" "48207" "48209"
## [2629] "48211" "48213" "48215" "48217" "48219" "48221" "48223" "48225" "48227"
## [2638] "48229" "48231" "48233" "48235" "48237" "48239" "48241" "48243" "48245"
## [2647] "48247" "48249" "48251" "48253" "48255" "48257" "48259" "48261" "48263"
## [2656] "48265" "48267" "48269" "48271" "48273" "48275" "48277" "48279" "48281"
## [2665] "48283" "48285" "48287" "48289" "48291" "48293" "48295" "48297" "48299"
## [2674] "48301" "48303" "48305" "48307" "48309" "48311" "48313" "48315" "48317"
## [2683] "48319" "48321" "48323" "48325" "48327" "48329" "48331" "48333" "48335"
## [2692] "48337" "48339" "48341" "48343" "48345" "48347" "48349" "48351" "48353"
## [2701] "48355" "48357" "48359" "48361" "48363" "48365" "48367" "48369" "48371"
## [2710] "48373" "48375" "48377" "48379" "48381" "48383" "48385" "48387" "48389"
## [2719] "48391" "48393" "48395" "48397" "48399" "48401" "48403" "48405" "48407"
## [2728] "48409" "48411" "48413" "48415" "48417" "48419" "48421" "48423" "48425"
## [2737] "48427" "48429" "48431" "48433" "48435" "48437" "48439" "48441" "48443"
## [2746] "48445" "48447" "48449" "48451" "48453" "48455" "48457" "48459" "48461"
## [2755] "48463" "48465" "48467" "48469" "48471" "48473" "48475" "48477" "48479"
## [2764] "48481" "48483" "48485" "48487" "48489" "48491" "48493" "48495" "48497"
## [2773] "48499" "48501" "48503" "48505" "48507" "49001" "49003" "49005" "49007"
## [2782] "49009" "49011" "49013" "49015" "49017" "49019" "49021" "49023" "49025"
## [2791] "49027" "49029" "49031" "49033" "49035" "49037" "49039" "49041" "49043"
## [2800] "49045" "49047" "49049" "49051" "49053" "49055" "49057" "50001" "50003"
## [2809] "50005" "50007" "50009" "50011" "50013" "50015" "50017" "50019" "50021"
## [2818] "50023" "50025" "50027" "51001" "51003" "51005" "51007" "51009" "51011"
## [2827] "51013" "51015" "51017" "51019" "51021" "51023" "51025" "51027" "51029"
## [2836] "51031" "51033" "51035" "51036" "51037" "51041" "51043" "51045" "51047"
## [2845] "51049" "51051" "51053" "51057" "51059" "51061" "51063" "51065" "51067"
## [2854] "51069" "51071" "51073" "51075" "51077" "51079" "51081" "51083" "51085"
## [2863] "51087" "51089" "51091" "51093" "51095" "51097" "51099" "51101" "51103"
## [2872] "51105" "51107" "51109" "51111" "51113" "51115" "51117" "51119" "51121"
## [2881] "51125" "51127" "51131" "51133" "51135" "51137" "51139" "51141" "51143"
## [2890] "51145" "51147" "51149" "51153" "51155" "51157" "51159" "51161" "51163"
## [2899] "51165" "51167" "51169" "51171" "51173" "51175" "51177" "51179" "51181"
## [2908] "51183" "51185" "51187" "51191" "51193" "51195" "51197" "51199" "51510"
## [2917] "51520" "51530" "51540" "51550" "51570" "51580" "51590" "51595" "51600"
## [2926] "51610" "51620" "51630" "51640" "51650" "51660" "51670" "51678" "51680"
## [2935] "51683" "51685" "51690" "51700" "51710" "51720" "51730" "51735" "51740"
## [2944] "51750" "51760" "51770" "51775" "51790" "51800" "51810" "51820" "51830"
## [2953] "51840" "53001" "53003" "53005" "53007" "53009" "53011" "53013" "53015"
## [2962] "53017" "53019" "53021" "53023" "53025" "53027" "53029" "53031" "53033"
## [2971] "53035" "53037" "53039" "53041" "53043" "53045" "53047" "53049" "53051"
## [2980] "53053" "53055" "53057" "53059" "53061" "53063" "53065" "53067" "53069"
## [2989] "53071" "53073" "53075" "53077" "54001" "54003" "54005" "54007" "54009"
## [2998] "54011" "54013" "54015" "54017" "54019" "54021" "54023" "54025" "54027"
## [3007] "54029" "54031" "54033" "54035" "54037" "54039" "54041" "54043" "54045"
## [3016] "54047" "54049" "54051" "54053" "54055" "54057" "54059" "54061" "54063"
## [3025] "54065" "54067" "54069" "54071" "54073" "54075" "54077" "54079" "54081"
## [3034] "54083" "54085" "54087" "54089" "54091" "54093" "54095" "54097" "54099"
## [3043] "54101" "54103" "54105" "54107" "54109" "55001" "55003" "55005" "55007"
## [3052] "55009" "55011" "55013" "55015" "55017" "55019" "55021" "55023" "55025"
## [3061] "55027" "55029" "55031" "55033" "55035" "55037" "55039" "55041" "55043"
## [3070] "55045" "55047" "55049" "55051" "55053" "55055" "55057" "55059" "55061"
## [3079] "55063" "55065" "55067" "55069" "55071" "55073" "55075" "55077" "55078"
## [3088] "55079" "55081" "55083" "55085" "55087" "55089" "55091" "55093" "55095"
## [3097] "55097" "55099" "55101" "55103" "55105" "55107" "55109" "55111" "55113"
## [3106] "55115" "55117" "55119" "55121" "55123" "55125" "55127" "55129" "55131"
## [3115] "55133" "55135" "55137" "55139" "55141" "56001" "56003" "56005" "56007"
## [3124] "56009" "56011" "56013" "56015" "56017" "56019" "56021" "56023" "56025"
## [3133] "56027" "56029" "56031" "56033" "56035" "56037" "56039" "56041" "56043"
## [3142] "56045" "72001" "72003" "72005" "72007" "72009" "72011" "72013" "72015"
## [3151] "72017" "72019" "72021" "72023" "72025" "72027" "72029" "72031" "72033"
## [3160] "72035" "72037" "72039" "72041" "72043" "72045" "72047" "72049" "72051"
## [3169] "72053" "72054" "72055" "72057" "72059" "72061" "72063" "72065" "72067"
## [3178] "72069" "72071" "72073" "72075" "72077" "72079" "72081" "72083" "72085"
## [3187] "72087" "72089" "72091" "72093" "72095" "72097" "72099" "72101" "72103"
## [3196] "72105" "72107" "72109" "72111" "72113" "72115" "72117" "72119" "72121"
## [3205] "72123" "72125" "72127" "72129" "72131" "72133" "72135" "72137" "72139"
## [3214] "72141" "72143" "72145" "72147" "72149" "72151" "72153"

12*. Now that our covariates data is tidy, we can finally merge it with the elections data. Merge elections_2020_clean and acs_2019_clean using left_join. Note: elections_2020_clean needs to be the first argument). Create a new object called elections_2020_covariates.

elections_2020_covariates <- left_join(elections_2020_clean, acs_2019_clean)
                #by=c("county_name" = "NAME", "county_fips_clean" = "county_fips"))
View(elections_2020_covariates)

Regression analysis [8 points]

1*. In the regression analysis, we will only focus on the votes obtained by Donald Trump.

Filter elections_2020_covariates to keep only the observations with the votes for Donald Trump.
Keep only the following variables: state_po, county_name, county_fips_clean, pct_votes, pct_const_maintenance_occ, mean_income, pct_high_school.
Rename pct_votes -> pct_votes_trump.

Create a new object called elections_2020_reg_trump for this task.

elections_2020_reg_trump <- elections_2020_covariates %>% 
  filter(candidate == "DONALD J TRUMP") %>%
  select(state_po, county_name, county_fips_clean, pct_votes, pct_const_maintenance_occ, mean_income, pct_high_school)

elections_2020_reg_trump <- elections_2020_reg_trump %>%
  rename("pct_votes_trump" = "pct_votes")

Support time: If you are not confident that your answers to the necessary red-starred questions are all correct, you can now download a pre-cleaned dataset here. This file contains the data you should have arrived to after all these questions. This means that even if you made mistakes before, you can get a fresh start for the regressions and do them correctly. We will not penalise you for using this data.

If you choose to use this option, you can load the elections_2020_reg_trump.RData file you just downloaded by using the load() function. Replace eval = FALSE by eval = TRUE.

load("path/elections_2020_reg_trump.RData")
# where `path` corresponds to where the elections_2020_reg_trump.RData file is located on your laptop.

2. Use the skim() function from the skimr package to obtain summary statistics for mean_income, pct_high_school and pct_const_maintenance_occ.

skim(elections_2020_reg_trump)

Data summary
Name	elections_2020_reg_trump
Number of rows	3155
Number of columns	7
_______________________
Column type frequency:
character	3
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
state_po	0	1	2	2	51
county_name	0	1	3	21	1865
county_fips_clean	1	1	5	5	3154

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
pct_votes_trump	0	1.00	64.76	16.23	5.4	55.37	68.15	77.37	96.18	▁▂▅▇▅
pct_const_maintenance_occ	40	0.99	12.57	4.16	0.0	9.80	12.10	14.80	46.00	▂▇▁▁▁
mean_income	40	0.99	70083.84	17387.93	35819.0	59194.00	66802.00	76735.00	181261.00	▇▇▁▁▁
pct_high_school	40	0.99	34.15	7.23	7.3	29.70	34.50	39.10	57.40	▁▂▇▅▁

We can see that are some missing values. If you want you can look at which counties have missing data but otherwise, we’ll just ignore them for this midterm.

3. To facilitate plotting we will reshape the data in long format. (The difference between long and wide data is explained in this article.) We are going to use the pivot_longer() function from the tidyverse package to reshape the data. You can find examples of how the pivot_longer() function works by typing ?pivot_longer in your console. You can also find information and examples about pivot_longer and its associated function pivot_wider here

For our purposes we will use the following 4 arguments in pivot_longer():

data = elections_2020_reg_trump (notice that if you use the %>% pipe you don’t need to specify the data argument);
cols = c("mean_income", "pct_high_school", "pct_const_maintenance_occ"). These are the columns to be reshaped.;
names_to = "covariate". This is the name of a new variable to be created, where we will store the name of the covariate;
values_to = "value" This is the name of a new variable to be created, where we will store the values of the covariates.

If this is somewhat unclear to you, simply run the code and check how it differs from the previous dataset. This should clear things up.

Create a new object elections_2020_plot_trump for this task. View the created dataframe to ensure it looks as expected.

elections_2020_plot_trump <- elections_2020_reg_trump %>% 
  pivot_longer(
    cols = c("mean_income", "pct_high_school", "pct_const_maintenance_occ"),
    names_to = "covariate", 
    values_to = "value",
  )

4. Produce 3 graphs displaying the relationship between the vote percentage for Donald Trump and our 3 covariates. Each graph should be on a different row. Make sure that your graph has a meaningful y-axis label and that the y-axis is between 0 and 100 (use the scale_y_continuous() layer for this). (Hint: use the facet_wrap layer in ggplot to make these 3 plots in the same graph. Look at the facet_wrap arguments to ensure each graph is on its own row and the scales of each plot change freely.)

plot1 <- elections_2020_plot_trump %>%
  ggplot() + 
  aes(x = value, 
      y = pct_votes_trump) +
  geom_point(size =0.5, alpha = 1, colour = "#d90502")

plot1 <- plot1 + facet_wrap(~covariate, scales = "free", nrow = 3)
plot1 <- plot1 + scale_y_continuous(limits = c(0, 100))
plot1 <- plot1 + labs(title = "Relationship between the Vote% for Donald Trump and Our 3 Covariates of Voter Data",
       subtitle = "Mean Income, % of County Population with only High School Diploma, and % of County Population in Resources/Construction/Maintenance occupation",
       y = "% of Votes for Trump", x = "")
plot1

5. Add a regression line on these plots using the geom_smooth layer to better see the relationship’s sign. Set the argument method to "lm" (for linear model) and se to FALSE (so that it does not display a confidence interval around the regression line; we’ll discuss this towards the end of the course but you can set the argument to TRUE to see what it does).

plot1 <- plot1 + 
  geom_smooth(method = lm, 
              se = FALSE)
plot1

6. From the graphs, is the relationship between the different covariates and Donald Trump’s vote share positive or negative? Rather strong or rather weak?

Based on the graphs, the relationship between the percentages of employment and high school education with the percentage of votes for Donal Trump is positive, while that of mean income is negative. All three covarities have rather weak relationships as seen through the large dispersion of observations. A trend can be seen but the data is not consistent.

7. Run a simple linear regression of Donald Trump’s vote percentage on each of our covariate variables separately (i.e. 3 simple linear regressions). For this question you should use elections_2020_reg_trump as the data source.
Use the export_summs() function from the jtools package to see the results from the three regressions side by side.

reg_income <- lm(pct_votes_trump ~ mean_income, elections_2020_reg_trump)
reg_income

## 
## Call:
## lm(formula = pct_votes_trump ~ mean_income, data = elections_2020_reg_trump)
## 
## Coefficients:
## (Intercept)  mean_income  
##   87.528265    -0.000322

reg_pct_occ <- lm(pct_votes_trump ~ pct_const_maintenance_occ, elections_2020_reg_trump)
reg_pct_occ

## 
## Call:
## lm(formula = pct_votes_trump ~ pct_const_maintenance_occ, data = elections_2020_reg_trump)
## 
## Coefficients:
##               (Intercept)  pct_const_maintenance_occ  
##                    40.395                      1.954

reg_pct_hs <- lm(pct_votes_trump ~ pct_high_school, elections_2020_reg_trump)
reg_pct_hs

## 
## Call:
## lm(formula = pct_votes_trump ~ pct_high_school, data = elections_2020_reg_trump)
## 
## Coefficients:
##     (Intercept)  pct_high_school  
##           27.75             1.09

export_summs(reg_income, reg_pct_occ, reg_pct_hs)

	Model 1	Model 2	Model 3
(Intercept)	87.53 ***	40.39 ***	27.75 ***
	(1.13)	(0.80)	(1.22)
mean_income	-0.00 ***
	(0.00)
pct_const_maintenance_occ		1.95 ***
		(0.06)
pct_high_school			1.09 ***
			(0.03)
N	3115	3115	3115
R2	0.12	0.25	0.24
* p < 0.001; p < 0.01; * p < 0.05.

Don’t pay attention to the stars next to the coefficients and to the number in parenthesis below the coefficients, we’ll get to them later on in the course.

8. What is the interpretation of the coefficients for mean_income, pct_high_school and pct_const_maintenance_occ? Be precise.

First, the mean_income coefficient is -0.00032. This signifies that with an increase of 1 dollar in mean income, the change in y, the percentage of votes for Trump, is very minute. The negative sign shows a negative relationship with income and Trump’s vote share: the lower a voter’s income, the more likely it was that they voted for Trump. Second, the pct_high_school coefficient is 1.09. We can interpret this positive relationship to mean that a 1% increase in the county population of voters who are only high school educated sees an approximate similar 1% increase in Trump voters, suggesting people who only have a high school diploma are more likely to vote for Trump. Lastly, the coefficient for pct_const_maintenance_occ is 1.954. This signifies that a 1% increase in the county’s population who worked in these fields saw an almost doubled effect in the percentage of Trump voters. This suggests that these blue-collar workers often voted for Trump.

9. Why is the coefficient for median income so small?

The mean income coefficient is so small because a $1 change in income does not make a large difference in one’s income and thus potentially affect their voting choice. Nevertheless, over a large number of dollars, there is this trend of lower income voters voting for Trump which can be observed. Looking at the graph, we can see support for this idea as there is a high level of dispersion with quite a flat regression line showing the change in how income gradually impacts Trump’s vote share.

10. Why does the number of observations not vary for each regression?

The number of observations do not vary for each regression because we use the same data set with the same number of observations (3155) to conduct each regression.

11. Which variable’s variance explains more of the variance in Donald Trump’s vote?

The variable which explains more of the variance in Donald Trump’s vote is the pct_const_maitenance_occ. This variable holds the largest coefficient, meaing it has the strongest relationship with Trump’s vote share. Change in this percentage largely impacts his vote share. Additionally, it has the highest R2 value at 0.25, making it the most significant variable. Nevertheless, 0.25 does not show a large significance in the big picture.

12. Can you interpret these estimates causally? Why not? How else could you interpret the regression on pct_high_school? How would you go about trying to assess whether mean income or the share of the Resources/Construction/Maintenance occupation at the county level has a causal impact on the vote percentage for Donald Trump?

These estimates cannot be interpreted as causal because as is the case with every regression, we are only able to see the relationship between two variables not their causality. Another way to interpret the regression on pct_high_school could be that only high school educated citizens voted for Trump, as these two variables had a quite linear relationship. From other knowledge we know that Trump often attracted lower educated voters. Nevertheless, information on just high school education does not give us the whole picture. Studies into voting patterns based on no high school diploma as well as college graduates need to be done to draw a more causal analysis between Trump’s vote share and level of education. To assess whether mean income or the Resources/Construction/Maintenance occupation at the county level has a causal impact on the vote percentage for Donald Trump, we would utilize information from both the 2016 and 2020 elections. We would want to specifically look at voters whose mean income changed and voters whose occupation changed either into or away from these listed fields. With this information we would look to see if there was a trend between those who had a lower income changed their vote and vice versa.

Well done!

And that’s it for this midterm! We hope you had fun, or at least some fun, doing it and that you improved your R skills. We also hope you got a sense of what “real” data looks like and how it can be tamed by using the great tools that R provides.

Midterm Project

Mylène Feuillade and Gustave Kenedi