Logistics

  • DEADLINE: The midterm is due before March 6th, 23h59. Penalty for late work is -2 points per hour late.

  • You can do this homework in groups of two at most. Please indicate who you worked with when entering the names (as instructed below.)

  • This homework has a total of 20 points. The number of points for each section is clearly indicated in the section and/or subsection headings.

  • You need to hand in this .Rmd file on Moodle together with the .html file - that is produced when you click on the “Knit” button. If you are having issues knitting your file do get in touch (and knit regularly to make sure everything is working properly). Both files must be handed in the same .zip folder.

  • We left code blocks in this .Rmd file for you to fill out. Just put your code within the marked region, and click the Knit button in RStudio, just above this script, to see the generated HTML output. Knit it now, before you add any code to see how nice the output looks like.

  • When you need to enter a block of code, it will look like the following:

# code goes here
  • Please fill in your answer instead of # code goes here. Do NOT delete the ```{r} and ```; they separate the code chunk from text.

  • Whenever we give you a chunk of code with code already written, you need to replace eval = FALSE by eval = TRUE in the chunk “header”. This will make sure that the chunk is ‘evaluated’ (i.e., that the code runs) when you knit the document.

  • Sometimes, you need to enter text as an answer. In this case, it will say: Answer goes here.

  • You can find a nice intro to Rmarkdown (.Rmd) here.

  • Questions marked with a red star (*) are the ones we will build on progressively to clean the data until it is ready for the regression analysis. Other questions are side quests, which you don’t need to answer to get to the regressions.

  • We have added a link to a clean dataset just before the regression analysis part. If there are necessary questions you have not managed to do, you can start from this clean data instead of using your own.

2020 US Presidential Elections - Analysis of the Results

This midterm project consists in one long analysis of data from the 2020 Presidential elections in the United States.

You will be asked to download the data from its original source, just like you would if you were to undertake this task by yourself. You will use everything we’ve learned so far: importing (un-tidy) data; summarizing, visualizing and tidying it; running regressions for different variables; and thinking about interpreting these associations. In the process, you will also learn 2 useful operations: merging 2 (or more) datasets together, and “reshaping” data from wide to long format. Don’t worry we will work you through how to do all of this.

We hope you find this project interesting and useful. Now let’s get going!

Kate Francke and Tedo Sandroshvili

Loading packages

It’s good practice to load all the packages you will need in the same place.

# Load all the packages you need here
# Don't write the install.packages() code here. Run that code in the console.
library(readxl)
library(readr)
library(tidyverse)
library(janitor)
library(sf)
library(tmap)
library(skimr)
library(jtools)
library(huxtable)
library(rmapshaper)

Loading the Data [0.5 points]

Download the data from November 2020 US presidential elections from here:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ#

Scroll down, click on the download icon next to the second file (countypres_2000-2020.tab), and select “Comma Separated Values (Original File Format)”.

This is straight from the MIT Election Data + Science lab, which collects and shares election data from the United States. This dataset contains election results aggregated at the county level. Counties are small geographic units in the United States.

1*. Load the file you downloaded into RStudio in an object called elections. You can use the read.csv function.

elections=read.csv("~/Desktop/Sem 4/R/countypres_2000-2020.csv")

Understanding what the dataset contains [1 point]

1. Use tail to display the last 10 rows of the dataset

tail(elections,10)
yearstatestate_pocounty_namecounty_fipsofficecandidatepartycandidatevotestotalvotesversionmode
2020WYOMINGWYUINTA56041US PRESIDENTOTHEROTHER200945920220315TOTAL
2020WYOMINGWYUINTA56041US PRESIDENTDONALD J TRUMPREPUBLICAN7496945920220315TOTAL
2020WYOMINGWYWASHAKIE56043US PRESIDENTJOSEPH R BIDEN JRDEMOCRAT651403220220315TOTAL
2020WYOMINGWYWASHAKIE56043US PRESIDENTJO JORGENSENLIBERTARIAN65403220220315TOTAL
2020WYOMINGWYWASHAKIE56043US PRESIDENTOTHEROTHER71403220220315TOTAL
2020WYOMINGWYWASHAKIE56043US PRESIDENTDONALD J TRUMPREPUBLICAN3245403220220315TOTAL
2020WYOMINGWYWESTON56045US PRESIDENTJOSEPH R BIDEN JRDEMOCRAT360356020220315TOTAL
2020WYOMINGWYWESTON56045US PRESIDENTJO JORGENSENLIBERTARIAN46356020220315TOTAL
2020WYOMINGWYWESTON56045US PRESIDENTOTHEROTHER47356020220315TOTAL
2020WYOMINGWYWESTON56045US PRESIDENTDONALD J TRUMPREPUBLICAN3107356020220315TOTAL

2. What years are included in our data? What’s inside the enigmatic “mode” variable? Tabulate the variables year and mode separately to find out.

table(elections$mode)
## 
##         2ND ABSENTEE             ABSENTEE     ABSENTEE BY MAIL 
##                  120                 1995                 1038 
##      ADVANCED VOTING                EARLY           EARLY VOTE 
##                  477                  453                  450 
##         EARLY VOTING         ELECTION DAY             FAILSAFE 
##                  120                 3737                  230 
## FAILSAFE PROVISIONAL   IN-PERSON ABSENTEE                 MAIL 
##                  230                  230                  145 
##             ONE STOP                 PROV          PROVISIONAL 
##                  500                  477                 1832 
##                TOTAL 
##                60583
table(elections$year)
## 
##  2000  2004  2008  2012  2016  2020 
## 12628  9474  9474  9474  9474 22093

The years 2000, 2004, 2008, 2012, 2016, and 2020 are included in our data. The mode variable describes how people voted.

3. What is the unit of observation? In other words, what does each row correspond to?

Each row corresponds to the mode of voting for each candidate of each county.

4. How many variables are there?

summary(elections)
##       year         state             state_po         county_name       
##  Min.   :2000   Length:72617       Length:72617       Length:72617      
##  1st Qu.:2004   Class :character   Class :character   Class :character  
##  Median :2012   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2011                                                           
##  3rd Qu.:2020                                                           
##  Max.   :2020                                                           
##                                                                         
##   county_fips       office           candidate            party          
##  Min.   : 1001   Length:72617       Length:72617       Length:72617      
##  1st Qu.:18103   Class :character   Class :character   Class :character  
##  Median :29205   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :30264                                                           
##  3rd Qu.:45057                                                           
##  Max.   :56045                                                           
##  NA's   :57                                                              
##  candidatevotes      totalvotes         version             mode          
##  Min.   :      0   Min.   :      0   Min.   :20220315   Length:72617      
##  1st Qu.:    115   1st Qu.:   5175   1st Qu.:20220315   Class :character  
##  Median :   1278   Median :  11194   Median :20220315   Mode  :character  
##  Mean   :  10782   Mean   :  42514   Mean   :20220315                     
##  3rd Qu.:   5848   3rd Qu.:  29855   3rd Qu.:20220315                     
##  Max.   :3028885   Max.   :4264365   Max.   :20220315                     
## 

Twelve variables

5. How many rows are there?

count(elections)
n
72617

72617 rows

Filtering the data and exploring further [1 points]

1*. This data contains the results of all presidential elections from 2000 to 2020. We only want to analyse the 2020 results. Use the filter function to keep only the observations that concern the 2020 election. Create a new object elections_2020 for this task.

elections_2020=filter(elections,year==2020)

2. Our main variable of interest is candidatevotes. In how many counties did a candidate get less than 10 votes (less or equal 10 )? How many different candidates were concerned?

candidatevotesq=elections_2020 %>%
  filter(candidatevotes<=10)
n_distinct(candidatevotesq$candidate)
## [1] 4
n_distinct(candidatevotesq$county_fips)
## [1] 1310
nrow(candidatevotesq[candidatevotesq$candidate == 'OTHER',])
## [1] 2867

1310 counties; 2780>=candidartes>=4; there are 3 named candidates, however 2867 ‘OTHER’ named candidates, so they could either all be the same or all unique, therefore there is a large range of possible values.

3. We saw previously that there were different modes of voting recorded in the data, and that one category is “Total”. This could be worrying: are we counting ballots twice? Could a ballot appear both in the “in person” and “total” categories? Cross-tabulate the mode variable with state, to understand whether the modes of voting are the same in all States. (Use the table function)

xy=table(elections_2020$mode,elections_2020$state)
xy
##                       
##                        ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO
##   2ND ABSENTEE               0      0       0        0          0        0
##   ABSENTEE                   0      0       0      375          0        0
##   ABSENTEE BY MAIL           0      0       0        0          0        0
##   ADVANCED VOTING            0      0       0        0          0        0
##   EARLY                      0      0       0        0          0        0
##   EARLY VOTE                 0      0      75      375          0        0
##   EARLY VOTING               0      0       0        0          0        0
##   ELECTION DAY               0      0      75      375          0        0
##   FAILSAFE                   0      0       0        0          0        0
##   FAILSAFE PROVISIONAL       0      0       0        0          0        0
##   IN-PERSON ABSENTEE         0      0       0        0          0        0
##   MAIL                       0      0       0        0          0        0
##   ONE STOP                   0      0       0        0          0        0
##   PROV                       0      0       0        0          0        0
##   PROVISIONAL                0      0      75      375          0        0
##   TOTAL                    201    205       0        0        290      320
##                       
##                        CONNECTICUT DELAWARE DISTRICT OF COLUMBIA FLORIDA
##   2ND ABSENTEE                   0        0                    0       0
##   ABSENTEE                       0        0                    0       0
##   ABSENTEE BY MAIL               0        0                    0       0
##   ADVANCED VOTING                0        0                    0       0
##   EARLY                          0        0                    0       0
##   EARLY VOTE                     0        0                    0       0
##   EARLY VOTING                   0        0                    0       0
##   ELECTION DAY                   0        0                    0       0
##   FAILSAFE                       0        0                    0       0
##   FAILSAFE PROVISIONAL           0        0                    0       0
##   IN-PERSON ABSENTEE             0        0                    0       0
##   MAIL                           0        0                    0       0
##   ONE STOP                       0        0                    0       0
##   PROV                           0        0                    0       0
##   PROVISIONAL                    0        0                    0       0
##   TOTAL                         40       12                    5     335
##                       
##                        GEORGIA HAWAII IDAHO ILLINOIS INDIANA IOWA KANSAS
##   2ND ABSENTEE               0      0     0        0       0    0      0
##   ABSENTEE                 477      0     0        0       0  495      0
##   ABSENTEE BY MAIL           0      0     0        0       0    0      0
##   ADVANCED VOTING          477      0     0        0       0    0      0
##   EARLY                      0      0     0        0       0    0      0
##   EARLY VOTE                 0      0     0        0       0    0      0
##   EARLY VOTING               0      0     0        0       0    0      0
##   ELECTION DAY             477      0     0        0       0  495      0
##   FAILSAFE                   0      0     0        0       0    0      0
##   FAILSAFE PROVISIONAL       0      0     0        0       0    0      0
##   IN-PERSON ABSENTEE         0      0     0        0       0    0      0
##   MAIL                       0      0     0        0       0    0      0
##   ONE STOP                   0      0     0        0       0    0      0
##   PROV                     477      0     0        0       0    0      0
##   PROVISIONAL                0      0     0        0       0    0      0
##   TOTAL                      0     20   176      510     460    0    315
##                       
##                        KENTUCKY LOUISIANA MAINE MARYLAND MASSACHUSETTS MICHIGAN
##   2ND ABSENTEE                0         0     0      120             0        0
##   ABSENTEE                    0         0     0      120             0        0
##   ABSENTEE BY MAIL            0         0     0        0             0        0
##   ADVANCED VOTING             0         0     0        0             0        0
##   EARLY                       0         0     0        0             0        0
##   EARLY VOTE                  0         0     0        0             0        0
##   EARLY VOTING                0         0     0      120             0        0
##   ELECTION DAY              480         0     0      120             0        0
##   FAILSAFE                    0         0     0        0             0        0
##   FAILSAFE PROVISIONAL        0         0     0        0             0        0
##   IN-PERSON ABSENTEE          0         0     0        0             0        0
##   MAIL                        0         0     0        0             0        0
##   ONE STOP                    0         0     0        0             0        0
##   PROV                        0         0     0        0             0        0
##   PROVISIONAL                 0         0     0      120             0        0
##   TOTAL                       0       256    80        0            70      415
##                       
##                        MINNESOTA MISSISSIPPI MISSOURI MONTANA NEBRASKA NEVADA
##   2ND ABSENTEE                 0           0        0       0        0      0
##   ABSENTEE                     0           0        0       0        0      0
##   ABSENTEE BY MAIL             0           0        0       0        0      0
##   ADVANCED VOTING              0           0        0       0        0      0
##   EARLY                        0           0        0       0        0      0
##   EARLY VOTE                   0           0        0       0        0      0
##   EARLY VOTING                 0           0        0       0        0      0
##   ELECTION DAY                 0           0        0       0        0      0
##   FAILSAFE                     0           0        0       0        0      0
##   FAILSAFE PROVISIONAL         0           0        0       0        0      0
##   IN-PERSON ABSENTEE           0           0        0       0        0      0
##   MAIL                         0           0        0       0        0      0
##   ONE STOP                     0           0        0       0        0      0
##   PROV                         0           0        0       0        0      0
##   PROVISIONAL                  0           0        0       0        0      0
##   TOTAL                      435         410      580     168      279     68
##                       
##                        NEW HAMPSHIRE NEW JERSEY NEW MEXICO NEW YORK
##   2ND ABSENTEE                     0          0          0        0
##   ABSENTEE                         0          0          0        0
##   ABSENTEE BY MAIL                 0          0          0        0
##   ADVANCED VOTING                  0          0          0        0
##   EARLY                            0          0          0        0
##   EARLY VOTE                       0          0          0        0
##   EARLY VOTING                     0          0          0        0
##   ELECTION DAY                     0          0          0        0
##   FAILSAFE                         0          0          0        0
##   FAILSAFE PROVISIONAL             0          0          0        0
##   IN-PERSON ABSENTEE               0          0          0        0
##   MAIL                             0          0          0        0
##   ONE STOP                         0          0          0        0
##   PROV                             0          0          0        0
##   PROVISIONAL                      0          0          0        0
##   TOTAL                           30        105        165      310
##                       
##                        NORTH CAROLINA NORTH DAKOTA OHIO OKLAHOMA OREGON
##   2ND ABSENTEE                      0            0    0        0      0
##   ABSENTEE                          0            0    0        0      0
##   ABSENTEE BY MAIL                500            0    0      308      0
##   ADVANCED VOTING                   0            0    0        0      0
##   EARLY                             0            0    0      308      0
##   EARLY VOTE                        0            0    0        0      0
##   EARLY VOTING                      0            0    0        0      0
##   ELECTION DAY                    500            0    0      308      0
##   FAILSAFE                          0            0    0        0      0
##   FAILSAFE PROVISIONAL              0            0    0        0      0
##   IN-PERSON ABSENTEE                0            0    0        0      0
##   MAIL                              0            0    0        0      0
##   ONE STOP                        500            0    0        0      0
##   PROV                              0            0    0        0      0
##   PROVISIONAL                     500            0    0        0      0
##   TOTAL                             0          212  352        0    180
##                       
##                        PENNSYLVANIA RHODE ISLAND SOUTH CAROLINA SOUTH DAKOTA
##   2ND ABSENTEE                    0            0              0            0
##   ABSENTEE                        0            0              0            0
##   ABSENTEE BY MAIL                0            0            230            0
##   ADVANCED VOTING                 0            0              0            0
##   EARLY                           0            0              0            0
##   EARLY VOTE                      0            0              0            0
##   EARLY VOTING                    0            0              0            0
##   ELECTION DAY                    0            0            230            0
##   FAILSAFE                        0            0            230            0
##   FAILSAFE PROVISIONAL            0            0            230            0
##   IN-PERSON ABSENTEE              0            0            230            0
##   MAIL                            0            0              0            0
##   ONE STOP                        0            0              0            0
##   PROV                            0            0              0            0
##   PROVISIONAL                     0            0            230            0
##   TOTAL                         201           24              0          198
##                       
##                        TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON
##   2ND ABSENTEE                 0     0    0       0        0          0
##   ABSENTEE                     0     0    0       0      528          0
##   ABSENTEE BY MAIL             0     0    0       0        0          0
##   ADVANCED VOTING              0     0    0       0        0          0
##   EARLY                        0     0  145       0        0          0
##   EARLY VOTE                   0     0    0       0        0          0
##   EARLY VOTING                 0     0    0       0        0          0
##   ELECTION DAY                 0     0  145       0      532          0
##   FAILSAFE                     0     0    0       0        0          0
##   FAILSAFE PROVISIONAL         0     0    0       0        0          0
##   IN-PERSON ABSENTEE           0     0    0       0        0          0
##   MAIL                         0     0  145       0        0          0
##   ONE STOP                     0     0    0       0        0          0
##   PROV                         0     0    0       0        0          0
##   PROVISIONAL                  0     0    0       0      532          0
##   TOTAL                      285  1270  140      70        0        195
##                       
##                        WEST VIRGINIA WISCONSIN WYOMING
##   2ND ABSENTEE                     0         0       0
##   ABSENTEE                         0         0       0
##   ABSENTEE BY MAIL                 0         0       0
##   ADVANCED VOTING                  0         0       0
##   EARLY                            0         0       0
##   EARLY VOTE                       0         0       0
##   EARLY VOTING                     0         0       0
##   ELECTION DAY                     0         0       0
##   FAILSAFE                         0         0       0
##   FAILSAFE PROVISIONAL             0         0       0
##   IN-PERSON ABSENTEE               0         0       0
##   MAIL                             0         0       0
##   ONE STOP                         0         0       0
##   PROV                             0         0       0
##   PROVISIONAL                      0         0       0
##   TOTAL                          220       360      92

As you can see, in some States the mode variable takes exclusively the value “TOTAL” (in Alabama for instance). In those States, we know the total number of votes obtained by candidates in each county. In other States, we do not have the total number of votes, but only a number disaggregated by voting method (in Arizona for instance).

All of those situations are fine, since they mean ballots are not counted twice.

However there is one worrying case: Utah. In this State, we observe both the total number of votes, and the number of votes by mail, election day voting, and early voting. By looking more closely at the data, you can see that in the Salt Lake county (code 49035), we have the number of votes received early; on election day; and by mail. For all other counties, the ballot count for these voting modes is zero, and we only observe the “total” ballot count. If you wanted to be convinced more formally that this is true, you could run the following script:

elections_2020 %>% 
  filter(state=="UTAH") %>% #keep only UTAH observations
  mutate(mode_vote_total = (mode=="TOTAL" & candidatevotes!=0), #create a dummy if we observe a ballot in the "total" category
         mode_vote_other = (mode!="TOTAL" & candidatevotes!=0)) %>% # create a dummy if we observe a ballot in any other voting category
  group_by(county_fips, county_name) %>%  # group by county
  summarise(mode_vote_total = max(mode_vote_total), #create a dummy if for any candidate in the county we observe a ballot in the "total" category
            mode_vote_other = max(mode_vote_other)) %>% #create a dummy if for any candidate in the county we observe a ballot in any other voting category
  View() #visualise the resulting dataset

#You can see that in no county do we observe a ballot both in "total" and in another category!

So it holds true that we never count a ballot twice: in every county, either we count it in the “total” category, or we count it in one of the categories associated to specific voting method. This is good news!

Checking quality of the data [2 points]

1. In States like Arizona where we don’t have the total number of ballots, it is not clear that the voting modes for which we have data cover all voting methods available.

For instance, in Kentucky we only have the number of ballots cast on election day. A quick search online shows that early voting is possible in this State, although it is heavily restricted. It is hence very likely that some ballots are missing in our data.

Now that we noticed this feature of the data, we want to check how big of an issue this is: how many votes are not counted in our data? Can we still reasonably use it, or is its quality too bad?

To answer this question, we will compare the total number of votes by State in our dataset to the official number provided by the Federal Government.

Create a new object called elections_2020_state_totals, with two variables: state_po and sum_votes_state: the sum of candidatevotes by State. Hint: you should group your data by state_po, and then use summarise. You should also disregard missing values when you compute the sum of candidatevotes: find out the argument of the sum function which lets you do that.

elections_2020_state_totals=elections_2020 %>%
  group_by(state_po)%>%
  summarise(sum_votes_state=sum(candidatevotes,na.rm=FALSE))

view(elections_2020_state_totals)

2. The official election data in the US is provided by the Federal Election Commission, in this document. Notice that it is in pdf format. We already extracted it into an excel file for you.

Download the file here

Import it into a new object named official_state_votes, using the read_excel function from the readxl package.

official_state_votes <- read_excel("~/Desktop/OneDrive/state_votes.xlsx")

3. Both your elections_2020_state_totals and official_state_votes should have 51 observations, one per State in the US + Washington DC. But now the official number of votes and the one from our dataset are in two different R objects. You could compare the numbers manually, but this would be a lot of work. Instead, we want to merge them into a single object.

We will do this using the left_join() function from the tidyverse package.

This is how left_join() works: left_join(x,y) will return all rows from x (very important!) and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. Rows in y with no match in x will be discarded.

The following cheatsheet might help you visualise the way the function works: https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf. You can also find more details here.

If nothing is specified, left_join will automatically detect the variables which have the same name in both datasets and use them to join the data. If your variables have different names in the datasets, you can specify left_join(x,y, by=c("var1x" = "var1y", "var2x" = "var2y")). This will match variable var1x in x to var1y in y and var2x in x to var2y in y. (You can keep going, and join on more than two variables!)

Create a new object called elections_2020_state_comparison, which is the left_join of elections_2020_state_totals (1st argument) and official_state_votes (2nd argument). Here, our identifying variables have different names, so we need to add the argument by=c("state_po" = "state_short")

elections_2020_state_comparison=left_join(elections_2020_state_totals, official_state_votes, by=c("state_po"="state_short"))

Congratulations, you have merged your first datasets! Easy right? :)

4. Compute the percentage difference between the official number of votes, and the sum of votes in our data by State: (official votes - sum votes)/official votes*100. What are the min, max and mean of this percentage difference?

elections_2020_state_comparison$percentagedifference= ((elections_2020_state_comparison$official_total_votes - elections_2020_state_comparison$sum_votes_state)/elections_2020_state_comparison$official_total_votes)*100
summary(elections_2020_state_comparison$percentagedifference)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.924807  0.000000  0.000000 -0.044041  0.001063  0.545434

Min: -0.924807; Mean: -0.044041; Max: 0.545434

If our election data contained the official number of votes per State, the difference would always be zero. Clearly, this is not the case: in some States our data has too few votes, in others it has too many. However, the difference is always very very small and should not dramatically change the final results obtained by each candidate.

Still, the discrepancy is not easy to explain: the MIT lab collects official results from all the States and puts them together into a national file. It is not clear where the difference is coming from: maybe there was human error in data collection, maybe the results were updated in some States after a recount…

Since the United States government itself does not provide official county-level results for the entire country, the only alternative to using the MIT data would be to start from State level official results. However, this would be a lot of work as we would have to find all the files, deal with format issues (many data files are still provided in pdf, for instance in Kentucky) and clean them all. It would be time consuming, and we would likely make more errors than the MIT lab!

Some more tidying [2.5 points]

1. Now that we have decided that our data was good enough, we still need to tidy it before we can use it. Make sure that the candidate name is always written with the same format by tabulating the candidate variable.

table(elections_2020$candidate)
## 
##    DONALD J TRUMP      JO JORGENSEN JOSEPH R BIDEN JR             OTHER 
##              5117              4955              5117              6904

2*. Remember that we noticed earlier there were missing values in the candidatevotes and totalvotes variables.

Run the following script to fix the issue (replace eval = FALSE by eval = TRUE below, so that the code is also run when you knit). This chunk of code:

  • replaces candidate_votes by zero when it is missing
  • replaces total_votes_county by the sum of votes in the county when it is missing.

Note that we create a new object, elections_2020_clean.

The ifelse function works in the following way: ifelse(test, value if test==TRUE, value if test==FALSE). For instance in the third line below, for each observation, if candidatevotes is missing then it is replaced by 0, otherwise it is not changed.

elections_2020_clean <- elections_2020 %>%
  group_by(county_fips) %>%
  mutate(candidatevotes = ifelse(is.na(candidatevotes), 0, candidatevotes),
         totalvotes = ifelse(is.na(totalvotes), sum(candidatevotes), totalvotes))

3*. Clean some more the elections_2020_clean object by:

  • Summarising the data so that the unit of observation is a candidate in a county. To do this, you need to sum the number of votes obtained for each candidate in each county, across all modes of voting. Note: in order to keep all the variables, you will need to use the following command group_by(state, state_po, county_name, county_fips, totalvotes, candidate). If we only grouped by county_fips, candidate, we would “lose” all the other “extra” variables that give us information about the county (its name, its state, the number of votes, …). If this is not clear to you, simply run the code twice, once grouping by all the variables, and once with group_by(county_fips, candidate), and see what happens!

  • Creating a variable pct_votes which is equal to the percentage of votes obtained by a candidate in a county: (number of votes obtained by a candidate / total number of votes in county)*100.

  • Grouping the data by county (county_fips), and then use a combination of rank() and desc() commands to compute the ranks of each candidate within each county. You should create a variable called candidate_rank which takes the value 1 for the candidate who received the most vote in the county, 2 for the candidate who received the second highest number of votes etc.

  • Once you’ve ensured your code is correct, “ungroup” the data by passing your object into the ungroup() function or by piping it. This function does what it says: it removes grouping. This means that it cancels the effect of your last group_by, and ensures that the next functions you use will not be applied by group.

You can pipe all these commands together if you wish or proceed in several steps.

elections_2020_clean=elections_2020_clean %>%
  group_by(state, state_po, county_name, county_fips, totalvotes, candidate) %>%
  summarise(candidatevotes=sum(candidatevotes))

pctvotes= ((elections_2020_clean$candidatevotes / elections_2020_clean$totalvotes)*100)
elections_2020_clean$pctvotes=pctvotes

#Grouping the data by county (county_fips), and then use a combination of rank() and desc() commands to compute the ranks of each candidate within each county. You should create a variable called candidate_rank which takes the value 1 for the candidate who received the most vote in the county, 2 for the candidate who received the second highest number of votes etc.

elections_2020_clean=elections_2020_clean %>%
  arrange(county_fips, candidatevotes) %>%
    group_by(county_fips) %>%
    mutate(rank = rank(-candidatevotes))
view(elections_2020_clean)

4. Check that your variable pct_votes is actually a percentage (that it is comprised within 0 and 100). This is a safety check to make sure that there is no inconsistency in the data.

range(elections_2020_clean$pctvotes, na.rm = TRUE)
## [1]  0.00000 96.18182

Maps [3 points]

Election data are probably better visualized using maps.

In this section we will produce two maps displaying:

  1. Joe Biden and Donald Trump’s vote percentages in each county,
  2. The leading candidate in each county.

To produce the maps you will need to install and load the sf, tmap and rmapshaper packages.

Maps are tricky and very complicated objects. But to boil it down to the essential ingredients, you need a “shapefile” which is basically a file that contains the map onto which data will be plotted.

This file is provided by the US census website here. However it is somewhat heavy, so to make your life easier, we have simplified it for you using the following commands. Do NOT touch nor run this chunk, and do NOT replace eval = FALSE by eval = TRUE. We’ve left it so you can see how we created the simplified shapefile you will be loading in the next question.

library(sf)
library(tmap)
library(rmapshaper)

county_shfl_full <- read_sf("path/cb_2018_us_county_500k/cb_2018_us_county_500k.shp")
# where `path` corresponds to where the `cb_2018_us_county_500k` folder is located.
county_shfl <- st_simplify(county_shfl_full, preserveTopology = FALSE, dTolerance = 1000) #simplify the map to make it less memory-heavy

save(county_shfl, file="path/county_shfl.RData") #saving the object

You can download the resulting R object from this link: > https://www.dropbox.com/s/m8zq5e38imp2he1/county_shfl.RData?dl=1

1. Let’s load the county_shfl.RData file by using the load() function. Replace eval = FALSE by eval = TRUE.

load("/Users/katherinefrancke/Library/CloudStorage/OneDrive-sciencespo.fr/county_shfl.RData")

You should now have an object called county_shfl in your environment

2. FIPS codes are numbers which uniquely identify geographic areas in the US. Working with FIPS codes is always advisable when identifying areas, as names can be duplicates: there is an Aidar county in Kentucky and another one in Iowa.

What is the county FIPS code of the “FLEMING” county in the elections_2020_clean data.frame? What are the State FIPS (STATEFP), county FIPS (COUNTYFP) and GEOID of the same county in the county_shfl data? Hint: in county_shfl the county’s name is written “Fleming” and the variable containing county names is NAME.

elections_2020_clean %>%
  filter(county_name=="FLEMING")
county_shfl %>%
  filter(NAME=="Fleming")
STATEFPCOUNTYFPCOUNTYNSAFFGEOIDGEOIDNAMELSADALANDAWATERgeometry
21069005168810500000US2106921069Fleming069.03e+087.18e+06list(c(-83.984282, -83.921376, -83.916131, -83.88036, -83.85914, -83.846644, -83.835656, -83.801869, -83.768357, -83.74408, -83.734005, -83.735331, -83.73823, -83.710803, -83.703959, -83.692767, -83.685828, -83.641322, -83.586792, -83.560332, -83.559706, -83.507953, -83.499543, -83.487972, -83.476871, -83.465424, -83.463696, -83.581427, -83.582588, -83.589503, -83.593335, -83.6261, -83.641852, -83.85928, -83.930348, -83.984282, 38.445493, 38.383309, 38.360152, 38.360966, 38.343529, 38.337216, 38.30263,
38.279276, 38.293533, 38.284713, 38.293878, 38.278838, 38.268149, 38.26817, 38.244383, 38.2056, 38.214107, 38.174944, 38.21828, 38.220903, 38.231253, 38.273155, 38.318717, 38.352905, 38.359256, 38.357992, 38.393352, 38.43655, 38.462461, 38.473977, 38.489031, 38.512177, 38.525381, 38.456348, 38.492278, 38.445493))

The county FIPS code of the “FLEMING” county is 21069; State FIPS is 21, county FIPS is 069; and GEOID is 21069.

3. You might have noticed that the county_fips variable in our elections_2020_clean is actually made of the State FIPS code, followed by the county code.

In county_shfl, we have a variable for the State FIPS code and a separate one for the county code. The variable containing the same code as in elections_2020_clean is actually called GEOID. This type of thing is exactly why you always need to look inside your data: it would not have been obvious from the variable names.

What is the type of county_fips in the elections_2020_clean data? What is the type of GEOID in county_shfl? Hint: use the typeof function.

typeof(elections_2020_clean$county_fips)
## [1] "integer"
typeof(county_shfl$GEOID)
## [1] "character"

The county_fips data in elections_2020_clean are integers, the GEOID data in county_shfl are characters.

It is not the same! In order to merge the two datasets, we want county_fips to have the same type as in GEOID and be a character string instead.

Furthermore, because it is stored as a numeric value there is no leading zero (the FIPS code “01001” is stored as 1001).

The following script fixes this, and creates a new variable called county_fips_clean. Run the code and change eval = FALSE to eval = TRUE to make sure the code runs when you knit your document.

We start by using the as.character function to convert the variable to a character string. Then, we use the ifelse function again. What this conditional statement says is: if the number of characters of the variable county_fips_clean is equal to 4, then add a “0” in front of this variable; otherwise do nothing.

elections_2020_clean <- elections_2020_clean %>% 
  mutate(county_fips_clean = as.character(county_fips), #convert county_fips to a character string
         county_fips_clean = ifelse(nchar(county_fips_clean) == 4,
                            paste0("0", county_fips_clean),
                            county_fips_clean)) #Add a leading zero when necessary: if there are only four characters to the FIPS code, it means that the leading 0 is missing

4. How many different State FIPS codes are there in the county_shfl dataset? Hint: You can use a combination of the length and unique commands to count them.

n_distinct(county_shfl$STATEFP)
## [1] 56

56

5. Our shapefile clearly contains the map for more than just the 50 US States + DC. By looking at the existing State codes here, you can see that codes above 56 correspond to overseas territories.

The final map we want to produce should only contain the 50 States + DC: we want to remove overseas territories. We also exclude Alaska and Hawaii for purely aesthetic reasons: including them zooms out the map too much. We could fix this problem by showing Alaska and Hawaii in a corner of the map but this is beyond the scope of the midterm.

Filter county_shfl_clean to keep only observations such that:

  • STATEFP is not equal to “02” (Alaska) or “15” (Hawaii)
  • and as.numeric(STATEFP) is lower or equal to 56 (note that because State FIPS is stored as a character, we have to convert it to a numeric before evaluating whether it is below 56)

Create a new object called county_shfl_clean.

county_shfl_clean=county_shfl %>%
  filter(STATEFP!="02", STATEFP!="15")%>%
  filter(as.numeric(STATEFP)<=56)

Furthermore, some counties’ FIPS have changed over the years, and they differ in our two datasets. The following script will fix all these inconsistencies for you. The case_when function is similar to ifelse. Here it modifies the variable county_fips_clean, such that :

  • if county_fips_clean was equal to “46113”, it now takes the value “46102”
  • if state_po is equal to “DC”, county_fips_clean now takes the value “11001”
  • else county_fips_clean keeps the same value.

Remember to replace eval = FALSE with eval = TRUE.

elections_2020_clean <- elections_2020_clean %>% 
  mutate(county_fips_clean = case_when(
    county_fips_clean == "46113" ~ "46102",
    state_po == "DC" ~ "11001",
    TRUE ~ county_fips_clean))

6. Use left_join to merge county_shfl_clean (first argument) and elections_2020_clean (second argument). Create a new object called elections_2020_map for this question.

Here, our identifying variables have different names: GEOID in county_shfl_clean (first argument) and county_fips_clean in elections_2020_clean (second argument). Look at the help file or go back to question 3 of the “Checking quality” section to see how to specify the by argument.

elections_2020_map=left_join(county_shfl_clean,elections_2020_clean,by=c("GEOID"="county_fips_clean"))

Map 1: Candidates’ vote percentage

7. Create a heat map of Biden and Trump’s percentage of the vote by county, using the following code (change eval = FALSE to eval = TRUE just below so that the html file displays the map):

tmap_mode("view") # we want an interactive map
elections_2020_map %>%
             # only keep Trump and Biden
             filter(candidate %in% c("DONALD J TRUMP", "JOSEPH R BIDEN JR")) %>% 
    tm_shape() +
    tm_borders(col="white", lwd = 0.3) + # white and thin (line width) borders
    tm_fill(
        col = "pctvotes",   # variable to be mapped
        title = "% of votes",   # legend title
        id = "county_name",   # information to display when mouse hovers over a departement
        popup.vars = c("Vote %:" = "pctvotes")) + # variable to display in popup window
    tm_facets(by = "candidate") # create one map per selected candidate

Wasn’t that easy (well, sort of ;) ) ! And now you have 2 comparable maps of the vote percentages received by Biden and Trump! Hover over the counties with your mouse. The name of the county will show up. If you click on a county you will get the vote percentage received by the candidate. Isn’t that cool!

Map 2: Leading candidate

8. Create a map displaying the candidate who received the highest percentage of votes in each county. Try to create this map by yourself using the code above (and the comments in it). You don’t need to create any new variables. Don’t hesitate to use Google (or your favorite search engine) to help you out. Don’t forget to change the title of the legend.

Add the following argument in tm_fill() to have a nice color palette for your map: palette = c("red","blue").

tmap_mode("view") # we want an interactive map
elections_2020_map %>%
            # this chooses the candidate with the highest percentage votes in each county
            group_by(county_fips) %>% slice(which.max(pctvotes)) %>%
  tm_shape() +
  tm_borders(col = "white", lwd = 0.3) + # white and thin (line width) borders
  tm_fill(
      col = "candidate", # variable to be mapped
      title = "2020 US Election, Who Won Each County?", # legend title
      id = "county_name", # information to display when mouse hovers on a departement
      palette = c("red","blue"),
      popup.vars = c("Vote %:" = "pctvotes")) # variable to display in popup window

Tidying covariate data [2 points]

A covariate is simply the name we give to variables containing characteristics of our unit of interest (here, characteristic of counties). A covariate can be of direct interest if it is an independent variable in the model. It can also be an extra characteristic we add to the regression to avoid the omitted variable bias which we will discuss later in the course.

Note: The websites have been known to have issues with Mozilla Firefox. Please use another browser, such as Google Chrome or Microsoft Edge, to download the files. > Download county-level social data from the American Community Survey (ACS) here: > https://data.census.gov/cedsci/table?q=DP02&g=0100000US%240500000&tid=ACSDP5Y2019.DP02

Click “DOWNLOAD TABLE” and select the year 2019 for “ACS 5-Year Estimates Data Profiles” and download. Unzip the folder once downloaded.

Repeat for ACS economic data: https://data.census.gov/cedsci/table?q=DP03&g=0100000US%240500000&tid=ACSDP5Y2019.DP03

1*. Import these csv files (the ones with -Data in their name) into objects called acs_soc_2019 and acs_eco_2019 respectively.

Note: If you work in groups and want the code to run on all your laptops, make sure that your files have the same name.

acs_soc_2019 <- read.csv("~/Desktop/Sem 4/R/ACSDP5Y2019.DP02-Data.csv")
acs_eco_2019 <- read.csv("~/Desktop/Sem 4/R/ACSDP5Y2019.DP03-Data.csv")

2. How many variables are there in each dataset?

ncol(acs_soc_2019)
## [1] 1227
ncol(acs_eco_2019)
## [1] 1099

1227 variables in acs_soc_2019; 1099 variables in acs_eco_2019

3*. That’s a lot! We need to keep only the variables we are interested in, and tidy the data a little. For each acs data frame:

  • Rename the following variables : (old_name -> new_name)
  1. In acs_eco_2019 : DP03_0009PE -> unemp_rate
  2. In acs_eco_2019 : DP03_0062E -> med_income
  3. In acs_soc_2019 : DP02_0152PE -> pct_computer

The new name tells you what the variable actually contains. unemp_rate is the unemployment rate in the county. med_income is the median households income and pct_computer is the percentage of households owning a computer.

  • Keep only the variables you just renamed, NAME and GEO_ID

For this task, create two new objects: acs_soc_2019_clean and acs_eco_2019_clean.

acs_soc_2019_clean=acs_soc_2019 %>%
  rename(pct_computer=DP02_0152PE) %>%
  subset(select= c(NAME, GEO_ID,pct_computer))

acs_eco_2019_clean=acs_eco_2019 %>%
  rename(med_income=DP03_0062E) %>%
  rename(unemp_rate=DP03_0009PE) %>%
  subset(select = c(NAME, GEO_ID, med_income, unemp_rate))

4. Display the first five lines of acs_soc_2019_clean and acs_eco_2019_clean

head(acs_soc_2019_clean,5)
NAMEGEO_IDpct_computer
Geographic Area NameGeographyPercent!!COMPUTERS AND INTERNET USE!!Total households!!With a computer
Autauga County, Alabama0500000US0100189.2
Baldwin County, Alabama0500000US0100390.9
Barbour County, Alabama0500000US0100577.2
Bibb County, Alabama0500000US0100778.1
head(acs_eco_2019_clean,5)
NAMEGEO_IDmed_incomeunemp_rate
Geographic Area NameGeographyEstimate!!INCOME AND BENEFITS (IN 2019 INFLATION-ADJUSTED DOLLARS)!!Total households!!Median household income (dollars)Percent!!EMPLOYMENT STATUS!!Civilian labor force!!Unemployment Rate
Autauga County, Alabama0500000US01001587313.7
Baldwin County, Alabama0500000US01003583204.3
Barbour County, Alabama0500000US01005325259.2
Bibb County, Alabama0500000US01007475427.3

5. What is in the first line of the datasets?

A description of the variables.

6*. Drop the first row of acs_soc_2019_clean and acs_eco_2019_clean. Hint: the negative sign should help you.

acs_soc_2019_clean <- acs_soc_2019_clean[-1,]
acs_eco_2019_clean <- acs_eco_2019_clean[-1,]

7*. Merge the acs_soc_2019_clean and acs_eco_2019_clean together using left_join. You don’t need to specify the by argument here, since the common variabes GEO_ID and NAME should have the same name in both datasets.

Create a new object called acs_2019_clean.

acs_2019_clean=left_join(acs_soc_2019_clean,acs_eco_2019_clean)

8. What is the type of the pct_computer, unemp_rate and med_income variables?

typeof(acs_2019_clean$pct_computer)
## [1] "character"
typeof(acs_2019_clean$unemp_rate)
## [1] "character"
typeof(acs_2019_clean$med_income)
## [1] "character"

The variables of pct_computer, unemp_rate, and med_income are characters.

9*. In the acs_2019_clean dataframe, convert the pct_computer, unemp_rate and med_income variables to numeric, using the as.numeric() function. This function simply converts character strings to numbers: for example as.numeric("17.9") = 17.9. Hint: to convert the variable, simply use the dplyr verb to create a new variable and give it the same name as the variable you are modifying.

Ignore the warning about NAs being introduced. They are due to observations for which pct_computer was already missing (it was equal to “null”)

acs_2019_clean <- acs_2019_clean %>%
  mutate(pct_computer=as.numeric(pct_computer)) %>%
  mutate(unemp_rate= as.numeric(unemp_rate)) %>%
  mutate(med_income=as.numeric(med_income))
typeof(acs_2019_clean$med_income)
## [1] "double"

10. What is the GEO_ID of the county whose NAME is “Fleming County, Kentucky” in the acs_eco_2019_clean and acs_soc_2019_clean dataframes?

acs_2019_clean %>%
  filter(NAME=="Fleming County, Kentucky")
NAMEGEO_IDpct_computermed_incomeunemp_rate
Fleming County, Kentucky0500000US2106981.44.46e+047.4

11*. The geographical ID of the county is different to what we had last time! You might have noticed that this time, the 5-digit FIPS code is preceded by a prefix: “0500000US”. We want to extract the FIPS code from this geographic identifier.

Create a new variable, called county_fips_clean which contains only the 5 last characters of the GEO_ID variable. Use the substr() function. This function extracts a substring from a character string. You need to give it the original string, and the position of the first and last character of the substring. Example: substr("abcde", 2, 4) = "bcd", i.e. you get letters 2 to 4. Here we want to extract characters 10 to 14 of the GEO_ID variable. As with as.numeric you have to create the variable using the relevant dplyr verb.

acs_2019_clean <- acs_2019_clean %>%
  mutate(county_fips_clean=substr(GEO_ID,10, 14))

12*. Now that our covariates data is tidy, we can finally merge it with the elections data. Merge elections_2020_clean and acs_2019_clean using left_join. Note: elections_2020_clean needs to be the first argument). Create a new object called elections_2020_covariates.

elections_2020_covariates=left_join(elections_2020_clean,acs_2019_clean)

Regression analysis [8 points]

1*. In the regression analysis, we will only focus on the votes obtained by Donald Trump.

  • Filter elections_2020_covariates to keep only the observations with the votes for Donald Trump.

  • Keep only the following variables: state_po, county_name, county_fips_clean, pct_votes, unemp_rate, med_income, pct_computer.

  • Rename pct_votes -> pct_votes_trump.

Create a new object called elections_2020_reg_trump for this task.

elections_2020_reg_trump=elections_2020_covariates %>%
  filter(candidate=="DONALD J TRUMP") %>%
  ungroup(county_fips)%>%
  select(c('state_po', 'county_name', 'county_fips_clean', 'pctvotes', 'unemp_rate', 'med_income', 'pct_computer')) %>%
  rename(pct_votes_trump=pctvotes)

Support time: If you are not confident that your answers to the necessary red-starred questions are all correct, you can now download a pre-cleaned dataset here. This file contains the data you should have arrived to after all these questions. This means that even if you made mistakes before, you can get a fresh start for the regressions and do them correctly. We will not penalise you for using this data.

If you choose to use this option, you can load the elections_2020_reg_trump.RData file you just downloaded by using the load() function. Replace eval = FALSE by eval = TRUE.

2. Use the skim() function from the skimr package to obtain summary statistics for pct_computer, unemp_rate and med_income.

elections_2020_reg_trump %>%
  skim(pct_computer,unemp_rate,med_income)
Data summary
Name Piped data
Number of rows 3155
Number of columns 7
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
pct_computer 40 0.99 85.40 6.32 43.2 82.05 86.3 89.70 98.3 ▁▁▁▇▇
unemp_rate 40 0.99 5.26 2.57 0.0 3.60 4.9 6.45 24.9 ▇▇▁▁▁
med_income 40 0.99 53363.62 14147.87 21504.0 44134.00 51679.0 59642.50 142299.0 ▃▇▁▁▁

We can see that are some missing values. If you want you can look at which counties have missing data but otherwise, we’ll just ignore them for this midterm.

3. To facilitate plotting we will reshape the data in long format. (The difference between long and wide data is explained in this article.) We are going to use the pivot_longer() function from the tidyverse package to reshape the data. You can find examples of how the pivot_longer() function works by typing ?pivot_longer in your console. You can also find information and examples about pivot_longer and its associated function pivot_wider here

For our purposes we will use the following 4 arguments in pivot_longer():

  • data = elections_2020_reg_trump (notice that if you use the %>% pipe you don’t need to specify the data argument);
  • cols = c("pct_computer", "unemp_rate", "med_income"). These are the columns to be reshaped.;
  • names_to = "covariate". This is the name of a new variable to be created, where we will store the name of the covariate;
  • values_to = "value" This is the name of a new variable to be created, where we will store the values of the covariates.

If this is somewhat unclear to you, simply run the code and check how it differs from the previous dataset. This should clear things up.

Create a new object elections_2020_plot_trump for this task. View the created dataframe to ensure it looks as expected.

elections_2020_plot_trump=elections_2020_reg_trump %>%
  pivot_longer(cols = c("pct_computer", "unemp_rate", "med_income"),names_to = "covariate", values_to = "values")
view(elections_2020_plot_trump)

4. Produce 3 graphs displaying the relationship between the vote percentage for Donald Trump and our 3 covariates. Each graph should be on a different row. Make sure that your graph has a meaningful y-axis label and that the y-axis is between 0 and 100 (use the scale_y_continuous() layer for this). (Hint: use the facet_wrap layer in ggplot to make these 3 plots in the same graph. Look at the facet_wrap arguments to ensure each graph is on its own row and the scales of each plot change freely.)

elections_2020_plot_trump %>%
  ggplot() + aes(x=values, y=pct_votes_trump) + labs(x="Covariate",y="Percentage of Votes for Trump")+geom_line()+scale_y_continuous()+facet_wrap(~covariate,scales = "free")

5. Add a regression line on these plots using the geom_smooth layer to better see the relationship’s sign. Set the argument method to "lm" (for linear model) and se to FALSE (so that it does not display a confidence interval around the regression line; we’ll discuss this towards the end of the course but you can set the argument to TRUE to see what it does).

elections_2020_plot_trump %>%
  ggplot()+ aes(x=values, y=pct_votes_trump) + labs(x="Covariate",y="Percentage of Votes for Trump")+geom_line()+scale_y_continuous()+facet_wrap(~covariate,scales = "free")+geom_smooth(method=lm,se=FALSE)

6. From the graphs, is the relationship between the different covariates and Donald Trump’s vote share positive or negative? Rather strong or rather weak? The relationship is negative and rather strong

7. Run a simple linear regression of Donald Trump’s vote percentage on each of our covariate variables separately (i.e. 3 simple linear regressions). For this question you should use elections_2020_reg_trump as the data source.
Use the export_summs() function from the jtools package to see the results from the three regressions side by side.

view(elections_2020_reg_trump)
unemp_rateregression=lm(pct_votes_trump ~ unemp_rate,elections_2020_reg_trump)
med_incomeregression=lm(pct_votes_trump ~ med_income,elections_2020_reg_trump)
pct_computerregression=lm(pct_votes_trump~pct_computer, elections_2020_reg_trump)
regressiontabletrump=export_summs(unemp_rateregression,med_incomeregression,pct_computerregression)

Don’t pay attention to the stars next to the coefficients and to the number in parenthesis below the coefficients, we’ll get to them later on in the course.

8. What is the interpretation of the coefficients for pct_computer, unemp_rate and med_income? Be precise.

The negative coefficient for pct_computer signifies that the percentage of people voting for Trump decreases as the percentage of people with a computer increases in a given county. For every 1% increase in the amount of people owning a computer, there is a 0.5% decrease in the rates of vote for Trump. The negative coefficient for unemp_rate signifies that as the percentage of people percentage of people that are unemployed increases in a given county, the percentage of people who vote for Trump decreases in that same county. For every 1% increase in unemployment, the percentage of people voting for Trump decreases by 1.38%. The negative coefficient for med_income signifies that as the median income in a given county increases, the percentage of people who vote for Trump decreases. Still, the relationship between median income and voting decisions is relatively weak: as median income increases by 1%, votes for Trump decrease by 0.0002996%.

9. Why is the coefficient for median income so small?

The coefficient for median income is likely so small because values of income can range at an incredibly large scale, so it is unlikely that changes can simply be measured by median measurements.

10. Why does the number of observations not vary for each regression?

The number of observations does not vary for each regression because the observations all come from the same set of data.

11. Which variable’s variance explains more of the variance in Donald Trump’s vote?

The unemployment rate best explains the variance in Donald Trump’s vote.

12. Can you interpret these estimates causally? Why not? How else could you interpret the regression on pct_computer? How would you go about trying to assess whether median income or the unemployment rate at the county level has a causal impact on the vote percentage for Donald Trump?

We cannot interpret the estimates causally because we have not isolated the variables from outside factors. There could be lurking variables that also impact the relationships, so we can only make conclusions based on correlation. For example, we could interpret the regression on pct_computer as being a function of income or age. We could assess whether median income or unemployment rate at the county level has a causal impact on the vote percentage for Donald Trump by setting the median income to a certain level and finding the unemp_rate that corresponds to that result, and then running a regression with these results and the Trump vote.

Well done!

And that’s it for this midterm! We hope you had fun, or at least some fun, doing it and that you improved your R skills. We also hope you got a sense of what “real” data looks like and how it can be tamed by using the great tools that R provides.