DEADLINE: The midterm is due before March 6th, 23h59. Penalty for late work is -2 points per hour late.
You can do this homework in groups of two at most. Please indicate who you worked with when entering the names (as instructed below.)
This homework has a total of 20 points. The number of points for each section is clearly indicated in the section and/or subsection headings.
You need to hand in this .Rmd file on Moodle
together with the .html file - that is produced when you
click on the “Knit” button. If you are having issues knitting your file
do get in touch (and knit regularly to make sure everything is
working properly). Both files must be handed in the same
.zip folder.
We left code blocks in this .Rmd file for you to
fill out. Just put your code within the marked region, and click the
Knit button in RStudio, just above this script, to see the
generated HTML output. Knit it now, before you add any code to see how
nice the output looks like.
When you need to enter a block of code, it will look like the following:
# code goes here
Please fill in your answer instead of # code goes here. Do NOT delete the ```{r} and ```; they separate the code chunk from text.
Whenever we give you a chunk of code with code already written,
you need to replace eval = FALSE by
eval = TRUE in the chunk “header”. This will make sure that
the chunk is ‘evaluated’ (i.e., that the code runs) when you knit the
document.
Sometimes, you need to enter text as an answer. In this case, it will say: Answer goes here.
You can find a nice intro to Rmarkdown (.Rmd) here.
Questions marked with a red star (*) are the ones we will build on progressively to clean the data until it is ready for the regression analysis. Other questions are side quests, which you don’t need to answer to get to the regressions.
We have added a link to a clean dataset just before the regression analysis part. If there are necessary questions you have not managed to do, you can start from this clean data instead of using your own.
This midterm project consists in one long analysis of data from the 2020 Presidential elections in the United States.
You will be asked to download the data from its original source, just like you would if you were to undertake this task by yourself. You will use everything we’ve learned so far: importing (un-tidy) data; summarizing, visualizing and tidying it; running regressions for different variables; and thinking about interpreting these associations. In the process, you will also learn 2 useful operations: merging 2 (or more) datasets together, and “reshaping” data from wide to long format. Don’t worry we will work you through how to do all of this.
We hope you find this project interesting and useful. Now let’s get going!
Kate Francke and Tedo Sandroshvili
It’s good practice to load all the packages you will need in the same place.
# Load all the packages you need here
# Don't write the install.packages() code here. Run that code in the console.
library(readxl)
library(readr)
library(tidyverse)
library(janitor)
library(sf)
library(tmap)
library(skimr)
library(jtools)
library(huxtable)
library(rmapshaper)
Download the data from November 2020 US presidential elections from here:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ#
Scroll down, click on the download icon next to the second file (countypres_2000-2020.tab), and select “Comma Separated Values (Original File Format)”.
This is straight from the MIT Election Data + Science lab, which collects and shares election data from the United States. This dataset contains election results aggregated at the county level. Counties are small geographic units in the United States.
1*. Load the file you downloaded
into RStudio in an object called elections. You can use the
read.csv function.
elections=read.csv("~/Desktop/Sem 4/R/countypres_2000-2020.csv")
1. Use tail to display the last 10 rows of the
dataset
tail(elections,10)
| year | state | state_po | county_name | county_fips | office | candidate | party | candidatevotes | totalvotes | version | mode |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020 | WYOMING | WY | UINTA | 56041 | US PRESIDENT | OTHER | OTHER | 200 | 9459 | 20220315 | TOTAL |
| 2020 | WYOMING | WY | UINTA | 56041 | US PRESIDENT | DONALD J TRUMP | REPUBLICAN | 7496 | 9459 | 20220315 | TOTAL |
| 2020 | WYOMING | WY | WASHAKIE | 56043 | US PRESIDENT | JOSEPH R BIDEN JR | DEMOCRAT | 651 | 4032 | 20220315 | TOTAL |
| 2020 | WYOMING | WY | WASHAKIE | 56043 | US PRESIDENT | JO JORGENSEN | LIBERTARIAN | 65 | 4032 | 20220315 | TOTAL |
| 2020 | WYOMING | WY | WASHAKIE | 56043 | US PRESIDENT | OTHER | OTHER | 71 | 4032 | 20220315 | TOTAL |
| 2020 | WYOMING | WY | WASHAKIE | 56043 | US PRESIDENT | DONALD J TRUMP | REPUBLICAN | 3245 | 4032 | 20220315 | TOTAL |
| 2020 | WYOMING | WY | WESTON | 56045 | US PRESIDENT | JOSEPH R BIDEN JR | DEMOCRAT | 360 | 3560 | 20220315 | TOTAL |
| 2020 | WYOMING | WY | WESTON | 56045 | US PRESIDENT | JO JORGENSEN | LIBERTARIAN | 46 | 3560 | 20220315 | TOTAL |
| 2020 | WYOMING | WY | WESTON | 56045 | US PRESIDENT | OTHER | OTHER | 47 | 3560 | 20220315 | TOTAL |
| 2020 | WYOMING | WY | WESTON | 56045 | US PRESIDENT | DONALD J TRUMP | REPUBLICAN | 3107 | 3560 | 20220315 | TOTAL |
2. What years are included in our data? What’s inside the enigmatic
“mode” variable? Tabulate the variables year and
mode separately to find out.
table(elections$mode)
##
## 2ND ABSENTEE ABSENTEE ABSENTEE BY MAIL
## 120 1995 1038
## ADVANCED VOTING EARLY EARLY VOTE
## 477 453 450
## EARLY VOTING ELECTION DAY FAILSAFE
## 120 3737 230
## FAILSAFE PROVISIONAL IN-PERSON ABSENTEE MAIL
## 230 230 145
## ONE STOP PROV PROVISIONAL
## 500 477 1832
## TOTAL
## 60583
table(elections$year)
##
## 2000 2004 2008 2012 2016 2020
## 12628 9474 9474 9474 9474 22093
The years 2000, 2004, 2008, 2012, 2016, and 2020 are included in our data. The mode variable describes how people voted.
3. What is the unit of observation? In other words, what does each row correspond to?
Each row corresponds to the mode of voting for each candidate of each county.
4. How many variables are there?
summary(elections)
## year state state_po county_name
## Min. :2000 Length:72617 Length:72617 Length:72617
## 1st Qu.:2004 Class :character Class :character Class :character
## Median :2012 Mode :character Mode :character Mode :character
## Mean :2011
## 3rd Qu.:2020
## Max. :2020
##
## county_fips office candidate party
## Min. : 1001 Length:72617 Length:72617 Length:72617
## 1st Qu.:18103 Class :character Class :character Class :character
## Median :29205 Mode :character Mode :character Mode :character
## Mean :30264
## 3rd Qu.:45057
## Max. :56045
## NA's :57
## candidatevotes totalvotes version mode
## Min. : 0 Min. : 0 Min. :20220315 Length:72617
## 1st Qu.: 115 1st Qu.: 5175 1st Qu.:20220315 Class :character
## Median : 1278 Median : 11194 Median :20220315 Mode :character
## Mean : 10782 Mean : 42514 Mean :20220315
## 3rd Qu.: 5848 3rd Qu.: 29855 3rd Qu.:20220315
## Max. :3028885 Max. :4264365 Max. :20220315
##
Twelve variables
5. How many rows are there?
count(elections)
| n |
|---|
| 72617 |
72617 rows
1*. This data contains the results
of all presidential elections from 2000 to 2020. We only want to analyse
the 2020 results. Use the filter function to keep only the
observations that concern the 2020 election. Create a new object
elections_2020 for this task.
elections_2020=filter(elections,year==2020)
2. Our main variable of interest is candidatevotes. In
how many counties did a candidate get less than 10 votes (less or equal
10 )? How many different candidates were concerned?
candidatevotesq=elections_2020 %>%
filter(candidatevotes<=10)
n_distinct(candidatevotesq$candidate)
## [1] 4
n_distinct(candidatevotesq$county_fips)
## [1] 1310
nrow(candidatevotesq[candidatevotesq$candidate == 'OTHER',])
## [1] 2867
1310 counties; 2780>=candidartes>=4; there are 3 named candidates, however 2867 ‘OTHER’ named candidates, so they could either all be the same or all unique, therefore there is a large range of possible values.
3. We saw previously that there were different modes of voting
recorded in the data, and that one category is “Total”. This could be
worrying: are we counting ballots twice? Could a ballot appear both in
the “in person” and “total” categories? Cross-tabulate the
mode variable with state, to understand
whether the modes of voting are the same in all States. (Use the
table function)
xy=table(elections_2020$mode,elections_2020$state)
xy
##
## ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO
## 2ND ABSENTEE 0 0 0 0 0 0
## ABSENTEE 0 0 0 375 0 0
## ABSENTEE BY MAIL 0 0 0 0 0 0
## ADVANCED VOTING 0 0 0 0 0 0
## EARLY 0 0 0 0 0 0
## EARLY VOTE 0 0 75 375 0 0
## EARLY VOTING 0 0 0 0 0 0
## ELECTION DAY 0 0 75 375 0 0
## FAILSAFE 0 0 0 0 0 0
## FAILSAFE PROVISIONAL 0 0 0 0 0 0
## IN-PERSON ABSENTEE 0 0 0 0 0 0
## MAIL 0 0 0 0 0 0
## ONE STOP 0 0 0 0 0 0
## PROV 0 0 0 0 0 0
## PROVISIONAL 0 0 75 375 0 0
## TOTAL 201 205 0 0 290 320
##
## CONNECTICUT DELAWARE DISTRICT OF COLUMBIA FLORIDA
## 2ND ABSENTEE 0 0 0 0
## ABSENTEE 0 0 0 0
## ABSENTEE BY MAIL 0 0 0 0
## ADVANCED VOTING 0 0 0 0
## EARLY 0 0 0 0
## EARLY VOTE 0 0 0 0
## EARLY VOTING 0 0 0 0
## ELECTION DAY 0 0 0 0
## FAILSAFE 0 0 0 0
## FAILSAFE PROVISIONAL 0 0 0 0
## IN-PERSON ABSENTEE 0 0 0 0
## MAIL 0 0 0 0
## ONE STOP 0 0 0 0
## PROV 0 0 0 0
## PROVISIONAL 0 0 0 0
## TOTAL 40 12 5 335
##
## GEORGIA HAWAII IDAHO ILLINOIS INDIANA IOWA KANSAS
## 2ND ABSENTEE 0 0 0 0 0 0 0
## ABSENTEE 477 0 0 0 0 495 0
## ABSENTEE BY MAIL 0 0 0 0 0 0 0
## ADVANCED VOTING 477 0 0 0 0 0 0
## EARLY 0 0 0 0 0 0 0
## EARLY VOTE 0 0 0 0 0 0 0
## EARLY VOTING 0 0 0 0 0 0 0
## ELECTION DAY 477 0 0 0 0 495 0
## FAILSAFE 0 0 0 0 0 0 0
## FAILSAFE PROVISIONAL 0 0 0 0 0 0 0
## IN-PERSON ABSENTEE 0 0 0 0 0 0 0
## MAIL 0 0 0 0 0 0 0
## ONE STOP 0 0 0 0 0 0 0
## PROV 477 0 0 0 0 0 0
## PROVISIONAL 0 0 0 0 0 0 0
## TOTAL 0 20 176 510 460 0 315
##
## KENTUCKY LOUISIANA MAINE MARYLAND MASSACHUSETTS MICHIGAN
## 2ND ABSENTEE 0 0 0 120 0 0
## ABSENTEE 0 0 0 120 0 0
## ABSENTEE BY MAIL 0 0 0 0 0 0
## ADVANCED VOTING 0 0 0 0 0 0
## EARLY 0 0 0 0 0 0
## EARLY VOTE 0 0 0 0 0 0
## EARLY VOTING 0 0 0 120 0 0
## ELECTION DAY 480 0 0 120 0 0
## FAILSAFE 0 0 0 0 0 0
## FAILSAFE PROVISIONAL 0 0 0 0 0 0
## IN-PERSON ABSENTEE 0 0 0 0 0 0
## MAIL 0 0 0 0 0 0
## ONE STOP 0 0 0 0 0 0
## PROV 0 0 0 0 0 0
## PROVISIONAL 0 0 0 120 0 0
## TOTAL 0 256 80 0 70 415
##
## MINNESOTA MISSISSIPPI MISSOURI MONTANA NEBRASKA NEVADA
## 2ND ABSENTEE 0 0 0 0 0 0
## ABSENTEE 0 0 0 0 0 0
## ABSENTEE BY MAIL 0 0 0 0 0 0
## ADVANCED VOTING 0 0 0 0 0 0
## EARLY 0 0 0 0 0 0
## EARLY VOTE 0 0 0 0 0 0
## EARLY VOTING 0 0 0 0 0 0
## ELECTION DAY 0 0 0 0 0 0
## FAILSAFE 0 0 0 0 0 0
## FAILSAFE PROVISIONAL 0 0 0 0 0 0
## IN-PERSON ABSENTEE 0 0 0 0 0 0
## MAIL 0 0 0 0 0 0
## ONE STOP 0 0 0 0 0 0
## PROV 0 0 0 0 0 0
## PROVISIONAL 0 0 0 0 0 0
## TOTAL 435 410 580 168 279 68
##
## NEW HAMPSHIRE NEW JERSEY NEW MEXICO NEW YORK
## 2ND ABSENTEE 0 0 0 0
## ABSENTEE 0 0 0 0
## ABSENTEE BY MAIL 0 0 0 0
## ADVANCED VOTING 0 0 0 0
## EARLY 0 0 0 0
## EARLY VOTE 0 0 0 0
## EARLY VOTING 0 0 0 0
## ELECTION DAY 0 0 0 0
## FAILSAFE 0 0 0 0
## FAILSAFE PROVISIONAL 0 0 0 0
## IN-PERSON ABSENTEE 0 0 0 0
## MAIL 0 0 0 0
## ONE STOP 0 0 0 0
## PROV 0 0 0 0
## PROVISIONAL 0 0 0 0
## TOTAL 30 105 165 310
##
## NORTH CAROLINA NORTH DAKOTA OHIO OKLAHOMA OREGON
## 2ND ABSENTEE 0 0 0 0 0
## ABSENTEE 0 0 0 0 0
## ABSENTEE BY MAIL 500 0 0 308 0
## ADVANCED VOTING 0 0 0 0 0
## EARLY 0 0 0 308 0
## EARLY VOTE 0 0 0 0 0
## EARLY VOTING 0 0 0 0 0
## ELECTION DAY 500 0 0 308 0
## FAILSAFE 0 0 0 0 0
## FAILSAFE PROVISIONAL 0 0 0 0 0
## IN-PERSON ABSENTEE 0 0 0 0 0
## MAIL 0 0 0 0 0
## ONE STOP 500 0 0 0 0
## PROV 0 0 0 0 0
## PROVISIONAL 500 0 0 0 0
## TOTAL 0 212 352 0 180
##
## PENNSYLVANIA RHODE ISLAND SOUTH CAROLINA SOUTH DAKOTA
## 2ND ABSENTEE 0 0 0 0
## ABSENTEE 0 0 0 0
## ABSENTEE BY MAIL 0 0 230 0
## ADVANCED VOTING 0 0 0 0
## EARLY 0 0 0 0
## EARLY VOTE 0 0 0 0
## EARLY VOTING 0 0 0 0
## ELECTION DAY 0 0 230 0
## FAILSAFE 0 0 230 0
## FAILSAFE PROVISIONAL 0 0 230 0
## IN-PERSON ABSENTEE 0 0 230 0
## MAIL 0 0 0 0
## ONE STOP 0 0 0 0
## PROV 0 0 0 0
## PROVISIONAL 0 0 230 0
## TOTAL 201 24 0 198
##
## TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON
## 2ND ABSENTEE 0 0 0 0 0 0
## ABSENTEE 0 0 0 0 528 0
## ABSENTEE BY MAIL 0 0 0 0 0 0
## ADVANCED VOTING 0 0 0 0 0 0
## EARLY 0 0 145 0 0 0
## EARLY VOTE 0 0 0 0 0 0
## EARLY VOTING 0 0 0 0 0 0
## ELECTION DAY 0 0 145 0 532 0
## FAILSAFE 0 0 0 0 0 0
## FAILSAFE PROVISIONAL 0 0 0 0 0 0
## IN-PERSON ABSENTEE 0 0 0 0 0 0
## MAIL 0 0 145 0 0 0
## ONE STOP 0 0 0 0 0 0
## PROV 0 0 0 0 0 0
## PROVISIONAL 0 0 0 0 532 0
## TOTAL 285 1270 140 70 0 195
##
## WEST VIRGINIA WISCONSIN WYOMING
## 2ND ABSENTEE 0 0 0
## ABSENTEE 0 0 0
## ABSENTEE BY MAIL 0 0 0
## ADVANCED VOTING 0 0 0
## EARLY 0 0 0
## EARLY VOTE 0 0 0
## EARLY VOTING 0 0 0
## ELECTION DAY 0 0 0
## FAILSAFE 0 0 0
## FAILSAFE PROVISIONAL 0 0 0
## IN-PERSON ABSENTEE 0 0 0
## MAIL 0 0 0
## ONE STOP 0 0 0
## PROV 0 0 0
## PROVISIONAL 0 0 0
## TOTAL 220 360 92
As you can see, in some States the mode variable takes
exclusively the value “TOTAL” (in Alabama for instance). In those
States, we know the total number of votes obtained by candidates in each
county. In other States, we do not have the total number of votes, but
only a number disaggregated by voting method (in Arizona for
instance).
All of those situations are fine, since they mean ballots are not counted twice.
However there is one worrying case: Utah. In this State, we observe both the total number of votes, and the number of votes by mail, election day voting, and early voting. By looking more closely at the data, you can see that in the Salt Lake county (code 49035), we have the number of votes received early; on election day; and by mail. For all other counties, the ballot count for these voting modes is zero, and we only observe the “total” ballot count. If you wanted to be convinced more formally that this is true, you could run the following script:
elections_2020 %>%
filter(state=="UTAH") %>% #keep only UTAH observations
mutate(mode_vote_total = (mode=="TOTAL" & candidatevotes!=0), #create a dummy if we observe a ballot in the "total" category
mode_vote_other = (mode!="TOTAL" & candidatevotes!=0)) %>% # create a dummy if we observe a ballot in any other voting category
group_by(county_fips, county_name) %>% # group by county
summarise(mode_vote_total = max(mode_vote_total), #create a dummy if for any candidate in the county we observe a ballot in the "total" category
mode_vote_other = max(mode_vote_other)) %>% #create a dummy if for any candidate in the county we observe a ballot in any other voting category
View() #visualise the resulting dataset
#You can see that in no county do we observe a ballot both in "total" and in another category!
So it holds true that we never count a ballot twice: in every county, either we count it in the “total” category, or we count it in one of the categories associated to specific voting method. This is good news!
1. In States like Arizona where we don’t have the total number of ballots, it is not clear that the voting modes for which we have data cover all voting methods available.
For instance, in Kentucky we only have the number of ballots cast on election day. A quick search online shows that early voting is possible in this State, although it is heavily restricted. It is hence very likely that some ballots are missing in our data.
Now that we noticed this feature of the data, we want to check how big of an issue this is: how many votes are not counted in our data? Can we still reasonably use it, or is its quality too bad?
To answer this question, we will compare the total number of votes by State in our dataset to the official number provided by the Federal Government.
Create a new object called elections_2020_state_totals,
with two variables: state_po and
sum_votes_state: the sum of candidatevotes by
State. Hint: you should group your data by state_po, and
then use summarise. You should also disregard missing
values when you compute the sum of candidatevotes: find out
the argument of the sum function which lets you do
that.
elections_2020_state_totals=elections_2020 %>%
group_by(state_po)%>%
summarise(sum_votes_state=sum(candidatevotes,na.rm=FALSE))
view(elections_2020_state_totals)
2. The official election data in the US is provided by the Federal Election Commission, in this document. Notice that it is in pdf format. We already extracted it into an excel file for you.
Download the file here
Import it into a new object named official_state_votes,
using the read_excel function from the readxl
package.
official_state_votes <- read_excel("~/Desktop/OneDrive/state_votes.xlsx")
3. Both your elections_2020_state_totals and
official_state_votes should have 51 observations, one per
State in the US + Washington DC. But now the official number of votes
and the one from our dataset are in two different R objects. You could
compare the numbers manually, but this would be a lot of work. Instead,
we want to merge them into a single object.
We will do this using the left_join() function from the
tidyverse package.
This is how left_join() works:
left_join(x,y) will return all rows from x
(very important!) and all columns from x and
y. Rows in x with no match in y
will have NA values in the new columns. Rows in
y with no match in x will be discarded.
The following cheatsheet might help you visualise the way the function works: https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf. You can also find more details here.
If nothing is specified, left_join will automatically
detect the variables which have the same name in both datasets and use
them to join the data. If your variables have different names in the
datasets, you can specify
left_join(x,y, by=c("var1x" = "var1y", "var2x" = "var2y")).
This will match variable var1x in x to
var1y in y and var2x in
x to var2y in y. (You can keep
going, and join on more than two variables!)
Create a new object called
elections_2020_state_comparison, which is the
left_join of elections_2020_state_totals (1st
argument) and official_state_votes (2nd argument). Here,
our identifying variables have different names, so we need to add the
argument by=c("state_po" = "state_short")
elections_2020_state_comparison=left_join(elections_2020_state_totals, official_state_votes, by=c("state_po"="state_short"))
Congratulations, you have merged your first datasets! Easy right? :)
4. Compute the percentage difference between the official number of
votes, and the sum of votes in our data by State:
(official votes - sum votes)/official votes*100. What are
the min, max and mean of this percentage difference?
elections_2020_state_comparison$percentagedifference= ((elections_2020_state_comparison$official_total_votes - elections_2020_state_comparison$sum_votes_state)/elections_2020_state_comparison$official_total_votes)*100
summary(elections_2020_state_comparison$percentagedifference)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.924807 0.000000 0.000000 -0.044041 0.001063 0.545434
Min: -0.924807; Mean: -0.044041; Max: 0.545434
If our election data contained the official number of votes per State, the difference would always be zero. Clearly, this is not the case: in some States our data has too few votes, in others it has too many. However, the difference is always very very small and should not dramatically change the final results obtained by each candidate.
Still, the discrepancy is not easy to explain: the MIT lab collects official results from all the States and puts them together into a national file. It is not clear where the difference is coming from: maybe there was human error in data collection, maybe the results were updated in some States after a recount…
Since the United States government itself does not provide official county-level results for the entire country, the only alternative to using the MIT data would be to start from State level official results. However, this would be a lot of work as we would have to find all the files, deal with format issues (many data files are still provided in pdf, for instance in Kentucky) and clean them all. It would be time consuming, and we would likely make more errors than the MIT lab!
1. Now that we have decided that our data was good enough, we still
need to tidy it before we can use it. Make sure that the candidate name
is always written with the same format by tabulating the
candidate variable.
table(elections_2020$candidate)
##
## DONALD J TRUMP JO JORGENSEN JOSEPH R BIDEN JR OTHER
## 5117 4955 5117 6904
2*. Remember that we noticed earlier
there were missing values in the candidatevotes and
totalvotes variables.
Run the following script to fix the issue (replace
eval = FALSE by eval = TRUE below, so that the
code is also run when you knit). This chunk of code:
candidate_votes by zero when it is
missingtotal_votes_county by the sum of votes in the
county when it is missing.Note that we create a new object,
elections_2020_clean.
The ifelse function works in the following way:
ifelse(test, value if test==TRUE, value if test==FALSE).
For instance in the third line below, for each observation, if
candidatevotes is missing then it is replaced by 0,
otherwise it is not changed.
elections_2020_clean <- elections_2020 %>%
group_by(county_fips) %>%
mutate(candidatevotes = ifelse(is.na(candidatevotes), 0, candidatevotes),
totalvotes = ifelse(is.na(totalvotes), sum(candidatevotes), totalvotes))
3*. Clean some more the
elections_2020_clean object by:
Summarising the data so that the unit of observation is a
candidate in a county. To do this, you need to sum the number of votes
obtained for each candidate in each county, across all modes of voting.
Note: in order to keep all the variables, you will need to use the
following command
group_by(state, state_po, county_name, county_fips, totalvotes, candidate).
If we only grouped by county_fips, candidate, we would
“lose” all the other “extra” variables that give us information about
the county (its name, its state, the number of votes, …). If this is not
clear to you, simply run the code twice, once grouping by all the
variables, and once with group_by(county_fips, candidate),
and see what happens!
Creating a variable pct_votes which is equal to the
percentage of votes obtained by a candidate in a county: (number of
votes obtained by a candidate / total number of votes in
county)*100.
Grouping the data by county (county_fips), and then
use a combination of rank() and desc()
commands to compute the ranks of each candidate within each county. You
should create a variable called candidate_rank which takes
the value 1 for the candidate who received the most vote in the county,
2 for the candidate who received the second highest number of votes
etc.
Once you’ve ensured your code is correct, “ungroup” the data by
passing your object into the ungroup() function or by
piping it. This function does what it says: it removes grouping. This
means that it cancels the effect of your last group_by, and
ensures that the next functions you use will not be applied by
group.
You can pipe all these commands together if you wish or proceed in several steps.
elections_2020_clean=elections_2020_clean %>%
group_by(state, state_po, county_name, county_fips, totalvotes, candidate) %>%
summarise(candidatevotes=sum(candidatevotes))
pctvotes= ((elections_2020_clean$candidatevotes / elections_2020_clean$totalvotes)*100)
elections_2020_clean$pctvotes=pctvotes
#Grouping the data by county (county_fips), and then use a combination of rank() and desc() commands to compute the ranks of each candidate within each county. You should create a variable called candidate_rank which takes the value 1 for the candidate who received the most vote in the county, 2 for the candidate who received the second highest number of votes etc.
elections_2020_clean=elections_2020_clean %>%
arrange(county_fips, candidatevotes) %>%
group_by(county_fips) %>%
mutate(rank = rank(-candidatevotes))
view(elections_2020_clean)
4. Check that your variable pct_votes is actually a
percentage (that it is comprised within 0 and 100). This is a safety
check to make sure that there is no inconsistency in the data.
range(elections_2020_clean$pctvotes, na.rm = TRUE)
## [1] 0.00000 96.18182
Election data are probably better visualized using maps.
In this section we will produce two maps displaying:
To produce the maps you will need to install and load the
sf, tmap and rmapshaper
packages.
Maps are tricky and very complicated objects. But to boil it down to the essential ingredients, you need a “shapefile” which is basically a file that contains the map onto which data will be plotted.
This file is provided by the US census website here.
However it is somewhat heavy, so to make your life easier, we have
simplified it for you using the following commands. Do
NOT touch nor run this chunk, and do NOT
replace eval = FALSE by eval = TRUE. We’ve
left it so you can see how we created the simplified shapefile you will
be loading in the next question.
library(sf)
library(tmap)
library(rmapshaper)
county_shfl_full <- read_sf("path/cb_2018_us_county_500k/cb_2018_us_county_500k.shp")
# where `path` corresponds to where the `cb_2018_us_county_500k` folder is located.
county_shfl <- st_simplify(county_shfl_full, preserveTopology = FALSE, dTolerance = 1000) #simplify the map to make it less memory-heavy
save(county_shfl, file="path/county_shfl.RData") #saving the object
You can download the resulting R object from this link: > https://www.dropbox.com/s/m8zq5e38imp2he1/county_shfl.RData?dl=1
1. Let’s load the county_shfl.RData file by using the
load() function. Replace eval = FALSE by
eval = TRUE.
load("/Users/katherinefrancke/Library/CloudStorage/OneDrive-sciencespo.fr/county_shfl.RData")
You should now have an object called county_shfl in your
environment
2. FIPS codes are numbers which uniquely identify geographic areas in the US. Working with FIPS codes is always advisable when identifying areas, as names can be duplicates: there is an Aidar county in Kentucky and another one in Iowa.
What is the county FIPS code of the “FLEMING” county in the
elections_2020_clean data.frame? What are the State FIPS
(STATEFP), county FIPS (COUNTYFP) and
GEOID of the same county in the county_shfl
data? Hint: in county_shfl the county’s name is written
“Fleming” and the variable containing county names is
NAME.
elections_2020_clean %>%
filter(county_name=="FLEMING")
county_shfl %>%
filter(NAME=="Fleming")
| STATEFP | COUNTYFP | COUNTYNS | AFFGEOID | GEOID | NAME | LSAD | ALAND | AWATER | geometry |
|---|---|---|---|---|---|---|---|---|---|
| 21 | 069 | 00516881 | 0500000US21069 | 21069 | Fleming | 06 | 9.03e+08 | 7.18e+06 | list(c(-83.984282, -83.921376, -83.916131, -83.88036, -83.85914, -83.846644, -83.835656, -83.801869, -83.768357, -83.74408, -83.734005, -83.735331, -83.73823, -83.710803, -83.703959, -83.692767, -83.685828, -83.641322, -83.586792, -83.560332, -83.559706, -83.507953, -83.499543, -83.487972, -83.476871, -83.465424, -83.463696, -83.581427, -83.582588, -83.589503, -83.593335, -83.6261, -83.641852, -83.85928, -83.930348, -83.984282, 38.445493, 38.383309, 38.360152, 38.360966, 38.343529, 38.337216, 38.30263, 38.279276, 38.293533, 38.284713, 38.293878, 38.278838, 38.268149, 38.26817, 38.244383, 38.2056, 38.214107, 38.174944, 38.21828, 38.220903, 38.231253, 38.273155, 38.318717, 38.352905, 38.359256, 38.357992, 38.393352, 38.43655, 38.462461, 38.473977, 38.489031, 38.512177, 38.525381, 38.456348, 38.492278, 38.445493)) |
The county FIPS code of the “FLEMING” county is 21069; State FIPS is 21, county FIPS is 069; and GEOID is 21069.
3. You might have noticed that the county_fips variable
in our elections_2020_clean is actually made of the State
FIPS code, followed by the county code.
In county_shfl, we have a variable for the State FIPS
code and a separate one for the county code. The variable containing the
same code as in elections_2020_clean is actually called
GEOID. This type of thing is exactly why you always need to
look inside your data: it would not have been obvious from the variable
names.
What is the type of county_fips in the
elections_2020_clean data? What is the type of
GEOID in county_shfl? Hint: use the
typeof function.
typeof(elections_2020_clean$county_fips)
## [1] "integer"
typeof(county_shfl$GEOID)
## [1] "character"
The county_fips data in elections_2020_clean are integers, the GEOID data in county_shfl are characters.
It is not the same! In order to merge the two datasets, we want
county_fips to have the same type as in GEOID
and be a character string instead.
Furthermore, because it is stored as a numeric value there is no leading zero (the FIPS code “01001” is stored as 1001).
The following script fixes this, and creates a new variable called
county_fips_clean. Run the code and change
eval = FALSE to eval = TRUE to make sure the
code runs when you knit your document.
We start by using the as.character function to convert
the variable to a character string. Then, we use the ifelse
function again. What this conditional statement says is: if the number
of characters of the variable county_fips_clean is equal to
4, then add a “0” in front of this variable; otherwise do nothing.
elections_2020_clean <- elections_2020_clean %>%
mutate(county_fips_clean = as.character(county_fips), #convert county_fips to a character string
county_fips_clean = ifelse(nchar(county_fips_clean) == 4,
paste0("0", county_fips_clean),
county_fips_clean)) #Add a leading zero when necessary: if there are only four characters to the FIPS code, it means that the leading 0 is missing
4. How many different State FIPS codes are there in the
county_shfl dataset? Hint: You can use a combination of the
length and unique commands to count them.
n_distinct(county_shfl$STATEFP)
## [1] 56
56
5. Our shapefile clearly contains the map for more than just the 50 US States + DC. By looking at the existing State codes here, you can see that codes above 56 correspond to overseas territories.
The final map we want to produce should only contain the 50 States + DC: we want to remove overseas territories. We also exclude Alaska and Hawaii for purely aesthetic reasons: including them zooms out the map too much. We could fix this problem by showing Alaska and Hawaii in a corner of the map but this is beyond the scope of the midterm.
Filter county_shfl_clean to keep only observations such
that:
STATEFP is not equal to “02” (Alaska) or “15”
(Hawaii)as.numeric(STATEFP) is lower or equal to 56 (note
that because State FIPS is stored as a character, we have to convert it
to a numeric before evaluating whether it is below 56)Create a new object called county_shfl_clean.
county_shfl_clean=county_shfl %>%
filter(STATEFP!="02", STATEFP!="15")%>%
filter(as.numeric(STATEFP)<=56)
Furthermore, some counties’ FIPS have changed over the years, and
they differ in our two datasets. The following script will fix all these
inconsistencies for you. The case_when function is similar
to ifelse. Here it modifies the variable
county_fips_clean, such that :
county_fips_clean was equal to “46113”, it now takes
the value “46102”state_po is equal to “DC”,
county_fips_clean now takes the value “11001”county_fips_clean keeps the same value.Remember to replace eval = FALSE with
eval = TRUE.
elections_2020_clean <- elections_2020_clean %>%
mutate(county_fips_clean = case_when(
county_fips_clean == "46113" ~ "46102",
state_po == "DC" ~ "11001",
TRUE ~ county_fips_clean))
6. Use left_join to merge county_shfl_clean
(first argument) and elections_2020_clean (second
argument). Create a new object called elections_2020_map
for this question.
Here, our identifying variables have different names:
GEOID in county_shfl_clean (first argument)
and county_fips_clean in elections_2020_clean
(second argument). Look at the help file or go back to question 3 of the
“Checking quality” section to see how to specify the by argument.
elections_2020_map=left_join(county_shfl_clean,elections_2020_clean,by=c("GEOID"="county_fips_clean"))
7. Create a heat map of Biden and Trump’s percentage of the vote by
county, using the following code (change eval = FALSE to
eval = TRUE just below so that the html file displays the
map):
tmap_mode("view") # we want an interactive map
elections_2020_map %>%
# only keep Trump and Biden
filter(candidate %in% c("DONALD J TRUMP", "JOSEPH R BIDEN JR")) %>%
tm_shape() +
tm_borders(col="white", lwd = 0.3) + # white and thin (line width) borders
tm_fill(
col = "pctvotes", # variable to be mapped
title = "% of votes", # legend title
id = "county_name", # information to display when mouse hovers over a departement
popup.vars = c("Vote %:" = "pctvotes")) + # variable to display in popup window
tm_facets(by = "candidate") # create one map per selected candidate
Wasn’t that easy (well, sort of ;) ) ! And now you have 2 comparable maps of the vote percentages received by Biden and Trump! Hover over the counties with your mouse. The name of the county will show up. If you click on a county you will get the vote percentage received by the candidate. Isn’t that cool!
8. Create a map displaying the candidate who received the highest percentage of votes in each county. Try to create this map by yourself using the code above (and the comments in it). You don’t need to create any new variables. Don’t hesitate to use Google (or your favorite search engine) to help you out. Don’t forget to change the title of the legend.
Add the following argument in tm_fill() to have a nice color palette
for your map: palette = c("red","blue").
tmap_mode("view") # we want an interactive map
elections_2020_map %>%
# this chooses the candidate with the highest percentage votes in each county
group_by(county_fips) %>% slice(which.max(pctvotes)) %>%
tm_shape() +
tm_borders(col = "white", lwd = 0.3) + # white and thin (line width) borders
tm_fill(
col = "candidate", # variable to be mapped
title = "2020 US Election, Who Won Each County?", # legend title
id = "county_name", # information to display when mouse hovers on a departement
palette = c("red","blue"),
popup.vars = c("Vote %:" = "pctvotes")) # variable to display in popup window
A covariate is simply the name we give to variables containing characteristics of our unit of interest (here, characteristic of counties). A covariate can be of direct interest if it is an independent variable in the model. It can also be an extra characteristic we add to the regression to avoid the omitted variable bias which we will discuss later in the course.
Note: The websites have been known to have issues with Mozilla Firefox. Please use another browser, such as Google Chrome or Microsoft Edge, to download the files. > Download county-level social data from the American Community Survey (ACS) here: > https://data.census.gov/cedsci/table?q=DP02&g=0100000US%240500000&tid=ACSDP5Y2019.DP02
Click “DOWNLOAD TABLE” and select the year 2019 for “ACS 5-Year Estimates Data Profiles” and download. Unzip the folder once downloaded.
Repeat for ACS economic data: https://data.census.gov/cedsci/table?q=DP03&g=0100000US%240500000&tid=ACSDP5Y2019.DP03
1*. Import these csv files (the ones
with -Data in their name) into objects called
acs_soc_2019 and acs_eco_2019
respectively.
Note: If you work in groups and want the code to run on all your laptops, make sure that your files have the same name.
acs_soc_2019 <- read.csv("~/Desktop/Sem 4/R/ACSDP5Y2019.DP02-Data.csv")
acs_eco_2019 <- read.csv("~/Desktop/Sem 4/R/ACSDP5Y2019.DP03-Data.csv")
2. How many variables are there in each dataset?
ncol(acs_soc_2019)
## [1] 1227
ncol(acs_eco_2019)
## [1] 1099
1227 variables in acs_soc_2019; 1099 variables in acs_eco_2019
3*. That’s a lot! We need to keep only the variables we are interested in, and tidy the data a little. For each acs data frame:
old_name ->
new_name)acs_eco_2019 : DP03_0009PE ->
unemp_rateacs_eco_2019 : DP03_0062E ->
med_incomeacs_soc_2019 : DP02_0152PE ->
pct_computerThe new name tells you what the variable actually contains.
unemp_rate is the unemployment rate in the county.
med_income is the median households income and
pct_computer is the percentage of households owning a
computer.
NAME and
GEO_IDFor this task, create two new objects:
acs_soc_2019_clean and acs_eco_2019_clean.
acs_soc_2019_clean=acs_soc_2019 %>%
rename(pct_computer=DP02_0152PE) %>%
subset(select= c(NAME, GEO_ID,pct_computer))
acs_eco_2019_clean=acs_eco_2019 %>%
rename(med_income=DP03_0062E) %>%
rename(unemp_rate=DP03_0009PE) %>%
subset(select = c(NAME, GEO_ID, med_income, unemp_rate))
4. Display the first five lines of acs_soc_2019_clean
and acs_eco_2019_clean
head(acs_soc_2019_clean,5)
| NAME | GEO_ID | pct_computer |
|---|---|---|
| Geographic Area Name | Geography | Percent!!COMPUTERS AND INTERNET USE!!Total households!!With a computer |
| Autauga County, Alabama | 0500000US01001 | 89.2 |
| Baldwin County, Alabama | 0500000US01003 | 90.9 |
| Barbour County, Alabama | 0500000US01005 | 77.2 |
| Bibb County, Alabama | 0500000US01007 | 78.1 |
head(acs_eco_2019_clean,5)
| NAME | GEO_ID | med_income | unemp_rate |
|---|---|---|---|
| Geographic Area Name | Geography | Estimate!!INCOME AND BENEFITS (IN 2019 INFLATION-ADJUSTED DOLLARS)!!Total households!!Median household income (dollars) | Percent!!EMPLOYMENT STATUS!!Civilian labor force!!Unemployment Rate |
| Autauga County, Alabama | 0500000US01001 | 58731 | 3.7 |
| Baldwin County, Alabama | 0500000US01003 | 58320 | 4.3 |
| Barbour County, Alabama | 0500000US01005 | 32525 | 9.2 |
| Bibb County, Alabama | 0500000US01007 | 47542 | 7.3 |
5. What is in the first line of the datasets?
A description of the variables.
6*. Drop the first row of
acs_soc_2019_clean and acs_eco_2019_clean.
Hint: the negative sign should help you.
acs_soc_2019_clean <- acs_soc_2019_clean[-1,]
acs_eco_2019_clean <- acs_eco_2019_clean[-1,]
7*. Merge the
acs_soc_2019_clean and acs_eco_2019_clean
together using left_join. You don’t need to specify the
by argument here, since the common variabes
GEO_ID and NAME should have the same name in
both datasets.
Create a new object called acs_2019_clean.
acs_2019_clean=left_join(acs_soc_2019_clean,acs_eco_2019_clean)
8. What is the type of the pct_computer,
unemp_rate and med_income variables?
typeof(acs_2019_clean$pct_computer)
## [1] "character"
typeof(acs_2019_clean$unemp_rate)
## [1] "character"
typeof(acs_2019_clean$med_income)
## [1] "character"
The variables of pct_computer, unemp_rate, and med_income are characters.
9*. In the
acs_2019_clean dataframe, convert the
pct_computer, unemp_rate and
med_income variables to numeric, using the
as.numeric() function. This function simply converts
character strings to numbers: for example
as.numeric("17.9") = 17.9. Hint: to convert the variable,
simply use the dplyr verb to create a new variable and give
it the same name as the variable you are modifying.
Ignore the warning about NAs being introduced. They are due to
observations for which pct_computer was already missing (it
was equal to “null”)
acs_2019_clean <- acs_2019_clean %>%
mutate(pct_computer=as.numeric(pct_computer)) %>%
mutate(unemp_rate= as.numeric(unemp_rate)) %>%
mutate(med_income=as.numeric(med_income))
typeof(acs_2019_clean$med_income)
## [1] "double"
10. What is the GEO_ID of the county whose
NAME is “Fleming County, Kentucky” in the
acs_eco_2019_clean and acs_soc_2019_clean
dataframes?
acs_2019_clean %>%
filter(NAME=="Fleming County, Kentucky")
| NAME | GEO_ID | pct_computer | med_income | unemp_rate |
|---|---|---|---|---|
| Fleming County, Kentucky | 0500000US21069 | 81.4 | 4.46e+04 | 7.4 |
11*. The geographical ID of the county is different to what we had last time! You might have noticed that this time, the 5-digit FIPS code is preceded by a prefix: “0500000US”. We want to extract the FIPS code from this geographic identifier.
Create a new variable, called county_fips_clean which
contains only the 5 last characters of the GEO_ID variable.
Use the substr() function. This function extracts a
substring from a character string. You need to give it the original
string, and the position of the first and last character of the
substring. Example: substr("abcde", 2, 4) = "bcd", i.e. you
get letters 2 to 4. Here we want to extract characters 10 to 14 of the
GEO_ID variable. As with as.numeric you have
to create the variable using the relevant dplyr verb.
acs_2019_clean <- acs_2019_clean %>%
mutate(county_fips_clean=substr(GEO_ID,10, 14))
12*. Now that our covariates data is
tidy, we can finally merge it with the elections data. Merge
elections_2020_clean and acs_2019_clean using
left_join. Note: elections_2020_clean needs to
be the first argument). Create a new object called
elections_2020_covariates.
elections_2020_covariates=left_join(elections_2020_clean,acs_2019_clean)
1*. In the regression analysis, we will only focus on the votes obtained by Donald Trump.
Filter elections_2020_covariates to keep only the
observations with the votes for Donald Trump.
Keep only the following variables: state_po,
county_name, county_fips_clean,
pct_votes, unemp_rate,
med_income, pct_computer.
Rename pct_votes ->
pct_votes_trump.
Create a new object called elections_2020_reg_trump for
this task.
elections_2020_reg_trump=elections_2020_covariates %>%
filter(candidate=="DONALD J TRUMP") %>%
ungroup(county_fips)%>%
select(c('state_po', 'county_name', 'county_fips_clean', 'pctvotes', 'unemp_rate', 'med_income', 'pct_computer')) %>%
rename(pct_votes_trump=pctvotes)
Support time: If you are not confident that your answers to the necessary red-starred questions are all correct, you can now download a pre-cleaned dataset here. This file contains the data you should have arrived to after all these questions. This means that even if you made mistakes before, you can get a fresh start for the regressions and do them correctly. We will not penalise you for using this data.
If you choose to use this option, you can load the
elections_2020_reg_trump.RData file you just downloaded by
using the load() function. Replace
eval = FALSE by eval = TRUE.
2. Use the skim() function from the skimr
package to obtain summary statistics for pct_computer,
unemp_rate and med_income.
elections_2020_reg_trump %>%
skim(pct_computer,unemp_rate,med_income)
| Name | Piped data |
| Number of rows | 3155 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| pct_computer | 40 | 0.99 | 85.40 | 6.32 | 43.2 | 82.05 | 86.3 | 89.70 | 98.3 | ▁▁▁▇▇ |
| unemp_rate | 40 | 0.99 | 5.26 | 2.57 | 0.0 | 3.60 | 4.9 | 6.45 | 24.9 | ▇▇▁▁▁ |
| med_income | 40 | 0.99 | 53363.62 | 14147.87 | 21504.0 | 44134.00 | 51679.0 | 59642.50 | 142299.0 | ▃▇▁▁▁ |
We can see that are some missing values. If you want you can look at which counties have missing data but otherwise, we’ll just ignore them for this midterm.
3. To facilitate plotting we will reshape the data in long format.
(The difference between long and wide data is explained in this article.) We are
going to use the pivot_longer() function from the
tidyverse package to reshape the data. You can find
examples of how the pivot_longer() function works by typing
?pivot_longer in your console. You can also find
information and examples about pivot_longer and its
associated function pivot_wider here
For our purposes we will use the following 4 arguments in
pivot_longer():
data = elections_2020_reg_trump (notice that if you use
the %>% pipe you don’t need to specify the data
argument);cols = c("pct_computer", "unemp_rate", "med_income").
These are the columns to be reshaped.;names_to = "covariate". This is the name of a new
variable to be created, where we will store the name of the
covariate;values_to = "value" This is the name of a new variable
to be created, where we will store the values of the covariates.If this is somewhat unclear to you, simply run the code and check how it differs from the previous dataset. This should clear things up.
Create a new object elections_2020_plot_trump for this
task. View the created dataframe to ensure it looks as expected.
elections_2020_plot_trump=elections_2020_reg_trump %>%
pivot_longer(cols = c("pct_computer", "unemp_rate", "med_income"),names_to = "covariate", values_to = "values")
view(elections_2020_plot_trump)
4. Produce 3 graphs displaying the relationship between the vote
percentage for Donald Trump and our 3 covariates. Each graph should be
on a different row. Make sure that your graph has a meaningful y-axis
label and that the y-axis is between 0 and 100 (use the
scale_y_continuous() layer for this). (Hint: use the
facet_wrap layer in ggplot to make these 3
plots in the same graph. Look at the facet_wrap arguments
to ensure each graph is on its own row and the scales of each plot
change freely.)
elections_2020_plot_trump %>%
ggplot() + aes(x=values, y=pct_votes_trump) + labs(x="Covariate",y="Percentage of Votes for Trump")+geom_line()+scale_y_continuous()+facet_wrap(~covariate,scales = "free")
5. Add a regression line on these plots using the
geom_smooth layer to better see the relationship’s sign.
Set the argument method to "lm" (for linear
model) and se to FALSE (so that it does not
display a confidence interval around the regression line; we’ll discuss
this towards the end of the course but you can set the argument to
TRUE to see what it does).
elections_2020_plot_trump %>%
ggplot()+ aes(x=values, y=pct_votes_trump) + labs(x="Covariate",y="Percentage of Votes for Trump")+geom_line()+scale_y_continuous()+facet_wrap(~covariate,scales = "free")+geom_smooth(method=lm,se=FALSE)
6. From the graphs, is the relationship between the different covariates and Donald Trump’s vote share positive or negative? Rather strong or rather weak? The relationship is negative and rather strong
7. Run a simple linear regression of Donald Trump’s vote percentage
on each of our covariate variables separately (i.e. 3 simple linear
regressions). For this question you should use
elections_2020_reg_trump as the data source.
Use the export_summs() function from the
jtools package to see the results from the three
regressions side by side.
view(elections_2020_reg_trump)
unemp_rateregression=lm(pct_votes_trump ~ unemp_rate,elections_2020_reg_trump)
med_incomeregression=lm(pct_votes_trump ~ med_income,elections_2020_reg_trump)
pct_computerregression=lm(pct_votes_trump~pct_computer, elections_2020_reg_trump)
regressiontabletrump=export_summs(unemp_rateregression,med_incomeregression,pct_computerregression)
Don’t pay attention to the stars next to the coefficients and to the number in parenthesis below the coefficients, we’ll get to them later on in the course.
8. What is the interpretation of the coefficients for
pct_computer, unemp_rate and
med_income? Be precise.
The negative coefficient for pct_computer signifies that the percentage of people voting for Trump decreases as the percentage of people with a computer increases in a given county. For every 1% increase in the amount of people owning a computer, there is a 0.5% decrease in the rates of vote for Trump. The negative coefficient for unemp_rate signifies that as the percentage of people percentage of people that are unemployed increases in a given county, the percentage of people who vote for Trump decreases in that same county. For every 1% increase in unemployment, the percentage of people voting for Trump decreases by 1.38%. The negative coefficient for med_income signifies that as the median income in a given county increases, the percentage of people who vote for Trump decreases. Still, the relationship between median income and voting decisions is relatively weak: as median income increases by 1%, votes for Trump decrease by 0.0002996%.
9. Why is the coefficient for median income so small?
The coefficient for median income is likely so small because values of income can range at an incredibly large scale, so it is unlikely that changes can simply be measured by median measurements.
10. Why does the number of observations not vary for each regression?
The number of observations does not vary for each regression because the observations all come from the same set of data.
11. Which variable’s variance explains more of the variance in Donald Trump’s vote?
The unemployment rate best explains the variance in Donald Trump’s vote.
12. Can you interpret these estimates causally? Why not? How else
could you interpret the regression on pct_computer? How
would you go about trying to assess whether median income or the
unemployment rate at the county level has a causal impact on the vote
percentage for Donald Trump?
We cannot interpret the estimates causally because we have not isolated the variables from outside factors. There could be lurking variables that also impact the relationships, so we can only make conclusions based on correlation. For example, we could interpret the regression on pct_computer as being a function of income or age. We could assess whether median income or unemployment rate at the county level has a causal impact on the vote percentage for Donald Trump by setting the median income to a certain level and finding the unemp_rate that corresponds to that result, and then running a regression with these results and the Trump vote.
And that’s it for this midterm! We hope you had fun, or at least
some fun, doing it and that you improved your R
skills. We also hope you got a sense of what “real” data looks like and
how it can be tamed by using the great tools that R
provides.