This exam consists of 16 questions. You must hand in an .rmd file as well as a knitted pdf or word document. The rmd of this exact file is available, and you are welcome to use that as a template for your answers.
You are not allowed to share code with classmates. You may ask clarifying questions to TAs/me. If you are stuck on something and can’t continue to the next part of the assignment, you can ask a TA or me to give you the code to continue, but do expect to lose a couple of points. Make sure to take the time to add titles, labels, etc to make your graphs look professional.
For the final exam we are going to be working with elections returns from US Senate elections.
Comprehensive results for Senate races have been compiled by the MIT Election Data and Science lab. We’re going to do some cleaning to get this into a format that we can use to analyze using a map. The changes we make in the data are going to be cumulative, so you should assume that changes you make to the data in one question (filtering, selecting variables etc) apply to all subsequent questions. For example, in Q6 you will remove Independent candidates from the data. All subsequent questions make use of this filtered data with no Independents.
us_senate <- read_csv("Data/Raw/1976-2020-senate.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## year = col_double(),
## state = col_character(),
## state_po = col_character(),
## state_fips = col_double(),
## state_cen = col_double(),
## state_ic = col_double(),
## office = col_character(),
## district = col_character(),
## stage = col_character(),
## special = col_logical(),
## candidate = col_character(),
## party_detailed = col_character(),
## writein = col_logical(),
## mode = col_character(),
## candidatevotes = col_double(),
## totalvotes = col_double(),
## unofficial = col_logical(),
## version = col_double(),
## party_simplified = col_character()
## )
year, state, state_po, stage, candidate, party_detailed, candidatevotes, totalvotes.us_senate <- us_senate %>%
select('year', 'state', 'state_po', 'stage', 'candidate', 'party_detailed', 'candidatevotes', 'totalvotes')
clean_us_senate <- us_senate %>%
filter(!is.na(candidate), !is.na(party_detailed))
clean_us_senate <- clean_us_senate %>%
mutate(party3 = "I",
party3 = ifelse(party_detailed == "DEMOCRAT", "D", party3),
party3 = ifelse(party_detailed == "REPUBLICAN", "R", party3),
party3 = ifelse(party_detailed == "I", "I", party3))
clean_us_senate %>%
group_by(party3) %>%
summarise(number = n())
clean_us_senate <- clean_us_senate %>%
filter(party3 != "I", stage == "gen")
over_2_contest <- clean_us_senate %>%
group_by(state, year) %>%
summarise(count = n()) %>%
filter(count > 2)
## `summarise()` has grouped output by 'state'. You can override using the `.groups` argument.
over_2_contest %>%
group_by(state) %>%
summarise(count = n()) %>%
top_n(1)
## Selecting by count
louisiana_votes <- clean_us_senate %>%
filter(state == "LOUISIANA") %>%
select('year', 'candidate', 'candidatevotes', 'totalvotes', 'party3') %>%
mutate(perc_vote = (candidatevotes/totalvotes)*100)
ggplot(louisiana_votes, aes(x = year, y = perc_vote, color = party3, label = candidate)) +
geom_jitter(size = 1) +
scale_x_continuous(breaks = seq(1978, 2020, 1), limits=c(1978,2020)) +
scale_y_continuous(breaks = seq(0, 100, 25), limits=c(0, 100)) +
geom_label_repel(aes(label = paste("(",candidate,",",perc_vote,")")), , size = 1, box.padding = 0.5, max.overlaps = Inf) +
ggtitle("Louisiana Voting Trends") +
xlab("Year of Election") +
ylab("Percentage of Votes Per Candidate") +
scale_color_manual(name = "Party", values = c( "D"="blue", "R"="red"), label = c("Democrat", "Republican")) +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 7 rows containing missing values (geom_point).
year, state, party3, and the candidate percent you calculated in the previous question. Reshape this data so that there is only one row per state, and two columns that represent the percent of the vote won by the Republican candidate and the percent of the vote won by the Democratic candidate. Note that you will not have 50 rows because not all states have a Senate election in an election year.votes_2012 <- clean_us_senate %>%
filter(year == 2012) %>%
mutate(perc_vote = (candidatevotes/totalvotes)*100) %>%
select('year', 'state', 'party3', 'perc_vote')
votes_2012_clean <- pivot_wider(votes_2012, names_from = party3, values_from = perc_vote)
votes_2012_clean <- votes_2012_clean %>%
mutate(demwin = ifelse(D > R, 1, 0))
votes_2012_clean <- votes_2012_clean %>%
mutate(demdiff = D - R)
mapdatastates <- map_data("state") %>%
mutate(region = str_to_upper(region))
states_2012_votes <- left_join(votes_2012_clean, states, by = c("state" = "region"))
ggplot(states_2012_votes) +
geom_polygon(mapping = aes(x = long, y = lat, group = group, fill = demwin), color = "white", lwd = 0.25) +
coord_quickmap() +
scale_fill_gradient(low = "red", high = "blue") +
theme_void() +
theme(legend.position = "none") +
labs(title = "2012 Senate Election Results by State", subtitle = "Blue represents Democrat Win, Red represents Republican Win")
ggplot(states_2012_votes) +
geom_polygon(mapping = aes(x = long, y = lat, group = group, fill = demdiff), color = "white", lwd = 0.25) +
coord_quickmap() +
scale_fill_gradient2(low = "red", mid = "purple", high = "blue", midpoint = 0.5) +
theme_void() +
theme(legend.position = "none") +
labs(title = "2012 Senate Election Results by State", subtitle = "Bluer represents more Democratic Leaning, Purple is moderate, Reder represents more Repulican Leaning")