Final Exam

This exam consists of 16 questions. You must hand in an .rmd file as well as a knitted pdf or word document. The rmd of this exact file is available, and you are welcome to use that as a template for your answers.

You are not allowed to share code with classmates. You may ask clarifying questions to TAs/me. If you are stuck on something and can’t continue to the next part of the assignment, you can ask a TA or me to give you the code to continue, but do expect to lose a couple of points. Make sure to take the time to add titles, labels, etc to make your graphs look professional.

For the final exam we are going to be working with elections returns from US Senate elections.

Comprehensive results for Senate races have been compiled by the MIT Election Data and Science lab. We’re going to do some cleaning to get this into a format that we can use to analyze using a map. The changes we make in the data are going to be cumulative, so you should assume that changes you make to the data in one question (filtering, selecting variables etc) apply to all subsequent questions. For example, in Q6 you will remove Independent candidates from the data. All subsequent questions make use of this filtered data with no Independents.

First, import this data using this link. Download the spreadsheet as a csv and load the election results into R.

us_senate <- read_csv("Data/Raw/1976-2020-senate.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   year = col_double(),
##   state = col_character(),
##   state_po = col_character(),
##   state_fips = col_double(),
##   state_cen = col_double(),
##   state_ic = col_double(),
##   office = col_character(),
##   district = col_character(),
##   stage = col_character(),
##   special = col_logical(),
##   candidate = col_character(),
##   party_detailed = col_character(),
##   writein = col_logical(),
##   mode = col_character(),
##   candidatevotes = col_double(),
##   totalvotes = col_double(),
##   unofficial = col_logical(),
##   version = col_double(),
##   party_simplified = col_character()
## )

Our first step to clean this data is removing non-substantive columns. Keep only the variables year, state, state_po, stage, candidate, party_detailed, candidatevotes, totalvotes.

us_senate <- us_senate %>%
  select('year', 'state', 'state_po', 'stage', 'candidate', 'party_detailed', 'candidatevotes', 'totalvotes')

Next, we’re going to remove rows with some incomplete data. Remove any rows that have missing data in the “candidate” or “party” columns.

clean_us_senate <- us_senate %>%
  filter(!is.na(candidate), !is.na(party_detailed))

Next, create a new variable, “party3” which recodes the “party” column into “D” for Democrats, “R” for Republicans, and “I” for all other parties. (Hint: you may want to first create this column so that all rows equal “I”. Then use the ifelse() function to recode to “R” if the row represents a Republican and otherwise stays equal to its current value.)

clean_us_senate <- clean_us_senate %>%
  mutate(party3 = "I",
         party3 = ifelse(party_detailed == "DEMOCRAT", "D", party3),
         party3 = ifelse(party_detailed == "REPUBLICAN", "R", party3),
         party3 = ifelse(party_detailed == "I", "I", party3))

How many Democrats are in this dataset? Answer: 798 How many Republicans? Answer: 802 How many Independents? Answer: 1430

clean_us_senate %>% 
  group_by(party3) %>%
  summarise(number = n())

Now let’s look at the 2-party vote in these data. First, remove the independent candidates from the data. Next, remove all the rows where “stage” is not equal to “gen”. This ensures that we only get results from the general election.

clean_us_senate <- clean_us_senate %>%
  filter(party3 != "I", stage == "gen")

How many races were contested between more than two candidates? Answer: 23 Which state had the most of these races? Answer:Louisiana

over_2_contest <- clean_us_senate %>%
group_by(state, year) %>%
  summarise(count = n()) %>%
  filter(count > 2)

## `summarise()` has grouped output by 'state'. You can override using the `.groups` argument.

over_2_contest %>%
  group_by(state) %>%
  summarise(count = n()) %>%
  top_n(1)

## Selecting by count

For Democratic and Republican candidates create a figure that displays year on the x-axis and each candidate’s percent of the vote on the y-axis. Be sure to color code each candidate by their respective party. Add two lines – one for each party – that represents the trend in that parties’ support overtime.

louisiana_votes <- clean_us_senate %>%
  filter(state == "LOUISIANA") %>%
  select('year', 'candidate', 'candidatevotes', 'totalvotes', 'party3') %>%
  mutate(perc_vote = (candidatevotes/totalvotes)*100)

ggplot(louisiana_votes, aes(x = year, y = perc_vote, color = party3, label = candidate)) +
  geom_jitter(size = 1) +
  scale_x_continuous(breaks = seq(1978, 2020, 1), limits=c(1978,2020)) +
  scale_y_continuous(breaks = seq(0, 100, 25), limits=c(0, 100)) +
  geom_label_repel(aes(label = paste("(",candidate,",",perc_vote,")")), , size = 1, box.padding = 0.5, max.overlaps = Inf) +
  ggtitle("Louisiana Voting Trends") +
  xlab("Year of Election") +
  ylab("Percentage of Votes Per Candidate") +
  scale_color_manual(name = "Party", values = c( "D"="blue", "R"="red"), label = c("Democrat", "Republican")) +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 7 rows containing missing values (geom_point).

Let’s take a look at the races from 2012. Filter your dataset so that it only contains the results for 2012, and only the columns year, state, party3, and the candidate percent you calculated in the previous question. Reshape this data so that there is only one row per state, and two columns that represent the percent of the vote won by the Republican candidate and the percent of the vote won by the Democratic candidate. Note that you will not have 50 rows because not all states have a Senate election in an election year.

votes_2012 <- clean_us_senate %>%
  filter(year == 2012) %>%
  mutate(perc_vote = (candidatevotes/totalvotes)*100) %>%
  select('year', 'state', 'party3', 'perc_vote')

votes_2012_clean <- pivot_wider(votes_2012, names_from = party3, values_from = perc_vote)

Create a variable “demwin” that records if the Democrat received a higher vote share than the Republican in each race in 2012.

votes_2012_clean <- votes_2012_clean %>%
  mutate(demwin = ifelse(D > R, 1, 0))

Create a variable “demdiff” that records the difference between the Democratic and Republican share of the vote in each race in 2012.

votes_2012_clean <- votes_2012_clean %>%
  mutate(demdiff = D - R)

Next, we’re going to do some analysis to map this data. Load in the state-level mapping data that we’ve worked with from the package mapdata

states <- map_data("state") %>%
  mutate(region = str_to_upper(region))

Join the 2012 Senate election data to this mapping data. Be cautious about the format of the state names!

states_2012_votes <- left_join(votes_2012_clean, states, by = c("state" = "region"))

Create a map that shows the winner of each Senate contest in 2012, with Democrats in blue and Republicans in red. If there was no Senate contest in a state (or if a party other than Democrats or Republicans won the seat), leave the state blank.

ggplot(states_2012_votes) +
  geom_polygon(mapping = aes(x = long, y = lat, group = group, fill = demwin), color = "white", lwd = 0.25) +
  coord_quickmap() +
  scale_fill_gradient(low = "red", high = "blue") +
  theme_void() +
  theme(legend.position = "none") +
  labs(title = "2012 Senate Election Results by State", subtitle = "Blue represents Democrat Win, Red represents Republican Win")

Create a map that shades each state by the Democratic vote difference you created above. Again, If there was no Senate contest in a state (or if a party other than Democrats or Republicans won the seat), leave the state blank.

ggplot(states_2012_votes) +
  geom_polygon(mapping = aes(x = long, y = lat, group = group, fill = demdiff), color = "white", lwd = 0.25) +
  coord_quickmap() +
  scale_fill_gradient2(low = "red", mid = "purple", high = "blue", midpoint = 0.5) +
  theme_void() +
  theme(legend.position = "none") +
  labs(title = "2012 Senate Election Results by State", subtitle = "Bluer represents more Democratic Leaning, Purple is moderate, Reder represents more Repulican Leaning")

Final Exam

DATA 101

Due May 11th at 11:59pm