Assignment Details

This homework assignment is due March 29 before midnight. Everyone must complete their own code and assignment, though you may discuss the homework with classmates.

To access the code and data for the assignment, go to RStudio.cloud. To turn in the assignment, upload two files to Canvas: (1) your code file (.Rmd), and (2) a PDF version (click Knit to create PDF).

Question 1: Ecological Inference

Legal cases on voting rights often require making estimates of the turnout rate of a minority group. In most states, however, it is not possible to know the exact turnout rate of any single population group due to the secret ballot. As a result, analysts often use methods of ecological inference to estimate turnout rates.

The data set turnout contains turnout data for North Carolina for the 2012, 2014 and 2016 general elections. Each row is a precint-year combination. This question explores what we can learn about turnout among racial groups from the aggregate precinct data.

Q1(a): For one of the election years (2012, 2014 or 2016) in the data, plot the turnout rate (on the y-axis) against the share of black voters in the precinct. Include a regression line – from the linear regression of turnout share on black voter share – in the plot. Putting aside other factors, what is the general relationship, if any, between turnout rate and black population share in a precinct? Does this tell you anything about turnout among black voters? What pitfalls are there in drawing a conclusion based on the information in this plot?

turnout_2012 <- subset(turnout, year == "2012")

turnout_2012 %>%
  ggplot(aes(x = B, y = T)) +
    geom_point(alpha = 0.3) + 
    geom_smooth(method = "lm", formula = y ~ x) + 
    labs(title = "Relationship between turnout and black voter share in 2012",
         x = "Share of Black Voters",
         y = "Turnout") + 
  theme_minimal()

Based on this regression, there is a negative correlation between the share of black voters and turnout — as the share of black voters increases, the turnout rate decreases. However, this does not necessarily imply that black voters are less likely to turn out. This assumption would fall on the ecological fallacy, which is an incorrect interpretation that makes inferences about individuals based on inferences about the groups that they are a part of.

Q1(b): What is the overall turnout rate (as a proportion of registered voters) in the state of North Carolina for each year in the data? What is the overall black population share for each year in the data?

# Hint: There are several ways to do this. One makes use of the function `weighted.mean()`.
# Type `?weighted.mean()` to see the syntax.

turnout_overall <- turnout %>%
  group_by(year) %>%
  summarize(overall_turnout = weighted.mean(T, registered),
            black_population_share = weighted.mean(B, registered), 
            .groups = "drop")

print(turnout_overall)
## # A tibble: 3 x 3
##   year  overall_turnout black_population_share
##   <chr>           <dbl>                  <dbl>
## 1 2012            0.682                  0.224
## 2 2014            0.444                  0.224
## 3 2016            0.689                  0.221
# The overall turnout rate, weighting the number of registered voters, was 
# 0.681 in 2012, 0.444 in 2014, 0.689 in 2016. Again weighting the number of 
# registered voters, the overall black population share was 0.224 in 2012, 
# 0.224 in 2014, and 0.221 in 2016. 

Q1(c): For the 2012 election, suppose you want to learn what you can about turnout among black voters in precincts where black registrants comprised more than 75 percent of the registered voters in the precinct. One approach is to use the “method of bounds.” Use the function defined above, bounds(), to perform this calculation. Create a plot that illustrates how these bounds vary with the share of black voters in a precinct (so, include black population share on the x-axis and black voter turnout on the y-axis). How much variation is there in black turnout rate across precincts? Does the turnout rate appear to depend on the population share?

q1c <- subset(turnout, B >= 0.75)

turnout_bounds <- bounds(q1c$T, q1c$B)

plot(x = q1c$B, y = q1c$T, type = "n",
     main = "Method of Bounds", ylab = "Black Voter Turnout", 
     xlab = "Share of Black Voters") 
segments(y0 = turnout_bounds$beta_low, y1 = turnout_bounds$beta_hi, 
         x0 = q1c$B)

# I am not confident that the graph below shows the correct visual
# representation of the method of bounds (in fact, I'm fairly sure that what I
# have is wrong, since it looks different from what I have seen in the readings
# for this class, but I can't seem to figure out how to do it correctly), 
# but based on my interpretation of the graph below, it seems like there is 
# significant variation in black turnout rate across precincts. Especially 
# when the share of Black voters is lower, the variation in Black voter 
# turnout is also greater; when the share of Black voters in higher, the 
# variation in Black voter turnout is lower. In general, it seems like a higher
# share of Black voters within a precinct is correlated with higher Black 
# voter turnout. 

Q1(d): Estimate ecological regressions with the goal of determining an estimate for the black and non-black turnout rate among registrants in the state. What are your estimates for 2012, 2014 and 2016? In which year is there the smallest gap between estimates in turnout for black and non-black citizens?

# Hint: To force R to omit the constant from a regression, include a "-1" 
# (i.e., lm(Y ~ -1 + X) would regress Y on X and omit the constant).

turnout_2014 <- subset(turnout, year == "2014")
turnout_2016 <- subset(turnout, year == "2016")

lm(turnout_2012$T ~ -1 + turnout_2012$B + I(1 - turnout_2012$B))
## 
## Call:
## lm(formula = turnout_2012$T ~ -1 + turnout_2012$B + I(1 - turnout_2012$B))
## 
## Coefficients:
##        turnout_2012$B  I(1 - turnout_2012$B)  
##                0.6346                 0.6959
lm(turnout_2014$T ~ -1 + turnout_2014$B + I(1 - turnout_2014$B))
## 
## Call:
## lm(formula = turnout_2014$T ~ -1 + turnout_2014$B + I(1 - turnout_2014$B))
## 
## Coefficients:
##        turnout_2014$B  I(1 - turnout_2014$B)  
##                0.3404                 0.4863
lm(turnout_2016$T ~ -1 + turnout_2016$B + I(1 - turnout_2016$B))
## 
## Call:
## lm(formula = turnout_2016$T ~ -1 + turnout_2016$B + I(1 - turnout_2016$B))
## 
## Coefficients:
##        turnout_2016$B  I(1 - turnout_2016$B)  
##                0.5526                 0.7284
# In the above regression, the estimate for turnout among registrants in 2012 
# was 0.6346 for black registrants and 0.6959 for non-black registrants. In
# 2014, the estimate was 0.3404 for black registrants and 0.4863 for
# non-black registrants. Finally, in 2016, the estimate was 0.5526 for black
# registrants and 0.7284 for non-black registrants. 

# The year with the smallest gap between estimates in turnout for black and
# non-black citizens was 2012, where the difference was 0.0613, compared to
# 0.1459 in 2014 and 0.1758 in 2016. 

Q1(e): The data.frame turnout_by_race includes actual, precinct-level turnout and registration data for NC voters in the 2012, 2014 and 2016 elections. That is, it includes information that allows us to know exactly the turnout rate among black and non-black registrants (rather than having to estimate it through ecological regression). Use this data to create a table that displays turnout as a percentage of registered voters in the state for the racial groups in the ecological regression above (i.e., black versus non-black). How much do your estimates from the ecological regression differ from the true turnout rates for black and non-black voters in North Carolina?

actual_turnout <- turnout_by_race %>%
  mutate(black = ifelse(race == "B", yes = T, no = F)) %>%
  group_by(year, black) %>%
  drop_na() %>%
  summarize(turnout = mean(voted / registered))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
print(actual_turnout)
## # A tibble: 6 x 3
## # Groups:   year [3]
##   year  black turnout
##   <chr> <lgl>   <dbl>
## 1 2012  FALSE   0.584
## 2 2012  TRUE    0.685
## 3 2014  FALSE   0.323
## 4 2014  TRUE    0.421
## 5 2016  FALSE   0.609
## 6 2016  TRUE    0.637
# The actual precinct-level turnout for NC is shown in the actual_turnout data
# frame. There is a difference between the estimates from the ecological 
# regression and the true turnout rates, but the difference is not massive.

Q1(f): How would you explain differences between the ecological inference estimates and the results from the precinct-level turnout by race data? Is there evidence that key assumptions of ecological regression may not be met?

See the diagnostic plot, which displays residuals of the turnout rate by the black population share in each precinct, as one clue as to what assumptions from ecological regression may be violated. (Hint: Also see the Gelman et al. (2001) article discussed in class for a list of key assumptions that must be met for ecological regression to produce valid estimates).

Residual Plot

According to the Gelman et al (2001) article, ecological regressions rely on either the constancy model or the zero-correlation model. Each of these models relies on a fundamental assumption that might be violated in this case. For example, the constancy model would require that black and non-black voters have approximately the same turnout across precincts. The zero-correlation model relies on the assumption that there is no correlation at all between precinct- level demographics and level of turnout for black and non-black voters.

Question 2: Assessing Gerrymandering

A popular method for calculating seats-votes curves (allowing for estimates of how votes are translated into congressional seats in a state, for example) is uniform swing. This approach makes the assumption that swings in partisanship occur equally across all districts and, as a result, allow analysts to project the translation of votes into seats for counterfactual scenarios that have not occurred. This question explores some of those assumptions. To calculate uniform swing, use the function defined in the setup code chunk at the start of this document uniform_swing() as well as the data.frame svdist. (Note both of these were used in the in-class exercise on redistricting.)

Q2(a): Calculate uniform swing for North Carolina in 2016 and plot a seats-votes curve. What was the observed vote share and seat share (state-wide) for North Carolina in 2018? Plot this point on the graph as well. Is it in line with predictions from the 2016 seats-votes curve?

nc_2016 <- subset(svdist, state == "NC" & year == 2016)
nc_2018 <- subset(svdist, state == "NC" & year == 2018)

out <- uniform_swing(nc_2016$dvotes, nc_2016$totalvotes)

plot(x = out$dv_swing, y = out$dwin_swing, type = "l",
     xlab = "D Vote Share", ylab = "D Seat Share",
     main = "2016 Seats-Votes Curve")
abline(h = 0.5, lty = 2, col = "gray")
abline(v = 0.5, lty = 2, col = "gray")
points(x = sum(nc_2018$dvotes)/sum(nc_2018$totalvotes), 
       y = sum(nc_2018$dwin)/13, col = "red")

# The state-wide observed vote share and seat share in North Carolina in 2018 
# is in line with predictions from the 2016 seats-votes curve. This curve shows
# a gerrymandered district, because when democrats have a vote share of
# nearly 50%, the democratic vote share is just over 20%. 

Q2(b): Now do the same for Ohio, California, Georgia, and Virginia. (Hint: You may want to write a function so that you can do this quickly/easily for multiple states).

sv_curve <- function(state_abr){
  state_2016 <- subset(svdist, state == state_abr & year == 2016)
  state_2018 <- subset(svdist, state == state_abr & year == 2018)
  
  out <- uniform_swing(state_2016$dvotes, state_2016$totalvotes)
  
  plot(x = out$dv_swing, y = out$dwin_swing, type = "l",
     xlab = "D Vote Share", ylab = "D Seat Share")
  abline(h = 0.5, lty = 2, col = "gray")
  abline(v = 0.5, lty = 2, col = "gray")
  points(x = sum(state_2018$dvotes)/sum(state_2018$totalvotes), 
         y = sum(state_2018$dwin)/nrow(state_2018), col = "red")
}

sv_curve(state_abr = "OH")

sv_curve(state_abr = "CA")

sv_curve(state_abr = "GA")

sv_curve(state_abr = "VA")

Q2(c): For those states where the counterfactual predictions of the 2016 seats-votes curve appear different than the actual outcomes in the 2018 election, what do you think happened? What assumption or assumptions underpinning uniform swing likely did not hold up?

In general, the Democratic seat share is higher than predicted given the Democratic vote share. The seats-votes curve and uniform swing rely on several assumptions. For example, if an election prompts a redistricting effort, this would mean that the seat apportionment would not be independent of vote share. It also assumes that only two parties/candidates are involved in the election, thus ignoring third parties and write-in candidates, for example. It also fails to consider uncontested elections.

Question 3 (Optional Bonus Question)

Create a plot that illustrates Seats-Votes curves for the US Congress (i.e., at the national level rather than for a particular state) for every election since 2002. What accounts for any observed differences in the seats-votes curve over time? What accounts for the similarities?

national <- subset(svdist, year >= 2002)

yr_output <- function(yr){
  data_yr <- subset(national, year == yr)
  out_yr <- uniform_swing(data_yr$dvotes, data_yr$totalvotes) %>%
    mutate(year = yr)
  return(out_yr)
}

joined <- full_join(yr_output(2002), yr_output(2004), 
                    by = c("dv_swing", "dwin_swing", "year")) %>%
  full_join(yr_output(2006), by = c("dv_swing", "dwin_swing", "year")) %>%
  full_join(yr_output(2008), by = c("dv_swing", "dwin_swing", "year")) %>%
  full_join(yr_output(2010), by = c("dv_swing", "dwin_swing", "year")) %>%
  full_join(yr_output(2012), by = c("dv_swing", "dwin_swing", "year")) %>%
  full_join(yr_output(2014), by = c("dv_swing", "dwin_swing", "year")) %>%
  full_join(yr_output(2016), by = c("dv_swing", "dwin_swing", "year")) %>%
  full_join(yr_output(2018), by = c("dv_swing", "dwin_swing", "year"))

ggplot(data = joined, aes(x = dv_swing, y = dwin_swing)) +
  geom_line() +
  facet_wrap(~year) + 
  theme_minimal() + 
  labs(title = "US Congress Seats-Votes Curves since 2002",
       x = "Democratic Vote Share", 
       y = "Democratic Seat Share")

joined %>%
  ggplot(aes(x = dv_swing, y = dwin_swing)) +
  geom_line() +
  transition_time(year) + 
  labs(title = 'Year: {frame_time}', x = "Democratic Vote Share", 
       y = "Democratic Seat Share") + 
  theme_classic()

# The animation results in several pages of the graph appearing on the pdf
# file, but it shows up properly on the html version of this document, linked
# here:

# For some reason, the animation includes decimal values for the years in
# addition to the integer values — I couldn't figure out how to resolve this
# issue.