Lab 7 Part II: Descriptive Analysis

Now that you have cleaned and merged multiple datasets, we’re ready to so some analysis. As previously noted, the question we will examine: “Is voter turnout lower in counties with a higher percentage of Black voters?” One motivation for this question are the claims that voter turnout is being suppressed. Who gets to vote in Florida?

Let’s get started by reading in our final merged dataset from Part I of Lab 7

ACS <- read_csv("ACS_citizens_health_countypres_police.csv")

## Warning: Missing column names filled in: 'X1' [1]

1 Define Variables for Analysis

1.1 Voter Turnout Variable

In order to look at voter turnout, we’ll need to create a variable that represents it. We note that we don’t have the number of eligible voters, but we do have total citizens which we can use as a proxy for eligible voters. Due to age restrictions (> 18) as well as the removal of voting rights from certain citizens (i.e. those that are incarcerated), we don’t have the actual number of eligible voters.

ACS_2 <- ACS %>%
  mutate(voter_turnout = totalvotes / citizens_total)

ACS_2 %>%
  summarise(average_turnout = mean(voter_turnout, na.rm = TRUE),
            min_turnout = min(voter_turnout, na.rm = TRUE),
            max_turnout = max(voter_turnout, na.rm = TRUE))

## # A tibble: 1 x 3
##   average_turnout min_turnout max_turnout
##             <dbl>       <dbl>       <dbl>
## 1           0.459      0.0250        2.61

Voter turnout averaged about 45% across counties, where some counties have over 100% voter turnout. This is strange. Let’s investigate.

ACS_2 %>% 
  filter(voter_turnout > 1)

## # A tibble: 2 x 686
##      X1  FIPS GEOID  NAME.x  QName   STUSAB SUMLEV GEOCOMP FILEID LOGRECNO US   
##   <dbl> <dbl> <chr>  <chr>   <chr>   <chr>  <chr>  <chr>   <chr>  <chr>    <lgl>
## 1    68  2013 05000… Aleuti… Aleuti… ak     050    00      ACSSF  0000013  NA   
## 2    69  2016 05000… Aleuti… Aleuti… ak     050    00      ACSSF  0000014  NA   
## # … with 675 more variables: REGION <lgl>, DIVISION <lgl>, STATECE <lgl>,
## #   STATE.x <chr>, COUNTY.x <chr>, COUSUB <lgl>, PLACE <lgl>, PLACESE <lgl>,
## #   TRACT <lgl>, BLKGRP <lgl>, CONCIT <lgl>, AIANHH <lgl>, AIANHHFP <lgl>,
## #   AIHHTLI <lgl>, AITSCE <lgl>, AITS <lgl>, ANRC <lgl>, CBSA <lgl>, CSA <lgl>,
## #   METDIV <lgl>, MACC <lgl>, MEMI <lgl>, NECTA <lgl>, CNECTA <lgl>,
## #   NECTADIV <lgl>, UA <lgl>, UACP <lgl>, CDCURR <lgl>, SLDU <lgl>, SLDL <lgl>,
## #   VTD <lgl>, ZCTA3 <lgl>, ZCTA5 <lgl>, SUBMCD <lgl>, SDELM <lgl>,
## #   SDSEC <lgl>, SDUNI <lgl>, UR <lgl>, PCI <lgl>, TAZ <lgl>, UGA <lgl>,
## #   BTTR <lgl>, BTBG <lgl>, PUMA5 <lgl>, PUMA1 <lgl>, A00001_001 <dbl>,
## #   A00002_001 <dbl>, A00002_002 <dbl>, A00002_003 <dbl>, A02001_001 <dbl>,
## #   A02001_002 <dbl>, A02001_003 <dbl>, A02002_001 <dbl>, A02002_002 <dbl>,
## #   A02002_003 <dbl>, A02002_004 <dbl>, A02002_005 <dbl>, A02002_006 <dbl>,
## #   A02002_007 <dbl>, A02002_008 <dbl>, A02002_009 <dbl>, A02002_010 <dbl>,
## #   A02002_011 <dbl>, A02002_012 <dbl>, A02002_013 <dbl>, A02002_014 <dbl>,
## #   A02002_015 <dbl>, A02002_016 <dbl>, A02002_017 <dbl>, A02002_018 <dbl>,
## #   A02002_019 <dbl>, A02002_020 <dbl>, A02002_021 <dbl>, A02002_022 <dbl>,
## #   A02002_023 <dbl>, A02002_024 <dbl>, A02002_025 <dbl>, A02002_026 <dbl>,
## #   A02002_027 <dbl>, B02002_001 <dbl>, B02002_002 <dbl>, B02002_003 <dbl>,
## #   B02002_004 <dbl>, B02002_005 <dbl>, B02002_006 <dbl>, B02002_007 <dbl>,
## #   B02002_008 <dbl>, B02002_009 <dbl>, B02002_010 <dbl>, B02002_011 <dbl>,
## #   A01001_001 <dbl>, A01001_002 <dbl>, A01001_003 <dbl>, A01001_004 <dbl>,
## #   A01001_005 <dbl>, A01001_006 <dbl>, A01001_007 <dbl>, A01001_008 <dbl>,
## #   A01001_009 <dbl>, A01001_010 <dbl>, …

We can see that these observations are in Alaska. More importantly, there aren’t a lot of these observations. Let’s classify all observations where voter turnout is over 100% as missing for now (again there are alternative ways of dealing with this).

What we’re doing here is telling R is: for the subset of [ACS_citizens_health_countypres_police2$voter_turnout > 1] (observations in which voter_turnout > 1), replace the value with NA (" ").

ACS_2$voter_turnout[ACS_2$voter_turnout > 1] <- ""

#reconvert voter_turnout to numeric
ACS_2$voter_turnout <- as.numeric(ACS_2$voter_turnout)

ACS_2 %>%
  summarise(average_turnout = mean(voter_turnout, na.rm = TRUE),
            min_turnout = min(voter_turnout, na.rm = TRUE),
            max_turnout = max(voter_turnout, na.rm = TRUE))

## # A tibble: 1 x 3
##   average_turnout min_turnout max_turnout
##             <dbl>       <dbl>       <dbl>
## 1           0.458      0.0250           1

We can check that this worked by rerunning our previous code, and see that we now have 0 counties with voter_turnout > 1. R converted to voter_turnout to a character in the process of removing the values of voter_turnout > 1, so we’ll need to convert it back to numeric.

1.2 Black Representation Variables

One way to measure Black voter representation is to see what percentage of total eligible voters are Black.

ACS_3 <- ACS_2 %>%
  mutate(citizens_black_pct = citizens_black/citizens_total)

ACS_3 %>%
  summarise(average_citizens_black_pct = mean(citizens_black_pct, na.rm = TRUE),
            min_citizens_black_pct = min(citizens_black_pct, na.rm = TRUE),
            max_citizens_black_pct = max(citizens_black_pct, na.rm = TRUE))

## # A tibble: 1 x 3
##   average_citizens_black_pct min_citizens_black_pct max_citizens_black_pct
##                        <dbl>                  <dbl>                  <dbl>
## 1                     0.0912                      0                  0.870

It’s interesting to see which counties have many eligible Black voters and none.

Question: Which counties have no Black voters? Do the results seem reasonable or is there reason to suspect a problem with the data?

2 Descriptive Analysis

2.1 Scatterplot

How can we visualize this data? One way is to make a scatter plot where the x-axis is the % of Black voters in a county and the y-axis is voter turnout.

ACS_3 %>%
  ggplot(aes(x = citizens_black_pct, y = voter_turnout)) +
  geom_point()  +
  geom_smooth(method='lm') +
  ggtitle("Vote Turnout by Black Voter Representation") +
  xlab("Pct Black Voters") +
  ylab("Voter Turnout")

## Warning: Removed 8 rows containing non-finite values (stat_smooth).

## Warning: Removed 8 rows containing missing values (geom_point).

Based on the regression line geom_smooth(method='lm'), it looks like as Black Representation increases, voter turnout decreases. But with the scatterplot it is hard to see.

2.2 Confidence Interval and Bar Plots

As an alternative, let’s divide the counties into 4 quartiles based on Black Voter representation. The we can do this by mutating a new column called quartile that divides our dataset into 4 tiles by citizens_black_pct.

ACS_4 <- ACS_3 %>%
    mutate(quartile = ntile(citizens_black_pct, 4))

ACS_4 %>%
  summarise(mean_q = mean(quartile, na.rm = TRUE))

## # A tibble: 1 x 1
##   mean_q
##    <dbl>
## 1   2.50

ACS_4 %>%
  group_by(quartile) %>%
  dplyr::summarise(mean_citizens_black_pct = mean(citizens_black_pct, na.rm = TRUE),
                   mean_voter_turnout = mean(voter_turnout, na.rm = TRUE))

## # A tibble: 5 x 3
##   quartile mean_citizens_black_pct mean_voter_turnout
## *    <int>                   <dbl>              <dbl>
## 1        1                 0.00272              0.486
## 2        2                 0.0115               0.469
## 3        3                 0.0508               0.437
## 4        4                 0.300                0.440
## 5       NA               NaN                  NaN

We now have four groups, and each group represents a quartile. The 1st quartile has virtually no Black voters, while the 4th quartile averages 30% Black voters.

How can we represent the distribution of Black voter representation across the four quartiles? One method is a confidence interval plot.

To create a confidence interval plot, we need to start by calculating the confidence intervals. Unfortunately, base R does not have a function that allows us to do this, and storing the confidence intervals in the way we need them for this graph is surprising tedious (at least Bea thought it was- if you figure out how to do it, please tell her how you did).

Thankfully, we are not the first people to have needed to build a confidence interval plot in R, and someone has written a nice package called Rmisc that allows us to do it easily. Go ahead and install the Rmisc package now, and load its library.

Rmisc contains a package called summarySE. The summarySE function will calculate a variety of summary stats for us, grouped by quartile.

library(Rmisc)

summary_stats <- summarySE(ACS_4, measurevar="voter_turnout", groupvars=c("quartile"), na.rm = TRUE)

## Warning in qt(conf.interval/2 + 0.5, datac$N - 1): NaNs produced

#plot
ggplot(summary_stats, aes(x = quartile, y = voter_turnout)) + 
    geom_errorbar(aes(ymin=voter_turnout - se, ymax=voter_turnout + se), width=.1) +
    geom_point()

## Warning: Removed 1 rows containing missing values (geom_point).

The pattern is much easier to see with this type of graph. In the two top quartiles (3 and 4), voter turnout is lower (hovering around 44%), where in places with fewer Black voters (1 and 2) voter turnout is higher.

Prof. Gong’s preferred graph is a bar graph with confidence intervals:

#we'll need this package to scale in the particular way we want to 
library(scales) 

#we'll also generate a palette for us to pull from (there are tons and tons of palettes to pull- this is one I found online) 
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442")
  
#plot! note that we're plotting geom_bar before geom_errorbar, otherwise the error bars will be hidden by the bars themselves
ggplot(summary_stats, aes(x = quartile, y = voter_turnout)) + 
  geom_bar(position=position_dodge(), stat="identity", aes(fill = cbbPalette )) +
  geom_errorbar(aes(ymin=voter_turnout - se, ymax=voter_turnout + se), width=.2) +
  geom_point() +
  scale_y_continuous(limits=c(0.4,0.5),oob = rescale_none) +
  theme(legend.position = "none")

## Warning: Removed 1 rows containing missing values (geom_bar).

## Warning: Removed 1 rows containing missing values (geom_point).

Let’s add titles and formatting

ggplot(summary_stats, aes(x = quartile, y = voter_turnout)) + 
  geom_bar(position=position_dodge(), stat="identity", aes(fill = cbbPalette )) +
  geom_errorbar(aes(ymin=voter_turnout - se, ymax=voter_turnout + se), width=.2) +
  geom_point(size = .4) +
  scale_y_continuous(limits=c(0.4,0.5), oob = rescale_none) +
  theme(legend.position = "none") +
  ggtitle("Vote Rate - Black Percentage Quartiles") +
  ylab("Avg Vote Rate")

## Warning: Removed 1 rows containing missing values (geom_bar).

## Warning: Removed 1 rows containing missing values (geom_point).

Question: Do this descriptive analysis present evidence that there is inequity in voter turnout? Is there evidence of voter suppression?