Simple Election Analysis

This write-up was inspired by a Spring 2022 Stat20 assignment on evaluating election fraud using Benford’s Law. Stat20 is UC Berkeley’s introductory computational statistics course, and I joined its course staff from Fall 2022.

Introduction.

Benford’s Law appears to be a well-attested phenomenon whereby first-digit distributions of “real life” data sets follow a logarithmic decay (Wikipedia). This law of anomalous numbers - so-called by the physicist Frank Benford in 1938 - has been used in contexts ranging from the evaluation of genomic data to detecting election fraud. A visualization of the first-digit distribution under Benford’s Law is shown below. (Adapted from “Benford’s Law Graphed in R”.)

In this simple write-up, I conduct exploratory data analyses (EDAs) of both the 2009 Iran election and the 2020 US election. I then generate first-digit distributions and apply Chi-squared goodness-of-fit (GoF) tests against the Benford distribution. Finally, I evaluate these results in a hypothesis testing framework. I end with conclusions on the findings.

#adapted from "Benford's Law Graphed in R" - see References. 
digits <- 1:9
ben_law <- function(d) log10(1 + 1/d)
ben_dist <- tibble(digit = digits, prob = ben_law(digits))
ggplot(ben_dist, aes(x = factor(digit), y = prob)) + 
  geom_bar(stat = "identity") + 
  labs(title = "First-digit values observed under Benford's Law.", 
       x = "First digit", 
       y = "Probability of observing")

EDA: 2009 Iran election.

#see notes on accessing Iran data in References.
load("/Users/benelhiguchi/Downloads/iran.rda")
head(iran)

The iran dataframe consists of 366 observations across 9 variables. In this tidy dataframe, the unit of observation is an Iranian city; its province is noted. The dataframe contains the number of legitimate votes casted for a candidate (ahmadinejad, rezai, karubi, and mousavi), as well as the total_votes_casted by the city (comprised of both voided_votes and legitimate_votes). Mahmoud Ahmadinejad won the 2009 elections, though analyses by Mebane (2009) and others suggest fraudulent results.

First, I summarize the distribution of total votes cast and find a count of the number of cities which voted for each candidate:

#finding the proportion of total legitimate votes cast for each candidate
iran %>% 
  summarize(totalA = sum(ahmadinejad), 
            totalR = sum(rezai), 
            totalK = sum(karrubi), 
            totalM = sum(mousavi), 
            total = sum(legitimate_votes)) %>%
  summarize(ahmadinejad = totalA/total, 
            rezai = totalR/total, 
            karrubi = totalK/total, 
            mousavi = totalM/total)

#comparing legitimate vote counts to determine winner per city and summing 
wins <- iran %>%
  rename(a = ahmadinejad, 
         r = rezai, 
         k = karrubi, 
         m = mousavi) %>%
  mutate(winA = ((a > r) & (a > k) & (a > m)), 
         winR = ((r > a) & (r > k) & (r > m)), 
         winK = ((k > a) & (k > r) & (k > m)), 
         winM = ((m > a) & (m > r) & (m > k)))

wins %>%
  summarize(ahmadinejad = sum(winA), 
            rezai = sum(winR), 
            karrubi = sum(winK), 
            mousavi = sum(winM))

Both EDAs reveal that Rezai and Karrubi won virtually no cities in the 2009 Iran elections. Incidentally, the disparity in national vote proportions versus city vote proportions (i.e., Ahmadinejad winning 63% of the national vote versus 87% of cities) suggests room for a geographic analysis of vote distributions.

To explore the role of voided_votes on the outcomes of the election, I summarize the proportion of total voided_votes against total_votes_cast. Then, I visualize how this proportion varies at the city level. Finally, I summarize this distribution as conditioned by winning candidate:

iran %>%
  summarize(prop_voided = sum(voided_votes)/sum(total_votes_cast))

wins <- wins %>% 
  mutate(prop_voided = voided_votes/total_votes_cast)

wins %>%
    ggplot(aes(x = prop_voided)) + 
    geom_histogram(bins = 50, color = "white") + 
    geom_vline(xintercept = .011, color = "red", linetype = "dashed") + 
    annotate("text", 
             label = "Average Voided Proportion: .011", 
             x = .035, y = 40) + 
    labs(title = "Distribution of city-level voided vote proportions.", 
         x = "Proportion of voided votes", 
         y = "Count")

The distribution of city-wide voided vote proportions features a right skew; the national voided vote proportion has been skewed upwards due to the presence of an outlying observation. Now, conditioning on winner:

#dropping r and k observations
wins_am <- wins %>%
  filter(winA | winM)

wins_am %>%
  group_by(winA) %>%
  summarize(prop_voided_avg = mean(prop_voided), 
            prop_voided_sd = sd(prop_voided))

prop_voided_a <- wins_am %>%
  filter(winA) %>%
  ggplot(aes(x = prop_voided)) + 
  geom_histogram(bins = 50, color = "white") + 
  geom_vline(xintercept = .008, color = "red", linetype = "dashed")

prop_voided_m <- wins_am %>%
  filter(winM) %>%
  ggplot(aes(x = prop_voided)) + 
  geom_histogram(bins = 50, color = "white") + 
  geom_vline(xintercept = .016, color = "red", linetype = "dashed")

The summary table above shows that the proportion of voided votes varies significantly across candidates, with said proportion being about 2x higher in cities that voted for Mousavi (winA = FALSE) compared to those that voted for Ahmadinejad. The standard deviations of these measures also shows higher variability in those cities which voted for Mousavi. (Part of this variability could be explained by the outlying observation noted above, but further analysis would likely yield that this is not all to the story.)

The significant difference in voided vote proportions suggest that votes for Mousavi were more often voided than votes for Ahmadinejad, though this claim cannot be substantiated at the granularity of the data provided. (To justify such a claim would require resolution at the level of individual votes rather than the aggregate voting behavior of an entire city.)

For now, I begin investigating election fraud by generating a first-digit distribution of vote counts for Ahmadinejad across cities (in line with using Benford’s Law):

a_first_dig <- wins %>%
  mutate(first_digit = get_first(a)) 

a_first_dig %>%
  ggplot(aes(x = first_digit)) + 
  geom_bar() + 
  scale_x_continuous(breaks = 1:9) + 
  labs(title = "First-digit distribution of votes for Ahmadinejad", 
       x = "First digit", 
       y = "Counts")

It is immediately clear that the first-digit distribution of city-based votes for Ahmadinejad visually differs from that expected by Benford’s Law. Quantifying this difference will be explored under the analysis section.

EDA: 2020 US election.

Exploring the dataset.

I now turn to data collected by OpenElection during the 2020 US election cycle. I chose to analyze data collected at the New York precinct level. (It would be interesting to discuss the implications of granularity differences across the US and Iran data analyzed, i.e., how do claims made at the city level compare with those at the precinct level?)

ny_precinct <- read_csv("https://raw.githubusercontent.com/openelections/openelections-data-ny/master/2020/20201103__ny__general__precinct.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

head(ny_precinct)

The ny_precinct data frame consists of 426,582 observations across 13 observations. It is difficult to gauge what the unit of observation is: it seems to be a (precinct, candidate) tuple, but mutliple such observations also seem to be repeated throughout the dataset. Time to explore!

First, I’d like to get a sense of how many precincts are being represented in the dataset:

#counting unique precinct observations 
ny_precinct %>%
  select(precinct) %>%
  unique() %>%
  summarize(count = n())

There appear to be 15,402 precincts. But this seems unreasonable, since a quick Google search claims that there are 77 (police) precincts in the state. In fact, a glance at the dataset suggests that precinct observations are actually smaller units, such as wards!

We are getting closer to understanding what the unit of observation is. I imagine that the data was inputted as the count of which candidate was voted for which office in a given precincts ward. That being said, I now look for (a) an example of the candidates I could choose for a given office, and (b) the number of candidates that ran for each office. Hopefully this count will yield insight into how 426K observations were made.

#an example of candidate choices for State Senate in Corning Ward 1 
ny_precinct %>%
  filter(precinct == "City of Corning Ward 1" & office == "State Senate") %>%
  select(candidate, party)

This sample peek into candidate choices for State Senate in Corning Ward 1 reveals that the votes for a given candidate are faceted by the different party registrations of their voters!

#finding number of candidates that ran for each office
ny_precinct %>%
  group_by(office) %>%
  select(office, candidate) %>%
  unique() %>%
  summarize(count = n())

Now, how should I paint a picture of the data collection process?

Firstly, I imagine myself as a New York voter: I implicitly record where I’m voting from on my ballot (the precinct ward), my one choice for President, my local choice(s) - some subset of the 342 listed across the state - for State Assembly member(s), and likewise down the list of 4 possible offices to vote for.

Now, I imagine myself as an election officer. I go through individual ballots, tallying how many votes each candidate receives by their party affiliation; a given candidate’s votes may be recorded over multiple such affiliations. Furthermore, I mark the mode of these votes, such as whether a vote was delivered as an absentee ballot or as a machine_vote. Finally, I also mark some “overhead” metrics for these particular office results, such as how many votes were sent in (i.e., the Total) and the number of Voids, as additional candidate values.

Aha! Each observation in the dataset roughly measures how many votes an individual candidate - faceted by party affiliations - received in a given precinct ward. Furthermore, each observation categorizes the type of votes received. Over the 15K precinct wards, the range of local and national candidates to choose from (and their number per office!), and the inclusion of “overhead” metrics (such as the Total votes for a given office), 426K observations were made.

Note that this dataset doesn’t make use of tidy standards, as opposed to the one used in the Iran analysis. This fact, and the ad-hoc inclusion of “overhead” metrics as candidate values, led to a lengthy surface exploration of the dataset at hand.

EDA: Voting behavior and the role of absentee ballots.

In this EDA I am interested in determining voting behavior within a precinct ward by vote type. Of particular interest is the proportion of votes cast in an absentee ballot, as these votes stirred quite the controversy during the 2020 General Elections. Being interested in the behavior of Biden and Trump voters, I use these ward-by-ward proportions to visualize candidate-specific histograms detailing the spread of absentee proportions; specifically, I am interested in these distributions as they pertain to either DEM or REP party affiliations. I also compare these proportions to the same statistic at the state-wide level. Finally, I visualize the first-digit distribution of ward-level votes for Biden.

In determining absentee props by ward, I filter for the candidates (Biden and Trump) and partys of interest:

#finding the prop of absentee votes per ward
#filtering for all candidates containing of "Biden" or "Trump"
#filtering for all such "DEM" or "REP" observations

absentee_props <- ny_precinct %>%
  filter(str_detect(candidate, "Biden") | 
         str_detect(candidate, "Trump")) %>%
  filter(party %in% c("DEM", "REP")) %>%
  mutate(absentee_prop = absentee / votes) %>% 
  select(precinct, candidate, absentee_prop)

(It is interesting to note that filtering for only DEM or REP instances of Biden or Trump votes about halves the data set; this suggests that a significant amount of votes for either candidate were cast in their alternative party affiliations.)

Now, I visualize the candidate-specific voter distribtuions of absentee_prop:

absentee_props %>%
  drop_na(absentee_prop) %>%
  filter(absentee_prop < 1) %>%
  ggplot(aes(x = absentee_prop)) + 
    geom_histogram(bins=50, color="white") + 
    facet_wrap(vars(candidate)) + 
    labs(title = "Absentee voting patterns help distinguish candidates", 
         x = "Proportion of Absentee Votes", 
         y = "Count")

(Again, note the high filtration of data along our parameters of interest! Starting with 32K ward-wide observations of Republican Trump and Democrat Biden votes, only about 5K such observations had valid, i.e., non-NA, recordings for absentee_prop, and a further 24 had absentee_props which were greater than or equal to 1! That absentee votes were not counted - or, somehow, constituted more than the total votes for a given candidate - suggests issues with the data collection process.)

The distributions shown above highlight a seemingly statistically significant difference in recorded absentee_props: nearly symmetric absentee_prop distributions show that about 10% of Republican Trump voters used absentee ballots, and about 26% for Democratic Biden voters.

Now, finding state-wide absentee_props:

ny_precinct %>%
  filter(str_detect(candidate, "Biden") | 
         str_detect(candidate, "Trump")) %>%
  filter(party %in% c("DEM", "REP")) %>%
  group_by(candidate) %>%
  select(precinct, candidate, votes, absentee) %>%
  drop_na(absentee) %>%
  summarize(absentee_prop_state = sum(absentee) / sum(votes))

These findings corroborate the ward-level claims made above, and justify the use of state-wide absentee proportions in the place of ward-by-ward proportions.

Finally, I visualize the ward-by-ward first-digit distribution of votes for Biden:

biden_first_dig <- ny_precinct %>%
  filter(str_detect(candidate, "Biden")) %>%
  mutate(first_dig = get_first(votes)) %>%
  filter(first_dig > 0)

biden_first_dig %>%
  ggplot(aes(x = first_dig)) + 
    geom_bar() + 
    scale_x_continuous(breaks = 1:9) + 
    labs(title = "First-digit distribution of votes for Biden", 
       x = "First digit", 
       y = "Counts")

The distribution above seems more similar to that expected by Benford’s Law than the one found across Ahmadinejad’s votes. This is (superficially) good news for Biden!

Analysis: Chi-Square Goodness-of-Fit.

The central emphasis of this analysis was to determine whether the first-digit distributions of both Ahmadinejad and Biden are consistent with those expected under Benford’s Law. To that end, I am interested in gauging how similar the empirical voter distributions are to the theoretical Benford distribution. A method both well-studied and well-suited to this task is to use a Chi-square goodness-of-fit (GoF) test.

The Chi-square goodness-of-fit test statistic quantifies the aggregated, scaled difference between expected and observed counts across all categories; the lower the test statistic, the closer the observed (e.g., voter) distribution resembles the expected (e.g., Benford) distribution.

I now need to run a Chi-squared hypothesis test. I utilize the highly accessible infer package for conducting statistical inference. Due to a quirk in the infer::calculate function, I also need to generate a labeled vector of Benford probabilities.

library(infer)

#generating a labeled vector for use in infer
ben_props <- c("1" = log10(1 + 1/1),
               "2" = log10(1 + 1/2), 
               "3" = log10(1 + 1/3), 
               "4" = log10(1 + 1/4), 
               "5" = log10(1 + 1/5), 
               "6" = log10(1 + 1/6), 
               "7" = log10(1 + 1/7), 
               "8" = log10(1 + 1/8), 
               "9" = log10(1 + 1/9))

But, before jumping in to the analysis, a quick review of hypothesis testing!

Hypothesis Testing

If one were to imagine this test statistic generated over many, many empirical distributions (each hypothetical, and each varying from the expected distribution categories by some foreseeable amount), one could generate a distribution of test statistics which represent the range and shape of ways that an empirical sample distribution might differ from its expected, theoretical population distribution. This is the Chi-squared sample distribution of test statistics.

The (observed) test statistic is evaluated by how likely it was to have been drawn under the Chi-squared sample distribution. If, for example, our observed test statistic bifurcates the distribution such that 95% of the distribution lies to its left, this implies that only 5% of all other observed test statistics realized under the theoretical distribution would measure greater than or equal to the one currently bifurcating. In other words, such a sample test statistic can act as a proxy of determining whether another given test statistic is likely to have been derived out of the theoretical distribution of interest.

In practice, this method of (a) hypothesizing a theoretical distribution, (b) generating a distribution of sampled statistics from the hypothetical distribution of interest, and (c) determining how likely our observed test statistic could have occurred within this hypothetical model characterizes hypothesis testing. With a significance, or alpha, level of 5% (represented by the proxy statistic mentioned above!), I permit the 5% chance that I may wrongly conclude that our observed test statistic does not come from the hypothetical distribution of interest.

Typically, observing a test statistic whose probability of occurrence within the hypothesized distribution - what is called that p-value - is less than alpha provides us with reasonable grounds to reject the hypothesized data-generating model. For instance, if a test statistic has a p-value of .001, then, with alpha of .05, I can feel comfortable rejecting the hypothesized data-generating model. Note, however, that this rejection is qualified; I may have observed a very unlikely (in this case, 1 in 1000 chance), but nevertheless possible, test statistic produced under the hypothesized data-generating model. This type of error in judgement - rejecting a hypothesis when it in fact should not have been - is a Type 1 error, and engineering the avoidance of such errors forms a rich part of current research in statistical methods.

Equipped with a basic understanding of hypothesis testing and its utility, I apply to the analysis at hand.

GoF: 2020 US and Biden.

First, I need to determine the Chi-squared test statistic.

#factoring first_dig so that a Chisq test can be run
biden_first_dig <- biden_first_dig %>%
  mutate(first_dig = factor(first_dig))

#calculating the Chisq test stat
#set.seed() for reproducibility in inference
set.seed(1)

biden_obs_stat <- biden_first_dig %>%
  specify(response = first_dig) %>%
  hypothesize(null = "point", 
              p = ben_props) %>%
  generate(reps = 1, type = "draw") %>%
  calculate(stat = "Chisq")

Our next goal is to visualize and assess this test statistic relative to its probability of appearing under the hypothetical Chi-squared distribution which represents variations in untampered voter behavior. Using infer, I can generate this distribution and visualize the placement of the observed test statistic in one fell swoop:

set.seed(2)
biden_chisq_dist <- biden_first_dig %>%
  specify(response = first_dig) %>%
  hypothesize(null = "point", 
              p = ben_props) %>%
  generate(reps = 1000, type = "draw") %>%
  calculate(stat = "Chisq")

biden_chisq_dist %>%
  visualize() + 
  shade_p_value(biden_obs_stat, 
               direction = "greater") + 
  labs(title = "US 2020: Biden Chi-squared distribution", 
       subtitle = "Observed test stat suggests lack of voter fraud.", 
       x = "Generated test statistics", 
       y = "Count")

As can be seen, the assumption of a lack of voter fraud (i.e., “fair play”) seems to justify the little variation in shape between the observed first_dig distribution for Biden and the Benford distribution. To quantify how similar the empirical and Benford distributions are, I find a p-value:

p_value <- biden_chisq_dist %>%
  get_p_value(biden_obs_stat, 
              direction = "greater")

p_value

The p-value of .791 indicates that about 80% of all other foreseeable first-digit distributions derived under the assumption of fair play would have been quantifiably more different from the Benford distribution than the empirical one is. Using an alpha level of 5%, I conclude that the empirical distribution is “close enough” in shape to the Benford distribution. Does analysis of the 2009 Iran elections hold up against these standards?

GoF: 2009 Iran and Ahmadinejad.

I follow the same heuristic as above, identifying an observed test statistic and its likelihood of appearing under an assumption of “fair play.”

#factoring first_dig so that a Chisq test can be run
a_first_dig <- a_first_dig %>%
  mutate(first_digit = factor(first_digit))

#calculating the Chisq test stat
#set.seed() for reproducibility in inference
set.seed(11)

a_obs_stat <- a_first_dig %>%
  specify(response = first_digit) %>%
  hypothesize(null = "point", 
              p = ben_props) %>%
  generate(reps = 1, type = "draw") %>%
  calculate(stat = "Chisq")

set.seed(22)
a_chisq_dist <- a_first_dig %>%
  specify(response = first_digit) %>%
  hypothesize(null = "point", 
              p = ben_props) %>%
  generate(reps = 1000, type = "draw") %>%
  calculate(stat = "Chisq")

a_chisq_dist %>%
  visualize() + 
  shade_p_value(a_obs_stat, 
               direction = "greater") + 
  labs(title = "Iran 2009: Ahmadinejad Chi-squared distribution", 
       subtitle = "Observed test stat suggests lack of voter fraud.", 
       x = "Generated test statistics", 
       y = "Count")

Interestingly, the visualization above suggests that the empirical first-digit distribution for Ahmadinejad does not differ significantly from the Benford distribution. To be precise, I evaluate a p-value:

p_value <- a_chisq_dist %>%
  get_p_value(a_obs_stat, 
              direction = "greater")

p_value

This p-value of .516 suggests that just over 50% of plausibly non-fraudulent first-digit distributions would quantifiably differ more from the Benford distribution than the one at hand. Using an alpha level of 5%, I conclude that the empirical distribution is “close enough” in shape to the Benford distribution. I explore this contrary conclusion in the below section.

Conclusions and Further Investigations.

Firstly, the fraudulence claims presented here rest on the validity of using Benford’s Law as a proxy for real-world data-generating processes.

An EDA of Iran’s 2009 election revealed a disparity in city-based proportions of voided votes per candidate, where cities in which Mousavi won appeared 2x more likely to void votes than in cities in which Ahmadinejad won. Despite a stronger claim’s - i.e., that individual votes for Mousavi were more likely to be voided than for Ahmadinejad - being unsubstantiated given the granularity of the data at hand, a first-digit distribution of the aggregate city-based vote counts for Ahmadinejad reveals discrepancies with Benford’s Law. Used as a proxy for real-world data-generating phenomena, this discrepancy hints at election fraud in the 2009 Iran election.

On a similar vein of analysis, an EDA of precinct-/ ward-level voting counts across New York during the 2020 US Election cycle revealed somewhat regular absentee voting patterns across candidates: the ward-level proportion of absentee voters for Trump were nearly symmetric about 10%, and that for Biden were also nearly symmetric about 26%. Investigating the claim that absentee voting contributed to voter fraud in 2020, a superficial glance at the first-digit distribution of vote counts for Biden seemed to suggest that no foul play was at hand.

Quantifying the extent to which these empirical first-digit distributions differed from the Benford distribution was used as a single-pronged proxy to determine whether fraud was undeniably present. As such, Chi-squared goodness-of-fit (GoF) tests were used to analyze the distributions at hand. The empirical first-digit distribution for Biden yielded a p-value of .791, rendering me unable to reject the notion of a “fairly played” election. However, a p-value of .516 on Ahmadinejad’s distribution led to the same conclusion.

This last conclusion seems contrary to a visual inspection of the empirical data at hand; it seems that these first-digit counts would not have been generated under a Benford process. This leaves room for much further analysis and use of the Chi-squared GoF through infer - it’s possible that I’m using some methods incorrectly. This leaves room for more growth!

On that note, engaging in good data science entails a willingness to cycle through the “data lifecycle.” Besides evaluating my use of infer, a follow-up question of interest would be to explore the use of absentee proportions as a predictive feature of candidate selection in the 2020 US Elections. In other words, how well does knowing an arbitrary proportion of absentee votes predict which candidate - either Biden or Trump - the aggregate of votes were ultimately for? The use of a one-dimensional numerical predictor for a categorical variable suggests that this question may benefit from using a 1-D k-nearest neighbors algorithm.

As a final note: I learned a lot while working on this analysis! Besides furthering my R fluency, e.g., finding methods to detect a regex pattern in an observation, I learned about (somewhat convoluted) voting infrastructures. For example, that many observations have NA values for absentee voting suggests that not all wards record such “meta” information, which limits their analysis.

References.

“Benford’s Law Graphed in R.” Benford’s Law Graphed in R, https://rstudio-pubs-static.s3.amazonaws.com/205239_5c34105c42b643259f2353762f31c330.html.

“Benford’s Law.” Wikipedia, Wikimedia Foundation, 17 Jan. 2023, https://en.wikipedia.org/wiki/Benford%27s_law.

Mebane, Walter. Note on the Presidential Election in Iran, June 2009. https://websites.umich.edu/~wmebane/note29jun2009.pdf.

“Tidy Chi-Squared Tests with Infer.” Tidy Chi-Squared Tests with Infer, https://cran.r-project.org/web/packages/infer/vignettes/chi_squared.html.

Notes

The documentation for Andrew Bray’s get_first() method can be found here.

The public Git repo housing the data I used for my Iran analysis can be found here. An important note for reproducibility: I downloaded the iran .rda data frame file locally.