This write-up was inspired by a Spring 2022 Stat20 assignment on evaluating election fraud using Benford’s Law. Stat20 is UC Berkeley’s introductory computational statistics course, and I joined its course staff from Fall 2022.
Benford’s Law appears to be a well-attested phenomenon whereby first-digit distributions of “real life” data sets follow a logarithmic decay (Wikipedia). This law of anomalous numbers - so-called by the physicist Frank Benford in 1938 - has been used in contexts ranging from the evaluation of genomic data to detecting election fraud. A visualization of the first-digit distribution under Benford’s Law is shown below. (Adapted from “Benford’s Law Graphed in R”.)
In this simple write-up, I conduct exploratory data analyses (EDAs) of both the 2009 Iran election and the 2020 US election. I then generate first-digit distributions and apply Chi-squared goodness-of-fit (GoF) tests against the Benford distribution. Finally, I evaluate these results in a hypothesis testing framework. I end with conclusions on the findings.
#adapted from "Benford's Law Graphed in R" - see References.
digits <- 1:9
ben_law <- function(d) log10(1 + 1/d)
ben_dist <- tibble(digit = digits, prob = ben_law(digits))
ggplot(ben_dist, aes(x = factor(digit), y = prob)) +
geom_bar(stat = "identity") +
labs(title = "First-digit values observed under Benford's Law.",
x = "First digit",
y = "Probability of observing")
#see notes on accessing Iran data in References.
load("/Users/benelhiguchi/Downloads/iran.rda")
head(iran)
The iran dataframe consists of 366 observations across 9
variables. In this tidy dataframe, the unit of observation is an Iranian
city; its province is noted. The dataframe
contains the number of legitimate votes casted for a candidate
(ahmadinejad, rezai, karubi, and
mousavi), as well as the total_votes_casted by
the city (comprised of both voided_votes and
legitimate_votes). Mahmoud Ahmadinejad won the 2009
elections, though analyses by Mebane (2009) and others suggest
fraudulent results.
First, I summarize the distribution of total votes cast and find a count of the number of cities which voted for each candidate:
#finding the proportion of total legitimate votes cast for each candidate
iran %>%
summarize(totalA = sum(ahmadinejad),
totalR = sum(rezai),
totalK = sum(karrubi),
totalM = sum(mousavi),
total = sum(legitimate_votes)) %>%
summarize(ahmadinejad = totalA/total,
rezai = totalR/total,
karrubi = totalK/total,
mousavi = totalM/total)
#comparing legitimate vote counts to determine winner per city and summing
wins <- iran %>%
rename(a = ahmadinejad,
r = rezai,
k = karrubi,
m = mousavi) %>%
mutate(winA = ((a > r) & (a > k) & (a > m)),
winR = ((r > a) & (r > k) & (r > m)),
winK = ((k > a) & (k > r) & (k > m)),
winM = ((m > a) & (m > r) & (m > k)))
wins %>%
summarize(ahmadinejad = sum(winA),
rezai = sum(winR),
karrubi = sum(winK),
mousavi = sum(winM))
Both EDAs reveal that Rezai and Karrubi won virtually no cities in the 2009 Iran elections. Incidentally, the disparity in national vote proportions versus city vote proportions (i.e., Ahmadinejad winning 63% of the national vote versus 87% of cities) suggests room for a geographic analysis of vote distributions.
To explore the role of voided_votes on the outcomes of
the election, I summarize the proportion of total
voided_votes against total_votes_cast. Then, I
visualize how this proportion varies at the city level. Finally, I
summarize this distribution as conditioned by winning candidate:
iran %>%
summarize(prop_voided = sum(voided_votes)/sum(total_votes_cast))
wins <- wins %>%
mutate(prop_voided = voided_votes/total_votes_cast)
wins %>%
ggplot(aes(x = prop_voided)) +
geom_histogram(bins = 50, color = "white") +
geom_vline(xintercept = .011, color = "red", linetype = "dashed") +
annotate("text",
label = "Average Voided Proportion: .011",
x = .035, y = 40) +
labs(title = "Distribution of city-level voided vote proportions.",
x = "Proportion of voided votes",
y = "Count")
The distribution of city-wide voided vote proportions features a right skew; the national voided vote proportion has been skewed upwards due to the presence of an outlying observation. Now, conditioning on winner:
#dropping r and k observations
wins_am <- wins %>%
filter(winA | winM)
wins_am %>%
group_by(winA) %>%
summarize(prop_voided_avg = mean(prop_voided),
prop_voided_sd = sd(prop_voided))
prop_voided_a <- wins_am %>%
filter(winA) %>%
ggplot(aes(x = prop_voided)) +
geom_histogram(bins = 50, color = "white") +
geom_vline(xintercept = .008, color = "red", linetype = "dashed")
prop_voided_m <- wins_am %>%
filter(winM) %>%
ggplot(aes(x = prop_voided)) +
geom_histogram(bins = 50, color = "white") +
geom_vline(xintercept = .016, color = "red", linetype = "dashed")
The summary table above shows that the proportion of voided votes
varies significantly across candidates, with said proportion being about
2x higher in cities that voted for Mousavi
(winA = FALSE) compared to those that voted for
Ahmadinejad. The standard deviations of these measures also shows higher
variability in those cities which voted for Mousavi. (Part of this
variability could be explained by the outlying observation noted above,
but further analysis would likely yield that this is not all to the
story.)
The significant difference in voided vote proportions suggest that votes for Mousavi were more often voided than votes for Ahmadinejad, though this claim cannot be substantiated at the granularity of the data provided. (To justify such a claim would require resolution at the level of individual votes rather than the aggregate voting behavior of an entire city.)
For now, I begin investigating election fraud by generating a first-digit distribution of vote counts for Ahmadinejad across cities (in line with using Benford’s Law):
a_first_dig <- wins %>%
mutate(first_digit = get_first(a))
a_first_dig %>%
ggplot(aes(x = first_digit)) +
geom_bar() +
scale_x_continuous(breaks = 1:9) +
labs(title = "First-digit distribution of votes for Ahmadinejad",
x = "First digit",
y = "Counts")
It is immediately clear that the first-digit distribution of city-based votes for Ahmadinejad visually differs from that expected by Benford’s Law. Quantifying this difference will be explored under the analysis section.
I now turn to data collected by OpenElection during the 2020 US election cycle. I chose to analyze data collected at the New York precinct level. (It would be interesting to discuss the implications of granularity differences across the US and Iran data analyzed, i.e., how do claims made at the city level compare with those at the precinct level?)
ny_precinct <- read_csv("https://raw.githubusercontent.com/openelections/openelections-data-ny/master/2020/20201103__ny__general__precinct.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
head(ny_precinct)
The ny_precinct data frame consists of 426,582
observations across 13 observations. It is difficult to gauge what the
unit of observation is: it seems to be a (precinct, candidate) tuple,
but mutliple such observations also seem to be repeated throughout the
dataset. Time to explore!
First, I’d like to get a sense of how many precincts are
being represented in the dataset:
#counting unique precinct observations
ny_precinct %>%
select(precinct) %>%
unique() %>%
summarize(count = n())
There appear to be 15,402 precincts. But this seems
unreasonable, since a quick Google search claims that there are 77
(police) precincts in the state. In fact, a glance at the
dataset suggests that precinct observations are actually
smaller units, such as wards!
We are getting closer to understanding what the unit of observation
is. I imagine that the data was inputted as the count of which
candidate was voted for which office in a
given precincts ward. That being said, I now look for (a)
an example of the candidates I could choose for a given
office, and (b) the number of candidates that
ran for each office. Hopefully this count will yield
insight into how 426K observations were made.
#an example of candidate choices for State Senate in Corning Ward 1
ny_precinct %>%
filter(precinct == "City of Corning Ward 1" & office == "State Senate") %>%
select(candidate, party)
This sample peek into candidate choices for
State Senate in Corning Ward 1 reveals that the
votes for a given candidate are faceted by the
different party registrations of their voters!
#finding number of candidates that ran for each office
ny_precinct %>%
group_by(office) %>%
select(office, candidate) %>%
unique() %>%
summarize(count = n())
Now, how should I paint a picture of the data collection process?
Firstly, I imagine myself as a New York voter: I implicitly
record where I’m voting from on my ballot (the precinct
ward), my one choice for President, my local choice(s) - some subset of
the 342 listed across the state - for State Assembly member(s), and
likewise down the list of 4 possible offices to vote
for.
Now, I imagine myself as an election officer. I go through
individual ballots, tallying how many votes each
candidate receives by their party affiliation;
a given candidate’s votes may be recorded over multiple
such affiliations. Furthermore, I mark the mode of these
votes, such as whether a vote was delivered as
an absentee ballot or as a machine_vote.
Finally, I also mark some “overhead” metrics for these particular
office results, such as how many votes were
sent in (i.e., the Total) and the number of
Voids, as additional candidate
values.
Aha! Each observation in the dataset roughly measures how many
votes an individual candidate - faceted by
party affiliations - received in a given
precinct ward. Furthermore, each observation categorizes
the type of votes received. Over the 15K
precinct wards, the range of local and national
candidates to choose from (and their number per
office!), and the inclusion of “overhead” metrics (such as
the Total votes for a given office), 426K
observations were made.
Note that this dataset doesn’t make use of tidy
standards, as opposed to the one used in the Iran analysis. This fact,
and the ad-hoc inclusion of “overhead” metrics as
candidate values, led to a lengthy surface exploration of
the dataset at hand.
In this EDA I am interested in determining voting behavior within a
precinct ward by vote type. Of particular
interest is the proportion of votes cast in an
absentee ballot, as these votes stirred quite
the controversy during the 2020 General Elections. Being interested in
the behavior of Biden and Trump voters, I use these ward-by-ward
proportions to visualize candidate-specific histograms
detailing the spread of absentee proportions; specifically,
I am interested in these distributions as they pertain to either
DEM or REP party affiliations. I also compare
these proportions to the same statistic at the state-wide level.
Finally, I visualize the first-digit distribution of ward-level votes
for Biden.
In determining absentee props by ward, I filter for the
candidates (Biden and Trump) and partys of
interest:
#finding the prop of absentee votes per ward
#filtering for all candidates containing of "Biden" or "Trump"
#filtering for all such "DEM" or "REP" observations
absentee_props <- ny_precinct %>%
filter(str_detect(candidate, "Biden") |
str_detect(candidate, "Trump")) %>%
filter(party %in% c("DEM", "REP")) %>%
mutate(absentee_prop = absentee / votes) %>%
select(precinct, candidate, absentee_prop)
(It is interesting to note that filtering for only DEM
or REP instances of Biden or Trump votes about halves the
data set; this suggests that a significant amount of votes for either
candidate were cast in their alternative party affiliations.)
Now, I visualize the candidate-specific voter
distribtuions of absentee_prop:
absentee_props %>%
drop_na(absentee_prop) %>%
filter(absentee_prop < 1) %>%
ggplot(aes(x = absentee_prop)) +
geom_histogram(bins=50, color="white") +
facet_wrap(vars(candidate)) +
labs(title = "Absentee voting patterns help distinguish candidates",
x = "Proportion of Absentee Votes",
y = "Count")
(Again, note the high filtration of data along our parameters of
interest! Starting with 32K ward-wide observations of Republican Trump
and Democrat Biden votes, only about 5K such observations had valid,
i.e., non-NA, recordings for absentee_prop,
and a further 24 had absentee_props which were greater than
or equal to 1! That absentee votes were not counted - or,
somehow, constituted more than the total votes for a given
candidate - suggests issues with the data collection process.)
The distributions shown above highlight a seemingly statistically
significant difference in recorded absentee_props: nearly
symmetric absentee_prop distributions show that about 10%
of Republican Trump voters used absentee ballots, and about
26% for Democratic Biden voters.
Now, finding state-wide absentee_props:
ny_precinct %>%
filter(str_detect(candidate, "Biden") |
str_detect(candidate, "Trump")) %>%
filter(party %in% c("DEM", "REP")) %>%
group_by(candidate) %>%
select(precinct, candidate, votes, absentee) %>%
drop_na(absentee) %>%
summarize(absentee_prop_state = sum(absentee) / sum(votes))
These findings corroborate the ward-level claims made above, and justify the use of state-wide absentee proportions in the place of ward-by-ward proportions.
Finally, I visualize the ward-by-ward first-digit distribution of votes for Biden:
biden_first_dig <- ny_precinct %>%
filter(str_detect(candidate, "Biden")) %>%
mutate(first_dig = get_first(votes)) %>%
filter(first_dig > 0)
biden_first_dig %>%
ggplot(aes(x = first_dig)) +
geom_bar() +
scale_x_continuous(breaks = 1:9) +
labs(title = "First-digit distribution of votes for Biden",
x = "First digit",
y = "Counts")
The distribution above seems more similar to that expected by Benford’s Law than the one found across Ahmadinejad’s votes. This is (superficially) good news for Biden!
The central emphasis of this analysis was to determine whether the first-digit distributions of both Ahmadinejad and Biden are consistent with those expected under Benford’s Law. To that end, I am interested in gauging how similar the empirical voter distributions are to the theoretical Benford distribution. A method both well-studied and well-suited to this task is to use a Chi-square goodness-of-fit (GoF) test.
The Chi-square goodness-of-fit test statistic quantifies the aggregated, scaled difference between expected and observed counts across all categories; the lower the test statistic, the closer the observed (e.g., voter) distribution resembles the expected (e.g., Benford) distribution.
I now need to run a Chi-squared hypothesis test. I utilize the highly
accessible infer package for conducting statistical
inference. Due to a quirk in the infer::calculate function,
I also need to generate a labeled vector of Benford probabilities.
library(infer)
#generating a labeled vector for use in infer
ben_props <- c("1" = log10(1 + 1/1),
"2" = log10(1 + 1/2),
"3" = log10(1 + 1/3),
"4" = log10(1 + 1/4),
"5" = log10(1 + 1/5),
"6" = log10(1 + 1/6),
"7" = log10(1 + 1/7),
"8" = log10(1 + 1/8),
"9" = log10(1 + 1/9))
But, before jumping in to the analysis, a quick review of hypothesis testing!
If one were to imagine this test statistic generated over many, many empirical distributions (each hypothetical, and each varying from the expected distribution categories by some foreseeable amount), one could generate a distribution of test statistics which represent the range and shape of ways that an empirical sample distribution might differ from its expected, theoretical population distribution. This is the Chi-squared sample distribution of test statistics.
The (observed) test statistic is evaluated by how likely it was to have been drawn under the Chi-squared sample distribution. If, for example, our observed test statistic bifurcates the distribution such that 95% of the distribution lies to its left, this implies that only 5% of all other observed test statistics realized under the theoretical distribution would measure greater than or equal to the one currently bifurcating. In other words, such a sample test statistic can act as a proxy of determining whether another given test statistic is likely to have been derived out of the theoretical distribution of interest.
In practice, this method of (a) hypothesizing a theoretical distribution, (b) generating a distribution of sampled statistics from the hypothetical distribution of interest, and (c) determining how likely our observed test statistic could have occurred within this hypothetical model characterizes hypothesis testing. With a significance, or alpha, level of 5% (represented by the proxy statistic mentioned above!), I permit the 5% chance that I may wrongly conclude that our observed test statistic does not come from the hypothetical distribution of interest.
Typically, observing a test statistic whose probability of occurrence within the hypothesized distribution - what is called that p-value - is less than alpha provides us with reasonable grounds to reject the hypothesized data-generating model. For instance, if a test statistic has a p-value of .001, then, with alpha of .05, I can feel comfortable rejecting the hypothesized data-generating model. Note, however, that this rejection is qualified; I may have observed a very unlikely (in this case, 1 in 1000 chance), but nevertheless possible, test statistic produced under the hypothesized data-generating model. This type of error in judgement - rejecting a hypothesis when it in fact should not have been - is a Type 1 error, and engineering the avoidance of such errors forms a rich part of current research in statistical methods.
Equipped with a basic understanding of hypothesis testing and its utility, I apply to the analysis at hand.
First, I need to determine the Chi-squared test statistic.
#factoring first_dig so that a Chisq test can be run
biden_first_dig <- biden_first_dig %>%
mutate(first_dig = factor(first_dig))
#calculating the Chisq test stat
#set.seed() for reproducibility in inference
set.seed(1)
biden_obs_stat <- biden_first_dig %>%
specify(response = first_dig) %>%
hypothesize(null = "point",
p = ben_props) %>%
generate(reps = 1, type = "draw") %>%
calculate(stat = "Chisq")
Our next goal is to visualize and assess this test statistic relative
to its probability of appearing under the hypothetical Chi-squared
distribution which represents variations in untampered voter
behavior. Using infer, I can generate this distribution and
visualize the placement of the observed test statistic in one fell
swoop:
set.seed(2)
biden_chisq_dist <- biden_first_dig %>%
specify(response = first_dig) %>%
hypothesize(null = "point",
p = ben_props) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "Chisq")
biden_chisq_dist %>%
visualize() +
shade_p_value(biden_obs_stat,
direction = "greater") +
labs(title = "US 2020: Biden Chi-squared distribution",
subtitle = "Observed test stat suggests lack of voter fraud.",
x = "Generated test statistics",
y = "Count")
As can be seen, the assumption of a lack of voter fraud
(i.e., “fair play”) seems to justify the little variation in shape
between the observed first_dig distribution for Biden and
the Benford distribution. To quantify how similar the empirical and
Benford distributions are, I find a p-value:
p_value <- biden_chisq_dist %>%
get_p_value(biden_obs_stat,
direction = "greater")
p_value
The p-value of .791 indicates that about 80% of all other foreseeable first-digit distributions derived under the assumption of fair play would have been quantifiably more different from the Benford distribution than the empirical one is. Using an alpha level of 5%, I conclude that the empirical distribution is “close enough” in shape to the Benford distribution. Does analysis of the 2009 Iran elections hold up against these standards?
I follow the same heuristic as above, identifying an observed test statistic and its likelihood of appearing under an assumption of “fair play.”
#factoring first_dig so that a Chisq test can be run
a_first_dig <- a_first_dig %>%
mutate(first_digit = factor(first_digit))
#calculating the Chisq test stat
#set.seed() for reproducibility in inference
set.seed(11)
a_obs_stat <- a_first_dig %>%
specify(response = first_digit) %>%
hypothesize(null = "point",
p = ben_props) %>%
generate(reps = 1, type = "draw") %>%
calculate(stat = "Chisq")
set.seed(22)
a_chisq_dist <- a_first_dig %>%
specify(response = first_digit) %>%
hypothesize(null = "point",
p = ben_props) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "Chisq")
a_chisq_dist %>%
visualize() +
shade_p_value(a_obs_stat,
direction = "greater") +
labs(title = "Iran 2009: Ahmadinejad Chi-squared distribution",
subtitle = "Observed test stat suggests lack of voter fraud.",
x = "Generated test statistics",
y = "Count")
Interestingly, the visualization above suggests that the empirical first-digit distribution for Ahmadinejad does not differ significantly from the Benford distribution. To be precise, I evaluate a p-value:
p_value <- a_chisq_dist %>%
get_p_value(a_obs_stat,
direction = "greater")
p_value
This p-value of .516 suggests that just over 50% of plausibly non-fraudulent first-digit distributions would quantifiably differ more from the Benford distribution than the one at hand. Using an alpha level of 5%, I conclude that the empirical distribution is “close enough” in shape to the Benford distribution. I explore this contrary conclusion in the below section.
Firstly, the fraudulence claims presented here rest on the validity of using Benford’s Law as a proxy for real-world data-generating processes.
An EDA of Iran’s 2009 election revealed a disparity in city-based proportions of voided votes per candidate, where cities in which Mousavi won appeared 2x more likely to void votes than in cities in which Ahmadinejad won. Despite a stronger claim’s - i.e., that individual votes for Mousavi were more likely to be voided than for Ahmadinejad - being unsubstantiated given the granularity of the data at hand, a first-digit distribution of the aggregate city-based vote counts for Ahmadinejad reveals discrepancies with Benford’s Law. Used as a proxy for real-world data-generating phenomena, this discrepancy hints at election fraud in the 2009 Iran election.
On a similar vein of analysis, an EDA of precinct-/ ward-level voting counts across New York during the 2020 US Election cycle revealed somewhat regular absentee voting patterns across candidates: the ward-level proportion of absentee voters for Trump were nearly symmetric about 10%, and that for Biden were also nearly symmetric about 26%. Investigating the claim that absentee voting contributed to voter fraud in 2020, a superficial glance at the first-digit distribution of vote counts for Biden seemed to suggest that no foul play was at hand.
Quantifying the extent to which these empirical first-digit distributions differed from the Benford distribution was used as a single-pronged proxy to determine whether fraud was undeniably present. As such, Chi-squared goodness-of-fit (GoF) tests were used to analyze the distributions at hand. The empirical first-digit distribution for Biden yielded a p-value of .791, rendering me unable to reject the notion of a “fairly played” election. However, a p-value of .516 on Ahmadinejad’s distribution led to the same conclusion.
This last conclusion seems contrary to a visual inspection of the
empirical data at hand; it seems that these first-digit counts would not
have been generated under a Benford process. This leaves room for much
further analysis and use of the Chi-squared GoF through
infer - it’s possible that I’m using some methods
incorrectly. This leaves room for more growth!
On that note, engaging in good data science entails a willingness to
cycle through the “data lifecycle.” Besides evaluating my use of
infer, a follow-up question of interest would be to explore
the use of absentee proportions as a predictive feature of
candidate selection in the 2020 US Elections. In other
words, how well does knowing an arbitrary proportion of
absentee votes predict which
candidate - either Biden or Trump - the aggregate of
votes were ultimately for? The use of a one-dimensional
numerical predictor for a categorical variable suggests that this
question may benefit from using a 1-D k-nearest neighbors algorithm.
As a final note: I learned a lot while working on this
analysis! Besides furthering my R fluency, e.g., finding
methods to detect a regex pattern in an observation, I learned about
(somewhat convoluted) voting infrastructures. For example, that many
observations have NA values for absentee
voting suggests that not all wards record such “meta” information, which
limits their analysis.
“Benford’s Law Graphed in R.” Benford’s Law Graphed in R, https://rstudio-pubs-static.s3.amazonaws.com/205239_5c34105c42b643259f2353762f31c330.html.
“Benford’s Law.” Wikipedia, Wikimedia Foundation, 17 Jan. 2023, https://en.wikipedia.org/wiki/Benford%27s_law.
Mebane, Walter. Note on the Presidential Election in Iran, June 2009. https://websites.umich.edu/~wmebane/note29jun2009.pdf.
“Tidy Chi-Squared Tests with Infer.” Tidy Chi-Squared Tests with Infer, https://cran.r-project.org/web/packages/infer/vignettes/chi_squared.html.
Notes
The documentation for Andrew Bray’s get_first() method
can be found here.
The public Git repo housing the data I used for my Iran analysis can
be found here.
An important note for reproducibility: I downloaded the
iran .rda data frame file locally.