In the 2014 gubermatorial general election in Howard County, Maryland, a subject of interest was relative turnout for Republican vs. Democratic voters, with higher Republican turnout seen as the key for Allan Kittleman’s election as County Executive (not to mention other Republican victories from Larry Hogan on down). A related question is whether turnout was depressed in certain Howard County council districts due to lack of opposition to the incumbent council members. (In particular, in districts 3 and 4 Jen Terrasa and Mary-Kay Sigaty respectively had no declared Republican opponents.)
In this and subsequent parts of this series I take a look at turnout statistics to explore these questions, starting in part 1 with a look at the state-wide turnout datasets for the Maryland 2014 general election.
For this analysis I use the R statistical package run from the RStudio development environment, along with the readxl package to extract data from Excel spreadsheets and the dplyr package to do data manipulation.
library("readxl")
library("dplyr", warn.conflicts = FALSE)
The Maryland State Board of Elections has published a number of turnout-related reports (in both PDF and Microsoft Excel format) as part of its 2014 general election reports. The two reports of most interest for my purpose are the state-wide per-county turnout statistics [Excel] and the state-wide per-precinct turnout statistics [Excel].
First I download the Excel-format data files from the Maryland State Board of Elections containing statewide 2014 general election turnout by county and by precincts, and store a copy of the data in the local files GG14_Turnout_by_party_by_county.xlsx
and GG14_Turnout_by_party_by_precinct.xlsx
.
download.file(
"http://www.elections.state.md.us/elections/2014/turnout/general/GG14_Turnout_by_party_by_county.xlsx",
"GG14_Turnout_by_party_by_county.xlsx",
method = "curl")
download.file(
"http://www.elections.state.md.us/elections/2014/turnout/general/GG14_Turnout_by_party_by_precinct.xlsx",
"GG14_Turnout_by_party_by_precinct.xlsx",
method = "curl")
I then load the Excel spreadsheets into the dataframes county_turnout_raw
and precinct_turnout_raw
respectively.
county_turnout_raw <- read_excel("GG14_Turnout_by_party_by_county.xlsx")
colnames(county_turnout_raw)[7] <- "Percent_Turnout"
str(county_turnout_raw)
## Classes 'tbl_df', 'tbl' and 'data.frame': 193 obs. of 7 variables:
## $ Totals : chr "Allegany" "Anne Arundel" "Baltimore City" "Baltimore County" ...
## $ POLLS : chr "18480" "133124" "107961" "203005" ...
## $ EV : chr "1504" "38654" "25921" "51812" ...
## $ ABS : chr "687" "5784" "3829" "7089" ...
## $ PROV : chr "222" "3913" "4650" "4419" ...
## $ ELIGIBLE_VOTERS: chr "42560" "349313" "373171" "521130" ...
## $ Percent_Turnout: chr "0.49090695488721803" "0.51951974303847837" "0.38148998716406152" "0.51105290426573025" ...
precinct_turnout_raw <- read_excel("GG14_Turnout_by_party_by_precinct.xlsx")
str(precinct_turnout_raw)
## Classes 'tbl_df', 'tbl' and 'data.frame': 12552 obs. of 13 variables:
## $ ELECTION_CODE : chr "GG14" "GG14" "GG14" "GG14" ...
## $ LBE : chr "Allegany" "Allegany" "Allegany" "Allegany" ...
## $ LBE_CODE : num 1 1 1 1 1 1 1 1 1 1 ...
## $ CONGRESSIONAL_DISTRICT_CODE: num 6 6 6 6 6 6 6 6 6 6 ...
## $ LEGISLATIVE_DISTRICT_CODE : chr "01C" "01C" "01C" "01C" ...
## $ PRECINCT : chr "001-000" "001-000" "001-000" "001-000" ...
## $ PARTY : chr "DEMOCRAT" "GREEN" "LIBERTARIAN" "OTHER PARTIES" ...
## $ POLLS : num 73 0 3 3 213 35 73 0 1 2 ...
## $ EARLY_VOTING : num 6 0 0 0 5 0 2 0 0 1 ...
## $ ABSENTEE : num 0 0 0 1 1 0 5 0 0 0 ...
## $ PROVISIONAL : num 0 0 0 0 0 1 2 0 0 0 ...
## $ ELIGIBLE_VOTERS : num 140 0 5 5 359 94 207 1 3 5 ...
## $ Percentage : num 0.564 0 0.6 0.8 0.61 ...
Unfortunately the spreadsheet for the per-county turnout organizes its data in a way that is not optimal for analysis in R: There are separate sections (groups of rows) for each party, each with a header row and a subtotal row, and an initial section for overall turnout (all parties).
For my purposes it is better to identify the party using a separate variable (column), and to eliminate all subtotal and total rows. (They can always be recomputed as needed.) I therefore do the following:
I also convert the various counts from character strings to integers.
parties <- c(rep(NA, 26),
rep("Republican", 24),
rep(NA, 4),
rep("Democrat", 24),
rep(NA, 4),
rep("Libertarian", 24),
rep(NA, 4),
rep("Green", 24),
rep(NA, 2),
rep("Other Parties", 24),
rep(NA, 3),
rep("Unaffiliated", 24),
rep(NA, 6))
county_turnout <- county_turnout_raw %>%
mutate(Party = parties) %>%
filter(!is.na(Party)) %>%
mutate(LBE = Totals,
Polls = as.integer(POLLS),
Early_Voting = as.integer(EV),
Absentee = as.integer(ABS),
Provisional = as.integer(PROV),
Eligible_Voters = as.integer(ELIGIBLE_VOTERS)) %>%
select(LBE,
Polls,
Early_Voting,
Absentee,
Provisional,
Eligible_Voters,
Percent_Turnout,
Party)
str(county_turnout)
## Classes 'tbl_df', 'tbl' and 'data.frame': 144 obs. of 8 variables:
## $ LBE : chr "Allegany" "Anne Arundel" "Baltimore City" "Baltimore County" ...
## $ Polls : int 10306 57411 9457 64647 12476 3855 33027 11056 11488 3990 ...
## $ Early_Voting : int 771 15702 1354 13327 2141 821 4888 2089 1687 698 ...
## $ Absentee : int 361 2012 392 2217 425 87 881 260 376 159 ...
## $ Provisional : int 99 1308 384 1142 198 55 322 110 183 27 ...
## $ Eligible_Voters: int 20200 123998 30156 131289 23930 7740 58969 24692 24352 7237 ...
## $ Percent_Turnout: chr "0.57113861386138609" "0.61640510330811782" "0.38423530972277492" "0.61949592121198271" ...
## $ Party : chr "Republican" "Republican" "Republican" "Republican" ...
Next I do a check to confirm that the county_turnout
dataset has values consistent with the original spreadsheet, by totaling the counts and confirming that they match the corresponding state-wide totals from row 25 of the original spreadsheet.
count_total_county <- county_turnout %>%
summarise(Polls = sum(Polls),
Early_Voting = sum(Early_Voting),
Absentee = sum(Absentee),
Provisional = sum(Provisional),
Eligible_Voters = sum(Eligible_Voters))
print(count_total_county)
## Source: local data frame [1 x 5]
##
## Polls Early_Voting Absentee Provisional Eligible_Voters
## 1 1350930 307665 54651 35063 3701665
stopifnot(count_total_county == county_turnout_raw[25,2:6])
Next I turn to the precinct-level turnout data. This dataset is much easier to work with: Each row contains data for a particular party in a particular precinct. I change the variable names to mixed case for readability, the variable types as appropriate, and the values for the Party
variable to use initial capitals instead of upper case only.
precinct_turnout <- precinct_turnout_raw %>%
mutate(Election_Code = ELECTION_CODE,
LBE_Code = as.integer(LBE_CODE),
Congressional_District_Code = as.integer(CONGRESSIONAL_DISTRICT_CODE),
Legislative_District_Code = LEGISLATIVE_DISTRICT_CODE,
Precinct = PRECINCT,
Party = gsub("(^|[[:space:]])([[:alpha:]])([[:alpha:]]*)",
"\\1\\U\\2\\L\\3",
PARTY,
perl=TRUE),
Polls = as.integer(POLLS),
Early_Voting = as.integer(EARLY_VOTING),
Absentee = as.integer(ABSENTEE),
Provisional = as.integer(PROVISIONAL),
Eligible_Voters = as.integer(ELIGIBLE_VOTERS)) %>%
select(Election_Code,
LBE,
LBE_Code,
Congressional_District_Code,
Legislative_District_Code,
Precinct,
Party,
Polls,
Early_Voting,
Absentee,
Provisional,
Eligible_Voters,
Percentage)
str(precinct_turnout)
## Classes 'tbl_df', 'tbl' and 'data.frame': 12552 obs. of 13 variables:
## $ Election_Code : chr "GG14" "GG14" "GG14" "GG14" ...
## $ LBE : chr "Allegany" "Allegany" "Allegany" "Allegany" ...
## $ LBE_Code : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Congressional_District_Code: int 6 6 6 6 6 6 6 6 6 6 ...
## $ Legislative_District_Code : chr "01C" "01C" "01C" "01C" ...
## $ Precinct : chr "001-000" "001-000" "001-000" "001-000" ...
## $ Party : chr "Democrat" "Green" "Libertarian" "Other Parties" ...
## $ Polls : int 73 0 3 3 213 35 73 0 1 2 ...
## $ Early_Voting : int 6 0 0 0 5 0 2 0 0 1 ...
## $ Absentee : int 0 0 0 1 1 0 5 0 0 0 ...
## $ Provisional : int 0 0 0 0 0 1 2 0 0 0 ...
## $ Eligible_Voters : int 140 0 5 5 359 94 207 1 3 5 ...
## $ Percentage : num 0.564 0 0.6 0.8 0.61 ...
In theory the county-level dataset is simply a summary of the precinct-level dataset. Now that I’ve cleaned the two datasets and made them consistent in terms of variable names, types, and values I check that hypothesis.
I first check that the parties listed in the county turnout dataset are the exact same parties as in the precinct turnout dataset, and vice versa:
sort(unique(county_turnout$Party))
## [1] "Democrat" "Green" "Libertarian" "Other Parties"
## [5] "Republican" "Unaffiliated"
sort(unique(precinct_turnout$Party))
## [1] "Democrat" "Green" "Libertarian" "Other Parties"
## [5] "Republican" "Unaffiliated"
Next I check to see if the counts in the precinct turnout dataset are consistent with the counts in the county turnout dataset, by comparing the total counts for the entire state calculated using each dataset:
count_total_precinct <- precinct_turnout %>%
summarise(Polls = sum(Polls),
Early_Voting = sum(Early_Voting),
Absentee = sum(Absentee),
Provisional = sum(Provisional),
Eligible_Voters = sum(Eligible_Voters))
print(count_total_county)
## Source: local data frame [1 x 5]
##
## Polls Early_Voting Absentee Provisional Eligible_Voters
## 1 1350930 307665 54651 35063 3701665
print(count_total_precinct)
## Source: local data frame [1 x 5]
##
## Polls Early_Voting Absentee Provisional Eligible_Voters
## 1 1350930 307665 54651 35062 3701665
All counts match except for the total provisional votes, for which the precinct-level data shows one less provisional vote than the county-level data.
To narrow down the source of the discrepancy I compare the counts of provisional votes for each county using the two datasets:
provisional_county <- county_turnout %>%
group_by(LBE) %>%
summarise(Provisional_County = sum(Provisional))
provisional_precinct <- precinct_turnout %>%
group_by(LBE) %>%
summarise(Provisional_Precinct = sum(Provisional))
left_join(provisional_county, provisional_precinct) %>%
print.data.frame()
## Joining by: "LBE"
## LBE Provisional_County Provisional_Precinct
## 1 Allegany 222 222
## 2 Anne Arundel 3913 3913
## 3 Baltimore City 4650 4650
## 4 Baltimore County 4419 4419
## 5 Calvert 461 461
## 6 Caroline 121 121
## 7 Carroll 587 587
## 8 Cecil 241 241
## 9 Charles 877 877
## 10 Dorchester 144 144
## 11 Frederick 989 989
## 12 Garrett 83 83
## 13 Harford 1470 1470
## 14 Howard 1718 1718
## 15 Kent 37 37
## 16 Montgomery 6554 6554
## 17 Prince George's 6676 6675
## 18 Queen Anne's 166 166
## 19 Saint Mary's 412 412
## 20 Somerset 172 172
## 21 Talbot 193 193
## 22 Washington 432 432
## 23 Wicomico 336 336
## 24 Worcester 190 190
It appears that the one-vote discrepancy is for Prince George’s County. To narrow things down even further I print the number of provisional voters for each party in Prince George’s County:
pg_provisional_county <- county_turnout %>%
filter(LBE == "Prince George\'s") %>%
mutate(Provisional_County = Provisional) %>%
select(Party, Provisional_County)
pg_provisional_precinct <- precinct_turnout %>%
filter(LBE == "Prince George\'s") %>%
group_by(Party) %>%
summarise(Provisional_Precinct = sum(Provisional))
left_join(pg_provisional_county, pg_provisional_precinct) %>%
print.data.frame()
## Joining by: "Party"
## Party Provisional_County Provisional_Precinct
## 1 Republican 455 455
## 2 Democrat 5425 5425
## 3 Libertarian 21 20
## 4 Green 10 10
## 5 Other Parties 56 56
## 6 Unaffiliated 709 709
The discrepancy is thus due to a difference in the number of provisional votes cast by members of the Libertarian party in Prince George’s County.
Next I check the calculation of the turnout percentage figures (which are actually fractions, not percentages per se). I recalculate the percentages for the precinct-level data, compute the absolute difference for each precinct between the original figure and the recalculated figure, and determine the maximum absolute difference across all precincts.
(The calculation of turnout percentage is not totally straightforward, since I must account for cases in which there are no eligible voters for a given party in a given precinct. In the original dataset the turnout percentage is set to zero in these cases, so I do the same.)
precinct_turnout %>%
mutate(Pct = ifelse(Eligible_Voters <= 0,
0,
(Polls+Early_Voting+Absentee+Provisional)/Eligible_Voters)) %>%
summarise(max(abs(Percentage - Pct)))
## Source: local data frame [1 x 1]
##
## max(abs(Percentage - Pct))
## 1 0
The maximum absolute difference is zero, so the original percentage values were indeed calculated as I expected.
With the exception of the one-vote discrepancy in Prince George’s County noted above, the county-level dataset is simply a summarization of the precinct-level dataset. I therefore choose to keep only the precinct-level dataset, and write it out as a CSV file gg14-turnout-by-party-by-precinct.csv
for use in the subsequent parts of this analysis. (I use a slightly different name than that of the original Excel spreadsheet to emphasize that this file is not simply a CSV export of the spreadsheet.)
write.csv(precinct_turnout,
file = "gg14-turnout-by-party-by-precinct.csv",
quote = TRUE,
row.names = FALSE)
The precinct-level dataset contains one row for each combination of county, precinct, and party; each row contains the following variables:
Election_Code
: The code for the year and type of election; for this dataset it has the single value “GG14” (gubernatorial general election for 2014). (character string)LBE
: The jurisdiction of the local board of elections (e.g., “Howard” or “Baltimore City”). (character string)LBE_Code
: The code for the local board of elections (e.g., 14 for Howard County). (integer)Congressional_District_Code
: The US Congressional district with which the precinct is associated (e.g., 7). (integer)Legislative_District_Code
: The Maryland state legislative district with which the precinct is associated (e.g., “09A” for district 9A). It consists of a two-digit zero-filled value optionally followed by a single letter “A”, “B”, etc. (character string)Precinct
: The designator for the precinct (e.g., “002-006”). It consists of two three-digit zero-filled values separated by a hyphen. A precinct designator is unique only within a given LBE jurisdiction. (character string)Party
: The party affiliation of people voting, or “Unaffiliated” for those who did not indicate a party affiliation when registering to vote. The recognized party affiliations are “Republican”, “Democrat”, “Libertarian”, and “Green”; several small parties are lumped together as “Other Parties”. (character string)Polls
: The number of people who voted at the polls on election day. (integer)Early_Voting
: The number of people who voted during the early voting period. (integer)Absentee
: The number of people who voted via absentee ballot. (integer)Provisional
: The number of people who cast provisional ballots. (integer)Eligible_Voters
: The number of people eligible to vote (i.e., registered to vote and otherwise qualified). (integer)Percentage
: The fraction of eligible voters who actually voted (e.g., 0.5745). It is computed as the sum of the Poll
, Early_Voting
, Absentee
, and Provisional
counts, divided by the Eligible_Voters
count. Note that it must be multiplied by 100 to obtain a true percentage. (numeric)To make the file gg14-turnout-by-party-by-precinct.csv
more accessible to myself and others, I put a copy of the file in my hocodata Github repository; it can be downloaded at the following URL:
I used the following R environment in doing the analysis for this example:
sessionInfo()
## R version 3.1.3 (2015-03-09)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.4.0 readxl_0.1.0 RCurl_1.95-4.3 bitops_1.0-6
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 DBI_0.3.1 digest_0.6.4 evaluate_0.5.5
## [5] formatR_1.0 htmltools_0.2.6 knitr_1.7 lazyeval_0.1.10
## [9] magrittr_1.0.1 parallel_3.1.3 Rcpp_0.11.5 rmarkdown_0.5.1
## [13] stringr_0.6.2 tools_3.1.3 yaml_2.1.13
You can find the source code for this analysis and others at my HoCoData repository on GitHub. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.