This project seeks to explore the similarities and differences of state election systems, and offer descriptive and visual analysis of the data at hand. Utilizing the confidential voter files of U.S. states, I attempt to analyze the different ways in which race, gender, birth identification, voter status, and party identification variables are classified by a sample of states.
Talk of a nation-wide, standardized election system has ensued for decades. For example, the factors that contributed to the disenfranchisement of thousands of Florida registered voters following the 2000 general election debacle stemmed from the inept actions of local election officials, inadequate databases, and a lack of uniform, statewide systems among other things. Many commissions, organizations, and policy proposals have focused on this latter issue and have called for the enactment of standardized measurements on all 50 states and their different election processes. Ultimately, the need for standards and election uniformity champions a more democratic, efficient, and pervasive system.
This project stems from a larger project overseen by researchers at the University in Florida (UF) in conjunction with VoteShield.
Confidential state voter files and vote history files are gathered via communication with state election officials. .yaml files are then created to efficiently label the different headers. These headers are standardized across state files based off researcher discretion. To demonstrate, the first header “Accession_Number” given by Alaska embodies the same concept in the classification “Voter_ID” (the unique identifier given to each registered voter). Thus, “Accession_Number” is coded instead as “Voter_ID” in Alaska to match the other sampled, standardized states. Below is the first few lines of Washington state’s .yaml file for example.
.yaml Example
Given that this is an ongoing project, only 20 states out of 50 (plus the District of Columbia) have supplied the requisite materials as of November 11, 2019. The following states analyzed are mapped and graphed within the discussion section using a shapefile created by the @urban_institute [https://medium.com/@urban_institute/how-to-create-state-and-county-maps-easily-in-r-577d29300bb2]
Lastly, the five classifications this project is most concerned with are the “Race”, “Gender”, birth identifier (“Birth_Date”, “Age”, etc.), “Voter_Status”, and party identifiers (“Party_Affiliation”, “Last_Party_Voted”, etc.) as codified by each state.
Before kickoff, install and load in the following packages. The Urban’s own package (“urbnmapr”) is necessary for the U.S. Choropleth shapefile. Choropleth maps shade geographical spaces based on subjective data values. To use the package, devtools must be installed (and the package first installed from Github).
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
install.packages(c("readxl", "ggplot2","tidyverse","choroplethrMaps","devtools","urbnmapr","dplyr","yaml","kable", "knitr"))
install.packages("kable")
devtools::install_github("UrbanInstitute/urbnmapr")
library(readxl)
library(ggplot2)
library(tidyverse)
library(devtools)
library(choroplethrMaps)
library(devtools)
library(urbnmapr)
library(dplyr)
library(yaml)
library(knitr)
We can easily manipulate the shapefile using ggplot2. This example will be using Urban’s Median Household datafile, specifically creating a heatmap focusing on the median household income within each county in the United States. You can see below their original county-wide U.S. map.
ggplot() +
geom_polygon(data = urbnmapr::states, mapping = aes(x = long, y = lat, group = group),
fill = "grey", color = "white") +
coord_map(projection = "albers", lat0 = 39, lat1 = 45)
data(state.map)
data("state.regions")
data("statedata")
ggplot(state.map, aes(long, lat, group=group)) + geom_polygon()
household_data <- left_join(countydata, counties, by = "county_fips")
household_data %>%
ggplot(aes(long, lat, group = group, fill = medhhincome)) +
geom_polygon(color = NA) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
labs(fill = "Median Household Income")
For the purposes of my project, I am interested in state-wide explanations and not county-level scenarios. I can outline the U.S. states via their dataset “statedata”, as you can see here.
#ex states household income
household_data<- merge(statedata, states, by="state_fips")
household_data %>%
ggplot(aes(long, lat, group=group, fill=medhhincome)) +
geom_polygon(color=NA) +
coord_map(projection = "albers", lat0=39, lat1=45) +
labs(fill= "Median Household Income")
After the shapefile is created, I can merge any dataset to manipulate the outputs. The data we’re using come from these created .yaml files. To read in these files, one must use the yaml package in R. Moreover, because this project is utilizing only 20 .yaml files, I created two different data frames that included 1) the 20 states analyzed (“state”) and 2) the other 30 states to be added into the analysis once .yaml files are created for them ("np.states)
ak.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/alaska/alaska.yaml")
co.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/colorado.yaml")
ct.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/connecticut.gender.yaml")
fl.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/florida.yaml")
ga.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/georgia.yaml")
id.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/idaho.yaml")
ia.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/iowa.yaml")
mo.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/missouri.yaml")
mt.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/montana.gender.yaml")
nv.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/nevada.yaml")
nj.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/new jersey.yaml")
ny.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/new york/new york.yaml")
nc.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/north carolina/north carolina.yaml")
oh.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/ohio/ohio.yaml")
ok.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/oklahoma/oklahoma.yaml")
or.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/oregon/oregon.yaml")
pa.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/pennsylvania/pennsylvania.yaml")
ut.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/utah/utah.yaml")
vt.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/vermont/vermont.gender.yaml")
wa.yaml<- read_yaml("/Users/EmilyBoykin/Documents/Voter File. McDonald/washington/washington.yaml")
state<- c("Alaska",
"Colorado",
"Connecticut",
"Florida",
"Georgia",
"Idaho",
"Iowa",
"Missouri",
"Montana",
"Nevada",
"New Jersey",
"New York",
"North Carolina",
"Ohio",
"Oklahoma",
"Oregon",
"Pennsylvania",
"Utah",
"Vermont",
"Washington")
np.states<- c("Alabama",
"Arizona",
"Arkansas",
"California",
"Delaware",
"District of Columbia",
"Hawaii",
"Illinois",
"Indiana",
"Kansas",
"Kentucky",
"Louisiana",
"Maine",
"Maryland",
"Massachusetts",
"Michigan",
"Minnesota",
"Mississippi",
"Nebraska",
"New Hampshire",
"New Mexico",
"North Dakota",
"Rhode Island",
"South Carolina",
"South Dakota",
"Tennessee",
"Texas",
"Virginia",
"West Virginia",
"Wisconsin",
"Wyoming")
After, I can use these frames to merge onto the household data. Because the state dataset supplied by this package includes state names, abbreviations, and FIPS codes (Federal Information Processing Standards), it is easy to merge any dataset if they include any state or code that matches. I choose to create five separate dataframes corresponding to the header/variable being analyzed and merged on the state names. Note that copies were made of household_data in case we needed to use the unedited original file.
As explained, this smaller project only analyzes 20 of the U.S.’s 50 states due to constraints in accessible data. The 20 states are mapped below and include: Alaska (AK), Colorado (CO), Connecticut (CT), Florida (FL), Georgia (GA), Idaho (ID), Iowa (IA), Missouri (MO), Montana (MT), Nevada (NV), New Jersey (NJ), New York (NY), North Carolina (NC), Ohio (OH), Oklahoma (OK), Oregon (OR), Pennsylvania (PA), Utah (UT), Vermont (VT), and Washington (WA). For ease in visualization, the R code is supplied only for the first map, Gender.
It is important to note here that for only one variable, “Race”, I looked at a few other states for fun, since I received this data from my advisor or was able to find it from the Secretary of State’s website. As you will see during the discussion, if I had not factored in these extra states, the analysis would have included only three states– and would not have been as interesting! As a disclaimer though, these states are not included in the overall design as I did not the necessary data to support other analysis.
The following maps were created to visualize how U.S. states classify five different variables in elections.
## Gender Data Frame
gender<-data.frame(state=c(state, np.states), gender=NA)
yamls = list(ak.yaml, co.yaml, ct.yaml, fl.yaml, ga.yaml, id.yaml, ia.yaml, mo.yaml, mt.yaml, nv.yaml, nj.yaml, ny.yaml, nc.yaml, oh.yaml, ok.yaml, or.yaml, pa.yaml, ut.yaml, vt.yaml, wa.yaml)
for(i in 1:length(yamls)) {
gender$gender[i] = paste(as.character(unlist(yamls[[i]]$gender_codes)), collapse='')
}
gender$gender[is.na(gender$gender)] <- "NA"
gender <- gender[order(gender$state),]
## Merging Gender Data Frame
copy2.house_data<- household_data
merge.gender.yaml <- merge(copy2.house_data, gender, by.x='state_name.x', by.y="state")
## Gender Map
merge.gender.yaml %>%
ggplot(aes(long, lat, group = group, fill = `gender`)) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
ggtitle("Gender") + theme(plot.title=element_text(face="bold", color="#555555", hjust=.6, size=25)) +
scale_fill_brewer(palette = "Accent", breaks=c("NA", "NP", "MFU", "MF", "MFUNP", "MFNP"),
labels=c("NA States", "Not Provided", "Standard: Male/Female/Unknown", "Male/Female", "Standard + Not Provided", "No 'Unknown' Option")) +
geom_polygon(color = NA) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
labs(fill = "Gender Codes")
This first map explores “Gender” and the various attributes. Of the 20 states, 8 did not provide any information about a voter’s gender. Of the 13 states who did supply gender information, 7 states had “Female”, “Male”, and “Unknown” classifications from which voters could choose. 2 states, New York and Iowa, only had “Female” and “Male” options while Idaho was the only state that went beyond three attributes, the fourth option being “Not Provided”. New Jersey did not have an “Unknown” option. No states in this sample had options for non-binary voters who may wish to classify as “Other” at the very least.
As such and given that Idaho’s fourth attribute could be filtered as missing data, the “Gender” standard recommendation would be the following three classifications: “Female”, “Male”, and “Unknown”.
## Status Data Frame
status<-data.frame(state=c(state,np.states), status=NA)
yamls = list(ak.yaml, co.yaml, ct.yaml, fl.yaml, ga.yaml, id.yaml, ia.yaml, mo.yaml, mt.yaml, nv.yaml, nj.yaml, ny.yaml, nc.yaml, oh.yaml, ok.yaml, or.yaml, pa.yaml, ut.yaml, vt.yaml, wa.yaml)
for(i in 1:length(yamls)) {
status$status[i] = paste(as.character(unlist(yamls[[i]]$status_codes)), collapse='')
}
status$status[is.na(status$status)] <- "NA"
status <- status[order(status$state),]
status$status[which(status$status=="AP" | status$status=="AIP" | status$status=="AIPR" | status$status=="AIDRS" | status$status=="APPR" | status$status=="ActiveChallengedN-18" | status$status=="ActiveConfirmation")] <- "Standard + Pending or Pre Reg."
status$status[which(status$status=="ACTINAPRE" | status$status=="ACTINAPROLATE" | status$status=="AIADIDP" | status$status=="AI17AMAFAPAUP" | status$status=="ADIRS" | status$status=="AIACP" | status$status=="AIIMCCDCFCSCT")] <- "Standard + 3"
## Merging Status Data Frame
copy2.house_data<- household_data
merge.status.yaml <- merge(copy2.house_data, status, by.x='state_name.x', by.y="state")
## Status Map
merge.status.yaml %>%
ggplot(aes(long, lat, group = group, fill = `status`)) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
ggtitle("Voter Status") + theme(plot.title=element_text(face="bold", color="#555555", hjust=.6, size=25)) +
scale_fill_brewer(palette = "Accent", breaks=c("NA", "NP", "AI", "Standard + Pending or Pre Reg.", "Standard + 3"),
labels=c("NA States", "Not Provided", "Standard: Active / Inactive", "Standard + Pending or Pre Reg.", "Standard + 3")) +
geom_polygon(color = NA) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
labs(fill = "Status Codes")
This second map explores “Voter Status” as a variable. Arguably one of the more important variables given the controversial recent environment of voter role purging amongst other things, a voter’s status as an active, registered voter is more than likely either “Active” or “Inactive” and is thus seen as the standard in this methodology.
I included the R code here to note a that, for my key code conceptualization, I needed to merge a few column specifications under one classification/ggplot color for visualization purposes. For example, I merged 6 states under one classification, as these states all had a “Pending” or “Pre-Registered” commonality (“Standard + Pending or Pre-Reg.”). I utilized this function for both party and race classifications as you will note later.
Of the 20 states, 5 states had this standard classification scheme. However, some states offer other classifications in conjunction with these two attributes, predominately pre-registered who were not yet eligible to vote but were included in the voter files (3 out of 5). Other options included those who were active but protected from the public voter files (Oregon) or voters who were “pending” active status on the voter rolls following that state’s (Iowa) unique system of purging or adding voters to the rolls. Moreover, some states have quite a few classifications, ranging to an upwards of eleven different classifications (Washington). These latter states that have more than three different attribute types include: Washington, Montana, New York, North Carolina, and New Jersey.
Only 3 states (Alaska, Idaho, and Missouri) did not provide any information on a voter’s status.
These next two maps differ slightly from the first two. For this first map, instead of analyzing the different attributes under one uniform variable, I want to see how one concept (a voter’s age) is coded. Of the 20 states analyzed, 12 states provided a voter’s full date of birth in ‘%m/%d/%Y’ (Month/Date/Year or 11/05/1976) format. This is the most comprehensive birth identifier that provides the most information as well as the identifier used by the majority of the sampled states; it is understood as the standard.
1 state (Idaho) provided only a voter’s age (ex: 43) and 4 states provided birth year only (ex: 1976).
Surprisingly, only Alaska did not provide any information concerning a voter’s date of birth, age, or month/year they were born in. However, some states do not provide this information publicly. Those who wish to have access to this information must request it via an appeal to the state. Alaska is one such state that does collect this information but does not publicly provide it.
Similarly, a “Party” classification does not exist in the voter files given that not all states request a voter disclose their party affiliation. Instead, some states mark which party a voter last voted for within the most recent election to estimate a voter’s party affiliation. Regardless, there are still many different types of parties a voter could belong to or vote for, and this specific identifier seeks to analyze the variation in how many parties for which a state accounts– and it varies substantially.
Of the 20 states, 5 states had no means of supplying party identification; 4 states only have 5 or less parties identified (Republican, Democrat, Not Affiliated, Green Party, etc.); 4 states have anywhere from 6 to 9 different parties identified; 4 states have at least up to 10; and 3 states have 15 or more different types of parties identified.
The last variable we’ve identified is “Race”. While possibly the most interesting classification to analyze, not many states actually ask voters for or report this demographic. Of the 20 states analyzed, only 3 states supplied racial or ethnic information of voters. These states include Florida, Georgia, and North Carolina. For this reason, as previously explained, I added two states (Louisiana and South Carolina) to this analysis who’s race codes were available either online or through my advisor. I had expected to also add Missouri, Mississippi, Alabama, Arkansas, and Tennessee’s race codes to this section, but did not receive the data in time for this project’s deadline.
It is important here to note that the word “standard” used in this key does not connotate the same “standard” as used in other explanations or keys. Because there is such a dearth of comparable data and mixed academic understandings on ethnic classifications, this project cannot come up with a suitable “standard” of race classifications. With that being said, of the 3 states that supplied information, Florida had a benchmark of classifications that seemed appropriate. These classifications include: American Indian or Alaskan Native, Asian or Pacific Islander, Black, White, Hispanic, Multi-Racial, Other, and Unknown. Georgia included all of these classifications except “Multi-Racial” and also had the option of “Black Non-Hispanic”. Below is a dataframe that outlines the variations between states.
onlyrace.states <- c("Florida","Georgia","North Carolina","Louisianna","South Carolina")
onlyrace.yamls = list(fl.yaml, ga.yaml, nc.yaml, lo.race, sc.race)
onlyrace.df <- data.frame(states=onlyrace.states, White=F, Black=F, BlackNonHis=F, Asian=F, Amer.Indian=F, Multi=F, Other=F, Unknown=F, Hispanic=F, Sep.Hispanic=F)
for (i in 1:nrow(onlyrace.df)) {
onlyrace.df$White[i] <- any(grepl("white", unlist(names(onlyrace.yamls[[i]]$race_codes))))
onlyrace.df$Black[i] <- any(grepl("black", unlist(names(onlyrace.yamls[[i]]$race_codes))))
onlyrace.df$Asian[i] <- any(grepl("asian", unlist(names(onlyrace.yamls[[i]]$race_codes))))
onlyrace.df$Hispanic[i] <- any(grepl("his", unlist(names(onlyrace.yamls[[i]]$race_codes))))
onlyrace.df$Amer.Indian[i] <- any(grepl("american", unlist(names(onlyrace.yamls[[i]]$race_codes))))
onlyrace.df$BlackNonHis[i] <- any(grepl("non", unlist(names(onlyrace.yamls[[i]]$race_codes))))
onlyrace.df$Multi[i] <- any(grepl("multi", unlist(names(onlyrace.yamls[[i]]$race_codes)))) | grepl("two", unlist(names(onlyrace.yamls[[i]]$race_codes)))
onlyrace.df$Other[i] <- any(grepl("other", unlist(names(onlyrace.yamls[[i]]$race_codes))))
onlyrace.df$Unknown[i] <- any(grepl("unknown", unlist(names(onlyrace.yamls[[i]]$race_codes))))
onlyrace.df$Sep.Hispanic[i] <- any(grepl("sep", unlist(names(onlyrace.yamls[[i]]$race_codes))))
}
| states | White | Black | BlackNonHis | Asian | Amer.Indian | Multi | Other | Unknown | Hispanic | Sep.Hispanic |
|---|---|---|---|---|---|---|---|---|---|---|
| Florida | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE |
| Georgia | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | FALSE |
| North Carolina | TRUE | TRUE | FALSE | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE | TRUE |
| Louisianna | TRUE | TRUE | FALSE | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | FALSE |
| South Carolina | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE |
North Carolina is an important state in this discussion because it had two separate “Race” variables that asked first for a voters racial classification and then, as a subsequent question, asked if a voter aligned with any Hispanic or Latinx classification. The latter classification is essentially a binary attribute that classifies a voter either as “Hispanic or Latino” or “Not Hispanic or Latino”. In light of this information, future research needs to be undertaken to see if other states follow North Carolina’s Hispanic distinction, or ask any questions about ethnic origin at all.
I wanted to take a minute to discuss the challenges I faced to produce this product. First, the background in gathering the data or creating the .xml file from the .yaml files has been almost a year in the making. I have learned so much about the importance of standard codified systems, variables, and collaboration amongst team members for one.
Second, the biggest challenge I came across was in reading these .yaml files. The purpose of these .yaml files is to standardize these headers, and it was obvious that we have not fully done that objective. I spent hours going back into the files to update, manipulate, or standardize these codes. This is going to continually be the hardest part, as the process does not truly end until all state voter files have been standardized.
Third, I had a tough time manipulating the shapefile. I was surprised to find, after merging in my data, that many states lost their shape due to misplaced keys and “holes” in the data. You can see in the below sketch the initial trouble.
.yaml Example
The error lied in the fact that I had originally misspelled one of the states! While a silly oversight, an important one nonetheless, and a test on patience and triple-checking one’s work.
There are a few aspects about this project that will change in the next few months. As more state’s send their voter files and vote history files, more .yaml files are created, and more analysis needing to be run. For this reason, this shapefile lacks the capabilities for northern states to be equally represented and mapped via the colored codes. A hexagon map of U.S. states, where every state is represented by the same size hexagon, may be more appropriate for smaller states like Rhode Island and New Jersey. On top of that, state abbreviations should be added for clearer visualization.
Moreover, this project does not sufficiently give a snapshot of the purpose at hand, given the lack of state data. As more frequencies are run on these five variables within different voter files, the subjective key code I created should and will change to accommodate new attributes or data.
This project was produced to better understand how state’s collect data on voters and under what variable and/or attribute names. This was one of the first waves of exploration of this data and within the larger project with my advisor and VoteShield. The hallmark to this project is that it is entirely ongoing; as more state data is acquired, the maps will continue to fill and paint a more holistic representation of the issue at hand. This project undoubtedly has future potential to be used in other studies, legal analyses, and academic papers. However, a better snapshot of the United States at large needs to be undertaken; more than just 20 states need to be analyzed before real, true analysis with substantial hypotheses can be conducted.
In any fashion, this project introduced me to the foundations of R and ggplots, shapefiles, for loups, and utilizing a variety of data collection methods to tell a story. This such story hopefully conveys that the need for standardization within elections should be a more forefront idea. I’m curious to see what future research and other stories can be told from the similarities and differences among state voter files.
I want to acknowledge Jenna Tingum (University of Florida) and Nathan Morse (Pennsylvania State) for not only their work in creating these .yaml files this past year, but also in their mentorship in helping me troubleshoot parts of this project in R.