Election Data Cleaning Documentation

Summary of Data-Sources
Image Processing
Defining Outcomes
Partitioning Data

Summary of Data-Sources

To begin we summarize the data sources used to construct our election data-frame. There are three data-sources used:

House of Representatives: This is scraped from Google image searches of HOR candidates from 1976 to 2018. In this dataset we have 15085 unique candidates together with their state, party, year of election, electoral district, image-name id, sex, flags for runoff, special and write-in elections, votes received by the candidate and total votes cast in the election, together with the final outcome did-win which is a binary indicator for the candidate that won their election.
Precinct: This is scraped from Google image searches of PRE candidates in 2018 which yields 17593 unique candidates. Here we collect information on the office candidates are running for, their names, district, party, state, sex, candidate and total votes cast in the election, together with the final outcome did-win. Since PRE elections are only scraped from elections in 2018, we include a year variable to indicate this.
Ballotpedia: This is a curated data-set of US local and federal elections from 2018-2020 and includes image urls which are scraped. This yields 49463 unique candidates from 5 election types. There are 663 rows from conventions, 426 from runoffs, 45559 from primaries, 715 from primary runoffs, and 41670 from general elections. We will focus only on general elections, since we want to limit our data to races which include at least one republican and one democrat (which won’t be the case in primaries). We also have data from 2018, 2019, and 2020 which leads to some overlap with the PRE data-set. Since there are no common identifiers between these data-sources, we drop the 2018 data from BAL. These restrictions leave us with 21897 distinct candidates.

Combining Data-Sources

An important note that applies to all three data-sources used in this pipeline is that we don’t necessarily have high quality image data for each valid candidate. To ensure we can perform checks on the actual images themselves, we make use of two hash-code data-sets. For these we take our raw BAL, HOR and PRE image data-sets and compute unique hash-codes for each image. This allows us to track images throughout the cleaning process, apply filters to eliminate repeat images across different candidates, and also remove fully white images (i.e. blanks). These filters will be discusses in a later section.

We append these three data-frames, making sure to track their original data-sources. When a candidates unique ID appears in more than one df this happens between BAL-HOR or BAL-PRE and we drop the HOR or PRE candidate in each of these cases. We loose 697 observations through this filter and ensure that we have a unique candidate ID by which to track. We now construct a new race-id in order to have a unique identifier for groups of candidates. For this we group year, state, totalvotes, data_source, runoff, special, writein. We then use this to drop races with duplicate candidates, and we include a final filter for unique candidates using their personal-id and candidate names. Note that we allow for the same candidate to be in different races still. This final data-set contains 46857 unique candidates, over 26583 unique races.

Filtering Covariates

Here we need to deal with the sex and party variable which we want to mutate into neat factors. We begin by factoring gender into male, female, and unknown groups with 40026 male-rows, 13777 female-rows, and 9268 as unknown. Since we are considering regional races as well, and only restricting to those containing at least one democrat and one republican, we need to factor all third parties together. We generate groupings for democrat, republican with 20817 and 20305 rows respectively, as well as independents, liberterians, non-partisans, and others. These groups contain 3647, 3543, 9305, and 5454 rows respectively. These groupings are important for generating valid-race filters.

Defining valid filters

We define three classes of filters which we use to decide whether a specific observation is used for training/regression purposes:

Valid-Candidate filter

We consider a candidate valid when:

The total number of votes received > 0
They are either a democrat or a republican

Applying these filters leaves us with 41097 rows containing valid candidates, and 21974 rows without. This filter is comparatively not very restrictive, since most elections from the HOR and PRE data contain democrats and republicans.

Valid-Race filter

We consider a race valid when:

It contains one democrat vs. one republican (no duplicate parties)
It can contain third party candidates
Third party candidates won’t be valid, but the race itself is, so long as there is a single winner
We consider only races in which democrats or republicans won

Applying these, we get 6 separate groupings. There are 6386 single party races, 1142 duplicate party races, 4624 races with missing parties (i.e. not a democrat or republican), 0 races with duplicated candidates, 6 ties, and 14425 valid races.

Valid-Image filter

The final type of filter we consider is a valid-image filter. Here we want to check the quality of the image-data we scraped. This filter relies on a data-set of hash-codes linked to the images contained in the BAL, HOR and PRE data-sets. These codes are unique to each image and let us look at:

Image duplicates for different candidates and within races (since scraping may lead to race-specific errors like this)
Whether images are blank (since we know the hash code for white images)
Whether we have an image scraped for that candidate (i.e. whether a hash code exists)

We clearly don’t want two different candidates, with different outcomes, to have the same image. We also want to limit errors coming from the scraping pipeline, and thus restrict this quite tightly. We are left with 33176 potentially valid images, which will be reduced greatly when considering valid races since the BAL and PRE data contains a lot of images for local races, which will be filtered out through our democrat vs. republican filters.

An observation is considered valid when all three flags are true, i.e. a valid candidate, in a valid race, with a valid image. All combined, this leaves us with 17559 observations.

Image Processing

We apply three steps to the raw ballotpedia, house-of-representatives, and precinct images. These are:

Cropping + Focusing + Scaling

We apply a cropping pipeline to all raw images in script s13. This takes raw images and identifies boundaries where a face could be. It then crops the original image down to only this box. This simultaneously filters out images which don’t contain a valid face, and reduces the potential for non-useful background noise. The second step in the script is a focusing process in which we just black out the background outside a certain margin around the oval where a face was identified. Lastly all these images are scaled to 256x256 while preserving the aspect ratio (so as to not induce any warping).

Manual inspection

The face-identification algorithm discussed above is not 100$ accurate and many images are flagged for containing more than one face. These images are saved with the corresponding -0.jpg, -1.jpg, …, -N.jpg enumerator and need to be inspected manually. We take the images which are non-valid faces and remove them from the -focused directory, which now contains all the focused images which are valid. In the final run of this pipeline we were left with 29728 valid scraped images in total.

Renaming

The final step of the image processing is renaming. Since there is no single common id between all three data sources, I construct this in the main script s14 and then use the seperate data-source specific names to rename the images. This is needed in order to feed these images into the CNN from a single directory and a reference feather file is produced in s14 to be able to match names backwards in case that is ever needed.

Defining Outcomes

Outcomes in the election data are straight forward. We have access to final election outcomes in didwin which is 0 for a candidate that lost their election, and 1 for the candidate that won. We also have vote-share which is candidate-vote/total-votes. These measures consider all candidates in every race, including those which are not democrats or republicans. Since we end up filtering on these, we also consider conditional-vote-share which filters on only the democratic and republican candidates in each race. Thus conditional-vote-share is conditional-candidate-votes/conditional-total-votes.

Our primary more is trained to predict didwin in a classification task, but we use the conditional-vote-share feature to construct historical own-party vote share features.

Partitioning Data

The last step of data-prep is to ensure correct partitioning of the train/val-test portions of the data. We need to solve for a partitioning that makes sure that

Every candidate in a particular race (i.e. the democrat vs. republican) are in the same side of the split
We need to partition so that there is no information leakage within-race and across-train-test
Since outcomes are binary, if the democrat wins and is in test, for the same race, the model known the republican candidate must have lost

Ensuring this split is done correctly yields a train/val-test split of: 11136/6423