How to load the data and fix the column names.
library(readr)
library(stringr)
library(dplyr)
nnv <- read_tsv("data/all-votes.tsv")
# Clean up variable names
names(nnv) <- names(nnv) %>%
str_to_lower() %>%
str_replace_all("\\.", "") %>%
str_replace_all("\\s", "_")
# Extract the year from the date column and treat that as an integer
nnv <- nnv %>%
mutate(year = str_extract(date, "\\d{4}") %>% as.integer())
There are some many different kinds of election and so many different states or geographies represented in the New Nation Votes data that making an assumption about the relationship between the cities, towns, counties, etc. for every election is perilous. So we are going to take a subset of elections for which we know that we are doing the right thing. Let’s do the election to Virginia’s 15th Congressional district in 1809. (We might be able to generalize from this to Congressional elections as a whole.)
We will look at at a small number of the columns to get a sense of the geographies represented, but we won’t actually throw away that data.
va15 <- nnv %>%
filter(id == "va.uscongress.15.1809")
va15 %>% select(district, county,town, name, vote)
## Source: local data frame [12 x 5]
##
## district county town name vote
## (chr) (chr) (chr) (chr) (int)
## 1 NA NA NA John Randolph 1243
## 2 NA NA NA Jerman Baker 492
## 3 Fifteen NA NA John Randolph 1243
## 4 Fifteen NA NA Jerman Baker 492
## 5 Fifteen Buckingham NA John Randolph 379
## 6 Fifteen Buckingham NA Jerman Baker 192
## 7 Fifteen Charlotte NA John Randolph 332
## 8 Fifteen Charlotte NA Jerman Baker 72
## 9 Fifteen Cumberland NA John Randolph 190
## 10 Fifteen Cumberland NA Jerman Baker 144
## 11 Fifteen Prince Edward NA John Randolph 342
## 12 Fifteen Prince Edward NA Jerman Baker 84
It should be obvious what is happening here. We have the votes counted at three different levels. When the district and county is specified we have the county level votes. When the district is specified but not the county, we have the sum of the votes for the district. And when neither the district nor the county is specified, we have the totals for the entire election. As it happens, there is only one district in the election, so we can right away spot that the district total is equal to the grand total: a dead giveaway that the votes are being counted multiple times.
But let’s double check that the county totals match the district total and the grand total. In each case we are going to group by the name and summarize to get the sum of the votes for each candidate. But we are also going to use filter to specify which geographies we want.
First, the grand total results for the election.
va15 %>%
filter(is.na(district) & is.na(county)) %>%
group_by(name) %>%
summarize(vote = sum(vote, na.rm = TRUE))
## Source: local data frame [2 x 2]
##
## name vote
## (chr) (int)
## 1 Jerman Baker 492
## 2 John Randolph 1243
Second, the district total.
va15 %>%
filter(!is.na(district) & is.na(county)) %>%
group_by(name) %>%
summarize(vote = sum(vote, na.rm = TRUE))
## Source: local data frame [2 x 2]
##
## name vote
## (chr) (int)
## 1 Jerman Baker 492
## 2 John Randolph 1243
Third, the total of the counties.
va15 %>%
filter(!is.na(district) & !is.na(county)) %>%
group_by(name) %>%
summarize(vote = sum(vote, na.rm = TRUE))
## Source: local data frame [2 x 2]
##
## name vote
## (chr) (int)
## 1 Jerman Baker 492
## 2 John Randolph 1243
The totals are exactly the same in each case, so we know for certain that each ballot is being triple counted at different levels of aggregation. This implies that if we need to filter out at least two of the undesired levels of aggregation.
Can we generalize this to Virginia congressional elections in 1809? Here we can assume that if it has a district but not a county (and also, not a city, since Williamsburg appears to be a special case), then it is the district total, which happens to be the same thing as the grand total for the election.
va_congressional <- nnv %>%
filter(year == 1809,
state == "Virginia",
office == "U.S. House of Representatives") %>%
arrange(id) %>%
filter(!is.na(district) & is.na(county) & is.na(city))
va_congressional %>%
select(id, district, name, vote)
## Source: local data frame [20 x 4]
##
## id district name vote
## (chr) (chr) (chr) (int)
## 1 va.uscongress.1.1809 One John G. Jackson 1326
## 2 va.uscongress.1.1809 One Noah Linsley 872
## 3 va.uscongress.12.1809 Twelve Burwell Bassett 1172
## 4 va.uscongress.12.1809 Twelve John Eyre 869
## 5 va.uscongress.15.1809 Fifteen John Randolph 1243
## 6 va.uscongress.15.1809 Fifteen Jerman Baker 492
## 7 va.uscongress.18.1809 Eighteen Peterson Goodwyn 670
## 8 va.uscongress.18.1809 Eighteen John Pegram 280
## 9 va.uscongress.2.1809 Two James Stephenson 876
## 10 va.uscongress.2.1809 Two John Morrow 645
## 11 va.uscongress.20.1809 Twenty Thomas Newton 1051
## 12 va.uscongress.20.1809 Twenty Robert B. Tayler 699
## 13 va.uscongress.3.1809 Three John Smith 1085
## 14 va.uscongress.3.1809 Three Robert Page 163
## 15 va.uscongress.4.1809 Four Jacob Swoope 1369
## 16 va.uscongress.4.1809 Four Daniel Smith 1181
## 17 va.uscongress.6.1809 Six Daniel Sheffey 1662
## 18 va.uscongress.6.1809 Six Francis Preston 835
## 19 va.uscongress.7.1809 Seven Joseph Lewis, Jr. 760
## 20 va.uscongress.7.1809 Seven William Tyler 465
Each candidate appears only once in each district. A quick way of verifying that:
va_congressional %>%
count(district, name)
## Source: local data frame [20 x 3]
## Groups: district [?]
##
## district name n
## (chr) (chr) (int)
## 1 Eighteen John Pegram 1
## 2 Eighteen Peterson Goodwyn 1
## 3 Fifteen Jerman Baker 1
## 4 Fifteen John Randolph 1
## 5 Four Daniel Smith 1
## 6 Four Jacob Swoope 1
## 7 One John G. Jackson 1
## 8 One Noah Linsley 1
## 9 Seven Joseph Lewis, Jr. 1
## 10 Seven William Tyler 1
## 11 Six Daniel Sheffey 1
## 12 Six Francis Preston 1
## 13 Three John Smith 1
## 14 Three Robert Page 1
## 15 Twelve Burwell Bassett 1
## 16 Twelve John Eyre 1
## 17 Twenty Robert B. Tayler 1
## 18 Twenty Thomas Newton 1
## 19 Two James Stephenson 1
## 20 Two John Morrow 1
Note that in Massachusetts the votes are counted by town, not by country. Different states had different ways of doing things.
Given our VA congressional data from above, how do we figure out who won? Assuming that there wasn’t a run-off election, the person who won is the person who got the most votes. Another way of saying that is asking if the vote that the person got was the same thing as the maximum vote for all the candidates. We need to group_by() the district variable, then use mutate() to add a new won column.
va_winners <- va_congressional %>%
group_by(district) %>%
mutate(won = vote == max(vote)) %>%
select(id, district, name, vote, won)
va_winners
## Source: local data frame [20 x 5]
## Groups: district [10]
##
## id district name vote won
## (chr) (chr) (chr) (int) (lgl)
## 1 va.uscongress.1.1809 One John G. Jackson 1326 TRUE
## 2 va.uscongress.1.1809 One Noah Linsley 872 FALSE
## 3 va.uscongress.12.1809 Twelve Burwell Bassett 1172 TRUE
## 4 va.uscongress.12.1809 Twelve John Eyre 869 FALSE
## 5 va.uscongress.15.1809 Fifteen John Randolph 1243 TRUE
## 6 va.uscongress.15.1809 Fifteen Jerman Baker 492 FALSE
## 7 va.uscongress.18.1809 Eighteen Peterson Goodwyn 670 TRUE
## 8 va.uscongress.18.1809 Eighteen John Pegram 280 FALSE
## 9 va.uscongress.2.1809 Two James Stephenson 876 TRUE
## 10 va.uscongress.2.1809 Two John Morrow 645 FALSE
## 11 va.uscongress.20.1809 Twenty Thomas Newton 1051 TRUE
## 12 va.uscongress.20.1809 Twenty Robert B. Tayler 699 FALSE
## 13 va.uscongress.3.1809 Three John Smith 1085 TRUE
## 14 va.uscongress.3.1809 Three Robert Page 163 FALSE
## 15 va.uscongress.4.1809 Four Jacob Swoope 1369 TRUE
## 16 va.uscongress.4.1809 Four Daniel Smith 1181 FALSE
## 17 va.uscongress.6.1809 Six Daniel Sheffey 1662 TRUE
## 18 va.uscongress.6.1809 Six Francis Preston 835 FALSE
## 19 va.uscongress.7.1809 Seven Joseph Lewis, Jr. 760 TRUE
## 20 va.uscongress.7.1809 Seven William Tyler 465 FALSE
As a double-check, we can make sure there is only one winner for each election.
va_winners %>%
filter(won == TRUE) %>%
count(id)
## Source: local data frame [10 x 2]
##
## id n
## (chr) (int)
## 1 va.uscongress.1.1809 1
## 2 va.uscongress.12.1809 1
## 3 va.uscongress.15.1809 1
## 4 va.uscongress.18.1809 1
## 5 va.uscongress.2.1809 1
## 6 va.uscongress.20.1809 1
## 7 va.uscongress.3.1809 1
## 8 va.uscongress.4.1809 1
## 9 va.uscongress.6.1809 1
## 10 va.uscongress.7.1809 1