Loading the data

How to load the data and fix the column names.

library(readr)
library(stringr)
library(dplyr)
nnv <- read_tsv("data/all-votes.tsv")

# Clean up variable names
names(nnv) <- names(nnv) %>% 
  str_to_lower() %>% 
  str_replace_all("\\.", "") %>% 
  str_replace_all("\\s", "_")

# Extract the year from the date column and treat that as an integer
nnv <- nnv %>% 
  mutate(year = str_extract(date, "\\d{4}") %>% as.integer())

Removing double or triple counted votes.

There are some many different kinds of election and so many different states or geographies represented in the New Nation Votes data that making an assumption about the relationship between the cities, towns, counties, etc. for every election is perilous. So we are going to take a subset of elections for which we know that we are doing the right thing. Let’s do the election to Virginia’s 15th Congressional district in 1809. (We might be able to generalize from this to Congressional elections as a whole.)

We will look at at a small number of the columns to get a sense of the geographies represented, but we won’t actually throw away that data.

va15 <- nnv %>% 
  filter(id == "va.uscongress.15.1809") 

va15 %>% select(district, county,town, name, vote)
## Source: local data frame [12 x 5]
## 
##    district        county  town          name  vote
##       (chr)         (chr) (chr)         (chr) (int)
## 1        NA            NA    NA John Randolph  1243
## 2        NA            NA    NA  Jerman Baker   492
## 3   Fifteen            NA    NA John Randolph  1243
## 4   Fifteen            NA    NA  Jerman Baker   492
## 5   Fifteen    Buckingham    NA John Randolph   379
## 6   Fifteen    Buckingham    NA  Jerman Baker   192
## 7   Fifteen     Charlotte    NA John Randolph   332
## 8   Fifteen     Charlotte    NA  Jerman Baker    72
## 9   Fifteen    Cumberland    NA John Randolph   190
## 10  Fifteen    Cumberland    NA  Jerman Baker   144
## 11  Fifteen Prince Edward    NA John Randolph   342
## 12  Fifteen Prince Edward    NA  Jerman Baker    84

It should be obvious what is happening here. We have the votes counted at three different levels. When the district and county is specified we have the county level votes. When the district is specified but not the county, we have the sum of the votes for the district. And when neither the district nor the county is specified, we have the totals for the entire election. As it happens, there is only one district in the election, so we can right away spot that the district total is equal to the grand total: a dead giveaway that the votes are being counted multiple times.

But let’s double check that the county totals match the district total and the grand total. In each case we are going to group by the name and summarize to get the sum of the votes for each candidate. But we are also going to use filter to specify which geographies we want.

First, the grand total results for the election.

va15 %>% 
  filter(is.na(district) & is.na(county)) %>% 
  group_by(name) %>% 
  summarize(vote = sum(vote, na.rm = TRUE))
## Source: local data frame [2 x 2]
## 
##            name  vote
##           (chr) (int)
## 1  Jerman Baker   492
## 2 John Randolph  1243

Second, the district total.

va15 %>% 
  filter(!is.na(district) & is.na(county)) %>% 
  group_by(name) %>% 
  summarize(vote = sum(vote, na.rm = TRUE))
## Source: local data frame [2 x 2]
## 
##            name  vote
##           (chr) (int)
## 1  Jerman Baker   492
## 2 John Randolph  1243

Third, the total of the counties.

va15 %>% 
  filter(!is.na(district) & !is.na(county)) %>% 
  group_by(name) %>% 
  summarize(vote = sum(vote, na.rm = TRUE))
## Source: local data frame [2 x 2]
## 
##            name  vote
##           (chr) (int)
## 1  Jerman Baker   492
## 2 John Randolph  1243

The totals are exactly the same in each case, so we know for certain that each ballot is being triple counted at different levels of aggregation. This implies that if we need to filter out at least two of the undesired levels of aggregation.

Can we generalize this to Virginia congressional elections in 1809? Here we can assume that if it has a district but not a county (and also, not a city, since Williamsburg appears to be a special case), then it is the district total, which happens to be the same thing as the grand total for the election.

va_congressional <- nnv %>% 
  filter(year == 1809,
         state == "Virginia",
         office == "U.S. House of Representatives") %>%
  arrange(id) %>% 
  filter(!is.na(district) & is.na(county) & is.na(city))

va_congressional %>% 
  select(id, district, name, vote)
## Source: local data frame [20 x 4]
## 
##                       id district              name  vote
##                    (chr)    (chr)             (chr) (int)
## 1   va.uscongress.1.1809      One   John G. Jackson  1326
## 2   va.uscongress.1.1809      One      Noah Linsley   872
## 3  va.uscongress.12.1809   Twelve   Burwell Bassett  1172
## 4  va.uscongress.12.1809   Twelve         John Eyre   869
## 5  va.uscongress.15.1809  Fifteen     John Randolph  1243
## 6  va.uscongress.15.1809  Fifteen      Jerman Baker   492
## 7  va.uscongress.18.1809 Eighteen  Peterson Goodwyn   670
## 8  va.uscongress.18.1809 Eighteen       John Pegram   280
## 9   va.uscongress.2.1809      Two  James Stephenson   876
## 10  va.uscongress.2.1809      Two       John Morrow   645
## 11 va.uscongress.20.1809   Twenty     Thomas Newton  1051
## 12 va.uscongress.20.1809   Twenty  Robert B. Tayler   699
## 13  va.uscongress.3.1809    Three        John Smith  1085
## 14  va.uscongress.3.1809    Three       Robert Page   163
## 15  va.uscongress.4.1809     Four      Jacob Swoope  1369
## 16  va.uscongress.4.1809     Four      Daniel Smith  1181
## 17  va.uscongress.6.1809      Six    Daniel Sheffey  1662
## 18  va.uscongress.6.1809      Six   Francis Preston   835
## 19  va.uscongress.7.1809    Seven Joseph Lewis, Jr.   760
## 20  va.uscongress.7.1809    Seven     William Tyler   465

Each candidate appears only once in each district. A quick way of verifying that:

va_congressional %>% 
  count(district, name)
## Source: local data frame [20 x 3]
## Groups: district [?]
## 
##    district              name     n
##       (chr)             (chr) (int)
## 1  Eighteen       John Pegram     1
## 2  Eighteen  Peterson Goodwyn     1
## 3   Fifteen      Jerman Baker     1
## 4   Fifteen     John Randolph     1
## 5      Four      Daniel Smith     1
## 6      Four      Jacob Swoope     1
## 7       One   John G. Jackson     1
## 8       One      Noah Linsley     1
## 9     Seven Joseph Lewis, Jr.     1
## 10    Seven     William Tyler     1
## 11      Six    Daniel Sheffey     1
## 12      Six   Francis Preston     1
## 13    Three        John Smith     1
## 14    Three       Robert Page     1
## 15   Twelve   Burwell Bassett     1
## 16   Twelve         John Eyre     1
## 17   Twenty  Robert B. Tayler     1
## 18   Twenty     Thomas Newton     1
## 19      Two  James Stephenson     1
## 20      Two       John Morrow     1

Note that in Massachusetts the votes are counted by town, not by country. Different states had different ways of doing things.

But who won?

Given our VA congressional data from above, how do we figure out who won? Assuming that there wasn’t a run-off election, the person who won is the person who got the most votes. Another way of saying that is asking if the vote that the person got was the same thing as the maximum vote for all the candidates. We need to group_by() the district variable, then use mutate() to add a new won column.

va_winners <- va_congressional %>% 
  group_by(district) %>% 
  mutate(won = vote == max(vote)) %>% 
  select(id, district, name, vote, won)
va_winners
## Source: local data frame [20 x 5]
## Groups: district [10]
## 
##                       id district              name  vote   won
##                    (chr)    (chr)             (chr) (int) (lgl)
## 1   va.uscongress.1.1809      One   John G. Jackson  1326  TRUE
## 2   va.uscongress.1.1809      One      Noah Linsley   872 FALSE
## 3  va.uscongress.12.1809   Twelve   Burwell Bassett  1172  TRUE
## 4  va.uscongress.12.1809   Twelve         John Eyre   869 FALSE
## 5  va.uscongress.15.1809  Fifteen     John Randolph  1243  TRUE
## 6  va.uscongress.15.1809  Fifteen      Jerman Baker   492 FALSE
## 7  va.uscongress.18.1809 Eighteen  Peterson Goodwyn   670  TRUE
## 8  va.uscongress.18.1809 Eighteen       John Pegram   280 FALSE
## 9   va.uscongress.2.1809      Two  James Stephenson   876  TRUE
## 10  va.uscongress.2.1809      Two       John Morrow   645 FALSE
## 11 va.uscongress.20.1809   Twenty     Thomas Newton  1051  TRUE
## 12 va.uscongress.20.1809   Twenty  Robert B. Tayler   699 FALSE
## 13  va.uscongress.3.1809    Three        John Smith  1085  TRUE
## 14  va.uscongress.3.1809    Three       Robert Page   163 FALSE
## 15  va.uscongress.4.1809     Four      Jacob Swoope  1369  TRUE
## 16  va.uscongress.4.1809     Four      Daniel Smith  1181 FALSE
## 17  va.uscongress.6.1809      Six    Daniel Sheffey  1662  TRUE
## 18  va.uscongress.6.1809      Six   Francis Preston   835 FALSE
## 19  va.uscongress.7.1809    Seven Joseph Lewis, Jr.   760  TRUE
## 20  va.uscongress.7.1809    Seven     William Tyler   465 FALSE

As a double-check, we can make sure there is only one winner for each election.

va_winners %>% 
  filter(won == TRUE) %>% 
  count(id)
## Source: local data frame [10 x 2]
## 
##                       id     n
##                    (chr) (int)
## 1   va.uscongress.1.1809     1
## 2  va.uscongress.12.1809     1
## 3  va.uscongress.15.1809     1
## 4  va.uscongress.18.1809     1
## 5   va.uscongress.2.1809     1
## 6  va.uscongress.20.1809     1
## 7   va.uscongress.3.1809     1
## 8   va.uscongress.4.1809     1
## 9   va.uscongress.6.1809     1
## 10  va.uscongress.7.1809     1