nz_booths

This is how I’d go about using the tidyverse packages (dplyr, tidyr, purrr, et al.) to import and tidy booth level data on NZ election results. The code results in a tidy data frame in which each row is an observation (number of votes for an individual candidate at a particular booth).

Load packages

First we load the packages we’ll need. I use rvest for web scraping and the tidyverse for everything else.

library(rvest)

## Loading required package: xml2

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2.9000     ✓ purrr   0.3.4.9000
## ✓ tibble  3.0.3.9000     ✓ dplyr   1.0.2.9000
## ✓ tidyr   1.1.2.9000     ✓ stringr 1.4.0.9000
## ✓ readr   1.3.1.9000     ✓ forcats 0.5.0.9000

## ── Conflicts ────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()
## x purrr::pluck()          masks rvest::pluck()

Scrape website

Now we extract the URLs for the individual CSVs from the NZ election results website:

cand_list_url <- "https://archive.electionresults.govt.nz/electionresults_2017/statistics/votes-by-voting-place-electorate-index.html"

cand_csv_urls <- cand_list_url %>%
  read_html() %>%
  html_nodes("td:nth-child(3) a") %>%
  html_attr("href") 

base_url <- "https://archive.electionresults.govt.nz/electionresults_2017/statistics/"

cand_csv_urls <- paste0(base_url, cand_csv_urls)

Now we have URLs for the individual CSVs. Here’s what they look like:

https://archive.electionresults.govt.nz/electionresults_2017/statistics/csv/candidate-votes-by-voting-place-1.csv

Download

Now we download the CSVs:

cand_filenames <- file.path("data", basename(cand_csv_urls))

walk2(cand_csv_urls, cand_filenames, download.file)

Import CSVs

Now we have the CSV locally, we can import them. The CSVs are formatted in a non-standard way - there doesn’t appear to be enough commas in the initial rows. To make sure we get all the information we want, I first read the CSV from the third row onwards, then read them again just to get the electorate name from the second row.

cands <- map(cand_filenames, read_csv,
             skip = 2)

# We have to read the file twice because it's formatted incorrectly, with too 
# few commas in the initial rows
get_elec_names <- function(filename) {
  x <- read_csv(filename, skip = 1, n_max = 1) %>%
    names()
  x[1]
}

elec_names <- map_chr(cand_filenames, get_elec_names)

cands <- set_names(cands, elec_names)

Now we have a list (cands), each element of which is a dataframe corresponding to an electorate CSV.

Tidy CSVs

Now the fun stuff. We wrangle each dataframe to get it in a tidy format, in which each row is an observation (booth-level number of votes for an individual candidate). Note that I drop the little table at the bottom of each CSV - I’m only reading the main table.

# Create a function to tidy the electorate-candidate dfs
tidy_elec_df <- function(df) {
  df %>%
    rename(area = 1, booth = 2) %>%
    # Move place type ('advanced' v regular) to its own column
    mutate(place_type = if_else(is.na(booth),
                                area,
                                NA_character_),
           area = if_else(!is.na(place_type),
                          NA_character_,
                          area)) %>%
    # Fill blank cells with the previous non-blank text
    fill(place_type, area, booth, .direction = "down") %>%
    # Drop the weird little table underneath the main table
    filter(!is.na(`Total Valid Candidate Votes`)) %>%
    # Drop booth totals
    select(-`Total Valid Candidate Votes`) %>%
    # Pivot to long/tidy format
    gather(key = candidate, value = votes,
           -place_type, -area, -booth) %>%
    mutate(votes = as.numeric(votes))
}

# Apply the function to each df in the list

cands_df <- map_dfr(cands, tidy_elec_df, .id = "electorate")

Result

Now you have a tidy dataframe of booth level results for each electorate. It has 47951 rows and looks like this:

electorate	area	booth	place_type	candidate	votes
Auckland Central 1	Auckland City	Atrium, Takutai Square	Advance Voting Places	EDWARDS, Frank Torrens	20
Auckland Central 1	Auckland City	Auckland University, AUSA Club Space, The Quad, Alfred Street	Advance Voting Places	EDWARDS, Frank Torrens	11
Auckland Central 1	Auckland City	AUT University, Level 4 Library Foyer, WA Building, 55 Wellesley Street East	Advance Voting Places	EDWARDS, Frank Torrens	23
Auckland Central 1	Auckland City	Liston House, 30-32 Hobson Street	Advance Voting Places	EDWARDS, Frank Torrens	6
Auckland Central 1	Freemans Bay	Victoria Park New World, 2 College Hill	Advance Voting Places	EDWARDS, Frank Torrens	24
Auckland Central 1	Freemans Bay	Auckland Hospital Mobile & Advance Voting	Advance Voting Places	EDWARDS, Frank Torrens	3
Auckland Central 1	Grey Lynn	Grey Lynn Community Centre, Oval Room, 510 Richmond Road	Advance Voting Places	EDWARDS, Frank Torrens	8
Auckland Central 1	Grey Lynn	Grey Lynn Library Hall, 474 Great North Road	Advance Voting Places	EDWARDS, Frank Torrens	3
Auckland Central 1	Mangere	Auckland International Airport Terminal, International Departure Lounge	Advance Voting Places	EDWARDS, Frank Torrens	0
Auckland Central 1	Mt Albert	UNITEC, Te Puna Building, Student Drop in Space, Level 1, 139 Carrington Road	Advance Voting Places	EDWARDS, Frank Torrens	0