Setup

We start with a typical setup by loading any required libraries and setting any environment variables we may need.

library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
working.dir <- "D:/Classes/Election Science/Class 2"

Scraping a File from the Internet

We will work with the North Carolina voter registration file. The North Carolina State Board of Elections (NCSBE) posts lots of election data at https://dl.ncsbe.gov/index.html. NCSBE posts weekly extracts of the statewide voter registration file every Saturday morning, along with complete past vote history, and records of early voting in the current election.

The statewide voter file is rather large, so we will work with Alamance County, the first county alphabetically.

I first need to identify the URL that I’m grabbing the file from. URLs can be long and messy, so I’m going to assign the URL of the file I want to scrape to an R object that I’ll call url.

url <- "https://s3.amazonaws.com/dl.ncsbe.gov/data/ncvoter1.zip"

I next need to tell R where I’m going to save the file that I download and what to call it on my computer. Since this is the destination for the data traveling through the web, I’ll create another R object called dest.file to place this information.

I’m going to take advantage of the working.dir I created earlier by creating the string concatenation of the working.dir and the file name I’m going to put into the R object dest.file.

To do this, I use the paste command, which joins (or concatenates) together strings.

dest.file <- paste(working.dir, "/ncvoter1.zip", sep="")

Okay! We are ready to put all this together to scrape a file from the internet and place it in the directory and filename of your choosing.

I invoke the download.file command directly, instead of placing it into an R object.

download.file(url, dest.file, method="wininet")

download.file takes two peices of information:

  1. The url of where we are scraping the data from
  2. The dest.file where the data will be placed into on your computer

And then does its magic. ##Zip Files The data that we downloaded is a zip file, which is compressed version of the text file we are working with. Compressed files use up less space using algorithms that transform a file. This is desirable when working with large files, such as voter registration files that often contain millions of records.

Zip files are in a binary file format, and R cannot read the text data in the zip file directly. We first need to unzip the file.

Fortunately, R has routines to work to zip and unzip files as needed. We use the unzip command that takes a zip file and unzips its contents to a directory. You need to be careful, since a zip file can contain multiple files stored in a subdirectories. These files will be unzipped into the appropriate subdirectories. I change the default overwrite = F to T because I know there is a single text file contained in the dest.file zip file I scraped in the previous step. I should have no problems overwriting this file for this exercise, but in other circumstances you should exercise caution. I set the exdir = working.dir to unzip the zip file contents into the working.dir.

unzip(dest.file, exdir = working.dir, overwrite = T)

Loading the Voter File into R

We have scraped the zip file from the web and we’ve unzipped it. Now we are ready to load the file into R, just like we would do with any other file.

The text file inside the zip file has a different name, so I first create an R object with the directory and filename of this text file.

voter.file <- dest.file <- paste(working.dir, "/ncvoter1.txt", sep="")

Now I can read it into R. I happen to know that this particular file is a tab separated value format text file, so I will use the command read_tsv.

voters.NC <- read_tsv(voter.file, col_names = T)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   county_id = col_double(),
##   absent_ind = col_logical(),
##   name_prefx_cd = col_logical(),
##   zip_code = col_double(),
##   mail_addr3 = col_logical(),
##   mail_addr4 = col_logical(),
##   birth_age = col_double(),
##   ward_abbrv = col_logical(),
##   ward_desc = col_logical(),
##   nc_senate_abbrv = col_double(),
##   county_commiss_abbrv = col_logical(),
##   county_commiss_desc = col_logical(),
##   township_abbrv = col_logical(),
##   township_desc = col_logical(),
##   school_dist_abbrv = col_logical(),
##   school_dist_desc = col_logical(),
##   fire_dist_abbrv = col_logical(),
##   fire_dist_desc = col_logical(),
##   water_dist_abbrv = col_logical(),
##   water_dist_desc = col_logical()
##   # ... with 10 more columns
## )
## See spec(...) for full column specifications.
## Warning: 15 parsing failures.
##   row        col           expected                actual                                               file
##  3999 mail_addr3 1/0/T/F/TRUE/FALSE CH-2900 PORRENTRUY    'D:/Classes/Election Science/Class 2/ncvoter1.txt'
##  3999 mail_addr4 1/0/T/F/TRUE/FALSE SWITZERLAND           'D:/Classes/Election Science/Class 2/ncvoter1.txt'
##  5637 mail_addr4 1/0/T/F/TRUE/FALSE 28213 BREMEN, GERMANY 'D:/Classes/Election Science/Class 2/ncvoter1.txt'
## 22495 mail_addr3 1/0/T/F/TRUE/FALSE UKYO-KU, KYOTO, JAPAN 'D:/Classes/Election Science/Class 2/ncvoter1.txt'
## 22495 mail_addr4 1/0/T/F/TRUE/FALSE 616-8184              'D:/Classes/Election Science/Class 2/ncvoter1.txt'
## ..... .......... .................. ..................... ..................................................
## See problems(...) for more details.

Let’s take some time to explore this file to understand the information contained in it.

A typical voter file has a voter’s name, address, mailing address for a mail ballot (if requested), birth date (with some variants), sex, party registration (if a closed primary state), race (in a few Southern states), date of registration, and so on.

We will save for a later class how to relate individuals’ records of participation in prior elections to their voter registration record, early voting, and other such files that exist in a relation to one another through the voter id number, which serves as a unique key to link files.