How to read data into R

As always, we need to do any required setup, like reading in the R packages we will use:

library(tidyverse)

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

For some examples, we need to easily see the data we are attempting to import into R. To do this, we can supply R with data to be read into R “inline”, meaning that we’re essentially passing an R string object to R with the data we wish to read. In practice, you will most often read data from a file.

read_csv("a,b,c
1,2,3
4,5,6")

## # A tibble: 2 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

What is this code snippet doing? We are passing R a string with the first row “a, b, c” to read_csv(), short for: read comma separated values. read_csv() is a program that, as it’s name implies, reads into R a comma separated values file. read_csv() reads in data in a particular format with these characteristics:

Each column or variable is separated by a ,
Each row ends with a carriage return, or \n
Each row has the same number of columns of data

Although not clearly visible, there is a character return at the end of each line. To see the carriage return more clearly, you can specify a carriage return with \n. This code does the same thing, which is a little hard to visually read until you are comfortable with \n.

read_csv("a,b,c\n1,2,3\n4,5,6")

## # A tibble: 2 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Now that we understand these basics, there are a couple of options that work with any of the various read_ functions that read in various data formats.

Sometimes there is information about the file embedded into the first few lines of the file. This information is often referred to as “metadata”. You can set a read_csv() option to skip reading in the first rows of a data file.

read_csv("The first line of metadata
  The second line of metadata
  x,y,z
  1,2,3", skip = 2)

## # A tibble: 1 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3

Or tell R to skip any commented rows, by telling R what the special character comment is. The comment character can be any character, it does not need to be R’s character. Be a little careful here, since you don’t want to say a , or " are comments for read_csv(), and you don’t want to say a character found in a variable name is a special comment character.

read_csv("* A comment I want to skip
  x,y,z
  1,2,3", comment = "*")

## # A tibble: 1 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3

The first (non-commented) row usually contains a header, which are the variable names. The variable names are often in a different format than your data - usually characters. You can tell any of the read_ flavors that your data does not have a header.

read_csv("1,2,3\n4,5,6", col_names = F)

## # A tibble: 2 x 3
##      X1    X2    X3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Note how R auto-assigns column names X1, X2, ... as variable names.

If you wish to assign your own variable names, you can do so when you import your data.

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))

## # A tibble: 2 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

You can also assign column names later.

test.data <- read_csv("1,2,3\n4,5,6", col_names = F)
colnames(test.data) <- c("x", "y", "z")

Let’s also stop a moment to note what c() does. This function sequentially adds strings to an R vector object, and is useful in a variety of contexts, as we shall see.

test <- c("x", "y", "z")
test

## [1] "x" "y" "z"

Parsing

When R reads data into a data frame, in the background it is performing an operation known as “parsing”, which is taking data and placing it into a particular format. In most situations, parsing seems very basic. A 1 is read in as an integer, a a is read in as a string, and so on.

There are situations where you need to thoughtfully contemplate how your data are being read into an R data frame. For example, a id variable, such as used to identify a county or precinct number, may be for a row of data 001. Reading this as an integer may trim the leading zeros, which may be important to the id. In this case, it would be better to read in the column of data as a string.

There are other situations where you may want more control over how R assigns formats to data. For example, date and time formats are not standard across the world, and number formats using . or , differ in the U.S. and Europe.

The basics of learning parsing start from the perspective of data already in an R object. Later, we will learn how to control data formats when data are read into R from an external file.

parse_character("1, 2, 3")

## [1] "1, 2, 3"

parse_integer(c("1", "2", "3"))

## [1] 1 2 3

parse_logical(c("TRUE", "FALSE", "NA"))

## [1]  TRUE FALSE    NA

parse_date(c("2010-01-01", "1979-10-14"))

## [1] "2010-01-01" "1979-10-14"

Parsing Numbers

R will return error messages if you attempt to parse data into an improper format, and the output will be assigned NA. You can use problem() in the tidyverse to get more info on parsing errors.

x <- parse_integer(c("123", "345", "abc", "123.45"))

## Warning: 2 parsing failures.
## row col               expected actual
##   3  -- an integer                abc
##   4  -- no trailing characters    .45

problems(x)

## # A tibble: 2 x 4
##     row   col expected               actual
##   <int> <int> <chr>                  <chr> 
## 1     3    NA an integer             abc   
## 2     4    NA no trailing characters .45

Different parts of the world present decimal numbers with . and ,. The default is ., which you can change.

parse_double("1.23")

## [1] 1.23

parse_double("1,23", locale = locale(decimal_mark = ","))

## [1] 1.23

Number parsers ignore leading and trailing characters, which is helpful when working with currencies and percentages.

parse_double("1.23")

## [1] 1.23

parse_double("1,23", locale = locale(decimal_mark = ","))

## [1] 1.23

parse_number("$100")

## [1] 100

parse_number("20%")

## [1] 20

parse_number("It cost $123.45")

## [1] 123.45

Parsing Strings

Alphanumeric characters actually are represented in computers by binary bits of numbers strung together. You can see the numeric representation of characters with charToRaw(), but this is rarely something you need to do when working in English.

charToRaw("English")

## [1] 45 6e 67 6c 69 73 68

If you need to write in a foreign language, you can’t often represent that with an English keyboard.

You have to know the numeric representation of a special character and the language you are converting to.

x1 <- "El Ni\xf1o was particularly bad this year"
parse_character(x1, locale = locale(encoding = "Latin1"))

## [1] "El NiÃ±o was particularly bad this year"

x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))

## [1] "<U+3053><U+3093><U+306B><U+3061><U+306F>"

What if a dataset is encoded in another language? You can use R to guess with guess_encoding(). You should do more research on your own if you encounter this issue.

Factors

We saw factors earlier in the class to tell R that you have categorical data. Factors are a data format type, too.

fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)

## Warning: 1 parsing failure.
## row col           expected   actual
##   3  -- value in level set bananana

## [1] apple  banana <NA>  
## attr(,"problems")
## # A tibble: 1 x 4
##     row   col expected           actual  
##   <int> <int> <chr>              <chr>   
## 1     3    NA value in level set bananana
## Levels: apple banana

Dates, Date-Times, and Times

Date and time formates vary across the world. I tend to like leading with the year, then month, and day for file structures, 1900-01-01, but that is not the way we write dates in the U.S., 1/1/1900 or 01-01-1900. Parsing dates and times is a common task but why do it? Keeping a date in a string format won’t let you easily sort and select dates, particularly when working with the 1/1/1900 format.

This is the default date format expected by R

parse_date("2010-10-01")

## [1] "2010-10-01"

If you have different formatted dates, you need to tell R where the day, month, and year are located with %d, %m, and %y. Consider the following:

parse_date("01/02/15", "%m/%d/%y")

## [1] "2015-01-02"

parse_date("01/02/15", "%d/%m/%y")

## [1] "2015-02-01"

parse_date("01/02/15", "%y/%m/%d")

## [1] "2001-02-15"

Reading in a file

Now that we understand how parsing works in the tidyverse we can better control how R reads in data.

The read_ flavor of functions read in the first 1,000 lines of data and make a guess as to what the data format type is. You can play around with guess_parser() directly to see its behavior.

guess_parser("2010-10-01")

## [1] "date"

guess_parser(c("1", "5", "9"))

## [1] "double"

guess_parser(c("12,352,561"))

## [1] "number"

A typical problem with large files is the first 1,000 rows might nt be representative of all the data. This is particularly true with datasets like voter files where rows are sorted by counties. A voter registration number in the first county alphabetically might be a number, but a county higher in the alphabet might use characters. Or, a special code for a rare event, like needing to present voter id for a first-time voter, might be missing in the first 1,000 records but exist later.

I always like to read in as a string any variable that can be used as a key to identify records uniquely or to relate to another file. Keys can have meaningful leading zeros that will cause different datasets to fail to properly relate to one another if the zeros are dropped. Consider this:

working.dir <- "D:/DropBox/Dropbox/Election Science/Class Data Import"
url <- "https://s3.amazonaws.com/dl.ncsbe.gov/data/ncvoter1.zip"
dest.file <- paste(working.dir, "/ncvoter1.zip", sep="")

download.file(url, dest.file)

unzip(dest.file, exdir = working.dir, overwrite = T)

voter.file <- paste(working.dir, "/ncvoter1.txt", sep="")

voters.NC <- read_tsv(voter.file, col_names = T)

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   county_id = col_double(),
##   absent_ind = col_logical(),
##   name_prefx_cd = col_logical(),
##   zip_code = col_double(),
##   mail_addr3 = col_logical(),
##   mail_addr4 = col_logical(),
##   birth_age = col_double(),
##   ward_abbrv = col_logical(),
##   ward_desc = col_logical(),
##   nc_senate_abbrv = col_double(),
##   county_commiss_abbrv = col_logical(),
##   county_commiss_desc = col_logical(),
##   township_abbrv = col_logical(),
##   township_desc = col_logical(),
##   school_dist_abbrv = col_logical(),
##   school_dist_desc = col_logical(),
##   fire_dist_abbrv = col_logical(),
##   fire_dist_desc = col_logical(),
##   water_dist_abbrv = col_logical(),
##   water_dist_desc = col_logical()
##   # ... with 10 more columns
## )

## See spec(...) for full column specifications.

## Warning: 15 parsing failures.
##   row        col           expected                actual                                                                 file
##  4021 mail_addr3 1/0/T/F/TRUE/FALSE CH-2900 PORRENTRUY    'D:/DropBox/Dropbox/Election Science/Class Data Import/ncvoter1.txt'
##  4021 mail_addr4 1/0/T/F/TRUE/FALSE SWITZERLAND           'D:/DropBox/Dropbox/Election Science/Class Data Import/ncvoter1.txt'
##  5671 mail_addr4 1/0/T/F/TRUE/FALSE 28213 BREMEN, GERMANY 'D:/DropBox/Dropbox/Election Science/Class Data Import/ncvoter1.txt'
## 22643 mail_addr3 1/0/T/F/TRUE/FALSE UKYO-KU, KYOTO, JAPAN 'D:/DropBox/Dropbox/Election Science/Class Data Import/ncvoter1.txt'
## 22643 mail_addr4 1/0/T/F/TRUE/FALSE 616-8184              'D:/DropBox/Dropbox/Election Science/Class Data Import/ncvoter1.txt'
## ..... .......... .................. ..................... ....................................................................
## See problems(...) for more details.

Let’s look at how R guessed the column specifications

spec(voters.NC)

## cols(
##   county_id = col_double(),
##   county_desc = col_character(),
##   voter_reg_num = col_character(),
##   status_cd = col_character(),
##   voter_status_desc = col_character(),
##   reason_cd = col_character(),
##   voter_status_reason_desc = col_character(),
##   absent_ind = col_logical(),
##   name_prefx_cd = col_logical(),
##   last_name = col_character(),
##   first_name = col_character(),
##   middle_name = col_character(),
##   name_suffix_lbl = col_character(),
##   res_street_address = col_character(),
##   res_city_desc = col_character(),
##   state_cd = col_character(),
##   zip_code = col_double(),
##   mail_addr1 = col_character(),
##   mail_addr2 = col_character(),
##   mail_addr3 = col_logical(),
##   mail_addr4 = col_logical(),
##   mail_city = col_character(),
##   mail_state = col_character(),
##   mail_zipcode = col_character(),
##   full_phone_number = col_character(),
##   race_code = col_character(),
##   ethnic_code = col_character(),
##   party_cd = col_character(),
##   gender_code = col_character(),
##   birth_age = col_double(),
##   birth_state = col_character(),
##   drivers_lic = col_character(),
##   registr_dt = col_character(),
##   precinct_abbrv = col_character(),
##   precinct_desc = col_character(),
##   municipality_abbrv = col_character(),
##   municipality_desc = col_character(),
##   ward_abbrv = col_logical(),
##   ward_desc = col_logical(),
##   cong_dist_abbrv = col_character(),
##   super_court_abbrv = col_character(),
##   judic_dist_abbrv = col_character(),
##   nc_senate_abbrv = col_double(),
##   nc_house_abbrv = col_character(),
##   county_commiss_abbrv = col_logical(),
##   county_commiss_desc = col_logical(),
##   township_abbrv = col_logical(),
##   township_desc = col_logical(),
##   school_dist_abbrv = col_logical(),
##   school_dist_desc = col_logical(),
##   fire_dist_abbrv = col_logical(),
##   fire_dist_desc = col_logical(),
##   water_dist_abbrv = col_logical(),
##   water_dist_desc = col_logical(),
##   sewer_dist_abbrv = col_logical(),
##   sewer_dist_desc = col_logical(),
##   sanit_dist_abbrv = col_logical(),
##   sanit_dist_desc = col_logical(),
##   rescue_dist_abbrv = col_logical(),
##   rescue_dist_desc = col_logical(),
##   munic_dist_abbrv = col_character(),
##   munic_dist_desc = col_character(),
##   dist_1_abbrv = col_double(),
##   dist_1_desc = col_character(),
##   dist_2_abbrv = col_logical(),
##   dist_2_desc = col_logical(),
##   confidential_ind = col_character(),
##   birth_year = col_double(),
##   ncid = col_character(),
##   vtd_abbrv = col_character(),
##   vtd_desc = col_character()
## )

We got lucky here in that one of the first 1,000 voter_reg_num had a character. Suppose it didn’t. You can have read_ look at a different number of records to make guesses about the column formats.

voters.NC <- read_tsv(voter.file, col_names = T, guess_max = 10000)

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   county_id = col_double(),
##   absent_ind = col_logical(),
##   name_prefx_cd = col_logical(),
##   zip_code = col_double(),
##   birth_age = col_double(),
##   ward_abbrv = col_logical(),
##   ward_desc = col_logical(),
##   nc_senate_abbrv = col_double(),
##   county_commiss_abbrv = col_logical(),
##   county_commiss_desc = col_logical(),
##   township_abbrv = col_logical(),
##   township_desc = col_logical(),
##   school_dist_abbrv = col_logical(),
##   school_dist_desc = col_logical(),
##   fire_dist_abbrv = col_logical(),
##   fire_dist_desc = col_logical(),
##   water_dist_abbrv = col_logical(),
##   water_dist_desc = col_logical(),
##   sewer_dist_abbrv = col_logical(),
##   sewer_dist_desc = col_logical()
##   # ... with 8 more columns
## )

## See spec(...) for full column specifications.

This still won’t work on a good number of the variables, which happened to have a lot of missing value NAs that the parser guessed where logical when they were really something else, like characters.

If I want to force R to read in everything as a character, I can do this:

readcolumns = c(rep("character", 71))
voters.NC <- read_tsv(voter.file, col_names = T, col_types = cols(.default= col_character()))

If you did that on your own computer, it took a long time! Reading everything in as a character is memory intensive and takes up memory and disk space. So what to do?

A trick I’ve learned is to use my read_ function to make guesses, grab the column specifications with spec() and then control read_ behavior by using the same option to control `col_types’.

voters.NC <- read_tsv(voter.file, col_names = T, col_types = cols(
  county_id = col_double(),
  county_desc = col_character(),
  voter_reg_num = col_character(),
  status_cd = col_character(),
  voter_status_desc = col_character(),
  reason_cd = col_character(),
  voter_status_reason_desc = col_character(),
  absent_ind = col_logical(),
  name_prefx_cd = col_logical(),
  last_name = col_character(),
  first_name = col_character(),
  middle_name = col_character(),
  name_suffix_lbl = col_character(),
  res_street_address = col_character(),
  res_city_desc = col_character(),
  state_cd = col_character(),
  zip_code = col_double(),
  mail_addr1 = col_character(),
  mail_addr2 = col_character(),
  mail_addr3 = col_logical(),
  mail_addr4 = col_logical(),
  mail_city = col_character(),
  mail_state = col_character(),
  mail_zipcode = col_character(),
  full_phone_number = col_character(),
  race_code = col_character(),
  ethnic_code = col_character(),
  party_cd = col_character(),
  gender_code = col_character(),
  birth_age = col_double(),
  birth_state = col_character(),
  drivers_lic = col_character(),
  registr_dt = col_character(),
  precinct_abbrv = col_character(),
  precinct_desc = col_character(),
  municipality_abbrv = col_character(),
  municipality_desc = col_character(),
  ward_abbrv = col_logical(),
  ward_desc = col_logical(),
  cong_dist_abbrv = col_character(),
  super_court_abbrv = col_character(),
  judic_dist_abbrv = col_character(),
  nc_senate_abbrv = col_double(),
  nc_house_abbrv = col_character(),
  county_commiss_abbrv = col_logical(),
  county_commiss_desc = col_logical(),
  township_abbrv = col_logical(),
  township_desc = col_logical(),
  school_dist_abbrv = col_logical(),
  school_dist_desc = col_logical(),
  fire_dist_abbrv = col_logical(),
  fire_dist_desc = col_logical(),
  water_dist_abbrv = col_logical(),
  water_dist_desc = col_logical(),
  sewer_dist_abbrv = col_logical(),
  sewer_dist_desc = col_logical(),
  sanit_dist_abbrv = col_logical(),
  rescue_dist_abbrv = col_logical(),
  rescue_dist_desc = col_logical(),
  munic_dist_abbrv = col_character(),
  munic_dist_desc = col_character(),
  dist_1_abbrv = col_double(),
  dist_1_desc = col_character(),
  dist_2_abbrv = col_logical(),
  dist_2_desc = col_logical(),
  confidential_ind = col_character(),
  birth_year = col_double(),
  ncid = col_character(),
  vtd_abbrv = col_character(),
  vtd_desc = col_character()
  )
)

## Warning: 15 parsing failures.
##   row        col           expected                actual                                                                 file
##  4021 mail_addr3 1/0/T/F/TRUE/FALSE CH-2900 PORRENTRUY    'D:/DropBox/Dropbox/Election Science/Class Data Import/ncvoter1.txt'
##  4021 mail_addr4 1/0/T/F/TRUE/FALSE SWITZERLAND           'D:/DropBox/Dropbox/Election Science/Class Data Import/ncvoter1.txt'
##  5671 mail_addr4 1/0/T/F/TRUE/FALSE 28213 BREMEN, GERMANY 'D:/DropBox/Dropbox/Election Science/Class Data Import/ncvoter1.txt'
## 22643 mail_addr3 1/0/T/F/TRUE/FALSE UKYO-KU, KYOTO, JAPAN 'D:/DropBox/Dropbox/Election Science/Class Data Import/ncvoter1.txt'
## 22643 mail_addr4 1/0/T/F/TRUE/FALSE 616-8184              'D:/DropBox/Dropbox/Election Science/Class Data Import/ncvoter1.txt'
## ..... .......... .................. ..................... ....................................................................
## See problems(...) for more details.

You can now edit the column format specifications to your liking.

Writing data

There is a companion set up functions that write data, just like you can read data.

setwd("D:/DropBox/Dropbox/Election Science/Class Data Import")
write_csv(voters.NC, "test.csv")

We’ve only worked with text-based files. There are a wide range of files R can read and write data to, including Excel, dBase, and other file formats.

Election Data Science - Data Import

Michael McDonald

September 23, 2019