readr Vignette Daniel Craig

Readr Introduction

In this vignette, we will focus on the readr package and its ability to parse files into tables from data files such as csv’s and tsv’s.
We will cover the material in two sections; Functions & Column Specifications

You can read more about the readr package here: https://readr.tidyverse.org/.

Main Functions

The main functions from the readr package are focused on the different categories of flat file formats to contain data.

A file and command will be made for each so that the user can view the file structure outside of R and the command associated with it.

Take a look at each of the linked files to get an idea of the structure and run the associated function.

read_csv(): comma-separated values (CSV)

this file has each column separated by commas
CSV Example

## # A tibble: 1 × 3
##   column1 column2 column3
##   <chr>   <chr>   <chr>  
## 1 value1  value2  value3

read_tsv(): tab-separated values (TSV)

this file has each column separated by tabs
TSV Example

read_tsv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/tsv_format.txt")

## # A tibble: 1 × 3
##   column1 column2 column3
##   <chr>   <chr>   <chr>  
## 1 value1  value2  value3

read_csv2(): semicolon-separated values with a comma as the decimal mark for numbers

this file has each column separated by semicolons and any numbers that have decimal values have a comma to replace them
CSV2 Example

read_csv2("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/csv2_format.txt")

## # A tibble: 1 × 3
##   column1 column2 column3
##   <chr>   <chr>     <dbl>
## 1 value1  value2     142.

read_delim(): generic delimited files

this command is a generic version of the read_csv command that allow the user to select the delimiter, in this example we use ‘|’
Generic Delim Example

read_delim("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/delim_format.txt", delim = "|")

## # A tibble: 1 × 3
##   column1 column2 column3
##   <chr>   <chr>   <chr>  
## 1 value1  value2  value3

read_fwf(): fixed-width files

this file has the start and end of columns at specifically stated dimensions, ie. the first column ends on character 9, and the second column ends on the 20th character
note that this function cannot handle urls

For this reason, we will use the readr example.

There are several options to be used with read_fwf.

fwf_sample <- readr_example("fwf-sample.txt")
writeLines(read_lines(fwf_sample))

## John Smith          WA        418-Y11-4111
## Mary Hartford       CA        319-Z19-4341
## Evan Nolan          IL        219-532-c301

fwf_empty() - Guesses based on the positions of empty columns

read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))

## # A tibble: 3 × 4
##   first last     state ssn         
##   <chr> <chr>    <chr> <chr>       
## 1 John  Smith    WA    418-Y11-4111
## 2 Mary  Hartford CA    319-Z19-4341
## 3 Evan  Nolan    IL    219-532-c301

fwf_widths() - Supply the widths of each column

Notice we combined first and last names by denoting the column width for the first column to be 20.

read_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))

## # A tibble: 3 × 3
##   name          state ssn         
##   <chr>         <chr> <chr>       
## 1 John Smith    WA    418-Y11-4111
## 2 Mary Hartford CA    319-Z19-4341
## 3 Evan Nolan    IL    219-532-c301

fwf_positions() - Supply paired vectors of start and end positions

Notice the vectors passed to fwf_samples are read as fwf_positions(c(start_position_vec1, start_position_vec2), c(end_position_vec1, end_position_vec2)). Do not read them as c(start_position_vec1, end_position_vec2).

read_fwf(fwf_sample, fwf_positions(c(1, 30), c(20, 42), c("name", "ssn")))

## # A tibble: 3 × 2
##   name          ssn         
##   <chr>         <chr>       
## 1 John Smith    418-Y11-4111
## 2 Mary Hartford 319-Z19-4341
## 3 Evan Nolan    219-532-c301

fwf_cols() - Supply column names and their respective lengths

Note that subsequent named columns will determine the start of its column by noting the position the previous column ended. I.E. The name column is the first column and lasts 20 characters. The state column starts at character 21 and continues for 10 more characters. The ssn column starts at character 31 and ends 12 characters later. The user only needs to tell fwf_cols how long each column is.

read_fwf(fwf_sample, fwf_cols(name = 20, state = 10, ssn = 12))

## # A tibble: 3 × 3
##   name          state ssn         
##   <chr>         <chr> <chr>       
## 1 John Smith    WA    418-Y11-4111
## 2 Mary Hartford CA    319-Z19-4341
## 3 Evan Nolan    IL    219-532-c301

read_table(): whitespace separated files

this file designates columns at either ends of whitespace and removes the whitespace afterwards Whitespace Example

read_table("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/whitespace_format.txt")

## # A tibble: 1 × 3
##   column1 column2 column3
##   <chr>   <chr>   <chr>  
## 1 value1  value2  value3

read_log(): web log files

uses a combination of quotation marks and brackets in file structure Log Example

read_log("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/example.log")

## # A tibble: 2 × 7
##   X1           X2    X3                   X4                   X5       X6    X7
##   <chr>        <lgl> <chr>                <chr>                <chr> <dbl> <dbl>
## 1 172.21.13.45 NA    "Microsoft\\JohnDoe" 08/Apr/2001:17:39:0… GET …   200  3401
## 2 127.0.0.1    NA    "frank"              10/Oct/2000:13:55:3… GET …   200  2326

Column Specifications

Readr will naturally make guesses about the best column type to use

by parsing out 1000 values in a particular column. By default, it usually works acceptably well. Sometimes it needs a more delicate touch and the readr package will allow the user to specify these column types during import with one of the read_ functions.

In each of these read functions, readr will allow the user to specify column types by passing their read function a list. There is also an option that allows skipping a column. This is quite useful in the early

stages of data tidying to avoid any column type changes or column drops in the middle of computations/transformations that are type specific that normally pop up in tidying. This allows the user to keep a very clean workflow.

An exhaustive list of types is below with abbreviations in brackets:

col_logical() [l], containing only T, F, TRUE or FALSE
col_integer() [i], integers
col_double() [d], doubles
col_character() [c], everything else
col_factor(levels, ordered) [f], a fixed set of values
col_date(format = ““) [D]: with the locale’s date_format
col_time(format =““) [t]: with the locale’s time_format
col_datetime(format =““) [T]: ISO8601 date times
col_number() [n], numbers containing the grouping_mark
col_skip() [ _ , -], don’t import this column
col_guess() [?], parse using the “best” type based on the input

Let’s use a few of these to gain some experience. In this first

example, we will be as verbose as possible in our code.

In the below example, we follow the standard syntax of read_csv.

read_csv(file,

col_types = list(

column_name = column_type(),

….

)

## # A tibble: 1 × 3
##   column1 column2 column3
##   <chr>   <chr>   <chr>  
## 1 value1  value2  value3

In this example, let’s try to cut down on typing. We can avoid denoting each column and the full type by using the col type’s abbreviation for each column from the list earlier.

read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/csv_format.txt", col_types = list(
  'c',
  'c',
  'c'
))

## # A tibble: 1 × 3
##   column1 column2 column3
##   <chr>   <chr>   <chr>  
## 1 value1  value2  value3

Let’s say we want to skip a column. There are two methods to achieve this. One by using the abbreviation and the other by using the cols_only() option. In this example, lets skip the 2nd column via abbreviation.

read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/csv_format.txt", col_types = list(
  'c',
  '_', # You can also use '-' instead of '_'
  'c'
))

## # A tibble: 1 × 2
##   column1 column3
##   <chr>   <chr>  
## 1 value1  value3

Now with the cols_only option instead of abbreviation. This option is useful when you have many columns and are already aware of the few you’re interested in.

read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/csv_format.txt", col_types = cols_only(
  column1 = 'c',
  column2 = 'c'
))

## # A tibble: 1 × 2
##   column1 column2
##   <chr>   <chr>  
## 1 value1  value2

Specifications for Factor Data

   If you’re unfamiliar with factors in the R world, these can be simplified into categories of a categorical variable. For example, if you have a list of cars, one of the categories may be the manufacturer. Each of these manufacturers are considered a factor. When a column has its type labelled as “factor,” it will only ever accept a value in its rows as one of those “factors” (categories).
   For this example, we will use a slightly modified version of the mtcars data set. The only change made was setting the first column to be the make of the car by printing the mtcars data to csv, opening the file, and typing the column name in.
   Although there a other methods to changing a column name in R, this is an example of how understanding flat file composition can come in handy. We will be using the make as different factors of the categorical variable of manufacturer. This would be a good exercise to follow along with on your end.

Data Prep for Factors (optional)

To create an example using multiple column types. Let’s use the mtcars dataset as if we were importing it for the first time. To make this example fitting, I would like to include a factor variable and a date variable.

Let’s first ‘add’ the factor variable by naming the first column of the mtcars data set. I first print the mtcars dataset with write.csv.

library(stringr)
write.csv(mtcars,"C:\\repos\\DATA607\\TIDYVERSE Create\\mtcarsReadr.csv")

I open this csv file and add a make column by following the file structure and save it as a new file.

From here, I will read the file back in with read_csv and edit the first column so that it only contains the manufacturers.

mtcarsEdit <- read_csv("C:\\repos\\DATA607\\TIDYVERSE Create\\mtcarsMake.csv")

#Removing excess text
mtcarsEdit$make <- str_remove(mtcarsEdit$make, " .*")

I then create a vector of dates and bind them to the mtcarsEdit data frame. I re-write this using another function from readr, write_csv.

#Creating a date column for use in the next example
relDate <- c(sample(seq(as.Date('1970/01/01'), as.Date('1980/01/01'), by="day"), 32))

#Combining the original mtcars dataframe and the new date column
mtcarsEdit <- cbind(mtcarsEdit,relDate)
mtcarsEdit

##        make  mpg cyl  disp  hp drat    wt  qsec vs am gear carb    relDate
## 1     Mazda 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 1974-08-21
## 2     Mazda 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 1975-02-20
## 3    Datsun 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 1972-07-06
## 4    Hornet 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 1979-08-19
## 5    Hornet 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 1975-01-27
## 6   Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1 1973-02-06
## 7    Duster 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 1971-07-14
## 8      Merc 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 1973-06-05
## 9      Merc 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2 1978-03-19
## 10     Merc 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4 1975-11-20
## 11     Merc 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4 1975-12-05
## 12     Merc 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3 1978-01-30
## 13     Merc 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3 1978-06-15
## 14     Merc 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3 1972-05-02
## 15 Cadillac 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 1970-03-26
## 16  Lincoln 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 1973-09-17
## 17 Chrysler 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4 1973-05-01
## 18     Fiat 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1 1970-10-14
## 19    Honda 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2 1978-08-15
## 20   Toyota 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 1975-05-12
## 21   Toyota 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1 1977-01-11
## 22    Dodge 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2 1976-09-18
## 23      AMC 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2 1977-02-28
## 24   Camaro 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4 1970-12-14
## 25  Pontiac 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2 1973-11-29
## 26     Fiat 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1 1972-07-31
## 27  Porsche 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2 1970-10-09
## 28    Lotus 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2 1979-12-13
## 29     Ford 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4 1975-08-17
## 30  Ferrari 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6 1971-03-31
## 31 Maserati 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8 1977-02-06
## 32    Volvo 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 1973-02-09

#I re-printed and pushed to github to make available for those that do not wish to copy the previous steps
write_csv(mtcarsEdit,"C:\\repos\\DATA607\\TIDYVERSE Create\\mtcarsEdit.csv")

From here, we will use the github file we just wrote out above, its path is set to a variable below as ‘url2,’ to use as an example for importing columns as factors with readr.

Importing & Complete Example w/ mtcarsEdit

Let’s say we only want to import the make, mpg, and release date from the modified mtcars dataset just above. We know that the the make should be a factor, as its a categorical variable, the mpg should be a double, and the release date as a date. Recall that integers are whole numbers, doubles can contain decimal points.
Since we know that we’re only interested in certain columns, we will use the col_only() sub option in combination with our col_types to get a clean import.

url2 <- "https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/mtcarsEdit.csv"
read_csv(url2, col_types = cols_only(
  make = 'f',
  mpg = 'd',
  relDate = 'D'))

## # A tibble: 32 × 3
##    make      mpg relDate   
##    <fct>   <dbl> <date>    
##  1 Mazda    21   1977-12-15
##  2 Mazda    21   1977-06-17
##  3 Datsun   22.8 1977-07-18
##  4 Hornet   21.4 1970-03-01
##  5 Hornet   18.7 1972-06-10
##  6 Valiant  18.1 1979-10-28
##  7 Duster   14.3 1975-05-17
##  8 Merc     24.4 1971-07-10
##  9 Merc     22.8 1973-02-26
## 10 Merc     19.2 1976-04-22
## # ℹ 22 more rows

How does readr parse data?

Let’s use a dataset from Kaggle for an example closer to a real-world scenario. You can download the file titled ‘A.csv’ from the following link. This file is a csv from the Yahoo Finance API that has quite a few columns. Let’s take a look at the results with just a normal read_csv. I’ll be accessing this from a file location on github. You can also access it there.

fin <- read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/yahooKaggleAPI.csv")

paged_table(fin)

Here we can see that the date column has a combined date and time field and a few other numeric columns. What does the parsing process look like for the read_ functions?

Parsing Process

Parsing takes place in three steps.

The text is parsed into a matrix
The type of each column is determined
Each column of strings is parsed into a more specific vector type

The easiest way to understand this is in reverse since this is how the package works. It builds from the smallest unit to the largest.

Parsing Vectors

The core parsing functions are as follows. They each take a character vector and return a corresponding vector with attributes and any issues.

parse_integer()
parse_double()
parse_logical()
parse_character()
parse_number()

It’s important to note that parse_integer() and parse_double() are strict and do not take leading or trailing characters. Parse_number(), on the other hand, can handle multiple symbols. Let’s use a few of these on our Yahoo Finance csv from Kaggle.

Notice we are using the first 3 values in each of the respective columns; Open, Volume

parse_double(c("27.761129","26.197654","25.206190"))

## [1] 27.76113 26.19765 25.20619

parse_integer(c("62546380","15234146","6577870"))

## [1] 62546380 15234146  6577870

parse_number(c("1999-11-18 05:00:00","1999-11-19 05:00:00","1999-11-22 05:00:00"))

## [1] 1999 1999 1999

Notice that the parse_number still failed to interpret the date, but was generous enough to atleast give us the few digits and removed what it believed to be excess characters. These parsers are the first step in one of the read_ functions process to creating tibbles in the tidyverse.

To accomodate dates, times, and date-times, readr has the following parsers:

parse_date() : uses the format determined by the locale
parse_time() : uses the format determined by the locale
parse_datetime() : recognizes ISO8601 format (an international standard)
- this command also allows for specific patterns defined by the user

parse_datetime(c("1999-11-18 05:00:00","1999-11-19 05:00:00","1999-11-22 05:00:00"))

## [1] "1999-11-18 05:00:00 UTC" "1999-11-19 05:00:00 UTC"
## [3] "1999-11-22 05:00:00 UTC"

It looks like datetime is the best format, and luckily for us is already in the ISO8601 standard. If you attempt the other date/time parsers, they will fail and return NA’s.

Parsing factors is the last group of parsers. When parsing factors, readr uses:

parse_factor(x, levels): where x is a character vector
- if you include a factor not defined in the levels character vector, the function will fail

parse_factor(c("a", "b", "a"), levels = c("a", "b", "c"))

## [1] a b a
## Levels: a b c

Now that readr can parse for specific data types, the next step is to guess which data types fit each column

Guessing Column Types

When using one of the read_ functions, they are attempting to guess column types with the use of:

guess_parser()

guess_parser(c("27.761129"))

## [1] "double"

guess_parser(c("62546380"))

## [1] "double"

guess_parser(c("1999-11-18 05:00:00"))

## [1] "datetime"

Take note that the guess parser chose double for our second line in our code chunk and successfully picked datetime for the third. You could argue to expect integer instead of double for the second line, but readr tries to be strict where it can be.
An important note is that a factor will never be guessed since it requires input on what the correct levels should be. A factor will always need to be manually set. The guess_parser will also never assume a column should be skipped and must be told to skip a column with the methods discussed earlier in the read_ functions.

Guessing Entire Files

If one wanted to test what column types would be chosen by one of the read_ functions before committing, the following can be used:

spec_csv()
spec_tsv()
spec_()

spec_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/yahooKaggleAPI.csv")

## cols(
##   Date = col_datetime(format = ""),
##   Open = col_double(),
##   High = col_double(),
##   Low = col_double(),
##   Close = col_double(),
##   Volume = col_double(),
##   Dividends = col_double(),
##   `Stock Splits` = col_double()
## )

For the most part, it looks like we can be happy with these predictions. As a default, readr uses the first 1000 entries in column to guess column types. So if we happened to know that data later in the column would change this, readr has given the ability to extend this guessing parameter with the guess_max option. Let’s highlight this using the readr example since the Yahoo Finance csv seems pretty normalized.

x <- spec_csv(readr_example("challenge.csv"), guess_max = 1000)
y <- spec_csv(readr_example("challenge.csv"), guess_max = 1001)
x

## cols(
##   x = col_double(),
##   y = col_logical()
## )

## cols(
##   x = col_double(),
##   y = col_date(format = "")
## )

fin2 <- read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/yahooKaggleAPI.csv", guess_max = 2000)
spec(fin2)

## cols(
##   Date = col_datetime(format = ""),
##   Open = col_double(),
##   High = col_double(),
##   Low = col_double(),
##   Close = col_double(),
##   Volume = col_double(),
##   Dividends = col_double(),
##   `Stock Splits` = col_double()
## )

Here we can see that when we changed the guess max from 1000 to 1001 the column type changed for the readr example. When using this option to 2000 on the Yahoo Finance csv, nothing changed.