The Idea

The readr package is part of the Tidyverse group of libraries. As such, it is constantly maintained and the data and analysis that comes from this package can be trusted. The goal of readr is to provide an easy way to read rectangular data essentially replacing functions such as read.table() and read.csv(). It is able to easily (and quickly) parse a flat file into a tibble. In essence, the parsing takes place in three steps:

You can find a cheetsheat for the readr package here or on the tidyverse website.

readr

According to the readr documentation, “[to] accurately read a rectangular dataset with readr you combine two pieces: a function that parses the overall file, and a column specification. The column specification describes how each column should be converted from a character vector to the most appropriate data type, and in most cases it’s not necessary because readr will guess it for you automatically.”

There are several file formats supported by readr. The syntax for each is the same so mastering one command allows you to use the others easily.

Command File Type
read_csv() comma separated (CSV) files
read_csv2() semicolon separated files
read_tsv() tab separated files
read_delim() general delimited files
read_fwf() fixed width files
read_table() tabular files where columns are separated by white-space.
read_log() web log files

We begin, as always by loading the tidyverse library which contains the readr library (or simply the readr library itself). A typical read_csv() command will look very similar to the read.csv() command. They can basically be used interchangeably.

library(readr)
dat <- read_csv(readr_example("mtcars.csv"))
head(dat)
## # A tibble: 6 x 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

One handy bonus to the read_csv() command is that it prints the column specifications. This way, one can check that they have been copied as you expected. Should they fail to do so, you can change them using the following call.

mtcars <- read_csv(readr_example("mtcars.csv"), 
                   col_types = cols(
                           mpg = col_double(),
                           cyl = col_integer(),
                           disp = col_double(),
                           hp = col_integer(),
                           drat = col_double(),
                           vs = col_integer(),
                           wt = col_double(),
                           qsec = col_double(),
                           am = col_integer(),
                           gear = col_integer(),
                           carb = col_integer()
                           )
                   )

Additionally, you can use read_csv() to supply an inline csv file. This is quite useful if you want to construct toy data and examples.

read_csv(
"Col1,Col2,Col3
1,2,3
4,5,6
7,8,9"
)
## # A tibble: 3 x 3
##    Col1  Col2  Col3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6
## 3     7     8     9

As you can see, the first line of the .csv file (or inline data) is interpreted as a column name. There are ways to avoid this issue (should your data not have column names or should your data have multiple column names/ headings). The skip command will allow you to skip the first few rows of your data. Comments can also be marked with a placeholder such as # and comment=“#” will recognize that line as a line that read_csv() should drop. (Conversely, if you only want to read the first n lines, you can use the n_max = n argument). In this example, we drop the first 3 lines.

read_csv(
"This is the title line
This is the subtitle line
This is the author line
Col1,Col2,Col3
1,2,3
4,5,6
7,8,9", skip=3
)
## # A tibble: 3 x 3
##    Col1  Col2  Col3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6
## 3     7     8     9
read_csv(
"# This is the title line
# This is the subtitle line
# This is the author line
Col1,Col2,Col3
1,2,3
4,5,6
7,8,9", comment = "#"
)
## # A tibble: 3 x 3
##    Col1  Col2  Col3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6
## 3     7     8     9

If the data has no column names, you can use col_names=FALSE to tell read_csv() that there are no column headings and that all of the rows are data. You can name the columns using col_names and providing a list of column names.

read_csv(
"1,2,3
4,5,6
7,8,9", col_names = FALSE
)
## # A tibble: 3 x 3
##      X1    X2    X3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6
## 3     7     8     9
read_csv(
"1,2,3
4,5,6
7,8,9", col_names = c("A","B","C")
)
## # A tibble: 3 x 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6
## 3     7     8     9

If there are NA values in your data represented by a placeholder such as a ., you can specify the missing data value using the na command. This is convenient if you are working with a data set that has a numerical value for na (such as 999).

read_csv(
"Col1,Col2,Col3
1,2,.
4,5,6
.,8,9", na = "."
)
## # A tibble: 3 x 3
##    Col1  Col2  Col3
##   <dbl> <dbl> <dbl>
## 1     1     2    NA
## 2     4     5     6
## 3    NA     8     9

Benefits of readr

An obvious question to ask is if read.csv() works fine, why do we need read_csv()? Here are some reasons:

  • Typically, read_csv() is faster (about 10 times faster) than the base-R equivalent. Additionally, long loading data sets have a progress bar so you can see that you are not stalled.
  • The read_csv() command produces a tibble, it does not convert character vectors into factors, use row names or manipulate column names.
  • Base-R functions inherit behavior from your computer. This operating system dependency makes reproduciblity difficult.
  • By default, readr functions guess the column types based on the first 1000 rows. If there is an inconsistency in the data that appears after row 1000, for example if row 1001 was a letter after the first 1000 rows were numbers, the readr function you used will return warnings that allow you to re-check your original data.

If you would like to know more about readr, please see the readr documentation or the text R Programming for Data Science

Citations

“R for Data Science.” Accessed May 3, 2021. Available Here.

Camm, Jeffrey D. Business Analytics. Third edition, Cengage, 2019.

“Introduction to Readr.” Accessed April 30, 2021. Available Here.

Peng, Roger D. R Programming for Data Science. Accessed May 3, 2021. Available Here.

“R for Data Science.” Accessed May 3, 2021. Available Here.

“Read Rectangular Text Data.” Accessed April 30, 2021. Available Here.

Wickham, Hadley and RStudio. Tidyr: Tidy Messy Data (version 1.1.3), 2021. Available here.