The readr package is part of the Tidyverse group of libraries. As such, it is constantly maintained and the data and analysis that comes from this package can be trusted. The goal of readr is to provide an easy way to read rectangular data essentially replacing functions such as read.table() and read.csv(). It is able to easily (and quickly) parse a flat file into a tibble. In essence, the parsing takes place in three steps:
You can find a cheetsheat for the readr package here or on the tidyverse website.
According to the readr documentation, “[to] accurately read a rectangular dataset with readr you combine two pieces: a function that parses the overall file, and a column specification. The column specification describes how each column should be converted from a character vector to the most appropriate data type, and in most cases it’s not necessary because readr will guess it for you automatically.”
There are several file formats supported by readr. The syntax for each is the same so mastering one command allows you to use the others easily.
| Command | File Type |
|---|---|
| read_csv() | comma separated (CSV) files |
| read_csv2() | semicolon separated files |
| read_tsv() | tab separated files |
| read_delim() | general delimited files |
| read_fwf() | fixed width files |
| read_table() | tabular files where columns are separated by white-space. |
| read_log() | web log files |
We begin, as always by loading the tidyverse library which contains the readr library (or simply the readr library itself). A typical read_csv() command will look very similar to the read.csv() command. They can basically be used interchangeably.
library(readr)
dat <- read_csv(readr_example("mtcars.csv"))
head(dat)
## # A tibble: 6 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
One handy bonus to the read_csv() command is that it prints the column specifications. This way, one can check that they have been copied as you expected. Should they fail to do so, you can change them using the following call.
mtcars <- read_csv(readr_example("mtcars.csv"),
col_types = cols(
mpg = col_double(),
cyl = col_integer(),
disp = col_double(),
hp = col_integer(),
drat = col_double(),
vs = col_integer(),
wt = col_double(),
qsec = col_double(),
am = col_integer(),
gear = col_integer(),
carb = col_integer()
)
)
Additionally, you can use read_csv() to supply an inline csv file. This is quite useful if you want to construct toy data and examples.
read_csv(
"Col1,Col2,Col3
1,2,3
4,5,6
7,8,9"
)
## # A tibble: 3 x 3
## Col1 Col2 Col3
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
## 3 7 8 9
As you can see, the first line of the .csv file (or inline data) is interpreted as a column name. There are ways to avoid this issue (should your data not have column names or should your data have multiple column names/ headings). The skip command will allow you to skip the first few rows of your data. Comments can also be marked with a placeholder such as # and comment=“#” will recognize that line as a line that read_csv() should drop. (Conversely, if you only want to read the first n lines, you can use the n_max = n argument). In this example, we drop the first 3 lines.
read_csv(
"This is the title line
This is the subtitle line
This is the author line
Col1,Col2,Col3
1,2,3
4,5,6
7,8,9", skip=3
)
## # A tibble: 3 x 3
## Col1 Col2 Col3
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
## 3 7 8 9
read_csv(
"# This is the title line
# This is the subtitle line
# This is the author line
Col1,Col2,Col3
1,2,3
4,5,6
7,8,9", comment = "#"
)
## # A tibble: 3 x 3
## Col1 Col2 Col3
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
## 3 7 8 9
If the data has no column names, you can use col_names=FALSE to tell read_csv() that there are no column headings and that all of the rows are data. You can name the columns using col_names and providing a list of column names.
read_csv(
"1,2,3
4,5,6
7,8,9", col_names = FALSE
)
## # A tibble: 3 x 3
## X1 X2 X3
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
## 3 7 8 9
read_csv(
"1,2,3
4,5,6
7,8,9", col_names = c("A","B","C")
)
## # A tibble: 3 x 3
## A B C
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
## 3 7 8 9
If there are NA values in your data represented by a placeholder such as a ., you can specify the missing data value using the na command. This is convenient if you are working with a data set that has a numerical value for na (such as 999).
read_csv(
"Col1,Col2,Col3
1,2,.
4,5,6
.,8,9", na = "."
)
## # A tibble: 3 x 3
## Col1 Col2 Col3
## <dbl> <dbl> <dbl>
## 1 1 2 NA
## 2 4 5 6
## 3 NA 8 9
An obvious question to ask is if read.csv() works fine, why do we need read_csv()? Here are some reasons:
If you would like to know more about readr, please see the readr documentation or the text R Programming for Data Science
“R for Data Science.” Accessed May 3, 2021. Available Here.
Camm, Jeffrey D. Business Analytics. Third edition, Cengage, 2019.
“Introduction to Readr.” Accessed April 30, 2021. Available Here.
Peng, Roger D. R Programming for Data Science. Accessed May 3, 2021. Available Here.
“R for Data Science.” Accessed May 3, 2021. Available Here.
“Read Rectangular Text Data.” Accessed April 30, 2021. Available Here.
Wickham, Hadley and RStudio. Tidyr: Tidy Messy Data (version 1.1.3), 2021. Available here.