read_delim
Just as read.table() was the “mother function” of the utils package, read_delim() is the main function in readr by Hadley Wickham.
The read_delim() function takes two mandatory arguments:
file: the file that contains the data
delim: the character that seperates the values in the data file
read_csv
Time to switch to .csv files! Similar to utils, the read_csv() in the readr package uses read_delim() behind the scenes. If you use read_csv(), however, you are not free to choose the delim argument; readr handles that for you.
col_types, skip and n_max
Apart from controlling how columns are named, you can also specify which types the columns should be in your imported data frame. You can do this with col_types. If set to NULL, which is the default, functions from the readr package will try to find the correct types themselves. If you want to manually set the types, you can use a string between “”, with one character for each column that is imported. Each character corresponds to a type:
- c to a character
- d to a double
- i to an integer
- l to a logical
- _ skips the column
But that’s not all: Through skip and n_max, which should be set to integers, you can also control which part of your flat file you’re actually importing into R. It’s important here that once you skip some lines, this inevitably leads to also skipping the column names in the first line, if they are available. n_max will be the maximum number of records to read.
You’ll be working with the potatoes.txt data frame again. It contains tab-delimited records and the first line contains the column names. For change, use read_tsv() this time; it’s a wrapper around read_delim() that’s easy to use for tab-delimited files. The column names are available as the properties vector; if you think you should need them.
# Column names
properties <- c("area", "temp", "size", "storage", "method",
"texture", "flavor", "moistness")
# Import 5 observations from potatoes.txt: potatoes_fragment??? To select the 7th observation, you have to skip 7 lines: the first line, which contains the column names, and then 6 lines, representing the first 6 observations
potatoes_fragment <- read_tsv("potatoes.txt", col_name = properties, skip = 7, n_max=5)
# Import all data, but force all columns to be character: potatoes_char
potatoes_char <- read_tsv("potatoes.txt", col_types = "cccccccc")
# Display the structure of potatoes_char
str(potatoes_char)
## Classes 'tbl_df', 'tbl' and 'data.frame': 160 obs. of 8 variables:
## $ area : chr "1" "1" "1" "1" ...
## $ temp : chr "1" "1" "1" "1" ...
## $ size : chr "1" "1" "1" "1" ...
## $ storage : chr "1" "1" "1" "1" ...
## $ method : chr "1" "2" "3" "4" ...
## $ texture : chr "2.9" "2.3" "2.5" "2.1" ...
## $ flavor : chr "3.2" "2.5" "2.8" "2.9" ...
## $ moistness: chr "3.0" "2.6" "2.8" "2.4" ...
ol_types with collectors
Another way of setting the types of the imported columns is using collectors. Collector functions can be passed in a list() to the col_types argument of read_ functions to tell them how to interpret values in a column.
For a complete list of collector functions, you can take a look at the collector documentation. For this exercise you will need two collector functions:
- col_integer(): the column should be interpreted as an integer
- col_factor(levels, ordered = FALSE): the column should be interpreted as a factor with levels. By default, the values are not seen as ordered values.
In this exercise, you will work with hotdogs.txt, which is a tab-delimited file without column names in the first row.
# Import without col_types
hotdogs <- read_tsv("hotdogs.txt", col_names = c("type", "calories", "sodium"))
# Display the summary of hotdogs
summary(hotdogs)
## type calories sodium
## Length:54 Min. : 86.0 Min. :144.0
## Class :character 1st Qu.:132.0 1st Qu.:362.5
## Mode :character Median :145.0 Median :405.0
## Mean :145.4 Mean :424.8
## 3rd Qu.:172.8 3rd Qu.:503.5
## Max. :195.0 Max. :645.0
# The collectors you will need to import the data
fac <- col_factor(levels = c("Beef", "Meat", "Poultry"))
int <- col_integer()
# Edit the col_types argument to import the data correctly: hotdogs_factor
hotdogs_factor <- read_tsv("hotdogs.txt",
col_names = c("type", "calories", "sodium"),
# Change col_types to the correct vector of collectors
col_types = list(fac, int,int))
# Display the summary of hotdogs_factor
summary(hotdogs_factor)
## type calories sodium
## Beef :20 Min. : 86.0 Min. :144.0
## Meat :17 1st Qu.:132.0 1st Qu.:362.5
## Poultry:17 Median :145.0 Median :405.0
## Mean :145.4 Mean :424.8
## 3rd Qu.:172.8 3rd Qu.:503.5
## Max. :195.0 Max. :645.0
fread
Let’s shift focus to the data.table package. You still remember how to use read.table(), right? Well, fread() is a function in data.table that does the same job with very similar arguments. It is extremely easy to use and blazingly fast! Often, simply specifying the path to the file is enough to successfully import your flat file data as a data frame.
fread: more advanced use
Now that you know the basics about fread(), you should now about two arguments of the function: drop and select.
They enable you to drop or select variables of interest in your flat file. Suppose you have a dataset that contains 5 variables and you want to keep the first and fifth variable, named “a” and “e”. The following options will all do the trick:
- fread(“path/to/file.txt”, drop = 2:4)
- fread(“path/to/file.txt”, select = c(1, 5))
- fread(“path/to/file.txt”, drop = c(“b”, “c”, “d”)
- fread(“path/to/file.txt”, select = c(“a”, “e”))
Let’s stick with potatoes since we’re particularly fond of them here at DataCamp. The data is again available in the file potatoes.txt, containing tab-delimited records.
# Import columns 6, 7 and 8 of potatoes.txt: potatoes
potatoes <- fread("potatoes.txt", select=c(6,7,8))
# Keep only tasty potatoes (flavor > 3): tasty_potatoes
tasty_potatoes <- subset(potatoes, flavor > 3)
# Plot texture (x) and moistness (y) of tasty_potatoes
plot(tasty_potatoes$texture, tasty_potatoes$moistness)

Dedicated classes
You might have noticed that the fread() function produces data frames that look slightly different when you print them out. That’s because another class named data.table is assigned to the resulting data frames. The printout of such data.table objects is different. Does something similar happen with the data frames generated by readr?
In your current working directory, we prepared the potatoes.txt file. The packages data.table and readr are both loaded, so you can experiment straight away.
Which of the following statements is true?
- fread() creates an object whose only class is data.table class. read_tsv() creates an object with class tbl_df.
- The class of the result of fread() is only data.table. That of the result of read_tsv() is both tbl_df and tbl.
- The class of the result of fread() is both data.table and data.frame. read_tsv() creates an object with three classes: tbl_df, tbl and data.frame. <—-
- fread() creates an object of the data.table class, while read_tsv() simply generates a data.frame, nothing more.