read_delim

Just as read.table() was the “mother function” of the utils package, read_delim() is the main function in readr by Hadley Wickham.

The read_delim() function takes two mandatory arguments:

file: the file that contains the data

delim: the character that seperates the values in the data file

The dataset you’ll be working with in the next few exercises is potatoes.txt. It gives information on the impact of storage period and cooking on potatoes’ flavour. The file uses tabs (“”) to delimit values and contains column names in its first line. It’s available in your working directory so you can start right away.

# Load the readr package
library("readr")
## Warning: package 'readr' was built under R version 3.2.4
# Import potatoes.txt using read_delim(): potatoes
potatoes <- read_delim("potatoes.txt", delim="\t")
head(potatoes)
##   area temp size storage method texture flavor moistness
## 1    1    1    1       1      1     2.9    3.2       3.0
## 2    1    1    1       1      2     2.3    2.5       2.6
## 3    1    1    1       1      3     2.5    2.8       2.8
## 4    1    1    1       1      4     2.1    2.9       2.4
## 5    1    1    1       1      5     1.9    2.8       2.2
## 6    1    1    1       2      1     1.8    3.0       1.7
# Create a subset of potatoes: potatoes_sel
potatoes_sel <-  potatoes[, c("texture", "flavor", "moistness")]
head(potatoes_sel)
##   texture flavor moistness
## 1     2.9    3.2       3.0
## 2     2.3    2.5       2.6
## 3     2.5    2.8       2.8
## 4     2.1    2.9       2.4
## 5     1.9    2.8       2.2
## 6     1.8    3.0       1.7

read_csv

Time to switch to .csv files! Similar to utils, the read_csv() in the readr package uses read_delim() behind the scenes. If you use read_csv(), however, you are not free to choose the delim argument; readr handles that for you.

Instead of a .txt, the potatoes data comes to you in the form of a .csv file. It uses commas to delimit fields in a record, but does not contain the column names in the first row anymore. The file potatoes.csv is available in your workspace. You’ll have to specify the column names manually; you can use the properties variable that’s already pre-coded.

# Column names
properties <- c("area", "temp", "size", "storage", "method", 
                "texture", "flavor", "moistness")

# Import potatoes.csv with read_csv(): potatoes
potatoes <- read_csv("potatoes.csv", col_names=properties)
potatoes$method
##   [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
##  [36] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
##  [71] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## [106] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## [141] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
# Create a copy of potatoes: potatoes2
potatoes2 <- potatoes

# Convert the method column of potatoes2 to a factor
potatoes2$method <- factor(potatoes2$method)
potatoes2$method
##   [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
##  [36] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
##  [71] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## [106] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## [141] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## Levels: 1 2 3 4 5

col_types, skip and n_max

Apart from controlling how columns are named, you can also specify which types the columns should be in your imported data frame. You can do this with col_types. If set to NULL, which is the default, functions from the readr package will try to find the correct types themselves. If you want to manually set the types, you can use a string between “”, with one character for each column that is imported. Each character corresponds to a type:

  • c to a character
  • d to a double
  • i to an integer
  • l to a logical
  • _ skips the column

But that’s not all: Through skip and n_max, which should be set to integers, you can also control which part of your flat file you’re actually importing into R. It’s important here that once you skip some lines, this inevitably leads to also skipping the column names in the first line, if they are available. n_max will be the maximum number of records to read.

You’ll be working with the potatoes.txt data frame again. It contains tab-delimited records and the first line contains the column names. For change, use read_tsv() this time; it’s a wrapper around read_delim() that’s easy to use for tab-delimited files. The column names are available as the properties vector; if you think you should need them.

# Column names
properties <- c("area", "temp", "size", "storage", "method", 
                "texture", "flavor", "moistness")

# Import 5 observations from potatoes.txt: potatoes_fragment??? To select the 7th observation, you have to skip 7 lines: the first line, which contains the column names, and then 6 lines, representing the first 6 observations
potatoes_fragment <- read_tsv("potatoes.txt", col_name = properties, skip = 7, n_max=5)

# Import all data, but force all columns to be character: potatoes_char
potatoes_char <- read_tsv("potatoes.txt", col_types = "cccccccc")

# Display the structure of potatoes_char
str(potatoes_char)
## Classes 'tbl_df', 'tbl' and 'data.frame':    160 obs. of  8 variables:
##  $ area     : chr  "1" "1" "1" "1" ...
##  $ temp     : chr  "1" "1" "1" "1" ...
##  $ size     : chr  "1" "1" "1" "1" ...
##  $ storage  : chr  "1" "1" "1" "1" ...
##  $ method   : chr  "1" "2" "3" "4" ...
##  $ texture  : chr  "2.9" "2.3" "2.5" "2.1" ...
##  $ flavor   : chr  "3.2" "2.5" "2.8" "2.9" ...
##  $ moistness: chr  "3.0" "2.6" "2.8" "2.4" ...

ol_types with collectors

Another way of setting the types of the imported columns is using collectors. Collector functions can be passed in a list() to the col_types argument of read_ functions to tell them how to interpret values in a column.

For a complete list of collector functions, you can take a look at the collector documentation. For this exercise you will need two collector functions:

  • col_integer(): the column should be interpreted as an integer
  • col_factor(levels, ordered = FALSE): the column should be interpreted as a factor with levels. By default, the values are not seen as ordered values.

In this exercise, you will work with hotdogs.txt, which is a tab-delimited file without column names in the first row.

# Import without col_types
hotdogs <- read_tsv("hotdogs.txt", col_names = c("type", "calories", "sodium"))

# Display the summary of hotdogs
summary(hotdogs)
##      type              calories         sodium     
##  Length:54          Min.   : 86.0   Min.   :144.0  
##  Class :character   1st Qu.:132.0   1st Qu.:362.5  
##  Mode  :character   Median :145.0   Median :405.0  
##                     Mean   :145.4   Mean   :424.8  
##                     3rd Qu.:172.8   3rd Qu.:503.5  
##                     Max.   :195.0   Max.   :645.0
# The collectors you will need to import the data
fac <- col_factor(levels = c("Beef", "Meat", "Poultry"))
int <- col_integer()

# Edit the col_types argument to import the data correctly: hotdogs_factor
hotdogs_factor <- read_tsv("hotdogs.txt", 
                           col_names = c("type", "calories", "sodium"),
                           # Change col_types to the correct vector of collectors
                           col_types = list(fac, int,int))

# Display the summary of hotdogs_factor
summary(hotdogs_factor)
##       type       calories         sodium     
##  Beef   :20   Min.   : 86.0   Min.   :144.0  
##  Meat   :17   1st Qu.:132.0   1st Qu.:362.5  
##  Poultry:17   Median :145.0   Median :405.0  
##               Mean   :145.4   Mean   :424.8  
##               3rd Qu.:172.8   3rd Qu.:503.5  
##               Max.   :195.0   Max.   :645.0

fread

Let’s shift focus to the data.table package. You still remember how to use read.table(), right? Well, fread() is a function in data.table that does the same job with very similar arguments. It is extremely easy to use and blazingly fast! Often, simply specifying the path to the file is enough to successfully import your flat file data as a data frame.

Don’t take our word for it, try it yourself! You’ll again be working with the potatoes.txt file, that’s available in your workspace. It’s in the same format as before; fields are delimited by tabs.

# load the data.table package
library("data.table")
## Warning: package 'data.table' was built under R version 3.2.4
# Import potatoes.txt with fread(): potatoes
potatoes <- fread("potatoes.txt")

# Print out arranged version of potatoes
potatoes[order(potatoes$moistness)]
##      area temp size storage method texture flavor moistness
##   1:    1    1    2       4      1     1.5    2.6       1.3
##   2:    1    1    2       4      2     1.4    2.6       1.3
##   3:    1    2    2       4      1     1.4    2.9       1.4
##   4:    1    1    1       3      1     1.8    2.6       1.5
##   5:    1    2    2       4      5     1.5    2.4       1.5
##  ---                                                       
## 156:    2    2    2       2      3     2.2    2.8       3.1
## 157:    2    2    2       3      5     2.9    3.1       3.1
## 158:    1    1    2       3      3     3.3    3.2       3.2
## 159:    1    2    1       1      3     3.2    2.7       3.2
## 160:    1    2    1       4      3     3.6    3.0       3.3
# Import 20 rows of potatoes.txt with fread(): potatoes_part
potatoes_part <- fread("potatoes.txt", nrows=20)

fread: more advanced use

Now that you know the basics about fread(), you should now about two arguments of the function: drop and select.

They enable you to drop or select variables of interest in your flat file. Suppose you have a dataset that contains 5 variables and you want to keep the first and fifth variable, named “a” and “e”. The following options will all do the trick:

  • fread(“path/to/file.txt”, drop = 2:4)
  • fread(“path/to/file.txt”, select = c(1, 5))
  • fread(“path/to/file.txt”, drop = c(“b”, “c”, “d”)
  • fread(“path/to/file.txt”, select = c(“a”, “e”))

Let’s stick with potatoes since we’re particularly fond of them here at DataCamp. The data is again available in the file potatoes.txt, containing tab-delimited records.

# Import columns 6, 7 and 8 of potatoes.txt: potatoes
potatoes <- fread("potatoes.txt", select=c(6,7,8))

# Keep only tasty potatoes (flavor > 3): tasty_potatoes
tasty_potatoes <- subset(potatoes, flavor > 3)

# Plot texture (x) and moistness (y) of tasty_potatoes
plot(tasty_potatoes$texture, tasty_potatoes$moistness)

Dedicated classes

You might have noticed that the fread() function produces data frames that look slightly different when you print them out. That’s because another class named data.table is assigned to the resulting data frames. The printout of such data.table objects is different. Does something similar happen with the data frames generated by readr?

In your current working directory, we prepared the potatoes.txt file. The packages data.table and readr are both loaded, so you can experiment straight away.

Which of the following statements is true?

  • fread() creates an object whose only class is data.table class. read_tsv() creates an object with class tbl_df.
  • The class of the result of fread() is only data.table. That of the result of read_tsv() is both tbl_df and tbl.
  • The class of the result of fread() is both data.table and data.frame. read_tsv() creates an object with three classes: tbl_df, tbl and data.frame. <—-
  • fread() creates an object of the data.table class, while read_tsv() simply generates a data.frame, nothing more.