Prerequisites

In this chapter, we will use the readr package to import rectangular data stored in plain text files. So far, we have only worked with example data already available in R, but at some point we need to import our own datasets. Unfortunately, it is rarely the case that these datasets are already tidy. Instead, real-world data may contain typos and/or inconsistent notations, combine multiple variables in one column, use multiple columns for a single variable, and so on. We will discuss how to tidy up such data in the next chapter.

In addition to readr, the Tidyverse also provides packages for importing data from other statistical software. In particular, the haven package supports SPSS, Stata, and SAS formats, whereas readxl imports data stored in Excel sheets.

Plain text files

Plain text files contain only text (letters, numbers, punctuation, and other special characters). They can be opened by any text editor on almost any platform. Therefore, storing data in a plain text file is often a viable option unless there are of hundreds of thousands of rows and columns. Such large datasets would take up a lot of disk space when stored as plain text, so other more specialized binary formats (not covered in this workshop) should be preferred.

Let’s start with an example. Here’s the contents of mtcars.csv, a rectangular plain text file that comes with readr (you can find its location on your computer with readr_example("mtcars.csv")):

"mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
21,6,160,110,3.9,2.62,16.46,0,1,4,4
21,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3
15.2,8,275.8,180,3.07,3.78,18,0,0,3,3
10.4,8,472,205,2.93,5.25,17.98,0,0,3,4
10.4,8,460,215,3,5.424,17.82,0,0,3,4
14.7,8,440,230,3.23,5.345,17.42,0,0,3,4
32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
15.5,8,318,150,2.76,3.52,16.87,0,0,3,2
15.2,8,304,150,3.15,3.435,17.3,0,0,3,2
13.3,8,350,245,3.73,3.84,15.41,0,0,3,4
19.2,8,400,175,3.08,3.845,17.05,0,0,3,2
27.3,4,79,66,4.08,1.935,18.9,1,1,4,1
26,4,120.3,91,4.43,2.14,16.7,0,1,5,2
30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
15.8,8,351,264,4.22,3.17,14.5,0,1,5,4
19.7,6,145,175,3.62,2.77,15.5,0,1,5,6
15,8,301,335,3.54,3.57,14.6,0,1,5,8
21.4,4,121,109,4.11,2.78,18.6,1,1,4,2

Here, rectangular means that each column has the same number of elements, which is the case for most datasets. Apparently, this file also has a header row containing variable names, and columns are separated by commas. However, other text files might use different column separators (such as commas or semicolons), different decimal marks (such as a comma used in German-speaking countries as opposed to a point used in English-speaking countries), no header row, comments, and so on.

The readr functions

As usual, we start with activating the package:

library(readr)

The most generic function for reading rectangular text data is called read_delim(). It has parameters to set the column delimiter, decimal mark, missing values, whether or not there is a header row, and so on. Other functions include read_csv(), read_tsv(), and read_csv2(), which are special cases with predefined values for some common file formats. We’ll showcase how readr functions work using read_delim(), because all other functions behave very similarly.

The file and delim arguments

The first argument (file) is always the file name you want to import. The second argument (delim) specifies the column separator used in the particular file. For example, if you wanted to load mtcars.csv shown in the previous example, you could use the following function call to import the contents of the file into a data frame:

(df = read_delim(readr_example("mtcars.csv"), delim=","))
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 22 more rows

However, data files are usually located in the current working directory. This allows us to specify the file name without the complete path as follows:

df = read_delim("mtcars.csv", delim=",")  # mtcars.csv must be in current working directory

We could also use read_csv() to import this file, which saves us passing the column separator argument:

df = read_csv("mtcars.csv")

Instead of a file name, you can also pass a character vector as the first argument:

read_delim("A,B,C\n1.1,1.3,-2.0\n5,6.3,-1.8", delim=",")
# A tibble: 2 × 3
      A     B     C
  <dbl> <dbl> <dbl>
1   1.1   1.3  -2  
2   5     6.3  -1.8

This is useful to quickly generate some toy data (note that \n corresponds to a line break).

Here is an example of data separated by semicolons. To import it, we need to set delim=";" accordingly.

read_delim("A;B;C\n1.1;1.3;-2.0\n5;6.3;-1.8", delim=";")
# A tibble: 2 × 3
      A     B     C
  <dbl> <dbl> <dbl>
1   1.1   1.3  -2  
2   5     6.3  -1.8

Whatever the format of the file may be, the result is always a data frame.

Recent versions of the readr package do not even require that we pass an explicit value for the delim argument. If we do not provide a delimiter, the function tries to guess it from the contents of the file. This works in many (if not most) cases, so we could load the file simply with:

read_delim("mtcars.csv")

In case this does not work, we can always provide the correct delimiter explicitly.

Other useful arguments

If the file does not contain a header row, we need to specify col_names=FALSE:

read_delim("1.1,1.3,-2.0\n5,6.3,-1.8", ",", col_names=FALSE)
# A tibble: 2 × 3
     X1    X2    X3
  <dbl> <dbl> <dbl>
1   1.1   1.3  -2  
2   5     6.3  -1.8

In this case, read_delim() generates column names automatically (X1, X2, and X3).

English-speaking regions use a point to denote the decimal mark. This is also the case for most programming languages, for example:

4.1
[1] 4.1

In contrast, other regions (including Austria and Germany) use a comma to separate whole and decimal digits. Therefore, the number 4.1 would be written as 4,1. To import numerical data stored in this format, we need to set the locale argument accordingly:

read_delim("A;B;C\n1,1;1,3;-2,0\n5;6,3;-1,8", ";", locale=locale(decimal_mark=","))
# A tibble: 2 × 3
      A     B     C
  <dbl> <dbl> <dbl>
1   1.1   1.3  -2  
2   5     6.3  -1.8

The resulting data frame is identical to the previous example where we used the regular English number format. It is also worth noting that a file cannot use the same character for both column and decimal separators. This is why the previous example used semicolons to delimit columns (because the comma is already used for the decimal mark).

Comparison to base R functions

TODO

Parsing a vector

TODO

Parsing a file

TODO

Writing to a file

TODO


© 2022, Clemens Brunner