In this chapter, we will use the readr package to import rectangular data stored in plain text files. So far, we have only worked with example data already available in R, but at some point we need to import our own datasets. Unfortunately, it is rarely the case that these datasets are already tidy. Instead, real-world data may contain typos and/or inconsistent notations, combine multiple variables in one column, use multiple columns for a single variable, and so on. We will discuss how to tidy up such data in the next chapter.
In addition to readr, the Tidyverse also provides packages for importing data from other statistical software. In particular, the haven package supports SPSS, Stata, and SAS formats, whereas readxl imports data stored in Excel sheets.
Plain text files contain only text (letters, numbers, punctuation, and other special characters). They can be opened by any text editor on almost any platform. Therefore, storing data in a plain text file is often a viable option unless there are of hundreds of thousands of rows and columns. Such large datasets would take up a lot of disk space when stored as plain text, so other more specialized binary formats (not covered in this workshop) should be preferred.
Let’s start with an example. Here’s the contents of
mtcars.csv, a rectangular plain text file that comes with readr
(you can find its location on your computer with
readr_example("mtcars.csv")
):
"mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
21,6,160,110,3.9,2.62,16.46,0,1,4,4
21,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3
15.2,8,275.8,180,3.07,3.78,18,0,0,3,3
10.4,8,472,205,2.93,5.25,17.98,0,0,3,4
10.4,8,460,215,3,5.424,17.82,0,0,3,4
14.7,8,440,230,3.23,5.345,17.42,0,0,3,4
32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
15.5,8,318,150,2.76,3.52,16.87,0,0,3,2
15.2,8,304,150,3.15,3.435,17.3,0,0,3,2
13.3,8,350,245,3.73,3.84,15.41,0,0,3,4
19.2,8,400,175,3.08,3.845,17.05,0,0,3,2
27.3,4,79,66,4.08,1.935,18.9,1,1,4,1
26,4,120.3,91,4.43,2.14,16.7,0,1,5,2
30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
15.8,8,351,264,4.22,3.17,14.5,0,1,5,4
19.7,6,145,175,3.62,2.77,15.5,0,1,5,6
15,8,301,335,3.54,3.57,14.6,0,1,5,8
21.4,4,121,109,4.11,2.78,18.6,1,1,4,2
Here, rectangular means that each column has the same number of elements, which is the case for most datasets. Apparently, this file also has a header row containing variable names, and columns are separated by commas. However, other text files might use different column separators (such as commas or semicolons), different decimal marks (such as a comma used in German-speaking countries as opposed to a point used in English-speaking countries), no header row, comments, and so on.
readr
functionsAs usual, we start with activating the package:
library(readr)
The most generic function for reading rectangular text data is called
read_delim()
. It has parameters to set the column
delimiter, decimal mark, missing values, whether or not there is a
header row, and so on. Other functions include read_csv()
,
read_tsv()
, and read_csv2()
, which are special
cases with predefined values for some common file formats. We’ll
showcase how readr functions work using read_delim()
,
because all other functions behave very similarly.
file
and delim
argumentsThe first argument (file
) is always the file name you
want to import. The second argument (delim
) specifies the
column separator used in the particular file. For example, if you wanted
to load mtcars.csv shown in the previous example, you could use
the following function call to import the contents of the file into a
data frame:
(df = read_delim(readr_example("mtcars.csv"), delim=","))
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
However, data files are usually located in the current working directory. This allows us to specify the file name without the complete path as follows:
df = read_delim("mtcars.csv", delim=",") # mtcars.csv must be in current working directory
We could also use read_csv()
to import this file, which
saves us passing the column separator argument:
df = read_csv("mtcars.csv")
Instead of a file name, you can also pass a character vector as the first argument:
read_delim("A,B,C\n1.1,1.3,-2.0\n5,6.3,-1.8", delim=",")
# A tibble: 2 × 3
A B C
<dbl> <dbl> <dbl>
1 1.1 1.3 -2
2 5 6.3 -1.8
This is useful to quickly generate some toy data (note that
\n
corresponds to a line break).
Here is an example of data separated by semicolons. To import it, we
need to set delim=";"
accordingly.
read_delim("A;B;C\n1.1;1.3;-2.0\n5;6.3;-1.8", delim=";")
# A tibble: 2 × 3
A B C
<dbl> <dbl> <dbl>
1 1.1 1.3 -2
2 5 6.3 -1.8
Whatever the format of the file may be, the result is always a data frame.
Recent versions of the readr package do not even require that we pass
an explicit value for the delim
argument. If we do not
provide a delimiter, the function tries to guess it from the contents of
the file. This works in many (if not most) cases, so we could load the
file simply with:
read_delim("mtcars.csv")
In case this does not work, we can always provide the correct delimiter explicitly.
If the file does not contain a header row, we need to specify
col_names=FALSE
:
read_delim("1.1,1.3,-2.0\n5,6.3,-1.8", ",", col_names=FALSE)
# A tibble: 2 × 3
X1 X2 X3
<dbl> <dbl> <dbl>
1 1.1 1.3 -2
2 5 6.3 -1.8
In this case, read_delim()
generates column names
automatically (X1
, X2
, and
X3
).
English-speaking regions use a point to denote the decimal mark. This is also the case for most programming languages, for example:
4.1
[1] 4.1
In contrast, other regions (including Austria and Germany) use a
comma to separate whole and decimal digits. Therefore, the number 4.1
would be written as 4,1. To import numerical data stored in this format,
we need to set the locale
argument accordingly:
read_delim("A;B;C\n1,1;1,3;-2,0\n5;6,3;-1,8", ";", locale=locale(decimal_mark=","))
# A tibble: 2 × 3
A B C
<dbl> <dbl> <dbl>
1 1.1 1.3 -2
2 5 6.3 -1.8
The resulting data frame is identical to the previous example where we used the regular English number format. It is also worth noting that a file cannot use the same character for both column and decimal separators. This is why the previous example used semicolons to delimit columns (because the comma is already used for the decimal mark).
TODO
TODO
TODO