I’m pleased to announced that readr is now available on CRAN. Readr makes it easy to read many types of tabular data:
read_delim(), read_csv(), read_tsv(), read_csv2().read_fwf(), read_table().read_log().You can install it with:
install.packages("readr")
All readr functions work very similarly. There are four important arguments:
file: the path to the file to read in. This can be a url, or path to zipped, bzipped, xzipped, or gzipped file. You can also pass in a connection or a raw vector.
You can also supply literal data - if the input contains a new line, then the data will be read from the string. Thanks to data.table for this great idea!
library(readr)
read_csv("x,y\n1,2\n3,4")
#> x y
#> 1 1 2
#> 2 3 4col_names: this replaces the header argument in base R functions. It has three possible values:
TRUE will use the the first row of data as column names.FALSE will number the columns sequentially.col_types: this replaces the colClasses argument in base R functions, and allows you to override readrs automatic guessing of column types. More on that below
progress: by default, readr will display a progress bar if it’s estimated that it will take more than 5 seconds to load in the data. You can use progress = FALSE to suppress the progress indicator.
Compared to the equivalent base functions, readr functions tend to be around 10x faster. The performance of read_fwf() is particularly good compared to read.fwf(), because read.fwf() uses a rather inefficient strategy. They should also have much lower memory overhead, only making copies when absolutely necessary.
The output of readr functions has been designed to make your life easier:
Characters are never automatically converted to factors (i.e. no more stringsAsFactors = FALSE!).
Column names are left as is, not munged into valid R identifiers (i.e. there is no check.names = TRUE). You can always use backticks to refer to variables with unusual names, e.g. df$`Income ($000)`.
The output has class c("tbl_df", "tbl", "data.frame") so if you also use dplyr you’ll get an enhanced display (i.e. you’ll see just the first ten rows, not the first 10,000!).
Row names are never set.
Readr uses a set of heuristics to determine the type of the input columns: it reads the first 100 rows of your dataset and compares them to common formats. This is not guaranteed to be perfect, but it’s fast and it’s a reasonable guess. Currently, readr automatically recognises the following types of columns:
col_logical() [l], containing only T, F, TRUE or FALSE.col_integer() [i], integers.col_double() [d], doubles.col_euro_double() [e], “Euro” doubles that use , as decimal separator.col_date() [D]: Y-m-d dates.col_datetime() [T]: ISO8601 date timescol_character() [c], everything else.You can also manually specify other column types:
col_skip() [_], don’t import this column.
col_date(format), dates with given format. Dates and times are rather complex, so they’re described in more detail in the next section.
col_datetime(format, tz), date times with given format. If the timezone is UTC (the default), this is >20x faster than loading then parsing with strptime().
col_numeric() [n], a sloppy numeric parser that ignores everything apart from 0-9, - and . (this is useful for parsing data formatted as currencies).
col_factor(levels, ordered), parse a fixed set of known values into a factor
There are two ways to override the default choices with the col_types argument:
With a compact string: "dc__d": read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with types that need parameters like date time and factor.)
With a (named) list of col objects:
read_csv("iris.csv", col_types = list(
Sepal.Length = col_double(),
Sepal.Width = col_double(),
Petal.Length = col_double(),
Petal.Width = col_double(),
Species = col_factor(c("setosa", "versicolor", "virginica"))
))
Any omitted columns will be parsed automatically, so the previous call is equivalent to:
read_csv("iris.csv", col_types = list(
Species = col_factor(c("setosa", "versicolor", "virginica"))
)One of the most helpful features of readr is its ability to seemly important dates and date times. It can automatically recognise the following formats:
Dates in year-month-day form: 2001-10-20 or 2010/15/10 (or any non-numeric separator). It can’t automatically recongise dates in m/d/y or d/m/y format because they’re ambiguous: is 02/01/2015 the 2nd of January or the first of February?
Date times in ISO8601 form: e.g. 2001-02-03 04:05:06.07 -0800, 20010203 040506, 20010203 etc. I don’t yet support every possible variant, so please let me know if it doesn’t work for your data.
If your dates are in another format, don’t despair. You can also use col_date() and col_datetime() to explicit specify a format string. Readr implements it’s own strptime() equivalent which supports the following format strings:
Year: \%Y (4 digits). \%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
Month: \%m (2 digits), \%b (abbreviated name in current locale), \%B (full name in current locale).
Day: \%d (2 digits), \%e (optional leading space)
Hour: \%H
Minutes: \%M
Seconds: \%S (integer seconds), \%OS (partial seconds)
Time zone: \%Z (as name, e.g. America/Chicago), \%z (as offset from UTC, e.g. +0800)
Non-digits: \%. skips one non-digit charcter, \%* skips any number of non-digits characters.
Shortcuts: \%D = \%m/\%d/\%y, \%F = \%Y-\%m-\%d, \%R = \%H:\%M, \%T = \%H:\%M:\%S, \%x = \%y/\%m/\%d.
To practice parsing date times with out having to load the file each time, you can use parse_datetime() and parse_date():
parse_date("2015-10-10")
#> [1] "2015-10-10"
parse_datetime("2015-10-10 15:14")
#> [1] "2015-10-10 15:14:00 UTC"
parse_date("02/01/2015", "%m/%d/%Y")
#> [1] "2015-02-01"
parse_date("02/01/2015", "%d/%m/%Y")
#> [1] "2015-01-02"
If there are any problems parsing the file, the read_ function will throw a warning telling you how many problems there are. You can then use the problems() function to access a data frame that gives information about each problem:
csv <- "x,y
1,a
b,2
"
df <- read_csv(csv, col_types = "ii")
#> Warning: 2 problems parsing literal data. See problems(...) for more
#> details.
problems(df)
#> row col expected actual
#> 1 1 2 an integer a
#> 2 2 1 an integer b
df
#> x y
#> 1 1 NA
#> 2 NA 2
Readr also provides a handful other useful functions:
read_lines() works the same way as readLines(), but is a lot faster.
read_file() reads a complete file into a string.
type_convert() attempts to coerce all character columns to their appropriate type. This is useful if you need to do some manual munging (e.g. with regular expressions) to turn strings into numbers.
write_csv() writes a data frame out to a csv file. It’s quite a bit faster than write.csv() and it never writes row.names. It escapeds " embedded in strings in a way that read_csv() can read.