I’m pleased to announced that readr is now available on CRAN. Readr makes it easy to read many types of tabular data:

You can install it with:

install.packages("readr")

Input

All readr functions work very similarly. There are four important arguments:

Compared to the equivalent base functions, readr functions tend to be around 10x faster. The performance of read_fwf() is particularly good compared to read.fwf(), because read.fwf() uses a rather inefficient strategy. They should also have much lower memory overhead, only making copies when absolutely necessary.

Output

The output of readr functions has been designed to make your life easier:

Column types

Readr uses a set of heuristics to determine the type of the input columns: it reads the first 100 rows of your dataset and compares them to common formats. This is not guaranteed to be perfect, but it’s fast and it’s a reasonable guess. Currently, readr automatically recognises the following types of columns:

You can also manually specify other column types:

There are two ways to override the default choices with the col_types argument:

Dates and times

One of the most helpful features of readr is its ability to seemly important dates and date times. It can automatically recognise the following formats:

  • Dates in year-month-day form: 2001-10-20 or 2010/15/10 (or any non-numeric separator). It can’t automatically recongise dates in m/d/y or d/m/y format because they’re ambiguous: is 02/01/2015 the 2nd of January or the first of February?

  • Date times in ISO8601 form: e.g. 2001-02-03 04:05:06.07 -0800, 20010203 040506, 20010203 etc. I don’t yet support every possible variant, so please let me know if it doesn’t work for your data.

If your dates are in another format, don’t despair. You can also use col_date() and col_datetime() to explicit specify a format string. Readr implements it’s own strptime() equivalent which supports the following format strings:

  • Year: \%Y (4 digits). \%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.

  • Month: \%m (2 digits), \%b (abbreviated name in current locale), \%B (full name in current locale).

  • Day: \%d (2 digits), \%e (optional leading space)

  • Hour: \%H

  • Minutes: \%M

  • Seconds: \%S (integer seconds), \%OS (partial seconds)

  • Time zone: \%Z (as name, e.g. America/Chicago), \%z (as offset from UTC, e.g. +0800)

  • Non-digits: \%. skips one non-digit charcter, \%* skips any number of non-digits characters.

  • Shortcuts: \%D = \%m/\%d/\%y, \%F = \%Y-\%m-\%d, \%R = \%H:\%M, \%T = \%H:\%M:\%S, \%x = \%y/\%m/\%d.

To practice parsing date times with out having to load the file each time, you can use parse_datetime() and parse_date():

parse_date("2015-10-10")
#> [1] "2015-10-10"
parse_datetime("2015-10-10 15:14")
#> [1] "2015-10-10 15:14:00 UTC"

parse_date("02/01/2015", "%m/%d/%Y")
#> [1] "2015-02-01"
parse_date("02/01/2015", "%d/%m/%Y")
#> [1] "2015-01-02"

Problems

If there are any problems parsing the file, the read_ function will throw a warning telling you how many problems there are. You can then use the problems() function to access a data frame that gives information about each problem:

csv <- "x,y
1,a
b,2
"

df <- read_csv(csv, col_types = "ii")
#> Warning: 2 problems parsing literal data. See problems(...) for more
#> details.
problems(df)
#>   row col   expected actual
#> 1   1   2 an integer      a
#> 2   2   1 an integer      b
df
#>    x  y
#> 1  1 NA
#> 2 NA  2

Helper functions

Readr also provides a handful other useful functions: