1 Why R/data.table?

Short answer: because life’s short and you’ve got stuff to do.

Longer answer:

data.table allows you to handle large amounts of data
it’s fast
maybe most important: its compact syntax allows you to write clean code. This means that you need less time to write code that is easier to maintain.

This overview shows you many aspects of data.table using an airlines database with 337 thousand flights.

2 How it data.table works - the basics

The basic syntax looks a lot like SQL:

df[i, j, by]

select rows i, do something with variable[s] j, group by

Simple, huh?

3 Get the data and set to data.table

We get 337K flights from a data.frame in the nycflights13 library, and convert that to a data.table.

If you need to read data, then data.table’s fread is much faster than base R. See http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r/15058684#15058684. Here we will just use the flight data from a library.

library(data.table)

data(flights, package = "nycflights13")
df = setDT(flights)

df

##         year month day dep_time dep_delay arr_time arr_delay carrier
##      1: 2013     1   1      517         2      830        11      UA
##      2: 2013     1   1      533         4      850        20      UA
##      3: 2013     1   1      542         2      923        33      AA
##      4: 2013     1   1      544        -1     1004       -18      B6
##      5: 2013     1   1      554        -6      812       -25      DL
##     ---                                                             
## 336772: 2013     9  30       NA        NA       NA        NA      9E
## 336773: 2013     9  30       NA        NA       NA        NA      9E
## 336774: 2013     9  30       NA        NA       NA        NA      MQ
## 336775: 2013     9  30       NA        NA       NA        NA      MQ
## 336776: 2013     9  30       NA        NA       NA        NA      MQ
##         tailnum flight origin dest air_time distance hour minute
##      1:  N14228   1545    EWR  IAH      227     1400    5     17
##      2:  N24211   1714    LGA  IAH      227     1416    5     33
##      3:  N619AA   1141    JFK  MIA      160     1089    5     42
##      4:  N804JB    725    JFK  BQN      183     1576    5     44
##      5:  N668DN    461    LGA  ATL      116      762    5     54
##     ---                                                         
## 336772:           3393    JFK  DCA       NA      213   NA     NA
## 336773:           3525    LGA  SYR       NA      198   NA     NA
## 336774:  N535MQ   3461    LGA  BNA       NA      764   NA     NA
## 336775:  N511MQ   3572    LGA  CLE       NA      419   NA     NA
## 336776:  N839MQ   3531    LGA  RDU       NA      431   NA     NA

The first advantage of data.table is immediately visible. When you type “df”, R does not try to spit out all 337K flights at you. Instead, it prints out head and tail.

As you can see in the str(df) call, the object remains a data.frame. data.table just makes it much more efficient. There are no rownames in a data.table, since they can be identical for different records.

class(df)

## [1] "data.table" "data.frame"

3.1 Select columns

You use the “j” position to define your variables. The “.” is a shortcut for “list”. No quotes are necessary.

df[, .(tailnum, flight, origin)]

data.table discourages the use of column index numbers rather than names since this is bad programming practice. In some situations you will still require numbers, in that case use “with = FALSE” as follows:

delay.cols = grepl("delay", names(df)) # find columns with "delay"
df[, delay.cols, with = FALSE]

3.2 Select rows

# simple select
df[origin == "EWR"]

# setting a key also automatically sorts the data.frame
setkey(df, origin)
df["EWR"]

# select multiple rows
df[origin %in% c("LGA", "EWR")]

# grep style select
df[dest %like% "LA"]  # finds LAS and LAX

# between select
df[distance %between% c(1000, 1200)]  # distance between 1000 and 1200 nautical miles

3.3 Group by

In the example below we find the average delay of flights to IAH that have a delay of at least 30 minutes by carrier. This command uses “.N”, which is the data.table shortcut for “count”.

df[arr_delay > 30 & dest == "IAH",      #  <- select rows
      .(mean_delay = mean(arr_delay),   #  <- mean arrival delay
        count = .N),                    #  <- counter arrival delay
      by = carrier]                     # <- summarize by airline

4 Save time with transparent code

Row and column selection, combined with group by will already save you tons of time. But wait, there’s much more.

4.1 Adding and removing fields

4.1.1 Add one new field

The documentation of the nycflights13 dataset does not specify units for distance or airtime. We assume that distance is measured in the traditional nautical miles per hour and airtime in minutes. The speed is then calculated in km/h.

df[, speed := (distance * 1.852) / (air_time / 60)]

You can update this field in place. For example, set NA values for speed to zero, and blank tail numbers to “Unknown”.

df[is.na(speed), speed := 0]
df[nchar(tailnum) < 2, tailnum := "Unknown"]

This is a real time saver.

4.1.2 Remove a field

df[, speed := NULL]

4.1.3 Add and remove more new fields

You can add more fields at the same time. Note that you don’t have to write “df = df[i, j, group]” - the variables are updated in place. Another time saver.

The example below uses library lubridate, which is very efficient handling of date conversions and calculations.

library(lubridate)

df[, `:=` (
  speed = (distance * 1.852) / (air_time / 60),
  date = ymd(paste(year, month, day, sep = "-"))
)]

df[, c("speed", "date") := NULL]

4.1.4 Calculate in steps

The nested command to get a date field in the code above is not very transparent. There is much excitement about the use of “pipes” from library magrittr in combination with library dplyr.

data.table allows for code evaluation in steps and chaining. There is therefore not much need for pipes as an additional concept. However, if you absolutely want to write “%>%” all the time in data.table nobody will stop you.

df[, date := {
  temp = paste(year, month, day, sep = "-")
  date = ymd(temp)
}]

4.1.5 Chaining

You can calculate more fields in steps simultaneously. You can also chain outputs Linux style. So let’s go wild and find pilots that went through the sound barrier [1234 km/h]:

df[, c("date", "speed") := {
  temp = paste(year, month, day, sep = "-")
  date = ymd(temp)
  
  km = distance * 1.852
  hr = air_time / 60
  speed = km / hr

  .(date = date, speed = speed)
}][speed > 1234, .(date, air_time, distance, speed, carrier, tailnum)]

The first part of the code gets dates and km/h in steps. The results are then piped into a command that both finds supersonic flights and makes a subselection of the fields. This results in transparent code that you will still be able to understand when you revisit your script months later.

5 Fast joins and overlap detection

An important part of your time is spent merging different datasets together. data.table can do this really fast- and as usual, there’s more.

5.1 Left join

5.1.1 How it works

We have found a flight that went through the sound barrier and want to know which plane it was. This requires us to get the manufacturer’s data [which are also included in the nycflights13 library] and join that dataset with the flight data.

data(planes, package = "nycflights13")
setDT(planes)

setkey(planes, tailnum)
setkey(df, tailnum)
dx = df[planes]

dx[speed > 1234, .(carrier, tailnum, origin, dest, speed, manufacturer, model, engine)]

First, we get the plane data and set it to data.table. Second, we set the keys for the datasets we want to merge: both contain the field “tailnum”. Then we do a left join on the datasets, and find the supersonic flight. Delta went through the sound barrier with a Boeing 757!

5.1.2 More join operations

What was the first flight per tail number?

setkey(df, tailnum, date)
df[planes, mult = "first"]  # finds the first flight per tail number
df[planes, mult = "last"]

What is the second flight of each tailnumber per carrier?

dx[, .SD[2], by = .(carrier, tailnum)]

The “.SD” means “subset of data”.

What are the first 3 flights per tailnumber per carrier?

dx[, head(.SD, 3), .(carrier, tailnum)]

What was the minimum speed per carrier?

df[, .SD[which.min(speed)], by = carrier]

5.2 Full join

Instead of of joins in the form : A[B], you can also use merge. This allows you to specify field names for A and B if they have different names [e.g. tailnum and TailNum]. You can also specify left, right and full joins.

# left join
x = merge(df, planes, by = "tailnum", all.x = T)

setkey(df, tailnum)
setkey(planes, tailnum)
y = planes[df]

# full join
z = merge(df, planes, by = "tailnum", all = T)

5.3 Overlaps

data.table offers a unique feature to find overlaps between intervals of two data.tables. The feature is best explained from a simple example [?foverlaps for more details].

# make 2 simple data.tables
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)

# find the overlaps
z = foverlaps(x, y, type="any", which=TRUE) ## return overlap indices

print(x)  # data.table 1

##    start end val2
## 1:     5   8    7
## 2:    31  50    8
## 3:    22  25    9
## 4:    16  18   10

print(y)  # data.table 2

##    start end val1
## 1:    10  15    1
## 2:    20  35    2
## 3:    30  45    3

print(z)  # indices overlaps

##    xid yid
## 1:   1  NA
## 2:   2   2
## 3:   2   3
## 4:   3   2
## 5:   4  NA

There are 2 data.tables with intervals [start, end]. The last table shows the overlap indices.

The first row of x has interval [5, 8] whereas none of the rows in y have that interval; all are > 8. The second row of x has interval [31, 50] and this has overlaps with both rows 2 and 3 of y: [20-35] and [30-45] respectively.

The table below shows the joined data.table.

foverlaps(x, y, type="any")

##    start end val1 i.start i.end val2
## 1:    NA  NA   NA       5     8    7
## 2:    20  35    2      31    50    8
## 3:    30  45    3      31    50    8
## 4:    20  35    2      22    25    9
## 5:    NA  NA   NA      16    18   10

6 Data manipulation: from wide to long and back again

6.1 Convert multiple columns in place

A data.table is, like a data.frame, a list of vectors. Therefore, processing multiple fields requires list processing: lapply. You have to indicate which columns you want to have processed separately. This is a cumbersome operation in data.table; we will work from the easiest case to the most difficult one [which is unfortunately the most frequent one].

# convert times in minutes to hours in the delay columns
df[, lapply(.SD, function(x) x/60), .SDcols = c("dep_delay", "arr_delay")]

As you can see, this has given you the delay columns in hours. However, often you want to do this in place - you want to work in the original data.table. Here, we will find the variable names rather than typing them.

df[, arr.delay.corr := arr_delay - dep_delay] # correction, see below

# which variable names have "delay" in them?
idx.numcols = which(names(df) %like% "delay")
names.numcols = names(df)[idx.numcols]

# convert to minutes in place
df[, (names.numcols) := lapply(.SD, function(x) ifelse(is.na(x), x, x/60)), .SDcols = names.numcols]
print(head(df, 3)[, names.numcols, with = F])

##    dep_delay  arr_delay arr.delay.corr
## 1: 1.1333333 1.51666667     0.38333333
## 2: 0.6666667 0.73333333     0.06666667
## 3: 0.4000000 0.03333333    -0.36666667

6.2 Convert from wide to long format (fast melt)

Often times data.tables are in wide format [many variables], whereas you want to have them in long format. You would require long format, for example, to work in ggplot.

First the melt [wide –> long] operation, this is optimized for speed in data.table (even more optimization is scheduled for upcoming version 1.9.6). We do the melt operation for a few carriers only, for plotting purposes later on.

mycols = c("carrier", names.numcols)
m = melt(df[, mycols, with = F], id.var = "carrier")[carrier %chin% c("AA", "NW", "UA", "US")]
print(head(m, 3))

##    carrier  variable      value
## 1:      US dep_delay -0.1166667
## 2:      US dep_delay -0.1500000
## 3:      US dep_delay  0.6500000

Now you can easily plot in ggplot2. How do the arrival and departure delay compare by carrier?

library(ggplot2)
theme_set(theme_bw())

p = ggplot(m, aes(variable, value)) + 
    geom_boxplot() + 
    facet_grid(.~ carrier) + 
    labs(x = "", y = "hours", title = "delays per carrier") +
    theme_bw()

print(p)

In this plot, 50% of the observations can be found in the “box” [the horizontal dash is the median]. The plot shows that the variance in arrival delay is wider [and often negative]. The outliers are shows as dots; AA is the winner here. Note the large arrival delays - if you start 1 hour to late, you already have clocked 1 hour arrival delay too unless you fly faster than usual. The variable “arr.delay.corr” corrects for this.

6.3 Convert long format to wide (fast dcast)

We had just melted the data.table for delays per carrier.

You can now cast this back - the latest and greatest data.table 1.9.6 has some more functionality for this.

For example, you can run multiple functions while casting. The code below calculates both mean and standard deviation for the delays of the selected carriers.

dc = dcast(m, variable ~ carrier, value.var = "value", fun = list(mean, sd), na.rm = T)

7 Finding duplicates

Data cleaning always involves the removal of duplicates. Suppose that we want to work with unique carriers.

data.table finds duplicates per key. This means that your results will depend on the keys that you have set. In the example below we remove the key, test for duplicates, then repeat the process when a key is set.

key(df) # what is the current key? 
setkey(df) # remove any keys
dedup = df[!duplicated(df),]  # find the unique records
nrow(dedup) # how many unique records?
nrow(unique(df, by ="carrier")) # ditto - difference!

setkey(df, carrier)
dedup = df[!duplicated(df),]
nrow(dedup) # now the unique records found corresponds to what you wanted [unique carriers]

8 House keeping

8.1 Copies of data.tables

data.table avoids to make copies. This means that you can get tricked when you make a new data.table out of another one and do things with it, such as deleting variables.

planes.copy = planes
planes.copy[, type := NULL]
"type" %in% names(planes)  # the deleted variable has gone in the original dataset too

## [1] FALSE

planes.deep.copy = copy(planes)
planes.deep.copy[, model := NULL]
"model" %in% names(planes) # deleted variable still there in the original dataset

## [1] TRUE

8.2 Fast renaming of variables

setnames(df, c("air_time", "speed"), c("air_time.min", "speed.km.hr"))

8.3 Setting the order of columns

You sort columns in 2 ways. First there is the aptly named “setcolorder”. Second, you can use the fact that the columns show up in the order in which you queried them.

dfx = df[, .(year, month, day, date)]  # sample data.table

setcolorder(dfx, c("date", "day", "month", "year"))

dfx[, .(date, day, month, year)]

9 References

r cran location: https://cran.r-project.org/web/packages/data.table/index.html

the mothership: https://github.com/Rdatatable/data.table/wiki

bigvis: https://github.com/hadley/bigvis

10 dplyr: the data.table alternative

You can also work in dplyr if you prefer to use commands for every subaction. It may be easier for some people to remember the syntax of dplyr as it looks more “natural”. The resulting code is more verbose.

As you can see below, you will have the result in data.table before you have finished typing in dplyr. data.table needed two commands (“by” and “[]”), whereas dplyr needed 4 (filter, group_by, summarise and %>%). For me, this gets old really fast. For others, the combination of typing “%>%” combined with a verb for every action is easier to work with.

A discussion of differences can be found here: http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly

library(dplyr)

  datatable.way = df[arr_delay > 30 & origin == "JFK", 
                     mean(arr_delay), 
                     by = carrier]

  dplyr.way = df %>% 
                filter(arr_delay > 30 & origin == "JFK") %>%
                group_by(carrier) %>% 
                summarise(mean(arr_delay))

data crunching with data.table

Henk Harmsen

September 11, 2015