Annotated Examples for data.table

data.table

library("data.table")

The data.table package describes itself with the tagline:

Extension of data.frame: Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development.

It has been hailed as the eventual replacement of the beloved data.frame. But as a newcomer and casual R user, I encountered a steep learning curve, not least because at the time of writing, the documentation is terse.

There is a lot to read:

In the rest of this document, I heavily annotate the data.table example (example(data.table)) given in the package documentation in order to amplify the comments as a convenient refresher or ready-reference.

data.table is a data.frame

A data.table can be created in the same way that we create a data.frame.

DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)

Peeking at the objects we have just created:

DF
##   x y v
## 1 a 1 1
## 2 a 3 2
## 3 a 6 3
## 4 b 1 4
## 5 b 3 5
## 6 b 6 6
## 7 c 1 7
## 8 c 3 8
## 9 c 6 9
DT
##    x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
## 4: b 1 4
## 5: b 3 5
## 6: b 6 6
## 7: c 1 7
## 8: c 3 8
## 9: c 6 9

data.table distinguishes itself by adding a : to the row labels. But otherwise, at heart, a datable is a dataframe. DF and DT are:

identical(dim(DT),dim(DF)) # TRUE
## [1] TRUE
identical(DF$a, DT$a)      # TRUE
## [1] TRUE
is.list(DF)                # TRUE
## [1] TRUE
is.list(DT)                # TRUE
## [1] TRUE
is.data.frame(DT)          # TRUE
## [1] TRUE

The tables() function

tables() (not to be confused with the table() function of base R) returns a summary of all the data.table objects currently in the workspace.

tables()
##      NAME NROW MB COLS  KEY
## [1,] DT      9 1  x,y,v    
## Total: 1MB

The summary lists each data.table object and provides its name, number of rows, memory usage, and the column names. If it has a key (or keys) these are also named. The tables() function helps to view all the existing data.tables as the tables in a relational database.

Selecting columns and rows

data.table usage relies heavily on two parameters, i and j, the values of which may be implied (i is the first argument, and j is the second after [). The following examples demonstrate the basic use of i and j, first as positional arguments, then as explicit keywords to [. While the examples here are trivial, it may help a newcomer to visualise the i and j parameters until they become second nature.

DT[2]                          # 2nd row
##    x y v
## 1: a 3 2
DT[i=2]                        # 2nd row, as above
##    x y v
## 1: a 3 2
DT[,v]                         # v column (as vector)
## [1] 1 2 3 4 5 6 7 8 9
DT[j=v]                        # v column (as vector), as above
## [1] 1 2 3 4 5 6 7 8 9
DT[,list(v)]                   # v column (as data.table)
##    v
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## 6: 6
## 7: 7
## 8: 8
## 9: 9
DT[j=list(v)]                  # v column (as data.table), as above
##    v
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## 6: 6
## 7: 7
## 8: 8
## 9: 9
DT[2:3, sum(v)]                # sum(v) over rows 2 and 3
## [1] 5
DT[i=2:3,j=sum(v)]             # sum(v) over rows 2 and 3, as above
## [1] 5
DT[2:5, cat(v,"\n")]           # just for j's side effect
## 2 3 4 5
## NULL
DT[i=2:5,j=cat(v,"\n")]        # just for j's side effect, as above
## 2 3 4 5
## NULL
DT[c(FALSE,TRUE), v]           # even rows, column v, (usual recycling)
## [1] 2 4 6 8
DT[i=c(FALSE,TRUE), j = v]     # even rows, column v, (usual recycling), as above
## [1] 2 4 6 8

Sub-select a data.table inside another

By default with is TRUE and in that case, j is evaluated as an expression in the frame of the data.table and column names may be referred to as variables in the expression. A trivial example, not using column names, will show what happens when with is TRUE:

DT[,2^3]                   # simply evaluates to the cube of 2
## [1] 8
DT[,j = 2^3, with = TRUE]  # same as above, only more explicitly
## [1] 8

When with is FALSE, j is interpeted as a vector of names or positions to select. The selection is returned as a new data.table.

DT[,2,with=FALSE]                 # extract 2nd column, default j argument
##    y
## 1: 1
## 2: 3
## 3: 6
## 4: 1
## 5: 3
## 6: 6
## 7: 1
## 8: 3
## 9: 6
colNum = 2
DT[,j = colNum, with = FALSE]     # same as above, more explicit
##    y
## 1: 1
## 2: 3
## 3: 6
## 4: 1
## 5: 3
## 6: 6
## 7: 1
## 8: 3
## 9: 6
DT[,j = 8/4, with = FALSE]        # a convoluted way of selecting column 2
##    y
## 1: 1
## 2: 3
## 3: 6
## 4: 1
## 5: 3
## 6: 6
## 7: 1
## 8: 3
## 9: 6

Indexing data by keys

setkey(DT,x)               # set a 1-column key. No quotes, for convenience.
tables()                   # see result
##      NAME NROW MB COLS  KEY
## [1,] DT      9 1  x,y,v x  
## Total: 1MB
setkeyv(DT,"x")            # same (v in setkeyv stands for vector)
v="x"
setkeyv(DT,v)              # same
tables()                   # identical result, demonstrating equivalence of all above
##      NAME NROW MB COLS  KEY
## [1,] DT      9 1  x,y,v x  
## Total: 1MB

Footnote: always use the setkey() or setkeyv() functions in order to use the faster methods in data.table. The traditional forms (e.g. key(DT) <- "x") would copy the whole table and would be far slower.

Search by key

Now that we have keyed the DT object by the column x, we can use it to conduct fast searches as demonstrated below.

DT["a"]                    # binary search (fast)
##    x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
DT[x=="a"]                 # vector scan (slow)
##    x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3

The code snippet also demonstrates that the vector scan form, the traditional mode of selecting rows from a data.frame by scanning all the values of a vector still works, but over large datasets will be far slower.

Grouping by columns or rows

DT[, j=sum(v), by=x]           # keyed by
##    x V1
## 1: a  6
## 2: b 15
## 3: c 24
DT[, j=sum(v), by=key(DT)]     # same
##    x V1
## 1: a  6
## 2: b 15
## 3: c 24
DT[, j=sum(v), by=y]           # ad hoc by
##    y V1
## 1: 1 12
## 2: 3 15
## 3: 6 18

The last example shows that one can group by any arbitrary column, including one that is not used as a key to the data.table. Below, we show that by can take an arbitrary expression using column names as variables in order to determine the groupings.

DT[, j=sum(y), by=list(v %% 3)]     # by expression (v modulus 3)
##    v V1
## 1: 1  3
## 2: 2  9
## 3: 0 18

We can also sub-select groups for j expressions using key values in i:

DT[i="a", j=sum(v)]             # j for one group
##    x V1
## 1: a  6
DT[i=c("a","b"), j=sum(v)]      # j for two groups
##    x V1
## 1: a  6
## 2: b 15

Since j does not need to return any value, we can exploit its side effects. In the example below, for instance, we plot a scattergram for each group in i:

par(mfrow=c(1,3))
DT[i=c("a", "b", "c"), 
   j=plot(y,v, ylim=c(0,10))]

plot of chunk j_plot

## Empty data.table (0 rows) of 1 col: x

Create a new data.table

X = data.table(c("b","c"),foo=c(4,2))
X
##    V1 foo
## 1:  b   4
## 2:  c   2

Joining data.tables

Joining two datatables implies combining them using a common key. In the example below, DT, keyed on its column x, is joined to the new datatable, X. X does not have a key, so its first column is used as its key.

DT[i = X]                      # join
##    x y v foo
## 1: b 1 4   4
## 2: b 3 5   4
## 3: b 6 6   4
## 4: c 1 7   2
## 5: c 3 8   2
## 6: c 6 9   2

The first column of X has the values “b” and “c”. The join operation on DT retains only those rows with “b” or “c” as their keys. Furthermore, a column foo is created in the joined datatable, importing the relevant value from X$foo. Similar logic follows when the DT is joined onto X:

setkey(X, V1)                  # explicitly set keys when joining
X[i = DT]                      # join
##    V1 foo y v
## 1:  a  NA 1 1
## 2:  a  NA 3 2
## 3:  a  NA 6 3
## 4:  b   4 1 4
## 5:  b   4 3 5
## 6:  b   4 6 6
## 7:  c   2 1 7
## 8:  c   2 3 8
## 9:  c   2 6 9

Operating on Groups during Join

j is evaluated for each row in i, and this behaviour results in split, apply, combine effects as below:

DT[i = X, j = sum(v)]           # join and eval j for each row in i
##    x V1
## 1: b 15
## 2: c 24
DT[i = X, mult="first"]         # extract first row of each group
##    x y v foo
## 1: b 1 4   4
## 2: c 1 7   2
DT[i = X, mult="last"]          # extract last row of each group
##    x y v foo
## 1: b 6 6   4
## 2: c 6 9   2
DT[i = X, j = sum(v)*foo]       # join inherited scope
##    x V1
## 1: b 60
## 2: c 48

In the last example, j is evaluated using variables from both DT and X, in this case v and foo respectively.

Multiple Keys

A data.table may be indexed on more than one column (or key) and the syntax is demonstrated:

setkey(DT,x,y)             # 2-column key
setkeyv(DT,c("x","y"))     # same

Join to First Column of Multiple Keys

The following are equivalent ways of joining to the first column of a multi-keyed object:

## The `i` keyword is omitted from here onwards for convenience

DT["a"]                    # join to 1st column of key; also DT[i = "a"] 
##    x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
DT[J("a")]                 # same. J() stands for Join, an alias for list()
##    x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
DT[list("a")]              # same
##    x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
DT[.("a")]                 # same. In the style of package plyr.
##    x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3

Join to Multiple Columns

The syntax for joining to more than one column is similar:

DT[J("a",3)]               # join to 2 columns
##    x y v
## 1: a 3 2
DT[.("a",3)]               # same
##    x y v
## 1: a 3 2

When one of the columns has missing values, there are three options on how to handle the extra rows:

DT[J("a",3:6)]             # join 4 rows (2 missing)
##    x y  v
## 1: a 3  2
## 2: a 4 NA
## 3: a 5 NA
## 4: a 6  3
DT[J("a",3:6),nomatch=0]   # remove missing
##    x y v
## 1: a 3 2
## 2: a 6 3
DT[J("a",3:6),roll=TRUE]   # rolling join (locf)
##    x y v
## 1: a 3 2
## 2: a 4 2
## 3: a 5 2
## 4: a 6 3

Split, Apply, Combine

DT[,sum(v),by=list(y%%2)]  # by expression
##    y V1
## 1: 1 27
## 2: 0 18
DT[,.SD[2],by=x]           # 2nd row of each group
##    x y v
## 1: a 3 2
## 2: b 3 5
## 3: c 3 8
DT[,tail(.SD,2),by=x]      # last 2 rows of each group
##    x y v
## 1: a 3 2
## 2: a 6 3
## 3: b 3 5
## 4: b 6 6
## 5: c 3 8
## 6: c 6 9
DT[,lapply(.SD,sum),by=x]  # apply through columns by group
##    x  y  v
## 1: a 10  6
## 2: b 10 15
## 3: c 10 24
DT[,list(MySum=sum(v),
         MyMin=min(v),
         MyMax=max(v)),
    by=list(x,y%%2)]       # by 2 expressions
##    x y MySum MyMin MyMax
## 1: a 1     3     1     2
## 2: a 0     3     3     3
## 3: b 1     9     4     5
## 4: b 0     6     6     6
## 5: c 1    15     7     8
## 6: c 0     9     9     9

Compound Queries

The results of the [ operator are themselves a data.table on which more operations can be chained, as in these two examples:

DT[,sum(v),by=x][V1<20]       # compound query
##    x V1
## 1: a  6
## 2: b 15
DT[,sum(v),by=x][order(-V1)]  # ordering results
##    x V1
## 1: c 24
## 2: b 15
## 3: a  6

Add, Delete, Modify Column

print(DT[,z:=42L])         # add new column by reference
##    x y v  z
## 1: a 1 1 42
## 2: a 3 2 42
## 3: a 6 3 42
## 4: b 1 4 42
## 5: b 3 5 42
## 6: b 6 6 42
## 7: c 1 7 42
## 8: c 3 8 42
## 9: c 6 9 42
print(DT[,z:=NULL])        # remove column by reference
##    x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
## 4: b 1 4
## 5: b 3 5
## 6: b 6 6
## 7: c 1 7
## 8: c 3 8
## 9: c 6 9
print(DT["a",v:=42L])      # subassign to existing v column by reference
##    x y  v
## 1: a 1 42
## 2: a 3 42
## 3: a 6 42
## 4: b 1  4
## 5: b 3  5
## 6: b 6  6
## 7: c 1  7
## 8: c 3  8
## 9: c 6  9
print(DT["b",v2:=84L])     # subassign to new column by reference (NA padded)
##    x y  v v2
## 1: a 1 42 NA
## 2: a 3 42 NA
## 3: a 6 42 NA
## 4: b 1  4 84
## 5: b 3  5 84
## 6: b 6  6 84
## 7: c 1  7 NA
## 8: c 3  8 NA
## 9: c 6  9 NA
DT[,m:=mean(v),by=x][]     # add new column by reference by group
##    x y  v v2  m
## 1: a 1 42 NA 42
## 2: a 3 42 NA 42
## 3: a 6 42 NA 42
## 4: b 1  4 84  5
## 5: b 3  5 84  5
## 6: b 6  6 84  5
## 7: c 1  7 NA  8
## 8: c 3  8 NA  8
## 9: c 6  9 NA  8
                           # NB: postfix [] is shortcut to print()

Nested Query by Group

DT[,.SD[which.min(v)],by=x][]  # nested query by group
##    x y  v v2  m
## 1: a 1 42 NA 42
## 2: b 1  4 84  5
## 3: c 1  7 NA  8

Inverse Join

Selecting the inverse rows from the join subset:

DT[!J("a")]                # not join
##    x y v v2 m
## 1: b 1 4 84 5
## 2: b 3 5 84 5
## 3: b 6 6 84 5
## 4: c 1 7 NA 8
## 5: c 3 8 NA 8
## 6: c 6 9 NA 8
DT[!"a"]                   # same
##    x y v v2 m
## 1: b 1 4 84 5
## 2: b 3 5 84 5
## 3: b 6 6 84 5
## 4: c 1 7 NA 8
## 5: c 3 8 NA 8
## 6: c 6 9 NA 8
DT[!2:4]                   # all rows other than 2:4
##    x y  v v2  m
## 1: a 1 42 NA 42
## 2: b 3  5 84  5
## 3: b 6  6 84  5
## 4: c 1  7 NA  8
## 5: c 3  8 NA  8
## 6: c 6  9 NA  8
DT[x!="b" | y!=3]          # multiple vector scanning approach, slow
##    x y  v v2  m
## 1: a 1 42 NA 42
## 2: a 3 42 NA 42
## 3: a 6 42 NA 42
## 4: b 1  4 84  5
## 5: b 6  6 84  5
## 6: c 1  7 NA  8
## 7: c 3  8 NA  8
## 8: c 6  9 NA  8
DT[!J("b",3)]              # same result but much faster
##    x y  v v2  m
## 1: a 1 42 NA 42
## 2: a 3 42 NA 42
## 3: a 6 42 NA 42
## 4: b 1  4 84  5
## 5: b 6  6 84  5
## 6: c 1  7 NA  8
## 7: c 3  8 NA  8
## 8: c 6  9 NA  8