data.table
data.table
library("data.table")
The data.table
package describes itself with the tagline:
Extension of data.frame: Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development.
It has been hailed as the eventual replacement of the beloved data.frame
. But as a newcomer and casual R user, I encountered a steep learning curve, not least because at the time of writing, the documentation is terse.
There is a lot to read:
data.table
tagIn the rest of this document, I heavily annotate the data.table
example (example(data.table)
) given in the package documentation in order to amplify the comments as a convenient refresher or ready-reference.
data.table
is a data.frame
A data.table
can be created in the same way that we create a data.frame
.
DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
Peeking at the objects we have just created:
DF
## x y v
## 1 a 1 1
## 2 a 3 2
## 3 a 6 3
## 4 b 1 4
## 5 b 3 5
## 6 b 6 6
## 7 c 1 7
## 8 c 3 8
## 9 c 6 9
DT
## x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
## 4: b 1 4
## 5: b 3 5
## 6: b 6 6
## 7: c 1 7
## 8: c 3 8
## 9: c 6 9
data.table
distinguishes itself by adding a :
to the row labels. But otherwise, at heart, a datable is a dataframe. DF
and DT
are:
list
data.table
subclasses data.frame
, hence it is fully a data,frame
identical(dim(DT),dim(DF)) # TRUE
## [1] TRUE
identical(DF$a, DT$a) # TRUE
## [1] TRUE
is.list(DF) # TRUE
## [1] TRUE
is.list(DT) # TRUE
## [1] TRUE
is.data.frame(DT) # TRUE
## [1] TRUE
tables()
functiontables()
(not to be confused with the table()
function of base R) returns a summary of all the data.table
objects currently in the workspace.
tables()
## NAME NROW MB COLS KEY
## [1,] DT 9 1 x,y,v
## Total: 1MB
The summary lists each data.table
object and provides its name, number of rows, memory usage, and the column names. If it has a key (or keys) these are also named. The tables()
function helps to view all the existing data.tables
as the tables in a relational database.
data.table
usage relies heavily on two parameters, i
and j
, the values of which may be implied (i
is the first argument, and j
is the second after [
). The following examples demonstrate the basic use of i
and j
, first as positional arguments, then as explicit keywords to [
. While the examples here are trivial, it may help a newcomer to visualise the i
and j
parameters until they become second nature.
DT[2] # 2nd row
## x y v
## 1: a 3 2
DT[i=2] # 2nd row, as above
## x y v
## 1: a 3 2
DT[,v] # v column (as vector)
## [1] 1 2 3 4 5 6 7 8 9
DT[j=v] # v column (as vector), as above
## [1] 1 2 3 4 5 6 7 8 9
DT[,list(v)] # v column (as data.table)
## v
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## 6: 6
## 7: 7
## 8: 8
## 9: 9
DT[j=list(v)] # v column (as data.table), as above
## v
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## 6: 6
## 7: 7
## 8: 8
## 9: 9
DT[2:3, sum(v)] # sum(v) over rows 2 and 3
## [1] 5
DT[i=2:3,j=sum(v)] # sum(v) over rows 2 and 3, as above
## [1] 5
DT[2:5, cat(v,"\n")] # just for j's side effect
## 2 3 4 5
## NULL
DT[i=2:5,j=cat(v,"\n")] # just for j's side effect, as above
## 2 3 4 5
## NULL
DT[c(FALSE,TRUE), v] # even rows, column v, (usual recycling)
## [1] 2 4 6 8
DT[i=c(FALSE,TRUE), j = v] # even rows, column v, (usual recycling), as above
## [1] 2 4 6 8
data.table
inside anotherBy default with
is TRUE
and in that case, j
is evaluated as an expression in the frame of the data.table
and column names may be referred to as variables in the expression. A trivial example, not using column names, will show what happens when with
is TRUE
:
DT[,2^3] # simply evaluates to the cube of 2
## [1] 8
DT[,j = 2^3, with = TRUE] # same as above, only more explicitly
## [1] 8
When with
is FALSE
, j
is interpeted as a vector of names or positions to select. The selection is returned as a new data.table
.
DT[,2,with=FALSE] # extract 2nd column, default j argument
## y
## 1: 1
## 2: 3
## 3: 6
## 4: 1
## 5: 3
## 6: 6
## 7: 1
## 8: 3
## 9: 6
colNum = 2
DT[,j = colNum, with = FALSE] # same as above, more explicit
## y
## 1: 1
## 2: 3
## 3: 6
## 4: 1
## 5: 3
## 6: 6
## 7: 1
## 8: 3
## 9: 6
DT[,j = 8/4, with = FALSE] # a convoluted way of selecting column 2
## y
## 1: 1
## 2: 3
## 3: 6
## 4: 1
## 5: 3
## 6: 6
## 7: 1
## 8: 3
## 9: 6
setkey(DT,x) # set a 1-column key. No quotes, for convenience.
tables() # see result
## NAME NROW MB COLS KEY
## [1,] DT 9 1 x,y,v x
## Total: 1MB
setkeyv(DT,"x") # same (v in setkeyv stands for vector)
v="x"
setkeyv(DT,v) # same
tables() # identical result, demonstrating equivalence of all above
## NAME NROW MB COLS KEY
## [1,] DT 9 1 x,y,v x
## Total: 1MB
Footnote: always use the setkey()
or setkeyv()
functions in order to use the faster methods in data.table
. The traditional forms (e.g. key(DT) <- "x"
) would copy the whole table and would be far slower.
Now that we have keyed the DT
object by the column x
, we can use it to conduct fast searches as demonstrated below.
DT["a"] # binary search (fast)
## x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
DT[x=="a"] # vector scan (slow)
## x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
The code snippet also demonstrates that the vector scan form, the traditional mode of selecting rows from a data.frame
by scanning all the values of a vector still works, but over large datasets will be far slower.
DT[, j=sum(v), by=x] # keyed by
## x V1
## 1: a 6
## 2: b 15
## 3: c 24
DT[, j=sum(v), by=key(DT)] # same
## x V1
## 1: a 6
## 2: b 15
## 3: c 24
DT[, j=sum(v), by=y] # ad hoc by
## y V1
## 1: 1 12
## 2: 3 15
## 3: 6 18
The last example shows that one can group by any arbitrary column, including one that is not used as a key to the data.table
. Below, we show that by
can take an arbitrary expression using column names as variables in order to determine the groupings.
DT[, j=sum(y), by=list(v %% 3)] # by expression (v modulus 3)
## v V1
## 1: 1 3
## 2: 2 9
## 3: 0 18
We can also sub-select groups for j
expressions using key values in i:
DT[i="a", j=sum(v)] # j for one group
## x V1
## 1: a 6
DT[i=c("a","b"), j=sum(v)] # j for two groups
## x V1
## 1: a 6
## 2: b 15
Since j
does not need to return any value, we can exploit its side effects. In the example below, for instance, we plot a scattergram for each group in i
:
par(mfrow=c(1,3))
DT[i=c("a", "b", "c"),
j=plot(y,v, ylim=c(0,10))]
## Empty data.table (0 rows) of 1 col: x
data.table
X = data.table(c("b","c"),foo=c(4,2))
X
## V1 foo
## 1: b 4
## 2: c 2
data.tables
Joining two datatables implies combining them using a common key. In the example below, DT
, keyed on its column x
, is joined to the new datatable, X
. X
does not have a key, so its first column is used as its key.
DT[i = X] # join
## x y v foo
## 1: b 1 4 4
## 2: b 3 5 4
## 3: b 6 6 4
## 4: c 1 7 2
## 5: c 3 8 2
## 6: c 6 9 2
The first column of X
has the values “b” and “c”. The join operation on DT
retains only those rows with “b” or “c” as their keys. Furthermore, a column foo
is created in the joined datatable, importing the relevant value from X$foo
. Similar logic follows when the DT
is joined onto X
:
setkey(X, V1) # explicitly set keys when joining
X[i = DT] # join
## V1 foo y v
## 1: a NA 1 1
## 2: a NA 3 2
## 3: a NA 6 3
## 4: b 4 1 4
## 5: b 4 3 5
## 6: b 4 6 6
## 7: c 2 1 7
## 8: c 2 3 8
## 9: c 2 6 9
j
is evaluated for each row in i
, and this behaviour results in split, apply, combine effects as below:
DT[i = X, j = sum(v)] # join and eval j for each row in i
## x V1
## 1: b 15
## 2: c 24
DT[i = X, mult="first"] # extract first row of each group
## x y v foo
## 1: b 1 4 4
## 2: c 1 7 2
DT[i = X, mult="last"] # extract last row of each group
## x y v foo
## 1: b 6 6 4
## 2: c 6 9 2
DT[i = X, j = sum(v)*foo] # join inherited scope
## x V1
## 1: b 60
## 2: c 48
In the last example, j
is evaluated using variables from both DT
and X
, in this case v
and foo
respectively.
A data.table
may be indexed on more than one column (or key) and the syntax is demonstrated:
setkey(DT,x,y) # 2-column key
setkeyv(DT,c("x","y")) # same
The following are equivalent ways of joining to the first column of a multi-keyed object:
## The `i` keyword is omitted from here onwards for convenience
DT["a"] # join to 1st column of key; also DT[i = "a"]
## x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
DT[J("a")] # same. J() stands for Join, an alias for list()
## x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
DT[list("a")] # same
## x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
DT[.("a")] # same. In the style of package plyr.
## x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
The syntax for joining to more than one column is similar:
DT[J("a",3)] # join to 2 columns
## x y v
## 1: a 3 2
DT[.("a",3)] # same
## x y v
## 1: a 3 2
When one of the columns has missing values, there are three options on how to handle the extra rows:
NA
DT[J("a",3:6)] # join 4 rows (2 missing)
## x y v
## 1: a 3 2
## 2: a 4 NA
## 3: a 5 NA
## 4: a 6 3
DT[J("a",3:6),nomatch=0] # remove missing
## x y v
## 1: a 3 2
## 2: a 6 3
DT[J("a",3:6),roll=TRUE] # rolling join (locf)
## x y v
## 1: a 3 2
## 2: a 4 2
## 3: a 5 2
## 4: a 6 3
DT[,sum(v),by=list(y%%2)] # by expression
## y V1
## 1: 1 27
## 2: 0 18
DT[,.SD[2],by=x] # 2nd row of each group
## x y v
## 1: a 3 2
## 2: b 3 5
## 3: c 3 8
DT[,tail(.SD,2),by=x] # last 2 rows of each group
## x y v
## 1: a 3 2
## 2: a 6 3
## 3: b 3 5
## 4: b 6 6
## 5: c 3 8
## 6: c 6 9
DT[,lapply(.SD,sum),by=x] # apply through columns by group
## x y v
## 1: a 10 6
## 2: b 10 15
## 3: c 10 24
DT[,list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=list(x,y%%2)] # by 2 expressions
## x y MySum MyMin MyMax
## 1: a 1 3 1 2
## 2: a 0 3 3 3
## 3: b 1 9 4 5
## 4: b 0 6 6 6
## 5: c 1 15 7 8
## 6: c 0 9 9 9
The results of the [
operator are themselves a data.table
on which more operations can be chained, as in these two examples:
DT[,sum(v),by=x][V1<20] # compound query
## x V1
## 1: a 6
## 2: b 15
DT[,sum(v),by=x][order(-V1)] # ordering results
## x V1
## 1: c 24
## 2: b 15
## 3: a 6
print(DT[,z:=42L]) # add new column by reference
## x y v z
## 1: a 1 1 42
## 2: a 3 2 42
## 3: a 6 3 42
## 4: b 1 4 42
## 5: b 3 5 42
## 6: b 6 6 42
## 7: c 1 7 42
## 8: c 3 8 42
## 9: c 6 9 42
print(DT[,z:=NULL]) # remove column by reference
## x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
## 4: b 1 4
## 5: b 3 5
## 6: b 6 6
## 7: c 1 7
## 8: c 3 8
## 9: c 6 9
print(DT["a",v:=42L]) # subassign to existing v column by reference
## x y v
## 1: a 1 42
## 2: a 3 42
## 3: a 6 42
## 4: b 1 4
## 5: b 3 5
## 6: b 6 6
## 7: c 1 7
## 8: c 3 8
## 9: c 6 9
print(DT["b",v2:=84L]) # subassign to new column by reference (NA padded)
## x y v v2
## 1: a 1 42 NA
## 2: a 3 42 NA
## 3: a 6 42 NA
## 4: b 1 4 84
## 5: b 3 5 84
## 6: b 6 6 84
## 7: c 1 7 NA
## 8: c 3 8 NA
## 9: c 6 9 NA
DT[,m:=mean(v),by=x][] # add new column by reference by group
## x y v v2 m
## 1: a 1 42 NA 42
## 2: a 3 42 NA 42
## 3: a 6 42 NA 42
## 4: b 1 4 84 5
## 5: b 3 5 84 5
## 6: b 6 6 84 5
## 7: c 1 7 NA 8
## 8: c 3 8 NA 8
## 9: c 6 9 NA 8
# NB: postfix [] is shortcut to print()
DT[,.SD[which.min(v)],by=x][] # nested query by group
## x y v v2 m
## 1: a 1 42 NA 42
## 2: b 1 4 84 5
## 3: c 1 7 NA 8
Selecting the inverse rows from the join subset:
DT[!J("a")] # not join
## x y v v2 m
## 1: b 1 4 84 5
## 2: b 3 5 84 5
## 3: b 6 6 84 5
## 4: c 1 7 NA 8
## 5: c 3 8 NA 8
## 6: c 6 9 NA 8
DT[!"a"] # same
## x y v v2 m
## 1: b 1 4 84 5
## 2: b 3 5 84 5
## 3: b 6 6 84 5
## 4: c 1 7 NA 8
## 5: c 3 8 NA 8
## 6: c 6 9 NA 8
DT[!2:4] # all rows other than 2:4
## x y v v2 m
## 1: a 1 42 NA 42
## 2: b 3 5 84 5
## 3: b 6 6 84 5
## 4: c 1 7 NA 8
## 5: c 3 8 NA 8
## 6: c 6 9 NA 8
DT[x!="b" | y!=3] # multiple vector scanning approach, slow
## x y v v2 m
## 1: a 1 42 NA 42
## 2: a 3 42 NA 42
## 3: a 6 42 NA 42
## 4: b 1 4 84 5
## 5: b 6 6 84 5
## 6: c 1 7 NA 8
## 7: c 3 8 NA 8
## 8: c 6 9 NA 8
DT[!J("b",3)] # same result but much faster
## x y v v2 m
## 1: a 1 42 NA 42
## 2: a 3 42 NA 42
## 3: a 6 42 NA 42
## 4: b 1 4 84 5
## 5: b 6 6 84 5
## 6: c 1 7 NA 8
## 7: c 3 8 NA 8
## 8: c 6 9 NA 8