data.table package intro.

published on Nov 25, 2013

keywords : R, data.table, split, data munging

소개

여기에서는 data.table이 기존의 data.frame으로 처리하는 것에 비해 속도가 얼마나 빠른지, 문법이 얼마나 간단하고 이해하기 쉬운것인지에 대해 간단히 살펴보려고 한다. 아래의 내용은 _Introduction to the data.table package in R_을 참고하여 작성한 내용이다.

사용법

require(data.table)

## Loading required package: data.table

## Warning: package 'data.table' was built under R version 3.0.2

먼저 data.frame 사용법을 보면 아래와 같다.

DF <- data.frame(x = c("b", "b", "b", "a", "a"), v = rnorm(5))
DF

##   x       v
## 1 b  1.3583
## 2 b  0.9240
## 3 b -0.5034
## 4 a -0.2791
## 5 a -1.9055

data.table 또한 위와 같은 방법으로 생성한다.

DT <- data.table(x = c("b", "b", "b", "a", "a"), v = rnorm(5))
DT

##    x       v
## 1: b -0.4539
## 2: b -0.2239
## 3: b  1.0135
## 4: a  0.3524
## 5: a -0.4969

data.table은 data.frame과 비교했을 때 row numer가 약간 다르다. 이것은 크게 중요한 사항은 아니고, 또한 data.frame을 data.table로 변환도 가능하다.

CARS <- data.table(cars)
head(CARS)

##    speed dist
## 1:     4    2
## 2:     4   10
## 3:     7    4
## 4:     7   22
## 5:     8   16
## 6:     9   10

메모리상의 data.table 리스트를 볼 수 있어 유용하다.

tables()

##      NAME NROW MB COLS       KEY
## [1,] CARS   50 1  speed,dist    
## [2,] DT      5 1  x,v           
## Total: 2MB

MB column은 quick하게 메모리 사용을 판단하고 제거할 수 있다. 불필요한 테이블들을 제거하여 메모리를 여유롭게 할 수 있다.

sapply(DT, class)

##           x           v 
## "character"   "numeric"

이미 위에 tables() 결과에 KEY column이 비어 있는 것을 보았을 것이다. 이것은 다음 섹션에서 다루도록 하겠다.

1. Keys

tables()

##      NAME NROW MB COLS       KEY
## [1,] CARS   50 1  speed,dist    
## [2,] DT      5 1  x,v           
## Total: 2MB

DT

##    x       v
## 1: b -0.4539
## 2: b -0.2239
## 3: b  1.0135
## 4: a  0.3524
## 5: a -0.4969

keys가 셋팅되어 있지 않다. data.table도 data.frame과 같은 syntax를 사용할 수 있다.

DT[2, ]

##    x       v
## 1: b -0.2239

DT[DT$x == "b", ]

##    x       v
## 1: b -0.4539
## 2: b -0.2239
## 3: b  1.0135

하지만 rownames은 없다. 아래의 코드는 실행되지 않을 것이다.

cat(try(DT["b", ], silent = TRUE))

## Error in `[.data.table`(DT, "b", ) : 
##   When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.

Error message는 setkey()를 사용하는 것이 필요하다는 것이다.

setkey(DT, x)
DT

##    x       v
## 1: a  0.3524
## 2: a -0.4969
## 3: b -0.4539
## 4: b -0.2239
## 5: b  1.0135

DT의 rows가 x값에 따라 재배열이 되었다. keys를 확인하기 위해서는 haskey(), key(), attributes(), 혹은 그냥 tables()를 실행해라.

tables()

##      NAME NROW MB COLS       KEY
## [1,] CARS   50 1  speed,dist    
## [2,] DT      5 1  x,v        x  
## Total: 2MB

이제 DT가 keys를 가지게 된 것을 확인할 수 있다.

기본적으로 group에 있는 모든 rows들이 return된다. _mult_는 group의 first, last를 return하게 해준다. 그리고 이제 comma는 필수가 아닌 선택사항이다.

DT["b", ]

##    x       v
## 1: b -0.4539
## 2: b -0.2239
## 3: b  1.0135

DT["b", mult = "first"]

##    x       v
## 1: b -0.4539

DT["b", mult = "last"]

##    x     v
## 1: b 1.013

DT["b"]

##    x       v
## 1: b -0.4539
## 2: b -0.2239
## 3: b  1.0135

자, 이제 새로운 data.frame을 생성하겠다. vector scan과 binary search의 차이점을 충분히 보일 것이다.

grpsize <- ceiling(1e+07/26^2)  # 10 million rows, 676 groups
tt <- system.time(DF <- data.frame(x = rep(LETTERS, each = 26 * grpsize), y = rep(letters, 
    each = grpsize), v = runif(grpsize * 26^2), stringsAsFactors = FALSE))
tt

##    user  system elapsed 
##    3.37    0.76    4.29

head(DF, 3)

##   x y       v
## 1 A a 0.61433
## 2 A a 0.97483
## 3 A a 0.06573

tail(DF, 3)

##          x y      v
## 10000066 Z z 0.8022
## 10000067 Z z 0.6750
## 10000068 Z z 0.7773

dim(DF)

## [1] 10000068        3

DF로부터 임의의 그룹을 뽑아보자.

tt <- system.time(ans1 <- DF[DF$x == "R" & DF$y == "h", ])  # 'vector scan'
tt

##    user  system elapsed 
##    2.67    0.09    3.07

head(ans1, 3)

##         x y      v
## 6642058 R h 0.1721
## 6642059 R h 0.3673
## 6642060 R h 0.5431

dim(ans1)

## [1] 14793     3

그럼 이제 data.table로 바꾼 후에 같은 작업을 해보겠다.

DT <- data.table(DF)
setkey(DT, x, y)
ss <- system.time(ans2 <- DT[J("R", "h")])  # 'binary search'
ss

##    user  system elapsed 
##       0       0       0

head(ans2, 3)

##    x y      v
## 1: R h 0.1721
## 2: R h 0.3673
## 3: R h 0.5431

dim(ans2)

## [1] 14793     3

identical(ans1$v, ans2$v)

## [1] TRUE

와우, data.frame은 3.53sec, data.table은 0.05sec… 물론 data.table로도 vector scan을 할 수 있다.

system.time(ans1 <- DT[x == "R" & y == "h", ])  # works but is using data.table badly

##    user  system elapsed 
##    1.61    0.08    1.78

system.time(ans2 <- DF[DF$x == "R" & DF$y == "h", ])  # the data.frame way

##    user  system elapsed 
##    2.79    0.14    3.00

mapply(identical, ans1, ans2)

##    x    y    v 
## TRUE TRUE TRUE

정렬된 테이블이 binary search를 이용하여 매칭되는 행들을 찾는다는 사실의 이점을 취하기 위해 key를 사용한다.

DT$x=='R'을 사용하면 열 x 전체를 스캔하게된다. [] 메소드를 이용하면 R 내부적으로는 매우 빠르지만 이 역시 vectorized operation, vector scan이다.

data.table에서 i번째 argument, 아래의 결과는 같은 결과를 나타낸다. DT는 key값이 x,y column으로 셋팅되어 있으므로 x,y column에서 'R', 'h'인 행들을 가져오라는 것이다. data.table('R','h'), J('R','h') 같은 결과를 보여준다.

identical(DT[J("R", "h"), ], DT[data.table("R", "h"), ])

## [1] TRUE

data.table은 vector scanning과 binary search 둘 다 가능하다. 하지만 한가지 방법만 쓰는편이 낫다.

2. Fast grouping

The second argument to [ is j, which may consist of one or more expressions whose arguments are (unquoted) column names, as if the column names were variables.

DT[, sum(v)]

## [1] 5000875

When we supply a j expression and a 'by' list of expressions, the j expression is repeated for each 'by' group:

DT[, sum(v), by = x]

##     x     V1
##  1: A 191925
##  2: B 192475
##  3: C 192719
##  4: D 192612
##  5: E 192406
##  6: F 192199
##  7: G 192257
##  8: H 192500
##  9: I 192692
## 10: J 192033
## 11: K 192621
## 12: L 192374
## 13: M 192660
## 14: N 192097
## 15: O 192329
## 16: P 192383
## 17: Q 192363
## 18: R 192350
## 19: S 192085
## 20: T 192307
## 21: U 192315
## 22: V 192229
## 23: W 192189
## 24: X 192334
## 25: Y 192262
## 26: Z 192159
##     x     V1

data.table의 by는 빠르다. tapply와 비교해보자.

ttt <- system.time(tt <- tapply(DT$v, DT$x, sum))
ttt

##    user  system elapsed 
##    7.45    0.50    8.44

sss <- system.time(ss <- DT[, sum(v), by = x])
sss

##    user  system elapsed 
##    0.34    0.00    0.35

head(tt)

##      A      B      C      D      E      F 
## 191925 192475 192719 192612 192406 192199

head(ss)

##    x     V1
## 1: A 191925
## 2: B 192475
## 3: C 192719
## 4: D 192612
## 5: E 192406
## 6: F 192199

identical(as.vector(tt), ss$V1)

## [1] TRUE

위 결과 tapply는 5.54초, data.table의 by는 0.23초로 약 24배 더 빠르며 같은 결과가 얻어진다.

다음으로 두개의 열로 그룹을 지어보자.

ttt <- system.time(tt <- tapply(DT$v, list(DT$x, DT$y), sum))
ttt

##    user  system elapsed 
##    9.23    0.70   10.32

sss <- system.time(ss <- DT[, sum(v), by = "x,y"])
sss

##    user  system elapsed 
##    0.34    0.09    0.45

tt[1:5, 1:5]

##      a    b    c    d    e
## A 7342 7349 7310 7403 7426
## B 7368 7370 7414 7469 7304
## C 7380 7417 7395 7461 7440
## D 7386 7461 7406 7379 7364
## E 7461 7357 7386 7397 7439

head(ss)

##    x y   V1
## 1: A a 7342
## 2: A b 7349
## 3: A c 7310
## 4: A d 7403
## 5: A e 7426
## 6: A f 7372

identical(as.vector(t(tt)), ss$V1)

## [1] TRUE

위의 결과 또한 약 22배 더 빠르며, 문법 또한 간단하고 이해하기 쉽다.

3. Fast time series join

Reference

'Introduction to the data.table package in R' (http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf)
'cran manual - package {plyr}' (http://cran.r-project.org/web/packages/data.table/index.html)

Hankuk University of Foreign Studies. Dept. of Statistics. Daewoo Choi Lab. Yong Cha.
한국외국어대학교 통계학과 최대우 교수 연구실 차용
e-mail : yong.stat@gmail.com