Short comparison of `unique.data.table` with `unique.data.frame`

This is a very short gist to compare the performance of unique on a data.table vs data.frame. On relatively smaller data sets, the difference doesn't seem drastic, or does not matter. However, even as the size gets a bit large (1e6), it's quite easy to notice.

The comparison here is performed with the development version v1.8.11 of data.table. You can get it from here. Let's first create a reasonable size data.frame:

set.seed(45)
DF <- data.frame(x = as.numeric(sample(10000, 1e+06, TRUE)), y = as.numeric(sample(1000, 
    1e+06, TRUE)))

Now, call unique on DF.

system.time(u.DF <- unique(DF))

##    user  system elapsed 
##   7.834   0.120   8.060

head(u.DF)

##      x   y
## 1 6334 302
## 2 3176 515
## 3 2410 399
## 4 3785 732
## 5 3522 220
## 6 2978 313

Let's create DT and run the same on it.

require(data.table)
# Loading required package: data.table data.table 1.8.11 For help type:
# help('data.table')
DT <- data.table(DF)
system.time(u.DT <- unique(DT))

##    user  system elapsed 
##   0.508   0.044   0.555

head(u.DT)

##       x   y
## 1: 6334 302
## 2: 3176 515
## 3: 2410 399
## 4: 3785 732
## 5: 3522 220
## 6: 2978 313

You can see the speed-up is >13x. Are they identical?

# identical?
identical(u.DF$x, u.DT$x)

## [1] TRUE

identical(u.DF$y, u.DT$y)

## [1] TRUE

Short comparison of unique.data.table with unique.data.frame

Short comparison of `unique.data.table` with `unique.data.frame`