Benchmarking merged.stack, reshape, and melt.data.table+dcast.data.table

The required packages

library(data.table)
library(ggplot2)
library(microbenchmark)
library(reshape2)
library(splitstackshape)
library(zoo)
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

The sample data

dates <- seq(as.Date("2004-01-01"), by = 7, len = 52 * 10)
tickers <- letters
my.df <- data.frame(dates = as.Date(rep_len(dates, length(dates) * length(tickers))), 
    tickers = rep(tickers, each = length(dates)), SVI = runif(length(dates) * 
        length(tickers), min = 0, max = 100))
svi <- read.zoo(my.df, index.column = "dates", split = "tickers")
meansvi <- rollmean(svi, 52, align = "right")
my.zoo <- merge(svi, meansvi)

The reshaping functions

AM_1 <- function() {
    dt <- data.table(index = index(my.zoo), coredata(my.zoo), key = "index")
    setnames(dt, gsub("([a-z].*)\\.([a-z].*)", "\\2_\\1", names(dt)))
    reshape(dt, idvar = "index", varying = 2:ncol(dt), sep = "_", direction = "long")
}
AM_2 <- function() {
    merged.stack(data.frame(index = index(my.zoo), my.zoo), id.vars = "index", 
        var.stubs = c("\\.svi$", "\\.meansvi$"), sep = "\\.svi|\\.meansvi")
}
agstudy <- function() {
    A <- melt(data.table(index = index(my.zoo), coredata(my.zoo)), id = "index")
    A <- A[, `:=`(c("tickers", "variables", "variable"), {
        vv <- unlist(strsplit(as.character(variable), ".", fixed = TRUE))
        list(unlist(vv)[c(TRUE, FALSE)], unlist(vv)[c(FALSE, TRUE)], NULL)
    })]
    dcast.data.table(A, index + tickers ~ variables, value.var = "value")
}
orig <- function() {
    df <- data.frame(index = index(my.zoo), my.zoo)
    df <- stack(df, select = -index)
    df$index <- as.Date(rep_len(index(my.zoo), nrow(df)))
    df <- cbind(df, do.call(rbind, strsplit(as.character(df$ind), ".", fixed = TRUE)))
    reshape(df, direction = "wide", idvar = c("index", "1"), timevar = "2", 
        drop = "ind")
}

The benchmarks

x <- microbenchmark(AM_1(), AM_2(), agstudy(), orig(), times = 20)
x
## Unit: milliseconds
##       expr   min     lq median     uq   max neval
##     AM_1() 229.2 236.47 244.66 255.91 334.1    20
##     AM_2()  84.0  86.69  90.08  93.60 176.0    20
##  agstudy()  52.5  52.95  56.09  60.36 133.5    20
##     orig() 268.7 324.58 350.45 440.32 588.6    20
autoplot(x)

plot of chunk unnamed-chunk-4

Summary:

A quick explanation of the merged.stack solution.

The solution looks like this:

merged.stack(data.frame(index = index(my.zoo), my.zoo), 
             id.vars = "index", 
             var.stubs = c("\\.svi$", "\\.meansvi$"), 
             sep = "\\.svi$|\\.meansvi$")

Here, the var.stubs argument is a regular expression or a character string that uniquely identifies the column groups. For this question, I've set it to be “names that end with .svi and names that end with .meansvi”.

Presently, that is what gets used for the resulting column names in the reshaped data, but I may add an argument later to allow that to be overridden.

Generally, the sep argument would take a character. However, since the names are somewhat backwards from what reshape (and merged.stack) expects, we just “strip out” the part that represents the groups and assume the rest to be the “time” variable. (reshape expects names to be in the form of “group+sep+time”, so “varX_1”, “varX_2”, and so on. The names in this dataset, however, are in the form of “1_varX”, “2_varX”, which needs a little more special treatment.)

By the way, I do plan to update merged.stack to benefit from the perks that Arun has brought to “data.table” with respect to reshaping data, but I'm waiting until it's a stable release that's out on CRAN.