Column-wise operation in R: Methods and Tricks

Introduction

I have always discussed different problems with my friends and colleagues regarding operating large datasets, i.e., lists, dataframes, tables, matrix etc., in R. Most of time, people complained that there is no good way to do things in R such as adding new columns whose values are determined by other column(s), removing certain columns based on some criteria, or some other interesting ideas (annoying ideas of boss).

So, I decided to prepare this short collection and tried to give some examples about similar problems in data preparing process in R which I have met. The purpose here is not to show that R is the better than other language, but to present that R is not bad at all.

Prepare the data example

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, last

N <- 2e6 # size of DT
set.seed(1)
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e4, foo())
ch <- unique(ch)

DF <- data.frame(v1 = as.numeric(sample(c(rnorm(1e5)*1e4), N, TRUE)), 
                 v2 = as.numeric(sample(rnorm(1e6), N, TRUE)), 
                 v3 = sample(c(1e4:1e5), N, TRUE),
                 v4 = sample(c(round(rnorm(100,10,2))), N, TRUE),
                 d = sample(ch, N, TRUE))

DT = data.table(DF)

The size of the dataframe and datatable

head(DF)

##           v1          v2    v3 v4         d
## 1 -23281.328 -1.78530782 82030  7    ijmvah
## 2   1523.288 -2.00057071 58461 12  mdsqerwu
## 3  -5520.508  0.79477797 81160  4    jdreyx
## 4 -18413.247 -0.34343814 60370 12    wtefdp
## 5   6847.334  2.14037666 75845 13  lsbwmcwr
## 6  12528.141  0.01408064 27663 12 fuvhaufcd

print(object.size(DF), units="MB")

## 61.6 Mb

head(DT)

##            v1          v2    v3 v4         d
## 1: -23281.328 -1.78530782 82030  7    ijmvah
## 2:   1523.288 -2.00057071 58461 12  mdsqerwu
## 3:  -5520.508  0.79477797 81160  4    jdreyx
## 4: -18413.247 -0.34343814 60370 12    wtefdp
## 5:   6847.334  2.14037666 75845 13  lsbwmcwr
## 6:  12528.141  0.01408064 27663 12 fuvhaufcd

print(object.size(DT), units="MB")

## 61.6 Mb

Base functions in R to manipulate slices of datasets

The key idea to use apply family operation in R is to avoid using loops. The apply family pertains to the R base package, and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array, and apply a named function with one or several optional arguments. The called function could be:

an aggregating function, like for example the mean, or the sum (that return a number or scalar);
other transforming or sub-setting functions;
and other vectorization functions, which return more complex structures like list, vectors, matrices and arrays.

The apply functions form the basis of more complex combinations and helps to perform operations with very few lines of code. The family comprises: apply, lapply , sapply, vapply, mapply, rapply, and tapply.

There are many nice tutorials online about apply family functions. Such as R tutorial on the Apply family of functions, and this.

Interesting examples

Though apply, sapply, and lapply belong to the same function family. You should keep in mind that they are working on different type of data underneath. sapply and lapply default accept input as list, that means when giving dataframe object as input, they treat each column as a list. While apply default input is an array (or a matrix when the dimension of the array is 2.) which means it should have the same type for all elements. When you run it on a data.frame, it simply calls as.matrix first.

Here is an example, I want to check the object class of each column in DF.

system.time(a1 <- apply(DF, 2, class))

##    user  system elapsed 
##   13.61    0.22   13.87

a1

##          v1          v2          v3          v4           d 
## "character" "character" "character" "character" "character"

system.time(a2 <- apply(DF[,-c(5)], 2, class))

##    user  system elapsed 
##    0.22    0.05    0.26

a2

##        v1        v2        v3        v4 
## "numeric" "numeric" "numeric" "numeric"

system.time(sa <- sapply(DF, class))

##    user  system elapsed 
##       0       0       0

sa

##        v1        v2        v3        v4         d 
## "numeric" "numeric" "integer" "numeric"  "factor"

system.time(la <- lapply(DF, class))

##    user  system elapsed 
##       0       0       0

la

## $v1
## [1] "numeric"
## 
## $v2
## [1] "numeric"
## 
## $v3
## [1] "integer"
## 
## $v4
## [1] "numeric"
## 
## $d
## [1] "factor"

As we can see from the output, directly using apply not only gives wrong results (though no error message!!!), but also runs much slowly than sapply or lapply.

Let us consider another scenario, you want to apply some calculation on the columns in your data, in other word, you want to create a new column whose values are function output using other columns as inputs. Most of time, you can use summarise functions in R combined with apply to get what you want. What if you want to use some customized functions?

For example, I want to calculate a new variable v_new, (v_new = v1/(v3-2.0))

system.time(DF$v_new <- apply(DF[,c(1,3)], 1, function(x) x[1]/(x[2]-2.0)))

##    user  system elapsed 
##   17.25    0.21   17.59

head(DF)

##           v1          v2    v3 v4         d       v_new
## 1 -23281.328 -1.78530782 82030  7    ijmvah -0.28382172
## 2   1523.288 -2.00057071 58461 12  mdsqerwu  0.02605738
## 3  -5520.508  0.79477797 81160  4    jdreyx -0.06802174
## 4 -18413.247 -0.34343814 60370 12    wtefdp -0.30501669
## 5   6847.334  2.14037666 75845 13  lsbwmcwr  0.09028300
## 6  12528.141  0.01408064 27663 12 fuvhaufcd  0.45291714

DF$v_new = NULL

Or we can use sapply to do the same thing on both v1 and v2.

system.time(DF[,c("v_new1", "v_new2")] <- sapply(DF[,c(1,2)], function(x, y) x/(y-2.0), y=DF[,3]))

##    user  system elapsed 
##    2.73    0.16    2.90

head(DF)

##           v1          v2    v3 v4         d      v_new1        v_new2
## 1 -23281.328 -1.78530782 82030  7    ijmvah -0.28382172 -2.176461e-05
## 2   1523.288 -2.00057071 58461 12  mdsqerwu  0.02605738 -3.422177e-05
## 3  -5520.508  0.79477797 81160  4    jdreyx -0.06802174  9.792971e-06
## 4 -18413.247 -0.34343814 60370 12    wtefdp -0.30501669 -5.689076e-06
## 5   6847.334  2.14037666 75845 13  lsbwmcwr  0.09028300  2.822115e-05
## 6  12528.141  0.01408064 27663 12 fuvhaufcd  0.45291714  5.090431e-07

DF$v_new1 = NULL
DF$v_new2 = NULL

dplyr

dplyr is a library written by Hadley Wickham. It is a very nice and powerful package. Actually, after you master it, you can never use the base functions again.

The dplyr package makes these steps fast and easy:

By constraining your options, it simplifies how you can think about common data manipulation tasks.
It provides simple "verbs", functions that correspond to the most common data manipulation tasks, to help you translate those thoughts into code.
It uses efficient data storage backends, so you spend less time waiting for the computer.

Let do the same thing again on DF in dplyr

DFP = tbl_df(DF)
myf <- function(x1, x2) { x1/(x2 - 2.0) }

system.time( DFP <- DFP%>% mutate(v_new = myf(v1, v3)))

##    user  system elapsed 
##    0.03    0.00    0.03

head(DFP)

## Source: local data frame [6 x 6]
## 
##           v1          v2    v3    v4         d       v_new
##        (dbl)       (dbl) (int) (dbl)    (fctr)       (dbl)
## 1 -23281.328 -1.78530782 82030     7    ijmvah -0.28382172
## 2   1523.288 -2.00057071 58461    12  mdsqerwu  0.02605738
## 3  -5520.508  0.79477797 81160     4    jdreyx -0.06802174
## 4 -18413.247 -0.34343814 60370    12    wtefdp -0.30501669
## 5   6847.334  2.14037666 75845    13  lsbwmcwr  0.09028300
## 6  12528.141  0.01408064 27663    12 fuvhaufcd  0.45291714

DFP$v_new = NULL
system.time( DFP <- DFP%>% mutate(v_new1 = myf(v1, v3), v_new2 = myf(v2, v3)))

##    user  system elapsed 
##    0.01    0.00    0.02

head(DFP)

## Source: local data frame [6 x 7]
## 
##           v1          v2    v3    v4         d      v_new1        v_new2
##        (dbl)       (dbl) (int) (dbl)    (fctr)       (dbl)         (dbl)
## 1 -23281.328 -1.78530782 82030     7    ijmvah -0.28382172 -2.176461e-05
## 2   1523.288 -2.00057071 58461    12  mdsqerwu  0.02605738 -3.422177e-05
## 3  -5520.508  0.79477797 81160     4    jdreyx -0.06802174  9.792971e-06
## 4 -18413.247 -0.34343814 60370    12    wtefdp -0.30501669 -5.689076e-06
## 5   6847.334  2.14037666 75845    13  lsbwmcwr  0.09028300  2.822115e-05
## 6  12528.141  0.01408064 27663    12 fuvhaufcd  0.45291714  5.090431e-07

DFP$v_new1 = NULL
DFP$v_new2 = NULL

Let us mention the effociency again, using base function (apply): ~17s; using base function (sapply): ~3s; using dplyr mutate: ~0.03s; on 2000000 rows data!

We also can use mutate_each function in dplyr to do the same operation on multiple columns.

system.time(DFP <- DFP %>% mutate_each(funs(v_new = myf(., v3)), -d, -v4, -v3))

##    user  system elapsed 
##    0.03    0.00    0.03

head(DFP)

## Source: local data frame [6 x 5]
## 
##            v1            v2    v3    v4         d
##         (dbl)         (dbl) (int) (dbl)    (fctr)
## 1 -0.28382172 -2.176461e-05 82030     7    ijmvah
## 2  0.02605738 -3.422177e-05 58461    12  mdsqerwu
## 3 -0.06802174  9.792971e-06 81160     4    jdreyx
## 4 -0.30501669 -5.689076e-06 60370    12    wtefdp
## 5  0.09028300  2.822115e-05 75845    13  lsbwmcwr
## 6  0.45291714  5.090431e-07 27663    12 fuvhaufcd

Note, when using mutate_each, the default behavoir is transformed the existing columns to the new values. You can manually specify a new name per column (but only practical for few columns). Or you can use a named vector to create additional columns with new names.

DFP = tbl_df(DF)
system.time(DFP <-DFP %>% mutate_each(funs(v_new = myf(., v3)), v_new1 = v1, v_new2 = v2, -d, -v4, -v3))

##    user  system elapsed 
##    0.03    0.00    0.03

head(DFP)

## Source: local data frame [6 x 7]
## 
##           v1          v2    v3    v4         d      v_new1        v_new2
##        (dbl)       (dbl) (int) (dbl)    (fctr)       (dbl)         (dbl)
## 1 -23281.328 -1.78530782 82030     7    ijmvah -0.28382172 -2.176461e-05
## 2   1523.288 -2.00057071 58461    12  mdsqerwu  0.02605738 -3.422177e-05
## 3  -5520.508  0.79477797 81160     4    jdreyx -0.06802174  9.792971e-06
## 4 -18413.247 -0.34343814 60370    12    wtefdp -0.30501669 -5.689076e-06
## 5   6847.334  2.14037666 75845    13  lsbwmcwr  0.09028300  2.822115e-05
## 6  12528.141  0.01408064 27663    12 fuvhaufcd  0.45291714  5.090431e-07

DFP = tbl_df(DF)
vars <- names(DFP)[1:2] 
vars <- setNames(vars, paste0(vars, "_new")) 
DFP %>% mutate_each_(funs(v_new = myf(., v3)), vars) %>% head

## Source: local data frame [6 x 7]
## 
##           v1          v2    v3    v4         d      v1_new        v2_new
##        (dbl)       (dbl) (int) (dbl)    (fctr)       (dbl)         (dbl)
## 1 -23281.328 -1.78530782 82030     7    ijmvah -0.28382172 -2.176461e-05
## 2   1523.288 -2.00057071 58461    12  mdsqerwu  0.02605738 -3.422177e-05
## 3  -5520.508  0.79477797 81160     4    jdreyx -0.06802174  9.792971e-06
## 4 -18413.247 -0.34343814 60370    12    wtefdp -0.30501669 -5.689076e-06
## 5   6847.334  2.14037666 75845    13  lsbwmcwr  0.09028300  2.822115e-05
## 6  12528.141  0.01408064 27663    12 fuvhaufcd  0.45291714  5.090431e-07

Advanced ?

Put more controls on which variables the transformation should be applied on.

Let assume I want to put "units" after the values on the v1.

DFP = tbl_df(DF)
DFP %>% mutate_each(funs(paste0(., ".inches")), matches("v1")) %>% head

## Source: local data frame [6 x 5]
## 
##                         v1          v2    v3    v4         d
##                      (chr)       (dbl) (int) (dbl)    (fctr)
## 1 -23281.3282688267.inches -1.78530782 82030     7    ijmvah
## 2  1523.28809520714.inches -2.00057071 58461    12  mdsqerwu
## 3 -5520.50827473118.inches  0.79477797 81160     4    jdreyx
## 4 -18413.2474744393.inches -0.34343814 60370    12    wtefdp
## 5  6847.33365361713.inches  2.14037666 75845    13  lsbwmcwr
## 6   12528.141100736.inches  0.01408064 27663    12 fuvhaufcd

Data.table

First Impression

data.table package is written by Matt Dowle, it is designed to manipulate Large dataset in R in comparable speed as in python, etc. It is not intuitive to transfer tradition R operation to data.table methods. I am still a newbie of data.table, but I'd like to use it more and more often in the future, because it is really really fast. See one benchmark comparing Grouping in data.table, dplyr, and panda(python).

First, let's repeat the previous task using data.table

names(DT) = c("v_1", "v_2", "v3", "v4", "d")
head(DT)

##           v_1         v_2    v3 v4         d
## 1: -23281.328 -1.78530782 82030  7    ijmvah
## 2:   1523.288 -2.00057071 58461 12  mdsqerwu
## 3:  -5520.508  0.79477797 81160  4    jdreyx
## 4: -18413.247 -0.34343814 60370 12    wtefdp
## 5:   6847.334  2.14037666 75845 13  lsbwmcwr
## 6:  12528.141  0.01408064 27663 12 fuvhaufcd

system.time(for(i in grep("_", names(DT), value=TRUE))
    DT[, i:=get(i)/(v3-2.0), with=FALSE])

##    user  system elapsed 
##    0.03    0.00    0.03

head(DT)

##            v_1           v_2    v3 v4         d
## 1: -0.28382172 -2.176461e-05 82030  7    ijmvah
## 2:  0.02605738 -3.422177e-05 58461 12  mdsqerwu
## 3: -0.06802174  9.792971e-06 81160  4    jdreyx
## 4: -0.30501669 -5.689076e-06 60370 12    wtefdp
## 5:  0.09028300  2.822115e-05 75845 13  lsbwmcwr
## 6:  0.45291714  5.090431e-07 27663 12 fuvhaufcd

Or, another way, just output the results, we can also do it as below

DT = data.table(DF)
names(DT) = c("v_1", "v_2", "v3", "v4", "d")
system.time( x <- DT[, lapply(.SD, myf, DT[,v3]), .SDcols=c("v_1","v_2")])

##    user  system elapsed 
##    0.03    0.00    0.03

head(x)

##            v_1           v_2
## 1: -0.28382172 -2.176461e-05
## 2:  0.02605738 -3.422177e-05
## 3: -0.06802174  9.792971e-06
## 4: -0.30501669 -5.689076e-06
## 5:  0.09028300  2.822115e-05
## 6:  0.45291714  5.090431e-07

Speed, Speed, Speed!

Suppose we have another data set, with V4 and d two fields, and other three new fileds, x1, x2, and x3. We want to subset out original DF dataset so that we're only looking at v1, v2, and v3 with v4=4 and d=ijmvah. In regular R, we can do this.

DF.new = DF[DF$v4==4 & DF$d=="ijmvah",c("v4", "d")]
DF.new$x1 = sample(rnorm(10), dim(DF.new)[1] , TRUE)
DF.new$x2 = sample(rnorm(10), dim(DF.new)[1] , TRUE)
DF.new$x3 = sample(rnorm(10), dim(DF.new)[1] , TRUE)
head(DF.new)

##         v4      d        x1         x2         x3
## 459142   4 ijmvah 1.1209419 -0.6532428 -1.4455367
## 776932   4 ijmvah 0.6662657  0.3131157 -1.4455367
## 777900   4 ijmvah 0.5943943 -0.3405394  0.4608586
## 1277727  4 ijmvah 0.6662657  0.1396439 -1.4455367

system.time(DF.merge <- merge(DF, DF.new, by=c('v4','d')))

##    user  system elapsed 
##    4.49    0.00    4.51

head(DF.merge)

##   v4      d        v1         v2    v3        x1         x2         x3
## 1  4 ijmvah 23391.483 0.24720625 21661 1.1209419 -0.6532428 -1.4455367
## 2  4 ijmvah 23391.483 0.24720625 21661 0.6662657  0.3131157 -1.4455367
## 3  4 ijmvah 23391.483 0.24720625 21661 0.5943943 -0.3405394  0.4608586
## 4  4 ijmvah 23391.483 0.24720625 21661 0.6662657  0.1396439 -1.4455367
## 5  4 ijmvah  3529.048 0.08768831 74587 1.1209419 -0.6532428 -1.4455367
## 6  4 ijmvah  3529.048 0.08768831 74587 0.6662657  0.3131157 -1.4455367

Now let's check what happens when we using merge() function in data.table.

DT = data.table(DF)
DT.new = data.table(DF.new)
system.time(DT.merge <- merge(DT, DT.new, by =c("v4","d")))

##    user  system elapsed 
##    0.15    0.00    0.15

head(DT.merge)

##    v4      d        v1         v2    v3        x1         x2         x3
## 1:  4 ijmvah  23593.24 -0.9844648 53589 1.1209419 -0.6532428 -1.4455367
## 2:  4 ijmvah  23593.24 -0.9844648 53589 0.6662657  0.3131157 -1.4455367
## 3:  4 ijmvah  23593.24 -0.9844648 53589 0.5943943 -0.3405394  0.4608586
## 4:  4 ijmvah  23593.24 -0.9844648 53589 0.6662657  0.1396439 -1.4455367
## 5:  4 ijmvah -16576.53 -0.7077086 17688 1.1209419 -0.6532428 -1.4455367
## 6:  4 ijmvah -16576.53 -0.7077086 17688 0.6662657  0.3131157 -1.4455367

See, we get job done from ~4s to ~0.15s. Can we do it faster? Let's try. The nice thing about data.table is tht you can set key(s) before you do anything on it. In most of time, it make things faster. This time, it takes ~0.05s

setkey(DT, v4, d)
setkey(DT.new, v4, d)
system.time (DT.merge <- merge(DT, DT.new, by =c("v4","d")))

##    user  system elapsed 
##    0.06    0.00    0.07

head(DT.merge)

##    v4      d        v1         v2    v3        x1         x2         x3
## 1:  4 ijmvah  23593.24 -0.9844648 53589 1.1209419 -0.6532428 -1.4455367
## 2:  4 ijmvah  23593.24 -0.9844648 53589 0.6662657  0.3131157 -1.4455367
## 3:  4 ijmvah  23593.24 -0.9844648 53589 0.5943943 -0.3405394  0.4608586
## 4:  4 ijmvah  23593.24 -0.9844648 53589 0.6662657  0.1396439 -1.4455367
## 5:  4 ijmvah -16576.53 -0.7077086 17688 1.1209419 -0.6532428 -1.4455367
## 6:  4 ijmvah -16576.53 -0.7077086 17688 0.6662657  0.3131157 -1.4455367

So what do you think? But I want to see if I can do it ever faster. I'll use the internal data.table function for merging. You can to the same as an inner, left, or right join. Since we've already set the keys for both data tables,it knows what filed we want to merge on (v4 and d). Here is the inner join. Finished in no time. :)

system.time(DT.merge <- DT[DT.new])

##    user  system elapsed 
##    0.02    0.00    0.01

head(DT.merge)

##            v1          v2    v3 v4      d        x1         x2        x3
## 1:  23593.242 -0.98446475 53589  4 ijmvah 1.1209419 -0.6532428 -1.445537
## 2: -16576.533 -0.70770863 17688  4 ijmvah 1.1209419 -0.6532428 -1.445537
## 3:  23391.483  0.24720625 21661  4 ijmvah 1.1209419 -0.6532428 -1.445537
## 4:   3529.048  0.08768831 74587  4 ijmvah 1.1209419 -0.6532428 -1.445537
## 5:  23593.242 -0.98446475 53589  4 ijmvah 0.6662657  0.3131157 -1.445537
## 6: -16576.533 -0.70770863 17688  4 ijmvah 0.6662657  0.3131157 -1.445537

For comparison, let me try it in dplyr way.

system.time(DF.merge <- DF %>% inner_join(DF.new, by=c('v4','d')))

##    user  system elapsed 
##    0.05    0.00    0.04

head(DF.merge)

##          v1         v2    v3 v4      d        x1         x2         x3
## 1  23593.24 -0.9844648 53589  4 ijmvah 1.1209419 -0.6532428 -1.4455367
## 2  23593.24 -0.9844648 53589  4 ijmvah 0.6662657  0.3131157 -1.4455367
## 3  23593.24 -0.9844648 53589  4 ijmvah 0.5943943 -0.3405394  0.4608586
## 4  23593.24 -0.9844648 53589  4 ijmvah 0.6662657  0.1396439 -1.4455367
## 5 -16576.53 -0.7077086 17688  4 ijmvah 1.1209419 -0.6532428 -1.4455367
## 6 -16576.53 -0.7077086 17688  4 ijmvah 0.6662657  0.3131157 -1.4455367

Multiple Datasets

We have seen that using setkey() in data.table looks like magic. Normally, we may get task to calculate some new properties in one table using information from another table. Consider the example datasets below.

set.seed(100)
DT.original = data.table(Weekday=rep(c("Monday", "Tuesday", "Wednesday", 
                                       "Thursday", "Friday", "Saturday",
                                       "Sunday"),length.out=7e5), 
                         Date=rep(seq(as.Date("2015-01-01"), length.out=365, by='day'),
                                     1e5, TRUE),
                         Distance = sample(runif(1000, 20, 80), 7e5, replace=TRUE))
head(DT.original)

##      Weekday       Date Distance
## 1:    Monday 2015-01-01 59.70674
## 2:   Tuesday 2015-01-01 68.41200
## 3: Wednesday 2015-01-01 42.62981
## 4:  Thursday 2015-01-01 29.17710
## 5:    Friday 2015-01-01 60.96852
## 6:  Saturday 2015-01-01 60.49637

DT.coe = data.table(Weekday=rep(c("Monday", "Tuesday", "Wednesday", 
                                       "Thursday", "Friday", "Saturday",
                                       "Sunday")),
                      Coefficient = sample(abs(rnorm(10)), 7 , TRUE))
head(DT.coe)

##      Weekday Coefficient
## 1:    Monday   1.1696943
## 2:   Tuesday   0.2586376
## 3: Wednesday   1.8243576
## 4:  Thursday   0.2586376
## 5:    Friday   1.0887565
## 6:  Saturday   0.4887914

Here we have two datasets, DT.original is the main dataset which includes subjects walking distance in year 2015 collected from wearable device. But we also know that the accuracy of that device is not good enough, it should be adjusted by multipling a coefficients. So, before we do any fancy DM on the data, we need to calculate the real distance which is distance*coefficient based on the weekday.

One way to do it is that merging the DT.other data into DT.original first based on the weekday. Then create a new column which calculated the value of (distance*coefficient). Then delete the columns of old distance and the coefficient. We know data.tabe is pretty good at the merging. But such way is not the best way data.table to do it. Actually, we can finished the merging and calculation in the same time.

system.time(setkey(DT.original, Weekday)[DT.coe, Distance.adj := Distance*Coefficient][order(Date)])

##    user  system elapsed 
##    0.06    0.00    0.07

head(DT.original)

##    Weekday       Date Distance Distance.adj
## 1:  Friday 2015-01-01 60.96852     66.37988
## 2:  Friday 2015-01-01 56.36131     61.36375
## 3:  Friday 2015-01-01 65.27107     71.06431
## 4:  Friday 2015-01-01 34.26703     37.30845
## 5:  Friday 2015-01-01 37.32332     40.63600
## 6:  Friday 2015-01-01 53.06470     57.77453

Coudl you try to do it without using data.table?

Multiple Columns from Multiple Tables

rm(DT)
rm(DF)
N <- 1e7 # size of DT
set.seed(100)
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e4, foo())
ch <- unique(ch)

DF <- data.frame(id = sample(ch, N, TRUE),
                 v1 = as.numeric(sample(c(rnorm(1e5)*1e4), N, TRUE)), 
                 v2 = as.numeric(sample(rnorm(1e6), N, TRUE)), 
                 v3 = as.numeric(sample(rnorm(2e6), N, TRUE)),
                 v4 = as.numeric(sample(rnorm(3e6), N, TRUE)),
                 v5 = as.numeric(sample(rnorm(3e6), N, TRUE)))

DT = data.table(DF)
weights = rnorm(5,10)

The problems is that we have a dataset which saves five variables' value for 1e7 observation. The task is to get the weight sum of this variable for each observation based on given weights, and named the new column as Comb_v. In dplyr, you can use mutate() function, but it is not friendly when you need to perform the calculation over many columns. We know that this operation is actually inner product.

Here is the dplyr way.

system.time(DF <- cbind(DF, as.matrix(DF[, -grep("id", names(DF))]) %*% weights))

##    user  system elapsed 
##    0.45    0.05    0.50

names(DF) =c("id","v1","v2","v3","v4","v5","Com.V")
head(DF)

##         id        v1         v2          v3         v4         v5
## 1    eyrfn -7501.109 -0.7837236 -0.30656884  0.0391261  0.3873146
## 2 cicqweor 10086.533  0.5301808  0.18213895  0.8275374 -0.5357433
## 3   zhmsbe 11855.062 -1.0223668 -0.06560893  0.8832775 -1.4525234
## 4   xcjzhj  4280.910 -1.1538706 -0.85118854 -1.7658080  0.9280722
## 5   qgzxsn 14709.943  2.5478778  0.11462929  0.2285528 -0.8566297
## 6 tmzuiwoo  4966.520  1.3120825 -0.60920175  1.2848112 -0.3376857
##       Com.V
## 1 -79798.08
## 2 107303.48
## 3 126088.21
## 4  45507.68
## 5 156494.50
## 6  52847.32

Let's see the data.table way.

ColList = names(DT)
system.time(DT[, Com.V :=.(as.matrix(DT[,.SD, .SDcol=ColList[-1]]) %*% weights)])

##    user  system elapsed 
##    0.57    0.11    0.70

head(DT)

##          id        v1         v2          v3         v4         v5
## 1:    eyrfn -7501.109 -0.7837236 -0.30656884  0.0391261  0.3873146
## 2: cicqweor 10086.533  0.5301808  0.18213895  0.8275374 -0.5357433
## 3:   zhmsbe 11855.062 -1.0223668 -0.06560893  0.8832775 -1.4525234
## 4:   xcjzhj  4280.910 -1.1538706 -0.85118854 -1.7658080  0.9280722
## 5:   qgzxsn 14709.943  2.5478778  0.11462929  0.2285528 -0.8566297
## 6: tmzuiwoo  4966.520  1.3120825 -0.60920175  1.2848112 -0.3376857
##        Com.V
## 1: -79798.08
## 2: 107303.48
## 3: 126088.21
## 4:  45507.68
## 5: 156494.50
## 6:  52847.32

For this task, data.table does not performance over data.frame. data.tables are faster relative to data.frames when operating on data without unnecessary copying. In above data.table command, we actually put base operation on copy the data.table to a matrix. We can achieve the similar performance using so call "in-table" operation, as shown below.

DT[,Com.V:=NULL]
ColList = names(DT)
myp = function(x) { x %*% weights }
system.time(DT[, `:=`(Com.V=myp(as.matrix(.SD))), .SDcol=ColList[-1]])

##    user  system elapsed 
##    0.45    0.08    0.53

head(DT)

##          id        v1         v2          v3         v4         v5
## 1:    eyrfn -7501.109 -0.7837236 -0.30656884  0.0391261  0.3873146
## 2: cicqweor 10086.533  0.5301808  0.18213895  0.8275374 -0.5357433
## 3:   zhmsbe 11855.062 -1.0223668 -0.06560893  0.8832775 -1.4525234
## 4:   xcjzhj  4280.910 -1.1538706 -0.85118854 -1.7658080  0.9280722
## 5:   qgzxsn 14709.943  2.5478778  0.11462929  0.2285528 -0.8566297
## 6: tmzuiwoo  4966.520  1.3120825 -0.60920175  1.2848112 -0.3376857
##        Com.V
## 1: -79798.08
## 2: 107303.48
## 3: 126088.21
## 4:  45507.68
## 5: 156494.50
## 6:  52847.32

We can see that, now we have a little better performance by using data.table, and such performance is limitted by the matrix multiplication operation.

system.time(temp <- myp(as.matrix(DF[,c(2:6)])))

##    user  system elapsed 
##    0.42    0.08    0.49

The task is not over yet. After I got the weighted combined variable Com.V, I want to check the mean values of variables, i.e., v1, v2, ...., of the top N subjects based on the value of com.V.

For example, I want to get the mean valule of V1 in top 0.5% of com.v.

Using dplyr, I could use top_n() and summarise() function.

n = round(dim(DF)[1]*0.005)
system.time(a.F <- DF %>% top_n(n, Com.V) %>% summarise(avg = mean(v1)))

##    user  system elapsed 
##   39.39    0.56   40.13

print(a.F)

##        avg
## 1 28469.68

Using data.table

system.time(setorder(DT, -Com.V))

##    user  system elapsed 
##    2.03    0.01    2.06

system.time(a.T <- DT[,.(a.T=mean(head(v1, n)))])

##    user  system elapsed 
##       0       0       0

print(a.T)

##         a.T
## 1: 28469.68

I can also simultaneously calcuate the next top 0.5%.

n1 = round(dim(DF)[1]*0.01)
system.time(a.T <- DT[,.(a.T.1=mean(head(v1, n)), a.T.2=mean(tail(head(v1, n1),n)))])

##    user  system elapsed 
##       0       0       0

print(a.T)

##       a.T.1    a.T.2
## 1: 28469.68 24310.84

Conclusion

There are much more tricks out there for using dplyr and data.table efficently and smartly. Here just is the tip of iceberg. I will keep updating the new, creative ways to manipulate datasets here. Thank you!

## [1] "Last Updated on Fri Apr 29 11:51:13 2016"

Columns Operations

DataOmics

March 8, 2016