I have always discussed different problems with my friends and colleagues regarding operating large datasets, i.e., lists, dataframes, tables, matrix etc., in R. Most of time, people complained that there is no good way to do things in R such as adding new columns whose values are determined by other column(s), removing certain columns based on some criteria, or some other interesting ideas (annoying ideas of boss).
So, I decided to prepare this short collection and tried to give some examples about similar problems in data preparing process in R which I have met. The purpose here is not to show that R is the better than other language, but to present that R is not bad at all.
library(dplyr)
library(data.table)
N <- 2e6 # size of DT
set.seed(1)
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e4, foo())
ch <- unique(ch)
DF <- data.frame(v1 = as.numeric(sample(c(rnorm(1e5)*1e4), N, TRUE)),
v2 = as.numeric(sample(rnorm(1e6), N, TRUE)),
v3 = sample(c(1e4:1e5), N, TRUE),
v4 = sample(c(round(rnorm(100,10,2))), N, TRUE),
d = sample(ch, N, TRUE))
DT = data.table(DF)
head(DF)
## v1 v2 v3 v4 d
## 1 -23281.328 -1.78530782 82030 7 ijmvah
## 2 1523.288 -2.00057071 58461 12 mdsqerwu
## 3 -5520.508 0.79477797 81160 4 jdreyx
## 4 -18413.247 -0.34343814 60370 12 wtefdp
## 5 6847.334 2.14037666 75845 13 lsbwmcwr
## 6 12528.141 0.01408064 27663 12 fuvhaufcd
print(object.size(DF), units="MB")
## 61.6 Mb
head(DT)
## v1 v2 v3 v4 d
## 1: -23281.328 -1.78530782 82030 7 ijmvah
## 2: 1523.288 -2.00057071 58461 12 mdsqerwu
## 3: -5520.508 0.79477797 81160 4 jdreyx
## 4: -18413.247 -0.34343814 60370 12 wtefdp
## 5: 6847.334 2.14037666 75845 13 lsbwmcwr
## 6: 12528.141 0.01408064 27663 12 fuvhaufcd
print(object.size(DT), units="MB")
## 61.6 Mb
The key idea to use apply family operation in R is to avoid using loops. The apply family pertains to the R base package, and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array, and apply a named function with one or several optional arguments. The called function could be:
The apply functions form the basis of more complex combinations and helps to perform operations with very few lines of code. The family comprises: apply, lapply , sapply, vapply, mapply, rapply, and tapply.
There are many nice tutorials online about apply family functions. Such as R tutorial on the Apply family of functions, and this.
Though apply, sapply, and lapply belong to the same function family. You should keep in mind that they are working on different type of data underneath. sapply and lapply default accept input as list, that means when giving dataframe object as input, they treat each column as a list. While apply default input is an array (or a matrix when the dimension of the array is 2.) which means it should have the same type for all elements. When you run it on a data.frame, it simply calls as.matrix first.
Here is an example, I want to check the object class of each column in DF.system.time(a1 <- apply(DF, 2, class))
## user system elapsed
## 13.61 0.22 13.87
a1
## v1 v2 v3 v4 d
## "character" "character" "character" "character" "character"
system.time(a2 <- apply(DF[,-c(5)], 2, class))
## user system elapsed
## 0.22 0.05 0.26
a2
## v1 v2 v3 v4
## "numeric" "numeric" "numeric" "numeric"
system.time(sa <- sapply(DF, class))
## user system elapsed
## 0 0 0
sa
## v1 v2 v3 v4 d
## "numeric" "numeric" "integer" "numeric" "factor"
system.time(la <- lapply(DF, class))
## user system elapsed
## 0 0 0
la
## $v1
## [1] "numeric"
##
## $v2
## [1] "numeric"
##
## $v3
## [1] "integer"
##
## $v4
## [1] "numeric"
##
## $d
## [1] "factor"
As we can see from the output, directly using apply not only gives wrong results (though no error message!!!), but also runs much slowly than sapply or lapply.
Let us consider another scenario, you want to apply some calculation on the columns in your data, in other word, you want to create a new column whose values are function output using other columns as inputs. Most of time, you can use summarise functions in R combined with apply to get what you want. What if you want to use some customized functions?
For example, I want to calculate a new variable v_new, (v_new = v1/(v3-2.0))
system.time(DF$v_new <- apply(DF[,c(1,3)], 1, function(x) x[1]/(x[2]-2.0)))
## user system elapsed
## 17.25 0.21 17.59
head(DF)
## v1 v2 v3 v4 d v_new
## 1 -23281.328 -1.78530782 82030 7 ijmvah -0.28382172
## 2 1523.288 -2.00057071 58461 12 mdsqerwu 0.02605738
## 3 -5520.508 0.79477797 81160 4 jdreyx -0.06802174
## 4 -18413.247 -0.34343814 60370 12 wtefdp -0.30501669
## 5 6847.334 2.14037666 75845 13 lsbwmcwr 0.09028300
## 6 12528.141 0.01408064 27663 12 fuvhaufcd 0.45291714
DF$v_new = NULL
system.time(DF[,c("v_new1", "v_new2")] <- sapply(DF[,c(1,2)], function(x, y) x/(y-2.0), y=DF[,3]))
## user system elapsed
## 2.73 0.16 2.90
head(DF)
## v1 v2 v3 v4 d v_new1 v_new2
## 1 -23281.328 -1.78530782 82030 7 ijmvah -0.28382172 -2.176461e-05
## 2 1523.288 -2.00057071 58461 12 mdsqerwu 0.02605738 -3.422177e-05
## 3 -5520.508 0.79477797 81160 4 jdreyx -0.06802174 9.792971e-06
## 4 -18413.247 -0.34343814 60370 12 wtefdp -0.30501669 -5.689076e-06
## 5 6847.334 2.14037666 75845 13 lsbwmcwr 0.09028300 2.822115e-05
## 6 12528.141 0.01408064 27663 12 fuvhaufcd 0.45291714 5.090431e-07
DF$v_new1 = NULL
DF$v_new2 = NULL
dplyr is a library written by Hadley Wickham. It is a very nice and powerful package. Actually, after you master it, you can never use the base functions again.
The dplyr package makes these steps fast and easy:
By constraining your options, it simplifies how you can think about common data manipulation tasks.
It provides simple "verbs", functions that correspond to the most common data manipulation tasks, to help you translate those thoughts into code.
It uses efficient data storage backends, so you spend less time waiting for the computer.
Let do the same thing again on DF in dplyr
DFP = tbl_df(DF)
myf <- function(x1, x2) { x1/(x2 - 2.0) }
system.time( DFP <- DFP%>% mutate(v_new = myf(v1, v3)))
## user system elapsed
## 0.03 0.00 0.03
head(DFP)
## Source: local data frame [6 x 6]
##
## v1 v2 v3 v4 d v_new
## (dbl) (dbl) (int) (dbl) (fctr) (dbl)
## 1 -23281.328 -1.78530782 82030 7 ijmvah -0.28382172
## 2 1523.288 -2.00057071 58461 12 mdsqerwu 0.02605738
## 3 -5520.508 0.79477797 81160 4 jdreyx -0.06802174
## 4 -18413.247 -0.34343814 60370 12 wtefdp -0.30501669
## 5 6847.334 2.14037666 75845 13 lsbwmcwr 0.09028300
## 6 12528.141 0.01408064 27663 12 fuvhaufcd 0.45291714
DFP$v_new = NULL
system.time( DFP <- DFP%>% mutate(v_new1 = myf(v1, v3), v_new2 = myf(v2, v3)))
## user system elapsed
## 0.01 0.00 0.02
head(DFP)
## Source: local data frame [6 x 7]
##
## v1 v2 v3 v4 d v_new1 v_new2
## (dbl) (dbl) (int) (dbl) (fctr) (dbl) (dbl)
## 1 -23281.328 -1.78530782 82030 7 ijmvah -0.28382172 -2.176461e-05
## 2 1523.288 -2.00057071 58461 12 mdsqerwu 0.02605738 -3.422177e-05
## 3 -5520.508 0.79477797 81160 4 jdreyx -0.06802174 9.792971e-06
## 4 -18413.247 -0.34343814 60370 12 wtefdp -0.30501669 -5.689076e-06
## 5 6847.334 2.14037666 75845 13 lsbwmcwr 0.09028300 2.822115e-05
## 6 12528.141 0.01408064 27663 12 fuvhaufcd 0.45291714 5.090431e-07
DFP$v_new1 = NULL
DFP$v_new2 = NULL
Let us mention the effociency again, using base function (apply): ~17s; using base function (sapply): ~3s; using dplyr mutate: ~0.03s; on 2000000 rows data!
We also can use mutate_each function in dplyr to do the same operation on multiple columns.
system.time(DFP <- DFP %>% mutate_each(funs(v_new = myf(., v3)), -d, -v4, -v3))
## user system elapsed
## 0.03 0.00 0.03
head(DFP)
## Source: local data frame [6 x 5]
##
## v1 v2 v3 v4 d
## (dbl) (dbl) (int) (dbl) (fctr)
## 1 -0.28382172 -2.176461e-05 82030 7 ijmvah
## 2 0.02605738 -3.422177e-05 58461 12 mdsqerwu
## 3 -0.06802174 9.792971e-06 81160 4 jdreyx
## 4 -0.30501669 -5.689076e-06 60370 12 wtefdp
## 5 0.09028300 2.822115e-05 75845 13 lsbwmcwr
## 6 0.45291714 5.090431e-07 27663 12 fuvhaufcd
Note, when using mutate_each, the default behavoir is transformed the existing columns to the new values. You can manually specify a new name per column (but only practical for few columns). Or you can use a named vector to create additional columns with new names.
DFP = tbl_df(DF)
system.time(DFP <-DFP %>% mutate_each(funs(v_new = myf(., v3)), v_new1 = v1, v_new2 = v2, -d, -v4, -v3))
## user system elapsed
## 0.03 0.00 0.03
head(DFP)
## Source: local data frame [6 x 7]
##
## v1 v2 v3 v4 d v_new1 v_new2
## (dbl) (dbl) (int) (dbl) (fctr) (dbl) (dbl)
## 1 -23281.328 -1.78530782 82030 7 ijmvah -0.28382172 -2.176461e-05
## 2 1523.288 -2.00057071 58461 12 mdsqerwu 0.02605738 -3.422177e-05
## 3 -5520.508 0.79477797 81160 4 jdreyx -0.06802174 9.792971e-06
## 4 -18413.247 -0.34343814 60370 12 wtefdp -0.30501669 -5.689076e-06
## 5 6847.334 2.14037666 75845 13 lsbwmcwr 0.09028300 2.822115e-05
## 6 12528.141 0.01408064 27663 12 fuvhaufcd 0.45291714 5.090431e-07
Or
DFP = tbl_df(DF)
vars <- names(DFP)[1:2]
vars <- setNames(vars, paste0(vars, "_new"))
DFP %>% mutate_each_(funs(v_new = myf(., v3)), vars) %>% head
## Source: local data frame [6 x 7]
##
## v1 v2 v3 v4 d v1_new v2_new
## (dbl) (dbl) (int) (dbl) (fctr) (dbl) (dbl)
## 1 -23281.328 -1.78530782 82030 7 ijmvah -0.28382172 -2.176461e-05
## 2 1523.288 -2.00057071 58461 12 mdsqerwu 0.02605738 -3.422177e-05
## 3 -5520.508 0.79477797 81160 4 jdreyx -0.06802174 9.792971e-06
## 4 -18413.247 -0.34343814 60370 12 wtefdp -0.30501669 -5.689076e-06
## 5 6847.334 2.14037666 75845 13 lsbwmcwr 0.09028300 2.822115e-05
## 6 12528.141 0.01408064 27663 12 fuvhaufcd 0.45291714 5.090431e-07
Put more controls on which variables the transformation should be applied on.
Let assume I want to put "units" after the values on the v1.
DFP = tbl_df(DF)
DFP %>% mutate_each(funs(paste0(., ".inches")), matches("v1")) %>% head
## Source: local data frame [6 x 5]
##
## v1 v2 v3 v4 d
## (chr) (dbl) (int) (dbl) (fctr)
## 1 -23281.3282688267.inches -1.78530782 82030 7 ijmvah
## 2 1523.28809520714.inches -2.00057071 58461 12 mdsqerwu
## 3 -5520.50827473118.inches 0.79477797 81160 4 jdreyx
## 4 -18413.2474744393.inches -0.34343814 60370 12 wtefdp
## 5 6847.33365361713.inches 2.14037666 75845 13 lsbwmcwr
## 6 12528.141100736.inches 0.01408064 27663 12 fuvhaufcd
data.table package is written by Matt Dowle, it is designed to manipulate Large dataset in R in comparable speed as in python, etc. It is not intuitive to transfer tradition R operation to data.table methods. I am still a newbie of data.table, but I'd like to use it more and more often in the future, because it is really really fast. See one benchmark comparing Grouping in data.table, dplyr, and panda(python).
First, let's repeat the previous task using data.tablenames(DT) = c("v_1", "v_2", "v3", "v4", "d")
head(DT)
## v_1 v_2 v3 v4 d
## 1: -23281.328 -1.78530782 82030 7 ijmvah
## 2: 1523.288 -2.00057071 58461 12 mdsqerwu
## 3: -5520.508 0.79477797 81160 4 jdreyx
## 4: -18413.247 -0.34343814 60370 12 wtefdp
## 5: 6847.334 2.14037666 75845 13 lsbwmcwr
## 6: 12528.141 0.01408064 27663 12 fuvhaufcd
system.time(for(i in grep("_", names(DT), value=TRUE))
DT[, i:=get(i)/(v3-2.0), with=FALSE])
## user system elapsed
## 0.03 0.00 0.03
head(DT)
## v_1 v_2 v3 v4 d
## 1: -0.28382172 -2.176461e-05 82030 7 ijmvah
## 2: 0.02605738 -3.422177e-05 58461 12 mdsqerwu
## 3: -0.06802174 9.792971e-06 81160 4 jdreyx
## 4: -0.30501669 -5.689076e-06 60370 12 wtefdp
## 5: 0.09028300 2.822115e-05 75845 13 lsbwmcwr
## 6: 0.45291714 5.090431e-07 27663 12 fuvhaufcd
DT = data.table(DF)
names(DT) = c("v_1", "v_2", "v3", "v4", "d")
system.time( x <- DT[, lapply(.SD, myf, DT[,v3]), .SDcols=c("v_1","v_2")])
## user system elapsed
## 0.03 0.00 0.03
head(x)
## v_1 v_2
## 1: -0.28382172 -2.176461e-05
## 2: 0.02605738 -3.422177e-05
## 3: -0.06802174 9.792971e-06
## 4: -0.30501669 -5.689076e-06
## 5: 0.09028300 2.822115e-05
## 6: 0.45291714 5.090431e-07
Suppose we have another data set, with V4 and d two fields, and other three new fileds, x1, x2, and x3. We want to subset out original DF dataset so that we're only looking at v1, v2, and v3 with v4=4 and d=ijmvah. In regular R, we can do this.
DF.new = DF[DF$v4==4 & DF$d=="ijmvah",c("v4", "d")]
DF.new$x1 = sample(rnorm(10), dim(DF.new)[1] , TRUE)
DF.new$x2 = sample(rnorm(10), dim(DF.new)[1] , TRUE)
DF.new$x3 = sample(rnorm(10), dim(DF.new)[1] , TRUE)
head(DF.new)
## v4 d x1 x2 x3
## 459142 4 ijmvah 1.1209419 -0.6532428 -1.4455367
## 776932 4 ijmvah 0.6662657 0.3131157 -1.4455367
## 777900 4 ijmvah 0.5943943 -0.3405394 0.4608586
## 1277727 4 ijmvah 0.6662657 0.1396439 -1.4455367
system.time(DF.merge <- merge(DF, DF.new, by=c('v4','d')))
## user system elapsed
## 4.49 0.00 4.51
head(DF.merge)
## v4 d v1 v2 v3 x1 x2 x3
## 1 4 ijmvah 23391.483 0.24720625 21661 1.1209419 -0.6532428 -1.4455367
## 2 4 ijmvah 23391.483 0.24720625 21661 0.6662657 0.3131157 -1.4455367
## 3 4 ijmvah 23391.483 0.24720625 21661 0.5943943 -0.3405394 0.4608586
## 4 4 ijmvah 23391.483 0.24720625 21661 0.6662657 0.1396439 -1.4455367
## 5 4 ijmvah 3529.048 0.08768831 74587 1.1209419 -0.6532428 -1.4455367
## 6 4 ijmvah 3529.048 0.08768831 74587 0.6662657 0.3131157 -1.4455367
Now let's check what happens when we using merge() function in data.table.
DT = data.table(DF)
DT.new = data.table(DF.new)
system.time(DT.merge <- merge(DT, DT.new, by =c("v4","d")))
## user system elapsed
## 0.15 0.00 0.15
head(DT.merge)
## v4 d v1 v2 v3 x1 x2 x3
## 1: 4 ijmvah 23593.24 -0.9844648 53589 1.1209419 -0.6532428 -1.4455367
## 2: 4 ijmvah 23593.24 -0.9844648 53589 0.6662657 0.3131157 -1.4455367
## 3: 4 ijmvah 23593.24 -0.9844648 53589 0.5943943 -0.3405394 0.4608586
## 4: 4 ijmvah 23593.24 -0.9844648 53589 0.6662657 0.1396439 -1.4455367
## 5: 4 ijmvah -16576.53 -0.7077086 17688 1.1209419 -0.6532428 -1.4455367
## 6: 4 ijmvah -16576.53 -0.7077086 17688 0.6662657 0.3131157 -1.4455367
See, we get job done from ~4s to ~0.15s. Can we do it faster? Let's try. The nice thing about data.table is tht you can set key(s) before you do anything on it. In most of time, it make things faster. This time, it takes ~0.05s
setkey(DT, v4, d)
setkey(DT.new, v4, d)
system.time (DT.merge <- merge(DT, DT.new, by =c("v4","d")))
## user system elapsed
## 0.06 0.00 0.07
head(DT.merge)
## v4 d v1 v2 v3 x1 x2 x3
## 1: 4 ijmvah 23593.24 -0.9844648 53589 1.1209419 -0.6532428 -1.4455367
## 2: 4 ijmvah 23593.24 -0.9844648 53589 0.6662657 0.3131157 -1.4455367
## 3: 4 ijmvah 23593.24 -0.9844648 53589 0.5943943 -0.3405394 0.4608586
## 4: 4 ijmvah 23593.24 -0.9844648 53589 0.6662657 0.1396439 -1.4455367
## 5: 4 ijmvah -16576.53 -0.7077086 17688 1.1209419 -0.6532428 -1.4455367
## 6: 4 ijmvah -16576.53 -0.7077086 17688 0.6662657 0.3131157 -1.4455367
So what do you think? But I want to see if I can do it ever faster. I'll use the internal data.table function for merging. You can to the same as an inner, left, or right join. Since we've already set the keys for both data tables,it knows what filed we want to merge on (v4 and d). Here is the inner join. Finished in no time. :)
system.time(DT.merge <- DT[DT.new])
## user system elapsed
## 0.02 0.00 0.01
head(DT.merge)
## v1 v2 v3 v4 d x1 x2 x3
## 1: 23593.242 -0.98446475 53589 4 ijmvah 1.1209419 -0.6532428 -1.445537
## 2: -16576.533 -0.70770863 17688 4 ijmvah 1.1209419 -0.6532428 -1.445537
## 3: 23391.483 0.24720625 21661 4 ijmvah 1.1209419 -0.6532428 -1.445537
## 4: 3529.048 0.08768831 74587 4 ijmvah 1.1209419 -0.6532428 -1.445537
## 5: 23593.242 -0.98446475 53589 4 ijmvah 0.6662657 0.3131157 -1.445537
## 6: -16576.533 -0.70770863 17688 4 ijmvah 0.6662657 0.3131157 -1.445537
system.time(DF.merge <- DF %>% inner_join(DF.new, by=c('v4','d')))
## user system elapsed
## 0.05 0.00 0.04
head(DF.merge)
## v1 v2 v3 v4 d x1 x2 x3
## 1 23593.24 -0.9844648 53589 4 ijmvah 1.1209419 -0.6532428 -1.4455367
## 2 23593.24 -0.9844648 53589 4 ijmvah 0.6662657 0.3131157 -1.4455367
## 3 23593.24 -0.9844648 53589 4 ijmvah 0.5943943 -0.3405394 0.4608586
## 4 23593.24 -0.9844648 53589 4 ijmvah 0.6662657 0.1396439 -1.4455367
## 5 -16576.53 -0.7077086 17688 4 ijmvah 1.1209419 -0.6532428 -1.4455367
## 6 -16576.53 -0.7077086 17688 4 ijmvah 0.6662657 0.3131157 -1.4455367
We have seen that using setkey() in data.table looks like magic. Normally, we may get task to calculate some new properties in one table using information from another table. Consider the example datasets below.
set.seed(100)
DT.original = data.table(Weekday=rep(c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday"),length.out=7e5),
Date=rep(seq(as.Date("2015-01-01"), length.out=365, by='day'),
1e5, TRUE),
Distance = sample(runif(1000, 20, 80), 7e5, replace=TRUE))
head(DT.original)
## Weekday Date Distance
## 1: Monday 2015-01-01 59.70674
## 2: Tuesday 2015-01-01 68.41200
## 3: Wednesday 2015-01-01 42.62981
## 4: Thursday 2015-01-01 29.17710
## 5: Friday 2015-01-01 60.96852
## 6: Saturday 2015-01-01 60.49637
DT.coe = data.table(Weekday=rep(c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday")),
Coefficient = sample(abs(rnorm(10)), 7 , TRUE))
head(DT.coe)
## Weekday Coefficient
## 1: Monday 1.1696943
## 2: Tuesday 0.2586376
## 3: Wednesday 1.8243576
## 4: Thursday 0.2586376
## 5: Friday 1.0887565
## 6: Saturday 0.4887914
Here we have two datasets, DT.original is the main dataset which includes subjects walking distance in year 2015 collected from wearable device. But we also know that the accuracy of that device is not good enough, it should be adjusted by multipling a coefficients. So, before we do any fancy DM on the data, we need to calculate the real distance which is distance*coefficient based on the weekday.
One way to do it is that merging the DT.other data into DT.original first based on the weekday. Then create a new column which calculated the value of (distance*coefficient). Then delete the columns of old distance and the coefficient. We know data.tabe is pretty good at the merging. But such way is not the best way data.table to do it. Actually, we can finished the merging and calculation in the same time.
system.time(setkey(DT.original, Weekday)[DT.coe, Distance.adj := Distance*Coefficient][order(Date)])
## user system elapsed
## 0.06 0.00 0.07
head(DT.original)
## Weekday Date Distance Distance.adj
## 1: Friday 2015-01-01 60.96852 66.37988
## 2: Friday 2015-01-01 56.36131 61.36375
## 3: Friday 2015-01-01 65.27107 71.06431
## 4: Friday 2015-01-01 34.26703 37.30845
## 5: Friday 2015-01-01 37.32332 40.63600
## 6: Friday 2015-01-01 53.06470 57.77453
Coudl you try to do it without using data.table?
rm(DT)
rm(DF)
N <- 1e7 # size of DT
set.seed(100)
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e4, foo())
ch <- unique(ch)
DF <- data.frame(id = sample(ch, N, TRUE),
v1 = as.numeric(sample(c(rnorm(1e5)*1e4), N, TRUE)),
v2 = as.numeric(sample(rnorm(1e6), N, TRUE)),
v3 = as.numeric(sample(rnorm(2e6), N, TRUE)),
v4 = as.numeric(sample(rnorm(3e6), N, TRUE)),
v5 = as.numeric(sample(rnorm(3e6), N, TRUE)))
DT = data.table(DF)
weights = rnorm(5,10)
The problems is that we have a dataset which saves five variables' value for 1e7 observation. The task is to get the weight sum of this variable for each observation based on given weights, and named the new column as Comb_v. In dplyr, you can use mutate() function, but it is not friendly when you need to perform the calculation over many columns. We know that this operation is actually inner product.
Here is the dplyr way.system.time(DF <- cbind(DF, as.matrix(DF[, -grep("id", names(DF))]) %*% weights))
## user system elapsed
## 0.45 0.05 0.50
names(DF) =c("id","v1","v2","v3","v4","v5","Com.V")
head(DF)
## id v1 v2 v3 v4 v5
## 1 eyrfn -7501.109 -0.7837236 -0.30656884 0.0391261 0.3873146
## 2 cicqweor 10086.533 0.5301808 0.18213895 0.8275374 -0.5357433
## 3 zhmsbe 11855.062 -1.0223668 -0.06560893 0.8832775 -1.4525234
## 4 xcjzhj 4280.910 -1.1538706 -0.85118854 -1.7658080 0.9280722
## 5 qgzxsn 14709.943 2.5478778 0.11462929 0.2285528 -0.8566297
## 6 tmzuiwoo 4966.520 1.3120825 -0.60920175 1.2848112 -0.3376857
## Com.V
## 1 -79798.08
## 2 107303.48
## 3 126088.21
## 4 45507.68
## 5 156494.50
## 6 52847.32
ColList = names(DT)
system.time(DT[, Com.V :=.(as.matrix(DT[,.SD, .SDcol=ColList[-1]]) %*% weights)])
## user system elapsed
## 0.57 0.11 0.70
head(DT)
## id v1 v2 v3 v4 v5
## 1: eyrfn -7501.109 -0.7837236 -0.30656884 0.0391261 0.3873146
## 2: cicqweor 10086.533 0.5301808 0.18213895 0.8275374 -0.5357433
## 3: zhmsbe 11855.062 -1.0223668 -0.06560893 0.8832775 -1.4525234
## 4: xcjzhj 4280.910 -1.1538706 -0.85118854 -1.7658080 0.9280722
## 5: qgzxsn 14709.943 2.5478778 0.11462929 0.2285528 -0.8566297
## 6: tmzuiwoo 4966.520 1.3120825 -0.60920175 1.2848112 -0.3376857
## Com.V
## 1: -79798.08
## 2: 107303.48
## 3: 126088.21
## 4: 45507.68
## 5: 156494.50
## 6: 52847.32
For this task, data.table does not performance over data.frame. data.tables are faster relative to data.frames when operating on data without unnecessary copying. In above data.table command, we actually put base operation on copy the data.table to a matrix. We can achieve the similar performance using so call "in-table" operation, as shown below.
DT[,Com.V:=NULL]
ColList = names(DT)
myp = function(x) { x %*% weights }
system.time(DT[, `:=`(Com.V=myp(as.matrix(.SD))), .SDcol=ColList[-1]])
## user system elapsed
## 0.45 0.08 0.53
head(DT)
## id v1 v2 v3 v4 v5
## 1: eyrfn -7501.109 -0.7837236 -0.30656884 0.0391261 0.3873146
## 2: cicqweor 10086.533 0.5301808 0.18213895 0.8275374 -0.5357433
## 3: zhmsbe 11855.062 -1.0223668 -0.06560893 0.8832775 -1.4525234
## 4: xcjzhj 4280.910 -1.1538706 -0.85118854 -1.7658080 0.9280722
## 5: qgzxsn 14709.943 2.5478778 0.11462929 0.2285528 -0.8566297
## 6: tmzuiwoo 4966.520 1.3120825 -0.60920175 1.2848112 -0.3376857
## Com.V
## 1: -79798.08
## 2: 107303.48
## 3: 126088.21
## 4: 45507.68
## 5: 156494.50
## 6: 52847.32
We can see that, now we have a little better performance by using data.table, and such performance is limitted by the matrix multiplication operation.
system.time(temp <- myp(as.matrix(DF[,c(2:6)])))
## user system elapsed
## 0.42 0.08 0.49
The task is not over yet. After I got the weighted combined variable Com.V, I want to check the mean values of variables, i.e., v1, v2, ...., of the top N subjects based on the value of com.V.
For example, I want to get the mean valule of V1 in top 0.5% of com.v.
Using dplyr, I could use top_n() and summarise() function.n = round(dim(DF)[1]*0.005)
system.time(a.F <- DF %>% top_n(n, Com.V) %>% summarise(avg = mean(v1)))
## user system elapsed
## 39.39 0.56 40.13
print(a.F)
## avg
## 1 28469.68
system.time(setorder(DT, -Com.V))
## user system elapsed
## 2.03 0.01 2.06
system.time(a.T <- DT[,.(a.T=mean(head(v1, n)))])
## user system elapsed
## 0 0 0
print(a.T)
## a.T
## 1: 28469.68
n1 = round(dim(DF)[1]*0.01)
system.time(a.T <- DT[,.(a.T.1=mean(head(v1, n)), a.T.2=mean(tail(head(v1, n1),n)))])
## user system elapsed
## 0 0 0
print(a.T)
## a.T.1 a.T.2
## 1: 28469.68 24310.84
There are much more tricks out there for using dplyr and data.table efficently and smartly. Here just is the tip of iceberg. I will keep updating the new, creative ways to manipulate datasets here. Thank you!
## [1] "Last Updated on Fri Apr 29 11:51:13 2016"