If you’re not using technology readily available to you like hadoop or something similar, then you’re probably choosing to work on your data, using your own computer. If you’re using R to do that, dataframes might become a bit too large for you to easily manipulate. Thats where the data.table package comes in handy, a faster alternative to the dataframes. Lets look at it.
You can create a data.table just like a dataframe; only if one vector isn’t the same length as the other. The short vector will repeat itself until it hits maximum iterations.
DT = data.table(x=1:5,y=1:2000)Lets move on to some fun stuff; lets take the boys dataset from mice and turn that from dataframe to data table.
data('boys')
dt = data.table(boys)
class(dt)## [1] "data.table" "data.frame"
describe(dt)## vars n mean sd median trimmed mad min max range skew
## age 1 748 9.16 6.89 10.50 9.03 10.14 0.04 21.18 21.14 -0.03
## hgt 2 728 132.15 46.51 147.30 134.31 52.34 50.00 198.00 148.00 -0.35
## wgt 3 744 37.15 26.03 34.65 35.49 34.93 3.14 117.40 114.26 0.38
## bmi 4 727 18.07 3.05 17.45 17.73 2.64 11.77 31.74 19.97 1.15
## hc 5 702 51.51 5.91 53.00 52.18 5.26 33.70 65.00 31.30 -0.88
## gen* 6 245 3.12 1.58 3.00 3.15 2.97 1.00 5.00 4.00 -0.08
## phb* 7 245 3.36 1.88 4.00 3.33 2.97 1.00 6.00 5.00 0.03
## tv 8 226 11.89 7.99 12.00 11.53 11.86 1.00 25.00 24.00 0.25
## reg* 9 745 3.02 1.14 3.00 3.03 1.48 1.00 5.00 4.00 -0.08
## kurtosis se
## age -1.56 0.25
## hgt -1.43 1.72
## wgt -1.03 0.95
## bmi 1.76 0.11
## hc 0.05 0.22
## gen* -1.59 0.10
## phb* -1.54 0.12
## tv -1.31 0.53
## reg* -0.76 0.04
As you can see, we can still perform descriptive dataframe statistics although it is now a data.table. Lets see some other functions we can use with it.
sampler1 = dt[,median(hgt),by=wgt]
head(sampler1,5)## wgt V1
## 1: 3.65 50.1
## 2: 3.37 53.5
## 3: 3.14 50.0
## 4: 4.27 54.5
## 5: 5.03 57.5
We now have a datatable thats seperated all heights by their median weight. Awesome, what about something more complex like data aggregation?
Lets aggregate some data
dt[wgt> 25, .(sum(hc),sum(age)),by=hgt ]## hgt V1 V2
## 1: 128.7 109.2 10.923
## 2: 123.0 52.5 5.338
## 3: 124.8 105.4 14.641
## 4: 128.1 53.3 6.017
## 5: 124.6 51.5 6.113
## ---
## 298: 192.3 57.6 19.926
## 299: 193.1 60.0 19.942
## 300: 189.5 NA 19.978
## 301: 188.7 113.3 40.489
## 302: 189.1 NA 20.761
#Vs aggregate(cbind(hc,age)~hgt, DF[DF$hc > 25,],sum)Conclusion : If you’ve spent your time mastering libraries to work over dataframes; thats OK. You can manipulate your data in a table and convert it back to a dataframe for other functions that might rely on such. Mastering data.tables seems easier than dataframes because there is less to write, and the syntax is more common sense. The bonus is that data.tables should operate faster than dataframes. Awesome.