D621B3

Michael Muller

May 24, 2018

When working with sparse or gigantic datasets; one of the biggest hurdles to overcome (other than finding perfect algorithims to use the data) is managing your data.

If you’re not using technology readily available to you like hadoop or something similar, then you’re probably choosing to work on your data, using your own computer. If you’re using R to do that, dataframes might become a bit too large for you to easily manipulate. Thats where the data.table package comes in handy, a faster alternative to the dataframes. Lets look at it.

You can create a data.table just like a dataframe; only if one vector isn’t the same length as the other. The short vector will repeat itself until it hits maximum iterations.

DT = data.table(x=1:5,y=1:2000)

Lets move on to some fun stuff; lets take the boys dataset from mice and turn that from dataframe to data table.

data('boys')
dt = data.table(boys)
class(dt)

## [1] "data.table" "data.frame"

describe(dt)

##      vars   n   mean    sd median trimmed   mad   min    max  range  skew
## age     1 748   9.16  6.89  10.50    9.03 10.14  0.04  21.18  21.14 -0.03
## hgt     2 728 132.15 46.51 147.30  134.31 52.34 50.00 198.00 148.00 -0.35
## wgt     3 744  37.15 26.03  34.65   35.49 34.93  3.14 117.40 114.26  0.38
## bmi     4 727  18.07  3.05  17.45   17.73  2.64 11.77  31.74  19.97  1.15
## hc      5 702  51.51  5.91  53.00   52.18  5.26 33.70  65.00  31.30 -0.88
## gen*    6 245   3.12  1.58   3.00    3.15  2.97  1.00   5.00   4.00 -0.08
## phb*    7 245   3.36  1.88   4.00    3.33  2.97  1.00   6.00   5.00  0.03
## tv      8 226  11.89  7.99  12.00   11.53 11.86  1.00  25.00  24.00  0.25
## reg*    9 745   3.02  1.14   3.00    3.03  1.48  1.00   5.00   4.00 -0.08
##      kurtosis   se
## age     -1.56 0.25
## hgt     -1.43 1.72
## wgt     -1.03 0.95
## bmi      1.76 0.11
## hc       0.05 0.22
## gen*    -1.59 0.10
## phb*    -1.54 0.12
## tv      -1.31 0.53
## reg*    -0.76 0.04

As you can see, we can still perform descriptive dataframe statistics although it is now a data.table. Lets see some other functions we can use with it.

https://github.com/Rdatatable/data.table/wiki

sampler1 = dt[,median(hgt),by=wgt]
head(sampler1,5)

##     wgt   V1
## 1: 3.65 50.1
## 2: 3.37 53.5
## 3: 3.14 50.0
## 4: 4.27 54.5
## 5: 5.03 57.5

We now have a datatable thats seperated all heights by their median weight. Awesome, what about something more complex like data aggregation?

Lets aggregate some data

dt[wgt> 25, .(sum(hc),sum(age)),by=hgt ]

##        hgt    V1     V2
##   1: 128.7 109.2 10.923
##   2: 123.0  52.5  5.338
##   3: 124.8 105.4 14.641
##   4: 128.1  53.3  6.017
##   5: 124.6  51.5  6.113
##  ---                   
## 298: 192.3  57.6 19.926
## 299: 193.1  60.0 19.942
## 300: 189.5    NA 19.978
## 301: 188.7 113.3 40.489
## 302: 189.1    NA 20.761

#Vs aggregate(cbind(hc,age)~hgt, DF[DF$hc > 25,],sum)

Conclusion : If you’ve spent your time mastering libraries to work over dataframes; thats OK. You can manipulate your data in a table and convert it back to a dataframe for other functions that might rely on such. Mastering data.tables seems easier than dataframes because there is less to write, and the syntax is more common sense. The bonus is that data.tables should operate faster than dataframes. Awesome.