If you are a speed junky; you prefer speed over readability of the code and deal with a very large dataset on a daily basis then you are in the right place to learn the right tool. There is a package called data.table in R that extends the functionality of data.frames. The syntax is a little different from regular data.frames, so it will take time to get used to, which is probably the primary reason it has not been near-universal adoption.
And data.table is not the only package that helps in data frame manipulation. There is another excellent package by Hadley Wickham called dplyr. It has its own grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation.
There is always a debate between usability, readability, and speed between these two packages. A StackOverflow discussion is worth following to know more about what one package can do and the other cannot. In general, if you are looking for speed then data.tables is most preferred but the readability of the code is your major concern then dplyr is the way to fo
In this exercise, our focus in on data.table and the secret to the speed is that data.table has an index like a database. This allows faster accessing, group by operation and joins.
Creating data.table is just like creating data.frames , and the two are very similar.
require(data.table)
## Loading required package: data.table
## create a regular data.frame
theDf <- data.frame(A=1:10,B=letters[1:10],C=LETTERS[1:10],D=rep(c("One","Two","Three"),length.out=10))
## create a data.table
theDt = data.table(A=1:10,B=letters[1:10],C=LETTERS[1:10],D=rep(c("One","Two","Three"),length.out=10))
## print and compare
theDt
## A B C D
## 1: 1 a A One
## 2: 2 b B Two
## 3: 3 c C Three
## 4: 4 d D One
## 5: 5 e E Two
## 6: 6 f F Three
## 7: 7 g G One
## 8: 8 h H Two
## 9: 9 i I Three
## 10: 10 j J One
It can be noticed below that by default data.frame turns character data into factors while data.table does not.
class(theDf$B)
## [1] "factor"
class(theDt$B)
## [1] "character"
The data are identical except that data.frame turned B into a factor while data.table did not.
It is also possible to create a data.table out of an existing data.frame as given below.
require(ggplot2)
## Loading required package: ggplot2
diamondsDT <- data.table(diamonds)
diamondsDT
## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## ---
## 53936: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
## 53937: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
## 53938: 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
## 53939: 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
## 53940: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
Now let us see how to perform some common and advanced data manipulation using data.table.
Accessing the rows can be done similarly to accessing rows in data.frame. The following code subset the data to the first two rows.
theDt[1:2,]
## A B C D
## 1: 1 a A One
## 2: 2 b B Two
To filter rows based on some condition we can use the following one-liner. Although the one-liner is the valid syntax, it is not necessarily efficient syntax. That line creates a vector of length nrow=10 consisting of True or False entries, which is a vector scan.
After we create a key for the data.tables we can use different syntax to pick rows through binary search, which will be much faster and is covered in the coming section.
theDt[theDt$A >=7,]
## A B C D
## 1: 7 g G One
## 2: 8 h H Two
## 3: 9 i I Three
## 4: 10 j J One
Accessing individual columns must be done a little differently than accessing columns in data.frames. In data.frames to subset, multiple column names should be specified as character vectors. With data.tables the columns should be specified as a list of actual names, not as characters.
theDt[,list(A,C)]
## A C
## 1: 1 A
## 2: 2 B
## 3: 3 C
## 4: 4 D
## 5: 5 E
## 6: 6 F
## 7: 7 G
## 8: 8 H
## 9: 9 I
## 10: 10 J
If we must specify the column names as characters (perhaps because they were passed as arguments to a function), the with argument should be set to FALSE.
theDt[,c("A","C"), with=FALSE]
## A C
## 1: 1 A
## 2: 2 B
## 3: 3 C
## 4: 4 D
## 5: 5 E
## 6: 6 F
## 7: 7 G
## 8: 8 H
## 9: 9 I
## 10: 10 J
This time we used a vector to hold the column names instead of a list. These nuances are important to proper functions of data.table but can lead to a great deal of frustration.
Now that we have a few data.tables in memory, we might be interested in seeing some information about them.
## show tables
tables()
## NAME NROW NCOL MB COLS KEY
## 1: diamondsDT 53,940 10 3 carat,cut,color,clarity,depth,table,...
## 2: theDt 10 4 0 A,B,C,D
## Total: 3MB
This shows for each data.table in memory, the name, the number of rows, the size in megabytes, the column names and the key. We have not assigned keys for any of the tables so that column is blank as of now. The key is used to index the data.table and will provide the extra speed.
We start by adding the key to theDt. We will use the D column to index the data.table . This is done using set.key , which takes the name of the data.table as its first argument and the name of the desired column (without quotes, as is consistent with column section) as the second argument.
## set the key
setkey(theDt, D)
## show the data.table again
theDt
## A B C D
## 1: 1 a A One
## 2: 4 d D One
## 3: 7 g G One
## 4: 10 j J One
## 5: 3 c C Three
## 6: 6 f F Three
## 7: 9 i I Three
## 8: 2 b B Two
## 9: 5 e E Two
## 10: 8 h H Two
The data have been reordered according to column D , which is sorted alphabetically. We can confirm the key was set with key function.
key(theDt)
## [1] "D"
This adds some new functionality to selecting rows from data.tables. In addition to selecting rows by the row number or by some expression that evaluates to TRUE or FALSE, a value of the key column can be specified.
theDt[c("One","Two"),]
## A B C D
## 1: 1 a A One
## 2: 4 d D One
## 3: 7 g G One
## 4: 10 j J One
## 5: 2 b B Two
## 6: 5 e E Two
## 7: 8 h H Two
More than one column can also be set as the key in data.tables as follows.
setkey(diamondsDT, cut, color)
To access rows according to both keys, there is a special function J. It takes multiple arguments, each of which is the vector of values to select.
head(diamondsDT)
## carat cut color clarity depth table price x y z
## 1: 0.75 Fair D SI2 64.6 57 2848 5.74 5.72 3.70
## 2: 0.71 Fair D VS2 56.9 65 2858 5.89 5.84 3.34
## 3: 0.90 Fair D SI2 66.9 57 2885 6.02 5.90 3.99
## 4: 1.00 Fair D SI2 69.3 58 2974 5.96 5.87 4.10
## 5: 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05
## 6: 0.73 Fair D VS1 66.0 54 3047 5.56 5.66 3.70
# diamondsDT[J("Ideal"), c("E","D")]
The primary benefit of indexing is faster aggregation. While aggregate and various d*ply functions will work because data.table are just enhanced data.frames but they will be slower than using the built-in aggregate functionality of the data.tables. Let us quickly compare the syntactic difference between the built-in and external aggregate functionality.
To calculate the mean price of diamonds for each type of cut, let us use the aggregate function which we used in our previous article.
aggregate(price~cut, diamonds, mean)
## cut price
## 1 Fair 4358.758
## 2 Good 3928.864
## 3 Very Good 3981.760
## 4 Premium 4584.258
## 5 Ideal 3457.542
To get the same result using the data.table aggregate functionality we use the following code.
diamondsDT[,mean(price), by=cut]
## cut V1
## 1: Fair 4358.758
## 2: Good 3928.864
## 3: Very Good 3981.760
## 4: Premium 4584.258
## 5: Ideal 3457.542
The only difference between the two results is that columns have different names. In data.tables , the default name V1is assigned to the aggregated column. To specify the name of the resulting column, pass the aggregation function as a named list. To aggregate on multiple columns, specify them as list()
diamondsDT[,list(price=mean(price)), by=cut]
## cut price
## 1: Fair 4358.758
## 2: Good 3928.864
## 3: Very Good 3981.760
## 4: Premium 4584.258
## 5: Ideal 3457.542
## aggreagate on multiple columns
diamondsDT[,list(price=mean(price)), by=list(cut, color)]
## cut color price
## 1: Fair D 4291.061
## 2: Fair E 3682.312
## 3: Fair F 3827.003
## 4: Fair G 4239.255
## 5: Fair H 5135.683
## 6: Fair I 4685.446
## 7: Fair J 4975.655
## 8: Good D 3405.382
## 9: Good E 3423.644
## 10: Good F 3495.750
## 11: Good G 4123.482
## 12: Good H 4276.255
## 13: Good I 5078.533
## 14: Good J 4574.173
## 15: Very Good D 3470.467
## 16: Very Good E 3214.652
## 17: Very Good F 3778.820
## 18: Very Good G 3872.754
## 19: Very Good H 4535.390
## 20: Very Good I 5255.880
## 21: Very Good J 5103.513
## 22: Premium D 3631.293
## 23: Premium E 3538.914
## 24: Premium F 4324.890
## 25: Premium G 4500.742
## 26: Premium H 5216.707
## 27: Premium I 5946.181
## 28: Premium J 6294.592
## 29: Ideal D 2629.095
## 30: Ideal E 2597.550
## 31: Ideal F 3374.939
## 32: Ideal G 3720.706
## 33: Ideal H 3889.335
## 34: Ideal I 4451.970
## 35: Ideal J 4918.186
## cut color price
And to aggregate multiple arguments, pass them as list. Unlike with aggregate function, a different metric can be measured for each column.
diamondsDT[,list(price=mean(price), carat=mean(carat), caratSum=sum(carat)), by=list(cut, color)]
## cut color price carat caratSum
## 1: Fair D 4291.061 0.9201227 149.98
## 2: Fair E 3682.312 0.8566071 191.88
## 3: Fair F 3827.003 0.9047115 282.27
## 4: Fair G 4239.255 1.0238217 321.48
## 5: Fair H 5135.683 1.2191749 369.41
## 6: Fair I 4685.446 1.1980571 209.66
## 7: Fair J 4975.655 1.3411765 159.60
## 8: Good D 3405.382 0.7445166 492.87
## 9: Good E 3423.644 0.7451340 695.21
## 10: Good F 3495.750 0.7759296 705.32
## 11: Good G 4123.482 0.8508955 741.13
## 12: Good H 4276.255 0.9147293 642.14
## 13: Good I 5078.533 1.0572222 551.87
## 14: Good J 4574.173 1.0995440 337.56
## 15: Very Good D 3470.467 0.6964243 1053.69
## 16: Very Good E 3214.652 0.6763167 1623.16
## 17: Very Good F 3778.820 0.7409612 1603.44
## 18: Very Good G 3872.754 0.7667986 1762.87
## 19: Very Good H 4535.390 0.9159485 1670.69
## 20: Very Good I 5255.880 1.0469518 1260.53
## 21: Very Good J 5103.513 1.1332153 768.32
## 22: Premium D 3631.293 0.7215471 1156.64
## 23: Premium E 3538.914 0.7177450 1677.37
## 24: Premium F 4324.890 0.8270356 1927.82
## 25: Premium G 4500.742 0.8414877 2460.51
## 26: Premium H 5216.707 1.0164492 2398.82
## 27: Premium I 5946.181 1.1449370 1634.97
## 28: Premium J 6294.592 1.2930941 1044.82
## 29: Ideal D 2629.095 0.5657657 1603.38
## 30: Ideal E 2597.550 0.5784012 2257.50
## 31: Ideal F 3374.939 0.6558285 2509.20
## 32: Ideal G 3720.706 0.7007146 3422.29
## 33: Ideal H 3889.335 0.7995249 2490.52
## 34: Ideal I 4451.970 0.9130291 1910.97
## 35: Ideal J 4918.186 1.0635937 952.98
## cut color price carat caratSum
Data wrangling is a very important step in any analytics pipeline. Sometimes it is the end goal, and other times it is in preparation for applying more advanced methods. No matter the reason, there are plenty of functions to make it possible. These include aggregate, apply family in R and group by functionality in data.table.