Advanced Data Wrangling

If you are a speed junky; you prefer speed over readability of the code and deal with a very large dataset on a daily basis then you are in the right place to learn the right tool. There is a package called data.table in R that extends the functionality of data.frames. The syntax is a little different from regular data.frames, so it will take time to get used to, which is probably the primary reason it has not been near-universal adoption.

And data.table is not the only package that helps in data frame manipulation. There is another excellent package by Hadley Wickham called dplyr. It has its own grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation.

filter() to select cases based on their values.
arrange() to reorder the cases.
select() and rename() to select variables based on their names.
mutate() and transmute() to add new variables that are functions of existing variables.
summarise() to condense multiple values to a single value.
sample_n() and sample_frac() to take random samples.

There is always a debate between usability, readability, and speed between these two packages. A StackOverflow discussion is worth following to know more about what one package can do and the other cannot. In general, if you are looking for speed then data.tables is most preferred but the readability of the code is your major concern then dplyr is the way to fo

In this exercise, our focus in on data.table and the secret to the speed is that data.table has an index like a database. This allows faster accessing, group by operation and joins.

Creating data.table is just like creating data.frames , and the two are very similar.

require(data.table)

## Loading required package: data.table

## create a regular data.frame
theDf <- data.frame(A=1:10,B=letters[1:10],C=LETTERS[1:10],D=rep(c("One","Two","Three"),length.out=10))

## create a data.table
theDt = data.table(A=1:10,B=letters[1:10],C=LETTERS[1:10],D=rep(c("One","Two","Three"),length.out=10))
## print and compare
theDt

##      A B C     D
##  1:  1 a A   One
##  2:  2 b B   Two
##  3:  3 c C Three
##  4:  4 d D   One
##  5:  5 e E   Two
##  6:  6 f F Three
##  7:  7 g G   One
##  8:  8 h H   Two
##  9:  9 i I Three
## 10: 10 j J   One

It can be noticed below that by default data.frame turns character data into factors while data.table does not.

class(theDf$B)

## [1] "factor"

class(theDt$B)

## [1] "character"

The data are identical except that data.frame turned B into a factor while data.table did not.

It is also possible to create a data.table out of an existing data.frame as given below.

require(ggplot2)

## Loading required package: ggplot2

diamondsDT <- data.table(diamonds)
diamondsDT

##        carat       cut color clarity depth table price    x    y    z
##     1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
##     2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
##     3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
##     4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
##     5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
##    ---                                                               
## 53936:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 53937:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 53938:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 53939:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 53940:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

Now let us see how to perform some common and advanced data manipulation using data.table.

Accessing Rows

Accessing the rows can be done similarly to accessing rows in data.frame. The following code subset the data to the first two rows.

theDt[1:2,]

##    A B C   D
## 1: 1 a A One
## 2: 2 b B Two

To filter rows based on some condition we can use the following one-liner. Although the one-liner is the valid syntax, it is not necessarily efficient syntax. That line creates a vector of length nrow=10 consisting of True or False entries, which is a vector scan.

After we create a key for the data.tables we can use different syntax to pick rows through binary search, which will be much faster and is covered in the coming section.

theDt[theDt$A >=7,]

##     A B C     D
## 1:  7 g G   One
## 2:  8 h H   Two
## 3:  9 i I Three
## 4: 10 j J   One

Accessing Columns

Accessing individual columns must be done a little differently than accessing columns in data.frames. In data.frames to subset, multiple column names should be specified as character vectors. With data.tables the columns should be specified as a list of actual names, not as characters.

theDt[,list(A,C)]

##      A C
##  1:  1 A
##  2:  2 B
##  3:  3 C
##  4:  4 D
##  5:  5 E
##  6:  6 F
##  7:  7 G
##  8:  8 H
##  9:  9 I
## 10: 10 J

If we must specify the column names as characters (perhaps because they were passed as arguments to a function), the with argument should be set to FALSE.

theDt[,c("A","C"), with=FALSE]

##      A C
##  1:  1 A
##  2:  2 B
##  3:  3 C
##  4:  4 D
##  5:  5 E
##  6:  6 F
##  7:  7 G
##  8:  8 H
##  9:  9 I
## 10: 10 J

This time we used a vector to hold the column names instead of a list. These nuances are important to proper functions of data.table but can lead to a great deal of frustration.

Keys

Now that we have a few data.tables in memory, we might be interested in seeing some information about them.

## show tables
tables()

##          NAME   NROW NCOL MB                                    COLS KEY
## 1: diamondsDT 53,940   10  3 carat,cut,color,clarity,depth,table,...    
## 2:      theDt     10    4  0                                 A,B,C,D    
## Total: 3MB

This shows for each data.table in memory, the name, the number of rows, the size in megabytes, the column names and the key. We have not assigned keys for any of the tables so that column is blank as of now. The key is used to index the data.table and will provide the extra speed.

We start by adding the key to theDt. We will use the D column to index the data.table . This is done using set.key , which takes the name of the data.table as its first argument and the name of the desired column (without quotes, as is consistent with column section) as the second argument.

## set the key
setkey(theDt, D)

## show the data.table again
theDt

##      A B C     D
##  1:  1 a A   One
##  2:  4 d D   One
##  3:  7 g G   One
##  4: 10 j J   One
##  5:  3 c C Three
##  6:  6 f F Three
##  7:  9 i I Three
##  8:  2 b B   Two
##  9:  5 e E   Two
## 10:  8 h H   Two

The data have been reordered according to column D , which is sorted alphabetically. We can confirm the key was set with key function.

key(theDt)

## [1] "D"

This adds some new functionality to selecting rows from data.tables. In addition to selecting rows by the row number or by some expression that evaluates to TRUE or FALSE, a value of the key column can be specified.

theDt[c("One","Two"),]

##     A B C   D
## 1:  1 a A One
## 2:  4 d D One
## 3:  7 g G One
## 4: 10 j J One
## 5:  2 b B Two
## 6:  5 e E Two
## 7:  8 h H Two

More than one column can also be set as the key in data.tables as follows.

setkey(diamondsDT, cut, color)

To access rows according to both keys, there is a special function J. It takes multiple arguments, each of which is the vector of values to select.

head(diamondsDT)

##    carat  cut color clarity depth table price    x    y    z
## 1:  0.75 Fair     D     SI2  64.6    57  2848 5.74 5.72 3.70
## 2:  0.71 Fair     D     VS2  56.9    65  2858 5.89 5.84 3.34
## 3:  0.90 Fair     D     SI2  66.9    57  2885 6.02 5.90 3.99
## 4:  1.00 Fair     D     SI2  69.3    58  2974 5.96 5.87 4.10
## 5:  1.01 Fair     D     SI2  64.6    56  3003 6.31 6.24 4.05
## 6:  0.73 Fair     D     VS1  66.0    54  3047 5.56 5.66 3.70

# diamondsDT[J("Ideal"), c("E","D")]

Aggregation

The primary benefit of indexing is faster aggregation. While aggregate and various d*ply functions will work because data.table are just enhanced data.frames but they will be slower than using the built-in aggregate functionality of the data.tables. Let us quickly compare the syntactic difference between the built-in and external aggregate functionality.

To calculate the mean price of diamonds for each type of cut, let us use the aggregate function which we used in our previous article.

aggregate(price~cut, diamonds, mean)

##         cut    price
## 1      Fair 4358.758
## 2      Good 3928.864
## 3 Very Good 3981.760
## 4   Premium 4584.258
## 5     Ideal 3457.542

To get the same result using the data.table aggregate functionality we use the following code.

diamondsDT[,mean(price), by=cut]

##          cut       V1
## 1:      Fair 4358.758
## 2:      Good 3928.864
## 3: Very Good 3981.760
## 4:   Premium 4584.258
## 5:     Ideal 3457.542

The only difference between the two results is that columns have different names. In data.tables , the default name V1is assigned to the aggregated column. To specify the name of the resulting column, pass the aggregation function as a named list. To aggregate on multiple columns, specify them as list()

diamondsDT[,list(price=mean(price)), by=cut]

##          cut    price
## 1:      Fair 4358.758
## 2:      Good 3928.864
## 3: Very Good 3981.760
## 4:   Premium 4584.258
## 5:     Ideal 3457.542

## aggreagate on  multiple columns
diamondsDT[,list(price=mean(price)), by=list(cut, color)]

##           cut color    price
##  1:      Fair     D 4291.061
##  2:      Fair     E 3682.312
##  3:      Fair     F 3827.003
##  4:      Fair     G 4239.255
##  5:      Fair     H 5135.683
##  6:      Fair     I 4685.446
##  7:      Fair     J 4975.655
##  8:      Good     D 3405.382
##  9:      Good     E 3423.644
## 10:      Good     F 3495.750
## 11:      Good     G 4123.482
## 12:      Good     H 4276.255
## 13:      Good     I 5078.533
## 14:      Good     J 4574.173
## 15: Very Good     D 3470.467
## 16: Very Good     E 3214.652
## 17: Very Good     F 3778.820
## 18: Very Good     G 3872.754
## 19: Very Good     H 4535.390
## 20: Very Good     I 5255.880
## 21: Very Good     J 5103.513
## 22:   Premium     D 3631.293
## 23:   Premium     E 3538.914
## 24:   Premium     F 4324.890
## 25:   Premium     G 4500.742
## 26:   Premium     H 5216.707
## 27:   Premium     I 5946.181
## 28:   Premium     J 6294.592
## 29:     Ideal     D 2629.095
## 30:     Ideal     E 2597.550
## 31:     Ideal     F 3374.939
## 32:     Ideal     G 3720.706
## 33:     Ideal     H 3889.335
## 34:     Ideal     I 4451.970
## 35:     Ideal     J 4918.186
##           cut color    price

And to aggregate multiple arguments, pass them as list. Unlike with aggregate function, a different metric can be measured for each column.

diamondsDT[,list(price=mean(price), carat=mean(carat), caratSum=sum(carat)), by=list(cut, color)]

##           cut color    price     carat caratSum
##  1:      Fair     D 4291.061 0.9201227   149.98
##  2:      Fair     E 3682.312 0.8566071   191.88
##  3:      Fair     F 3827.003 0.9047115   282.27
##  4:      Fair     G 4239.255 1.0238217   321.48
##  5:      Fair     H 5135.683 1.2191749   369.41
##  6:      Fair     I 4685.446 1.1980571   209.66
##  7:      Fair     J 4975.655 1.3411765   159.60
##  8:      Good     D 3405.382 0.7445166   492.87
##  9:      Good     E 3423.644 0.7451340   695.21
## 10:      Good     F 3495.750 0.7759296   705.32
## 11:      Good     G 4123.482 0.8508955   741.13
## 12:      Good     H 4276.255 0.9147293   642.14
## 13:      Good     I 5078.533 1.0572222   551.87
## 14:      Good     J 4574.173 1.0995440   337.56
## 15: Very Good     D 3470.467 0.6964243  1053.69
## 16: Very Good     E 3214.652 0.6763167  1623.16
## 17: Very Good     F 3778.820 0.7409612  1603.44
## 18: Very Good     G 3872.754 0.7667986  1762.87
## 19: Very Good     H 4535.390 0.9159485  1670.69
## 20: Very Good     I 5255.880 1.0469518  1260.53
## 21: Very Good     J 5103.513 1.1332153   768.32
## 22:   Premium     D 3631.293 0.7215471  1156.64
## 23:   Premium     E 3538.914 0.7177450  1677.37
## 24:   Premium     F 4324.890 0.8270356  1927.82
## 25:   Premium     G 4500.742 0.8414877  2460.51
## 26:   Premium     H 5216.707 1.0164492  2398.82
## 27:   Premium     I 5946.181 1.1449370  1634.97
## 28:   Premium     J 6294.592 1.2930941  1044.82
## 29:     Ideal     D 2629.095 0.5657657  1603.38
## 30:     Ideal     E 2597.550 0.5784012  2257.50
## 31:     Ideal     F 3374.939 0.6558285  2509.20
## 32:     Ideal     G 3720.706 0.7007146  3422.29
## 33:     Ideal     H 3889.335 0.7995249  2490.52
## 34:     Ideal     I 4451.970 0.9130291  1910.97
## 35:     Ideal     J 4918.186 1.0635937   952.98
##           cut color    price     carat caratSum

Data wrangling is a very important step in any analytics pipeline. Sometimes it is the end goal, and other times it is in preparation for applying more advanced methods. No matter the reason, there are plenty of functions to make it possible. These include aggregate, apply family in R and group by functionality in data.table.

Advanced Data Wrangling

Rihad Variawa

11/9/2019

Accessing Rows

Accessing Columns

Keys

Aggregation