Group Manipulation

A general rule of thumb for data analysis is that manipulating the data or data munging consumes 80 % of the effort. This often requires repeated operations on different sections of the data — split-apply-combine. That is, we split the data into discrete sections based on some metric, apply a transformation of some kind to each section, and then combine all the sections together. There are many ways to iterate over data in R, and we will see some of the most convenient methods of doing it.

Apply Family

R has built-in apply function and all of its relatives such as tapply, lapply, sapply and mapply. Let’s see how each function has its own usage while manipulating the data.

apply

apply is the first member of this family that users usually learn and it is also the most restrictive in nature. It must be used on the matrix, meaning all of the elements must be of the same type whether they are character, numeric or logical. If used on some other object, such as data.frame, it will be converted to a matrix first.

The first argument to apply is the object we are working with. The second argument is the margin to apply the function over, with 1 meaning to operate over the rows and 2 meaning operating over the columns. The third argument is the function we want to apply. Any following argument will be passed on to the function.

To illustrate its use we start with a trivial example, summing the rows or columns of a matrix. Notice that this could alternatively be accomplished using the built-in rowSums and colSums, yielding the same results.

theMatrix <- matrix(1:9, nrow=3)
head(theMatrix)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

apply - row sum

# 1 meaning to operate over the rows
apply(theMatrix, 1, sum)

## [1] 12 15 18

apply - column sum

# 2 meaning operating over the columns
apply(theMatrix, 2, sum)

## [1]  6 15 24

apply - row sum with missing values

Similar to most of the R functions where we have an argument na.rm to handle missing values NA in the matrix or any other data type. Let’s add some NA to the theMatrix.

theMatrix[2,1] <- NA
head(theMatrix)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]   NA    5    8
## [3,]    3    6    9

apply(theMatrix, 1, sum)

## [1] 12 NA 18

By adding na.rm argument to the apply function, it will ignore the missing values and calculate the sum over rows and columns.

apply(theMatrix, 1, sum, na.rm=TRUE)

## [1] 12 13 18

head(theMatrix)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]   NA    5    8
## [3,]    3    6    9

lapply - sum operation

lapply works similar to apply but it applies the function to each element of the list and returning the results as list as well.

theList <- list(A=matrix(1:9, 3), B=1:5,C=matrix(1:4, 2), D=2)
lapply(theList, sum)

## $A
## [1] 45
## 
## $B
## [1] 15
## 
## $C
## [1] 10
## 
## $D
## [1] 2

Dealing with lists feels a bit cumbersome sometimes, so to return the result as vector instead, sapply can be put into use in the same way as lapply. And a vector is technically a form of list, so lapply and sapply can also take vector as their input.

sapply - sum operation

sapply(theList, sum)

##  A  B  C  D 
## 45 15 10  2

# count n of characters in each name
theNames <- c("Jared", "Deb", "Paul")
sapply(theNames, nchar)

## Jared   Deb  Paul 
##     5     3     4

mapply

Perhaps the most overlooked but so useful member of the apply family is mapply, which applies a function to each element of multiple lists. Often when confronted with this scenario, people will resort to using a loop, which is certainly not necessary. Let’s build two lists to understand the usage of the mapply with an example. We use built-in identical function in R to see whether two lists are identical by comparing element-to-element.

# build two lists
firstList <- list(A=matrix(1:16,4),B=matrix(1:16,2),c(1:5))

secondList <- list(A=matrix(1:16,4),B=matrix(1:16,8),c(15:1))

# test element by element if they are identical
mapply(identical, firstList, secondList)

##     A     B       
##  TRUE FALSE FALSE

mapply can also take user-defined function in place of built-in function in R. Let’s build a simple function that adds the number of rows of each corresponding element in a lists.

simpleFunc <- function(x,y) {
              NROW(x) + NROW(y)
              }
mapply(simpleFunc, firstList, secondList)

##  A  B    
##  8 10 20

There are many other members of the apply family that either do not get used much or have been superseded by functions in the plyr family. They include

tapply
rapply
eapply
vapply
by

aggregate

Human who got used to SQL terminology generally wants to run a groupby and aggregation as their first R task. The way to do this is to use the aptly named aggregate function. We have multiple ways to call, aggregate and we will see the most convenient ways of calling it using formula notation.

formulas consist of a left side and right side separated by a tilde (~). The usage of formula methodology is similar to how we created graphics using ggplot2. The left side represents the variable that we want to make a calculation on and the right side represents one or more variables that we want to group the calculation by. To demonstrate the usage of aggregate we once resort to diamonds data in ggplot2.

require(ggplot2)

## Loading required package: ggplot2

data(diamonds)
head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

As a first example, we will calculate the average price for each type of cut in the diamonds data. The first argument aggregate is the formula specifying that the price should be broken by cut. The second argument is the data to use, in this case, diamonds. The third argument is the function to apply to each subset of the data.

aggregate(price~cut, diamonds, mean)

##         cut    price
## 1      Fair 4358.758
## 2      Good 3928.864
## 3 Very Good 3981.760
## 4   Premium 4584.258
## 5     Ideal 3457.542

Notice that we only specified the column name and did not have to identify the data because that is given in the second argument. After the third argument specifying the function, additional named arguments to that function can be passed as follows.

aggregate(price~cut, diamonds, mean, na.rm=T)

##         cut    price
## 1      Fair 4358.758
## 2      Good 3928.864
## 3 Very Good 3981.760
## 4   Premium 4584.258
## 5     Ideal 3457.542

To group data by more than one variable, add the additional variable to the right side of the formula separating it with a plus sign(+).

aggregate(price~cut + color, diamonds, mean)

##          cut color    price
## 1       Fair     D 4291.061
## 2       Good     D 3405.382
## 3  Very Good     D 3470.467
## 4    Premium     D 3631.293
## 5      Ideal     D 2629.095
## 6       Fair     E 3682.312
## 7       Good     E 3423.644
## 8  Very Good     E 3214.652
## 9    Premium     E 3538.914
## 10     Ideal     E 2597.550
## 11      Fair     F 3827.003
## 12      Good     F 3495.750
## 13 Very Good     F 3778.820
## 14   Premium     F 4324.890
## 15     Ideal     F 3374.939
## 16      Fair     G 4239.255
## 17      Good     G 4123.482
## 18 Very Good     G 3872.754
## 19   Premium     G 4500.742
## 20     Ideal     G 3720.706
## 21      Fair     H 5135.683
## 22      Good     H 4276.255
## 23 Very Good     H 4535.390
## 24   Premium     H 5216.707
## 25     Ideal     H 3889.335
## 26      Fair     I 4685.446
## 27      Good     I 5078.533
## 28 Very Good     I 5255.880
## 29   Premium     I 5946.181
## 30     Ideal     I 4451.970
## 31      Fair     J 4975.655
## 32      Good     J 4574.173
## 33 Very Good     J 5103.513
## 34   Premium     J 6294.592
## 35     Ideal     J 4918.186

To aggregate two variables, they must be combined using cbind on the left side of the formula.

aggregate(cbind(price, carat) ~ cut + color, diamonds, mean)

##          cut color    price     carat
## 1       Fair     D 4291.061 0.9201227
## 2       Good     D 3405.382 0.7445166
## 3  Very Good     D 3470.467 0.6964243
## 4    Premium     D 3631.293 0.7215471
## 5      Ideal     D 2629.095 0.5657657
## 6       Fair     E 3682.312 0.8566071
## 7       Good     E 3423.644 0.7451340
## 8  Very Good     E 3214.652 0.6763167
## 9    Premium     E 3538.914 0.7177450
## 10     Ideal     E 2597.550 0.5784012
## 11      Fair     F 3827.003 0.9047115
## 12      Good     F 3495.750 0.7759296
## 13 Very Good     F 3778.820 0.7409612
## 14   Premium     F 4324.890 0.8270356
## 15     Ideal     F 3374.939 0.6558285
## 16      Fair     G 4239.255 1.0238217
## 17      Good     G 4123.482 0.8508955
## 18 Very Good     G 3872.754 0.7667986
## 19   Premium     G 4500.742 0.8414877
## 20     Ideal     G 3720.706 0.7007146
## 21      Fair     H 5135.683 1.2191749
## 22      Good     H 4276.255 0.9147293
## 23 Very Good     H 4535.390 0.9159485
## 24   Premium     H 5216.707 1.0164492
## 25     Ideal     H 3889.335 0.7995249
## 26      Fair     I 4685.446 1.1980571
## 27      Good     I 5078.533 1.0572222
## 28 Very Good     I 5255.880 1.0469518
## 29   Premium     I 5946.181 1.1449370
## 30     Ideal     I 4451.970 0.9130291
## 31      Fair     J 4975.655 1.3411765
## 32      Good     J 4574.173 1.0995440
## 33 Very Good     J 5103.513 1.1332153
## 34   Premium     J 6294.592 1.2930941
## 35     Ideal     J 4918.186 1.0635937

It is important to note from the above example only one function can be supplied, and hence applied to the variables. To apply more than one function, it easier to use the dplyr or data.table packages which extend and enhances the capability of data.frames.

Aggregating data is a very important step in the analysis process. Sometimes it is the end goal and other times it is the preparation for applying more advanced methods. In this exercise we have seen common methodologies to perform group manipulation in R.