A general rule of thumb for data analysis is that manipulating the data or data munging consumes 80 % of the effort. This often requires repeated operations on different sections of the data — split-apply-combine. That is, we split the data into discrete sections based on some metric, apply a transformation of some kind to each section, and then combine all the sections together. There are many ways to iterate over data in R, and we will see some of the most convenient methods of doing it.
R has built-in apply function and all of its relatives such as tapply, lapply, sapply and mapply. Let’s see how each function has its own usage while manipulating the data.
apply is the first member of this family that users usually learn and it is also the most restrictive in nature. It must be used on the matrix, meaning all of the elements must be of the same type whether they are character, numeric or logical. If used on some other object, such as data.frame, it will be converted to a matrix first.
The first argument to apply is the object we are working with. The second argument is the margin to apply the function over, with 1 meaning to operate over the rows and 2 meaning operating over the columns. The third argument is the function we want to apply. Any following argument will be passed on to the function.
To illustrate its use we start with a trivial example, summing the rows or columns of a matrix. Notice that this could alternatively be accomplished using the built-in rowSums and colSums, yielding the same results.
theMatrix <- matrix(1:9, nrow=3)
head(theMatrix)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
# 1 meaning to operate over the rows
apply(theMatrix, 1, sum)
## [1] 12 15 18
# 2 meaning operating over the columns
apply(theMatrix, 2, sum)
## [1] 6 15 24
Similar to most of the R functions where we have an argument na.rm to handle missing values NA in the matrix or any other data type. Let’s add some NA to the theMatrix.
theMatrix[2,1] <- NA
head(theMatrix)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] NA 5 8
## [3,] 3 6 9
apply(theMatrix, 1, sum)
## [1] 12 NA 18
By adding na.rm argument to the apply function, it will ignore the missing values and calculate the sum over rows and columns.
apply(theMatrix, 1, sum, na.rm=TRUE)
## [1] 12 13 18
head(theMatrix)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] NA 5 8
## [3,] 3 6 9
lapply works similar to apply but it applies the function to each element of the list and returning the results as list as well.
theList <- list(A=matrix(1:9, 3), B=1:5,C=matrix(1:4, 2), D=2)
lapply(theList, sum)
## $A
## [1] 45
##
## $B
## [1] 15
##
## $C
## [1] 10
##
## $D
## [1] 2
Dealing with lists feels a bit cumbersome sometimes, so to return the result as vector instead, sapply can be put into use in the same way as lapply. And a vector is technically a form of list, so lapply and sapply can also take vector as their input.
sapply(theList, sum)
## A B C D
## 45 15 10 2
# count n of characters in each name
theNames <- c("Jared", "Deb", "Paul")
sapply(theNames, nchar)
## Jared Deb Paul
## 5 3 4
Perhaps the most overlooked but so useful member of the apply family is mapply, which applies a function to each element of multiple lists. Often when confronted with this scenario, people will resort to using a loop, which is certainly not necessary. Let’s build two lists to understand the usage of the mapply with an example. We use built-in identical function in R to see whether two lists are identical by comparing element-to-element.
# build two lists
firstList <- list(A=matrix(1:16,4),B=matrix(1:16,2),c(1:5))
secondList <- list(A=matrix(1:16,4),B=matrix(1:16,8),c(15:1))
# test element by element if they are identical
mapply(identical, firstList, secondList)
## A B
## TRUE FALSE FALSE
mapply can also take user-defined function in place of built-in function in R. Let’s build a simple function that adds the number of rows of each corresponding element in a lists.
simpleFunc <- function(x,y) {
NROW(x) + NROW(y)
}
mapply(simpleFunc, firstList, secondList)
## A B
## 8 10 20
There are many other members of the apply family that either do not get used much or have been superseded by functions in the plyr family. They include
Human who got used to SQL terminology generally wants to run a groupby and aggregation as their first R task. The way to do this is to use the aptly named aggregate function. We have multiple ways to call, aggregate and we will see the most convenient ways of calling it using formula notation.
formulas consist of a left side and right side separated by a tilde (~). The usage of formula methodology is similar to how we created graphics using ggplot2. The left side represents the variable that we want to make a calculation on and the right side represents one or more variables that we want to group the calculation by. To demonstrate the usage of aggregate we once resort to diamonds data in ggplot2.
require(ggplot2)
## Loading required package: ggplot2
data(diamonds)
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
As a first example, we will calculate the average price for each type of cut in the diamonds data. The first argument aggregate is the formula specifying that the price should be broken by cut. The second argument is the data to use, in this case, diamonds. The third argument is the function to apply to each subset of the data.
aggregate(price~cut, diamonds, mean)
## cut price
## 1 Fair 4358.758
## 2 Good 3928.864
## 3 Very Good 3981.760
## 4 Premium 4584.258
## 5 Ideal 3457.542
Notice that we only specified the column name and did not have to identify the data because that is given in the second argument. After the third argument specifying the function, additional named arguments to that function can be passed as follows.
aggregate(price~cut, diamonds, mean, na.rm=T)
## cut price
## 1 Fair 4358.758
## 2 Good 3928.864
## 3 Very Good 3981.760
## 4 Premium 4584.258
## 5 Ideal 3457.542
To group data by more than one variable, add the additional variable to the right side of the formula separating it with a plus sign(+).
aggregate(price~cut + color, diamonds, mean)
## cut color price
## 1 Fair D 4291.061
## 2 Good D 3405.382
## 3 Very Good D 3470.467
## 4 Premium D 3631.293
## 5 Ideal D 2629.095
## 6 Fair E 3682.312
## 7 Good E 3423.644
## 8 Very Good E 3214.652
## 9 Premium E 3538.914
## 10 Ideal E 2597.550
## 11 Fair F 3827.003
## 12 Good F 3495.750
## 13 Very Good F 3778.820
## 14 Premium F 4324.890
## 15 Ideal F 3374.939
## 16 Fair G 4239.255
## 17 Good G 4123.482
## 18 Very Good G 3872.754
## 19 Premium G 4500.742
## 20 Ideal G 3720.706
## 21 Fair H 5135.683
## 22 Good H 4276.255
## 23 Very Good H 4535.390
## 24 Premium H 5216.707
## 25 Ideal H 3889.335
## 26 Fair I 4685.446
## 27 Good I 5078.533
## 28 Very Good I 5255.880
## 29 Premium I 5946.181
## 30 Ideal I 4451.970
## 31 Fair J 4975.655
## 32 Good J 4574.173
## 33 Very Good J 5103.513
## 34 Premium J 6294.592
## 35 Ideal J 4918.186
To aggregate two variables, they must be combined using cbind on the left side of the formula.
aggregate(cbind(price, carat) ~ cut + color, diamonds, mean)
## cut color price carat
## 1 Fair D 4291.061 0.9201227
## 2 Good D 3405.382 0.7445166
## 3 Very Good D 3470.467 0.6964243
## 4 Premium D 3631.293 0.7215471
## 5 Ideal D 2629.095 0.5657657
## 6 Fair E 3682.312 0.8566071
## 7 Good E 3423.644 0.7451340
## 8 Very Good E 3214.652 0.6763167
## 9 Premium E 3538.914 0.7177450
## 10 Ideal E 2597.550 0.5784012
## 11 Fair F 3827.003 0.9047115
## 12 Good F 3495.750 0.7759296
## 13 Very Good F 3778.820 0.7409612
## 14 Premium F 4324.890 0.8270356
## 15 Ideal F 3374.939 0.6558285
## 16 Fair G 4239.255 1.0238217
## 17 Good G 4123.482 0.8508955
## 18 Very Good G 3872.754 0.7667986
## 19 Premium G 4500.742 0.8414877
## 20 Ideal G 3720.706 0.7007146
## 21 Fair H 5135.683 1.2191749
## 22 Good H 4276.255 0.9147293
## 23 Very Good H 4535.390 0.9159485
## 24 Premium H 5216.707 1.0164492
## 25 Ideal H 3889.335 0.7995249
## 26 Fair I 4685.446 1.1980571
## 27 Good I 5078.533 1.0572222
## 28 Very Good I 5255.880 1.0469518
## 29 Premium I 5946.181 1.1449370
## 30 Ideal I 4451.970 0.9130291
## 31 Fair J 4975.655 1.3411765
## 32 Good J 4574.173 1.0995440
## 33 Very Good J 5103.513 1.1332153
## 34 Premium J 6294.592 1.2930941
## 35 Ideal J 4918.186 1.0635937
It is important to note from the above example only one function can be supplied, and hence applied to the variables. To apply more than one function, it easier to use the dplyr or data.table packages which extend and enhances the capability of data.frames.
Aggregating data is a very important step in the analysis process. Sometimes it is the end goal and other times it is the preparation for applying more advanced methods. In this exercise we have seen common methodologies to perform group manipulation in R.