Writing for, while, etc loops are useful when programming but not particularly appropriate when working interactively on the command line. Multi-line expressions with curly braces are just not that convenient to sort through when working on. R has some functions which implement looping in a compact form to make your job as data scientists easier.
tapply() Useful function over subsets of a vectorlapply() Loop over a list and evaluate a function on each elementsapply() Same as lapply but try to simplify the resultapply() Useful function over the margins of an arraymapply() Multivariate version of lapplyNote: The actual looping is done internally in C code for efficiency reasons.
tapply()tapply() is used to apply a function over subsets of a vector. It can be thought of as a combination of split() and sapply() for vectors only. I’ve been told that the “t” in tapply() refers to “table”, but that is unconfirmed.
## function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
The arguments to tapply() are as follows:
X is a vectorINDEX is a factor or a list of factors (or else they are coerced to factors)FUN is a function to be applied... contains other arguments to be passed FUNsimplify should we simplify the result?Example 11: To understand how it works, let’s use the iris dataset. This dataset is very famous in the world of machine learning. The purpose of this dataset is to predict the class of each of the three flower species: Sepal, Versicolor, Virginica. The dataset collects information for each species about their length and width.
data(iris) # load iris `dataset`
tapply(iris$Sepal.Width, iris$Species, mean) # the average of the width for each species## setosa versicolor virginica
## 3.428 2.770 2.974
We can also take the group means without simplifying the result, which will give us a list. For functions that return a single value, usually, this is not what we want, but it can be done.
## $setosa
## [1] 3.428
##
## $versicolor
## [1] 2.77
##
## $virginica
## [1] 2.974
lapply()This function is useful for performing operations on list objects and returns a list object of same length of original set. It will returns a list of the similar length as input list object, each element of which is the result of applying FUN to the corresponding element of list.
Example 12: Use the mean() function to all elements of a list. If the original list has names, the the names will be preserved in the output.
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
##
## $Species
## [1] NA
## $Sepal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
##
## $Sepal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.057 3.300 4.400
##
## $Petal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
##
## $Petal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
##
## $Species
## setosa versicolor virginica
## 50 50 50
Example 13: Change the string value of a matrix to lower case with tolower function. We construct a matrix with the name of the famous movies. The name is in upper case format.
movies <- c("FOUNDATION",
"AVENGERS",
"HAMILTON",
"CHINATOWN") # create a vector of famous movies
movies_lower <-lapply(movies, tolower) # lower case with `tolower` function
str(movies_lower) # let see the result## List of 4
## $ : chr "foundation"
## $ : chr "avengers"
## $ : chr "hamilton"
## $ : chr "chinatown"
movies_lower <-unlist(lapply(movies,tolower)) # `unlist()` to convert the list into a vector
str(movies_lower)## chr [1:4] "foundation" "avengers" "hamilton" "chinatown"
sapply()The sapply() function behaves similarly to lapply(); the only real difference is in the return value. sapply() will try to simplify the result of lapply() if possible. Essentially, sapply() calls lapply() on its input and then applies the following algorithm:
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5.843333 3.057333 3.758000 1.199333 NA
## $Sepal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
##
## $Sepal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.057 3.300 4.400
##
## $Petal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
##
## $Petal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
##
## $Species
## setosa versicolor virginica
## 50 50 50
apply()The apply() function takes data frame or matrix as an input and gives output in vector, list, or array. It is also primarily used to avoid explicit uses of the looping construction. Moreover, this function is the most basic of all loop functions that can be used over a matrices.
This function takes 3 arguments:
## function (X, MARGIN, FUN, ...)
The arguments to apply() are
X is an array (it can be data frame, list, vector, etc)MARGIN is an integer vector indicating which margins should be “retained”.FUN is a function to be applied... is for other arguments to be passed to FUNExample 11: Let’s create a 20 by 10 matrix of Normal random numbers. Then compute the mean of each row. You can also compute the sum of each column.
df <- matrix(rnorm(200), 20, 10) # create a matrix of Normal random numbers
apply(df, 1, mean) # take the mean of each row## [1] 0.37812656 -0.04697510 -0.12034587 -0.47662701 0.21143608 0.03105009
## [7] -0.20014427 0.01768803 -0.10146497 -0.38899155 -0.41601038 0.17504550
## [13] 0.07270976 0.54403841 0.29829480 -0.30733061 -0.10510197 -0.31849960
## [19] -0.08526944 0.40215902
## [1] -0.7080868 -5.5325139 -6.7019855 -1.4470669 1.2145389 3.3273954
## [7] -3.4342488 -2.9661638 3.0117013 8.8743049
You have probably noticed that the second argument MARGIN is either 1 or 2, depending on whether we want row statistics or column statistics. Accordingly, there is some special case of column/row sums and column/row means of matrices, we have some useful shortcuts.
Example 12: Take a look back to normalize problem, we can solve it more simple way using apply():
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.18080867 0.17403879 0.73459541 0.95115289 0.11239496 0.95293687
## [2,] 0.58399623 0.52503311 0.91561740 0.29993595 0.35684653 0.62785641
## [3,] 0.70162041 0.41205415 0.01343519 0.00000000 0.85194426 0.95203787
## [4,] 0.20228303 0.36716141 0.65881663 0.39126386 0.36110459 0.83272590
## [5,] 0.33096442 0.65063722 0.62807406 0.46707176 0.74269133 0.04012659
## [6,] 0.60320110 0.63061982 0.94642765 0.36557589 0.66315671 0.76603052
## [7,] 0.61051899 0.76028796 0.78244427 0.84674166 0.03297843 0.26567673
## [8,] 0.59845094 0.00000000 0.75979144 0.18597771 0.38647851 0.72739541
## [9,] 0.28517928 0.77507834 0.88400938 0.06839108 0.55820173 0.49861063
## [10,] 0.00000000 0.27366293 0.49574653 0.42238002 0.27682369 0.57279132
## [11,] 0.53605765 0.01135163 0.98387815 0.24791585 0.07044870 0.57146991
## [12,] 0.73579815 0.90219351 0.61997177 0.41664957 0.50721270 1.00000000
## [13,] 0.31528780 0.29457945 0.58270419 0.21594491 1.00000000 0.86796961
## [14,] 0.93886348 1.00000000 0.79584621 0.44599058 0.46257792 0.31151128
## [15,] 1.00000000 0.74416613 0.44015351 1.00000000 0.00000000 0.85266802
## [16,] 0.55165968 0.52871169 0.00000000 0.81180503 0.37342425 0.77003841
## [17,] 0.46754917 0.11899053 1.00000000 0.72832518 0.52170323 0.64993031
## [18,] 0.09187965 0.61572930 0.87936793 0.45705383 0.57730896 0.00000000
## [19,] 0.56150004 0.66053041 0.43281322 0.57035087 0.13880386 0.46216378
## [20,] 0.90998238 0.44683973 0.77485489 0.36632792 0.39657184 0.65473567
## [,7] [,8] [,9] [,10]
## [1,] 0.4256441 1.00000000 0.5968190 0.9350251
## [2,] 0.0000000 0.68880691 0.3639969 0.7052689
## [3,] 0.8597050 0.43583086 0.7758602 0.0955294
## [4,] 0.1866723 0.18514326 0.3028439 0.4039993
## [5,] 0.6946319 0.74529672 0.9216672 0.5858581
## [6,] 0.4606115 0.19301888 0.1705919 0.5051513
## [7,] 0.5788022 0.17032499 0.1336596 0.5283064
## [8,] 0.5483426 0.41451479 1.0000000 0.6678390
## [9,] 0.2114875 0.47338071 0.4490086 0.7878694
## [10,] 0.3832997 0.55588216 0.6101156 0.5240210
## [11,] 0.0682909 0.26680581 0.8626053 0.4910204
## [12,] 0.4688284 0.00000000 0.2710585 0.8819621
## [13,] 0.4867323 0.47846611 0.7481975 0.3905076
## [14,] 0.6651239 0.53879498 0.6905854 1.0000000
## [15,] 1.0000000 0.30985569 0.2422383 0.5787552
## [16,] 0.3910689 0.56905442 0.4516361 0.0000000
## [17,] 0.3317161 0.65549300 0.0000000 0.2257495
## [18,] 0.5184556 0.03959854 0.7361196 0.4466048
## [19,] 0.8095626 0.46096612 0.6592886 0.4078566
## [20,] 0.4095363 0.95312445 0.5991016 0.8155328
mapply()The mapply() function is a multivariate apply of sorts which applies a function in parallel over a set of arguments. Recall that lapply() and friends only iterate over a single R object. What if you want to iterate over multiple R objects in parallel? This is what mapply() is for.
## function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
The arguments to mapply() are:
FUN is a function to apply... contains R objects to apply overSIMPLIFY indicates whether the result should be simplifiedThe mapply() function has a different argument order from lapply() because the function to apply comes first rather than the object to iterate over. The R objects over which we apply the function are given in the \(\cdots\) argument because we can apply over an arbitrary number of R objects.
Example 13: Create tedious to type of list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))
## [[1]]
## [1] 1 1 1 1
##
## [[2]]
## [1] 2 2 2
##
## [[3]]
## [1] 3 3
##
## [[4]]
## [1] 4
Example 14: Creating a simulation of random Normal variables.
## [1] 0.2692832 -0.8674708 2.5692412 3.0410126 -1.5982717
## [1] 0.933190 2.649508 4.712723 4.176588 1.710444
## [[1]]
## [1] 3.365356
##
## [[2]]
## [1] 7.870619 1.175211
##
## [[3]]
## [1] 2.7029182 0.5991706 0.6066444
##
## [[4]]
## [1] 5.355777 4.964463 2.769435 2.584872
##
## [[5]]
## [1] 5.849386 7.802966 4.308933 10.600763 4.568853