1 Loop Functions

Writing for, while, etc loops are useful when programming but not particularly appropriate when working interactively on the command line. Multi-line expressions with curly braces are just not that convenient to sort through when working on. R has some functions which implement looping in a compact form to make your job as data scientists easier.

  • tapply() Useful function over subsets of a vector
  • lapply() Loop over a list and evaluate a function on each element
  • sapply() Same as lapply but try to simplify the result
  • apply() Useful function over the margins of an array
  • mapply() Multivariate version of lapply

Note: The actual looping is done internally in C code for efficiency reasons.

1.1 tapply()

tapply() is used to apply a function over subsets of a vector. It can be thought of as a combination of split() and sapply() for vectors only. I’ve been told that the “t” in tapply() refers to “table”, but that is unconfirmed.

str(tapply)
## function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)

The arguments to tapply() are as follows:

  • X is a vector
  • INDEX is a factor or a list of factors (or else they are coerced to factors)
  • FUN is a function to be applied
  • ... contains other arguments to be passed FUN
  • simplify should we simplify the result?

Example 11: To understand how it works, let’s use the iris dataset. This dataset is very famous in the world of machine learning. The purpose of this dataset is to predict the class of each of the three flower species: Sepal, Versicolor, Virginica. The dataset collects information for each species about their length and width.

data(iris)                                        # load iris `dataset`
tapply(iris$Sepal.Width, iris$Species, mean)      # the average of the width for each species
##     setosa versicolor  virginica 
##      3.428      2.770      2.974

We can also take the group means without simplifying the result, which will give us a list. For functions that return a single value, usually, this is not what we want, but it can be done.

tapply(iris$Sepal.Width, iris$Species, mean, simplify = F)  
## $setosa
## [1] 3.428
## 
## $versicolor
## [1] 2.77
## 
## $virginica
## [1] 2.974

1.2 lapply()

This function is useful for performing operations on list objects and returns a list object of same length of original set. It will returns a list of the similar length as input list object, each element of which is the result of applying FUN to the corresponding element of list.

Example 12: Use the mean() function to all elements of a list. If the original list has names, the the names will be preserved in the output.

lapply(iris, mean)                                # take the mean of each column of `iris`
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## $Sepal.Length
## [1] 5.843333
## 
## $Sepal.Width
## [1] 3.057333
## 
## $Petal.Length
## [1] 3.758
## 
## $Petal.Width
## [1] 1.199333
## 
## $Species
## [1] NA
lapply(iris, summary)                             # take the mean of each column of `iris`
## $Sepal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900 
## 
## $Sepal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.057   3.300   4.400 
## 
## $Petal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.350   3.758   5.100   6.900 
## 
## $Petal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.199   1.800   2.500 
## 
## $Species
##     setosa versicolor  virginica 
##         50         50         50

Example 13: Change the string value of a matrix to lower case with tolower function. We construct a matrix with the name of the famous movies. The name is in upper case format.

movies <- c("FOUNDATION",
            "AVENGERS",
            "HAMILTON",
            "CHINATOWN")                          # create a vector of famous movies
movies_lower <-lapply(movies, tolower)            # lower case with `tolower` function 
str(movies_lower)                                 # let see the result
## List of 4
##  $ : chr "foundation"
##  $ : chr "avengers"
##  $ : chr "hamilton"
##  $ : chr "chinatown"
movies_lower <-unlist(lapply(movies,tolower))     # `unlist()` to convert the list into a vector
str(movies_lower)
##  chr [1:4] "foundation" "avengers" "hamilton" "chinatown"

1.3 sapply()

The sapply() function behaves similarly to lapply(); the only real difference is in the return value. sapply() will try to simplify the result of lapply() if possible. Essentially, sapply() calls lapply() on its input and then applies the following algorithm:

  • If the result is a list where every element is length 1, then a vector is returned
  • If the result is a list where every element is a vector of the same length \((>1)\), a matrix is returned.
  • If it can’t figure things out, a list is returned
sapply(iris, mean)                                # take the mean of each column of `iris`
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##     5.843333     3.057333     3.758000     1.199333           NA
sapply(iris, summary)                             # take the summary of each column of `iris`
## $Sepal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900 
## 
## $Sepal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.057   3.300   4.400 
## 
## $Petal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.350   3.758   5.100   6.900 
## 
## $Petal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.199   1.800   2.500 
## 
## $Species
##     setosa versicolor  virginica 
##         50         50         50

1.4 apply()

The apply() function takes data frame or matrix as an input and gives output in vector, list, or array. It is also primarily used to avoid explicit uses of the looping construction. Moreover, this function is the most basic of all loop functions that can be used over a matrices.

This function takes 3 arguments:

str(apply)
## function (X, MARGIN, FUN, ...)

The arguments to apply() are

  • X is an array (it can be data frame, list, vector, etc)
  • MARGIN is an integer vector indicating which margins should be “retained”.
  • FUN is a function to be applied
  • ... is for other arguments to be passed to FUN

Example 11: Let’s create a 20 by 10 matrix of Normal random numbers. Then compute the mean of each row. You can also compute the sum of each column.

df <- matrix(rnorm(200), 20, 10)                  # create a matrix of Normal random numbers
apply(df, 1, mean)                                # take the mean of each row
##  [1]  0.37812656 -0.04697510 -0.12034587 -0.47662701  0.21143608  0.03105009
##  [7] -0.20014427  0.01768803 -0.10146497 -0.38899155 -0.41601038  0.17504550
## [13]  0.07270976  0.54403841  0.29829480 -0.30733061 -0.10510197 -0.31849960
## [19] -0.08526944  0.40215902
apply(df, 2, sum)                                 # Take the mean of each column
##  [1] -0.7080868 -5.5325139 -6.7019855 -1.4470669  1.2145389  3.3273954
##  [7] -3.4342488 -2.9661638  3.0117013  8.8743049

You have probably noticed that the second argument MARGIN is either 1 or 2, depending on whether we want row statistics or column statistics. Accordingly, there is some special case of column/row sums and column/row means of matrices, we have some useful shortcuts.

  • rowSums = apply(x, 1, sum)
  • rowMeans = apply(x, 1, mean)
  • colSums = apply(x, 2, sum)
  • colMeans = apply(x, 2, mean)

Example 12: Take a look back to normalize problem, we can solve it more simple way using apply():

normalize <- function(x){
  norm <- (x-min(x))/(max(x)-min(x))
  return(norm)
}
apply(df, 2, normalize)                           
##             [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
##  [1,] 0.18080867 0.17403879 0.73459541 0.95115289 0.11239496 0.95293687
##  [2,] 0.58399623 0.52503311 0.91561740 0.29993595 0.35684653 0.62785641
##  [3,] 0.70162041 0.41205415 0.01343519 0.00000000 0.85194426 0.95203787
##  [4,] 0.20228303 0.36716141 0.65881663 0.39126386 0.36110459 0.83272590
##  [5,] 0.33096442 0.65063722 0.62807406 0.46707176 0.74269133 0.04012659
##  [6,] 0.60320110 0.63061982 0.94642765 0.36557589 0.66315671 0.76603052
##  [7,] 0.61051899 0.76028796 0.78244427 0.84674166 0.03297843 0.26567673
##  [8,] 0.59845094 0.00000000 0.75979144 0.18597771 0.38647851 0.72739541
##  [9,] 0.28517928 0.77507834 0.88400938 0.06839108 0.55820173 0.49861063
## [10,] 0.00000000 0.27366293 0.49574653 0.42238002 0.27682369 0.57279132
## [11,] 0.53605765 0.01135163 0.98387815 0.24791585 0.07044870 0.57146991
## [12,] 0.73579815 0.90219351 0.61997177 0.41664957 0.50721270 1.00000000
## [13,] 0.31528780 0.29457945 0.58270419 0.21594491 1.00000000 0.86796961
## [14,] 0.93886348 1.00000000 0.79584621 0.44599058 0.46257792 0.31151128
## [15,] 1.00000000 0.74416613 0.44015351 1.00000000 0.00000000 0.85266802
## [16,] 0.55165968 0.52871169 0.00000000 0.81180503 0.37342425 0.77003841
## [17,] 0.46754917 0.11899053 1.00000000 0.72832518 0.52170323 0.64993031
## [18,] 0.09187965 0.61572930 0.87936793 0.45705383 0.57730896 0.00000000
## [19,] 0.56150004 0.66053041 0.43281322 0.57035087 0.13880386 0.46216378
## [20,] 0.90998238 0.44683973 0.77485489 0.36632792 0.39657184 0.65473567
##            [,7]       [,8]      [,9]     [,10]
##  [1,] 0.4256441 1.00000000 0.5968190 0.9350251
##  [2,] 0.0000000 0.68880691 0.3639969 0.7052689
##  [3,] 0.8597050 0.43583086 0.7758602 0.0955294
##  [4,] 0.1866723 0.18514326 0.3028439 0.4039993
##  [5,] 0.6946319 0.74529672 0.9216672 0.5858581
##  [6,] 0.4606115 0.19301888 0.1705919 0.5051513
##  [7,] 0.5788022 0.17032499 0.1336596 0.5283064
##  [8,] 0.5483426 0.41451479 1.0000000 0.6678390
##  [9,] 0.2114875 0.47338071 0.4490086 0.7878694
## [10,] 0.3832997 0.55588216 0.6101156 0.5240210
## [11,] 0.0682909 0.26680581 0.8626053 0.4910204
## [12,] 0.4688284 0.00000000 0.2710585 0.8819621
## [13,] 0.4867323 0.47846611 0.7481975 0.3905076
## [14,] 0.6651239 0.53879498 0.6905854 1.0000000
## [15,] 1.0000000 0.30985569 0.2422383 0.5787552
## [16,] 0.3910689 0.56905442 0.4516361 0.0000000
## [17,] 0.3317161 0.65549300 0.0000000 0.2257495
## [18,] 0.5184556 0.03959854 0.7361196 0.4466048
## [19,] 0.8095626 0.46096612 0.6592886 0.4078566
## [20,] 0.4095363 0.95312445 0.5991016 0.8155328

1.5 mapply()

The mapply() function is a multivariate apply of sorts which applies a function in parallel over a set of arguments. Recall that lapply() and friends only iterate over a single R object. What if you want to iterate over multiple R objects in parallel? This is what mapply() is for.

str(mapply)
## function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)

The arguments to mapply() are:

  • FUN is a function to apply
  • ... contains R objects to apply over
  • MoreArgs` is a list of other arguments to FUN.
  • SIMPLIFY indicates whether the result should be simplified

The mapply() function has a different argument order from lapply() because the function to apply comes first rather than the object to iterate over. The R objects over which we apply the function are given in the \(\cdots\) argument because we can apply over an arbitrary number of R objects.

Example 13: Create tedious to type of list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))

mapply(rep, 1:4, 4:1)
## [[1]]
## [1] 1 1 1 1
## 
## [[2]]
## [1] 2 2 2
## 
## [[3]]
## [1] 3 3
## 
## [[4]]
## [1] 4

Example 14: Creating a simulation of random Normal variables.

noise <- function(n, mean, sd) {
            rnorm(n, mean, sd)
}
noise(5, 1, 2)                                    # Simulate 5 randon numbers    
## [1]  0.2692832 -0.8674708  2.5692412  3.0410126 -1.5982717
noise(1:5, 1:5, 2)                                # This only simulates 1 set of numbers, not 5 
## [1] 0.933190 2.649508 4.712723 4.176588 1.710444
# get 5 sets of random numbers, 
# each with a different length and mean
mapply(noise, 1:5, 1:5, 2)                        
## [[1]]
## [1] 3.365356
## 
## [[2]]
## [1] 7.870619 1.175211
## 
## [[3]]
## [1] 2.7029182 0.5991706 0.6066444
## 
## [[4]]
## [1] 5.355777 4.964463 2.769435 2.584872
## 
## [[5]]
## [1]  5.849386  7.802966  4.308933 10.600763  4.568853