Chapter 3: Loop functions: The apply Family

As we saw in previous chapters, the peculiar nature of R suggests not making use of loops in all situations where an iteration of instructions is required; We have already seen that we can take advantage of the native vectorization of R, but there are also functions that can cycle instructions in a more pleasant way, either on simple or more complex data structures.

The apply() family of functions belongs to the base R package, and has functions to manipulate slices of matrices, arrays, lists and data.frames iteratively.

These functions allow you to cross-operate on the data in various ways and avoid the explicit use of loops. These functions act on an input structure and apply a function with one or more optional arguments.

The types of functions that can be looped are:

Chapter contents

The apply() functions form the basis for more complex combinations and help to perform operations with very few lines of code. The family consists of the functions apply(), lapply(), sapply(), vapply(), mapply(), rapply(), and tapply().

3.1 The apply() function

The apply() function operates on arrays. For simplicity, we will focus on 2-dimensional arrays, which are also known as matrices.

The syntax is:

# apply(X, 
#       MARGIN, 
#       FUN,
#       ...)

Where:

  • X is an array, or a matrix, data.frames are also accepted.
  • MARGIN is a variable that defines how the function is applied: when MARGIN = 1, it is applied on rows (i), while with MARGIN = 2, it works on columns (j). When you use the MARGIN = c (1,2) construct, it applies to both rows and columns (i,j).
  • FUN is the function you want to apply to these structures.

Now we are going to see how to get the average of columns and the sum of rows of a matrix:

set.seed(1)

x<-matrix(round(rnorm(500),3), 
                              # Create matrix( of 500 elements with normal distribution,
          nrow=100,           # Distributed in 100 rows and 5 columns)
          ncol=5)             
                              # The round() function, limits the number of decimals,
                              # in this case, the limit is 3 decimals

head(x)
##        [,1]   [,2]   [,3]   [,4]   [,5]
## [1,] -0.626 -0.620  0.409  0.894  1.074
## [2,]  0.184  0.042  1.689 -1.047  1.896
## [3,] -0.836 -0.911  1.587  1.971 -0.603
## [4,]  1.595  0.158 -0.331 -0.384 -0.391
## [5,]  0.330 -0.655 -2.285  1.654 -0.416
## [6,] -0.820  1.767  2.498  1.512 -0.376
round(apply(x, 2, mean),3)    # apply to (x, on columns, the function "mean()")
## [1]  0.109 -0.038  0.030  0.052 -0.039
# We manually check the mean of some column

round(mean(x[,1]),3)
## [1] 0.109
y<-apply(x, 1, sum)     # apply to (x, on rows, the function "sum()")

# visualize
cbind(head(x),sum=y[1:6])
##                                            sum
## [1,] -0.626 -0.620  0.409  0.894  1.074  1.131
## [2,]  0.184  0.042  1.689 -1.047  1.896  2.764
## [3,] -0.836 -0.911  1.587  1.971 -0.603  1.208
## [4,]  1.595  0.158 -0.331 -0.384 -0.391  0.647
## [5,]  0.330 -0.655 -2.285  1.654 -0.416 -1.372
## [6,] -0.820  1.767  2.498  1.512 -0.376  4.581
# We manually check the sum of some row

sum(x[1,])
## [1] 1.131

3.2 The lapply() function

The lapply() function is useful for performing operations on lists, and returns a list with the same length as the original set. Each element of the resulting list is the result of applying a function, defined by FUN, to the corresponding element of the list X. It can also accept data.frames or vectors, but the result will be a list.

Syntax

# lapply(X,
#        FUN)

Example 1: Convert all uppercase words within the vector movies, to lowercase words.

movies <- c("AVENGERS","JOKER","BRAVE","UP")   

movies_lowercase <-lapply(movies, tolower)

str(movies_lowercase)                                
## List of 4
##  $ : chr "avengers"
##  $ : chr "joker"
##  $ : chr "brave"
##  $ : chr "up"
# We can use the "unlist()" function to transform the result of "lapply()" to a vector.

movies_lowercase <-unlist(lapply(movies,tolower))

str(movies_lowercase)
##  chr [1:4] "avengers" "joker" "brave" "up"

Example 2: Extract the second column of all arrays within the mat.l list, and assign them to mat.l.col2, and extract the second row of all arrays within the mat.l list and assign them to mat.l.row2

mat.l<-list("A"=matrix(1:9,nrow=3,ncol=3),
            "B"=matrix(11:26,nrow=4,ncol=4),
            "C"=matrix(31:34,nrow=2,ncol=2))

mat.l.col2<-lapply(mat.l,function(x) x[,2])

mat.l.row2<-lapply(mat.l,function(x) x[2,])

mat.l
## $A
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## $B
##      [,1] [,2] [,3] [,4]
## [1,]   11   15   19   23
## [2,]   12   16   20   24
## [3,]   13   17   21   25
## [4,]   14   18   22   26
## 
## $C
##      [,1] [,2]
## [1,]   31   33
## [2,]   32   34
str(mat.l.col2)
## List of 3
##  $ A: int [1:3] 4 5 6
##  $ B: int [1:4] 15 16 17 18
##  $ C: int [1:2] 33 34
str(mat.l.row2)
## List of 3
##  $ A: int [1:3] 2 5 8
##  $ B: int [1:4] 12 16 20 24
##  $ C: int [1:2] 32 34
unlist(mat.l.col2)
## A1 A2 A3 B1 B2 B3 B4 C1 C2 
##  4  5  6 15 16 17 18 33 34
unlist(mat.l.row2)
## A1 A2 A3 B1 B2 B3 B4 C1 C2 
##  2  5  8 12 16 20 24 32 34

3.3 The sapply() function

The sapply() function does the same job as the lapply() function, but can return a simpler structure, such as a vector, if the output of the FUN function allows it, i.e. it simplifies the output that could be obtained with lapply().

Syntax

# sapply(X,
#        FUN)

Returning to ‘Example 1’ of the lapply() function: Convert all uppercase words in the movies vector to lowercase words.

movies <- c("AVENGERS","JOKER","BRAVE","UP")   
movies_lowercase <-as.character(sapply(movies, 
                                    tolower))     
                                                  
str(movies_lowercase)                                
##  chr [1:4] "avengers" "joker" "brave" "up"

Example 2: Extract the element of the second row and second column from all arrays in mat.l and assign them to mat.l.row2col2.

(mat.l<-list("A"=matrix(1:9,nrow=3,ncol=3),
            "B"=matrix(11:26,nrow=4,ncol=4),
            "C"=matrix(31:34,nrow=2,ncol=2)))
## $A
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## $B
##      [,1] [,2] [,3] [,4]
## [1,]   11   15   19   23
## [2,]   12   16   20   24
## [3,]   13   17   21   25
## [4,]   14   18   22   26
## 
## $C
##      [,1] [,2]
## [1,]   31   33
## [2,]   32   34
mat.l.row2col2<-sapply(mat.l,function(x) x[2,2])

str(mat.l.row2col2)
##  Named int [1:3] 5 16 34
##  - attr(*, "names")= chr [1:3] "A" "B" "C"

3.4 The mapply() function

The mapply() function is a multi-variable version of sapply(), it applies the FUN function to the elements with index i of each var n argument. The arguments are recycled if necessary.

Syntax

# mapply(FUN,
#        var 1...var n)

Example 1: Sum of the i elements of n vectors:

mapply(sum,    # "mappply" addition to 3 vectors with a sequence from 1 to 5
       1:5,    # i.e: 1+1+1, 2+2+2, etc.
       1:5, 
       1:5)
## [1]  3  6  9 12 15
# This is basically the same as:
1:5+1:5+1:5
## [1]  3  6  9 12 15
# If we made a matrix with the same values spread over 3 columns:
x<-matrix(c(rep(1:5,3)),
          nrow=5,
          ncol=3)

x
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    2    2    2
## [3,]    3    3    3
## [4,]    4    4    4
## [5,]    5    5    5
# We would get the same result if:
apply(x,1,sum)
## [1]  3  6  9 12 15

Example 2: Evaluate if each column of x is identical to each column of y.

x<-data.frame(V1=c(1:5),
              V2=c(6:10),
              V3=c(11:15))

y<-data.frame(V1=c(1:5),
              V2=c(6:10),
              V3=c(11:14,16))
# Look at the output of the "identical()" function
#   Note: identical() only works if both tables have the same dimensions 
#          and column names.
identical(x,y)
## [1] FALSE
# We know that in general, x is not identical to y, but we want
# to know which column is different.

mapply(identical,x,y)
##    V1    V2    V3 
##  TRUE  TRUE FALSE
data.frame(identical_x_vs_y=mapply(identical,x,y))
##    identical_x_vs_y
## V1             TRUE
## V2             TRUE
## V3            FALSE

3.5 The vapply() function

The vapply() function is a stricter version of sapply(), it applies the FUN function to the elements of X, but one must indicate the amount and kind of output that is expected in FUN.VALUE, we could think of vapply as a sapply with output validation.

Syntax

# vapply(X,
#        FUN,
#        FUN.VALUE=class of output (length of output))

To see how it works, it’s useful to compare it to sapply.

set.seed(100)
dat<-data.frame(X=rnorm(n=100,     # We create a data.frame with 3 columns (X,Y,Z)
                        mean=0,    # Each one composed of 100 obs. with normal 
                        sd=1),     # distribution, different means, same std.
                Y=rnorm(n=100,
                        mean=1,
                        sd=1),
                Z=rnorm(n=100,
                        mean=2,
                        sd=1))

sapply(dat,mean)
##           X           Y           Z 
## 0.002912563 1.011140837 2.012793127
# In this case, we expect the output to be of class "numeric", length=1
vapply(dat,mean,numeric(1))
##           X           Y           Z 
## 0.002912563 1.011140837 2.012793127
# Notice what happens when we define the length of the output to 2.


vapply(dat,mean,numeric(2))
## Error in vapply(dat, mean, numeric(2)): values must be length 2,
##  but FUN(X[[1]]) result is length 1
# Basically R is telling you, that an output of 2 numbers was expected for each i,
# but the operation returned 1 for each i

# Now, by applying the summary() function to each column, we expect an output
# of length 6 for each i.

vapply(dat,summary,numeric(6))
##                    X          Y         Z
## Min.    -2.271925486 -1.1364939 -1.020814
## 1st Qu. -0.608846594  0.5681700  1.394084
## Median  -0.059419897  0.9271222  2.159128
## Mean     0.002912563  1.0111408  2.012793
## 3rd Qu.  0.655891078  1.4461908  2.657317
## Max.     2.581958928  3.1686003  4.727888
# The summary() function acts in different ways, depending 
# on the class to which it is applied in this case, because the class is "numeric",
# it returns a statistical summary of the columns.


t(vapply(dat,summary,numeric(6)))
##        Min.    1st Qu.     Median        Mean   3rd Qu.     Max.
## X -2.271925 -0.6088466 -0.0594199 0.002912563 0.6558911 2.581959
## Y -1.136494  0.5681700  0.9271222 1.011140837 1.4461908 3.168600
## Z -1.020814  1.3940839  2.1591282 2.012793127 2.6573168 4.727888
# The "t()" function is used to transpose matrices, transpose means
# that we are going to exchange rows for columns, and vice versa. 
# Some sort of dimensional rotation.

3.6 The tapply() function

The tapply() function is useful when we need to partition a vector into groups defined by some INDEX sorting or grouping factor, compute a FUN function on the generated subsets over X, and return the results in a convenient way.

Syntax

# tapply(X,
#        INDEX,
#        FUN)

Where:

  • X is the variable on which the FUN function must be applied.
  • INDEX is the grouping variable.

Example:

set.seed(200)

# We create a data set, where we simulate the effect of diet and physical training in
# people with morbid obesity.

# We create the variable to control the number of samples
np=4*100 #number of patients

weight_control<-data.frame(patient_id = as.character(paste0("patient_", 1:np)),
                    age_years = as.integer(round(rnorm(np, mean = 50, sd =8),
                                                 digits = 0)),
                    starting_weight_kg=rnorm(np,300,10),
                    final_weight_kg = c(rnorm(np/4,mean=100,sd=10),
                                     rnorm(np/4,mean=70,sd=5),
                                     rnorm(np/4,mean=90,sd=5),
                                     rnorm(np/4,mean=300,sd=10)),
                   height_m=rnorm(np, mean = 1.75, sd =0.08),
                    gender_binary=as.factor(rep_len(c("f","m"),
                                          np)),
                    treatment_category = gl(4, 
                                            np/4,
                                     labels = c("diet",
                                                "diet_and_physical_training",
                                                "physical_training",
                                                "control")))


# The function "gl()" generates a vector of factors defined 
# by (n levels, n members x level, level labels)


## Feature engineering

# Starting and final bmi calculations

weight_control$starting_bmi<-weight_control$starting_weight_kg/(weight_control$height_m**2)
weight_control$final_bmi<-weight_control$final_weight_kg/(weight_control$height**2)

# Starting and final bmi classification
#  We create a function that helps us do 
#  the classification automatically.

BMI_classifier<-function(x){
  as.factor(ifelse(x<18.5,"low_weight",
                   ifelse(x>=18.5 & 
                            x<25,"normal_weight",
                          ifelse(x>=25 & 
                                   x<30,"overweight",
                                 ifelse(x>=30 &
                                          x<35,"mild_obesity",
                                        ifelse(x>=35 & 
                                                 x<40,"obesity",
                                               "morbid_obesity"))))))
}

weight_control$starting_bmi_category<-BMI_classifier(weight_control$starting_bmi)


weight_control$final_bmi_category<-BMI_classifier(weight_control$final_bmi)

# Calculation of differences (deltas)

weight_control$delta_weight_kg<-weight_control$starting_weight_kg-weight_control$final_weight_kg

weight_control$delta_bmi<-weight_control$starting_bmi-weight_control$final_bmi


# Generate a data summary
summary(weight_control)
##   patient_id          age_years     starting_weight_kg final_weight_kg 
##  Length:400         Min.   :28.00   Min.   :270.2      Min.   : 59.32  
##  Class :character   1st Qu.:45.00   1st Qu.:293.4      1st Qu.: 81.49  
##  Mode  :character   Median :50.00   Median :299.9      Median : 93.64  
##                     Mean   :50.12   Mean   :299.7      Mean   :140.58  
##                     3rd Qu.:55.00   3rd Qu.:306.2      3rd Qu.:159.68  
##                     Max.   :80.00   Max.   :325.3      Max.   :322.56  
##     height_m     gender_binary                  treatment_category
##  Min.   :1.549   f:200         diet                      :100     
##  1st Qu.:1.697   m:200         diet_and_physical_training:100     
##  Median :1.753                 physical_training         :100     
##  Mean   :1.754                 control                   :100     
##  3rd Qu.:1.805                                                    
##  Max.   :2.005                                                    
##   starting_bmi      final_bmi         starting_bmi_category
##  Min.   : 70.32   Min.   : 17.60   morbid_obesity:400      
##  1st Qu.: 91.34   1st Qu.: 26.21                           
##  Median : 97.55   Median : 30.92                           
##  Mean   : 98.10   Mean   : 45.81                           
##  3rd Qu.:104.60   3rd Qu.: 57.12                           
##  Max.   :130.09   Max.   :124.41                           
##       final_bmi_category delta_weight_kg    delta_bmi     
##  low_weight    :  4      Min.   :-39.08   Min.   :-12.76  
##  mild_obesity  : 84      1st Qu.:139.79   1st Qu.: 40.19  
##  morbid_obesity:103      Median :204.81   Median : 65.62  
##  normal_weight : 77      Mean   :159.15   Mean   : 52.29  
##  obesity       : 33      3rd Qu.:219.04   3rd Qu.: 73.38  
##  overweight    : 99      Max.   :261.37   Max.   :100.78
# We want to EXPLORE if the treatments had any indication of effect

treatment_effects<-tapply(weight_control$final_bmi,           # X
                          weight_control$treatment_category,  # INDEX
                          summary)                            # FUN

treatment_effects
## $diet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.31   29.54   31.96   32.46   35.49   49.87 
## 
## $diet_and_physical_training
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.60   21.33   23.28   23.43   25.11   30.32 
## 
## $physical_training
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.66   27.47   29.89   29.99   32.21   39.12 
## 
## $control
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   78.84   91.15   96.83   97.35  103.39  124.41
treatment_effects.df<- do.call(rbind,               # Function
                                 treatment_effects) # Argument list

treatment_effects.df
##                                Min.  1st Qu.   Median     Mean   3rd Qu.
## diet                       24.31047 29.53742 31.96384 32.46227  35.48720
## diet_and_physical_training 17.60446 21.32765 23.28333 23.43037  25.11293
## physical_training          23.66123 27.46616 29.88690 29.98649  32.21126
## control                    78.84499 91.14723 96.83323 97.35062 103.39297
##                                 Max.
## diet                        49.87335
## diet_and_physical_training  30.31577
## physical_training           39.11740
## control                    124.40778
# The "do.call()" function allows you to call any function in R, 
# but instead of writing the arguments one by one, you can use 
# a list to hold the function's arguments.
# do.call() is another loop function.

3.7 The rapply() function

The rapply() function stands for recursive apply, and as its name suggests, it is used to apply an f function to all elements of object recursively. The classes argument delimits the elements on which the f function is going to be applied, based on their class.

This function has three basic modes. If how = "replace", each element of object that has its class included in classes is replaced by the result of applying f to the element in question.

With the how = "list" mode, all elements that have a class included in classes are replaced by the result of applying f to the element, and all others are replaced by default result deflt.

Next, if how = "unlist", unlist (recursive = TRUE) is invoked on the result.

Finally, in ... you put any extra arguments that f can use.

Syntax

# rapply(object,
#        f,
#        classes,
#        deflt,
#        how=c("unlist", "replace", "list"),
#       ...)

Now, let’s normalize weight, height, bmi, and deltas (i.e the numerical data) from “weight_control” using y=MinMaxNorm(x).

MinMaxNorm: Minmax normalization is a normalization strategy that linearly transforms xi to yi = (xi-min(x)) / (max(x)-min(x)), where min and max are the minimum and maximum values in x, and xi is the set of individual observed values in x. You can easily see that when xi = min(x), then yi = 0, and when xi = max(x), then yi = 1. Therefore, the range of values for y is from 0 to 1.

In this case, we are going to create a function that does precisely this transformation, and we will call it MinMaxNorm.

# We create the function MinMaxNorm

MinMaxNorm<-function(x){
  y<-(x-min(x))/(max(x)-min(x))
}


weight_control.norm<-rapply(weight_control, # object
                 f=MinMaxNorm,              # function
                 classes = "numeric",       # class limit
                 how = "replace")

head(weight_control.norm)[,c(1:5)]
##   patient_id age_years starting_weight_kg final_weight_kg   height_m
## 1  patient_1        51         0.80689353      0.16327245 0.43820773
## 2  patient_2        52         0.76696811      0.07450938 0.43966621
## 3  patient_3        53         0.28925377      0.14459683 0.08919087
## 4  patient_4        54         0.54982210      0.22830982 0.46257339
## 5  patient_5        50         0.54355485      0.14795106 0.43278732
## 6  patient_6        49         0.01091919      0.15092286 0.54552228
head(weight_control)[,c(1:5)]
##   patient_id age_years starting_weight_kg final_weight_kg height_m
## 1  patient_1        51           314.6868       102.30413 1.749155
## 2  patient_2        52           312.4859        78.93844 1.749820
## 3  patient_3        53           286.1528        97.38802 1.590079
## 4  patient_4        54           300.5162       119.42433 1.760261
## 5  patient_5        50           300.1707        98.27098 1.746685
## 6  patient_6        49           270.8101        99.05326 1.798067
plot(density(weight_control$final_bmi),
     main= "Density for final bmi (untransformed)")

plot(density(weight_control.norm$final_bmi),
     main= "Density for final bmi (min max transformed)")

# If we wanted to obtain a dataframe where the columns are minimum and maximum values,
# and the rows the names of the "numerical" columns of "weight_control":

weight_control.minmax<-data.frame(min= rapply(weight_control,
                                      f=min,
                                      classes = "numeric",
                                      deflt = NULL,
                                      how = "unlist"),
                          max=rapply(weight_control,
                                     f=max,
                                     classes = "numeric",
                                     deflt = NULL,
                                     how = "unlist"))
weight_control.minmax
##                           min        max
## starting_weight_kg 270.208161 325.331426
## final_weight_kg     59.324857 322.561356
## height_m             1.549427   2.005211
## starting_bmi        70.315843 130.094190
## final_bmi           17.604460 124.407777
## delta_weight_kg    -39.078885 261.374147
## delta_bmi          -12.757147 100.778463

3.8 Wrap-up

apply: Apply function over the margins of an array.

lapply: Loop over a list and evaluate a function on each element.

sapply: Basically the same as lapply but simplifies/reduces the result.

mapply: Multivariate version of lapply

tapply: Apply a function over subsets of a vector.

rapply: Apply a function to all elements of an object recursively. The classes argument delimits the elements on which the function is going to be applied, based on their class.