apply FamilyAs we saw in previous chapters, the peculiar nature of R suggests not making use of loops in all situations where an iteration of instructions is required; We have already seen that we can take advantage of the native vectorization of R, but there are also functions that can cycle instructions in a more pleasant way, either on simple or more complex data structures.
The apply() family of functions belongs to the base R
package, and has functions to manipulate slices of matrices,
arrays, lists and data.frames iteratively.
These functions allow you to cross-operate on the data in various ways and avoid the explicit use of loops. These functions act on an input structure and apply a function with one or more optional arguments.
The types of functions that can be looped are:
Chapter contents
apply() functionlapply() functionsapply() functionmapply() functionvapply() functiontapply() functionrapply() functionThe apply() functions form the basis for more complex
combinations and help to perform operations with very few lines of code.
The family consists of the functions apply(),
lapply(), sapply(), vapply(),
mapply(), rapply(), and
tapply().
apply() functionThe apply() function operates on arrays.
For simplicity, we will focus on 2-dimensional arrays,
which are also known as matrices.
The syntax is:
# apply(X,
# MARGIN,
# FUN,
# ...)
Where:
X is an array, or a matrix,
data.frames are also accepted.MARGIN is a variable that defines how the function is
applied: when MARGIN = 1, it is applied on rows
(i), while with MARGIN = 2, it works on
columns (j). When you use the MARGIN = c (1,2)
construct, it applies to both rows and columns (i,j).FUN is the function you want to apply to these
structures.Now we are going to see how to get the average of columns and the sum of rows of a matrix:
set.seed(1)
x<-matrix(round(rnorm(500),3),
# Create matrix( of 500 elements with normal distribution,
nrow=100, # Distributed in 100 rows and 5 columns)
ncol=5)
# The round() function, limits the number of decimals,
# in this case, the limit is 3 decimals
head(x)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.626 -0.620 0.409 0.894 1.074
## [2,] 0.184 0.042 1.689 -1.047 1.896
## [3,] -0.836 -0.911 1.587 1.971 -0.603
## [4,] 1.595 0.158 -0.331 -0.384 -0.391
## [5,] 0.330 -0.655 -2.285 1.654 -0.416
## [6,] -0.820 1.767 2.498 1.512 -0.376
round(apply(x, 2, mean),3) # apply to (x, on columns, the function "mean()")
## [1] 0.109 -0.038 0.030 0.052 -0.039
# We manually check the mean of some column
round(mean(x[,1]),3)
## [1] 0.109
y<-apply(x, 1, sum) # apply to (x, on rows, the function "sum()")
# visualize
cbind(head(x),sum=y[1:6])
## sum
## [1,] -0.626 -0.620 0.409 0.894 1.074 1.131
## [2,] 0.184 0.042 1.689 -1.047 1.896 2.764
## [3,] -0.836 -0.911 1.587 1.971 -0.603 1.208
## [4,] 1.595 0.158 -0.331 -0.384 -0.391 0.647
## [5,] 0.330 -0.655 -2.285 1.654 -0.416 -1.372
## [6,] -0.820 1.767 2.498 1.512 -0.376 4.581
# We manually check the sum of some row
sum(x[1,])
## [1] 1.131
lapply() functionThe lapply() function is useful for performing
operations on lists, and returns a list with the same length as the
original set. Each element of the resulting list is the result of
applying a function, defined by FUN, to the corresponding
element of the list X. It can also accept
data.frames or vectors, but the result will be
a list.
Syntax
# lapply(X,
# FUN)
Example 1: Convert all uppercase words within the vector
movies, to lowercase words.
movies <- c("AVENGERS","JOKER","BRAVE","UP")
movies_lowercase <-lapply(movies, tolower)
str(movies_lowercase)
## List of 4
## $ : chr "avengers"
## $ : chr "joker"
## $ : chr "brave"
## $ : chr "up"
# We can use the "unlist()" function to transform the result of "lapply()" to a vector.
movies_lowercase <-unlist(lapply(movies,tolower))
str(movies_lowercase)
## chr [1:4] "avengers" "joker" "brave" "up"
Example 2: Extract the second column of all arrays within the
mat.l list, and assign them to mat.l.col2, and
extract the second row of all arrays within the mat.l list
and assign them to mat.l.row2
mat.l<-list("A"=matrix(1:9,nrow=3,ncol=3),
"B"=matrix(11:26,nrow=4,ncol=4),
"C"=matrix(31:34,nrow=2,ncol=2))
mat.l.col2<-lapply(mat.l,function(x) x[,2])
mat.l.row2<-lapply(mat.l,function(x) x[2,])
mat.l
## $A
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## $B
## [,1] [,2] [,3] [,4]
## [1,] 11 15 19 23
## [2,] 12 16 20 24
## [3,] 13 17 21 25
## [4,] 14 18 22 26
##
## $C
## [,1] [,2]
## [1,] 31 33
## [2,] 32 34
str(mat.l.col2)
## List of 3
## $ A: int [1:3] 4 5 6
## $ B: int [1:4] 15 16 17 18
## $ C: int [1:2] 33 34
str(mat.l.row2)
## List of 3
## $ A: int [1:3] 2 5 8
## $ B: int [1:4] 12 16 20 24
## $ C: int [1:2] 32 34
unlist(mat.l.col2)
## A1 A2 A3 B1 B2 B3 B4 C1 C2
## 4 5 6 15 16 17 18 33 34
unlist(mat.l.row2)
## A1 A2 A3 B1 B2 B3 B4 C1 C2
## 2 5 8 12 16 20 24 32 34
sapply() functionThe sapply() function does the same job as the
lapply() function, but can return a simpler structure, such
as a vector, if the output of the FUN function allows it,
i.e. it simplifies the output that could be obtained
with lapply().
Syntax
# sapply(X,
# FUN)
Returning to ‘Example 1’ of the lapply() function:
Convert all uppercase words in the movies vector to
lowercase words.
movies <- c("AVENGERS","JOKER","BRAVE","UP")
movies_lowercase <-as.character(sapply(movies,
tolower))
str(movies_lowercase)
## chr [1:4] "avengers" "joker" "brave" "up"
Example 2: Extract the element of the second row and second column
from all arrays in mat.l and assign them to
mat.l.row2col2.
(mat.l<-list("A"=matrix(1:9,nrow=3,ncol=3),
"B"=matrix(11:26,nrow=4,ncol=4),
"C"=matrix(31:34,nrow=2,ncol=2)))
## $A
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## $B
## [,1] [,2] [,3] [,4]
## [1,] 11 15 19 23
## [2,] 12 16 20 24
## [3,] 13 17 21 25
## [4,] 14 18 22 26
##
## $C
## [,1] [,2]
## [1,] 31 33
## [2,] 32 34
mat.l.row2col2<-sapply(mat.l,function(x) x[2,2])
str(mat.l.row2col2)
## Named int [1:3] 5 16 34
## - attr(*, "names")= chr [1:3] "A" "B" "C"
mapply() functionThe mapply() function is a multi-variable version of
sapply(), it applies the FUN function to the
elements with index i of each var n argument. The
arguments are recycled if necessary.
Syntax
# mapply(FUN,
# var 1...var n)
Example 1: Sum of the i elements of n
vectors:
mapply(sum, # "mappply" addition to 3 vectors with a sequence from 1 to 5
1:5, # i.e: 1+1+1, 2+2+2, etc.
1:5,
1:5)
## [1] 3 6 9 12 15
# This is basically the same as:
1:5+1:5+1:5
## [1] 3 6 9 12 15
# If we made a matrix with the same values spread over 3 columns:
x<-matrix(c(rep(1:5,3)),
nrow=5,
ncol=3)
x
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 2
## [3,] 3 3 3
## [4,] 4 4 4
## [5,] 5 5 5
# We would get the same result if:
apply(x,1,sum)
## [1] 3 6 9 12 15
Example 2: Evaluate if each column of x is identical to
each column of y.
x<-data.frame(V1=c(1:5),
V2=c(6:10),
V3=c(11:15))
y<-data.frame(V1=c(1:5),
V2=c(6:10),
V3=c(11:14,16))
# Look at the output of the "identical()" function
# Note: identical() only works if both tables have the same dimensions
# and column names.
identical(x,y)
## [1] FALSE
# We know that in general, x is not identical to y, but we want
# to know which column is different.
mapply(identical,x,y)
## V1 V2 V3
## TRUE TRUE FALSE
data.frame(identical_x_vs_y=mapply(identical,x,y))
## identical_x_vs_y
## V1 TRUE
## V2 TRUE
## V3 FALSE
vapply() functionThe vapply() function is a stricter version of
sapply(), it applies the FUN function to the
elements of X, but one must indicate the amount and kind of
output that is expected in FUN.VALUE, we could
think of vapply as a sapply with
output validation.
Syntax
# vapply(X,
# FUN,
# FUN.VALUE=class of output (length of output))
To see how it works, it’s useful to compare it to
sapply.
set.seed(100)
dat<-data.frame(X=rnorm(n=100, # We create a data.frame with 3 columns (X,Y,Z)
mean=0, # Each one composed of 100 obs. with normal
sd=1), # distribution, different means, same std.
Y=rnorm(n=100,
mean=1,
sd=1),
Z=rnorm(n=100,
mean=2,
sd=1))
sapply(dat,mean)
## X Y Z
## 0.002912563 1.011140837 2.012793127
# In this case, we expect the output to be of class "numeric", length=1
vapply(dat,mean,numeric(1))
## X Y Z
## 0.002912563 1.011140837 2.012793127
# Notice what happens when we define the length of the output to 2.
vapply(dat,mean,numeric(2))
## Error in vapply(dat, mean, numeric(2)): values must be length 2,
## but FUN(X[[1]]) result is length 1
# Basically R is telling you, that an output of 2 numbers was expected for each i,
# but the operation returned 1 for each i
# Now, by applying the summary() function to each column, we expect an output
# of length 6 for each i.
vapply(dat,summary,numeric(6))
## X Y Z
## Min. -2.271925486 -1.1364939 -1.020814
## 1st Qu. -0.608846594 0.5681700 1.394084
## Median -0.059419897 0.9271222 2.159128
## Mean 0.002912563 1.0111408 2.012793
## 3rd Qu. 0.655891078 1.4461908 2.657317
## Max. 2.581958928 3.1686003 4.727888
# The summary() function acts in different ways, depending
# on the class to which it is applied in this case, because the class is "numeric",
# it returns a statistical summary of the columns.
t(vapply(dat,summary,numeric(6)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## X -2.271925 -0.6088466 -0.0594199 0.002912563 0.6558911 2.581959
## Y -1.136494 0.5681700 0.9271222 1.011140837 1.4461908 3.168600
## Z -1.020814 1.3940839 2.1591282 2.012793127 2.6573168 4.727888
# The "t()" function is used to transpose matrices, transpose means
# that we are going to exchange rows for columns, and vice versa.
# Some sort of dimensional rotation.
tapply() functionThe tapply() function is useful when we need to
partition a vector into groups defined by some INDEX
sorting or grouping factor, compute a FUN function on the
generated subsets over X, and return the results in a
convenient way.
Syntax
# tapply(X,
# INDEX,
# FUN)
Where:
X is the variable on which the FUN
function must be applied.INDEX is the grouping variable.Example:
set.seed(200)
# We create a data set, where we simulate the effect of diet and physical training in
# people with morbid obesity.
# We create the variable to control the number of samples
np=4*100 #number of patients
weight_control<-data.frame(patient_id = as.character(paste0("patient_", 1:np)),
age_years = as.integer(round(rnorm(np, mean = 50, sd =8),
digits = 0)),
starting_weight_kg=rnorm(np,300,10),
final_weight_kg = c(rnorm(np/4,mean=100,sd=10),
rnorm(np/4,mean=70,sd=5),
rnorm(np/4,mean=90,sd=5),
rnorm(np/4,mean=300,sd=10)),
height_m=rnorm(np, mean = 1.75, sd =0.08),
gender_binary=as.factor(rep_len(c("f","m"),
np)),
treatment_category = gl(4,
np/4,
labels = c("diet",
"diet_and_physical_training",
"physical_training",
"control")))
# The function "gl()" generates a vector of factors defined
# by (n levels, n members x level, level labels)
## Feature engineering
# Starting and final bmi calculations
weight_control$starting_bmi<-weight_control$starting_weight_kg/(weight_control$height_m**2)
weight_control$final_bmi<-weight_control$final_weight_kg/(weight_control$height**2)
# Starting and final bmi classification
# We create a function that helps us do
# the classification automatically.
BMI_classifier<-function(x){
as.factor(ifelse(x<18.5,"low_weight",
ifelse(x>=18.5 &
x<25,"normal_weight",
ifelse(x>=25 &
x<30,"overweight",
ifelse(x>=30 &
x<35,"mild_obesity",
ifelse(x>=35 &
x<40,"obesity",
"morbid_obesity"))))))
}
weight_control$starting_bmi_category<-BMI_classifier(weight_control$starting_bmi)
weight_control$final_bmi_category<-BMI_classifier(weight_control$final_bmi)
# Calculation of differences (deltas)
weight_control$delta_weight_kg<-weight_control$starting_weight_kg-weight_control$final_weight_kg
weight_control$delta_bmi<-weight_control$starting_bmi-weight_control$final_bmi
# Generate a data summary
summary(weight_control)
## patient_id age_years starting_weight_kg final_weight_kg
## Length:400 Min. :28.00 Min. :270.2 Min. : 59.32
## Class :character 1st Qu.:45.00 1st Qu.:293.4 1st Qu.: 81.49
## Mode :character Median :50.00 Median :299.9 Median : 93.64
## Mean :50.12 Mean :299.7 Mean :140.58
## 3rd Qu.:55.00 3rd Qu.:306.2 3rd Qu.:159.68
## Max. :80.00 Max. :325.3 Max. :322.56
## height_m gender_binary treatment_category
## Min. :1.549 f:200 diet :100
## 1st Qu.:1.697 m:200 diet_and_physical_training:100
## Median :1.753 physical_training :100
## Mean :1.754 control :100
## 3rd Qu.:1.805
## Max. :2.005
## starting_bmi final_bmi starting_bmi_category
## Min. : 70.32 Min. : 17.60 morbid_obesity:400
## 1st Qu.: 91.34 1st Qu.: 26.21
## Median : 97.55 Median : 30.92
## Mean : 98.10 Mean : 45.81
## 3rd Qu.:104.60 3rd Qu.: 57.12
## Max. :130.09 Max. :124.41
## final_bmi_category delta_weight_kg delta_bmi
## low_weight : 4 Min. :-39.08 Min. :-12.76
## mild_obesity : 84 1st Qu.:139.79 1st Qu.: 40.19
## morbid_obesity:103 Median :204.81 Median : 65.62
## normal_weight : 77 Mean :159.15 Mean : 52.29
## obesity : 33 3rd Qu.:219.04 3rd Qu.: 73.38
## overweight : 99 Max. :261.37 Max. :100.78
# We want to EXPLORE if the treatments had any indication of effect
treatment_effects<-tapply(weight_control$final_bmi, # X
weight_control$treatment_category, # INDEX
summary) # FUN
treatment_effects
## $diet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.31 29.54 31.96 32.46 35.49 49.87
##
## $diet_and_physical_training
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.60 21.33 23.28 23.43 25.11 30.32
##
## $physical_training
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.66 27.47 29.89 29.99 32.21 39.12
##
## $control
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78.84 91.15 96.83 97.35 103.39 124.41
treatment_effects.df<- do.call(rbind, # Function
treatment_effects) # Argument list
treatment_effects.df
## Min. 1st Qu. Median Mean 3rd Qu.
## diet 24.31047 29.53742 31.96384 32.46227 35.48720
## diet_and_physical_training 17.60446 21.32765 23.28333 23.43037 25.11293
## physical_training 23.66123 27.46616 29.88690 29.98649 32.21126
## control 78.84499 91.14723 96.83323 97.35062 103.39297
## Max.
## diet 49.87335
## diet_and_physical_training 30.31577
## physical_training 39.11740
## control 124.40778
# The "do.call()" function allows you to call any function in R,
# but instead of writing the arguments one by one, you can use
# a list to hold the function's arguments.
# do.call() is another loop function.
rapply() functionThe rapply() function stands for recursive
apply, and as its name suggests, it is used to apply an
f function to all elements of object
recursively. The classes argument delimits the elements on
which the f function is going to be applied, based on their
class.
This function has three basic modes. If how = "replace",
each element of object that has its class included in
classes is replaced by the result of applying
f to the element in question.
With the how = "list" mode, all elements that have a
class included in classes are replaced by the result of
applying f to the element, and all others are replaced by
default result deflt.
Next, if how = "unlist",
unlist (recursive = TRUE) is invoked on the result.
Finally, in ... you put any extra arguments that
f can use.
Syntax
# rapply(object,
# f,
# classes,
# deflt,
# how=c("unlist", "replace", "list"),
# ...)
Now, let’s normalize weight, height, bmi, and deltas (i.e the numerical data) from “weight_control” using y=MinMaxNorm(x).
MinMaxNorm: Minmax normalization is a normalization
strategy that linearly transforms xi to
yi = (xi-min(x)) / (max(x)-min(x)), where min
and max are the minimum and maximum values in
x, and xi is the set of individual observed
values in x. You can easily see that when
xi = min(x), then yi = 0, and when
xi = max(x), then yi = 1. Therefore, the range
of values for y is from 0 to 1.
In this case, we are going to create a function that does precisely
this transformation, and we will call it MinMaxNorm.
# We create the function MinMaxNorm
MinMaxNorm<-function(x){
y<-(x-min(x))/(max(x)-min(x))
}
weight_control.norm<-rapply(weight_control, # object
f=MinMaxNorm, # function
classes = "numeric", # class limit
how = "replace")
head(weight_control.norm)[,c(1:5)]
## patient_id age_years starting_weight_kg final_weight_kg height_m
## 1 patient_1 51 0.80689353 0.16327245 0.43820773
## 2 patient_2 52 0.76696811 0.07450938 0.43966621
## 3 patient_3 53 0.28925377 0.14459683 0.08919087
## 4 patient_4 54 0.54982210 0.22830982 0.46257339
## 5 patient_5 50 0.54355485 0.14795106 0.43278732
## 6 patient_6 49 0.01091919 0.15092286 0.54552228
head(weight_control)[,c(1:5)]
## patient_id age_years starting_weight_kg final_weight_kg height_m
## 1 patient_1 51 314.6868 102.30413 1.749155
## 2 patient_2 52 312.4859 78.93844 1.749820
## 3 patient_3 53 286.1528 97.38802 1.590079
## 4 patient_4 54 300.5162 119.42433 1.760261
## 5 patient_5 50 300.1707 98.27098 1.746685
## 6 patient_6 49 270.8101 99.05326 1.798067
plot(density(weight_control$final_bmi),
main= "Density for final bmi (untransformed)")
plot(density(weight_control.norm$final_bmi),
main= "Density for final bmi (min max transformed)")
# If we wanted to obtain a dataframe where the columns are minimum and maximum values,
# and the rows the names of the "numerical" columns of "weight_control":
weight_control.minmax<-data.frame(min= rapply(weight_control,
f=min,
classes = "numeric",
deflt = NULL,
how = "unlist"),
max=rapply(weight_control,
f=max,
classes = "numeric",
deflt = NULL,
how = "unlist"))
weight_control.minmax
## min max
## starting_weight_kg 270.208161 325.331426
## final_weight_kg 59.324857 322.561356
## height_m 1.549427 2.005211
## starting_bmi 70.315843 130.094190
## final_bmi 17.604460 124.407777
## delta_weight_kg -39.078885 261.374147
## delta_bmi -12.757147 100.778463
apply: Apply function over the margins of an array.
lapply: Loop over a list and evaluate a function on each element.
sapply: Basically the same as lapply but simplifies/reduces the result.
mapply: Multivariate version of lapply
tapply: Apply a function over subsets of a vector.
rapply: Apply a function to all elements of an object recursively. The
classesargument delimits the elements on which the function is going to be applied, based on their class.