The objective of this worksheet is to enhance knowledge of the apply family functions.

1. apply function

apply : apply function to each column or row of matrix (more accurately array). Since, dataframe is also matrix, therefore, apply also works on dataframe. See example,

Df <- data.frame(a = 1:10, b = rnorm(10), c = runif(10, 10, 100))
apply(Df, 1, sum) # one as second argument indicate provide sum of each row.
##  [1]  21.39524  64.37107  98.04714  31.53353  49.88592  37.38125  55.24396
##  [8] 107.99997  87.07680  57.44775
apply(Df, 2, sum) # provide sum of each column
##          a          b          c 
##  55.000000   1.756291 553.626336
#More example: 
#Summary statistics of each column
apply(Df, 2, summary) 
##             a           b        c
## Min.     1.00 -1.32141966 20.77769
## 1st Qu.  3.25 -0.52045449 33.59466
## Median   5.50  0.09476273 47.72110
## Mean     5.50  0.17562908 55.36263
## 3rd Qu.  7.75  0.92338168 75.28305
## Max.    10.00  1.64472951 98.35524
#only mean of each column
apply(Df, 2, mean)
##          a          b          c 
##  5.5000000  0.1756291 55.3626336
#min of each column
apply(Df, 2, min)
##        a        b        c 
##  1.00000 -1.32142 20.77769
#sum of square of each column 
apply(Df, 2, function(x) sum(x^2))
##            a            b            c 
##   385.000000     9.150518 37433.727216
#Number of NA in each column
apply(Df, 2, function(x) sum(is.na(x)))
## a b c 
## 0 0 0

Try following excercise:

# Ex1: Why sum of first 2 column is not returned? 
#Create following R object: 
Dfm <- data.frame(a = 1:10, b = 10:1, letters[1:10])
apply(Dfm, 2, sum)
## Error in FUN(newX[, i], ...): invalid 'type' (character) of argument
#Ex2: Get mean and variance using single line function for each column. 
Df <- data.frame(a = 1:10, b = rnorm(10), c = runif(10, 10, 100))

#Ex3: Replace NA's value with 0 in following dataframe using single line function using apply
Df <- data.frame(a = c(1:9, NA), b = sample(c(1, 5, NA), size = 10, replace = TRUE))

2. lapply function

lapply apply a Function to each element of a List or Vector. Lapply returns output in the list, of the same length of X (input vector or list).

v <- c(1, 2, 3, 4, 5)

#square root of each element
lapply(v, sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068
#square of each element
lapply(v, function(x) x^2)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25
#Function on list
Ls <- list(a = 1:10, b = runif(5), c = rnorm(15))

#length of each element
lapply(Ls, length)
## $a
## [1] 10
## 
## $b
## [1] 5
## 
## $c
## [1] 15
#mean of each element
lapply(Ls, mean)
## $a
## [1] 5.5
## 
## $b
## [1] 0.528497
## 
## $c
## [1] -0.02633585
#summary statistics of each element of  list
lapply(Ls, summary)
## $a
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00 
## 
## $b
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07896 0.20182 0.45567 0.52850 0.91396 0.99208 
## 
## $c
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.08278 -0.44829 -0.12568 -0.02634  0.36911  1.46365

Try following excercise:

#Ex1: Why following code works on dataframe? 
Df <- data.frame(a = 1:10, b = 10:1, c = rnorm(10)) 
lapply(Df, mean)
## $a
## [1] 5.5
## 
## $b
## [1] 5.5
## 
## $c
## [1] -0.3396287
#Ex2: calculate number of missing values for each column of data in list (Lsd). 
data(cars)
data("mtcars")
Lsd <- list(cars = cars, mtcars = mtcars)

#Ex3: Calculate summary of each column of data in Lsd. 

#Ex4: Write a function that use lapply within a lapply. 

#Ex5: Attempt all `apply` excercise using lapply

3. sapply function

Sapply is same as lapply, but it simplifies the output to vector or matrix, if possible. sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f). Same example using sapply, and see the difference.

v <- c(1, 2, 3, 4, 5)

#square root of each element
sapply(v, sqrt)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
#square of each element
sapply(v, function(x) x^2)
## [1]  1  4  9 16 25
#Function on list
Ls <- list(a = 1:10, b = runif(5), c = rnorm(15))

#length of each element
sapply(Ls, length)
##  a  b  c 
## 10  5 15
#mean of each element
sapply(Ls, mean)
##           a           b           c 
##  5.50000000  0.30585782 -0.03530268
#summary statistics of each element of  list
sapply(Ls, summary) #may be much better. 
##             a         b           c
## Min.     1.00 0.1492890 -1.14370132
## 1st Qu.  3.25 0.1936619 -0.58296544
## Median   5.50 0.2005962 -0.18599022
## Mean     5.50 0.3058578 -0.03530268
## 3rd Qu.  7.75 0.2083238  0.41617079
## Max.    10.00 0.7774181  1.60228060

Attempt all excercise of lapply using sapply and check the differences in the outcome. If you find yourself typing unlist(lapply(...)), stop and consider sapply.

4. vapply function

vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use. For example,

vapply(Df, FUN = mean, FUN.VALUE = 0) #since return values are numeric, it works. 
##          a          b          c 
##  5.5000000  5.5000000 -0.3396287
vapply(Df, FUN = mean, FUN.VALUE = "a")  #does not work
## Error in vapply(Df, FUN = mean, FUN.VALUE = "a"): values must be type 'character',
##  but FUN(X[[1]]) result is type 'double'

For more information, please refer ?vapply.

5. mapply function

This is useful when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

#Sums the 1st elements, the 2nd elements, etc. 
mapply(sum, 1:5, 1:5, 1:5) 
## [1]  3  6  9 12 15
mapply(rep, 1:4, 4:1)   
## [[1]]
## [1] 1 1 1 1
## 
## [[2]]
## [1] 2 2 2
## 
## [[3]]
## [1] 3 3
## 
## [[4]]
## [1] 4
#To generate random numbers with different mean and standard deviation
mapply(FUN = function(x, y) rnorm(5, x, y), 1:5, 5:1)
##            [,1]       [,2]      [,3]     [,4]     [,5]
## [1,] -1.4199575 10.9466578 -1.178147 2.480393 4.231029
## [2,]  2.7047023  9.4547140  3.964918 2.161408 5.467605
## [3,]  0.6912268  0.9505227  2.268016 4.452615 4.822792
## [4,]  9.2267466 -2.5584926 -2.002012 3.025329 5.724015
## [5,] 10.1348753 -0.7810312  2.342527 7.124626 5.635718

I never felt the need of mapply function for any of the problem.