Objective

  1. apply() lets you perform a function across a data frame’s rows or columns. This returns a vector where each position corresponds to the row number / named positions for each row/coluns it analysed;
  2. lapply() is used on list objects and returns a list as well;
  3. sapply() goes hand-in-hand with lapply() and works the same way, where it can accept a list and a function name as the input. But instead of returning a list, it will return the answers in the simplest possible format;
  4. vapply vapply is similar to sapply, but it requires you to specify what type of data you are expecting.
  5. tapply does the same thing as the group_by() and summarize() functions (dplyr::)
  6. mapply to create new variable
pacman::p_load(
  dplyr,      # data cleaning 
  tidyverse  # data management and visualization
)

Overall

Overall, function apply(X, MARGIN, FUN, ...) where X is a data frame or matrix, MARGIN determines whether you are looping over columns (2) or rows (1), and FUN is the function you wish to employ.

cmeans <- apply(mtcars, 2, mean, na.rm = TRUE)
csds <- apply(mtcars, 2, sd, na.rm = TRUE)

data.frame(means = cmeans, stddev = csds)
##           means      stddev
## mpg   20.090625   6.0269481
## cyl    6.187500   1.7859216
## disp 230.721875 123.9386938
## hp   146.687500  68.5628685
## drat   3.596563   0.5346787
## wt     3.217250   0.9784574
## qsec  17.848750   1.7869432
## vs     0.437500   0.5040161
## am     0.406250   0.4989909
## gear   3.687500   0.7378041
## carb   2.812500   1.6152000
# Create data frame
example <- data.frame(indiv = c("A", "B", "C", "D", "E"),
                      height_0 = c(15, 10, 12, 9, 17),
                      height_10 = c(20, 18, 14, 15, 19),
                      height_20 = c(23, 24, 18, 17, 26))

# View the data frame
head(example)
##   indiv height_0 height_10 height_20
## 1     A       15        20        23
## 2     B       10        18        24
## 3     C       12        14        18
## 4     D        9        15        17
## 5     E       17        19        26
# Calculate the mean for each row in the data frame
row.avg <- apply(X = example[, 2:4], 1, FUN = mean)
row.avg
## [1] 19.33333 17.33333 14.66667 13.66667 20.66667
# Calculate the mean for each column 
col.avg <- apply(example[, 2:4], 2, mean)
col.avg 
##  height_0 height_10 height_20 
##      12.6      17.2      21.6
# Use rowMeans() and colMeans() instead of apply()
# Create a function

is_tall <- function(x) {
  value <- mean(x) > 15
  return(value)
}

apply(example[, 2:4], 2, is_tall)
##  height_0 height_10 height_20 
##     FALSE      TRUE      TRUE

lapply()

lapply() - the “L” in front of “apply” stands for “lists”, because this function is used on list objects and returns a list as well.

set.seed(123)
# Set seed so that the randomly-generated numbers are the same each time
set.seed(123)

# Create a list using randomly-generated numbers
plants <- list(height = runif(10, min = 10, max = 20), # runif() function to generate random numbers
                 mass = runif(10, min = 5, max = 10),
                 flowers = sample(1:10, 10))  # sample() function to generate random integers between one and ten

# View the list
plants
## $height
##  [1] 12.87578 17.88305 14.08977 18.83017 19.40467 10.45556 15.28105 18.92419
##  [9] 15.51435 14.56615
## 
## $mass
##  [1] 9.784167 7.266671 8.387853 7.863167 5.514623 9.499125 6.230439 5.210298
##  [9] 6.639604 9.772518
## 
## $flowers
##  [1]  9 10  1  5  3  2  6  7  8  4
mean(plants$height)
## [1] 15.78248
mean(plants$mass)
## [1] 7.616846
mean(plants$flowers)
## [1] 5.5
# Create an empty vector 
plant_avgs <- c()

# Loop the average for each element and save in vector
for (i in 1:3) {  # loop in 3 columns 
  plant_avgs[[i]] <- mean(plants[[i]])
}

# View vector 
plant_avgs
## [[1]]
## [1] 15.78248
## 
## [[2]]
## [1] 7.616846
## 
## [[3]]
## [1] 5.5

lapply() doesn’t have the MARGIN argument (1-row, 2-col) that apply() has. Instead, lapply() already knows that it should apply the specified function across all list elements. You can just type lapply(X = list, FUN = function.you.want)

lapply(plants, mean)
## $height
## [1] 15.78248
## 
## $mass
## [1] 7.616846
## 
## $flowers
## [1] 5.5

The output of lapply() is also a list, where the means of height, mass, and flowers are saved as list elements of the same name.

sapply()

The sapply() and lapply() work basically the same. The only difference is that lapply() always returns a list, whereas sapply() tries to simplify the result into a vector or matrix

sapply(plants, mean)
##    height      mass   flowers 
## 15.782475  7.616846  5.500000

vapply()

vapply is similar to sapply, but it requires you to specify what type of data you are expecting. vapply(X, FUN, FUN.VALUE) and FUN.VALUE is where you specify the type of data you are expecting. I am expecting each item in the list to return a single numeric value, so FUN.VALUE = numeric(1).

tapply()

tapply() this does the same thing as the group_by() and summarize() functions when using the dplyr package.

library(tidyverse)

head(example) # view data in wide format
##   indiv height_0 height_10 height_20
## 1     A       15        20        23
## 2     B       10        18        24
## 3     C       12        14        18
## 4     D        9        15        17
## 5     E       17        19        26
# Pivot the data so that the data are in long format instead of wide format
example <- pivot_longer(example, 
                        cols = 2:4,
                        names_to = "time",
                        values_to = "height")

head(example)
## # A tibble: 6 × 3
##   indiv time      height
##   <chr> <chr>      <dbl>
## 1 A     height_0      15
## 2 A     height_10     20
## 3 A     height_20     23
## 4 B     height_0      10
## 5 B     height_10     18
## 6 B     height_20     24
# Use sub() to get rid of the string "height_" in front of the time values
example$time <- sub("height_","", example$time)

head(example)
## # A tibble: 6 × 3
##   indiv time  height
##   <chr> <chr>  <dbl>
## 1 A     0         15
## 2 A     10        20
## 3 A     20        23
## 4 B     0         10
## 5 B     10        18
## 6 B     20        24

Let’s use tapply() to look at each individuals’ heights, grouped by time. The function accepts a new argument called INDEX: tapply(X = vector.to.analyze, INDEX = vector.to.group.by, FUN = function.you.want). In the code below, I wanted to analyze the height values grouped by time, using the function mean().

tapply(example$height, example$time, mean)
##    0   10   20 
## 12.6 17.2 21.6
# using dplyr instead of tapply 
example %>% 
  dplyr::group_by(time) %>% 
  dplyr::summarise(avg_height = mean(height)) %>% 
  ungroup() %>% 
  filter(avg_height > 15)
## # A tibble: 2 × 2
##   time  avg_height
##   <chr>      <dbl>
## 1 10          17.2
## 2 20          21.6

mapply()

Another use for mapply would be to create a new variable. For example, using dataset t, I could divide one column by another column to create a new value. This would be useful for creating a ratio of two variables as shown in the example below:

# create dataset
my.matrx <- matrix(c(1:10, 11:20, 21:30), nrow = 10, ncol = 3)
my.matrx
##       [,1] [,2] [,3]
##  [1,]    1   11   21
##  [2,]    2   12   22
##  [3,]    3   13   23
##  [4,]    4   14   24
##  [5,]    5   15   25
##  [6,]    6   16   26
##  [7,]    7   17   27
##  [8,]    8   18   28
##  [9,]    9   19   29
## [10,]   10   20   30
tdata <- as.data.frame(cbind(c(1,1,1,1,1,2,2,2,2,2), my.matrx))
tdata
##    V1 V2 V3 V4
## 1   1  1 11 21
## 2   1  2 12 22
## 3   1  3 13 23
## 4   1  4 14 24
## 5   1  5 15 25
## 6   2  6 16 26
## 7   2  7 17 27
## 8   2  8 18 28
## 9   2  9 19 29
## 10  2 10 20 30
colnames(tdata)
## [1] "V1" "V2" "V3" "V4"
# using mapply to create a new variable 
tdata$V5 <- mapply(function(x, y) x/y, tdata$V2, tdata$V4)
tdata$V5
##  [1] 0.04761905 0.09090909 0.13043478 0.16666667 0.20000000 0.23076923
##  [7] 0.25925926 0.28571429 0.31034483 0.33333333

Using apply() function in real dataset

library(MASS)
## Warning: package 'MASS' was built under R version 4.2.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(state)
head(state.x77)
##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
str(state.x77)
##  num [1:50, 1:8] 3615 365 2212 2110 21198 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...

Using apply to get summary data

apply(state.x77, 2, median)
## Population     Income Illiteracy   Life Exp     Murder    HS Grad      Frost 
##   2838.500   4519.000      0.950     70.675      6.850     53.250    114.500 
##       Area 
##  54277.000
apply(state.x77, 2, sd)
##   Population       Income   Illiteracy     Life Exp       Murder      HS Grad 
## 4.464491e+03 6.144699e+02 6.095331e-01 1.342394e+00 3.691540e+00 8.076998e+00 
##        Frost         Area 
## 5.198085e+01 8.532730e+04
apply(state.x77, 2, function(x) c(mean(x), sd(x)))
##      Population    Income Illiteracy  Life Exp  Murder   HS Grad     Frost
## [1,]   4246.420 4435.8000  1.1700000 70.878600 7.37800 53.108000 104.46000
## [2,]   4464.491  614.4699  0.6095331  1.342394 3.69154  8.076998  51.98085
##          Area
## [1,] 70735.88
## [2,] 85327.30
state.summary <- apply(state.x77, 2, function(x) c(mean(x), sd(x)))
state.summary
##      Population    Income Illiteracy  Life Exp  Murder   HS Grad     Frost
## [1,]   4246.420 4435.8000  1.1700000 70.878600 7.37800 53.108000 104.46000
## [2,]   4464.491  614.4699  0.6095331  1.342394 3.69154  8.076998  51.98085
##          Area
## [1,] 70735.88
## [2,] 85327.30
state.range <- apply(state.x77, 2, function(x) c(min(x), median(x), max(x)))
state.range
##      Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## [1,]      365.0   3098       0.50   67.960   1.40   37.80   0.0   1049
## [2,]     2838.5   4519       0.95   70.675   6.85   53.25 114.5  54277
## [3,]    21198.0   6315       2.80   73.600  15.10   67.30 188.0 566432

Using mapply to compute a new variable

population <- state.x77[1:50]
head(state.area)
## [1]  51609 589757 113909  53104 158693 104247
area <- state.area

pop.dens <- mapply(function(x, y) x/y, population, area)
pop.dens
##  [1] 0.070045922 0.000618899 0.019419010 0.039733353 0.133578671 0.024374802
##  [7] 0.618886005 0.281477880 0.141342213 0.083752293 0.134573643 0.009729885
## [13] 0.198528369 0.146399934 0.050826079 0.027715647 0.083847011 0.078437030
## [19] 0.031853078 0.389713529 0.704129829 0.156503367 0.046640815 0.049061112
## [25] 0.068406854 0.005070070 0.019993008 0.005337434 0.087274291 0.935809086
## [31] 0.009402791 0.364611909 0.103468604 0.009014364 0.260419194 0.038830647
## [37] 0.023551005 0.261619571 0.766886326 0.090677830 0.008838761 0.098783259
## [43] 0.045773344 0.014166941 0.049120616 0.122038466 0.052190873 0.074397254
## [49] 0.081721694 0.003840105

Using tapply to explore population by region

region.info <- tapply(population, state.region, function(x) c(min(x), max(x), mean(x), median(x))) # group_by state.region
region.info
## $Northeast
## [1]   472.000 18076.000  5495.111  3100.000
## 
## $South
## [1]   579.000 12237.000  4208.125  3710.500
## 
## $`North Central`
## [1]   637 11197  4803  4255
## 
## $West
## [1]   365.000 21198.000  2915.308  1144.000

References

  1. R for ecology.
  2. Chapter 4: apply function.