apply() lets you perform a function across a data
frame’s rows or columns. This returns a vector where each
position corresponds to the row number / named positions for each
row/coluns it analysed;lapply() is used on list objects and returns a list as
well;sapply() goes hand-in-hand with lapply() and works the
same way, where it can accept a list and a function name as the input.
But instead of returning a list, it will return the answers in the
simplest possible format;vapply vapply is similar to sapply, but it requires you
to specify what type of data you are expecting.tapply does the same thing as the group_by() and
summarize() functions (dplyr::)mapply to create new variablepacman::p_load(
dplyr, # data cleaning
tidyverse # data management and visualization
)
Overall, function apply(X, MARGIN, FUN, ...) where
X is a data frame or matrix, MARGIN determines
whether you are looping over columns (2) or rows (1),
and FUN is the function you wish to employ.
cmeans <- apply(mtcars, 2, mean, na.rm = TRUE)
csds <- apply(mtcars, 2, sd, na.rm = TRUE)
data.frame(means = cmeans, stddev = csds)
## means stddev
## mpg 20.090625 6.0269481
## cyl 6.187500 1.7859216
## disp 230.721875 123.9386938
## hp 146.687500 68.5628685
## drat 3.596563 0.5346787
## wt 3.217250 0.9784574
## qsec 17.848750 1.7869432
## vs 0.437500 0.5040161
## am 0.406250 0.4989909
## gear 3.687500 0.7378041
## carb 2.812500 1.6152000
# Create data frame
example <- data.frame(indiv = c("A", "B", "C", "D", "E"),
height_0 = c(15, 10, 12, 9, 17),
height_10 = c(20, 18, 14, 15, 19),
height_20 = c(23, 24, 18, 17, 26))
# View the data frame
head(example)
## indiv height_0 height_10 height_20
## 1 A 15 20 23
## 2 B 10 18 24
## 3 C 12 14 18
## 4 D 9 15 17
## 5 E 17 19 26
# Calculate the mean for each row in the data frame
row.avg <- apply(X = example[, 2:4], 1, FUN = mean)
row.avg
## [1] 19.33333 17.33333 14.66667 13.66667 20.66667
# Calculate the mean for each column
col.avg <- apply(example[, 2:4], 2, mean)
col.avg
## height_0 height_10 height_20
## 12.6 17.2 21.6
# Use rowMeans() and colMeans() instead of apply()
# Create a function
is_tall <- function(x) {
value <- mean(x) > 15
return(value)
}
apply(example[, 2:4], 2, is_tall)
## height_0 height_10 height_20
## FALSE TRUE TRUE
lapply() - the “L” in front of “apply” stands for “lists”, because this function is used on list objects and returns a list as well.
set.seed(123)
# Set seed so that the randomly-generated numbers are the same each time
set.seed(123)
# Create a list using randomly-generated numbers
plants <- list(height = runif(10, min = 10, max = 20), # runif() function to generate random numbers
mass = runif(10, min = 5, max = 10),
flowers = sample(1:10, 10)) # sample() function to generate random integers between one and ten
# View the list
plants
## $height
## [1] 12.87578 17.88305 14.08977 18.83017 19.40467 10.45556 15.28105 18.92419
## [9] 15.51435 14.56615
##
## $mass
## [1] 9.784167 7.266671 8.387853 7.863167 5.514623 9.499125 6.230439 5.210298
## [9] 6.639604 9.772518
##
## $flowers
## [1] 9 10 1 5 3 2 6 7 8 4
mean(plants$height)
## [1] 15.78248
mean(plants$mass)
## [1] 7.616846
mean(plants$flowers)
## [1] 5.5
# Create an empty vector
plant_avgs <- c()
# Loop the average for each element and save in vector
for (i in 1:3) { # loop in 3 columns
plant_avgs[[i]] <- mean(plants[[i]])
}
# View vector
plant_avgs
## [[1]]
## [1] 15.78248
##
## [[2]]
## [1] 7.616846
##
## [[3]]
## [1] 5.5
lapply() doesn’t have the MARGIN argument (1-row, 2-col)
that apply() has. Instead, lapply() already knows that it should apply
the specified function across all list elements. You can just type
lapply(X = list, FUN = function.you.want)
lapply(plants, mean)
## $height
## [1] 15.78248
##
## $mass
## [1] 7.616846
##
## $flowers
## [1] 5.5
The output of lapply() is also a list, where the means
of height, mass, and flowers are saved as list elements of the same
name.
The sapply() and lapply() work basically the same. The
only difference is that lapply() always returns a list, whereas sapply()
tries to simplify the result into a vector or matrix
sapply(plants, mean)
## height mass flowers
## 15.782475 7.616846 5.500000
vapply is similar to sapply, but it requires you to
specify what type of data you are expecting.
vapply(X, FUN, FUN.VALUE) and FUN.VALUE is where you
specify the type of data you are expecting. I am expecting each item in
the list to return a single numeric value, so FUN.VALUE =
numeric(1).
tapply() this does the same thing as the
group_by() and summarize() functions when
using the dplyr package.
library(tidyverse)
head(example) # view data in wide format
## indiv height_0 height_10 height_20
## 1 A 15 20 23
## 2 B 10 18 24
## 3 C 12 14 18
## 4 D 9 15 17
## 5 E 17 19 26
# Pivot the data so that the data are in long format instead of wide format
example <- pivot_longer(example,
cols = 2:4,
names_to = "time",
values_to = "height")
head(example)
## # A tibble: 6 × 3
## indiv time height
## <chr> <chr> <dbl>
## 1 A height_0 15
## 2 A height_10 20
## 3 A height_20 23
## 4 B height_0 10
## 5 B height_10 18
## 6 B height_20 24
# Use sub() to get rid of the string "height_" in front of the time values
example$time <- sub("height_","", example$time)
head(example)
## # A tibble: 6 × 3
## indiv time height
## <chr> <chr> <dbl>
## 1 A 0 15
## 2 A 10 20
## 3 A 20 23
## 4 B 0 10
## 5 B 10 18
## 6 B 20 24
Let’s use tapply() to look at each individuals’ heights, grouped
by time. The function accepts a new argument called
INDEX: tapply(X = vector.to.analyze, INDEX = vector.to.group.by, FUN = function.you.want).
In the code below, I wanted to analyze the height values grouped by
time, using the function mean().
tapply(example$height, example$time, mean)
## 0 10 20
## 12.6 17.2 21.6
# using dplyr instead of tapply
example %>%
dplyr::group_by(time) %>%
dplyr::summarise(avg_height = mean(height)) %>%
ungroup() %>%
filter(avg_height > 15)
## # A tibble: 2 × 2
## time avg_height
## <chr> <dbl>
## 1 10 17.2
## 2 20 21.6
Another use for mapply would be to create a new
variable. For example, using dataset t, I could divide one column by
another column to create a new value. This would be useful for creating
a ratio of two variables as shown in the example below:
# create dataset
my.matrx <- matrix(c(1:10, 11:20, 21:30), nrow = 10, ncol = 3)
my.matrx
## [,1] [,2] [,3]
## [1,] 1 11 21
## [2,] 2 12 22
## [3,] 3 13 23
## [4,] 4 14 24
## [5,] 5 15 25
## [6,] 6 16 26
## [7,] 7 17 27
## [8,] 8 18 28
## [9,] 9 19 29
## [10,] 10 20 30
tdata <- as.data.frame(cbind(c(1,1,1,1,1,2,2,2,2,2), my.matrx))
tdata
## V1 V2 V3 V4
## 1 1 1 11 21
## 2 1 2 12 22
## 3 1 3 13 23
## 4 1 4 14 24
## 5 1 5 15 25
## 6 2 6 16 26
## 7 2 7 17 27
## 8 2 8 18 28
## 9 2 9 19 29
## 10 2 10 20 30
colnames(tdata)
## [1] "V1" "V2" "V3" "V4"
# using mapply to create a new variable
tdata$V5 <- mapply(function(x, y) x/y, tdata$V2, tdata$V4)
tdata$V5
## [1] 0.04761905 0.09090909 0.13043478 0.16666667 0.20000000 0.23076923
## [7] 0.25925926 0.28571429 0.31034483 0.33333333
apply() function in real datasetlibrary(MASS)
## Warning: package 'MASS' was built under R version 4.2.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data(state)
head(state.x77)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
str(state.x77)
## num [1:50, 1:8] 3615 365 2212 2110 21198 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
apply to get summary dataapply(state.x77, 2, median)
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## 2838.500 4519.000 0.950 70.675 6.850 53.250 114.500
## Area
## 54277.000
apply(state.x77, 2, sd)
## Population Income Illiteracy Life Exp Murder HS Grad
## 4.464491e+03 6.144699e+02 6.095331e-01 1.342394e+00 3.691540e+00 8.076998e+00
## Frost Area
## 5.198085e+01 8.532730e+04
apply(state.x77, 2, function(x) c(mean(x), sd(x)))
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## [1,] 4246.420 4435.8000 1.1700000 70.878600 7.37800 53.108000 104.46000
## [2,] 4464.491 614.4699 0.6095331 1.342394 3.69154 8.076998 51.98085
## Area
## [1,] 70735.88
## [2,] 85327.30
state.summary <- apply(state.x77, 2, function(x) c(mean(x), sd(x)))
state.summary
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## [1,] 4246.420 4435.8000 1.1700000 70.878600 7.37800 53.108000 104.46000
## [2,] 4464.491 614.4699 0.6095331 1.342394 3.69154 8.076998 51.98085
## Area
## [1,] 70735.88
## [2,] 85327.30
state.range <- apply(state.x77, 2, function(x) c(min(x), median(x), max(x)))
state.range
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## [1,] 365.0 3098 0.50 67.960 1.40 37.80 0.0 1049
## [2,] 2838.5 4519 0.95 70.675 6.85 53.25 114.5 54277
## [3,] 21198.0 6315 2.80 73.600 15.10 67.30 188.0 566432
mapply to compute a new variablepopulation <- state.x77[1:50]
head(state.area)
## [1] 51609 589757 113909 53104 158693 104247
area <- state.area
pop.dens <- mapply(function(x, y) x/y, population, area)
pop.dens
## [1] 0.070045922 0.000618899 0.019419010 0.039733353 0.133578671 0.024374802
## [7] 0.618886005 0.281477880 0.141342213 0.083752293 0.134573643 0.009729885
## [13] 0.198528369 0.146399934 0.050826079 0.027715647 0.083847011 0.078437030
## [19] 0.031853078 0.389713529 0.704129829 0.156503367 0.046640815 0.049061112
## [25] 0.068406854 0.005070070 0.019993008 0.005337434 0.087274291 0.935809086
## [31] 0.009402791 0.364611909 0.103468604 0.009014364 0.260419194 0.038830647
## [37] 0.023551005 0.261619571 0.766886326 0.090677830 0.008838761 0.098783259
## [43] 0.045773344 0.014166941 0.049120616 0.122038466 0.052190873 0.074397254
## [49] 0.081721694 0.003840105
tapply to explore population by regionregion.info <- tapply(population, state.region, function(x) c(min(x), max(x), mean(x), median(x))) # group_by state.region
region.info
## $Northeast
## [1] 472.000 18076.000 5495.111 3100.000
##
## $South
## [1] 579.000 12237.000 4208.125 3710.500
##
## $`North Central`
## [1] 637 11197 4803 4255
##
## $West
## [1] 365.000 21198.000 2915.308 1144.000