The goal of this tutorial is to get familiar with the apply family of functions using typical examples.
# First we load the libraries
library(ggplot2)
# In this example we will use the open repository of plants classification Iris.
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# We are going to remove the non-numerical variables
iris$Species <- NULL
str(iris)
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# The apply family of functions is the alternative to loops in R
# It works on pieces of data from matrices, vectors, dataframes, etc.
# In the case of the apply function it takes three arguments
# The first one is the dataframe to work with
# The second one is if the action is going to be applied on rows:1 or columns:2
# The third one is the function to be used in every vector of the dataframe
# Example: calculating the average of each column
apply(iris, 2, mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
# Example: calculating the average of each row
head(apply(iris, 1, mean))
## [1] 2.550 2.375 2.350 2.350 2.550 2.850
# Inside of the apply function are several parameters that define how the output is built
# One of this parameters is "simplify", which decides if the output is a list (FALSE) or a numeric vector (TRUE)
# Let's calculate the mean of each column of the iris dataset with lapply that returns each column as an element of a list
lapply(iris[, 1:ncol(iris)], mean)
## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
# We can add extra parameters to define the configuration of the mean function
lapply(iris[, 1:ncol(iris)], mean, na.rm = TRUE)
## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
# Inside of the apply function are several parameters that define how the output is built
# One of this parameters is "simplify", which decides if the output is a list (FALSE) or a numeric vector (TRUE)
# Let's calculate the mean of each column of the iris dataset with sapply that returns a vector with the result
sapply(iris[, 1:ncol(iris)], mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
# We can add extra parameters to define the configuration of the mean function
sapply(iris[, 1:ncol(iris)], mean, na.rm = TRUE)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
# Now we can compare this result with the lapply function and see that it is the same
sapply(iris[, 1:ncol(iris)], mean, na.rm = TRUE, simplify = FALSE)
## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
In this tutorial we have made a first approximation to the apply family of functions, including apply, lapply and sapply. Notice that this family is bigger and we encourage you to explore the other functions contained in this family.