Apply and For Loops in R

Let’s look at the R help guide on apply

??apply

Different types of Apply:

  • base::apply : Apply functions Over Array Margins
  • base::by : Apply a Function to a Data Frame Split by Factors
  • base::eapply : Apply a Function Over Values in an Environment
  • base::lapply : Apply a Function over a List or Vector
  • base::mapply : Apply a Function to Multiple List or Vector Arguments
  • base::rapply : Recrusively Apply a Function to a List
  • base::tapply : Apply a Function Over a Ragged Array

Basic Apply

Description: “Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.” The margins are either the rows (1), columns (2), or both (1:2). When referring to both, we apply the function to each individual value. Here is an example:

#Create a matrix of 10 rows x 2 columns
m <- matrix(c(1:10, 11:20), nrow = 10, ncol=2)
m
##       [,1] [,2]
##  [1,]    1   11
##  [2,]    2   12
##  [3,]    3   13
##  [4,]    4   14
##  [5,]    5   15
##  [6,]    6   16
##  [7,]    7   17
##  [8,]    8   18
##  [9,]    9   19
## [10,]   10   20
#Find the mean of the rows 
apply(m, 1, mean)
##  [1]  6  7  8  9 10 11 12 13 14 15
#Find the mean of the columns
apply(m, 2, mean)
## [1]  5.5 15.5
#Divide all values in the matrix by 2
apply(m, 1:2, function(x) x/2)
##       [,1] [,2]
##  [1,]  0.5  5.5
##  [2,]  1.0  6.0
##  [3,]  1.5  6.5
##  [4,]  2.0  7.0
##  [5,]  2.5  7.5
##  [6,]  3.0  8.0
##  [7,]  3.5  8.5
##  [8,]  4.0  9.0
##  [9,]  4.5  9.5
## [10,]  5.0 10.0

Here is an example from the help guide:

#Compute row and column sums for a matrix:
x <- cbind(x1 = 3, x2 = c(4:1, 2:5))
x
##      x1 x2
## [1,]  3  4
## [2,]  3  3
## [3,]  3  2
## [4,]  3  1
## [5,]  3  2
## [6,]  3  3
## [7,]  3  4
## [8,]  3  5
dimnames(x)[[1]] <- letters[1:8]
apply(x, 2, mean, trim = .2)
## x1 x2 
##  3  3

I’m going to skip “by”" for now, because it involves splitting data, which I think is done easier with the dplyr package. But if you’re curious about apply:by, visit this blog post on Apply in R. I have used quite a few examples in this R Markdown exercise from N. Saunder’s blog post, along with some of my own examples from Vince Buffalo’s Bioinformatics Data Skills.

Now that we know how apply basically works, let’s look at more complicated versions of apply

lapply()

Description: “lapply returns a list of the same length as X, each element of which is the result of applying the function to the corresponding element of X.”

Here is a simple example:

#Create a list with 2 elements
l <- list(a=1:10, b=11:20)
l
## $a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $b
##  [1] 11 12 13 14 15 16 17 18 19 20
#Find the mean of values in each element
lapply(l,mean)
## $a
## [1] 5.5
## 
## $b
## [1] 15.5
#Find the sum of the values in each element
lapply(l,sum)
## $a
## [1] 55
## 
## $b
## [1] 155

Here is an example for a list of numeric values:

#list
ll <- list(a=rnorm(6,mean=1), b=rnorm(6,mean=4), c=rnorm(6,mean=6))
ll
## $a
## [1]  1.74128637  0.89289984 -0.09936126  1.20483917  1.01187326  1.08487061
## 
## $b
## [1] 4.242054 3.589783 4.353298 4.858905 1.565346 2.751108
## 
## $c
## [1] 5.682891 6.126494 6.717015 4.210750 6.081511 4.558200
#Calculate the mean for each vector stored in the list
  #First create the empty vector for means
ll_means <- numeric(length(ll))
  #Loop over list element and calc mean
for (i in seq_along(ll)) { ll_means[i] <- mean(ll[[i]]) }
ll_means
## [1] 0.9727347 3.5600824 5.5628102
  #Can use lapply much easier:
lapply(ll,mean)
## $a
## [1] 0.9727347
## 
## $b
## [1] 3.560082
## 
## $c
## [1] 5.56281
#Ignoring NA values:
  #First make the function
meanRemoveNA <- function(x) mean(x, na.rm=TRUE)
  #Apply to the list
lapply(ll, meanRemoveNA)
## $a
## [1] 0.9727347
## 
## $b
## [1] 3.560082
## 
## $c
## [1] 5.56281

eapply()

Description: “eapply applies a function to the named values from an environment and returns the results as a list.” This uses environments in R. An environment is a self-contained object with its own variables and functions. Lets use a simple example to define an environment and run eapply over it to find the mean of the variables:

#Create new environment
e <- new.env()
e
## <environment: 0x7ff71483db20>
#Create the environment variables
e$a <- 1:10
e$b <- 11:20
#Find the mean of the variables
eapply(e, mean)
## $a
## [1] 5.5
## 
## $b
## [1] 15.5

Environments are often used by R packages such as Bioconductor.

sapply()

Description: “sapply is a user friendly version of lapply by default returning a vector or matrix if appropriate.” This means that if lapply() returned a list with elements a and b, sapply() will return either a vector or a matrix.

Let’s use the simple list example to use sapply:

#Create list with 2 elements
l <- list(a=1:10, b=11:20)
l
## $a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $b
##  [1] 11 12 13 14 15 16 17 18 19 20
#Find mean of values
l.mean <- sapply(l, mean)
#What type of object is returned?
class(l.mean)
## [1] "numeric"
#Numeric vector, so can get element a such as:
l.mean[['a']]
## [1] 5.5

vapply()

Description: “vapply is similar to sapply, but it has a pre-specified type of return value.” The third argument supplied to vapply is a sort of template settings for the output. The documentation for vapply uses fivenum function as an example:

l <- list(a=1:10, b=11:20)
l
## $a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $b
##  [1] 11 12 13 14 15 16 17 18 19 20
#Fivenum of values using vapply
l.fivenum <- vapply(l, fivenum, c(Min.=0, "1st Qu."=0, Median=0, "3rd Qu."=0, Max.=0))
l.fivenum
##            a    b
## Min.     1.0 11.0
## 1st Qu.  3.0 13.0
## Median   5.5 15.5
## 3rd Qu.  8.0 18.0
## Max.    10.0 20.0

vapply returns a matrix, where the column names correspond to the original list element and the row names to the output template.

mapply()

Description: “mapply is a multivariate version of sapply. mapply applies the function to the first elements of each argument.”

Here is a simple example:

l1 <- list(a=c(1:10), b=c(11:20))
l2 <- list(c=c(21:30), d=c(31:40))
l1
## $a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $b
##  [1] 11 12 13 14 15 16 17 18 19 20
l2
## $c
##  [1] 21 22 23 24 25 26 27 28 29 30
## 
## $d
##  [1] 31 32 33 34 35 36 37 38 39 40
#Sum the corresponding elements of l1 and l2
mapply(sum, l1$a, l1$b, l2$c, l2$d)
##  [1]  64  68  72  76  80  84  88  92  96 100

Here is an example of two genotypes and I want to see how many alleles are shared by calling intersect

ind_1 <- list(loci_1=c("T", "T"), loci_2=c("T", "G"), loci_3=c("C", "G"))
ind_1
## $loci_1
## [1] "T" "T"
## 
## $loci_2
## [1] "T" "G"
## 
## $loci_3
## [1] "C" "G"
ind_2 <- list(loci_1=c("A", "A"), loci_2=c("G", "G"), loci_3=c("C", "G"))
ind_2
## $loci_1
## [1] "A" "A"
## 
## $loci_2
## [1] "G" "G"
## 
## $loci_3
## [1] "C" "G"
mapply(function(a,b) length(intersect(a,b)), ind_1, ind_2)
## loci_1 loci_2 loci_3 
##      0      1      2

tapply()

tapply is a nice apply function to start thinking about the dplyr package. Description: “Apply a function to each cell of a raggd array, that is to each group of values given by a unique combination of the levels of certain factors.” The usage is “tapply(X, INDEX, FUN … simplify = TRUE),” where X is an object, usually a vector, and the index is a list of factors.

Let’s use the iris data:

attach(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
#Find the mean petal length by species 
tapply(iris$Petal.Length, Species, mean)
##     setosa versicolor  virginica 
##      1.462      4.260      5.552

Glancing at dplyr

Next week we are discussing the dplyr package. tapply resembles dplyr slightly. Let’s look at some basic dplyr functions and what you can do with the iris data:

library(dplyr)
attach(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

We see from head, the 5 columns are sepal length, seapl width, petal length, petal width, and species. Let’s say for now we just want petal length, width, and species. We use select, and dplyr’s pipes to make our lives easier. You can either type in the pipe symbol by hand “%>%,” or on a Mac the shortcut is Cmd + Shft + M = %>%. Let’s use dplyr pipes with the iris dataset to select certain columns:

iris %>% select(Petal.Width, Petal.Length, Species) %>% head(10)
##    Petal.Width Petal.Length Species
## 1          0.2          1.4  setosa
## 2          0.2          1.4  setosa
## 3          0.2          1.3  setosa
## 4          0.2          1.5  setosa
## 5          0.2          1.4  setosa
## 6          0.4          1.7  setosa
## 7          0.3          1.4  setosa
## 8          0.2          1.5  setosa
## 9          0.2          1.4  setosa
## 10         0.1          1.5  setosa

Now you have filtered out two columns you may or may not have wanted, without changing the raw data. I also piped through head so you wouldn’t get the entire dataset. You can arrange by a certain column:

iris %>% arrange(Petal.Length) %>% head(10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           4.6         3.6          1.0         0.2  setosa
## 2           4.3         3.0          1.1         0.1  setosa
## 3           5.8         4.0          1.2         0.2  setosa
## 4           5.0         3.2          1.2         0.2  setosa
## 5           4.7         3.2          1.3         0.2  setosa
## 6           5.4         3.9          1.3         0.4  setosa
## 7           5.5         3.5          1.3         0.2  setosa
## 8           4.4         3.0          1.3         0.2  setosa
## 9           5.0         3.5          1.3         0.3  setosa
## 10          4.5         2.3          1.3         0.3  setosa

Here you are ascending by petal length. So how is tapply similar to dplyr? Using filter and the mean function to summarize petal length grouped by species. In the tapply function, we wanted to calculate the mean petal length grouped by species. Here is how we do it in dplyr:

iris %>% group_by(Species) %>% summarize(mean_petal_length = mean(Petal.Length))
## # A tibble: 3 × 2
##      Species mean_petal_length
##       <fctr>             <dbl>
## 1     setosa             1.462
## 2 versicolor             4.260
## 3  virginica             5.552