How to (and how not to) loop in R

by Mark R Payne
DTU-Aqua, Charlottenlund, Denmark
http://www.staff.dtu.dk/mpay

Wed Aug 13 08:39:53 2014

This tutorial provides a brief introduction to using loops in R. It gives you an idea of how they work, and demonstrates that they are in fact really slow and should only be used as a last resort. Examples using lapply() and tapply() demonstrate that these are much better alternatives.

For loops

A loop is one of the most fundamental elements in computer programming. It is also one of the elements that separates all of the various languages from each other. While it supposedly simple, mastering its proper use is a key art that you need to learn - I once saw it described as the pirouette of computer programming.

To start with, here’s a basic example of how it works in R:

for(i in 1:10) {   #The counter declaration
  #then do something for each iteration
  print(i)
  #....
} #And finish the enviroment.
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

The structure is simple. Firstly we indicate that we are going into a loop with the for(). The argument to the for() consists of a variable, in this case i that gets assigned a value, and a list of values, in this case the vector 1:10. For each iteration in the for loop, a value is taken from the list and assigned to the variable, and the following code (in the braces, {}) is evaluated. A key point is that the counter in the for loop can be either a list or a vector - it doesn’t matter. The key thing is the the for command loops over each element and assigns it to the indexing variable (i in the above example) e.g

for(lab.member in c("Laurene","Tim","Sofia","Jonatan","Philipp","Katharina")) {
  cat(sprintf("Hello %s...\n",lab.member))
}
## Hello Laurene...
## Hello Tim...
## Hello Sofia...
## Hello Jonatan...
## Hello Philipp...
## Hello Katharina...

The argument can also be a list, as we will see in the next example.

Note that for loops don’t return any values. If you want to get something back, you have to take care of it manually e.g.

x <- 1:10
y <- rep(NA,length(x))
for(i in 1:length(x)) {
  y[i] <- x[i] ^2
}
print(y)
##  [1]   1   4   9  16  25  36  49  64  81 100

Working with lists

One of the more handy features of for loops in R is their ability to work with lists in the same way as they work with vectors. To illustrate this point, lets work with the classic Fisher iris data set

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

This data set is commonly used as an example in R and statistics generally. It gives the traits (!!!) of three iris species (Iris setosa, I. versicolor, I. virginica) that were measured experimentally. This is very typical of the type of data that we might analyse here in the lab, so lets have a bit closer look at it.

Firstly, we can split the dataframe into a list corresponding to each of the species using split()

sp.l <- split(iris,iris$Species)
str(sp.l)
## List of 3
##  $ setosa    :'data.frame':  50 obs. of  5 variables:
##   ..$ Sepal.Length: num [1:50] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##   ..$ Sepal.Width : num [1:50] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##   ..$ Petal.Length: num [1:50] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##   ..$ Petal.Width : num [1:50] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ versicolor:'data.frame':  50 obs. of  5 variables:
##   ..$ Sepal.Length: num [1:50] 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
##   ..$ Sepal.Width : num [1:50] 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
##   ..$ Petal.Length: num [1:50] 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
##   ..$ Petal.Width : num [1:50] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ virginica :'data.frame':  50 obs. of  5 variables:
##   ..$ Sepal.Length: num [1:50] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
##   ..$ Sepal.Width : num [1:50] 3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
##   ..$ Petal.Length: num [1:50] 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
##   ..$ Petal.Width : num [1:50] 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...

We can use now, for example, a for loop to look at the first few lines of each item in the list

for(sp in sp.l) {
  print(head(sp))
}
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 51          7.0         3.2          4.7         1.4 versicolor
## 52          6.4         3.2          4.5         1.5 versicolor
## 53          6.9         3.1          4.9         1.5 versicolor
## 54          5.5         2.3          4.0         1.3 versicolor
## 55          6.5         2.8          4.6         1.5 versicolor
## 56          5.7         2.8          4.5         1.3 versicolor
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 101          6.3         3.3          6.0         2.5 virginica
## 102          5.8         2.7          5.1         1.9 virginica
## 103          7.1         3.0          5.9         2.1 virginica
## 104          6.3         2.9          5.6         1.8 virginica
## 105          6.5         3.0          5.8         2.2 virginica
## 106          7.6         3.0          6.6         2.1 virginica

Looking good. Now lets say that we wanted to calculate some statistics on the traits of each species. We could do this as follows

res <- list()
for(n in names(sp.l)){   
   dat <- sp.l[[n]]   #Extract the data from the list
   res[[n]] <- data.frame(species=n,
                          mean.sepal.length=mean(dat$Sepal.Length),
                          sd.sepal.length=sd(dat$Sepal.Length),
                          n.samples=nrow(dat))
}
print(res)
## $setosa
##   species mean.sepal.length sd.sepal.length n.samples
## 1  setosa             5.006       0.3524897        50
## 
## $versicolor
##      species mean.sepal.length sd.sepal.length n.samples
## 1 versicolor             5.936       0.5161711        50
## 
## $virginica
##     species mean.sepal.length sd.sepal.length n.samples
## 1 virginica             6.588       0.6358796        50

This returns values as a list. However, we often want a data.frame to work with. We can build one from the list as so

res.df <- do.call(rbind,res)
print(res.df)
##               species mean.sepal.length sd.sepal.length n.samples
## setosa         setosa             5.006       0.3524897        50
## versicolor versicolor             5.936       0.5161711        50
## virginica   virginica             6.588       0.6358796        50

Notice that in this example we have worked with the names of the list, rather than directly with the items. This can be a nice trick at times. However, there are much quicker and easier ways to do this though - see, for example, the section on lapply below.

Vectorisation vs for loops

Consider the problem of trying to calculate the sin of a long vector, x. We could do this using a for loop as follows:

x <- rnorm(1e6)
y <- rep(NA,length(x))

The run time for this particular example is

##    user  system elapsed 
##     1.1     0.0     1.1

I would hope that none of you would do it that way though - the natural way to it is just using

y <- sin(x)

The run time for this operation is

##    user  system elapsed 
##   0.026   0.000   0.026

which is an o speed increase of 50x. Why the difference? All of the calculations in a for loop are intrepreted code. However, if you use the sin() operator directly on a vector, the first thing that R does is to shift to compiled code (eg. C++), which is much, much faster. As a rule, you should do everything vectorised that you possibly can, and only use for loops when you cannot solve the problem any other way. We will discuss some examples below of when this is appropriate.

lapply() and `sapply()

For loops work well, but are generally not the most efficient way to program. A lot of our tasks can be handled very nicely by lapply(), which is idealogically similar to a for loop, but which takes care of a lot of the bookkeeping.

Returning to the iris data again, we can do the sample analysis that we did with the for loop, much easier with an lapply

sp.l <- split(iris,iris$Species)
res.l <- sapply(sp.l,function(dat){
  r <- data.frame(mean.sepal.length=mean(dat$Sepal.Length),
                  sd.sepal.length=sd(dat$Sepal.Length),
                  n.samples=nrow(dat))
  return(r)
})
res <- do.call(rbind,res.l)
print(res)
##             [,1]
##  [1,]  5.0060000
##  [2,]  0.3524897
##  [3,] 50.0000000
##  [4,]  5.9360000
##  [5,]  0.5161711
##  [6,] 50.0000000
##  [7,]  6.5880000
##  [8,]  0.6358796
##  [9,] 50.0000000

There are a few things to note here. Firstly, lapply() takes two arguments: a list and a function. The function can be one that you have written yourself to do a custom processing, as in this example, or it can be one of the standard functions in R, e.g.

lapply(sp.l,nrow)
## $setosa
## [1] 50
## 
## $versicolor
## [1] 50
## 
## $virginica
## [1] 50
lapply(sp.l,head,n=2)  #Arguments can also be provided to the function
## $setosa
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 
## $versicolor
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 51          7.0         3.2          4.7         1.4 versicolor
## 52          6.4         3.2          4.5         1.5 versicolor
## 
## $virginica
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 101          6.3         3.3          6.0         2.5 virginica
## 102          5.8         2.7          5.1         1.9 virginica

Secondly, lapply() always returns a list. This can be handy sometimes, other times it be annoying. lapply() also has a sister function called sapply() that simplifies (which is why it is called sapply()) the list to a vector, where possible. e.g.

sapply(sp.l,nrow)
##     setosa versicolor  virginica 
##         50         50         50

sapply doesn’t always do the simplifcation how you expect / hope.

res <- sapply(sp.l,function(dat){
  r <- data.frame(mean.sepal.length=mean(dat$Sepal.Length),
                  sd.sepal.length=sd(dat$Sepal.Length),
                  n.samples=nrow(dat))
  return(r)
})
print(res)
##                   setosa    versicolor virginica
## mean.sepal.length 5.006     5.936      6.588    
## sd.sepal.length   0.3524897 0.5161711  0.6358796
## n.samples         50        50         50

The reason is that simplify returns a matrix, rather than a dataframe which is more common to work with. Check your results.

tapply()

While we’re in the apply familar, one of the most useful is tapply(), takes care of the subsetting for you as well. For example..

tapply(iris$Sepal.Length,iris$Species,mean)
##     setosa versicolor  virginica 
##      5.006      5.936      6.588

…does in one line what lapply takes many to do.

sp.l <- split(iris,iris$Species)
res <- sapply(sp.l, function(d){
  mean(d$Sepal.Length)
})
print(res)
##     setosa versicolor  virginica 
##      5.006      5.936      6.588

tapply has the disadvantage though that it only only work on a single vector at a time. Still, you can work around this

traits <- data.frame(sepal.len.mean=tapply(iris$Sepal.Length,iris$Species,mean),
                     sepal.len.sd=tapply(iris$Sepal.Length,iris$Species,sd),
                     n.obs=tapply(iris$Sepal.Length,iris$Species,length))
print(traits)
##            sepal.len.mean sepal.len.sd n.obs
## setosa              5.006    0.3524897    50
## versicolor          5.936    0.5161711    50
## virginica           6.588    0.6358796    50

replicate()

Finally, it is worth noting the replicate() function, which simply repeats an expression many times. This can be useful if you are bootstrapping data or trying to sample from distributions e.g.

reps <- replicate(1e5,max(rnorm(100)))
hist(reps,breaks=100)

plot of chunk unnamed-33

When should I use a for loop then?

Generally, you can use the two interchangably when dealing with small to moderate amounts of data. I prefer lapply and sapply because they are very efficient ways to write code. However, the choice is generally not important - use what you feel best with.

The place where it can be important is when you are dealing with larger datasets. For example, if you have a data.frame with 100 000 raws, then you probably should be using lapply at all times - also because data.frames are actually quite slow to extract data from (can you use a matrix instead?) Generally, if you are using for loops, or haven’t vectorised your code, and it is running “too slow”, this is a good place to start.

There are some cases where for loops are unavoidable. These are generally situations where the calculations are sequential ie. the results at one time step depend on the results at the previous step. So, an example of this might be a cumulative sum e.g.

x <- rnorm(100)
cum.x <- rep(NA,length(x))
cum.x[1] <- x[1]
for(i in 2:length(x)) {
  cum.x[i] <- cum.x[i-1]+x[i]
}

Fortunately there is the cumsum() function which does the same thing in compiled code. But what about solving an ODE? That can’t be done so easily. Take, for example, integating dy/dx = -ky

dx <-0.01
x <- seq(0,1,by=dx)
y <- x*NA  #Make a blank results vector
y[1] <- 1
for(i in 2:length(x)) {
  y[i] <- y[i-1] + -5*y[i-1]*dx
}
plot(x,y,type="l")

plot of chunk unnamed-37

which is the exponential decay that you would expect.

This work by Mark R Payne is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. For details, see http://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_US Basically, this means that you are free to “share” and “remix” for non-commerical purposes as you see fit, so long as you “attribute” me for my contribution. Derivatives can be distributed under the same or similar license.
This work comes with ABSOLUTELY NO WARRANTY or support.
This work should also be considered as BEER-WARE. For details, see http://en.wikipedia.org/wiki/Beerware