R-Basic Part VI: Loop Functions and Debugging in R

Introduction

Loop functions are some of the most powerful functions in R language. They make it very easy to work in R, especially in an interactive setting. Writing for and/or while loops is useful when programming but not easy when working interactively on the command line. There are some functions use the ‘looping’ characteristics to ease coding.

\(lapply:\) Loops over a list and evaluate a function on each element. It’s a very general concept but can be efficiently used to conduct a lot of computations.
\(sapply:\) Same as \(lapply\) but it provides a nicer, cleaner outputs
\(apply:\) Applies a function over the margins of an array. It is very useful if we want to summarize matrices, or other higher dimensional arrays.
\(tapply:\) Applies a function over subsets of a vector. A short form of the table apply.
\(mapply:\) Multivariate version of \(lapply\)
\(split:\) An auxiliary function, can be used in conjunction with \(lapply\)

\(lapply\)

lapply

function (X, FUN, ...) 
{
    FUN <- match.fun(FUN)
    if (!is.vector(X) || is.object(X)) 
        X <- as.list(X)
    .Internal(lapply(X, FUN))
}
<bytecode: 0x000000001453c3e0>
<environment: namespace:base>

It takes the three arguments:

X: a list ‘X’
FUN: a function (or the name of the function), and
Other arguments as represented by …

Caveat: If X is not a list, it gets coerced to a list using the as.list function, as long as possible. If not we get an error.

Example:

Creating an object named my_list having two variables.

my_list <- list(number = 5:14, random_number = rnorm(10))
# checking the list I just created
my_list

$number
 [1]  5  6  7  8  9 10 11 12 13 14

$random_number
 [1]  1.46291209  1.86496805 -0.07995655  1.72696004  1.02501068  1.00613997
 [7]  2.34910617 -0.19961932  0.99185697  1.32494760

The first one is an array of numbers between 5 and 14, while the other is a list of 10 random numbers. I am going to calculate mean of both of the variables using the \(lapply\) function followed by their standard deviations.

# Applying lapply by calculating mean and standard deviations of the variables
lapply(my_list, mean)

$number
[1] 9.5

$random_number
[1] 1.147233

lapply(my_list, sd)

$number
[1] 3.02765

$random_number
[1] 0.8031849

We received the mean and the standard deviation of both of the variables in my_list. The number and random_number are the names of the columns, R assigned to the variables.

Here’s the another way we can use the lapply function. I am going to create a uniform random variables based on a given set of list:

# Setting the limit for num1
list1 <- 1:6
# Generating a random uniform variables
lapply(list1, runif)

[[1]]
[1] 0.6354516

[[2]]
[1] 0.8664135 0.9029416

[[3]]
[1] 0.69042616 0.08773895 0.86549211

[[4]]
[1] 0.9306201 0.3898541 0.5268778 0.6004027

[[5]]
[1] 0.51490149 0.75754896 0.01643082 0.18210719 0.52388015

[[6]]
[1] 0.4966569 0.2804625 0.4994748 0.8330982 0.8156403 0.3553473

I designated the limit to an object list1,meaning, it has the values between 1 through 6. Then, I passed the \(lapply\) function and requested random uniform numbers. As expected, R populated 6 lists that included one item in the first one, all the way up to 6 items in the 6^th one.

I want to repeat the same object but want to set a limit for the range between 2 and 5.

lapply(list1, runif, min = 2, max = 5) # generate random uniform numbers having elements between 2 and 5

[[1]]
[1] 4.077277

[[2]]
[1] 4.534571 3.571611

[[3]]
[1] 2.524963 2.152590 2.031555

[[4]]
[1] 4.367135 3.887520 3.422944 4.571860

[[5]]
[1] 4.367742 4.646184 2.300846 2.581464 2.809755

[[6]]
[1] 4.042313 4.920532 3.703123 2.665831 4.260906 2.837666

The \(lapply\) and similar functions make use of anonymous functions. Anonymous functions are functions not bound to any identifier, i.e., they don’t have names. They are created and used but not assigned to a specific variable.

# Creating a composit matrix having two different matrices within it
composite_matrix <- list(matrix_1 = matrix(3:6, 2, 2), matrix_2 = matrix(4:9, 3, 2))
composite_matrix

$matrix_1
     [,1] [,2]
[1,]    3    5
[2,]    4    6

$matrix_2
     [,1] [,2]
[1,]    4    7
[2,]    5    8
[3,]    6    9

# Extracting first column of both of the matrices
lapply(composite_matrix, function(c1m) c1m[, 2]) # created a function c1m and used it to get values of second column in both of the matrices. It is an anonymous function

$matrix_1
[1] 5 6

$matrix_2
[1] 7 8 9

# Extracting first row of both of the matrices
lapply(composite_matrix, function(r1w) r1w[2, ]) # Here, r1w, and it is used to extract values in the second rows of both matrices

$matrix_1
[1] 4 6

$matrix_2
[1] 5 8

The lapply always returns a list, one of the most boring aspect of this function. So, sometimes it may not be as convenient. It is always good to have a nice clean output. If we are all about a cleaner output with all the goodies of lapply function, we use sapply function.

\(sapply\)

It tries to simplify the results of the \(lapply\) functions as possible:

if the result is a list where every element is length 1, then a vector is returned
if the result is a list where every element is a vector of the same length (>1), a matrix is returned
if it can’t figure things out, a list is returned

First of all, I am creating a list and using lapply function to calculate mean. Afterward, I will use sapply, and compare the differences.

test_1 <- list(a = 11:14, b = rnorm(8), c = rnorm(11, 1), d = rnorm(80, 7))
lapply(test_1, mean)

$a
[1] 12.5

$b
[1] -0.0321797

$c
[1] 0.7549777

$d
[1] 7.029287

# just for fun!
library(lattice)
densityplot(test_1$d)

Received a vector containing the column-wise means of all 4 variables. Now I am going to use sapply and compare the differences:

sapply(test_1, mean)

         a          b          c          d 
12.5000000 -0.0321797  0.7549777  7.0292870

As we can see the results appear in much nicer tabular form.

\(apply\) Function

This function is used to evaluate a function (often an anonymous function)over the margins of an array.

It is most often used to apply a function to the rows or column of a matrix.
It can be used with general arrays, e.g., taking an average of an array of matrices.
It is not really faster than writing a loop, but it works in one line.

Let’s check the arguments required to execute the apply function:

str(apply)

function (X, MARGIN, FUN, ..., simplify = TRUE)

X is an array (name of an array or a matrix, or a data frame)
MARGIN is an integer vector indicating margins that we want to retain, 1 for rows and 2 for columns
FUN is a function to be applied
… is used for optional arguments, if any.

Let’s create a matrix with 15 rows and 12 columns and pass apply function.

a <- matrix(rnorm(180), 15, 12)
# Let's calculate the mean of all the columns
apply(a, 2, mean)

 [1]  0.07059646  0.07154827  0.09868546  0.73583431 -0.50007538  0.12287220
 [7]  0.12612793 -0.41168822 -0.12294147  0.17510980 -0.03373771  0.29036536

In the above syntax: a is the name of the matrix, 2 tells R to do something by columns, and mean tells to calculate mean. All in all we got 12 values and they are the means of 12 columns in matrix a.

Now, let’s add the values in all 15 rows:

apply(a, 1, sum)

 [1]  0.08773681  0.38545560 -3.26189971  7.58999169 -0.70135001  4.82322597
 [7]  2.61548234 -1.79671431 -0.05051695  8.42282712 -5.39321901 -0.54869049
[13] -4.45708450  3.98775024 -2.36253962

Here we go. We have 15 values and they are the sums of each 15 rows in the matrix a.

There are optimized functions in R that would simplify the tasks of calculating row or column sums or means. They are:

Col/row sums and means

rowSums = apply(a, 1, sum)
rowMeans = apply(a, 1, mean)
colSums = apply(a, 2, sum)
colMeans = apply(a, 2, mean)

These functions are much faster and easy to run. Let’s check one of them in action:

rowSums(a)

 [1]  0.08773681  0.38545560 -3.26189971  7.58999169 -0.70135001  4.82322597
 [7]  2.61548234 -1.79671431 -0.05051695  8.42282712 -5.39321901 -0.54869049
[13] -4.45708450  3.98775024 -2.36253962

If we compare the values, we have exact same sums. However, the function of apply function doesn’t end there. We can calculate the quantile value of the matrix a and even provide some criteria, like calculating 25th and 75th percentile values by columns and rows. For example:

# Calculating 25th and 75th percentile values by the column
apply(a, 2, quantile, probs = c(0.25, 0.75))

          [,1]       [,2]       [,3]        [,4]        [,5]       [,6]
25% -0.4717277 -0.3182557 -0.7787222 -0.03679574 -1.02652186 -0.7004837
75%  0.7471132  0.3200644  0.6927512  1.08908391  0.09802976  0.6404537
          [,7]       [,8]      [,9]      [,10]      [,11]      [,12]
25% -0.6907739 -1.0129006 -0.619241 -0.6210604 -0.3967204 -0.1000465
75%  0.9615792  0.1518437  0.627524  0.7288684  0.6720800  0.9127678

Here’s our values. Now, lets calculate 10th and 90th percentile values in each rows of a.

apply(a, 1, quantile, probs = c(0.10, 0.90))

         [,1]       [,2]       [,3]       [,4]       [,5]       [,6]      [,7]
10% -1.182349 -0.8328109 -1.0200503 -0.2481957 -1.1479979 -0.7645003 -0.619589
90%  1.156929  0.7873955  0.4297385  1.3227091  0.9457887  2.1934801  1.029480
         [,8]       [,9]     [,10]      [,11]     [,12]      [,13]      [,14]
10% -1.044502 -1.0891163 -0.267020 -1.4440264 -1.148876 -1.9278430 -0.8181829
90%  1.145441  0.6811349  1.853211  0.2923267  1.148196  0.9705034  1.2997735
        [,15]
10% -1.275072
90%  1.170902

Looks like we got the values we were looking for. The function went through each row of the matrix and calculated the 10th and 90th percentile values for us. And the results were provided in a matrix that had 2-rows and 15-columns.

We can use this function not just with a matrix but also with an array. for example:

# rnorm(2*2*10) tells R to create 10 2X2 matrices using random normal values
# c(2,2,10) tells R what the names of those 2X2 matrices are: they are 1 through 10
b <- array(rnorm(2 * 2 * 10), c(2, 2, 10))

# Let's check how the 1st and 7th matrices look like
b[, , 1:2] # Returns the first and second matrices

, , 1

          [,1]       [,2]
[1,] 0.8439809 0.03574801
[2,] 1.0188777 0.70858362

, , 2

           [,1]       [,2]
[1,] -0.6391815 -0.5822599
[2,]  1.2055147  1.7499308

# b[,,7]#7th matrix

# If we want a range of matrices we can use the syntax like this
# b[,,2:8]#print all matrices from 2 through 8

# Calculating the means in array b
apply(b, c(1, 2), mean)

         [,1]      [,2]
[1,] 0.277740 0.2926438
[2,] 0.421089 0.1764464

# Or
rowMeans(b, dims = 2)

         [,1]      [,2]
[1,] 0.277740 0.2926438
[2,] 0.421089 0.1764464

We got the exact same outcomes.

\(mapply\) Function

It is a multivariate version of lapply function, which works parallel over a set of arguments. Here’s the structure of this function, which is followed by the arguments:

str(mapply)

function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)

Where,

FUN is a function to apply, e.g., mean, sd etc.
… is an optional arguments, e.g., prob.
MoreArgs is a list of other arguments to FUN.
SIMPLIFY is to tell R whether we want to simplify the results.
USE.NAMES is to tell R where we want the results by the names of the multiple variables, arrays etc.

Let’s take an example: in which I want to create a list containing the values between 1 and 6. I want 1 to repeat 6 times followed by 2 five times, etc. I can use either list or mapply function.

# Using list function
list(rep(1, 6), rep(2, 5), rep(3, 4), rep(4, 3), rep(5, 2), rep(6, 1))

[[1]]
[1] 1 1 1 1 1 1

[[2]]
[1] 2 2 2 2 2

[[3]]
[1] 3 3 3 3

[[4]]
[1] 4 4 4

[[5]]
[1] 5 5

[[6]]
[1] 6

# Using mapply function
mapply(rep, 1:6, 6:1) # rep is the 'repeat' function

[[1]]
[1] 1 1 1 1 1 1

[[2]]
[1] 2 2 2 2 2

[[3]]
[1] 3 3 3 3

[[4]]
[1] 4 4 4

[[5]]
[1] 5 5

[[6]]
[1] 6

We got the exact same lists, but the mapply code is much shorter and easier to comprehend. Let’s take one more example: I want to create an increasing set of data points with increasing mean simultaneously, while having a static standard deviation. For this, I can write a function, but it doesn’t work the way I want it. Let me try it a couple different ways ways:

# 1. Creating a Function
function_increasing_mean <- function(n, mean, sd) { # create a function with 3 arguments
  rnorm(n, mean, sd) # use random n numbers to create a list having the mean of 'mean' and the standard deviation of 'sd'
}
# Let's pass the function
function_increasing_mean(4, 1, 2)

[1]  2.961762 -1.744311  4.030102 -0.230550

I got a list of 4 numbers which have the mean of 1 and sd of 2, but this is not what I was looking for. Now, let’s try one more attempt;

# Range Method
function_increasing_mean(1:4, 1:4, 2)

[1] 1.906233 1.443013 1.950079 1.954253

Nope. I got a list of four numbers like before and this time the mean is not even 1. Now, let’s try mapply

# mapply Method
mapply(function_increasing_mean, 1:4, 1:4, 2)

[[1]]
[1] 1.779424

[[2]]
[1] -0.9773194  3.3053830

[[3]]
[1] 5.281450 1.873322 1.442196

[[4]]
[1] 0.98201672 3.32058478 0.08235915 6.28287471

Here, we go. I have four lists of numbers. The first one has the mean of 1 (kind of) and fourth 4, while the sd remained same. It’s the result I was looking for.

I can access the thing using the list function but I have to manually type all the following codes:

list(function_increasing_mean(1, 1, 2), function_increasing_mean(2, 2, 2), function_increasing_mean(3, 3, 2), function_increasing_mean(4, 4, 2))

[[1]]
[1] 1.671077

[[2]]
[1] 2.669772 2.078709

[[3]]
[1] -0.4959015  3.8091327  2.1638217

[[4]]
[1] 4.257558 1.664209 2.602976 7.170069

\(tapply\) function

A very useful function in subsets of a vector. Here’s the arguments that it takes:

str(tapply)

function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)

Where,

X is a vector : 100 participants
INDEX is a factor or a list of factors(or else they are coerced to factors)- The vectors should be of the same length: 50 males and 50 females
FUN is s function that we want to apply
… optional arguments
simplify, simplify the results?

# Creating a vector e that has 30 elements
e <- c(rnorm(20), runif(20), rnorm(20, 1))
# Creating a factor variable f (which will have 3 categories) using the gl function and each of the variables will be repeated 20 times
f <- gl(3, 20)
# Now, tapply on e, pass the factor variable and calculate the means of the three groups
tapply(e, f, mean)

        1         2         3 
0.1964159 0.4536534 1.1806358

The means of all of the three categories are populated. We got the results in a nice clean table. This is the default function. If we choose simplify=FALSE, we get little bit messy (as single vectors) outcome. For example:

tapply(e, f, mean, simplify = FALSE)

$`1`
[1] 0.1964159

$`2`
[1] 0.4536534

$`3`
[1] 1.180636

This is how the simplify function work. We can also use the tapply function in little bit complex function as well:

tapply(e, f, range)

$`1`
[1] -1.427012  1.736941

$`2`
[1] 0.01100638 0.86965413

$`3`
[1] -1.095685  3.091873

# tapply(e, f, median)#Gives Median
# tapply(e, f, mode)#Gives Mode
# tapply(e, f, sd)#Gives Standard Deviation

We got the largest and the smallest values in all of the three categories.

split function

This function takes a vector or other objects and splits it into the groups determined by a factor or list of factors. It is not a loop function but it is very handy to use in conjunction with other apply functions. Here’s the required arguments:

str(split)

function (x, f, drop = FALSE, ...)

Where,

x is a vector (or list) or data frame
f is a factor (or coerced to one) or a list of factors
drop indicates whether empty factors levels should be dropped

Let’s check a simple example by using the same e and f objects from above:

split(e, f)

$`1`
 [1] -0.83118688 -0.11076104  0.61996316 -1.42701199  1.72704684 -1.31064232
 [7] -0.05705985 -0.27023149  0.85967397 -0.97481408 -0.89231341  1.29710609
[13]  0.46356776  0.68760333  1.73694056  0.27688548  0.27611748  1.11757794
[19]  1.66987469 -0.93001881

$`2`
 [1] 0.69209913 0.65732471 0.60420401 0.81236675 0.54034453 0.86965413
 [7] 0.17872125 0.04194962 0.38059585 0.86322323 0.80169873 0.06803396
[13] 0.62265928 0.32890919 0.45383706 0.22381724 0.41185781 0.01100638
[19] 0.23646694 0.27429845

$`3`
 [1]  2.1220623  0.4249407  0.7141020  1.2785241 -0.4347269  1.4829627
 [7]  3.0918729  0.5845254 -1.0956846  0.7872617  1.6629817  1.1192421
[13]  2.1832100  1.5079100  2.8070939  0.0341233  1.8852808  0.1398690
[19]  0.7719396  2.5452259

As we can see, the split function took the factors we created in object f and applied them to our data set e. We have equal number of elements in all three categories.

Once the data splitted we can use other functions like lapply or sapply. Here’s the most common way of using split function.

lapply(split(e, f), mean)

$`1`
[1] 0.1964159

$`2`
[1] 0.4536534

$`3`
[1] 1.180636

sapply(split(e, f), sd)

        1         2         3 
1.0274392 0.2802472 1.0917719

We got the mean and standard deviation of all three categories. And we also know why the second outcome looks little better than the first one. It’s because we used sapply. These functions though, are not necessary because as we saw in the previous example, the tapply function does the same thing.

However, the nice thing about the split function is that we can use it in much more complex type of objects. For example let’s use it on ‘airquality’ data set.

head(airquality) # checking the structure of the data set

I want to take the mean of the Ozone, Solar Radiation, Wind, and Temperature by month. How do I do this? I will first split the data frame by months and calculate means.

split_airquality <- split(airquality, airquality$Month)
lapply(split_airquality, function(f) colMeans(f[, c("Ozone", "Solar.R", "Wind", "Temp")])) # function (f) is an anonymous variable

$`5`
   Ozone  Solar.R     Wind     Temp 
      NA       NA 11.62258 65.54839 

$`6`
    Ozone   Solar.R      Wind      Temp 
       NA 190.16667  10.26667  79.10000 

$`7`
     Ozone    Solar.R       Wind       Temp 
        NA 216.483871   8.941935  83.903226 

$`8`
    Ozone   Solar.R      Wind      Temp 
       NA        NA  8.793548 83.967742 

$`9`
   Ozone  Solar.R     Wind     Temp 
      NA 167.4333  10.1800  76.9000

The results gave the means of Ozone, Solar.R, Wind, and Temp for the months 5 (i.e., May) through 9 (i.e., September). The Mean of the the Ozone and Solar.R are not calculated because there are some missing data. We can see the temperature rising through August and dropping in the month of September. The results are fine but let’s see if we get better outcome (in terms of how they look) when we apply sapply function.

sapply(split_airquality, function(f) colMeans(f[, c("Ozone", "Solar.R", "Wind", "Temp")]))

               5         6          7         8        9
Ozone         NA        NA         NA        NA       NA
Solar.R       NA 190.16667 216.483871        NA 167.4333
Wind    11.62258  10.26667   8.941935  8.793548  10.1800
Temp    65.54839  79.10000  83.903226 83.967742  76.9000

The results are properly managed in a tabular format. They are easy to read and understand. Much easier to compare the findings among the variables.

However, when we have the missing values, R doesn’t give us the mean for that variable. We can get rid of missing values and calculate mean for all of the variables. Here’s how we do it:

sapply(split_airquality, function(f) colMeans(f[, c("Ozone", "Solar.R", "Wind", "Temp")], na.rm = TRUE)) # gets rid of NAs

                5         6          7          8         9
Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind     11.62258  10.26667   8.941935   8.793548  10.18000
Temp     65.54839  79.10000  83.903226  83.967742  76.90000

In the example above we talked about having factors of a single object or dataframe. Sometimes, we can have multiple objects with various factors. We can still use split function to compute the values of our interest. Let’s take an example:

# Creating a data set with 30 random numbers
g <- rnorm(30)

# First categorical variable
grade <- gl(3, 10) # 3 different grades of 10 each

# Second categorical variable
ethnicity <- gl(6, 5) # 6-ethnic groups of 5 each

# Calculating Interaction between grade and ethnicity
interaction(grade, ethnicity)

 [1] 1.1 1.1 1.1 1.1 1.1 1.2 1.2 1.2 1.2 1.2 2.3 2.3 2.3 2.3 2.3 2.4 2.4 2.4 2.4
[20] 2.4 3.5 3.5 3.5 3.5 3.5 3.6 3.6 3.6 3.6 3.6
18 Levels: 1.1 2.1 3.1 1.2 2.2 3.2 1.3 2.3 3.3 1.4 2.4 3.4 1.5 2.5 3.5 ... 3.6

There are total of 18 levels after interaction. Now, I want to split the vector g according to the categorical variables grade, and ethnicity.

# splitting the variable g and asking for the lists of grade and ethnicity and finally asking for the structure of all of the 18 distinct categories.
str(split(g, list(grade, ethnicity)))

List of 18
 $ 1.1: num [1:5] -0.935 -0.758 -0.691 0.204 1.336
 $ 2.1: num(0) 
 $ 3.1: num(0) 
 $ 1.2: num [1:5] 1.195 -0.121 -0.347 -1.323 -0.472
 $ 2.2: num(0) 
 $ 3.2: num(0) 
 $ 1.3: num(0) 
 $ 2.3: num [1:5] 0.8432 -0.0171 0.6319 0.5351 0.8473
 $ 3.3: num(0) 
 $ 1.4: num(0) 
 $ 2.4: num [1:5] 0.54 -0.444 -0.735 0.937 0.657
 $ 3.4: num(0) 
 $ 1.5: num(0) 
 $ 2.5: num(0) 
 $ 3.5: num [1:5] -0.509 -0.42 0.123 -2.009 0.343
 $ 1.6: num(0) 
 $ 2.6: num(0) 
 $ 3.6: num [1:5] 3.471 1.294 0.595 1.782 1.268

If I want to use the split function, I don’t have to call for the interaction. It automatically does so for us. And obviously not all interaction have values. If we want to get rid of the empty levels, we have the option called drop, which gives back the interaction units that have values in them.

str(split(g, list(grade, ethnicity), drop = TRUE))

List of 6
 $ 1.1: num [1:5] -0.935 -0.758 -0.691 0.204 1.336
 $ 1.2: num [1:5] 1.195 -0.121 -0.347 -1.323 -0.472
 $ 2.3: num [1:5] 0.8432 -0.0171 0.6319 0.5351 0.8473
 $ 2.4: num [1:5] 0.54 -0.444 -0.735 0.937 0.657
 $ 3.5: num [1:5] -0.509 -0.42 0.123 -2.009 0.343
 $ 3.6: num [1:5] 3.471 1.294 0.595 1.782 1.268

We got only the six interactions that have values in them.

Let’s wrap up with a fun figure (using lattice package and airquality data):

These functions/packages can be of your interest. These examples are just to show you a small snippet.

library(lattice)
densityplot(~Temp, groups = Month, data = airquality, plot.points = FALSE, auto.key = TRUE)

library(report)
report_sample(airquality, group_by = "Month")

Final Words: 1. These are the basic aspects of using R. Without the knowledge of these functions, we always feel lost. So, better learn them!

Most of these outcomes can also be achieved using purrr package, which uses an entirely different grammar.

References:

- Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage.

- Irizarry, R.(2021). Data Science: R Basics. https://learning.edx.org/course/course-v1:HarvardX+PH125.1x+2T2020/progress

- Peng, R.,D.(2020). R Programming for Data Science. Leanpub.

R-Basic Part VI: Loop Functions and Debugging in R

Nirmal Ghimire

28 February, 2022

Introduction

\(lapply\)

\(sapply\)

\(apply\) Function

\(mapply\) Function

\(tapply\) function

split function