Highlights
Loop functions are some of the most powerful functions in R language. They make it very easy to work in R, especially in an interactive setting. Writing for and/or while loops is useful when programming but not easy when working interactively on the command line. There are some functions use the ‘looping’ characteristics to ease coding.
lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<bytecode: 0x000000001453c3e0>
<environment: namespace:base>
It takes the three arguments:
Caveat: If X is not a list, it gets coerced to a list using the as.list function, as long as possible. If not we get an error.
Example:
my_list <- list(number = 5:14, random_number = rnorm(10))
# checking the list I just created
my_list
$number
[1] 5 6 7 8 9 10 11 12 13 14
$random_number
[1] 1.46291209 1.86496805 -0.07995655 1.72696004 1.02501068 1.00613997
[7] 2.34910617 -0.19961932 0.99185697 1.32494760
The first one is an array of numbers between 5 and 14, while the other is a list of 10 random numbers. I am going to calculate mean of both of the variables using the \(lapply\) function followed by their standard deviations.
# Applying lapply by calculating mean and standard deviations of the variables
lapply(my_list, mean)
$number
[1] 9.5
$random_number
[1] 1.147233
lapply(my_list, sd)
$number
[1] 3.02765
$random_number
[1] 0.8031849
We received the mean and the standard deviation of both of the variables in my_list. The number and random_number are the names of the columns, R assigned to the variables.
Here’s the another way we can use the lapply function. I am going to create a uniform random variables based on a given set of list:
# Setting the limit for num1
list1 <- 1:6
# Generating a random uniform variables
lapply(list1, runif)
[[1]]
[1] 0.6354516
[[2]]
[1] 0.8664135 0.9029416
[[3]]
[1] 0.69042616 0.08773895 0.86549211
[[4]]
[1] 0.9306201 0.3898541 0.5268778 0.6004027
[[5]]
[1] 0.51490149 0.75754896 0.01643082 0.18210719 0.52388015
[[6]]
[1] 0.4966569 0.2804625 0.4994748 0.8330982 0.8156403 0.3553473
I designated the limit to an object list1,meaning, it has the values between 1 through 6. Then, I passed the \(lapply\) function and requested random uniform numbers. As expected, R populated 6 lists that included one item in the first one, all the way up to 6 items in the 6th one.
I want to repeat the same object but want to set a limit for the range between 2 and 5.
lapply(list1, runif, min = 2, max = 5) # generate random uniform numbers having elements between 2 and 5
[[1]]
[1] 4.077277
[[2]]
[1] 4.534571 3.571611
[[3]]
[1] 2.524963 2.152590 2.031555
[[4]]
[1] 4.367135 3.887520 3.422944 4.571860
[[5]]
[1] 4.367742 4.646184 2.300846 2.581464 2.809755
[[6]]
[1] 4.042313 4.920532 3.703123 2.665831 4.260906 2.837666
The \(lapply\) and similar functions make use of anonymous functions. Anonymous functions are functions not bound to any identifier, i.e., they don’t have names. They are created and used but not assigned to a specific variable.
# Creating a composit matrix having two different matrices within it
composite_matrix <- list(matrix_1 = matrix(3:6, 2, 2), matrix_2 = matrix(4:9, 3, 2))
composite_matrix
$matrix_1
[,1] [,2]
[1,] 3 5
[2,] 4 6
$matrix_2
[,1] [,2]
[1,] 4 7
[2,] 5 8
[3,] 6 9
# Extracting first column of both of the matrices
lapply(composite_matrix, function(c1m) c1m[, 2]) # created a function c1m and used it to get values of second column in both of the matrices. It is an anonymous function
$matrix_1
[1] 5 6
$matrix_2
[1] 7 8 9
# Extracting first row of both of the matrices
lapply(composite_matrix, function(r1w) r1w[2, ]) # Here, r1w, and it is used to extract values in the second rows of both matrices
$matrix_1
[1] 4 6
$matrix_2
[1] 5 8
The lapply always returns a list, one of the most boring aspect of this function. So, sometimes it may not be as convenient. It is always good to have a nice clean output. If we are all about a cleaner output with all the goodies of lapply function, we use sapply function.
It tries to simplify the results of the \(lapply\) functions as possible:
First of all, I am creating a list and using lapply function to calculate mean. Afterward, I will use sapply, and compare the differences.
test_1 <- list(a = 11:14, b = rnorm(8), c = rnorm(11, 1), d = rnorm(80, 7))
lapply(test_1, mean)
$a
[1] 12.5
$b
[1] -0.0321797
$c
[1] 0.7549777
$d
[1] 7.029287
# just for fun!
library(lattice)
densityplot(test_1$d)
Received a vector containing the column-wise means of all 4 variables. Now I am going to use sapply and compare the differences:
sapply(test_1, mean)
a b c d
12.5000000 -0.0321797 0.7549777 7.0292870
As we can see the results appear in much nicer tabular form.
This function is used to evaluate a function (often an anonymous function)over the margins of an array.
Let’s check the arguments required to execute the apply function:
str(apply)
function (X, MARGIN, FUN, ..., simplify = TRUE)
Let’s create a matrix with 15 rows and 12 columns and pass apply function.
a <- matrix(rnorm(180), 15, 12)
# Let's calculate the mean of all the columns
apply(a, 2, mean)
[1] 0.07059646 0.07154827 0.09868546 0.73583431 -0.50007538 0.12287220
[7] 0.12612793 -0.41168822 -0.12294147 0.17510980 -0.03373771 0.29036536
In the above syntax: a is the name of the matrix, 2 tells R to do something by columns, and mean tells to calculate mean. All in all we got 12 values and they are the means of 12 columns in matrix a.
Now, let’s add the values in all 15 rows:
apply(a, 1, sum)
[1] 0.08773681 0.38545560 -3.26189971 7.58999169 -0.70135001 4.82322597
[7] 2.61548234 -1.79671431 -0.05051695 8.42282712 -5.39321901 -0.54869049
[13] -4.45708450 3.98775024 -2.36253962
Here we go. We have 15 values and they are the sums of each 15 rows in the matrix a.
There are optimized functions in R that would simplify the tasks of calculating row or column sums or means. They are:
Col/row sums and means
These functions are much faster and easy to run. Let’s check one of them in action:
rowSums(a)
[1] 0.08773681 0.38545560 -3.26189971 7.58999169 -0.70135001 4.82322597
[7] 2.61548234 -1.79671431 -0.05051695 8.42282712 -5.39321901 -0.54869049
[13] -4.45708450 3.98775024 -2.36253962
If we compare the values, we have exact same sums. However, the function of apply function doesn’t end there. We can calculate the quantile value of the matrix a and even provide some criteria, like calculating 25th and 75th percentile values by columns and rows. For example:
# Calculating 25th and 75th percentile values by the column
apply(a, 2, quantile, probs = c(0.25, 0.75))
[,1] [,2] [,3] [,4] [,5] [,6]
25% -0.4717277 -0.3182557 -0.7787222 -0.03679574 -1.02652186 -0.7004837
75% 0.7471132 0.3200644 0.6927512 1.08908391 0.09802976 0.6404537
[,7] [,8] [,9] [,10] [,11] [,12]
25% -0.6907739 -1.0129006 -0.619241 -0.6210604 -0.3967204 -0.1000465
75% 0.9615792 0.1518437 0.627524 0.7288684 0.6720800 0.9127678
Here’s our values. Now, lets calculate 10th and 90th percentile values in each rows of a.
apply(a, 1, quantile, probs = c(0.10, 0.90))
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
10% -1.182349 -0.8328109 -1.0200503 -0.2481957 -1.1479979 -0.7645003 -0.619589
90% 1.156929 0.7873955 0.4297385 1.3227091 0.9457887 2.1934801 1.029480
[,8] [,9] [,10] [,11] [,12] [,13] [,14]
10% -1.044502 -1.0891163 -0.267020 -1.4440264 -1.148876 -1.9278430 -0.8181829
90% 1.145441 0.6811349 1.853211 0.2923267 1.148196 0.9705034 1.2997735
[,15]
10% -1.275072
90% 1.170902
Looks like we got the values we were looking for. The function went through each row of the matrix and calculated the 10th and 90th percentile values for us. And the results were provided in a matrix that had 2-rows and 15-columns.
We can use this function not just with a matrix but also with an array. for example:
# rnorm(2*2*10) tells R to create 10 2X2 matrices using random normal values
# c(2,2,10) tells R what the names of those 2X2 matrices are: they are 1 through 10
b <- array(rnorm(2 * 2 * 10), c(2, 2, 10))
# Let's check how the 1st and 7th matrices look like
b[, , 1:2] # Returns the first and second matrices
, , 1
[,1] [,2]
[1,] 0.8439809 0.03574801
[2,] 1.0188777 0.70858362
, , 2
[,1] [,2]
[1,] -0.6391815 -0.5822599
[2,] 1.2055147 1.7499308
# b[,,7]#7th matrix
# If we want a range of matrices we can use the syntax like this
# b[,,2:8]#print all matrices from 2 through 8
# Calculating the means in array b
apply(b, c(1, 2), mean)
[,1] [,2]
[1,] 0.277740 0.2926438
[2,] 0.421089 0.1764464
# Or
rowMeans(b, dims = 2)
[,1] [,2]
[1,] 0.277740 0.2926438
[2,] 0.421089 0.1764464
We got the exact same outcomes.
It is a multivariate version of lapply function, which works parallel over a set of arguments. Here’s the structure of this function, which is followed by the arguments:
str(mapply)
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
Where,
Let’s take an example: in which I want to create a list containing the values between 1 and 6. I want 1 to repeat 6 times followed by 2 five times, etc. I can use either list or mapply function.
# Using list function
list(rep(1, 6), rep(2, 5), rep(3, 4), rep(4, 3), rep(5, 2), rep(6, 1))
[[1]]
[1] 1 1 1 1 1 1
[[2]]
[1] 2 2 2 2 2
[[3]]
[1] 3 3 3 3
[[4]]
[1] 4 4 4
[[5]]
[1] 5 5
[[6]]
[1] 6
# Using mapply function
mapply(rep, 1:6, 6:1) # rep is the 'repeat' function
[[1]]
[1] 1 1 1 1 1 1
[[2]]
[1] 2 2 2 2 2
[[3]]
[1] 3 3 3 3
[[4]]
[1] 4 4 4
[[5]]
[1] 5 5
[[6]]
[1] 6
We got the exact same lists, but the mapply code is much shorter and easier to comprehend. Let’s take one more example: I want to create an increasing set of data points with increasing mean simultaneously, while having a static standard deviation. For this, I can write a function, but it doesn’t work the way I want it. Let me try it a couple different ways ways:
# 1. Creating a Function
function_increasing_mean <- function(n, mean, sd) { # create a function with 3 arguments
rnorm(n, mean, sd) # use random n numbers to create a list having the mean of 'mean' and the standard deviation of 'sd'
}
# Let's pass the function
function_increasing_mean(4, 1, 2)
[1] 2.961762 -1.744311 4.030102 -0.230550
I got a list of 4 numbers which have the mean of 1 and sd of 2, but this is not what I was looking for. Now, let’s try one more attempt;
# Range Method
function_increasing_mean(1:4, 1:4, 2)
[1] 1.906233 1.443013 1.950079 1.954253
Nope. I got a list of four numbers like before and this time the mean is not even 1. Now, let’s try mapply
# mapply Method
mapply(function_increasing_mean, 1:4, 1:4, 2)
[[1]]
[1] 1.779424
[[2]]
[1] -0.9773194 3.3053830
[[3]]
[1] 5.281450 1.873322 1.442196
[[4]]
[1] 0.98201672 3.32058478 0.08235915 6.28287471
Here, we go. I have four lists of numbers. The first one has the mean of 1 (kind of) and fourth 4, while the sd remained same. It’s the result I was looking for.
I can access the thing using the list function but I have to manually type all the following codes:
list(function_increasing_mean(1, 1, 2), function_increasing_mean(2, 2, 2), function_increasing_mean(3, 3, 2), function_increasing_mean(4, 4, 2))
[[1]]
[1] 1.671077
[[2]]
[1] 2.669772 2.078709
[[3]]
[1] -0.4959015 3.8091327 2.1638217
[[4]]
[1] 4.257558 1.664209 2.602976 7.170069
A very useful function in subsets of a vector. Here’s the arguments that it takes:
str(tapply)
function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
Where,
# Creating a vector e that has 30 elements
e <- c(rnorm(20), runif(20), rnorm(20, 1))
# Creating a factor variable f (which will have 3 categories) using the gl function and each of the variables will be repeated 20 times
f <- gl(3, 20)
# Now, tapply on e, pass the factor variable and calculate the means of the three groups
tapply(e, f, mean)
1 2 3
0.1964159 0.4536534 1.1806358
The means of all of the three categories are populated. We got the results in a nice clean table. This is the default function. If we choose simplify=FALSE, we get little bit messy (as single vectors) outcome. For example:
tapply(e, f, mean, simplify = FALSE)
$`1`
[1] 0.1964159
$`2`
[1] 0.4536534
$`3`
[1] 1.180636
This is how the simplify function work. We can also use the tapply function in little bit complex function as well:
tapply(e, f, range)
$`1`
[1] -1.427012 1.736941
$`2`
[1] 0.01100638 0.86965413
$`3`
[1] -1.095685 3.091873
# tapply(e, f, median)#Gives Median
# tapply(e, f, mode)#Gives Mode
# tapply(e, f, sd)#Gives Standard Deviation
We got the largest and the smallest values in all of the three categories.
This function takes a vector or other objects and splits it into the groups determined by a factor or list of factors. It is not a loop function but it is very handy to use in conjunction with other apply functions. Here’s the required arguments:
str(split)
function (x, f, drop = FALSE, ...)
Where,
Let’s check a simple example by using the same e and f objects from above:
split(e, f)
$`1`
[1] -0.83118688 -0.11076104 0.61996316 -1.42701199 1.72704684 -1.31064232
[7] -0.05705985 -0.27023149 0.85967397 -0.97481408 -0.89231341 1.29710609
[13] 0.46356776 0.68760333 1.73694056 0.27688548 0.27611748 1.11757794
[19] 1.66987469 -0.93001881
$`2`
[1] 0.69209913 0.65732471 0.60420401 0.81236675 0.54034453 0.86965413
[7] 0.17872125 0.04194962 0.38059585 0.86322323 0.80169873 0.06803396
[13] 0.62265928 0.32890919 0.45383706 0.22381724 0.41185781 0.01100638
[19] 0.23646694 0.27429845
$`3`
[1] 2.1220623 0.4249407 0.7141020 1.2785241 -0.4347269 1.4829627
[7] 3.0918729 0.5845254 -1.0956846 0.7872617 1.6629817 1.1192421
[13] 2.1832100 1.5079100 2.8070939 0.0341233 1.8852808 0.1398690
[19] 0.7719396 2.5452259
As we can see, the split function took the factors we created in object f and applied them to our data set e. We have equal number of elements in all three categories.
Once the data splitted we can use other functions like lapply or sapply. Here’s the most common way of using split function.
lapply(split(e, f), mean)
$`1`
[1] 0.1964159
$`2`
[1] 0.4536534
$`3`
[1] 1.180636
sapply(split(e, f), sd)
1 2 3
1.0274392 0.2802472 1.0917719
We got the mean and standard deviation of all three categories. And we also know why the second outcome looks little better than the first one. It’s because we used sapply. These functions though, are not necessary because as we saw in the previous example, the tapply function does the same thing.
However, the nice thing about the split function is that we can use it in much more complex type of objects. For example let’s use it on ‘airquality’ data set.
head(airquality) # checking the structure of the data set
I want to take the mean of the Ozone, Solar Radiation, Wind, and Temperature by month. How do I do this? I will first split the data frame by months and calculate means.
split_airquality <- split(airquality, airquality$Month)
lapply(split_airquality, function(f) colMeans(f[, c("Ozone", "Solar.R", "Wind", "Temp")])) # function (f) is an anonymous variable
$`5`
Ozone Solar.R Wind Temp
NA NA 11.62258 65.54839
$`6`
Ozone Solar.R Wind Temp
NA 190.16667 10.26667 79.10000
$`7`
Ozone Solar.R Wind Temp
NA 216.483871 8.941935 83.903226
$`8`
Ozone Solar.R Wind Temp
NA NA 8.793548 83.967742
$`9`
Ozone Solar.R Wind Temp
NA 167.4333 10.1800 76.9000
The results gave the means of Ozone, Solar.R, Wind, and Temp for the months 5 (i.e., May) through 9 (i.e., September). The Mean of the the Ozone and Solar.R are not calculated because there are some missing data. We can see the temperature rising through August and dropping in the month of September. The results are fine but let’s see if we get better outcome (in terms of how they look) when we apply sapply function.
sapply(split_airquality, function(f) colMeans(f[, c("Ozone", "Solar.R", "Wind", "Temp")]))
5 6 7 8 9
Ozone NA NA NA NA NA
Solar.R NA 190.16667 216.483871 NA 167.4333
Wind 11.62258 10.26667 8.941935 8.793548 10.1800
Temp 65.54839 79.10000 83.903226 83.967742 76.9000
The results are properly managed in a tabular format. They are easy to read and understand. Much easier to compare the findings among the variables.
However, when we have the missing values, R doesn’t give us the mean for that variable. We can get rid of missing values and calculate mean for all of the variables. Here’s how we do it:
sapply(split_airquality, function(f) colMeans(f[, c("Ozone", "Solar.R", "Wind", "Temp")], na.rm = TRUE)) # gets rid of NAs
5 6 7 8 9
Ozone 23.61538 29.44444 59.115385 59.961538 31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind 11.62258 10.26667 8.941935 8.793548 10.18000
Temp 65.54839 79.10000 83.903226 83.967742 76.90000
In the example above we talked about having factors of a single object or dataframe. Sometimes, we can have multiple objects with various factors. We can still use split function to compute the values of our interest. Let’s take an example:
# Creating a data set with 30 random numbers
g <- rnorm(30)
# First categorical variable
grade <- gl(3, 10) # 3 different grades of 10 each
# Second categorical variable
ethnicity <- gl(6, 5) # 6-ethnic groups of 5 each
# Calculating Interaction between grade and ethnicity
interaction(grade, ethnicity)
[1] 1.1 1.1 1.1 1.1 1.1 1.2 1.2 1.2 1.2 1.2 2.3 2.3 2.3 2.3 2.3 2.4 2.4 2.4 2.4
[20] 2.4 3.5 3.5 3.5 3.5 3.5 3.6 3.6 3.6 3.6 3.6
18 Levels: 1.1 2.1 3.1 1.2 2.2 3.2 1.3 2.3 3.3 1.4 2.4 3.4 1.5 2.5 3.5 ... 3.6
There are total of 18 levels after interaction. Now, I want to split the vector g according to the categorical variables grade, and ethnicity.
# splitting the variable g and asking for the lists of grade and ethnicity and finally asking for the structure of all of the 18 distinct categories.
str(split(g, list(grade, ethnicity)))
List of 18
$ 1.1: num [1:5] -0.935 -0.758 -0.691 0.204 1.336
$ 2.1: num(0)
$ 3.1: num(0)
$ 1.2: num [1:5] 1.195 -0.121 -0.347 -1.323 -0.472
$ 2.2: num(0)
$ 3.2: num(0)
$ 1.3: num(0)
$ 2.3: num [1:5] 0.8432 -0.0171 0.6319 0.5351 0.8473
$ 3.3: num(0)
$ 1.4: num(0)
$ 2.4: num [1:5] 0.54 -0.444 -0.735 0.937 0.657
$ 3.4: num(0)
$ 1.5: num(0)
$ 2.5: num(0)
$ 3.5: num [1:5] -0.509 -0.42 0.123 -2.009 0.343
$ 1.6: num(0)
$ 2.6: num(0)
$ 3.6: num [1:5] 3.471 1.294 0.595 1.782 1.268
If I want to use the split function, I don’t have to call for the interaction. It automatically does so for us. And obviously not all interaction have values. If we want to get rid of the empty levels, we have the option called drop, which gives back the interaction units that have values in them.
str(split(g, list(grade, ethnicity), drop = TRUE))
List of 6
$ 1.1: num [1:5] -0.935 -0.758 -0.691 0.204 1.336
$ 1.2: num [1:5] 1.195 -0.121 -0.347 -1.323 -0.472
$ 2.3: num [1:5] 0.8432 -0.0171 0.6319 0.5351 0.8473
$ 2.4: num [1:5] 0.54 -0.444 -0.735 0.937 0.657
$ 3.5: num [1:5] -0.509 -0.42 0.123 -2.009 0.343
$ 3.6: num [1:5] 3.471 1.294 0.595 1.782 1.268
We got only the six interactions that have values in them.
Let’s wrap up with a fun figure (using lattice package and airquality data):
These functions/packages can be of your interest. These examples are just to show you a small snippet.
library(lattice)
densityplot(~Temp, groups = Month, data = airquality, plot.points = FALSE, auto.key = TRUE)
library(report)
report_sample(airquality, group_by = "Month")
Final Words: 1. These are the basic aspects of using R. Without the knowledge of these functions, we always feel lost. So, better learn them!
References:
- Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage.
- Irizarry, R.(2021). Data Science: R Basics. https://learning.edx.org/course/course-v1:HarvardX+PH125.1x+2T2020/progress
- Peng, R.,D.(2020). R Programming for Data Science. Leanpub.