Hello! Welcome to this, the final of your hard-core R workshops in LIFE4138.
Last time, we looked at:
for loopsif/else statementsSo, today we are going to go back to the for loops, and
talk about a slightly different way of approaching the same problem.
for loops are a fundemental part of coding in any language,
although R was designed for them to be avoided, with the creation of a
special family of functions unique to R, the apply()
family. The three functions from this family that we’ll focus on here
are:
apply()lapply(), andsapply()The apply family of functions allow you to create a piece of code
analagous to a for loop, allowing you to iterate over the
items in a data structure, and apply a function or set of
functions to either rows or columns of the data structure in turn. You
might have a series of dataframes and you want to count the columns
within them - you could do this with a for loop, or, you
could use the R exclusive apply family to achieve the same
thing in a much speedier and more efficient way. The apply functions
were developed to avoid the use of looping in R, and are usually much
faster (not always!) - there are some cases where running an apply
function will be slower or considerably more tricky to do, but they’re a
useful thing to learn, which will allow you to further advance your R
coding prowess.
That being said, the apply functions can be quite tricky to
understand and get your head around. Generally, the help files for the
apply functions are a bit rubbish, and it’s tricky to find places where
they are well explained. It took me a while to fully understand exactly
what outcome I’d get when using apply(), and sometimes it
still takes me a couple of goes to get it exactly right. Our aim here
isn’t to develop some super complex functions, but to demystify them and
get you famililar with how they work and where they might be useful.
They’re certainly worth taking the time to understand.
There are functions beyond apply(),
lapply() and sapply(), but these are the three
that we will focus on today:
apply() is used on matricies and dataframeslapply() is used on lists (and gives a list
output)sapply() is similar to lapply() but with
the option to simplify the output - arguably the most versatile?These functions are incredibly useful to expand your R toolkit, and once you’ve got to grips with using them, you’ll find other complex R functions much easier to understand too!
Right - we’ll start with apply(). We’ll have a quick go
at using apply, then dissect it a little. First up, we need our iris
data - remember to make sure the in-built R datasets are loaded first,
and then have a look at iris with with head() function:
data()
head(iris)
We can see that iris has 5 columns - for now, we’re going to get rid of the species column as it’s actually going to make our life a little more difficult than we’d like, so we’ll create a new object with the iris dataframe, minus the 5th column:
Run the head() function on the new df to check that your
code has removed the correct column - you should be left with 4 (petal
and sepal width and length measurements). Above we can see that we have
correctly removed the species column, so we can carry on! The reason we
removed the species vector was to allow us to have a nice clean dataset
with only numeric values inside of it, so we can more easily demonstrate
apply().
Now we have a dataset where each column is a set of values pertaining
to 150 observations of something. What if we want to find the means of
each of these columns? There are several ways we can do this without
using apply(), such as:
mean(my_iris[,1])
[1] 5.843333
mean(my_iris[,2])
[1] 3.057333
mean(my_iris[,3])
[1] 3.758
mean(my_iris[,4])
[1] 1.199333
This is theoretically achievable with only 4 columns, but is clunky
and repetitive. We want to write a single command that will iterate over
the entire dataset for us. We could also achieve it with a
for loop:
for (x in 1:ncol(my_iris)){
print(mean(my_iris[,x]))
}
[1] 5.843333
[1] 3.057333
[1] 3.758
[1] 1.199333
This loop has taken every value between 1 and the number of columns in the iris dataset, gives x that value, and works out the mean according to the column index. This isn’t the most efficient way of doing things, and can be a little complex to construct. It would also take ages if we were using a big dataset.
Arguably, a much quicker and more efficient way of achieving this is
by using apply(). This function is structured with 3
arguments, you tell it what data you would like it to work with, and
then what dimension of the data (either row or column in this case), and
then we feed apply a function for it to perform on each of the
dimensions of the dataset in turn. Let’s construct our first example,
and then we can dissect it a little more - we will use the
apply() function to achieve what we just did the hard
way:
apply(my_iris, 2, mean)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.199333
You can see that apply also gives us a nicer output! But, back to the anatomy of apply. There are three arguments:
apply(my_iris, 1, mean)
[1] 2.550 2.375 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400 2.700 2.500 2.325 2.125 2.800 3.000 2.750 2.575
[19] 2.875 2.675 2.675 2.675 2.350 2.650 2.575 2.450 2.600 2.600 2.550 2.425 2.425 2.675 2.725 2.825 2.425 2.400
[37] 2.625 2.500 2.225 2.550 2.525 2.100 2.275 2.675 2.800 2.375 2.675 2.350 2.675 2.475 4.075 3.900 4.100 3.275
[55] 3.850 3.575 3.975 2.900 3.850 3.300 2.875 3.650 3.300 3.775 3.350 3.900 3.650 3.400 3.600 3.275 3.925 3.550
[73] 3.800 3.700 3.725 3.850 3.950 4.100 3.725 3.200 3.200 3.150 3.400 3.850 3.600 3.875 4.000 3.575 3.500 3.325
[91] 3.425 3.775 3.400 2.900 3.450 3.525 3.525 3.675 2.925 3.475 4.525 3.875 4.525 4.150 4.375 4.825 3.400 4.575
[109] 4.200 4.850 4.200 4.075 4.350 3.800 4.025 4.300 4.200 5.100 4.875 3.675 4.525 3.825 4.800 3.925 4.450 4.550
[127] 3.900 3.950 4.225 4.400 4.550 5.025 4.250 3.925 3.925 4.775 4.425 4.200 3.900 4.375 4.450 4.350 3.875 4.550
[145] 4.550 4.300 3.925 4.175 4.325 3.950
This is somewhat meaningless biologically, but illustrates the point. Anyway, our third argument…
mean(). When you pass a function to apply, R will search
it’s environment for the function you’re after - you can use custom
functions here too!So, using a single line of code, we have managed to achieve what took
us much more effort to perform manually or with a loop. That being said,
there is a baseR function which will work out the column means of a
dataset without having to use apply - colMeans().
colMeans(my_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.199333
The point of telling you this isn’t to invalidate the use of apply or any of the other methods, but to get you to think about your use of code. Often, particularly with vectorisation, there is a much simpler and more straightforward way of performing quite complex commands in R - don’t forget it was literally designed to be simple! Note that some of the best coders are lazy, we’re always looking for quicker and easier ways of doing things to make our own lives easy. It is however important to strike a balance between finding a quicker way, and the time that you can spend finding that quicker way!! Don’t forget your R philosophies, and my favourite thing to say when teaching R - If it works, it works!
We’ve constructed an apply function in its most basic form, but we
can go much further with them. What if, for example, there was a rogue
NA in our dataset? We’ve seen in the past that lots of
functions can’t handle this very well! Let’s pop one into our
my_iris data and run the apply function again:
my_iris[5,2] <- NA
apply(my_iris, 2, mean)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 NA 3.758000 1.199333
Just as suspected, we’ve got an NA result for our column with the NA
value. We can add function arguments into an apply function simply by
popping them on the end sequentially, separated by commas. Let’s try and
add the na.rm = TRUE argument to our function to fix the
issue
apply(my_iris, 2, mean, na.rm = TRUE)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.053691 3.758000 1.199333
Piece of cake. Apply is super flexible, let’s construct another
function just to really hammer home the point - we’ll try with
sum() this time:
apply(my_iris, 2, sum, na.rm = TRUE)
Sepal.Length Sepal.Width Petal.Length Petal.Width
876.5 455.0 563.7 179.9
apply() is super quick, and very useful. Whilst at first
it takes some getting used to, once you’ve constructed a couple, they
will become second nature.
We can also write our own functions, and use them with
apply(). Let’s create a function to square a series of
values
As a very quick reminder, this bit of code tells R that we want to
create a function called square with a single argument
x. R will assign whatever value we give the
square function to x, and perform the code in
the body of the function (the stuff enclosed in the {}).
Let’s test square to make sure it does what we’re expecting
before we use it with apply()
square(10)
[1] 100
Wonderful, our function is performing as we would expect. Now, we can pop it into apply and square each of the values in the dataset.
apply(my_iris, 2, square)
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] 26.01 12.25 1.96 0.04
[2,] 24.01 9.00 1.96 0.04
[3,] 22.09 10.24 1.69 0.04
[4,] 21.16 9.61 2.25 0.04
[5,] 25.00 NA 1.96 0.04
[6,] 29.16 15.21 2.89 0.16
[7,] 21.16 11.56 1.96 0.09
[8,] 25.00 11.56 2.25 0.04
[9,] 19.36 8.41 1.96 0.04
[10,] 24.01 9.61 2.25 0.01
[11,] 29.16 13.69 2.25 0.04
[12,] 23.04 11.56 2.56 0.04
[13,] 23.04 9.00 1.96 0.01
[14,] 18.49 9.00 1.21 0.01
[15,] 33.64 16.00 1.44 0.04
[16,] 32.49 19.36 2.25 0.16
[17,] 29.16 15.21 1.69 0.16
[18,] 26.01 12.25 1.96 0.09
[19,] 32.49 14.44 2.89 0.09
[20,] 26.01 14.44 2.25 0.09
[21,] 29.16 11.56 2.89 0.04
[22,] 26.01 13.69 2.25 0.16
[23,] 21.16 12.96 1.00 0.04
[24,] 26.01 10.89 2.89 0.25
[25,] 23.04 11.56 3.61 0.04
[26,] 25.00 9.00 2.56 0.04
[27,] 25.00 11.56 2.56 0.16
[28,] 27.04 12.25 2.25 0.04
[29,] 27.04 11.56 1.96 0.04
[30,] 22.09 10.24 2.56 0.04
[31,] 23.04 9.61 2.56 0.04
[32,] 29.16 11.56 2.25 0.16
[33,] 27.04 16.81 2.25 0.01
[34,] 30.25 17.64 1.96 0.04
[35,] 24.01 9.61 2.25 0.04
[36,] 25.00 10.24 1.44 0.04
[37,] 30.25 12.25 1.69 0.04
[38,] 24.01 12.96 1.96 0.01
[39,] 19.36 9.00 1.69 0.04
[40,] 26.01 11.56 2.25 0.04
[41,] 25.00 12.25 1.69 0.09
[42,] 20.25 5.29 1.69 0.09
[43,] 19.36 10.24 1.69 0.04
[44,] 25.00 12.25 2.56 0.36
[45,] 26.01 14.44 3.61 0.16
[46,] 23.04 9.00 1.96 0.09
[47,] 26.01 14.44 2.56 0.04
[48,] 21.16 10.24 1.96 0.04
[49,] 28.09 13.69 2.25 0.04
[50,] 25.00 10.89 1.96 0.04
[51,] 49.00 10.24 22.09 1.96
[52,] 40.96 10.24 20.25 2.25
[53,] 47.61 9.61 24.01 2.25
[54,] 30.25 5.29 16.00 1.69
[55,] 42.25 7.84 21.16 2.25
[56,] 32.49 7.84 20.25 1.69
[57,] 39.69 10.89 22.09 2.56
[58,] 24.01 5.76 10.89 1.00
[59,] 43.56 8.41 21.16 1.69
[60,] 27.04 7.29 15.21 1.96
[61,] 25.00 4.00 12.25 1.00
[62,] 34.81 9.00 17.64 2.25
[63,] 36.00 4.84 16.00 1.00
[64,] 37.21 8.41 22.09 1.96
[65,] 31.36 8.41 12.96 1.69
[66,] 44.89 9.61 19.36 1.96
[67,] 31.36 9.00 20.25 2.25
[68,] 33.64 7.29 16.81 1.00
[69,] 38.44 4.84 20.25 2.25
[70,] 31.36 6.25 15.21 1.21
[71,] 34.81 10.24 23.04 3.24
[72,] 37.21 7.84 16.00 1.69
[73,] 39.69 6.25 24.01 2.25
[74,] 37.21 7.84 22.09 1.44
[75,] 40.96 8.41 18.49 1.69
[76,] 43.56 9.00 19.36 1.96
[77,] 46.24 7.84 23.04 1.96
[78,] 44.89 9.00 25.00 2.89
[79,] 36.00 8.41 20.25 2.25
[80,] 32.49 6.76 12.25 1.00
[81,] 30.25 5.76 14.44 1.21
[82,] 30.25 5.76 13.69 1.00
[83,] 33.64 7.29 15.21 1.44
[84,] 36.00 7.29 26.01 2.56
[85,] 29.16 9.00 20.25 2.25
[86,] 36.00 11.56 20.25 2.56
[87,] 44.89 9.61 22.09 2.25
[88,] 39.69 5.29 19.36 1.69
[89,] 31.36 9.00 16.81 1.69
[90,] 30.25 6.25 16.00 1.69
[91,] 30.25 6.76 19.36 1.44
[92,] 37.21 9.00 21.16 1.96
[93,] 33.64 6.76 16.00 1.44
[94,] 25.00 5.29 10.89 1.00
[95,] 31.36 7.29 17.64 1.69
[96,] 32.49 9.00 17.64 1.44
[97,] 32.49 8.41 17.64 1.69
[98,] 38.44 8.41 18.49 1.69
[99,] 26.01 6.25 9.00 1.21
[100,] 32.49 7.84 16.81 1.69
[101,] 39.69 10.89 36.00 6.25
[102,] 33.64 7.29 26.01 3.61
[103,] 50.41 9.00 34.81 4.41
[104,] 39.69 8.41 31.36 3.24
[105,] 42.25 9.00 33.64 4.84
[106,] 57.76 9.00 43.56 4.41
[107,] 24.01 6.25 20.25 2.89
[108,] 53.29 8.41 39.69 3.24
[109,] 44.89 6.25 33.64 3.24
[110,] 51.84 12.96 37.21 6.25
[111,] 42.25 10.24 26.01 4.00
[112,] 40.96 7.29 28.09 3.61
[113,] 46.24 9.00 30.25 4.41
[114,] 32.49 6.25 25.00 4.00
[115,] 33.64 7.84 26.01 5.76
[116,] 40.96 10.24 28.09 5.29
[117,] 42.25 9.00 30.25 3.24
[118,] 59.29 14.44 44.89 4.84
[119,] 59.29 6.76 47.61 5.29
[120,] 36.00 4.84 25.00 2.25
[121,] 47.61 10.24 32.49 5.29
[122,] 31.36 7.84 24.01 4.00
[123,] 59.29 7.84 44.89 4.00
[124,] 39.69 7.29 24.01 3.24
[125,] 44.89 10.89 32.49 4.41
[126,] 51.84 10.24 36.00 3.24
[127,] 38.44 7.84 23.04 3.24
[128,] 37.21 9.00 24.01 3.24
[129,] 40.96 7.84 31.36 4.41
[130,] 51.84 9.00 33.64 2.56
[131,] 54.76 7.84 37.21 3.61
[132,] 62.41 14.44 40.96 4.00
[133,] 40.96 7.84 31.36 4.84
[134,] 39.69 7.84 26.01 2.25
[135,] 37.21 6.76 31.36 1.96
[136,] 59.29 9.00 37.21 5.29
[137,] 39.69 11.56 31.36 5.76
[138,] 40.96 9.61 30.25 3.24
[139,] 36.00 9.00 23.04 3.24
[140,] 47.61 9.61 29.16 4.41
[141,] 44.89 9.61 31.36 5.76
[142,] 47.61 9.61 26.01 5.29
[143,] 33.64 7.29 26.01 3.61
[144,] 46.24 10.24 34.81 5.29
[145,] 44.89 10.89 32.49 6.25
[146,] 44.89 9.00 27.04 5.29
[147,] 39.69 6.25 25.00 3.61
[148,] 42.25 9.00 27.04 4.00
[149,] 38.44 11.56 29.16 5.29
[150,] 34.81 9.00 26.01 3.24
R has returned the whole dataset to us (because we gave it a function
to perform on every value, rather than a series of values like
sum or mean). Each value within the dataset is
squared!
This is a simple e.g., but it is always worth knowing that there are alternatives. There are loads of ways to do everything in R, and being aware of them all will help you to solve errors and think about the way that your functions are performing in order to get the best possible results. For example, here, we could also run
my_iris^2
The lesson here is that you should be lazy! Look for the quickest and easiest solution to any given task! Having the knowledge of both all of the code alternatives and an understanding of the way that R processes and behaves with various data structures will give you a great starting point - you will be able to write really simple and efficient code, and have a variety of tricks up your sleeve for when you can’t quite get the first attempt to work. Of course, here, there are simple alternatives because of the nature of the examples I’m giving you, but you might have to think more carefully about this when it comes to writing your own code to perform much more complicated tasks for you!
Now - we’ve covered the basics of how the apply function works, and
talked about what sorts of functions you can use it for. The final thing
to say about apply() specifically (although this applies to
all of the apply family) is with regards to anonymous functions. There
may be a case where you want to use a custom function, but without
explicitly defining it within your text. The apply family allow you to
specifiy a custom function to use within an apply()
function, without having to explicitly define it in your code. We can
slightly alter the way that apply() is written in order to
get it to perform something within a code block (remember the
{}).
Let’s try and square the values of my_iris again, this
time, without explicitly declaring a function to do so:
apply(my_iris, 2, function(x){
x^2
})
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] 26.01 12.25 1.96 0.04
[2,] 24.01 9.00 1.96 0.04
[3,] 22.09 10.24 1.69 0.04
[4,] 21.16 9.61 2.25 0.04
[5,] 25.00 NA 1.96 0.04
[6,] 29.16 15.21 2.89 0.16
[7,] 21.16 11.56 1.96 0.09
[8,] 25.00 11.56 2.25 0.04
[9,] 19.36 8.41 1.96 0.04
[10,] 24.01 9.61 2.25 0.01
[11,] 29.16 13.69 2.25 0.04
[12,] 23.04 11.56 2.56 0.04
[13,] 23.04 9.00 1.96 0.01
[14,] 18.49 9.00 1.21 0.01
[15,] 33.64 16.00 1.44 0.04
[16,] 32.49 19.36 2.25 0.16
[17,] 29.16 15.21 1.69 0.16
[18,] 26.01 12.25 1.96 0.09
[19,] 32.49 14.44 2.89 0.09
[20,] 26.01 14.44 2.25 0.09
[21,] 29.16 11.56 2.89 0.04
[22,] 26.01 13.69 2.25 0.16
[23,] 21.16 12.96 1.00 0.04
[24,] 26.01 10.89 2.89 0.25
[25,] 23.04 11.56 3.61 0.04
[26,] 25.00 9.00 2.56 0.04
[27,] 25.00 11.56 2.56 0.16
[28,] 27.04 12.25 2.25 0.04
[29,] 27.04 11.56 1.96 0.04
[30,] 22.09 10.24 2.56 0.04
[31,] 23.04 9.61 2.56 0.04
[32,] 29.16 11.56 2.25 0.16
[33,] 27.04 16.81 2.25 0.01
[34,] 30.25 17.64 1.96 0.04
[35,] 24.01 9.61 2.25 0.04
[36,] 25.00 10.24 1.44 0.04
[37,] 30.25 12.25 1.69 0.04
[38,] 24.01 12.96 1.96 0.01
[39,] 19.36 9.00 1.69 0.04
[40,] 26.01 11.56 2.25 0.04
[41,] 25.00 12.25 1.69 0.09
[42,] 20.25 5.29 1.69 0.09
[43,] 19.36 10.24 1.69 0.04
[44,] 25.00 12.25 2.56 0.36
[45,] 26.01 14.44 3.61 0.16
[46,] 23.04 9.00 1.96 0.09
[47,] 26.01 14.44 2.56 0.04
[48,] 21.16 10.24 1.96 0.04
[49,] 28.09 13.69 2.25 0.04
[50,] 25.00 10.89 1.96 0.04
[51,] 49.00 10.24 22.09 1.96
[52,] 40.96 10.24 20.25 2.25
[53,] 47.61 9.61 24.01 2.25
[54,] 30.25 5.29 16.00 1.69
[55,] 42.25 7.84 21.16 2.25
[56,] 32.49 7.84 20.25 1.69
[57,] 39.69 10.89 22.09 2.56
[58,] 24.01 5.76 10.89 1.00
[59,] 43.56 8.41 21.16 1.69
[60,] 27.04 7.29 15.21 1.96
[61,] 25.00 4.00 12.25 1.00
[62,] 34.81 9.00 17.64 2.25
[63,] 36.00 4.84 16.00 1.00
[64,] 37.21 8.41 22.09 1.96
[65,] 31.36 8.41 12.96 1.69
[66,] 44.89 9.61 19.36 1.96
[67,] 31.36 9.00 20.25 2.25
[68,] 33.64 7.29 16.81 1.00
[69,] 38.44 4.84 20.25 2.25
[70,] 31.36 6.25 15.21 1.21
[71,] 34.81 10.24 23.04 3.24
[72,] 37.21 7.84 16.00 1.69
[73,] 39.69 6.25 24.01 2.25
[74,] 37.21 7.84 22.09 1.44
[75,] 40.96 8.41 18.49 1.69
[76,] 43.56 9.00 19.36 1.96
[77,] 46.24 7.84 23.04 1.96
[78,] 44.89 9.00 25.00 2.89
[79,] 36.00 8.41 20.25 2.25
[80,] 32.49 6.76 12.25 1.00
[81,] 30.25 5.76 14.44 1.21
[82,] 30.25 5.76 13.69 1.00
[83,] 33.64 7.29 15.21 1.44
[84,] 36.00 7.29 26.01 2.56
[85,] 29.16 9.00 20.25 2.25
[86,] 36.00 11.56 20.25 2.56
[87,] 44.89 9.61 22.09 2.25
[88,] 39.69 5.29 19.36 1.69
[89,] 31.36 9.00 16.81 1.69
[90,] 30.25 6.25 16.00 1.69
[91,] 30.25 6.76 19.36 1.44
[92,] 37.21 9.00 21.16 1.96
[93,] 33.64 6.76 16.00 1.44
[94,] 25.00 5.29 10.89 1.00
[95,] 31.36 7.29 17.64 1.69
[96,] 32.49 9.00 17.64 1.44
[97,] 32.49 8.41 17.64 1.69
[98,] 38.44 8.41 18.49 1.69
[99,] 26.01 6.25 9.00 1.21
[100,] 32.49 7.84 16.81 1.69
[101,] 39.69 10.89 36.00 6.25
[102,] 33.64 7.29 26.01 3.61
[103,] 50.41 9.00 34.81 4.41
[104,] 39.69 8.41 31.36 3.24
[105,] 42.25 9.00 33.64 4.84
[106,] 57.76 9.00 43.56 4.41
[107,] 24.01 6.25 20.25 2.89
[108,] 53.29 8.41 39.69 3.24
[109,] 44.89 6.25 33.64 3.24
[110,] 51.84 12.96 37.21 6.25
[111,] 42.25 10.24 26.01 4.00
[112,] 40.96 7.29 28.09 3.61
[113,] 46.24 9.00 30.25 4.41
[114,] 32.49 6.25 25.00 4.00
[115,] 33.64 7.84 26.01 5.76
[116,] 40.96 10.24 28.09 5.29
[117,] 42.25 9.00 30.25 3.24
[118,] 59.29 14.44 44.89 4.84
[119,] 59.29 6.76 47.61 5.29
[120,] 36.00 4.84 25.00 2.25
[121,] 47.61 10.24 32.49 5.29
[122,] 31.36 7.84 24.01 4.00
[123,] 59.29 7.84 44.89 4.00
[124,] 39.69 7.29 24.01 3.24
[125,] 44.89 10.89 32.49 4.41
[126,] 51.84 10.24 36.00 3.24
[127,] 38.44 7.84 23.04 3.24
[128,] 37.21 9.00 24.01 3.24
[129,] 40.96 7.84 31.36 4.41
[130,] 51.84 9.00 33.64 2.56
[131,] 54.76 7.84 37.21 3.61
[132,] 62.41 14.44 40.96 4.00
[133,] 40.96 7.84 31.36 4.84
[134,] 39.69 7.84 26.01 2.25
[135,] 37.21 6.76 31.36 1.96
[136,] 59.29 9.00 37.21 5.29
[137,] 39.69 11.56 31.36 5.76
[138,] 40.96 9.61 30.25 3.24
[139,] 36.00 9.00 23.04 3.24
[140,] 47.61 9.61 29.16 4.41
[141,] 44.89 9.61 31.36 5.76
[142,] 47.61 9.61 26.01 5.29
[143,] 33.64 7.29 26.01 3.61
[144,] 46.24 10.24 34.81 5.29
[145,] 44.89 10.89 32.49 6.25
[146,] 44.89 9.00 27.04 5.29
[147,] 39.69 6.25 25.00 3.61
[148,] 42.25 9.00 27.04 4.00
[149,] 38.44 11.56 29.16 5.29
[150,] 34.81 9.00 26.01 3.24
Wonderful! Let’s have a look at the syntax:
apply(my_iris, 2, ...) - the basic function is
constructed in the same way as normalapply(my_iris, 2, function(x){ body of function }) - we
use the function(x) argument in a similar way to usual, but
we don’t close the apply() parentheses until after we’ve
constructed the body of the functionThis can be quite tricky to grasp fully the first few times you do it, but essentially, you’re telling R to look at the my_iris data, take each column, call it x, then perform a bit of code on it.
We already know that this example is perhaps slightly redundant
because we can just raise the entire dataframe to the power 2, but
anonymous functions might be useful elsewhere. Let’s write a more
complex apply with an anonymous function to count the number of
NAs that we have in each column of our data. We’ll start,
by getting R to identify where the NAs are by writing an
anonymous function with is.na() inside apply:
apply(my_iris, 2, function(x){
is.na(x)
})
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE TRUE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE
[36,] FALSE FALSE FALSE FALSE
[37,] FALSE FALSE FALSE FALSE
[38,] FALSE FALSE FALSE FALSE
[39,] FALSE FALSE FALSE FALSE
[40,] FALSE FALSE FALSE FALSE
[41,] FALSE FALSE FALSE FALSE
[42,] FALSE FALSE FALSE FALSE
[43,] FALSE FALSE FALSE FALSE
[44,] FALSE FALSE FALSE FALSE
[45,] FALSE FALSE FALSE FALSE
[46,] FALSE FALSE FALSE FALSE
[47,] FALSE FALSE FALSE FALSE
[48,] FALSE FALSE FALSE FALSE
[49,] FALSE FALSE FALSE FALSE
[50,] FALSE FALSE FALSE FALSE
[51,] FALSE FALSE FALSE FALSE
[52,] FALSE FALSE FALSE FALSE
[53,] FALSE FALSE FALSE FALSE
[54,] FALSE FALSE FALSE FALSE
[55,] FALSE FALSE FALSE FALSE
[56,] FALSE FALSE FALSE FALSE
[57,] FALSE FALSE FALSE FALSE
[58,] FALSE FALSE FALSE FALSE
[59,] FALSE FALSE FALSE FALSE
[60,] FALSE FALSE FALSE FALSE
[61,] FALSE FALSE FALSE FALSE
[62,] FALSE FALSE FALSE FALSE
[63,] FALSE FALSE FALSE FALSE
[64,] FALSE FALSE FALSE FALSE
[65,] FALSE FALSE FALSE FALSE
[66,] FALSE FALSE FALSE FALSE
[67,] FALSE FALSE FALSE FALSE
[68,] FALSE FALSE FALSE FALSE
[69,] FALSE FALSE FALSE FALSE
[70,] FALSE FALSE FALSE FALSE
[71,] FALSE FALSE FALSE FALSE
[72,] FALSE FALSE FALSE FALSE
[73,] FALSE FALSE FALSE FALSE
[74,] FALSE FALSE FALSE FALSE
[75,] FALSE FALSE FALSE FALSE
[76,] FALSE FALSE FALSE FALSE
[77,] FALSE FALSE FALSE FALSE
[78,] FALSE FALSE FALSE FALSE
[79,] FALSE FALSE FALSE FALSE
[80,] FALSE FALSE FALSE FALSE
[81,] FALSE FALSE FALSE FALSE
[82,] FALSE FALSE FALSE FALSE
[83,] FALSE FALSE FALSE FALSE
[84,] FALSE FALSE FALSE FALSE
[85,] FALSE FALSE FALSE FALSE
[86,] FALSE FALSE FALSE FALSE
[87,] FALSE FALSE FALSE FALSE
[88,] FALSE FALSE FALSE FALSE
[89,] FALSE FALSE FALSE FALSE
[90,] FALSE FALSE FALSE FALSE
[91,] FALSE FALSE FALSE FALSE
[92,] FALSE FALSE FALSE FALSE
[93,] FALSE FALSE FALSE FALSE
[94,] FALSE FALSE FALSE FALSE
[95,] FALSE FALSE FALSE FALSE
[96,] FALSE FALSE FALSE FALSE
[97,] FALSE FALSE FALSE FALSE
[98,] FALSE FALSE FALSE FALSE
[99,] FALSE FALSE FALSE FALSE
[100,] FALSE FALSE FALSE FALSE
[101,] FALSE FALSE FALSE FALSE
[102,] FALSE FALSE FALSE FALSE
[103,] FALSE FALSE FALSE FALSE
[104,] FALSE FALSE FALSE FALSE
[105,] FALSE FALSE FALSE FALSE
[106,] FALSE FALSE FALSE FALSE
[107,] FALSE FALSE FALSE FALSE
[108,] FALSE FALSE FALSE FALSE
[109,] FALSE FALSE FALSE FALSE
[110,] FALSE FALSE FALSE FALSE
[111,] FALSE FALSE FALSE FALSE
[112,] FALSE FALSE FALSE FALSE
[113,] FALSE FALSE FALSE FALSE
[114,] FALSE FALSE FALSE FALSE
[115,] FALSE FALSE FALSE FALSE
[116,] FALSE FALSE FALSE FALSE
[117,] FALSE FALSE FALSE FALSE
[118,] FALSE FALSE FALSE FALSE
[119,] FALSE FALSE FALSE FALSE
[120,] FALSE FALSE FALSE FALSE
[121,] FALSE FALSE FALSE FALSE
[122,] FALSE FALSE FALSE FALSE
[123,] FALSE FALSE FALSE FALSE
[124,] FALSE FALSE FALSE FALSE
[125,] FALSE FALSE FALSE FALSE
[126,] FALSE FALSE FALSE FALSE
[127,] FALSE FALSE FALSE FALSE
[128,] FALSE FALSE FALSE FALSE
[129,] FALSE FALSE FALSE FALSE
[130,] FALSE FALSE FALSE FALSE
[131,] FALSE FALSE FALSE FALSE
[132,] FALSE FALSE FALSE FALSE
[133,] FALSE FALSE FALSE FALSE
[134,] FALSE FALSE FALSE FALSE
[135,] FALSE FALSE FALSE FALSE
[136,] FALSE FALSE FALSE FALSE
[137,] FALSE FALSE FALSE FALSE
[138,] FALSE FALSE FALSE FALSE
[139,] FALSE FALSE FALSE FALSE
[140,] FALSE FALSE FALSE FALSE
[141,] FALSE FALSE FALSE FALSE
[142,] FALSE FALSE FALSE FALSE
[143,] FALSE FALSE FALSE FALSE
[144,] FALSE FALSE FALSE FALSE
[145,] FALSE FALSE FALSE FALSE
[146,] FALSE FALSE FALSE FALSE
[147,] FALSE FALSE FALSE FALSE
[148,] FALSE FALSE FALSE FALSE
[149,] FALSE FALSE FALSE FALSE
[150,] FALSE FALSE FALSE FALSE
This prints out a long series of logical values, telling us whether
any given value is missing or not. We can count these, by wrapping
is.na() in the sum function, which will count
the incidences of TRUE in each column:
apply(my_iris, 2, function(x){
sum(is.na(x))
})
Sepal.Length Sepal.Width Petal.Length Petal.Width
0 1 0 0
We already know that only one of our columns should contain a single
NA value (because we put it there earlier!), so we can
quite easily verify that our complicated function is doing exactly what
we wanted it to! This is a really quick way of getting through things
that would take much longer to both construct and run with a
for loop. You might never have to use the apply functions
in your own code, but they are incredibly useful tools. You might, for
example, borrow some code from someone else on github that includes
them, or someone might share code with you that has apply functions
included, so they’re a useful thing to be aware of.
lapply()lapply() is essentially apply() for lists.
It works in essentially the same way, but for different data structures.
You can either feed lapply a dataframe or a list, and it will treat it
as such. In order to understand the output of lapply(), we
must be aware of how lists work in R.
A list is a way of storing datastructures of different types and lengths in a single object in R, in a way that isn’t possible in a dataframe. Let’s quickly construct one and have a look
my_list
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
[38] 38 39 40 41 42 43 44 45 46 47 48 49 50
[[2]]
[1] "red" "yellow" "green"
[[3]]
[1] "FALSE"
Here we can see that each of the 3 vectors stored within the list is
of a different length (50, 3, and 1). Recall that we access the elements
of a list in a slightly different way to vectors or dataframes. In order
to access each element of the list, but have it retain it’s list class,
we use single square brackets []. If we want to revert back
to the contents of the list in its original vector form, we use double
square brackets [[]]. For e.g.:
my_list[2]
[[1]]
[1] "red" "yellow" "green"
class(my_list[2])
[1] "list"
my_list[[2]]
[1] "red" "yellow" "green"
class(my_list[[2]])
[1] "character"
If we want to pull out a single element of a vector, we use a
combination of single and double square brackets - we first ask R to
return the entire vector element with [[]], and then index
it as usual with []. If we wanted to pull out the word
“green” which is the 3rd item in the 2nd element of the list, we would
use
my_list[[2]][3]
[1] "green"
Lists are complicated, but really common in R (sorry), particularly so when using functions on your data. Lots of the time, a function will output a series of values or information stored within a list, and it’s really useful to be able to extract that information in a way that means you’re always getting what you think you’re getting!
lapply()Now that we have our list, we can use lapply() to
perform functions on each of the vectors. Let’s start by working out the
lengths of each of the elements of the list (note that you don’t have to
provide dimension with lapply)
lapply(my_list, length)
[[1]]
[1] 50
[[2]]
[1] 3
[[3]]
[1] 1
This gives us our output as a list. So now, we have a list of lengths of elements of our original list. Phew. It’s important to remember lapply() always returns values in a list. There is a way around this, but we’ll talk about that in a moment.
We don’t have to exclusively use lapply() on lists, we
can also use it on dataframes (which lapply() will coerce
into lists before it does anything with them). Let’s have a look at what
our my_iris data looks like as a list:
as.list(my_iris)
$Sepal.Length
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0
[28] 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5
[82] 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1
[136] 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
$Sepal.Width
[1] 3.5 3.0 3.2 3.1 NA 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4
[28] 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3
[55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4
[82] 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9
[109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6
[136] 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0
$Petal.Length
[1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6
[28] 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
[55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8
[82] 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
[109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6
[136] 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
$Petal.Width
[1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.2 0.4 0.4 0.3 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2 0.4
[28] 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2 0.2 0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 1.4 1.5 1.5 1.3
[55] 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3 1.4 1.5 1.0 1.5 1.1 1.8 1.3 1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.0 1.1
[82] 1.0 1.2 1.6 1.5 1.6 1.5 1.3 1.3 1.3 1.2 1.4 1.2 1.0 1.3 1.2 1.3 1.3 1.1 1.3 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8
[109] 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1 1.8 1.8 1.8 2.1 1.6 1.9 2.0 2.2 1.5 1.4
[136] 2.3 2.4 1.8 1.8 2.1 2.4 2.3 1.9 2.3 2.5 2.3 1.9 2.0 2.3 1.8
This has separated each of our four columns into their own list element (helpfully with the column names retained!). There are 4 elements of the list, each with 150 observations in. lapply will automatically apply to the columns, which as you can see, are what split off into list elements. Note that we haven’t overwritten my_iris as a list - it’s still a dataframe at the minute
class(my_iris)
[1] "data.frame"
This was just to get an idea of how lapply() might
coerce our data into list format, and what that might mean for the
function. Let’s try and use lapply to find the means of
each column
lapply(my_iris, mean)
$Sepal.Length
[1] 5.843333
$Sepal.Width
[1] NA
$Petal.Length
[1] 3.758
$Petal.Width
[1] 1.199333
Oops - we forgot about our NA problem - we can add
arguments in the same way that we did in apply()
lapply(my_iris, mean, na.rm = TRUE)
$Sepal.Length
[1] 5.843333
$Sepal.Width
[1] 3.053691
$Petal.Length
[1] 3.758
$Petal.Width
[1] 1.199333
sapply()The last of the three apply() functions we’re going to
look at today is sapply(). This works almost identically to
lapply(), but with the option to simplify the output from a
list to a vector. This is arguably the most versatile, as it takes any
input, and can return it to you in a variety of formats. Let’s start
with feeding sapply() a dataset, and continue with our
theme of finding the column means of my_iris:
sapply(my_iris, mean)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 NA 3.758000 1.199333
The function is constructed in an identical way to
lapply(), but the output is a vector which might be more
useful for downstream analysis. We remove the NA in exactly
the same way as before
sapply(my_iris, mean, na.rm = TRUE)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.053691 3.758000 1.199333
We can also feed sapply() lists to work on. Let’s give
it the list we created earlier and ask it to work out the lengths of the
elements
sapply(my_list, length)
[1] 50 3 1
We can also make sapply() return a list to us, and
almost revert it back to lapply(), by adding the
simplify = FALSE argument. By default, this is set to
TRUE in sapply():
sapply(my_list, length, simplify = FALSE)
[[1]]
[1] 50
[[2]]
[1] 3
[[3]]
[1] 1
sapply() in particular, but the apply family in general
are a really useful tool in R to allow you to avoid for loops - we can
use sapply() to do some really substantial stuff for us in
R, way beyond what we’ve covered here. We can also use anonymous
functions with sapply() in the same way as we did in
apply(). Let’s write one that will take the reciprocal
value of each number in a vector from 1:10
sapply(1:10, function(x){
1/x
})
[1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000 0.1666667 0.1428571 0.1250000 0.1111111 0.1000000
And if we want the output in a list:
sapply(1:10, simplify = FALSE, function(x){
1/x
})
[[1]]
[1] 1
[[2]]
[1] 0.5
[[3]]
[1] 0.3333333
[[4]]
[1] 0.25
[[5]]
[1] 0.2
[[6]]
[1] 0.1666667
[[7]]
[1] 0.1428571
[[8]]
[1] 0.125
[[9]]
[1] 0.1111111
[[10]]
[1] 0.1
We are iterating across a numeric vector, and applying an anonymous function to it. We’re taking each value from 1:10, calling it x, and then taking the reciprocal of it. We can achieve something similar with character vectors - imagine we wanted to create a series of variables for 24 chromosomes in a genome:
sapply(1:24, function(x){
paste0("chr",x)
})
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13"
[14] "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chr23" "chr24"
paste0 performs the same as paste - but doesn’t include spaces! This will output a series of chromosome names for you, without you having to faff. There is, however, as always, a quicker way of performing this function in each case.
z <- 1:10
1/z
[1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000 0.1666667 0.1428571 0.1250000 0.1111111 0.1000000
1/1:10
[1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000 0.1666667 0.1428571 0.1250000 0.1111111 0.1000000
paste0("chr", 1:24)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13"
[14] "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chr23" "chr24"
It might seem counterintuitive to bother learning
apply(), but I promise there will not always be an easier
way - I just think it’s a really useful skill to spot when you might be
making your own life much harder than you need to!