Hello! Welcome to the first of the advanced R workshops on the
LIFE4138 module. Our learning objectives for this session are:
- Learn some advanced programming skills in R, and understand that the
language can be used as a “strict” programming language similar to
Python etc.
- Think about control flow - we’ll introduce the concepts of control
flow, including
for loops, if and
else statements, and the ifelse()
function.
- We will learn to write our own functions, a vital tool when we are
building scripts. We will discuss the usability of custom functions to
avoid replication in your code.
- Finally, we will begin to have a look at our 4th data structure in R
- lists.
What do we mean by “advanced”?
So, the previous sessions have all been designed to give you an
overall grounding of the way that R works, and to try and understand the
different approaches that R users take to the language (baseR versus the
tidyverse approach for e.g.). For most R users, this is enough to get
by. You’ll be able to do the majority of the stuff that you need to
using only the skills that we’ve been through so far.
That being said, there are times, particularly when undertaking
bioinformatic analysis, when perhaps you might like to have a script
that will run independently without your input. Situations such as this
really highlight the importance of the advanced techniques that we’re
about to learn!
Taking it to the next step
The below will begin to give you a greater flexibilty and versatility
with your R coding, and allow you to begin to make R work for you in
exactly the way that you want it to. for loops for example
are a great principle to learn in general with your code, and learning
both the loops themselves as well as the alternatives available in R
will be incredibly useful to you.
Another nice thing about moving on to more advanced coding principles
in R is that they will help to deepen your understanding of how the
basics work. For example, the way that we handle and access vectors will
become clearer as we use more advanced techniques to work through
them.
But how do I become an “advanced” R user?!
Just following tutorials such as this will not make you into an
advanced R user. These tutorials can only go so far - I can introduce
you to a concept, but really what will deepen and strengthen your
understanding is by using the techniques with real-world data, beyond
the structured format of this tutorial.
The best way to learn R (or any programming language), is by
doing, playing, and experimenting.
Critically, you also need to learn how and where to access help.
Don’t be afraid to use google, stack overflow, or to ask your friends or
colleagues for advice. A fresh pair of eyes is much more likely to spot
a rogue comma than you are after you’ve been staring at your screen for
hours. This might seem a little basic for an advanced R tutorial, but
looking for answers online is a great way to see other examples, look at
other peoples coding styles, get help, and importantly, learn to avoid
future mistakes. There are loads of wonderful R resources out there for
you to use, and they are getting increasingly better with time.
Another thing you should develop as you become a more advanced R
user, is your own coding philosophy. My approach is to do what works.
This sounds daft at first, but there are several people whos philosophy
it is to achieve and end result using as few lines of code as possible.
I like to do what works for me, whilst retaining an open mind to new
solutions. It is very easy to get stuck in a coding rut with R, and
continue to use a solution that might quickly become out of date or
impractical/uneccesarily difficult (think about the introduction of the
tidyverse in the 2010s for example!).
The above abilities and ideas will open up your coding abilities, and
allow you to really get to grips with R as both a language and a tool
for data wrangling, plotting, and downstream analysis.
Finally, the importance of your coding environment (I mean physical
environment, not your R environment) cannot be understated. Find a way
to get “in the zone”, and allow yourself to concentrate solely on what
it is that you’re trying to achieve. Focus is important with code, it’s
so easy to make mistakes and end up with analyses that hasn’t done what
you think it has. For me, I make a giant cup of tea, grab some biscuits,
get comfy with a warm jumper, and pop my headphones on. I listen to a
playlist that I have personally curated for a mixture of concentration,
positivity, and motivation. It sounds daft, but I promise, it works. If
I really don’t want to be disturbed, I shut my office door, cover the
window in the door, and turn the lights off (and put my lamp on - I’m
aiming for ambient lighting rather than pitch black!).
I know I just said finally, but actually properly finally, don’t
forget to take a break. If you’re tearing your hair out to debug a piece
of code, or if you’re really stuck and just can’t make something work,
put it down. Walk away, make yourself a cup of tea, have some cake, go
for a run, do whatever you need to for some mental space. Even sleep on
it if you have to. When you come back to it, often the answer will
present itself. If it doesn’t everything will at least feel a little
easier than it did before. Coding can be frustrating, but learning to
manage and harness that frustration is a real skill!
Right. Let’s get to the meat of it.
Control flow
The first concept that we’ll cover today is control flow. You may
well have come across control flow in your forrays with Python and Unix,
and the basic idea of how it works is largely similar in R. Control flow
is a fundemental part of the way that any programming language works,
not just R. It essentially provides a way to control the order that a
set of functions are performed on your dataset.
There are several examples of control flow functions, but the ones we
will concentrate on today are for loops, if
and else statements, and the ifelse()
function. ifelse() isn’t strictly a control flow statement,
but a function from the baseR set of functions.
Before we get started, the way that control flows work are as
follows. Say, for example, you have a condition. You ask whether that
condition is true or false. If the condition is true, you perform a
function.

If it is false, you perform another function. This is easier to
understand with an analogy. We can use if else
statements to write a function to make ourselves a cup of tea. First, we
perform the boil kettle function. Then, we periodically check to see if
the water has boiled. If the outcome of “Has the water boiled” is
FALSE, we perform the wait a bit longer function. We check
again. If the outcome is TRUE, we perform the pour water
function, which then allows us to perform the drink tea function.
Admittedly, this analogy is flawed - it’s missing some vital steps
including adding a tea bag, adding any milk, waiting for the tea to
cool, and cutting yourself a slice of cake! That being said, it
demonstrates the basic concept of control flow nicely.

A note on brackets
Before we go any further, you may have noticed that we often use
different types of brackets for various purposes in R. This is vitally
important to the language, and your code will not function in the way
that you expect it to if you get this wrong.
() - rounded brackets are used exclusively to perform
functions, or denote the order of mathematical calculations
[] - square brackets are used for indexing and subsetting
{} - curly brackets are used to define blocks of code in
functions and statements. What I mean by this will become increasingly
apparent as we work through the below.
For loops
Despite if else statements being perhaps easier to wrap
our heads around, we’re going to start with for loops.
Essentially, a for loop is a way of iterating through a set
of values, performing a function on each in turn. These loops are
relatively straightforward and intuitive in R, once you get used to
using them.
The syntax is, of course, slightly different to Python, but the basic
concept is the same. Typically, for loops in R are a little
slow. There’s a much quicker way of performing these functions, but we
wont go into that here - it’s important to understand how the loops work
first.
Let’s try one. We want to create a basic for loop which
will print all of the numbers from 1 to 10. We could of course do this
without a loop, by assigning an individual value to x, and printing them
in turn:
x <- 1
print(x)
[1] 1
x <- 2
print(x)
[1] 2
x <- 3
print(x)
[1] 3
What if we wanted to automate this? Let’s build that basic
for loop…
for (x in 1:10){
print(x)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Brilliant, but let’s go through just how that works.
for - we start with the for statement. If
you are using an IDE, you might notice that this statment changes colour
to denote that it is one of R’s protected control statements.
x in 1:10 - we know that 1:10 produces a vector of all
of the numbers in sequence from 1 to 10. the x in bit just
means “for every value in the vector of”. The x could be anything you
want it to be. e.g. for numbers in 1:10 would yeild the
name result. Note here that different programming languages follow
different conventions- in python you’ll often see for loops
written as for i in 1:10.
{} - the curly brackets contain the body of the loop.
What that means is that all of the values we have just specified above
will have the contents of the {} performed on them. Here,
we’ve used the print() function, which just prints our
values to the console.
So, this loop assigns x a value of 1 - 10 in a sequence, and prints
it to the console.
This is an example of a super simple and straightforward control
statement. We could replace the 1:10 with anything we liked, and the
loop would cycle through the values presented it in turn until it
reached the final one.
Now, printing the numbers 1 to 10 doesn’t seem overly useful. What if
we want to perform a mathematical operation on all of the numbers in a
sequence quickly? Let’s expand our for loop so that it adds
10 to every number in the sequence, and prints the result to the
console:
for (x in 1:10){
print(x + 10)
}
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20
Easy as that. Again, this is a really basic example of a
for loop, just so you can get to grips with the principle
of how they work in R.
We can also create for loops with vectors consisting of
character strings - let’s create one with 4 components:
years <- c("2019", "2020", "2021", "2022")
Remember what when we create character string vectors, we have to
include the c() function
Let’s pop these in a loop that will print the sentence “The year is”
followed by each of the levels of the vector in turn
years <- c("2019", "2020", "2021", "2022")
for (x in years){
y <- paste("The year is", x)
print(y)
}
[1] "The year is 2019"
[1] "The year is 2020"
[1] "The year is 2021"
[1] "The year is 2022"
This loops through the four years we have given in the vector, and
uses the paste() function to stick two character strings
together, printing each of the vector elements in turn. Notice that
we’ve created a new object within the loop, y - if you look
in your R environment window, or try and print y to the console with
print(y), you’ll see that y is stored as “The year is
2022”, which is the final iteration of the loop. It’s important to note
that the other iterations are not stored in the wider R environment, as
the loop overwrites itself every time it performs an iteration.
Just to recap,a for loop runs through values in a vector
(and is quick in these small toy examples!), and does some work for you
inside the statement. Curly brackets {} denote the body of
the loop, which is where we tell R what we want it to do with our input.
Whilst for loops are great, every example that we’ve just run through
can actually be acheived with simpler, and more computationally
efficient code (sorry). Remember to think about your code, and consider
whether there might be easier solutions to a problem!
Vectorisation as a more efficient mechanism?
For our first example, printing 1 - 10 to the screen with a loop, we
could save a lot of time by explicitly popping x into it’s own vector
(rather than adding this to the conditions of a for loop),
and dealing with it that way:
x <- 1:10
print(x)
[1] 1 2 3 4 5 6 7 8 9 10
As you can see, this ultimately yields the same result, with
considerably less faffing. In these small toy examples, there isn’t a
whole lot of difference in computation time, but in much larger
datasets, this is certainly something worth considering. Generally,
vectorisation is much speedier than for loops.
For the example where we added 10 to x, we can use the same
vectorisation principle. This is where R differs in it’s fundemental dev
makeup to, for example, bash, where you would have to
do a for loop to perform the following:
x <- 1:10 # recreate our 1:10 vector
print(x + 10)
[1] 11 12 13 14 15 16 17 18 19 20
Here, you don’t even need the print() function, just
typing x + 10 would be sufficient!
You don’t even really need to use loops on character vectors in R!
Let’s go back to our “the year is” example. We can avoid a loop by using
the paste() function on the vector explicitly, rather than
having to force it through a loop:
years <- c("2019", "2020", "2021", "2022")
paste("The year is", years)
[1] "The year is 2019" "The year is 2020" "The year is 2021" "The year is 2022"
So why bother with for loops in the first place?
All of this is not to say that for loops are rubbish and
useless in R - in fact there will likely be times where they are
unavoidable, and a vector based solution just won’t cut it. It is
however worth hammering home the point that they are really worth
thinking about quite carefully. There will often be a better way to do
whatever you’re trying to do - slow down when you’re coding. Stop and
think about what it is you’re trying to achieve. Beginners in particular
tend to learn how to create for loops, and then use them
for everything, when often, a vector based solution would be much
quicker and simpler!
If statements
Now, on to if statements. These are another control flow
coding mechanism you can use in R and further afield to customise the
order that commands are run in, depending on conditions of your data.
Think about the kettle example - IF your water boiled,
then you perform the task. if statements ask whether a
condition has been met, before they perform a function. The best way to
explain this is, as always, with an example:
z <- 1
if (z < 10){
print("z is less than 10")
}
We can see that because we’ve given z the value of 1,
which is less than 10, we get an output telling us so (R is performing
the function!). What happens if we change the assignment of
z and try again?
z <- 45
if (z < 10){
print("z is less than 10")
}
We don’t get an output. Because the condition was not met, no
function was performed on our data.
The way that this is coded in R is that the provided condition (in
()) is run, to give a logical output. If that output is
TRUE, the body of the function (in {}) is run.
If the output is FALSE, nothing will happen.
else statements
It isn’t a huge leap to ask whether we could take the above and
perform a different function if the condition was not met, and the
logical output was FALSE. We can achieve this by pairing
our if statement with an else statement. The
syntax of this can take a little getting used to, because it pairs two
sets of {} separated by an else statement.
Let’s demonstrate:
z <- 1
if (z < 10){
print("z is less than 10")
} else {
print("z is more than 10")
}
[1] "z is less than 10"
With an input of 1, we get a printed message telling us that z is
less than 10. What if we change z to be greater than 10?
z <- 20
if (z < 10){
print("z is less than 10")
} else {
print("z is more than 10")
}
[1] "z is more than 10"
As expected, R tells us that z is more than 10. This combination of
if and else statements runs a control flow
that takes an input, and if a condition is met, runs function a. If the
condition is not met, it runs function b. This basic principle allows
you to perform slightly more complex functions or analyses, based on the
input that you give.
The ifelse() function
There is an alternative to if and else
control flow functions in R - the ifelse() function. This
doesn’t always perform in quite the way that you might expect, but it’s
certainly an option to consider. The syntax is perhaps a little more
familiar than the control flow statements… Let’s perform the same thing
we just did, but this time using an ifelse() function:
z <- 1
ifelse(z < 10, "less than 10", "more than 10")
[1] "less than 10"
z <- 20
ifelse(z < 10, "less than 10", "more than 10")
[1] "more than 10"
Great. All works the same so far! The ifelse() function
takes an input, evaluates it against a condition (here that it is <
10). If the condition is TRUE, it performs the first
option, if FALSE, does the 2nd.
Helpfully, we can also perform the ifelse() functions on
a vector. Let’s create a numeric vector from 1 - 60 in steps of 6 and
run our function again
z <- seq(from = 1, to = 60, by = 6)
ifelse(z < 10, "less than 10", "more than 10")
[1] "less than 10" "less than 10" "more than 10" "more than 10" "more than 10" "more than 10" "more than 10"
[8] "more than 10" "more than 10" "more than 10"
We end up with 10 outputs, each telling us whether a result is less
or more than 10!
z <- seq(from = 1, to = 60, by = 6)
if (z < 10){
print("z is less than 10")
} else {
print("z is more than 10")
}
Error in if (z < 10) { : the condition has length > 1
You can see from the error above, that you can’t perform
if else statements on a vector - the condition
is only expecting something with a length of 1, rather than, in this
case, 10.
We can use ifelse() on character vectors too. Let’s
create one and try it out:
countries <- c("UK", "USA", "France", "Spain")
ifelse(countries == "UK", "UK", "Elsewhere")
[1] "UK" "Elsewhere" "Elsewhere" "Elsewhere"
The function is evaluating to see if the input matches the string
“UK”, and if it does, will print “UK”, otherwise, prints “Elsewhere”.
Fairly straightforward stuff!
Writing your own functions:
Writing your own functions is an important part of programming. It
keeps code simple and readable, and allows you to perform a set of tasks
multiple times without loads of duplication in your code. If you wanted
to develop your own package, custom function creation would be a huge
part of this - a package is essentially just a collection of
functions!
The process of creating functions is quite straightforward - I do it
loads to make my life quick and easy. Let’s start by creating a simple
function that adds 10 to any numeric input:
add_ten <- function(x){
y <- x + 10
return(y)
}
add_ten(1)
[1] 11
So, here, we’ve created and tested a basic function. Let’s dissect it
a little:
add_ten - first, we have to give our new function a
name. I’ve imaginatively called this one add_ten.
- Then, we assign the rest of the code to the new function name with
our assignment arrow
<-
- Next up, we have the
function function. This tells R
that we want to turn our function name into a new function
function(x) - this tells R that we are creating a
function with a single argument (or input), x.
{} this denotes the body of the function - whatever
code is in here is what will be applied to an input
y <- x + 10 - this bits obvious - add 10 to our
input of x and assign it a new name, y.
return(y) - give us an output! Without this, R wouldn’t
automatically tell us what the output of whatever we’ve just asked it to
do is!
Notice here that the new y object is treated a little
differently in functions than loops. If we remove y from
our environment and run the function again, you’ll notice that it does
not re-appear. Whatever objects you create within the body of the
function exist only within that function - they are not callable to the
wider R environment.
Functions can also take a vector input - let’s run our
add_ten() function with a vector and see what happens
add_ten <- function(x){
y <- x + 10
return(y)
}
z <- 1:10
add_ten(z)
[1] 11 12 13 14 15 16 17 18 19 20
Of course, we could perform this function manually, as it is a really
straightforward example. I’d be unlikely to bother creating a function
to do this in practice, but it’s a useful demonstration.
The anatomy of a function
Let’s quickly recap the anatomy of how a function works
function(x) declares a function. It creates the function
within the R environment, and defines the arguments that the function
requires. In our example above, whenever you run add_ten,
you must have an x argument, otherwise the function will not
run.
{} define the block of code within which the work that
the function does takes place. Here, y is only defined
within the function.
return(y) tells R to give us an output, and display what
the function has done.
Functions with multiple arguments
We can extend our functions to take multiple arguments, or inputs,
using the basics that we’ve just learned. Functions that we have used
throughout the course so far have taken multiple arguments, so let’s try
and also create one. We’ll write a function that multiplies two numbers
together:
multiply <- function(x, y){
z <- x * y
return(z)
}
multiply(3,2)
[1] 6
In our R environment window, we can see a new section has been
created called ‘Functions’. Within that, we can see both our
add_ten and our multiply functions, and the
number of arguments that each takes.
Functions with default arguments
You’ll find often that several arguments in functions that you’re
running in R have default settngs that you do not have to explicitly
code in order for R to perform them. For example, in the
sum() and mean() functions, the default for
the na.rm argument is FALSE. We can edit our
multiply() function to have a default y value
of 10:
multiply <- function(x, y = 10){
z <- x * y
return(z)
}
multiply(3)
[1] 30
multiply(3, 2)
[1] 6
Here, we can see that if we include only 1 argument to the function
now, it will run and use the default of y = 10. If we include two
arguments ourselves, we override the default and the function performs
on both of the arguments that we have specified. You can hopefully see
how writing functions with multiple arguments and defaults isn’t too
much of a stretch from writing a function with a single argument, and
see how useful that might be in your coding!
Combining functions and conditionals:
Now, we can combine all of the skills that we’ve learned today and
create complex functions with some control flow statements inside of
them. Let’s start by wtiting a function that evaluates whether a
numerical input is greater than or less than 10.
eval10 <- function(x){
y <- ifelse(x < 10, "less than 10", "greater than 10")
return(y)
}
eval10(1)
[1] "less than 10"
eval10(15)
[1] "greater than 10"
We’ve included a simple ifelse function within our eval10 function,
which takes a new object, y, and performs an ifelse() function upon it,
and returns the output. We could do the same with the explicit
if and else control flow statements
eval10_2 <- function(x){
y <- if (x < 10){
print("less than 10")
} else {
print("greater than 10")
}
}
eval10_2(1)
[1] "less than 10"
eval10_2(15)
[1] "greater than 10"
Whilst this is a simple example, we could build up to creating some
incredibly complex functions with custom control flows to efficiently
handle our data and downstream analysis in R. It may well be useful for
things such as looking at data to see if it contains headers before
removing them, or assessing the length of a data set before performing a
function on it.
Lists in R
Now, we’ve had a whistle stop tour of functions, obviously this is
not the full extent of their functionality, but it’s time to move on to
lists. We’ve not really mentioned lists so far, other than very briefly
in some of the beginner sessions. To recap, there are 4 main data
structures in R - vectors, matricies, data.frames, and lists.
A list is a special form of data structure which allows you to
contain objects of different lengths within it. They are a little more
complicated than the other data structures we covered before - partly
because a data.frame is both a special type of list, and can be
contained within a list! In a data.frame, each component part (or
column) must be the same length - this is not true of a list.
They are quite complicated on first glance, and often are not taught
at all in R courses - I was never formally taught what a list was or how
on earth to use one, and I had to try and work it out for myself. Lists
are probably the form od data structure that I use the least in R,
simply because I never got into the habit of using them. Now, let’s make
a list and see how they work, and figure out how to access stuff within
them. First, we’ll create three vectors (one character, one numeric, and
one logical), and combine them with the list()
function.
nintendo <- c("Mario", "Luigi", "Peach")
numbers <- 1:100
logic <- c(TRUE, FALSE, FALSE, TRUE)
my_list <- list(nintendo, numbers, logic)
Now, we have our list. We can print it to the screen by typing the
name of our list, and it’ll print everything out.
my_list
[[1]]
[1] "Mario" "Luigi" "Peach"
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[27] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
[53] 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
[79] 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
[[3]]
[1] TRUE FALSE FALSE TRUE
From this, we can clearly see that the objects contained within the
list are of different lengths - nintendo has a length of 3, numbers has
a length of 100, and logic has a length of 4.
So, how do we access the list? It can seem complicated, but actually
printing it to the screen has given us lots of information. We can use
square brackets [] to choose objects. We can also access
elements with their names, provided we have given them names, but we
will come back to that.
Let’s try and select the first vector element of our list. We might
try and access it with:
my_list[1]
[[1]]
[1] "Mario" "Luigi" "Peach"
Whilst this does return what we want it to, this actually isn’t in
the vector class that we created the list with. We can check with the
class() function:
class(my_list[1])
[1] "list"
We can see here that the object has retained it’s list class, and has
taken the first part of the list, but not the vector itself. We can
access the vector within the list with double square brackets
[[]], the clue is in the display of the list!
my_list[[1]]
[1] "Mario" "Luigi" "Peach"
class(my_list[[1]])
[1] "character"
When we verify what we’re doing with class(), we can see
that we’re getting the original character vector stored within the list.
This is a bit complex and hand-wavy, but basically remember that
[[]] allows you to access the element from within the list
in its native form, whereas [] gives you the element of the
list, whilst retaining it’s list class.
Be aware that the way we access lists with [] is
different than vectors. For example, if we were to type
my_list[1:2] we get the first two elements of the list, but
my_list[[1:2]] gives you the second element of the first
bit of the list:
my_list[1:2]
[[1]]
[1] "Mario" "Luigi" "Peach"
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[27] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
[53] 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
[79] 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
my_list[[1:2]]
[1] "Luigi"
You can access bits of lists as above, or perhaps more intuitively
and less confusingly, by combining double and single square brackets
my_list[[1]][2] # This accesses the 2nd item of the 1st list element.
You can also use names to access list elements in a way that is
equivalent to how a dataframe would work. In order to do this, however,
you must specify the names when you’re creating the list, like this:
my_list <- list(nintendo = nintendo, numbers = numbers, logic = logic)
If we print the list again, you’ll see that the [[1]], [[2]], and
[[3]] have been replaced by $nintendo, $numbers, and $logic:
my_list
$nintendo
[1] "Mario" "Luigi" "Peach"
$numbers
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[27] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
[53] 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
[79] 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
$logic
[1] TRUE FALSE FALSE TRUE
Now you can access your list using the names, but note that just like
a dataframe, my_list[[1]] will still work
my_list$nintendo
[1] "Mario" "Luigi" "Peach"
my_list[[1]]
[1] "Mario" "Luigi" "Peach"
Why are lists important?
It perhaps isn’t immediately obvious why lists are important, but we
will talk more about this in the next session. It’s useful for you to
know how to use them, as they will make R more useful to you.
Briefly, lists can be helpful if you are working on lots of
dataframes/tibbles at the same time, as you can store them within a
list. The versatility of this will become clearer next time when we’re
talking about the apply family of functions, sapply(),
lapply(), and apply().
Summary
So, today we have learned about control flow with for,
if, else, and ifelse(), and how
these are used to control the flow of data through a set of functions,
depending on whether conditions are met.
We’ve learned how to write our own functions, including with multiple
arguments and defaults, and have combined this with our new knowledge of
control flow.
Finally, we touched on lists which feeds nicely into the apply set of
functions that we’ll look at in more detail on Thursday. We can use
sapply(), lapply() and apply() to
efficiently process vectors in R, and we’ll learn more about the
principles of proper scripting.
---
title: "LIFE4138 - Advanced R workshop 1"
output: html_notebook
---

Hello! Welcome to the first of the advanced R workshops on the LIFE4138 module. Our learning objectives for this session are:

* Learn some advanced programming skills in R, and understand that the language can be used as a "strict" programming language similar to Python etc.
* Think about control flow - we'll introduce the concepts of control flow, including `for` loops, `if` and `else` statements, and the `ifelse()` function.
* We will learn to write our own functions, a vital tool when we are building scripts. We will discuss the usability of custom functions to avoid replication in your code.
* Finally, we will begin to have a look at our 4th data structure in R - lists.

# What do we mean by "advanced"?

So, the previous sessions have all been designed to give you an overall grounding of the way that R works, and to try and understand the different approaches that R users take to the language (baseR versus the tidyverse approach for e.g.). For most R users, this is enough to get by. You'll be able to do the majority of the stuff that you need to using only the skills that we've been through so far.

That being said, there are times, particularly when undertaking bioinformatic analysis, when perhaps you might like to have a script that will run independently without your input. Situations such as this really highlight the importance of the advanced techniques that we're about to learn!

# Taking it to the next step

The below will begin to give you a greater flexibilty and versatility with your R coding, and allow you to begin to make R work for you in exactly the way that you want it to. `for` loops for example are a great principle to learn in general with your code, and learning both the loops themselves as well as the alternatives available in R will be incredibly useful to you. 

Another nice thing about moving on to more advanced coding principles in R is that they will help to deepen your understanding of how the basics work.  For example, the way that we handle and access vectors will become clearer as we use more advanced techniques to work through them.

# But how do I become an "advanced" R user?!

* Just following tutorials such as this will not make you into an advanced R user. These tutorials can only go so far - I can introduce you to a concept, but really what will deepen and strengthen your understanding is by using the techniques with real-world data, beyond the structured format of this tutorial.

* The best way to learn R (or any programming language), is by doing, playing, and experimenting.

* Critically, you also need to learn how and where to access help. Don't be afraid to use google, stack overflow, or to ask your friends or colleagues for advice. A fresh pair of eyes is much more likely to spot a rogue comma than you are after you've been staring at your screen for hours. This might seem a little basic for an advanced R tutorial, but looking for answers online is a great way to see other examples, look at other peoples coding styles, get help, and importantly, learn to avoid future mistakes. There are loads of wonderful R resources out there for you to use, and they are getting increasingly better with time.

* Another thing you should develop as you become a more advanced R user, is your own coding philosophy. My approach is to do what works. This sounds daft at first, but there are several people whos philosophy it is to achieve and end result using as few lines of code as possible. I like to do what works for me, whilst retaining an open mind to new solutions. It is very easy to get stuck in a coding rut with R, and continue to use a solution that might quickly become out of date or impractical/uneccesarily difficult (think about the introduction of the tidyverse in the 2010s for example!). 

The above abilities and ideas will open up your coding abilities, and allow you to really get to grips with R as both a language and a tool for data wrangling, plotting, and downstream analysis. 

Finally, the importance of your coding environment (I mean physical environment, not your R environment) cannot be understated. Find a way to get "in the zone", and allow yourself to concentrate solely on what it is that you're trying to achieve. Focus is important with code, it's so easy to make mistakes and end up with analyses that hasn't done what you think it has. For me, I make a giant cup of tea, grab some biscuits, get comfy with a warm jumper, and pop my headphones on. I listen to a playlist that I have personally curated for a mixture of concentration, positivity, and motivation. It sounds daft, but I promise, it works. If I really don't want to be disturbed, I shut my office door, cover the window in the door, and turn the lights off (and put my lamp on - I'm aiming for ambient lighting rather than pitch black!).

I know I just said finally, but actually properly finally, don't forget to take a break. If you're tearing your hair out to debug a piece of code, or if you're really stuck and just can't make something work, put it down. Walk away, make yourself a cup of tea, have some cake, go for a run, do whatever you need to for some mental space. Even sleep on it if you have to. When you come back to it, often the answer will present itself. If it doesn't everything will at least feel a little easier than it did before. Coding can be frustrating, but learning to manage and harness that frustration is a real skill!

Right. Let's get to the meat of it.

# Control flow

The first concept that we'll cover today is control flow. You may well have come across control flow in your forrays with Python and Unix, and the basic idea of how it works is largely similar in R. Control flow is a fundemental part of the way that any programming language works, not just R. It essentially provides a way to control the order that a set of functions are performed on your dataset.

There are several examples of control flow functions, but the ones we will concentrate on today are `for` loops, `if` and `else` statements, and the `ifelse()` function. `ifelse()` isn't strictly a control flow statement, but a function from the baseR set of functions.

Before we get started, the way that control flows work are as follows. Say, for example, you have a condition. You ask whether that condition is true or false. If the condition is true, you perform a function. 

![](control_flow.png)

If it is false, you perform another function. This is easier to understand with an analogy. We can use `if` `else` statements to write a function to make ourselves a cup of tea. First, we perform the boil kettle function. Then, we periodically check to see if the water has boiled. If the outcome of "Has the water boiled" is `FALSE`, we perform the wait a bit longer function. We check again. If the outcome is `TRUE`, we perform the pour water function, which then allows us to perform the drink tea function. Admittedly, this analogy is flawed - it's missing some vital steps including adding a tea bag, adding any milk, waiting for the tea to cool, and cutting yourself a slice of cake! That being said, it demonstrates the basic concept of control flow nicely.

![](control_flow2.png)

## A note on brackets

Before we go any further, you may have noticed that we often use different types of brackets for various purposes in R. This is vitally important to the language, and your code will not function in the way that you expect it to if you get this wrong.

`()` - rounded brackets are used exclusively to perform functions, or denote the order of mathematical calculations
`[]` - square brackets are used for indexing and subsetting
`{}` - curly brackets are used to define blocks of code in functions and statements. What I mean by this will become increasingly apparent as we work through the below.


## For loops

Despite `if else` statements being perhaps easier to wrap our heads around, we're going to start with `for` loops. Essentially, a `for` loop is a way of iterating through a set of values, performing a function on each in turn. These loops are relatively straightforward and intuitive in R, once you get used to using them. 

The syntax is, of course, slightly different to Python, but the basic concept is the same. Typically, `for` loops in R are a little slow. There's a much quicker way of performing these functions, but we wont go into that here - it's important to understand how the loops work first. 

Let's try one. We want to create a basic `for` loop which will print all of the numbers from 1 to 10. We could of course do this without a loop, by assigning an individual value to x, and printing them in turn:

```{r}
x <- 1
print(x)

x <- 2
print(x)

x <- 3
print(x)

# and so on...
```

What if we wanted to automate this? Let's build that basic `for` loop...

```{r}
for (x in 1:10){
  print(x)
}
```

Brilliant, but let's go through just how that works.

* `for` - we start with the `for` statement. If you are using an IDE, you might notice that this statment changes colour to denote that it is one of R's protected control statements.
* `x in 1:10` - we know that 1:10 produces a vector of all of the numbers in sequence from 1 to 10. the `x in` bit just means "for every value in the vector of". The x could be anything you want it to be. e.g. `for numbers in 1:10` would yeild the name result. Note here that different programming languages follow different conventions- in python you'll often see `for` loops written as `for i in 1:10`. 
* `{}` - the curly brackets contain the body of the loop. What that means is that all of the values we have just specified above will have the contents of the `{}` performed on them. Here, we've used the `print()` function, which just prints our values to the console.

So, this loop assigns x a value of 1 - 10 in a sequence, and prints it to the console.

This is an example of a super simple and straightforward control statement. We could replace the 1:10 with anything we liked, and the loop would cycle through the values presented it in turn until it reached the final one.

Now, printing the numbers 1 to 10 doesn't seem overly useful. What if we want to perform a mathematical operation on all of the numbers in a sequence quickly? Let's expand our `for` loop so that it adds 10 to every number in the sequence, and prints the result to the console:

```{r}
for (x in 1:10){
  print(x + 10)
}
```

Easy as that. Again, this is a really basic example of a `for` loop, just so you can get to grips with the principle of how they work in R. 

We can also create `for` loops with vectors consisting of character strings - let's create one with 4 components:
```{r}
years <- c("2019", "2020", "2021", "2022")
```
*Remember what when we create character string vectors, we have to include the `c()` function*

Let's pop these in a loop that will print the sentence "The year is" followed by each of the levels of the vector in turn

```{r}
years <- c("2019", "2020", "2021", "2022")

for (x in years){
  y <- paste("The year is", x)
  print(y)
}
```

This loops through the four years we have given in the vector, and uses the `paste()` function to stick two character strings together, printing each of the vector elements in turn. Notice that we've created a new object within the loop, `y` - if you look in your R environment window, or try and print y to the console with `print(y)`, you'll see that y is stored as "The year is 2022", which is the final iteration of the loop. It's important to note that the other iterations are not stored in the wider R environment, as the loop overwrites itself every time it performs an iteration.

Just to recap,a `for` loop runs through values in a vector (and is quick in these small toy examples!), and does some work for you inside the statement. Curly brackets `{}` denote the body of the loop, which is where we tell R what we want it to do with our input. Whilst for loops are great, every example that we've just run through can actually be acheived with simpler, and more computationally efficient code (sorry). Remember to think about your code, and consider whether there might be easier solutions to a problem!

## Vectorisation as a more efficient mechanism?

For our first example, printing 1 - 10 to the screen with a loop, we could save a lot of time by explicitly popping x into it's own vector (rather than adding this to the conditions of a `for` loop), and dealing with it that way:

```{r}
x <- 1:10
print(x)
```

As you can see, this ultimately yields the same result, with considerably less faffing. In these small toy examples, there isn't a whole lot of difference in computation time, but in much larger datasets, this is certainly something worth considering. Generally, vectorisation is much speedier than for loops.

For the example where we added 10 to x, we can use the same vectorisation principle. This is where R differs in it's fundemental dev makeup to, for example, bash, where you would **have** to do a `for` loop to perform the following:

```{r}
x <- 1:10 # recreate our 1:10 vector
print(x + 10)
```

Here, you don't even need the `print()` function, just typing `x + 10` would be sufficient!

You don't even really need to use loops on character vectors in R! Let's go back to our "the year is" example. We can avoid a loop by using the `paste()` function on the vector explicitly, rather than having to force it through a loop:

```{r}
years <- c("2019", "2020", "2021", "2022")
paste("The year is", years)
```

## So why bother with for loops in the first place?

All of this is not to say that `for` loops are rubbish and useless in R - in fact there will likely be times where they are unavoidable, and a vector based solution just won't cut it. It is however worth hammering home the point that they are really worth thinking about quite carefully. There will often be a better way to do whatever you're trying to do - slow down when you're coding. Stop and think about what it is you're trying to achieve. Beginners in particular tend to learn how to create `for` loops, and then use them for everything, when often, a vector based solution would be much quicker and simpler!


# `If` statements

Now, on to `if` statements. These are another control flow coding mechanism you can use in R and further afield to customise the order that commands are run in, depending on conditions of your data. Think about the kettle example - **IF** your water boiled, then you perform the task. `if` statements ask whether a condition has been met, before they perform a function. The best way to explain this is, as always, with an example:
```{r}
z <- 1
if (z < 10){
  print("z is less than 10")
}
```

We can see that because we've given `z` the value of 1, which is less than 10, we get an output telling us so (R is performing the function!). What happens if we change the assignment of `z` and try again?

```{r}
z <- 45
if (z < 10){
  print("z is less than 10")
}
```

We don't get an output. Because the condition was not met, no function was performed on our data.

The way that this is coded in R is that the provided condition (in `()`) is run, to give a logical output. If that output is `TRUE`, the body of the function (in `{}`) is run. If the output is `FALSE`, nothing will happen. 

# `else` statements

It isn't a huge leap to ask whether we could take the above and perform a different function if the condition was not met, and the logical output was `FALSE`. We can achieve this by pairing our `if` statement with an `else` statement. The syntax of this can take a little getting used to, because it pairs two sets of `{}` separated by an `else` statement. Let's demonstrate:

```{r}
z <- 1

if (z < 10){
  print("z is less than 10")
} else {
  print("z is more than 10")
}
```

With an input of 1, we get a printed message telling us that z is less than 10. What if we change z to be greater than 10?

```{r}
z <- 20

if (z < 10){
  print("z is less than 10")
} else {
  print("z is more than 10")
}
```

As expected, R tells us that z is more than 10. This combination of `if` and `else` statements runs a control flow that takes an input, and if a condition is met, runs function a. If the condition is not met, it runs function b. This basic principle allows you to perform slightly more complex functions or analyses, based on the input that you give.

# The `ifelse()` function

There is an alternative to `if` and `else` control flow functions in R - the `ifelse()` function. This doesn't always perform in quite the way that you might expect, but it's certainly an option to consider. The syntax is perhaps a little more familiar than the control flow statements... Let's perform the same thing we just did, but this time using an `ifelse()` function:

```{r}
z <- 1
ifelse(z < 10, "less than 10", "more than 10")
```


```{r}
z <- 20
ifelse(z < 10, "less than 10", "more than 10")
```

Great. All works the same so far! The `ifelse()` function takes an input, evaluates it against a condition (here that it is < 10). If the condition is `TRUE`, it performs the first option, if `FALSE`, does the 2nd. 

Helpfully, we can also perform the `ifelse()` functions on a vector. Let's create a numeric vector from 1 - 60 in steps of 6 and run our function again

```{r}
z <- seq(from = 1, to = 60, by = 6)
ifelse(z < 10, "less than 10", "more than 10")
```

We end up with 10 outputs, each telling us whether a result is less or more than 10!

```{r}
z <- seq(from = 1, to = 60, by = 6)

if (z < 10){
  print("z is less than 10")
} else {
  print("z is more than 10")
}
```

You can see from the error above, that you can't perform `if` `else` statements on a vector - the condition is only expecting something with a length of 1, rather than, in this case, 10.


We can use `ifelse()` on character vectors too. Let's create one and try it out:

```{r}
countries <- c("UK", "USA", "France", "Spain")
ifelse(countries == "UK", "UK", "Elsewhere")

```

The function is evaluating to see if the input matches the string "UK", and if it does, will print "UK", otherwise, prints "Elsewhere". Fairly straightforward stuff!


# Writing your own functions:

Writing your own functions is an important part of programming. It keeps code simple and readable, and allows you to perform a set of tasks multiple times without loads of duplication in your code. If you wanted to develop your own package, custom function creation would be a huge part of this - a package is essentially just a collection of functions!

The process of creating functions is quite straightforward - I do it loads to make my life quick and easy. Let's start by creating a simple function that adds 10 to any numeric input:

```{r}
add_ten <- function(x){
  y <- x + 10
  return(y)
}

add_ten(1)
```

So, here, we've created and tested a basic function. Let's dissect it a little:

* `add_ten` - first, we have to give our new function a name. I've imaginatively called this one add_ten. 
* Then, we assign the rest of the code to the new function name with our assignment arrow `<-`
* Next up, we have the `function` function. This tells R that we want to turn our function name into a new function
* `function(x)` - this tells R that we are creating a function with a single argument (or input), x.
* `{}` this denotes the body of the function - whatever code is in here is what will be applied to an input
* `y <- x + 10` - this bits obvious - add 10 to our input of x and assign it a new name, `y`.
* `return(y)` - give us an output! Without this, R wouldn't automatically tell us what the output of whatever we've just asked it to do is!

Notice here that the new `y` object is treated a little differently in functions than loops. If we remove `y` from our environment and run the function again, you'll notice that it does not re-appear. Whatever objects you create within the body of the function exist only within that function - they are not callable to the wider R environment.

Functions can also take a vector input - let's run our `add_ten()` function with a vector and see what happens

```{r}
add_ten <- function(x){
  y <- x + 10
  return(y)
}

z <- 1:10
add_ten(z)
```

Of course, we could perform this function manually, as it is a really straightforward example. I'd be unlikely to bother creating a function to do this in practice, but it's a useful demonstration. 


# The anatomy of a function

Let's quickly recap the anatomy of how a function works

`function(x)` declares a function. It creates the function within the R environment, and defines the arguments that the function requires. In our example above, whenever you run `add_ten`, you *must* have an x argument, otherwise the function will not run.

`{}` define the block of code within which the work that the function does takes place. Here, `y` is only defined within the function.

`return(y)` tells R to give us an output, and display what the function has done.


# Functions with multiple arguments

We can extend our functions to take multiple arguments, or inputs, using the basics that we've just learned. Functions that we have used throughout the course so far have taken multiple arguments, so let's try and also create one. We'll write a function that multiplies two numbers together:
```{r}
multiply <- function(x, y){
  z <- x * y
  return(z)
}

multiply(3,2)
```

In our R environment window, we can see a new section has been created called 'Functions'. Within that, we can see both our `add_ten` and our `multiply` functions, and the number of arguments that each takes. 

# Functions with default arguments

You'll find often that several arguments in functions that you're running in R have default settngs that you do not have to explicitly code in order for R to perform them. For example, in the `sum()` and `mean()` functions, the default for the `na.rm` argument is `FALSE`. We can edit our `multiply()` function to have a default `y` value of 10:

```{r}
multiply <- function(x, y = 10){
  z <- x * y
  return(z)
}

multiply(3)
multiply(3, 2)
```

Here, we can see that if we include only 1 argument to the function now, it will run and use the default of y = 10. If we include two arguments ourselves, we override the default and the function performs on both of the arguments that we have specified. You can hopefully see how writing functions with multiple arguments and defaults isn't too much of a stretch from writing a function with a single argument, and see how useful that might be in your coding!

# Combining functions and conditionals:

Now, we can combine all of the skills that we've learned today and create complex functions with some control flow statements inside of them. Let's start by wtiting a function that evaluates whether a numerical input is greater than or less than 10. 

```{r}
eval10 <- function(x){
  y <- ifelse(x < 10, "less than 10", "greater than 10")
  return(y)
}

eval10(1)
eval10(15)

```

We've included a simple ifelse function within our eval10 function, which takes a new object, y, and performs an ifelse() function upon it, and returns the output. We could do the same with the explicit `if` and `else` control flow statements

```{r}
eval10_2 <- function(x){
  y <- if (x < 10){
    print("less than 10")
    } else {
      print("greater than 10")
    }
}

eval10_2(1)
eval10_2(15)
```

Whilst this is a simple example, we could build up to creating some incredibly complex functions with custom control flows to efficiently handle our data and downstream analysis in R. It may well be useful for things such as looking at data to see if it contains headers before removing them, or assessing the length of a data set before performing a function on it. 

# Lists in R

Now, we've had a whistle stop tour of functions, obviously this is not the full extent of their functionality, but it's time to move on to lists. We've not really mentioned lists so far, other than very briefly in some of the beginner sessions. To recap, there are 4 main data structures in R - vectors, matricies, data.frames, and lists. 

A list is a special form of data structure which allows you to contain objects of different lengths within it. They are a little more complicated than the other data structures we covered before - partly because a data.frame is both a special type of list, and can be contained within a list! In a data.frame, each component part (or column) must be the same length - this is not true of a list.

They are quite complicated on first glance, and often are not taught at all in R courses - I was never formally taught what a list was or how on earth to use one, and I had to try and work it out for myself. Lists are probably the form od data structure that I use the least in R, simply because I never got into the habit of using them. Now, let's make a list and see how they work, and figure out how to access stuff within them. First, we'll create three vectors (one character, one numeric, and one logical), and combine them with the `list()` function.

```{r}
nintendo <- c("Mario", "Luigi", "Peach")
numbers <- 1:100
logic <- c(TRUE, FALSE, FALSE, TRUE)

my_list <- list(nintendo, numbers, logic)
```
Now, we have our list. We can print it to the screen by typing the name of our list, and it'll print everything out.

```{r}
my_list
```
From this, we can clearly see that the objects contained within the list are of different lengths - nintendo has a length of 3, numbers has a length of 100, and logic has a length of 4.

So, how do we access the list? It can seem complicated, but actually printing it to the screen has given us lots of information. We can use square brackets `[]` to choose objects. We can also access elements with their names, provided we have given them names, but we will come back to that.

Let's try and select the first vector element of our list. We might try and access it with:

```{r}
my_list[1]
```

Whilst this does return what we want it to, this actually isn't in the vector class that we created the list with. We can check with the `class()` function:

```{r}
class(my_list[1])
```

We can see here that the object has retained it's list class, and has taken the first part of the list, but not the vector itself. We can access the vector within the list with double square brackets `[[]]`, the clue is in the display of the list!

```{r}
my_list[[1]]

class(my_list[[1]])
```

When we verify what we're doing with `class()`, we can see that we're getting the original character vector stored within the list. This is a bit complex and hand-wavy, but basically remember that `[[]]` allows you to access the element from within the list in its native form, whereas `[]` gives you the element of the list, whilst retaining it's list class.

Be aware that the way we access lists with `[]` is different than vectors. For example, if we were to type `my_list[1:2]` we get the first two elements of the list, but `my_list[[1:2]]` gives you the second element of the first bit of the list:

```{r}
my_list[1:2]

my_list[[1:2]]
```
You can access bits of lists as above, or perhaps more intuitively and less confusingly, by combining double and single square brackets

```{r}
my_list[[1]][2] # This accesses the 2nd item of the 1st list element.
```

You can also use names to access list elements in a way that is equivalent to how a dataframe would work. In order to do this, however, you must specify the names when you're creating the list, like this:

```{r}
my_list <- list(nintendo = nintendo, numbers = numbers, logic = logic)
```
If we print the list again, you'll see that the [[1]], [[2]], and [[3]] have been replaced by $nintendo, $numbers, and $logic:
```{r}
my_list
```
Now you can access your list using the names, but note that just like a dataframe, `my_list[[1]]` will still work

```{r}
my_list$nintendo

my_list[[1]]
```

## Why are lists important?

It perhaps isn't immediately obvious why lists are important, but we will talk more about this in the next session. It's useful for you to know how to use them, as they will make R more useful to you.

Briefly, lists can be helpful if you are working on lots of dataframes/tibbles at the same time, as you can store them within a list. The versatility of this will become clearer next time when we're talking about the apply family of functions, `sapply()`, `lapply()`, and `apply()`.

# Summary

So, today we have learned about control flow with `for`, `if`, `else`, and `ifelse()`, and how these are used to control the flow of data through a set of functions, depending on whether conditions are met. 

We've learned how to write our own functions, including with multiple arguments and defaults, and have combined this with our new knowledge of control flow.

Finally, we touched on lists which feeds nicely into the apply set of functions that we'll look at in more detail on Thursday. We can use `sapply()`, `lapply()` and `apply()` to efficiently process vectors in R, and we'll learn more about the principles of proper scripting.



