Hello! Welcome to the first of the advanced R workshops on the LIFE4138 module. Our learning objectives for this session are:

What do we mean by “advanced”?

So, the previous sessions have all been designed to give you an overall grounding of the way that R works, and to try and understand the different approaches that R users take to the language (baseR versus the tidyverse approach for e.g.). For most R users, this is enough to get by. You’ll be able to do the majority of the stuff that you need to using only the skills that we’ve been through so far.

That being said, there are times, particularly when undertaking bioinformatic analysis, when perhaps you might like to have a script that will run independently without your input. Situations such as this really highlight the importance of the advanced techniques that we’re about to learn!

Taking it to the next step

The below will begin to give you a greater flexibilty and versatility with your R coding, and allow you to begin to make R work for you in exactly the way that you want it to. for loops for example are a great principle to learn in general with your code, and learning both the loops themselves as well as the alternatives available in R will be incredibly useful to you.

Another nice thing about moving on to more advanced coding principles in R is that they will help to deepen your understanding of how the basics work. For example, the way that we handle and access vectors will become clearer as we use more advanced techniques to work through them.

But how do I become an “advanced” R user?!

The above abilities and ideas will open up your coding abilities, and allow you to really get to grips with R as both a language and a tool for data wrangling, plotting, and downstream analysis.

Finally, the importance of your coding environment (I mean physical environment, not your R environment) cannot be understated. Find a way to get “in the zone”, and allow yourself to concentrate solely on what it is that you’re trying to achieve. Focus is important with code, it’s so easy to make mistakes and end up with analyses that hasn’t done what you think it has. For me, I make a giant cup of tea, grab some biscuits, get comfy with a warm jumper, and pop my headphones on. I listen to a playlist that I have personally curated for a mixture of concentration, positivity, and motivation. It sounds daft, but I promise, it works. If I really don’t want to be disturbed, I shut my office door, cover the window in the door, and turn the lights off (and put my lamp on - I’m aiming for ambient lighting rather than pitch black!).

I know I just said finally, but actually properly finally, don’t forget to take a break. If you’re tearing your hair out to debug a piece of code, or if you’re really stuck and just can’t make something work, put it down. Walk away, make yourself a cup of tea, have some cake, go for a run, do whatever you need to for some mental space. Even sleep on it if you have to. When you come back to it, often the answer will present itself. If it doesn’t everything will at least feel a little easier than it did before. Coding can be frustrating, but learning to manage and harness that frustration is a real skill!

Right. Let’s get to the meat of it.

Control flow

The first concept that we’ll cover today is control flow. You may well have come across control flow in your forrays with Python and Unix, and the basic idea of how it works is largely similar in R. Control flow is a fundemental part of the way that any programming language works, not just R. It essentially provides a way to control the order that a set of functions are performed on your dataset.

There are several examples of control flow functions, but the ones we will concentrate on today are for loops, if and else statements, and the ifelse() function. ifelse() isn’t strictly a control flow statement, but a function from the baseR set of functions.

Before we get started, the way that control flows work are as follows. Say, for example, you have a condition. You ask whether that condition is true or false. If the condition is true, you perform a function.

If it is false, you perform another function. This is easier to understand with an analogy. We can use if else statements to write a function to make ourselves a cup of tea. First, we perform the boil kettle function. Then, we periodically check to see if the water has boiled. If the outcome of “Has the water boiled” is FALSE, we perform the wait a bit longer function. We check again. If the outcome is TRUE, we perform the pour water function, which then allows us to perform the drink tea function. Admittedly, this analogy is flawed - it’s missing some vital steps including adding a tea bag, adding any milk, waiting for the tea to cool, and cutting yourself a slice of cake! That being said, it demonstrates the basic concept of control flow nicely.

A note on brackets

Before we go any further, you may have noticed that we often use different types of brackets for various purposes in R. This is vitally important to the language, and your code will not function in the way that you expect it to if you get this wrong.

() - rounded brackets are used exclusively to perform functions, or denote the order of mathematical calculations [] - square brackets are used for indexing and subsetting {} - curly brackets are used to define blocks of code in functions and statements. What I mean by this will become increasingly apparent as we work through the below.

For loops

Despite if else statements being perhaps easier to wrap our heads around, we’re going to start with for loops. Essentially, a for loop is a way of iterating through a set of values, performing a function on each in turn. These loops are relatively straightforward and intuitive in R, once you get used to using them.

The syntax is, of course, slightly different to Python, but the basic concept is the same. Typically, for loops in R are a little slow. There’s a much quicker way of performing these functions, but we wont go into that here - it’s important to understand how the loops work first.

Let’s try one. We want to create a basic for loop which will print all of the numbers from 1 to 10. We could of course do this without a loop, by assigning an individual value to x, and printing them in turn:

x <- 1
print(x)
[1] 1
x <- 2
print(x)
[1] 2
x <- 3
print(x)
[1] 3

What if we wanted to automate this? Let’s build that basic for loop…

for (x in 1:10){
  print(x)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

Brilliant, but let’s go through just how that works.

  • for - we start with the for statement. If you are using an IDE, you might notice that this statment changes colour to denote that it is one of R’s protected control statements.
  • x in 1:10 - we know that 1:10 produces a vector of all of the numbers in sequence from 1 to 10. the x in bit just means “for every value in the vector of”. The x could be anything you want it to be. e.g. for numbers in 1:10 would yeild the name result. Note here that different programming languages follow different conventions- in python you’ll often see for loops written as for i in 1:10.
  • {} - the curly brackets contain the body of the loop. What that means is that all of the values we have just specified above will have the contents of the {} performed on them. Here, we’ve used the print() function, which just prints our values to the console.

So, this loop assigns x a value of 1 - 10 in a sequence, and prints it to the console.

This is an example of a super simple and straightforward control statement. We could replace the 1:10 with anything we liked, and the loop would cycle through the values presented it in turn until it reached the final one.

Now, printing the numbers 1 to 10 doesn’t seem overly useful. What if we want to perform a mathematical operation on all of the numbers in a sequence quickly? Let’s expand our for loop so that it adds 10 to every number in the sequence, and prints the result to the console:

for (x in 1:10){
  print(x + 10)
}
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20

Easy as that. Again, this is a really basic example of a for loop, just so you can get to grips with the principle of how they work in R.

We can also create for loops with vectors consisting of character strings - let’s create one with 4 components:

years <- c("2019", "2020", "2021", "2022")

Remember what when we create character string vectors, we have to include the c() function

Let’s pop these in a loop that will print the sentence “The year is” followed by each of the levels of the vector in turn

years <- c("2019", "2020", "2021", "2022")

for (x in years){
  y <- paste("The year is", x)
  print(y)
}
[1] "The year is 2019"
[1] "The year is 2020"
[1] "The year is 2021"
[1] "The year is 2022"

This loops through the four years we have given in the vector, and uses the paste() function to stick two character strings together, printing each of the vector elements in turn. Notice that we’ve created a new object within the loop, y - if you look in your R environment window, or try and print y to the console with print(y), you’ll see that y is stored as “The year is 2022”, which is the final iteration of the loop. It’s important to note that the other iterations are not stored in the wider R environment, as the loop overwrites itself every time it performs an iteration.

Just to recap,a for loop runs through values in a vector (and is quick in these small toy examples!), and does some work for you inside the statement. Curly brackets {} denote the body of the loop, which is where we tell R what we want it to do with our input. Whilst for loops are great, every example that we’ve just run through can actually be acheived with simpler, and more computationally efficient code (sorry). Remember to think about your code, and consider whether there might be easier solutions to a problem!

Vectorisation as a more efficient mechanism?

For our first example, printing 1 - 10 to the screen with a loop, we could save a lot of time by explicitly popping x into it’s own vector (rather than adding this to the conditions of a for loop), and dealing with it that way:

x <- 1:10
print(x)
 [1]  1  2  3  4  5  6  7  8  9 10

As you can see, this ultimately yields the same result, with considerably less faffing. In these small toy examples, there isn’t a whole lot of difference in computation time, but in much larger datasets, this is certainly something worth considering. Generally, vectorisation is much speedier than for loops.

For the example where we added 10 to x, we can use the same vectorisation principle. This is where R differs in it’s fundemental dev makeup to, for example, bash, where you would have to do a for loop to perform the following:

x <- 1:10 # recreate our 1:10 vector
print(x + 10)
 [1] 11 12 13 14 15 16 17 18 19 20

Here, you don’t even need the print() function, just typing x + 10 would be sufficient!

You don’t even really need to use loops on character vectors in R! Let’s go back to our “the year is” example. We can avoid a loop by using the paste() function on the vector explicitly, rather than having to force it through a loop:

years <- c("2019", "2020", "2021", "2022")
paste("The year is", years)
[1] "The year is 2019" "The year is 2020" "The year is 2021" "The year is 2022"

So why bother with for loops in the first place?

All of this is not to say that for loops are rubbish and useless in R - in fact there will likely be times where they are unavoidable, and a vector based solution just won’t cut it. It is however worth hammering home the point that they are really worth thinking about quite carefully. There will often be a better way to do whatever you’re trying to do - slow down when you’re coding. Stop and think about what it is you’re trying to achieve. Beginners in particular tend to learn how to create for loops, and then use them for everything, when often, a vector based solution would be much quicker and simpler!

If statements

Now, on to if statements. These are another control flow coding mechanism you can use in R and further afield to customise the order that commands are run in, depending on conditions of your data. Think about the kettle example - IF your water boiled, then you perform the task. if statements ask whether a condition has been met, before they perform a function. The best way to explain this is, as always, with an example:

z <- 1
if (z < 10){
  print("z is less than 10")
}

We can see that because we’ve given z the value of 1, which is less than 10, we get an output telling us so (R is performing the function!). What happens if we change the assignment of z and try again?

z <- 45
if (z < 10){
  print("z is less than 10")
}

We don’t get an output. Because the condition was not met, no function was performed on our data.

The way that this is coded in R is that the provided condition (in ()) is run, to give a logical output. If that output is TRUE, the body of the function (in {}) is run. If the output is FALSE, nothing will happen.

else statements

It isn’t a huge leap to ask whether we could take the above and perform a different function if the condition was not met, and the logical output was FALSE. We can achieve this by pairing our if statement with an else statement. The syntax of this can take a little getting used to, because it pairs two sets of {} separated by an else statement. Let’s demonstrate:

z <- 1

if (z < 10){
  print("z is less than 10")
} else {
  print("z is more than 10")
}
[1] "z is less than 10"

With an input of 1, we get a printed message telling us that z is less than 10. What if we change z to be greater than 10?

z <- 20

if (z < 10){
  print("z is less than 10")
} else {
  print("z is more than 10")
}
[1] "z is more than 10"

As expected, R tells us that z is more than 10. This combination of if and else statements runs a control flow that takes an input, and if a condition is met, runs function a. If the condition is not met, it runs function b. This basic principle allows you to perform slightly more complex functions or analyses, based on the input that you give.

The ifelse() function

There is an alternative to if and else control flow functions in R - the ifelse() function. This doesn’t always perform in quite the way that you might expect, but it’s certainly an option to consider. The syntax is perhaps a little more familiar than the control flow statements… Let’s perform the same thing we just did, but this time using an ifelse() function:

z <- 1
ifelse(z < 10, "less than 10", "more than 10")
[1] "less than 10"
z <- 20
ifelse(z < 10, "less than 10", "more than 10")
[1] "more than 10"

Great. All works the same so far! The ifelse() function takes an input, evaluates it against a condition (here that it is < 10). If the condition is TRUE, it performs the first option, if FALSE, does the 2nd.

Helpfully, we can also perform the ifelse() functions on a vector. Let’s create a numeric vector from 1 - 60 in steps of 6 and run our function again

z <- seq(from = 1, to = 60, by = 6)
ifelse(z < 10, "less than 10", "more than 10")
 [1] "less than 10" "less than 10" "more than 10" "more than 10" "more than 10" "more than 10" "more than 10"
 [8] "more than 10" "more than 10" "more than 10"

We end up with 10 outputs, each telling us whether a result is less or more than 10!

z <- seq(from = 1, to = 60, by = 6)

if (z < 10){
  print("z is less than 10")
} else {
  print("z is more than 10")
}
Error in if (z < 10) { : the condition has length > 1

You can see from the error above, that you can’t perform if else statements on a vector - the condition is only expecting something with a length of 1, rather than, in this case, 10.

We can use ifelse() on character vectors too. Let’s create one and try it out:

countries <- c("UK", "USA", "France", "Spain")
ifelse(countries == "UK", "UK", "Elsewhere")
[1] "UK"        "Elsewhere" "Elsewhere" "Elsewhere"

The function is evaluating to see if the input matches the string “UK”, and if it does, will print “UK”, otherwise, prints “Elsewhere”. Fairly straightforward stuff!

Writing your own functions:

Writing your own functions is an important part of programming. It keeps code simple and readable, and allows you to perform a set of tasks multiple times without loads of duplication in your code. If you wanted to develop your own package, custom function creation would be a huge part of this - a package is essentially just a collection of functions!

The process of creating functions is quite straightforward - I do it loads to make my life quick and easy. Let’s start by creating a simple function that adds 10 to any numeric input:

add_ten <- function(x){
  y <- x + 10
  return(y)
}

add_ten(1)
[1] 11

So, here, we’ve created and tested a basic function. Let’s dissect it a little:

Notice here that the new y object is treated a little differently in functions than loops. If we remove y from our environment and run the function again, you’ll notice that it does not re-appear. Whatever objects you create within the body of the function exist only within that function - they are not callable to the wider R environment.

Functions can also take a vector input - let’s run our add_ten() function with a vector and see what happens

add_ten <- function(x){
  y <- x + 10
  return(y)
}

z <- 1:10
add_ten(z)
 [1] 11 12 13 14 15 16 17 18 19 20

Of course, we could perform this function manually, as it is a really straightforward example. I’d be unlikely to bother creating a function to do this in practice, but it’s a useful demonstration.

The anatomy of a function

Let’s quickly recap the anatomy of how a function works

function(x) declares a function. It creates the function within the R environment, and defines the arguments that the function requires. In our example above, whenever you run add_ten, you must have an x argument, otherwise the function will not run.

{} define the block of code within which the work that the function does takes place. Here, y is only defined within the function.

return(y) tells R to give us an output, and display what the function has done.

Functions with multiple arguments

We can extend our functions to take multiple arguments, or inputs, using the basics that we’ve just learned. Functions that we have used throughout the course so far have taken multiple arguments, so let’s try and also create one. We’ll write a function that multiplies two numbers together:

multiply <- function(x, y){
  z <- x * y
  return(z)
}

multiply(3,2)
[1] 6

In our R environment window, we can see a new section has been created called ‘Functions’. Within that, we can see both our add_ten and our multiply functions, and the number of arguments that each takes.

Functions with default arguments

You’ll find often that several arguments in functions that you’re running in R have default settngs that you do not have to explicitly code in order for R to perform them. For example, in the sum() and mean() functions, the default for the na.rm argument is FALSE. We can edit our multiply() function to have a default y value of 10:

multiply <- function(x, y = 10){
  z <- x * y
  return(z)
}

multiply(3)
[1] 30
multiply(3, 2)
[1] 6

Here, we can see that if we include only 1 argument to the function now, it will run and use the default of y = 10. If we include two arguments ourselves, we override the default and the function performs on both of the arguments that we have specified. You can hopefully see how writing functions with multiple arguments and defaults isn’t too much of a stretch from writing a function with a single argument, and see how useful that might be in your coding!

Combining functions and conditionals:

Now, we can combine all of the skills that we’ve learned today and create complex functions with some control flow statements inside of them. Let’s start by wtiting a function that evaluates whether a numerical input is greater than or less than 10.

eval10 <- function(x){
  y <- ifelse(x < 10, "less than 10", "greater than 10")
  return(y)
}

eval10(1)
[1] "less than 10"
eval10(15)
[1] "greater than 10"

We’ve included a simple ifelse function within our eval10 function, which takes a new object, y, and performs an ifelse() function upon it, and returns the output. We could do the same with the explicit if and else control flow statements

eval10_2 <- function(x){
  y <- if (x < 10){
    print("less than 10")
    } else {
      print("greater than 10")
    }
}

eval10_2(1)
[1] "less than 10"
eval10_2(15)
[1] "greater than 10"

Whilst this is a simple example, we could build up to creating some incredibly complex functions with custom control flows to efficiently handle our data and downstream analysis in R. It may well be useful for things such as looking at data to see if it contains headers before removing them, or assessing the length of a data set before performing a function on it.

Lists in R

Now, we’ve had a whistle stop tour of functions, obviously this is not the full extent of their functionality, but it’s time to move on to lists. We’ve not really mentioned lists so far, other than very briefly in some of the beginner sessions. To recap, there are 4 main data structures in R - vectors, matricies, data.frames, and lists.

A list is a special form of data structure which allows you to contain objects of different lengths within it. They are a little more complicated than the other data structures we covered before - partly because a data.frame is both a special type of list, and can be contained within a list! In a data.frame, each component part (or column) must be the same length - this is not true of a list.

They are quite complicated on first glance, and often are not taught at all in R courses - I was never formally taught what a list was or how on earth to use one, and I had to try and work it out for myself. Lists are probably the form od data structure that I use the least in R, simply because I never got into the habit of using them. Now, let’s make a list and see how they work, and figure out how to access stuff within them. First, we’ll create three vectors (one character, one numeric, and one logical), and combine them with the list() function.

nintendo <- c("Mario", "Luigi", "Peach")
numbers <- 1:100
logic <- c(TRUE, FALSE, FALSE, TRUE)

my_list <- list(nintendo, numbers, logic)

Now, we have our list. We can print it to the screen by typing the name of our list, and it’ll print everything out.

my_list
[[1]]
[1] "Mario" "Luigi" "Peach"

[[2]]
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26
 [27]  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52
 [53]  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78
 [79]  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

[[3]]
[1]  TRUE FALSE FALSE  TRUE

From this, we can clearly see that the objects contained within the list are of different lengths - nintendo has a length of 3, numbers has a length of 100, and logic has a length of 4.

So, how do we access the list? It can seem complicated, but actually printing it to the screen has given us lots of information. We can use square brackets [] to choose objects. We can also access elements with their names, provided we have given them names, but we will come back to that.

Let’s try and select the first vector element of our list. We might try and access it with:

my_list[1]
[[1]]
[1] "Mario" "Luigi" "Peach"

Whilst this does return what we want it to, this actually isn’t in the vector class that we created the list with. We can check with the class() function:

class(my_list[1])
[1] "list"

We can see here that the object has retained it’s list class, and has taken the first part of the list, but not the vector itself. We can access the vector within the list with double square brackets [[]], the clue is in the display of the list!

my_list[[1]]
[1] "Mario" "Luigi" "Peach"
class(my_list[[1]])
[1] "character"

When we verify what we’re doing with class(), we can see that we’re getting the original character vector stored within the list. This is a bit complex and hand-wavy, but basically remember that [[]] allows you to access the element from within the list in its native form, whereas [] gives you the element of the list, whilst retaining it’s list class.

Be aware that the way we access lists with [] is different than vectors. For example, if we were to type my_list[1:2] we get the first two elements of the list, but my_list[[1:2]] gives you the second element of the first bit of the list:

my_list[1:2]
[[1]]
[1] "Mario" "Luigi" "Peach"

[[2]]
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26
 [27]  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52
 [53]  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78
 [79]  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
my_list[[1:2]]
[1] "Luigi"

You can access bits of lists as above, or perhaps more intuitively and less confusingly, by combining double and single square brackets

my_list[[1]][2] # This accesses the 2nd item of the 1st list element.

You can also use names to access list elements in a way that is equivalent to how a dataframe would work. In order to do this, however, you must specify the names when you’re creating the list, like this:

my_list <- list(nintendo = nintendo, numbers = numbers, logic = logic)

If we print the list again, you’ll see that the [[1]], [[2]], and [[3]] have been replaced by $nintendo, $numbers, and $logic:

my_list
$nintendo
[1] "Mario" "Luigi" "Peach"

$numbers
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26
 [27]  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52
 [53]  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78
 [79]  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

$logic
[1]  TRUE FALSE FALSE  TRUE

Now you can access your list using the names, but note that just like a dataframe, my_list[[1]] will still work

my_list$nintendo
[1] "Mario" "Luigi" "Peach"
my_list[[1]]
[1] "Mario" "Luigi" "Peach"

Why are lists important?

It perhaps isn’t immediately obvious why lists are important, but we will talk more about this in the next session. It’s useful for you to know how to use them, as they will make R more useful to you.

Briefly, lists can be helpful if you are working on lots of dataframes/tibbles at the same time, as you can store them within a list. The versatility of this will become clearer next time when we’re talking about the apply family of functions, sapply(), lapply(), and apply().

Summary

So, today we have learned about control flow with for, if, else, and ifelse(), and how these are used to control the flow of data through a set of functions, depending on whether conditions are met.

We’ve learned how to write our own functions, including with multiple arguments and defaults, and have combined this with our new knowledge of control flow.

Finally, we touched on lists which feeds nicely into the apply set of functions that we’ll look at in more detail on Thursday. We can use sapply(), lapply() and apply() to efficiently process vectors in R, and we’ll learn more about the principles of proper scripting.

---
title: "LIFE4138 - Advanced R workshop 1"
output: html_notebook
---

Hello! Welcome to the first of the advanced R workshops on the LIFE4138 module. Our learning objectives for this session are:

* Learn some advanced programming skills in R, and understand that the language can be used as a "strict" programming language similar to Python etc.
* Think about control flow - we'll introduce the concepts of control flow, including `for` loops, `if` and `else` statements, and the `ifelse()` function.
* We will learn to write our own functions, a vital tool when we are building scripts. We will discuss the usability of custom functions to avoid replication in your code.
* Finally, we will begin to have a look at our 4th data structure in R - lists.

# What do we mean by "advanced"?

So, the previous sessions have all been designed to give you an overall grounding of the way that R works, and to try and understand the different approaches that R users take to the language (baseR versus the tidyverse approach for e.g.). For most R users, this is enough to get by. You'll be able to do the majority of the stuff that you need to using only the skills that we've been through so far.

That being said, there are times, particularly when undertaking bioinformatic analysis, when perhaps you might like to have a script that will run independently without your input. Situations such as this really highlight the importance of the advanced techniques that we're about to learn!

# Taking it to the next step

The below will begin to give you a greater flexibilty and versatility with your R coding, and allow you to begin to make R work for you in exactly the way that you want it to. `for` loops for example are a great principle to learn in general with your code, and learning both the loops themselves as well as the alternatives available in R will be incredibly useful to you. 

Another nice thing about moving on to more advanced coding principles in R is that they will help to deepen your understanding of how the basics work.  For example, the way that we handle and access vectors will become clearer as we use more advanced techniques to work through them.

# But how do I become an "advanced" R user?!

* Just following tutorials such as this will not make you into an advanced R user. These tutorials can only go so far - I can introduce you to a concept, but really what will deepen and strengthen your understanding is by using the techniques with real-world data, beyond the structured format of this tutorial.

* The best way to learn R (or any programming language), is by doing, playing, and experimenting.

* Critically, you also need to learn how and where to access help. Don't be afraid to use google, stack overflow, or to ask your friends or colleagues for advice. A fresh pair of eyes is much more likely to spot a rogue comma than you are after you've been staring at your screen for hours. This might seem a little basic for an advanced R tutorial, but looking for answers online is a great way to see other examples, look at other peoples coding styles, get help, and importantly, learn to avoid future mistakes. There are loads of wonderful R resources out there for you to use, and they are getting increasingly better with time.

* Another thing you should develop as you become a more advanced R user, is your own coding philosophy. My approach is to do what works. This sounds daft at first, but there are several people whos philosophy it is to achieve and end result using as few lines of code as possible. I like to do what works for me, whilst retaining an open mind to new solutions. It is very easy to get stuck in a coding rut with R, and continue to use a solution that might quickly become out of date or impractical/uneccesarily difficult (think about the introduction of the tidyverse in the 2010s for example!). 

The above abilities and ideas will open up your coding abilities, and allow you to really get to grips with R as both a language and a tool for data wrangling, plotting, and downstream analysis. 

Finally, the importance of your coding environment (I mean physical environment, not your R environment) cannot be understated. Find a way to get "in the zone", and allow yourself to concentrate solely on what it is that you're trying to achieve. Focus is important with code, it's so easy to make mistakes and end up with analyses that hasn't done what you think it has. For me, I make a giant cup of tea, grab some biscuits, get comfy with a warm jumper, and pop my headphones on. I listen to a playlist that I have personally curated for a mixture of concentration, positivity, and motivation. It sounds daft, but I promise, it works. If I really don't want to be disturbed, I shut my office door, cover the window in the door, and turn the lights off (and put my lamp on - I'm aiming for ambient lighting rather than pitch black!).

I know I just said finally, but actually properly finally, don't forget to take a break. If you're tearing your hair out to debug a piece of code, or if you're really stuck and just can't make something work, put it down. Walk away, make yourself a cup of tea, have some cake, go for a run, do whatever you need to for some mental space. Even sleep on it if you have to. When you come back to it, often the answer will present itself. If it doesn't everything will at least feel a little easier than it did before. Coding can be frustrating, but learning to manage and harness that frustration is a real skill!

Right. Let's get to the meat of it.

# Control flow

The first concept that we'll cover today is control flow. You may well have come across control flow in your forrays with Python and Unix, and the basic idea of how it works is largely similar in R. Control flow is a fundemental part of the way that any programming language works, not just R. It essentially provides a way to control the order that a set of functions are performed on your dataset.

There are several examples of control flow functions, but the ones we will concentrate on today are `for` loops, `if` and `else` statements, and the `ifelse()` function. `ifelse()` isn't strictly a control flow statement, but a function from the baseR set of functions.

Before we get started, the way that control flows work are as follows. Say, for example, you have a condition. You ask whether that condition is true or false. If the condition is true, you perform a function. 

![](control_flow.png)

If it is false, you perform another function. This is easier to understand with an analogy. We can use `if` `else` statements to write a function to make ourselves a cup of tea. First, we perform the boil kettle function. Then, we periodically check to see if the water has boiled. If the outcome of "Has the water boiled" is `FALSE`, we perform the wait a bit longer function. We check again. If the outcome is `TRUE`, we perform the pour water function, which then allows us to perform the drink tea function. Admittedly, this analogy is flawed - it's missing some vital steps including adding a tea bag, adding any milk, waiting for the tea to cool, and cutting yourself a slice of cake! That being said, it demonstrates the basic concept of control flow nicely.

![](control_flow2.png)

## A note on brackets

Before we go any further, you may have noticed that we often use different types of brackets for various purposes in R. This is vitally important to the language, and your code will not function in the way that you expect it to if you get this wrong.

`()` - rounded brackets are used exclusively to perform functions, or denote the order of mathematical calculations
`[]` - square brackets are used for indexing and subsetting
`{}` - curly brackets are used to define blocks of code in functions and statements. What I mean by this will become increasingly apparent as we work through the below.


## For loops

Despite `if else` statements being perhaps easier to wrap our heads around, we're going to start with `for` loops. Essentially, a `for` loop is a way of iterating through a set of values, performing a function on each in turn. These loops are relatively straightforward and intuitive in R, once you get used to using them. 

The syntax is, of course, slightly different to Python, but the basic concept is the same. Typically, `for` loops in R are a little slow. There's a much quicker way of performing these functions, but we wont go into that here - it's important to understand how the loops work first. 

Let's try one. We want to create a basic `for` loop which will print all of the numbers from 1 to 10. We could of course do this without a loop, by assigning an individual value to x, and printing them in turn:

```{r}
x <- 1
print(x)

x <- 2
print(x)

x <- 3
print(x)

# and so on...
```

What if we wanted to automate this? Let's build that basic `for` loop...

```{r}
for (x in 1:10){
  print(x)
}
```

Brilliant, but let's go through just how that works.

* `for` - we start with the `for` statement. If you are using an IDE, you might notice that this statment changes colour to denote that it is one of R's protected control statements.
* `x in 1:10` - we know that 1:10 produces a vector of all of the numbers in sequence from 1 to 10. the `x in` bit just means "for every value in the vector of". The x could be anything you want it to be. e.g. `for numbers in 1:10` would yeild the name result. Note here that different programming languages follow different conventions- in python you'll often see `for` loops written as `for i in 1:10`. 
* `{}` - the curly brackets contain the body of the loop. What that means is that all of the values we have just specified above will have the contents of the `{}` performed on them. Here, we've used the `print()` function, which just prints our values to the console.

So, this loop assigns x a value of 1 - 10 in a sequence, and prints it to the console.

This is an example of a super simple and straightforward control statement. We could replace the 1:10 with anything we liked, and the loop would cycle through the values presented it in turn until it reached the final one.

Now, printing the numbers 1 to 10 doesn't seem overly useful. What if we want to perform a mathematical operation on all of the numbers in a sequence quickly? Let's expand our `for` loop so that it adds 10 to every number in the sequence, and prints the result to the console:

```{r}
for (x in 1:10){
  print(x + 10)
}
```

Easy as that. Again, this is a really basic example of a `for` loop, just so you can get to grips with the principle of how they work in R. 

We can also create `for` loops with vectors consisting of character strings - let's create one with 4 components:
```{r}
years <- c("2019", "2020", "2021", "2022")
```
*Remember what when we create character string vectors, we have to include the `c()` function*

Let's pop these in a loop that will print the sentence "The year is" followed by each of the levels of the vector in turn

```{r}
years <- c("2019", "2020", "2021", "2022")

for (x in years){
  y <- paste("The year is", x)
  print(y)
}
```

This loops through the four years we have given in the vector, and uses the `paste()` function to stick two character strings together, printing each of the vector elements in turn. Notice that we've created a new object within the loop, `y` - if you look in your R environment window, or try and print y to the console with `print(y)`, you'll see that y is stored as "The year is 2022", which is the final iteration of the loop. It's important to note that the other iterations are not stored in the wider R environment, as the loop overwrites itself every time it performs an iteration.

Just to recap,a `for` loop runs through values in a vector (and is quick in these small toy examples!), and does some work for you inside the statement. Curly brackets `{}` denote the body of the loop, which is where we tell R what we want it to do with our input. Whilst for loops are great, every example that we've just run through can actually be acheived with simpler, and more computationally efficient code (sorry). Remember to think about your code, and consider whether there might be easier solutions to a problem!

## Vectorisation as a more efficient mechanism?

For our first example, printing 1 - 10 to the screen with a loop, we could save a lot of time by explicitly popping x into it's own vector (rather than adding this to the conditions of a `for` loop), and dealing with it that way:

```{r}
x <- 1:10
print(x)
```

As you can see, this ultimately yields the same result, with considerably less faffing. In these small toy examples, there isn't a whole lot of difference in computation time, but in much larger datasets, this is certainly something worth considering. Generally, vectorisation is much speedier than for loops.

For the example where we added 10 to x, we can use the same vectorisation principle. This is where R differs in it's fundemental dev makeup to, for example, bash, where you would **have** to do a `for` loop to perform the following:

```{r}
x <- 1:10 # recreate our 1:10 vector
print(x + 10)
```

Here, you don't even need the `print()` function, just typing `x + 10` would be sufficient!

You don't even really need to use loops on character vectors in R! Let's go back to our "the year is" example. We can avoid a loop by using the `paste()` function on the vector explicitly, rather than having to force it through a loop:

```{r}
years <- c("2019", "2020", "2021", "2022")
paste("The year is", years)
```

## So why bother with for loops in the first place?

All of this is not to say that `for` loops are rubbish and useless in R - in fact there will likely be times where they are unavoidable, and a vector based solution just won't cut it. It is however worth hammering home the point that they are really worth thinking about quite carefully. There will often be a better way to do whatever you're trying to do - slow down when you're coding. Stop and think about what it is you're trying to achieve. Beginners in particular tend to learn how to create `for` loops, and then use them for everything, when often, a vector based solution would be much quicker and simpler!


# `If` statements

Now, on to `if` statements. These are another control flow coding mechanism you can use in R and further afield to customise the order that commands are run in, depending on conditions of your data. Think about the kettle example - **IF** your water boiled, then you perform the task. `if` statements ask whether a condition has been met, before they perform a function. The best way to explain this is, as always, with an example:
```{r}
z <- 1
if (z < 10){
  print("z is less than 10")
}
```

We can see that because we've given `z` the value of 1, which is less than 10, we get an output telling us so (R is performing the function!). What happens if we change the assignment of `z` and try again?

```{r}
z <- 45
if (z < 10){
  print("z is less than 10")
}
```

We don't get an output. Because the condition was not met, no function was performed on our data.

The way that this is coded in R is that the provided condition (in `()`) is run, to give a logical output. If that output is `TRUE`, the body of the function (in `{}`) is run. If the output is `FALSE`, nothing will happen. 

# `else` statements

It isn't a huge leap to ask whether we could take the above and perform a different function if the condition was not met, and the logical output was `FALSE`. We can achieve this by pairing our `if` statement with an `else` statement. The syntax of this can take a little getting used to, because it pairs two sets of `{}` separated by an `else` statement. Let's demonstrate:

```{r}
z <- 1

if (z < 10){
  print("z is less than 10")
} else {
  print("z is more than 10")
}
```

With an input of 1, we get a printed message telling us that z is less than 10. What if we change z to be greater than 10?

```{r}
z <- 20

if (z < 10){
  print("z is less than 10")
} else {
  print("z is more than 10")
}
```

As expected, R tells us that z is more than 10. This combination of `if` and `else` statements runs a control flow that takes an input, and if a condition is met, runs function a. If the condition is not met, it runs function b. This basic principle allows you to perform slightly more complex functions or analyses, based on the input that you give.

# The `ifelse()` function

There is an alternative to `if` and `else` control flow functions in R - the `ifelse()` function. This doesn't always perform in quite the way that you might expect, but it's certainly an option to consider. The syntax is perhaps a little more familiar than the control flow statements... Let's perform the same thing we just did, but this time using an `ifelse()` function:

```{r}
z <- 1
ifelse(z < 10, "less than 10", "more than 10")
```


```{r}
z <- 20
ifelse(z < 10, "less than 10", "more than 10")
```

Great. All works the same so far! The `ifelse()` function takes an input, evaluates it against a condition (here that it is < 10). If the condition is `TRUE`, it performs the first option, if `FALSE`, does the 2nd. 

Helpfully, we can also perform the `ifelse()` functions on a vector. Let's create a numeric vector from 1 - 60 in steps of 6 and run our function again

```{r}
z <- seq(from = 1, to = 60, by = 6)
ifelse(z < 10, "less than 10", "more than 10")
```

We end up with 10 outputs, each telling us whether a result is less or more than 10!

```{r}
z <- seq(from = 1, to = 60, by = 6)

if (z < 10){
  print("z is less than 10")
} else {
  print("z is more than 10")
}
```

You can see from the error above, that you can't perform `if` `else` statements on a vector - the condition is only expecting something with a length of 1, rather than, in this case, 10.


We can use `ifelse()` on character vectors too. Let's create one and try it out:

```{r}
countries <- c("UK", "USA", "France", "Spain")
ifelse(countries == "UK", "UK", "Elsewhere")

```

The function is evaluating to see if the input matches the string "UK", and if it does, will print "UK", otherwise, prints "Elsewhere". Fairly straightforward stuff!


# Writing your own functions:

Writing your own functions is an important part of programming. It keeps code simple and readable, and allows you to perform a set of tasks multiple times without loads of duplication in your code. If you wanted to develop your own package, custom function creation would be a huge part of this - a package is essentially just a collection of functions!

The process of creating functions is quite straightforward - I do it loads to make my life quick and easy. Let's start by creating a simple function that adds 10 to any numeric input:

```{r}
add_ten <- function(x){
  y <- x + 10
  return(y)
}

add_ten(1)
```

So, here, we've created and tested a basic function. Let's dissect it a little:

* `add_ten` - first, we have to give our new function a name. I've imaginatively called this one add_ten. 
* Then, we assign the rest of the code to the new function name with our assignment arrow `<-`
* Next up, we have the `function` function. This tells R that we want to turn our function name into a new function
* `function(x)` - this tells R that we are creating a function with a single argument (or input), x.
* `{}` this denotes the body of the function - whatever code is in here is what will be applied to an input
* `y <- x + 10` - this bits obvious - add 10 to our input of x and assign it a new name, `y`.
* `return(y)` - give us an output! Without this, R wouldn't automatically tell us what the output of whatever we've just asked it to do is!

Notice here that the new `y` object is treated a little differently in functions than loops. If we remove `y` from our environment and run the function again, you'll notice that it does not re-appear. Whatever objects you create within the body of the function exist only within that function - they are not callable to the wider R environment.

Functions can also take a vector input - let's run our `add_ten()` function with a vector and see what happens

```{r}
add_ten <- function(x){
  y <- x + 10
  return(y)
}

z <- 1:10
add_ten(z)
```

Of course, we could perform this function manually, as it is a really straightforward example. I'd be unlikely to bother creating a function to do this in practice, but it's a useful demonstration. 


# The anatomy of a function

Let's quickly recap the anatomy of how a function works

`function(x)` declares a function. It creates the function within the R environment, and defines the arguments that the function requires. In our example above, whenever you run `add_ten`, you *must* have an x argument, otherwise the function will not run.

`{}` define the block of code within which the work that the function does takes place. Here, `y` is only defined within the function.

`return(y)` tells R to give us an output, and display what the function has done.


# Functions with multiple arguments

We can extend our functions to take multiple arguments, or inputs, using the basics that we've just learned. Functions that we have used throughout the course so far have taken multiple arguments, so let's try and also create one. We'll write a function that multiplies two numbers together:
```{r}
multiply <- function(x, y){
  z <- x * y
  return(z)
}

multiply(3,2)
```

In our R environment window, we can see a new section has been created called 'Functions'. Within that, we can see both our `add_ten` and our `multiply` functions, and the number of arguments that each takes. 

# Functions with default arguments

You'll find often that several arguments in functions that you're running in R have default settngs that you do not have to explicitly code in order for R to perform them. For example, in the `sum()` and `mean()` functions, the default for the `na.rm` argument is `FALSE`. We can edit our `multiply()` function to have a default `y` value of 10:

```{r}
multiply <- function(x, y = 10){
  z <- x * y
  return(z)
}

multiply(3)
multiply(3, 2)
```

Here, we can see that if we include only 1 argument to the function now, it will run and use the default of y = 10. If we include two arguments ourselves, we override the default and the function performs on both of the arguments that we have specified. You can hopefully see how writing functions with multiple arguments and defaults isn't too much of a stretch from writing a function with a single argument, and see how useful that might be in your coding!

# Combining functions and conditionals:

Now, we can combine all of the skills that we've learned today and create complex functions with some control flow statements inside of them. Let's start by wtiting a function that evaluates whether a numerical input is greater than or less than 10. 

```{r}
eval10 <- function(x){
  y <- ifelse(x < 10, "less than 10", "greater than 10")
  return(y)
}

eval10(1)
eval10(15)

```

We've included a simple ifelse function within our eval10 function, which takes a new object, y, and performs an ifelse() function upon it, and returns the output. We could do the same with the explicit `if` and `else` control flow statements

```{r}
eval10_2 <- function(x){
  y <- if (x < 10){
    print("less than 10")
    } else {
      print("greater than 10")
    }
}

eval10_2(1)
eval10_2(15)
```

Whilst this is a simple example, we could build up to creating some incredibly complex functions with custom control flows to efficiently handle our data and downstream analysis in R. It may well be useful for things such as looking at data to see if it contains headers before removing them, or assessing the length of a data set before performing a function on it. 

# Lists in R

Now, we've had a whistle stop tour of functions, obviously this is not the full extent of their functionality, but it's time to move on to lists. We've not really mentioned lists so far, other than very briefly in some of the beginner sessions. To recap, there are 4 main data structures in R - vectors, matricies, data.frames, and lists. 

A list is a special form of data structure which allows you to contain objects of different lengths within it. They are a little more complicated than the other data structures we covered before - partly because a data.frame is both a special type of list, and can be contained within a list! In a data.frame, each component part (or column) must be the same length - this is not true of a list.

They are quite complicated on first glance, and often are not taught at all in R courses - I was never formally taught what a list was or how on earth to use one, and I had to try and work it out for myself. Lists are probably the form od data structure that I use the least in R, simply because I never got into the habit of using them. Now, let's make a list and see how they work, and figure out how to access stuff within them. First, we'll create three vectors (one character, one numeric, and one logical), and combine them with the `list()` function.

```{r}
nintendo <- c("Mario", "Luigi", "Peach")
numbers <- 1:100
logic <- c(TRUE, FALSE, FALSE, TRUE)

my_list <- list(nintendo, numbers, logic)
```
Now, we have our list. We can print it to the screen by typing the name of our list, and it'll print everything out.

```{r}
my_list
```
From this, we can clearly see that the objects contained within the list are of different lengths - nintendo has a length of 3, numbers has a length of 100, and logic has a length of 4.

So, how do we access the list? It can seem complicated, but actually printing it to the screen has given us lots of information. We can use square brackets `[]` to choose objects. We can also access elements with their names, provided we have given them names, but we will come back to that.

Let's try and select the first vector element of our list. We might try and access it with:

```{r}
my_list[1]
```

Whilst this does return what we want it to, this actually isn't in the vector class that we created the list with. We can check with the `class()` function:

```{r}
class(my_list[1])
```

We can see here that the object has retained it's list class, and has taken the first part of the list, but not the vector itself. We can access the vector within the list with double square brackets `[[]]`, the clue is in the display of the list!

```{r}
my_list[[1]]

class(my_list[[1]])
```

When we verify what we're doing with `class()`, we can see that we're getting the original character vector stored within the list. This is a bit complex and hand-wavy, but basically remember that `[[]]` allows you to access the element from within the list in its native form, whereas `[]` gives you the element of the list, whilst retaining it's list class.

Be aware that the way we access lists with `[]` is different than vectors. For example, if we were to type `my_list[1:2]` we get the first two elements of the list, but `my_list[[1:2]]` gives you the second element of the first bit of the list:

```{r}
my_list[1:2]

my_list[[1:2]]
```
You can access bits of lists as above, or perhaps more intuitively and less confusingly, by combining double and single square brackets

```{r}
my_list[[1]][2] # This accesses the 2nd item of the 1st list element.
```

You can also use names to access list elements in a way that is equivalent to how a dataframe would work. In order to do this, however, you must specify the names when you're creating the list, like this:

```{r}
my_list <- list(nintendo = nintendo, numbers = numbers, logic = logic)
```
If we print the list again, you'll see that the [[1]], [[2]], and [[3]] have been replaced by $nintendo, $numbers, and $logic:
```{r}
my_list
```
Now you can access your list using the names, but note that just like a dataframe, `my_list[[1]]` will still work

```{r}
my_list$nintendo

my_list[[1]]
```

## Why are lists important?

It perhaps isn't immediately obvious why lists are important, but we will talk more about this in the next session. It's useful for you to know how to use them, as they will make R more useful to you.

Briefly, lists can be helpful if you are working on lots of dataframes/tibbles at the same time, as you can store them within a list. The versatility of this will become clearer next time when we're talking about the apply family of functions, `sapply()`, `lapply()`, and `apply()`.

# Summary

So, today we have learned about control flow with `for`, `if`, `else`, and `ifelse()`, and how these are used to control the flow of data through a set of functions, depending on whether conditions are met. 

We've learned how to write our own functions, including with multiple arguments and defaults, and have combined this with our new knowledge of control flow.

Finally, we touched on lists which feeds nicely into the apply set of functions that we'll look at in more detail on Thursday. We can use `sapply()`, `lapply()` and `apply()` to efficiently process vectors in R, and we'll learn more about the principles of proper scripting.



