Introduction to R: Basics

2026-03-17

1 Off we go!

1.1 Credit and license

This material is heavily based on the book Learning statistics with R: A tutorial for psychology students and other beginners (Version 0.6) by Danielle Navarro. It has been reorganized, extended, rewritten, adapted and formatted with learnr by Wolf Vanpaemel under a Creative Commons BY-SA license (CC BY-SA) version 4.0. This means that this book can be reused, remixed, retained, revised, and redistributed (including commercially) as long as appropriate credit is given to the authors. If you remix or modify the original version of this open textbook, you must redistribute all versions of this open textbook under the same license - CC BY-SA. https://creativecommons.org/licenses/by-sa/4.0/

For formatting with learnr, Evelien Schat and Richard Artner provided valuable assistance. Jeffrey Goris, Jordan Revol, Robin Vloeberghs, Marre Vervloet, Yentl Koopmans, Lisa Koßmann, Peer-Ole Jacobsen and Katrijn Cnudde provided valuable feedback on a previous version.

1.2 Goal and scope

Information in the endnotes is beyond the scope of the course, and is strictly provided for your information. It might be interesting, or useful for later when you are a more prolific R user, but it is not part of what you should study.

Our goal in these chapters is not to learn any statistical concepts: we’re just trying to learn the basics of how R works and get comfortable interacting with the system. Rather than learning about how to use R to do statistics, the main goal of these chapters is to get started in R and learn how R works. to In Chapter XXX, we will encounter some statistical concepts. The goal here is to show you how to compute those in R. I will not be explaining why these computations are interesting, what they mean, and how they should be interpreted. To learn about this stuff, you should go elsewhere.

The list of topics that these chapters cover is pretty broad, and there’s a lot of content there. Even though this is quite long, I’m really only scratching the surface of several fairly different and important topics. My advice is to read through the chapter once and try to follow as much of it as you can. Don’t worry too much if you can’t grasp it all at once. However, what you’ll probably find is that later on, you’ll need to flick back to earlier chapters in order to understand some of the concepts that I refer to there. In general, I’m not trying to be comprehensive in these chapters, I’m trying to make sure that you’ve got the basic foundations needed to tackle the content that comes later in the book. This means that some of the topics are revisited in more detail later. For example, I will talk about data frames in Sections XXX, XXX and XXX. This makes this book annoying as a reference book — not everything you need to know about a data frame is collected at the same spot — but I hope it makes the book a good textbook providing useful as study material, where you are taken by the hand to do everything step by step. It is a thin line to walk, though, and I do hope I have succeeded.

1.3 Format

This is a somewhat interactive document, and you will be asked to write R code in your browser. So, perhaps surprisingly, while you’ll learn how to read, write and run R code, you will not need to open R just now. Instead, you will be working in R from within a browser. This is not how you will typically use R later, but it offers a nice learning experience. In Section 5, we will work in R properly (or in RStudio, really).

You will often be asked to run some code. You can do that by hitting ctrl + enter (or command + enter for the Mac users) or pressing the “run code” button in the boxes that will be provided. For some exercises, you will see a solution box, on which you can click and copy.

1.4 Caveats

This work is neither complete, nor perfect, and is a work in progress. One thing that is surely broken is the internal referencing system (to figures, tables, sections). So that means that if I say I will talk about it in Section 8.4, there is only a 70% probability the correct section is 8.4. At other times, I gave up and didn’t even say 8.4 but use XXX as a placeholder. Or the system broke, which leads to ??. I plan to fix it, but ran out of time and energy. While ugly and mildly annoying, I don’t expect this to slow down your learning curve.

Further, it makes sense to view R from the perspective of a language. Which makes sense, given that it is a language. That means that there are many ways you can do things wrong. But, at the same time, there are many ways to do things right too! Just like with language, different people have different styles, and the same holds for R. That means that whatever code I write is most emphatically not the only way this code could or should be written. It reflects my own personal code writing style, while of course respecting the rule of grammar.

1.5 Typos

On https://docs.google.com/document/d/119uqTG6OpP9bcwEyubcw2Uy-l03HIoZjFyveAaU7oJc/edit?usp=sharing, you can find and report typos. Any typo I have found will be reported under the “confirmed typos” heading. If you think you have found a typo, and it is not listed under the “confirmed typos” heading, please add it under the “suspected typos” heading. Future users will thank you!

2 Getting started with R

2.1 Typing commands

One of the easiest things you can do with R is to use it as a simple calculator, so it’s a good place to start.

Exercise: type 10 + 20 in the box below, run the code.

Congrats! When you have done this, you’ve entered a command, and R will “execute” that command.

Not a lot of surprises in this extract. But there are a few things worth talking about, even with such a simple example.

Firstly, it’s important that you understand how to read the extract. In this example, what you typed was the 10 + 20 part. You didn’t type the [1] 30 part. That’s what R printed out in response to your command.

Secondly, it’s important to understand how the output is formatted. Obviously, the correct answer to the sum 10 + 20 is 30, and not surprisingly R has printed that out as part of its response. But it’s also printed out this [1] part, which probably doesn’t make a lot of sense to you right now. You’re going to see that a lot. I’ll talk about what this means in a bit more detail later on, but for now, you can think of [1] 30 as if R were saying “the answer to the 1st question you asked is 30”. That’s not quite the truth, but it’s close enough for now. And in any case, it’s not really very interesting at the moment: we only asked R to calculate one thing, so obviously, there’s only one answer printed on the screen. Later on, this will change, and the [1] part will start to make a bit more sense. For now, I just don’t want you to get confused or concerned by it.

2.1.1 Be very careful to avoid typos

Before we go on to talk about other types of calculations that we can do with R, there are a few other things I want to point out. The first thing is that, while R is good software, it’s still software. It’s pretty stupid, and because it’s stupid it can’t handle typos. It takes it on faith that you meant to type exactly what you did type. For example, suppose that you forgot to hit the shift key when trying to type +, and as a result, your command ended up being 10 = 20 rather than 10 + 20.

Exercise: Run the command 10 = 20 and see what happens.

What’s happened here is that R has attempted to interpret 10 = 20 as a command, and spits out an error message because the command doesn’t make any sense to it. When a human looks at this, and then looks down at his or her keyboard and sees that + and = are on the same key, it’s pretty obvious that the command was a typo. But R doesn’t know this, so it gets upset. And, if you look at it from its perspective, this makes sense. All that R “knows” is that 10 is a legitimate number, 20 is a legitimate number, and = is a legitimate part of the language too. In other words, from its perspective, this really does look like the user meant to type 10 = 20, since all the individual parts of that statement are legitimate and it’s too stupid to realise that this is probably a typo. Therefore, R takes it on faith that this is exactly what you meant … . It only “discovers” that the command is nonsense when it tries to follow your instructions, typo and all. And then it whinges and spits out an error.

Even more subtle is the fact that some typos won’t produce errors at all, because they happen to correspond to “well-formed” R commands. For instance, suppose that not only did I forget to hit the shift key when trying to type 10 + 20, I also managed to press the key next to the one I meant to. The resulting typo would produce the command 10 - 20. Clearly, R has no way of knowing that you meant to add 20 to 10, not subtract 20 from 10.

Exercise: Run the command 10 - 20 and see what happens.

In this case, R produces the right answer, but to the wrong question.

To some extent, I’m stating the obvious here, but it’s important. The people who wrote R are smart. You, the user, are smart. But R itself is dumb. And because it’s dumb, it has to be mindlessly obedient. It does exactly what you ask it to do. There is no equivalent to “autocorrect” in R, and for good reason. When doing advanced stuff – and even the simplest of statistics is pretty advanced in a lot of ways – it’s dangerous to let a mindless automaton like R try to overrule the human user. But because of this, it’s your responsibility to be careful. Always make sure you type exactly what you mean. When dealing with computers, it’s not enough to type “approximately” the right thing. In general, you absolutely must be precise in what you say to R … like all machines it is too stupid to be anything other than absurdly literal in its interpretation.

2.1.2 R is (a bit) flexible with spacing

Of course, now that I’ve been so uptight about the importance of always being precise, I should point out that there are some exceptions. Or, more accurately, there are some situations in which R does show a bit more flexibility than my previous description suggests. R is smart enough to ignore redundant spacing. What I mean by this is that, when I typed 10 + 20 before, I could equally have done this

10    + 20

or this

10+20

Exercise: Try it!

You get exactly the same answer!

However, that doesn’t mean that you can insert spaces in any old place. For example, you could type citation() to get some information about how to cite R.

Exercise: Run the command citation().

It tells you how to cite the R manual. Let’s see what happens when you try changing the spacing.

Exercise: Type citation() with spaces in between the word and the parentheses, or inside the parentheses themselves.

citation () and citation( ) will produce exactly the same response. However, what you can’t do is insert spaces in the middle of the word.

Exercise: Run the command citation(), with spaces in the middle of the word.

citat ion() gives an error.

2.2 Doing simple calculations

Okay, now that we’ve discussed some of the tedious details associated with typing R commands, let’s move forward. So far, all we know how to do is addition. Clearly, a calculator that only did addition would be a bit stupid, so I should tell you about how to perform other simple calculations using R. But first, some more terminology.

Addition is an example of an “operation” that you can perform (specifically, an arithmetic operation), and the operator that performs it is +. To people with a programming or mathematics background, this terminology probably feels pretty natural, but to other people it might feel like I’m trying to make something very simple (addition) sound more complicated than it is (by calling it an arithmetic operation). To some extent, that’s true: if addition was the only operation that we were interested in, it’d be a bit silly to introduce all this extra terminology. However, as we go along, we’ll start using more and more different kinds of operations, so it’s probably a good idea to get the language straight now, while we’re still talking about very familiar concepts like addition!

2.2.1 Adding, subtracting, multiplying and dividing

So, now that we have the terminology, let’s learn how to perform some arithmetic operations in R. To that end, Table 2.1 lists (among others) the operators that correspond to the basic arithmetic we learned in primary school: addition, subtraction, multiplication and division.

Table 2.1: Basic arithmetic operations in R. These five operators are used very frequently throughout the text, so it’s important to be familiar with them at the outset.
operation operator example input example output
addition + 10 + 2 12
subtraction - 9 - 3 6
multiplication * 5 * 5 25
division / 9 / 3 3
power ^ 5 ^ 2 25
power ** 4 ** 2 16

As you can see, R uses fairly standard symbols to denote each of the different operations you might want to perform: addition is done using the + operator, subtraction is performed by the - operator, and so on.

Exercise: Find out what 57 times 61 is using R.

57 * 61

So that’s handy.

2.2.2 Taking powers

The first four operations listed in Table 2.1 are things we all learned in primary school, but they aren’t the only arithmetic operations built into R. There are three other arithmetic operations that I should probably mention: taking powers, doing integer division, and calculating a modulus. Of the three, the only one that is of any real importance for the purposes of this book is taking powers, so I’ll discuss that one here: the other two are not discussed. Grace!

For those of you who can still remember your high school maths, this should be familiar. But for some people high school maths was a long time ago, and others of us didn’t listen very hard in high school. It’s not complicated. As I’m sure everyone will probably remember the moment they read this, the act of multiplying a number \(x\) by itself \(n\) times is called “raising \(x\) to the \(n\)-th power”. Mathematically, this is written as \(x^n\). Some values of \(n\) have special names: in particular, \(x^2\) is called \(x\)-squared (x kwadraat, in Dutch), and \(x^3\) is called \(x\)-cubed. So, the 4th power of 5 is calculated like this:

\[ 5^4 = 5 \times 5 \times 5 \times 5 \]

One way that we could calculate \(5^4\) in R would be to type in the complete multiplication as it is shown in the equation above. That is, we could do this

5 * 5 * 5 * 5
## [1] 625

but it does seem a bit tedious. It would be very annoying indeed if you wanted to calculate \(5^{15}\), since the command would end up being quite long. Therefore, to make our lives easier, we use the power operator instead. When we do that, our command to calculate \(5^4\) goes like this:

5 ^ 4
## [1] 625

Much easier. Another way to do this is by using ** instead of ^.

Exercise: Use ** to obtain the 4th power of 5.

2.2.3 Doing calculations in the right order

In most situations where you would want to use a calculator, you might want to do multiple calculations. R lets you do this, just by typing in longer commands. In fact, we’ve already seen an example of this earlier, when I typed in 5 * 5 * 5 * 5. However, let’s try a slightly different example:

1 + 2 * 4
## [1] 9

Clearly, this isn’t a problem for R either. However, it’s worth stopping for a second, and thinking about what R just did. Clearly, since it gave us an answer of 9 it must have multiplied 2 * 4 (to get an interim answer of 8) and then added 1 to that. But, suppose it had decided to just go from left to right: if R had decided instead to add 1+2 (to get an interim answer of 3) and then multiplied by 4, it would have come up with an answer of 12. To answer this, you need to know the order of operations that R uses.

If you remember back to your high school maths classes, it’s actually the same order that you got taught when you were at school. In some English speaking countries, this is known as the “BEDMAS” order1. That is, first calculate things inside Brackets (), then calculate Exponents ^, then Division / and Multiplication *, then Addition + and Subtraction -. So, to continue the example above, if we want to force R to calculate the 1+2 part before the multiplication, all we would have to do is enclose it in brackets:

(1 + 2) * 4 
## [1] 12

This is a fairly useful thing to be able to do.

The only other thing I should point out about order of operations is what to expect when you have two operations that have the same priority: that is, how does R resolve ties? For instance, multiplication and division are actually the same priority, but what should we expect when we give R a problem like 4 / 2 * 3 to solve? If it evaluates the multiplication first and then the division, it would calculate a value of two-thirds. But if it evaluates the division first it calculates a value of 6. The answer, in this case, is that R goes from left to right, so in this case, the division step would come first:

4 / 2 * 3
## [1] 6

All of the above being said, it’s helpful to remember that brackets always come first. So, if you’re ever unsure about what order R will do things in, an easy solution is to enclose the thing you want it to do first in brackets. There’s nothing stopping you from typing (4 / 2) * 3. By enclosing the division in brackets we make it clear which thing is supposed to happen first. In this instance, you wouldn’t have needed to, since R would have done the division first anyway, but when you’re first starting out it’s better to make sure R does what you want!

Exercise: A good learning trick is to try typing in a few different variations on what I’ve done here. Experiment a bit with your commands, to learn what works and what doesn’t.

2.3 Storing a number as a variable

Okay. At this point, you know how to take one of the most powerful pieces of statistical software in the world, and use it as a $2 calculator. . That’s not nothing (you could argue that you’ve just saved yourself $2) but on the other hand, it’s not very much either. In order to use R more effectively, we need to introduce more programming concepts.

One of the most important things to be able to do in R (or any programming language, for that matter) is to store information in variables. At a conceptual level, you can think of a variable as label for a certain piece of information, or even several different pieces of information. Let’s look at the very basics for how we create variables and how to work with them.

Since we’ve been working with numbers so far, let’s start by creating variables to store our numbers. And since most people like concrete examples, let’s invent one. Suppose I’m trying to calculate how much money I’m going to make from this book. There are several different numbers I might want to store. Firstly, I need to figure out how many copies I’ll sell. This isn’t exactly Harry Potter, so let’s assume I’m only going to sell one copy per student in my class. That’s 350 sales, so let’s create a variable called sales. What I want to do is assign a value to my variable sales, and that value should be 350. We do this by using the assignment operator, which is <-. Here’s how we do it:

sales <- 350

When you would run this command in R, R doesn’t print out any output. This, however, does not mean all your efforts were in vain and nothing happened. Behind the scenes, you did make an impact. By typing that line, R has created a variable called sales and given it a value of 350.

You don’t believe me? Good for you. But you can check that this has happened by asking R to print the variable on screen. The simplest way to do that is to type the name of the variable and hit ctrl + enter.

Exercise: Type the variable sales and run the code.

So that’s nice to know. Anytime you can’t remember what R has got stored in a particular variable, you can just type the name of the variable and hit enter.

Okay, so now we know how to assign variables. Actually, there’s a bit more you should know. Firstly, one of the curious features of R is that there are several different ways of making assignments. In addition to the <- operator, we can also use -> and =. Note, however, that the <- operator is by far the most widely used and that is is hard to spot a -> operator in the wild.

Let’s start by considering ->, since that’s the easy one. As you might expect from just looking at the symbol, it’s almost identical to <-. It’s just that the arrow (i.e., the assignment) goes from left to right. So if I wanted to define my sales variable using ->, I would write it like this:

350 -> sales

This has the same effect: and it still means that I’m only going to sell 350 copies. Sigh. Apart from this superficial difference, <- and -> are identical. In fact, as far as R is concerned, they’re actually the same operator, just in a “left form” and a “right form”.

A quick reminder: when using operators like <- and -> that span multiple characters, you can’t insert spaces in the middle. That is, if you type - > or < -, R will interpret your command the wrong way.

Exercise: Wanna try? Run s < - 3

Now =. Although it is not visible in the symbol itself, = does have a direction.

sales = 350

works, whereas

350 = sales
## Error in 350 = sales: invalid (do_set) left-hand side to assignment

doesn’t work.

2.4 Working with variables

One final thing you need to understand about creating variables (for now, that is) is how R overwrites stuff. You could imagine, that if I would now write sales <- 450, R would balk, and complain that sales has already been defined and that I should make up my mind, for once in my life, and stop being the fickle person that I am and grow a backbone. Let’s find out:

Exercise: Assign a new value to the variable sales and let R show it.

R graciously accepts my whims, and just pretends nothing has happened. R has overwritten the earlier value we had for sales. There is no memory left of the 350. Check it in the box above.

2.5 Doing calculations with variables

Okay, let’s get back to my original story. In my quest to become rich, I’ve written this textbook. To figure out how good this strategy is, I’ve started creating some variables in R. In addition to defining a sales variable that counts the number of copies I’m going to sell, I can also create a variable called royalty, indicating how much money I get per copy. Let’s say that my royalties are about $7 per book:

sales <- 350
royalty <- 7

The nice thing about variables (in fact, the whole point of having variables) is that we can do anything with a variable that we ought to be able to do with the information that it stores. That is, R allows the multiplication of 350 by 7

350 * 7
## [1] 2450

The good news is that it also allows the multiplication of sales by royalty.

Exercise: Multiply sales by royalty.

As far as R is concerned, the sales * royalty command is the same as the 350 * 7 command.

Not surprisingly, I can assign the output of this calculation to a new variable, which I’ll call revenue. And when we do this, the new variable revenue gets the value 2450. So let’s do that, and then get R to print out the value of revenue (by just typing revenue) so that we can verify that it’s done what we asked:

revenue <- sales * royalty
revenue
## [1] 2450

That’s fairly straightforward. A slightly more subtle thing we can do is reassign the value of my variable, based on its current value. For instance, suppose that one of my students (no doubt under the influence of psychotropic drugs) loves the book so much that he or she donates me an extra $550. The simplest way to capture this is by a command like this:

revenue <- revenue + 550
revenue
## [1] 3000

In this calculation, R has taken the old value of revenue (i.e., 2450) and added 550 to that value, producing a value of 3000. This new value is assigned to the revenue variable, overwriting its previous value. In any case, we now know that I’m expecting to make $3000 off this.

2.6 Storing many numbers as a vector

Let’s return to our discussion of variables. When I introduced variables in Section 2.3, I showed you how we can use variables to store a single number. In this section, we’ll extend this idea and look at how to store multiple numbers within the one variable. In R, the name for a variable that can store multiple values is a vector. So let’s create one.

Let’s stick to my silly “get rich quick by textbook writing” example. Suppose the textbook company (if I actually had one, that is) sends me sales data on a monthly basis. Since my class starts in late February, we might expect most of the sales to occur towards the start of the year. Let’s suppose that I have 100 sales in February, 200 sales in March and 50 sales in April, and no other sales for the rest of the year. What I would like to do is have a variable – let’s call it sales.by.month – that stores all this sales data. The first number stored should be 0 since I had no sales in January, the second should be 100, and so on. The simplest way to do this in R is to use the combine function, c().2 To do so, all we have to do is type all the numbers you want to store in a comma-separated list, like this:

sales.by.month <- c(0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0)
sales.by.month
##  [1]   0 100 200  50   0   0   0   0   0   0   0   0

To use the correct terminology here, we have a single variable here called sales.by.month: this variable is a vector that consists of 12 elements.

R is rather flexible in how you use the c() function. It works on both numbers and vectors at the same time. Say that you want to create a new variable with sales in even more months, where you add sales in a new month. It works like this:

sales.by.month.extended <- c(sales.by.month, 99) 
sales.by.month.extended
##  [1]   0 100 200  50   0   0   0   0   0   0   0   0  99

You ask R to combine the vector we have and appreciate (sales.by.month) with a new number (99). No biggie for R.

Worse, or better, yet, you can even define a variable by using that variable (if it exists, of course):

sales.by.month.extended <- c(sales.by.month.extended, 299)
sales.by.month.extended
##  [1]   0 100 200  50   0   0   0   0   0   0   0   0  99 299

R is most emphatically not flexible about whether or not you should use c(). It is a very common beginner mistake to forget it, but R is unforgiving if you do any of this:

sales.by.month <- (0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0)
sales.by.month <- 0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0
sales.by.month <- [0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0]
## Error: <text>:1:21: unexpected ','
## 1: sales.by.month <- (0,
##                         ^

2.7 Working with a vector

2.7.1 Getting information out of a vector

Let’s consider the problem of how to get information out of a vector. At this point, you might have a sneaking suspicion that the answer has something to do with the [1] thing that R has been printing out. And of course, you are correct. Suppose I want to pull out the February sales data only. February is the second month of the year, so let’s try this:

sales.by.month[2]
## [1] 100

Yep, that’s the February sales all right. The bottom line is that we can use square brackets [] to get info out of a vector.

But there’s a subtle detail to be aware of here: notice that R outputs [1] 100, not [2] 100. This is because R is being extremely literal. When we typed in sales.by.month[2], we asked R to find exactly one thing, and that one thing happens to be the second element of our sales.by.month vector. So, when it outputs [1] 100 what R is saying is that the first number that we just asked for is 100.

This behaviour makes more sense when you realise that we can use this trick to create new variables. For example, I could create a february.sales variable like this:

february.sales <- sales.by.month[2]
february.sales
## [1] 100

Obviously, the new variable february.sales should only have one element and so when I print it out this new variable, the R output begins with a [1] because 100 is the value of the first (and only) element of february.sales. The fact that this also happens to be the value of the second element of sales.by.month is irrelevant.

In the previous example, we only used a single number (i.e., 2) to indicate which element we wanted. Alternatively, we can use a vector. Of course, you can access more elements. For example, the sales from the first three months are extracted as follows:

sales.by.month[c(1,2,3)]
## [1]   0 100 200

Or, the sales from months 7 and 9:

sales.by.month[c(7,9)]
## [1] 0 0

2.7.2 Altering the elements of a vector

Sometimes you’ll want to change the values stored in a vector. Imagine my surprise when the publisher rings me up to tell me that the sales data for May are wrong. There were actually an additional 25 books sold in May, but there was an error or something so they hadn’t told me about it. How can I fix my sales.by.month variable? One possibility would be to assign the whole vector again from the beginning, using c(). But that’s a lot of typing. Also, it’s a little wasteful: why should R have to redefine the sales figures for all 12 months, when only the 5th one is wrong? Fortunately, we can tell R to change only the 5th element, using this trick3:

sales.by.month[5] <- 25
sales.by.month
##  [1]   0 100 200  50  25   0   0   0   0   0   0   0

Exercise: It is always interesting to see how a program (or a human, for that matter) behaves when confronted with something unexpected or impossible. Try to change an unexisting element of sales.by.month. First, let’s try to assign the 13th element to 22.

2.7.3 Using a shorthand to access a vector

R somewhat kindly provides you with handy shortcuts for very common situations. For instance, suppose that I wanted to use the vector c(2,3,4,5,6,7,8). I could do

c(2,3,4,5,6,7,8)
## [1] 2 3 4 5 6 7 8

but it’s kind of a lot of typing. To help make this easier, R lets you use 2:8 as shorthand for c(2,3,4,5,6,7,8), which makes things a lot simpler.

Exercise: You don’t have to believe me (in fact, I rather have you not!). Let’s just check that this is true

This shorthand is especially useful for accessing elements from a vector. For example, the sales from the first six months are extracted as follows:

sales.by.month[c(1,2,3,4,5,6)]
## [1]   0 100 200  50  25   0

but more conveniently using

sales.by.month[1:6]
## [1]   0 100 200  50  25   0

And yes, you can also use it to alter elements of a vector.

sales.by.month[3:7] <- 2
sales.by.month
##  [1]   0 100   2   2   2   2   2   0   0   0   0   0

Any idea why the next line doesn’t work?

sales.by.month[1:3] <- c(9, 19)
## Warning in sales.by.month[1:3] <- c(9, 19): number of items to replace is not a
## multiple of replacement length

Well, you tell R to replace elements 1, 2 and 3 with some numbers. You even tell R which numbers the new ones should be, by specifying c(9, 19). But you don’t give enough: you want 3 numbers to be replaced, but you only provide 2. R thinks you are a bully, but still responds quite nicely.

It might, then, come as a surprise that this does work:

sales.by.month[1:4] <- c(9, 19)
sales.by.month
##  [1]  9 19  9 19  2  2  2  0  0  0  0  0

So you want 4 numbers to be replaced and you only provide 2? The reason this does work is because of the recycling rule, discussed below (Section XXX).

Because we will use it later, I will restore the OG sales.by.month

sales.by.month <- c(0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0)

2.8 Doing calculations with vectors

2.8.1 Using a single number

You often want to alter all of the elements of a vector at once. For instance, suppose I wanted to figure out how much money I made in each month. Since I’m earning an exciting $7 per book (no seriously, that’s actually pretty close to what authors get on the very expensive textbooks that you’re expected to purchase), what I want to do is multiply each element in the sales.by.month vector by 7. R makes this pretty easy, as the following example shows:

sales.by.month * 7
##  [1]    0  700 1400  350    0    0    0    0    0    0    0    0

In other words, when you multiply a vector by a single number, all elements in the vector get multiplied. The same is true for addition, subtraction, division and taking powers. So that’s neat.

2.8.2 Using another vector

Sometimes my (non-existing) publisher is in a good mood, and they decide to give me a bonus. In January, I get a 1 dollar bonus, in February 2, and so on.

bonus <- 1:12
bonus
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

Computing my total profit is easily done:

profit <- sales.by.month * 7
profit
##  [1]    0  700 1400  350    0    0    0    0    0    0    0    0
bonus
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
profit_bonus <- profit + bonus
profit_bonus
##  [1]    1  702 1403  354    5    6    7    8    9   10   11   12

So the nth element of bonus is added to the nth element of profit.

On the other hand, suppose I wanted to know how much money I was making per day, rather than per month (dropping the bonus, which only existed in my imagination anyways). Since not every month has the same number of days, I need to do something slightly different. Firstly, I’ll create two new vectors:

days.per.month <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)

The days.per.month variable is pretty straightforward. What I want to do is divide every element of profit by the corresponding element of days.per.month. Again, R makes this pretty easy:

profit / days.per.month
##  [1]  0.00000 25.00000 45.16129 11.66667  0.00000  0.00000  0.00000  0.00000
##  [9]  0.00000  0.00000  0.00000  0.00000

Notice that the second element of the output is 25 because R has divided the second element of profit (i.e. 700) by the second element of days.per.month (i.e. 28). Similarly, the third element of the output is equal to 1400 divided by 31, and so on.

2.8.3 The recycling rule

There’s one semi-advanced thing that I should mention about how vector arithmetic works in R, and that’s the recycling rule. It is fairly straightforward, but can be confusing to novices. The easiest way to explain it is to give a simple example. Suppose I have two vectors of different length, x and y, and I want to add them together. It’s not obvious what that actually means, so let’s have a look at what R does:

x <- c(1,1,1,1,1,1)  
y <- c(0,1)          
x + y                  
## [1] 1 2 1 2 1 2

Try to understand what’s going on, from looking at this output.

As you can see, what R has done is “recycle” the value of the shorter vector (in this case y, of length 2, as compared to x of length 6) several times. That is, the first element of x is added to the first element of y, and the second element of x is added to the second element of y. However, when R reaches the third element of x there isn’t any corresponding element in y, so it returns to the beginning: thus, the third element of x is added to the first element of y. This process continues until R reaches the last element of x. And that’s all there is to it really. The same recycling rule also applies for subtraction, multiplication and division.

Someone paying close attention might wonder what happens if the length of the longer vector (5, in this example) isn’t an exact multiple of the length of the shorter one (2, in this example). Let’see:

x <- c(1,1,1,1,1)    
y <- c(0,1)          
x + y                  
## Warning in x + y: longer object length is not a multiple of shorter object
## length
## [1] 1 2 1 2 1

R still does it, but also gives you a warning message. Warnings are highly important, and shouldn’t be ignored. Despite this, we will ignore the warning, but will say a bit more about it in Section 3.1.

2.9 Using functions to do calculations

The symbols +, -, * and so on are examples of operators. As we’ve seen, you can do quite a lot of calculations just by using these operators. However, in order to do more advanced calculations (and later on, to do actual statistics), you’re going to need to start using functions. I’ll talk in more detail about functions and how they work in Section 9.1, but for now let’s just dive in and use a few.

2.9.1 Using a single number

To get started, suppose I wanted to take the square root (vierkantswortel, in Dutch) of 225. The square root, in case your high school maths is a bit rusty, is just the opposite of squaring a number. So, for instance, since “5 squared is 25” I can say that “5 is the square root of 25”. The usual notation for this is \(\sqrt{25} = 5\).

To calculate the square root of 25, I can do it in my head pretty easily, since I memorised my multiplication tables when I was a kid. It gets harder when the numbers get bigger, and pretty much impossible if they’re not whole numbers. This is where something like R comes in very handy. Let’s say I wanted to calculate \(\sqrt{225}\), the square root of 225. Here is how I could do this using R.

R provides a square root function, sqrt(). To calculate the square root of 225 using this function, what I do is insert the number 225 in the parentheses.

Exercise: Calculate the square root of 225 using the sqrt() function.

When we use a function to do something, we generally refer to this as calling the function, and the values that we type into the function (in general, there can be more than one) are referred to as the arguments of that function.

Note how we provide the arguments inside the round brackets. What happens if you would inadvertently use square brackets?

If you type sqrt[225], R will think you want the 225th element of the sqrt object. Since that object does not exist (since you didn’t define it), R thinks you are being unreasonable.

The party is hardly over! There are lots of other functions in R: in fact, almost everything of interest that I’ll talk about in this book is an R function of some kind. For example, one function that we will need to use in this book is the absolute value function. Compared to the square root function, it’s extremely simple: it just converts negative numbers to positive numbers and leaves positive numbers alone. Mathematically, the absolute value of \(x\) is written \(|x|\). Calculating absolute values in R is pretty easy since R provides the abs() function that you can use for this purpose.

Exercise: Feed the abs() function a positive number (e.g., 21).

Here, the absolute value function does nothing to it at all.

Exercise: Feed the abs() function a negative number (e.g., -13).

It now spits out the positive version of the same number.

Before moving on, it’s worth noting that – in the same way that R allows us to put multiple operations together into a longer command, like 1 + 2*4 for instance – it also lets us put functions together and even combine functions with operators if we so desire. For example, the following is a perfectly legitimate command:

sqrt( 1 + abs(-8) )

Exercise: What is the result of this computation? Use R to confirm.

When R executes this command, it starts out by calculating the value of abs(-8), which produces an intermediate value of 8. Having done so, the command simplifies to sqrt( 1 + 8 ). To solve the square root4 it first needs to add 1 + 8 to get 9, at which point it evaluates sqrt(9), and so it finally outputs a value of 3.

2.9.2 Using a vector

The examples above only took single numbers as input. Some of you might be wondering whether you can also input a vector.

Exercise: Wonder no more! Just try. Give a vector, e.g., c(25, 49, 36), as input to the sqrt() function.

If you did everything right (for example, if you didn’t forget the c()), you will have seen that the sqrt() function just does whatever it does (taking the square root, in this case) on each element separately. So this function works on a vector element-wise.

Not every function works on a vector element-wise. You often find yourself wanting to know how many elements there are in a vector (usually because you’ve forgotten). You can use the length() function to do this. It’s quite straightforward:

length(sales.by.month) 
## [1] 12

2.10 Some tips and caveats

2.10.1 Combining stuff and the work-from-within-rule

The real power of R only comes to shine when all this stuff is getting combined. For example, you might want to take the square root of 20 + 5. You can combine both operations (adding and taking the square root) in a single line:

sqrt(20 + 5)
## [1] 5

Or you might want to take the square root of the absolute value of -25. You can do that in two steps

abs.val <- abs(-25)
sqrt( abs.val )
## [1] 5

but more conveniently in one:

sqrt( abs(-25) )
## [1] 5

The longer the expression, the more is happening, but also the harder stuff is to understand. Unlike in English, where you read from left to right, in R, it often pays to read from within, especially when brackets are involved. So when you want to understand sqrt( abs(-25) ), you could read it from left to right as i take the square root of the absolute value of -25, but I have the impression that most students prefer the from-within approach: i first take the absolute value of -25 and then take the square root.

2.10.2 Brackets

We have seen two types of brackets: round brackets () and square brackets []. Later, we will encounter curly brackets {} (and the dreaded double square brackets [[]]). Functions require round brackets. Vectors (and, as we will later see, matrices and data frames too) require square brackets.

So this doesn’t work to compute the square root of 25

sqrt[25]
## Error in sqrt[25]: object of type 'builtin' is not subsettable

because R thinks you want the 25th element of the (non-existing) vector called sqrt.

This doesn’t work either for accessing the second element of x

x <- c(1, 2, 3)
x(2)
## Error in x(2): could not find function "x"

because R thinks you want to compute the (non-existing) function called x when the input is 2.

Of course, square brackets can appear close to a function, for example like this:

sqrt(c(1,4,9,25))[2]
## [1] 2

What happens is that the function sqrt acts on the input c(1,4,9,25), which is included between round brackets as it should. The function then produces a vector, of which we then select the 2nd element, using square brackets, as it should.

2.11 Using comments

Before discussing any of the more complicated R stuff, I want to introduce the comment character, #. It has a simple meaning: it tells R to ignore everything else you’ve written after it. You won’t have much need of the # character immediately, but it’s very useful later on when writing scripts (see Section 5). However, while you don’t need to use it, I want to be able to include comments in my R extracts. For instance, if you read this:

seeker <- 3.1415           # create the first variable
lover <- 2.7183            # create the second variable
keeper <- seeker * lover   # now multiply them to create a third one
keeper # print out the value of 'keeper'
## [1] 8.539539

it’s a lot easier to understand what I’m doing than if I just write this:

seeker <- 3.1415
lover <- 2.7183
keeper <- seeker * lover
keeper
## [1] 8.539539

You’ll start seeing # characters appearing in the extracts, with some human-readable explanatory remarks next to them. These are still perfectly legitimate commands since R knows that it should ignore the # character and everything after it. But hopefully, they’ll help make things a little easier to understand.

Exercise: Double check that R really doesn’t read what’s behind the #. For example, run 10 + 20 #I HATE YOU, R in the box below, and check whether R is still your faithful servant, despite you expressing your negative feelings.

You will see that R is completely nonplussed by your comment.

Exercise: Ok, this isn’t quite the bullet proof test. R might have actually read your comment, but just doesn’t care about the feelings of a human, or might be used to their abuse. As a better test, do this. Double check that R really doesn’t read what’s behind the #. For example, run the following two lines in the box below: on the first line, type 10 + 20 #x <- 40; in the second line, check whether R knows the value of x, by typing x.

R will tell you x can not be found. So either R didn’t read the comment, or R is very good at denying it did.

2.12 Go out and play

If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. You can use this box to make the exercises for Chapter 2. Note that there are two documents: one with only the questions, and one with the questions as well as some suggested solutions.

3 More fun with R

3.1 Errors and warnings

Before discussing any of the more complicated R stuff, I want to talk a bit about errors and warnings that R, sometimes helpfully, sometimes spitefully, throws at you. We’ve come across some of these already, but I feel I should give it a bit more attention, as they are cries for helps, and as a good psychologist, you should be trained to recognize and act upon cries for help.

Both errors and warnings signal that something is off. The difference is that, when R throws an error, it means you are done for. R couldn’t do what you asked it to do, so it stopped, producing no output. With a warning, in contrast, it powered through and produced output, but it thinks something could be off, so you should look at the code and the output with extra care.

It would be equal measure impossible and maddening to explain all the errors and warning messages R produces. My general advice is to read them carefully, as they sometimes make sense. In the other cases where they don’t, the warning or error message at least gives you something you could use when looking for help.

Just to get an idea, here are a few:

#example 1
y <- z + 2
## Error in eval(expr, envir, enclos): object 'z' not found
#example 2
sqrt(225))
## Error: <text>:2:10: unexpected ')'
## 1: #example 2
## 2: sqrt(225))
##             ^
#example 3
sqrt{225}
## Error: <text>:2:5: unexpected '{'
## 1: #example 3
## 2: sqrt{
##        ^
#example 4
skwairroet(225)
## Error in skwairroet(225): could not find function "skwairroet"
#example 5
1:5 + 1:6   
## Warning in 1:5 + 1:6: longer object length is not a multiple of shorter object
## length
## [1]  2  4  6  8 10  7

3.2 Text data

A lot of the time your data will be numeric in nature, but not always. Sometimes your data really needs to be described using text, not using numbers.

3.2.1 Storing text data

To address this, we need to consider the situation where our variables store text. To create a variable that stores the word “hello”, we can type this:

greeting <- "hello"
greeting
## [1] "hello"

When interpreting this, it’s important to recognise that the quote marks here aren’t part of the string itself. They’re just something that we use to make sure that R knows to treat the characters that they enclose as a piece of text data, known as a character string. In other words, R treats "hello" as a string containing the word “hello”; but if I had typed hello instead, R would go looking for a variable by that name! You can also use 'hello' to specify a character string.

Okay, so that’s how we store the text. Next, it’s important to recognise that when we do this, R stores the entire word "hello" as a single element: our greeting variable is not a vector of five different letters. Rather, it has only one element, and that element corresponds to the entire character string "hello".

Exercise: Just to be sure, ask R how many elements greeting has.

If you typed length(greeting), which you should have, you see that as far as R is concerned, greeting consist of a single element only.

Exercise: What could that be, you surely wonder?

You see that if you actually ask R to find the first element of greeting, by typing greeting[1], it prints the whole string.

3.2.2 Storing text data as a vector

Of course, there’s no reason why I can’t create a vector of character strings, just like you can create a vector of numerical elements. For instance, if we were to continue with the example of my attempts to look at the monthly sales data for my book, one variable I might want would include the names of all 12 months. To do so, I could type in a command like this, again using the combine function c()

months <- c("January", "February", "March", "April", "May", "June",
            "July", "August", "September", "October", "November", 
            "December")

This is a character vector containing 12 elements, each of which is the name of a month.

Exercise: Get R to tell you the name of the fourth month. You know the answer to that question, so you will know if you did it right.

# selecting the ith element can be done using [i]
months[4]

Exercise: Get R to tell you how many months there are in a year. You know the answer to that question, so you will know if you did it right.

# counting the number of elements can be done using length()
length(months)

3.2.3 Working with text data

Working with text data is somewhat more complicated than working with numeric data. So far, most of the numerical operations (addition, etc) and functions (i.e., sqrt(), abs()) that we have seen only make sense when applied to numeric data.

Here’s a question you never thought you would ask: For example, can you do numerical operations to a character vector? And can you take the square root of months?

Exercise: Well, can you?

No. months + 1, months * 3, months + months, months^2 and sqrt(months) are all meaningless. R agrees, and throws an error.

We’ve seen one function that can be applied to pretty much any variable or vector (i.e., length()). It might be nice to see another example of a function that can be applied to text. The function I’m going to introduce you to is called nchar(), and what it does is count the number of individual characters that make up a string. Recall earlier that when we tried to calculate the length() of our greeting variable it returned a value of 1: the greeting variable contains only one string, which happens to be "hello". But what if I want to know how many letters there are in the word? Sure, I could count them, but that’s boring, and more to the point it’s a terrible strategy if what I wanted to know was the number of letters in War and Peace. That’s where the nchar() function is helpful:

nchar( greeting ) 
## [1] 5

That makes sense, since there are in fact 5 letters in the string "hello". Better yet, you can apply nchar() to whole vectors. So, for instance, if I want R to tell me how many letters there are in the names of each of the 12 months, I can do this:

nchar( months )  
##  [1] 7 8 5 5 3 4 4 6 9 7 8 8

So that’s nice to know. The nchar() function can do a bit more than this, and there are a lot of other functions that you can do to extract more information from text or do all sorts of fancy things. However, the goal here is not to teach any of that! The goal right now is just to see an example of a function that actually does work when applied to text.

Note that nchar() also works on numerics. Exhibit 1:

a.number.i.like <- 17
nchar( a.number.i.like )  
## [1] 2

3.3 Logical data aka “true” or “false” data

Time to move onto a third kind of data. A key concept that a lot of R relies on is the idea of a logical value. A logical value is an assertion about whether something is true or false. This is implemented in R in a pretty straightforward way. There are two logical values, namely TRUE and FALSE. Despite the simplicity, logical values are very useful things. Let’s see how they work.

3.3.1 Assessing mathematical truths

In George Orwell’s classic book 1984, one of the slogans used by the totalitarian Party was “two plus two equals five”, the idea being that the political domination of human freedom becomes complete when it is possible to subvert even the most basic of truths. It’s a terrifying thought, especially when the protagonist Winston Smith finally breaks down under torture and agrees to the proposition. “Man is infinitely malleable”, the book says. I’m pretty sure that this isn’t true of humans but it’s definitely not true of R. R is not infinitely malleable. It has rather firm opinions on the topic of what is and isn’t true, at least as regards basic mathematics. If I ask it to calculate 2 + 2, it always gives the same answer, and it’s not bloody 5:

2 + 2
## [1] 4

Of course, so far R is just doing the calculations. I haven’t asked it to explicitly assert that \(2+2 = 4\) is a true statement. If I want R to make an explicit judgement, I can use a command like this:

2 + 2 == 4
## [1] TRUE

What I’ve done here is use the equality operator, ==, to force R to make a “true or false” judgement.5 Note that this is very different from, and should not be confused with the assignment operator, =, which we use the make sure a variable takes a values. With ==, we ask R the question whether a variable takes a certain value.

Okay, let’s see what R thinks of the Party slogan:

2 + 2 == 5
## [1] FALSE

Booyah! Freedom and ponies for all! Or something like that.

Anyway, it’s worth having a look at what happens if I try to force R to believe that two plus two is five by making an assignment statement like 2 + 2 <- 5. When I do this, here’s what happens:

2 + 2 <- 5
## Error in 2 + 2 <- 5: target of assignment expands to non-language object

R doesn’t like this very much. It recognises that 2 + 2 is not a variable (that’s what the “non-language object” part is saying), and it won’t let you try to “reassign” it. While R is pretty flexible and actually does let you do some quite remarkable things to redefine parts of R itself, there are just some basic, primitive truths that it refuses to give up. It won’t change the laws of addition, and it won’t change the definition of the number 2.

That’s probably for the best.

3.3.2 Storing logical data

Up to this point, I’ve introduced numeric data (in Sections 2.3 and 2.6) and character data (in Section 3.2). So you might not be surprised to discover that these TRUE and FALSE values that R has been producing are actually a third kind of data, called logical data. That is, when I asked R if 2 + 2 == 5 and it said [1] FALSE in reply, it was actually producing information that we can store in variables. For instance, I could create a variable called is.the.Party.correct, which would store R’s opinion:

is.the.Party.correct <- 2 + 2 == 5
is.the.Party.correct
## [1] FALSE

Alternatively, you can assign the value directly, by typing TRUE or FALSE in your command. Like this:

is.the.Party.correct <- TRUE
is.the.Party.correct
## [1] TRUE

Note that, again, R is totally chillax about this inconsistency. It just overwrites the previous value, without pain, grief or warning.

Better yet, because it’s kind of tedious to type TRUE or FALSE over and over again, R provides you with a shortcut: you can use T and F instead (but it’s case sensitive: t and f won’t work). Anyway, the long and short of it is that it’s safer to use TRUE and FALSE. So this works:

is.the.Party.correct <- F
is.the.Party.correct
## [1] FALSE

but this doesn’t:

is.the.Party.correct <- f
## Error in eval(expr, envir, enclos): object 'f' not found

I can’t let you go without a small warning: TRUE and FALSE are reserved keywords in R, so you can trust that they always mean what they say they do. Unfortunately, the shortcut versions T and F do not have this property. It’s even possible to create variables that set up the reverse meanings, by typing commands like T <- FALSE and F <- TRUE. This is kind of insane, and something that is generally thought to be a design flaw in R.

3.3.3 Storing logical data as a vector

The next thing to mention is that you can store vectors of logical values in exactly the same way that you can store vectors of numbers (Section 2.6) and vectors of text data (Section 3.2). Again, we can define them directly via the c() function, like this:

x <- c(TRUE, TRUE, FALSE)
x
## [1]  TRUE  TRUE FALSE

More interestingly, you can produce a vector of logicals by applying a logical operator (such as the equality operator) to a vector. This might not make a lot of sense to you, so let’s unpack it slowly.

First, let’s suppose we have a vector of numbers. For instance, we could use the sales.by.month vector that we were using earlier:

sales.by.month
##  [1]   0 100 200  50   0   0   0   0   0   0   0   0

Suppose I wanted R to tell me, for each month of the year, whether it was a slow month in that no books were sold. I can do that by typing this:

sales.by.month == 0
##  [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

and again, I can store this in a vector if I want, as the example below illustrates:

no.sales.this.month <- sales.by.month == 0
no.sales.this.month
##  [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

In other words, no.sales.this.month is a logical vector whose elements are TRUE only if the corresponding element of sales.by.month is equal to zero. For instance, since I sold zero books in January, the first element is TRUE.

Let’s do second example, but now with text. Suppose that – to continue the saga of the textbook sales – I find out that the bookshop only had sufficient stocks for a few months of the year. They tell me that early in the year they had "high" stocks, which then dropped to "low" levels, and in fact for two months they were "out" of copies of the book for a while before they were able to replenish them. Thus I might have a variable called stock.levels which looks like this:

stock.levels <- c("high", "high", "low", "out", "out", "high",
                "high", "high", "high", "high", "high", "high")
stock.levels
##  [1] "high" "high" "low"  "out"  "out"  "high" "high" "high" "high" "high"
## [11] "high" "high"

If I want to know whether or not book is out of stock, I can ask R as follows:

stock.levels == "out"
##  [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

However, what you need to keep in mind is that R is not at all tolerant when it comes to grammar and spacing. If two strings differ in any way whatsoever, R will say that they’re not equal to each other, as the following examples indicate:

stock.levels == "high"
##  [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
stock.levels == "High"
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
stock.levels == "h igh"
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

3.3.4 Working with logical data

Above, we created the logical vector no.sales.this.month whose elements are TRUE or FALSE. Life is full of surprises, and so is R. As it turns out, you can do numerical operations with this vector. Can you find out how it works?

Exercise: Run the following commands: no.sales.this.month + 0, no.sales.this.month * 1, no.sales.this.month^2 and compare it no.sales.this.month. What do you notice?

Every TRUE plays the role of a 1, and every FALSE plays the role of a 0.

Later, in Section 6, I’ll show you why these logical operations and logical vectors are so handy.

3.3.5 More logical operations

So now we’ve seen logical operations at work, but so far we’ve only seen the simplest possible example, the equality operator. You probably won’t be surprised to discover that we can combine logical operations with other operations and functions in a more complicated way, like this:

3*3 + 4*4 == 5*5
## [1] TRUE

or this

sqrt(25) == 5
## [1] TRUE

Not only that, but as Table 3.1 illustrates, there are several other logical operators that you can use, corresponding to some basic mathematical concepts.

Table 3.1: Some logical operators. Technically I should be calling these “binary relational operators”, but quite frankly I don’t want to. It’s my book so no-one can make me.
operation operator example input answer
less than < 2 < 3 TRUE
less than or equal to <= 2 <= 2 TRUE
greater than > 2 > 3 FALSE
greater than or equal to >= 2 >= 2 TRUE
equal to == 2 == 3 FALSE
not equal to != 2 != 3 TRUE

Hopefully, these are all pretty self-explanatory: for example, the less than operator < checks to see if the number on the left is less than the number on the right. If it’s less, then R returns an answer of TRUE:

99 < 100
## [1] TRUE

but if the two numbers are equal, or if the one on the right is larger, then R returns an answer of FALSE, as the following two examples illustrate:

100 < 100
## [1] FALSE
100 < 99
## [1] FALSE

In contrast, the less than or equal to operator <= will do exactly what it says. It returns a value of TRUE if the number of the left-hand side is less than or equal to the number on the right-hand side. So if we repeat the previous two examples using <=, here’s what we get:

100 <= 100
## [1] TRUE
100 <= 99
## [1] FALSE

And at this point I hope it’s pretty obvious what the greater than operator > and the greater than or equal to operator >= do!

Next on the list of logical operators is the not equal to operator != which – as with all the others – does what it says it does. It returns a value of TRUE when things on either side are not identical to each other. Therefore, since \(2+2\) isn’t equal to \(5\), we get:

2 + 2 != 5
## [1] TRUE

3.3.6 Even more logical operations

We’re not quite done yet. There are three more logical operations that are worth knowing about, listed in Table 3.2.

Table 3.2: Some more logical operators.
operation operator example input answer
not ! !(1==1) FALSE
or | (1==1) | (2==3) TRUE
and & (1==1) & (2==3) FALSE

These are the not operator !, the and operator &, and the or operator |. Like the other logical operators, their behaviour is more or less exactly what you’d expect given their names. For instance, if I ask you to assess the claim that “either \(2+2 = 4\) or \(2+2 = 5\)” you’d say that it’s true. Since it’s an “either-or” statement, all we need is for one of the two parts to be true. That’s what the | operator does:

(2+2 == 4) | (2+2 == 5)
## [1] TRUE

On the other hand, if I ask you to assess the claim that “both \(2+2 = 4\) and \(2+2 = 5\)” you’d say that it’s false. Since this is an and statement we need both parts to be true. And that’s what the & operator does:

(2+2 == 4) & (2+2 == 5)
## [1] FALSE

To be clear, the | operator does not want exactly one statement to be true. If both parts are true, it will judge the combined statement as true as well:

(2+2 == 4) | (3+3 == 6)
## [1] TRUE

Finally, there’s the not operator, which is simple but annoying to describe in English. If I ask you to assess my claim that “it is not true that \(2+2 = 5\)” then you would say that my claim is true; because my claim is that “\(2+2 = 5\) is false”. And I’m right. If we write this as an R command we get this:

! (2+2 == 5)
## [1] TRUE

In other words, since 2+2 == 5 is a FALSE statement, it must be the case that !(2+2 == 5) is a TRUE one. Essentially, what we’ve really done is to claim that “not false” is the same thing as “true”. Obviously, this isn’t really quite right in real life. But R lives in a much more black or white world: for R everything is either true or false. No shades of grey are allowed. We can actually see this much more explicitly, like this:

! FALSE
## [1] TRUE

Of course, in our \(2+2 = 5\) example, we didn’t really need to use “not” ! and “equals to” == as two separate operators. We could have just used the “not equals to” operator != like this:

2+2 != 5
## [1] TRUE

But there are many situations where you really do need to use the ! operator.

Let’s get some more practice with combining these operations. Looking back at the stock levels, suppose I want to focus only on those cases when the stock level is either “out” or “low”. One simple way to do to this is:

stock.levels == "out" | stock.levels == "low"
##  [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

What this does is return TRUE for those elements of stock.levels that are either "out" or "low" and returns FALSE for all the others.

Neat. But there’s an even neater way. To send you off, I will leave you with a useful trick to be aware of, which is the %in% operator6. It’s actually very similar to the == operator, except that you can supply a collection of acceptable values, so you can look for a match of multiple cases. The best way to learn about it is to see it at work:

stock.levels %in% c("out","low") 
##  [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

You see that, again, it returns TRUE for those elements of stock.levels that are either "out" or "low" and returns FALSE for all the others. You could verbalize the above statement as “stock.levels is part of either out or low” or “stock.levels is at least one of the values in the vector consisting of out and low”.

Exercise: Is there a difference between stock.levels=="high" and stock.levels %in% "high"

No. You could see the %in% operator as a multiple case extension of the == operator.

3.3.7 The order of logical operators

Just like algebraic operators have an order (e.g., multiplication before addition), logical operators have one too.

This one is easy

(TRUE | TRUE) & FALSE
## [1] FALSE

(TRUE | TRUE) gives TRUE, so we end up with TRUE & FALSE which yields FALSE.

This one is harder if you encounter it for the first time

TRUE | TRUE & FALSE
## [1] TRUE

This expresssion is not evaluated left-to-right. Instead, R follows operator precedence rules, holding that & has higher precedence than |. So it’s really interpreted as:

TRUE | (TRUE & FALSE)
## [1] TRUE

3.4 Getting help

The single most important skill you need to learn as a programmer (or, to a lesser extent, being a data analysist) is getting help. I have somewhat mixed feelings about the help documentation in R. On the plus side, there’s a lot of it, and it’s very thorough. On the minus side, there’s a lot of it, and it’s very thorough. There’s so much help documentation that it sometimes doesn’t help, and most of it is written with an advanced user in mind. Often it feels like most of the help files work on the assumption that the reader already understands everything about R except for the specific topic that it’s providing help for. What that means is that, once you’ve been using R for a long time and are beginning to get a feel for how to use it, the help documentation is awesome. These days, I find myself really liking the help files (most of them anyway). But when I first started using R I found it very dense.

To some extent, there’s not much I can do to help you with this. You just have to work at it yourself; once you’re moving away from being a pure beginner and are becoming a skilled user, you’ll start finding the help documentation more and more helpful.

If you want to read the help file of, say, startsWith() function, you can use either of the following:

help("startsWith")
?startsWith 

When I do that, R goes looking for the help file for the “startsWith” topic.

Exercise: You can also use the help() function in this document, look up the help documentation for the startsWith() function.

Alternatively, you can try a fuzzy search for a help topic, meaning that it will not just look for the exact search term, but also at search terms that are similar to your search term.

help.search("startsWith")
??startsWith

If you try it (for example, in the box of the previous exercise), this will bring up a list of possible topics that you might want to follow up in.

I want to mention a few other resources besides the R documentation already here.

  • The first help resource is your own brain and creativity. If you don’t know what some code does, just run it, and see what it does. Just (carefully) looking at it might already be enough. The main message is that you will learn, discover and understand by playing around with the code. Playing for the win!

  • Perhaps most importantly, google is your best friend. Whatever problem you run into with R, it is very likely that someone else ran into the same problem before you. Stack Overflow, for example, is a large Q&A platform where coders help each other with their programming issues (this is not only limited to R). Imagine we have 7 a data frame of which we want to convert one column from character to numeric class, however, we can’t exactly remember how to do this. If you look up something in the trend of ‘convert data frame column character to numeric in R’, you will get plenty of results that can help you with this - including answers on Stack Overflow (https://stackoverflow.com/questions/37707060/converting-data-frame-column-from-character-to-numeric/37707117).

  • The Rseek website (www.rseek.org). One thing that I really find annoying about the R help documentation is that it’s hard to search properly. When coupled with the fact that the documentation is dense and highly technical, it’s often a better idea to search or ask online for answers to your questions. With that in mind, the Rseek website is great: it’s an R specific search engine. I find it really useful, and it’s almost always my first port of call when I’m looking around.

  • Another, more recent but also somewhat twisted friend are LLMs, like ChatGPT. They are remarkably good at writing code, and remarkably bad at making good jokes. One take away from this is that you shouldn’t despair: it is easier to be a good programmer than to be funny. A second take away is that you can ask ChatGPT for input whenever you are stuck. Do treat whatever it comes up with with some caution. With some luck, the code it produces will be a useful starting point. It is your responsibility, still, to make sure the code actually works as intended. So you need to read it, understand it, and check both intermediate steps and final output.

  • If you are becoming a more advanced R user, you might consider joining the R-help mailing list (see http://www.r-project.org/mail.html for details). It won’t be needed for the purposes of this course. This is the official R help mailing list. It can be very helpful, but it’s very important that you do your homework before posting a question. The list gets a lot of traffic. While the people on the list try as hard as they can to answer questions, they do so for free, and you really don’t want to know how much money they could charge on an hourly rate if they wanted to apply market rates. In short, they are doing you a favour, so be polite. Don’t waste their time asking questions that can be easily answered by a quick search on Rseek (it’s rude), make sure your question is clear, and all of the relevant information is included. In short, read the posting guidelines carefully (http://www.r-project.org/posting-guide.html), and make use of the help.request() function that R provides to check that you’re actually doing what you’re expected.

  • Keep in mind, though, that by using these routes, you are quite likely to have your R problem solved. You are, however, not jus tin the business of problem solving, but of learning. So try to make sure to understand why the solution is a solution. This is especially easy when using ChatGPT: you can just let explain what the proposed code is doing, so that you actually understand it!

3.5 Packages

A lot of R’s functionality is built-in and comes with simply installing R. For most of what we will be using in this book, that will suffice. But even more of R’s functionality is not built-in, and one of the benefits of R is the availability of this endless and growing list of advanced functionalities. So while it might be a bit premature to talks about them when you just started to learn R, what I am gonna explain you know if so fundamental to R I want to talk about them already now, even if you won’t start using it till much later.

The additional functionality I am talking about is provided in a thing called packages. A package is basically just a big collection of functions, data sets and other R objects that are all grouped together under a common name. Some R packages are already installed when you put R on your computer, but the vast majority of them are out there on the internet, waiting for you to download, install and use them.

One of the main selling points for R is that there are thousands of packages that have been written for it, and these are all available online. So whereabouts online are these packages to be found, and how do we download and install them? There is a big repository of packages called the “Comprehensive R Archive Network” (CRAN).

There’s a critical distinction that you need to understand, which is the difference between having a package installed on your computer, and having a package loaded in R. As of this writing, there are just over 5000 R packages freely available “out there” on the internet.8 When you install R on your computer, you don’t get all of them: only about 30 or so come bundled with the basic R installation. So right now there are about 30 packages “installed” on your computer, and another 5000 or so that are not installed. So that’s what installed means: it means “it’s on your computer somewhere”. The critical thing to remember is that just because something is on your computer doesn’t mean R can use it. In order for R to be able to use one of your 30 or so installed packages, that package must also be “loaded”. Generally, when you open up R, only a few of these packages (about 7 or 8) are actually loaded.

So there are two things you need to remember about packages: 1) A package must be installed before it can be loaded. 2) A package must be loaded before it can be used. This two-step process might seem a little odd at first, but the designers of R had very good reasons to do it this way. Basically, the reason is that there are 5000 packages, and probably about 4000 authors of packages, and no-one really knows what all of them do. Keeping the installation separate from the loading minimizes the chances that two packages will interact with each other in a nasty way. But don’t worry, you get the hang of it pretty quickly. We will talk about the specifics of installing and loading packages in Section 5.

3.6 Go out and play

If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. You can use this box to make the exercises for Chapter 3. Note that there are two documents: one with only the questions, and one with the questions as well as some suggested solutions.

4 More on functions in R

4.1 Function arguments

In Section 2.9 you were introduced to the basics of functions, like sqrt(), length() and nchar(). There are two more fairly important things that you need to understand about functions in R, and that’s the use of “named” arguments and “default values” for arguments. Further, I will introduce you to the somewhat bizarre world of pipes. Not surprisingly, that’s not to say that this is the last we’ll hear about how functions work, but they are the last things we desperately need to discuss in order to get you started.

4.1.1 Argument names

To understand what the first two concepts are all about, I’ll introduce a new function to you. The round() function can be used to round some value to the nearest whole number.

Exercise: Use the round() function for the value 3.1415.

Pretty straightforward, really. However, suppose I only wanted to round it to two decimal places: that is, I want to get 3.14 as the output. The round() function supports this, by allowing you to input a second argument to the function that specifies the number of decimal places that you want to round the number to. In other words, I could do this:

round( 3.14165, 2 )
## [1] 3.14

What’s happening here is that I’ve specified two arguments: the first argument is the number that needs to be rounded (i.e., 3.1415), the second argument is the number of decimal places that it should be rounded to (i.e., 2), and the two arguments are separated by a comma. In this simple example, it’s quite easy to remember which one argument comes first and which one comes second, but for more complicated functions this is not easy. Fortunately, most R functions make use of argument names. For the round() function, for example the number that needs to be rounded is specified using the x argument, and the number of decimal points that you want it rounded to is specified using the digits argument. Because we have these names available to us, we can specify the arguments to the function by name. We do so like this:

round( x = 3.1415, digits = 2 )
## [1] 3.14

Notice that this is kind of similar in spirit to variable assignments (Section 2.3), except that I used = here, rather than <-. In both cases, we’re specifying specific values to be associated with a label. However, there are some differences between what I was doing earlier on when creating variables, and what I’m doing here when specifying arguments, and so as a consequence, it’s important that you use = in this context.

As you can see, specifying the arguments by name involves a lot more typing, but it’s also a lot easier to read. Because of this, the commands in this book will usually specify arguments by name,9 since that makes it clearer to you what I’m doing.

One important thing to note is that when specifying the arguments using their names, it doesn’t matter what order you type them in. But if you don’t use the argument names, then you have to input the arguments in the correct order. In other words, these commands all produce the same output…

round( 3.14165, 2 )
## [1] 3.14
round( x = 3.1415, digits = 2 )
## [1] 3.14
round( digits = 2, x = 3.1415 )
## [1] 3.14

but this one does not…

round( 2, 3.14165 )
## [1] 2

What does R do when you provide names for some arguments but not for others? Let’s see

round(  2, x = 3.1415 )
## [1] 3.14
round( digits = 2, 3.1415 )
## [1] 3.14

The named argument is easy, of course. If you use x = 3.1415 you literally tell R that 3.1415 should serve as x. For the unnamed argument, R needs to decide what you mean with it. R uses the first argument that is different from the named one. So in the first example, it knows that 2 serves as a value for digits, because round() expect arguments in the x and digits order. Since we have provided x, the first argument that has not been assigned a value is digits. In the second example, 3.1415 serves as x, because that’s the first argument round() expects, and that wasn’t assigned a value.

So if you want to use names for the arguments, you are basically free to do what you want. If you don’t want to use names, you have to be very careful about order! How do you find out what the correct order is? There’s a few different ways, but the easiest one is to look at the help documentation for the function (see Section 3.4). However, if you’re ever unsure, it’s probably best to type in the argument name. To know the correct name, you also need to consult the help documentation, but at least, these names are often easier to remember than the order, so you will probably have to visit the help file less using the name approach than when you are using the order approach.

Now here is something weird. All of this works!

round( x = 3.1415, digit = 2 )
## [1] 3.14
round( x = 3.1415, digi = 2 )
## [1] 3.14
round( x = 3.1415, d = 2 )
## [1] 3.14

The reason is that R (somewhat controversially) does partial matching with names arguments. Since with round(), there is only one names argument that starts with d, digi, or digit, R sort of auto-complete these bits to digits.

You have to do something right, though. This doesn’t work:

round( x = 3.1415, digitjes = 2 )
## Error in round(x = 3.1415, digitjes = 2): unused argument (digitjes = 2)
round( x = 3.1415, Digits = 2 )
## Error in round(x = 3.1415, Digits = 2): unused argument (Digits = 2)
round( x = 3.1415, getalletjes = 2 )
## Error in round(x = 3.1415, getalletjes = 2): unused argument (getalletjes = 2)

and R tells you why it balks (or at least it tells you something is going wrong on the argument department).

4.1.2 Argument defaults

Okay, so that’s the first thing I said you’d need to know: argument names. The other thing you need to know about arguments is that they can have default values. Notice that the first time I called the round() function, round( 3.14165 ) I didn’t actually specify the digits argument at all, and yet R somehow knew that this meant it should round to the nearest whole number. How did that happen? The answer is that the digits argument has a default value of 0, meaning that if you decide not to specify a value for digits then R will act as if you had typed digits = 0. This is quite handy: the vast majority of the time when you want to round a number you want to round it to the nearest whole number, and it would be pretty annoying to have to specify the digits argument every single time. On the other hand, sometimes you actually do want to round to something other than the nearest whole number, and it would be even more annoying if R didn’t allow this! Thus, by having digits = 0 as the default value, we get the best of both worlds.

How do you find out what the default values are? Again, the easiest one is to look at the help documentation for the function (see Section 3.4). Or try to reverse engineer stuff by trying things out!

4.1.3 More about argument names

Perhaps unsurprisingly, assigning values to an argument can be done using variables. With this cryptic sentence, I mean that these commands do exactly the same:

round( x = 3.1415, digits = 2 )
## [1] 3.14
y = 3.1415
d = 2
round( x = y, digits = d )
## [1] 3.14

Functionally, there is a difference, however. Using the first strategy, we have no access to x. Its value is only known to the function, not outside it. Since in the second approach, y is defined outside the function, it is also available outside of it:

round( x = 3.1415, digits = 2 )
## [1] 3.14
x + 1 #doesn't work 
## Error in eval(expr, envir, enclos): object 'x' not found
y = 3.1415
d = 2
round( x = y, digits = d )
## [1] 3.14
x + 1 #doesn't work
## Error in eval(expr, envir, enclos): object 'x' not found
y + 1 #does work
## [1] 4.1415

Some people are overargumentative. If you are like that, let’s see how R reacts. For example, you might (mistakenly) think that round() has an argument which tells you both the rounding down and rounding up result, called upndown, which can be set to TRUE:

round( x = 3.1415, upndown = TRUE  )
## Error in round(x = 3.1415, upndown = TRUE): unused argument (upndown = TRUE)
round( x = 3.1415, digits = 2, upndown = TRUE  )
## Error in eval(expr, envir, enclos): 3 arguments passed to 'round' which requires 1 or 2 arguments

R complains twice, giving you a different reason in each case.

4.1.4 Pipes

A final nugget of wisdom about using function I would like to share with you is piping. By now, you should know how easy it is to call a function. If you want to take the square root of something, you write down the function to that, which is sqrt(), and include the something between the brackets, like this:

sqrt(225)
## [1] 15

Somewhat counterintuitively, there is a different, more complicated way of doing exactly the same thing. It makes uses of the forward pipe operator, which look like this: |>. What it does is that it “pipes” (i.e., puts) everything to its left inside the function to its right, as the first argument in the call. Come take a look:

225 |> sqrt()
## [1] 15

So the above code does exactly the same as sqrt(225)! Weird or wonderful? Your call!

Piping with additional arguments is fairly straightforward:

3.1415 |> round(digits=2)
## [1] 3.14

is functionally identical to

round(3.1415, digits=2)
## [1] 3.14

and

2 |> round(x = 12.345)
## [1] 12.35

is identical to

round(2, x = 12.345)
## [1] 12.35

I won’t be using the piping approach much (or maybe even at all) in this book, but since it is gaining popularity in the R world, I thought I should get it on your radar. Personally, I don’t see the appeal of calling functions like this, but maybe that’s a very boomer thing to say.

4.2 A few more mathematical functions

As I’ve mentioned earlier, R has an incredible range of mathematical functions built into it, and there really wouldn’t be much point in trying to describe or even list all of them. I will focus only on those functions that are strictly necessary for this book. When doing statistics, you will find that you will be doing a lot of transformations. Also, you will find that a lot of the transformations that you might want to apply to your data are based on fairly simple mathematical functions and operations. In this section, I want to return to that discussion, and mention several other mathematical functions and arithmetic operations that I didn’t bother to mention when introducing you to R, but are actually quite useful for a lot of real-world data analysis. Table 4.1 gives a brief overview of the various mathematical functions I want to talk about (and some that I already have talked about). Obviously, this doesn’t even come close to cataloguing the range of possibilities available in R, but it does cover a very wide range of functions that are used in day to day data analysis.

Table 4.1: Some of the mathematical functions available in R.
mathematical.function R.function example.input answer
square root sqrt() sqrt(25) 5
absolute value abs() abs(-23) 23
rounding to nearest round() round(1.32) 1
rounding down floor() floor(1.32) 1
rounding up ceiling() ceiling(1.32) 2
logarithm (base 10) log10() log10(1000) 3
logarithm (base e) log() log(1000) 6.908
exponentiation exp() exp(6.908) 1000.245
sum sum() sum(c(2,1,6)) 9
mean mean() mean(c(2,1,6)) 3
cumsum cumsum() cumsum(c(2,1,6)) 2 3 9

4.2.1 Rounding a number

One very simple transformation that crops up surprisingly often is the need to round a number to the nearest whole number, or to a certain number of significant digits. To start with, let’s assume that we want to round to a whole number. To that end, there are three useful functions in R you want to know about: round(), floor() and ceiling().

You are already familiar with the round() function from Section 4.2.1. It just rounds to the nearest whole number. So if you round the number 4.3, it “rounds down” to 4, like so:

round( x = 4.3 )
## [1] 4

In contrast, if we want to round the number 4.7, we would round upwards to 5. In everyday life, when someone talks about “rounding”, they usually mean “round to nearest”, so this is the function we use most of the time. However, sometimes you have reasons to want to always round up or always round down. If you want to always round down, use the floor() function instead; and if you want to force R to round up, then use ceiling(). That’s the only difference between the three functions.

What if you want to round to a certain number of digits? Let’s suppose you want to round to a fixed number of decimal places, say 2 decimal places. If so, what you need to do is specify the digits argument to the round() function, as was discussed in Section 4.2.1.

Exercise: Round the value 0.0123 to 2 decimal places. Specify the arguments x and digits.

round( x = 0.0123, digits = 2 )

4.2.2 Logarithms and exponentials

Next up are logarithms and exponentials. Although they aren’t needed anywhere else in this book, they are everywhere in statistics more broadly, and not only that, there are a lot of situations in which it is convenient to analyse the logarithm of a variable (i.e., to take a “log-transform” of the variable). I suspect that many (maybe most) readers of this book will have encountered logarithms and exponentials before, but from past experience, I know that there’s a substantial proportion of students who take a social science statistics class who haven’t touched logarithms since high school, and would appreciate a bit of a refresher.

In order to understand logarithms and exponentials, the easiest thing to do is to actually calculate them and see how they relate to other simple calculations. There are three R functions in particular that I want to talk about, namely log(), log10() and exp(). To start with, let’s consider log10(), which is known as the “logarithm in base 10”. The trick to understanding a logarithm is to understand that it’s basically the “opposite” of taking a power. Specifically, the logarithm in base 10 is closely related to the powers of 10. So let’s start by noting that 10-cubed is 1000. Mathematically, we would write this:

\[ 10^3 = 1000 \]

and in R we’d calculate it by using the command 10^3. The trick to understanding a logarithm is to recognise that the statement that “10 to the power of 3 is equal to 1000” is the mirror image of the statement that “the logarithm (in base 10) of 1000 is equal to 3”. Mathematically, we write this as follows,

\[ \log_{10}( 1000 ) = 3 \]

Exercise: Calculate the base-10 logarithm of 1000 using the log10() function.

log10(1000)

Obviously, since you already know that \(10^3 = 1000\) there’s really no point in getting R to tell you that the base-10 logarithm of 1000 is 3. However, most of the time you probably don’t know what the right answer is. For instance, I can honestly say that I didn’t know that \(10^{2.69897} = 500\), so it’s rather convenient for me that I can use R to calculate the base-10 logarithm of 500.

log10( 500 )
## [1] 2.69897

Or at least it would be convenient if I had a pressing need to know the base-10 logarithm of 500.

Okay, since the log10() function is related to the powers of 10, you might expect that there are other logarithms (in bases other than 10) that are related to other powers too. And of course, that’s true: there’s not really anything mathematically special about the number 10. You and I happen to find it useful because decimal numbers are built around the number 10, but the big bad world of mathematics scoffs at our decimal numbers. Sadly, the universe doesn’t actually care how we write down numbers. Anyway, the consequence of this cosmic indifference is that there’s nothing particularly special about calculating logarithms in base 10. You could, for instance, calculate your logarithms in base 2, and in fact, R does provide a function for doing that, which is (not surprisingly) called log2(). Since we know that \(2^3 = 2 \times 2 \times 2 = 8\), it’s no surprise to see that

log2( 8 )
## [1] 3

Alternatively, a third type of logarithm – and one we see a lot more of in statistics than either base 10 or base 2 – is called the natural logarithm, and corresponds to the logarithm in base \(e\). Since you might one day run into it, I’d better explain what \(e\) is. The number \(e\), known as Euler’s number, is one of those annoying “irrational” numbers whose decimal expansion is infinitely long and is considered one of the most important numbers in mathematics. The first few digits of \(e\) are:

\[ e = 2.718282 \]

There are quite a few situations in statistics that require us to calculate powers of \(e\). Raising \(e\) to the power \(x\) is called the exponential of \(x\), and so it’s very common to see \(e^x\) written as \(\exp(x)\). And so it’s no surprise that R has a function that calculates exponentials, called exp(). For instance, suppose I wanted to calculate \(e^3\). I could try typing in the value of \(e\) manually, like this:

2.718282 ^ 3
## [1] 20.08554

but it’s much easier to do the same thing using the exp() function.

Exercise: Calculate the exponential of 3 using the exp() function.

exp(3)

Anyway, because the number \(e\) crops up so often in statistics, the natural logarithm (i.e., logarithm in base \(e\)) also tends to turn up. Mathematicians often write it as \(\log_e(x)\) or \(\ln(x)\), or sometimes even just \(\log(x)\). In fact, R works the same way: the log() function corresponds to the natural logarithm10 Anyway, as a quick check, let’s calculate the natural logarithm of 20.08554 using R:

log( 20.08554 )
## [1] 3

And with that, I think we’ve had quite enough exponentials and logarithms for this book!

4.2.3 The sum(), the mean(), and the cumsum()

Although I will defer all true statistical content to Chapter 11, I make one exception here, and that is using R to compute the mean.

As a recap, here’s what you should do to compute the mean: add all the values up and then divide by the total number of values. Okay, how do we get the magic computing box to do the work for us? If you really wanted to, you could do this calculation directly in R.

To make things a bit concrete, let’s use some data. Unlike most data sets in this book, these are actually real data, relating to the Australian Football League (AFL).11 The afl.margins variable contains the winning margin (which is just the difference between the number of points, so if one team scores 26 and the other 21, the margin is 5) for all 176 home and away games played during the 2010 season.

Here is what the first couple of scores look like (in a moment, I will show how I could use the head() function for that)

afl.margins[1:5]
## [1] 56 31 56  8 32

Exercise: For the first 5 AFL margins (56, 31, 56, 8, 32), calculate the mean just by typing it in as if R were a calculator.

(56 + 31 + 56 + 8 + 32) / 5

… in which case R outputs the answer 36.6, just as if it were a calculator.

However, that’s not the only way to do the calculations, and when the number of observations starts to become large, it’s easily the most tedious. Besides, in almost every real-world scenario, you’ve already got the actual numbers stored in a variable of some kind, just like we have with the afl.margins variable. Under those circumstances, what you want is a function that will just add up all the values stored in a numeric vector. That’s what the sum() function does.

If we want to add up all 176 winning margins in the data set, we can do so using the following command:

sum( afl.margins )
## [1] 6213

Exercise: Take the sum of the first five observations of afl.margins, using the sum() function.

sum( afl.margins[1:5] )

Exercise: Now calculate the mean, by telling R to divide the output of the summation of the first five observations by 5. Use the sum() function.

sum( afl.margins[1:5] ) / 5

Although it’s pretty easy to calculate the mean using the sum() function, we can do it in an even easier way, since R also provides us with the mean() function.

Exercise: Calculate the mean of all 176 games using the mean() function.

mean( afl.margins )

Just to show you that there’s nothing funny going on, here’s what we would do to calculate the mean for the first five observations:

mean( afl.margins[1:5] )
## [1] 36.6

As you can see, this gives exactly the same answers as the previous calculations.

Fairly easy, huh?

Sometimes, you don’t want just the sum, but you want the cumulative sum: Again, R helps you out here. It sort of speaks for itself:

y <- cumsum( afl.margins[1:5] )
y
## [1]  56  87 143 151 183

The first element of y is simply the first element of afl.margins. The second element of y (87) is th sum of the first 2 elements of afl.margins[1:5] (56 and 31). The third element of y is the sum of the first 3 elements of afl.margins (56, 31, and 56), and so on.

4.2.4 sum() and mean() with logical data

The sum() function is especially useful in combination with logical data, by virtue of TRUEs and FALSEs doubling as 1s and 0s, as you discovered in Section 3.3.4. It makes it quiet easy to count how many cases of something are in your data set.

Suppose we want to know how many AFL margins in our data set are larger than 100? Let’s ask R:

afl.margins>100
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

This doesn’t quite give the answer we are after. It gives us a bunch of TRUEs and FALSEs, where a TRUE indicates that the margin is larger than 100. So what we need to do to get to our answer, is to count the number of TRUEs. Somewhat surprisingly, the sum() function helps us out here. Why is that?

sum(afl.margins>100)
## [1] 4

The reason that it works is that, as I discussed in Section 3.3.4, TRUE and FALSE act as 0 and 1, so when summing the collection of FALSEs and TRUEs, we are just summing 0s and 1s. Since adding a 0 doesn’t really do anything, what this boils down to is just summing the 1s. And summing a number of 1s is of course identical to just counting the number of 1s. So the end result of the sum operation is the number of 1s we had, or the number of TRUEs.

Once we have the number of TRUEs, it is of course very easy to turn this frequency into a proportion. Using length(), we count the total number of games. And the proportion of games with a margin>100 is nothing more than the number of games with a margin>100 divided by the total number of games.

n <- length(afl.margins)
sum(afl.margins>100)/n
## [1] 0.02272727

Now if you really want to be badass, you could even use the mean() function to compute the proportion of interest:

mean(afl.margins>100)
## [1] 0.02272727

So, in sum, what I mean is that in some cases, we can use mean() to compute a proportion! This might seem tricky at first, but is nothing magical, really. Remember that the mean just adds up all things and then divides it by the total number. As per above, that is exactly what we need to do if we want to compute the proportion.

Exercise: What proportion of games has a winning margin of exactly 3?

sum(afl.margins==3)/length(afl.margins) #one way 
mean(afl.margins==3) #another way

This is a feature we will be using quite a bit, so it is a good idea to familiarize yourself. Often, it will help to use the work-from-within strategy. For example, can you make sense of this line?

x <- c(7, -3, -6, 4, 4, -1, 0, 8, 9, 2)
mean(abs(x)>2) 
## [1] 0.7

It counts the proportion of elements in x for which the absolute value is larger than 2. In that single statement, no less than 3 things are happening, starting from within and eating yourself to the outside:

1: compute the absolute value of x

abs(x)  
##  [1] 7 3 6 4 4 1 0 8 9 2

2: compute which elements of the absolute value of x are larger than 2

abs(x)  > 2
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE

3: compute the proportion of TRUEs in this vector

mean(abs(x)  > 2)
## [1] 0.7

4.3 A few more general functions

There are a few quite convenient functions you will be happy to know. As most of them take quite long to describe and just seeing what they do is much easier for all parties involved, I will often be brief on the description. As the saying goes, an R command is worth a thousands words.

4.3.1 rep() and seq()

For example, here is how you can repeat stuff:

rep(2,9)
## [1] 2 2 2 2 2 2 2 2 2
rep("z",4)
## [1] "z" "z" "z" "z"
rep(c(3,4),6)
##  [1] 3 4 3 4 3 4 3 4 3 4 3 4
rep(c(2,"q"),5)
##  [1] "2" "q" "2" "q" "2" "q" "2" "q" "2" "q"

Note how in the last example, 2 was “characterized”.

I always forget whether rep(2,9) creates 9 2s or 2 9s. Using names arguments would also solve this first-world problem.

rep(x = 2, times = 9)
## [1] 2 2 2 2 2 2 2 2 2
rep(times = 9, x = 2)
## [1] 2 2 2 2 2 2 2 2 2

Here’s another cool function, if you find yourself in that sequence making mood:

seq(2,12)
##  [1]  2  3  4  5  6  7  8  9 10 11 12
seq(2,12,3)
## [1]  2  5  8 11
seq(from=2,to=12,by=3) #same as above
## [1]  2  5  8 11
seq(from=2,to=12,length.out=6) #not same as above
## [1]  2  4  6  8 10 12

So that’s a nice way to generate a sequence.

4.3.2 head() and tail()

Some variables are pretty big. For example that afl.margins variable contains 176 games, which is a lot of info to digest if it is printed out on my computer screen. To that end, R provides you with a few useful functions to print out only a few of elements. The first of these is head() which prints out the first couple 12 elements, like this:

head( afl.margins )
## [1] 56 31 56  8 32 14

You can also use the tail() function to print out the last couple 13 of rows.

As always, R serves every whim you might have. If you want more than the default number of first entries, you do you!

head( afl.margins, n = 10 )
##  [1] 56 31 56  8 32 14 36 56 19  1

Looking at the last entries can be done by tail().

4.3.3 diff()

Try to understand what diff() does

diff(c(1,3,9))
## [1] 2 6

Good boy! It computes the difference between elements 1 and 2, between elements 2 and 3, and so on.

4.3.4 max() and min()

People are fond of extremes. Maybe you are, too. What’s the biggest difference in scores, you wonder? You can ask R. It’s easy :

max(afl.margins)
## [1] 116

I am sure you are bright enough to guess what min() does and how to use it.

4.3.5 which() and which.max()

One function that can be handy is the which() function; it takes as input a vector of logicals and outputs the indices of the TRUE cases.

Exercise: Apply the which() function to find the values of afl.margins that are larger than 100.

which( afl.margins > 100 )
# Or:
large.cases <- afl.margins > 100
which( large.cases )

What this has done is shown us that the large cases correspond to games 12, 46, 157, and 163.

We know from above that the highest margin was a whopping 116. But which game has this monster score?

Of course, we could do this:

which(afl.margins == max(afl.margins))
## [1] 163

But also of course, R wants you to know it is vastly smarter than you, so you could also do this:

which.max(afl.margins)
## [1] 163

I don’t think I should tell you what which.min() does, do I?

4.3.6 unique()

Sometimes you wanna go full Marie Kondo and remove all ballast. unique() does exactly that and removes all duplicate elements:

unique(afl.margins)
##  [1]  56  31   8  32  14  36  19   1   3 104  43  44  72   9  28  25  27  55  20
## [20]  16   7  23  40  48  64  22  95  15  49  52  50  10  65  12  39  26 108  53
## [39]  38   4  13  66  67  61  29  81  37  70  35  54  47   2  41  24  11  71  18
## [58]   0  60  57  83  84  30  68  75  63  82  73  33  76   5  94  98  89 101  21
## [77]  42 116   6

Make good use of it.

4.3.7 toupper()

A task that comes up quite often is making transformations to text. A simple example of this would be converting text to lower case or upper case, which you can do using the toupper() and tolower() functions. Both of these functions have a single argument x which contains the text that needs to be converted. Imagine we have the following text vector.

text <- c( "lIfe", "Impact" )
Exercise: Convert the text in text to lower case.
tolower( x = text )

4.3.8 startsWith() and endsWith()

This is pretty self-explanatory. See for yourself.

x <- "KDB"
startsWith(x, "K")
## [1] TRUE
startsWith(x, "L")
## [1] FALSE
startsWith(x, "KD")
## [1] TRUE
startsWith(x, "KC")
## [1] FALSE

What happens, you wonder, when the input is not a character? Wonder no more:

x <- 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
startsWith(x, 1)
## Error in startsWith(x, 1): non-character object(s)

So we need a character as input really. This also means that

with

x <- "KDB17"
endsWith(x, "6")
## [1] FALSE
endsWith(x, "7")
## [1] TRUE
endsWith(x, 7)
## Error in endsWith(x, 7): non-character object(s)

The first two commands run without problem, because the numbers 6 and 7 are, by virtue of the ““’s treated as text. When we, however, start treating the 7 as the number it is, like in the last line, R spits out an error.

4.3.9 Pasting strings together

Sometimes, you will need either to glue several character strings together or to pull them apart. To glue several strings together, the paste() function is very useful. There are two important arguments to the paste() function:

  • ... These dots refer to an unnamed argument, and “match” up against any number of inputs. In this case, the inputs should be the various different strings you want to paste together.
  • sep. This argument should be a string, indicating what characters R should use as separators, in order to keep each of the original strings separate from each other in the pasted output. By default, the value is a single space, sep = " ". This is made a little clearer when we look at the examples. That probably doesn’t make much sense yet, so let’s start with a simple example. First, let’s try to paste two words together.

Exercise: Paste together the words “hello” and “world” using the paste() function, without specifying any other arguments.

paste( "hello", "world" )

Notice that R has inserted a space between the "hello" and "world". Suppose that’s not what I wanted. Instead, I might want to use . as the separator character, or to use no separator at all. To do either of those, I would need to specify sep = "." or sep = "". For instance:

paste( "hello", "world", sep = "." )
## [1] "hello.world"

To be honest, it does bother me a little that the default value of sep is a space. Normally when I want to paste strings together I don’t want any separator character, so I’d prefer it if the default were sep="". To that end, it’s worth noting that there’s also a paste0() function, which is identical to paste() except that it always assumes that sep="".

paste( "hello", "world", sep = "" )
## [1] "helloworld"
paste0( "hello", "world" )
## [1] "helloworld"

4.3.10 The any() function

In the afl.margin data, is there at least one game with a margin of 8? You can use any() to find out!

any( afl.margins == 8 )
## [1] TRUE

We also learn there is no game with a margin of 117.

any( afl.margins == 117 )
## [1] FALSE

Sweet.

4.3.11 The all() function

Do you also feel we are living in an age where mankind is longing for everything to be true? If you are one of those people, I proudly present to you the all() function. If the input is a logical vector, it checks whether all elements are TRUE. See for yourself:

x <- 1:10
x5 <- x>5
all(x5)
## [1] FALSE
x0 <- x>0
all(x0)
## [1] TRUE

4.3.12 The all.equal() function aka the problem with floating-point arithmetic

If I’ve learned nothing else about transfinite arithmetic (and I haven’t) it’s that infinity is a tedious and inconvenient concept. Not only is it annoying and counterintuitive at times, but it has nasty practical consequences. As we were all taught in high school, there are some numbers that cannot be represented as a decimal number of finite length, nor can they be represented as any kind of fraction between two whole numbers; \(\sqrt{2}\), \(\pi\) and \(e\), for instance. In everyday life, we mostly don’t care about this. I’m perfectly happy to approximate \(\pi\) as 3.14, quite frankly. Sure, this does produce some rounding errors from time to time, and if I’d used a more detailed approximation like 3.1415926535 I’d be less likely to run into those issues, but in all honesty, I’ve never needed my calculations to be that precise. In other words, although our pencil and paper calculations cannot represent the number \(\pi\) exactly as a decimal number, we humans are smart enough to realise that we don’t care. Computers, unfortunately, are dumb … and you don’t have to dig too deep in order to run into some very weird issues that arise because they can’t represent numbers perfectly. Here is my favourite example:

0.1 + 0.2 == 0.3
## [1] FALSE

Obviously, R has made a mistake here, because this is definitely the wrong answer. Your first thought might be that R is broken, and you might be considering switching to some other language. But you can reproduce the same error in dozens of different programming languages, so the issue isn’t specific to R. Your next thought might be that it’s something in the hardware, but you can get the same mistake on any machine. It’s something deeper than that.

The fundamental issue at hand is floating point arithmetic, which is a fancy way of saying that computers will always round a number to a fixed number of significant digits. The exact number of significant digits that the computer stores isn’t important to us:14 what matters is that whenever the number that the computer is trying to store is very long, you get rounding errors. That’s actually what’s happening with our example above. There are teeny tiny rounding errors that have appeared in the computer’s storage of the numbers, and these rounding errors have in turn caused the internal storage of 0.1 + 0.2 to be a tiny bit different from the internal storage of 0.3.

How big are these differences? Let’s ask R:

0.1 + 0.2 - 0.3
## [1] 5.551115e-17

Knowing that e-17 should be read as 10^(-17) or 0.00000000000000001, this is very tiny indeed. No sane person would care about differences that small. But R is not a sane person, and the equality operator == is very literal-minded. It returns a value of TRUE only when the two values that it is given are absolutely identical to each other. And in this case, they are not.

However, this only answers half of the question. The other half of the question is, why are we getting these rounding errors when we’re only using nice simple numbers like 0.1, 0.2 and 0.3? This seems a little counterintuitive. The answer is that, like most programming languages, R doesn’t store numbers using their decimal expansion (i.e., base 10: using digits 0, 1, 2 …, 9). We humans like to write our numbers in base 10 because we have 10 fingers. But computers don’t have fingers, they have transistors; and transistors are built to store 2 numbers, not 10. So you can see where this is going: the internal storage of a number in R is based on its binary expansion (i.e., base 2: using digits 0 and 1). And unfortunately, here’s what the binary expansion of 0.1 looks like:

\[ .1 \mbox{(decimal)} = .00011001100110011... \mbox{(binary)} \]

and the pattern continues forever. In other words, from the perspective of your computer, which likes to encode numbers in binary,15 0.1 is not a simple number at all. To a computer, 0.1 is actually an infinitely long binary number! As a consequence, the computer can make minor errors when doing calculations here.

Hopefully, it is now clear that the problem is the result of the twin facts that (1) we usually think in decimal numbers and computers usually compute with binary numbers, and (2) computers are finite machines and can’t store infinitely long numbers. The only questions that remain are when you should care and what you should do about it. Thankfully, you don’t have to care very often: because the rounding errors are small, the only practical situation that I’ve seen this issue arise is when you want to test whether an arithmetic fact holds exactly numbers are identical (e.g., is someone’s response time equal to exactly \(2 \times 0.33\) seconds?) This is pretty rare in real-world data analysis, but just in case it does occur, it’s better to use a test that allows for a small tolerance. That is, if the difference between the two numbers is below a certain threshold value, we deem them to be equal for all practical purposes.

Okay, the problem is clear, but what about the solution? For instance, you could do something like this, which asks whether the difference between the two numbers is less than tolerance of \(10^{-10}\)

abs( 0.1 + 0.2 - 0.3 ) < 10^-10
## [1] TRUE

Neat, but clumsy. R, do you have something else up your sleeve? Most definitely, you are too kind to ask! There is a function called all.equal() that lets you test for equality but allows a small tolerance for rounding errors:

all.equal( 0.1 + 0.2, 0.3 )
## [1] TRUE

4.3.13 print()

The print() function displays things. That’s easy enough. The difficult bit is that it seems unnecessary. Consider the following code

x <- 10
x
## [1] 10

This code has printed x, without using the print() function. Who on earth had so much time on their hands or that much need for validation to spend time making the print() function?

First off, it doesn’t hurt. In the code below, it doesn’t really do anything, but it helps make clear what I’m doing.

x <- 10
print(x)
## [1] 10

Second, it can be useful if you are sourcing a script (as I will discuss in Section ??). If you source a script, just having x in your script won’t show x, but print(x) will.

Third, if you want to have something printed while running a function (see Section XXX) or a loop (see Section XXX), you will need print().

Finally, it sometimes makes things look nicer. One example we will encounter in Section 6.5.1 is that, if you want to look at a data frame in the browser environment we are using, using print() will make the data frame look nice. Weirdly, it is not needed for anything other than data frames, and even not needed for data frames if you are using R in RStudio instead of in the browser environment. Don’t worry if this sounds like gibberish now. You will see in due time.

4.4 Go out and play

If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. You can use this box to make the exercises for Chapter 4. Note that there are two documents: one with only the questions, and one with the questions as well as some suggested solutions.

5 Working in RStudio

Up till now, you have been working with R in a browser. This was, so I hope, useful for learning R. But once you will start using R, you will no longer work in this nifty browser environment. This setup was only used because I hope it facilitates providing exercises and solutions. Once you start using R instead of learning R, you will use R not in a browser. Rather, you will use it in, for example, RStudio.

The R terminal that comes with the installation of R should, in principle, be enough for using R. However, it is not as visually pretty as the RStudio version, and lacks some of the cooler features that RStudio provides. That’s why we’ll be using R from within RStudio.

There are some specific things to using R in RStudio that you really learn best in RStudio directly, rather than in this browser environment. These things are quite important, because, I can not stress this enough, once you’ve done learning R and start using R, you will no longer be running R code in this nifty browser environment!

This chapter will be a lot less interactive than the other materials. I do recommend to not just read whatever has been described here, but also do it.

First, a disclaimer. Some of the things I describe in this document (such as tab to autcomplete) aren’t just an RStudio thing. For example, if you’re running R in a terminal window instead of in RStudio, tab autocomplete works in exactly the way I describe below. I don’t bother to document that here: my assumption is that if you are running R in the terminal then you’re already familiar with using tab autocomplete. So I am not going to distinguish between what is an R feature and what is an RStudio thing. I do try to have a life outside of this, you know.

When you open RStudio, you will see there are different panes (or panels). Most of the current document is describing how using these panes can make your life as an R user easy. 16

You will see that often, when you get RStudio to do something using one of the panes, you’ll actually see the R commands that get created and show up. For example, when you install the abtest package using the Packages panel (no worries; I explain below), in the Console panel you will see install.packages("abtest") appearing as a command. RStudio has sent a command to the R console, exactly as if you’d typed it yourself!

This means there are often at least two different ways of doing things: using the menu-based options (aka panel-based interface) provided by RStudio and using command-based options (aka text-based interface) from R (in the R console). Throughout this chapter, the idea is that I’ll first show you the (often easy) way to do it using RStudio and then, if needed, describe the (sometimes awkward) R commands that do all the work. I suspect that mostly, you will be using the menu-based interface, but the text-based interface does deserve some attention. One reason to be at least aware of the command-based way of doing things is that when you make a script (see Section XXX; for example if you want to share your hard R work, or want to keep a memory of what you did), it can be handy to have everything that needs to be done as an R command in text, rather than as a click-on-this instruction. Of course, in that case, you can actually use the panel-based option, and then copy and store the R commands that RStudio might have generated in your script.

5.1 Using the Console panel for running R commands

To start, I will focus on the panel labelled Console. This is where R will execute the commands you ask it to perform. Working in the console pane is very similar to typing the commands you did before in the boxes in the browser I provided. However, since the pane is not just boxes, there are a few nifty things the console is helping you with, unlike these boxes. Let’s unpack these little nuggets.

5.1.1 R can sometimes tell that you’re not finished yet (but not often)

We know how to enter commands in R. As a recap, let’s use R to add 10 and 20. To do that, in the R console, type 10+20 and hit enter. If R gives the correct answer (30), we are good. So far so good. Now, for the cool stuff. If you hit enter in a situation where it’s “obvious” to R that you haven’t actually finished typing the command, R is just smart enough to keep waiting. For example, if you type 10 + and then press enter, even R is smart enough to realise that you probably wanted to type in another number. So if you type 10+ and then accidentally press enter, there’s a blinking cursor next to the plus sign on the new line. What this means is that R is still waiting for you to finish. It “thinks” you’re still typing your command, so it hasn’t tried to execute it yet. In other words, this plus sign is actually another command prompt. It’s different from the usual one (i.e., the > symbol) to remind you that R is going to “add” whatever you type now to what you typed last time. For example, if I then go on to type 20 and hit enter, what I get is the correct answer (30). And as far as R is concerned, this is exactly the same as if you had typed 10 + 20.

Similarly, consider the citation() function that we talked about earlier. Suppose you hit enter after typing citation(. Once again, R is smart enough to realise that there must be more coming – since you need to add the ) character – so it waits. I can even hit enter several times and it will keep waiting.

That being said, it’s not often the case that R is smart enough to tell that there’s more coming. For instance, in the same way that I can’t add a space in the middle of a word, I can’t hit enter in the middle of a word either. If I hit enter after typing citat I get an error because R thinks I’m interested in an “object” called citat and can’t find it:

> citat
Error: object 'citat' not found

What about if I typed citation and hit enter? In this case, we get something very odd, something that we definitely don’t want, at least at this stage. Here’s what happens:

citation
## function (package = "base", lib.loc = NULL, auto = NULL) 
## {
##     dir <- system.file(package = package, lib.loc = lib.loc)
##     if (dir == "") 
##         stop(gettextf("package '%s' not found", package), domain = NA)
BLAH BLAH BLAH

where the BLAH BLAH BLAH goes on for rather a long time, and you don’t know enough R yet to understand what all this gibberish actually means (of course, it doesn’t actually say BLAH BLAH BLAH - it says some other things we don’t understand or need to know that I’ve edited for length) This incomprehensible output can be quite intimidating to novice users, and unfortunately it’s very easy to forget to type the parentheses; so almost certainly you’ll do this by accident. Do not panic when this happens. Simply ignore the gibberish. As you become more experienced this gibberish will start to make sense, and you’ll find it quite handy to print this stuff out.17 But for now, just try to remember to add the parentheses when typing your commands using functions.

If you start doing this yourself, you’ll eventually get yourself in trouble (it happens to us all). Maybe you start typing a command, and then you realise you’ve screwed up. For example,

> citblation( 
+ 
+ 

You’d probably prefer R not to try running this command, right? If you want to get out of this situation, just hit the ‘escape’ key.18 R will return you to the normal command prompt (i.e. >) without attempting to execute the botched command.

5.1.2 Autocomplete using “tab”

At this stage, you know how to type in basic commands, including how to use R functions. And it’s probably beginning to dawn on you that there are a lot of R functions, all of which have their own arguments. You’re probably also worried that you’re going to have to remember all of them! Thankfully, it’s not that bad. In fact, very few data analysts bother to try to remember all the commands. I want to call your attention to a couple of simple tricks that RStudio makes available to you.

One thing I want to call your attention to is the autocomplete ability in RStudio. Let’s assume that what you want to do is to round a number. This time around, start typing the name of the function that you want (e.g., ro …), and then hit the “tab” key. RStudio will then display a little window with two panels. On the left, there’s a list of variables and functions that start with the letters that I’ve typed shown in black text, and some grey text that tells you where that variable/function is stored. Ignore the grey text for now: it won’t make much sense to you until we’ve talked about packages in Section ??. You can see that there are quite a few things that start with the letters ro: there’s something called rock, something called round, something called round.Date and so on. The one we want is round, but if you’re typing this yourself you’ll notice that when you hit the tab key the window pops up with the top entry (i.e., rock) highlighted. You can use the up and down arrow keys to select the one that you want. Or, if none of the options looks right to you, you can hit the escape key (“esc”) or the left arrow key to make the window go away.

In our case, the thing we want is the round option, so we’ll select that. When you do this, you’ll see that the panel on the right changes. Previously, it had been telling us something about the rock data set (i.e., “Measurements on 48 rock samples…”) that is distributed as part of R. But when we select round, it displays information about the round() function, exactly as it is shown in Figure 5.1.

Start typing the name of a function or a variable, and hit the tab key. RStudio brings up a little dialogue box like this one that lets you select the one you want, and even prints out a little information about it.

Figure 5.1: Start typing the name of a function or a variable, and hit the tab key. RStudio brings up a little dialogue box like this one that lets you select the one you want, and even prints out a little information about it.

This display is really handy. The very first thing it says is round(x, digits = 0): what this is telling you is that the round() function has two arguments. The first argument is called x, and it doesn’t have a default value. The second argument is digits, and it has a default value of 0. In a lot of situations, that’s all the information you need. But RStudio goes a bit further and provides some additional information about the function underneath. Sometimes that additional information is very helpful, sometimes it’s not: RStudio pulls that text from the R help documentation, and my experience is that the helpfulness of that documentation varies wildly. Anyway, if you’ve decided that round() is the function that you want to use, you can hit the right arrow or the enter key, and RStudio will finish typing the rest of the function name for you.

The RStudio autocomplete tool works slightly differently if you’ve already got the name of the function typed and you’re now trying to type the arguments. For instance, suppose I’ve typed round( into the console, and then I hit tab. RStudio is smart enough to recognise that I already know the name of the function that I want, because I’ve already typed it, and figures that I’m interested in the arguments of that function. Being an obedient servant, it gives us what we want. You can see this in Figure 5.2. Again, the window has two panels, and you can interact with this window in exactly the same way that you did with the window shown in Figure 5.1. On the left-hand panel, you can see a list of the argument names. On the right-hand side, it displays some information about what the selected argument does.

If you've typed the name of a function already along with the left parenthesis and then hit the tab key, RStudio brings up a different window to the one shown above. This one lists all the arguments to the function on the left, and information about each argument on the right.

Figure 5.2: If you’ve typed the name of a function already along with the left parenthesis and then hit the tab key, RStudio brings up a different window to the one shown above. This one lists all the arguments to the function on the left, and information about each argument on the right.

5.1.3 Browsing your command history

One thing that RStudio does automatically is to keep track of your “command history”. That is, it remembers all the commands that you’ve previously typed. You can access this history in a few different ways. To see how this works, let’s type some commands in the R command line in the console.

age <- 2
age <- age + 1
age * 10
myName <- "Dan"

The simplest way is to use the up and down arrow keys. If you hit the up key, the R console will show you the most recent command that you’ve typed. Hit it again, and it will show you the command before that. If you want the text on the screen to go away, hit escape19 Using the up and down keys can be really handy if you’ve typed a long command that had one typo in it. Rather than having to type it all again from scratch, you can use the up key to bring up the command and fix it.

Another method is to start typing some text and then hit the Control key and the up arrow together (on Windows or Linux) or the Command key and the up arrow together (on a Mac). This will bring up a window showing all your recent commands that started with the same text as what you’ve currently typed. That can come in quite handy sometimes.

5.2 Using the History panel for accessing your command history

This seamlessly brings us to one of the other panels in RStudio: the History panel. On the upper right-hand side of the RStudio window, you’ll see a tab labelled History. Click on that, and you’ll see a list of all your recent commands displayed in that panel: it should look something like Figure 5.3. If you double click on one of the commands, it will be copied to the R console. You can achieve the same result by selecting the command you want with the mouse and then clicking the “To Console” button.

The history panel is located in the top right hand side of the RStudio window. Click on the word History and it displays this panel.

Figure 5.3: The history panel is located in the top right hand side of the RStudio window. Click on the word History and it displays this panel.

5.3 Using the Environment panel for managing the workspace

An important concept when working with R is the notion of the workspace, also referred to as the global environment. Roughly, the workspace is as an abstract location in which R variables are stored.

To have something to work with, let’s add some content to the workspace

keeper <- 8.5395945
lover <- 2.7183
seeker <- 3.1415

5.3.1 Listing the content of the workspace

How can you now examine the contents of the workspace, i.e., which variables does R keep in its memory? If you’re using RStudio, you will be both happy and somewhat unsurprised to hear that there’s a dedicated panel for that. You will probably find that the easiest way to do this is to use the Environment panel in the top right-hand corner. Click on that, and you’ll see a list that looks very much like the one shown in Figures 5.4 and 5.5.

The RStudio Environment panel shows you the contents of the workspace. The view shown above is the list view. To switch to the grid view, click on the menu item on the top right that currently reads list. Select grid from the dropdown menu, and then it will switch to a view like the one shown in the other workspace figure

Figure 5.4: The RStudio Environment panel shows you the contents of the workspace. The view shown above is the list view. To switch to the grid view, click on the menu item on the top right that currently reads list. Select grid from the dropdown menu, and then it will switch to a view like the one shown in the other workspace figure

The RStudio Environment panel shows you the contents of the workspace. Compare this grid view to the list earlier

Figure 5.5: The RStudio Environment panel shows you the contents of the workspace. Compare this grid view to the list earlier

5.3.1.1 Doing it using R commands

If you want to list the content of the workspace using the command line, there are a couple of functions that may come in handy: We will only use the ls() function. 20. If you would try it out, you would see something like this:

ls()
## [1] "keeper" "lover"  "seeker"

5.3.2 Removing variables from the workspace

Looking over that list of variables, it occurs to me that I really don’t need them any more. I created them originally just to make a point, but they don’t serve any useful purpose anymore, and now I want to get rid of them. I’ll show you how to do this, but first I want to warn you – there’s no “undo” option for variable removal. Once a variable is removed, it’s gone forever. But quite clearly we have no need for these variables at all, so we can safely get rid of them.

In RStudio, the easiest way to remove variables is to use the Environment panel. Assuming that you’re in grid view (i.e., Figure 5.5), check the boxes next to the variables that you want to delete, then click on the “Clear” button (the broom) at the top of the panel. When you do this, RStudio will show a dialogue box asking you to confirm that you really do want to delete the variables. It’s always worth checking that you really do, because as RStudio is at pains to point out, you can’t undo this. Once a variable is deleted, it’s gone. In any case, if you click “yes”, that variable will disappear from the workspace: it will no longer appear in the environment panel, and it won’t show up when you use the ls() command. Removing all variables can be done by clicking the broom (in the List view) or by clicking the broom after selecting all variables (in the Grid view), which can be easily done by checking the box next to “Name”.

5.3.2.1 Doing it using R commands

If you want to remove variables using R commands, you will be happy to meet the remove function rm(). The simplest way to use rm() is just to type in a (comma separated) list of all the variables you want to remove. Let’s say I want to get rid of seeker and lover, but I would like to keep keeper. To do this, all I have to do is type:

rm( seeker, lover )

There’s no visible output, but if I now inspect the workspace

ls()
## [1] "keeper"

I see that there’s only the keeper variable left. As you can see, rm() can be very handy for keeping the workspace tidy. If you want to clear the entire workspace, the following command can be used:

rm(list=ls()) 

This is a somewhat mysterious command. If you ever said you hated statistics because it destroys all of the mystery, this one’s for you.

5.4 Using the Help panel to, well, get help

I have discussed earlier that a big secret of being successful at programming, or at life more generally, is being able to ask for help. You might already have seen the Help panel on your left. It has a nifty search box, which will bring you to R’s built-in help documentation.

5.4.0.1 Doing it using R commands

We already know, from Section XXX, which commands to type if we desire help. For example, if we want to look at the help documentation for the load() function, you already know you could type either of the following:

?load 
help("load")

When you do that, R goes looking for the help file for the “load” topic. If it finds one, Rstudio takes it and displays it in the, wait for it, Help panel.

Also if you do a fuzzy search for a help topic, you will be directed to the Help panel.

??load 
help.search("load")

This will bring up a list of possible topics in the Help panel.

5.5 Using (mostly) the Packages panel for dealing with R packages

Remember I told you before what packages are and how important they are? If your answer is anything else than “Yes, of course, I am on top of this material”, you might want to revisit Section XXX.

Dealing with packages can be done in two ways: using the command line of the console, or —yes, I knew you would have guessed it!— using yet another panel, the —again, no surprises here— Packages panel.

The Packages panel.

Figure 5.6: The Packages panel.

Right, let’s get started. The first thing you need to do is look in the lower right-hand panel in RStudio. You’ll see a tab labelled “Packages”. Click on the tab, and you’ll see a list of packages that looks something like Figure 5.6. Every row in the panel corresponds to a different package, and every column is a useful piece of information about that package. Going from left to right, here’s what each column is telling you:

  • The check box on the far left column indicates whether or not the package is loaded.
  • The one word of text immediately to the right of the check box is the name of the package.
  • The short passage of text next to the name is a brief description of the package.
  • The number next to the description tells you what version of the package you have installed.
  • The little x-mark next to the version number is a button that you can push to uninstall the package from your computer (you almost never need this).

5.5.1 Installing new R packages

Using the RStudio tools is, again, dead simple. In the top left-hand corner of the Packages panel (Figure 5.6) you’ll see a button called “Install”. If you click on that, it will bring up a window like the one shown in Figure 5.7.

The package installation dialog box in RStudio

Figure 5.7: The package installation dialog box in RStudio

There are a few different buttons and boxes you can play with. Ignore most of them. Just go to the line that says “Packages” and start typing the name of the package that you want. As you type, you’ll see a dropdown menu appear (Figure 5.8), listing names of packages that start with the letters that you’ve typed so far.

When you start typing, you'll see a dropdown menu suggest a list of possible packages that you might want to install

Figure 5.8: When you start typing, you’ll see a dropdown menu suggest a list of possible packages that you might want to install

You can select from this list, or just keep typing. Either way, once you’ve got the package name that you want, click on the install button at the bottom of the window. R then goes off to the internet, has a conversation with CRAN, downloads some stuff, and installs it on your computer. You probably don’t care about all the details of R’s little adventure on the web, but R is rather chatty, so it reports a bunch of gibberish that you really aren’t all that interested in:

trying URL 'http://cran.rstudio.com/bin/macosx/contrib/3.0/psych_1.4.1.tgz'
Content type 'application/x-gzip' length 2737873 bytes (2.6 Mb)
opened URL
==================================================
downloaded 2.6 Mb
The downloaded binary packages are in
    /var/folders/cl/thhsyrz53g73q0w1kb5z3l_80000gn/T//RtmpmQ9VT3/downloaded_packages

Despite the long and tedious response, all that really means is “I’ve installed the psych package”. I find it best to humour the talkative little automaton. I don’t actually read any of this garbage, I just politely say “thanks” and go back to whatever I was doing.

5.5.1.1 Doing it with R commands

When you do all things I mentioned above, you’ll see the following command appear in the R console:

install.packages("psych")

This is the R command that does all the work.

5.5.2 Loading R packages

Remember that a package must be loaded before it can be used. That seems straightforward enough, so let’s try loading packages. For this example, I’ll use the foreign package. The foreign package is a collection of tools that are very handy when R needs to interact with files that are produced by other software packages (e.g., SPSS). It comes bundled with R, so it’s one of the ones that you have installed already, but it won’t be one of the ones loaded. Inside the foreign package is a function called read.spss(). It’s a handy little function that you can use to import an SPSS data file into R, so let’s pretend we want to use it. Currently, the foreign package isn’t loaded, so if I ask R to tell me if it knows about a function called read.spss() it tells me that there’s no such thing…

read.spss()
## Error in read.spss(): could not find function "read.spss"

Now let’s load the package. In RStudio, the process is dead simple: go to the Packages tab, find the entry for the foreign package, and check the box on the left-hand side. So you can use the RStudio package panel to do all your package loading for you. The moment that you do this, you’ll see a command appear in the R console. Oh, I suppose we should check to see if our attempt to load the package actually worked. Let’s see if R now knows about the existence of the read.spss() function…

read.spss()
## Error in grep("^(http|ftp|https)://", file): argument "file" is missing, with no default

It complains that we didn’t provide the name of the file we’d like to load (and it has every right to!), but at least it no longer complains that the function does not exist. So we must have done somethings right.

5.5.2.1 Doing it using R commands

As you might have gleaned from whatever R spit out when using the Packages panel to load the package, the command to load the foreign package is just this:

library(foreign)

5.5.3 Inspecting a package

Every package that you have loaded is another environment. Just like we can look up the contents of the workspace aka the global environment, we can look up the contents of these other environments. In fact, you can actually use the Environment panel in RStudio to browse any of your loaded packages (just click on the text that says “Global Environment” and you’ll see a dropdown menu like the one shown in Figure ??).

The key thing to understand then is that you can access any of the R variables and functions that are stored in one of these environments, precisely because those are the environments that you have loaded!21

It should not come as a huge surprise that you could use the ls() function for this as well, if you are keen on using the command line . You just have to be a bit more explicit in your command. If I wanted to find out what is in the package:foreign environment (i.e., the environment into which the contents of the foreign package have been loaded), here’s what I’d get

ls("package:foreign")
##  [1] "data.restore"  "lookup.xport"  "read.arff"     "read.dbf"     
##  [5] "read.dta"      "read.epiinfo"  "read.mtp"      "read.octave"  
##  [9] "read.S"        "read.spss"     "read.ssd"      "read.systat"  
## [13] "read.xport"    "write.arff"    "write.dbf"     "write.dta"    
## [17] "write.foreign"

5.5.4 Unloading a package

Sometimes, especially after a long session of working with R, you find yourself wanting to get rid of some of those packages that you’ve loaded. The RStudio package panel makes this exactly as easy as loading the package in the first place. Find the entry corresponding to the package you want to unload and uncheck the box.

And the package is unloaded. We can verify this by seeing if the read.spss() function still exists:

read.spss()
## Error in read.spss(): could not find function "read.spss"

Nope. Definitely gone.

5.5.4.1 Doing it using R commands

The following bit is just for completeness. You don’t need to know this command.

When you use the Package panel to unload the foreign package, you might have seen this command appear on your screen:

detach("package:foreign", unload=TRUE)

There’s nothing more to say here.

5.5.5 Updating R packages

Every now and then the authors of packages release updated versions. The updated versions often add new functionality, fix bugs, and so on. It’s generally a good idea to update your packages periodically. In the packages panel, click on the “Update” button. This will bring up a window that looks like the one shown in Figure 5.9. In this window, each row refers to a package that needs to be updated. You can tell R which updates you want to install by checking the boxes on the left. If you’re feeling lazy and just want to update everything, click the “Select All” button, and then click the “Install Updates” button. R then prints out a lot of garbage on the screen, individually downloading and installing all the new packages. This might take a while to complete depending on how good your internet connection is. Go make a cup of coffee. Come back, and all will be well.

5.5.5.1 Doing it using R commands

There’s an update.packages() function that you can use to do this, but it’s probably easier to stick with the RStudio tool, so I’m not gonna bother to explain.

The RStudio dialog box for updating packages

Figure 5.9: The RStudio dialog box for updating packages

5.5.6 Masking

Something you should be aware of is this. Sometimes you’ll attempt to load a package, and R will print out a message telling you that something or other has been “masked”. This will be confusing to you if I don’t explain it now, and it actually ties very closely to the whole reason why R forces you to load packages separately from installing them. Here’s an example. 22

Two of the packages that you might encounter in your R career are called car and psych. The car package is short for “Companion to Applied Regression” (which is a really great book, I’ll add), and it has a lot of tools that I’m quite fond of. The car package was written by a guy called John Fox, who has written a lot of great statistical tools for social science applications. The psych package was written by William Revelle, and it has a lot of functions that are very useful for psychologists in particular, especially in regards to psychometric techniques. For the most part, car and psych are quite unrelated to each other. They do different things, so not surprisingly almost all of the function names are different. But… there’s one exception to that. The car package and the psych package both contain a function called logit().23 This creates a naming conflict. If I load both packages into R, an ambiguity is created. If the user types in logit(100), should R use the logit() function in the car package, or the one in the psych package? The answer is: R uses whichever package you loaded most recently, and it tells you this very explicitly. Here’s what happens when I load the car package, and then afterwards load the psych package:

library(car)
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
library(psych)
## Warning: package 'psych' was built under R version 4.2.3
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit

The output here is telling you that the logit object (i.e., function) in the car package is no longer accessible to you. It’s been hidden (or “masked”) from you by the one in the psych package. You can get R to use the one from the car package by using car::logit() as your command rather than logit(), since the car:: part tells R explicitly which package to use. 24

5.6 Using the Plots panel

At this point, we haven’t yet discussed how to make plots in R. We will do so later, in Chapter XXX, but to give an idea how this works in RStudio, type the following in your Console:

plot(1:3,4:6)

As you will see, it produces a (vastly uninteresting) plot in yet another pane, appropriately called the Plots panel. How do I save the picture? This is another one of those situations where the easiest thing to do is to use the RStudio tools. The easiest way to save your image is to click on the “Export” button in the Plot panel. When you do that you’ll see a menu that contains the options “Save as PDF” and “Save as Image” and “Copy to Clipboard”. All these versions work. They will bring up dialogue boxes that give you a few options that you can play with, but besides that, it’s pretty simple. This works pretty nicely for most situations.

5.6.0.1 Doing it using R commands

Saving plots using R commands can be somewhat annoying, to say the least. I do not recommend. You can thank me later.

5.7 Using the Files panel for mananaging the file system

In this section, I talk a little about how R interacts with the file system on your computer.

5.7.1 Displaying files, navigating the file system, and more

For our purposes, the easiest way to navigate the file system is to make use of RStudio’s built-in tools. The Files panel, in the lower right-hand area of RStudio and shown in Figure 5.10, is actually a pretty decent file browser. What can you do with it?

As you can tell, the Files panel is a very handy little tool for navigating the file system. You can access folders and subfolders just like you can on your computer. You can point and click on the names to move around the file system. Let’s say I’m looking at the actual screen shown in Figure 5.10.

The file panel is the area shown in the lower right hand corner. It provides a very easy way to browse and navigate your computer using R.

Figure 5.10: The file panel is the area shown in the lower right hand corner. It provides a very easy way to browse and navigate your computer using R.

At the top of the Files panel, you see some text that says “Home \(>\) Rbook \(>\) data”. What that means is that it’s displaying the files that are stored in the /Users/dan/Rbook/data directory on my computer.

The Files panel can be used to do other things than displaying, like navigating. If you want to move “up” to the parent folder (e.g., from /Users/dan/Rbook/data to /Users/dan/Rbook click on the “…” link in the Files panel. To move to a subfolder, click on the name of the folder that you want to open.

But you can use the Files panel for so much do more than just displaying and navigating. If you look at the buttons and menu options that it presents, you can even use it to Delete, Rename, Copy or Move files, and to create new folders. You can delete files from your computer using the “Delete” button, rename them with the “Rename” button, and so on. However, since most of that functionality isn’t critical to the basic goals of this book, I’ll let you discover those on your own.

Further, it can be used to load or open files in R. You can open some types of files by clicking on them (not necessarily in R! For example, if you click on a pdf, it will open outside of R). If, for example, you want to open a script (as we will discuss in Section 5.9), you should, in the Files panel, browse to the folder where your saved the script. If you found the script, just click it. This will open a new window within another panel (the “Source” panel) where you see your script. Opening a data file (like a .csv file) from the Files panel will be discussed in Section XXX.

5.7.2 Setting the working directory

Finally, you can use the Files panel to set the working directory. The what? Well, an important concept to grasp is the idea of a working directory. The working directory is just “whatever folder R is currently trying to find stuff”.

Sometimes, you will want to change the R working directory. You can do so using the Files panel. In particular, you need to click on the button with the gear that reads “More”. This will bring up a little menu, and one of the options will be “Set as Working Directory”. By clicking it, R will set the working directory to whatever folder you are currently in. It will even show you the R command needed too achieve this feat.

5.7.2.1 Doing it (and more!) using R commands

When clicking along, you might have noticed that RStudio sends a command to the R console, exactly as if you’d typed it yourself. In particular, it uses the setwd() function, where wd obviously refers to Working Directory. You can tell that it has done its job because this command appears in the console, for example, setwd("~/Rbook/data").

Sometimes, you’ll want to just know what the working directory is. You can find out by using the getwd() command.

5.8 Importing data files using the Files or Environment panels

Let’s now turn to the crucial but slightly annoying question of how you can load data from a range of different sources. As is often the case with R, the basic answer is simple but there are quite a bit of nuts and bolts to it. However, for the purposes of this course, we will stick to the basic answer. As an example, we will use a filed called AnnaF.csv.

Basically, there are two ways in which you can import data using RStudio (for now at least; we will encounter a third one in Section 5.10).

One is to use the Files panel to go to the folder that contains the to-be-read file, left-click on the data file, and then click Import Dataset. You should see something like in Figure 5.11.

The dialog box that shows up when you are importing data using the Files panel.

Figure 5.11: The dialog box that shows up when you are importing data using the Files panel.

Note that the First Row as Names checkbox has been checked, because — there is no easy way to say this — the first row contains the names of the variable, in this particular data set. If you want to import another data set, this might or might not be the case, so it is important to make sure by checking it yourself! Also, make sure to select the appropriate Delimiter (i.e., the stuff indicating when a new column should start). For this data set, columns are indicated by a semicolon (puntkomma in Dutch), but of course things might be different for different data sets. If you are unsure which Delimiter to chose, try out a few and see what it does in the Data Preview. If all choices are made, press the Import button. Note that, when everything worked, a new variable is now created in your R workspace. R (probably) has also automatically used the View() function to show you the data set in R.

Another is going to the Environment panel and click on Import Dataset. You will see that there are several possibilities, depending on the type of your file. One slightly counterintuitive thing to remember is that if you want to import a csv data set, you should select the From Text (base) option, even though you are not trying to import a text file! (Remember that everybody can be a little weird, sometimes.) Browse to wherever you stored your file, and once you located it, click on the file. You should see something like in Figure 5.12.

The dialog box that shows up when you are importing data using the Files panel.

Figure 5.12: The dialog box that shows up when you are importing data using the Files panel.

Annoyingly (everybody can be a bit annoying, sometimes), indicating whether or not the first row contains names (in this case: yes!) and how the columns are indicated (in this case: by a semicolon!) should be done slightly differently compared to the first approach: by selecting Yes (in this case) under Heading, and by selecting Semicolon (in this case) under Separator, respectively. Having done that, just press the Import button, sit back, relax and enjoy.

Whichever way you choose to do the import, R has suggested a name to call the variable that was the result of reading in your data. Of course you can easily overwrite that, if desired, for example, like this:

myData <- AnnaF 

Note that there are many more data formats beyond csv that you can import in R. I am not gonna bother explain all of them, since most of those follow much the same route as described for csv file. If you ever need to import a data set and you find yourself in trouble, look around for help, for example using the resources listed in Section XXX.

Also, I want to already spill the beans that whatever is imported that way is a data frame, a sentence which is a taunting mystery to you right know but will become demystified in Section XXX.

5.8.0.1 Doing it using R commands

As with lots of the other tasks, importing data can also be done in the R console, but this is outside the scope of this course. As you will see, if you import data using the steps described above, the relevant R commands will turn up in the console. If you need those (for example, to store in a script), you can of course copy them, but I do not recommend importing data using R commands. Unless you are importing data with exotic formats, but that is beyond the scope this course.

5.9 Using scripts in the Source panel

When you start analysing real-world data sets, you will rapidly find yourself needing to write something called scripts. Computer programs come in quite a few different forms: the kind of program that we’re most interested in from the perspective of everyday data analysis using R is known as a script. Script files are those with a .R file extension. These aren’t data files at all; rather, they’re used to save a collection of commands that you want R to execute later. It’s just a glorified text file in which you write out all the commands that you want R to run. You can write your script using whatever software you like.

In real-world data analysis writing scripts is a key skill – and as you become familiar with R you’ll probably find that most of what you do involves scripting rather than typing commands at the R prompt. The idea behind a script is that, instead of typing your commands into the R console one at a time, you write them all in a file. Not only is it a way to store the commands you need, it also makes running the code easier. Once you’ve finished writing them and saved the file, you can get R to execute all the commands in your file at once . In a moment I’ll show you exactly how this is done, but first I’d better explain why you should care.

5.9.1 Why use scripts?

To understand why scripts are so very useful, it may be helpful to consider the drawbacks to typing commands directly at the command prompt. The approach that we’ve been adopting so far, in which you type commands one at a time, and R sits there patiently in between commands, is referred to as the interactive style. Doing your data analysis this way is rather like having a conversation … a very annoying conversation between you and your data set, in which you and the data aren’t directly speaking to each other, and so you have to rely on R to pass messages back and forth. This approach makes a lot of sense when you’re just trying out a few ideas: maybe you’re trying to figure out what analyses are sensible for your data, or maybe just you’re trying to remember how the various R functions work, so you’re just typing in a few commands until you get the one you want. In other words, the interactive style is very useful as a tool for exploring your data. However, it has a number of drawbacks:

  • It’s hard to save your work effectively. You can save the workspace so that later on you can load any variables you created. You can save your plots as images. And you can even save the history or copy the contents of the R console to a file. Taken together, all these things let you create a reasonably decent record of what you did. But it does leave a lot to be desired. It seems like you ought to be able to save a single file that R could use (in conjunction with your raw data files) and reproduce everything (or at least, everything interesting) that you did during your data analysis.

  • It’s annoying to have to go back to the beginning when you make a mistake. Suppose you’ve just spent the last two hours typing in commands. Over the course of this time you’ve created lots of new variables and run lots of analyses. Then suddenly you realise that there was a nasty typo in the first command you typed, so all of your later numbers are wrong. Now you have to fix that first command, and then spend another hour or so combing through the R history to try and recreate what you did.

  • You can’t leave notes for yourself. Sure, you can scribble down some notes on a piece of paper, or even save a Word document that summarises what you did. But what you really want to be able to do is write down an English translation of your R commands, preferably right “next to” the commands themselves. That way, you can look back at what you’ve done and actually remember what you were doing. In the simple exercises we’ve engaged in so far, it hasn’t been all that hard to remember what you were doing or why you were doing it, but only because everything we’ve done could be done using only a few commands, and you’ve never been asked to reproduce your analysis six months after you originally did it! When your data analysis starts involving hundreds of variables and requires quite complicated commands to work, then you really, really need to leave yourself some notes to explain your analysis to, well, yourself.

  • It’s nearly impossible to reuse your analyses later, or adapt them to similar problems. Suppose that, sometime in January, you are handed a difficult data analysis problem. After working on it for ages, you figure out some really clever tricks that can be used to solve it. Then, in September, you get handed a really similar problem. You can sort of remember what you did, but not very well. You’d like to have a clean record of what you did last time, how you did it, and why you did it the way you did. Something like that would really help you solve this new problem.

  • It’s hard to do anything except the basics. There’s a nasty side effect of these problems. Typos are inevitable. Even the best data analyst in the world makes a lot of mistakes. So the chance that you’ll be able to string together dozens of correct R commands in a row is very small. So unless you have some way around this problem, you’ll never really be able to do anything other than simple analyses.

  • It’s difficult to share your work with other people. Because you don’t have this nice clean record of what R commands were involved in your analysis, it’s not easy to share your work with other people. Sure, you can send them all the data files you’ve saved, and your history and console logs, and even the little notes you wrote to yourself, but odds are pretty good that no-one else will really understand what’s going on (trust me on this: I’ve been handed lots of random bits of output from people who’ve been analysing their data, and it makes very little sense unless you’ve got the original person who did the work sitting right next to you explaining what you’re looking at)

Ideally, what you’d like to be able to do is something like this… Suppose you start out with a data set myrawdata.csv. What you want is a single document – let’s call it mydataanalysis.R – that stores all of the commands that you’ve used in order to do your data analysis. Kind of similar to the R history but much more focused. It would only include the commands that you want to keep for later. Then, later on, instead of typing in all those commands again, you’d just tell R to run all of the commands that are stored in mydataanalysis.R. Also, in order to help you make sense of all those commands, what you’d want is the ability to add some notes or comments within the file so that anyone reading the document for themselves would be able to understand what each of the commands actually does. But these comments wouldn’t get in the way: when you try to get R to run mydataanalysis.R it would be smart enough to recognise that these comments are for the benefit of humans, and so it would ignore them. Later on, you could tweak a few of the commands inside the file (maybe in a new file called mynewdatanalaysis.R) so that you can adapt an old analysis to be able to handle a new problem. And you could email your friends and colleagues a copy of this file so that they can reproduce your analysis themselves. In other words, what you want is a script. (There are better ways of keeping track of the lifecycle of a script and better ways of sharing scripts as well, but let’s not go there for now.)

5.9.2 Writing our first script

A screenshot showing the `hello.R` script if you open it using the default text editor (TextEdit) on a Mac. Using a simple text editor like TextEdit on a Mac or Notepad on Windows isn't actually the best way to write your scripts, but it is the simplest. More to the point, it highlights the fact that a script really is just an ordinary text file.

Figure 5.13: A screenshot showing the hello.R script if you open it using the default text editor (TextEdit) on a Mac. Using a simple text editor like TextEdit on a Mac or Notepad on Windows isn’t actually the best way to write your scripts, but it is the simplest. More to the point, it highlights the fact that a script really is just an ordinary text file.

Okay then. Since scripts are so terribly awesome, let’s write one. To create a script file in RStudio, go to the “File” menu, select the “New File” option, and then click on “R script”. This will open a new window within the “Source” panel. you can type the commands you want (or code as it is generally called when you’re typing the commands into a script file) and save it when you’re done.

Let’s try using x <- "hello world" and print(x) as our commands. Then save the document, by, for example, typing CTRL+S, or going to the “File” menu and find “Save”, as hello.R. Also, when it asks you where to save the file, save it to whatever folder you want, but do remember where you stored it. And just like that, you’ve written your first program R. It really is that simple. That’s all there is to it!

You should be looking at something like Figure 5.14. As you can see (if you’re looking at this book in colour) the character string “hello world” is highlighted in green. The nice thing about using RStudio to do this is that it automatically changes the colour of the text to indicate which parts of the code are comments and which are parts are actual R commands (these colours are called syntax highlighting, but they’re not actually part of the file – it’s just RStudio trying to be helpful. It also added line numbers, to facilitate communication, thank you very much!

Just like with any other file, it is important to save your work. If you made unsaved changes to your script, R will make it clear in the name of your script. On my machine, for example, the name of a script with unsaved changes is shown in red and followed by a *. Once I save the changes, it turns black again and the * disappears.

5.9.3 Running our first script

The simple script that I’ve shown above contains two commands. The first one creates a variable x and the second one prints it on screen. How can we make R execute these commands? In other words, how do we run the script? There are several approaches, really.

I often find myself running a script line by line. To do so, just put your cursor in front of the line (or any other place in that line, for that matter) you want to run, and hit CTRL+ENTER or CMD + ENTER if you are a Mac user. R then transfers these commands to the Console and executes them. You can also select more than one line, and have these lines be executed by hitting CTRL+ENTER or CMD + ENTER (for Macs).

The second approach is running all commands in the script at once. The first thing to do to make this work is to make sure that hello.R file has been saved to your working directory so that R can find it. There are two ways to go about it: Either put it in what is currently your working directory. Or keep the file where it is, and change your working directory to wherever you have put that file. Once the file is in the working directory (by whatever means), you can run the script using the following command in the Console:

source( "hello.R" )

When you type this command, R opens up the script file: it then reads each command in the file in the same order that they appear in the file, and executes those commands in that order. Alternatively, you can do as follows.

Notice in the top right-hand corner of Figure 5.14 there’s a little button that reads “Source”? If you click on that, RStudio will construct the relevant source() command for you, and send it straight to the R console. So you don’t even have to type in the source() command, which actually I think is a great thing because it really bugs me having to type all those extra keystrokes every time I want to run my script. 25

After we have run the script (by whatever approach), things happened. If we inspect the workspace using a command like ls(), we discover that R has created the new variable x within the workspace, and not surprisingly x is a character string containing the text "hello world".

A screenshot showing the `hello.R` script open in RStudio. Assuming that you're looking at this document in colour, you'll notice that the hello world text is shown in green. This isn't something that you do yourself: that's RStudio being helpful. Because the text editor in RStudio knows something about how R commands work, it will highlight different parts of your script in different colours. This is useful, but it's not actually part of the script itself.

Figure 5.14: A screenshot showing the hello.R script open in RStudio. Assuming that you’re looking at this document in colour, you’ll notice that the hello world text is shown in green. This isn’t something that you do yourself: that’s RStudio being helpful. Because the text editor in RStudio knows something about how R commands work, it will highlight different parts of your script in different colours. This is useful, but it’s not actually part of the script itself.

Now, replace print(x) by just x and source hello.R again. Unlike what you are used to in the Console, when typing x also shows x, this does not work from within a script.

5.9.4 Commenting your script

When writing up your data analysis as a script, one thing that is generally a good idea is to include a lot of comments in the code. That way, if someone else tries to read it (or if you come back to it several days, weeks, months or years later) they can figure out what’s going on. As a beginner, I think it’s especially useful to comment thoroughly, partly because it gets you into the habit of commenting the code, and partly because the simple act of typing in an explanation of what the code does will help you keep it clear in your own mind what you’re trying to achieve.

You can use comments at the beginning of the script, so that the script announces its behaviour. The first few lines of the script could, for example, tell about what the script is actually doing behind the scenes. It’s usually a pretty good idea to do this.

We’ve seen commenting before, so you might or might not remember that everything after a # sign will not be interpreted by R.

At this point, you’ve learned the basics of scripting. You are now officially allowed to say that you can program in R, though you probably shouldn’t say it too loudly. There’s a lot more to learn, but nevertheless, if you can write scripts like these then what you are doing is, in fact, basic programming.

5.10 Using the RStudio menu

I’ve been showing you how to use the Panels, and how to do the same stuff using R commands. Since R is the gift that keeps on giving, there is often a third way to do the same, using the RStudio menu, i.e., the ribbon at the very top that starts with File. I won’t go into full detail, because knowing two ways should be enough and things are pretty self-explanatory, but I will restrict myself to highlighting a few things. Everything in this section is meant to be helpful, so I strongly recommend to read it with care, but you don’t have to study it. You will need to be able to import data, but how you do that is up you entirely.

To open an existing file, go to the “File” menu again, and select “Open File…” and browse to where you stored it.

If you look in the “File” menu you will also see “Save” and “Save As…” options. Those options are used for dealing with scripts, and so they’ll produce .R files.

Remember I promised you a third option for importing data? Here it is: it involves using the menu on top of RStudio. For importing data sets, go the File menu and find “Import Dataset”. After this, follow the same steps as you did when importing data using the Environment Panel.

Saving, copying and removing plots can be done using the “Plots” menu on top of RStudio.

Setting the working directory can be done by going to the “Session” menu in the top row of RStudio and click “Set Working Directory”. If you select that option, then R really will change the working directory to either the Files Pane location of the Source file location.

There’s more, much more, but you will find it out when you need it.

5.11 Pimping RStudio

As with any software tool, there are many ways in which you can adjust RStudio to your own needs and likes. Most of that can be done by choosing Tools from the RStudio menu and choosing Global Options. There is a lot you can do (like changing the font size under Appearance) and I will let you find out for yourself.

There is one exception, though, because it is really cool (but also just fyi). If you Go to Tools/Global Options, click Code, open the Display tab and check “Rainbow parentheses”, nothing less than sheer rainbow joy happens after you clicked OK. Parentheses (), brackets [], and braces {} will now be color-matched by nesting level, which makes complex code way easier to read. It is strongly recommended, but it is entirely up to you to decide how much rainbow fun you want in your life.

5.12 Quitting R(Studio)

The dialog box that shows up when you try to close RStudio.

Figure 5.15: The dialog box that shows up when you try to close RStudio.

There’s one last thing I should cover in this chapter: how to quit R. When I say this, I’m not trying to imply that R is some kind of pathological addiction and that you need to call the R QuitLine or wear patches to control the cravings (although you certainly might argue that there’s something seriously pathological about being addicted to R). I just mean how to exit the program. Assuming you’re running R in the usual way (i.e., through RStudio or the default GUI on a Windows or Mac computer), then you can just shut down the application in the normal way. However, R also has a function, called q() that you can use to quit, which is pretty handy if you’re running R in a terminal window.26

Regardless of what method you use to quit R, when you do so for the first time R will probably ask you if you want to save the “workspace image”. If you’re using RStudio, you’ll see a dialogue box that looks like the one shown in Figure 5.15. If you’re using a text-based interface you’ll see this:

q()
## Save workspace image? [y/n/c]: 

The y/n/c part here is short for “yes / no / cancel”. Type y if you want to save, n if you don’t, and c if you’ve changed your mind and you don’t want to quit after all.

What does this actually mean? What’s going on is that R wants to know if you want to save all those variables that you’ve been creating, so that you can use them later. This sounds like a great idea, so it’s really tempting to type y or click the “Save” button. To be honest though, I very rarely do this, and it kind of annoys me a little bit… what R is really asking is if you want it to store these variables in a “default” data file. The catch (or advantage, if you wish) is that the data file will automatically reload for you next time you open R, which is often something you won’t need. And quite frankly, if I’d wanted to save the variables, then I’d have already saved them before trying to quit. Not only that, I’d have saved them to a location of my choice, so that I can find it again later. So I personally never bother with this, and I see little reason to type y or click the “Save” button.

The next bit is a quite useful thing to know, but you shouldn’t study it. You can change the settings so that it never asks me again whether I want to save stuff. You can do this in RStudio really easily: use the menu system to find the RStudio option; the dialogue box that comes up will give you an option to tell R never to whine about this again (see Figure 5.16. On a Mac, you can open this window by going to the “Edit” menu and selecting “Preferences”. On a Windows machine, you go to the “Tools” menu and select “Global Options”. Under the “General” tab you’ll see an option that reads “Save workspace to .Rdata on exit”. By default, this is set to “ask”. If you want R to stop asking, change it to “never”. Every time I install R on a new machine, this is one of the first things I do.

The options window in RStudio. On a Mac, you can open this window by going to the RStudio menu and selecting Preferences. On a Windows machine you go to the Tools menu and select Global Options

Figure 5.16: The options window in RStudio. On a Mac, you can open this window by going to the RStudio menu and selecting Preferences. On a Windows machine you go to the Tools menu and select Global Options

6 More on variables

You’ve seen vectors all right, but that it just a tip of the Rberg. In this chapter, we encounter matrices, factors, data frames, lists and formulas. But first, we start with …

6.1 Useful things to know about variables

6.1.1 Rules and conventions for naming variables

In the examples that we’ve seen so far, most of my variable names (such as sales and revenue) have just been English-language words written using lowercase letters. However, R allows a lot more flexibility when it comes to naming your variables, as the following list of rules illustrates:

  • Variable names can use the upper case alphabetic characters A-Z as well as the lower case characters a-z. You can also include numeric characters 0-9 in the variable name, as well as the period . or underscore _ character. In other words, you can use SaL.e_s as a variable name (though I can’t think why you would want to), but you can’t use Sales?.
  • Variable names cannot include spaces: therefore my sales is not a valid name, but my.sales is.
  • Variable names are case sensitive: that is, Sales and sales are different variable names.
  • Variable names must start with a letter or a period. You can’t use something like _sales or 1sales as a variable name. You can use .sales as a variable name if you want, but it’s not usually a good idea. By convention, variables starting with a . are used for special purposes, so you should avoid doing so.
  • Variable names cannot be one of the reserved keywords. These are special names that R needs to keep “safe” from us mere users, so you can’t use them as the names of variables. The keywords are: if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_complex_, and finally, NA_character_. Don’t feel especially obliged to memorise these: if you make a mistake and try to use one of the keywords as a variable name, R will complain about it like the whiny little automaton it is.

In addition to those rules that R enforces, there are some informal conventions that people tend to follow when naming variables. One of them you’ve already seen: i.e., don’t use variables that start with a period. But there are several others. You aren’t obliged to follow these conventions, and there are many situations in which it’s advisable to ignore them, but it’s generally a good idea to follow them when you can:

  • Use informative variable names. As a general rule, using meaningful names like sales and revenue is preferred over arbitrary ones like variable1 and variable2. Otherwise, it’s very hard to remember what the contents of different variables are, and it becomes hard to understand what your commands actually do.
  • Use short variable names. Typing is a pain and no-one likes doing it. So we much prefer to use a name like sales over a name like sales.for.this.book.that.you.are.reading. Obviously, there’s a bit of a tension between using informative names (which tend to be long) and using short names (which tend to be meaningless), so use a bit of common sense when trading off these two conventions.
  • Use one of the conventional naming styles for multi-word variable names. Suppose I want to name a variable that stores “my new salary”. Obviously, I can’t include spaces in the variable name, so how should I do this? There are three different conventions that you sometimes see R users employing. Firstly, you can separate the words using periods, which would give you my.new.salary as the variable name. Alternatively, you could separate words using underscores, as in my_new_salary. Finally, you could use capital letters at the beginning of each word (except the first one), which gives you myNewSalary as the variable name. I don’t think there’s any strong reason to prefer one over the other, but it’s always nice to be consistent.

6.1.2 Special values

The first thing I want to mention are some of the “special” values that you might see R produce. Most likely you’ll see them in situations where you were expecting a number, but there are quite a few other ways you can encounter them. These values are Inf, NaN, NA and NULL. These values can crop up in various different places, and so it’s important to understand what they mean.

  • Infinity (Inf). The easiest of the special values to explain is Inf, since it corresponds to a value that is infinitely large. You can also have -Inf. The easiest way to get Inf is to divide a positive number by 0.

Exercise: Do try yourself:

In most real-world data analysis situations, if you’re ending up with infinite numbers in your data, then something has gone awry. Hopefully, you’ll never have to see them.

  • Not a Number (NaN). The special value of NaN is short for “not a number”, and it’s basically a reserved keyword that means “there isn’t a mathematically defined number for this”. If you can remember your high school maths, remember that it is conventional to say that \(0/0\) doesn’t have a proper answer: mathematicians would say that \(0/0\) is undefined.

Exercise: Check if R says that it’s not a number:

Nevertheless, it’s still treated as a “numeric” value. To oversimplify, NaN corresponds to cases where you asked a proper numerical question that genuinely has no meaningful answer.

  • Not available (NA). NA indicates that the value that is “supposed” to be stored here is missing. To understand what this means, it helps to recognise that the NA value is something that you’re most likely to see when analysing data from real-world experiments. Sometimes you get equipment failures, or you lose some of the data, or whatever. The point is that some of the information that you were “expecting” to get from your study is just plain missing. Note the difference between NA and NaN. For NaN, we really do know what’s supposed to be stored; it’s just that it happens to correspond to something like \(0/0\) that doesn’t make any sense at all. In contrast, NA indicates that we actually don’t know what was supposed to be there. The information is missing.

Here’s an example

x <- c(1,4,2)
x
## [1] 1 4 2
x[8] <- 9
x
## [1]  1  4  2 NA NA NA NA  9

R dutifully adds 9 as the 8th element of x. But since we have only told R what the first three elements and the 8th element are, it kindly reminds us that it has no idea what we have in mind for elements 4 to 7.

  • No value (NULL). The NULL value takes this “absence” concept even further. It basically asserts that the variable genuinely has no value whatsoever. This is quite different from both NaN and NA. For NaN we actually know what the value is because it’s something insane like \(0/0\). For NA, we believe that there is supposed to be a value “out there”, but a dog ate our homework and so we don’t quite know what it is. But for NULL we strongly believe that there is no value at all.

6.1.3 Treating special values

How does R treat these special values? Let’s see.

6.1.3.1 Handling missing values

The next topic is the issue of missing data. Real data sets very frequently turn out to have missing values: perhaps someone forgot to fill in a particular survey question, for instance. Missing data can be the source of a lot of tricky issues, most of which I’m going to gloss over. However, at a minimum, you need to understand the basics of handling missing data in R.

Let’s focus on the simplest case, in which you’re trying to work with a single variable which has missing data. In R, this means that there will be NA values in your data vector. Let’s create a variable like that:

partial <- c(10, 20, NA, 30)

Let’s assume that you want to calculate the mean of this variable. By default, R assumes that you want to calculate the mean using all four elements of this vector, which is probably the safest thing for a dumb automaton to do, but it’s rarely what you actually want. Why not? Well, remember that the basic interpretation of NA is “I don’t know what this number is”. This means that 1 + NA = NA: if I add 1 to some number that I don’t know (i.e., the NA) then the answer is also a number that I don’t know. As a consequence, if you don’t explicitly tell R to ignore the NA values, and the data set does have missing values, then the output will itself be a missing value.

Exercise: Calculate the mean of the partial vector (without doing anything about the missing value).

mean(partial)

Technically correct, but deeply unhelpful.

To fix this, some functions have an optional argument called na.rm, which is shorthand for “remove NA values”. By default, na.rm = FALSE, so R does nothing about the missing data problem. Let’s try setting na.rm = TRUE and see what happens.

In particular, when calculating sums and means when missing data are present (i.e., when there are NA values) there’s actually an additional argument to the function that you should be aware of. This argument is called na.rm, and is a logical value indicating whether R should ignore (or “remove”) the missing data for the purposes of doing the calculations. By default, R assumes that you want to keep the missing values, so unless you say otherwise it will set na.rm = FALSE. However, R assumes that 1 + NA = NA: if I add 1 to some number that I don’t know (i.e., the NA) then the answer is also a number that I don’t know. As a consequence, if you don’t explicitly tell R to ignore the NA values, and the data set does have missing values, then the output will itself be a missing value.

Exercise: Calculate the mean of the partial vector, and set na.rm = TRUE.

mean(partial, na.rm=TRUE)

Notice that the mean is 20 (i.e., 60 / 3) and not 15 (i.e., 60 / 4). When R ignores a NA value, it genuinely ignores it. In effect, the calculation above is identical to what you’d get if you asked for the mean of the three-element vector c(10, 20, 30).

Note that this isn’t unique to the mean() function. Pretty much all of the other functions doing statistical stuff have an na.rm argument that indicates whether it should ignore missing values.

What about operators? As always, don’t wait for me to tell you. Find out by yourself:

partial < 30
## [1]  TRUE  TRUE    NA FALSE
partial == 10
## [1]  TRUE FALSE    NA FALSE
x <- c(3, 7, NA, 4, 7)
y <- c(5, NA, 1, 2, 2)
x + y
## [1]  8 NA NA  6  9

So the basic adage is that once you don’t know something (as evidenced by NA), you won’t suddenly start knowing something. NAs don’t just disappear.

Sometimes, NA can disappear:

q <- c(TRUE, FALSE, TRUE, NA, FALSE, TRUE)
which(q)
## [1] 1 3 6

The reason is that, with which(q), we are literally asking R to tell us which elements of q are equal to TRUE. As indicated by NA, we don’t know whether the fourth element of q is equal to TRUE, so it is only fair the answer does not include 4.

Exercise: What will the output be of sum(partial <= 20) and of sum(partial <= 20, na.rm = TRUE).

sum(partial <= 20)
sum(partial <= 20, na.rm = TRUE)

6.2 Variable classes

As we’ve seen, R allows you to store different kinds of data. In particular, the variables we’ve defined so far have either been numeric data, character data (text), or logical data.27 It’s important that we remember what kind of information each variable stores (and even more important that R remembers) since different kinds of variables allow you to do different things to them. For instance, if your variables have numerical information in them, then it’s okay to multiply them together.

Exercise: Assign the value 4 to x, the value 5 to y, and multiply x with y.

x <- 5   # x is numeric
y <- 4   # y is numeric
x * y

But if they contain character data, multiplication makes no sense whatsoever, and R will complain if you try to do it.

Exercise: Assign "apples" to x, "oranges" to y, and multiply x with y.

x <- "apples"   # x is character
y <- "oranges"  # y is character
x * y 

Even R is smart enough to know you can’t multiply "apples" by "oranges". It knows this because the quote marks are indicators that the variable is supposed to be treated as text, not as a number.

This is quite useful, but notice that it means that R makes a big distinction between 5 and "5". Without quote marks, R treats 5 as the number five, and will allow you to do calculations with it. With the quote marks, R treats "5" as the textual character five, and doesn’t recognise it as a number any more than it recognises "p" or "five" as numbers. As a consequence, there’s a big difference between typing x <- 5 and typing x <- "5". In the former, we’re storing the number 5; in the latter, we’re storing the character "5". Thus, if we try to do multiplication with the character versions, R gets stroppy:

x <- "5"   # x is character
y <- "4"   # y is character
x * y     
## Error in x * y: non-numeric argument to binary operator

Okay, let’s suppose that I’ve forgotten what kind of data I stored in the variable x (which happens depressingly often). R provides a function that will let us find out. Or, more precisely, it provides three functions: class(), mode() and typeof(). Why the heck does it provide three functions, you might be wondering? Basically, because R actually keeps track of three different kinds of information about a variable. In this class, we will only use class(), though.

The class of a variable is a “high level” classification, and it captures psychologically (or statistically) meaningful distinctions. For instance "2011-09-12" and "my birthday" are both text strings, but there’s an important difference between the two: one of them is a date. So it would be nice if we could get R to recognise that "2011-09-12" is a date, and allow us to do things like add or subtract from it. The class of a variable is what R uses to keep track of things like that. Because the class of a variable is critical for determining what R can or can’t do with it, the class() function is very handy.

Exercise: Find the class of the following examples using the class() function.

x <- "hello world"     
y <- TRUE    
z <- 100     
class(x)

Exciting, no?

You might have expected that R would have returned vector in all the previous exercises. It did not, despite x, y and z being, well. vectors.

Later on, I’ll talk a bit about how you can convince R to “coerce” a variable to change from one class to another (Section 6.9.3). That’s a useful skill for real-world data analysis, but it’s not something that we need right now.

6.3 Matrices

6.3.1 Introducing matrices

6.3.2 Creating a matrix using rbind() and cbind()

A not-uncommon task that you might find yourself needing to undertake is to combine several vectors. A matrix is one way of doing it (we will discuss data frames as another in Section 6.5). A matrix is basically a big rectangular table of data.

Let’s suppose we have the following two numeric vectors:

cake.1 <- c(100, 80, 0, 0, 0)
cake.2 <- c(100, 100, 90, 30, 10)

The numbers here might represent the amount of each of the two cakes that are left at five different time points. Apparently, the first cake is tastier, since that one gets devoured faster.

Let’s start by using the rbind() (“row bind”) function to create a small matrix:

Mr <- rbind( cake.1, cake.2 )  # row bind them into a matrix 
Mr                           # and print it out...
##        [,1] [,2] [,3] [,4] [,5]
## cake.1  100   80    0    0    0
## cake.2  100  100   90   30   10

It quite literally binds stuff together, forming a matrix.

Exercise: The variable Mr is a matrix, which we can confirm by using the class() function.

R is being pedantic right here, and tells me Mr is both a matrix and an array. Well, R, given that a matrix is a special kind of array (not that you should care, really), you are right. Thanks. Note that, although all elements of Mr are numeric, R does not tell us that when asked about its class, unlike it would have done if Mr had been a vector.

There is another function, the cbind() function (“column bind”) which produces a very similar looking output.

Mc <- cbind( cake.1, cake.2 )  # column bind them into a matrix 
Mc
##      cake.1 cake.2
## [1,]    100    100
## [2,]     80    100
## [3,]      0     90
## [4,]      0     30
## [5,]      0     10

The rbind() function (“row bind”) produces a somewhat different output than the cbind() function: it binds the vectors together row-wise rather than column-wise.

6.3.3 Creating a matrix using matrix()

We know from above that the rbind() and cbind() functions will convert the vectors into a matrix. There’s yet another way, using a function called —R often isn’t the eccentric kind— matrix(). Let’s see what it does. When creating a matrix using matrix(), there are three things to specify: which numbers should be put in the matrix; what should the matrix look like; and how should it be filled. To specify what it looks like, you should, in principle, specify the number of rows AND the number of columns. However, R is smart, and needs only one of those: If you give R 10 elements it should put in matrix, and you only tell it that the matrix should have 2 rows without telling R the number of columns, it is smart enough to figure out that the matrix should have 5 columns. So you only need to specify either the number of row or the number of columns.

So let’s put the cake data in a matrix with two columns. There are two ways we could do that:

M <- matrix(c(cake.1, cake.2), nrow = 2, byrow = TRUE)
M
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  100   80    0    0    0
## [2,]  100  100   90   30   10
Mx <- matrix(c(cake.1, cake.2), nrow = 2, byrow = FALSE)
Mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  100    0    0  100   30
## [2,]   80    0  100   90   10

As you can see, the byrow argument controls how the (in this case 2x5) matrix should be filled with the values from c(cake.1, cake.2). Either by filling up the rows first, or by filling up the columns first. In this case, we clearly want the byrow=TRUE version.

What if we want two columns?

cake.mat <- matrix(c(cake.1, cake.2), ncol = 2, byrow = FALSE)
cake.mat
##      [,1] [,2]
## [1,]  100  100
## [2,]   80  100
## [3,]    0   90
## [4,]    0   30
## [5,]    0   10
cake.matx <- matrix(c(cake.1, cake.2), ncol = 2, byrow = TRUE)
cake.matx
##      [,1] [,2]
## [1,]  100   80
## [2,]    0    0
## [3,]    0  100
## [4,]  100   90
## [5,]   30   10

Now, we want the want the byrow=FALSE version.

R can be annoying sometimes, so to restore karma, you can annoy the R once in a while. This is such a moment. Let’s ask R to put the 10 values about our cakes in a —insert diabolical laughter— matrix with 3 columns! You might even understand how he escapes our trap: It creates a matrix with 12 empty spots, of which it can easily fill the spots with the 10 cake values we provide. For the remaining 2 spots, it just starts over, and takes the first 2 values of the set of 10 we provided. Yep, that’s the recycling rule (see Section XXX).

cake.matHAHAHA <- matrix(c(cake.1, cake.2), ncol = 3, byrow = TRUE)
## Warning in matrix(c(cake.1, cake.2), ncol = 3, byrow = TRUE): data length [10]
## is not a sub-multiple or multiple of the number of rows [4]
cake.matHAHAHA
##      [,1] [,2] [,3]
## [1,]  100   80    0
## [2,]    0    0  100
## [3,]  100   90   30
## [4,]   10  100   80

The sneaky little munchkin does this (even being fair enough to provide a warning)!

6.3.4 Working with matrices

6.3.4.1 Getting information out of a matrix

You can use square brackets to extract a subset of a matrix, specifying a row index and then a column index. For instance, M[2,3] pulls out the entry in the 2nd row and 3rd column of the matrix (i.e., 90). By convention, the row number comes first.

Exercise: Do try!

We will talk more about this in Section 7.3.

6.3.4.2 Altering the elements of a matrix

What if you want to change a value stored in a matrix? Easy enough. One possibility would be to assign the whole matrix again from the beginning. Also, it’s a little wasteful: why should R have to redefine everything, when it is only needed to change a single value? Fortunately, we can tell R to change a specific element only.

M #before
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  100   80    0    0    0
## [2,]  100  100   90   30   10
M[1,2] <- 50
M #after
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  100   50    0    0    0
## [2,]  100  100   90   30   10

This doesn’t work

M[3,2] <- 50
## Error in `[<-`(`*tmp*`, 3, 2, value = 50): subscript out of bounds

because that element does not exist in M.

This neither

M[1,2] <- c(10,50)
## Error in M[1, 2] <- c(10, 50): number of items to replace is not a multiple of replacement length

This time the element does exist in M, but R can not replace a single element with two.

6.3.4.3 A matrix as one big variable, really

At a fundamental level, a matrix really is just one variable: it just happens that this one variable is formatted into rows and columns. If you want a matrix of numeric data, every single element in the matrix must be a number. If you want a matrix of character strings, every single element in the matrix must be a character string. If you try to mix data of different types together, then R will either spit out an error or quietly coerce the underlying data into a list.

Exercise: Let’s find out what class R secretly thinks the data within the matrix M is, by using the class() function and indexing the first observation.

class( M[1,2] )

You can’t type class(M), because all that will happen is R will tell you that M is a matrix: we’re not interested in the class of the matrix itself, we want to know what class the underlying data is assumed to be.

Anyway, to give you a sense of how R enforces this, let’s try to change one of the elements of our numeric matrix into a character string:

M[1,1] <- "text"
M
##      [,1]   [,2]  [,3] [,4] [,5]
## [1,] "text" "50"  "0"  "0"  "0" 
## [2,] "100"  "100" "90" "30" "10"

It looks as if R has coerced all of the data in our matrix into character strings. And in fact, if we now typed in class(M[1,1]) we’d see that this is exactly what has happened.

class(M[1,2])
## [1] "character"

If you alter the contents of one element in a matrix, R will change the underlying data type as necessary.

I personally don’t have any insight in what R will do when I now turn element M[1,1] into a number again. I simply don’t know enough about the inner workings of R to make a reasonable guess about how R will go about it. One thing I could do is to look it up in the help file or on the internet, or ask somebody who could know. Me not knowing what R will do is no good reason not to try it out. In fact, it is a very good reason to try it and see what happens:

M[1,1] <- 3
M
##      [,1]  [,2]  [,3] [,4] [,5]
## [1,] "3"   "50"  "0"  "0"  "0" 
## [2,] "100" "100" "90" "30" "10"
class(M[1,1])
## [1] "character"

As it turns out, once we go character, R never goes back. Even though we defined M[1,1] as a numerical value in the line <- 3, once we force it to become part of the matrix environment consisting of nothing but characters, its own numerical identity is overridden and it is forced to become part of the majority culture (being characters in this case).

6.3.5 Doing calculations using matrices

Let’s first define M again, like in the old days.

cake.1 <- c(100, 80, 0, 0, 0)
cake.2 <- c(100, 100, 90, 30, 10)
M <- matrix(c(cake.1, cake.2), nrow = 2, byrow = TRUE)
M
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  100   80    0    0    0
## [2,]  100  100   90   30   10

6.3.5.1 sum et al. 

You know about sum(), right? How would that work on a matrix? Find out!

sum(M)
## [1] 510

Quite unsurprisingly, it just summed all values in M.

But what if i wanted to sum row-by-row? That is: on the first row, sum all 5 columns; and then, on the second row, sum all 5 columns. R has your back!

rowSums(M)
## [1] 180 330

What about column-by-column? That is, summing both elements of the first column, summing both elements of the second column, etc. Again, easy:

colSums(M)
## [1] 200 180  90  30  10

There’s more to life than sums. How does mean() work for a matrix? Let’s find out!

mean(M)
## [1] 51

Quite unsurprisingly, it took the mean of all values in M.

But what if i wanted to sum row-by-row? That is: on the first row, take the mean of all 5 columns; and then, on the second row, take the mean of all 5 columns. R has your back!

rowMeans(M)
## [1] 36 66

What about column-by-column? That is, taking the mean of both elements of the first column, taking the mean of both elements of the second column, etc. Again, easy:

colMeans(M)
## [1] 100  90  45  15   5

Many of the other functions we have seen before work on matrices as well. For example:

max(M)
## [1] 100

finds the biggest element of M

And, if you have been paying any attention, you surely predict we also have

rowMaxs(M)
## Error in rowMaxs(M): could not find function "rowMaxs"
rowMins(M)
## Error in rowMins(M): could not find function "rowMins"
colMaxs(M)
## Error in colMaxs(M): could not find function "colMaxs"
colMins(M)
## Error in colMins(M): could not find function "colMins"

Haha. Just kidding. If you want that function, you will need to install and load a package

library(matrixStats)
## Warning: package 'matrixStats' was built under R version 4.2.3
rowMaxs(M)
## [1] 100 100
rowMins(M)
## [1]  0 10
colMaxs(M)
## [1] 100 100  90  30  10
colMins(M)
## [1] 100  80   0   0   0

6.3.5.2 apply()

rowSums() and colSums() are very convenient, but they are good for only one job. You can achieve the same using a function that is a bit more difficult, but has much broader applicability. Ladies and gentlemen, and everybody in between, please put your hands together for the apply() function. This is how it works:

The apply() function applies something on a matrix. It should not come as a surprise that, for it to work, you should feed it a matrix and an instruction of what to do. So say we want to take the sum() of the elements in M. You would think apply(M, sum) would do the job. But per above, you know that there are two ways R can go about it: either do a sum row-by-row (leading to 2 values, since we have 2 rows), or do a sum column-by-column (leading to 5 values, since we have 5 columns). So we need to give R some more information beyond the ambiguous instruction to take a sum. Telling R whether it should work column-wise or row-wise is governed by the MARGIN argument. When it is 1, R works row-wise, when it is 2, it works column-wise.

apply(M, MARGIN = 1, sum)
## [1] 180 330
apply(M, MARGIN = 2, sum)
## [1] 200 180  90  30  10

To be complete, the function we ask R to apply could be given an argument name called FUN in the apply() function. Please take some time to think of a reasonable pun about FUN and learning R.

apply(M, MARGIN = 1, FUN = sum)
## [1] 180 330
apply(M, MARGIN = 2, FUN = sum)
## [1] 200 180  90  30  10

Remember that sometimes you will want to give an extra argument when you run a function. For example, when using the round() function, you might wish to use the digits argument. Where should that info go if you call a function using apply? Easy: any arguments you’d like to use in the function you specify in FUN can just be included after the function.

For example, compare these two commands:

apply(M/1000, MARGIN = 2, FUN = round)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
apply(M/1000, MARGIN = 2, FUN = round, digits = 1)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  0.1  0.1  0.0    0    0
## [2,]  0.1  0.1  0.1    0    0

The first one uses the default value for the digits argument of the round() function, the second one follows our wishes and uses 1 for the digits argument.

I get it. You are young, and you like to break things. Then R is just for you! Here is an easy way to break R:

sum <- "UMOEDER"
apply(M, MARGIN = 2, FUN = sum)
## Error in get(as.character(FUN), mode = "function", envir = envir): object 'UMOEDER' of mode 'function' was not found

Now, that poor R thing is trying to apply sum() to M, but due to your juvenile behavior, he now thinks that sum is an object, no longer the trusted function we have learned to love. Luckily, the good people who build R have anticipated some of this behavior, and installed a protection program against it, so this still works, despite your best attempts to break things:

x <- c(6,7) #still childish, I know
sum(x) 
## [1] 13

Admittedly, this is not on the same level as flooding a school, but still, let’s undo this unholy nonsense:

rm(sum)
apply(M, MARGIN = 2, FUN = sum) #works again!
## [1] 200 180  90  30  10

6.4 Factors

Now, it’s time to start introducing some of the data types that are somewhat more specific to statistics. When we assign numbers to possible outcomes, these numbers can mean quite different things depending on what kind of variable we are attempting to measure. In particular, we commonly make the distinction between nominal, ordinal, interval and ratio scale data. How do we capture this distinction in R? Currently, we only seem to have a single numeric data type. That’s probably not going to be enough, is it?

A little thought suggests that the numeric variable class in R is perfectly suited for capturing ratio scale data. For instance, if I were to measure response time (RT) for five different events, I could store the data in R like this:

RT <- c(342, 401, 590, 391, 554)

where the data here are measured in milliseconds, as is conventional in the psychological literature. It’s perfectly sensible to talk about “twice the response time”, \(2 \times \mbox{RT}\), or the “response time plus 1 second”, \(\mbox{RT} + 1000\), and so both of the following are perfectly reasonable things for R to do:

2 * RT
## [1]  684  802 1180  782 1108
RT + 1000
## [1] 1342 1401 1590 1391 1554

And to a lesser extent, the “numeric” class is okay for interval scale data.

However. When it comes to nominal scale data, it becomes completely unacceptable, because almost all of the “usual” rules for what you’re allowed to do with numbers don’t apply to nominal scale data. If your data set about soccer contains three forwards and one winger, what’s the mean position? Indeed. It is for this reason that R has factors.

6.4.1 Introducing factors

Suppose, I was doing a study in which people could belong to one of three different treatment conditions. Each group of people were asked to complete the same task, but each group received different instructions. Not surprisingly, I might want to have a variable that keeps track of what group people were in. So I could type in something like this

group <- c(1,1,1,2,2,2,3,3,3)

so that group[i] contains the group membership of the i-th person in my study. Clearly, this is numeric data, but obviously, this is a nominal scale variable. There’s no sense in which “group 1” plus “group 2” equals “group 3”, but nevertheless if I try to do that, R won’t stop me because it doesn’t know any better.

Exercise: Add the value 2 to group.

group + 2

Apparently, R seems to think that it’s allowed to invent “group 4” and “group 5”, even though they didn’t actually exist. Unfortunately, R is too stupid to know any better: it thinks that 3 is an ordinary number in this context, so it sees no problem in calculating 3 + 2. But since we’re not that stupid, we’d like to stop R from doing this. We can do so by instructing R to treat group as a factor.

6.4.2 Creating a factor

Creating a factor is easy. You can do so using the factor() function.

group.f <- factor(group)
group.f
## [1] 1 1 1 2 2 2 3 3 3
## Levels: 1 2 3

It looks more or less the same as before (though it’s not immediately obvious what all that Levels rubbish is about), but if we ask R to tell us what the class of the group.f variable is now, it’s clear that it has done what we asked.

Exercise: Use the class() function to give us the class of the group.f variable.

class(group.f)

Neat.

6.4.3 Working with factors

6.4.3.1 Getting information out of a factor

Easy. Just use the [] as with a normal vector.

group.f[2] 
## [1] 1
## Levels: 1 2 3

gives the second element and, unlike with a normal vector, tells us the levels of the factor.

6.4.3.2 Altering the elements of a factor

What if i made a mistake and I want the 7th element to be a 1:

group.f[7] <- 1 
group.f
## [1] 1 1 1 2 2 2 1 3 3
## Levels: 1 2 3

Easy! Changing it to a 4 should be easy too, of course

group.f[7] <- 4 
## Warning in `[<-.factor`(`*tmp*`, 7, value = 4): invalid factor level, NA
## generated
group.f
## [1] 1    1    1    2    2    2    <NA> 3    3   
## Levels: 1 2 3

No it doesn’t. There is no level 4, so R just (probably correctly) that you are just taking nonsense.

6.4.4 Doing calculations using a factor

Now that we’ve converted group to a factor, look what happens when you try to add 2 to group.f

Exercise: Try it.

group.f + 2

This time even R is smart enough to know that I’m being an idiot, so it tells me off and then produces a vector of missing values (i.e., NA: see Section 6.1.2), together with a strongly worded warning. So not much to see here!

6.4.5 Labelling the factor levels

I have a confession to make. My memory is not infinite in capacity; and it seems to be getting worse as I get older. So it kind of annoys me when I get data sets where there’s a nominal scale variable called gender, with three levels corresponding to males, females and other. But when I go to print out the variable I get something like this:

gender
## [1] 1 1 1 3 1 2 2 2 2
## Levels: 1 2 3

Okaaaay. That’s not helpful at all, and it makes me very sad. Which number corresponds to the males, to the females and which one corresponds to the other category? Wouldn’t it be nice if R could actually keep track of this? It’s way too hard to remember which number corresponds to which gender.

And besides, the problem that this causes is much more serious than a single sad nerd… because R has no way of knowing that the 1s in the group.f variable are a very different kind of thing to the 1s in the gender variable. So if I try to ask which elements of the group.f variable are equal to the corresponding elements in gender, R thinks this is totally kosher and gives me this:

group.f
## [1] 1    1    1    2    2    2    <NA> 3    3   
## Levels: 1 2 3
gender
## [1] 1 1 1 3 1 2 2 2 2
## Levels: 1 2 3
group.f == gender
## [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE    NA FALSE FALSE

Well, that’s … especially stupid.28 The problem here is that R is very literal-minded. Even though you’ve declared both group.f and gender to be factors, it still assumes that a 1 is a 1 no matter which variable it appears in.

To fix both of these problems (my memory problem, and R’s infuriating literal interpretations), what we need to do is assign meaningful labels to the different levels of each factor. We can do that like this:

levels(group.f) <- c("group 1", "group 2", "group 3")
group.f 
## [1] group 1 group 1 group 1 group 2 group 2 group 2 <NA>    group 3 group 3
## Levels: group 1 group 2 group 3
levels(gender) <- c("female", "male", "other")
gender
## [1] female female female other  female male   male   male   male  
## Levels: female male other

Note how the orginal 1,2 and 3s have been rewritten to whatever was in the levels.

That’s much easier on the eye, and better yet, R is smart enough to know that "female" is not equal to "group 1", so now when I try to ask which group memberships are “equal to” the gender of the corresponding person,

group.f == gender
## Error in Ops.factor(group.f, gender): level sets of factors are different

R correctly tells me that I’m an idiot.

Of course, it is your responsibility to assign the correct meaning to your data, by listing the labels in the correct order. If a 1 in your gender variable means “male”, and a 2 means “other”, you should use

levels(gender) <- c("male", "other", "female")
gender
## [1] male   male   male   female male   other  other  other  other 
## Levels: male other female

Quite conveniently, you can already define the levels when you create the factor, using the levels argument. This doesn’t really work, because of the mismatch between the elements and the levels.

gender<-factor(c(1, 1 ,1, 3 ,1, 2, 2, 2, 2), levels = c("male", "other", "female"))
gender
## [1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: male other female

but this works:

gender<-factor(c("male", "male" ,"male", "other" ,"male", "female", "female", "female", "female"), levels = c("male", "other", "female"))
gender
## [1] male   male   male   other  male   female female female female
## Levels: male other female

and so does this

gender <- factor(c(1, 1 ,1, 3 ,1, 2, 2, 2, 2), levels = c(1,2,3), labels = c("male", "other", "female"))

and this partially works:

gender<-factor(c("male", "male" ,"male", "other" ,"male", "female", "female", "female", "female"), levels = c("male", "X", "female"))
gender
## [1] male   male   male   <NA>   male   female female female female
## Levels: male X female

6.4.6 Moving on…

Factors are very useful things : they’re the main way to represent a nominal scale variable. And there are lots of nominal scale variables out there. I’ll talk more about factors in Section ??, but for now, you know enough to be able to get started.

6.5 Data frames

6.5.1 Introducing data frames

In order to understand why R has created this funny thing called a data frame, it helps to try to see what problem it solves. So let’s go back to the little scenario that I used when introducing factors in Section 6.4. In that section, I recorded the group.f and gender for all 9 participants in my study. Let’s also suppose I recorded their ages and their score on “My Terribly Exciting Psychological Test”:

age <- c(17, 19, 21, 37, 18, 19, 47, 18, 19)
score <- c(12, 10, 11, 15, 16, 14, 25, 21, 29)
#and just as a reminder, we have
group.f <- factor(c("group 1","group 1","group 1","group 2","group 2","group 2","group 3","group 3","group 3"), levels = c("group 1", "group 2", "group 3"))
group.f
## [1] group 1 group 1 group 1 group 2 group 2 group 2 group 3 group 3 group 3
## Levels: group 1 group 2 group 3
gender<-factor(c("male", "male" ,"male", "other" ,"male", "female", "female", "female", "female"), levels = c("male", "other", "female"))
gender
## [1] male   male   male   other  male   female female female female
## Levels: male other female

So there are four variables in the workspace, age, group.f, gender, and score. And it just so happens that all four of them are the same size (i.e., they’re all vectors with 9 elements). Aaaand it just so happens that age[1] corresponds to the age of the first person, and gender[1] is the gender of that very same person, etc. In other words, you and I both know that all four of these variables correspond to the same data set, and all four of them are organised in exactly the same way.

However, R doesn’t know this! As far as it’s concerned, there’s no reason why the age variable has to be the same length as the gender variable; and there’s no particular reason to think that age[1] has any special relationship to gender[1] any more than it has a special relationship to gender[4]. In other words, when we store everything in separate variables like this, R doesn’t know anything about the relationships between things. It doesn’t even really know that these variables actually refer to a proper data set. The data frame fixes this: if we store our variables inside a data frame, we’re telling R to treat these variables as a single, fairly coherent data set.

6.5.2 Creating a data frame

To see how they do this, let’s create one. So how do we create a data frame? One way we’ve already seen: if we import our data from a CSV file, R will store it as a data frame. Sweet!

A second way is to create it directly from some existing variables using the data.frame() function. All you have to do is type a list of variables that you want to include in the data frame. The output of a data.frame() command is, well, a data frame, not unlike the matrix() command can be used to make a matrix. So, if I want to have different variables in a data frame, I can do so like this:

dataFrame <- data.frame ( variable1, variable2, variable3, variable4 ) 
age <- c(17, 19, 21, 37, 18, 19, 47, 18, 19)
score <- c(12, 10, 11, 15, 16, 14, 25, 21, 29)
#and just as a reminder, we have
group.f <- factor(c("group 1","group 1","group 1","group 2","group 2","group 2","group 3","group 3","group 3"), levels = c("group 1", "group 2", "group 3"))
group.f
## [1] group 1 group 1 group 1 group 2 group 2 group 2 group 3 group 3 group 3
## Levels: group 1 group 2 group 3
gender<-factor(c("male", "male" ,"male", "other" ,"male", "female", "female", "female", "female"), levels = c("male", "other", "female"))
gender
## [1] male   male   male   other  male   female female female female
## Levels: male other female
expt <- data.frame ( age, gender, group.f, score )

Exercise: Store all four variables from the experiment (age, gender, group.f, score) in a data frame called expt. You look at what you created by typing print(expt) on the next line.

expt <- data.frame ( age, gender, group.f, score ) 
print(expt)

Here is a brief note I wish I didn’t have to write. You might be wondering why I asked you to run print(expt) instead of just expt. When working in R(Studio), just typing expt would have been more than enough to have R print the data frame out for you. But for some reason I don’t (care to) comprehend, in the learnr environment this document is made in (so that you can make this exercises in the browser environment), just typing expt works, in the sense that R will print expt for you, but it looks ugly. There’s enough ugly in this world, and maybe I can not provide a lot of beauty, but at least let me not generate more ugliness. To make things look nice, I have used and will use the command print(expt) if I want to inspect expt, and probably you should do too! This is only needed for data frames. You can have a beautiful printing out of vectors or matrices without using print() but just typing the name of the variable.

Note that expt is a completely self-contained variable. Once you’ve created it, it no longer depends on the original variables from which it was constructed. Because this is such an important point, I want you so see it for yourself: make a change to age, and check the expt variable.

age[5] <- 19 #for example. make any changes to the age variable you like 
print(expt)

You will see that if we make changes to the original age variable, it will not lead to any changes to the age data stored in expt. This is a common (and stupid) mistake I would hate you to make.

Say you want to add new entries to the data frame, the variable storing the number of hours slept:

slept <- c(6, 7, 8, 7, 6, 5, 4, 3, 10)

You could of course just use the data.frame() command again, data.frame ( age, gender, group.f, score, slept)

The easiest way to do so, however, is to use $, as the following example illustrates. If I type a command like this

hoursslept <- c(6, 7, 8, 7, 6, 5, 4, 3, 10)
expt$hrslept <- hoursslept
print(expt)
##   age gender group.f score hrslept
## 1  17   male group 1    12       6
## 2  19   male group 1    10       7
## 3  21   male group 1    11       8
## 4  37  other group 2    15       7
## 5  18   male group 2    16       6
## 6  19 female group 2    14       5
## 7  47 female group 3    25       4
## 8  18 female group 3    21       3
## 9  19 female group 3    29      10

then R creates a new entry to the end of the list called hrslept, and assigns it the numerical values. Note that the name we give the variable on its own (hoursslept) should not necessarily be identical to how we call it inside the data frame (hrslept), but it can be, if you want. R is happy either way.

Of course, you can do this in a single step.

expt$hrslept <- c(6, 7, 8, 7, 6, 5, 4, 3, 20)
print(expt)
##   age gender group.f score hrslept
## 1  17   male group 1    12       6
## 2  19   male group 1    10       7
## 3  21   male group 1    11       8
## 4  37  other group 2    15       7
## 5  18   male group 2    16       6
## 6  19 female group 2    14       5
## 7  47 female group 3    25       4
## 8  18 female group 3    21       3
## 9  19 female group 3    29      20

Note how I changed the last element to a record-breaking 20, to highlight that by the new assignment of hrslept, the previous values are overwritten in expt.

Alternatively, you could go like this

expt[, "zzz"] <- hoursslept
print(expt)
##   age gender group.f score hrslept zzz
## 1  17   male group 1    12       6   6
## 2  19   male group 1    10       7   7
## 3  21   male group 1    11       8   8
## 4  37  other group 2    15       7   7
## 5  18   male group 2    16       6   6
## 6  19 female group 2    14       5   5
## 7  47 female group 3    25       4   4
## 8  18 female group 3    21       3   3
## 9  19 female group 3    29      20  10

Note the use of "", to indicate that "zzz" is a string (being the name of the column).

A final thing to note is that, when defining the data frame, I have unlimited freedom of choosing the names of the variables:

expt2 <- data.frame ( wisdom = age, hmmm = gender, grrrr = group.f, booyah = score ) 
print(expt2)
##   wisdom   hmmm   grrrr booyah
## 1     17   male group 1     12
## 2     19   male group 1     10
## 3     21   male group 1     11
## 4     37  other group 2     15
## 5     18   male group 2     16
## 6     19 female group 2     14
## 7     47 female group 3     25
## 8     18 female group 3     21
## 9     19 female group 3     29

or, if you haven’t defined your variables yet, you can do while defining the data frame

expt3 <- data.frame ( wisdom = c(1,2,3), hmmm = c("M","M","X") ) 
print(expt3)
##   wisdom hmmm
## 1      1    M
## 2      2    M
## 3      3    X

6.5.3 Working with data frames

6.5.3.1 Accessing the content of the data frame using $

At this point, we have all we need to know in the one variable, a data frame called expt. But as we can see when we told R to print the variable out, this data frame contains 5 variables, each of which has 9 observations. So how do we get this information out again? After all, there’s no point in storing information if you don’t use it, and there’s no way to use information if you can’t access it. So let’s talk a bit about how to pull information out of a data frame.

The first thing we might want to do is pull out one of our stored variables, let’s say hrslept. One thing you might try to do is ignore the fact that hrslept is locked up inside the expt data frame. For instance, you might try to print it out like this:

hrslept
## Error in eval(expr, envir, enclos): object 'hrslept' not found

This doesn’t work, because R doesn’t go “peeking” inside the data frame unless you explicitly tell it to do so. How do we tell R to look inside the data frame? As is always the case with R there are several ways. The simplest way is to use the $ operator to extract the variable you’re interested in, like this:

expt$hrslept
## [1]  6  7  8  7  6  5  4  3 20

We will talk a bit more about this in Section 7.4.

6.5.3.2 Altering the content of a data frame

If you want to restore the 20 hours slept to a more reasonable 10 hours, you could do so as follows:

expt$hrslept[9] <- 10
print(expt)
##   age gender group.f score hrslept zzz
## 1  17   male group 1    12       6   6
## 2  19   male group 1    10       7   7
## 3  21   male group 1    11       8   8
## 4  37  other group 2    15       7   7
## 5  18   male group 2    16       6   6
## 6  19 female group 2    14       5   5
## 7  47 female group 3    25       4   4
## 8  18 female group 3    21       3   3
## 9  19 female group 3    29      10  10

If all people would sleep exactly the same number of hours, you can change all values at once:

expt$hrslept <- 10
print(expt)
##   age gender group.f score hrslept zzz
## 1  17   male group 1    12      10   6
## 2  19   male group 1    10      10   7
## 3  21   male group 1    11      10   8
## 4  37  other group 2    15      10   7
## 5  18   male group 2    16      10   6
## 6  19 female group 2    14      10   5
## 7  47 female group 3    25      10   4
## 8  18 female group 3    21      10   3
## 9  19 female group 3    29      10  10

This, for example, won’t work

expt$hrslept <- c(9,10)
## Error in `$<-.data.frame`(`*tmp*`, hrslept, value = c(9, 10)): replacement has 2 rows, data has 9
print(expt)
##   age gender group.f score hrslept zzz
## 1  17   male group 1    12      10   6
## 2  19   male group 1    10      10   7
## 3  21   male group 1    11      10   8
## 4  37  other group 2    15      10   7
## 5  18   male group 2    16      10   6
## 6  19 female group 2    14      10   5
## 7  47 female group 3    25      10   4
## 8  18 female group 3    21      10   3
## 9  19 female group 3    29      10  10

R can’t possibly know what it should change all 9 values to if you only provide two values!

This won’t work either:

expt$hrslept[10] <- 11
## Error in `$<-.data.frame`(`*tmp*`, hrslept, value = c(10, 10, 10, 10, : replacement has 10 rows, data has 9
print(expt)
##   age gender group.f score hrslept zzz
## 1  17   male group 1    12      10   6
## 2  19   male group 1    10      10   7
## 3  21   male group 1    11      10   8
## 4  37  other group 2    15      10   7
## 5  18   male group 2    16      10   6
## 6  19 female group 2    14      10   5
## 7  47 female group 3    25      10   4
## 8  18 female group 3    21      10   3
## 9  19 female group 3    29      10  10

The reason is that R has no problem adding the 11 as the 10th value for hrslept but doesn’t know what the corresponding values should be for age, gender, group.f and score.

6.5.4 Doing calculations using data frames

One thing I want to share that the apply() function when encountered with matrices (Section XXX) also applies to data frames, in the exact same way.

For example:

apply(expt, MARGIN = 2, head)
##      age  gender   group.f   score hrslept zzz 
## [1,] "17" "male"   "group 1" "12"  "10"    " 6"
## [2,] "19" "male"   "group 1" "10"  "10"    " 7"
## [3,] "21" "male"   "group 1" "11"  "10"    " 8"
## [4,] "37" "other"  "group 2" "15"  "10"    " 7"
## [5,] "18" "male"   "group 2" "16"  "10"    " 6"
## [6,] "19" "female" "group 2" "14"  "10"    " 5"

For example:

apply(expt, MARGIN = 2, max)
##       age    gender   group.f     score   hrslept       zzz 
##      "47"   "other" "group 3"      "29"      "10"      "10"

You might have observed that the numbers have been turned characters, which is reminiscent of the behavior we observed with matrices. The reason is that, strictly speaking, apply() only works on matrices. So if we supply apply() with a data frame, R silently converts it to a matrix (but does not do the re-conversion). As long as your data frame consists of numeric variables, you probably won’t even notice, though.

What about, mean(), sum() and the row-wise and column-wise versions?

sum(expt)
## Error in FUN(X[[i]], ...): only defined on a data frame with all numeric-alike variables

No luck, but it does make sense. After all, expt contains a column with male, female and other, and there is no way you could sum those, so you shouldn’t expect R to do it?

So let’s, for the sake of experiment, construct a data frame with only numerical variables and see if the functions flourish again:

expt_num <- data.frame(age = expt$age, score = expt$score)
print(expt_num)
##   age score
## 1  17    12
## 2  19    10
## 3  21    11
## 4  37    15
## 5  18    16
## 6  19    14
## 7  47    25
## 8  18    21
## 9  19    29
sum(expt_num)
## [1] 368
rowSums(expt_num)
## [1] 29 29 32 52 34 33 72 39 48
rowMeans(expt_num)
## [1] 14.5 14.5 16.0 26.0 17.0 16.5 36.0 19.5 24.0
colSums(expt_num)
##   age score 
##   215   153
colMeans(expt_num)
##      age    score 
## 23.88889 17.00000

Yeah, baby, on a roll! Let’s push our luck:

rowMaxs(expt_num)
## Error in rowMaxs(expt_num): Argument 'x' must be a matrix or a vector.
colMaxs(expt_num)
## Error in colMaxs(expt_num): Argument 'x' must be a matrix or a vector.

And of course, at some point we are out. rowMaxs does not work on a data frame, which is, in hindsight, not surprising given that is a function from the package called matrixStats and not matrixanddataframesandafewotherclassesyoumightcareaboutStats.

Help is on the way in the form of pmax and friends:

pmax(expt$age, expt$score)
## [1] 17 19 21 37 18 19 47 21 29
pmin(expt$age, expt$score)
## [1] 12 10 11 15 16 14 25 18 19

6.6 Data frames vs matrices

So now we know two ways for binding or merging two or more vectors together: into the data frame or into the matrix. If you are thinking now that the expt data frame looks a lot like a matrix, you are right. In fact, in this particular case, we could have stored all info quite nearly into such a matrix, for example as follows:

exptM <- cbind( age, gender, group.f, score, hoursslept ) #put everything in a matrix 
exptM #show matrix
##       age gender group.f score hoursslept
##  [1,]  17      1       1    12          6
##  [2,]  19      1       1    10          7
##  [3,]  21      1       1    11          8
##  [4,]  37      2       2    15          7
##  [5,]  18      1       2    16          6
##  [6,]  19      3       2    14          5
##  [7,]  47      3       3    25          4
##  [8,]  18      3       3    21          3
##  [9,]  19      3       3    29         10
print(expt) #show dataframe
##   age gender group.f score hrslept zzz
## 1  17   male group 1    12      10   6
## 2  19   male group 1    10      10   7
## 3  21   male group 1    11      10   8
## 4  37  other group 2    15      10   7
## 5  18   male group 2    16      10   6
## 6  19 female group 2    14      10   5
## 7  47 female group 3    25      10   4
## 8  18 female group 3    21      10   3
## 9  19 female group 3    29      10  10

The critical difference between a data frame and a matrix is that, in a data frame, we have this notion that each of the columns corresponds to a different variable: as a consequence, the columns in a data frame can be of different data types. The first column could be numeric, and the second column could contain character strings, and the third column could be logical data. In that sense, there is a fundamental asymmetry build into a data frame, because of the fact that columns represent variables (which can be qualitatively different to each other) and rows represent cases (which cannot). Matrices are intended to be thought of in a different way. All elements are of the same type (in this case, numerical values).

Note that this difference is also reflected in how the data frame expt and the matrix exptM treat the factors gender and group.f: in the matrix, their values are represented as numbers (with all the problems it entails, as I discussed when arguing for the need for factors), whereas in the data frame, they are represented with their more meaningful labels.

To drive home this point, suppose I want to store the seasons in which I have collected the data.

collection <- c("win", "win", "win", "sum", "win", "sum", "win", "sum", "sum") #winter or summer
exptM <- cbind( age, gender, group.f, score, hoursslept, collection ) 
exptM
##       age  gender group.f score hoursslept collection
##  [1,] "17" "1"    "1"     "12"  "6"        "win"     
##  [2,] "19" "1"    "1"     "10"  "7"        "win"     
##  [3,] "21" "1"    "1"     "11"  "8"        "win"     
##  [4,] "37" "2"    "2"     "15"  "7"        "sum"     
##  [5,] "18" "1"    "2"     "16"  "6"        "win"     
##  [6,] "19" "3"    "2"     "14"  "5"        "sum"     
##  [7,] "47" "3"    "3"     "25"  "4"        "win"     
##  [8,] "18" "3"    "3"     "21"  "3"        "sum"     
##  [9,] "19" "3"    "3"     "29"  "10"       "sum"
expt$col <- collection
print(expt)
##   age gender group.f score hrslept zzz col
## 1  17   male group 1    12      10   6 win
## 2  19   male group 1    10      10   7 win
## 3  21   male group 1    11      10   8 win
## 4  37  other group 2    15      10   7 sum
## 5  18   male group 2    16      10   6 win
## 6  19 female group 2    14      10   5 sum
## 7  47 female group 3    25      10   4 win
## 8  18 female group 3    21      10   3 sum
## 9  19 female group 3    29      10  10 sum

You notice that all numerical values have been characterized 29 when being put into a matrix together with character variables. This did not happen when they were put together in a data frame. It shows how the internals for data frames and matrices are quite different.

Another glimpse of the different internal workings of matrices and data frames is offered when we try to add yet another variable that has a different number of elements from the others. I just want to remember data collection occurred in winter and in summer, but I don’t want to store that my first data point was collected in the winter, my second in summer, and so on. I just want to store two names of two seasons.

collection.short <- c("win", "sum") #winter or summer
exptM <- cbind( age, gender, group.f, score, hoursslept, collection.short ) 
## Warning in cbind(age, gender, group.f, score, hoursslept, collection.short):
## number of rows of result is not a multiple of vector length (arg 6)
exptM
##       age  gender group.f score hoursslept collection.short
##  [1,] "17" "1"    "1"     "12"  "6"        "win"           
##  [2,] "19" "1"    "1"     "10"  "7"        "sum"           
##  [3,] "21" "1"    "1"     "11"  "8"        "win"           
##  [4,] "37" "2"    "2"     "15"  "7"        "sum"           
##  [5,] "18" "1"    "2"     "16"  "6"        "win"           
##  [6,] "19" "3"    "2"     "14"  "5"        "sum"           
##  [7,] "47" "3"    "3"     "25"  "4"        "win"           
##  [8,] "18" "3"    "3"     "21"  "3"        "sum"           
##  [9,] "19" "3"    "3"     "29"  "10"       "win"
expt$season <- collection.short
## Error in `$<-.data.frame`(`*tmp*`, season, value = c("win", "sum")): replacement has 2 rows, data has 9
print(expt)
##   age gender group.f score hrslept zzz col
## 1  17   male group 1    12      10   6 win
## 2  19   male group 1    10      10   7 win
## 3  21   male group 1    11      10   8 win
## 4  37  other group 2    15      10   7 sum
## 5  18   male group 2    16      10   6 win
## 6  19 female group 2    14      10   5 sum
## 7  47 female group 3    25      10   4 win
## 8  18 female group 3    21      10   3 sum
## 9  19 female group 3    29      10  10 sum

The fact that the new variables has only 2 elements, whereas the others have 9, lead to a mere warning when being put together into a matrix. (Note the use of the recycling rule explained in Section ??). When trying to add this odd-one-out variable into a data frame, R balks and produces an error, so nothing is changed to the expt data frame (i.e., the additional season variable we asked for has not been added to the data frame). No can do. As we we’ll soon see, lists will solve this rather restrictive behavior.

Overall, by comparing the matrix and the data frame output, I guess most of you will appreciate that the data frame looks more neat and is easier to interpret. So we will mostly be using that, especially when doing statistical analyses.

After having stressed how big the difference is between a matrix and a data frame, let me tone down a bit and stress how they are similar too. In fact, they are so similar you can combine them in an operation like addition, of course provided that the elements can be summed:

#let's remind ourselve about this cute litte data frame
print(expt_num)
##   age score
## 1  17    12
## 2  19    10
## 3  21    11
## 4  37    15
## 5  18    16
## 6  19    14
## 7  47    25
## 8  18    21
## 9  19    29
#let's make a matrix
Mas <- cbind(age,score)
Mas
##       age score
##  [1,]  17    12
##  [2,]  19    10
##  [3,]  21    11
##  [4,]  37    15
##  [5,]  18    16
##  [6,]  19    14
##  [7,]  47    25
##  [8,]  18    21
##  [9,]  19    29
#now do some seriously messed up stuff
#and sum a matrix and a data frame
#prepare to get blown away
s <- Mas + expt_num
s
#what is this weird object you ask?
class(s)
## [1] "data.frame"
#it's a data frame!

6.6.1 Looking for more on data frames?

There’s a lot more that can be said about data frames: they’re fairly complicated beasts, and the longer you use R the more important it is to make sure you really understand them. We’ll talk a bit more about them in Chapter 7.

6.7 Lists

6.7.1 Introducing lists

The next kind of data I want to mention are lists. Lists are an extremely fundamental data structure in R, and as you start making the transition from a novice to a savvy R user you will use lists all the time. I don’t use lists very often in this book – not directly – but most of the advanced data structures in R are built from lists. In fact, as far as R is concerned a data frame is actually a special kind of list, or a list is like a data frame on steroids. Because lists are so important to how R stores things, it’s useful to have a basic understanding of them.

Okay, so what is a list, exactly? Like data frames, lists are just “collections of variables.” However, unlike data frames – which are basically supposed to look like a nice “rectangular” table of data – there are no constraints on what kinds of variables we include, and no requirement that the variables have any particular relationship to one another.

6.7.2 Creating lists

In order to understand what this actually means, the best thing to do is create a list, which, R being the good sports that it is, let’s you do using the list() function, just like we used the matrix() function and the data.frame() function to create a matrix or a data frame. Ain’t life grand.

If I type this as my command:

Dan <- list( age = 34,
            nerd = TRUE,
            parents = c("Joe","Liz") 
)

R creates a new list variable called Dan, which is a bundle of three different variables: age, nerd and parents. Notice, that the parents variable is longer than the others.

If we now print out the variable, you can see the way that R stores the list:

Dan 
## $age
## [1] 34
## 
## $nerd
## [1] TRUE
## 
## $parents
## [1] "Joe" "Liz"

As you might have guessed from those $ symbols everywhere, the variables are stored in exactly the same way that they are for a data frame (again, this is not surprising: data frames are a type of list). So you will (I hope) be entirely unsurprised and probably quite bored when I tell you that you can extract the variables from the list using the $ operator, like so:

Dan$nerd
## [1] TRUE

If you need to add new entries to the list, the easiest way to do so is to again use $, as the following example illustrates. If I type a command like this

Dan$children <- "Alex"

then R creates a new entry to the end of the list called children, and assigns it a value of "Alex". If I were now to print() this list out, you’d see a new entry at the bottom of the printout.

Finally, it’s actually possible for lists to contain other lists, so it’s quite possible that I would end up using a command like Dan$partner$age to find out how old my partner is.

Dan$partner <- list(name = "You know this, don't you", age = 45)
Dan
## $age
## [1] 34
## 
## $nerd
## [1] TRUE
## 
## $parents
## [1] "Joe" "Liz"
## 
## $children
## [1] "Alex"
## 
## $partner
## $partner$name
## [1] "You know this, don't you"
## 
## $partner$age
## [1] 45

Or I could try to remember it myself I suppose.

Note that the parents variable was longer than the others. This is perfectly acceptable for a list, but it wouldn’t be for a data frame! If we would have entered the same information in a data frame, R thinks there are two 34 year old nerds, one with a parent Joe and one with a parent Liz.

DanDF <- data.frame( age = 34,
            nerd = TRUE,
            parents = c("Joe","Liz") 
)
print(DanDF)
##   age nerd parents
## 1  34 TRUE     Joe
## 2  34 TRUE     Liz

6.7.3 Data frames as lists

I have said before that a data frame is just a special case of a list. However, it is good to take your time to appreciate that it’s a very special kind of list: one where all the variables are of the same length, and the first element in each variable happens to correspond to the first “case” in the data set. Let’s look at our data frame again.

print( expt )
##   age gender group.f score hrslept zzz col
## 1  17   male group 1    12      10   6 win
## 2  19   male group 1    10      10   7 win
## 3  21   male group 1    11      10   8 win
## 4  37  other group 2    15      10   7 sum
## 5  18   male group 2    16      10   6 win
## 6  19 female group 2    14      10   5 sum
## 7  47 female group 3    25      10   4 win
## 8  18 female group 3    21      10   3 sum
## 9  19 female group 3    29      10  10 sum

Note that, despite expt being a list (on account of being a data frame), it is printed differently then Dan. No-one ever wants to see a data frame printed out in the default “list-like” way that I’ve shown in the extract above when printing Dan. If you want to see how R would show expt to you if it respected its list-like identity, you can see it here, and be appreciative that R focuses on its data frame identity instead.

as.list(expt)
## $age
## [1] 17 19 21 37 18 19 47 18 19
## 
## $gender
## [1] male   male   male   other  male   female female female female
## Levels: male other female
## 
## $group.f
## [1] group 1 group 1 group 1 group 2 group 2 group 2 group 3 group 3 group 3
## Levels: group 1 group 2 group 3
## 
## $score
## [1] 12 10 11 15 16 14 25 21 29
## 
## $hrslept
## [1] 10 10 10 10 10 10 10 10 10
## 
## $zzz
## [1]  6  7  8  7  6  5  4  3 10
## 
## $col
## [1] "win" "win" "win" "sum" "win" "sum" "win" "sum" "sum"

6.8 Formulas

The final kind of variable that I want to introduce is the formula. Formulas were originally introduced into R as a convenient way to specify a particular type of statistical model (see Chapter ??) but they’re such handy things that they’ve spread. Formulas are now used in a lot of different contexts, so it makes sense to introduce them early.

Stated simply, a formula object is a variable, but it’s a special type of variable that specifies a relationship between other variables. A formula is specified using the “tilde operator” ~. A very simple example of a formula is shown below:30

formula1 <- out ~ pred
formula1
## out ~ pred
## <environment: 0x00000236bb752fe8>

The precise meaning of this formula depends on exactly what you want to do with it, but in broad terms, it means “the out (outcome) variable, analysed in terms of the pred (predictor) variable”. That said, although the simplest and most common form of a formula uses the “one variable on the left, one variable on the right” format, there are others. For instance, the following examples are all reasonably common

formula2 <-  out ~ pred1 + pred2   # more than one variable on the right
formula3 <-  out ~ pred1 * pred2   # different relationship between predictors 
formula4 <-  ~ var1 + var2         # a 'one-sided' formula

and there are many more variants besides. Formulas are pretty flexible things, and so different functions will make use of different formats, depending on what the function is intended to do. We will encounter formulas later. .

6.9 More useful things to know when dealing with different kind of variables

You want more structures? You got it! You want to identify and inspect a variable? You got it. Want to shapeshift variables? You got it!

6.9.1 Other useful data structures

Up to this point, we have encountered several different kinds of variables. At the simplest level, we’ve seen numeric data, logical data and character data. However, we’ve also encountered some more complicated kinds of variables, namely factors, formulas, data frames and lists (and I have mentiond arrays in passing). We’ll see a few more specialised data structures later on in this book.

For example, there is a class Date, which is for, well dates (the chronological kinds, not the romantic or botanical ones). Next, there is a class called table, which we will encounter when discussing tables in Section XXX. More generally, the output of a function (see Section XXX) is not guaranteed to produce any of the classes we already encountered, and might generate a class of its own. For example, in Section XXX, we will encounter the binom.test() function to perform a, well, binomial test. You can ignore the details for now, but you have to observe that the binom.test() function produces something we haven’t encountered before.

q=binom.test(x = 10, n = 20, p = 0.5,
              alternative = c("two.sided"), 
              conf.level = 0.95)
class(q)
## [1] "htest"

The take away here is not that you should know anything about or even remember the mere existence of the htest class, but rather that it sometimes will pay off to inspect (or look up) the output of a function.

6.9.2 Know what you are dealing with

Sometimes it will prove to be very handy to get some high-level information about the variables you are dealing with, without having to inspect them in detail. There are many functions that are helpful for this task. Here I demonstrate two, sharing the dimensions of your R objects and the internal structure.

dim(exptM)
## [1] 9 6
dim(expt)
## [1] 9 7
str(exptM)
##  chr [1:9, 1:6] "17" "19" "21" "37" "18" "19" "47" "18" "19" "1" "1" "1" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:6] "age" "gender" "group.f" "score" ...
str(expt)
## 'data.frame':    9 obs. of  7 variables:
##  $ age    : num  17 19 21 37 18 19 47 18 19
##  $ gender : Factor w/ 3 levels "male","other",..: 1 1 1 2 1 3 3 3 3
##  $ group.f: Factor w/ 3 levels "group 1","group 2",..: 1 1 1 2 2 2 3 3 3
##  $ score  : num  12 10 11 15 16 14 25 21 29
##  $ hrslept: num  10 10 10 10 10 10 10 10 10
##  $ zzz    : num  6 7 8 7 6 5 4 3 10
##  $ col    : chr  "win" "win" "win" "sum" ...

One problem that sometimes comes up in practice is that you forget what you called all your variables. Normally you might try to type ls(), but this command will not tell you what the names are for those variables inside a data frame! One way is to ask R to tell you what the names of all the variables stored in the data frame are, which you can do using the names() function:

names(expt)
## [1] "age"     "gender"  "group.f" "score"   "hrslept" "zzz"     "col"

Sadly, this doesn’t work for matrices:

names(exptM)
## NULL

Computer says no.

You need dimnames() instead. Sigh.

dimnames(exptM) 
## [[1]]
## NULL
## 
## [[2]]
## [1] "age"              "gender"           "group.f"          "score"           
## [5] "hoursslept"       "collection.short"

from which we learn that the rows have no name, and what the columnnames are.

6.9.3 Coercing data from one class to another

Sometimes you want to change the variable class. This can happen for all sorts of reasons. Sometimes when you import data from files, it can come to you in the wrong format: numbers sometimes get imported as text, dates usually get imported as text, and many other possibilities besides. Regardless of how you’ve ended up in this situation, there’s a very good chance that sometimes you’ll want to convert a variable from one class into another one. Or, to use the correct term, you want to coerce the variable from one class into another. Coercion is a little tricky, and so I’ll only discuss the very basics here, using a few simple examples.

Firstly, let’s suppose we have a variable x that is supposed to be representing a number, but the data file that you’ve been given has encoded it as text. Let’s imagine that the variable is something like this:

x <- "100"  # the variable 
class(x)    # what class is it?
## [1] "character"

Obviously, if I want to do calculations using x in its current state, R is going to get very annoyed at me. It thinks that x is text, so it’s not going to allow me to try to do mathematics using it! Obviously, we need to coerce x from character to numeric. We can do that in a straightforward way by using the as.numeric() function.

Exercise: Coerce x from character to numeric, and make sure to save the result again in x. Next, check the class of x and see whether a simple calculation with x works without R complaining.

x <- as.numeric(x)  # coerce the variable
class(x)            # what class is it?
x + 1               # hey, addition works!

Not surprisingly, we can also convert it back again if we need to. The function that we use to do this is the as.character() function:

x <- as.character(x)   # coerce back to text
class(x)               # check the class:
## [1] "character"

However, there are some fairly obvious limitations: you can’t coerce the string "hello world" into a number because, well, there’s isn’t a number that corresponds to it. Or, at least, you can’t do anything useful:

as.numeric( "hello world" )  # this isn't going to work.
## Warning: NAs introduced by coercion
## [1] NA

In this case, R doesn’t give you an error message; it just gives you a warning, and then says that the data is missing (see Section 6.1.2 for the interpretation of NA).

That gives you a feel for how to change between numeric and character data. What about logical data? To cover this briefly, coercing text to logical data is pretty intuitive: you use the as.logical() function, and the character strings "T", "TRUE", "True" and "true" all convert to the logical value of TRUE. Similarly "F", "FALSE", "False", and "false" all become FALSE. All other strings convert to NA. When you go back the other way using as.character(), TRUE converts to "TRUE" and FALSE converts to "FALSE". Converting numbers to logicals – again using as.logical() – is straightforward. Following the convention in the study of logic, the number 0 converts to FALSE. Everything else is TRUE. Going back using as.numeric(), FALSE converts to 0 and TRUE converts to 1.

In Section ??, we have already seen how we can convert something to a factor: it pretty straightforwardly, uses the as.factor() function.

6.10 Go out and play

If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. Since you now know how to work in RStudio, why not go there to make them.

7 Data handling

This is a somewhat strange chapter, even by my standards. My goal in this chapter is to talk a bit more honestly about the realities of working with data than you’ll see anywhere else in the book. The problem with real-world data sets is that they are lola young-esque messy. Very often the data file that you start out with doesn’t have the variables stored in the right format for the analysis you want to do. It’s not uncommon in real-world data analysis to find that one of your variables isn’t quite equivalent to the variable that you really want. For instance, you may need to convert a numeric variable into a different numeric variable (e.g., you may want to analyse at the absolute value of the original variable). At other times, it’s often convenient to take a continuous-valued variable (e.g., age) and break it up into a smallish number of categories (e.g., younger, middle, older). Sometimes you only want to analyse a subset of the data. Et cetera.

In other words, there’s a lot of data manipulation that you need to do, just to get all your data set into the format that you need it. The purpose of this chapter is to provide a basic introduction to some of these pragmatic topics. Although the chapter is motivated by the kinds of practical issues that arise when manipulating real data, I’ll stick with the practice that I’ve adopted through most of the book and rely on very small, toy data sets that illustrate the underlying issue.

Let’s introduce these data sets.

As matrix, we’ll use exptM from earlier:

exptM <- cbind( age, gender, group.f, score ) #put everything in a matrix 
exptM #show matrix
##       age gender group.f score
##  [1,]  17      1       1    12
##  [2,]  19      1       1    10
##  [3,]  21      1       1    11
##  [4,]  37      2       2    15
##  [5,]  18      1       2    16
##  [6,]  19      3       2    14
##  [7,]  47      3       3    25
##  [8,]  18      3       3    21
##  [9,]  19      3       3    29

For the data frame, let’s start with a simple example. As the father of a small child, I naturally spend a lot of time watching TV shows like In the Night Garden. I’ve transcribed a short section of the dialogue. There are two variables of interest, speaker and utterance, which are both simple vectors. When we take a look at the data, it becomes very clear what happened to my sanity.

speaker <- c("upsydaisy", "upsydaisy", "upsydaisy", "upsydaisy", "tombliboo", "tombliboo", "makkapakka", "makkapakka", "makkapakka", "makkapakka")
utterance <- c("pip", "pip", "onk", "onk", "ee", "oo", "pip", "pip", "onk", "onk")
speaker 
##  [1] "upsydaisy"  "upsydaisy"  "upsydaisy"  "upsydaisy"  "tombliboo" 
##  [6] "tombliboo"  "makkapakka" "makkapakka" "makkapakka" "makkapakka"
utterance
##  [1] "pip" "pip" "onk" "onk" "ee"  "oo"  "pip" "pip" "onk" "onk"

Let’s put all this into a nice data frame to demonstrate stuff, so let’s make one. Remember I need the print() function to make things look nice.

itng <- data.frame(speaker, utterance)
print(itng)
##       speaker utterance
## 1   upsydaisy       pip
## 2   upsydaisy       pip
## 3   upsydaisy       onk
## 4   upsydaisy       onk
## 5   tombliboo        ee
## 6   tombliboo        oo
## 7  makkapakka       pip
## 8  makkapakka       pip
## 9  makkapakka       onk
## 10 makkapakka       onk

I’ll also use a slightly different data set, namely the garden data frame. It extends the itng data frame with a third variable, reflecting the mood of the character when speaking the utterance, on a 1-3 scale.

garden <- itng 
garden$mymood <- c(2, 1, 1, 3, 2, 2, 1, 1, 2, 3) #mood, on a 1-3 scale 
print(garden)
##       speaker utterance mymood
## 1   upsydaisy       pip      2
## 2   upsydaisy       pip      1
## 3   upsydaisy       onk      1
## 4   upsydaisy       onk      3
## 5   tombliboo        ee      2
## 6   tombliboo        oo      2
## 7  makkapakka       pip      1
## 8  makkapakka       pip      1
## 9  makkapakka       onk      2
## 10 makkapakka       onk      3

7.1 The naming game

Since having a proper name for things can dramatically simplify data handling, it is only fitting I start the data handling section with an explanation of how to name things. I generally prefer having meaningful names attached to my variables.

7.1.1 Naming variables in a vector

We have briefly seen the names() function as a way to get R to show the names stored in a data frame. In true R fashion, it can do more than that: it can also be used for assigning names to, for example, vector elements.

One thing that is sometimes a little unsatisfying about the way that R prints out a vector is that the elements come out unlabelled. Here’s what I mean. Suppose I’ve got data reporting the quarterly profits for some company. If I just create a no-frills vector, I have to rely on memory to know which element corresponds to which event. That is:

profit <- c( 3.1, 0.1, -1.4, 1.1 )
profit
## [1]  3.1  0.1 -1.4  1.1

You can probably guess that the first element corresponds to the first quarter, the second element to the second quarter, and so on, but that’s only because I’ve told you the back story and because this happens to be a very simple example. In general, it can be quite difficult. This is where it can be helpful to assign names to each of the elements. Here’s how you do it:

names(profit) <- c("Q1","Q2","Q3","Q4")
profit
##   Q1   Q2   Q3   Q4 
##  3.1  0.1 -1.4  1.1

This is a slightly odd-looking command, admittedly, but it’s not too difficult to follow. All we’re doing is assigning a vector of labels (character strings) to names(profit).

It’s also worth noting that you don’t have to do this as a two-stage process. You can get the same result with this command:

profit <- c( "Q1" = 3.1, "Q2" = 0.1, "Q3" = -1.4, "Q4" = 1.1 )
profit
##   Q1   Q2   Q3   Q4 
##  3.1  0.1 -1.4  1.1

The important things to notice are that (a) this does make things much easier to read, but (b) the names at the top aren’t the “real” data. The value of profit[1] is still 3.1; all I’ve done is added a name to profit[1] as well.

We could delete these if we wanted by typing

names(profit)<-NULL
profit
## [1]  3.1  0.1 -1.4  1.1

But let’s give them back, for future exercises:

profit <- c( "Q1" = 3.1, "Q2" = 0.1, "Q3" = -1.4, "Q4" = 1.1 )

7.1.2 Naming variables in a matrix

What about naming matrices, you ask? First, let’s create one

row.1 <- c( 2,3,1 )         # create data for row 1
row.2 <- c( 5,6,7 )         # create data for row 2
M <- rbind( row.1, row.2 )  # row bind them into a matrix
M                   # and print it out...
##       [,1] [,2] [,3]
## row.1    2    3    1
## row.2    5    6    7

Notice that, when we bound the two vectors together, R retained the names of the original variables as row names. In fact, let’s also add some highly unimaginative column names as well:

colnames(M) <- c( "col.1", "col.2", "col.3" )
M
##       col.1 col.2 col.3
## row.1     2     3     1
## row.2     5     6     7

If we want to change the row names, we could of course use something like this:

rownames(M) <- c( "bettername.1", "bettername.2")
print(M)
##              col.1 col.2 col.3
## bettername.1     2     3     1
## bettername.2     5     6     7

So, you can add names to a matrix by using the rownames() and colnames() functions.

Let’s admire the result of our hard work:

rownames(M) 
## [1] "bettername.1" "bettername.2"
colnames(M) 
## [1] "col.1" "col.2" "col.3"
dimnames(M) 
## [[1]]
## [1] "bettername.1" "bettername.2"
## 
## [[2]]
## [1] "col.1" "col.2" "col.3"
names(M) #doesn't work 
## NULL

7.1.3 Naming variables in a data frame

We can do just like with a matrix:

rownames(garden) <- c("case.1", "case.2", "case.3", "case.4", "case.5", "case.6", "case.7", 
                      "case.8", "case.9", "case.10")
colnames(garden)[3] <- "mood"

Let’s check in awe:

rownames(garden) 
##  [1] "case.1"  "case.2"  "case.3"  "case.4"  "case.5"  "case.6"  "case.7" 
##  [8] "case.8"  "case.9"  "case.10"
colnames(garden) 
## [1] "speaker"   "utterance" "mood"
dimnames(garden) 
## [[1]]
##  [1] "case.1"  "case.2"  "case.3"  "case.4"  "case.5"  "case.6"  "case.7" 
##  [8] "case.8"  "case.9"  "case.10"
## 
## [[2]]
## [1] "speaker"   "utterance" "mood"
names(garden) #same as colnames 
## [1] "speaker"   "utterance" "mood"

7.2 Extracting a subset of a vector

One very important kind of data handling is being able to extract a particular subset of the data. For instance, you might be interested only in analysing the data from one experimental condition, or you may want to look closely at the data from people over 50 years in age. To do this, the first step is getting R to extract the subset of the data corresponding to the observations that you’re interested in.

7.2.1 Using numeric indexing

One very useful thing we can do is pull out more than one element at a time. To refresh your memory, this is what the sales.by.month vector looks like:

sales.by.month <- c(0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0)

So, suppose I wanted the data for February, March and April. What I could do is use the vector c(2,3,4) to indicate which elements I want R to pull out. That is, I’d type this:

sales.by.month[ c(2,3,4) ]
## [1] 100 200  50

Notice that the order matters here. If I asked for the data in the reverse order (i.e., April first, then March, then February) by using the vector c(4,3,2), then R outputs the data in the reverse order:

sales.by.month[ c(4,3,2) ]
## [1]  50 200 100

A second thing to be reminded (see Section ??) of is that R provides you with handy shortcuts for very common situations. For instance, suppose that I wanted to extract everything from the 2nd month through to the 8th month. One way to do this is to do the same thing I did above, and use the vector c(2,3,4,5,6,7,8) to indicate the elements that I want.

sales.by.month[ c(2,3,4,5,6,7,8) ]
## [1] 100 200  50   0   0   0   0

That works just fine, but it’s kind of a lot of typing. To help make this easier, R lets you use 2:8 as shorthand for c(2,3,4,5,6,7,8), which makes things a lot simpler (see Section ??). Let’s check that we can use the 2:8 shorthand as a way to pull out the 2nd through 8th elements of sales.by.month:

sales.by.month[2:8]
## [1] 100 200  50   0   0   0   0

So that’s kind of neat.

7.2.2 Using names

Remember from Section ?? how we added names to vector elements? Names aren’t purely cosmetic, since R allows you to pull out particular elements of the vector by referring to their names:

profit["Q1"]
##  Q1 
## 3.1

Also note I (well, you; well, R) needs the quotation marks:

profit[Q1]
## Error in eval(expr, envir, enclos): object 'Q1' not found

Exercise: Pull out the names by typing the command names(profit).

Perhaps unsurprisingly, you can use names to extract multiple elements from a vector.

profit[c("Q1","Q4")]
##  Q1  Q4 
## 3.1 1.1

7.2.3 Using logical indexing

At this point, I can introduce an extremely useful tool called logical indexing. What I’d like to do is to have R select the names of the months for which I sold any books. This is where logical indexing is handy. Here’s how it can be done.

Remember that earlier on, I created a vector sales.by.months that contained the number of books I sold each month, and a vector months that contains the names of each of the months.

sales.by.month <- c(0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0)
months <- c("January", "February", "March", "April", "May", "June",
            "July", "August", "September", "October", "November", 
            "December")

We will use these to answer my question.

The first step is to create a logical vector any.sales.this.month, whose elements are TRUE for any month in which I sold at least one book, and FALSE for all the others.

any.sales.this.month <- sales.by.month > 0 
any.sales.this.month
##  [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

So, any.sales.this.month is a logical vector whose elements are TRUE only if the corresponding element of sales.by.month is greater than zero. For instance, since I sold zero books in January, the first element is FALSE.

In the second step, I use the logical vector any.sales.this.month to selects those elements out of the month variable for which any sales were made. It looks like this:

months[any.sales.this.month]
## [1] "February" "March"    "April"

To figure out which elements of months to include in the output, what R does is look to see if the corresponding element in any.sales.this.month is TRUE. Thus, since element 1 of any.sales.this.month is FALSE, R does not include "January" as part of the output; but since element 2 of any.sales.this.month is TRUE, R does include "February" in the output. So there you have it: the list of months in which I sold at least one book.

I showed you how to do it step by step, but in fact, I could have just done this, using a single line, and would have gotten exactly the same result:

months[sales.by.month > 0]
## [1] "February" "March"    "April"

Note that the sales.by.month > 0 is the same logical expression that we used to create the any.sales.this.month vector.

There’s no reason why I can’t use the same approach to find the actual sales numbers for those months. The command to do that would just be this:

sales.by.month [sales.by.month > 0]
## [1] 100 200  50

In fact, we can take the same approach with text. Here’s an example. Suppose I want to know the months for which the bookshop was out of my book, I could apply the logical indexing approach, but with the character vector stock.levels we defined earlier. Let’s refresh first:

stock.levels <- c("high", "high", "low", "out", "out", "high",
                "high", "high", "high", "high", "high", "high")
stock.levels
##  [1] "high" "high" "low"  "out"  "out"  "high" "high" "high" "high" "high"
## [11] "high" "high"

It could look something like this:

out.of.stock <- stock.levels == "out"
out.of.stock
##  [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
months[out.of.stock]
## [1] "April" "May"

or, if you want to do everything in one go:

months[stock.levels == "out"]
## [1] "April" "May"

Alternatively, if I want to know when the bookshop was either low on copies or out of copies, I could do this:

months[stock.levels == "out" | stock.levels == "low"]
## [1] "March" "April" "May"

or, equivalently,

months[stock.levels != "high"]
## [1] "March" "April" "May"

or, equivalently,

months[stock.levels %in% c("out","low")]
## [1] "March" "April" "May"

Either way, I get the answer I want.

At this point, I hope you can see why logical indexing is such a useful thing. It’s a very basic, yet very powerful way to manipulate data.

It does take a bit of practice to become completely comfortable using logical indexing.

One final thing to note is how NA (see Section XXX) works when indexing:

x <- c(1, 2, 3)
q <- c(TRUE, FALSE, NA)
x[q]
## [1]  1 NA

7.2.4 Dropping elements from a vector using negative indices

Before moving on, there’s a nice trick worth mentioning: to use negative values as indices. As explained above, we can use a vector of numbers to extract a set of elements that we would like to keep. For instance, suppose I want to keep only elements 2 and 3 from sales.by.month. I could do so like this:

sales.by.month[2:3]
## [1] 100 200

But suppose, on the other hand, that I have discovered that observations 2 and 3 are untrustworthy, and I want to keep everything except those two elements. To that end, R lets you use negative numbers to remove specific values, like so:

sales.by.month[ -(2:3) ]
##  [1]  0 50  0  0  0  0  0  0  0  0

The output here corresponds to element 1 of the original vector, followed by elements 4, 5, and so on. When all you want to do is remove a few cases, this is a very handy convention.

Exercise: Remove the 1st and 3th element of the vector profit. Print the complete profit vector as well to check your answer.
profit[ -c(1,3) ]

Of course, you can also drop elements using logical indexing. For example, if you only want to see the sales which were non-zero, you can drop the depressing zeros as follows:

sales.by.month[ sales.by.month != 0 ]
## [1] 100 200  50

Can you use names for that? You wish

profit[ -c("Q1","Q3") ]
## Error in -c("Q1", "Q3"): invalid argument to unary operator

If you want to use names to drop elements from a vector, you need to do something nifty like this:

profit[ !names(profit) %in% c("Q1", "Q3") ]
##  Q2  Q4 
## 0.1 1.1

7.3 Extracting a subset of a matrix

7.3.1 Using square brackets and numeric indexing

So far, whenever I’ve been subsetting a vector, I’ve tended to use the square brackets [] to do so. You might be wondering whether it is possible to use the square brackets to subset a matrix. The answer, of course, is yes. Not only can you use square brackets for this purpose, as you become more familiar with R you’ll find that this is actually very useful.

Let’s assume that what we want to do is to pick out rows 2, 6 and 9 and columns 1 and 2 (variables age and gender) from the exptM matrix. How shall we do this? Since a matrix is basically a table, every element in the matrix has a row number and a column number. So, if we want to pick out a single element, we have to specify the row number and a column number within the square brackets. By convention, the row number comes first.

This means that, for a matrix which has, say, 5 rows and 3 columns, the numerical indexing scheme looks like this:

Table 7.1: The row and column version
Row Col.1 Col.2 Col.3
Row 1 [1,1] [1,2] [1,3]
Row 2 [2,1] [2,2] [2,3]

Let’s now aim to pull out three rows (2, 6 and 9) and two columns (age and gender). This is fairly simple to do since R allows us to specify multiple rows and multiple columns.

exptM[ c(2,6,9), 1:2]
##      age gender
## [1,]  19      1
## [2,]  19      3
## [3,]  19      3

Note that if I only select one column R will not print is as a column anymore:

exptM[ c(2,6,9), 2 ]
## [1] 1 3 3

R has printed the results horizontally, not vertically. The reason for this relates to how matrices are implemented. The original matrix exptM is treated as a two-dimensional object, containing 2 rows and 3 columns. However, whenever you pull out a single row or a single column, the result is considered to be one-dimensional. As far as R is concerned there’s no real reason to distinguish between a one-dimensional object printed vertically (a column) and a one-dimensional object printed horizontally (a row), and it prints them all out horizontally.

7.3.2 Using square brackets and names

A second way to do the same thing is to use the names of the rows and columns. That is, instead of using the row numbers and column numbers, you use the character strings that are used as the labels for the rows and columns. To apply this idea to our exptM data frame, we would use a command like this:

exptM[ c(2,6,9), c("age", "gender") ]
##      age gender
## [1,]  19      1
## [2,]  19      3
## [3,]  19      3

Once again, this produces exactly the same output. Note that, although this version is more annoying to type than the previous version, it’s a bit easier to read, because it’s often more meaningful to refer to the elements by their names rather than their numbers. If we had assigned names to our rows, we could have used those as well.

Also note I (well, you; well, R) needs the quotation marks:

exptM[ c(2,6,9), "age" ]
## [1] 19 19 19

select the columns called age, whereas this does not

exptM[ c(2,6,9), age ]
## Error in exptM[c(2, 6, 9), age]: subscript out of bounds

The reason is that in the second case, R is looking for (and when found, using) the object called age, whereas in the first it knows it just needs to look for the name age.

7.3.3 Using square brackets and logical indexing

Finally, both the rows and columns can be indexed using logical vectors as well. For example, although I claimed earlier that my goal was to extract rows 2, 6 and 9, what I really wanted to do was select the 19 year olds. So what I could have done is create a logical vector that indicates which rows correspond to 19yos:

is.19yo <- exptM[,1]==19 
is.19yo
## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

Okay, I must admit I have lost some sleep over this. To understand what is going on, you actually should have read Section 7.3.4 first. But moving the current section after that section kind of break the elegance of the composition. So I am leaving it here, but you might want to revisit it after having read Section 7.3.4.

As you can see, the 2nd, 6th and 9th elements of this vector are TRUE while the others are FALSE. Now that I’ve constructed this “indicator” variable, what I can do is use this vector to select the rows that I want to keep:

exptM[ is.19yo, c("age", "gender") ]
##      age gender
## [1,]  19      1
## [2,]  19      3
## [3,]  19      3

And of course, the output is, yet again, the same.

7.3.4 Going blank

What if you want to keep all of the rows, or all of the columns? This is a prime example of less is more: By giving less numbers as input, you get more output. To do this, all we have to do is leave the corresponding entry blank, but it is crucial to remember to keep the comma! In particular, exptM[2,] pulls out the entire 2nd row, and exptM[,3] pulls out the entire 3rd column. Just watch.

exptM[ 2 ,  ]
##     age  gender group.f   score 
##      19       1       1      10
exptM[ , 3  ]
## [1] 1 1 1 2 2 2 3 3 3

You can pull out more than a single column or row at once:

exptM[ , 1:2 ]
##       age gender
##  [1,]  17      1
##  [2,]  19      1
##  [3,]  21      1
##  [4,]  37      2
##  [5,]  18      1
##  [6,]  19      3
##  [7,]  47      3
##  [8,]  18      3
##  [9,]  19      3

Alternatively, if I want to keep all the columns but only want the last two rows, I use the same trick, but this time I leave the second index blank.

For example, to select the 5th and 6th row of exptM, while keeping all the columns, you could do as follows

exptM[5:6, ]
##      age gender group.f score
## [1,]  18      1       2    16
## [2,]  19      3       2    14

Don’t be mistaken: the fact that one element is empty does not mean it does not do anything. Quite to the contrary: it does a lot, by signaling a whole row or column is selected.

7.3.5 Dropping elements from a matrix using negative indices

I feel I should note is that it’s still okay to use negative indexes as a way of telling R to delete certain rows or columns, just like I told you when discussing subsetting vectors. For instance, if I want to delete the 3rd column, then I use this command:

exptM[ , -3 ] 
##       age gender score
##  [1,]  17      1    12
##  [2,]  19      1    10
##  [3,]  21      1    11
##  [4,]  37      2    15
##  [5,]  18      1    16
##  [6,]  19      3    14
##  [7,]  47      3    25
##  [8,]  18      3    21
##  [9,]  19      3    29

whereas if I want to delete the 3rd and 5th row, then I’d use this one:

exptM[ -c(3,5),  ] 
##      age gender group.f score
## [1,]  17      1       1    12
## [2,]  19      1       1    10
## [3,]  37      2       2    15
## [4,]  19      3       2    14
## [5,]  47      3       3    25
## [6,]  18      3       3    21
## [7,]  19      3       3    29

So that’s nice.

7.3.6 FYI: Using a single index

Above, we always have been using tow indexes, one for the column and one for the row (and even if we did not, we left one of them empty). For completeness (but this is not for studying), I have to mention that there is also a way of using only a single index to extract an element from a matrix. The single-index approach is illustrated in Table 7.2. The value of each cell is its index.

Table 7.2: The single-index version
Row Col.1 Col.2 Col.3
Row 1 1 3 5
Row 2 2 4 6

Confirm that exptM[2,4] (using the double index approach) is identical to exptM[29] (using the single index approach).

exptM[2,4]
## score 
##    10
exptM[29]
## [1] 10

You don’t need to study this. I just tell you this because I want to to avoid questioning your sanity if you would ever come across a matrix indexed by a single number.

7.4 Extracting a subset of a data frame

In this section, we turn to the question of how to subset a data frame rather than a vector or matrix. To that end, the first thing I should point out is that, if all you want to do is subset one of the variables inside the data frame, then, as per Section XXX, the $ operator is your friend (as long as you are after a column, which is the common way to store variables). For instance, suppose I’m working with the itng data frame, I can restrict myself to just the utterances as follows:

itng$utterance
##  [1] "pip" "pip" "onk" "onk" "ee"  "oo"  "pip" "pip" "onk" "onk"

It is possible, but unneeded, and therefor seldom used, to do this

itng$"utterance"
##  [1] "pip" "pip" "onk" "onk" "ee"  "oo"  "pip" "pip" "onk" "onk"

While fun, this approach is limited to situations where we want all cases from a single variable only.

Apart from that, things are pretty similar to how we dealt with matrices (see Section XXX), although there are some differences, as you will discover below.

7.4.1 Using square brackets and numeric indexing

So far, whenever I’ve been subsetting a vector or matrix I’ve tended to use the square brackets [] to do so. Given there are similarities between data frames and matrices, you might be wondering whether it is possible to use the square brackets to subset a data frame. The answer, of course, is yes. Not only can you use square brackets for this purpose, as you become more familiar with R you’ll find that this is actually very useful. Unfortunately, the use of square brackets for this purpose is somewhat complicated and can be very confusing to novices. So be warned: this section is more complicated than it feels like it “should” be. With that warning in place, I’ll try to walk you through it slowly.

Let’s assume that what we want to do is to pick out rows 5 and 6 (the two cases when Tombliboo is speaking), and columns 1 and 2 (variables speaker and utterance) from the garden data frame. How shall we do this? Since a data frame is basically a table, every element in the data frame has a row number and a column number. So, if we want to pick out a single element, we have to specify the row number and a column number within the square brackets. By convention, the row number comes first.

This means that, for a data frame which has, say, 5 rows and 3 columns, the numerical indexing scheme looks like this, much like you would expect for a matrix:

row col1 col2 col3
1 [1,1] [1,2] [1,3]
2 [2,1] [2,2] [2,3]
3 [3,1] [3,2] [3,3]
4 [4,1] [4,2] [4,3]
5 [5,1] [5,2] [5,3]

Let’s now aim to solve our problem, which is to pull out two rows (5 and 6) and two columns (1 and 2). This is fairly simple to do since R allows us to specify multiple rows and multiple columns.

Exercise: Pull out two rows (5 and 6) and two columns (1 and 2) from the data frame garden. You can use print() to make it look nice.

# The `:` operator can be used to select more than one element.
print( garden[ 5:6, 1:2 ] )

Clearly, that’s exactly what we asked for: the output here is a data frame containing two variables and two cases. Note that I could have gotten the same answer if I’d used the c() function to produce my vectors rather than the : operator. That is, the following command is equivalent to the last one:

print( garden[ c(5,6), c(1,2) ] )
##          speaker utterance
## case.5 tombliboo        ee
## case.6 tombliboo        oo

It’s just not as pretty. However, if the columns and rows that you want to keep don’t happen to be next to each other in the original data frame, then you might find that you have to resort to using commands like garden[ c(2,4,5), c(1,3) ] to extract them.

7.4.2 Using square brackets and names

A second way to do the same thing is to use the names of the rows and columns. That is, instead of using the row numbers and column numbers, you use the character strings that are used as the labels for the rows and columns. To apply this idea to our garden data frame, we would use a command like this:

print( garden[ c("case.5", "case.6"), c("speaker", "utterance") ] )
##          speaker utterance
## case.5 tombliboo        ee
## case.6 tombliboo        oo

Once again, this produces exactly the same output. Note that, although this version is more annoying to type than the previous version, it’s a bit easier to read, because it’s often more meaningful to refer to the elements by their names rather than their numbers.

Also, note that you don’t have to use the same convention for the rows and columns. For instance, I often find that the variable names are meaningful and so I sometimes refer to them by name, whereas the row names are pretty arbitrary so it’s easier to refer to them by number. In fact, that’s more or less exactly what’s happening with the garden data frame.

Exercise: Pull out two rows (5 and 6) and two columns (speaker and utterance) from the data frame garden, using the variable names for the columns and referring to the rows by number.

print( garden[ 5:6, c("speaker", "utterance") ] )

Again, the output is identical.

Like with matrices quotation marks are strictly needed:

print( garden[ "case.7", c(1,2) ] )
##           speaker utterance
## case.7 makkapakka       pip

vs

print( garden[ case.7, c(1,2) ] )
## Error in `[.data.frame`(garden, case.7, c(1, 2)): object 'case.7' not found

7.4.3 Using square brackets and logical indexing

Finally, both the rows and columns can be indexed using logical vectors as well. For example, although I claimed earlier that my goal was to extract cases 5 and 6, it’s pretty obvious that what I really wanted to do was select the cases where Tombliboo is speaking. So what I could have done is create a logical vector that indicates which cases correspond to Tombliboo speaking:

is.TB.speaking <- garden$speaker == "tombliboo"
is.TB.speaking
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

As you can see, the 5th and 6th elements of this vector are TRUE while the others are FALSE. Now that I’ve constructed this “indicator” variable, what I can do is use this vector to select the rows that I want to keep:

print( garden[ is.TB.speaking, c("speaker", "utterance") ] )
##          speaker utterance
## case.5 tombliboo        ee
## case.6 tombliboo        oo

And of course, the output is, yet again, the same.

7.4.4 Going blank

What if you want to keep all of the rows, or all of the columns? If you have been paying attention when we were using matrices, you know exactly what to do. To do this, all we have to do is leave the corresponding entry blank, but it is crucial to remember to keep the comma! For instance, suppose I want to keep all the rows in the garden data, but I only want to retain the first two columns. The easiest way to do this is to use a command like this :

print( garden[ , 1:2 ] )
##            speaker utterance
## case.1   upsydaisy       pip
## case.2   upsydaisy       pip
## case.3   upsydaisy       onk
## case.4   upsydaisy       onk
## case.5   tombliboo        ee
## case.6   tombliboo        oo
## case.7  makkapakka       pip
## case.8  makkapakka       pip
## case.9  makkapakka       onk
## case.10 makkapakka       onk

Alternatively, if I want to keep all the columns but only want the last two rows, I use the same trick, but this time I leave the second index blank.

Exercise: Select the 5th and 6th row of garden, while keeping all the columns.

print( garden[5:6, ] )

Of course, you can also use name. For example

print( garden[, c("speaker","mood") ] )
##            speaker mood
## case.1   upsydaisy    2
## case.2   upsydaisy    1
## case.3   upsydaisy    1
## case.4   upsydaisy    3
## case.5   tombliboo    2
## case.6   tombliboo    2
## case.7  makkapakka    1
## case.8  makkapakka    1
## case.9  makkapakka    2
## case.10 makkapakka    3

Note again that this doesn’t work:

print( garden[, c(speaker,mood) ] )
## Error in `[.data.frame`(garden, , c(speaker, mood)): object 'mood' not found

7.4.5 Dropping elements from a data frame using negative indices

I feel I should note is that it’s still okay to use negative indexes as a way of telling R to delete certain rows or columns, just like I told you when discussing subsetting vectors and matrices. For instance, if I want to delete the 3rd column, then I use this command:

print( garden[ , -3 ] )
##            speaker utterance
## case.1   upsydaisy       pip
## case.2   upsydaisy       pip
## case.3   upsydaisy       onk
## case.4   upsydaisy       onk
## case.5   tombliboo        ee
## case.6   tombliboo        oo
## case.7  makkapakka       pip
## case.8  makkapakka       pip
## case.9  makkapakka       onk
## case.10 makkapakka       onk

whereas if I want to delete the 3rd and 5th row, then I’d use this one:

print( garden[ -c(3,5),  ] )
##            speaker utterance mood
## case.1   upsydaisy       pip    2
## case.2   upsydaisy       pip    1
## case.4   upsydaisy       onk    3
## case.6   tombliboo        oo    2
## case.7  makkapakka       pip    1
## case.8  makkapakka       pip    1
## case.9  makkapakka       onk    2
## case.10 makkapakka       onk    3

So that’s nice.

7.4.6 More than you wished you knew on the double index approach

The “double index” approach, where you specify (or leave blank) what you want for the row and for the column, is fairly straightforward. Or so you think. There is a fairly useful elaboration on this double index approach that I should point out: something called dropping.31
At this point, some of you might be wondering why I’ve been so terribly careful to choose my examples in such a way as to ensure that the output always has multiple rows and multiple columns. The reason for this is that I’ve been trying to hide the somewhat curious “dropping” behaviour that R produces when the output only has a single column. I’ll start by showing you what happens, and then I’ll try to explain it.

Firstly, let’s have a look at what happens when the output contains only a single row:

print(garden[ 5, ])
##          speaker utterance mood
## case.5 tombliboo        ee    2

This is exactly what you’d expect to see: a data frame containing three variables, and only one case per variable. Okay, no problems so far. What happens when you ask for a single column? Suppose, for instance, I try this as a command:

print(garden[ , 3 ])

Please, R? Are you being serious right now? You cold-hearted manipulative little liar. Based on everything that I’ve shown you so far, you would be well within your rights to expect to see R produce a data frame containing a single variable (i.e., mood) and ten cases. After all, that is pretty consistent with everything else that I’ve shown you so far about how square brackets work. In other words, you should expect to see this:

       mood
case.1    2
case.2    1
case.3    1
case.4    3
case.5    2
case.6    2
case.7    1
case.8    1
case.9    2
case.10   3

However, that is emphatically not what happens.

Exercise: See what you get when selecting the 3rd column of garden.

print(garden[ , 3 ])

That output is not a data frame at all! That’s just an ordinary, plain old numeric vector containing 10 elements. Before you start thinking that R can’t love anyone, cause that would mean it’d have a heart, let us hear his take on the story. As any person with a narcissistic personality will tell, R would probably tell you that although it does not look like it, what’s going on here is that R is trying to be smart and helpful. Now, R has “noticed” that the output that we’ve asked for doesn’t really “need” to be wrapped up in a data frame at all, because it only corresponds to a single variable. So what it does is “drop” the output from a data frame containing a single variable, “down” to a simpler output that corresponds to that variable. This behaviour is actually very convenient for day to day usage once you’ve become familiar with it and I suppose that’s the real reason why R does this – but there’s no escaping the fact that it is deeply confusing to novices.

It’s especially confusing because the behaviour appears only for a very specific case: (a) it only works for columns and not for rows (because the columns correspond to variables and the rows do not), and (b) it only applies to the “double index” version of the square brackets we have been using so far, and not to the subset() function we will discuss in Section XXX,32 or to the “single index” use of the square brackets (as we will discover in Section XXX). As I say, it’s very confusing when you’re just starting out. For what it’s worth, you can suppress this behaviour if you want, by setting drop = FALSE when you construct your bracketed expression. That is, you could do something like this:

print(garden[ , 3, drop = FALSE ])
##         mood
## case.1     2
## case.2     1
## case.3     1
## case.4     3
## case.5     2
## case.6     2
## case.7     1
## case.8     1
## case.9     2
## case.10    3

I suppose that helps a little bit, in that it gives you some control over the dropping behaviour, but I’m not sure it helps to make things any easier to understand. Anyway, that’s the “dropping” special case. Fun, isn’t it?

7.4.7 FYI: Using a single index

Again, I will mention the single index approach, just to prepare you for what you might encounter in the wild one day, but for the purposes of this course, you can forget about it.

Like with matrices, you can also use a single index instead of a double index (so not even a blank space). In fact, there are two ways of doing it. One is with a single pair of square brackets [] and one is with a double pair [[ ]].

7.4.7.1 Using a single index with a single pair of brackets

What happens if you use a single index and a single pair of brackets?

Well, R will assume you want the corresponding columns, not the rows. Do not be fooled by the fact that this second method also uses square brackets: it behaves differently to the rather straightforward “double index” method that I’ve discussed in the last few sections. Again, what I’ll do is show you what happens first, and then I’ll try to explain why it happens afterwards. To that end, let’s start with the following command:

print( garden[ 1:2 ] )
##            speaker utterance
## case.1   upsydaisy       pip
## case.2   upsydaisy       pip
## case.3   upsydaisy       onk
## case.4   upsydaisy       onk
## case.5   tombliboo        ee
## case.6   tombliboo        oo
## case.7  makkapakka       pip
## case.8  makkapakka       pip
## case.9  makkapakka       onk
## case.10 makkapakka       onk

As you can see, the output gives me the first two columns, much as if I’d typed garden[,1:2] using the double-index approach. It doesn’t give me the first two rows, which is what I’d have gotten if I’d used a command like garden[1:2,]. So it seems that garden[ 1:2 ] could be treated as a (potentially handy) shorthand for garden[, 1:2 ].

Building off that insight, you might expect that garden[ 3 ] is a shorthand for garden[, 3]. Well, in the spirit of keeping things more complicated than needed, it is not. Let’s see what happens if I ask for a single column:

print( garden[3] )
##         mood
## case.1     2
## case.2     1
## case.3     1
## case.4     3
## case.5     2
## case.6     2
## case.7     1
## case.8     1
## case.9     2
## case.10    3

and compare it to

print( garden[, 3] )
##  [1] 2 1 1 3 2 2 1 1 2 3

Unlike what happens when I type garden[, 3] R does not drop the output. This is entirely consistent with what I said earlier: the only case where dropping occurs by default is when you use the double index version of the square brackets, and the output happens to correspond to a single column.

7.4.7.2 Using a single index with a double pair of brackets

Wait, you must be thinking, there must be more to this? It should be possible to make things even more complicated? Good thinking! There is something like single index “double brackets” notation: [[ ]]. Let’s find out what it does, shall we?

print( garden[[3]] )
##  [1] 2 1 1 3 2 2 1 1 2 3

So using this notation, you force R to drop the output. Note that R will only allow you to ask for one column at a time using the double brackets. If you try to ask for multiple columns in this way, you get completely different behaviour,33 which may or may not produce an error, but definitely won’t give you the output you’re expecting. The only reason I’m mentioning it at all is that you might run into double brackets when doing further reading, and a lot of books don’t explicitly point out the difference between [ and [[. However, I promise that I won’t be using [[ anywhere else in this book.

7.5 Using the subset() function

Using smart indexing is a useful way of getting info from a vector, matrix or data frame. However, it can become clunky sometimes. There are several different ways to subset a data frame in R, some easier than others. I’ll only discuss the subset() function, which is probably the conceptually simplest way to do it. For our purposes there are three different arguments that you’ll be most interested in:

  • x. The matrix or data frame that you want to subset.
  • subset. A vector of logical values indicating which cases (rows) of the matrix or data frame you want to keep. By default, all cases will be retained.
  • select. This argument indicates which variables (columns) in the matrix or data frame you want to keep. This can either be a list of variable names, or a logical vector indicating which ones to keep, or even just a numeric vector containing the relevant column numbers. By default, all variables will be retained.

Note that the function called subset() has an argument called subset. This is quite unusual, but nothing to frown upon.

7.5.1 subset() a matrix

Let’s start with an example

subset(x=exptM, subset=age>18)
##      age gender group.f score
## [1,]  19      1       1    10
## [2,]  21      1       1    11
## [3,]  37      2       2    15
## [4,]  19      3       2    14
## [5,]  47      3       3    25
## [6,]  19      3       3    29
subset(x=exptM, subset=age>18 & score>20)
##      age gender group.f score
## [1,]  47      3       3    25
## [2,]  19      3       3    29
subset(x=exptM, subset=age>18 & score>20, select = c(age, score))
##      age score
## [1,]  47    25
## [2,]  19    29

It’s pretty self-explanatory. The only thing pointing your attention to is that the output is still a matrix. Matrix in, matrix out.

7.5.2 subset() a data frame

Suppose that I want to subset the itng data frame, keeping only the utterances made by Makka-Pakka. What that means is that I need to use the select argument to pick out the utterance variable, and I also need to use the subset variable, to pick out the cases when Makka-Pakka is speaking (i.e., speaker == "makkapakka").

Exercise: Subset the itng data frame using the subset() function, keeping only Makka-Pakka’s utterances. Specify the x, subset, and select arguments. Assign the result to a variable named df. Print the result.

# Read the text above the exercise again, to know what to assign to which argument of the subset() function.
df <- subset( x = itng,                            # data frame is itng
              subset = speaker == "makkapakka",   # keep only Makka-Pakkas speech
              select = utterance )                 # keep only the utterance variable
print( df )

The variable df here is still a data frame, but it only contains one variable (called utterance) and four cases. Notice that the row numbers are actually the same ones from the original data frame.

It’s worth taking a moment to briefly explain this. The reason that this happens is that these “row numbers’ are actually row names. When you create a new data frame from scratch R will assign each row a fairly boring row name, which is identical to the row number. However, when you subset the data frame, each row keeps its original row name. This can be quite useful, since – as in the current example – it provides you with a visual reminder of what each row in the new data frame corresponds to in the original data frame. However, if it annoys you, you can change the row names using the rownames() function.34

In any case, let’s return to the subset() function, and look at what happens when we don’t use all three of the arguments. Firstly, suppose that we didn’t bother to specify the select argument.

Exercise: Again, subset the itng data frame, keeping only Makka-Pakka’s utterances, but do not specify the select() argument.

df2 <- subset( x = itng,
        subset = speaker == "makkapakka" )
print(df2)

Not surprisingly, R has kept the same cases from the original data set (i.e., rows 7 through 10), but this time it has kept all of the variables from the data frame.

Further, note that we could use the variable names out of the data frame directly

df2 <- subset( x = itng,
        subset = itng$speaker == "makkapakka" )
print(df2)
##       speaker utterance
## 7  makkapakka       pip
## 8  makkapakka       pip
## 9  makkapakka       onk
## 10 makkapakka       onk

but, speaker is enough, since, by account of the x = itng addition, the subset() function knows that with speaker we mean itng$speaker.

Exercise: What if you don’t specify the subset argument?

df3 <- subset( x = itng, 
         select = utterance )
print(df3)

Equally unsurprisingly, if we don’t specify the subset argument, what we find is that R keeps all of the cases. Again, it’s important to note that this output is still a data frame: it’s just a data frame with only a single variable.

7.6 Go out and play

If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. Since you now know how to work in RStudio, why not go there to make them.

8 More data handling

8.1 Transforming a variable

An important topic to discuss is the idea of transforming a variable. Taken literally, anything you do to a variable is a transformation, but in practice what it usually means is that you apply a relatively simple mathematical function to the original variable, in order to create a new variable.

The good news is that you already know how to do variable transformations. To see this, let’s go through an example. Suppose I’ve run a short study in which I ask 10 people a single question:

On a scale of 1 (strongly disagree) to 7 (strongly agree), to what extent do you agree with the proposition that “Dinosaurs are awesome”?

Now let’s load and look at the data. The data consist of a single variable that contains the raw Likert-scale responses:

likert.raw <- c(1, 7, 3, 4, 4, 4, 2, 6, 5, 5)

However, if you think about it, this isn’t the best way to represent these responses. Because of the fairly symmetric way that we set up the response scale, there’s a sense in which the midpoint of the scale should have been coded as 0 (no opinion), and the two endpoints should be \(+3\) (strong agree) and \(-3\) (strong disagree). By recoding the data in this way, it’s a bit more reflective of how we really think about the responses. The recoding here is trivially easy: we just subtract 4 from the raw scores:

Exercise: Recode the variable likert.raw by subtracting 4 from the raw scores, and assign it to a variable called likert.centred. Print the result.

likert.centred <- likert.raw - 4
likert.centred

One reason why it might be useful to have the data in this format is that there are a lot of situations where you might prefer to analyse the strength of the opinion separately from the direction of the opinion. We can do two different transformations on this likert.centred variable in order to distinguish between these two different concepts.

Firstly, to compute an opinion.strength variable, we want to take the absolute value of the centered data.

Exercise: Take the absolute value of the centered data likert.centred using the abs() function, and assign it to a variable called opinion.strength. Print the result.

opinion.strength <- abs( likert.centred )
opinion.strength

Secondly, to compute a variable that contains only the direction of the opinion and ignores the strength, we can use the sign() function to do this. If you type ?sign you’ll see that this function is really simple: all negative numbers are converted to \(-1\), all positive numbers are converted to \(1\) and zero stays as \(0\).

Exercise: Apply the sign() function to the variable likert.centred, and assign it to a variable named opinion.dir. Print the result.

opinion.dir <- sign( likert.centred )
opinion.dir

And we’re done. We now have three shiny new variables, all of which are useful transformations of the original likert.raw data. All of this should seem pretty familiar to you. The tools that you use to do regular calculations in R (e.g., Chapters 2 and ??) are very much the same ones that you use to transform your variables!

The variable I transformed (likert.raw) wasn’t inside a data frame. I’ve done this to keep the explanation simple, though in real life it almost certainly would be. Before moving on, you might (or might not; hey!, it’s your life) be curious to see what these calculations look like if the data had started out in a data frame. To that end, it may help to note that the following example does all of the calculations using variables inside a data frame, and stores the variables created inside it:

df <- data.frame( likert.raw )                   # create data frame
df$likert.centred <- df$likert.raw - 4           # create centred data
df$opinion.strength <- abs( df$likert.centred )  # create strength variable
df$opinion.dir <- sign( df$likert.centred )      # create direction variable
print(df)                                        # print the final data frame:
##    likert.raw likert.centred opinion.strength opinion.dir
## 1           1             -3                3          -1
## 2           7              3                3           1
## 3           3             -1                1          -1
## 4           4              0                0           0
## 5           4              0                0           0
## 6           4              0                0           0
## 7           2             -2                2          -1
## 8           6              2                2           1
## 9           5              1                1           1
## 10          5              1                1           1

In other words, the commands you use are basically the same ones as before: it’s just that every time you want to read a variable from the data frame or write to the data frame, you use the $ operator.

8.2 Tabulating data

A very common task when analysing data is the construction of frequency tables or cross-tabulation of one variable against another. There are several functions that you can use in R for that purpose. In this section I’ll illustrate the use of the table() and xtabs() functions (and will end with a brief shout out to tabulate()), though there are other options available, such as ftable() (not discussed in this book).

8.2.1 Creating tables from a vector

To illustrate what this is all about, I will make use of the speaker vector from the nightgarden data set. With these as my data, one task I might find myself needing to do is construct a frequency count of the number of words each character speaks during the show. The table() function provides a simple way to do to this. The basic usage of the table() function is as follows:

table(speaker)
## speaker
## makkapakka  tombliboo  upsydaisy 
##          4          2          4

The output here tells us on the first line that what we’re looking at is a tabulation of the speaker variable. On the second line, it lists all the different speakers that exist in the data, and on the third line, it tells you how many times that speaker appears in the data. In other words, it’s a frequency table

As usual, you can assign this output to a variable. If you type speaker.freq <- table(speaker) at the command prompt R will store the table as a variable. If you then type class(speaker.freq) you’ll see that the output is actually of class table.

speaker.freq <- table(speaker)
class(speaker.freq)
## [1] "table"

The key thing to note about a table object is that it’s basically a named vector:

speaker.freq[3]
## upsydaisy 
##         4
names(speaker.freq)
## [1] "makkapakka" "tombliboo"  "upsydaisy"
speaker.freq + 3
## speaker
## makkapakka  tombliboo  upsydaisy 
##          7          5          7

Notice that in the command above I didn’t name the argument, since table() is a function that makes use of unnamed arguments. You just type in a list of the variables that you want R to tabulate, and it tabulates them. For instance, if I type in the name of two variables, what I get as the output is a cross-tabulation.

Exercise: Obtain the cross-tabulation for speaker and utterance.

table(speaker, utterance)

When interpreting this table, remember that these are counts: so the fact that the first row and second column corresponds to a value of 2 indicates that Makka-Pakka (row 1) says “onk” (column 2) twice in this data set. As you’d expect, you can produce three-way or higher-order cross-tabulations just by adding more objects to the list of inputs. However, I won’t discuss that in this section.

The tabulation commands discussed so far all construct a table of raw frequencies: that is, a count of the total number of cases that satisfy certain conditions. However, often you want your data to be organised in terms of proportions rather than counts. This is where the prop.table() function comes in handy.

Let’s see how this works. Note that we need to first call table() before we can call prop.table().

prop.table(speaker) # doesn't work!!!
## Error in sum(x): invalid 'type' (character) of argument
prop.table(table(speaker)) #does work
## speaker
## makkapakka  tombliboo  upsydaisy 
##        0.4        0.2        0.4

We see that all proportions sum to 1, as they should.

Of course, this is identical to

n <- length(speaker) #number of cases
table(speaker)/n
## speaker
## makkapakka  tombliboo  upsydaisy 
##        0.4        0.2        0.4

so it is not entirely useful so far. But we are still glad to have you around, prop.table()!

One final function I want to mention is the tabulate() function, since this is actually the low-level function that does most of the hard work. It takes a numeric vector as input, and outputs frequencies as outputs:

some.data <- c(1,2,3,1,1,3,1,1,2,8,3,1,2,4,2,3,5,2)
tabulate(some.data)
## [1] 6 5 4 1 1 0 0 1

8.2.2 Creating tables from a matrix

The table() function also works on a matrix, but I see limited use for this:

#as a reminder, let's have a look at the good ol' exptM again
exptM
##       age gender group.f score
##  [1,]  17      1       1    12
##  [2,]  19      1       1    10
##  [3,]  21      1       1    11
##  [4,]  37      2       2    15
##  [5,]  18      1       2    16
##  [6,]  19      3       2    14
##  [7,]  47      3       3    25
##  [8,]  18      3       3    21
##  [9,]  19      3       3    29
table(exptM)
## exptM
##  1  2  3 10 11 12 14 15 16 17 18 19 21 25 29 37 47 
##  7  4  7  1  1  1  1  1  1  1  2  3  2  1  1  1  1

8.2.3 Creating tables from a data frame

A very common task when analysing data is the construction of frequency tables or cross-tabulation of one variable against another. We have seen how we can use the table() and prop.table() function. The major take home of this section is that you can use these functions for data frames as well. Additionally, I’ll show you how to use the xtabs() function.

There’s a couple of options under these circumstances. Firstly, if you just want to cross-tabulate all of the variables in the data frame, then it’s really easy, as you can just use the table() function.

Exercise: Use the table() function to cross-tabulate all variables in the data frame itng.

table(itng)

However, it’s often the case that you want to select particular variables from the data frame to tabulate. For example, it you want to cross-tabulate speaker and utterance in the garden data set, table() goes full TMI.

table(garden)
## , , mood = 1
## 
##             utterance
## speaker      ee onk oo pip
##   makkapakka  0   0  0   2
##   tombliboo   0   0  0   0
##   upsydaisy   0   1  0   1
## 
## , , mood = 2
## 
##             utterance
## speaker      ee onk oo pip
##   makkapakka  0   1  0   0
##   tombliboo   1   0  1   0
##   upsydaisy   0   0  0   1
## 
## , , mood = 3
## 
##             utterance
## speaker      ee onk oo pip
##   makkapakka  0   1  0   0
##   tombliboo   0   0  0   0
##   upsydaisy   0   1  0   0

This is where the xtabs() function is useful. In this function, you input a formula in order to list all the variables you want to cross-tabulate, and the name of the data frame that stores the data:

There are many situations when you’re analysing real data where this is actually extremely useful since your data set will almost certainly contain lots of variables and you’ll only want to tabulate a few of them at a time.

For example, with the garden data, we can use

xtabs( formula =  ~ speaker + utterance, data = garden )
##             utterance
## speaker      ee onk oo pip
##   makkapakka  0   2  0   2
##   tombliboo   1   0  1   0
##   upsydaisy   0   2  0   2

Notice how the left hand side of the formula has been left empty.

As I mentioned in Section ??, the tabulation commands discussed so far all construct a table of raw frequencies: that is, a count of the total number of cases that satisfy certain conditions. However, often you want your data to be organised in terms of proportions rather than counts. This is where the prop.table() function comes in handy. And yes, it also works for data frames!

Exercise: Create the proportion table of itng, and assign it to a variable called itng.table. Display the table again, just as a reminder.

itng.table <- prop.table(table(itng))    # create the table, and assign it to a variable
itng.table                   # display the table again, as a reminder

So we can express the data in proportions, by feeding a table into prop.table(). It works similarly from the xtabs() output:

garden.table <- xtabs( formula =  ~ speaker + utterance, data = garden )
garden.table
##             utterance
## speaker      ee onk oo pip
##   makkapakka  0   2  0   2
##   tombliboo   1   0  1   0
##   upsydaisy   0   2  0   2
prop.table( garden.table )  # express in proportions
##             utterance
## speaker       ee onk  oo pip
##   makkapakka 0.0 0.2 0.0 0.2
##   tombliboo  0.1 0.0 0.1 0.0
##   upsydaisy  0.0 0.2 0.0 0.2

8.3 Splitting a variable by group

One particular example of data handling that is especially common is the problem of splitting one variable up into several different variables, one corresponding to each group. To illustrate, let’s go back to the In the Night Garden example. I might want to create subsets of the utterance variable for every character.

One way to do this would be to do this, using logical indexing:

itng$utterance
##  [1] "pip" "pip" "onk" "onk" "ee"  "oo"  "pip" "pip" "onk" "onk"
itng$speaker
##  [1] "upsydaisy"  "upsydaisy"  "upsydaisy"  "upsydaisy"  "tombliboo" 
##  [6] "tombliboo"  "makkapakka" "makkapakka" "makkapakka" "makkapakka"
itng$utterance[ itng$speaker == "makkapakka" ]
## [1] "pip" "pip" "onk" "onk"
itng$utterance[ itng$speaker == "tombliboo" ]
## [1] "ee" "oo"
itng$utterance[ itng$speaker == "upsydaisy" ]
## [1] "pip" "pip" "onk" "onk"

but that quickly gets repetitive and hence annoying and this strategy breaks down in a situation with many characters.

A faster, and maybe more convenient, way do it is to use the split() function. The arguments are:

  • x. The variable that needs to be split into groups.
  • f. The grouping variable.

What this function does is output a list (Section 6.7), containing one variable for each group.

Exercise: Split the variable utterance by speaker using the split() function, and assign the result to a variable named speech.by.char. Specify the arguments x and f in the split() function. Print the result.

speech.by.char <- split( x = utterance, f = speaker )
speech.by.char

Once you’re starting to become comfortable working with lists and data frames, this output is all you need, since you can work with this list in much the same way that you would work with a data frame. For instance, if you want the first utterance made by Makka-Pakka, all you need to do is type this:

speech.by.char$makkapakka[1]
## [1] "pip"

8.4 Cutting a variable into categories

One pragmatic task that arises more often than you’d think is the problem of cutting a numeric variable up into discrete categories. If it makes you happy or feel interesting, you can say that you are recoding a variable. For instance, suppose I’m interested in looking at the age distribution of people at a social gathering:

age <- c( 60,58,24,26,34,42,31,30,33,2,9 )

In some situations, it can be quite helpful to group these into a smallish number of categories. For example, we could group the data into three broad categories: young (0-20), adult (21-40) and older (41-60). This is a quite coarse-grained classification, and the labels that I’ve attached only make sense in the context of this data set (e.g., viewed more generally, a 42-year-old wouldn’t consider themselves as “older”).

We can slice this variable up quite easily using the cut() function.35

To make things a little cleaner, I’ll start by creating a variable that defines the boundaries for the categories:

age.breaks <- seq( from = 0, to = 60, by = 20 )
age.breaks
## [1]  0 20 40 60

and another one for the labels:

age.labels <- c( "young", "adult", "older" )
age.labels
## [1] "young" "adult" "older"

Note that there are four numbers in the age.breaks variable, but only three labels in the age.labels variable; I’ve done this because the cut() function requires that you specify the edges of the categories rather than the mid-points. In any case, now that we’ve done this, we can use the cut() function to assign each observation to one of these three categories.

There are several arguments to the cut() function, but the three that we need to care about are:

  • x. The variable that needs to be categorised.
  • breaks. This is either a vector containing the locations of the breaks separating the categories, or a number indicating how many categories you want.
  • labels. The labels attached to the categories. This is optional: if you don’t specify this R will attach a boring label showing the range associated with each category.

Exercise: Since we’ve already created variables corresponding to the breaks and the labels, apply the cut() function to age. Specify the three arguments discussed above. In order to see what this command has actually done, just print out the output.

age.group <- cut(x = age,                # the variable to be categorised
                 breaks = age.breaks,    # the edges of the categories
                 labels = age.labels)    # the labels for the categories
age.group
##  [1] older older adult adult adult older adult adult adult young young
## Levels: young adult older

Note that the output variable here is a factor.

Often, it’s actually more helpful to create a data frame that includes both the original variable and the categorised one so that you can see the two side by side:

agedf <- data.frame(age, age.group)
print(agedf)
##    age age.group
## 1   60     older
## 2   58     older
## 3   24     adult
## 4   26     adult
## 5   34     adult
## 6   42     older
## 7   31     adult
## 8   30     adult
## 9   33     adult
## 10   2     young
## 11   9     young

8.4.1 Letting R take the lead

In the example above, I made all the decisions myself. If you want to you can delegate a lot of the choices to R. For instance, if you want you can specify the number of categories you want, rather than giving explicit ranges for them, and you can allow R to come up with some labels for the categories. To give you a sense of how this works, have a look at the following example:

age.group2 <- cut( x = age, breaks = 3 )

With this command, I’ve asked for three categories, but let R make the choices for where the boundaries should be. I won’t bother to print out the age.group2 variable, because it’s not terribly pretty or very interesting. Instead, all of the important information can be extracted by looking at the tabulated data:

table( age.group2 ) 
## age.group2
## (1.94,21.3] (21.3,40.7] (40.7,60.1] 
##           2           6           3

This output takes a little bit of interpretation, but it’s not complicated. What R has done is determined that the lowest age category should run from 1.94 years up to 21.3 years, the second category should run from 21.3 years to 40.7 years, and so on. The formatting on those labels might look a bit funny to those of you who haven’t studied a lot of maths, but it’s pretty simple. When R describes the first category as corresponding to the range \((1.94, 21.3]\) what it’s saying is that the range consists of those numbers that are larger than 1.94 but less than or equal to 21.3. In other words, the weird asymmetric brackets are R’s way of telling you that if there happens to be a value that is exactly equal to 21.3, then it belongs to the first category, not the second one. Obviously, this isn’t actually possible since I’ve only specified the ages to the nearest whole number, but R doesn’t know this and so it’s trying to be precise just in case. This notation is actually pretty standard, but I suspect not everyone reading the book will have seen it before. In any case, those labels are pretty ugly, so it’s usually a good idea to specify your own, meaningful labels to the categories.

It is important to take the time to figure out whether or not the resulting categories make any sense at all in terms of your research project. If they don’t make any sense to you as meaningful categories, then any data analysis that uses those categories is likely to be just as meaningless. More generally, in practice I’ve noticed that people have a very strong desire to carve their (continuous and messy) data into a few (discrete and simple) categories; and then run the analysis using the categorised data instead of the original one.

8.5 Sorting data

8.5.1 Sorting a numeric vector

One thing that you often want to do is sort a variable. If it’s a numeric variable you might want to sort in increasing or decreasing order. If it’s a character vector you might want to sort alphabetically, etc. The sort() function provides this capability.

Consider the variable numbers containing the following three values.

numbers <- c(2,4,3)

Exercise: Sort the variable numbers using the sort() function.

sort( numbers )

Exercise: Now ask R to sort numbers in decreasing order rather than increasing, by including the argument decreasing = TRUE.

You can ask for R to sort in decreasing order rather than increasing:

sort( numbers, decreasing = TRUE )

8.5.2 Sorting a character vector

You can ask it to sort text data in alphabetical order:

text <- c("aardvark", "zebra", "swing", "aardvark", "zebra", "swing", "swing")
sort( text )
## [1] "aardvark" "aardvark" "swing"    "swing"    "swing"    "zebra"    "zebra"

That’s pretty straightforward. That being said, it’s important to note that I’m glossing over something here. When you apply sort() to a character vector it doesn’t strictly sort into alphabetical order. R actually has a slightly different notion of how characters are ordered , which is more closely related to how computers store text data than to how letters are ordered in the alphabet. However, that’s a topic I am not gonna touch, but do remember that if you ever need an alphabetically sorted output, you need to take a deeper dive into R.

8.5.3 Sorting a factor

You can also sort factors, but the story here is slightly more subtle because there are two different ways you can sort a factor: alphabetically (by label) or by factor level. The sort() function uses the latter. To illustrate, let’s look at the two different examples. First, let’s create a factor in the usual way:

fac <- factor( text )
fac
## [1] aardvark zebra    swing    aardvark zebra    swing    swing   
## Levels: aardvark swing zebra

Now let’s sort it:

sort(fac)
## [1] aardvark aardvark swing    swing    swing    zebra    zebra   
## Levels: aardvark swing zebra

This looks like it’s sorted things into alphabetical order, but that’s only because the factor levels themselves happen to be alphabetically ordered (i.e, the levels read: aardvark, swing, zebra).

Suppose I deliberately define the factor levels in a non-alphabetical order:

fac <- factor( text, levels = c("zebra","swing","aardvark") )
fac
## [1] aardvark zebra    swing    aardvark zebra    swing    swing   
## Levels: zebra swing aardvark

Exercise: Now what happens when we try to sort fac this time?

sort(fac)

It didn’t sort the data (which is the text) in the order of the alphabet! What it does do is sort the data (which is the text) into the numerical order implied by the factor levels, not the alphabetical order implied by the labels attached to those levels. Since the order of the factor levels is (by my own choosing) zebra, swing, and aardvark, it has sorted the data (i.e., the text) in that order. Normally you never notice the distinction, because by default the factor levels are assigned in alphabetical order, but it’s important to know the difference.

8.5.4 Sorting a data frame

The sort() function doesn’t work properly with data frames. If you want to sort a data frame the standard advice that you’ll find online is to use the order() function (not described in this book) to determine what order the rows should be sorted, and then use square brackets to do the shuffling. There’s nothing inherently wrong with this advice, I just find it tedious. I won’t go into it any further, but I just want you to know you can do it. Remember that in life, nothing is impossible. But some things are just painfully tedious.

8.6 Using functions by group

8.6.1 One group

It is very commonly the case that you find yourself needing to look at stuff, broken down by some grouping variable. This is pretty easy to do in R, and we will discuss three functions in particular that are worth knowing about: tapply(), by() and aggregate().

For example, say we have these two variables:

gender <- c( "male","male","female","female","male" )
age <- c( 10,12,9,11,13 )

Suppose we want to compute the mean age per gender. If you remember the split() function from Section XXX, you know that one approach is as follows.

split.data <- split(x = age, f = gender)
mean(split.data$female)
## [1] 10
mean(split.data$male)
## [1] 11.66667

Nice, but tedious. And, if we are doing real talk, ugly. R provides several functions to make your life easy.

The first of these functions is tapply(), which has three key arguments. As before, X specifies the data, and FUN specifies a function. However, there is also an INDEX argument which specifies a grouping variable.36 What the tapply() function does is consider all of the different values that appear in the INDEX variable. Each such value defines a group: the tapply() function constructs the subset of X that corresponds to that group, and then applies the function FUN to that subset of the data. This probably sounds a little abstract, so let’s consider a specific example.

tapply( X = age, INDEX = gender, FUN = mean )
##   female     male 
## 10.00000 11.66667

In this extract, what we’re doing is using gender to define two different groups of people, and using their ages as the data. We then calculate the mean() of the ages, separately for the males and the females.

Note tapply() is very demanding, attention detailed, and different than most. While many functions we have seen take a lowercase x as argument, tapply() is only happy (and then some!) with uppercase X:

tapply( x = age, INDEX = gender, FUN = mean )
## Error in tapply(x = age, INDEX = gender, FUN = mean): argument "X" is missing, with no default

Remember that mean() can take the na.rm argument? So do I! How on earth could you pass this argument to tapply() (or any other argument for any other function, of course). Well, hear goes:

age_new <- age
age_new[4] <- NA
tapply( X = age_new, INDEX = gender, FUN = mean )
##   female     male 
##       NA 11.66667
tapply( X = age_new, INDEX = gender, FUN = mean, na.rm = TRUE )
##   female     male 
##  9.00000 11.66667

So the argument that should have gone inside mean() just became an argument inside tapply().

There’s even more flexibility! FUN is not restricted to built in stuff like mean() but can take your home-grown functions too!

tapply(X = age, 
       INDEX = gender, 
       FUN = function(x) sum(x > 18)
       ) 
## female   male 
##      0      0

A closely related function is by(). It actually does the same thing as tapply(), but the output is formatted a bit differently. This time around the three arguments are called data, INDICES and FUN, but they’re pretty much the same thing. The data argument specifies the data set, the INDICES argument specifies the grouping variable, and the FUN argument specifies the name of a function that you want to apply separately to each group. An example of how to use the by() function is shown in the following extract:

by( data = age, INDICES = gender, FUN = mean )
## gender: female
## [1] 10
## ------------------------------------------------------------ 
## gender: male
## [1] 11.66667

The output gives you means separately for the female group and the male group.

The same argument passing trick as with tapply() can be used. For example, if you want to add the na.rm = TRUE argument to the mean() function, just use

by( data = age_new, INDICES = gender, FUN = mean, na.rm = TRUE )
## gender: female
## [1] 9
## ------------------------------------------------------------ 
## gender: male
## [1] 11.66667

And, just like with tapply(), FUN can take a function you defined yourself:

by(data = age, 
   INDICES = gender, 
   FUN = function(x) sum(x > 18)
  ) 
## gender: female
## [1] 0
## ------------------------------------------------------------ 
## gender: male
## [1] 0

A fun fact is that by() produces an object of the class by. You don’t need to know what it does, but it is just to reinforce the idea that 1) there are many more classes than we discussed and 2) you should never assume that the class of the output is identical to the class of whatever you input.

byby <- by( data = age, INDICES = gender, FUN = mean )
class(byby)
## [1] "by"

A final quite convenient function is the aggregate() function. There are again three arguments that you need to specify. The x argument is used to indicate which variable you want to analyse, and which variables are used to specify the groups. For instance, if you want to look at age separately for each possible gender, you can specify this using a formula, like this: age ~ gender. The FUN argument is used to indicate what function you want to calculate for each group (e.g., the mean). The data argument is used to specify the data frame containing all the data, so we need to first combine our data in a data frame. Annoyingly, if I want to show you the output of the aggregate() function in this document, in need to use the print() function. If you would use it in RStudio, print() is not needed (but doesn’t hurt either).

dfAG <- data.frame(age, gender)
print(aggregate( x = age, 
                 by =  list(gender),  
                 FUN = mean,
                 data = dfAG)
)
##   Group.1        x
## 1  female 10.00000
## 2    male 11.66667

Note that we needed to convert gender to a list for aggregate() to work. In case you’d forget, aggregate() will kindly tell you off:

print(aggregate( x = age, 
                 by =  gender,  
                 FUN = mean,
                 data = dfAG)
)
## Error in aggregate.data.frame(as.data.frame(x), ...): 'by' must be a list

Alternatively, aggregate() can take a formula as input:

print(aggregate( x = age ~ gender,  
           data = dfAG,                     
           FUN = mean                     
))
##   gender      age
## 1 female 10.00000
## 2   male 11.66667

And yes, the extra-arguments and home-grown functions work as well. For once in its life, R is showing some consistency.

dfAG_new <- data.frame(age_new, gender)
print(aggregate( x = age_new ~ gender,  
           data = dfAG_new,                     
           FUN = mean,
           na.rm = TRUE
))
##   gender  age_new
## 1 female  9.00000
## 2   male 11.66667

and

print(aggregate(x = age ~ gender, 
         data = dfAG_new,                     
         FUN = function(x) sum(x > 18)
       )) 
##   gender age
## 1 female   0
## 2   male   0

What’s the class you wonder?

class(aggregate(x = age ~ gender, 
         data = dfAG_new,                     
         FUN = function(x) sum(x > 18)
       )) 
## [1] "data.frame"

8.6.2 Several groups

What if you have multiple grouping variables? Suppose, we have more variables in our data set, relating to whether the person has no pet, or a cat or a dog:

gender <- c( "male","male","female","female","male" )
age <- c( 10,12,9,11,13 )
pet <- c("no","cat","cat","dog","no")

For example, you would like to look at the average age separately for all possible combinations of age and pet ownership. It is possible to do this using the tapply() and by() function, as follows. Note the use of list, which we have encountered in Section XXX.

tapply( X = age, INDEX = list(pet, gender), FUN = mean )
##     female male
## cat      9 12.0
## dog     11   NA
## no      NA 11.5
by( data = age, INDICES = list(pet, gender), FUN = mean )
## : cat
## : female
## [1] 9
## ------------------------------------------------------------ 
## : dog
## : female
## [1] 11
## ------------------------------------------------------------ 
## : no
## : female
## [1] NA
## ------------------------------------------------------------ 
## : cat
## : male
## [1] 12
## ------------------------------------------------------------ 
## : dog
## : male
## [1] NA
## ------------------------------------------------------------ 
## : no
## : male
## [1] 11.5

I usually find it more convenient to use the aggregate() function in this situation. There are again three arguments that you need to specify. The x argument is used to indicate which variable you want to analyse, and which variables are used to specify the groups. For instance, if you want to look at age separately for each possible combination of gender and pet ownership, you can specify this using a formula, like this: age ~ gender + pet. The data argument is used to specify the data frame containing all the data, and the FUN argument is used to indicate what function you want to calculate for each group (e.g., the mean).

dfAGP <- data.frame(age, gender, pet)
print(aggregate( x = age,  # age 
                 by = list(pet,gender), # by pet/gender combination
                 data = dfAGP,           # data is in the dfAGP data frame
                 FUN = mean              # print out group means
))
##   Group.1 Group.2    x
## 1     cat  female  9.0
## 2     dog  female 11.0
## 3     cat    male 12.0
## 4      no    male 11.5

or, using a formula,

print(aggregate( x = age ~ pet + gender,  # age by pet/gender combination
            data = dfAGP,           # data is in the dfAGP data frame
            FUN = mean              # print out group means
))
##   pet gender  age
## 1 cat female  9.0
## 2 dog female 11.0
## 3 cat   male 12.0
## 4  no   male 11.5

The tapply(), by() and aggregate() functions are quite handy things to know about and are pretty widely used. As with the apply() function (see Section XXX), you can pass on additional function arguments after the FUN argument.

Before moving on, I should mention that there are several other functions that work along similar lines, and have suspiciously similar names: lapply, mapply, apply, vapply, rapply and eapply. You’ll hear about those when needed.

8.7 Merging

One thing you will quickly see (in Chapter XXX, to be more precise) is that, once your data are in the format R expects, doing inferential tests is almost a no brainer. A t test can be done with the t.test() function, a proportion test with the prop.test() function, and so on. I’m pretty sure you catch my drift. The most difficult part is often getting your data in the format R expects. There is one final function that I want to talk about which will prove very handy when starting doing data analysis, because it will help you doing exactly that. The merge() function supports fairly complicated “database like” merging of vectors and data frames. It doesn’t do anything you couldn’t do in a different way, by tedious and clever indexing. But it makes your life pretty easy.

To illustrate, consider a situation where we have two data frames with overlapping IDs, indicating the different participants.

dfA <- data.frame(
  ID = c(1, 2, 3),
  Score = c(10, 20, 30)
)

dfB <- data.frame(
  ID = c(2, 3, 4),
  Grade = c("B", "A", "C")
)

print(dfA)
##   ID Score
## 1  1    10
## 2  2    20
## 3  3    30
print(dfB)
##   ID Grade
## 1  2     B
## 2  3     A
## 3  4     C

Both dfA and dfB share a common column called ID. What if we want to combine both data sets in one? There’s a function for that.

merged_data <- merge(x = dfA, y = dfB, by = "ID")
print(merged_data)
##   ID Score Grade
## 1  2    20     B
## 2  3    30     A

The resulting data frame has not just put both data frames next to each other. Rather, the common column, which we specified using the by argument, has not been repeated. By default, merge() returns only the rows where the keys (IDs) match in both data frames. Thus, we end up with rows for ID=2 and ID=3 only, since ID=1 is missing in dfB and ID=4 is missing in dfA.

There are several variations. You might want to keep all rows from dfA (even if no matching ID exists in dfB). Or you could keep all rows from dfB, even if no match exists in dfA. Or just might prefer to keep all rows from dfB, even without a match in dfA.

print(merge(x = dfA, y = dfB, by = "ID", all.x = TRUE))
##   ID Score Grade
## 1  1    10  <NA>
## 2  2    20     B
## 3  3    30     A
print(merge(x = dfA, y = dfB, by = "ID", all.y = TRUE))
##   ID Score Grade
## 1  2    20     B
## 2  3    30     A
## 3  4    NA     C
print(merge(x = dfA, y = dfB, by = "ID", all = TRUE))
##   ID Score Grade
## 1  1    10  <NA>
## 2  2    20     B
## 3  3    30     A
## 4  4    NA     C

Now we were lucky that we had two different variables in each data frame: Score and Grade. But would the world have continued turning when we had have Grade twice, in each data frame? If you are feeling particularly adventurous, you can find out!

dfA <- data.frame(
  ID = c(1, 2, 3),
  Grade = c(10, 20, 30)
)

dfB <- data.frame(
  ID = c(2, 3, 4),
  Grade = c("B", "A", "C")
)

# merge them on the column "ID"
merged_data <- merge(x = dfA, y = dfB, by = "ID")
print(merged_data)
##   ID Grade.x Grade.y
## 1  2      20       B
## 2  3      30       A

The merge() function has added a suffix, to keep the world from collapsing. If you think you can do it better, you are right. For once.

merged_data <- merge(x = dfA, y = dfB, by = "ID", suffixes = c(".A",".B"))
print(merged_data)
##   ID Grade.A Grade.B
## 1  2      20       B
## 2  3      30       A

As advertised, the suffixes argument fixes!

8.8 A bit more on functions

8.8.1 Generic functions

There’s one really important thing that I omitted when I discussed functions earlier on in Section 2.9, but now that we are all up to speed about the different types of variables classes, there’s another important thing to understand, which is the concept of a generic function. It does not really fit in the More Data Handling chapter, but hey.

The thing that makes generics different from the other functions is that their behaviour changes, often quite dramatically, depending on the class() of the input you give it. The easiest way to explain the concept is with an example. With that in mind, let us take a closer look at what the print() function actually does.37 I’ll do this by creating a formula, and printing it out in a few different ways. First, let’s stick with what we know:

my.formula <- blah ~ blah.blah    # create a variable of class "formula"
print( my.formula )               # print it out using the generic print() function
## blah ~ blah.blah
## <environment: 0x00000236bb752fe8>

So far, there’s nothing very surprising here. But there’s actually a lot going on behind the scenes here. When I type print( my.formula ), what actually happens is the print() function checks the class of the my.formula variable. When the function discovers that the variable it’s been given is a formula, it does the thing it does to formulas.

If you’re curious to know how R would have printed my.formula ignoring the fact that it is a formula, you can force R to display its generic functioning by using the print.default() function, which tells R to stop all the special things it does by recognizing that the thing it needs to print is a formula:

print.default( my.formula )      # print it out using the print.default() method
## blah ~ blah.blah
## attr(,"class")
## [1] "formula"
## attr(,".Environment")
## <environment: 0x00000236bb752fe8>

Hm. You can kind of see that it is trying to print out the same formula, but there’s a bunch of ugly low-level details that have also turned up on the screen. This is because the print.default() method doesn’t know anything about formulas, and doesn’t know that it’s supposed to be hiding the obnoxious internal gibberish that R produces sometimes.

As a second example, remember the garden data frame? If we ask to print it, we get

print(garden)
##            speaker utterance mood
## case.1   upsydaisy       pip    2
## case.2   upsydaisy       pip    1
## case.3   upsydaisy       onk    1
## case.4   upsydaisy       onk    3
## case.5   tombliboo        ee    2
## case.6   tombliboo        oo    2
## case.7  makkapakka       pip    1
## case.8  makkapakka       pip    1
## case.9  makkapakka       onk    2
## case.10 makkapakka       onk    3

R had neatly organized stuff when printing, adapting its behavior on account of recognizing garden to be a data frame. If we stop R to be so sensitive to class, as just do whatever it does without taking the data frame nature into account, we get

print.default(garden)
## $speaker
##  [1] "upsydaisy"  "upsydaisy"  "upsydaisy"  "upsydaisy"  "tombliboo" 
##  [6] "tombliboo"  "makkapakka" "makkapakka" "makkapakka" "makkapakka"
## 
## $utterance
##  [1] "pip" "pip" "onk" "onk" "ee"  "oo"  "pip" "pip" "onk" "onk"
## 
## $mood
##  [1] 2 1 1 3 2 2 1 1 2 3
## 
## attr(,"class")
## [1] "data.frame"

So the general point is that, while print() will always, well, print, what it prints exactly depends on what class the input is. What happens is that, if we use print(garden), R is smart enough to recognize garden as being a data frame, and instead of using the print() function, it automatically (but also sneakily) uses the print.data.frame() function.

print.data.frame(garden)
##            speaker utterance mood
## case.1   upsydaisy       pip    2
## case.2   upsydaisy       pip    1
## case.3   upsydaisy       onk    1
## case.4   upsydaisy       onk    3
## case.5   tombliboo        ee    2
## case.6   tombliboo        oo    2
## case.7  makkapakka       pip    1
## case.8  makkapakka       pip    1
## case.9  makkapakka       onk    2
## case.10 makkapakka       onk    3

There’s no difference in the output at all compared to print(garden). But this shouldn’t surprise you because it was actually the print.data.frame() method that was doing all the hard work in the first place. The print() function itself is a lazy bastard that doesn’t do anything other than select which of the methods is going to do the actual printing. So when the function discovers that the variable it’s been given is a data frame, it goes looking for a function called print.data.frame(), and then delegates the whole business of printing out the variable to the print.data.frame() function. You won’t need to understand the details at all for this book, but you do need to know the gist of it; if only because a lot of the functions we’ll use are actually generics.

8.9 Go out and play

If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. Since you now know how to work in RStudio, why not go there to make them.

9 Basic programming

Up to this point in the book, I’ve tried hard to avoid using the word “programming” too much because – at least in my experience – it’s a word that can cause a lot of fear. For one reason or another, programming (like mathematics and statistics) is often perceived by people on the “outside” as a black art, a magical skill that can be learned only by some kind of super-nerd. I think this is a shame. It’s certainly true that advanced programming is a very specialised skill: several different skills actually since there’s quite a lot of different kinds of programming out there. However, the basics of programming aren’t all that hard, and you can accomplish a lot of very impressive things just using those basics.

With that in mind, the goal of this chapter is to discuss a few basic programming concepts and how to apply them in R. However, before I do, I want to make one further attempt to point out just how non-magical programming really is, via one very simple observation: you already know how to do it. Stripped to its essentials, programming is nothing more (and nothing less) than the process of writing out a bunch of instructions that a computer can understand. To phrase this slightly differently, when you write a computer program, you need to write it in a programming language that the computer knows how to interpret. R is one such language. Although I’ve been having you type all your commands at the command prompt, and all the commands in this book so far have been shown as if that’s what I was doing, it’s also quite possible (and as you’ll see shortly, shockingly easy) to write a program using these R commands. In other words, if this is the first time reading this book, then you’re only one short chapter away from being able to legitimately claim that you can program in R, albeit at a beginner’s level.

9.1 More on functions

9.1.1 Writing your own functions

In this section, I want to talk about functions again. Functions were introduced in Section 2.9 but since you are a programmer now, we can talk about them in more detail. In particular, I want to show you how to create your own. After this you will no longer be the meager peasant who is forced to work with whatever function the R developers (or the package developers) have found the grace to provide to you, but you will be the king of your own little universe.

Here’s the syntax that you use to create a function:

     FNAME <- function ( ARG1, ARG2, ETC ) {
        STATEMENT1
        STATEMENT2
        ETC
        return( VALUE )
     }

What this does is create a function with the name FNAME, which has arguments ARG1, ARG2 and so forth. Whenever the function is called, R executes the statements in the curly braces and then outputs the contents of VALUE to the user.

To give a simple example of this, let’s create a function called quadruple() which multiplies its inputs by four.

 ## --- functionexample.R
quadruple <- function(x) {
  y <- x*4
  return(y)
} 

Exercise: Run the code and see what happens.

quadruple <- function(x) {
  y <- x*4
  return(y)
} 

Nothing appears to have happened! You can’t see it in the browser, but what did happen is that there is a new object created in the workspace called quadruple.

Exercise: Ask R to tell us what kind of object quadruple is using the class() function.

class(quadruple)

It tells us that it is a function.

And now that we’ve created the quadruple() function, we can call it just like any other function. And if I want to store the output as a variable, we can do this:

my.var <- quadruple(10)
my.var
## [1] 40

Functions are another place where the print() function might prove its right to exist. Say we want to have a glimpse of the internal workings and see what x is (which is weird, because we provided is as input, but hey). Let’s try two things.

quadruple2 <- function(x) {
  y <- x*4
  x
  return(y)
} 
quadruple3 <- function(x) {
  y <- x*4
  print(x)
  return(y)
} 
quadruple2(10)
## [1] 40
quadruple3(10)
## [1] 10
## [1] 40

Unlike in your console, adding x to the function does not show said x. You need print(x) for that.

9.1.2 Seeing sources

Now that we know how to create our own functions in R, it’s probably a good idea to talk a little more about some of the other properties of functions that I’ve been glossing over.

Exercise: To start with, let’s take this opportunity to type the name of the function without the parentheses.

As you can see, when you type the name of a function, R prints out the underlying source code that we used to define the function in the first place. In the case of the quadruple() function, this is quite helpful to us – we can read this code and actually see what the function does. Besides studying the help files, searching google, and talking to a friendly AI assistant, this is another way of trying to make sense of what R does.

9.1.3 The ins and outs

An important thing to recognise here is that the two internal variables that the quadruple() function makes use of, x and y, stay internal.

rm(list = ls()) #remove all
quadruple <- function(x) {
  y <- x*4
  return(y)
} 
my.var <- quadruple(10)
my.var
## [1] 40
ls()
## [1] "my.var"    "quadruple"

In our workspace, we see the quadruple() function itself, as well as the my.var variable that we just created. But no trace of x or y.

It is time for some hard talk about the funny relation functions have with the workspace.

rm(list=ls()) #clear the workspace
afun1 = function(x) { 
  y <- 3
  result <- x + y
  return( result )
} 
afun1(x=10)
## [1] 13
x
## Error in eval(expr, envir, enclos): object 'x' not found
y
## Error in eval(expr, envir, enclos): object 'y' not found

When running afun1(x=10), we ask to add x and y. We supply x as input the the function (10, in this case), and have defined y inside the function (3 in this case). Since R returns 13, R has access to the correct x and y. The result is 13, since we supply x=10 as input and have set y<-3 inside the function. Yet, when we ask R whatever x and y are, R just blanks. The reason is the our x=10 is only defined within the function argument, and the y<-3only in the function code. Both instances don’t enter the workspace. R can not produce x or y, since x was defined as part of the input and y was defined inside the function.

What is happening that, every time you call a function, R briefly creates a temporary environment in which the function itself can work, which is then deleted after the calculations are complete. Note, however, that R does not execute the commands inside the function in the workspace. Instead, what it does is create a temporary local environment: all the internal statements in the body of the function are executed there, so they remain invisible to the user. Only the final results that are inside return() are returned to the workspace.

You might be surprised to learn that this still produces 13!

y <- 5
afun1(x = 10)
## [1] 13
x
## Error in eval(expr, envir, enclos): object 'x' not found
y
## [1] 5

Despite having defined y<-5 outside of the function, R uses the y<-3 from within the function, since, unlike x we haven’t supplied our value of 5 to the function! But also keep in mind that, outside the function, R thinks y<-5. So the y<-3 is temporarily uses while executing the function is cleared from its memory! What happens in a function, stays in a function. So go wild!

Now consider this:

rm(y)
afun2 = function(x) { 
  result <- x + y
  return( result )
} 
afun2(x=10)
## Error in afun2(x = 10): object 'y' not found
x
## Error in eval(expr, envir, enclos): object 'x' not found
y
## Error in eval(expr, envir, enclos): object 'y' not found

R cannot execute this function, because it needs an y to do so and we have not supplied one as input or as part of the funciton definition.

This, however, does work.

y <- 5
afun2(x = 10)
## [1] 15
x
## Error in eval(expr, envir, enclos): object 'x' not found
y
## [1] 5

So while R will go looking for an y inside the function when needed, when it can not find one, it will go look elsewhere. Despite not being shipped as input the the function, it does use the y<-5 defined outside of the function.

This is however, very bad practice. This is much cleaner and safer:

afun3 <- function(x, y) {
  result <- x + y
  return(result)
}
afun3(x = 10, y = 5)
## [1] 15

9.1.4 Function arguments revisited

Okay, now that we are starting to get a sense for how functions are constructed, let’s have a look at two, slightly more complicated functions that I’ve created. Let’s start by looking at the first one:

## --- functionexample2.R
pow <- function( x, y = 1) {
  out <- x^y  # raise x to the power y
  return( out )
}

As you can see from looking at the code for this function, it has two arguments x and y, and all it does is raise x to the power of y. For instance, this command

pow(x=3, y=2)
## [1] 9

calculates the value of \(3^2\). The interesting thing about this function isn’t what it does, since R already has perfectly good mechanisms for calculating powers. Rather, notice that when I defined the function, I specified y=1 when listing the arguments? That’s the default value for y. So if we enter a command without specifying a value for y, then the function assumes that we want y=1:

pow( x=3 )
## [1] 3

However, since I didn’t specify any default value for x when I defined the pow() function, we always need to input a value for x. If we don’t, R will spit out an error message. Try it!

Exercise: Try it!

So now you know how to specify default values for an argument.

The other thing I should point out while I’m on the topic of function arguments is the use of the ... argument. The ... argument is a special construct in R which is only used within functions. It is used as a way of matching against multiple user inputs: in other words, ... is used as a mechanism to allow the user to enter as many inputs as they like. I won’t talk at all about the low-level details of how this works, but I will show you a simple example of a function that makes use of it. To that end, consider the following function:

 ## --- functionexample3.R
doubleMax <- function( ... ) {  
  max.val <- max( ... )   # find the largest value in ... 
  out <- 2 * max.val      # double it
  return( out )
}

You can type in as many inputs as you like. The doubleMax() function identifies the largest value in the inputs, by passing all the user inputs to the max() function, and then doubles it.

Exercise: Enter a few different values in the doubleMax() function to see what it does.

 # for example:
doubleMax(1, 2, 5)

9.1.5 2many exes

The fact that the arguments don’t enter the workspace can have consequences that for novices are either deeply confusing or straight up hilarious, depending on your sense of humor.

For example, this code should be easy to follow:

a <- rep(x = 6.66, times = 9)
round(x = a)
## [1] 7 7 7 7 7 7 7 7 7

Both functions rep() and round() have an argument called x. Noice.

This code, however, seems wrong, but isn’t

round(x = rep(x = 6.66, times = 9))
## [1] 7 7 7 7 7 7 7 7 7

There are two different x’s being used as the argument, but since they don’t enter the workspace, there is no clash of exes.

Of course, this also works, if you want to avoid the 2 many exes problem.

round(rep(6.66, 9))
## [1] 7 7 7 7 7 7 7 7 7

My only point is that 2 many exes should never be a problem. Don’t let anyone shame you for your bodycount!

9.1.6 There’s more to functions than this

There’s a lot of other details to functions that I’ve hidden in my description in this chapter. Experienced programmers will wonder exactly how the “scoping rules” work in R,38 or want to know how to use a function to create variables in other environments39, or will wonder if function objects can be assigned as elements of a list40 and probably hundreds of other things besides. However, I don’t want to have this discussion get too cluttered with details, so I think it’s best – at least for the purposes of the current book – to stop here.

9.2 Loops

Remember long time ago I gave a description on how a script works? Well, it was a tiny bit of a lie. Specifically, it’s not necessarily the case that R starts at the top of the file and runs straight through to the end of the file. For all the scripts that we’ve seen so far, that’s exactly what happens, and unless you insert some commands to explicitly alter how the script runs, that is what will always happen. However, you actually have quite a lot of flexibility in this respect. Depending on how you write the script, you can have R repeat several commands, or skip over different commands, and so on. This topic is referred to as flow control, and the first concept to discuss in this respect is the idea of a loop. The basic idea is very simple: a loop is a block of code (i.e., a sequence of commands) that R will execute over and over again until some termination criterion is met. Looping is a very powerful idea. There are three different ways to construct a loop in R, based on the while, for and repeat functions. I’ll only discuss the first two in this book.

9.2.1 The while loop

A while loop is a simple thing. The basic format of the loop looks like this:

     while ( CONDITION ) {
        STATEMENT1
        STATEMENT2
        ETC
     }

The code corresponding to CONDITION needs to produce a logical value, either TRUE or FALSE. Whenever R encounters a while statement, it checks to see if the CONDITION is TRUE. If it is, then R goes on to execute all of the commands inside the curly brackets, proceeding from top to bottom as usual. However, when it gets to the bottom of those statements, it moves back up to the while statement. Then, like the mindless automaton it is, it checks to see if the CONDITION is TRUE. If it is, then R goes on to execute all … well, you get the idea. This continues endlessly until at some point the CONDITION turns out to be FALSE. Once that happens, R jumps to the bottom of the loop (i.e., to the } character), and then continues on with whatever commands appear next in the script.

To start with, let’s keep things simple, and use a while loop to calculate the smallest multiple of 17 that is greater than or equal to 1000. This is a very silly example since you can actually calculate it using simple arithmetic operations, but the point here isn’t to do something novel. The point is to show how to write a while loop. Here’s the script:

## --- whileexample.R
x <- 0
while ( x < 1000 ) {
  x <- x + 17
}
x 

When we run this script, R starts at the top and creates a new variable called x and assigns it a value of 0. It then moves down to the loop, and “notices” that the condition here is x < 1000. Since the current value of x is zero, the condition is true, so it enters the body of the loop (inside the curly braces). There’s only one command here41 which instructs R to increase the value of x by 17. R then returns to the top of the loop and rechecks the condition. The value of x is now 17, but that’s still less than 1000, so the loop continues. This cycle will continue for a total of 59 iterations, until finally x reaches a value of 1003 (i.e., \(59 \times 17 = 1003\)). At this point, the loop stops, and R finally reaches line 5 of the script, prints out the value of x on screen, and then halts.

Exercise: Run the while loop and watch what happens.

x <- 0
while ( x < 1000 ) {
  x <- x + 17
}
x 

Truly fascinating stuff.

9.2.2 The for loop

The for loop is also pretty simple, though not quite as simple as the while loop. The basic format of this loop goes like this:

     for ( VAR in VECTOR ) {
        STATEMENT1
        STATEMENT2
        ETC
     }

In a for loop, R runs a fixed number of iterations. We have a VECTOR which has several elements, each one corresponding to a possible value of the variable VAR. In the first iteration of the loop, VAR is given a value corresponding to the first element of VECTOR; in the second iteration of the loop VAR gets a value corresponding to the second value in VECTOR; and so on. Once we’ve exhausted all of the values in VECTOR, the loop terminates and the flow of the program continues down the script.

Once again, let’s use some very simple examples. Firstly, here is a program that just prints out the word “hello” three times and then stops:

 ## --- forexample.R
for ( i in 1:3 ) {
  print( "hello" )
}

This is the simplest example of a for loop. The vector of possible values for the i variable just corresponds to the numbers from 1 to 3. Not only that, the body of the loop doesn’t actually depend on i at all. Not surprisingly, here’s what happens when we run it.

Exercise: Run the for loop and watch what happens.

for ( i in 1:3 ) {
  print( "hello" )
}

However, there’s nothing that stops you from using something non-numeric as the vector of possible values, as the following example illustrates. This time around, we’ll use a character vector to control our loop, which in this case will be a vector of words. And what we’ll do in the loop is get R to convert the word to upper case letters, calculate the length of the word, and print it out. Here’s the script (note that it uses the toupper() function, which converts a lowercase to an uppercase):

 ## --- forexample2.R
 #the words
words <- c("it","was","the","dirty","end","of","winter")
 #loop over the words
for ( w in words ) {
  w.length <- nchar( w )     # calculate the number of letters_
  W <- toupper( w )          # convert the word to upper case letters_
  msg <- paste( W, "has", w.length, "letters" )   # a message to print_
  print( msg )               # print it
}

Exercise: Run the for loop and watch what happens.

 #the words
words <- c("it","was","the","dirty","end","of","winter")
 #loop over the words_
for ( w in words ) {
  w.length <- nchar( w )     # calculate the number of letters
  W <- toupper( w )          # convert the word to upper case letters_
  msg <- paste( W, "has", w.length, "letters" )   # a message to print
  print( msg )               # print it
  
}

Again, pretty straightforward I hope.

Note that we can use whatever we want as index (i.e., VAR)

for ( i in c(2,4,6)) {
  print(3 + i)
}
## [1] 5
## [1] 7
## [1] 9

is exactly the same as

for ( dude_can_you_play_a_song_with_a_flipping_beat in c(2,4,6)) {
  print(3 + dude_can_you_play_a_song_with_a_flipping_beat)
}
## [1] 5
## [1] 7
## [1] 9

The i, which is often used in a for loop, is just a dumb letter.

Note how on the above example, we asked R to do a computation for every i or dude_can_you_play_a_song_with_a_flipping_beat, and we asked to print the result. However, if we would like to store that result, we would not want to copy and paste the output from the console, which can be fun in its on way, but not recommend. So let’s see how we can change the for loop to make it store stuff.

for ( i in c(2,4,6)) {
  x <- 3 + i
}
x
## [1] 9

Nice, but not quite. x has only stored the last result, not the full history. At each iteration of the loop, x is overwritten, so the final x is the one corresponding to the final i.

Maybe we could try this:

rm(x) #first remove x
for ( i in c(2,4,6)) {
  x[i] <- 3 + i
}
## Error in x[i] <- 3 + i: object 'x' not found
x
## Error in eval(expr, envir, enclos): object 'x' not found

Mmmmh. R balks. At the first iteration of the loop, i is 2. So what we ask R to do in the statement x[i] <- 3 + i is to compute 3 + 2, and store that as the second element of x. What is this x, you speak of, says R? It’s like me asking you to put your laptop in the second room of house X. You surely can not do that if I don’t first tell you what house X is.

So let’s tell R what x is. In fact, we are playing it cool, and only tell R that is exists, which, by a fancy word, could be referred to as initializing it, but don’t tell what it is. The reason is that R will find out itself what it is, each time it executes the x[i] <- 3 + i line. Saying someting without saying anything can be done using NULL.

x <- NULL
for ( i in c(2,4,6)) {
  x[i] <- 3 + i
}
x
## [1] NA  5 NA  7 NA  9

Yeah! Sort of. x now stores all the results of the x[i] <- 3 + i line. But it stores it at weird places. Just like we asked, it stores it at the 2nd, 4th and 6th place. It would be nicer if it would store it at the 1st, 2nd and 3rd place. Here’s how to do that:

x <- NULL
vec <- c(2,4,6)
for ( i in 1:length(vec)) {
  x[i] <- 3 + vec[i]
}
x
## [1] 5 7 9

Hurray!

Exercise: Now it’s your turn. Write a for loop that iterates from 1 to 6, printing the square of each number using the print() function.

 # Start by merely writing the framework of the for loop, before moving on to the calculation that will be done inside the loop
for (i in 1:6) {
  
}
 # Remember how to calculate the square of a number in R? How do we generalise this for all iterations inside the loop?
for (i in 1:6) {
  print(i^2)
}

We will be doing a lot more of these for loops when we are doing simulations.

9.2.3 A more realistic example of a loop

To give you a sense of how you can use a loop in a more complex situation, let’s write a simple script to simulate the progression of a mortgage. Suppose we have a nice young couple who borrow $300000 from the bank, at an annual interest rate of 5%. The mortgage is a 30-year loan, so they need to pay it off within 360 months total. Our happy couple decides to set their monthly mortgage payment at $1600 per month. Will they pay off the loan in time or not? Only time will tell.42 Or, alternatively, we could get R to tell us. The script to run this is a fair bit more complicated.

To make the code easier we need to make a few calculations. The couple is making monthly payments of $1600, at an annual interest rate of 5%. This means that, each month, their outstanding balance is to be multiplied with 1.05^(1/12). 43

 ## --- mortgage.R
 # set up
month <- 0        # count the number of months
balance <- 300000 # initial mortgage balance
total.paid <- 0   # track what you've paid the bank
payment <- 1600  # monthly payment
interest <- 0.05  # 5% interest rate per year
 # convert annual interest to a monthly multiplier
monthly.multiplier <- (1+interest) ** (1/12)


# keep looping until the loan is paid off...
while ( balance > 0 ) {
  
  # do the calculations for this month
  month <- month + 1  # one more month
  balance <- balance * monthly.multiplier  # add the interest
  balance <- balance - payment  # make the payment
  total.paid <- total.paid + payment # track the total paid


} # end of loop
total.paid
month

To explain what’s going on, let’s go through it carefully. In the first block of code (under #set up) all we’re doing is specifying all the variables that define the problem. The loan starts with a balance of $300,000 owed to the bank on month zero, and at that point in time the total.paid money is nothing. The couple is making monthly payments of $1600, at an annual interest rate of 5% and the associated monthly.multiplier.

The interesting part (such as it is) is the loop. The while statement on tells R that it needs to keep looping until the balance reaches zero (or less, since it might be that the final payment of $1600 pushes the balance below zero). Then, inside the body of the loop, we have two different blocks of code. In the first bit, we do all the number crunching. Firstly we increase the value month by 1. Next, the bank charges the interest, so the balance goes up. Then, the couple makes their monthly payment and the balance goes down. Finally, we keep track of the total amount of money that the couple has paid so far, by adding the payment to the running tally.

The key thing here is the tension between the increase in balance (in the line balance <- balance * monthly.multiplier) and the decrease in balance (in the line balance <- balance - payment). As long as the decrease is bigger, then the balance will eventually drop to zero and the loop will eventually terminate. If not, the loop will continue forever! This is actually very bad programming on my part: I really should have included something to force R to stop if this goes on too long. However, I haven’t shown you how to evaluate “if” statements yet (you have to wait till Section XXX), so we’ll just have to hope that I have rigged the example so that the code actually runs. Anyway, assuming that the loop does eventually terminate, there’s one last line of code that prints out the total amount of money that the couple handed over to the bank over the lifetime of the loan and the number of months it took.

Now that I’ve explained everything in the code in tedious detail…

Exercise: Run the for loop and see what happens.

 # set up
month <- 0        # count the number of months
balance <- 300000 # initial mortgage balance
payments <- 1600  # monthly payments
interest <- 0.05  # 5% interest rate per year
total.paid <- 0   # track what you've paid the bank
 # convert annual interest to a monthly multiplier
monthly.multiplier <- (1+interest) ** (1/12)
 # keep looping until the loan is paid off...
while ( balance > 0 ) {
  
  # do the calculations for this month
  month <- month + 1  # one more month
  balance <- balance * monthly.multiplier  # add the interest
  balance <- balance - payments  # make the payments
  total.paid <- total.paid + payments # track the total paid

} # end of loop
total.paid
month

So our nice young couple has paid off their $300,000 loan in just 4 months shy of the 30-year term of their loan, at a bargain-basement price of $569600 A happy ending!

9.2.4 Implicit loops

In addition to providing the explicit looping structures via while and for, R also provides a collection of functions for implicit loops. What I mean by this is that these are functions that carry out operations very similar to those that you’d normally use a loop for. However, instead of typing out the whole loop, the whole thing is done with a single command. The main reason why this can be handy is that – due to the way that R is written – these implicit looping functions are usually about to do the same calculations much faster than the corresponding explicit loops. In most applications that beginners might want to undertake, this probably isn’t very important, since most beginners tend to start out working with fairly small data sets and don’t usually need to undertake extremely time-consuming number crunching. However, because you often see these functions referred to in other contexts, it may be useful to very briefly discuss a few of them.

In fact, I can be very brief about it, since we have been discussing this in Section XXX. For example, consider the by function. We have used it as follows:

age
## [1] 10 12  9 11 13
gender
## [1] "male"   "male"   "female" "female" "male"
by( data = age, INDICES = gender, FUN = mean )
## gender: female
## [1] 10
## ------------------------------------------------------------ 
## gender: male
## [1] 11.66667

In some sense, by() had been doing a loop for us:

unique_genders <- unique(gender)
mean_ages <- NULL
# loop over each unique gender
for (i in 1:length(unique_genders)) {
  mean_ages[i] <- mean(age[gender==unique_genders[i]])
}
mean_ages
## [1] 11.66667 10.00000

So, yeah, thanks, by().

9.3 Conditional statements

A second kind of flow control that programming languages provide is the ability to evaluate conditional statements. Unlike loops, which can repeat over and over again, a conditional statement only executes once, but it can switch between different possible commands depending on a CONDITION that is specified by the programmer. The power of these commands is that they allow the program itself to make choices, and in particular, to make different choices depending on the context in which the program is run. The most prominent example of a conditional statement is the if statement, and the accompanying else statement. The basic format of an if statement in R is as follows:

     if ( CONDITION ) {
        STATEMENT1
        STATEMENT2
        ETC
     }

And the execution of the statement is pretty straightforward. If the CONDITION is true, then R will execute the statements contained in the curly braces. If the CONDITION is false, then it does not. If you want to, you can extend the if statement to include an else statement as well, leading to the following syntax:

     if ( CONDITION ) {
        STATEMENT1
        STATEMENT2
        ETC
     } else {
        STATEMENT3
        STATEMENT4
        ETC
     }     

As you’d expect, the interpretation of this version is similar. If the CONDITION is true, then the contents of the first block of code (i.e., STATEMENT1, STATEMENT2, ETC) are executed; but if it is false, then the contents of the second block of code (i.e., STATEMENT3, STATEMENT4, ETC) are executed instead.

You can expand this logic as follows:

     if ( CONDITION ) {
        STATEMENT1
        STATEMENT2
        ETC
     } else if (ANOTHER CONDITION) {
        STATEMENT3
        STATEMENT4
        ETC
     }  else if (YET ANOTHER CONDITION) {
        STATEMENT5
        STATEMENT6
        ETC
     }  else      
     {
        STATEMENT7
        STATEMENT8
        ETC
     }

What will you see in this example?

score1 <- 60
score2 <- 40
if (score2 > score1 & score1 < 40){
  result <- "Ha"
} else if (score2 > score1 & score1 > 40){ 
  result <- "Ku"
} else if (score1 > score2 | score1 == 40){
  result <- "Na"
} else {
result <- "Ma"
}
result
## [1] "Na"

A particularly useful function to make conditional statements is the ifelse() function. I low key forgot how it exactly works, so I looked it up in the R help file, ?ifelse. Here is what it says in the description: ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE. Even though I sort of know what it does, this is mostly gibberish. So I looked at the examples in that very same help file:

x <- c(6:-4)
sqrt(x)  #- gives warning
## Warning in sqrt(x): NaNs produced
##  [1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000 0.000000      NaN
##  [9]      NaN      NaN      NaN
sqrt(ifelse(x >= 0, x, NA))  # no warning
##  [1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000 0.000000       NA
##  [9]       NA       NA       NA

Oh, please, dear R. Why do you have to make things so complicated? This isn’t gonna help. Let’s cook up our own example.

x <- -6:6
ifelse(x >= 0, x, x^2)  
##  [1] 36 25 16  9  4  1  0  1  2  3  4  5  6

So yeah, this gives me an idea of what it does. The ifelse() function takes three arguments, test, yes and no. If the first argument is TRUE, the function returns whatever is specified in the second or yes argument. If the first argument is FALSE, the function returns whatever is specified in the third or no argument.

So the full statement is as follows

x <- -6:6
ifelse(test = x >= 0, yes = x, no = x^2)  
##  [1] 36 25 16  9  4  1  0  1  2  3  4  5  6
#or
ifelse(test = x >= 0, no = x^2, yes = x)  
##  [1] 36 25 16  9  4  1  0  1  2  3  4  5  6

We can do the same with the ifelse structure, combined in a for loop:

x <- -6:6
result <- NULL
for (i in 1:length(x)){
if ( x[i] >= 0 ) {
  result[i] <- x[i]
} else {
  result[i] <- x[i]^2
}
}
result
##  [1] 36 25 16  9  4  1  0  1  2  3  4  5  6

There is another way of making conditional statements in R. In particular, the switch() function can be very useful in different contexts. However, my main aim in this chapter is to briefly cover the very basics, so I’ll move on.

10 Drawing graphs

Visualising data is one of the most important tasks facing the data analyst. It’s important for two distinct but closely related reasons. Firstly, there’s the matter of drawing “presentation graphics”: displaying your data in a clean, visually appealing fashion makes it easier for your reader to understand what you’re trying to tell them. Equally important, perhaps even more important, is the fact that drawing graphs helps you to understand the data. To that end, it’s important to draw “exploratory graphics” that help you learn about the data as you go about analysing it. These points might seem pretty obvious, but I cannot count the number of times I’ve seen people forget them.

The goal is to show you how to create basic graphs in R. The graphs themselves tend to be pretty straightforward, so in that respect, this chapter is pretty simple. Where people usually struggle is learning how to produce graphs, and especially, learning how to produce good graphs.44 The focus is on how to make these plots. The when and the why and the why not have been discussed in Statistics 1 and will not be repeated here.

Fortunately, learning how to draw graphs in R is both extremely simple and extremely hard. You are fortunate enough that R has a lot of very good graphing functions, and most of the time you can produce a clean, high-quality graphic without having to learn very much about the low-level details of how R handles graphics. As long as you’re not too picky about what your graph looks like, it is almost trivial. You make a histogram using hist(), a box plot using boxplot(), etc. It can almost not get any simpler. But. Unfortunately, on those occasions when you do want to do something non-standard, or if you need to make highly specific changes to the figure, you actually do need to learn a fair bit about these details; and those details are both complicated and boring. So doing something decent is ridiculously easy; doing something great is terrifyingly difficult.

The goal of this chapter is to teach you how to make quick-and-dirty graphs. I will show you how to make a basic graph and make a few adjustments. Making good graphs, will require a lot more than just a few adjustments, and a lot more explanation though. If you ever need that, you will need to go on a hunt yourself for tweaking the right handles.

10.1 Basic plotting

Before I discuss any specialised graphics, let’s start by drawing a few very simple graphs just to get a feel for what it’s like to draw pictures using R. To that end, let’s create a small vector Fibonacci that contains a few numbers we’d like R to draw for us. Then, we’ll ask R to plot() those numbers.

Fibonacci <- c( 1,1,2,3,5,8,13 )

Exercise: Ask R to plot the Fibonacci numbers, by providing the data set to the function plot().

plot(Fibonacci)

As you can see, what R has done is plot the values stored in the Fibonacci variable on the vertical axis (y-axis) and the corresponding index on the horizontal axis (x-axis). In other words, since the 4th element of the vector has a value of 3, we get a dot plotted at the location (4,3). That’s pretty straightforward, and the image is probably pretty close to what you would have had in mind when I suggested that we plot the Fibonacci data.

However, there’s quite a lot of customisation options available to you, so we should probably spend a bit of time looking at some of those options.

10.1.1 Changing the plot type

You can easily customise the appearance of the actual plot! To start with, let’s look at the single most important options that the plot() function provides for you to use, which is the type argument. The type argument specifies the visual style of the plot. The possible values for this are:

  • type = "p". Draw the points only.
  • type = "h". Draw “histogram-like” vertical bars.
  • type = "s". Draw a staircase, going horizontally then vertically.
  • type = "S". Draw a Staircase, going vertically then horizontally.
  • type = "b". Draw both points and lines, but don’t overplot.
  • type = "o". Draw the line over the top of the points.
  • type = "c". Draw only the connecting lines from the “b” version.
  • type = "l". Draw a line through the points.
  • type = "n". Draw nothing. (Apparently this is useful sometimes?)

The simplest way to illustrate what each of these really looks like is just to draw them. To that end, Figure 10.1 shows the same Fibonacci data, drawn using eight different types of plot. As you can see, by altering the type argument you can get a qualitatively different appearance to your plot. In other words, as far as R is concerned, the only difference between a scatterplot and a line plot is that you draw a scatterplot by setting type = "p" and you draw a line plot by setting type = "l".

Changing the `type` of the plot.

Figure 10.1: Changing the type of the plot.

10.2 Customizing graphs

The basic plots R produces are ok, but sometimes you should not always settle for ok. R offers many handles to tweak to customize your plots. They might feel daunting at first but 1) they are pretty self-explanatory and 2) they are often used for more than just the plot() function, but also for many other graphical functions we will encounter below. So it is worth to study them.

10.2.1 Customising the title, the axis labels and the limits

One of the first things that you’ll find yourself wanting to do when customising your plot is to label it better. You might want to specify more appropriate axis labels, add a title or add a subtitle. The arguments that you need to specify to make this happen are:

  • main. A character string containing the main title.
  • sub. A character string containing the subtitle.
  • xlab. A character string containing the x-axis label.
  • ylab. A character string containing the y-axis label.

Exercise: Let’s have a look at what happens when we make use of all these arguments. Here’s the command.

plot(x = Fibonacci,
       main = "You specify title using the 'main' argument",
       sub = "The subtitle appears here! (Use the 'sub' argument for this)",
       xlab = "The x-axis label is 'xlab'",
         ylab = "The y-axis label is 'ylab'" 
        )

It’s more or less as you’d expect. The plot itself is identical to the one we drew in the previous exercise, except for the fact that we’ve changed the axis labels, and added a title and a subtitle.

Another thing you might want to have control over is setting the limits of the axes.

  • xlim and ylim. The axis scales. Generally, R does a pretty good job of figuring out where to set the edges of the plot. However, you can override its choices by setting the xlim and ylim arguments. For instance, if I decide I want the vertical scale of the plot to run from 0 to 100, then I’d set ylim = c(0, 100).

Exercise: Let’s have a look yourself.

plot(x = Fibonacci,       # the data
       main = "You specify title using the 'main' argument",
       sub = "The subtitle appears here! (Use the 'sub' argument for this)",
       xlab = "The x-axis label is 'xlab'",
         ylab = "The y-axis label is 'ylab'", 
     xlim = c(0, 15),     # expand the x-scale
     ylim = c(0, 15)     # expand the y-scale
 )

The axis scales on both the horizontal and vertical dimensions have been expanded. Nice.

Even so, there’s a couple of interesting features worth calling your attention to. Firstly, notice that the subtitle is drawn below the plot, which I personally find annoying; as a consequence I almost never use subtitles. You may have a different opinion, of course, but the important thing is that you remember where the subtitle actually goes. Secondly, notice that R has decided to use boldface text and a large font size for the title. This is one of my most hated default settings in R graphics since I feel that it draws too much attention to the title. Generally, while I do want my reader to look at the title, I find that the R defaults are a bit overpowering, so I often like to change the settings.

10.2.2 Changing other features of the appearance of the plot

In Section 10.2.1 we talked about a group of graphical parameters that are related to the formatting of titles, axis labels etc. The second group of parameters I want to discuss are those related to the formatting of the plot itself:

  • pch: Plot character type: The plot character parameter is a number, usually between 0 and 24. What it does is tell R what symbol to use to draw the points that it plots. The simplest way to illustrate what the different values do (i.e., how they related to the character types used to plot points) is with a picture. Figure 10.2 shows the first 25 plotting characters. The default plotting character is a hollow circle (i.e., pch = 1). You don’t need to know these numbers by heart! If you encounter someone who does, try to initiate a friendly conversation of setting priorities in life.
Changing the plotted characters

Figure 10.2: Changing the plotted characters

  • cex: Plot character size. Font size is handled in a slightly curious way in R. Instead of using some absolute size, it uses a magnification value, which is referred to as “cex” (short for “character expansion”). So this parameter describes a character expansion factor (i.e., magnification) for the plotted characters such as points. By default cex=1, but if you want bigger symbols in your graph you should specify a larger value.
  • lty: Line type. The line type parameter describes the kind of line that R draws (if you ask it to draw a line, that is). It has seven values which you can specify using a number between 0 and 6, or using a meaningful character string: "blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". Note that the "blank" version (value 0) just means that R doesn’t draw the lines at all. The other six versions are shown in Figure 10.3. You don’t need to know these numbers or strings by heart.
Line types

Figure 10.3: Line types

  • lwd: Line width. The next graphical parameter in this category that I want to mention is the line width parameter, which is just a number specifying the width of the line. The default value is 1. Not surprisingly, larger values produce thicker lines and smaller values produce thinner lines.
  • col: Colour of the plot For the plot function it’s pretty simple: the col argument refers to the colour of the points and/or lines that get drawn! The simplest way to specify this parameter is using a character string: e.g., col = "blue". Conveniently, R has a very large number of named colours (type colours() to see a list of over 650 colour names that R knows), so you can use the English language name of the colour to select it.45 Examples are "red", "gray25", and "springgreen4" (yes, R really does recognise four different shades of “spring green”).

Exercise: To illustrate what you can do by altering these parameters, let’s try the following command.

plot(x = Fibonacci,
         type = "b",
     col = "blue",
         pch = 19,
         cex=5,
         lty=2,
         lwd=4)

10.2.3 Changing the appearance of the axes

There are several other possibilities worth discussing.

  • las: Orientation of the axis labels. I presume that the name of this parameter is an acronym of label style or something along those lines; but what it actually does is govern the orientation of the text used to label the individual tick marks (i.e., the numbering, not the xlab and ylab axis labels). There are four possible values for las: A value of 0 means that the labels of both axes are printed parallel to the axis itself (the default). A value of 1 means that the text is always horizontal. A value of 2 means that the labelling text is printed at right angles to the axis. Finally, a value of 3 means that the text is always vertical. You don’t need to study these values.
  • ann: Suppress labelling: This is a logical-valued argument that you can use if you don’t want R to include any annotations, such as text for a title, subtitle or axis label. To do so, set ann = FALSE. This will stop R from including any text that would normally appear in those places. Note that this will override any of your manual titles. For example, if you try to add a title using the main argument, but you also specify ann = FALSE, no title will appear.
  • axes: Suppress axis drawing: Again, this is a logical valued argument. Suppose you don’t want R to draw any axes at all. To suppress the axes, all you have to do is add axes = FALSE. This will remove the axes and the numbering, but not the axis labels (i.e. the xlab and ylab text). 46
  • frame.plot: Include a framing box: Suppose you’ve removed the axes by setting axes = FALSE, but you still want to have a simple box drawn around the plot; that is, you only wanted to get rid of the numbering and the tick marks, but you want to keep the box. To do that, you set frame.plot = TRUE.

Exercise: To illustrate what you can do by altering these parameters, let’s try the following command.

plot(x = Fibonacci,       # the data
     ann = FALSE,         # delete all annotations
     axes = FALSE,        # delete the axes
     frame.plot = TRUE    # but include a framing box
 )

The output is pretty much exactly as you’d expect. The axes have been suppressed (on account of axes being FALSE) as have the annotations (on account ann being FALSE), but we’ve kept a box around the plot (on account of frame.plot being TRUE).

10.2.4 More!

Before moving on, I should point out that there are several graphical parameters relating to the axes, the box, the general appearance of the plot which allow finer grain control over the appearance of the axes and the annotations, a bunch of graphical parameters that you can use to customise the font style, size, and so on, but I will let you figure that out by yourself if you ever need it. Most of them will speak to themselves. For example, it should not come as a surprise, knowing what you know what cex and main do, that cex.main controls the font size of the title.

Although you will often end up with commands that are quite long, it’s not complicated: the only thing the setting of these arguments does is that it overrides a bunch of the default parameter values. The only difficult aspect to this is that you have to remember what each of these parameters is called, and what all the different values are (unless I told you should not remember it, of course).

10.3 Scatterplots

Scatterplots are a simple but effective tool for visualising data. We’ve already seen scatterplots in this chapter when using the plot() function to draw the Fibonacci variable as a collection of dots (Section 10.1). However, for the purposes of this section, I have a slightly different notion in mind. Instead of just plotting one variable, what I want to do with my scatterplot is to display the relationship between two variables. It’s this latter application that we usually have in mind when we use the term “scatterplot”. In this kind of plot, each observation corresponds to one dot: the horizontal location of the dot plots the value of the observation on one variable, and the vertical location displays its value on the other variable. In many situations, you don’t really have clear opinions about what the causal relationship is (e.g., does A cause B, or does B cause A, or does some other variable C control both A and B). If that’s the case, it doesn’t really matter which variable you plot on the x-axis and which one you plot on the y-axis. However, in many situations, you do have a pretty strong idea which variable you think is most likely to be causal, or at least you have some suspicions in that direction. If so, then it’s conventional to plot the cause variable on the x-axis, and the effect variable on the y-axis.

To do so, let’s turn to a topic close to every parent’s heart: sleep. The following data set is fictitious but based on real events. Suppose I’m curious to find out how much my infant son’s sleeping habits affect my mood. Let’s say that I can rate my grumpiness very precisely, on a scale from 0 (not at all grumpy) to 100 (grumpy as a very, very grumpy old man). And, let us also assume that I’ve been measuring my grumpiness, my sleeping patterns and my son’s sleeping patterns for quite some time now. Let’s say, for 100 days. And, being a nerd, I’ve saved the data as a data frame called parenthood.

If we peek at the data using head() out the data, here’s what we get:

print(head(parenthood))
##   dan.sleep baby.sleep dan.grump day
## 1      7.59      10.18        56   1
## 2      7.91      11.66        60   2
## 3      5.14       7.92        82   3
## 4      7.71       9.61        55   4
## 5      6.68       9.75        67   5
## 6      5.99       5.04        72   6

We see that the data frame called parenthood contains four variables dan.sleep, baby.sleep, dan.grump and day.

Suppose my goal is to draw a scatterplot displaying the relationship between the amount of sleep that I get (dan.sleep) and how grumpy I am the next day (dan.grump). As you might expect given our earlier use of plot() to display the Fibonacci data, the function that we use is the plot() function. We just need to specify the name of the variable to be plotted on the x axis and the name of variable to be plotted on the y axis, using arguments x and y. I’m sure you can derive and remember which argument does what.

Exercise: Create a scatterplot showing the relationship between the amount of sleep that Dan’s gets and how grumpy he is the next day. Plot dan.sleep on the x axis and dan.grump on the y axis.

plot( x = parenthood$dan.sleep,   # data on the x-axis
      y = parenthood$dan.grump    # data on the y-axis
 ) 

If we do this, the result is the very basic scatterplot. This serves fairly well, but there are a few customisations that we probably want to make in order to have this work properly. As usual, we want to add some labels, but there’s a few other things we might want to do as well. Firstly, it’s sometimes useful to rescale the plots. In the scatterplot you just created, R has selected the scales so that the data fall neatly in the middle. But, in this case, we happen to know that the grumpiness measure falls on a scale from 0 to 100, and the hours slept falls on a natural scale between 0 hours and about 12 or so hours (the longest I can sleep in real life).

Exercise: Run the following command, to see how we might draw this.

plot( x = parenthood$dan.sleep,          # data on the x-axis
       y = parenthood$dan.grump,         # data on the y-axis
       xlab = "My sleep (hours)",        # x-axis label
       ylab = "My grumpiness (0-100)",   # y-axis label
       xlim = c(0,12),                   # scale the x-axis
       ylim = c(0,100),                  # scale the y-axis
       pch = 20,                         # change the plot type
       col = "gray50",                   # dim the dots slightly
       frame.plot = FALSE                # don't draw a box
 )

10.4 Adding stuff to a plot

Sometimes it can be very useful to draw a line. Don’t just tolerate any unacceptable behavior, you know? While hard to do in real life, drawing lines in R is easy. It just involves using the function lines(). Mind you: this is not a separate argument within the plot() function, but really a function of its own. Quite conveniently, the arguments that I need to specify are pretty much the exact same ones that I use when calling the plot() function. That is, suppose that I want to draw a line that goes from the point (4,93) to the point (9.5,37). Then the x locations can be specified by the vector c(4,9.5) and the y locations correspond to the vector c(93,37). In other words, I use this command:

plot( x = parenthood$dan.sleep,          # data on the x-axis
       y = parenthood$dan.grump,         # data on the y-axis
       xlab = "My sleep (hours)",        # x-axis label
       ylab = "My grumpiness (0-100)",   # y-axis label
       xlim = c(0,12),                   # scale the x-axis
       ylim = c(0,100),                  # scale the y-axis
       pch = 20,                         # change the plot type
       col = "gray50",                   # dim the dots slightly
       frame.plot = FALSE                # don't draw a box
)
lines( x = c(4,9.5),   # the horizontal locations
       y = c(93,37),   # the vertical locations
      lwd = 2         # line width
)
A scatterplot with scatter plot specific customisations

Figure 10.4: A scatterplot with scatter plot specific customisations

And when I do so, R plots the line over the top of the plot that I drew using the previous command.

Note that while the lines() function is a function of its own, it does somewhat depend on the plot() function, in the sense that if you use the lines() function in the void, without making a plot first, it will refuse to do so!

Another way to draw lines is to use the abline() function. Rather than using coordinates as input, it uses the intercept (a) and slope (b) as input (hence its name). It is especially useful for drawing horizontal line or vertical lines, for which you should not use the intercept and slopes, but rather the h or v argument, indicating where the horizontal or vertical line should go.

plot( x = parenthood$dan.sleep,          # data on the x-axis
       y = parenthood$dan.grump,         # data on the y-axis
       xlab = "My sleep (hours)",        # x-axis label
       ylab = "My grumpiness (0-100)",   # y-axis label
       xlim = c(0,12),                   # scale the x-axis
       ylim = c(0,100),                  # scale the y-axis
       pch = 20,                         # change the plot type
       col = "gray50",                   # dim the dots slightly
       frame.plot = FALSE                # don't draw a box
)
abline(h = 60, # draw a horizontal line at y=60
       v =  8  # draw a vertical line at x=8  
) 
A scatterplot with scatter plot specific customisations

Figure 10.5: A scatterplot with scatter plot specific customisations

If you don’t want to add a full line, but just one or more points, R is ready to help you out with a function of its own, which is called — I am sure you could have guessed — points(). Like lines() and abline(), it is a stand-alone function but only so-so, in that R will only know what to do with it in the context of a plot. Like lines(), it use coordinates as input. So if you, for whatever reason, would like the highlight the start and endpoints of the lines you drew earlier, you could do something like this:

plot( x = parenthood$dan.sleep,          # data on the x-axis
       y = parenthood$dan.grump,         # data on the y-axis
       xlab = "My sleep (hours)",        # x-axis label
       ylab = "My grumpiness (0-100)",   # y-axis label
       xlim = c(0,12),                   # scale the x-axis
       ylim = c(0,100),                  # scale the y-axis
       pch = 20,                         # change the plot type
       col = "gray50",                   # dim the dots slightly
       frame.plot = FALSE                # don't draw a box
)
 lines( x = c(4,9.5),   # the horizontal locations
        y = c(93,37),   # the vertical locations
        lwd = 2         # line width
)
 points( x = c(4,9.5),   # the horizontal locations
       y = c(93,37),     # the vertical locations
       pch = 5           # the symbol 
)
A scatterplot with scatter plot specific customisations

Figure 10.6: A scatterplot with scatter plot specific customisations

10.5 Pie charts

Now that we’ve tamed (or possibly fled from) the beast that is R graphical parameters, let’s talk more seriously about some real-life graphics that you’ll want to draw. We begin with the humble pie chart.

I don’t have solid advice on the usefulness of pie charts. I just wanna show you how it’s done. I will use the afl.finalists variable. The afl.finalists variable contains the names of all 400 teams that played in all 200 finals matches played during the period 1987 to 2010. What I want to do is draw a bar graph that displays the number of finals that each team has played in over the time spanned by the afl data set.

What I want to do is draw a pie chart that displays the percentage of finals that each team has played in over the time spanned by the afl data set.

The good news is that you need a function that is called pie(). Easy, right? The bad news is that this doesn’t work:

pie( afl.finalists ) 
## Error in pie(afl.finalists): 'x' values must be positive.

Rather, we need to convert our data to a frequency, which can be done using the table() or tabulate() functions. Let’s have a look first.

tabulate(afl.finalists)
##  [1] 26 25 26 28 32  6 39 27 28 28 17  6 24 26 38 24
table(afl.finalists)
## afl.finalists
##         Adelaide         Brisbane          Carlton      Collingwood 
##               26               25               26               28 
##         Essendon        Fremantle          Geelong         Hawthorn 
##               32                6               39               27 
##        Melbourne  North Melbourne    Port Adelaide         Richmond 
##               28               28               17                6 
##         St Kilda           Sydney       West Coast Western Bulldogs 
##               24               26               38               24

So we created a (named) vector containing the number of finals that each team has played in the afl.finalists data.

Now we bake:

pie( tabulate(afl.finalists) )

pie( table(afl.finalists) )

The only difference is that tabulate() gives us the numbers (so that pie() only use that), and that table() gives numbers and labels that pie() can use.

You can, however, control the labels if you so desire, using the conveniently named labels argument.

pie( tabulate(afl.finalists), labels =  levels(afl.finalists))

Here’s a bit of a fancier version, with percentages included in the labels and a clockwise organisation:

percentages <- round(100*prop.table(table(afl.finalists)),0)
pie_labels <- paste0(names(percentages), " (", percentages,"%)") 
pie(x = table(afl.finalists), labels = pie_labels, clockwise=TRUE)

Making it more readable is needed but hard, but I’m not gonna bother.

10.6 Bar graphs

Another form of graph that you often want to plot is the bar graph. I’ll use the afl.finalists variable. The main function that you can use in R to draw them is the barplot() function.

Disappointingly, but unsurprisingly if you paid any attention when reading about the pie() function, the following command does not work:

barplot( afl.finalists)
## Error in barplot.default(afl.finalists): 'height' must be a vector or a matrix

Exercise: Draw a bar graph using the barplot() function. The main argument that you need to specify for a bar graph is the frequencies.

barplot( tabulate(afl.finalists) )

As you can see, R has drawn a pretty minimal plot. It doesn’t have any labels, obviously, because we didn’t actually tell the barplot() function what the labels are! To do this, we need to specify the names.arg argument. The names.arg argument needs to be a vector of character strings containing the text that needs to be used as the label for each of the items. So I’m obviously going to need the team names to create some labels, so let’s create a variable with those. We’ll do this using the levels() function, which outputs the names of all the levels of a factor (see Section 6.4).

Exercise: Use the levels() function to obtain the names of all the levels in afl.finalists. Save the result in a variable named teams and print the result.

teams <- levels( afl.finalists )
teams 

Okay, so now that we have the information we need, let’s draw our bar graph again.

Exercise: Add the names.arg argument to the command used in the previous exercise, to indicate the labels of the bar graph.

barplot( tabulate(afl.finalists), names.arg = teams)

We could have saved ourselves some effort and just done this, using table() instead of tabulate():

barplot( table(afl.finalists) )

Anyhoo, this is an improvement, but not much of an improvement. R has only included a few of the labels because it can’t fit them in the plot. The fact that barplot() has omitted the names of every team in between Adelaide and Fitzroy is a somewhat problematic.

The simplest way to fix this is to rotate the labels so that the text runs vertically not horizontally. To do this, we need to alter set the las parameter, which I discussed briefly in Section 10.1.

Exercise: Using the command of the previous exercise, add an argument telling R to rotate the text so that it’s always perpendicular to the axes. This can be done with las = 2.

barplot( table(afl.finalists),      # the frequencies
        names.arg = teams,  # the labels 
        las = 2)   # rotate the labels

We’ve fixed the problem, but we’ve created a new one: the axis labels don’t quite fit anymore. To fix this, we have to be a bit cleverer again. A simple fix would be to use shorter names rather than the full name of all teams, and in many situations, that’s probably the right thing to do. However, at other times you really do need to create a bit more space to add your labels. I am not gonna go in detail on how to do that. Just know you can play with the space, if needed or desired.

10.7 Histograms

Histograms are one of the simplest and most useful ways of visualising data. They make the most sense when you have an interval or ratio scale (e.g., the afl.margins data from Chapter ??) and what you want to do is get an overall impression of the data.

Most of you probably know how histograms work, since they’re so widely used, but for the sake of completeness, I’ll describe them. All you do is divide up the possible values into bins, and then count the number of observations that fall within each bin. This count is referred to as the frequency of the bin, and is displayed as a bar: in the AFL winning margins data, there are 33 games in which the winning margin was less than 10 points. Drawing this histogram in R is pretty straightforward. The function you need to use is called hist(), and it has pretty reasonable default settings.

Exercise: Create a histogram of afl.margins using the hist() function.

hist( x = afl.margins )

Although this image would need a lot of cleaning up in order to make a good presentation graphic (i.e., one you’d include in a report), it nevertheless does a pretty good job of describing the data. In fact, the big strength of a histogram is that (properly used) it does show the entire spread of the data, so you can get a pretty good sense about what it looks like. The downside to histograms is that they aren’t very compact: unlike some of the other plots I’ll talk about that it’s hard to cram 20-30 histograms into a single image without overwhelming the viewer.

The main subtlety that you need to be aware of when drawing histograms is determining where the breaks that separate bins should be located, and (relatedly) how many breaks there should be. In the histogram you just created, you can see that R has made pretty sensible choices all by itself: the breaks are located at 0, 10, 20, … 120, which is exactly what I would have done had I been forced to make a choice myself. On the other hand, consider the two histograms in Figure 10.7 and 10.8, which I produced using the following two commands:

hist( x = afl.margins, breaks = 3 )
A histogram with too few bins

Figure 10.7: A histogram with too few bins

hist( x = afl.margins, breaks = 0:116 )
A histogram with too many bins

Figure 10.8: A histogram with too many bins

In Figure 10.8, the bins are only 1 point wide. As a result, although the plot is very informative (it displays the entire data set with no loss of information at all!) the plot is very hard to interpret and feels quite cluttered. On the other hand, the plot in Figure 10.7 has a bin width of 50 points, and has the opposite problem: it’s very easy to “read” this plot, but it doesn’t convey a lot of information. One gets the sense that this histogram is hiding too much. In short, the way in which you specify the breaks has a big effect on what the histogram looks like, so it’s important to make sure you choose the breaks sensibly. In general, R does a pretty good job of selecting the breaks on its own, since it makes use of some quite clever tricks that statisticians have devised for automatically selecting the right bins for a histogram, but nevertheless, it’s usually a good idea to play around with the breaks a bit to see what happens.

There is one fairly important thing to add regarding how the breaks argument works. There are two different ways you can specify the breaks. You can either specify how many breaks you want (which is what I did when I typed breaks = 3) and let R figure out where they should go, or you can provide a vector that tells R exactly where the breaks should be placed (which is what I did when I typed breaks = 0:116). The behaviour of the hist() function is slightly different depending on which version you use. If all you do is tell it how many breaks you want, R treats it as a “suggestion”, not as a demand. It assumes you want “approximately 3” breaks, but if it doesn’t think that this would look very pretty on screen, it picks a different (but similar) number. It does this for a sensible reason – it tries to make sure that the breaks are located at sensible values (like 10) rather than stupid ones (like 7.224414). And most of the time R is right: usually, when a human researcher says “give me 3 breaks”, he or she really does mean “give me approximately 3 breaks, and don’t put them in stupid places”. However, sometimes R is dead wrong. Sometimes you really do mean “exactly 3 breaks”, and you know precisely where you want them to go. So you need to invoke “real person privilege”, and order R to do what it’s bloody well told. In order to do that, you have to input the full vector that tells R exactly where you want the breaks. If you do that, R will go back to behaving like the nice little obedient calculator that it’s supposed to be. Good boy!

10.7.1 Adding stuff

This will be the most meaningless plot you will have ever encountered in your statistics education, but it serves an important goal: showing that lines() and points() functions don’t just work with the plot() function, but with several other functions, including hist().

hist( x = afl.margins )
lines( x = c(5,50),   # the horizontal locations
       y = c(20,1),   # the vertical locations
       lwd = 2         # line width
)
A histogram with a line

Figure 10.9: A histogram with a line

10.7.2 Visual style of your histogram

Okay, so at this point, we can draw a basic histogram, and we can alter the number and even the location of the breaks. However, the visual style of the histograms shown could stand to be improved. We can fix this by making use of some of the other arguments to the hist() function. Most of the things you might want to try doing have already been covered in Section 10.1, such as main and las and the likes, but there are several other new things you could do. I will, however, only discuss one.

One important argument is the labels argument, which controls the labelling the bars: You can attach labels to each of the bars using the labels argument. The simplest way to do this is to set labels = TRUE, in which case R will add a number just above each bar, that number being the exact number of observations in the bin. Alternatively, you can choose the labels yourself, by inputting a vector of strings, e.g., labels = c("label 1","label 2","etc"), but we won’t use that for now. We will see labels at work below.

10.7.3 A histogram is more than just a plot.

In some sense, it is more like a pet: You can give your histogram a name! (Well, you can do that for other plot types as well, but there it is not as useful.)

q <- hist( afl.margins)

Besides doing that for affectionate reason, there is a actually a quite good reason to do so. Let’s see what it holds for us:

class(q)
## [1] "histogram"
q 
## $breaks
##  [1]   0  10  20  30  40  50  60  70  80  90 100 110 120
## 
## $counts
##  [1] 38 23 27 23 21 14 10  7  6  3  3  1
## 
## $density
##  [1] 0.0215909091 0.0130681818 0.0153409091 0.0130681818 0.0119318182
##  [6] 0.0079545455 0.0056818182 0.0039772727 0.0034090909 0.0017045455
## [11] 0.0017045455 0.0005681818
## 
## $mids
##  [1]   5  15  25  35  45  55  65  75  85  95 105 115
## 
## $xname
## [1] "afl.margins"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Drawing a histogram is more than just making a plot. R did some computations for us by grouping data in bins and counting. The histogram shows the results visually, but q also shows the numeric input that R computed to draw the histogram, including the breakpoints and the counts. The counts bit shows the numbers we would see when we set labels to TRUE. Further, note that conveniently, our little histogram q is of the class histogram.`

hist( afl.margins, labels = TRUE)

10.7.4 Two types of histograms

While most arguments of the functions used for drawing, hist() has one important argument that deserves a bit of explanation, freq, which takes the values TRUE and FALSE, which (typically, there is a bit more nuance I am glossing over) defaults to TRUE. Let’s see what it does:

hist( afl.margins, freq = TRUE)

hist( afl.margins, freq = FALSE)

Both histograms look exactly the same, but they are different. Let’s bring out the differences. To do so, we make use of the labels argument.

hist( afl.margins, freq = TRUE, labels = TRUE)

hist( afl.margins, freq = FALSE, labels = TRUE)

It turns out that when freq=TRUE, R plots counts or frequencies, and when it is FALSE it does not. But what does it plot when freq = FALSE? The answer is densities, as can be viewed from this:

q$density
##  [1] 0.0215909091 0.0130681818 0.0153409091 0.0130681818 0.0119318182
##  [6] 0.0079545455 0.0056818182 0.0039772727 0.0034090909 0.0017045455
## [11] 0.0017045455 0.0005681818

But what is a density (dichtheid, in Dutch)? For the purposes of this book, the density is nothing but the count divided by the total number of games divided by the space between the breaks.

n <- length(afl.margins) #number of games
sb <- 10 #space between breaks
#sb <- unique(diff(q$mids)) #if you'd like to compute it; but you can ignore this if you wish
q$counts/n/sb
##  [1] 0.0215909091 0.0130681818 0.0153409091 0.0130681818 0.0119318182
##  [6] 0.0079545455 0.0056818182 0.0039772727 0.0034090909 0.0017045455
## [11] 0.0017045455 0.0005681818
q$density
##  [1] 0.0215909091 0.0130681818 0.0153409091 0.0130681818 0.0119318182
##  [6] 0.0079545455 0.0056818182 0.0039772727 0.0034090909 0.0017045455
## [11] 0.0017045455 0.0005681818

This description of density is somewhat incomplete, because there can be cases where there are different spaces between breaks. But you can ignore that subtlety. Just remember that the density is the relative frequency, taking into account bin width.

10.8 Stem and leaf plots

Histograms are one of the most widely used methods for displaying the observed values for a variable. They’re simple, pretty, and very informative. However, they do take a little bit of effort to draw. Sometimes it can be quite useful to make use of simpler, if less visually appealing, options. One such alternative is the stem and leaf plot. To a first approximation, you can think of a stem and leaf plot as a kind of text-based histogram. Stem and leaf plots aren’t used as widely these days as they were 30 years ago since it’s now just as easy to draw a histogram as it is to draw a stem and leaf plot. Not only that, they don’t work very well for larger data sets. As a consequence, you probably won’t have as much of a need to use them yourself, though you may run into them in older publications. These days, the only real-world situation where I use them is if I have a small data set with 20-30 data points and I don’t have a computer handy, because it’s pretty easy to quickly sketch a stem and leaf plot by hand.

With all that as background, let us have a look at stem and leaf plots. The AFL margins data contains 176 observations, which is at the upper end for what you can realistically plot this way. The function in R for drawing stem and leaf plots is called stem().

Exercise: Draw a stem and leaf plot of the afl.margins data using the stem() function.

stem( afl.margins )

The values to the left of the | are called stems and the values to the right are called leaves. If you just look at the shape that the leaves make, you can see something that looks a lot like a histogram made out of numbers, just rotated by 90 degrees. But if you know how to read the plot, there’s quite a lot of additional information here. In fact, it’s also giving you the actual values of all of the observations in the data set. To illustrate, let’s have a look at the last line in the stem and leaf plot, namely 11 | 6. Specifically, let’s compare this to the largest values of the afl.margins data set:

max( afl.margins )
## [1] 116

Hm… 11 | 6 versus 116. Obviously, the stem and leaf plot is trying to tell us that the largest value in the data set is 116. Similarly, when we look at the line that reads 10 | 148, the way we interpret it to note that the stem and leaf plot is telling us that the data set contains observations with values 101, 104 and 108. Finally, when we see something like 5 | 00002233445556667 the four 0s in the stem and leaf plot are telling us that there are four observations with value 50.

So that’s how we should interpret the mysterious The decimal point is 1 digit(s) to the right of the | message. It means that 11 | 6 should be read as 116 with the decimal point after the 6, so that means it corresponds to 116.

So if our data set had included only the numbers .11, .15, .23, .35 and .59 and we’d drawn a stem and leaf plot of these data, then R would move the decimal point: the stem values would be 1,2,3,4 and 5, but R would tell you that the decimal point has moved to the left of the | symbol. If you want to see this in action, try the following command:

Exercise: If you want to see this in action, try the following command: stem( x = afl.margins / 1000 )

stem( x = afl.margins / 1000 )

The stem and leaf plot itself looks identical to the original one we drew, except for the fact that R tells you that the decimal point has moved. So that’s how we should interpret the mysterious The decimal point is 2 digit(s) to the left of the | message. 11 | 6 should be read as 116 with the decimal point before the first 1, so 0.116. Which is exaclty what max(afl.margins / 1000) is.

10.9 Boxplots

Another alternative to histograms is a boxplot, sometimes called a “box and whiskers” plot. Like histograms, they’re most suited to interval or ratio scale data. The idea behind a boxplot is to provide a simple visual depiction of the median, the interquartile range, and the range of the data. And because they do so in a fairly compact way, boxplots have become a very popular statistical graphic, especially during the exploratory stage of data analysis when you’re trying to understand the data yourself. Let’s have a look at how they work, again using the afl.margins data as our example. You will again see that whipping up a basic version is almost embarrassingly easy.

The easiest way to describe what a boxplot looks like is just to draw one. The function for doing this in R is (surprise, surprise) boxplot(). As always there’s a lot of optional arguments that you can specify if you want, but for the most part, you can just let R choose the defaults for you. That said, we’re going to override one of the defaults to start with by specifying the range argument, but for the most part, you won’t want to do this (I’ll explain why in a minute).

Exercise: Create a boxplot of afl.margins using the boxplot() function, and specify range = 100.

boxplot( x = afl.margins, range = 100 )

What R draws is the most basic boxplot possible. When you look at this plot, this is how you should interpret it: the thick line in the middle of the box is the median; the box itself spans the range from the 25th percentile to the 75th percentile; and the “whiskers” cover the full range from the minimum value to the maximum value.

In practice, this isn’t quite how boxplots usually work. In most applications, the “whiskers” don’t cover the full range from minimum to maximum. Instead, they actually go out to the most extreme data point that doesn’t exceed a certain bound. By default, this value is 1.5 times the interquartile range, corresponding to a range value of 1.5. By default, R will only extent the whiskers a distance of 1.5 times the interquartile range, and will plot any points that fall outside that range separately

Exercise: Create a boxplot of afl.margins using the boxplot() function, and use the default value for range, which is 1.5.

boxplot( afl.margins ) #I don't need to specify range if I want to use its default value

For our AFL margins data, there is one observation (a game with a margin of 116 points) that falls outside this range. As a consequence, the upper whisker is pulled back to the next largest observation (a value of 108), and the observation at 116 is plotted as a circle.

10.9.1 Visual style of your boxplot

Boxplots in R are extremely customisable. In addition to the usual range of graphical parameters that you can tweak to make the plot look nice, you can also exercise nearly complete control over every element to the plot. The only thing that I want to say about it is that, if you ever need it, you should (obviously) consult the help file for the boxplot() function, but also of the bxp() function, which does most of the heavy lifting. Most arguments that are described in the bxp() function can be used when calling the boxplot() function.

10.10 curve()

The curve() function is part of a set of special creatures with some unprepossessing features , and part of me would like to just ignore it and keep it in its cage. (Toss, toss!) Another part of me, however, is eager to show you, and that part has two reasons: one is that is shows that R can be messy sometimes, which should help you appreciate the countless other times it behaves actually like you as a naive user would expect. And two: it is a very useful function to know about. So here goes (and comes trouble):

rm(x) #let's be sure we don't have an x stored somewhere
curve(x^2, from = -10, to = 10)

You what? Some of this should not come as a surprise. We see a, well, curve, running from -10 to 10, as asked. What is baffling is that we didn’t provide any values to plot! We told R to plot the square of x, but we did not provide any x whatsoever. This must be the work of the devil.

Compare this to how the plot function works:

plot(x^2, xlim=c(-10, 10))
## Error in plot(x^2, xlim = c(-10, 10)): object 'x' not found

No x no glory, so R doesn’t plot a thing.

To make the same stuff work with the plot() function, we need to tell R what x is

x <- seq(from=-10, to=10, by =0.01) 
plot(x^2)

So curve() is one of those special functions that can work with x without needing to know x. Magix!

While this is wildly flexible, there are some limits to the craziness curve() can handle. This, for example, is a no-go:

curve(y^2, xlim=c(-10, 10))
## Error in curve(y^2, xlim = c(-10, 10)): 'expr' must be a function, or a call or an expression containing 'x'

The magix only worx with xs.

10.11 Using more specialized functions

These are just the most basic graphics function in R. Much more is available, often using packages. Let’s look at a glimpse of what else you can do with R, just to pique your interest.

In Section XXX, I just wanted to show you how you can draw lines and points in a scatterplot. For the sake of illustration, I guesstimated where the line should be. In most realistic data analysis situations, you absolutely don’t want to just guess where the line through the points goes, since there are about a billion different ways in which you can get R to do a better job. However, it does at least illustrate the basic idea.

One possibility, if you do want to get R to draw nice clean lines through the data for you, is to use the scatterplot() function in the car package. Before we can use scatterplot() we need to load the package:

library( car )

Having done so, we can now use the function. The command we need is this one:

scatterplot( dan.grump ~ dan.sleep,
              data = parenthood, 
              smooth = FALSE
)
A fancy scatterplot drawn using the `scatterplot()` function in the `car` package.

Figure 10.10: A fancy scatterplot drawn using the scatterplot() function in the car package.

The first two arguments should be familiar: the first input is a formula dan.grump ~ dan.sleep telling R what variables to plot,47 and the second specifies a data frame. The third argument smooth I’ve set to FALSE to stop the scatterplot() function from drawing a fancy “smoothed” trendline (since it’s a bit confusing to beginners). The scatterplot itself is shown in Figure 10.10. As you can see, it has not only drawn the scatterplot, but its also drawn boxplots for each of the two variables, as well as a simple line of best fit showing the relationship between the two variables.

10.12 Moving on

One final thing to point out. There’s no easy way to tell you this, but R has several completely distinct systems for drawing figures. In this chapter, I’ve focused on the traditional graphics system. It’s the easiest one to get started with: you can draw a histogram with a command as simple as hist(x). A single high-level command is capable of drawing an entire graph, complete with a range of customisation options. Most but not all of the high-level commands that I’ll talk about in this book come from the graphics package itself, and so belong to the world of traditional graphics. These commands all tend to share a common visual style.

However, it’s not the most powerful tool for the job, and after a while, most R users start looking to shift to fancier systems. On the other side of the great divide, people rely heavily on two different packages – lattice and ggplot2 – each of which provides a quite different visual style. As you’ve probably guessed, there’s a whole separate bunch of functions that you’d need to learn if you want to use lattice graphics or make use of the ggplot2. Of these two, probably the most popular graphics systems is provided by the ggplot2 package. It’s not for novices: you need to have a pretty good grasp of R before you can start using it, and even then it takes a while to really get the hang of it. But when you’re finally at that stage, it’s worth taking the time to teach yourself, because it’s a much cleaner system for producing high quality graphs.

At this point, I think we’ve covered more than enough background material. The point that I’m trying to make by providing this discussion isn’t to scare you with all these horrible details, but rather to try to convey to you the fact that R doesn’t really provide a single coherent graphics system. Instead, R itself provides a platform, and different people have built different graphical tools using that platform. As a consequence of this fact, there are different universes of graphics and a great multitude of packages that live in them. At this stage, you don’t need to understand these complexities, but it’s useful to know that they’re there.

11 Hey, R! Let me do some statistics

By now, you almost know the basics of R. I didn’t talk about statistics much, apart from the visualisation side of it. There are a bunch of built-in R functions that are ideally suited for doing statistical analyses. One of those was the mean() function to, well, compute the mean. I have collected the most commonly used statistical functions in Table XXX. These functions are relatively easy to use, with little nuts and bolts, so I am not going over them in detail. If you do need to know these details — knowing what the default values for the arguments are comes to mind, or knowing the exact definition — remember that you can find out yourself by trial-and-error, by reading the R help (e.g., ?sd) or by looking for help online.

statistical.function R.function
mean mean()
median median()
range range()
quantile quantile()
interquartile range IQR()
standard deviation sd()
variance var()
covariance cov()
correlation cor()

While both the meaning and the usage of these functions are pretty self-explanatory, there a couple of things I’d like to draw your attention to.

11.1 Fooled by easiness

Table XXX should show how easy it is to use R to do a lot of useful statistical things, like computing means, medians and the likes. There is, however, a sense in which things have become almost too easy, in that they hide what R is actually doing. So you might think you know what R is doing, but in reality you don’t. I present two examples.

First, the variance (and its close square rooted cousin, the standard deviation). You probably think you know what R does when computing the variance, but you might be wrong. In the var() function, R has chosen to divide by \(n-1\) rather than by \(n\). Of course, if for some reason you would need the divide-by-n version, that’s easy to compute, using (n-1)*var(x)/n.

The more general point is that you can’t just assume you know what R is doing. To be fair, R makes it explicit in the help file. If you type ?var and scroll down a bit, you will read somewhere: The denominator n-1 is used which gives an unbiased estimator of the (co)variance for i.i.d. observations. And in this particular case, it’s just something you need to remember that R does it the way it does.

Second, the quantile. If you would bother to read the associated help file, you will see there are no less than 9 different algorithms in R that could be used in quantile(). I am not going the explain what the differences are (from lack of knowledge and desire), but if at some point you do care about these different versions, you need to know which one you want one which one R is using. Strictly FYI: If you don’t make a choice, R defaults to type 7, whatever that may be; and what you have learned in Statistics 1 was type 2.

11.2 Keeping it real

If one day, you forget that you can compute the median using median(), for example because you are now deeply ingrained with the knowledge that with R, things are never that simple, here are two other ways you can compute it. If you don’t know why this works, look up the definition of quantiles and median in your Statistics course.

median( afl.margins )
## [1] 30.5
quantile( afl.margins, probs = .5)
##  50% 
## 30.5
mean(c(sort(afl.margins)[length(afl.margins)/2],sort(afl.margins)[length(afl.margins)/2+1]))
## [1] 30.5

Similarly, what was the name of the function again to compute the range? It’s on the tip of my tongue, but it keeps escaping me! Well, I’ll just compute it myself.

range(afl.margins)
## [1]   0 116
c(min(afl.margins),max(afl.margins))
## [1]   0 116

And sorry, how on earth I am supposed to remember the name of the function of that interquartile range? But I can pull this off by myself, no worries.

IQR(afl.margins)
## [1] 37.75
quantile( afl.margins, probs = .75)-quantile( afl.margins, probs = .25)
##   75% 
## 37.75

So there’s nothing magical or even smart going on in most of these functions. They are just developed to make your life a little easier.

11.3 There’s more than meets the eye

While all these functions are pleasantly simple, they run the risk of being deceptively simple, in that they might hide all the cool stuff you can do with it. As an illustration cor() can be used to compute a correlation, but there’s more you can do with it!

For example, you can also calculate a complete “correlation matrix”, between all pairs of variables in the data frame (at least if all these variables are numeric).48

Exercise: Correlate all pairs of variables in parenthood by providing the whole data set as input to the cor() function.

cor( parenthood )  

As another example, it doesn’t just compute the Pearson correlation — i.e., that stuff we have been calling correlation, colloquially. For ordinal data, the pearson correlation is not useful, and you should use, for example, the Spearman’s rank-order correlation instead. We can calculate Spearman’s \(\rho\) using R: It only involves specifying the method argument of the cor() function. I won’t illustrate that here. This is just to pique your interest and to instill the idea that these functions often can do more than meets the eye. The default value of the method argument is "pearson", which is why we didn’t have to specify it earlier on when we were doing Pearson correlations.

11.4 Doing Everything Everywhere All at Once

Up to this point in the chapter, I’ve mentioned several different summary statistics that are commonly used when analysing data, along with specific functions that you can use in R to calculate each one. However, it’s kind of annoying to have to separately calculate means, medians, standard deviations, etc. Wouldn’t it be nice if R had some helpful functions that would do all these tedious calculations at once? Something like summary(), perhaps? Well yes, yes it would. So much so that this function exists. The summary() function is in the base package, which means that it comes with every installation of R.

The summary() function is an easy thing to use, but a tricky thing to understand in full, since it’s a generic function (see Section 8.8.1), meaning that its behaviour changes depending on what kind of input you give it. The basic idea behind the summary() function is that it prints out some useful information about whatever object (i.e., a variable, as far as we’re concerned) you specify as the object argument. As a consequence, the behaviour of the summary() function differs quite dramatically depending on the class of the object that you give it. Which is a feature not a bug, since what is interesting or even relevant about you isn’t necessarily interesting or relevant about me.

Let’s start by giving it a numeric object.

Exercise: Summarize the numeric object afl.margins using the summary() function.

summary( afl.margins ) 

For numeric variables, we get a whole bunch of useful descriptive statistics. It gives us the minimum and maximum values (i.e., the range), the first and third quartiles (25th and 75th percentiles; i.e., the IQR), the mean and the median. In other words, it gives us a pretty good collection of descriptive statistics related to the central tendency and the spread of the data.

Okay, what about if we feed it a logical vector instead? Let’s say I want to know something about how many “blowouts” there were in the 2010 AFL season. I operationalise the concept of a blowout (see Chapter ??) as a game in which the winning margin exceeds 50 points. Let’s create a logical variable blowouts in which the i-th element is TRUE if the i-th game was a blowout according to my definition:

blowouts <-  afl.margins > 50
head(blowouts)
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE

So that’s what the blowouts variable looks like. Now let’s ask R for a summary()

summary( blowouts )
##    Mode   FALSE    TRUE 
## logical     132      44

In this context, the summary() function gives us a count of the number of TRUE values and the number of FALSE values. Pretty reasonable behaviour.

Next, let’s try to give it a factor. If you recall, I’ve defined the afl.finalists vector as a factor, so let’s use that.

Exercise: Summarize the factor object afl.finalists using the summary() function.

summary( afl.finalists )

For factors, we get a frequency table, just like we got when we used the table() function.

Exercise: Interestingly, however, if we convert this to a character vector using the as.character() function (see Section 6.9.3, we don’t get the same results. Do try.

f2 <- as.character( afl.finalists )
summary( f2 )

Not really useful, but thanks anyway. Because I’ve defined afl.finalists as a factor, R knows that it should treat it as a nominal scale variable, and so it gives you a much more detailed (and helpful) summary than it would have if I’d left it as a character vector.

11.4.1 “Summarising” a data frame

Okay, what about data frames? When you pass a data frame to the summary() function, it produces a slightly condensed summary of each variable inside the data frame.

To give you a sense of how this can be useful, let’s try this for a new data set, one that you’ve never seen before. The data is stored in the clinicaltrial.Rdata file. Let’s see what we’ve got:

print(clin.trial)
##        drug    therapy mood.gain
## 1   placebo no.therapy       0.5
## 2   placebo no.therapy       0.3
## 3   placebo no.therapy       0.1
## 4  anxifree no.therapy       0.6
## 5  anxifree no.therapy       0.4
## 6  anxifree no.therapy       0.2
## 7  joyzepam no.therapy       1.4
## 8  joyzepam no.therapy       1.7
## 9  joyzepam no.therapy       1.3
## 10  placebo        CBT       0.6
## 11  placebo        CBT       0.9
## 12  placebo        CBT       0.3
## 13 anxifree        CBT       1.1
## 14 anxifree        CBT       0.8
## 15 anxifree        CBT       1.2
## 16 joyzepam        CBT       1.8
## 17 joyzepam        CBT       1.3
## 18 joyzepam        CBT       1.4

There’s a single data frame called clin.trial which contains three variables, drug, therapy and mood.gain. Presumably, this data is from a clinical trial of some kind, in which people were administered different drugs; and the researchers looked to see what the drugs did to their mood. Let’s see if the summary() function sheds a little more light on this situation.

Exercise: Summarize clin.trial using the summary() function.

summary( clin.trial )

Evidently, there were three drugs: a placebo, something called “anxifree” and something called “joyzepam”; and there were 6 people administered each drug. There were 9 people treated using cognitive behavioural therapy (CBT) and 9 people who received no psychological treatment. And we can see from looking at the summary of the mood.gain variable that most people did show a mood gain (mean \(=.88\)), though without knowing what the scale is here, it’s hard to say much more than that. Still, that’s not too bad. Overall, I feel that I learned something from that.

11.4.2 Descriptive statistics separately for each group

Let’s say, we want to look at the descriptive statistics for the clin.trial data, broken down separately by therapy type. Since summary is just another function (but probably one of the hardest working functions in R business), we can use the strategies discussed in Section XXX.

First, by by():

by( data=clin.trial, INDICES=clin.trial$therapy, FUN=summary )
## clin.trial$therapy: CBT
##        drug         therapy    mood.gain    
##  anxifree:3   CBT       :9   Min.   :0.300  
##  joyzepam:3   no.therapy:0   1st Qu.:0.800  
##  placebo :3                  Median :1.100  
##                              Mean   :1.044  
##                              3rd Qu.:1.300  
##                              Max.   :1.800  
## ------------------------------------------------------------ 
## clin.trial$therapy: no.therapy
##        drug         therapy    mood.gain     
##  anxifree:3   CBT       :0   Min.   :0.1000  
##  joyzepam:3   no.therapy:9   1st Qu.:0.3000  
##  placebo :3                  Median :0.5000  
##                              Mean   :0.7222  
##                              3rd Qu.:1.3000  
##                              Max.   :1.7000

Neat. As you can see, the output is essentially identical to the output that the summary() function produces, except that the output now gives you the info like means, medians, etc. separately for the CBT group and the no.therapy group. It’s the output of the summary() function, applied separately to CBT group and the no.therapy group. For the two factors (drug and therapy) it prints out a frequency table, whereas for the numeric variable (mood.gain) it prints out the range, interquartile range, mean and median.

For tapply(), we would hope this would do the job

tapply( X = clin.trial, INDEX = clin.trial$therapy, FUN = summary )
## Error in tapply(X = clin.trial, INDEX = clin.trial$therapy, FUN = summary): arguments must have same length

but alas, this does not work. If we want the information for each therapy type separately, we need to use tapply() twice:

#one time (for mood gain)
tapply( X = clin.trial$mood.gain, INDEX = clin.trial$therapy, FUN = summary )
## $CBT
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.300   0.800   1.100   1.044   1.300   1.800 
## 
## $no.therapy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1000  0.3000  0.5000  0.7222  1.3000  1.7000
#and again a second time (for drug)
tapply( X = clin.trial$drug, INDEX = clin.trial$therapy, FUN = summary )
## $CBT
## anxifree joyzepam  placebo 
##        3        3        3 
## 
## $no.therapy
## anxifree joyzepam  placebo 
##        3        3        3

Similarly for aggregate():

print(aggregate( x = mood.gain ~ therapy,                # mood.gain by therapy 
            data = clin.trial,                     # data is in the clin.trial data frame
            FUN = summary                          # print out group means
))
##      therapy mood.gain.Min. mood.gain.1st Qu. mood.gain.Median mood.gain.Mean
## 1        CBT      0.3000000         0.8000000        1.1000000      1.0444444
## 2 no.therapy      0.1000000         0.3000000        0.5000000      0.7222222
##   mood.gain.3rd Qu. mood.gain.Max.
## 1         1.3000000      1.8000000
## 2         1.3000000      1.7000000
print(aggregate( x = drug ~ therapy,                     # drug by therapy 
            data = clin.trial,                     # data is in the clin.trial data frame
            FUN = summary                          # print out group means
))
##      therapy drug.anxifree drug.joyzepam drug.placebo
## 1        CBT             3             3            3
## 2 no.therapy             3             3            3

What if you have multiple grouping variables? Suppose, for example, you would like to look at the descriptives of mood gain separately for all possible combinations of drug and therapy?

Here is one way:

by( data = clin.trial$mood.gain, INDICES = list(clin.trial$drug, clin.trial$therapy), FUN = summary )
## : anxifree
## : CBT
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   0.950   1.100   1.033   1.150   1.200 
## ------------------------------------------------------------ 
## : joyzepam
## : CBT
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.30    1.35    1.40    1.50    1.60    1.80 
## ------------------------------------------------------------ 
## : placebo
## : CBT
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.30    0.45    0.60    0.60    0.75    0.90 
## ------------------------------------------------------------ 
## : anxifree
## : no.therapy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.2     0.3     0.4     0.4     0.5     0.6 
## ------------------------------------------------------------ 
## : joyzepam
## : no.therapy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.350   1.400   1.467   1.550   1.700 
## ------------------------------------------------------------ 
## : placebo
## : no.therapy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.1     0.2     0.3     0.3     0.4     0.5

And here’s another:

print(aggregate(  x = mood.gain ~ drug + therapy,  # mood.gain by drug/therapy combination
            data = clin.trial,                     # data is in the clin.trial data frame
            FUN = summary                             #make a summary
))
##       drug    therapy mood.gain.Min. mood.gain.1st Qu. mood.gain.Median
## 1 anxifree        CBT       0.800000          0.950000         1.100000
## 2 joyzepam        CBT       1.300000          1.350000         1.400000
## 3  placebo        CBT       0.300000          0.450000         0.600000
## 4 anxifree no.therapy       0.200000          0.300000         0.400000
## 5 joyzepam no.therapy       1.300000          1.350000         1.400000
## 6  placebo no.therapy       0.100000          0.200000         0.300000
##   mood.gain.Mean mood.gain.3rd Qu. mood.gain.Max.
## 1       1.033333          1.150000       1.200000
## 2       1.500000          1.600000       1.800000
## 3       0.600000          0.750000       0.900000
## 4       0.400000          0.500000       0.600000
## 5       1.466667          1.550000       1.700000
## 6       0.300000          0.400000       0.500000

Here’s no way:

tapply( X = clin.trial$mood.gain, INDEX = list(clin.trial$drug, clin.trial$therapy), FUN = summary )
##          CBT              no.therapy      
## anxifree summaryDefault,6 summaryDefault,6
## joyzepam summaryDefault,6 summaryDefault,6
## placebo  summaryDefault,6 summaryDefault,6

In case you wonder (but you don’t need to understand, remember or study this; I only included it because it bugs me and might bug you), you can sort of make it work as follows (but you only get the numbers, without labels telling you what these numbers mean. So it’s not really helpful.)

print.table(tapply( X = clin.trial$mood.gain, INDEX = list(clin.trial$drug, clin.trial$therapy), FUN = summary ))
## [1] 0.800000, 0.950000, 1.100000, 1.033333, 1.150000, 1.200000
## [2] 1.30, 1.35, 1.40, 1.50, 1.60, 1.80                        
## [3] 0.30, 0.45, 0.60, 0.60, 0.75, 0.90                        
## [4] 0.2, 0.3, 0.4, 0.4, 0.5, 0.6                              
## [5] 1.300000, 1.350000, 1.400000, 1.466667, 1.550000, 1.700000
## [6] 0.1, 0.2, 0.3, 0.3, 0.4, 0.5
#or
print.listof(tapply( X = clin.trial$mood.gain, INDEX = list(clin.trial$drug, clin.trial$therapy), FUN = summary ))
## Component 1 :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   0.950   1.100   1.033   1.150   1.200 
## 
## Component 2 :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.30    1.35    1.40    1.50    1.60    1.80 
## 
## Component 3 :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.30    0.45    0.60    0.60    0.75    0.90 
## 
## Component 4 :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.2     0.3     0.4     0.4     0.5     0.6 
## 
## Component 5 :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.350   1.400   1.467   1.550   1.700 
## 
## Component 6 :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.1     0.2     0.3     0.3     0.4     0.5

11.5 Go out and play

If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. Since you now know how to work in RStudio, why not go there to make them.

12 Endnotes


  1. For advanced users: if you want a table showing the complete order of operator precedence in R, type ?Syntax. I haven’t included it in this book since there are quite a few different operators, and we don’t need that much detail. Besides, in practice most people seem to figure it out from seeing examples: until writing this book I never looked at the formal statement of operator precedence for any language I ever coded in, and never ran into any difficulties.↩︎

  2. Don’t get thrown off by the use of the term function. We’ll get into functions quite soon, and it will all start to make sense. Talking about great prospects!↩︎

  3. Another way to edit variables is to use the edit() or fix() functions. I won’t discuss them in detail right now, but you can check them out on your own.↩︎

  4. A note for the mathematically inclined: R does support complex numbers, but unless you explicitly specify that you want them it assumes all calculations must be real-valued. By default, the square root of a negative number is treated as undefined: sqrt(-9) will produce NaN (not a number) as its output. To get complex numbers, you would type sqrt(-9+0i) and R would now return 0+3i. However, since we won’t have any need for complex numbers in this book, I won’t refer to them again.↩︎

  5. Note that this is a very different operator to the assignment operator = that I talked about in Section 2.3. A common typo that people make when trying to write logical commands in R (or other languages, since the “= versus ==” distinction is important in most programming languages) is to accidentally type = when you really mean ==. Be especially cautious with this – I’ve been programming in various languages since I was a teenager, and I still screw this up a lot. Hm. I think I see why I wasn’t cool as a teenager. And why I’m still not cool.↩︎

  6. It’s also worth checking out the match() function↩︎

  7. I will be using a few words that will totally sound gibberish to your ears, but that’s ok for the point I want to make. So don’t fret.↩︎

  8. More precisely, there are 5000 or so packages on CRAN, the Comprehensive R Archive Network.↩︎

  9. The two functions discussed previously, sqrt() and abs(), both only have a single argument, x. So I could have typed something like sqrt(x = 225) or abs(x = -13) earlier. The fact that all these functions use x as the name of the argument that corresponds to the “main” variable that you’re working with is no coincidence. That’s a fairly widely used convention. Quite often, the writers of R functions will try to use conventional names like this to make your life easier. Or at least that’s the theory. In practice, it doesn’t always work as well as you’d hope.↩︎

  10. Actually, that’s a bit of a lie: the log() function is more flexible than that and can be used to calculate logarithms in any base. The log() function has a base argument that you can specify, which has a default value of \(e\). Thus log10(1000) is actually equivalent to log(x = 1000, base = 10). Note that the calculator you have used for Statistics 1 and 2 uses log() for the base-10 logarithm (and ln() for the base-e logarithm).↩︎

  11. Note for non-Australians: the AFL is an Australian rules football competition. You don’t need to know anything about Australian rules in order to follow this section.↩︎

  12. 6, to be precise↩︎

  13. well, 6, again↩︎

  14. For advanced users: type ?double for more information.↩︎

  15. Or at least, that’s the default. If all your numbers are integers (whole numbers), then you can explicitly tell R to store them as integers by adding an L suffix at the end of the number. That is, an assignment like x <- 2L tells R to assign x a value of 2 and to store it as an integer rather than as a binary expansion. Type ?integer for more details.↩︎

  16. You can choose which panes you see and where by going to View/Panes/Panes layout, but I recommend keeping it in the default setting, as long as you consider yourself a novice.↩︎

  17. For advanced users: yes, as you’ve probably guessed, R is printing out the source code for the function.↩︎

  18. If you’re running R from the terminal rather than from RStudio, escape doesn’t work: use CTRL-C instead.↩︎

  19. Incidentally, that always works: if you’ve started typing a command and you want to clear it and start again, hit escape.↩︎

  20. Here are some more, for the meerwaardezoeker: the objects() function, the ls() function, the ls.str() function and the who() function from the lsr package ↩︎

  21. For advanced users: that’s a little over-simplistic in two respects. First, it’s a terribly imprecise way of talking about scoping. Second, it might give you the impression that all the variables in question are actually loaded into memory. That’s not quite true, since that would be very wasteful of memory. Instead, R has a “lazy loading” mechanism, in which what R actually does is create a “promise” to load those objects if they’re actually needed. For details, check out the delayedAssign() function.↩︎

  22. The details about these packages is not something you should study.↩︎

  23. The logit function a simple mathematical function that happens not to have been included in the basic R distribution.↩︎

  24. Tip for advanced users: See also ::: if you’re especially keen to force R to use functions it otherwise wouldn’t, but take care, since ::: can be dangerous.↩︎

  25. For some reason, if I use x instead of print(x), R will print x using the first approach, but not using the second, source() approach. Don’t worry about that. You have more important things to worry about, like climate change.↩︎

  26. You can do the same using the “Session” or the “File” menu on top of RStudio and choose Quit Session… , or use the CTRL+Q shortkey or CMD+Q on a Mac.↩︎

  27. Or functions. But let’s ignore functions for the moment.↩︎

  28. Some users might wonder why R even allows the == operator for factors. The reason is that sometimes you really do have different factors that have the same levels. For instance, if I was analysing data associated with football games, I might have a factor called home.team, and another factor called winning.team. In that situation, I really should be able to ask if home.team == winning.team.↩︎

  29. Well, that’s not the best word, but you know what I mean.↩︎

  30. Note that, when I write out the formula, R doesn’t check to see if the out and pred variables actually exist: it’s only later on when you try to use the formula for something that this happens.↩︎

  31. but in a different way than I used it above, were dropping referred to “not showing”.↩︎

  32. Actually, you can make the subset() function behave this way by using the optional drop argument, but by default subset() does not drop, which is probably more sensible and more intuitive to novice users.↩︎

  33. Specifically, recursive indexing, a handy tool in some contexts but not something that I want to discuss here.↩︎

  34. Conveniently, if you type rownames(df) <- NULL R will renumber all the rows from scratch. For the df data frame, the labels that currently run from 7 to 10 will be changed to go from 1 to 4.↩︎

  35. It’s worth noting that there’s also a more powerful function called recode() function in the car package that I won’t discuss in this book but is worth looking into if you’re looking for a bit more flexibility.↩︎

  36. Or a list of such variables, as we will see below.↩︎

  37. I mentioned earlier that print() is not a terribly useful function, but at least it make itself useful by being a vehicle for demonstrating the concept of a generic function.↩︎

  38. Lexical scope.↩︎

  39. The assign() function.↩︎

  40. Yes.↩︎

  41. As an aside: if there’s only a single command that you want to include inside your loop, then you don’t actually need to bother including the curly braces at all. However, until you’re comfortable programming in R I’d advise always using them, even when you don’t have to.↩︎

  42. Okay, fine. This example is still a bit ridiculous, in three respects. Firstly, the bank absolutely will not let the couple pay less than the amount required to terminate the loan in 30 years. Secondly, a constant interest rate of 30 years is hilarious. Thirdly, you can solve this much more efficiently than through brute force simulation. However, we’re not exactly in the business of being realistic or efficient here.↩︎

  43. You don’t need to understand how I converted the annual percentage interest into a monthly multiplier. But if you care, here’s how. The number that you have to multiply the current balance by each month in order to produce an annual interest rate of 5%. An annual interest rate of 5% implies that, if no payments were made over 12 months the balance would end up being \(1.05\) times what it was originally, so the annual multiplier is \(1.05\). To calculate the monthly multiplier, we need to calculate the 12th root of 1.05 (i.e., raise 1.05 to the power of 1/12). As it happens, this corresponds to a value of about 1.004. All of which is a rather long-winded way of saying that the annual interest rate of 5% corresponds to a monthly interest rate of about 0.4%.↩︎

  44. I should add that this isn’t unique to R. Like everything in R there’s a pretty steep learning curve to learning how to draw graphs, and like always there’s a massive payoff at the end in terms of the quality of what you can produce. But to be honest, I’ve seen the same problems show up regardless of what system people use. I suspect that the hardest thing to do is to force yourself to take the time to think deeply about what your graphs are doing. I say that in full knowledge of the fact that only about half of my graphs turn out as well as they ought to. Understanding what makes a good graph is easy: actually designing a good graph is hard.↩︎

  45. On the off chance that this isn’t enough freedom for you, you can select a colour directly as a “red, green, blue” specification using the rgb() function, or as a “hue, saturation, value” specification using the hsv() function.↩︎

  46. Note that you can get finer grain control over this by specifying the xaxt and yaxt graphical parameters instead.↩︎

  47. You might be wondering why I haven’t specified the argument name for the formula. The reason is that there’s a bug in how the scatterplot() function is written: under the hood there’s one function that expects the argument to be named x and another one that expects it to be called formula. I don’t know why the function was written this way, but it’s not an isolated problem: this particular kind of bug repeats itself in a couple of other functions. The solution in such cases is to omit the argument name: that way, one function “thinks” that you’ve specified x and the other one “thinks” you’ve specified formula and everything works the way it’s supposed to. It’s not a great state of affairs, I’ll admit, but it sort of works.↩︎

  48. An alternative usage of cor() is to correlate one set of variables with another subset of variables. If X and Y are both data frames with the same number of rows, then cor(x = X, y = Y) will produce a correlation matrix that correlates all variables in X with all variables in Y.↩︎