This material is heavily based on the book Learning statistics with R: A tutorial for psychology students and other beginners (Version 0.6) by
Danielle Navarro. It has been reorganized, extended, rewritten, adapted and formatted with learnr by Wolf Vanpaemel under a Creative Commons BY-SA license (CC BY-SA) version 4.0. This means that this book can be reused, remixed, retained, revised, and redistributed (including commercially) as long as appropriate credit is given to the authors. If you remix or modify the original version of this open textbook, you must redistribute all versions of this open textbook under the same license - CC BY-SA. https://creativecommons.org/licenses/by-sa/4.0/
For formatting with learnr, Evelien Schat and Richard Artner provided valuable assistance. Jeffrey Goris, Jordan Revol, Robin Vloeberghs, Marre Vervloet, Yentl Koopmans, Lisa Koßmann, Peer-Ole Jacobsen and Katrijn Cnudde provided valuable feedback on a previous version.
Information in the endnotes is beyond the scope of the course, and is strictly provided for your information. It might be interesting, or useful for later when you are a more prolific R user, but it is not part of what you should study.
Our goal in these chapters is not to learn any statistical concepts: we’re just trying to learn the basics of how R works and get comfortable interacting with the system. Rather than learning about how to use R to do statistics, the main goal of these chapters is to get started in R and learn how R works. to In Chapter XXX, we will encounter some statistical concepts. The goal here is to show you how to compute those in R. I will not be explaining why these computations are interesting, what they mean, and how they should be interpreted. To learn about this stuff, you should go elsewhere.
The list of topics that these chapters cover is pretty broad, and there’s a lot of content there. Even though this is quite long, I’m really only scratching the surface of several fairly different and important topics. My advice is to read through the chapter once and try to follow as much of it as you can. Don’t worry too much if you can’t grasp it all at once. However, what you’ll probably find is that later on, you’ll need to flick back to earlier chapters in order to understand some of the concepts that I refer to there. In general, I’m not trying to be comprehensive in these chapters, I’m trying to make sure that you’ve got the basic foundations needed to tackle the content that comes later in the book. This means that some of the topics are revisited in more detail later. For example, I will talk about data frames in Sections XXX, XXX and XXX. This makes this book annoying as a reference book — not everything you need to know about a data frame is collected at the same spot — but I hope it makes the book a good textbook providing useful as study material, where you are taken by the hand to do everything step by step. It is a thin line to walk, though, and I do hope I have succeeded.
This is a somewhat interactive document, and you will be asked to write R code in your browser. So, perhaps surprisingly, while you’ll learn how to read, write and run R code, you will not need to open R just now. Instead, you will be working in R from within a browser. This is not how you will typically use R later, but it offers a nice learning experience. In Section 5, we will work in R properly (or in RStudio, really).
You will often be asked to run some code. You can do that by hitting ctrl + enter (or command + enter for the Mac users) or pressing the “run code” button in the boxes that will be provided. For some exercises, you will see a solution box, on which you can click and copy.
This work is neither complete, nor perfect, and is a work in progress. One thing that is surely broken is the internal referencing system (to figures, tables, sections). So that means that if I say I will talk about it in Section 8.4, there is only a 70% probability the correct section is 8.4. At other times, I gave up and didn’t even say 8.4 but use XXX as a placeholder. Or the system broke, which leads to ??. I plan to fix it, but ran out of time and energy. While ugly and mildly annoying, I don’t expect this to slow down your learning curve.
Further, it makes sense to view R from the perspective of a language. Which makes sense, given that it is a language. That means that there are many ways you can do things wrong. But, at the same time, there are many ways to do things right too! Just like with language, different people have different styles, and the same holds for R. That means that whatever code I write is most emphatically not the only way this code could or should be written. It reflects my own personal code writing style, while of course respecting the rule of grammar.
On https://docs.google.com/document/d/119uqTG6OpP9bcwEyubcw2Uy-l03HIoZjFyveAaU7oJc/edit?usp=sharing, you can find and report typos. Any typo I have found will be reported under the “confirmed typos” heading. If you think you have found a typo, and it is not listed under the “confirmed typos” heading, please add it under the “suspected typos” heading. Future users will thank you!
One of the easiest things you can do with R is to use it as a simple calculator, so it’s a good place to start.
Exercise: type 10 + 20 in the box below, run the code.
Congrats! When you have done this, you’ve entered a command, and R will “execute” that command.
Not a lot of surprises in this extract. But there are a few things worth talking about, even with such a simple example.
Firstly, it’s important
that you understand how to read the extract. In this example, what you
typed was the 10 + 20 part. You didn’t type the [1] 30 part. That’s
what R printed out in response to your command.
Secondly, it’s important to understand how the output is formatted.
Obviously, the correct answer to the sum 10 + 20 is 30, and not
surprisingly R has printed that out as part of its response. But it’s
also printed out this [1] part, which probably doesn’t make a lot of
sense to you right now. You’re going to see that a lot. I’ll talk about
what this means in a bit more detail later on, but for now, you can
think of [1] 30 as if R were saying “the answer to the 1st question
you asked is 30”. That’s not quite the truth, but it’s close enough for
now. And in any case, it’s not really very interesting at the moment: we
only asked R to calculate one thing, so obviously, there’s only one
answer printed on the screen. Later on, this will change, and the [1]
part will start to make a bit more sense. For now, I just don’t want you
to get confused or concerned by it.
Before we go on to talk about other types of calculations that we can do
with R, there are a few other things I want to point out. The first
thing is that, while R is good software, it’s still software. It’s
pretty stupid, and because it’s stupid it can’t handle typos. It takes
it on faith that you meant to type exactly what you did type. For
example, suppose that you forgot to hit the shift key when trying to
type +, and as a result, your command ended up being 10 = 20 rather
than 10 + 20.
Exercise: Run the command 10 = 20 and see what happens.
What’s happened here is that R has attempted to interpret 10 = 20 as a
command, and spits out an error message because the command doesn’t make
any sense to it. When a human looks at this, and then looks down at
his or her keyboard and sees that + and = are on the same key, it’s
pretty obvious that the command was a typo. But R doesn’t know this, so
it gets upset. And, if you look at it from its perspective, this makes
sense. All that R “knows” is that 10 is a legitimate number, 20 is a
legitimate number, and = is a legitimate part of the language too. In
other words, from its perspective, this really does look like the user
meant to type 10 = 20, since all the individual parts of that
statement are legitimate and it’s too stupid to realise that this is
probably a typo. Therefore, R takes it on faith that this is exactly
what you meant … . It only “discovers” that the command is nonsense when
it tries to follow your instructions, typo and all. And then it whinges
and spits out an error.
Even more subtle is the fact that some typos won’t produce errors at
all, because they happen to correspond to “well-formed” R commands. For
instance, suppose that not only did I forget to hit the shift key when
trying to type 10 + 20, I also managed to press the key next to the
one I meant to. The resulting typo would produce the command 10 - 20.
Clearly, R has no way of knowing that you meant to add 20 to 10, not
subtract 20 from 10.
Exercise: Run the command 10 - 20 and see what happens.
In this case, R produces the right answer, but to the wrong question.
To some extent, I’m stating the obvious here, but it’s important. The people who wrote R are smart. You, the user, are smart. But R itself is dumb. And because it’s dumb, it has to be mindlessly obedient. It does exactly what you ask it to do. There is no equivalent to “autocorrect” in R, and for good reason. When doing advanced stuff – and even the simplest of statistics is pretty advanced in a lot of ways – it’s dangerous to let a mindless automaton like R try to overrule the human user. But because of this, it’s your responsibility to be careful. Always make sure you type exactly what you mean. When dealing with computers, it’s not enough to type “approximately” the right thing. In general, you absolutely must be precise in what you say to R … like all machines it is too stupid to be anything other than absurdly literal in its interpretation.
Of course, now that I’ve been so uptight about the importance of always
being precise, I should point out that there are some exceptions. Or,
more accurately, there are some situations in which R does show a bit
more flexibility than my previous description suggests.
R is smart enough to ignore redundant spacing. What I mean by
this is that, when I typed 10 + 20 before, I could equally have done
this
or this
Exercise: Try it!
You get exactly the same answer!
However, that doesn’t mean that you can insert spaces in any old place.
For example, you could type citation() to get some information about
how to cite R.
Exercise: Run the command citation().
It tells you how to cite the R manual. Let’s see what happens when you try changing the spacing.
Exercise: Type citation() with spaces in between the word and the
parentheses, or inside the parentheses themselves.
citation () and citation( ) will produce exactly the same response.
However, what you can’t do is insert spaces in the middle of the word.
Exercise: Run the command citation(), with spaces in the middle of
the word.
citat ion() gives an error.
Okay, now that we’ve discussed some of the tedious details associated with typing R commands, let’s move forward. So far, all we know how to do is addition. Clearly, a calculator that only did addition would be a bit stupid, so I should tell you about how to perform other simple calculations using R. But first, some more terminology.
Addition is an example of an “operation” that you can perform
(specifically, an arithmetic operation), and the operator that
performs it is +. To people with a programming or mathematics
background, this terminology probably feels pretty natural, but to other
people it might feel like I’m trying to make something very simple
(addition) sound more complicated than it is (by calling it an
arithmetic operation). To some extent, that’s true: if addition was the
only operation that we were interested in, it’d be a bit silly to
introduce all this extra terminology. However, as we go along, we’ll
start using more and more different kinds of operations, so it’s
probably a good idea to get the language straight now, while we’re still
talking about very familiar concepts like addition!
So, now that we have the terminology, let’s learn how to perform some arithmetic operations in R. To that end, Table 2.1 lists (among others) the operators that correspond to the basic arithmetic we learned in primary school: addition, subtraction, multiplication and division.
| operation | operator | example input | example output |
|---|---|---|---|
| addition | + |
10 + 2 | 12 |
| subtraction | - |
9 - 3 | 6 |
| multiplication | * |
5 * 5 | 25 |
| division | / |
9 / 3 | 3 |
| power | ^ |
5 ^ 2 | 25 |
| power | ** |
4 ** 2 | 16 |
As you can see, R uses fairly standard symbols to denote each of the
different operations you might want to perform: addition is done using
the + operator, subtraction is performed by the - operator, and so
on.
Exercise: Find out what 57 times 61 is using R.
57 * 61
So that’s handy.
The first four operations listed in Table 2.1 are things we all learned in primary school, but they aren’t the only arithmetic operations built into R. There are three other arithmetic operations that I should probably mention: taking powers, doing integer division, and calculating a modulus. Of the three, the only one that is of any real importance for the purposes of this book is taking powers, so I’ll discuss that one here: the other two are not discussed. Grace!
For those of you who can still remember your high school maths, this should be familiar. But for some people high school maths was a long time ago, and others of us didn’t listen very hard in high school. It’s not complicated. As I’m sure everyone will probably remember the moment they read this, the act of multiplying a number \(x\) by itself \(n\) times is called “raising \(x\) to the \(n\)-th power”. Mathematically, this is written as \(x^n\). Some values of \(n\) have special names: in particular, \(x^2\) is called \(x\)-squared (x kwadraat, in Dutch), and \(x^3\) is called \(x\)-cubed. So, the 4th power of 5 is calculated like this:
\[ 5^4 = 5 \times 5 \times 5 \times 5 \]
One way that we could calculate \(5^4\) in R would be to type in the complete multiplication as it is shown in the equation above. That is, we could do this
## [1] 625
but it does seem a bit tedious. It would be very annoying indeed if you wanted to calculate \(5^{15}\), since the command would end up being quite long. Therefore, to make our lives easier, we use the power operator instead. When we do that, our command to calculate \(5^4\) goes like this:
## [1] 625
Much easier. Another way to do this is by using ** instead of ^.
Exercise: Use ** to obtain the 4th power of 5.
In most situations where you would want to use a calculator, you might
want to do multiple calculations. R lets you do this, just by typing in
longer commands. In fact, we’ve already seen an example of this earlier,
when I typed in 5 * 5 * 5 * 5. However, let’s try a slightly different
example:
## [1] 9
Clearly, this isn’t a problem for R either. However, it’s worth stopping
for a second, and thinking about what R just did. Clearly, since it gave
us an answer of 9 it must have multiplied 2 * 4 (to get an interim
answer of 8) and then added 1 to that. But, suppose it had decided to
just go from left to right: if R had decided instead to add 1+2 (to
get an interim answer of 3) and then multiplied by 4, it would have come
up with an answer of 12. To answer this, you need to know the order
of operations that R uses.
If you remember back to your high school maths classes, it’s actually
the same order that you got taught when you were at school. In some
English speaking countries, this is known as the “BEDMAS”
order1. That is, first calculate things inside Brackets (),
then calculate Exponents ^, then Division / and
Multiplication *, then Addition + and Subtraction -.
So, to continue the example above, if we want to force R to calculate
the 1+2 part before the multiplication, all we would have to do is
enclose it in brackets:
## [1] 12
This is a fairly useful thing to be able to do.
The only other thing I
should point out about order of operations is what to expect when you
have two operations that have the same priority: that is, how does R
resolve ties? For instance, multiplication and division are actually the
same priority, but what should we expect when we give R a problem like
4 / 2 * 3 to solve? If it evaluates the multiplication first and then
the division, it would calculate a value of two-thirds. But if it
evaluates the division first it calculates a value of 6. The answer, in
this case, is that R goes from left to right, so in this case, the
division step would come first:
## [1] 6
All of the above being said, it’s helpful to remember that brackets
always come first. So, if you’re ever unsure about what order R will do
things in, an easy solution is to enclose the thing you want it to do
first in brackets. There’s nothing stopping you from typing
(4 / 2) * 3. By enclosing the division in brackets we make it clear
which thing is supposed to happen first. In this instance, you wouldn’t
have needed to, since R would have done the division first anyway, but
when you’re first starting out it’s better to make sure R does what you
want!
Exercise: A good learning trick is to try typing in a few different variations on what I’ve done here. Experiment a bit with your commands, to learn what works and what doesn’t.
Okay. At this point, you know how to take one of the most powerful pieces of statistical software in the world, and use it as a $2 calculator. . That’s not nothing (you could argue that you’ve just saved yourself $2) but on the other hand, it’s not very much either. In order to use R more effectively, we need to introduce more programming concepts.
One of the most important things to be able to do in R (or any programming language, for that matter) is to store information in variables. At a conceptual level, you can think of a variable as label for a certain piece of information, or even several different pieces of information. Let’s look at the very basics for how we create variables and how to work with them.
Since we’ve been working with numbers so far, let’s start by creating
variables to store our numbers. And since most people like concrete
examples, let’s invent one. Suppose I’m trying to calculate how much
money I’m going to make from this book. There are several different
numbers I might want to store. Firstly, I need to figure out how many
copies I’ll sell. This isn’t exactly Harry Potter, so let’s assume I’m
only going to sell one copy per student in my class. That’s 350 sales,
so let’s create a variable called sales. What I want to do is assign a
value to my variable sales, and that value should be 350. We
do this by using the assignment operator, which is <-. Here’s
how we do it:
When you would run this command in R, R doesn’t print out any output. This, however, does not mean all your efforts were in vain and nothing happened. Behind the scenes, you did make an impact. By typing that line, R has created a variable called sales and given it a value of 350.
You don’t believe me? Good for you. But you can check that this has happened by asking R to print the variable on screen. The simplest way to do that is to type the name of the variable and hit ctrl + enter.
Exercise: Type the variable sales and run the code.
So that’s nice to know. Anytime you can’t remember what R has got stored in a particular variable, you can just type the name of the variable and hit enter.
Okay, so now we know how to assign variables. Actually, there’s a bit
more you should know. Firstly, one of the curious features of R is that
there are several different ways of making assignments. In addition to
the <- operator, we can also use -> and =. Note, however, that the <- operator is by far the most widely used and that is is hard to spot a -> operator in the wild.
Let’s start by considering ->, since that’s the easy one.
As you might expect from just looking at the symbol, it’s almost
identical to <-. It’s just that the arrow (i.e., the assignment) goes
from left to right. So if I wanted to define my sales variable using
->, I would write it like this:
This has the same effect: and it still means that I’m only going to
sell 350 copies. Sigh. Apart from this superficial difference, <-
and -> are identical. In fact, as far as R is concerned, they’re
actually the same operator, just in a “left form” and a “right form”.
A quick reminder: when using operators like <- and -> that span
multiple characters, you can’t insert spaces in the middle. That is, if
you type - > or < -, R will interpret your command the wrong way.
Exercise: Wanna try? Run s < - 3
Now =. Although it is not visible in the symbol itself, = does have a
direction.
works, whereas
## Error in 350 = sales: invalid (do_set) left-hand side to assignment
doesn’t work.
One final thing you need to understand about creating variables (for
now, that is) is how R overwrites stuff. You could imagine, that if I
would now write sales <- 450, R would balk, and complain that sales
has already been defined and that I should make up my mind, for once in
my life, and stop being the fickle person that I am and grow a backbone.
Let’s find out:
Exercise: Assign a new value to the variable sales and let R show
it.
R graciously accepts my whims, and just pretends nothing has happened. R
has overwritten the earlier value we had for sales. There is no memory
left of the 350. Check it in the box above.
Okay, let’s get back to my original story. In my quest to become rich,
I’ve written this textbook. To figure out how good this strategy is, I’ve
started creating some variables in R. In addition to defining a sales
variable that counts the number of copies I’m going to sell, I can also
create a variable called royalty, indicating how much money I get per
copy. Let’s say that my royalties are about $7 per book:
The nice thing about variables (in fact, the whole point of having
variables) is that we can do anything with a variable that we ought to
be able to do with the information that it stores. That is, R allows the
multiplication of 350 by 7
## [1] 2450
The good news is that it also allows the multiplication of sales by
royalty.
Exercise: Multiply sales by royalty.
As far as R is concerned, the sales * royalty command is the same as
the 350 * 7 command.
Not surprisingly, I can assign the output of this calculation to a new
variable, which I’ll call revenue. And when we do this, the new
variable revenue gets the value 2450. So let’s do that, and then get
R to print out the value of revenue (by just typing revenue)
so that we can verify that it’s done what we asked:
## [1] 2450
That’s fairly straightforward. A slightly more subtle thing we can do is reassign the value of my variable, based on its current value. For instance, suppose that one of my students (no doubt under the influence of psychotropic drugs) loves the book so much that he or she donates me an extra $550. The simplest way to capture this is by a command like this:
## [1] 3000
In this calculation, R has taken the old value of revenue (i.e., 2450)
and added 550 to that value, producing a value of 3000. This new value
is assigned to the revenue variable, overwriting its previous value.
In any case, we now know that I’m expecting to make $3000 off this.
Let’s return to our discussion of variables. When I introduced variables in Section 2.3, I showed you how we can use variables to store a single number. In this section, we’ll extend this idea and look at how to store multiple numbers within the one variable. In R, the name for a variable that can store multiple values is a vector. So let’s create one.
Let’s stick to my silly “get rich quick by textbook writing” example.
Suppose the textbook company (if I actually had one, that is) sends me
sales data on a monthly basis. Since my class starts in late February, we
might expect most of the sales to occur towards the start of the year.
Let’s suppose that I have 100 sales in February, 200 sales in March and
50 sales in April, and no other sales for the rest of the year. What I
would like to do is have a variable – let’s call it sales.by.month –
that stores all this sales data. The first number stored should be 0
since I had no sales in January, the second should be 100, and so on.
The simplest way to do this in R is to use the combine function, c().2 To do so, all we have to do is type all the numbers you want to store in a comma-separated list, like this:
## [1] 0 100 200 50 0 0 0 0 0 0 0 0
To use the correct terminology here, we have a single variable here
called sales.by.month: this variable is a vector that consists of 12
elements.
R is rather flexible in how you use the c() function. It works on both numbers and vectors at the same time. Say that you want to create a new variable with sales in even more months, where you add sales in a new month. It works like this:
## [1] 0 100 200 50 0 0 0 0 0 0 0 0 99
You ask R to combine the vector we have and appreciate (sales.by.month) with a new number (99). No biggie for R.
Worse, or better, yet, you can even define a variable by using that variable (if it exists, of course):
## [1] 0 100 200 50 0 0 0 0 0 0 0 0 99 299
R is most emphatically not flexible about whether or not you should use c(). It is a very common beginner mistake to forget it, but R is unforgiving if you do any of this:
sales.by.month <- (0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0)
sales.by.month <- 0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0
sales.by.month <- [0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0]## Error: <text>:1:21: unexpected ','
## 1: sales.by.month <- (0,
## ^
Let’s consider the problem of how to get information out of a vector. At
this point, you might have a sneaking suspicion that the answer has
something to do with the [1] thing that R has been printing out. And
of course, you are correct. Suppose I want to pull out the February
sales data only. February is the second month of the year, so let’s try
this:
## [1] 100
Yep, that’s the February sales all right. The bottom line is that we can use square brackets [] to get info out of a vector.
But there’s a subtle detail to be aware of here: notice that R outputs
[1] 100, not [2] 100. This is because R is being extremely
literal. When we typed in sales.by.month[2], we asked R to find
exactly one thing, and that one thing happens to be the second element
of our sales.by.month vector. So, when it outputs [1] 100 what R is
saying is that the first number that we just asked for is 100.
This behaviour makes more sense when you realise that we can use this
trick to create new variables. For example, I could create a
february.sales variable like this:
## [1] 100
Obviously, the new variable february.sales should only have one
element and so when I print it out this new variable, the R output
begins with a [1] because 100 is the value of the first (and only)
element of february.sales. The fact that this also happens to be the
value of the second element of sales.by.month is irrelevant.
In the previous example, we only used a single number (i.e., 2) to indicate which element we wanted. Alternatively, we can use a vector. Of course, you can access more elements. For example, the sales from the first three months are extracted as follows:
## [1] 0 100 200
Or, the sales from months 7 and 9:
## [1] 0 0
Sometimes you’ll want to change the values stored in a vector. Imagine
my surprise when the publisher rings me up to tell me that the sales
data for May are wrong. There were actually an additional 25 books sold
in May, but there was an error or something so they hadn’t told me about
it. How can I fix my sales.by.month variable? One possibility would be
to assign the whole vector again from the beginning, using c(). But
that’s a lot of typing. Also, it’s a little wasteful: why should R have
to redefine the sales figures for all 12 months, when only the 5th one
is wrong? Fortunately, we can tell R to change only the 5th element,
using this trick3:
## [1] 0 100 200 50 25 0 0 0 0 0 0 0
Exercise: It is always interesting to see how a program (or a human, for that matter) behaves when confronted with something unexpected or impossible. Try to change an unexisting element of sales.by.month. First, let’s try to assign the 13th element to 22.
R somewhat kindly provides you with handy shortcuts for very common
situations. For instance, suppose that I wanted to use the vector
c(2,3,4,5,6,7,8). I could do
## [1] 2 3 4 5 6 7 8
but it’s kind of a lot of typing. To help make this easier, R lets you
use 2:8 as shorthand for c(2,3,4,5,6,7,8), which makes things a lot
simpler.
Exercise: You don’t have to believe me (in fact, I rather have you not!). Let’s just check that this is true
This shorthand is especially useful for accessing elements from a vector. For example, the sales from the first six months are extracted as follows:
## [1] 0 100 200 50 25 0
but more conveniently using
## [1] 0 100 200 50 25 0
And yes, you can also use it to alter elements of a vector.
## [1] 0 100 2 2 2 2 2 0 0 0 0 0
Any idea why the next line doesn’t work?
## Warning in sales.by.month[1:3] <- c(9, 19): number of items to replace is not a
## multiple of replacement length
Well, you tell R to replace elements 1, 2 and 3 with some numbers. You even tell R which numbers the new ones should be, by specifying c(9, 19). But you don’t give enough: you want 3 numbers to be replaced, but you only provide 2. R thinks you are a bully, but still responds quite nicely.
It might, then, come as a surprise that this does work:
## [1] 9 19 9 19 2 2 2 0 0 0 0 0
So you want 4 numbers to be replaced and you only provide 2? The reason this does work is because of the recycling rule, discussed below (Section XXX).
Because we will use it later, I will restore the OG sales.by.month
You often want to alter all of the elements of a vector at once. For
instance, suppose I wanted to figure out how much money I made in each
month. Since I’m earning an exciting $7 per book (no seriously, that’s
actually pretty close to what authors get on the very expensive
textbooks that you’re expected to purchase), what I want to do is
multiply each element in the sales.by.month vector by 7. R makes
this pretty easy, as the following example shows:
## [1] 0 700 1400 350 0 0 0 0 0 0 0 0
In other words, when you multiply a vector by a single number, all elements in the vector get multiplied. The same is true for addition, subtraction, division and taking powers. So that’s neat.
Sometimes my (non-existing) publisher is in a good mood, and they decide to give me a bonus. In January, I get a 1 dollar bonus, in February 2, and so on.
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
Computing my total profit is easily done:
## [1] 0 700 1400 350 0 0 0 0 0 0 0 0
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
## [1] 1 702 1403 354 5 6 7 8 9 10 11 12
So the nth element of bonus is added to the nth element of profit.
On the other hand, suppose I wanted to know how much money I was making per day, rather than per month (dropping the bonus, which only existed in my imagination anyways). Since not every month has the same number of days, I need to do something slightly different. Firstly, I’ll create two new vectors:
The days.per.month variable is pretty straightforward. What I want to
do is divide every element of profit by the corresponding element of
days.per.month. Again, R makes this pretty easy:
## [1] 0.00000 25.00000 45.16129 11.66667 0.00000 0.00000 0.00000 0.00000
## [9] 0.00000 0.00000 0.00000 0.00000
Notice that the second element of the output is 25 because R has divided
the second element of profit (i.e. 700) by the second element of
days.per.month (i.e. 28). Similarly, the third element of the output
is equal to 1400 divided by 31, and so on.
There’s one semi-advanced thing that I should mention about how vector
arithmetic works in R, and that’s the recycling rule. It is fairly
straightforward, but can be confusing to novices. The easiest way to
explain it is to give a simple example. Suppose I have two vectors of
different length, x and y, and I want to add them together. It’s not
obvious what that actually means, so let’s have a look at what R does:
## [1] 1 2 1 2 1 2
Try to understand what’s going on, from looking at this output.
As you can see, what R has done is “recycle” the value of the shorter
vector (in this case y, of length 2, as compared to x of length 6)
several times. That is, the first element of x is added to the first
element of y, and the second element of x is added to the second
element of y. However, when R reaches the third element of x there
isn’t any corresponding element in y, so it returns to the beginning:
thus, the third element of x is added to the first element of y.
This process continues until R reaches the last element of x. And
that’s all there is to it really. The same recycling rule also applies
for subtraction, multiplication and division.
Someone paying close attention might wonder what happens if the length of the longer vector (5, in this example) isn’t an exact multiple of the length of the shorter one (2, in this example). Let’see:
## Warning in x + y: longer object length is not a multiple of shorter object
## length
## [1] 1 2 1 2 1
R still does it, but also gives you a warning message. Warnings are highly important, and shouldn’t be ignored. Despite this, we will ignore the warning, but will say a bit more about it in Section 3.1.
The symbols +, -, * and so on are examples of operators. As we’ve
seen, you can do quite a lot of calculations just by using these
operators. However, in order to do more advanced calculations (and later
on, to do actual statistics), you’re going to need to start using
functions.
I’ll talk in more detail about functions and how they work in Section
9.1, but for now let’s just dive in and use a few.
To get started, suppose I wanted to take the square root (vierkantswortel, in Dutch) of 225. The square root, in case your high school maths is a bit rusty, is just the opposite of squaring a number. So, for instance, since “5 squared is 25” I can say that “5 is the square root of 25”. The usual notation for this is \(\sqrt{25} = 5\).
To calculate the square root of 25, I can do it in my head pretty easily, since I memorised my multiplication tables when I was a kid. It gets harder when the numbers get bigger, and pretty much impossible if they’re not whole numbers. This is where something like R comes in very handy. Let’s say I wanted to calculate \(\sqrt{225}\), the square root of 225. Here is how I could do this using R.
R provides a square root function, sqrt(). To calculate the square root of
225 using this function, what I do is insert the number 225 in the
parentheses.
Exercise: Calculate the square root of 225 using the sqrt()
function.
When we use a function to do something, we generally refer to this as calling the function, and the values that we type into the function (in general, there can be more than one) are referred to as the arguments of that function.
Note how we provide the arguments inside the round brackets. What happens if you would inadvertently use square brackets?
If you type sqrt[225], R will think you want the 225th element of the sqrt object. Since that object does not exist (since you didn’t define it), R thinks you are being unreasonable.
The party is hardly over! There are lots of other functions in R: in
fact, almost everything of interest that I’ll talk about in this book is
an R function of some kind. For example, one function that we will need
to use in this book is the absolute value function. Compared to
the square root function, it’s extremely simple: it just converts
negative numbers to positive numbers and leaves positive numbers alone.
Mathematically, the absolute value of \(x\) is written \(|x|\).
Calculating absolute values in R
is pretty easy since R provides the abs() function that you can use
for this purpose.
Exercise: Feed the abs() function a positive number (e.g., 21).
Here, the absolute value function does nothing to it at all.
Exercise: Feed the abs() function a negative number (e.g., -13).
It now spits out the positive version of the same number.
Before moving on, it’s worth noting that – in the same way that R
allows us to put multiple operations together into a longer command,
like 1 + 2*4 for instance – it also lets us put functions together
and even combine functions with operators if we so desire. For example,
the following is a perfectly legitimate command:
Exercise: What is the result of this computation? Use R to confirm.
When R executes this command, it starts out by calculating the value of
abs(-8), which produces an intermediate value of 8. Having done so,
the command simplifies to sqrt( 1 + 8 ). To solve the square root4
it first needs to add 1 + 8 to get 9, at which point it evaluates
sqrt(9), and so it finally outputs a value of 3.
The examples above only took single numbers as input. Some of you might be wondering whether you can also input a vector.
Exercise: Wonder no more! Just try. Give a vector, e.g., c(25, 49, 36), as input to the sqrt() function.
If you did everything right (for example, if you didn’t forget the c()), you will have seen that the sqrt() function just does whatever it does (taking the square root, in this case) on each element separately. So this function works on a vector element-wise.
Not every function works on a vector element-wise. You often find
yourself wanting to know how many elements there are in a vector
(usually because you’ve forgotten). You can use the length() function
to do this. It’s quite straightforward:
## [1] 12
The real power of R only comes to shine when all this stuff is getting combined. For example, you might want to take the square root of 20 + 5. You can combine both operations (adding and taking the square root) in a single line:
## [1] 5
Or you might want to take the square root of the absolute value of -25. You can do that in two steps
## [1] 5
but more conveniently in one:
## [1] 5
The longer the expression, the more is happening, but also the harder stuff is to understand. Unlike in English, where you read from left to right, in R, it often pays to read from within, especially when brackets are involved. So when you want to understand sqrt( abs(-25) ), you could read it from left to right as i take the square root of the absolute value of -25, but I have the impression that most students prefer the from-within approach: i first take the absolute value of -25 and then take the square root.
We have seen two types of brackets: round brackets () and square brackets []. Later, we will encounter curly brackets {} (and the dreaded double square brackets [[]]). Functions require round brackets. Vectors (and, as we will later see, matrices and data frames too) require square brackets.
So this doesn’t work to compute the square root of 25
## Error in sqrt[25]: object of type 'builtin' is not subsettable
because R thinks you want the 25th element of the (non-existing) vector called sqrt.
This doesn’t work either for accessing the second element of x
## Error in x(2): could not find function "x"
because R thinks you want to compute the (non-existing) function called x when the input is 2.
Of course, square brackets can appear close to a function, for example like this:
## [1] 2
What happens is that the function sqrt acts on the input c(1,4,9,25), which is included between round brackets as it should. The function then produces a vector, of which we then select the 2nd element, using square brackets, as it should.
Before discussing any of the more complicated R stuff, I want to introduce
the comment character, #. It has a simple meaning: it tells R to
ignore everything else you’ve written after it. You won’t have much
need of the # character immediately, but it’s very useful later on
when writing scripts (see Section 5). However, while you
don’t need to use it, I want to be able to include comments in my R
extracts. For instance, if you read this:
seeker <- 3.1415 # create the first variable
lover <- 2.7183 # create the second variable
keeper <- seeker * lover # now multiply them to create a third one
keeper # print out the value of 'keeper'## [1] 8.539539
it’s a lot easier to understand what I’m doing than if I just write this:
## [1] 8.539539
You’ll start seeing # characters appearing in the extracts, with some
human-readable explanatory remarks next to them. These are still
perfectly legitimate commands since R knows that it should ignore the
# character and everything after it. But hopefully, they’ll help make
things a little easier to understand.
Exercise: Double check that R really doesn’t read what’s behind
the #. For example, run 10 + 20 #I HATE YOU, R in the box below, and
check whether R is still your faithful servant, despite you expressing
your negative feelings.
You will see that R is completely nonplussed by your comment.
Exercise: Ok, this isn’t quite the bullet proof test. R might have
actually read your comment, but just doesn’t care about the feelings of
a human, or might be used to their abuse. As a better test, do this.
Double check that R really doesn’t read what’s behind the #. For
example, run the following two lines in the box below: on the first
line, type 10 + 20 #x <- 40; in the second line, check whether R knows
the value of x, by typing x.
R will tell you x can not be found. So either R didn’t read the
comment, or R is very good at denying it did.
If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. You can use this box to make the exercises for Chapter 2. Note that there are two documents: one with only the questions, and one with the questions as well as some suggested solutions.
Before discussing any of the more complicated R stuff, I want to talk a bit about errors and warnings that R, sometimes helpfully, sometimes spitefully, throws at you. We’ve come across some of these already, but I feel I should give it a bit more attention, as they are cries for helps, and as a good psychologist, you should be trained to recognize and act upon cries for help.
Both errors and warnings signal that something is off. The difference is that, when R throws an error, it means you are done for. R couldn’t do what you asked it to do, so it stopped, producing no output. With a warning, in contrast, it powered through and produced output, but it thinks something could be off, so you should look at the code and the output with extra care.
It would be equal measure impossible and maddening to explain all the errors and warning messages R produces. My general advice is to read them carefully, as they sometimes make sense. In the other cases where they don’t, the warning or error message at least gives you something you could use when looking for help.
Just to get an idea, here are a few:
## Error in eval(expr, envir, enclos): object 'z' not found
## Error: <text>:2:10: unexpected ')'
## 1: #example 2
## 2: sqrt(225))
## ^
## Error: <text>:2:5: unexpected '{'
## 1: #example 3
## 2: sqrt{
## ^
## Error in skwairroet(225): could not find function "skwairroet"
## Warning in 1:5 + 1:6: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 8 10 7
A lot of the time your data will be numeric in nature, but not always. Sometimes your data really needs to be described using text, not using numbers.
To address this, we need to consider the situation where our variables store text. To create a variable that stores the word “hello”, we can type this:
## [1] "hello"
When interpreting this, it’s important to recognise that the quote marks
here aren’t part of the string itself. They’re just something that we
use to make sure that R knows to treat the characters that they enclose
as a piece of text data, known as a character string. In other
words, R treats "hello" as a string containing the word “hello”; but
if I had typed hello instead, R would go looking for a variable by
that name! You can also use 'hello' to specify a character string.
Okay, so that’s how we store the text. Next, it’s important to recognise
that when we do this, R stores the entire word "hello" as a single
element: our greeting variable is not a vector of five different
letters. Rather, it has only one element, and that element
corresponds to the entire character string "hello".
Exercise: Just to be sure, ask R how many elements greeting has.
If you typed length(greeting), which you should have, you see that as
far as R is concerned, greeting consist of a single element only.
Exercise: What could that be, you surely wonder?
You see that if you actually ask R to find the first element of
greeting, by typing greeting[1], it prints the whole string.
Of course, there’s no reason why I can’t create a vector of character
strings, just like you can create a vector of numerical elements. For
instance, if we were to continue with the example of my attempts to look
at the monthly sales data for my book, one variable I might want would
include the names of all 12 months. To do so, I could type in a
command like this, again using the combine function c()
months <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November",
"December")This is a character vector containing 12 elements, each of which is the name of a month.
Exercise: Get R to tell you the name of the fourth month. You know the answer to that question, so you will know if you did it right.
# selecting the ith element can be done using [i]
months[4]
Exercise: Get R to tell you how many months there are in a year. You know the answer to that question, so you will know if you did it right.
# counting the number of elements can be done using length()
length(months)
Working with text data is somewhat more complicated than working with
numeric data.
So far, most of the numerical operations (addition, etc) and functions
(i.e., sqrt(), abs()) that we have seen only
make sense when applied to numeric data.
Here’s a question you never thought you would ask: For example, can you do numerical operations to a character vector? And can you take the square root of months?
Exercise: Well, can you?
No. months + 1, months * 3, months + months, months^2 and
sqrt(months) are all meaningless. R agrees, and throws an error.
We’ve seen one function that can be applied to pretty much any variable
or vector (i.e., length()). It might be nice to see another example of
a function that can be applied to text. The function I’m going to
introduce you to is called nchar(), and what it does is count the
number of individual characters that make up a string. Recall earlier
that when we tried to calculate the length() of our greeting
variable it returned a value of 1: the greeting variable contains
only one string, which happens to be "hello". But what if I want
to know how many letters there are in the word? Sure, I could count
them, but that’s boring, and more to the point it’s a terrible strategy
if what I wanted to know was the number of letters in War and Peace.
That’s where the nchar() function is helpful:
## [1] 5
That makes sense, since there are in fact 5 letters in the string
"hello". Better yet, you can apply nchar() to whole vectors. So, for
instance, if I want R to tell me how many letters there are in the names
of each of the 12 months, I can do this:
## [1] 7 8 5 5 3 4 4 6 9 7 8 8
So that’s nice to know. The nchar() function can do a bit more than
this, and there are a lot of other functions that you can do to extract
more information from text or do all sorts of fancy things. However, the
goal here is not to teach any of that! The goal right now is just to see
an example of a function that actually does work when applied to text.
Note that nchar() also works on numerics. Exhibit 1:
## [1] 2
Time to move onto a third kind of data. A key concept that a lot of R
relies on is the idea of a logical value. A logical value is an
assertion about whether something is true or false. This is implemented
in R in a pretty straightforward way. There are two logical values,
namely TRUE and FALSE. Despite the simplicity, logical values are
very useful things. Let’s see how they work.
In George Orwell’s classic book 1984, one of the slogans used by the
totalitarian Party was “two plus two equals five”, the idea being that
the political domination of human freedom becomes complete when it is
possible to subvert even the most basic of truths. It’s a terrifying
thought, especially when the protagonist Winston Smith finally breaks
down under torture and agrees to the proposition. “Man is infinitely
malleable”, the book says. I’m pretty sure that this isn’t true of
humans but it’s definitely not true of R. R is not infinitely malleable.
It has rather firm opinions on the topic of what is and isn’t true, at
least as regards basic mathematics. If I ask it to calculate 2 + 2, it
always gives the same answer, and it’s not bloody 5:
## [1] 4
Of course, so far R is just doing the calculations. I haven’t asked it to explicitly assert that \(2+2 = 4\) is a true statement. If I want R to make an explicit judgement, I can use a command like this:
## [1] TRUE
What I’ve done here is use the equality operator, ==, to force R
to make a “true or false” judgement.5 Note that this is very different from, and should not be confused with the assignment operator, =, which we use the make sure a variable takes a values. With ==, we ask R the question whether a variable takes a certain value.
Okay, let’s see what R thinks of the Party slogan:
## [1] FALSE
Booyah! Freedom and ponies for all! Or something like that.
Anyway, it’s worth having a look at what happens if I try to force R
to believe that two plus two is five by making an assignment statement
like 2 + 2 <- 5. When I do this, here’s what
happens:
## Error in 2 + 2 <- 5: target of assignment expands to non-language object
R doesn’t like this very much. It recognises that 2 + 2 is not a
variable (that’s what the “non-language object” part is saying), and it
won’t let you try to “reassign” it. While R is pretty flexible and
actually does let you do some quite remarkable things to redefine parts
of R itself, there are just some basic, primitive truths that it refuses
to give up. It won’t change the laws of addition, and it won’t change
the definition of the number 2.
That’s probably for the best.
Up to this point, I’ve introduced numeric data (in Sections
2.3 and 2.6) and character data (in Section
3.2). So you might not be surprised to discover that these
TRUE and FALSE values that R has been producing are actually a third
kind of data, called logical data. That is, when I asked R if
2 + 2 == 5 and it said [1] FALSE in reply, it was actually producing
information that we can store in variables. For instance, I could create
a variable called is.the.Party.correct, which would store R’s opinion:
## [1] FALSE
Alternatively, you can assign the value directly, by typing TRUE or
FALSE in your command. Like this:
## [1] TRUE
Note that, again, R is totally chillax about this inconsistency. It just overwrites the previous value, without pain, grief or warning.
Better yet, because it’s kind of tedious to type TRUE or FALSE over
and over again, R provides you with a shortcut: you can use T and F
instead (but it’s case sensitive: t and f won’t work). Anyway, the
long and short of it is that it’s safer to use TRUE and FALSE. So
this works:
## [1] FALSE
but this doesn’t:
## Error in eval(expr, envir, enclos): object 'f' not found
I can’t let you go without a small warning: TRUE and FALSE are
reserved keywords in R, so you can trust that they always mean what they
say they do. Unfortunately, the shortcut versions T and F do not
have this property. It’s even possible to create variables that set up
the reverse meanings, by typing commands like T <- FALSE and
F <- TRUE. This is kind of insane, and something that is generally
thought to be a design flaw in R.
The next thing to mention is that you can store vectors of logical
values in exactly the same way that you can store vectors of numbers
(Section 2.6) and vectors of text data (Section 3.2).
Again, we can define them directly via the c() function, like this:
## [1] TRUE TRUE FALSE
More interestingly, you can produce a vector of logicals by applying a logical operator (such as the equality operator) to a vector. This might not make a lot of sense to you, so let’s unpack it slowly.
First, let’s suppose we have a vector of numbers.
For instance, we could use the sales.by.month vector that we
were using earlier:
## [1] 0 100 200 50 0 0 0 0 0 0 0 0
Suppose I wanted R to tell me, for each month of the year, whether it was a slow month in that no books were sold. I can do that by typing this:
## [1] TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
and again, I can store this in a vector if I want, as the example below illustrates:
## [1] TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
In other words, no.sales.this.month is a logical vector whose elements
are TRUE only if the corresponding element of sales.by.month is
equal to zero. For instance, since I sold zero books in January, the
first element is TRUE.
Let’s do second example, but now with text. Suppose that – to continue the saga of the textbook sales – I find out that the bookshop only had sufficient stocks for a few months of the year. They tell me that early in the year they had "high" stocks, which then dropped to "low" levels, and in fact for
two months they were "out" of copies of the book for a while before
they were able to replenish them. Thus I might have a variable called
stock.levels which looks like this:
stock.levels <- c("high", "high", "low", "out", "out", "high",
"high", "high", "high", "high", "high", "high")
stock.levels## [1] "high" "high" "low" "out" "out" "high" "high" "high" "high" "high"
## [11] "high" "high"
If I want to know whether or not book is out of stock, I can ask R as follows:
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
However, what you need to keep in mind is that R is not at all tolerant when it comes to grammar and spacing. If two strings differ in any way whatsoever, R will say that they’re not equal to each other, as the following examples indicate:
## [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Above, we created the logical vector no.sales.this.month whose
elements are TRUE or FALSE. Life is full of surprises, and so is R.
As it turns out, you can do numerical operations with this vector. Can
you find out how it works?
Exercise: Run the following commands: no.sales.this.month + 0,
no.sales.this.month * 1, no.sales.this.month^2 and compare it
no.sales.this.month. What do you notice?
Every TRUE plays the role of a 1, and every FALSE plays the role of a 0.
Later, in Section 6, I’ll show you why these logical operations and logical vectors are so handy.
So now we’ve seen logical operations at work, but so far we’ve only seen the simplest possible example, the equality operator. You probably won’t be surprised to discover that we can combine logical operations with other operations and functions in a more complicated way, like this:
## [1] TRUE
or this
## [1] TRUE
Not only that, but as Table 3.1 illustrates, there are several other logical operators that you can use, corresponding to some basic mathematical concepts.
| operation | operator | example input | answer |
|---|---|---|---|
| less than | < | 2 < 3 | TRUE |
| less than or equal to | <= | 2 <= 2 | TRUE |
| greater than | > | 2 > 3 | FALSE |
| greater than or equal to | >= | 2 >= 2 | TRUE |
| equal to | == | 2 == 3 | FALSE |
| not equal to | != | 2 != 3 | TRUE |
Hopefully, these are all pretty self-explanatory: for example, the
less than operator < checks to see if the number on the left is
less than the number on the right. If it’s less, then R returns an
answer of TRUE:
## [1] TRUE
but if the two numbers are equal, or if the one on the right is larger,
then R returns an answer of FALSE, as the following two examples
illustrate:
## [1] FALSE
## [1] FALSE
In contrast, the less than or equal to operator <= will do
exactly what it says. It returns a value of TRUE if the number of the
left-hand side is less than or equal to the number on the right-hand
side. So if we repeat the previous two examples using <=, here’s what
we get:
## [1] TRUE
## [1] FALSE
And at this point I hope it’s pretty obvious what the greater than
operator > and the greater than or equal to operator >= do!
Next on the list of logical operators is the not equal to operator
!= which – as with all the others – does what it says it does. It
returns a value of TRUE when things on either side are not identical
to each other. Therefore, since \(2+2\) isn’t equal to \(5\), we get:
## [1] TRUE
We’re not quite done yet. There are three more logical operations that are worth knowing about, listed in Table 3.2.
| operation | operator | example input | answer |
|---|---|---|---|
| not | ! | !(1==1) | FALSE |
| or | | | (1==1) | (2==3) | TRUE |
| and | & | (1==1) & (2==3) | FALSE |
These are the not operator !, the and operator &, and
the or operator |. Like the other logical operators, their
behaviour is more or less exactly what you’d expect given their names.
For instance, if I ask you to assess the claim that “either \(2+2 = 4\)
or \(2+2 = 5\)” you’d say that it’s true. Since it’s an “either-or”
statement, all we need is for one of the two parts to be true. That’s
what the | operator does:
## [1] TRUE
On the other hand, if I ask you to assess the claim that “both \(2+2 = 4\)
and \(2+2 = 5\)” you’d say that it’s false. Since this is an and
statement we need both parts to be true. And that’s what the &
operator does:
## [1] FALSE
To be clear, the | operator does not want exactly one statement to be true.
If both parts are true, it will judge the combined statement as true as well:
## [1] TRUE
Finally, there’s the not operator, which is simple but annoying to describe in English. If I ask you to assess my claim that “it is not true that \(2+2 = 5\)” then you would say that my claim is true; because my claim is that “\(2+2 = 5\) is false”. And I’m right. If we write this as an R command we get this:
## [1] TRUE
In other words, since 2+2 == 5 is a FALSE statement, it must be the
case that !(2+2 == 5) is a TRUE one. Essentially, what we’ve really
done is to claim that “not false” is the same thing as “true”.
Obviously, this isn’t really quite right in real life. But R lives in a
much more black or white world: for R everything is either true or
false. No shades of grey are allowed. We can actually see this much more
explicitly, like this:
## [1] TRUE
Of course, in our \(2+2 = 5\) example, we didn’t really need to use “not”
! and “equals to” == as two separate operators. We could have just
used the “not equals to” operator != like this:
## [1] TRUE
But there are many situations where you really do need to use the !
operator.
Let’s get some more practice with combining these operations. Looking back at the stock levels, suppose I want to focus only on those cases when the stock level is either “out” or “low”. One simple way to do to this is:
## [1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
What this does is return TRUE for those elements of stock.levels
that are either "out" or "low" and returns FALSE for all the
others.
Neat. But there’s an even neater way. To send you off, I will leave you
with a useful trick to be aware of, which is the %in% operator6.
It’s actually very similar to the == operator, except that you can
supply a collection of acceptable values, so you can look for a match of
multiple cases. The best way to learn about it is to see it at work:
## [1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
You see that, again, it returns TRUE for those elements of
stock.levels that are either "out" or "low" and returns FALSE
for all the others. You could verbalize the above statement as “stock.levels is part of either out or low” or “stock.levels is at least one of the values in the vector consisting of out and low”.
Exercise: Is there a difference between stock.levels=="high" and
stock.levels %in% "high"
No. You could see the %in% operator as a multiple case extension of the == operator.
Just like algebraic operators have an order (e.g., multiplication before addition), logical operators have one too.
This one is easy
## [1] FALSE
(TRUE | TRUE) gives TRUE, so we end up with TRUE & FALSE which yields FALSE.
This one is harder if you encounter it for the first time
## [1] TRUE
This expresssion is not evaluated left-to-right. Instead, R follows operator precedence rules, holding that & has higher precedence than |. So it’s really interpreted as:
## [1] TRUE
The single most important skill you need to learn as a programmer (or, to a lesser extent, being a data analysist) is getting help. I have somewhat mixed feelings about the help documentation in R. On the plus side, there’s a lot of it, and it’s very thorough. On the minus side, there’s a lot of it, and it’s very thorough. There’s so much help documentation that it sometimes doesn’t help, and most of it is written with an advanced user in mind. Often it feels like most of the help files work on the assumption that the reader already understands everything about R except for the specific topic that it’s providing help for. What that means is that, once you’ve been using R for a long time and are beginning to get a feel for how to use it, the help documentation is awesome. These days, I find myself really liking the help files (most of them anyway). But when I first started using R I found it very dense.
To some extent, there’s not much I can do to help you with this. You just have to work at it yourself; once you’re moving away from being a pure beginner and are becoming a skilled user, you’ll start finding the help documentation more and more helpful.
If you want to read the help file of, say, startsWith() function, you can use either of the following:
When I do that, R goes looking for the help file for the “startsWith” topic.
help() function in this document, look up the help documentation for the startsWith() function.
Alternatively, you can try a fuzzy search for a help topic, meaning that it will not just look for the exact search term, but also at search terms that are similar to your search term.
If you try it (for example, in the box of the previous exercise), this will bring up a list of possible topics that you might want to follow up in.
I want to mention a few other resources besides the R documentation already here.
The first help resource is your own brain and creativity. If you don’t know what some code does, just run it, and see what it does. Just (carefully) looking at it might already be enough. The main message is that you will learn, discover and understand by playing around with the code. Playing for the win!
Perhaps most importantly, google is your best friend. Whatever problem you run into with R, it is very likely that someone else ran into the same problem before you. Stack Overflow, for example, is a large Q&A platform where coders help each other with their programming issues (this is not only limited to R). Imagine we have 7 a data frame of which we want to convert one column from character to numeric class, however, we can’t exactly remember how to do this. If you look up something in the trend of ‘convert data frame column character to numeric in R’, you will get plenty of results that can help you with this - including answers on Stack Overflow (https://stackoverflow.com/questions/37707060/converting-data-frame-column-from-character-to-numeric/37707117).
The Rseek website (www.rseek.org). One thing that I really find annoying about the R help documentation is that it’s hard to search properly. When coupled with the fact that the documentation is dense and highly technical, it’s often a better idea to search or ask online for answers to your questions. With that in mind, the Rseek website is great: it’s an R specific search engine. I find it really useful, and it’s almost always my first port of call when I’m looking around.
Another, more recent but also somewhat twisted friend are LLMs, like ChatGPT. They are remarkably good at writing code, and remarkably bad at making good jokes. One take away from this is that you shouldn’t despair: it is easier to be a good programmer than to be funny. A second take away is that you can ask ChatGPT for input whenever you are stuck. Do treat whatever it comes up with with some caution. With some luck, the code it produces will be a useful starting point. It is your responsibility, still, to make sure the code actually works as intended. So you need to read it, understand it, and check both intermediate steps and final output.
If you are becoming a more advanced R user, you might consider
joining the R-help mailing list (see
http://www.r-project.org/mail.html for details). It won’t be needed for the purposes of this course. This is the
official R help mailing list. It can be very helpful, but it’s
very important that you do your homework before posting a
question. The list gets a lot of traffic. While the people on the
list try as hard as they can to answer questions, they do so for
free, and you really don’t want to know how much money they could
charge on an hourly rate if they wanted to apply market rates. In
short, they are doing you a favour, so be polite. Don’t waste their
time asking questions that can be easily answered by a quick search
on Rseek (it’s rude), make sure your question is clear, and all of
the relevant information is included. In short, read the posting
guidelines carefully
(http://www.r-project.org/posting-guide.html), and make use of the
help.request() function that R provides to check that you’re
actually doing what you’re expected.
Keep in mind, though, that by using these routes, you are quite likely to have your R problem solved. You are, however, not jus tin the business of problem solving, but of learning. So try to make sure to understand why the solution is a solution. This is especially easy when using ChatGPT: you can just let explain what the proposed code is doing, so that you actually understand it!
A lot of R’s functionality is built-in and comes with simply installing R. For most of what we will be using in this book, that will suffice. But even more of R’s functionality is not built-in, and one of the benefits of R is the availability of this endless and growing list of advanced functionalities. So while it might be a bit premature to talks about them when you just started to learn R, what I am gonna explain you know if so fundamental to R I want to talk about them already now, even if you won’t start using it till much later.
The additional functionality I am talking about is provided in a thing called packages. A package is basically just a big collection of functions, data sets and other R objects that are all grouped together under a common name. Some R packages are already installed when you put R on your computer, but the vast majority of them are out there on the internet, waiting for you to download, install and use them.
One of the main selling points for R is that there are thousands of packages that have been written for it, and these are all available online. So whereabouts online are these packages to be found, and how do we download and install them? There is a big repository of packages called the “Comprehensive R Archive Network” (CRAN).
There’s a critical distinction that you need to understand, which is the difference between having a package installed on your computer, and having a package loaded in R. As of this writing, there are just over 5000 R packages freely available “out there” on the internet.8 When you install R on your computer, you don’t get all of them: only about 30 or so come bundled with the basic R installation. So right now there are about 30 packages “installed” on your computer, and another 5000 or so that are not installed. So that’s what installed means: it means “it’s on your computer somewhere”. The critical thing to remember is that just because something is on your computer doesn’t mean R can use it. In order for R to be able to use one of your 30 or so installed packages, that package must also be “loaded”. Generally, when you open up R, only a few of these packages (about 7 or 8) are actually loaded.
So there are two things you need to remember about packages: 1) A package must be installed before it can be loaded. 2) A package must be loaded before it can be used. This two-step process might seem a little odd at first, but the designers of R had very good reasons to do it this way. Basically, the reason is that there are 5000 packages, and probably about 4000 authors of packages, and no-one really knows what all of them do. Keeping the installation separate from the loading minimizes the chances that two packages will interact with each other in a nasty way. But don’t worry, you get the hang of it pretty quickly. We will talk about the specifics of installing and loading packages in Section 5.
If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. You can use this box to make the exercises for Chapter 3. Note that there are two documents: one with only the questions, and one with the questions as well as some suggested solutions.
In Section 2.9 you were introduced to the basics of
functions, like sqrt(), length() and nchar(). There are two more fairly important things that you need to
understand about functions in R, and that’s the use of “named” arguments
and “default values” for arguments. Further, I will introduce you to the somewhat bizarre world of pipes.
Not surprisingly, that’s not
to say that this is the last we’ll hear about how functions work, but
they are the last things we desperately need to discuss in order to get
you started.
To understand what the first two concepts are all about, I’ll introduce
a new function to you. The round() function can be used to round some
value to the nearest whole number.
Exercise: Use the round() function for the value 3.1415.
Pretty straightforward, really. However, suppose I only wanted to round
it to two decimal places: that is, I want to get 3.14 as the output.
The round() function supports this, by allowing you to input a second
argument to the function that specifies the number of decimal places
that you want to round the number to. In other words, I could do this:
## [1] 3.14
What’s happening here is that I’ve specified two arguments: the first
argument is the number that needs to be rounded (i.e., 3.1415), the
second argument is the number of decimal places that it should be
rounded to (i.e., 2), and the two arguments are separated by a comma.
In this simple example, it’s quite easy to remember which one argument
comes first and which one comes second, but for more complicated
functions this is not easy. Fortunately, most R functions make use of
argument names. For the round() function, for example the number
that needs to be rounded is specified using the x argument, and the
number of decimal points that you want it rounded to is specified using
the digits argument. Because we have these names available to us, we
can specify the arguments to the function by name. We do so like this:
## [1] 3.14
Notice that this is kind of similar in spirit to variable assignments
(Section 2.3), except that I used = here, rather than <-.
In both cases, we’re specifying specific values to be associated with a
label. However, there are some differences between what I was doing
earlier on when creating variables, and what I’m doing here when
specifying arguments, and so as a consequence, it’s important that you
use = in this context.
As you can see, specifying the arguments by name involves a lot more typing, but it’s also a lot easier to read. Because of this, the commands in this book will usually specify arguments by name,9 since that makes it clearer to you what I’m doing.
One important thing to note is that when specifying the arguments using their names, it doesn’t matter what order you type them in. But if you don’t use the argument names, then you have to input the arguments in the correct order. In other words, these commands all produce the same output…
## [1] 3.14
## [1] 3.14
## [1] 3.14
but this one does not…
## [1] 2
What does R do when you provide names for some arguments but not for others? Let’s see
## [1] 3.14
## [1] 3.14
The named argument is easy, of course. If you use x = 3.1415 you literally tell R that 3.1415 should serve as x. For the unnamed argument, R needs to decide what you mean with it. R uses the first argument that is different from the named one. So in the first example, it knows that 2 serves as a value for digits, because round() expect arguments in the x and digits order. Since we have provided x, the first argument that has not been assigned a value is digits. In the second example, 3.1415 serves as x, because that’s the first argument round() expects, and that wasn’t assigned a value.
So if you want to use names for the arguments, you are basically free to do what you want. If you don’t want to use names, you have to be very careful about order! How do you find out what the correct order is? There’s a few different ways, but the easiest one is to look at the help documentation for the function (see Section 3.4). However, if you’re ever unsure, it’s probably best to type in the argument name. To know the correct name, you also need to consult the help documentation, but at least, these names are often easier to remember than the order, so you will probably have to visit the help file less using the name approach than when you are using the order approach.
Now here is something weird. All of this works!
## [1] 3.14
## [1] 3.14
## [1] 3.14
The reason is that R (somewhat controversially) does partial matching with names arguments. Since with round(), there is only one names argument that starts with d, digi, or digit, R sort of auto-complete these bits to digits.
You have to do something right, though. This doesn’t work:
## Error in round(x = 3.1415, digitjes = 2): unused argument (digitjes = 2)
## Error in round(x = 3.1415, Digits = 2): unused argument (Digits = 2)
## Error in round(x = 3.1415, getalletjes = 2): unused argument (getalletjes = 2)
and R tells you why it balks (or at least it tells you something is going wrong on the argument department).
Okay, so that’s the first thing I said you’d need to know: argument
names. The other thing you need to know about arguments is that they can
have default values. Notice that the first time I called the round()
function, round( 3.14165 ) I didn’t actually specify the digits argument at all, and yet
R somehow knew that this meant it should round to the nearest whole
number. How did that happen? The answer is that the digits argument
has a default value of 0, meaning that if you decide not to
specify a value for digits then R will act as if you had typed
digits = 0. This is quite handy: the vast majority of the time when
you want to round a number you want to round it to the nearest whole
number, and it would be pretty annoying to have to specify the digits
argument every single time. On the other hand, sometimes you actually do
want to round to something other than the nearest whole number, and it
would be even more annoying if R didn’t allow this! Thus, by having
digits = 0 as the default value, we get the best of both worlds.
How do you find out what the default values are? Again, the easiest one is to look at the help documentation for the function (see Section 3.4). Or try to reverse engineer stuff by trying things out!
Perhaps unsurprisingly, assigning values to an argument can be done using variables. With this cryptic sentence, I mean that these commands do exactly the same:
## [1] 3.14
## [1] 3.14
Functionally, there is a difference, however. Using the first strategy, we have no access to x. Its value is only known to the function, not outside it. Since in the second approach, y is defined outside the function, it is also available outside of it:
## [1] 3.14
## Error in eval(expr, envir, enclos): object 'x' not found
## [1] 3.14
## Error in eval(expr, envir, enclos): object 'x' not found
## [1] 4.1415
Some people are overargumentative. If you are like that, let’s see how R reacts. For example, you might (mistakenly) think that round() has an argument which tells you both the rounding down and rounding up result, called upndown, which can be set to TRUE:
## Error in round(x = 3.1415, upndown = TRUE): unused argument (upndown = TRUE)
## Error in eval(expr, envir, enclos): 3 arguments passed to 'round' which requires 1 or 2 arguments
R complains twice, giving you a different reason in each case.
A final nugget of wisdom about using function I would like to share with you is piping. By now, you should know how easy it is to call a function. If you want to take the square root of something, you write down the function to that, which is sqrt(), and include the something between the brackets, like this:
## [1] 15
Somewhat counterintuitively, there is a different, more complicated way of doing exactly the same thing. It makes uses of the forward pipe operator, which look like this: |>. What it does is that it “pipes” (i.e., puts) everything to its left inside the function to its right, as the first argument in the call. Come take a look:
## [1] 15
So the above code does exactly the same as sqrt(225)! Weird or wonderful? Your call!
Piping with additional arguments is fairly straightforward:
## [1] 3.14
is functionally identical to
## [1] 3.14
and
## [1] 12.35
is identical to
## [1] 12.35
I won’t be using the piping approach much (or maybe even at all) in this book, but since it is gaining popularity in the R world, I thought I should get it on your radar. Personally, I don’t see the appeal of calling functions like this, but maybe that’s a very boomer thing to say.
As I’ve mentioned earlier, R has an incredible range of mathematical functions built into it, and there really wouldn’t be much point in trying to describe or even list all of them. I will focus only on those functions that are strictly necessary for this book. When doing statistics, you will find that you will be doing a lot of transformations. Also, you will find that a lot of the transformations that you might want to apply to your data are based on fairly simple mathematical functions and operations. In this section, I want to return to that discussion, and mention several other mathematical functions and arithmetic operations that I didn’t bother to mention when introducing you to R, but are actually quite useful for a lot of real-world data analysis. Table 4.1 gives a brief overview of the various mathematical functions I want to talk about (and some that I already have talked about). Obviously, this doesn’t even come close to cataloguing the range of possibilities available in R, but it does cover a very wide range of functions that are used in day to day data analysis.
| mathematical.function | R.function | example.input | answer |
|---|---|---|---|
| square root | sqrt() | sqrt(25) | 5 |
| absolute value | abs() | abs(-23) | 23 |
| rounding to nearest | round() | round(1.32) | 1 |
| rounding down | floor() | floor(1.32) | 1 |
| rounding up | ceiling() | ceiling(1.32) | 2 |
| logarithm (base 10) | log10() | log10(1000) | 3 |
| logarithm (base e) | log() | log(1000) | 6.908 |
| exponentiation | exp() | exp(6.908) | 1000.245 |
| sum | sum() | sum(c(2,1,6)) | 9 |
| mean | mean() | mean(c(2,1,6)) | 3 |
| cumsum | cumsum() | cumsum(c(2,1,6)) | 2 3 9 |
One very simple transformation that crops up surprisingly often is the
need to round a number to the nearest whole number, or to a certain
number of significant digits. To start with, let’s assume that we want
to round to a whole number. To that end, there are three useful
functions in R you want to know about: round(), floor() and
ceiling().
You are already familiar with the round() function from Section 4.2.1. It just rounds to the nearest whole
number. So if you round the number 4.3, it “rounds down” to 4, like
so:
## [1] 4
In contrast, if we want to round the number 4.7, we would round upwards
to 5. In everyday life, when someone talks about “rounding”, they
usually mean “round to nearest”, so this is the function we use most of
the time. However, sometimes you have reasons to want to always round up
or always round down. If you want to always round down, use the
floor() function instead; and if you want to force R to round up, then
use ceiling(). That’s the only difference between the three functions.
What if you want to round to a certain number of digits? Let’s suppose
you want to round to a fixed number of decimal places, say 2 decimal
places. If so, what you need to do is specify the digits argument to
the round() function, as was discussed in Section 4.2.1.
Exercise: Round the value 0.0123 to 2 decimal places. Specify the
arguments x and digits.
round( x = 0.0123, digits = 2 )
Next up are logarithms and exponentials. Although they aren’t needed anywhere else in this book, they are everywhere in statistics more broadly, and not only that, there are a lot of situations in which it is convenient to analyse the logarithm of a variable (i.e., to take a “log-transform” of the variable). I suspect that many (maybe most) readers of this book will have encountered logarithms and exponentials before, but from past experience, I know that there’s a substantial proportion of students who take a social science statistics class who haven’t touched logarithms since high school, and would appreciate a bit of a refresher.
In order to understand logarithms and exponentials, the easiest thing to
do is to actually calculate them and see how they relate to other simple
calculations. There are three R functions in particular that I want to
talk about, namely log(), log10() and exp(). To start with, let’s
consider log10(), which is known as the “logarithm in base 10”. The
trick to understanding a logarithm is to understand that it’s
basically the “opposite” of taking a power. Specifically, the logarithm
in base 10 is closely related to the powers of 10. So let’s start by
noting that 10-cubed is 1000. Mathematically, we would write this:
\[ 10^3 = 1000 \]
and in R we’d calculate it by using the command 10^3. The trick to
understanding a logarithm is to recognise that the statement that “10 to
the power of 3 is equal to 1000” is the mirror
image of the statement that “the logarithm (in base 10) of 1000 is equal
to 3”. Mathematically, we write this as follows,
\[ \log_{10}( 1000 ) = 3 \]
Exercise: Calculate the base-10 logarithm of 1000 using the
log10() function.
log10(1000)
Obviously, since you already know that \(10^3 = 1000\) there’s really no point in getting R to tell you that the base-10 logarithm of 1000 is 3. However, most of the time you probably don’t know what the right answer is. For instance, I can honestly say that I didn’t know that \(10^{2.69897} = 500\), so it’s rather convenient for me that I can use R to calculate the base-10 logarithm of 500.
## [1] 2.69897
Or at least it would be convenient if I had a pressing need to know the base-10 logarithm of 500.
Okay, since the log10() function is related to the powers of 10, you
might expect that there are other logarithms (in bases other than 10)
that are related to other powers too. And of course, that’s true:
there’s not really anything mathematically special about the number 10.
You and I happen to find it useful because decimal numbers are built
around the number 10, but the big bad world of mathematics scoffs at our
decimal numbers. Sadly, the universe doesn’t actually care how we write
down numbers. Anyway, the consequence of this cosmic indifference is
that there’s nothing particularly special about calculating logarithms
in base 10. You could, for instance, calculate your logarithms in base
2, and in fact, R does provide a function for doing that, which is (not
surprisingly) called log2(). Since we know that
\(2^3 = 2 \times 2 \times 2 = 8\), it’s no surprise to see that
## [1] 3
Alternatively, a third type of logarithm – and one we see a lot more of in statistics than either base 10 or base 2 – is called the natural logarithm, and corresponds to the logarithm in base \(e\). Since you might one day run into it, I’d better explain what \(e\) is. The number \(e\), known as Euler’s number, is one of those annoying “irrational” numbers whose decimal expansion is infinitely long and is considered one of the most important numbers in mathematics. The first few digits of \(e\) are:
\[ e = 2.718282 \]
There are quite a few situations in statistics that require us to
calculate powers of \(e\).
Raising \(e\) to
the power \(x\) is called the exponential of \(x\), and so it’s very
common to see \(e^x\) written as \(\exp(x)\). And so it’s no surprise that R
has a function that calculates exponentials, called exp(). For
instance, suppose I wanted to calculate \(e^3\). I could try typing in the
value of \(e\) manually, like this:
## [1] 20.08554
but it’s much easier to do the same thing using the exp() function.
Exercise: Calculate the exponential of 3 using the exp() function.
exp(3)
Anyway, because the number \(e\) crops up so often in statistics, the
natural logarithm (i.e., logarithm in base \(e\)) also tends to turn up.
Mathematicians often write it as \(\log_e(x)\) or \(\ln(x)\), or sometimes
even just \(\log(x)\). In fact, R works the same way: the log() function
corresponds to the natural logarithm10 Anyway, as a quick check,
let’s calculate the natural logarithm of 20.08554 using R:
## [1] 3
And with that, I think we’ve had quite enough exponentials and logarithms for this book!
sum(), the mean(), and the cumsum()Although I will defer all true statistical content to Chapter 11, I make one exception here, and that is using R to compute the mean.
As a recap, here’s what you should do to compute the mean: add all the values up and then divide by the total number of values. Okay, how do we get the magic computing box to do the work for us? If you really wanted to, you could do this calculation directly in R.
To make things a bit concrete, let’s use some data.
Unlike most data sets in this book, these are actually real data,
relating to the Australian Football League (AFL).11 The afl.margins
variable contains the winning margin (which is just the difference between the number of points, so if one team scores 26 and the other 21, the margin is 5) for all 176 home
and away games played during the 2010 season.
Here is what the first couple of scores look like (in a moment, I will
show how I could use the head() function for that)
## [1] 56 31 56 8 32
Exercise: For the first 5 AFL margins (56, 31, 56, 8, 32),
calculate the mean just by typing it in as if R were a calculator.
(56 + 31 + 56 + 8 + 32) / 5
… in which case R outputs the answer 36.6, just as if it were a calculator.
However, that’s not the only way to do the calculations, and when the
number of observations starts to become large, it’s easily the most
tedious. Besides, in almost every real-world scenario, you’ve already
got the actual numbers stored in a variable of some kind, just like we
have with the afl.margins variable. Under those circumstances, what
you want is a function that will just add up all the values stored in a
numeric vector. That’s what the sum() function does.
If we want to add up all 176 winning margins in the data set, we can do so using the following command:
## [1] 6213
Exercise: Take the sum of the first five observations of
afl.margins, using the sum() function.
sum( afl.margins[1:5] )
Exercise: Now calculate the mean, by telling R to divide the output
of the summation of the first five observations by 5. Use the sum()
function.
sum( afl.margins[1:5] ) / 5
Although it’s pretty easy to calculate the mean using the sum()
function, we can do it in an even easier way, since R also provides us
with the mean() function.
Exercise: Calculate the mean of all 176 games using the mean()
function.
mean( afl.margins )
Just to show you that there’s nothing funny going on, here’s what we would do to calculate the mean for the first five observations:
## [1] 36.6
As you can see, this gives exactly the same answers as the previous calculations.
Fairly easy, huh?
Sometimes, you don’t want just the sum, but you want the cumulative sum: Again, R helps you out here. It sort of speaks for itself:
## [1] 56 87 143 151 183
The first element of y is simply the first element of afl.margins. The second element of y (87) is th sum of the first 2 elements of afl.margins[1:5] (56 and 31). The third element of y is the sum of the first 3 elements of afl.margins (56, 31, and 56), and so on.
sum() and mean() with logical dataThe sum() function is especially useful in combination with logical data, by virtue of TRUEs and FALSEs doubling as 1s and 0s, as you discovered in Section 3.3.4. It makes it quiet easy to count how many cases of something are in your data set.
Suppose we want to know how many AFL margins in our data set are larger than 100? Let’s ask R:
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
This doesn’t quite give the answer we are after. It gives us a bunch of
TRUEs and FALSEs, where a TRUE indicates that the margin is larger than
100. So what we need to do to get to our answer, is to count the number
of TRUEs. Somewhat surprisingly, the sum() function helps us out here.
Why is that?
## [1] 4
The reason that it works is that, as I discussed in Section 3.3.4, TRUE
and FALSE act as 0 and 1, so when summing the collection of FALSEs and
TRUEs, we are just summing 0s and 1s. Since adding a 0 doesn’t really do
anything, what this boils down to is just summing the 1s. And summing a
number of 1s is of course identical to just counting the number of 1s.
So the end result of the sum operation is the number of 1s we had, or
the number of TRUEs.
Once we have the number of TRUEs, it is of course very easy to turn this
frequency into a proportion. Using length(), we count the total number
of games. And the proportion of games with a margin>100 is nothing more
than the number of games with a margin>100 divided by the total number
of games.
## [1] 0.02272727
Now if you really want to be badass, you could even use the mean()
function to compute the proportion of interest:
## [1] 0.02272727
So, in sum, what I mean is that in some cases, we can use mean() to compute a proportion! This might seem tricky at first, but is nothing magical, really. Remember that the mean just adds up all things and then divides it by the total number. As per above, that is exactly what we need to do if we want to compute the proportion.
Exercise: What proportion of games has a winning margin of exactly 3?
sum(afl.margins==3)/length(afl.margins) #one way
mean(afl.margins==3) #another way
This is a feature we will be using quite a bit, so it is a good idea to familiarize yourself. Often, it will help to use the work-from-within strategy. For example, can you make sense of this line?
## [1] 0.7
It counts the proportion of elements in x for which the absolute value is larger than 2. In that single statement, no less than 3 things are happening, starting from within and eating yourself to the outside:
1: compute the absolute value of x
## [1] 7 3 6 4 4 1 0 8 9 2
2: compute which elements of the absolute value of x are larger than 2
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
3: compute the proportion of TRUEs in this vector
## [1] 0.7
There are a few quite convenient functions you will be happy to know. As most of them take quite long to describe and just seeing what they do is much easier for all parties involved, I will often be brief on the description. As the saying goes, an R command is worth a thousands words.
rep() and seq()For example, here is how you can repeat stuff:
## [1] 2 2 2 2 2 2 2 2 2
## [1] "z" "z" "z" "z"
## [1] 3 4 3 4 3 4 3 4 3 4 3 4
## [1] "2" "q" "2" "q" "2" "q" "2" "q" "2" "q"
Note how in the last example, 2 was “characterized”.
I always forget whether rep(2,9) creates 9 2s or 2 9s. Using names arguments would also solve this first-world problem.
## [1] 2 2 2 2 2 2 2 2 2
## [1] 2 2 2 2 2 2 2 2 2
Here’s another cool function, if you find yourself in that sequence making mood:
## [1] 2 3 4 5 6 7 8 9 10 11 12
## [1] 2 5 8 11
## [1] 2 5 8 11
## [1] 2 4 6 8 10 12
So that’s a nice way to generate a sequence.
head() and tail()Some variables are pretty big. For example that afl.margins variable
contains 176 games, which is a lot of info to digest if it is printed
out on my computer screen. To that end, R provides you with a few useful
functions to print out only a few of elements. The first of these is
head() which prints out the first couple 12 elements, like this:
## [1] 56 31 56 8 32 14
You can also use the tail() function to print out the last couple 13 of rows.
As always, R serves every whim you might have. If you want more than the default number of first entries, you do you!
## [1] 56 31 56 8 32 14 36 56 19 1
Looking at the last entries can be done by tail().
diff()Try to understand what diff() does
## [1] 2 6
Good boy! It computes the difference between elements 1 and 2, between elements 2 and 3, and so on.
max() and min()People are fond of extremes. Maybe you are, too. What’s the biggest difference in scores, you wonder? You can ask R. It’s easy :
## [1] 116
I am sure you are bright enough to guess what min() does and how to use
it.
which() and which.max()One function that can be handy is the which() function; it takes as
input a vector of logicals and outputs the indices of the TRUE cases.
Exercise: Apply the which() function to find the values of
afl.margins that are larger than 100.
which( afl.margins > 100 )
# Or:
large.cases <- afl.margins > 100
which( large.cases )
What this has done is shown us that the large cases correspond to games 12, 46, 157, and 163.
We know from above that the highest margin was a whopping 116. But which game has this monster score?
Of course, we could do this:
## [1] 163
But also of course, R wants you to know it is vastly smarter than you, so you could also do this:
## [1] 163
I don’t think I should tell you what which.min() does, do I?
unique()Sometimes you wanna go full Marie Kondo and remove all ballast. unique() does exactly that and removes all duplicate elements:
## [1] 56 31 8 32 14 36 19 1 3 104 43 44 72 9 28 25 27 55 20
## [20] 16 7 23 40 48 64 22 95 15 49 52 50 10 65 12 39 26 108 53
## [39] 38 4 13 66 67 61 29 81 37 70 35 54 47 2 41 24 11 71 18
## [58] 0 60 57 83 84 30 68 75 63 82 73 33 76 5 94 98 89 101 21
## [77] 42 116 6
Make good use of it.
toupper()A task that comes up quite often is making transformations to text. A simple example of this would be converting text to lower case or upper case, which you can do using the toupper() and tolower() functions. Both of these functions have a single argument x which contains the text that needs to be converted. Imagine we have the following text vector.
text to lower case.
tolower( x = text )
startsWith() and endsWith()This is pretty self-explanatory. See for yourself.
## [1] TRUE
## [1] FALSE
## [1] TRUE
## [1] FALSE
What happens, you wonder, when the input is not a character? Wonder no more:
## [1] 1 2 3 4 5 6 7 8 9 10
## Error in startsWith(x, 1): non-character object(s)
So we need a character as input really. This also means that
with
## [1] FALSE
## [1] TRUE
## Error in endsWith(x, 7): non-character object(s)
The first two commands run without problem, because the numbers 6 and 7 are, by virtue of the ““’s treated as text. When we, however, start treating the 7 as the number it is, like in the last line, R spits out an error.
Sometimes, you will need either to glue several character
strings together or to pull them apart. To glue several strings
together, the paste() function is very useful. There are two important
arguments to the paste() function:
... These dots refer to an unnamed argument, and “match” up against any number of inputs. In
this case, the inputs should be the various different strings you
want to paste together.sep. This argument should be a string, indicating what characters
R should use as separators, in order to keep each of the original
strings separate from each other in the pasted output. By default,
the value is a single space, sep = " ". This is made a little
clearer when we look at the examples.
That probably doesn’t make much sense yet, so let’s start with a simple
example. First, let’s try to paste two words together.Exercise: Paste together the words “hello” and “world” using the
paste() function, without specifying any other arguments.
paste( "hello", "world" )
Notice that R has inserted a space between the "hello" and "world".
Suppose that’s not what I wanted. Instead, I might want to use . as
the separator character, or to use no separator at all. To do either of
those, I would need to specify sep = "." or sep = "". For
instance:
## [1] "hello.world"
To be honest, it does bother me a little that the default value
of sep is a space. Normally when I want to paste strings together
I don’t want any separator character, so I’d prefer it if the
default were sep="". To that end, it’s worth noting that there’s
also a paste0() function, which is identical to paste() except
that it always assumes that sep="".
## [1] "helloworld"
## [1] "helloworld"
any() functionIn the afl.margin data, is there at least one game with a margin of 8? You can use any() to find out!
## [1] TRUE
We also learn there is no game with a margin of 117.
## [1] FALSE
Sweet.
all() functionDo you also feel we are living in an age where mankind is longing for everything to be true? If you are one of those people, I proudly present to you the all() function. If the input is a logical vector, it checks whether all elements are TRUE. See for yourself:
## [1] FALSE
## [1] TRUE
all.equal() function aka the problem with floating-point arithmeticIf I’ve learned nothing else about transfinite arithmetic (and I haven’t) it’s that infinity is a tedious and inconvenient concept. Not only is it annoying and counterintuitive at times, but it has nasty practical consequences. As we were all taught in high school, there are some numbers that cannot be represented as a decimal number of finite length, nor can they be represented as any kind of fraction between two whole numbers; \(\sqrt{2}\), \(\pi\) and \(e\), for instance. In everyday life, we mostly don’t care about this. I’m perfectly happy to approximate \(\pi\) as 3.14, quite frankly. Sure, this does produce some rounding errors from time to time, and if I’d used a more detailed approximation like 3.1415926535 I’d be less likely to run into those issues, but in all honesty, I’ve never needed my calculations to be that precise. In other words, although our pencil and paper calculations cannot represent the number \(\pi\) exactly as a decimal number, we humans are smart enough to realise that we don’t care. Computers, unfortunately, are dumb … and you don’t have to dig too deep in order to run into some very weird issues that arise because they can’t represent numbers perfectly. Here is my favourite example:
## [1] FALSE
Obviously, R has made a mistake here, because this is definitely the wrong answer. Your first thought might be that R is broken, and you might be considering switching to some other language. But you can reproduce the same error in dozens of different programming languages, so the issue isn’t specific to R. Your next thought might be that it’s something in the hardware, but you can get the same mistake on any machine. It’s something deeper than that.
The fundamental issue at hand is floating point arithmetic, which
is a fancy way of saying that computers will always round a number to a
fixed number of significant digits. The exact number of significant
digits that the computer stores isn’t important to us:14 what matters
is that whenever the number that the computer is trying to store is very
long, you get rounding errors. That’s actually what’s happening with our
example above. There are teeny tiny rounding errors that have appeared
in the computer’s storage of the numbers, and these rounding errors have
in turn caused the internal storage of 0.1 + 0.2 to be a tiny bit
different from the internal storage of 0.3.
How big are these differences? Let’s ask R:
## [1] 5.551115e-17
Knowing that e-17 should be read as 10^(-17) or 0.00000000000000001,
this is very tiny indeed. No sane person would care about differences
that small. But R is not a sane person, and the equality operator ==
is very literal-minded. It returns a value of TRUE only when the two
values that it is given are absolutely identical to each other. And in
this case, they are not.
However, this only answers half of the question. The other half of the question is, why are we getting these rounding errors when we’re only using nice simple numbers like 0.1, 0.2 and 0.3? This seems a little counterintuitive. The answer is that, like most programming languages, R doesn’t store numbers using their decimal expansion (i.e., base 10: using digits 0, 1, 2 …, 9). We humans like to write our numbers in base 10 because we have 10 fingers. But computers don’t have fingers, they have transistors; and transistors are built to store 2 numbers, not 10. So you can see where this is going: the internal storage of a number in R is based on its binary expansion (i.e., base 2: using digits 0 and 1). And unfortunately, here’s what the binary expansion of 0.1 looks like:
\[ .1 \mbox{(decimal)} = .00011001100110011... \mbox{(binary)} \]
and the pattern continues forever. In other words, from the perspective of your computer, which likes to encode numbers in binary,15 0.1 is not a simple number at all. To a computer, 0.1 is actually an infinitely long binary number! As a consequence, the computer can make minor errors when doing calculations here.
Hopefully, it is now clear that the problem is the result of the twin facts that (1) we usually think in decimal numbers and computers usually compute with binary numbers, and (2) computers are finite machines and can’t store infinitely long numbers. The only questions that remain are when you should care and what you should do about it. Thankfully, you don’t have to care very often: because the rounding errors are small, the only practical situation that I’ve seen this issue arise is when you want to test whether an arithmetic fact holds exactly numbers are identical (e.g., is someone’s response time equal to exactly \(2 \times 0.33\) seconds?) This is pretty rare in real-world data analysis, but just in case it does occur, it’s better to use a test that allows for a small tolerance. That is, if the difference between the two numbers is below a certain threshold value, we deem them to be equal for all practical purposes.
Okay, the problem is clear, but what about the solution? For instance, you could do something like this, which asks whether the difference between the two numbers is less than tolerance of \(10^{-10}\)
## [1] TRUE
Neat, but clumsy. R, do you have something else up your sleeve? Most definitely, you are too kind to ask! There is a function called all.equal() that lets you test for equality but allows a small tolerance for rounding errors:
## [1] TRUE
print()The print() function displays things. That’s easy enough. The difficult bit is that it seems unnecessary. Consider the following code
## [1] 10
This code has printed x, without using the print() function. Who on earth had so much time on their hands or that much need for validation to spend time making the print() function?
First off, it doesn’t hurt. In the code below, it doesn’t really do anything, but it helps make clear what I’m doing.
## [1] 10
Second, it can be useful if you are sourcing a script (as I will
discuss in Section ??). If you source a script, just having x in your script won’t show x, but print(x) will.
Third, if you want to have something printed while running a function (see Section XXX) or a loop (see Section XXX), you will need print().
Finally, it sometimes makes things look nicer. One example we will encounter in Section 6.5.1 is that, if you want to look at a data frame in the browser environment we are using, using print() will make the data frame look nice. Weirdly, it is not needed for anything other than data frames, and even not needed for data frames if you are using R in RStudio instead of in the browser environment. Don’t worry if this sounds like gibberish now. You will see in due time.
If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. You can use this box to make the exercises for Chapter 4. Note that there are two documents: one with only the questions, and one with the questions as well as some suggested solutions.
Up till now, you have been working with R in a browser. This was, so I hope, useful for learning R. But once you will start using R, you will no longer work in this nifty browser environment. This setup was only used because I hope it facilitates providing exercises and solutions. Once you start using R instead of learning R, you will use R not in a browser. Rather, you will use it in, for example, RStudio.
The R terminal that comes with the installation of R should, in principle, be enough for using R. However, it is not as visually pretty as the RStudio version, and lacks some of the cooler features that RStudio provides. That’s why we’ll be using R from within RStudio.
There are some specific things to using R in RStudio that you really learn best in RStudio directly, rather than in this browser environment. These things are quite important, because, I can not stress this enough, once you’ve done learning R and start using R, you will no longer be running R code in this nifty browser environment!
This chapter will be a lot less interactive than the other materials. I do recommend to not just read whatever has been described here, but also do it.
First, a disclaimer. Some of the things I describe in this document (such as tab to autcomplete) aren’t just an RStudio thing. For example, if you’re running R in a terminal window instead of in RStudio, tab autocomplete works in exactly the way I describe below. I don’t bother to document that here: my assumption is that if you are running R in the terminal then you’re already familiar with using tab autocomplete. So I am not going to distinguish between what is an R feature and what is an RStudio thing. I do try to have a life outside of this, you know.
When you open RStudio, you will see there are different panes (or panels). Most of the current document is describing how using these panes can make your life as an R user easy. 16
You will see that often, when you get RStudio to do something using one of the panes, you’ll actually see the R commands that get created and show up.
For example, when you install the abtest package using the Packages panel (no worries; I explain below), in the Console panel you will see install.packages("abtest") appearing as a command. RStudio has sent a command to the R console, exactly as if you’d typed it yourself!
This means there are often at least two different ways of doing things: using the menu-based options (aka panel-based interface) provided by RStudio and using command-based options (aka text-based interface) from R (in the R console). Throughout this chapter, the idea is that I’ll first show you the (often easy) way to do it using RStudio and then, if needed, describe the (sometimes awkward) R commands that do all the work. I suspect that mostly, you will be using the menu-based interface, but the text-based interface does deserve some attention. One reason to be at least aware of the command-based way of doing things is that when you make a script (see Section XXX; for example if you want to share your hard R work, or want to keep a memory of what you did), it can be handy to have everything that needs to be done as an R command in text, rather than as a click-on-this instruction. Of course, in that case, you can actually use the panel-based option, and then copy and store the R commands that RStudio might have generated in your script.
To start, I will focus on the panel labelled Console. This is where R will execute the commands you ask it to perform. Working in the console pane is very similar to typing the commands you did before in the boxes in the browser I provided. However, since the pane is not just boxes, there are a few nifty things the console is helping you with, unlike these boxes. Let’s unpack these little nuggets.
We know how to enter commands in R. As a recap, let’s use R to add 10 and 20. To do that, in the R console, type 10+20 and hit enter. If R gives the correct answer (30), we are good.
So far so good. Now, for the cool stuff. If you hit enter in a situation where it’s “obvious” to R that you haven’t actually finished typing the command, R is just smart enough to keep waiting. For example, if you type 10 + and then press enter, even R is smart enough to realise that you probably wanted to type in another number.
So if you type 10+ and then accidentally press enter, there’s a blinking cursor next to the plus sign on the new line. What this means is that R is still waiting for you to finish. It “thinks” you’re still typing your command, so it hasn’t tried to execute it yet. In other words, this plus sign is actually another command prompt. It’s different from the usual one (i.e., the > symbol) to remind you that R is going to “add” whatever you type now to what you typed last time. For example, if I then go on to type 20 and hit enter, what I get is the correct answer (30).
And as far as R is concerned, this is exactly the same as if you had typed 10 + 20.
Similarly, consider the citation() function that we talked about earlier. Suppose you hit enter after typing citation(. Once again, R is smart enough to realise that there must be more coming – since you need to add the ) character – so it waits. I can even hit enter several times and it will keep waiting.
That being said, it’s not often the case that R is smart enough to tell that there’s more coming. For instance, in the same way that I can’t add a space in the middle of a word, I can’t hit enter in the middle of a word either. If I hit enter after typing citat I get an error because R thinks I’m interested in an “object” called citat and can’t find it:
> citat
Error: object 'citat' not found
What about if I typed citation and hit enter? In this case, we get something very odd, something that we definitely don’t want, at least at this stage. Here’s what happens:
citation
## function (package = "base", lib.loc = NULL, auto = NULL)
## {
## dir <- system.file(package = package, lib.loc = lib.loc)
## if (dir == "")
## stop(gettextf("package '%s' not found", package), domain = NA)
BLAH BLAH BLAH
where the BLAH BLAH BLAH goes on for rather a long time, and you don’t know enough R yet to understand what all this gibberish actually means (of course, it doesn’t actually say BLAH BLAH BLAH - it says some other things we don’t understand or need to know that I’ve edited for length) This incomprehensible output can be quite intimidating to novice users, and unfortunately it’s very easy to forget to type the parentheses; so almost certainly you’ll do this by accident. Do not panic when this happens. Simply ignore the gibberish. As you become more experienced this gibberish will start to make sense, and you’ll find it quite handy to print this stuff out.17 But for now, just try to remember to add the parentheses when typing your commands using functions.
If you start doing this yourself, you’ll eventually get yourself in trouble (it happens to us all). Maybe you start typing a command, and then you realise you’ve screwed up. For example,
> citblation(
+
+
You’d probably prefer R not to try running this command, right? If you want to get out of this situation, just hit the ‘escape’ key.18 R will return you to the normal command prompt (i.e. >) without attempting to execute the botched command.
At this stage, you know how to type in basic commands, including how to use R functions. And it’s probably beginning to dawn on you that there are a lot of R functions, all of which have their own arguments. You’re probably also worried that you’re going to have to remember all of them! Thankfully, it’s not that bad. In fact, very few data analysts bother to try to remember all the commands. I want to call your attention to a couple of simple tricks that RStudio makes available to you.
One thing I want to call your attention to is the autocomplete ability in RStudio.
Let’s assume that what you want to do is to round a number. This time around, start typing the name of the function that you want (e.g., ro …), and then hit the “tab” key.
RStudio will then display a little window with two panels. On the left, there’s a list of variables and functions that start with the letters that I’ve typed shown in black text, and some grey text that tells you where that variable/function is stored. Ignore the grey text for now: it won’t make much sense to you until we’ve talked about packages in Section ??. You can see that there are quite a few things that start with the letters ro: there’s something called rock, something called round, something called round.Date and so on. The one we want is round, but if you’re typing this yourself you’ll notice that when you hit the tab key the window pops up with the top entry (i.e., rock) highlighted. You can use the up and down arrow keys to select the one that you want. Or, if none of the options looks right to you, you can hit the escape key (“esc”) or the left arrow key to make the window go away.
In our case, the thing we want is the round option, so we’ll select that. When you do this, you’ll see that the panel on the right changes. Previously, it had been telling us something about the rock data set (i.e., “Measurements on 48 rock samples…”) that is distributed as part of R. But when we select round, it displays information about the round() function, exactly as it is shown in Figure 5.1.
Figure 5.1: Start typing the name of a function or a variable, and hit the tab key. RStudio brings up a little dialogue box like this one that lets you select the one you want, and even prints out a little information about it.
This display is really handy. The very first thing it says is round(x, digits = 0): what this is telling you is that the round() function has two arguments. The first argument is called x, and it doesn’t have a default value. The second argument is digits, and it has a default value of 0. In a lot of situations, that’s all the information you need. But RStudio goes a bit further and provides some additional information about the function underneath. Sometimes that additional information is very helpful, sometimes it’s not: RStudio pulls that text from the R help documentation, and my experience is that the helpfulness of that documentation varies wildly. Anyway, if you’ve decided that round() is the function that you want to use, you can hit the right arrow or the enter key, and RStudio will finish typing the rest of the function name for you.
The RStudio autocomplete tool works slightly differently if you’ve already got the name of the function typed and you’re now trying to type the arguments. For instance, suppose I’ve typed round( into the console, and then I hit tab. RStudio is smart enough to recognise that I already know the name of the function that I want, because I’ve already typed it, and figures that I’m interested in the arguments of that function. Being an obedient servant, it gives us what we want. You can see this in Figure 5.2. Again, the window has two panels, and you can interact with this window in exactly the same way that you did with the window shown in Figure 5.1. On the left-hand panel, you can see a list of the argument names. On the right-hand side, it displays some information about what the selected argument does.
Figure 5.2: If you’ve typed the name of a function already along with the left parenthesis and then hit the tab key, RStudio brings up a different window to the one shown above. This one lists all the arguments to the function on the left, and information about each argument on the right.
One thing that RStudio does automatically is to keep track of your “command history”. That is, it remembers all the commands that you’ve previously typed. You can access this history in a few different ways. To see how this works, let’s type some commands in the R command line in the console.
The simplest way is to use the up and down arrow keys. If you hit the up key, the R console will show you the most recent command that you’ve typed. Hit it again, and it will show you the command before that. If you want the text on the screen to go away, hit escape19 Using the up and down keys can be really handy if you’ve typed a long command that had one typo in it. Rather than having to type it all again from scratch, you can use the up key to bring up the command and fix it.
Another method is to start typing some text and then hit the Control key and the up arrow together (on Windows or Linux) or the Command key and the up arrow together (on a Mac). This will bring up a window showing all your recent commands that started with the same text as what you’ve currently typed. That can come in quite handy sometimes.
This seamlessly brings us to one of the other panels in RStudio: the History panel. On the upper right-hand side of the RStudio window, you’ll see a tab labelled History. Click on that, and you’ll see a list of all your recent commands displayed in that panel: it should look something like Figure 5.3. If you double click on one of the commands, it will be copied to the R console. You can achieve the same result by selecting the command you want with the mouse and then clicking the “To Console” button.
Figure 5.3: The history panel is located in the top right hand side of the RStudio window. Click on the word History and it displays this panel.
An important concept when working with R is the notion of the workspace, also referred to as the global environment. Roughly, the workspace is as an abstract location in which R variables are stored.
To have something to work with, let’s add some content to the workspace
How can you now examine the contents of the workspace, i.e., which variables does R keep in its memory? If you’re using RStudio, you will be both happy and somewhat unsurprised to hear that there’s a dedicated panel for that. You will probably find that the easiest way to do this is to use the Environment panel in the top right-hand corner. Click on that, and you’ll see a list that looks very much like the one shown in Figures 5.4 and 5.5.
Figure 5.4: The RStudio Environment panel shows you the contents of the workspace. The view shown above is the list view. To switch to the grid view, click on the menu item on the top right that currently reads list. Select grid from the dropdown menu, and then it will switch to a view like the one shown in the other workspace figure
Figure 5.5: The RStudio Environment panel shows you the contents of the workspace. Compare this grid view to the list earlier
If you want to list the content of the workspace using the command line, there are a couple of functions that may come in handy: We will only use the ls() function. 20. If you would try it out, you would see something like this:
## [1] "keeper" "lover" "seeker"
Looking over that list of variables, it occurs to me that I really don’t need them any more. I created them originally just to make a point, but they don’t serve any useful purpose anymore, and now I want to get rid of them. I’ll show you how to do this, but first I want to warn you – there’s no “undo” option for variable removal. Once a variable is removed, it’s gone forever. But quite clearly we have no need for these variables at all, so we can safely get rid of them.
In RStudio, the easiest way to remove variables is to use the Environment panel. Assuming that you’re in grid view (i.e., Figure 5.5), check the boxes next to the variables that you want to delete, then click on the “Clear” button (the broom) at the top of the panel. When you do this, RStudio will show a dialogue box asking you to confirm that you really do want to delete the variables. It’s always worth checking that you really do, because as RStudio is at pains to point out, you can’t undo this. Once a variable is deleted, it’s gone. In any case, if you click “yes”, that variable will disappear from the workspace: it will no longer appear in the environment panel, and it won’t show up when you use the ls() command. Removing all variables can be done by clicking the broom (in the List view) or by clicking the broom after selecting all variables (in the Grid view), which can be easily done by checking the box next to “Name”.
If you want to remove variables using R commands, you will be happy to meet the remove function rm(). The simplest way to use rm() is just to type in a (comma separated) list of all the variables you want to remove. Let’s say I want to get rid of seeker and lover, but I would like to keep keeper. To do this, all I have to do is type:
There’s no visible output, but if I now inspect the workspace
## [1] "keeper"
I see that there’s only the keeper variable left. As you can see, rm() can be very handy for keeping the workspace tidy. If you want to clear the entire workspace, the following command can be used:
This is a somewhat mysterious command. If you ever said you hated statistics because it destroys all of the mystery, this one’s for you.
I have discussed earlier that a big secret of being successful at programming, or at life more generally, is being able to ask for help. You might already have seen the Help panel on your left. It has a nifty search box, which will bring you to R’s built-in help documentation.
We already know, from Section XXX, which commands to type if we desire help. For example, if we want to look at the help documentation for the load() function,
you already know you could type either of the following:
When you do that, R goes looking for the help file for the “load” topic. If it finds one, Rstudio takes it and displays it in the, wait for it, Help panel.
Also if you do a fuzzy search for a help topic, you will be directed to the Help panel.
This will bring up a list of possible topics in the Help panel.
Remember I told you before what packages are and how important they are? If your answer is anything else than “Yes, of course, I am on top of this material”, you might want to revisit Section XXX.
Dealing with packages can be done in two ways: using the command line of the console, or —yes, I knew you would have guessed it!— using yet another panel, the —again, no surprises here— Packages panel.
Figure 5.6: The Packages panel.
Right, let’s get started. The first thing you need to do is look in the lower right-hand panel in RStudio. You’ll see a tab labelled “Packages”. Click on the tab, and you’ll see a list of packages that looks something like Figure 5.6. Every row in the panel corresponds to a different package, and every column is a useful piece of information about that package. Going from left to right, here’s what each column is telling you:
Using the RStudio tools is, again, dead simple. In the top left-hand corner of the Packages panel (Figure 5.6) you’ll see a button called “Install”. If you click on that, it will bring up a window like the one shown in Figure 5.7.
Figure 5.7: The package installation dialog box in RStudio
There are a few different buttons and boxes you can play with. Ignore most of them. Just go to the line that says “Packages” and start typing the name of the package that you want. As you type, you’ll see a dropdown menu appear (Figure 5.8), listing names of packages that start with the letters that you’ve typed so far.
Figure 5.8: When you start typing, you’ll see a dropdown menu suggest a list of possible packages that you might want to install
You can select from this list, or just keep typing. Either way, once you’ve got the package name that you want, click on the install button at the bottom of the window. R then goes off to the internet, has a conversation with CRAN, downloads some stuff, and installs it on your computer. You probably don’t care about all the details of R’s little adventure on the web, but R is rather chatty, so it reports a bunch of gibberish that you really aren’t all that interested in:
trying URL 'http://cran.rstudio.com/bin/macosx/contrib/3.0/psych_1.4.1.tgz'
Content type 'application/x-gzip' length 2737873 bytes (2.6 Mb)
opened URL
==================================================
downloaded 2.6 Mb
The downloaded binary packages are in
/var/folders/cl/thhsyrz53g73q0w1kb5z3l_80000gn/T//RtmpmQ9VT3/downloaded_packages
Despite the long and tedious response, all that really means is “I’ve installed the psych package”. I find it best to humour the talkative little automaton. I don’t actually read any of this garbage, I just politely say “thanks” and go back to whatever I was doing.
Remember that a package must be loaded before it can be used. That seems straightforward enough, so let’s try loading packages. For this example, I’ll use the foreign package. The foreign package is a collection of tools that are very handy when R needs to interact with files that are produced by other software packages (e.g., SPSS). It comes bundled with R, so it’s one of the ones that you have installed already, but it won’t be one of the ones loaded. Inside the foreign package is a function called read.spss(). It’s a handy little function that you can use to import an SPSS data file into R, so let’s pretend we want to use it. Currently, the foreign package isn’t loaded, so if I ask R to tell me if it knows about a function called read.spss() it tells me that there’s no such thing…
## Error in read.spss(): could not find function "read.spss"
Now let’s load the package. In RStudio, the process is dead simple: go to the Packages tab, find the entry for the foreign package, and check the box on the left-hand side. So you can use the RStudio package panel to do all your package loading for you. The moment that you do this, you’ll see a command appear in the R console.
Oh, I suppose we should check to see if our attempt to load the package actually worked. Let’s see if R now knows about the existence of the read.spss() function…
## Error in grep("^(http|ftp|https)://", file): argument "file" is missing, with no default
It complains that we didn’t provide the name of the file we’d like to load (and it has every right to!), but at least it no longer complains that the function does not exist. So we must have done somethings right.
Every package that you have loaded is another environment. Just like we can look up the contents of the workspace aka the global environment, we can look up the contents of these other environments. In fact, you can actually use the Environment panel in RStudio to browse any of your loaded packages (just click on the text that says “Global Environment” and you’ll see a dropdown menu like the one shown in Figure ??).
The key thing to understand then is that you can access any of the R variables and functions that are stored in one of these environments, precisely because those are the environments that you have loaded!21
It should not come as a huge surprise that you could use the ls() function for this as well, if you are keen on using the command line . You just have to be a bit more explicit in your command. If I wanted to find out what is in the package:foreign environment (i.e., the environment into which the contents of the foreign package have been loaded), here’s what I’d get
## [1] "data.restore" "lookup.xport" "read.arff" "read.dbf"
## [5] "read.dta" "read.epiinfo" "read.mtp" "read.octave"
## [9] "read.S" "read.spss" "read.ssd" "read.systat"
## [13] "read.xport" "write.arff" "write.dbf" "write.dta"
## [17] "write.foreign"
Sometimes, especially after a long session of working with R, you find yourself wanting to get rid of some of those packages that you’ve loaded. The RStudio package panel makes this exactly as easy as loading the package in the first place. Find the entry corresponding to the package you want to unload and uncheck the box.
And the package is unloaded. We can verify this by seeing if the read.spss() function still exists:
## Error in read.spss(): could not find function "read.spss"
Nope. Definitely gone.
The following bit is just for completeness. You don’t need to know this command.
When you use the Package panel to unload the foreign package, you might have seen this command appear on your screen:
There’s nothing more to say here.
Every now and then the authors of packages release updated versions. The updated versions often add new functionality, fix bugs, and so on. It’s generally a good idea to update your packages periodically. In the packages panel, click on the “Update” button. This will bring up a window that looks like the one shown in Figure 5.9. In this window, each row refers to a package that needs to be updated. You can tell R which updates you want to install by checking the boxes on the left. If you’re feeling lazy and just want to update everything, click the “Select All” button, and then click the “Install Updates” button. R then prints out a lot of garbage on the screen, individually downloading and installing all the new packages. This might take a while to complete depending on how good your internet connection is. Go make a cup of coffee. Come back, and all will be well.
There’s an update.packages() function that you can use to do this, but it’s probably easier to stick with the RStudio tool, so I’m not gonna bother to explain.
Figure 5.9: The RStudio dialog box for updating packages
Something you should be aware of is this. Sometimes you’ll attempt to load a package, and R will print out a message telling you that something or other has been “masked”. This will be confusing to you if I don’t explain it now, and it actually ties very closely to the whole reason why R forces you to load packages separately from installing them. Here’s an example. 22
Two of the packages that you might encounter in your R career are called car and psych. The car package is short for “Companion to Applied Regression” (which is a really great book, I’ll add), and it has a lot of tools that I’m quite fond of. The car package was written by a guy called John Fox, who has written a lot of great statistical tools for social science applications. The psych package was written by William Revelle, and it has a lot of functions that are very useful for psychologists in particular, especially in regards to psychometric techniques. For the most part, car and psych are quite unrelated to each other. They do different things, so not surprisingly almost all of the function names are different. But… there’s one exception to that. The car package and the psych package both contain a function called logit().23 This creates a naming conflict. If I load both packages into R, an ambiguity is created. If the user types in logit(100), should R use the logit() function in the car package, or the one in the psych package? The answer is: R uses whichever package you loaded most recently, and it tells you this very explicitly. Here’s what happens when I load the car package, and then afterwards load the psych package:
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
## Warning: package 'psych' was built under R version 4.2.3
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
The output here is telling you that the logit object (i.e., function) in the car package is no longer accessible to you. It’s been hidden (or “masked”) from you by the one in the psych package. You can get R to use the one from the car package by using car::logit() as your command rather than logit(), since the car:: part tells R explicitly which package to use. 24
At this point, we haven’t yet discussed how to make plots in R. We will do so later, in Chapter XXX, but to give an idea how this works in RStudio, type the following in your Console:
As you will see, it produces a (vastly uninteresting) plot in yet another pane, appropriately called the Plots panel. How do I save the picture? This is another one of those situations where the easiest thing to do is to use the RStudio tools. The easiest way to save your image is to click on the “Export” button in the Plot panel. When you do that you’ll see a menu that contains the options “Save as PDF” and “Save as Image” and “Copy to Clipboard”. All these versions work. They will bring up dialogue boxes that give you a few options that you can play with, but besides that, it’s pretty simple. This works pretty nicely for most situations.
Saving plots using R commands can be somewhat annoying, to say the least. I do not recommend. You can thank me later.
Let’s now turn to the crucial but slightly annoying question of how you can load data from a range of different sources. As is often the case with R, the basic answer is simple but there are quite a bit of nuts and bolts to it. However, for the purposes of this course, we will stick to the basic answer. As an example, we will use a filed called AnnaF.csv.
Basically, there are two ways in which you can import data using RStudio (for now at least; we will encounter a third one in Section 5.10).
One is to use the Files panel to go to the folder that contains the to-be-read file, left-click on the data file, and then click Import Dataset. You should see something like in Figure 5.11.
Figure 5.11: The dialog box that shows up when you are importing data using the Files panel.
Note that the First Row as Names checkbox has been checked, because — there is no easy way to say this — the first row contains the names of the variable, in this particular data set. If you want to import another data set, this might or might not be the case, so it is important to make sure by checking it yourself! Also, make sure to select the appropriate Delimiter (i.e., the stuff indicating when a new column should start). For this data set, columns are indicated by a semicolon (puntkomma in Dutch), but of course things might be different for different data sets. If you are unsure which Delimiter to chose, try out a few and see what it does in the Data Preview. If all choices are made, press the Import button. Note that, when everything worked, a new variable is now created in your R workspace. R (probably) has also automatically used the View() function to show you the data set in R.
Another is going to the Environment panel and click on Import Dataset. You will see that there are several possibilities, depending on the type of your file. One slightly counterintuitive thing to remember is that if you want to import a csv data set, you should select the From Text (base) option, even though you are not trying to import a text file! (Remember that everybody can be a little weird, sometimes.) Browse to wherever you stored your file, and once you located it, click on the file. You should see something like in Figure 5.12.
Figure 5.12: The dialog box that shows up when you are importing data using the Files panel.
Annoyingly (everybody can be a bit annoying, sometimes), indicating whether or not the first row contains names (in this case: yes!) and how the columns are indicated (in this case: by a semicolon!) should be done slightly differently compared to the first approach: by selecting Yes (in this case) under Heading, and by selecting Semicolon (in this case) under Separator, respectively. Having done that, just press the Import button, sit back, relax and enjoy.
Whichever way you choose to do the import, R has suggested a name to call the variable that was the result of reading in your data. Of course you can easily overwrite that, if desired, for example, like this:
Note that there are many more data formats beyond csv that you can import in R. I am not gonna bother explain all of them, since most of those follow much the same route as described for csv file. If you ever need to import a data set and you find yourself in trouble, look around for help, for example using the resources listed in Section XXX.
Also, I want to already spill the beans that whatever is imported that way is a data frame, a sentence which is a taunting mystery to you right know but will become demystified in Section XXX.
As with lots of the other tasks, importing data can also be done in the R console, but this is outside the scope of this course. As you will see, if you import data using the steps described above, the relevant R commands will turn up in the console. If you need those (for example, to store in a script), you can of course copy them, but I do not recommend importing data using R commands. Unless you are importing data with exotic formats, but that is beyond the scope this course.
When you start analysing real-world data sets, you will rapidly find yourself needing to write something called scripts. Computer programs come in quite a few different forms: the kind of program that we’re most interested in from the perspective of everyday data analysis using R is known as a script. Script files are those with a .R file extension. These aren’t data files at all; rather, they’re used to save a collection of commands that you want R to execute later. It’s just a glorified text file in which you write out all the commands that you want R to run. You can write your script using whatever software you like.
In real-world data analysis writing scripts is a key skill – and as you become familiar with R you’ll probably find that most of what you do involves scripting rather than typing commands at the R prompt. The idea behind a script is that, instead of typing your commands into the R console one at a time, you write them all in a file. Not only is it a way to store the commands you need, it also makes running the code easier. Once you’ve finished writing them and saved the file, you can get R to execute all the commands in your file at once . In a moment I’ll show you exactly how this is done, but first I’d better explain why you should care.
To understand why scripts are so very useful, it may be helpful to consider the drawbacks to typing commands directly at the command prompt. The approach that we’ve been adopting so far, in which you type commands one at a time, and R sits there patiently in between commands, is referred to as the interactive style. Doing your data analysis this way is rather like having a conversation … a very annoying conversation between you and your data set, in which you and the data aren’t directly speaking to each other, and so you have to rely on R to pass messages back and forth. This approach makes a lot of sense when you’re just trying out a few ideas: maybe you’re trying to figure out what analyses are sensible for your data, or maybe just you’re trying to remember how the various R functions work, so you’re just typing in a few commands until you get the one you want. In other words, the interactive style is very useful as a tool for exploring your data. However, it has a number of drawbacks:
It’s hard to save your work effectively. You can save the workspace so that later on you can load any variables you created. You can save your plots as images. And you can even save the history or copy the contents of the R console to a file. Taken together, all these things let you create a reasonably decent record of what you did. But it does leave a lot to be desired. It seems like you ought to be able to save a single file that R could use (in conjunction with your raw data files) and reproduce everything (or at least, everything interesting) that you did during your data analysis.
It’s annoying to have to go back to the beginning when you make a mistake. Suppose you’ve just spent the last two hours typing in commands. Over the course of this time you’ve created lots of new variables and run lots of analyses. Then suddenly you realise that there was a nasty typo in the first command you typed, so all of your later numbers are wrong. Now you have to fix that first command, and then spend another hour or so combing through the R history to try and recreate what you did.
You can’t leave notes for yourself. Sure, you can scribble down some notes on a piece of paper, or even save a Word document that summarises what you did. But what you really want to be able to do is write down an English translation of your R commands, preferably right “next to” the commands themselves. That way, you can look back at what you’ve done and actually remember what you were doing. In the simple exercises we’ve engaged in so far, it hasn’t been all that hard to remember what you were doing or why you were doing it, but only because everything we’ve done could be done using only a few commands, and you’ve never been asked to reproduce your analysis six months after you originally did it! When your data analysis starts involving hundreds of variables and requires quite complicated commands to work, then you really, really need to leave yourself some notes to explain your analysis to, well, yourself.
It’s nearly impossible to reuse your analyses later, or adapt them to similar problems. Suppose that, sometime in January, you are handed a difficult data analysis problem. After working on it for ages, you figure out some really clever tricks that can be used to solve it. Then, in September, you get handed a really similar problem. You can sort of remember what you did, but not very well. You’d like to have a clean record of what you did last time, how you did it, and why you did it the way you did. Something like that would really help you solve this new problem.
It’s hard to do anything except the basics. There’s a nasty side effect of these problems. Typos are inevitable. Even the best data analyst in the world makes a lot of mistakes. So the chance that you’ll be able to string together dozens of correct R commands in a row is very small. So unless you have some way around this problem, you’ll never really be able to do anything other than simple analyses.
It’s difficult to share your work with other people. Because you don’t have this nice clean record of what R commands were involved in your analysis, it’s not easy to share your work with other people. Sure, you can send them all the data files you’ve saved, and your history and console logs, and even the little notes you wrote to yourself, but odds are pretty good that no-one else will really understand what’s going on (trust me on this: I’ve been handed lots of random bits of output from people who’ve been analysing their data, and it makes very little sense unless you’ve got the original person who did the work sitting right next to you explaining what you’re looking at)
Ideally, what you’d like to be able to do is something like this… Suppose you start out with a data set myrawdata.csv. What you want is a single document – let’s call it mydataanalysis.R – that stores all of the commands that you’ve used in order to do your data analysis. Kind of similar to the R history but much more focused. It would only include the commands that you want to keep for later. Then, later on, instead of typing in all those commands again, you’d just tell R to run all of the commands that are stored in mydataanalysis.R. Also, in order to help you make sense of all those commands, what you’d want is the ability to add some notes or comments within the file so that anyone reading the document for themselves would be able to understand what each of the commands actually does. But these comments wouldn’t get in the way: when you try to get R to run mydataanalysis.R it would be smart enough to recognise that these comments are for the benefit of humans, and so it would ignore them. Later on, you could tweak a few of the commands inside the file (maybe in a new file called mynewdatanalaysis.R) so that you can adapt an old analysis to be able to handle a new problem. And you could email your friends and colleagues a copy of this file so that they can reproduce your analysis themselves. In other words, what you want is a script.
(There are better ways of keeping track of the lifecycle of a script and better ways of sharing scripts as well, but let’s not go there for now.)
Figure 5.13: A screenshot showing the hello.R script if you open it using the default text editor (TextEdit) on a Mac. Using a simple text editor like TextEdit on a Mac or Notepad on Windows isn’t actually the best way to write your scripts, but it is the simplest. More to the point, it highlights the fact that a script really is just an ordinary text file.
Okay then. Since scripts are so terribly awesome, let’s write one. To create a script file in RStudio, go to the “File” menu, select the “New File” option, and then click on “R script”. This will open a new window within the “Source” panel. you can type the commands you want (or code as it is generally called when you’re typing the commands into a script file) and save it when you’re done.
Let’s try using x <- "hello world" and print(x) as our commands. Then save the document, by, for example, typing CTRL+S, or going to the “File” menu and find “Save”, as hello.R. Also, when it asks you where to save the file, save it to whatever folder you want, but do remember where you stored it. And just like that, you’ve written your first program R. It really is that simple. That’s all there is to it!
You should be looking at something like Figure 5.14. As you can see (if you’re looking at this book in colour) the character string “hello world” is highlighted in green. The nice thing about using RStudio to do this is that it automatically changes the colour of the text to indicate which parts of the code are comments and which are parts are actual R commands (these colours are called syntax highlighting, but they’re not actually part of the file – it’s just RStudio trying to be helpful. It also added line numbers, to facilitate communication, thank you very much!
Just like with any other file, it is important to save your work. If you made unsaved changes to your script, R will make it clear in the name of your script. On my machine, for example, the name of a script with unsaved changes is shown in red and followed by a *. Once I save the changes, it turns black again and the * disappears.
The simple script that I’ve shown above contains two commands. The first one creates a variable x and the second one prints it on screen. How can we make R execute these commands? In other words, how do we run the script?
There are several approaches, really.
I often find myself running a script line by line. To do so, just put your cursor in front of the line (or any other place in that line, for that matter) you want to run, and hit CTRL+ENTER or CMD + ENTER if you are a Mac user. R then transfers these commands to the Console and executes them. You can also select more than one line, and have these lines be executed by hitting CTRL+ENTER or CMD + ENTER (for Macs).
The second approach is running all commands in the script at once. The first thing to do to make this work is to make sure that hello.R file has been saved to your working directory so that R can find it. There are two ways to go about it: Either put it in what is currently your working directory. Or keep the file where it is, and change your working directory to wherever you have put that file. Once the file is in the working directory (by whatever means), you can run the script using the following command in the Console:
When you type this command, R opens up the script file: it then reads each command in the file in the same order that they appear in the file, and executes those commands in that order. Alternatively, you can do as follows.
Notice in the top right-hand corner of Figure 5.14 there’s a little button that reads “Source”? If you click on that, RStudio will construct the relevant source() command for you, and send it straight to the R console. So you don’t even have to type in the source() command, which actually I think is a great thing because it really bugs me having to type all those extra keystrokes every time I want to run my script. 25
After we have run the script (by whatever approach), things happened. If we inspect the workspace using a command like ls(), we discover that R has created the new variable x within the workspace, and not surprisingly x is a character string containing the text "hello world".
Figure 5.14: A screenshot showing the hello.R script open in RStudio. Assuming that you’re looking at this document in colour, you’ll notice that the hello world text is shown in green. This isn’t something that you do yourself: that’s RStudio being helpful. Because the text editor in RStudio knows something about how R commands work, it will highlight different parts of your script in different colours. This is useful, but it’s not actually part of the script itself.
Now, replace print(x) by just x and source hello.R again. Unlike what you are used to in the Console, when typing x also shows x, this does not work from within a script.
When writing up your data analysis as a script, one thing that is generally a good idea is to include a lot of comments in the code. That way, if someone else tries to read it (or if you come back to it several days, weeks, months or years later) they can figure out what’s going on. As a beginner, I think it’s especially useful to comment thoroughly, partly because it gets you into the habit of commenting the code, and partly because the simple act of typing in an explanation of what the code does will help you keep it clear in your own mind what you’re trying to achieve.
You can use comments at the beginning of the script, so that the script announces its behaviour. The first few lines of the script could, for example, tell about what the script is actually doing behind the scenes. It’s usually a pretty good idea to do this.
We’ve seen commenting before, so you might or might not remember that everything after a # sign will not be interpreted by R.
At this point, you’ve learned the basics of scripting. You are now officially allowed to say that you can program in R, though you probably shouldn’t say it too loudly. There’s a lot more to learn, but nevertheless, if you can write scripts like these then what you are doing is, in fact, basic programming.
As with any software tool, there are many ways in which you can adjust RStudio to your own needs and likes. Most of that can be done by choosing Tools from the RStudio menu and choosing Global Options. There is a lot you can do (like changing the font size under Appearance) and I will let you find out for yourself.
There is one exception, though, because it is really cool (but also just fyi). If you Go to Tools/Global Options, click Code, open the Display tab and check “Rainbow parentheses”, nothing less than sheer rainbow joy happens after you clicked OK. Parentheses (), brackets [], and braces {} will now be color-matched by nesting level, which makes complex code way easier to read. It is strongly recommended, but it is entirely up to you to decide how much rainbow fun you want in your life.
Figure 5.15: The dialog box that shows up when you try to close RStudio.
There’s one last thing I should cover in this chapter: how to quit R. When I say this, I’m not trying to imply that R is some kind of pathological addiction and that you need to call the R QuitLine or wear patches to control the cravings (although you certainly might argue that there’s something seriously pathological about being addicted to R). I just mean how to exit the program. Assuming you’re running R in the usual way (i.e., through RStudio or the default GUI on a Windows or Mac computer), then you can just shut down the application in the normal way. However, R also has a function, called q() that you can use to quit, which is pretty handy if you’re running R in a terminal window.26
Regardless of what method you use to quit R, when you do so for the first time R will probably ask you if you want to save the “workspace image”. If you’re using RStudio, you’ll see a dialogue box that looks like the one shown in Figure 5.15. If you’re using a text-based interface you’ll see this:
The y/n/c part here is short for “yes / no / cancel”. Type y if you want to save, n if you don’t, and c if you’ve changed your mind and you don’t want to quit after all.
What does this actually mean? What’s going on is that R wants to know if you want to save all those variables that you’ve been creating, so that you can use them later. This sounds like a great idea, so it’s really tempting to type y or click the “Save” button. To be honest though, I very rarely do this, and it kind of annoys me a little bit… what R is really asking is if you want it to store these variables in a “default” data file. The catch (or advantage, if you wish) is that the data file will automatically reload for you next time you open R, which is often something you won’t need. And quite frankly, if I’d wanted to save the variables, then I’d have already saved them before trying to quit. Not only that, I’d have saved them to a location of my choice, so that I can find it again later. So I personally never bother with this, and I see little reason to type y or click the “Save” button.
The next bit is a quite useful thing to know, but you shouldn’t study it. You can change the settings so that it never asks me again whether I want to save stuff. You can do this in RStudio really easily: use the menu system to find the RStudio option; the dialogue box that comes up will give you an option to tell R never to whine about this again (see Figure 5.16. On a Mac, you can open this window by going to the “Edit” menu and selecting “Preferences”. On a Windows machine, you go to the “Tools” menu and select “Global Options”. Under the “General” tab you’ll see an option that reads “Save workspace to .Rdata on exit”. By default, this is set to “ask”. If you want R to stop asking, change it to “never”. Every time I install R on a new machine, this is one of the first things I do.
Figure 5.16: The options window in RStudio. On a Mac, you can open this window by going to the RStudio menu and selecting Preferences. On a Windows machine you go to the Tools menu and select Global Options
You’ve seen vectors all right, but that it just a tip of the Rberg. In this chapter, we encounter matrices, factors, data frames, lists and formulas. But first, we start with …
In the examples that we’ve seen so far, most of my variable names (such as sales and
revenue) have just been English-language words written using lowercase
letters. However, R allows a lot more flexibility when it comes to
naming your variables, as the following list of rules illustrates:
A-Z as well as the lower case characters a-z. You can also
include numeric characters 0-9 in the variable name, as well as
the period . or underscore _ character. In other words, you can
use SaL.e_s as a variable name (though I can’t think why you would
want to), but you can’t use Sales?.my sales is not a
valid name, but my.sales is.Sales and sales are
different variable names._sales or 1sales as a variable name. You can use
.sales as a variable name if you want, but it’s not usually a good
idea. By convention, variables starting with a . are used for
special purposes, so you should avoid doing so.if,
else, repeat, while, function, for, in, next, break,
TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_,
NA_real_, NA_complex_, and finally, NA_character_. Don’t feel
especially obliged to memorise these: if you make a mistake and try
to use one of the keywords as a variable name, R will complain about
it like the whiny little automaton it is.In addition to those rules that R enforces, there are some informal conventions that people tend to follow when naming variables. One of them you’ve already seen: i.e., don’t use variables that start with a period. But there are several others. You aren’t obliged to follow these conventions, and there are many situations in which it’s advisable to ignore them, but it’s generally a good idea to follow them when you can:
sales and revenue is preferred over arbitrary ones
like variable1 and variable2. Otherwise, it’s very hard to
remember what the contents of different variables are, and it
becomes hard to understand what your commands actually do.sales over a name like
sales.for.this.book.that.you.are.reading. Obviously, there’s a bit
of a tension between using informative names (which tend to be long)
and using short names (which tend to be meaningless), so use a bit
of common sense when trading off these two conventions.my.new.salary as the variable
name. Alternatively, you could separate words using underscores, as
in my_new_salary. Finally, you could use capital letters at the
beginning of each word (except the first one), which gives you
myNewSalary as the variable name. I don’t think there’s any strong
reason to prefer one over the other, but it’s always nice to be
consistent.The first thing I want to mention are some of the “special” values that
you might see R produce. Most likely you’ll see them in situations where
you were expecting a number, but there are quite a few other ways you
can encounter them. These values are Inf, NaN, NA and NULL.
These values can crop up in various different places, and so it’s
important to understand what they mean.
Inf). The easiest of the special values to explain is
Inf, since it corresponds to a value that is infinitely large. You
can also have -Inf. The easiest way to get Inf is to divide a
positive number by 0.Exercise: Do try yourself:
In most real-world data analysis situations, if you’re ending up with infinite numbers in your data, then something has gone awry. Hopefully, you’ll never have to see them.
NaN). The special value of NaN is short for “not
a number”, and it’s basically a reserved keyword that means “there
isn’t a mathematically defined number for this”. If you can remember
your high school maths, remember that it is conventional to say that
\(0/0\) doesn’t have a proper answer: mathematicians would say that
\(0/0\) is undefined.Exercise: Check if R says that it’s not a number:
Nevertheless, it’s still treated as a “numeric” value. To oversimplify,
NaN corresponds to cases where you asked a proper numerical question
that genuinely has no meaningful answer.
NA). NA indicates that the value that is
“supposed” to be stored here is missing. To understand what this
means, it helps to recognise that the NA value is something that
you’re most likely to see when analysing data from real-world
experiments. Sometimes you get equipment failures, or you lose some
of the data, or whatever. The point is that some of the information
that you were “expecting” to get from your study is just plain
missing. Note the difference between NA and NaN. For NaN, we
really do know what’s supposed to be stored; it’s just that it
happens to correspond to something like \(0/0\) that doesn’t make any
sense at all. In contrast, NA indicates that we actually don’t
know what was supposed to be there. The information is missing.Here’s an example
## [1] 1 4 2
## [1] 1 4 2 NA NA NA NA 9
R dutifully adds 9 as the 8th element of x. But since we have only told R what the first three elements and the 8th element are, it kindly reminds us that it has no idea what we have in mind for elements 4 to 7.
NULL). The NULL value takes this “absence” concept
even further. It basically asserts that the variable genuinely has
no value whatsoever. This is quite different from both NaN and
NA. For NaN we actually know what the value is because it’s
something insane like \(0/0\). For NA, we believe that there is
supposed to be a value “out there”, but a dog ate our homework and
so we don’t quite know what it is. But for NULL we strongly
believe that there is no value at all.How does R treat these special values? Let’s see.
The next topic is the issue of missing data. Real data sets very frequently turn out to have missing values: perhaps someone forgot to fill in a particular survey question, for instance. Missing data can be the source of a lot of tricky issues, most of which I’m going to gloss over. However, at a minimum, you need to understand the basics of handling missing data in R.
Let’s focus on the simplest case, in which you’re
trying to work with a single variable which has missing data. In R, this means that there will be NA values in your data vector. Let’s create a variable like that:
Let’s assume that you want to calculate the mean of this variable. By
default, R assumes that you want to calculate the mean using all four
elements of this vector, which is probably the safest thing for a dumb
automaton to do, but it’s rarely what you actually want. Why not? Well,
remember that the basic interpretation of NA is “I don’t know what
this number is”. This means that 1 + NA = NA: if I add 1 to some
number that I don’t know (i.e., the NA) then the answer is also a
number that I don’t know. As a consequence, if you don’t explicitly tell
R to ignore the NA values, and the data set does have missing values,
then the output will itself be a missing value.
Exercise: Calculate the mean of the partial vector (without doing
anything about the missing value).
mean(partial)
Technically correct, but deeply unhelpful.
To fix this,
some functions have an optional argument called na.rm, which is
shorthand for “remove NA values”. By default, na.rm = FALSE, so R does
nothing about the missing data problem. Let’s try setting na.rm = TRUE
and see what happens.
In particular, when calculating sums and means when missing data are
present (i.e., when there are NA values) there’s actually an
additional argument to the function that you should be aware of. This
argument is called na.rm, and is a logical value indicating whether R
should ignore (or “remove”) the missing data for the purposes of doing
the calculations. By default, R assumes that you want to keep the
missing values, so unless you say otherwise it will set na.rm = FALSE.
However, R assumes that 1 + NA = NA: if I add 1 to some number that I
don’t know (i.e., the NA) then the answer is also a number that I
don’t know. As a consequence, if you don’t explicitly tell R to ignore
the NA values, and the data set does have missing values, then the
output will itself be a missing value.
Exercise: Calculate the mean of the partial vector, and set
na.rm = TRUE.
mean(partial, na.rm=TRUE)
Notice that the mean is 20 (i.e., 60 / 3) and not 15 (i.e.,
60 / 4). When R ignores a NA value, it genuinely ignores it. In
effect, the calculation above is identical to what you’d get if you
asked for the mean of the three-element vector c(10, 20, 30).
Note that this isn’t unique to the mean() function. Pretty much all of the other functions doing statistical stuff
have an na.rm
argument that indicates whether it should ignore missing values.
What about operators? As always, don’t wait for me to tell you. Find out by yourself:
## [1] TRUE TRUE NA FALSE
## [1] TRUE FALSE NA FALSE
## [1] 8 NA NA 6 9
So the basic adage is that once you don’t know something (as evidenced by NA), you won’t suddenly start knowing something. NAs don’t just disappear.
Sometimes, NA can disappear:
## [1] 1 3 6
The reason is that, with which(q), we are literally asking R to tell us which elements of q are equal to TRUE. As indicated by NA, we don’t know whether the fourth element of q is equal to TRUE, so it is only fair the answer does not include 4.
Exercise: What will the output be of sum(partial <= 20) and of sum(partial <= 20, na.rm = TRUE).
sum(partial <= 20)
sum(partial <= 20, na.rm = TRUE)
As we’ve seen, R allows you to store different kinds of data. In particular, the variables we’ve defined so far have either been numeric data, character data (text), or logical data.27 It’s important that we remember what kind of information each variable stores (and even more important that R remembers) since different kinds of variables allow you to do different things to them. For instance, if your variables have numerical information in them, then it’s okay to multiply them together.
Exercise: Assign the value 4 to x, the value 5 to y, and multiply x with y.
x <- 5 # x is numeric
y <- 4 # y is numeric
x * y
But if they contain character data, multiplication makes no sense whatsoever, and R will complain if you try to do it.
Exercise: Assign "apples" to x, "oranges" to y, and multiply
x with y.
x <- "apples" # x is character
y <- "oranges" # y is character
x * y
Even R is smart enough to know you can’t multiply "apples" by
"oranges". It knows this because the quote marks are indicators that
the variable is supposed to be treated as text, not as a number.
This is quite useful, but notice that it means that R makes a big
distinction between 5 and "5". Without quote marks, R treats 5 as
the number five, and will allow you to do calculations with it. With the
quote marks, R treats "5" as the textual character five, and doesn’t
recognise it as a number any more than it recognises "p" or "five"
as numbers. As a consequence, there’s a big difference between typing
x <- 5 and typing x <- "5". In the former, we’re storing the number
5; in the latter, we’re storing the character "5". Thus, if we try
to do multiplication with the character versions, R gets stroppy:
## Error in x * y: non-numeric argument to binary operator
Okay, let’s suppose that I’ve forgotten what kind of data I stored in the variable x (which happens depressingly often). R provides a function that will let us find out. Or, more precisely, it provides
three functions: class(), mode() and typeof(). Why the heck does
it provide three functions, you might be wondering? Basically, because R
actually keeps track of three different kinds of information about a
variable. In this class, we will only use class(), though.
The class of a variable is a “high level” classification, and
it captures psychologically (or statistically) meaningful
distinctions. For instance "2011-09-12" and "my birthday" are
both text strings, but there’s an important difference between the
two: one of them is a date. So it would be nice if we could get R to
recognise that "2011-09-12" is a date, and allow us to do things
like add or subtract from it. The class of a variable is what R uses
to keep track of things like that. Because the class of a variable
is critical for determining what R can or can’t do with it, the
class() function is very handy.
Exercise: Find the class of the following examples using the
class() function.
class(x)
Exciting, no?
You might have expected that R would have returned vector in all the previous exercises. It did not, despite x, y and z being, well. vectors.
Later on, I’ll talk a bit about how you can convince R to “coerce” a variable to change from one class to another (Section 6.9.3). That’s a useful skill for real-world data analysis, but it’s not something that we need right now.
rbind() and cbind()A not-uncommon task that you might find yourself needing to undertake is to combine several vectors. A matrix is one way of doing it (we will discuss data frames as another in Section 6.5). A matrix is basically a big rectangular table of data.
Let’s suppose we have the following two numeric vectors:
The numbers here might represent the amount of each of the two cakes that are left at five different time points. Apparently, the first cake is tastier, since that one gets devoured faster.
Let’s start by using the rbind() (“row bind”) function to create a small matrix:
## [,1] [,2] [,3] [,4] [,5]
## cake.1 100 80 0 0 0
## cake.2 100 100 90 30 10
It quite literally binds stuff together, forming a matrix.
Exercise: The variable Mr is a matrix, which we can confirm by
using the class() function.
R is being pedantic right here, and tells me Mr is both a matrix and an
array. Well, R, given that a matrix is a special kind of array (not that
you should care, really), you are right. Thanks. Note that, although all elements of Mr are numeric, R does not tell us that when asked about its class, unlike it would have done if Mr had been a vector.
There is another function, the cbind() function (“column bind”) which
produces a very similar looking output.
## cake.1 cake.2
## [1,] 100 100
## [2,] 80 100
## [3,] 0 90
## [4,] 0 30
## [5,] 0 10
The rbind() function (“row bind”) produces a somewhat different
output than the cbind() function: it binds the vectors together row-wise rather than column-wise.
matrix()We know from above that the rbind() and cbind() functions will convert the vectors into a matrix. There’s yet another way, using a function called —R often isn’t the eccentric kind— matrix(). Let’s see what it does. When creating a matrix using matrix(), there are three things to specify: which numbers should be put in the matrix; what should the matrix look like; and how should it be filled. To specify what it looks like, you should, in principle, specify the number of rows AND the number of columns. However, R is smart, and needs only one of those: If you give R 10 elements it should put in matrix, and you only tell it that the matrix should have 2 rows without telling R the number of columns, it is smart enough to figure out that the matrix should have 5 columns. So you only need to specify either the number of row or the number of columns.
So let’s put the cake data in a matrix with two columns. There are two ways we could do that:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 100 80 0 0 0
## [2,] 100 100 90 30 10
## [,1] [,2] [,3] [,4] [,5]
## [1,] 100 0 0 100 30
## [2,] 80 0 100 90 10
As you can see, the byrow argument controls how the (in this case 2x5) matrix should be filled with the values from c(cake.1, cake.2). Either by filling up the rows first, or by filling up the columns first. In this case, we clearly want the byrow=TRUE version.
What if we want two columns?
## [,1] [,2]
## [1,] 100 100
## [2,] 80 100
## [3,] 0 90
## [4,] 0 30
## [5,] 0 10
## [,1] [,2]
## [1,] 100 80
## [2,] 0 0
## [3,] 0 100
## [4,] 100 90
## [5,] 30 10
Now, we want the want the byrow=FALSE version.
R can be annoying sometimes, so to restore karma, you can annoy the R once in a while. This is such a moment. Let’s ask R to put the 10 values about our cakes in a —insert diabolical laughter— matrix with 3 columns! You might even understand how he escapes our trap: It creates a matrix with 12 empty spots, of which it can easily fill the spots with the 10 cake values we provide. For the remaining 2 spots, it just starts over, and takes the first 2 values of the set of 10 we provided. Yep, that’s the recycling rule (see Section XXX).
## Warning in matrix(c(cake.1, cake.2), ncol = 3, byrow = TRUE): data length [10]
## is not a sub-multiple or multiple of the number of rows [4]
## [,1] [,2] [,3]
## [1,] 100 80 0
## [2,] 0 0 100
## [3,] 100 90 30
## [4,] 10 100 80
The sneaky little munchkin does this (even being fair enough to provide a warning)!
You can use square brackets to extract a subset of a matrix,
specifying a row index and then a column index. For instance, M[2,3]
pulls out the entry in the 2nd row and 3rd column of the matrix (i.e.,
90). By convention, the row number comes first.
Exercise: Do try!
We will talk more about this in Section 7.3.
What if you want to change a value stored in a matrix? Easy enough. One possibility would be to assign the whole matrix again from the beginning. Also, it’s a little wasteful: why should R have to redefine everything, when it is only needed to change a single value? Fortunately, we can tell R to change a specific element only.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 100 80 0 0 0
## [2,] 100 100 90 30 10
## [,1] [,2] [,3] [,4] [,5]
## [1,] 100 50 0 0 0
## [2,] 100 100 90 30 10
This doesn’t work
## Error in `[<-`(`*tmp*`, 3, 2, value = 50): subscript out of bounds
because that element does not exist in M.
This neither
## Error in M[1, 2] <- c(10, 50): number of items to replace is not a multiple of replacement length
This time the element does exist in M, but R can not replace a single element with two.
At a fundamental level, a matrix really is just one variable: it just happens that this one variable is formatted into rows and columns. If you want a matrix of numeric data, every single element in the matrix must be a number. If you want a matrix of character strings, every single element in the matrix must be a character string. If you try to mix data of different types together, then R will either spit out an error or quietly coerce the underlying data into a list.
Exercise: Let’s find out what class R secretly thinks the data
within the matrix M is, by using the class() function and indexing
the first observation.
class( M[1,2] )
You can’t type class(M), because all that will happen is R will tell
you that M is a matrix: we’re not interested in the class of the
matrix itself, we want to know what class the underlying data is assumed
to be.
Anyway, to give you a sense of how R enforces this, let’s try to change one of the elements of our numeric matrix into a character string:
## [,1] [,2] [,3] [,4] [,5]
## [1,] "text" "50" "0" "0" "0"
## [2,] "100" "100" "90" "30" "10"
It looks as if R has coerced all of the data in our matrix into
character strings. And in fact, if we now typed in class(M[1,1]) we’d
see that this is exactly what has happened.
## [1] "character"
If you alter the contents of one element in a matrix, R will change the underlying data type as necessary.
I personally don’t have any insight in what R will do when I now turn
element M[1,1] into a number again. I simply don’t know enough about the
inner workings of R to make a reasonable guess about how R will go about
it. One thing I could do is to look it up in the help file or on the
internet, or ask somebody who could know. Me not knowing what R will do is no good reason not to try it
out. In fact, it is a very good reason to try it and see what happens:
## [,1] [,2] [,3] [,4] [,5]
## [1,] "3" "50" "0" "0" "0"
## [2,] "100" "100" "90" "30" "10"
## [1] "character"
As it turns out, once we go character, R never goes back. Even though we
defined M[1,1] as a numerical value in the line <- 3, once we force it to become part of the matrix environment consisting of nothing but characters,
its own numerical identity is overridden and it is forced to become part of
the majority culture (being characters in this case).
Let’s first define M again, like in the old days.
cake.1 <- c(100, 80, 0, 0, 0)
cake.2 <- c(100, 100, 90, 30, 10)
M <- matrix(c(cake.1, cake.2), nrow = 2, byrow = TRUE)
M## [,1] [,2] [,3] [,4] [,5]
## [1,] 100 80 0 0 0
## [2,] 100 100 90 30 10
You know about sum(), right? How would that work on a matrix? Find out!
## [1] 510
Quite unsurprisingly, it just summed all values in M.
But what if i wanted to sum row-by-row? That is: on the first row, sum all 5 columns; and then, on the second row, sum all 5 columns. R has your back!
## [1] 180 330
What about column-by-column? That is, summing both elements of the first column, summing both elements of the second column, etc. Again, easy:
## [1] 200 180 90 30 10
There’s more to life than sums. How does mean() work for a matrix? Let’s find out!
## [1] 51
Quite unsurprisingly, it took the mean of all values in M.
But what if i wanted to sum row-by-row? That is: on the first row, take the mean of all 5 columns; and then, on the second row, take the mean of all 5 columns. R has your back!
## [1] 36 66
What about column-by-column? That is, taking the mean of both elements of the first column, taking the mean of both elements of the second column, etc. Again, easy:
## [1] 100 90 45 15 5
Many of the other functions we have seen before work on matrices as well. For example:
## [1] 100
finds the biggest element of M
And, if you have been paying any attention, you surely predict we also have
## Error in rowMaxs(M): could not find function "rowMaxs"
## Error in rowMins(M): could not find function "rowMins"
## Error in colMaxs(M): could not find function "colMaxs"
## Error in colMins(M): could not find function "colMins"
Haha. Just kidding. If you want that function, you will need to install and load a package
## Warning: package 'matrixStats' was built under R version 4.2.3
## [1] 100 100
## [1] 0 10
## [1] 100 100 90 30 10
## [1] 100 80 0 0 0
rowSums() and colSums() are very convenient, but they are good for only one job. You can achieve the same using a function that is a bit more difficult, but has much broader applicability. Ladies and gentlemen, and everybody in between, please put your hands together for the apply() function. This is how it works:
The apply() function applies something on a matrix. It should not come as a surprise that, for it to work, you should feed it a matrix and an instruction of what to do. So say we want to take the sum() of the elements in M. You would think apply(M, sum) would do the job. But per above, you know that there are two ways R can go about it: either do a sum row-by-row (leading to 2 values, since we have 2 rows), or do a sum column-by-column (leading to 5 values, since we have 5 columns). So we need to give R some more information beyond the ambiguous instruction to take a sum. Telling R whether it should work column-wise or row-wise is governed by the MARGIN argument. When it is 1, R works row-wise, when it is 2, it works column-wise.
## [1] 180 330
## [1] 200 180 90 30 10
To be complete, the function we ask R to apply could be given an argument name called FUN in the apply() function. Please take some time to think of a reasonable pun about FUN and learning R.
## [1] 180 330
## [1] 200 180 90 30 10
Remember that sometimes you will want to give an extra argument when you run a function. For example, when using the round() function, you might wish to use the digits argument. Where should that info go if you call a function using apply? Easy: any arguments you’d like to use in the function you specify in FUN can just be included after the function.
For example, compare these two commands:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1 0.1 0.0 0 0
## [2,] 0.1 0.1 0.1 0 0
The first one uses the default value for the digits argument of the round() function, the second one follows our wishes and uses 1 for the digits argument.
I get it. You are young, and you like to break things. Then R is just for you! Here is an easy way to break R:
## Error in get(as.character(FUN), mode = "function", envir = envir): object 'UMOEDER' of mode 'function' was not found
Now, that poor R thing is trying to apply sum() to M, but due to your juvenile behavior, he now thinks that sum is an object, no longer the trusted function we have learned to love. Luckily, the good people who build R have anticipated some of this behavior, and installed a protection program against it, so this still works, despite your best attempts to break things:
## [1] 13
Admittedly, this is not on the same level as flooding a school, but still, let’s undo this unholy nonsense:
## [1] 200 180 90 30 10
Now, it’s time to start introducing some of the data types that are somewhat more specific to statistics. When we assign numbers to possible outcomes, these numbers can mean quite different things depending on what kind of variable we are attempting to measure. In particular, we commonly make the distinction between nominal, ordinal, interval and ratio scale data. How do we capture this distinction in R? Currently, we only seem to have a single numeric data type. That’s probably not going to be enough, is it?
A little thought suggests that the numeric variable class in R is perfectly suited for capturing ratio scale data. For instance, if I were to measure response time (RT) for five different events, I could store the data in R like this:
where the data here are measured in milliseconds, as is conventional in the psychological literature. It’s perfectly sensible to talk about “twice the response time”, \(2 \times \mbox{RT}\), or the “response time plus 1 second”, \(\mbox{RT} + 1000\), and so both of the following are perfectly reasonable things for R to do:
## [1] 684 802 1180 782 1108
## [1] 1342 1401 1590 1391 1554
And to a lesser extent, the “numeric” class is okay for interval scale data.
However. When it comes to nominal scale data, it becomes completely unacceptable, because almost all of the “usual” rules for what you’re allowed to do with numbers don’t apply to nominal scale data. If your data set about soccer contains three forwards and one winger, what’s the mean position? Indeed. It is for this reason that R has factors.
Suppose, I was doing a study in which people could belong to one of three different treatment conditions. Each group of people were asked to complete the same task, but each group received different instructions. Not surprisingly, I might want to have a variable that keeps track of what group people were in. So I could type in something like this
so that group[i] contains the group membership of the i-th person in
my study. Clearly, this is numeric data, but obviously, this is a
nominal scale variable. There’s no sense in which “group 1” plus “group
2” equals “group 3”, but nevertheless if I try to do that, R won’t stop
me because it doesn’t know any better.
Exercise: Add the value 2 to group.
group + 2
Apparently, R seems to think that it’s allowed to invent “group 4” and
“group 5”, even though they didn’t actually exist. Unfortunately, R is
too stupid to know any better: it thinks that 3 is an ordinary number
in this context, so it sees no problem in calculating 3 + 2. But since
we’re not that stupid, we’d like to stop R from doing this. We can do
so by instructing R to treat group as a factor.
Creating a factor is easy. You can do so using the factor() function.
## [1] 1 1 1 2 2 2 3 3 3
## Levels: 1 2 3
It looks more or less the same as before (though it’s not immediately
obvious what all that Levels rubbish is about), but if we ask R to
tell us what the class of the group.f variable is now, it’s clear that
it has done what we asked.
Exercise: Use the class() function to give us the class of the
group.f variable.
class(group.f)
Neat.
Easy. Just use the [] as with a normal vector.
## [1] 1
## Levels: 1 2 3
gives the second element and, unlike with a normal vector, tells us the levels of the factor.
What if i made a mistake and I want the 7th element to be a 1:
## [1] 1 1 1 2 2 2 1 3 3
## Levels: 1 2 3
Easy! Changing it to a 4 should be easy too, of course
## Warning in `[<-.factor`(`*tmp*`, 7, value = 4): invalid factor level, NA
## generated
## [1] 1 1 1 2 2 2 <NA> 3 3
## Levels: 1 2 3
No it doesn’t. There is no level 4, so R just (probably correctly) that you are just taking nonsense.
Now that we’ve converted group to a factor, look what happens when you try to add 2 to group.f
Exercise: Try it.
group.f + 2
This time even R is smart enough to know that I’m being an idiot, so it
tells me off and then produces a vector of missing values (i.e., NA:
see Section 6.1.2), together with a strongly worded warning. So not much to see here!
I have a confession to make. My memory is not infinite in capacity; and
it seems to be getting worse as I get older. So it kind of annoys me
when I get data sets where there’s a nominal scale variable called
gender, with three levels corresponding to males, females and other.
But when I go to print out the variable I get something like this:
## [1] 1 1 1 3 1 2 2 2 2
## Levels: 1 2 3
Okaaaay. That’s not helpful at all, and it makes me very sad. Which number corresponds to the males, to the females and which one corresponds to the other category? Wouldn’t it be nice if R could actually keep track of this? It’s way too hard to remember which number corresponds to which gender.
And besides, the problem that this causes is much more serious than a
single sad nerd… because R has no way of knowing that the 1s in the
group.f variable are a very different kind of thing to the 1s in the
gender variable. So if I try to ask which elements of the group.f
variable are equal to the corresponding elements in gender, R thinks
this is totally kosher and gives me this:
## [1] 1 1 1 2 2 2 <NA> 3 3
## Levels: 1 2 3
## [1] 1 1 1 3 1 2 2 2 2
## Levels: 1 2 3
## [1] TRUE TRUE TRUE FALSE FALSE TRUE NA FALSE FALSE
Well, that’s … especially stupid.28 The problem here is that R is
very literal-minded. Even though you’ve declared both group.f and
gender to be factors, it still assumes that a 1 is a 1 no matter
which variable it appears in.
To fix both of these problems (my memory problem, and R’s infuriating literal interpretations), what we need to do is assign meaningful labels to the different levels of each factor. We can do that like this:
## [1] group 1 group 1 group 1 group 2 group 2 group 2 <NA> group 3 group 3
## Levels: group 1 group 2 group 3
## [1] female female female other female male male male male
## Levels: female male other
Note how the orginal 1,2 and 3s have been rewritten to whatever was in the levels.
That’s much easier on the eye, and better yet, R is smart enough to know
that "female" is not equal to "group 1", so now when I try to ask
which group memberships are “equal to” the gender of the corresponding
person,
## Error in Ops.factor(group.f, gender): level sets of factors are different
R correctly tells me that I’m an idiot.
Of course, it is your responsibility to assign the correct meaning to your data, by listing the labels in the correct order. If a 1 in your gender variable means “male”, and a 2 means “other”, you should use
## [1] male male male female male other other other other
## Levels: male other female
Quite conveniently, you can already define the levels when you create the factor, using the levels argument. This doesn’t really work, because of the mismatch between the elements and the levels.
## [1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: male other female
but this works:
gender<-factor(c("male", "male" ,"male", "other" ,"male", "female", "female", "female", "female"), levels = c("male", "other", "female"))
gender## [1] male male male other male female female female female
## Levels: male other female
and so does this
gender <- factor(c(1, 1 ,1, 3 ,1, 2, 2, 2, 2), levels = c(1,2,3), labels = c("male", "other", "female"))and this partially works:
gender<-factor(c("male", "male" ,"male", "other" ,"male", "female", "female", "female", "female"), levels = c("male", "X", "female"))
gender## [1] male male male <NA> male female female female female
## Levels: male X female
Factors are very useful things : they’re the main way to represent a nominal scale variable. And there are lots of nominal scale variables out there. I’ll talk more about factors in Section ??, but for now, you know enough to be able to get started.
In order to understand why R has created this funny thing called a data
frame, it helps to try to see what problem it solves. So let’s go back
to the little scenario that I used when introducing factors in Section
6.4. In that section, I recorded the group.f and gender
for all 9 participants in my study. Let’s also suppose I recorded their
ages and their score on “My Terribly Exciting Psychological
Test”:
age <- c(17, 19, 21, 37, 18, 19, 47, 18, 19)
score <- c(12, 10, 11, 15, 16, 14, 25, 21, 29)
#and just as a reminder, we have
group.f <- factor(c("group 1","group 1","group 1","group 2","group 2","group 2","group 3","group 3","group 3"), levels = c("group 1", "group 2", "group 3"))
group.f## [1] group 1 group 1 group 1 group 2 group 2 group 2 group 3 group 3 group 3
## Levels: group 1 group 2 group 3
gender<-factor(c("male", "male" ,"male", "other" ,"male", "female", "female", "female", "female"), levels = c("male", "other", "female"))
gender## [1] male male male other male female female female female
## Levels: male other female
So there are four variables in the workspace, age, group.f, gender, and score. And it just so happens that all four of them are the same
size (i.e., they’re all vectors with 9 elements). Aaaand it just so
happens that age[1] corresponds to the age of the first person, and
gender[1] is the gender of that very same person, etc. In other words,
you and I both know that all four of these variables correspond to the
same data set, and all four of them are organised in exactly the same
way.
However, R doesn’t know this! As far as it’s concerned, there’s no
reason why the age variable has to be the same length as the gender
variable; and there’s no particular reason to think that age[1] has
any special relationship to gender[1] any more than it has a special
relationship to gender[4]. In other words, when we store everything in
separate variables like this, R doesn’t know anything about the
relationships between things. It doesn’t even really know that these
variables actually refer to a proper data set. The data frame fixes
this: if we store our variables inside a data frame, we’re telling R to
treat these variables as a single, fairly coherent data set.
To see how they do this, let’s create one. So how do we create a data frame? One way we’ve already seen: if we import our data from a CSV file, R will store it as a data frame. Sweet!
A second way is to create it directly from some existing variables using the data.frame() function. All you have to do is type a list of variables that you want to include in the data frame. The output of a data.frame() command is, well, a data frame, not unlike the matrix()
command can be used to make a matrix. So, if I want to have different variables in a data frame, I can do so like this:
age <- c(17, 19, 21, 37, 18, 19, 47, 18, 19)
score <- c(12, 10, 11, 15, 16, 14, 25, 21, 29)
#and just as a reminder, we have
group.f <- factor(c("group 1","group 1","group 1","group 2","group 2","group 2","group 3","group 3","group 3"), levels = c("group 1", "group 2", "group 3"))
group.f## [1] group 1 group 1 group 1 group 2 group 2 group 2 group 3 group 3 group 3
## Levels: group 1 group 2 group 3
gender<-factor(c("male", "male" ,"male", "other" ,"male", "female", "female", "female", "female"), levels = c("male", "other", "female"))
gender## [1] male male male other male female female female female
## Levels: male other female
Exercise: Store all four variables from the experiment (age, gender, group.f, score) in a data frame called expt. You look at what you created by typing print(expt) on the next line.
expt <- data.frame ( age, gender, group.f, score )
print(expt)
Here is a brief note I wish I didn’t have to write. You might be wondering why I asked you to run print(expt) instead of just expt. When working in R(Studio), just typing expt would have been more than enough to have R print the data frame out for you. But for some reason I don’t (care to) comprehend, in the learnr environment this document is made in (so that you can make this exercises in the browser environment), just typing expt works, in the sense that R will print expt for you, but it looks ugly. There’s enough ugly in this world, and maybe I can not provide a lot of beauty, but at least let me not generate more ugliness. To make things look nice, I have used and will use the command print(expt) if I want to inspect expt, and probably you should do too! This is only needed for data frames. You can have a beautiful printing out of vectors or matrices without using print() but just typing the name of the variable.
Note that expt is a completely self-contained variable. Once you’ve
created it, it no longer depends on the original variables from which it
was constructed. Because this is such an important point, I want you so
see it for yourself: make a change to age, and check the expt
variable.
age[5] <- 19 #for example. make any changes to the age variable you like
print(expt)
You will see that if we make changes to the original age variable, it
will not lead to any changes to the age data stored in expt. This is
a common (and stupid) mistake I would hate you to make.
Say you want to add new entries to the data frame, the variable storing the number of hours slept:
You could of course just use the data.frame() command again,
data.frame ( age, gender, group.f, score, slept)
The easiest way to do so, however, is to use $, as the following
example illustrates. If I type a command like this
## age gender group.f score hrslept
## 1 17 male group 1 12 6
## 2 19 male group 1 10 7
## 3 21 male group 1 11 8
## 4 37 other group 2 15 7
## 5 18 male group 2 16 6
## 6 19 female group 2 14 5
## 7 47 female group 3 25 4
## 8 18 female group 3 21 3
## 9 19 female group 3 29 10
then R creates a new entry to the end of the list called hrslept, and
assigns it the numerical values.
Note that the name we give the variable on its own (hoursslept) should
not necessarily be identical to how we call it inside the data frame
(hrslept), but it can be, if you want. R is happy either way.
Of course, you can do this in a single step.
## age gender group.f score hrslept
## 1 17 male group 1 12 6
## 2 19 male group 1 10 7
## 3 21 male group 1 11 8
## 4 37 other group 2 15 7
## 5 18 male group 2 16 6
## 6 19 female group 2 14 5
## 7 47 female group 3 25 4
## 8 18 female group 3 21 3
## 9 19 female group 3 29 20
Note how I changed the last element to a record-breaking 20, to
highlight that by the new assignment of hrslept, the previous values
are overwritten in expt.
Alternatively, you could go like this
## age gender group.f score hrslept zzz
## 1 17 male group 1 12 6 6
## 2 19 male group 1 10 7 7
## 3 21 male group 1 11 8 8
## 4 37 other group 2 15 7 7
## 5 18 male group 2 16 6 6
## 6 19 female group 2 14 5 5
## 7 47 female group 3 25 4 4
## 8 18 female group 3 21 3 3
## 9 19 female group 3 29 20 10
Note the use of "", to indicate that "zzz" is a string (being the name of the column).
A final thing to note is that, when defining the data frame, I have unlimited freedom of choosing the names of the variables:
## wisdom hmmm grrrr booyah
## 1 17 male group 1 12
## 2 19 male group 1 10
## 3 21 male group 1 11
## 4 37 other group 2 15
## 5 18 male group 2 16
## 6 19 female group 2 14
## 7 47 female group 3 25
## 8 18 female group 3 21
## 9 19 female group 3 29
or, if you haven’t defined your variables yet, you can do while defining the data frame
## wisdom hmmm
## 1 1 M
## 2 2 M
## 3 3 X
$At this point, we have all we need to
know in the one variable, a data frame called expt. But as we can see
when we told R to print the variable out, this data frame contains 5
variables, each of which has 9 observations. So how do we get this
information out again? After all, there’s no point in storing
information if you don’t use it, and there’s no way to use information
if you can’t access it. So let’s talk a bit about how to pull
information out of a data frame.
The first thing we might want to do is pull out one of our stored
variables, let’s say hrslept. One thing you might try to do is ignore
the fact that hrslept is locked up inside the expt data frame. For
instance, you might try to print it out like this:
## Error in eval(expr, envir, enclos): object 'hrslept' not found
This doesn’t work, because R doesn’t go “peeking” inside the data frame
unless you explicitly tell it to do so.
How do we tell R to look inside the data frame? As is always the case
with R there are several ways. The simplest way is to use the $
operator to extract the variable you’re interested in, like this:
## [1] 6 7 8 7 6 5 4 3 20
We will talk a bit more about this in Section 7.4.
If you want to restore the 20 hours slept to a more reasonable 10 hours, you could do so as follows:
## age gender group.f score hrslept zzz
## 1 17 male group 1 12 6 6
## 2 19 male group 1 10 7 7
## 3 21 male group 1 11 8 8
## 4 37 other group 2 15 7 7
## 5 18 male group 2 16 6 6
## 6 19 female group 2 14 5 5
## 7 47 female group 3 25 4 4
## 8 18 female group 3 21 3 3
## 9 19 female group 3 29 10 10
If all people would sleep exactly the same number of hours, you can change all values at once:
## age gender group.f score hrslept zzz
## 1 17 male group 1 12 10 6
## 2 19 male group 1 10 10 7
## 3 21 male group 1 11 10 8
## 4 37 other group 2 15 10 7
## 5 18 male group 2 16 10 6
## 6 19 female group 2 14 10 5
## 7 47 female group 3 25 10 4
## 8 18 female group 3 21 10 3
## 9 19 female group 3 29 10 10
This, for example, won’t work
## Error in `$<-.data.frame`(`*tmp*`, hrslept, value = c(9, 10)): replacement has 2 rows, data has 9
## age gender group.f score hrslept zzz
## 1 17 male group 1 12 10 6
## 2 19 male group 1 10 10 7
## 3 21 male group 1 11 10 8
## 4 37 other group 2 15 10 7
## 5 18 male group 2 16 10 6
## 6 19 female group 2 14 10 5
## 7 47 female group 3 25 10 4
## 8 18 female group 3 21 10 3
## 9 19 female group 3 29 10 10
R can’t possibly know what it should change all 9 values to if you only provide two values!
This won’t work either:
## Error in `$<-.data.frame`(`*tmp*`, hrslept, value = c(10, 10, 10, 10, : replacement has 10 rows, data has 9
## age gender group.f score hrslept zzz
## 1 17 male group 1 12 10 6
## 2 19 male group 1 10 10 7
## 3 21 male group 1 11 10 8
## 4 37 other group 2 15 10 7
## 5 18 male group 2 16 10 6
## 6 19 female group 2 14 10 5
## 7 47 female group 3 25 10 4
## 8 18 female group 3 21 10 3
## 9 19 female group 3 29 10 10
The reason is that R has no problem adding the 11 as the 10th value for hrslept but doesn’t know what the corresponding values should be for
age, gender, group.f and score.
One thing I want to share that the apply() function when encountered with matrices (Section XXX) also applies to data frames, in the exact same way.
For example:
## age gender group.f score hrslept zzz
## [1,] "17" "male" "group 1" "12" "10" " 6"
## [2,] "19" "male" "group 1" "10" "10" " 7"
## [3,] "21" "male" "group 1" "11" "10" " 8"
## [4,] "37" "other" "group 2" "15" "10" " 7"
## [5,] "18" "male" "group 2" "16" "10" " 6"
## [6,] "19" "female" "group 2" "14" "10" " 5"
For example:
## age gender group.f score hrslept zzz
## "47" "other" "group 3" "29" "10" "10"
You might have observed that the numbers have been turned characters, which is reminiscent of the behavior we observed with matrices. The reason is that, strictly speaking, apply() only works on matrices. So if we supply apply() with a data frame, R silently converts it to a matrix (but does not do the re-conversion). As long as your data frame consists of numeric variables, you probably won’t even notice, though.
What about, mean(), sum() and the row-wise and column-wise versions?
## Error in FUN(X[[i]], ...): only defined on a data frame with all numeric-alike variables
No luck, but it does make sense. After all, expt contains a column with male, female and other, and there is no way you could sum those, so you shouldn’t expect R to do it?
So let’s, for the sake of experiment, construct a data frame with only numerical variables and see if the functions flourish again:
## age score
## 1 17 12
## 2 19 10
## 3 21 11
## 4 37 15
## 5 18 16
## 6 19 14
## 7 47 25
## 8 18 21
## 9 19 29
## [1] 368
## [1] 29 29 32 52 34 33 72 39 48
## [1] 14.5 14.5 16.0 26.0 17.0 16.5 36.0 19.5 24.0
## age score
## 215 153
## age score
## 23.88889 17.00000
Yeah, baby, on a roll! Let’s push our luck:
## Error in rowMaxs(expt_num): Argument 'x' must be a matrix or a vector.
## Error in colMaxs(expt_num): Argument 'x' must be a matrix or a vector.
And of course, at some point we are out. rowMaxs does not work on a data frame, which is, in hindsight, not surprising given that is a function from the package called matrixStats and not matrixanddataframesandafewotherclassesyoumightcareaboutStats.
Help is on the way in the form of pmax and friends:
## [1] 17 19 21 37 18 19 47 21 29
## [1] 12 10 11 15 16 14 25 18 19
So now we know two ways for binding or merging two or more vectors
together: into the data frame or into the matrix.
If you are thinking now that the expt data frame looks a lot like a
matrix, you are right. In fact, in this particular case, we could have
stored all info quite nearly into such a matrix, for example as follows:
exptM <- cbind( age, gender, group.f, score, hoursslept ) #put everything in a matrix
exptM #show matrix## age gender group.f score hoursslept
## [1,] 17 1 1 12 6
## [2,] 19 1 1 10 7
## [3,] 21 1 1 11 8
## [4,] 37 2 2 15 7
## [5,] 18 1 2 16 6
## [6,] 19 3 2 14 5
## [7,] 47 3 3 25 4
## [8,] 18 3 3 21 3
## [9,] 19 3 3 29 10
## age gender group.f score hrslept zzz
## 1 17 male group 1 12 10 6
## 2 19 male group 1 10 10 7
## 3 21 male group 1 11 10 8
## 4 37 other group 2 15 10 7
## 5 18 male group 2 16 10 6
## 6 19 female group 2 14 10 5
## 7 47 female group 3 25 10 4
## 8 18 female group 3 21 10 3
## 9 19 female group 3 29 10 10
The critical difference between a data frame and a matrix is that, in a data frame, we have this notion that each of the columns corresponds to a different variable: as a consequence, the columns in a data frame can be of different data types. The first column could be numeric, and the second column could contain character strings, and the third column could be logical data. In that sense, there is a fundamental asymmetry build into a data frame, because of the fact that columns represent variables (which can be qualitatively different to each other) and rows represent cases (which cannot). Matrices are intended to be thought of in a different way. All elements are of the same type (in this case, numerical values).
Note that this difference is also reflected in how the data frame expt
and the matrix exptM treat the factors gender and group.f: in the
matrix, their values are represented as numbers (with all the problems
it entails, as I discussed when arguing for the need for factors),
whereas in the data frame, they are represented with their more
meaningful labels.
To drive home this point, suppose I want to store the seasons in which I have collected the data.
collection <- c("win", "win", "win", "sum", "win", "sum", "win", "sum", "sum") #winter or summer
exptM <- cbind( age, gender, group.f, score, hoursslept, collection )
exptM## age gender group.f score hoursslept collection
## [1,] "17" "1" "1" "12" "6" "win"
## [2,] "19" "1" "1" "10" "7" "win"
## [3,] "21" "1" "1" "11" "8" "win"
## [4,] "37" "2" "2" "15" "7" "sum"
## [5,] "18" "1" "2" "16" "6" "win"
## [6,] "19" "3" "2" "14" "5" "sum"
## [7,] "47" "3" "3" "25" "4" "win"
## [8,] "18" "3" "3" "21" "3" "sum"
## [9,] "19" "3" "3" "29" "10" "sum"
## age gender group.f score hrslept zzz col
## 1 17 male group 1 12 10 6 win
## 2 19 male group 1 10 10 7 win
## 3 21 male group 1 11 10 8 win
## 4 37 other group 2 15 10 7 sum
## 5 18 male group 2 16 10 6 win
## 6 19 female group 2 14 10 5 sum
## 7 47 female group 3 25 10 4 win
## 8 18 female group 3 21 10 3 sum
## 9 19 female group 3 29 10 10 sum
You notice that all numerical values have been characterized 29 when being put into a matrix together with character variables. This did not happen when they were put together in a data frame. It shows how the internals for data frames and matrices are quite different.
Another glimpse of the different internal workings of matrices and data frames is offered when we try to add yet another variable that has a different number of elements from the others. I just want to remember data collection occurred in winter and in summer, but I don’t want to store that my first data point was collected in the winter, my second in summer, and so on. I just want to store two names of two seasons.
collection.short <- c("win", "sum") #winter or summer
exptM <- cbind( age, gender, group.f, score, hoursslept, collection.short ) ## Warning in cbind(age, gender, group.f, score, hoursslept, collection.short):
## number of rows of result is not a multiple of vector length (arg 6)
## age gender group.f score hoursslept collection.short
## [1,] "17" "1" "1" "12" "6" "win"
## [2,] "19" "1" "1" "10" "7" "sum"
## [3,] "21" "1" "1" "11" "8" "win"
## [4,] "37" "2" "2" "15" "7" "sum"
## [5,] "18" "1" "2" "16" "6" "win"
## [6,] "19" "3" "2" "14" "5" "sum"
## [7,] "47" "3" "3" "25" "4" "win"
## [8,] "18" "3" "3" "21" "3" "sum"
## [9,] "19" "3" "3" "29" "10" "win"
## Error in `$<-.data.frame`(`*tmp*`, season, value = c("win", "sum")): replacement has 2 rows, data has 9
## age gender group.f score hrslept zzz col
## 1 17 male group 1 12 10 6 win
## 2 19 male group 1 10 10 7 win
## 3 21 male group 1 11 10 8 win
## 4 37 other group 2 15 10 7 sum
## 5 18 male group 2 16 10 6 win
## 6 19 female group 2 14 10 5 sum
## 7 47 female group 3 25 10 4 win
## 8 18 female group 3 21 10 3 sum
## 9 19 female group 3 29 10 10 sum
The fact that the new variables has only 2 elements, whereas the others
have 9, lead to a mere warning when being put together into a matrix.
(Note the use of the recycling rule explained in Section ??). When
trying to add this odd-one-out variable into a data frame, R balks and
produces an error, so nothing is changed to the expt data frame (i.e.,
the additional season variable we asked for has not been added to the
data frame). No can do. As we we’ll soon see, lists will solve this
rather restrictive behavior.
Overall, by comparing the matrix and the data frame output, I guess most of you will appreciate that the data frame looks more neat and is easier to interpret. So we will mostly be using that, especially when doing statistical analyses.
After having stressed how big the difference is between a matrix and a data frame, let me tone down a bit and stress how they are similar too. In fact, they are so similar you can combine them in an operation like addition, of course provided that the elements can be summed:
## age score
## 1 17 12
## 2 19 10
## 3 21 11
## 4 37 15
## 5 18 16
## 6 19 14
## 7 47 25
## 8 18 21
## 9 19 29
## age score
## [1,] 17 12
## [2,] 19 10
## [3,] 21 11
## [4,] 37 15
## [5,] 18 16
## [6,] 19 14
## [7,] 47 25
## [8,] 18 21
## [9,] 19 29
#now do some seriously messed up stuff
#and sum a matrix and a data frame
#prepare to get blown away
s <- Mas + expt_num
s## [1] "data.frame"
There’s a lot more that can be said about data frames: they’re fairly complicated beasts, and the longer you use R the more important it is to make sure you really understand them. We’ll talk a bit more about them in Chapter 7.
The next kind of data I want to mention are lists. Lists are an extremely fundamental data structure in R, and as you start making the transition from a novice to a savvy R user you will use lists all the time. I don’t use lists very often in this book – not directly – but most of the advanced data structures in R are built from lists. In fact, as far as R is concerned a data frame is actually a special kind of list, or a list is like a data frame on steroids. Because lists are so important to how R stores things, it’s useful to have a basic understanding of them.
Okay, so what is a list, exactly? Like data frames, lists are just “collections of variables.” However, unlike data frames – which are basically supposed to look like a nice “rectangular” table of data – there are no constraints on what kinds of variables we include, and no requirement that the variables have any particular relationship to one another.
In order to understand what this actually means, the best thing to do
is create a list, which, R being the good sports that it is, let’s you do using the list() function, just like
we used the matrix() function and the data.frame() function to
create a matrix or a data frame. Ain’t life grand.
If I type this as my command:
R creates a new list variable called Dan, which is a bundle of three
different variables: age, nerd and parents. Notice, that the
parents variable is longer than the others.
If we now print out the variable, you can see the way that R stores the list:
## $age
## [1] 34
##
## $nerd
## [1] TRUE
##
## $parents
## [1] "Joe" "Liz"
As you might have guessed from those $ symbols everywhere, the
variables are stored in exactly the same way that they are for a data
frame (again, this is not surprising: data frames are a type of list).
So you will (I hope) be entirely unsurprised and probably quite bored
when I tell you that you can extract the variables from the list using
the $ operator, like so:
## [1] TRUE
If you need to add new entries to the list, the easiest way to do so is
to again use $, as the following example illustrates. If I type a
command like this
then R creates a new entry to the end of the list called children, and
assigns it a value of "Alex". If I were now to print() this list
out, you’d see a new entry at the bottom of the printout.
Finally, it’s actually possible for lists to contain other lists, so
it’s quite possible that I would end up using a command like
Dan$partner$age to find out how old my partner is.
## $age
## [1] 34
##
## $nerd
## [1] TRUE
##
## $parents
## [1] "Joe" "Liz"
##
## $children
## [1] "Alex"
##
## $partner
## $partner$name
## [1] "You know this, don't you"
##
## $partner$age
## [1] 45
Or I could try to remember it myself I suppose.
Note that the parents variable was longer than the others. This is perfectly
acceptable for a list, but it wouldn’t be for a data frame! If we would have entered the same information in a data frame, R thinks there are two 34 year old nerds, one with a parent Joe and one with a parent Liz.
## age nerd parents
## 1 34 TRUE Joe
## 2 34 TRUE Liz
I have said before that a data frame is just a special case of a list. However, it is good to take your time to appreciate that it’s a very special kind of list: one where all the variables are of the same length, and the first element in each variable happens to correspond to the first “case” in the data set. Let’s look at our data frame again.
## age gender group.f score hrslept zzz col
## 1 17 male group 1 12 10 6 win
## 2 19 male group 1 10 10 7 win
## 3 21 male group 1 11 10 8 win
## 4 37 other group 2 15 10 7 sum
## 5 18 male group 2 16 10 6 win
## 6 19 female group 2 14 10 5 sum
## 7 47 female group 3 25 10 4 win
## 8 18 female group 3 21 10 3 sum
## 9 19 female group 3 29 10 10 sum
Note that, despite expt being a list (on account of being a data
frame), it is printed differently then Dan. No-one ever wants to see a
data frame printed out in the default “list-like” way that I’ve shown in
the extract above when printing Dan. If you want to see how R would show expt to you if it respected its list-like identity, you can see it here, and be appreciative that R focuses on its data frame identity instead.
## $age
## [1] 17 19 21 37 18 19 47 18 19
##
## $gender
## [1] male male male other male female female female female
## Levels: male other female
##
## $group.f
## [1] group 1 group 1 group 1 group 2 group 2 group 2 group 3 group 3 group 3
## Levels: group 1 group 2 group 3
##
## $score
## [1] 12 10 11 15 16 14 25 21 29
##
## $hrslept
## [1] 10 10 10 10 10 10 10 10 10
##
## $zzz
## [1] 6 7 8 7 6 5 4 3 10
##
## $col
## [1] "win" "win" "win" "sum" "win" "sum" "win" "sum" "sum"
The final kind of variable that I want to introduce is the formula. Formulas were originally introduced into R as a convenient way to specify a particular type of statistical model (see Chapter ??) but they’re such handy things that they’ve spread. Formulas are now used in a lot of different contexts, so it makes sense to introduce them early.
Stated simply, a formula object is a variable, but it’s a special type
of variable that specifies a relationship between other variables. A
formula is specified using the “tilde operator” ~. A very simple
example of a formula is shown below:30
## out ~ pred
## <environment: 0x00000236bb752fe8>
The precise meaning of this formula depends on exactly what you want
to do with it, but in broad terms, it means “the out (outcome)
variable, analysed in terms of the pred (predictor) variable”. That
said, although the simplest and most common form of a formula uses the
“one variable on the left, one variable on the right” format, there are
others. For instance, the following examples are all reasonably common
formula2 <- out ~ pred1 + pred2 # more than one variable on the right
formula3 <- out ~ pred1 * pred2 # different relationship between predictors
formula4 <- ~ var1 + var2 # a 'one-sided' formulaand there are many more variants besides. Formulas are pretty flexible things, and so different functions will make use of different formats, depending on what the function is intended to do. We will encounter formulas later. .
You want more structures? You got it! You want to identify and inspect a variable? You got it. Want to shapeshift variables? You got it!
Up to this point, we have encountered several different kinds of variables. At the simplest level, we’ve seen numeric data, logical data and character data. However, we’ve also encountered some more complicated kinds of variables, namely factors, formulas, data frames and lists (and I have mentiond arrays in passing). We’ll see a few more specialised data structures later on in this book.
For example, there is a class Date, which is for, well dates (the chronological kinds, not the romantic or botanical ones). Next, there is a class called table, which we will encounter when discussing tables in Section XXX. More generally, the output of a function (see Section XXX) is not guaranteed to produce any of the classes we already encountered, and might generate a class of its own. For example, in Section XXX, we will encounter the binom.test() function to perform a, well, binomial test. You can ignore the details for now, but you have to observe that the binom.test() function produces something we haven’t encountered before.
## [1] "htest"
The take away here is not that you should know anything about or even remember the mere existence of the htest class, but rather that it sometimes will pay off to inspect (or look up) the output of a function.
Sometimes it will prove to be very handy to get some high-level information about the variables you are dealing with, without having to inspect them in detail. There are many functions that are helpful for this task. Here I demonstrate two, sharing the dimensions of your R objects and the internal structure.
## [1] 9 6
## [1] 9 7
## chr [1:9, 1:6] "17" "19" "21" "37" "18" "19" "47" "18" "19" "1" "1" "1" ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:6] "age" "gender" "group.f" "score" ...
## 'data.frame': 9 obs. of 7 variables:
## $ age : num 17 19 21 37 18 19 47 18 19
## $ gender : Factor w/ 3 levels "male","other",..: 1 1 1 2 1 3 3 3 3
## $ group.f: Factor w/ 3 levels "group 1","group 2",..: 1 1 1 2 2 2 3 3 3
## $ score : num 12 10 11 15 16 14 25 21 29
## $ hrslept: num 10 10 10 10 10 10 10 10 10
## $ zzz : num 6 7 8 7 6 5 4 3 10
## $ col : chr "win" "win" "win" "sum" ...
One problem that sometimes comes up in practice is that you forget what
you called all your variables. Normally you might try to type
ls(), but this command will not tell you what the names are for those variables inside a data frame! One way is to ask
R to tell you what the names of all the variables stored in the data
frame are, which you can do using the names() function:
## [1] "age" "gender" "group.f" "score" "hrslept" "zzz" "col"
Sadly, this doesn’t work for matrices:
## NULL
Computer says no.
You need dimnames() instead. Sigh.
## [[1]]
## NULL
##
## [[2]]
## [1] "age" "gender" "group.f" "score"
## [5] "hoursslept" "collection.short"
from which we learn that the rows have no name, and what the columnnames are.
Sometimes you want to change the variable class. This can happen for all sorts of reasons. Sometimes when you import data from files, it can come to you in the wrong format: numbers sometimes get imported as text, dates usually get imported as text, and many other possibilities besides. Regardless of how you’ve ended up in this situation, there’s a very good chance that sometimes you’ll want to convert a variable from one class into another one. Or, to use the correct term, you want to coerce the variable from one class into another. Coercion is a little tricky, and so I’ll only discuss the very basics here, using a few simple examples.
Firstly, let’s suppose we have a variable x that is supposed to be
representing a number, but the data file that you’ve been given has
encoded it as text. Let’s imagine that the variable is something like
this:
## [1] "character"
Obviously, if I want to do calculations using x in its current state,
R is going to get very annoyed at me. It thinks that x is text, so
it’s not going to allow me to try to do mathematics using it! Obviously,
we need to coerce x from character to numeric. We can do that in a
straightforward way by using the as.numeric() function.
Exercise: Coerce x from character to numeric, and make sure to
save the result again in x. Next, check the class of x and see
whether a simple calculation with x works without R complaining.
x <- as.numeric(x) # coerce the variable
class(x) # what class is it?
x + 1 # hey, addition works!
Not surprisingly, we can also convert it back again if we need to. The
function that we use to do this is the as.character() function:
## [1] "character"
However, there are some fairly obvious limitations: you can’t coerce the
string "hello world" into a number because, well, there’s isn’t a
number that corresponds to it. Or, at least, you can’t do anything
useful:
## Warning: NAs introduced by coercion
## [1] NA
In this case, R doesn’t give you an error message; it just gives you a
warning, and then says that the data is missing (see Section
6.1.2 for the interpretation of NA).
That gives you a feel for how to change between numeric and character
data. What about logical data? To cover this briefly, coercing text to
logical data is pretty intuitive: you use the as.logical() function,
and the character strings "T", "TRUE", "True" and "true" all
convert to the logical value of TRUE. Similarly "F", "FALSE",
"False", and "false" all become FALSE. All other strings convert
to NA. When you go back the other way using as.character(), TRUE
converts to "TRUE" and FALSE converts to "FALSE". Converting
numbers to logicals – again using as.logical() – is straightforward.
Following the convention in the study of logic, the number 0 converts
to FALSE. Everything else is TRUE. Going back using as.numeric(),
FALSE converts to 0 and TRUE converts to 1.
In Section ??, we have already seen how we can convert something to a factor: it pretty straightforwardly, uses the as.factor() function.
If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. Since you now know how to work in RStudio, why not go there to make them.
This is a somewhat strange chapter, even by my standards. My goal in this chapter is to talk a bit more honestly about the realities of working with data than you’ll see anywhere else in the book. The problem with real-world data sets is that they are lola young-esque messy. Very often the data file that you start out with doesn’t have the variables stored in the right format for the analysis you want to do. It’s not uncommon in real-world data analysis to find that one of your variables isn’t quite equivalent to the variable that you really want. For instance, you may need to convert a numeric variable into a different numeric variable (e.g., you may want to analyse at the absolute value of the original variable). At other times, it’s often convenient to take a continuous-valued variable (e.g., age) and break it up into a smallish number of categories (e.g., younger, middle, older). Sometimes you only want to analyse a subset of the data. Et cetera.
In other words, there’s a lot of data manipulation that you need to do, just to get all your data set into the format that you need it. The purpose of this chapter is to provide a basic introduction to some of these pragmatic topics. Although the chapter is motivated by the kinds of practical issues that arise when manipulating real data, I’ll stick with the practice that I’ve adopted through most of the book and rely on very small, toy data sets that illustrate the underlying issue.
Let’s introduce these data sets.
As matrix, we’ll use exptM from earlier:
## age gender group.f score
## [1,] 17 1 1 12
## [2,] 19 1 1 10
## [3,] 21 1 1 11
## [4,] 37 2 2 15
## [5,] 18 1 2 16
## [6,] 19 3 2 14
## [7,] 47 3 3 25
## [8,] 18 3 3 21
## [9,] 19 3 3 29
For the data frame, let’s start with a simple example. As the father of a small child, I
naturally spend a lot of time watching TV shows like In the Night
Garden. I’ve transcribed a
short section of the dialogue. There are two
variables of interest, speaker and utterance, which are both simple vectors.
When we take a look at the data, it becomes very clear what happened to my sanity.
speaker <- c("upsydaisy", "upsydaisy", "upsydaisy", "upsydaisy", "tombliboo", "tombliboo", "makkapakka", "makkapakka", "makkapakka", "makkapakka")
utterance <- c("pip", "pip", "onk", "onk", "ee", "oo", "pip", "pip", "onk", "onk")## [1] "upsydaisy" "upsydaisy" "upsydaisy" "upsydaisy" "tombliboo"
## [6] "tombliboo" "makkapakka" "makkapakka" "makkapakka" "makkapakka"
## [1] "pip" "pip" "onk" "onk" "ee" "oo" "pip" "pip" "onk" "onk"
Let’s put all this into a nice
data frame to demonstrate stuff, so let’s make one. Remember I need the print() function to make things look nice.
## speaker utterance
## 1 upsydaisy pip
## 2 upsydaisy pip
## 3 upsydaisy onk
## 4 upsydaisy onk
## 5 tombliboo ee
## 6 tombliboo oo
## 7 makkapakka pip
## 8 makkapakka pip
## 9 makkapakka onk
## 10 makkapakka onk
I’ll also use a slightly different data set, namely the garden data
frame. It extends the itng data frame with a third variable,
reflecting the mood of the character when speaking the utterance, on a 1-3 scale.
## speaker utterance mymood
## 1 upsydaisy pip 2
## 2 upsydaisy pip 1
## 3 upsydaisy onk 1
## 4 upsydaisy onk 3
## 5 tombliboo ee 2
## 6 tombliboo oo 2
## 7 makkapakka pip 1
## 8 makkapakka pip 1
## 9 makkapakka onk 2
## 10 makkapakka onk 3
Since having a proper name for things can dramatically simplify data handling, it is only fitting I start the data handling section with an explanation of how to name things. I generally prefer having meaningful names attached to my variables.
We have briefly seen the names() function as a way to get R to show the names stored in a data frame. In true R fashion, it can do more than that: it
can also be used for assigning names to, for example, vector elements.
One thing that is sometimes a little unsatisfying about the way that R prints out a vector is that the elements come out unlabelled. Here’s what I mean. Suppose I’ve got data reporting the quarterly profits for some company. If I just create a no-frills vector, I have to rely on memory to know which element corresponds to which event. That is:
## [1] 3.1 0.1 -1.4 1.1
You can probably guess that the first element corresponds to the first
quarter, the second element to the second quarter, and so on, but that’s
only because I’ve told you the back story and because this happens to be
a very simple example. In general, it can be quite difficult. This is
where it can be helpful to assign names to each of the elements.
Here’s how you do it:
## Q1 Q2 Q3 Q4
## 3.1 0.1 -1.4 1.1
This is a slightly odd-looking command, admittedly, but it’s not too
difficult to follow. All we’re doing is assigning a vector of labels
(character strings) to names(profit).
It’s also worth noting that you don’t have to do this as a two-stage process. You can get the same result with this command:
## Q1 Q2 Q3 Q4
## 3.1 0.1 -1.4 1.1
The important things to notice are that (a) this does make things much
easier to read, but (b) the names at the top aren’t the “real” data. The
value of profit[1] is still 3.1; all I’ve done is added a name
to profit[1] as well.
We could delete these if we wanted by typing
## [1] 3.1 0.1 -1.4 1.1
But let’s give them back, for future exercises:
What about naming matrices, you ask? First, let’s create one
row.1 <- c( 2,3,1 ) # create data for row 1
row.2 <- c( 5,6,7 ) # create data for row 2
M <- rbind( row.1, row.2 ) # row bind them into a matrix
M # and print it out...## [,1] [,2] [,3]
## row.1 2 3 1
## row.2 5 6 7
Notice that, when we bound the two vectors together, R retained the names of the original variables as row names. In fact, let’s also add some highly unimaginative column names as well:
## col.1 col.2 col.3
## row.1 2 3 1
## row.2 5 6 7
If we want to change the row names, we could of course use something like this:
## col.1 col.2 col.3
## bettername.1 2 3 1
## bettername.2 5 6 7
So, you can add names to a matrix by using the rownames() and
colnames() functions.
Let’s admire the result of our hard work:
## [1] "bettername.1" "bettername.2"
## [1] "col.1" "col.2" "col.3"
## [[1]]
## [1] "bettername.1" "bettername.2"
##
## [[2]]
## [1] "col.1" "col.2" "col.3"
## NULL
We can do just like with a matrix:
rownames(garden) <- c("case.1", "case.2", "case.3", "case.4", "case.5", "case.6", "case.7",
"case.8", "case.9", "case.10")
colnames(garden)[3] <- "mood"Let’s check in awe:
## [1] "case.1" "case.2" "case.3" "case.4" "case.5" "case.6" "case.7"
## [8] "case.8" "case.9" "case.10"
## [1] "speaker" "utterance" "mood"
## [[1]]
## [1] "case.1" "case.2" "case.3" "case.4" "case.5" "case.6" "case.7"
## [8] "case.8" "case.9" "case.10"
##
## [[2]]
## [1] "speaker" "utterance" "mood"
## [1] "speaker" "utterance" "mood"
One very important kind of data handling is being able to extract a particular subset of the data. For instance, you might be interested only in analysing the data from one experimental condition, or you may want to look closely at the data from people over 50 years in age. To do this, the first step is getting R to extract the subset of the data corresponding to the observations that you’re interested in.
One very useful thing we can do is pull out more than one element at a
time.
To refresh your memory, this is what the sales.by.month vector looks
like:
So, suppose I wanted the data for February, March and April. What I
could do is use the vector c(2,3,4) to indicate which elements I want
R to pull out. That is, I’d type this:
## [1] 100 200 50
Notice that the order matters here. If I asked for the data in the
reverse order (i.e., April first, then March, then February) by using
the vector c(4,3,2), then R outputs the data in the reverse order:
## [1] 50 200 100
A second thing to be reminded (see Section ??) of is that R provides you with handy
shortcuts for very common situations. For instance, suppose that I
wanted to extract everything from the 2nd month through to the 8th
month. One way to do this is to do the same thing I did above, and use
the vector c(2,3,4,5,6,7,8) to indicate the elements that I want.
## [1] 100 200 50 0 0 0 0
That
works just fine, but it’s kind of a lot of typing. To help make this easier, R lets you
use 2:8 as shorthand for c(2,3,4,5,6,7,8), which makes things a lot
simpler (see Section ??).
Let’s check that we can use the 2:8 shorthand as a way to pull out
the 2nd through 8th elements of sales.by.month:
## [1] 100 200 50 0 0 0 0
So that’s kind of neat.
Remember from Section ?? how we added names to vector elements? Names aren’t purely cosmetic, since R allows you to pull out particular elements of the vector by referring to their names:
## Q1
## 3.1
Also note I (well, you; well, R) needs the quotation marks:
## Error in eval(expr, envir, enclos): object 'Q1' not found
Exercise: Pull out the names by typing the command names(profit).
Perhaps unsurprisingly, you can use names to extract multiple elements from a vector.
## Q1 Q4
## 3.1 1.1
At this point, I can introduce an extremely useful tool called
logical indexing. What I’d like to do is to have R select the
names of the months for which I sold any books.
This is where logical indexing is handy. Here’s how it can be done.
Remember that earlier on, I created a vector sales.by.months that contained the number of books I sold each month, and a vector months that contains the names of each
of the months.
sales.by.month <- c(0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0)
months <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November",
"December")We will use these to answer my question.
The first step is to create a logical vector any.sales.this.month, whose
elements are TRUE for any month in which I sold at least one book, and
FALSE for all the others.
## [1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
So, any.sales.this.month is a logical vector whose elements are TRUE only if the corresponding element of sales.by.month is greater than zero. For instance, since I sold zero books in January, the first element is FALSE.
In the second step, I use the logical vector any.sales.this.month to selects those elements out of the month
variable for which any sales were made. It looks like this:
## [1] "February" "March" "April"
To figure out which elements of months to include in the output, what
R does is look to see if the corresponding element in
any.sales.this.month is TRUE. Thus, since element 1 of
any.sales.this.month is FALSE, R does not include "January" as
part of the output; but since element 2 of any.sales.this.month is
TRUE, R does include "February" in the output. So there you have it: the list of months in which I sold at least one book.
I showed you how to do it step by step, but in fact, I could have just done this, using a single line, and would have gotten exactly the same result:
## [1] "February" "March" "April"
Note that the sales.by.month > 0 is the same logical expression that
we used to create the any.sales.this.month vector.
There’s no reason why I can’t use the same approach to find the actual sales numbers for those months. The command to do that would just be this:
## [1] 100 200 50
In fact, we can take the same approach with text. Here’s an example. Suppose
I want to know the months for which the bookshop was out of my book, I
could apply the logical indexing approach, but with the character vector
stock.levels we defined earlier. Let’s refresh first:
stock.levels <- c("high", "high", "low", "out", "out", "high",
"high", "high", "high", "high", "high", "high")
stock.levels## [1] "high" "high" "low" "out" "out" "high" "high" "high" "high" "high"
## [11] "high" "high"
It could look something like this:
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1] "April" "May"
or, if you want to do everything in one go:
## [1] "April" "May"
Alternatively, if I want to know when the bookshop was either low on copies or out of copies, I could do this:
## [1] "March" "April" "May"
or, equivalently,
## [1] "March" "April" "May"
or, equivalently,
## [1] "March" "April" "May"
Either way, I get the answer I want.
At this point, I hope you can see why logical indexing is such a useful thing. It’s a very basic, yet very powerful way to manipulate data.
It does take a bit of practice to become completely comfortable using logical indexing.
One final thing to note is how NA (see Section XXX) works when indexing:
## [1] 1 NA
Before moving on, there’s a nice trick worth mentioning: to
use negative values as indices.
As explained above, we
can use a vector of numbers to extract a set of elements that we would
like to keep. For instance, suppose I want to keep only elements 2 and 3
from sales.by.month. I could do so like this:
## [1] 100 200
But suppose, on the other hand, that I have discovered that observations 2 and 3 are untrustworthy, and I want to keep everything except those two elements. To that end, R lets you use negative numbers to remove specific values, like so:
## [1] 0 50 0 0 0 0 0 0 0 0
The output here corresponds to element 1 of the original vector, followed by elements 4, 5, and so on. When all you want to do is remove a few cases, this is a very handy convention.
profit. Print the complete profit vector as well to check your answer.
profit[ -c(1,3) ]
Of course, you can also drop elements using logical indexing. For example, if you only want to see the sales which were non-zero, you can drop the depressing zeros as follows:
## [1] 100 200 50
Can you use names for that? You wish
## Error in -c("Q1", "Q3"): invalid argument to unary operator
If you want to use names to drop elements from a vector, you need to do something nifty like this:
## Q2 Q4
## 0.1 1.1
So far, whenever I’ve been subsetting a vector, I’ve
tended to use the square brackets [] to do so. You might be wondering whether it is possible to use the square brackets to subset a matrix. The
answer, of course, is yes. Not only can you use square brackets for
this purpose, as you become more familiar with R you’ll find that this
is actually very useful.
Let’s assume that what we want to do is to pick out rows 2, 6 and 9 and columns 1 and 2 (variables
age and gender) from the exptM matrix. How shall we do this? Since a
matrix is basically a table, every element in the matrix has a
row number and a column number. So, if we want to pick out a single
element, we have to specify the row number and a column number within
the square brackets. By convention, the row number comes first.
This means that, for a matrix which has, say, 5 rows and 3 columns, the numerical indexing scheme looks like this:
| Row | Col.1 | Col.2 | Col.3 |
|---|---|---|---|
| Row 1 | [1,1] | [1,2] | [1,3] |
| Row 2 | [2,1] | [2,2] | [2,3] |
Let’s now aim to pull out three rows (2, 6 and 9) and two columns (age and gender). This is fairly simple to do since R allows us to specify multiple rows and multiple columns.
## age gender
## [1,] 19 1
## [2,] 19 3
## [3,] 19 3
Note that if I only select one column R will not print is as a column anymore:
## [1] 1 3 3
R has printed the
results horizontally, not vertically. The reason for this relates to how
matrices are implemented. The
original matrix exptM is treated as a two-dimensional object, containing 2
rows and 3 columns. However, whenever you pull out a single row or a
single column, the result is considered to be one-dimensional. As far as
R is concerned there’s no real reason to distinguish between a
one-dimensional object printed vertically (a column) and a
one-dimensional object printed horizontally (a row), and it prints them
all out horizontally.
A second way to do the same thing is to use the names of the rows and
columns. That is, instead of using the row numbers and column numbers,
you use the character strings that are used as the labels for the rows
and columns. To apply this idea to our exptM data frame, we would use
a command like this:
## age gender
## [1,] 19 1
## [2,] 19 3
## [3,] 19 3
Once again, this produces exactly the same output. Note that, although this version is more annoying to type than the previous version, it’s a bit easier to read, because it’s often more meaningful to refer to the elements by their names rather than their numbers. If we had assigned names to our rows, we could have used those as well.
Also note I (well, you; well, R) needs the quotation marks:
## [1] 19 19 19
select the columns called age, whereas this does not
## Error in exptM[c(2, 6, 9), age]: subscript out of bounds
The reason is that in the second case, R is looking for (and when found, using) the object called age, whereas in the first it knows it just needs to look for the name age.
Finally, both the rows and columns can be indexed using logical vectors as well. For example, although I claimed earlier that my goal was to extract rows 2, 6 and 9, what I really wanted to do was select the 19 year olds. So what I could have done is create a logical vector that indicates which rows correspond to 19yos:
## [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
Okay, I must admit I have lost some sleep over this. To understand what is going on, you actually should have read Section 7.3.4 first. But moving the current section after that section kind of break the elegance of the composition. So I am leaving it here, but you might want to revisit it after having read Section 7.3.4.
As you can see, the 2nd, 6th and 9th elements of this vector are TRUE while
the others are FALSE. Now that I’ve constructed this “indicator”
variable, what I can do is use this vector to select the rows that I
want to keep:
## age gender
## [1,] 19 1
## [2,] 19 3
## [3,] 19 3
And of course, the output is, yet again, the same.
What if you want to keep all of the rows, or all of the columns? This is a prime example of less is more: By giving less numbers as input, you get more output. To do this, all we have to do is leave the corresponding entry
blank, but it is crucial to remember to keep the comma! In particular, exptM[2,] pulls out the entire 2nd row, and exptM[,3] pulls out the entire 3rd column. Just watch.
## age gender group.f score
## 19 1 1 10
## [1] 1 1 1 2 2 2 3 3 3
You can pull out more than a single column or row at once:
## age gender
## [1,] 17 1
## [2,] 19 1
## [3,] 21 1
## [4,] 37 2
## [5,] 18 1
## [6,] 19 3
## [7,] 47 3
## [8,] 18 3
## [9,] 19 3
Alternatively, if I want to keep all the columns but only want the last two rows, I use the same trick, but this time I leave the second index blank.
For example, to select the 5th and 6th row of exptM, while keeping all
the columns, you could do as follows
## age gender group.f score
## [1,] 18 1 2 16
## [2,] 19 3 2 14
Don’t be mistaken: the fact that one element is empty does not mean it does not do anything. Quite to the contrary: it does a lot, by signaling a whole row or column is selected.
I feel I should note is that it’s still okay to use negative indexes as a way of telling R to delete certain rows or columns, just like I told you when discussing subsetting vectors. For instance, if I want to delete the 3rd column, then I use this command:
## age gender score
## [1,] 17 1 12
## [2,] 19 1 10
## [3,] 21 1 11
## [4,] 37 2 15
## [5,] 18 1 16
## [6,] 19 3 14
## [7,] 47 3 25
## [8,] 18 3 21
## [9,] 19 3 29
whereas if I want to delete the 3rd and 5th row, then I’d use this one:
## age gender group.f score
## [1,] 17 1 1 12
## [2,] 19 1 1 10
## [3,] 37 2 2 15
## [4,] 19 3 2 14
## [5,] 47 3 3 25
## [6,] 18 3 3 21
## [7,] 19 3 3 29
So that’s nice.
Above, we always have been using tow indexes, one for the column and one for the row (and even if we did not, we left one of them empty). For completeness (but this is not for studying), I have to mention that there is also a way of using only a single index to extract an element from a matrix. The single-index approach is illustrated in Table 7.2. The value of each cell is its index.
| Row | Col.1 | Col.2 | Col.3 |
|---|---|---|---|
| Row 1 | 1 | 3 | 5 |
| Row 2 | 2 | 4 | 6 |
Confirm that exptM[2,4] (using the double index approach) is identical to
exptM[29] (using the single index approach).
## score
## 10
## [1] 10
You don’t need to study this. I just tell you this because I want to to avoid questioning your sanity if you would ever come across a matrix indexed by a single number.
In this section, we turn to the question of how to subset a data frame
rather than a vector or matrix. To that end, the first thing I should point out is that, if all you want to do is subset one of the variables inside the
data frame, then, as per Section XXX, the $ operator is your friend (as long as you are after a column, which is the common way to store variables). For
instance, suppose I’m working with the itng data frame, I can restrict myself to just the utterances as follows:
## [1] "pip" "pip" "onk" "onk" "ee" "oo" "pip" "pip" "onk" "onk"
It is possible, but unneeded, and therefor seldom used, to do this
## [1] "pip" "pip" "onk" "onk" "ee" "oo" "pip" "pip" "onk" "onk"
While fun, this approach is limited to situations where we want all cases from a single variable only.
Apart from that, things are pretty similar to how we dealt with matrices (see Section XXX), although there are some differences, as you will discover below.
So far, whenever I’ve been subsetting a vector or matrix I’ve
tended to use the square brackets [] to do so. Given there are similarities between data frames and matrices, you might be wondering whether it is possible to use the square brackets to subset a data frame. The
answer, of course, is yes. Not only can you use square brackets for
this purpose, as you become more familiar with R you’ll find that this
is actually very useful. Unfortunately, the use of square brackets for this purpose is somewhat complicated and can be very confusing to novices. So be warned: this section is more complicated than it feels like it “should” be. With that warning in
place, I’ll try to walk you through it slowly.
Let’s assume that what we want to do is to pick out rows 5 and 6 (the
two cases when Tombliboo is speaking), and columns 1 and 2 (variables
speaker and utterance) from the garden data frame. How shall we do this? Since a
data frame is basically a table, every element in the data frame has a
row number and a column number. So, if we want to pick out a single
element, we have to specify the row number and a column number within
the square brackets. By convention, the row number comes first.
This means that, for a data frame which has, say, 5 rows and 3 columns, the numerical indexing scheme looks like this, much like you would expect for a matrix:
| row | col1 | col2 | col3 |
|---|---|---|---|
| 1 | [1,1] | [1,2] | [1,3] |
| 2 | [2,1] | [2,2] | [2,3] |
| 3 | [3,1] | [3,2] | [3,3] |
| 4 | [4,1] | [4,2] | [4,3] |
| 5 | [5,1] | [5,2] | [5,3] |
Let’s now aim to solve our problem, which is to pull out two rows (5 and 6) and two columns (1 and 2). This is fairly simple to do since R allows us to specify multiple rows and multiple columns.
Exercise: Pull out two rows (5 and 6) and two columns (1 and 2) from
the data frame garden. You can use print() to make it look nice.
# The `:` operator can be used to select more than one element.
print( garden[ 5:6, 1:2 ] )
Clearly, that’s exactly what we asked for: the output here is a data
frame containing two variables and two cases. Note that I could have
gotten the same answer if I’d used the c() function to produce my
vectors rather than the : operator. That is, the following command is
equivalent to the last one:
## speaker utterance
## case.5 tombliboo ee
## case.6 tombliboo oo
It’s just not as pretty. However, if the columns and rows that you want
to keep don’t happen to be next to each other in the original data
frame, then you might find that you have to resort to using commands
like garden[ c(2,4,5), c(1,3) ] to extract them.
A second way to do the same thing is to use the names of the rows and
columns. That is, instead of using the row numbers and column numbers,
you use the character strings that are used as the labels for the rows
and columns. To apply this idea to our garden data frame, we would use
a command like this:
## speaker utterance
## case.5 tombliboo ee
## case.6 tombliboo oo
Once again, this produces exactly the same output. Note that, although this version is more annoying to type than the previous version, it’s a bit easier to read, because it’s often more meaningful to refer to the elements by their names rather than their numbers.
Also, note that you don’t have to use the same convention for the rows
and columns. For instance, I often find that the variable names are
meaningful and so I sometimes refer to them by name, whereas the row
names are pretty arbitrary so it’s easier to refer to them by number. In
fact, that’s more or less exactly what’s happening with the garden
data frame.
Exercise: Pull out two rows (5 and 6) and two columns (speaker and
utterance) from the data frame garden, using the variable names for
the columns and referring to the rows by number.
print( garden[ 5:6, c("speaker", "utterance") ] )
Again, the output is identical.
Like with matrices quotation marks are strictly needed:
## speaker utterance
## case.7 makkapakka pip
vs
## Error in `[.data.frame`(garden, case.7, c(1, 2)): object 'case.7' not found
Finally, both the rows and columns can be indexed using logical vectors as well. For example, although I claimed earlier that my goal was to extract cases 5 and 6, it’s pretty obvious that what I really wanted to do was select the cases where Tombliboo is speaking. So what I could have done is create a logical vector that indicates which cases correspond to Tombliboo speaking:
## [1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
As you can see, the 5th and 6th elements of this vector are TRUE while
the others are FALSE. Now that I’ve constructed this “indicator”
variable, what I can do is use this vector to select the rows that I
want to keep:
## speaker utterance
## case.5 tombliboo ee
## case.6 tombliboo oo
And of course, the output is, yet again, the same.
What if you want to keep all of the rows, or all of the
columns? If you have been paying attention when we were using matrices, you know exactly what to do. To do this, all we have to do is leave the corresponding entry blank, but it is crucial to remember to keep the comma! For instance, suppose I want to keep all the rows in the garden data, but I only want to retain the first two columns. The easiest way to do this is to
use a command like this
:
## speaker utterance
## case.1 upsydaisy pip
## case.2 upsydaisy pip
## case.3 upsydaisy onk
## case.4 upsydaisy onk
## case.5 tombliboo ee
## case.6 tombliboo oo
## case.7 makkapakka pip
## case.8 makkapakka pip
## case.9 makkapakka onk
## case.10 makkapakka onk
Alternatively, if I want to keep all the columns but only want the last two rows, I use the same trick, but this time I leave the second index blank.
Exercise: Select the 5th and 6th row of garden, while keeping all
the columns.
print( garden[5:6, ] )
Of course, you can also use name. For example
## speaker mood
## case.1 upsydaisy 2
## case.2 upsydaisy 1
## case.3 upsydaisy 1
## case.4 upsydaisy 3
## case.5 tombliboo 2
## case.6 tombliboo 2
## case.7 makkapakka 1
## case.8 makkapakka 1
## case.9 makkapakka 2
## case.10 makkapakka 3
Note again that this doesn’t work:
## Error in `[.data.frame`(garden, , c(speaker, mood)): object 'mood' not found
I feel I should note is that it’s still okay to use negative indexes as a way of telling R to delete certain rows or columns, just like I told you when discussing subsetting vectors and matrices. For instance, if I want to delete the 3rd column, then I use this command:
## speaker utterance
## case.1 upsydaisy pip
## case.2 upsydaisy pip
## case.3 upsydaisy onk
## case.4 upsydaisy onk
## case.5 tombliboo ee
## case.6 tombliboo oo
## case.7 makkapakka pip
## case.8 makkapakka pip
## case.9 makkapakka onk
## case.10 makkapakka onk
whereas if I want to delete the 3rd and 5th row, then I’d use this one:
## speaker utterance mood
## case.1 upsydaisy pip 2
## case.2 upsydaisy pip 1
## case.4 upsydaisy onk 3
## case.6 tombliboo oo 2
## case.7 makkapakka pip 1
## case.8 makkapakka pip 1
## case.9 makkapakka onk 2
## case.10 makkapakka onk 3
So that’s nice.
The “double index” approach, where you specify (or leave blank) what you want for the row and for the column, is fairly straightforward. Or so you think.
There is a fairly useful elaboration on this double index approach
that I should point out: something called dropping.31
At this point, some of you might
be wondering why I’ve been so terribly careful to choose my examples in
such a way as to ensure that the output always has multiple rows and
multiple columns. The reason for this is that I’ve been trying to hide
the somewhat curious “dropping” behaviour that R produces when the
output only has a single column. I’ll start by showing you what happens,
and then I’ll try to explain it.
Firstly, let’s have a look at what happens when the output contains only a single row:
## speaker utterance mood
## case.5 tombliboo ee 2
This is exactly what you’d expect to see: a data frame containing three variables, and only one case per variable. Okay, no problems so far. What happens when you ask for a single column? Suppose, for instance, I try this as a command:
Please, R? Are you being serious right now? You cold-hearted manipulative little liar. Based on everything that I’ve shown you so far, you would be well within your rights to expect to see R produce a data frame containing a single
variable (i.e., mood) and ten cases. After all, that is pretty consistent
with everything else that I’ve shown you so far about how square
brackets work. In other words, you should expect to see this:
mood
case.1 2
case.2 1
case.3 1
case.4 3
case.5 2
case.6 2
case.7 1
case.8 1
case.9 2
case.10 3
However, that is emphatically not what happens.
Exercise: See what you get when selecting the 3rd column of
garden.
print(garden[ , 3 ])
That output is not a data frame at all! That’s just an ordinary, plain old numeric vector containing 10 elements. Before you start thinking that R can’t love anyone, cause that would mean it’d have a heart, let us hear his take on the story. As any person with a narcissistic personality will tell, R would probably tell you that although it does not look like it, what’s going on here is that R is trying to be smart and helpful. Now, R has “noticed” that the output that we’ve asked for doesn’t really “need” to be wrapped up in a data frame at all, because it only corresponds to a single variable. So what it does is “drop” the output from a data frame containing a single variable, “down” to a simpler output that corresponds to that variable. This behaviour is actually very convenient for day to day usage once you’ve become familiar with it and I suppose that’s the real reason why R does this – but there’s no escaping the fact that it is deeply confusing to novices.
It’s especially confusing because the behaviour appears only for a very
specific case: (a) it only works for columns and not for rows (because
the columns correspond to variables and the rows do not), and (b) it only
applies to the “double index” version of the square brackets we have been using so far, and not to the subset() function we will discuss in Section XXX,32 or to the “single index” use of the square brackets (as we will discover in Section XXX). As I say, it’s very confusing when you’re just starting out. For what it’s worth, you can suppress this behaviour if you want, by setting drop = FALSE when you construct your bracketed expression. That is, you could do something like this:
## mood
## case.1 2
## case.2 1
## case.3 1
## case.4 3
## case.5 2
## case.6 2
## case.7 1
## case.8 1
## case.9 2
## case.10 3
I suppose that helps a little bit, in that it gives you some control over the dropping behaviour, but I’m not sure it helps to make things any easier to understand. Anyway, that’s the “dropping” special case. Fun, isn’t it?
Again, I will mention the single index approach, just to prepare you for what you might encounter in the wild one day, but for the purposes of this course, you can forget about it.
Like with matrices, you can also use a single index instead of a double index (so not even a blank space). In fact, there are two ways of doing it. One is with a single pair of square brackets [] and one is with a double pair [[ ]].
What happens if you use a single index and a single pair of brackets?
Well, R will assume you want the corresponding columns, not the rows. Do not be fooled by the fact that this second method also uses square brackets: it behaves differently to the rather straightforward “double index” method that I’ve discussed in the last few sections. Again, what I’ll do is show you what happens first, and then I’ll try to explain why it happens afterwards. To that end, let’s start with the following command:
## speaker utterance
## case.1 upsydaisy pip
## case.2 upsydaisy pip
## case.3 upsydaisy onk
## case.4 upsydaisy onk
## case.5 tombliboo ee
## case.6 tombliboo oo
## case.7 makkapakka pip
## case.8 makkapakka pip
## case.9 makkapakka onk
## case.10 makkapakka onk
As you can see, the output gives me the first two columns, much as if
I’d typed garden[,1:2] using the double-index approach. It doesn’t
give me the first two rows, which is what I’d have gotten if I’d used a
command like garden[1:2,]. So it seems that garden[ 1:2 ] could be
treated as a (potentially handy) shorthand for garden[, 1:2 ].
Building off that insight, you might expect that garden[ 3 ] is a
shorthand for garden[, 3]. Well, in the spirit of keeping things more
complicated than needed, it is not. Let’s see what happens if I ask for
a single column:
## mood
## case.1 2
## case.2 1
## case.3 1
## case.4 3
## case.5 2
## case.6 2
## case.7 1
## case.8 1
## case.9 2
## case.10 3
and compare it to
## [1] 2 1 1 3 2 2 1 1 2 3
Unlike what happens when I type garden[, 3] R does not drop the
output. This is entirely consistent with what I said earlier: the only
case where dropping occurs by default is when you use the double index
version of the square brackets, and the output happens to correspond to
a single column.
Wait, you must be thinking, there must be more to this? It should be
possible to make things even more complicated? Good thinking! There is
something like single index “double brackets” notation: [[ ]]. Let’s
find out what it does, shall we?
## [1] 2 1 1 3 2 2 1 1 2 3
So using this notation, you force R to drop the output. Note that R will
only allow you to ask for one column at a time using the double
brackets. If you try to ask for multiple columns in this way, you get
completely different behaviour,33 which may or may not produce an
error, but definitely won’t give you the output you’re expecting. The
only reason I’m mentioning it at all is that you might run into double
brackets when doing further reading, and a lot of books don’t explicitly
point out the difference between [ and [[. However, I promise that I
won’t be using [[ anywhere else in this book.
subset() functionUsing smart indexing is a useful way of getting info from a vector, matrix or data frame. However, it can become clunky sometimes. There are several different ways to subset a data frame in R, some easier than others. I’ll only discuss the subset() function, which is probably the conceptually simplest way to do it. For our purposes there are three different arguments that you’ll be most interested in:
x. The matrix or data frame that you want to subset.subset. A vector of logical values indicating which cases (rows)
of the matrix or data frame you want to keep. By default, all cases will be
retained.select. This argument indicates which variables (columns) in the
matrix or data frame you want to keep. This can either be a list of variable
names, or a logical vector indicating which ones to keep, or even
just a numeric vector containing the relevant column numbers. By
default, all variables will be retained.Note that the function called subset() has an argument called subset. This is quite unusual, but nothing to frown upon.
Let’s start with an example
## age gender group.f score
## [1,] 19 1 1 10
## [2,] 21 1 1 11
## [3,] 37 2 2 15
## [4,] 19 3 2 14
## [5,] 47 3 3 25
## [6,] 19 3 3 29
## age gender group.f score
## [1,] 47 3 3 25
## [2,] 19 3 3 29
## age score
## [1,] 47 25
## [2,] 19 29
It’s pretty self-explanatory. The only thing pointing your attention to is that the output is still a matrix. Matrix in, matrix out.
Suppose that I want to subset the itng data frame, keeping only the
utterances made by Makka-Pakka. What that means is that I need to use
the select argument to pick out the utterance variable, and I also
need to use the subset variable, to pick out the cases when
Makka-Pakka is speaking (i.e., speaker == "makkapakka").
Exercise: Subset the itng data frame using the subset()
function, keeping only Makka-Pakka’s utterances. Specify the x,
subset, and select arguments. Assign the result to a variable named
df. Print the result.
# Read the text above the exercise again, to know what to assign to which argument of the subset() function.
df <- subset( x = itng, # data frame is itng
subset = speaker == "makkapakka", # keep only Makka-Pakkas speech
select = utterance ) # keep only the utterance variable
print( df )
The variable df here is still a data frame, but it only contains one
variable (called utterance) and four cases. Notice that the row
numbers are actually the same ones from the original data frame.
It’s worth taking a moment to briefly explain this. The reason that this
happens is that these “row numbers’ are actually row names. When you
create a new data frame from scratch R will assign each row a fairly
boring row name, which is identical to the row number. However, when you
subset the data frame, each row keeps its original row name. This can be
quite useful, since – as in the current example – it provides you with
a visual reminder of what each row in the new data frame corresponds to
in the original data frame. However, if it annoys you, you can change
the row names using the rownames() function.34
In any case, let’s return to the subset() function, and look at what
happens when we don’t use all three of the arguments. Firstly, suppose
that we didn’t bother to specify the select argument.
Exercise: Again, subset the itng data frame, keeping only
Makka-Pakka’s utterances, but do not specify the select() argument.
df2 <- subset( x = itng,
subset = speaker == "makkapakka" )
print(df2)
Not surprisingly, R has kept the same cases from the original data set (i.e., rows 7 through 10), but this time it has kept all of the variables from the data frame.
Further, note that we could use the variable names out of the data frame directly
## speaker utterance
## 7 makkapakka pip
## 8 makkapakka pip
## 9 makkapakka onk
## 10 makkapakka onk
but, speaker is enough, since, by account of the x = itng addition, the subset() function knows that with speaker we mean itng$speaker.
Exercise: What if you don’t specify the subset argument?
df3 <- subset( x = itng,
select = utterance )
print(df3)
Equally unsurprisingly, if we don’t specify the subset argument, what
we find is that R keeps all of the cases. Again, it’s important to note
that this output is still a data frame: it’s just a data frame with only
a single variable.
If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. Since you now know how to work in RStudio, why not go there to make them.
An important topic to discuss is the idea of transforming a variable. Taken literally, anything you do to a variable is a transformation, but in practice what it usually means is that you apply a relatively simple mathematical function to the original variable, in order to create a new variable.
The good news is that you already know how to do variable transformations. To see this, let’s go through an example. Suppose I’ve run a short study in which I ask 10 people a single question:
On a scale of 1 (strongly disagree) to 7 (strongly agree), to what extent do you agree with the proposition that “Dinosaurs are awesome”?
Now let’s load and look at the data. The data consist of a single variable that contains the raw Likert-scale responses:
However, if you think about it, this isn’t the best way to represent these responses. Because of the fairly symmetric way that we set up the response scale, there’s a sense in which the midpoint of the scale should have been coded as 0 (no opinion), and the two endpoints should be \(+3\) (strong agree) and \(-3\) (strong disagree). By recoding the data in this way, it’s a bit more reflective of how we really think about the responses. The recoding here is trivially easy: we just subtract 4 from the raw scores:
Exercise: Recode the variable likert.raw by subtracting 4 from the
raw scores, and assign it to a variable called likert.centred. Print
the result.
likert.centred <- likert.raw - 4
likert.centred
One reason why it might be useful to have the data in this format is
that there are a lot of situations where you might prefer to analyse the
strength of the opinion separately from the direction of the
opinion. We can do two different transformations on this
likert.centred variable in order to distinguish between these two
different concepts.
Firstly, to compute an opinion.strength variable,
we want to take the absolute value of the centered data.
Exercise: Take the absolute value of the centered data
likert.centred using the abs() function, and assign it to a variable
called opinion.strength. Print the result.
opinion.strength <- abs( likert.centred )
opinion.strength
Secondly, to compute a variable that contains only the direction of the
opinion and ignores the strength, we can use the sign() function to do
this. If you type ?sign you’ll see that this function is really
simple: all negative numbers are converted to \(-1\), all positive numbers
are converted to \(1\) and zero stays as \(0\).
Exercise: Apply the sign() function to the variable
likert.centred, and assign it to a variable named opinion.dir. Print
the result.
opinion.dir <- sign( likert.centred )
opinion.dir
And we’re done. We now have three shiny new variables, all of which are
useful transformations of the original likert.raw data. All of this
should seem pretty familiar to you. The tools that you use to do regular
calculations in R (e.g., Chapters 2 and ??)
are very much the same ones that you use to transform your variables!
The variable I transformed (likert.raw) wasn’t inside a data
frame. I’ve done this to keep the explanation simple, though in real life
it almost certainly would be.
Before moving on, you might (or might not; hey!, it’s your life) be
curious to see what these calculations look like if the data had started
out in a data frame. To that end, it may help to note that the following
example does all of the calculations using variables inside a data
frame, and stores the variables created inside it:
df <- data.frame( likert.raw ) # create data frame
df$likert.centred <- df$likert.raw - 4 # create centred data
df$opinion.strength <- abs( df$likert.centred ) # create strength variable
df$opinion.dir <- sign( df$likert.centred ) # create direction variable
print(df) # print the final data frame:## likert.raw likert.centred opinion.strength opinion.dir
## 1 1 -3 3 -1
## 2 7 3 3 1
## 3 3 -1 1 -1
## 4 4 0 0 0
## 5 4 0 0 0
## 6 4 0 0 0
## 7 2 -2 2 -1
## 8 6 2 2 1
## 9 5 1 1 1
## 10 5 1 1 1
In other words, the commands you use are basically the same ones as before: it’s
just that every time you want to read a variable from the data frame or
write to the data frame, you use the $ operator.
A very common task when analysing data is the construction of frequency tables or cross-tabulation of one variable against another. There are several functions that you can use in R for that purpose. In this section I’ll illustrate the use of the table() and xtabs() functions (and will end with a brief shout out to tabulate()), though there are other options available, such as ftable() (not discussed in this book).
To illustrate what this is all about, I will make use of the
speaker vector from the nightgarden data set.
With these as my data, one task I might find myself needing to do is
construct a frequency count of the number of words each character speaks
during the show. The table() function provides a simple way to do to
this. The basic usage of the table() function is as follows:
## speaker
## makkapakka tombliboo upsydaisy
## 4 2 4
The output here tells us on the first line that what we’re looking at is
a tabulation of the speaker variable. On the second line, it lists all
the different speakers that exist in the data, and on the third line, it
tells you how many times that speaker appears in the data. In other
words, it’s a frequency table
As usual, you can assign this output to a variable. If you type
speaker.freq <- table(speaker) at the command prompt R will store
the table as a variable. If you then type class(speaker.freq)
you’ll see that the output is actually of class table.
## [1] "table"
The key thing to note about a table object is that it’s basically a named vector:
## upsydaisy
## 4
## [1] "makkapakka" "tombliboo" "upsydaisy"
## speaker
## makkapakka tombliboo upsydaisy
## 7 5 7
Notice that in the command above I
didn’t name the argument, since table() is a function that makes
use of unnamed arguments. You just type in a list of the variables that
you want R to tabulate, and it tabulates them. For instance, if I type
in the name of two variables, what I get as the output is a
cross-tabulation.
Exercise: Obtain the cross-tabulation for speaker and utterance.
table(speaker, utterance)
When interpreting this table, remember that these are counts: so the fact that the first row and second column corresponds to a value of 2 indicates that Makka-Pakka (row 1) says “onk” (column 2) twice in this data set. As you’d expect, you can produce three-way or higher-order cross-tabulations just by adding more objects to the list of inputs. However, I won’t discuss that in this section.
The tabulation commands discussed so far all construct a table of raw
frequencies: that is, a count of the total number of cases that satisfy
certain conditions. However, often you want your data to be organised in
terms of proportions rather than counts. This is where the
prop.table() function comes in handy.
Let’s see how this works. Note that we need to first call table() before we can call prop.table().
## Error in sum(x): invalid 'type' (character) of argument
## speaker
## makkapakka tombliboo upsydaisy
## 0.4 0.2 0.4
We see that all proportions sum to 1, as they should.
Of course, this is identical to
## speaker
## makkapakka tombliboo upsydaisy
## 0.4 0.2 0.4
so it is not entirely useful so far. But we are still glad to have you around, prop.table()!
One final function I want to mention is the tabulate() function, since
this is actually the low-level function that does most of the hard work.
It takes a numeric vector as input, and outputs frequencies as outputs:
## [1] 6 5 4 1 1 0 0 1
The table() function also works on a matrix, but I see limited use for this:
## age gender group.f score
## [1,] 17 1 1 12
## [2,] 19 1 1 10
## [3,] 21 1 1 11
## [4,] 37 2 2 15
## [5,] 18 1 2 16
## [6,] 19 3 2 14
## [7,] 47 3 3 25
## [8,] 18 3 3 21
## [9,] 19 3 3 29
## exptM
## 1 2 3 10 11 12 14 15 16 17 18 19 21 25 29 37 47
## 7 4 7 1 1 1 1 1 1 1 2 3 2 1 1 1 1
A very common task when analysing data is the construction of frequency
tables or cross-tabulation of one variable against another. We have seen
how we can use the table() and prop.table() function. The major take
home of this section is that you can use these functions for data frames
as well. Additionally, I’ll show you how to use the xtabs() function.
There’s a couple of options under these circumstances. Firstly, if you
just want to cross-tabulate all of the variables in the data frame, then
it’s really easy, as you can just use the table() function.
Exercise: Use the table() function to cross-tabulate all variables in the data frame itng.
table(itng)
However, it’s often the case that you want to select particular
variables from the data frame to tabulate. For example, it you want to cross-tabulate speaker and utterance in the garden data set, table() goes full TMI.
## , , mood = 1
##
## utterance
## speaker ee onk oo pip
## makkapakka 0 0 0 2
## tombliboo 0 0 0 0
## upsydaisy 0 1 0 1
##
## , , mood = 2
##
## utterance
## speaker ee onk oo pip
## makkapakka 0 1 0 0
## tombliboo 1 0 1 0
## upsydaisy 0 0 0 1
##
## , , mood = 3
##
## utterance
## speaker ee onk oo pip
## makkapakka 0 1 0 0
## tombliboo 0 0 0 0
## upsydaisy 0 1 0 0
This is where the xtabs() function is useful. In this function, you input a formula in order to list all the variables you want to cross-tabulate, and the name of the data frame that stores the data:
There are many situations when you’re analysing real data where this is actually extremely useful since your data set will almost certainly contain lots of variables and you’ll only want to tabulate a few of them at a time.
For example, with the garden data, we can use
## utterance
## speaker ee onk oo pip
## makkapakka 0 2 0 2
## tombliboo 1 0 1 0
## upsydaisy 0 2 0 2
Notice how the left hand side of the formula has been left empty.
As I mentioned in Section ??, the tabulation commands discussed so far all construct a table of raw frequencies: that is, a count of the total number of cases that satisfy certain conditions. However, often you want your data to be organised in terms of proportions rather than counts. This is where the prop.table() function comes in handy. And yes, it
also works for data frames!
Exercise: Create the proportion table of itng, and assign it to a variable called itng.table. Display the table again, just as a reminder.
itng.table <- prop.table(table(itng)) # create the table, and assign it to a variable
itng.table # display the table again, as a reminder
So we can express the data in proportions, by
feeding a table into prop.table(). It works similarly from the xtabs() output:
## utterance
## speaker ee onk oo pip
## makkapakka 0 2 0 2
## tombliboo 1 0 1 0
## upsydaisy 0 2 0 2
## utterance
## speaker ee onk oo pip
## makkapakka 0.0 0.2 0.0 0.2
## tombliboo 0.1 0.0 0.1 0.0
## upsydaisy 0.0 0.2 0.0 0.2
One particular example of data handling that is
especially common is the problem of splitting one variable up into
several different variables, one corresponding to each group. To
illustrate, let’s go back to the In the Night Garden example. I might
want to create subsets of the utterance variable for every character.
One way to do this would be to do this, using logical indexing:
## [1] "pip" "pip" "onk" "onk" "ee" "oo" "pip" "pip" "onk" "onk"
## [1] "upsydaisy" "upsydaisy" "upsydaisy" "upsydaisy" "tombliboo"
## [6] "tombliboo" "makkapakka" "makkapakka" "makkapakka" "makkapakka"
## [1] "pip" "pip" "onk" "onk"
## [1] "ee" "oo"
## [1] "pip" "pip" "onk" "onk"
but that quickly gets repetitive and hence annoying and this strategy breaks down in a situation with many characters.
A faster, and maybe more convenient, way do it is to use the split() function. The arguments are:
x. The variable that needs to be split into groups.f. The grouping variable.What this function does is output a list (Section 6.7), containing one variable for each group.
Exercise: Split the variable utterance by speaker using the
split() function, and assign the result to a variable named
speech.by.char. Specify the arguments x and f in the split()
function. Print the result.
speech.by.char <- split( x = utterance, f = speaker )
speech.by.char
Once you’re starting to become comfortable working with lists and data frames, this output is all you need, since you can work with this list in much the same way that you would work with a data frame. For instance, if you want the first utterance made by Makka-Pakka, all you need to do is type this:
## [1] "pip"
One pragmatic task that arises more often than you’d think is the problem of cutting a numeric variable up into discrete categories. If it makes you happy or feel interesting, you can say that you are recoding a variable. For instance, suppose I’m interested in looking at the age distribution of people at a social gathering:
In some situations, it can be quite helpful to group these into a smallish number of categories. For example, we could group the data into three broad categories: young (0-20), adult (21-40) and older (41-60). This is a quite coarse-grained classification, and the labels that I’ve attached only make sense in the context of this data set (e.g., viewed more generally, a 42-year-old wouldn’t consider themselves as “older”).
We can slice this variable up quite easily using the cut()
function.35
To make things a little cleaner, I’ll start by creating a variable that defines the boundaries for the categories:
## [1] 0 20 40 60
and another one for the labels:
## [1] "young" "adult" "older"
Note that there are four numbers in the age.breaks variable, but only
three labels in the age.labels variable; I’ve done this because the
cut() function requires that you specify the edges of the categories
rather than the mid-points. In any case, now that we’ve done this, we
can use the cut() function to assign each observation to one of these
three categories.
There are several arguments to the cut() function,
but the three that we need to care about are:
x. The variable that needs to be categorised.breaks. This is either a vector containing the locations of the
breaks separating the categories, or a number indicating how many
categories you want.labels. The labels attached to the categories. This is optional:
if you don’t specify this R will attach a boring label showing the
range associated with each category.Exercise: Since we’ve already created variables corresponding to the
breaks and the labels, apply the cut() function to age. Specify the
three arguments discussed above. In order to see what this command has
actually done, just print out the output.
age.group <- cut(x = age, # the variable to be categorised
breaks = age.breaks, # the edges of the categories
labels = age.labels) # the labels for the categories
age.group
## [1] older older adult adult adult older adult adult adult young young
## Levels: young adult older
Note that the output variable here is a factor.
Often, it’s actually more helpful to create a data frame that includes both the original variable and the categorised one so that you can see the two side by side:
## age age.group
## 1 60 older
## 2 58 older
## 3 24 adult
## 4 26 adult
## 5 34 adult
## 6 42 older
## 7 31 adult
## 8 30 adult
## 9 33 adult
## 10 2 young
## 11 9 young
In the example above, I made all the decisions myself. If you want to you can delegate a lot of the choices to R. For instance, if you want you can specify the number of categories you want, rather than giving explicit ranges for them, and you can allow R to come up with some labels for the categories. To give you a sense of how this works, have a look at the following example:
With this command, I’ve asked for three categories, but let R make the
choices for where the boundaries should be. I won’t bother to print out
the age.group2 variable, because it’s not terribly pretty or very
interesting. Instead, all of the important information can be extracted
by looking at the tabulated data:
## age.group2
## (1.94,21.3] (21.3,40.7] (40.7,60.1]
## 2 6 3
This output takes a little bit of interpretation, but it’s not complicated. What R has done is determined that the lowest age category should run from 1.94 years up to 21.3 years, the second category should run from 21.3 years to 40.7 years, and so on. The formatting on those labels might look a bit funny to those of you who haven’t studied a lot of maths, but it’s pretty simple. When R describes the first category as corresponding to the range \((1.94, 21.3]\) what it’s saying is that the range consists of those numbers that are larger than 1.94 but less than or equal to 21.3. In other words, the weird asymmetric brackets are R’s way of telling you that if there happens to be a value that is exactly equal to 21.3, then it belongs to the first category, not the second one. Obviously, this isn’t actually possible since I’ve only specified the ages to the nearest whole number, but R doesn’t know this and so it’s trying to be precise just in case. This notation is actually pretty standard, but I suspect not everyone reading the book will have seen it before. In any case, those labels are pretty ugly, so it’s usually a good idea to specify your own, meaningful labels to the categories.
It is important to take the time to figure out whether or not the resulting categories make any sense at all in terms of your research project. If they don’t make any sense to you as meaningful categories, then any data analysis that uses those categories is likely to be just as meaningless. More generally, in practice I’ve noticed that people have a very strong desire to carve their (continuous and messy) data into a few (discrete and simple) categories; and then run the analysis using the categorised data instead of the original one.
One thing that you often want to do is sort a variable. If it’s a
numeric variable you might want to sort in increasing or decreasing
order. If it’s a character vector you might want to sort alphabetically,
etc. The sort() function provides this capability.
Consider the variable numbers containing the following three values.
Exercise: Sort the variable numbers using the sort() function.
sort( numbers )
Exercise: Now ask R to sort numbers in decreasing order rather
than increasing, by including the argument decreasing = TRUE.
You can ask for R to sort in decreasing order rather than increasing:
sort( numbers, decreasing = TRUE )
You can ask it to sort text data in alphabetical order:
## [1] "aardvark" "aardvark" "swing" "swing" "swing" "zebra" "zebra"
That’s pretty straightforward. That being said, it’s important to note
that I’m glossing over something here. When you apply sort() to a
character vector it doesn’t strictly sort into alphabetical order. R
actually has a slightly different notion of how characters are ordered
, which
is more closely related to how computers store text data than to how
letters are ordered in the alphabet. However, that’s a topic
I am not gonna touch, but do remember that if
you ever need an alphabetically sorted output, you need to take a deeper
dive into R.
You can also sort factors, but the story here is slightly more subtle
because there are two different ways you can sort a factor:
alphabetically (by label) or by factor level. The sort() function uses
the latter. To illustrate, let’s look at the two different examples.
First, let’s create a factor in the usual way:
## [1] aardvark zebra swing aardvark zebra swing swing
## Levels: aardvark swing zebra
Now let’s sort it:
## [1] aardvark aardvark swing swing swing zebra zebra
## Levels: aardvark swing zebra
This looks like it’s sorted things into alphabetical order, but that’s
only because the factor levels themselves happen to be alphabetically
ordered (i.e, the levels read: aardvark, swing, zebra).
Suppose I deliberately define the factor levels in a non-alphabetical order:
## [1] aardvark zebra swing aardvark zebra swing swing
## Levels: zebra swing aardvark
Exercise: Now what happens when we try to sort fac this time?
sort(fac)
It didn’t sort the data (which is the text) in the order of the alphabet! What it does do is sort the data (which is the text) into the numerical order implied by the factor levels,
not the alphabetical order implied by the labels attached to those
levels. Since the order of the factor levels is (by my own choosing) zebra, swing, and aardvark, it has sorted the data (i.e., the text) in that order. Normally you never notice the distinction, because by default the factor levels are assigned in alphabetical order, but it’s important to know the difference.
The sort() function doesn’t work properly with data frames. If you
want to sort a data frame the standard advice that you’ll find online is
to use the order() function (not described in this book) to determine
what order the rows should be sorted, and then use square brackets to do
the shuffling. There’s nothing inherently wrong with this advice, I just
find it tedious. I won’t go into it any further, but I just want you to
know you can do it. Remember that in life, nothing is impossible. But some things are just
painfully tedious.
It is very commonly the case that you find yourself needing to look at stuff, broken down by some grouping variable. This is pretty easy to do in R, and we will discuss three
functions in particular that are worth knowing about: tapply(), by()
and aggregate().
For example, say we have these two variables:
Suppose we want to compute the mean age per gender. If you remember the split() function from Section XXX, you know that one approach is as follows.
## [1] 10
## [1] 11.66667
Nice, but tedious. And, if we are doing real talk, ugly. R provides several functions to make your life easy.
The first of these functions is tapply(), which has three key
arguments. As before, X specifies the data, and FUN specifies a
function. However, there is also an INDEX argument which specifies a
grouping variable.36 What the tapply() function does is
consider all of the different values that appear in the INDEX variable. Each
such value defines a group: the tapply() function constructs the
subset of X that corresponds to that group, and then applies the
function FUN to that subset of the data. This probably sounds a little
abstract, so let’s consider a specific example.
## female male
## 10.00000 11.66667
In this extract, what we’re doing is using gender to define two different groups of people, and using their ages as the data. We then calculate the mean() of the ages, separately for the males and the females.
Note tapply() is very demanding, attention detailed, and different than most. While many functions we have seen take a lowercase x as argument, tapply() is only happy (and then some!) with uppercase X:
## Error in tapply(x = age, INDEX = gender, FUN = mean): argument "X" is missing, with no default
Remember that mean() can take the na.rm argument? So do I! How on earth could you pass this argument to tapply() (or any other argument for any other function, of course). Well, hear goes:
## female male
## NA 11.66667
## female male
## 9.00000 11.66667
So the argument that should have gone inside mean() just became an argument inside tapply().
There’s even more flexibility! FUN is not restricted to built in stuff like mean() but can take your home-grown functions too!
## female male
## 0 0
A closely related function is by(). It actually does the same thing as tapply(), but the output is formatted a bit differently. This time around the three arguments are called data, INDICES and FUN, but they’re pretty much the same thing. The data argument specifies the data set, the INDICES argument specifies the grouping variable, and the FUN argument specifies the name of a function that you want to apply separately to each group. An example of how to use the by() function is shown in the following extract:
## gender: female
## [1] 10
## ------------------------------------------------------------
## gender: male
## [1] 11.66667
The output gives you means separately for the female group and the male group.
The same argument passing trick as with tapply() can be used. For example, if you want to add the na.rm = TRUE argument to the mean() function, just use
## gender: female
## [1] 9
## ------------------------------------------------------------
## gender: male
## [1] 11.66667
And, just like with tapply(), FUN can take a function you defined yourself:
## gender: female
## [1] 0
## ------------------------------------------------------------
## gender: male
## [1] 0
A fun fact is that by() produces an object of the class by. You don’t need to know what it does, but it is just to reinforce the idea that 1) there are many more classes than we discussed and 2) you should never assume that the class of the output is identical to the class of whatever you input.
## [1] "by"
A final quite convenient function is the aggregate() function. There are again three arguments that you need to specify. The x argument is used to indicate which variable you want to analyse, and which variables are used to specify the groups. For instance, if you want to look at age separately for each possible gender, you can specify this using a formula, like this: age ~ gender. The FUN argument is used to indicate what function you want to calculate for each group (e.g., the mean). The data argument is used to specify the data frame containing all the data, so we need to first combine our data in a data frame. Annoyingly, if I want to show you the output of the aggregate() function in this document, in need to use the print() function. If you would use it in RStudio, print() is not needed (but doesn’t hurt either).
dfAG <- data.frame(age, gender)
print(aggregate( x = age,
by = list(gender),
FUN = mean,
data = dfAG)
)## Group.1 x
## 1 female 10.00000
## 2 male 11.66667
Note that we needed to convert gender to a list for aggregate() to work. In case you’d forget, aggregate() will kindly tell you off:
## Error in aggregate.data.frame(as.data.frame(x), ...): 'by' must be a list
Alternatively, aggregate() can take a formula as input:
## gender age
## 1 female 10.00000
## 2 male 11.66667
And yes, the extra-arguments and home-grown functions work as well. For once in its life, R is showing some consistency.
dfAG_new <- data.frame(age_new, gender)
print(aggregate( x = age_new ~ gender,
data = dfAG_new,
FUN = mean,
na.rm = TRUE
))## gender age_new
## 1 female 9.00000
## 2 male 11.66667
and
## gender age
## 1 female 0
## 2 male 0
What’s the class you wonder?
## [1] "data.frame"
What if you have multiple grouping variables? Suppose, we have more variables in our data set, relating to whether the person has no pet, or a cat or a dog:
gender <- c( "male","male","female","female","male" )
age <- c( 10,12,9,11,13 )
pet <- c("no","cat","cat","dog","no")For example, you would like to look at the average age separately for all possible combinations of age and pet ownership. It is possible to do this
using the tapply() and by() function, as follows. Note the use of list, which we have encountered in Section XXX.
## female male
## cat 9 12.0
## dog 11 NA
## no NA 11.5
## : cat
## : female
## [1] 9
## ------------------------------------------------------------
## : dog
## : female
## [1] 11
## ------------------------------------------------------------
## : no
## : female
## [1] NA
## ------------------------------------------------------------
## : cat
## : male
## [1] 12
## ------------------------------------------------------------
## : dog
## : male
## [1] NA
## ------------------------------------------------------------
## : no
## : male
## [1] 11.5
I usually find it more convenient to use the aggregate() function in
this situation. There are again three arguments that you need to
specify. The x argument is used to indicate which variable you want to
analyse, and which variables are used to specify the groups. For
instance, if you want to look at age separately for each
possible combination of gender and pet ownership, you can specify this using a formula, like this: age ~ gender + pet. The data argument
is used to specify the data frame containing all the data, and the FUN
argument is used to indicate what function you want to calculate for
each group (e.g., the mean).
dfAGP <- data.frame(age, gender, pet)
print(aggregate( x = age, # age
by = list(pet,gender), # by pet/gender combination
data = dfAGP, # data is in the dfAGP data frame
FUN = mean # print out group means
))## Group.1 Group.2 x
## 1 cat female 9.0
## 2 dog female 11.0
## 3 cat male 12.0
## 4 no male 11.5
or, using a formula,
print(aggregate( x = age ~ pet + gender, # age by pet/gender combination
data = dfAGP, # data is in the dfAGP data frame
FUN = mean # print out group means
))## pet gender age
## 1 cat female 9.0
## 2 dog female 11.0
## 3 cat male 12.0
## 4 no male 11.5
The tapply(), by() and aggregate() functions are quite handy things to know about and are pretty widely used. As with the apply() function (see Section XXX), you can pass on additional function arguments after the FUN argument.
Before moving on, I should mention that there are several other
functions that work along similar lines, and have suspiciously similar
names: lapply, mapply, apply, vapply, rapply and eapply.
You’ll hear about those when needed.
One thing you will quickly see (in Chapter XXX, to be more precise) is that, once your data are in the format R expects, doing inferential tests is almost a no brainer. A t test can be done with the t.test() function, a proportion test with the prop.test() function, and so on. I’m pretty sure you catch my drift. The most difficult part is often getting your data in the format R expects. There is one final function that I want to talk about which will prove very handy when starting doing data analysis, because it will help you doing exactly that. The merge() function supports fairly complicated “database like” merging of vectors and data frames. It doesn’t do anything you couldn’t do in a different way, by tedious and clever indexing. But it makes your life pretty easy.
To illustrate, consider a situation where we have two data frames with overlapping IDs, indicating the different participants.
dfA <- data.frame(
ID = c(1, 2, 3),
Score = c(10, 20, 30)
)
dfB <- data.frame(
ID = c(2, 3, 4),
Grade = c("B", "A", "C")
)
print(dfA)## ID Score
## 1 1 10
## 2 2 20
## 3 3 30
## ID Grade
## 1 2 B
## 2 3 A
## 3 4 C
Both dfA and dfB share a common column called ID. What if we want to combine both data sets in one? There’s a function for that.
## ID Score Grade
## 1 2 20 B
## 2 3 30 A
The resulting data frame has not just put both data frames next to each other. Rather, the common column, which we specified using the by argument, has not been repeated. By default, merge() returns only the rows where the keys (IDs) match in both data frames. Thus, we end up with rows for ID=2 and ID=3 only, since ID=1 is missing in dfB and ID=4 is missing in dfA.
There are several variations. You might want to keep all rows from dfA (even if no matching ID exists in dfB). Or you could keep all rows from dfB, even if no match exists in dfA. Or just might prefer to keep all rows from dfB, even without a match in dfA.
## ID Score Grade
## 1 1 10 <NA>
## 2 2 20 B
## 3 3 30 A
## ID Score Grade
## 1 2 20 B
## 2 3 30 A
## 3 4 NA C
## ID Score Grade
## 1 1 10 <NA>
## 2 2 20 B
## 3 3 30 A
## 4 4 NA C
Now we were lucky that we had two different variables in each data frame: Score and Grade. But would the world have continued turning when we had have Grade twice, in each data frame? If you are feeling particularly adventurous, you can find out!
dfA <- data.frame(
ID = c(1, 2, 3),
Grade = c(10, 20, 30)
)
dfB <- data.frame(
ID = c(2, 3, 4),
Grade = c("B", "A", "C")
)
# merge them on the column "ID"
merged_data <- merge(x = dfA, y = dfB, by = "ID")
print(merged_data)## ID Grade.x Grade.y
## 1 2 20 B
## 2 3 30 A
The merge() function has added a suffix, to keep the world from collapsing. If you think you can do it better, you are right. For once.
## ID Grade.A Grade.B
## 1 2 20 B
## 2 3 30 A
As advertised, the suffixes argument fixes!
There’s one really important thing that I omitted when I discussed functions earlier on in Section 2.9, but now that we are all up to speed about the different types of variables classes, there’s another important thing to understand, which is the concept of a generic function. It does not really fit in the More Data Handling chapter, but hey.
The thing that makes generics different from the other functions is that
their behaviour changes, often quite dramatically, depending on the
class() of the input you give it. The easiest way to explain the
concept is with an example. With that in mind, let us take a closer look
at what the print() function actually does.37 I’ll do this by creating a
formula, and printing it out in a few different ways. First, let’s stick
with what we know:
my.formula <- blah ~ blah.blah # create a variable of class "formula"
print( my.formula ) # print it out using the generic print() function## blah ~ blah.blah
## <environment: 0x00000236bb752fe8>
So far, there’s nothing very surprising here. But there’s actually a lot
going on behind the scenes here. When I type print( my.formula ), what
actually happens is the print() function checks the class of the
my.formula variable. When the function discovers that the variable
it’s been given is a formula, it does the thing it does to formulas.
If you’re curious to know how R would have printed my.formula ignoring
the fact that it is a formula, you can force R to display its generic
functioning by using the print.default() function, which tells R to stop all the special things it does by recognizing that the thing it needs to print is a formula:
## blah ~ blah.blah
## attr(,"class")
## [1] "formula"
## attr(,".Environment")
## <environment: 0x00000236bb752fe8>
Hm. You can kind of see that it is trying to print out the same formula,
but there’s a bunch of ugly low-level details that have also turned up
on the screen. This is because the print.default() method doesn’t know
anything about formulas, and doesn’t know that it’s supposed to be
hiding the obnoxious internal gibberish that R produces sometimes.
As a second example, remember the garden data frame? If we ask to print it, we get
## speaker utterance mood
## case.1 upsydaisy pip 2
## case.2 upsydaisy pip 1
## case.3 upsydaisy onk 1
## case.4 upsydaisy onk 3
## case.5 tombliboo ee 2
## case.6 tombliboo oo 2
## case.7 makkapakka pip 1
## case.8 makkapakka pip 1
## case.9 makkapakka onk 2
## case.10 makkapakka onk 3
R had neatly organized stuff when printing, adapting its behavior on account of recognizing
garden to be a data frame. If we stop R to be so sensitive to class, as just do whatever it does without taking the data frame nature into account, we get
## $speaker
## [1] "upsydaisy" "upsydaisy" "upsydaisy" "upsydaisy" "tombliboo"
## [6] "tombliboo" "makkapakka" "makkapakka" "makkapakka" "makkapakka"
##
## $utterance
## [1] "pip" "pip" "onk" "onk" "ee" "oo" "pip" "pip" "onk" "onk"
##
## $mood
## [1] 2 1 1 3 2 2 1 1 2 3
##
## attr(,"class")
## [1] "data.frame"
So the general point is that, while print() will always, well, print, what it prints exactly depends on what class the input is. What happens is that, if we use print(garden), R is smart enough to recognize garden as being a data frame, and instead of using the print() function, it automatically (but also sneakily) uses the print.data.frame() function.
## speaker utterance mood
## case.1 upsydaisy pip 2
## case.2 upsydaisy pip 1
## case.3 upsydaisy onk 1
## case.4 upsydaisy onk 3
## case.5 tombliboo ee 2
## case.6 tombliboo oo 2
## case.7 makkapakka pip 1
## case.8 makkapakka pip 1
## case.9 makkapakka onk 2
## case.10 makkapakka onk 3
There’s no difference in the output at all compared to print(garden). But this shouldn’t surprise you because it was actually the print.data.frame() method that was doing all the hard work in the first place. The print() function itself is a lazy bastard that doesn’t do anything other than select which of the methods is going to do the actual printing. So when the function discovers that the variable it’s been given is a data frame, it goes looking for a function called print.data.frame(), and then delegates the whole business of printing out the variable to the print.data.frame() function. You won’t need to understand the details at all for this book, but you do need to know the gist of it; if only because a lot of the functions we’ll use are actually generics.
If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. Since you now know how to work in RStudio, why not go there to make them.
Up to this point in the book, I’ve tried hard to avoid using the word “programming” too much because – at least in my experience – it’s a word that can cause a lot of fear. For one reason or another, programming (like mathematics and statistics) is often perceived by people on the “outside” as a black art, a magical skill that can be learned only by some kind of super-nerd. I think this is a shame. It’s certainly true that advanced programming is a very specialised skill: several different skills actually since there’s quite a lot of different kinds of programming out there. However, the basics of programming aren’t all that hard, and you can accomplish a lot of very impressive things just using those basics.
With that in mind, the goal of this chapter is to discuss a few basic programming concepts and how to apply them in R. However, before I do, I want to make one further attempt to point out just how non-magical programming really is, via one very simple observation: you already know how to do it. Stripped to its essentials, programming is nothing more (and nothing less) than the process of writing out a bunch of instructions that a computer can understand. To phrase this slightly differently, when you write a computer program, you need to write it in a programming language that the computer knows how to interpret. R is one such language. Although I’ve been having you type all your commands at the command prompt, and all the commands in this book so far have been shown as if that’s what I was doing, it’s also quite possible (and as you’ll see shortly, shockingly easy) to write a program using these R commands. In other words, if this is the first time reading this book, then you’re only one short chapter away from being able to legitimately claim that you can program in R, albeit at a beginner’s level.
In this section, I want to talk about functions again. Functions were introduced in Section 2.9 but since you are a programmer now, we can talk about them in more detail. In particular, I want to show you how to create your own. After this you will no longer be the meager peasant who is forced to work with whatever function the R developers (or the package developers) have found the grace to provide to you, but you will be the king of your own little universe.
Here’s the syntax that you use to create a function:
FNAME <- function ( ARG1, ARG2, ETC ) {
STATEMENT1
STATEMENT2
ETC
return( VALUE )
}
What this does is create a function with the name FNAME, which has arguments ARG1, ARG2 and so forth. Whenever the function is called, R executes the statements in the curly braces and then outputs the contents of VALUE to the user.
To give a simple example of this, let’s create a function called
quadruple() which multiplies its inputs by four.
Exercise: Run the code and see what happens.
quadruple <- function(x) {
y <- x*4
return(y)
}
Nothing appears to have happened! You can’t see it in the browser, but
what did happen is that there is a new object created in the workspace
called quadruple.
Exercise: Ask R to tell us what kind of object quadruple is using
the class() function.
class(quadruple)
It tells us that it is a function.
And now that we’ve created the quadruple() function, we can call it
just like any other function. And if I want to store the output as a
variable, we can do this:
## [1] 40
Functions are another place where the print() function might prove its right to exist. Say we want to have a glimpse of the internal workings and see what x is (which is weird, because we provided is as input, but hey). Let’s try two things.
## [1] 40
## [1] 10
## [1] 40
Unlike in your console, adding x to the function does not show said x. You need print(x) for that.
Now that we know how to create our own functions in R, it’s probably a good idea to talk a little more about some of the other properties of functions that I’ve been glossing over.
Exercise: To start with, let’s take this opportunity to type the name of the function without the parentheses.
As you can see, when you type the name of a function, R prints out the
underlying source code that we used to define the function in the first
place. In the case of the quadruple() function, this is quite helpful
to us – we can read this code and actually see what the function does. Besides studying the help files, searching google, and talking to a friendly AI assistant, this is another way of trying to make sense of what R does.
An important thing to recognise here is that the two internal variables that the quadruple() function makes use of, x and y, stay internal.
rm(list = ls()) #remove all
quadruple <- function(x) {
y <- x*4
return(y)
}
my.var <- quadruple(10)
my.var## [1] 40
## [1] "my.var" "quadruple"
In our workspace, we see the quadruple() function itself, as well as the my.var variable that we just created. But no trace of x or y.
It is time for some hard talk about the funny relation functions have with the workspace.
rm(list=ls()) #clear the workspace
afun1 = function(x) {
y <- 3
result <- x + y
return( result )
}
afun1(x=10)## [1] 13
## Error in eval(expr, envir, enclos): object 'x' not found
## Error in eval(expr, envir, enclos): object 'y' not found
When running afun1(x=10), we ask to add x and y. We supply x as input the the function (10, in this case), and have defined y inside the function (3 in this case). Since R returns 13, R has access to the correct x and y. The result is 13, since we supply x=10 as input and have set y<-3 inside the function. Yet, when we ask R whatever x and y are, R just blanks. The reason is the our x=10 is only defined within the function argument, and the y<-3only in the function code. Both instances don’t enter the workspace. R can not produce x or y, since x was defined as part of the input and y was defined inside the function.
What is happening that, every time you call a function, R briefly creates a temporary environment in which the function itself can work, which is then deleted after the calculations are complete. Note, however, that R does not execute the commands inside the function in the workspace. Instead, what it does is create a temporary local environment: all the internal statements in the body of the function are executed there, so they remain invisible to the user. Only the final results that are inside return() are returned to the workspace.
You might be surprised to learn that this still produces 13!
## [1] 13
## Error in eval(expr, envir, enclos): object 'x' not found
## [1] 5
Despite having defined y<-5 outside of the function, R uses the y<-3 from within the function, since, unlike x we haven’t supplied our value of 5 to the function! But also keep in mind that, outside the function, R thinks y<-5. So the y<-3 is temporarily uses while executing the function is cleared from its memory! What happens in a function, stays in a function. So go wild!
Now consider this:
## Error in afun2(x = 10): object 'y' not found
## Error in eval(expr, envir, enclos): object 'x' not found
## Error in eval(expr, envir, enclos): object 'y' not found
R cannot execute this function, because it needs an y to do so and we have not supplied one as input or as part of the funciton definition.
This, however, does work.
## [1] 15
## Error in eval(expr, envir, enclos): object 'x' not found
## [1] 5
So while R will go looking for an y inside the function when needed, when it can not find one, it will go look elsewhere. Despite not being shipped as input the the function, it does use the y<-5 defined outside of the function.
This is however, very bad practice. This is much cleaner and safer:
## [1] 15
Okay, now that we are starting to get a sense for how functions are constructed, let’s have a look at two, slightly more complicated functions that I’ve created. Let’s start by looking at the first one:
## --- functionexample2.R
pow <- function( x, y = 1) {
out <- x^y # raise x to the power y
return( out )
}As you can see from looking at the code for this function, it has two
arguments x and y, and all it does is raise x to the power of y.
For instance, this command
## [1] 9
calculates the value of \(3^2\). The interesting thing about this function
isn’t what it does, since R already has perfectly good mechanisms for
calculating powers. Rather, notice that when I defined the function, I
specified y=1 when listing the arguments? That’s the default value for
y. So if we enter a command without specifying a value for y, then
the function assumes that we want y=1:
## [1] 3
However, since I didn’t specify any default value for x when I defined the pow() function, we always need to input a value for x. If we don’t, R will spit out an error message. Try it!
Exercise: Try it!
So now you know how to specify default values for an argument.
The other
thing I should point out while I’m on the topic of function arguments is the use of the ...
argument. The ... argument is a special construct in R which is only
used within functions. It is used as a way of matching against multiple
user inputs: in other words, ... is used as a mechanism to allow the
user to enter as many inputs as they like. I won’t talk at all about the
low-level details of how this works, but I will show you a simple
example of a function that makes use of it. To that end, consider the
following function:
## --- functionexample3.R
doubleMax <- function( ... ) {
max.val <- max( ... ) # find the largest value in ...
out <- 2 * max.val # double it
return( out )
}You can type in as many inputs as you like. The doubleMax() function
identifies the largest value in the inputs, by passing all the user
inputs to the max() function, and then doubles it.
Exercise: Enter a few different values in the doubleMax() function
to see what it does.
# for example:
doubleMax(1, 2, 5)
The fact that the arguments don’t enter the workspace can have consequences that for novices are either deeply confusing or straight up hilarious, depending on your sense of humor.
For example, this code should be easy to follow:
## [1] 7 7 7 7 7 7 7 7 7
Both functions rep() and round() have an argument called x. Noice.
This code, however, seems wrong, but isn’t
## [1] 7 7 7 7 7 7 7 7 7
There are two different x’s being used as the argument, but since they don’t enter the workspace, there is no clash of exes.
Of course, this also works, if you want to avoid the 2 many exes problem.
## [1] 7 7 7 7 7 7 7 7 7
My only point is that 2 many exes should never be a problem. Don’t let anyone shame you for your bodycount!
There’s a lot of other details to functions that I’ve hidden in my description in this chapter. Experienced programmers will wonder exactly how the “scoping rules” work in R,38 or want to know how to use a function to create variables in other environments39, or will wonder if function objects can be assigned as elements of a list40 and probably hundreds of other things besides. However, I don’t want to have this discussion get too cluttered with details, so I think it’s best – at least for the purposes of the current book – to stop here.
Remember long time ago I gave a description on how a script works? Well, it was a tiny bit of a lie. Specifically, it’s not necessarily the case that R starts at the
top of the file and runs straight through to the end of the file. For all the scripts that we’ve seen so far, that’s exactly what happens, and unless you insert some commands to explicitly alter how the script runs, that is what will always happen. However, you actually have quite a lot of flexibility in this respect. Depending on how you write the script, you can have R repeat several commands, or skip over different commands, and so on. This topic is referred to as flow control, and the first concept to discuss in this respect is the idea of a loop. The basic idea is very simple: a loop is a block of code (i.e., a sequence of commands) that R will execute over and over again until some termination criterion is met. Looping is a very powerful idea. There are three different ways to construct a loop in R, based on the while, for and repeat functions. I’ll only discuss the first two in this book.
while loopA while loop is a simple thing. The basic format of the loop looks
like this:
while ( CONDITION ) {
STATEMENT1
STATEMENT2
ETC
}
The code corresponding to CONDITION needs to produce a logical value,
either TRUE or FALSE. Whenever R encounters a while statement, it
checks to see if the CONDITION is TRUE. If it is, then R goes on to
execute all of the commands inside the curly brackets, proceeding from
top to bottom as usual. However, when it gets to the bottom of those
statements, it moves back up to the while statement. Then, like the
mindless automaton it is, it checks to see if the CONDITION is TRUE.
If it is, then R goes on to execute all … well, you get the idea. This
continues endlessly until at some point the CONDITION turns out to be
FALSE. Once that happens, R jumps to the bottom of the loop (i.e., to
the } character), and then continues on with whatever commands appear
next in the script.
To start with, let’s keep things simple, and use a while loop to
calculate the smallest multiple of 17 that is greater than or equal to
1000. This is a very silly example since you can actually calculate it
using simple arithmetic operations, but the point here isn’t to do
something novel. The point is to show how to write a while loop.
Here’s the script:
When we run this script, R starts at the top and creates a new variable
called x and assigns it a value of 0. It then moves down to the loop,
and “notices” that the condition here is x < 1000. Since the current
value of x is zero, the condition is true, so it enters the body of
the loop (inside the curly braces). There’s only one command here41
which instructs R to increase the value of x by 17. R then returns to
the top of the loop and rechecks the condition. The value of x is now
17, but that’s still less than 1000, so the loop continues. This cycle
will continue for a total of 59 iterations, until finally x reaches a
value of 1003 (i.e., \(59 \times 17 = 1003\)). At this point, the loop
stops, and R finally reaches line 5 of the script, prints out the value
of x on screen, and then halts.
Exercise: Run the while loop and watch what happens.
x <- 0
while ( x < 1000 ) {
x <- x + 17
}
x
Truly fascinating stuff.
for loopThe for loop is also pretty simple, though not quite as simple as the
while loop. The basic format of this loop goes like this:
for ( VAR in VECTOR ) {
STATEMENT1
STATEMENT2
ETC
}
In a for loop, R runs a fixed number of iterations. We have a VECTOR
which has several elements, each one corresponding to a possible value
of the variable VAR. In the first iteration of the loop, VAR is given a
value corresponding to the first element of VECTOR; in the second
iteration of the loop VAR gets a value corresponding to the second value
in VECTOR; and so on. Once we’ve exhausted all of the values in VECTOR,
the loop terminates and the flow of the program continues down the
script.
Once again, let’s use some very simple examples. Firstly, here is a program that just prints out the word “hello” three times and then stops:
This is the simplest example of a for loop. The vector of possible
values for the i variable just corresponds to the numbers from 1 to 3.
Not only that, the body of the loop doesn’t actually depend on i at
all. Not surprisingly, here’s what happens when we run it.
Exercise: Run the for loop and watch what happens.
for ( i in 1:3 ) {
print( "hello" )
}
However, there’s nothing that stops you from using something non-numeric
as the vector of possible values, as the following example illustrates.
This time around, we’ll use a character vector to control our loop,
which in this case will be a vector of words. And what we’ll do in the
loop is get R to convert the word to upper case letters, calculate the
length of the word, and print it out. Here’s the script (note that it
uses the toupper() function, which converts a lowercase to an
uppercase):
## --- forexample2.R
#the words
words <- c("it","was","the","dirty","end","of","winter")
#loop over the words
for ( w in words ) {
w.length <- nchar( w ) # calculate the number of letters_
W <- toupper( w ) # convert the word to upper case letters_
msg <- paste( W, "has", w.length, "letters" ) # a message to print_
print( msg ) # print it
}Exercise: Run the for loop and watch what happens.
#the words
words <- c("it","was","the","dirty","end","of","winter")
#loop over the words_
for ( w in words ) {
w.length <- nchar( w ) # calculate the number of letters
W <- toupper( w ) # convert the word to upper case letters_
msg <- paste( W, "has", w.length, "letters" ) # a message to print
print( msg ) # print it
}
Again, pretty straightforward I hope.
Note that we can use whatever we want as index (i.e., VAR)
## [1] 5
## [1] 7
## [1] 9
is exactly the same as
for ( dude_can_you_play_a_song_with_a_flipping_beat in c(2,4,6)) {
print(3 + dude_can_you_play_a_song_with_a_flipping_beat)
}## [1] 5
## [1] 7
## [1] 9
The i, which is often used in a for loop, is just a dumb letter.
Note how on the above example, we asked R to do a computation for every i or dude_can_you_play_a_song_with_a_flipping_beat, and we asked to print the result. However, if we would like to store that result, we would not want to copy and paste the output from the console, which can be fun in its on way, but not recommend. So let’s see how we can change the for loop to make it store stuff.
## [1] 9
Nice, but not quite. x has only stored the last result, not the full history. At each iteration of the loop, x is overwritten, so the final x is the one corresponding to the final i.
Maybe we could try this:
## Error in x[i] <- 3 + i: object 'x' not found
## Error in eval(expr, envir, enclos): object 'x' not found
Mmmmh. R balks. At the first iteration of the loop, i is 2. So what we ask R to do in the statement x[i] <- 3 + i is to compute 3 + 2, and store that as the second element of x. What is this x, you speak of, says R? It’s like me asking you to put your laptop in the second room of house X. You surely can not do that if I don’t first tell you what house X is.
So let’s tell R what x is. In fact, we are playing it cool, and only tell R that is exists, which, by a fancy word, could be referred to as initializing it, but don’t tell what it is. The reason is that R will find out itself what it is, each time it executes the x[i] <- 3 + i line. Saying someting without saying anything can be done using NULL.
## [1] NA 5 NA 7 NA 9
Yeah! Sort of. x now stores all the results of the x[i] <- 3 + i line. But it stores it at weird places. Just like we asked, it stores it at the 2nd, 4th and 6th place. It would be nicer if it would store it at the 1st, 2nd and 3rd place. Here’s how to do that:
## [1] 5 7 9
Hurray!
Exercise: Now it’s your turn. Write a for loop that iterates from 1
to 6, printing the square of each number using the print() function.
# Start by merely writing the framework of the for loop, before moving on to the calculation that will be done inside the loop
for (i in 1:6) {
}
# Remember how to calculate the square of a number in R? How do we generalise this for all iterations inside the loop?
for (i in 1:6) {
print(i^2)
}
We will be doing a lot more of these for loops when we are doing simulations.
To give you a sense of how you can use a loop in a more complex situation, let’s write a simple script to simulate the progression of a mortgage. Suppose we have a nice young couple who borrow $300000 from the bank, at an annual interest rate of 5%. The mortgage is a 30-year loan, so they need to pay it off within 360 months total. Our happy couple decides to set their monthly mortgage payment at $1600 per month. Will they pay off the loan in time or not? Only time will tell.42 Or, alternatively, we could get R to tell us. The script to run this is a fair bit more complicated.
To make the code easier we need to make a few calculations. The couple is making monthly payments of $1600, at an annual interest rate of 5%. This means that, each month, their outstanding balance is to be multiplied with 1.05^(1/12). 43
## --- mortgage.R
# set up
month <- 0 # count the number of months
balance <- 300000 # initial mortgage balance
total.paid <- 0 # track what you've paid the bank
payment <- 1600 # monthly payment
interest <- 0.05 # 5% interest rate per year
# convert annual interest to a monthly multiplier
monthly.multiplier <- (1+interest) ** (1/12)
# keep looping until the loan is paid off...
while ( balance > 0 ) {
# do the calculations for this month
month <- month + 1 # one more month
balance <- balance * monthly.multiplier # add the interest
balance <- balance - payment # make the payment
total.paid <- total.paid + payment # track the total paid
} # end of loop
total.paid
monthTo explain what’s going on, let’s go through it carefully. In the first
block of code (under #set up) all we’re doing is specifying all the
variables that define the problem. The loan starts with a balance of
$300,000 owed to the bank on month zero, and at that point in time
the total.paid money is nothing. The couple is making monthly
payments of $1600, at an annual interest rate of 5% and the associated monthly.multiplier.
The interesting part (such as it is) is
the loop. The while statement on tells R that it needs to keep looping
until the balance reaches zero (or less, since it might be that the
final payment of $1600 pushes the balance below zero). Then, inside the
body of the loop, we have two different blocks of code. In the first
bit, we do all the number crunching. Firstly we increase the value
month by 1. Next, the bank charges the interest, so the balance goes
up. Then, the couple makes their monthly payment and the balance goes
down. Finally, we keep track of the total amount of money that the
couple has paid so far, by adding the payment to the running tally.
The key thing here is the tension between the increase in
balance (in the line balance <- balance * monthly.multiplier) and
the decrease in balance (in the line balance <- balance - payment).
As long as the decrease is bigger, then the balance will eventually drop
to zero and the loop will eventually terminate. If not, the loop will
continue forever! This is actually very bad programming on my part: I
really should have included something to force R to stop if this goes on
too long. However, I haven’t shown you how to evaluate “if” statements
yet (you have to wait till Section XXX), so we’ll just have to hope that I have rigged
the example so that the code actually runs. Anyway, assuming that the loop does eventually terminate,
there’s one last line of code that prints out the total amount of money
that the couple handed over to the bank over the lifetime of the loan and the number of months it took.
Now that I’ve explained everything in the code in tedious detail…
Exercise: Run the for loop and see what happens.
# set up
month <- 0 # count the number of months
balance <- 300000 # initial mortgage balance
payments <- 1600 # monthly payments
interest <- 0.05 # 5% interest rate per year
total.paid <- 0 # track what you've paid the bank
# convert annual interest to a monthly multiplier
monthly.multiplier <- (1+interest) ** (1/12)
# keep looping until the loan is paid off...
while ( balance > 0 ) {
# do the calculations for this month
month <- month + 1 # one more month
balance <- balance * monthly.multiplier # add the interest
balance <- balance - payments # make the payments
total.paid <- total.paid + payments # track the total paid
} # end of loop
total.paid
month
So our nice young couple has paid off their $300,000 loan in just 4 months shy of the 30-year term of their loan, at a bargain-basement price of $569600 A happy ending!
In addition to providing the explicit looping structures via while and for, R also provides a collection of functions for implicit
loops. What I mean by this is that these are functions that carry out operations very similar to those that you’d normally use a loop for. However, instead of typing out the whole loop, the whole thing is done with a single command. The main reason why this can be handy is that – due to the way that R is written – these implicit looping functions are usually about to do the same calculations much faster than the corresponding explicit loops. In most applications that beginners might want to undertake, this probably isn’t very important, since most beginners tend to start out working with fairly small data sets and don’t usually need to undertake extremely time-consuming number crunching. However, because you often see these functions referred to in other contexts, it may be useful to very briefly discuss a few of them.
In fact, I can be very brief about it, since we have been discussing this in Section XXX. For example, consider the by function. We have used it as follows:
## [1] 10 12 9 11 13
## [1] "male" "male" "female" "female" "male"
## gender: female
## [1] 10
## ------------------------------------------------------------
## gender: male
## [1] 11.66667
In some sense, by() had been doing a loop for us:
unique_genders <- unique(gender)
mean_ages <- NULL
# loop over each unique gender
for (i in 1:length(unique_genders)) {
mean_ages[i] <- mean(age[gender==unique_genders[i]])
}
mean_ages## [1] 11.66667 10.00000
So, yeah, thanks, by().
A second kind of flow control that programming languages provide is the
ability to evaluate conditional statements. Unlike loops, which
can repeat over and over again, a conditional statement only executes
once, but it can switch between different possible commands depending on
a CONDITION that is specified by the programmer. The power of these
commands is that they allow the program itself to make choices, and in
particular, to make different choices depending on the context in which
the program is run. The most prominent example of a conditional
statement is the if statement, and the accompanying else statement.
The basic format of an if statement in R is as follows:
if ( CONDITION ) {
STATEMENT1
STATEMENT2
ETC
}
And the execution of the statement is pretty straightforward. If the
CONDITION is true, then R will execute the statements contained in the
curly braces. If the CONDITION is false, then it does not. If you want
to, you can extend the if statement to include an else statement as
well, leading to the following syntax:
if ( CONDITION ) {
STATEMENT1
STATEMENT2
ETC
} else {
STATEMENT3
STATEMENT4
ETC
}
As you’d expect, the interpretation of this version is similar. If the CONDITION is true, then the contents of the first block of code (i.e., STATEMENT1, STATEMENT2, ETC) are executed; but if it is false, then the contents of the second block of code (i.e., STATEMENT3, STATEMENT4, ETC) are executed instead.
You can expand this logic as follows:
if ( CONDITION ) {
STATEMENT1
STATEMENT2
ETC
} else if (ANOTHER CONDITION) {
STATEMENT3
STATEMENT4
ETC
} else if (YET ANOTHER CONDITION) {
STATEMENT5
STATEMENT6
ETC
} else
{
STATEMENT7
STATEMENT8
ETC
}
What will you see in this example?
score1 <- 60
score2 <- 40
if (score2 > score1 & score1 < 40){
result <- "Ha"
} else if (score2 > score1 & score1 > 40){
result <- "Ku"
} else if (score1 > score2 | score1 == 40){
result <- "Na"
} else {
result <- "Ma"
}
result## [1] "Na"
A particularly useful function to make conditional statements is the
ifelse() function. I low key forgot how it exactly works, so I looked
it up in the R help file, ?ifelse. Here is what it says in the
description: ifelse returns a value with the same shape as test which
is filled with elements selected from either yes or no depending on
whether the element of test is TRUE or FALSE. Even though I sort of
know what it does, this is mostly gibberish. So I looked at the examples
in that very same help file:
## Warning in sqrt(x): NaNs produced
## [1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000 0.000000 NaN
## [9] NaN NaN NaN
## [1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000 0.000000 NA
## [9] NA NA NA
Oh, please, dear R. Why do you have to make things so complicated? This isn’t gonna help. Let’s cook up our own example.
## [1] 36 25 16 9 4 1 0 1 2 3 4 5 6
So yeah, this gives me an idea of what it does. The ifelse() function
takes three arguments, test, yes and no. If the first argument is TRUE, the function
returns whatever is specified in the second or yes argument. If the first
argument is FALSE, the function returns whatever is specified in the
third or no argument.
So the full statement is as follows
## [1] 36 25 16 9 4 1 0 1 2 3 4 5 6
## [1] 36 25 16 9 4 1 0 1 2 3 4 5 6
We can do the same with the if … else structure, combined in a for loop:
x <- -6:6
result <- NULL
for (i in 1:length(x)){
if ( x[i] >= 0 ) {
result[i] <- x[i]
} else {
result[i] <- x[i]^2
}
}
result## [1] 36 25 16 9 4 1 0 1 2 3 4 5 6
There is another way of making conditional
statements in R. In particular, the
switch() function can be very useful in different contexts. However,
my main aim in this chapter is to briefly cover the very basics, so I’ll
move on.
Visualising data is one of the most important tasks facing the data analyst. It’s important for two distinct but closely related reasons. Firstly, there’s the matter of drawing “presentation graphics”: displaying your data in a clean, visually appealing fashion makes it easier for your reader to understand what you’re trying to tell them. Equally important, perhaps even more important, is the fact that drawing graphs helps you to understand the data. To that end, it’s important to draw “exploratory graphics” that help you learn about the data as you go about analysing it. These points might seem pretty obvious, but I cannot count the number of times I’ve seen people forget them.
The goal is to show you how to create basic graphs in R. The graphs themselves tend to be pretty straightforward, so in that respect, this chapter is pretty simple. Where people usually struggle is learning how to produce graphs, and especially, learning how to produce good graphs.44 The focus is on how to make these plots. The when and the why and the why not have been discussed in Statistics 1 and will not be repeated here.
Fortunately, learning how to draw graphs in R is both extremely simple
and extremely hard.
You are fortunate enough that
R has a lot of very good graphing functions, and most of the time you
can produce a clean, high-quality graphic without having to learn very
much about the low-level details of how R handles graphics. As long as
you’re not too picky about what your graph looks like, it is almost
trivial. You make a histogram using hist(), a box plot using boxplot(),
etc. It can almost not get any simpler. But. Unfortunately, on those
occasions when you do want to do something non-standard, or if you need
to make highly specific changes to the figure, you actually do need to
learn a fair bit about these details; and those details are both
complicated and boring. So doing something decent is ridiculously easy;
doing something great is terrifyingly difficult.
The goal of this chapter is to teach you how to make quick-and-dirty graphs. I will show you how to make a basic graph and make a few adjustments. Making good graphs, will require a lot more than just a few adjustments, and a lot more explanation though. If you ever need that, you will need to go on a hunt yourself for tweaking the right handles.
Before I discuss any specialised graphics, let’s start by drawing a few
very simple graphs just to get a feel for what it’s
like to draw pictures using R. To that end, let’s create a small vector
Fibonacci that contains a few numbers we’d like R to draw for us.
Then, we’ll ask R to plot() those numbers.
Exercise: Ask R to plot the Fibonacci numbers, by providing the
data set to the function plot().
plot(Fibonacci)
As you can see, what R has done is plot the values stored in the
Fibonacci variable on the vertical axis (y-axis) and the corresponding
index on the horizontal axis (x-axis). In other words, since the 4th
element of the vector has a value of 3, we get a dot plotted at the
location (4,3). That’s pretty straightforward, and the image is probably
pretty close to what you would have had in mind when I suggested that we
plot the Fibonacci data.
However, there’s quite a lot of customisation options available to you, so we should probably spend a bit of time looking at some of those options.
You can easily customise the appearance of the actual
plot! To start with, let’s look at the single most important options
that the plot() function provides for you to use, which is the
type argument. The type argument specifies the visual style of the
plot. The possible values for this are:
type = "p". Draw the points only.type = "h". Draw “histogram-like” vertical bars.type = "s". Draw a staircase, going horizontally then
vertically.type = "S". Draw a Staircase, going vertically then
horizontally.type = "b". Draw both points and lines, but don’t overplot.type = "o". Draw the line over the top of the points.type = "c". Draw only the connecting lines from the “b”
version.type = "l". Draw a line through the points.type = "n". Draw nothing. (Apparently this is useful sometimes?)The simplest way to illustrate what each of these really looks like is
just to draw them. To that end, Figure 10.1 shows the
same Fibonacci data, drawn using eight different types of plot. As you
can see, by altering the type argument you can get a qualitatively
different appearance to your plot. In other words, as far as R is
concerned, the only difference between a scatterplot
and a line plot is
that you draw a scatterplot by setting type = "p" and you draw a line
plot by setting type = "l".
Figure 10.1: Changing the type of the plot.
The basic plots R produces are ok, but sometimes you should not always settle for ok. R offers many handles to tweak to customize your plots. They might feel daunting at first but 1) they are pretty self-explanatory and 2) they are often used for more than just the plot() function, but also for many other graphical functions we will encounter below. So it is worth to study them.
One of the first things that you’ll find yourself wanting to do when customising your plot is to label it better. You might want to specify more appropriate axis labels, add a title or add a subtitle. The arguments that you need to specify to make this happen are:
main. A character string containing the main title.sub. A character string containing the subtitle.xlab. A character string containing the x-axis label.ylab. A character string containing the y-axis label.Exercise: Let’s have a look at what happens when we make use of all these arguments. Here’s the command.
plot(x = Fibonacci,
main = "You specify title using the 'main' argument",
sub = "The subtitle appears here! (Use the 'sub' argument for this)",
xlab = "The x-axis label is 'xlab'",
ylab = "The y-axis label is 'ylab'"
)
It’s more or less as you’d expect. The plot itself is identical to the one we drew in the previous exercise, except for the fact that we’ve changed the axis labels, and added a title and a subtitle.
Another thing you might want to have control over is setting the limits of the axes.
xlim and ylim. The axis scales. Generally, R
does a pretty good job of figuring out where to set the edges of the
plot. However, you can override its choices by setting the xlim and
ylim arguments. For instance, if I decide I want the vertical scale of
the plot to run from 0 to 100, then I’d set ylim = c(0, 100).Exercise: Let’s have a look yourself.
plot(x = Fibonacci, # the data
main = "You specify title using the 'main' argument",
sub = "The subtitle appears here! (Use the 'sub' argument for this)",
xlab = "The x-axis label is 'xlab'",
ylab = "The y-axis label is 'ylab'",
xlim = c(0, 15), # expand the x-scale
ylim = c(0, 15) # expand the y-scale
)
The axis scales on both the horizontal and vertical dimensions have been expanded. Nice.
Even so, there’s a couple of interesting features worth calling your attention to. Firstly, notice that the subtitle is drawn below the plot, which I personally find annoying; as a consequence I almost never use subtitles. You may have a different opinion, of course, but the important thing is that you remember where the subtitle actually goes. Secondly, notice that R has decided to use boldface text and a large font size for the title. This is one of my most hated default settings in R graphics since I feel that it draws too much attention to the title. Generally, while I do want my reader to look at the title, I find that the R defaults are a bit overpowering, so I often like to change the settings.
In Section 10.2.1 we talked about a group of graphical parameters that are related to the formatting of titles, axis labels etc. The second group of parameters I want to discuss are those related to the formatting of the plot itself:
pch: Plot character type: The plot
character parameter is a number, usually between 0 and 24. What it does is
tell R what symbol to use to draw the points that it plots. The
simplest way to illustrate what the different values do (i.e., how they related to the character types used to plot points) is with a
picture. Figure 10.2 shows the first 25 plotting
characters. The default plotting character is a hollow circle (i.e.,
pch = 1). You don’t need to know these numbers by heart! If you encounter someone who does, try to initiate a friendly conversation of setting priorities in life.
Figure 10.2: Changing the plotted characters
cex: Plot character size. Font size
is handled in a slightly curious way in R. Instead of using some absolute size, it uses a magnification value, which is referred to as “cex” (short for “character expansion”). So this parameter describes a character
expansion factor (i.e., magnification) for the plotted
characters such as points. By default cex=1, but if you want bigger symbols in
your graph you should specify a larger value. lty: Line type. The line type parameter describes the
kind of line that R draws (if you ask it to draw a line, that is). It has seven values which you can specify using a number between 0 and 6, or using a meaningful character string: "blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". Note that the "blank" version (value 0) just means that R doesn’t draw the lines at all. The other six versions are shown in Figure 10.3. You don’t need to know these numbers or strings by heart.
Figure 10.3: Line types
lwd: Line width. The next graphical parameter in this category
that I want to mention is the line width parameter,
which is just a number specifying the width of the line. The default
value is 1. Not surprisingly, larger values produce thicker lines
and smaller values produce thinner lines. col: Colour of the plot For the
plot function it’s pretty simple: the col argument
refers to the colour of the points and/or lines that get drawn! The simplest way to specify this parameter is using a character string: e.g., col = "blue". Conveniently, R has a very large number of named colours (type
colours() to see a list of over 650 colour names that R knows), so
you can use the English language name of the colour to select
it.45 Examples are "red", "gray25", and "springgreen4"
(yes, R really does recognise four different shades of “spring
green”).Exercise: To illustrate what you can do by altering these parameters, let’s try the following command.
plot(x = Fibonacci,
type = "b",
col = "blue",
pch = 19,
cex=5,
lty=2,
lwd=4)
There are several other possibilities worth discussing.
las: Orientation of the axis labels. I presume that the name of
this parameter is an acronym of label style or something
along those lines; but what it actually does is govern the
orientation of the text used to label the individual tick marks
(i.e., the numbering, not the xlab and ylab axis labels). There
are four possible values for las: A value of 0 means that the
labels of both axes are printed parallel to the axis itself (the
default). A value of 1 means that the text is always horizontal. A
value of 2 means that the labelling text is printed at right angles
to the axis. Finally, a value of 3 means that the text is always
vertical. You don’t need to study these values.ann: Suppress labelling: This is a logical-valued argument that
you can use if you don’t want R to include any annotations, such as text for a title,
subtitle or axis label. To do so, set ann = FALSE. This will stop
R from including any text that would normally appear in those
places. Note that this will override any of your manual titles. For
example, if you try to add a title using the main argument, but
you also specify ann = FALSE, no title will appear.axes: Suppress axis drawing: Again, this is a logical valued
argument. Suppose you don’t want R to draw any axes at all. To
suppress the axes, all you have to do is add axes = FALSE. This
will remove the axes and the numbering, but not the axis labels
(i.e. the xlab and ylab text). 46 frame.plot: Include a framing box: Suppose you’ve removed the
axes by setting axes = FALSE, but you still want to have a simple
box drawn around the plot; that is, you only wanted to get rid of
the numbering and the tick marks, but you want to keep the box. To
do that, you set frame.plot = TRUE.Exercise: To illustrate what you can do by altering these parameters, let’s try the following command.
plot(x = Fibonacci, # the data
ann = FALSE, # delete all annotations
axes = FALSE, # delete the axes
frame.plot = TRUE # but include a framing box
)
The output is pretty much exactly as you’d expect.
The axes have been suppressed (on account of axes being FALSE) as
have the annotations (on account ann being FALSE), but we’ve kept a
box around the plot (on account of frame.plot being TRUE).
Before moving on, I should point out that there are several graphical
parameters relating to the axes, the box, the general appearance of
the plot which allow finer grain control over the appearance of the axes
and the annotations, a bunch of graphical parameters that you can use
to customise the font style, size, and so on, but I will let you figure that out by yourself if you ever need it. Most of them will speak to themselves. For example, it should not come as a surprise, knowing what you know what cex and main do, that cex.main controls the font size of the title.
Although you will often end up with commands that are quite long, it’s not complicated: the only thing the setting of these arguments does is that it overrides a bunch of the default parameter values. The only difficult aspect to this is that you have to remember what each of these parameters is called, and what all the different values are (unless I told you should not remember it, of course).
Scatterplots are a simple but effective tool for visualising data.
We’ve already seen scatterplots in this chapter when using the plot()
function to draw the Fibonacci variable as a collection of dots
(Section 10.1). However, for the purposes of this
section, I have a slightly different notion in mind. Instead of just
plotting one variable, what I want to do with my scatterplot is to
display the relationship between two variables.
It’s this latter application that we usually have in mind when we use
the term “scatterplot”. In this kind of plot, each observation
corresponds to one dot: the horizontal location of the dot plots the
value of the observation on one variable, and the vertical location
displays its value on the other variable. In many situations, you don’t
really have clear opinions about what the causal relationship is
(e.g., does A cause B, or does B cause A, or does some other variable C
control both A and B). If that’s the case, it doesn’t really matter
which variable you plot on the x-axis and which one you plot on the
y-axis. However, in many situations, you do have a pretty strong idea
which variable you think is most likely to be causal, or at least you
have some suspicions in that direction. If so, then it’s conventional to
plot the cause variable on the x-axis, and the effect variable on the
y-axis.
To do so, let’s turn to a topic close to every parent’s heart: sleep.
The following data set is fictitious but based on real events. Suppose
I’m curious to find out how much my infant son’s sleeping habits affect
my mood. Let’s say that I can rate my grumpiness very precisely, on a
scale from 0 (not at all grumpy) to 100 (grumpy as a very, very grumpy
old man). And, let us also assume that I’ve been measuring my
grumpiness, my sleeping patterns and my son’s sleeping patterns for
quite some time now. Let’s say, for 100 days. And, being a nerd, I’ve
saved the data as a data frame called parenthood.
If we peek at the data using head() out the data, here’s what we get:
## dan.sleep baby.sleep dan.grump day
## 1 7.59 10.18 56 1
## 2 7.91 11.66 60 2
## 3 5.14 7.92 82 3
## 4 7.71 9.61 55 4
## 5 6.68 9.75 67 5
## 6 5.99 5.04 72 6
We see that the data frame called
parenthood contains four variables dan.sleep, baby.sleep,
dan.grump and day.
Suppose my goal is to draw a scatterplot displaying the relationship
between the amount of sleep that I get (dan.sleep) and how grumpy I am
the next day (dan.grump). As you might expect given our earlier use of
plot() to display the Fibonacci data, the function that we use is
the plot() function. We just need to specify the name of the variable to be
plotted on the x axis and the name of variable to be plotted on the y axis, using arguments x and y. I’m sure you can derive and remember which argument does what.
Exercise: Create a scatterplot showing the relationship between the
amount of sleep that Dan’s gets and how grumpy he is the next day. Plot
dan.sleep on the x axis and dan.grump on the y axis.
plot( x = parenthood$dan.sleep, # data on the x-axis
y = parenthood$dan.grump # data on the y-axis
)
If we do this, the result is the very basic scatterplot. This serves fairly well, but there are a few customisations that we probably want to make in order to have this work properly. As usual, we want to add some labels, but there’s a few other things we might want to do as well. Firstly, it’s sometimes useful to rescale the plots. In the scatterplot you just created, R has selected the scales so that the data fall neatly in the middle. But, in this case, we happen to know that the grumpiness measure falls on a scale from 0 to 100, and the hours slept falls on a natural scale between 0 hours and about 12 or so hours (the longest I can sleep in real life).
Exercise: Run the following command, to see how we might draw this.
plot( x = parenthood$dan.sleep, # data on the x-axis
y = parenthood$dan.grump, # data on the y-axis
xlab = "My sleep (hours)", # x-axis label
ylab = "My grumpiness (0-100)", # y-axis label
xlim = c(0,12), # scale the x-axis
ylim = c(0,100), # scale the y-axis
pch = 20, # change the plot type
col = "gray50", # dim the dots slightly
frame.plot = FALSE # don't draw a box
)
Sometimes it can be very useful to draw a line. Don’t just tolerate any unacceptable behavior, you know? While hard to do in real life, drawing lines in R is easy. It just involves using the function lines(). Mind you: this is
not a separate argument within the plot() function, but really a
function of its own.
Quite conveniently, the arguments that I need to
specify are pretty much the exact same ones that I use when calling the
plot() function. That is, suppose that I want to draw a line that goes
from the point (4,93) to the point (9.5,37). Then the x locations can
be specified by the vector c(4,9.5) and the y locations correspond
to the vector c(93,37). In other words, I use this command:
plot( x = parenthood$dan.sleep, # data on the x-axis
y = parenthood$dan.grump, # data on the y-axis
xlab = "My sleep (hours)", # x-axis label
ylab = "My grumpiness (0-100)", # y-axis label
xlim = c(0,12), # scale the x-axis
ylim = c(0,100), # scale the y-axis
pch = 20, # change the plot type
col = "gray50", # dim the dots slightly
frame.plot = FALSE # don't draw a box
)
lines( x = c(4,9.5), # the horizontal locations
y = c(93,37), # the vertical locations
lwd = 2 # line width
)
Figure 10.4: A scatterplot with scatter plot specific customisations
And when I do so, R plots the line over the top of the plot that I drew using the previous command.
Note that while the lines() function is a
function of its own, it does somewhat depend on the plot() function,
in the sense that if you use the lines() function in the void, without
making a plot first, it will refuse to do so!
Another way to draw lines is to use the abline() function. Rather than
using coordinates as input, it uses the intercept (a) and slope (b)
as input (hence its name). It is especially useful for drawing
horizontal line or vertical lines, for which you should not use the intercept and slopes, but rather the h or v argument, indicating where the horizontal or vertical line should go.
plot( x = parenthood$dan.sleep, # data on the x-axis
y = parenthood$dan.grump, # data on the y-axis
xlab = "My sleep (hours)", # x-axis label
ylab = "My grumpiness (0-100)", # y-axis label
xlim = c(0,12), # scale the x-axis
ylim = c(0,100), # scale the y-axis
pch = 20, # change the plot type
col = "gray50", # dim the dots slightly
frame.plot = FALSE # don't draw a box
)
abline(h = 60, # draw a horizontal line at y=60
v = 8 # draw a vertical line at x=8
)
Figure 10.5: A scatterplot with scatter plot specific customisations
If you don’t want to add a full line, but just one or more points, R is
ready to help you out with a function of its own, which is called — I
am sure you could have guessed — points(). Like lines() and
abline(), it is a stand-alone function but only so-so, in that R will
only know what to do with it in the context of a plot. Like lines(),
it use coordinates as input. So if you, for whatever reason, would like
the highlight the start and endpoints of the lines you drew earlier, you
could do something like this:
plot( x = parenthood$dan.sleep, # data on the x-axis
y = parenthood$dan.grump, # data on the y-axis
xlab = "My sleep (hours)", # x-axis label
ylab = "My grumpiness (0-100)", # y-axis label
xlim = c(0,12), # scale the x-axis
ylim = c(0,100), # scale the y-axis
pch = 20, # change the plot type
col = "gray50", # dim the dots slightly
frame.plot = FALSE # don't draw a box
)
lines( x = c(4,9.5), # the horizontal locations
y = c(93,37), # the vertical locations
lwd = 2 # line width
)
points( x = c(4,9.5), # the horizontal locations
y = c(93,37), # the vertical locations
pch = 5 # the symbol
)
Figure 10.6: A scatterplot with scatter plot specific customisations
Now that we’ve tamed (or possibly fled from) the beast that is R graphical parameters, let’s talk more seriously about some real-life graphics that you’ll want to draw. We begin with the humble pie chart.
I don’t have solid advice on the usefulness of pie charts. I just wanna
show you how it’s done. I
will use the afl.finalists variable.
The afl.finalists variable contains the names of all 400 teams that played in all 200
finals matches played during the period 1987 to 2010. What I want to do
is draw a bar graph that displays the number of finals that each team
has played in over the time spanned by the afl data set.
What I want to do is draw a pie chart that displays the percentage of
finals that each team has played in over the time spanned by the afl
data set.
The good news is that you need a function that is called pie(). Easy, right? The bad news is that this doesn’t work:
## Error in pie(afl.finalists): 'x' values must be positive.
Rather, we need to convert our data to a frequency, which can be done using the table() or tabulate() functions. Let’s have a look first.
## [1] 26 25 26 28 32 6 39 27 28 28 17 6 24 26 38 24
## afl.finalists
## Adelaide Brisbane Carlton Collingwood
## 26 25 26 28
## Essendon Fremantle Geelong Hawthorn
## 32 6 39 27
## Melbourne North Melbourne Port Adelaide Richmond
## 28 28 17 6
## St Kilda Sydney West Coast Western Bulldogs
## 24 26 38 24
So we created a (named) vector containing the number of finals that each team has played in the afl.finalists data.
Now we bake:


The only difference is that tabulate() gives us the numbers (so that pie() only use that), and that table() gives numbers and labels that pie() can use.
You can, however, control the labels if you so desire, using the conveniently named labels argument.

Here’s a bit of a fancier version, with percentages included in the
labels and a clockwise organisation:
percentages <- round(100*prop.table(table(afl.finalists)),0)
pie_labels <- paste0(names(percentages), " (", percentages,"%)")
pie(x = table(afl.finalists), labels = pie_labels, clockwise=TRUE)
Making it more readable is needed but hard, but I’m not gonna bother.
Another form of graph that you often want to plot is the bar
graph. I’ll use the
afl.finalists variable.
The main function that you can use in R to draw them is the barplot()
function.
Disappointingly, but unsurprisingly if you paid any attention when reading about the pie() function, the following command does not work:
## Error in barplot.default(afl.finalists): 'height' must be a vector or a matrix
Exercise: Draw a bar graph using the barplot() function. The main
argument that you need to specify for a bar graph is the frequencies.
barplot( tabulate(afl.finalists) )
As you can see, R has drawn a pretty minimal plot. It doesn’t have any
labels, obviously, because we didn’t actually tell the barplot()
function what the labels are! To do this, we need to specify the
names.arg argument. The names.arg argument needs to be a vector of
character strings containing the text that needs to be used as the label
for each of the items. So I’m obviously going to need the team names to create
some labels, so let’s create a variable with those. We’ll do this using
the levels() function, which outputs the names of all the levels of a
factor (see Section 6.4).
Exercise: Use the levels() function to obtain the names of all the
levels in afl.finalists. Save the result in a variable named teams
and print the result.
teams <- levels( afl.finalists )
teams
Okay, so now that we have the information we need, let’s draw our bar graph again.
Exercise: Add the names.arg argument to the command used in the
previous exercise, to indicate the labels of the bar graph.
barplot( tabulate(afl.finalists), names.arg = teams)
We could have saved ourselves some effort and just done this, using
table() instead of tabulate():

Anyhoo, this is an improvement, but not much of an improvement. R has
only included a few of the labels because it can’t fit them in the plot.
The fact that barplot() has omitted the names of every team in between
Adelaide and Fitzroy is a somewhat problematic.
The simplest way to fix this is to rotate the labels so that the text
runs vertically not horizontally. To do this, we need to alter set the
las parameter, which I discussed briefly in Section
10.1.
Exercise: Using the command of the previous exercise, add an
argument telling R to rotate the text so that it’s always perpendicular
to the axes. This can be done with las = 2.
barplot( table(afl.finalists), # the frequencies
names.arg = teams, # the labels
las = 2) # rotate the labels
We’ve fixed the problem, but we’ve created a new one: the axis labels don’t quite fit anymore. To fix this, we have to be a bit cleverer again. A simple fix would be to use shorter names rather than the full name of all teams, and in many situations, that’s probably the right thing to do. However, at other times you really do need to create a bit more space to add your labels. I am not gonna go in detail on how to do that. Just know you can play with the space, if needed or desired.
Histograms are one of the simplest and most useful ways
of visualising data. They make the most sense when you have an interval
or ratio scale (e.g., the afl.margins data from Chapter
??) and what you want to do is get an overall
impression of the data.
Most of you probably know how histograms work, since they’re so widely
used, but for the sake of completeness, I’ll describe them. All you do
is divide up the possible values into bins, and then count the
number of observations that fall within each bin. This count is referred
to as the frequency of the bin, and is displayed as a bar: in the AFL
winning margins data, there are 33 games in which the winning margin was
less than 10 points. Drawing this histogram in R is pretty
straightforward. The function you need to use is called hist(), and it
has pretty reasonable default settings.
Exercise: Create a histogram of afl.margins using the hist()
function.
hist( x = afl.margins )
Although this image would need a lot of cleaning up in order to make a good presentation graphic (i.e., one you’d include in a report), it nevertheless does a pretty good job of describing the data. In fact, the big strength of a histogram is that (properly used) it does show the entire spread of the data, so you can get a pretty good sense about what it looks like. The downside to histograms is that they aren’t very compact: unlike some of the other plots I’ll talk about that it’s hard to cram 20-30 histograms into a single image without overwhelming the viewer.
The main subtlety that you need to be aware of when drawing histograms
is determining where the breaks that separate bins should be located,
and (relatedly) how many breaks there should be. In the histogram you
just created, you can see that R has made pretty sensible choices all by
itself: the breaks are located at 0, 10, 20, … 120, which is exactly
what I would have done had I been forced to make a choice myself. On the
other hand, consider the two histograms in Figure 10.7 and
10.8, which I produced using the following two commands:
Figure 10.7: A histogram with too few bins
Figure 10.8: A histogram with too many bins
In Figure 10.8, the bins are only 1 point wide. As a result, although the plot is very informative (it displays the entire data set with no loss of information at all!) the plot is very hard to interpret and feels quite cluttered. On the other hand, the plot in Figure 10.7 has a bin width of 50 points, and has the opposite problem: it’s very easy to “read” this plot, but it doesn’t convey a lot of information. One gets the sense that this histogram is hiding too much. In short, the way in which you specify the breaks has a big effect on what the histogram looks like, so it’s important to make sure you choose the breaks sensibly. In general, R does a pretty good job of selecting the breaks on its own, since it makes use of some quite clever tricks that statisticians have devised for automatically selecting the right bins for a histogram, but nevertheless, it’s usually a good idea to play around with the breaks a bit to see what happens.
There is one fairly important thing to add regarding how the breaks
argument works. There are two different ways you can specify the breaks.
You can either specify how many breaks you want (which is what I did
when I typed breaks = 3) and let R figure out where they should go, or
you can provide a vector that tells R exactly where the breaks should be
placed (which is what I did when I typed breaks = 0:116). The
behaviour of the hist() function is slightly different depending on
which version you use. If all you do is tell it how many breaks you
want, R treats it as a “suggestion”, not as a demand. It assumes you
want “approximately 3” breaks, but if it doesn’t think that this would
look very pretty on screen, it picks a different (but similar) number.
It does this for a sensible reason – it tries to make sure that the
breaks are located at sensible values (like 10) rather than stupid ones
(like 7.224414). And most of the time R is right: usually, when a human
researcher says “give me 3 breaks”, he or she really does mean “give me
approximately 3 breaks, and don’t put them in stupid places”. However,
sometimes R is dead wrong. Sometimes you really do mean “exactly 3
breaks”, and you know precisely where you want them to go. So you need
to invoke “real person privilege”, and order R to do what it’s bloody
well told. In order to do that, you have to input the full vector that
tells R exactly where you want the breaks. If you do that, R will go
back to behaving like the nice little obedient calculator that it’s
supposed to be. Good boy!
This will be the most meaningless plot you will have ever encountered in your statistics education, but it serves an important goal: showing that lines() and points() functions don’t just work with the plot() function, but with several other functions, including hist().
hist( x = afl.margins )
lines( x = c(5,50), # the horizontal locations
y = c(20,1), # the vertical locations
lwd = 2 # line width
)
Figure 10.9: A histogram with a line
Okay, so at this point, we can draw a basic histogram, and we can alter
the number and even the location of the breaks. However, the visual
style of the histograms shown could stand to be improved. We
can fix this by making use of some of the other arguments to the
hist() function. Most of the things you might want to try doing have
already been covered in Section 10.1, such as main and las and the likes, but there are several other new things you could do. I will, however, only discuss one.
One important argument is the labels argument, which controls the labelling the bars: You can attach labels to each of the bars using the labels argument. The simplest way to do this is to set labels = TRUE, in which case R will add a number just above each bar, that number being the exact number of observations in the bin. Alternatively, you can choose the labels yourself, by inputting a vector of strings, e.g., labels = c("label 1","label 2","etc"), but we won’t use that for now. We will see labels at work below.
In some sense, it is more like a pet: You can give your histogram a name! (Well, you can do that for other plot types as well, but there it is not as useful.)

Besides doing that for affectionate reason, there is a actually a quite good reason to do so. Let’s see what it holds for us:
## [1] "histogram"
## $breaks
## [1] 0 10 20 30 40 50 60 70 80 90 100 110 120
##
## $counts
## [1] 38 23 27 23 21 14 10 7 6 3 3 1
##
## $density
## [1] 0.0215909091 0.0130681818 0.0153409091 0.0130681818 0.0119318182
## [6] 0.0079545455 0.0056818182 0.0039772727 0.0034090909 0.0017045455
## [11] 0.0017045455 0.0005681818
##
## $mids
## [1] 5 15 25 35 45 55 65 75 85 95 105 115
##
## $xname
## [1] "afl.margins"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
Drawing a histogram is more than just making a plot. R did some computations for us by grouping data in bins and counting. The histogram shows the results visually, but q also shows the numeric input that R computed to draw the histogram, including the breakpoints and the counts. The counts bit shows the numbers we would see when we set labels to TRUE. Further, note that conveniently, our little histogram q is of the class histogram.`

While most arguments of the functions used for drawing, hist() has one important argument that deserves a bit of explanation, freq, which takes the values TRUE and FALSE, which (typically, there is a bit more nuance I am glossing over) defaults to TRUE. Let’s see what it does:


Both histograms look exactly the same, but they are different. Let’s bring out the differences. To do so, we make use of the labels argument.


It turns out that when freq=TRUE, R plots counts or frequencies, and when it is FALSE it does not. But what does it plot when freq = FALSE? The answer is densities, as can be viewed from this:
## [1] 0.0215909091 0.0130681818 0.0153409091 0.0130681818 0.0119318182
## [6] 0.0079545455 0.0056818182 0.0039772727 0.0034090909 0.0017045455
## [11] 0.0017045455 0.0005681818
But what is a density (dichtheid, in Dutch)? For the purposes of this book, the density is nothing but the count divided by the total number of games divided by the space between the breaks.
n <- length(afl.margins) #number of games
sb <- 10 #space between breaks
#sb <- unique(diff(q$mids)) #if you'd like to compute it; but you can ignore this if you wish
q$counts/n/sb## [1] 0.0215909091 0.0130681818 0.0153409091 0.0130681818 0.0119318182
## [6] 0.0079545455 0.0056818182 0.0039772727 0.0034090909 0.0017045455
## [11] 0.0017045455 0.0005681818
## [1] 0.0215909091 0.0130681818 0.0153409091 0.0130681818 0.0119318182
## [6] 0.0079545455 0.0056818182 0.0039772727 0.0034090909 0.0017045455
## [11] 0.0017045455 0.0005681818
This description of density is somewhat incomplete, because there can be cases where there are different spaces between breaks. But you can ignore that subtlety. Just remember that the density is the relative frequency, taking into account bin width.
Histograms are one of the most widely used methods for displaying the observed values for a variable. They’re simple, pretty, and very informative. However, they do take a little bit of effort to draw. Sometimes it can be quite useful to make use of simpler, if less visually appealing, options. One such alternative is the stem and leaf plot. To a first approximation, you can think of a stem and leaf plot as a kind of text-based histogram. Stem and leaf plots aren’t used as widely these days as they were 30 years ago since it’s now just as easy to draw a histogram as it is to draw a stem and leaf plot. Not only that, they don’t work very well for larger data sets. As a consequence, you probably won’t have as much of a need to use them yourself, though you may run into them in older publications. These days, the only real-world situation where I use them is if I have a small data set with 20-30 data points and I don’t have a computer handy, because it’s pretty easy to quickly sketch a stem and leaf plot by hand.
With all that as background, let us have a look at stem and leaf plots.
The AFL margins data contains 176 observations, which is at the upper
end for what you can realistically plot this way. The function in R for
drawing stem and leaf plots is called stem().
Exercise: Draw a stem and leaf plot of the afl.margins data using
the stem() function.
stem( afl.margins )
The values to the left of the | are called stems and the values
to the right are called leaves. If you just look at the shape that
the leaves make, you can see something that looks a lot like a histogram
made out of numbers, just rotated by 90 degrees. But if you know how to
read the plot, there’s quite a lot of additional information here. In
fact, it’s also giving you the actual values of all of the
observations in the data set. To illustrate, let’s have a look at the
last line in the stem and leaf plot, namely 11 | 6. Specifically,
let’s compare this to the largest values of the afl.margins data set:
## [1] 116
Hm… 11 | 6 versus 116. Obviously, the stem and leaf plot is trying
to tell us that the largest value in the data set is 116. Similarly,
when we look at the line that reads 10 | 148, the way we interpret it
to note that the stem and leaf plot is telling us that the data set
contains observations with values 101, 104 and 108. Finally, when we see
something like 5 | 00002233445556667 the four 0s in the stem and
leaf plot are telling us that there are four observations with value 50.
So that’s how we should interpret the mysterious The decimal point is 1 digit(s) to the right of the | message. It means that 11 | 6 should be read as 116 with the decimal point after the 6, so that means it corresponds to 116.
So if our data set had
included only the numbers .11, .15, .23, .35 and .59 and we’d drawn a
stem and leaf plot of these data, then R would move the decimal point:
the stem values would be 1,2,3,4 and 5, but R would tell you that the
decimal point has moved to the left of the | symbol. If you want to
see this in action, try the following command:
Exercise: If you want to see this in action, try the following
command: stem( x = afl.margins / 1000 )
stem( x = afl.margins / 1000 )
The stem and leaf plot itself looks identical to the original one we
drew, except for the fact that R tells you that the decimal point has
moved. So that’s how we should interpret the mysterious The decimal point is 2 digit(s) to the left of the | message. 11 | 6 should be read as 116 with the decimal point before the first 1, so 0.116. Which is exaclty what max(afl.margins / 1000) is.
Another alternative to histograms is a boxplot, sometimes called a
“box and whiskers” plot. Like histograms, they’re most suited to
interval or ratio scale data. The idea behind a boxplot is to provide a
simple visual depiction of the median, the interquartile range, and the
range of the data. And because they do so in a fairly compact way,
boxplots have become a very popular statistical graphic, especially
during the exploratory stage of data analysis when you’re trying to
understand the data yourself. Let’s have a look at how they work, again
using the afl.margins data as our example. You will again see that
whipping up a basic version is almost embarrassingly easy.
The easiest way to describe what a boxplot looks like is just to draw
one. The function for doing this in R is (surprise, surprise)
boxplot(). As always there’s a lot of optional arguments that you can
specify if you want, but for the most part, you can just let R choose
the defaults for you. That said, we’re going to override one of the
defaults to start with by specifying the range argument, but for the
most part, you won’t want to do this (I’ll explain why in a minute).
Exercise: Create a boxplot of afl.margins using the boxplot()
function, and specify range = 100.
boxplot( x = afl.margins, range = 100 )
What R draws is the most basic boxplot possible. When you look at this plot, this is how you should interpret it: the thick line in the middle of the box is the median; the box itself spans the range from the 25th percentile to the 75th percentile; and the “whiskers” cover the full range from the minimum value to the maximum value.
In practice, this isn’t quite how boxplots usually work. In most applications, the “whiskers” don’t cover the full range from minimum to maximum. Instead, they actually go out to the most extreme data point that doesn’t exceed a certain bound. By default, this value is 1.5 times the interquartile range, corresponding to a range value of 1.5. By default, R will only extent the whiskers a distance of 1.5 times the interquartile range,
and will plot any points that fall outside that range separately
Exercise: Create a boxplot of afl.margins using the boxplot()
function, and use the default value for range, which is 1.5.
boxplot( afl.margins ) #I don't need to specify range if I want to use its default value
For our AFL margins data, there is one observation (a game with a margin of 116 points) that falls outside this range. As a consequence, the upper whisker is pulled back to the next largest observation (a value of 108), and the observation at 116 is plotted as a circle.
Boxplots in R are extremely customisable. In addition to the usual range
of graphical parameters that you can tweak to make the plot look nice,
you can also exercise nearly complete control over every element to the
plot. The only thing that I want to say about it is that, if you ever need it, you should (obviously) consult the help file for the boxplot() function, but also of the bxp() function, which does most of the heavy lifting. Most arguments that are described in the bxp() function can be used when calling the boxplot() function.
curve()The curve() function is part of a set of special creatures with some unprepossessing features , and part of me would like to just ignore it and keep it in its cage. (Toss, toss!) Another part of me, however, is eager to show you, and that part has two reasons: one is that is shows that R can be messy sometimes, which should help you appreciate the countless other times it behaves actually like you as a naive user would expect. And two: it is a very useful function to know about. So here goes (and comes trouble):

You what? Some of this should not come as a surprise. We see a, well, curve, running from -10 to 10, as asked. What is baffling is that we didn’t provide any values to plot! We told R to plot the square of x, but we did not provide any x whatsoever. This must be the work of the devil.
Compare this to how the plot function works:
## Error in plot(x^2, xlim = c(-10, 10)): object 'x' not found
No x no glory, so R doesn’t plot a thing.
To make the same stuff work with the plot() function, we need to tell R what x is

So curve() is one of those special functions that can work with x without needing to know x. Magix!
While this is wildly flexible, there are some limits to the craziness curve() can handle. This, for example, is a no-go:
## Error in curve(y^2, xlim = c(-10, 10)): 'expr' must be a function, or a call or an expression containing 'x'
The magix only worx with xs.
These are just the most basic graphics function in R. Much more is available, often using packages. Let’s look at a glimpse of what else you can do with R, just to pique your interest.
In Section XXX, I just wanted to show you how you can draw lines and points in a scatterplot. For the sake of illustration, I guesstimated where the line should be. In most realistic data analysis situations, you absolutely don’t want to just guess where the line through the points goes, since there are about a billion different ways in which you can get R to do a better job. However, it does at least illustrate the basic idea.
One possibility, if you do want to get R to draw nice clean lines
through the data for you, is to use the scatterplot() function in the
car package. Before we can use scatterplot() we need to load the
package:
Having done so, we can now use the function. The command we need is this one:
Figure 10.10: A fancy scatterplot drawn using the scatterplot() function in the car package.
The first two arguments should be familiar: the first input is a formula
dan.grump ~ dan.sleep telling R what variables to plot,47 and the
second specifies a data frame. The third argument smooth I’ve set to
FALSE to stop the scatterplot() function from drawing a fancy
“smoothed” trendline (since it’s a bit confusing to beginners). The
scatterplot itself is shown in Figure 10.10. As you
can see, it has not only drawn the scatterplot, but its also drawn
boxplots for each of the two variables, as well as a simple line of best
fit showing the relationship between the two variables.
One final thing to point out.
There’s no easy way
to tell you this, but R has several completely distinct systems for
drawing figures. In this chapter, I’ve focused on the traditional
graphics system. It’s the easiest one to get started with: you can draw
a histogram with a command as simple as hist(x).
A single high-level command is capable of drawing an entire graph,
complete with a range of customisation options. Most but not all of the
high-level commands that I’ll talk about in this book come from the
graphics package itself, and so belong to the world of traditional
graphics. These commands all tend to share a common visual style.
However, it’s not the most powerful tool for the job, and after a while,
most R users start looking to shift to fancier systems. On the other
side of the great divide, people rely heavily on two
different packages – lattice and ggplot2 – each of which provides
a quite different visual style. As you’ve probably guessed, there’s a
whole separate bunch of functions that you’d need to learn if you want
to use lattice graphics or make use of the ggplot2. Of these two, probably the most
popular graphics systems is provided by the ggplot2 package.
It’s not for novices: you need to have a pretty good grasp of R before
you can start using it, and even then it takes a while to really get the
hang of it. But when you’re finally at that stage, it’s worth taking the
time to teach yourself, because it’s a much cleaner system for producing high quality graphs.
At this point, I think we’ve covered more than enough background material. The point that I’m trying to make by providing this discussion isn’t to scare you with all these horrible details, but rather to try to convey to you the fact that R doesn’t really provide a single coherent graphics system. Instead, R itself provides a platform, and different people have built different graphical tools using that platform. As a consequence of this fact, there are different universes of graphics and a great multitude of packages that live in them. At this stage, you don’t need to understand these complexities, but it’s useful to know that they’re there.
By now, you almost know the basics of R. I didn’t talk about statistics
much, apart from the visualisation side of it. There are a bunch of
built-in R functions that are ideally suited for doing statistical
analyses. One of those was the mean() function to, well, compute the
mean. I have collected the most commonly used statistical functions in
Table XXX. These functions are relatively easy to use, with little nuts
and bolts, so I am not going over them in detail. If you do need to know
these details — knowing what the default values for the arguments are
comes to mind, or knowing the exact definition — remember that you can
find out yourself by trial-and-error, by reading the R help (e.g.,
?sd) or by looking for help online.
| statistical.function | R.function |
|---|---|
| mean | mean() |
| median | median() |
| range | range() |
| quantile | quantile() |
| interquartile range | IQR() |
| standard deviation | sd() |
| variance | var() |
| covariance | cov() |
| correlation | cor() |
While both the meaning and the usage of these functions are pretty self-explanatory, there a couple of things I’d like to draw your attention to.
Table XXX should show how easy it is to use R to do a lot of useful statistical things, like computing means, medians and the likes. There is, however, a sense in which things have become almost too easy, in that they hide what R is actually doing. So you might think you know what R is doing, but in reality you don’t. I present two examples.
First, the variance (and its close square rooted cousin, the standard
deviation). You probably think you know what R does when computing the
variance, but you might be wrong. In the var() function, R has chosen
to divide by \(n-1\) rather than by \(n\). Of course, if for some reason you
would need the divide-by-n version, that’s easy to compute, using
(n-1)*var(x)/n.
The more general point is that you can’t just assume you know
what R is doing. To be fair, R makes it explicit in the help file. If
you type ?var and scroll down a bit, you will read somewhere:
The denominator n-1 is used which gives an unbiased estimator of the (co)variance for i.i.d. observations.
And in this particular case, it’s just something you need to remember
that R does it the way it does.
Second, the quantile. If you would bother to read the associated help
file, you will see there are no less than 9 different algorithms in R
that could be used in quantile(). I am not going the explain what the
differences are (from lack of knowledge and desire), but if at some
point you do care about these different versions, you need to know which
one you want one which one R is using. Strictly FYI: If you don’t make a choice, R defaults to type 7, whatever that may be; and what you have learned in Statistics 1 was type 2.
If one day, you forget that you can compute the median using median(),
for example because you are now deeply ingrained with the knowledge that
with R, things are never that simple, here are two other ways you can
compute it. If you don’t know why this works, look up the
definition of quantiles and median in your Statistics course.
## [1] 30.5
## 50%
## 30.5
## [1] 30.5
Similarly, what was the name of the function again to compute the range? It’s on the tip of my tongue, but it keeps escaping me! Well, I’ll just compute it myself.
## [1] 0 116
## [1] 0 116
And sorry, how on earth I am supposed to remember the name of the function of that interquartile range? But I can pull this off by myself, no worries.
## [1] 37.75
## 75%
## 37.75
So there’s nothing magical or even smart going on in most of these functions. They are just developed to make your life a little easier.
While all these functions are pleasantly simple, they run the risk of
being deceptively simple, in that they might hide all the cool stuff you
can do with it. As an illustration cor() can be used to compute a
correlation, but there’s more you can do with it!
For example, you can also calculate a complete “correlation matrix”, between all pairs of variables in the data frame (at least if all these variables are numeric).48
Exercise: Correlate all pairs of variables in parenthood by
providing the whole data set as input to the cor() function.
cor( parenthood )
As another example, it doesn’t just compute the Pearson correlation —
i.e., that stuff we have been calling correlation, colloquially. For
ordinal data, the pearson correlation is not useful, and you should use,
for example, the Spearman’s rank-order correlation instead.
We can calculate Spearman’s \(\rho\) using R:
It only involves specifying the method
argument of the cor() function. I won’t illustrate that here. This is
just to pique your interest and to instill the idea that these functions often can do more than meets the eye. The default value of the method argument is "pearson", which is
why we didn’t have to specify it earlier on when we were doing Pearson
correlations.
Up to this point in the chapter, I’ve mentioned several different
summary statistics that are commonly used when analysing data, along
with specific functions that you can use in R to calculate each one.
However, it’s kind of annoying to have to separately calculate means,
medians, standard deviations, etc. Wouldn’t it be nice if R
had some helpful functions that would do all these tedious calculations
at once? Something like summary(), perhaps?
Well yes, yes it would. So much so that
this function exists. The
summary() function is in the base package, which means that it comes
with every installation of R.
The summary() function is an easy thing to use, but a tricky thing to
understand in full, since it’s a generic function (see Section
8.8.1), meaning that its behaviour changes depending on what kind of
input you give it. The basic idea behind the summary() function is that
it prints out some useful information about whatever object (i.e., a
variable, as far as we’re concerned) you specify as the object
argument. As a consequence, the behaviour of the summary() function
differs quite dramatically depending on the class of the object that you
give it. Which is a feature not a bug, since what is interesting or even relevant about you isn’t necessarily interesting or relevant about me.
Let’s start by giving it a numeric object.
Exercise: Summarize the numeric object afl.margins using the
summary() function.
summary( afl.margins )
For numeric variables, we get a whole bunch of useful descriptive statistics. It gives us the minimum and maximum values (i.e., the range), the first and third quartiles (25th and 75th percentiles; i.e., the IQR), the mean and the median. In other words, it gives us a pretty good collection of descriptive statistics related to the central tendency and the spread of the data.
Okay, what about if we feed it a logical vector instead? Let’s say I
want to know something about how many “blowouts” there were in the 2010
AFL season. I operationalise the concept of a blowout (see Chapter
??) as a game in which the winning margin exceeds 50
points. Let’s create a logical variable blowouts in which the i-th
element is TRUE if the i-th game was a blowout according to my
definition:
## [1] TRUE FALSE TRUE FALSE FALSE FALSE
So that’s what the blowouts variable looks like. Now let’s ask R for a
summary()
## Mode FALSE TRUE
## logical 132 44
In this context, the summary() function gives us a count of the number
of TRUE values and the number of FALSE values.
Pretty
reasonable behaviour.
Next, let’s try to give it a factor. If you recall, I’ve defined the
afl.finalists vector as a factor, so let’s use that.
Exercise: Summarize the factor object afl.finalists using the
summary() function.
summary( afl.finalists )
For factors, we get a frequency table, just like we got when we used the
table() function.
Exercise: Interestingly, however, if we convert this to a character
vector using the as.character() function (see Section 6.9.3,
we don’t get the same results. Do try.
f2 <- as.character( afl.finalists )
summary( f2 )
Not really useful, but thanks anyway. Because I’ve defined
afl.finalists as a factor, R knows that it should treat it as a
nominal scale variable, and so it gives you a much more detailed (and
helpful) summary than it would have if I’d left it as a character
vector.
Okay, what about data frames? When you pass a data frame to the
summary() function, it produces a slightly condensed summary of each
variable inside the data frame.
To give you a sense of how this can be useful, let’s try this for a new
data set, one that you’ve never seen before. The data is stored in the
clinicaltrial.Rdata file. Let’s see what we’ve got:
## drug therapy mood.gain
## 1 placebo no.therapy 0.5
## 2 placebo no.therapy 0.3
## 3 placebo no.therapy 0.1
## 4 anxifree no.therapy 0.6
## 5 anxifree no.therapy 0.4
## 6 anxifree no.therapy 0.2
## 7 joyzepam no.therapy 1.4
## 8 joyzepam no.therapy 1.7
## 9 joyzepam no.therapy 1.3
## 10 placebo CBT 0.6
## 11 placebo CBT 0.9
## 12 placebo CBT 0.3
## 13 anxifree CBT 1.1
## 14 anxifree CBT 0.8
## 15 anxifree CBT 1.2
## 16 joyzepam CBT 1.8
## 17 joyzepam CBT 1.3
## 18 joyzepam CBT 1.4
There’s a single data frame called clin.trial which contains three
variables, drug, therapy and mood.gain. Presumably, this data is
from a clinical trial of some kind, in which people were administered
different drugs; and the researchers looked to see what the drugs did to
their mood. Let’s see if the summary() function sheds a little more
light on this situation.
Exercise: Summarize clin.trial using the summary() function.
summary( clin.trial )
Evidently, there were three drugs: a placebo, something called
“anxifree” and something called “joyzepam”; and there were 6 people
administered each drug. There were 9 people treated using cognitive
behavioural therapy (CBT) and 9 people who received no psychological
treatment. And we can see from looking at the summary of the mood.gain
variable that most people did show a mood gain (mean \(=.88\)), though
without knowing what the scale is here, it’s hard to say much more than
that. Still, that’s not too bad. Overall, I feel that I learned
something from that.
Let’s say, we want to look at the descriptive statistics for the
clin.trial data, broken down separately by therapy type. Since summary is just another function (but probably one of the hardest working functions in R business), we can use the strategies discussed in Section XXX.
First, by by():
## clin.trial$therapy: CBT
## drug therapy mood.gain
## anxifree:3 CBT :9 Min. :0.300
## joyzepam:3 no.therapy:0 1st Qu.:0.800
## placebo :3 Median :1.100
## Mean :1.044
## 3rd Qu.:1.300
## Max. :1.800
## ------------------------------------------------------------
## clin.trial$therapy: no.therapy
## drug therapy mood.gain
## anxifree:3 CBT :0 Min. :0.1000
## joyzepam:3 no.therapy:9 1st Qu.:0.3000
## placebo :3 Median :0.5000
## Mean :0.7222
## 3rd Qu.:1.3000
## Max. :1.7000
Neat. As you can see, the output is essentially identical to the output that
the summary() function produces, except that the
output now gives you the info like means, medians, etc. separately for the
CBT group and the no.therapy group. It’s the output of the
summary() function, applied separately to CBT group and the
no.therapy group. For the two factors (drug and therapy) it prints
out a frequency table, whereas for the numeric variable (mood.gain) it
prints out the range, interquartile range, mean and median.
For tapply(), we would hope this would do the job
## Error in tapply(X = clin.trial, INDEX = clin.trial$therapy, FUN = summary): arguments must have same length
but alas, this does not work. If we want the information for each therapy type separately, we need to use tapply() twice:
#one time (for mood gain)
tapply( X = clin.trial$mood.gain, INDEX = clin.trial$therapy, FUN = summary )## $CBT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.300 0.800 1.100 1.044 1.300 1.800
##
## $no.therapy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1000 0.3000 0.5000 0.7222 1.3000 1.7000
#and again a second time (for drug)
tapply( X = clin.trial$drug, INDEX = clin.trial$therapy, FUN = summary )## $CBT
## anxifree joyzepam placebo
## 3 3 3
##
## $no.therapy
## anxifree joyzepam placebo
## 3 3 3
Similarly for aggregate():
print(aggregate( x = mood.gain ~ therapy, # mood.gain by therapy
data = clin.trial, # data is in the clin.trial data frame
FUN = summary # print out group means
))## therapy mood.gain.Min. mood.gain.1st Qu. mood.gain.Median mood.gain.Mean
## 1 CBT 0.3000000 0.8000000 1.1000000 1.0444444
## 2 no.therapy 0.1000000 0.3000000 0.5000000 0.7222222
## mood.gain.3rd Qu. mood.gain.Max.
## 1 1.3000000 1.8000000
## 2 1.3000000 1.7000000
print(aggregate( x = drug ~ therapy, # drug by therapy
data = clin.trial, # data is in the clin.trial data frame
FUN = summary # print out group means
))## therapy drug.anxifree drug.joyzepam drug.placebo
## 1 CBT 3 3 3
## 2 no.therapy 3 3 3
What if you have multiple grouping variables? Suppose, for example, you would like to look at the descriptives of mood gain separately for all possible combinations of drug and therapy?
Here is one way:
by( data = clin.trial$mood.gain, INDICES = list(clin.trial$drug, clin.trial$therapy), FUN = summary )## : anxifree
## : CBT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 0.950 1.100 1.033 1.150 1.200
## ------------------------------------------------------------
## : joyzepam
## : CBT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.30 1.35 1.40 1.50 1.60 1.80
## ------------------------------------------------------------
## : placebo
## : CBT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.30 0.45 0.60 0.60 0.75 0.90
## ------------------------------------------------------------
## : anxifree
## : no.therapy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2 0.3 0.4 0.4 0.5 0.6
## ------------------------------------------------------------
## : joyzepam
## : no.therapy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.350 1.400 1.467 1.550 1.700
## ------------------------------------------------------------
## : placebo
## : no.therapy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1 0.2 0.3 0.3 0.4 0.5
And here’s another:
print(aggregate( x = mood.gain ~ drug + therapy, # mood.gain by drug/therapy combination
data = clin.trial, # data is in the clin.trial data frame
FUN = summary #make a summary
))## drug therapy mood.gain.Min. mood.gain.1st Qu. mood.gain.Median
## 1 anxifree CBT 0.800000 0.950000 1.100000
## 2 joyzepam CBT 1.300000 1.350000 1.400000
## 3 placebo CBT 0.300000 0.450000 0.600000
## 4 anxifree no.therapy 0.200000 0.300000 0.400000
## 5 joyzepam no.therapy 1.300000 1.350000 1.400000
## 6 placebo no.therapy 0.100000 0.200000 0.300000
## mood.gain.Mean mood.gain.3rd Qu. mood.gain.Max.
## 1 1.033333 1.150000 1.200000
## 2 1.500000 1.600000 1.800000
## 3 0.600000 0.750000 0.900000
## 4 0.400000 0.500000 0.600000
## 5 1.466667 1.550000 1.700000
## 6 0.300000 0.400000 0.500000
Here’s no way:
tapply( X = clin.trial$mood.gain, INDEX = list(clin.trial$drug, clin.trial$therapy), FUN = summary )## CBT no.therapy
## anxifree summaryDefault,6 summaryDefault,6
## joyzepam summaryDefault,6 summaryDefault,6
## placebo summaryDefault,6 summaryDefault,6
In case you wonder (but you don’t need to understand, remember or study this; I only included it because it bugs me and might bug you), you can sort of make it work as follows (but you only get the numbers, without labels telling you what these numbers mean. So it’s not really helpful.)
print.table(tapply( X = clin.trial$mood.gain, INDEX = list(clin.trial$drug, clin.trial$therapy), FUN = summary ))## [1] 0.800000, 0.950000, 1.100000, 1.033333, 1.150000, 1.200000
## [2] 1.30, 1.35, 1.40, 1.50, 1.60, 1.80
## [3] 0.30, 0.45, 0.60, 0.60, 0.75, 0.90
## [4] 0.2, 0.3, 0.4, 0.4, 0.5, 0.6
## [5] 1.300000, 1.350000, 1.400000, 1.466667, 1.550000, 1.700000
## [6] 0.1, 0.2, 0.3, 0.3, 0.4, 0.5
#or
print.listof(tapply( X = clin.trial$mood.gain, INDEX = list(clin.trial$drug, clin.trial$therapy), FUN = summary ))## Component 1 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 0.950 1.100 1.033 1.150 1.200
##
## Component 2 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.30 1.35 1.40 1.50 1.60 1.80
##
## Component 3 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.30 0.45 0.60 0.60 0.75 0.90
##
## Component 4 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2 0.3 0.4 0.4 0.5 0.6
##
## Component 5 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.350 1.400 1.467 1.550 1.700
##
## Component 6 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1 0.2 0.3 0.3 0.4 0.5
If you have done and understood all of this, you can find some additional exercises in the RIntroExercises document. Since you now know how to work in RStudio, why not go there to make them.
For advanced users: if you want a table showing the complete order
of operator precedence in R, type ?Syntax. I haven’t included it
in this book since there are quite a few different operators, and we
don’t need that much detail. Besides, in practice most people seem
to figure it out from seeing examples: until writing this book I
never looked at the formal statement of operator precedence for any
language I ever coded in, and never ran into any difficulties.↩︎
Don’t get thrown off by the use of the term function. We’ll get into functions quite soon, and it will all start to make sense. Talking about great prospects!↩︎
Another way to edit variables is to use the edit() or fix()
functions. I won’t discuss them in detail right now, but you can
check them out on your own.↩︎
A note for the mathematically inclined: R does support complex
numbers, but unless you explicitly specify that you want them it
assumes all calculations must be real-valued. By default, the square
root of a negative number is treated as undefined: sqrt(-9) will
produce NaN (not a number) as its output. To get complex numbers,
you would type sqrt(-9+0i) and R would now return 0+3i. However,
since we won’t have any need for complex numbers in this book, I
won’t refer to them again.↩︎
Note that this is a very different operator to the assignment
operator = that I talked about in Section 2.3. A common
typo that people make when trying to write logical commands in R (or
other languages, since the “= versus ==” distinction is
important in most programming languages) is to accidentally type =
when you really mean ==. Be especially cautious with this – I’ve
been programming in various languages since I was a teenager, and I
still screw this up a lot. Hm. I think I see why I wasn’t cool as
a teenager. And why I’m still not cool.↩︎
It’s also worth checking out the match() function↩︎
I will be using a few words that will totally sound gibberish to your ears, but that’s ok for the point I want to make. So don’t fret.↩︎
More precisely, there are 5000 or so packages on CRAN, the Comprehensive R Archive Network.↩︎
The two functions discussed previously, sqrt() and abs(),
both only have a single argument, x. So I could have typed
something like sqrt(x = 225) or abs(x = -13) earlier. The fact
that all these functions use x as the name of the argument that
corresponds to the “main” variable that you’re working with is no
coincidence. That’s a fairly widely used convention. Quite often,
the writers of R functions will try to use conventional names like
this to make your life easier. Or at least that’s the theory. In
practice, it doesn’t always work as well as you’d hope.↩︎
Actually, that’s a bit of a lie: the log() function is more
flexible than that and can be used to calculate logarithms in any
base. The log() function has a base argument that you can
specify, which has a default value of \(e\). Thus log10(1000) is
actually equivalent to log(x = 1000, base = 10). Note that the calculator you have used for Statistics 1 and 2 uses log() for the base-10 logarithm (and ln() for the base-e logarithm).↩︎
Note for non-Australians: the AFL is an Australian rules football competition. You don’t need to know anything about Australian rules in order to follow this section.↩︎
6, to be precise↩︎
well, 6, again↩︎
For advanced users: type ?double for more information.↩︎
Or at least, that’s the default. If all your numbers are integers
(whole numbers), then you can explicitly tell R to store them as
integers by adding an L suffix at the end of the number. That is,
an assignment like x <- 2L tells R to assign x a value of 2 and
to store it as an integer rather than as a binary expansion. Type
?integer for more details.↩︎
You can choose which panes you see and where by going to View/Panes/Panes layout, but I recommend keeping it in the default setting, as long as you consider yourself a novice.↩︎
For advanced users: yes, as you’ve probably guessed, R is printing out the source code for the function.↩︎
If you’re running R from the terminal rather than from RStudio, escape doesn’t work: use CTRL-C instead.↩︎
Incidentally, that always works: if you’ve started typing a command and you want to clear it and start again, hit escape.↩︎
Here are some more, for the meerwaardezoeker: the objects() function, the ls() function, the ls.str() function and the who() function from the lsr package ↩︎
For advanced users: that’s a little over-simplistic in two respects. First, it’s a terribly imprecise way of talking about scoping. Second, it might give you the impression that all the variables in question are actually loaded into memory. That’s not quite true, since that would be very wasteful of memory. Instead, R has a “lazy loading” mechanism, in which what R actually does is create a “promise” to load those objects if they’re actually needed. For details, check out the delayedAssign() function.↩︎
The details about these packages is not something you should study.↩︎
The logit function a simple mathematical function that happens not to have been included in the basic R distribution.↩︎
Tip for advanced users: See also ::: if you’re especially keen to force R to use functions it otherwise wouldn’t, but take care, since ::: can be dangerous.↩︎
For some reason, if I use x instead of print(x), R will print x using the first approach, but not using the second, source() approach. Don’t worry about that. You have more important things to worry about, like climate change.↩︎
You can do the same using the “Session” or the “File” menu on top of RStudio and choose Quit Session… , or use the CTRL+Q shortkey or CMD+Q on a Mac.↩︎
Or functions. But let’s ignore functions for the moment.↩︎
Some users might wonder why R even allows the == operator for
factors. The reason is that sometimes you really do have different
factors that have the same levels. For instance, if I was analysing
data associated with football games, I might have a factor called
home.team, and another factor called winning.team. In that
situation, I really should be able to ask if
home.team == winning.team.↩︎
Well, that’s not the best word, but you know what I mean.↩︎
Note that, when I write out the formula, R doesn’t check to see
if the out and pred variables actually exist: it’s only later on
when you try to use the formula for something that this happens.↩︎
but in a different way than I used it above, were dropping referred to “not showing”.↩︎
Actually, you can make the subset() function behave this way by
using the optional drop argument, but by default subset() does
not drop, which is probably more sensible and more intuitive to
novice users.↩︎
Specifically, recursive indexing, a handy tool in some contexts but not something that I want to discuss here.↩︎
Conveniently, if you type rownames(df) <- NULL R will renumber
all the rows from scratch. For the df data frame, the labels that
currently run from 7 to 10 will be changed to go from 1 to 4.↩︎
It’s worth noting that there’s also a more powerful function
called recode() function in the car package that I won’t discuss
in this book but is worth looking into if you’re looking for a bit
more flexibility.↩︎
Or a list of such variables, as we will see below.↩︎
I mentioned earlier that print() is not a terribly useful function, but at least it make itself useful by being a vehicle for demonstrating the concept of a generic function.↩︎
Lexical scope.↩︎
The assign() function.↩︎
Yes.↩︎
As an aside: if there’s only a single command that you want to include inside your loop, then you don’t actually need to bother including the curly braces at all. However, until you’re comfortable programming in R I’d advise always using them, even when you don’t have to.↩︎
Okay, fine. This example is still a bit ridiculous, in three respects. Firstly, the bank absolutely will not let the couple pay less than the amount required to terminate the loan in 30 years. Secondly, a constant interest rate of 30 years is hilarious. Thirdly, you can solve this much more efficiently than through brute force simulation. However, we’re not exactly in the business of being realistic or efficient here.↩︎
You don’t need to understand how I converted the annual percentage interest into a monthly multiplier. But if you care, here’s how. The number that you have to multiply the current balance by each month in order to produce an annual interest rate of 5%. An annual interest rate of 5% implies that, if no payments were made over 12 months the balance would end up being \(1.05\) times what it was originally, so the annual multiplier is \(1.05\). To calculate the monthly multiplier, we need to calculate the 12th root of 1.05 (i.e., raise 1.05 to the power of 1/12). As it happens, this corresponds to a value of about 1.004. All of which is a rather long-winded way of saying that the annual interest rate of 5% corresponds to a monthly interest rate of about 0.4%.↩︎
I should add that this isn’t unique to R. Like everything in R there’s a pretty steep learning curve to learning how to draw graphs, and like always there’s a massive payoff at the end in terms of the quality of what you can produce. But to be honest, I’ve seen the same problems show up regardless of what system people use. I suspect that the hardest thing to do is to force yourself to take the time to think deeply about what your graphs are doing. I say that in full knowledge of the fact that only about half of my graphs turn out as well as they ought to. Understanding what makes a good graph is easy: actually designing a good graph is hard.↩︎
On the off chance that this isn’t enough freedom for you, you can
select a colour directly as a “red, green, blue” specification using
the rgb() function, or as a “hue, saturation, value” specification
using the hsv() function.↩︎
Note that you can get finer grain
control over this by specifying the xaxt and yaxt graphical
parameters instead.↩︎
You might be wondering why I haven’t specified the argument name
for the formula. The reason is that there’s a bug in how the
scatterplot() function is written: under the hood there’s one
function that expects the argument to be named x and another one
that expects it to be called formula. I don’t know why the
function was written this way, but it’s not an isolated problem:
this particular kind of bug repeats itself in a couple of other
functions. The solution in such cases is to omit the argument name:
that way, one function “thinks” that you’ve specified x and the
other one “thinks” you’ve specified formula and everything works
the way it’s supposed to. It’s not a great state of affairs, I’ll
admit, but it sort of works.↩︎
An alternative usage of cor() is to correlate one set of
variables with another subset of variables. If X and Y are both
data frames with the same number of rows, then cor(x = X, y = Y)
will produce a correlation matrix that correlates all variables in
X with all variables in Y.↩︎