Welcome to the wonderful world of R. Everybody should have RStudio downloaded and ready to go, so I’m going to jump right in to how R sees the world, and then get on to how we can control what R sees.
When you open RStudio for the first time, you’ll see three panes. The top right of your screen will show the Workspace, this will show you all your variables that you’ve put into R. Bottom right is your Viewer pane. This will, by default, show your current folder (or working directory). It’s also where any plots or Help files that we want to pop up will go.
The left of your screen will be dominated by the Console. This is where we tell R what to do. Anything that is typed here and followed by an Enter will be sent to R for processing.
In the Console pane, type 2+2 and hit Enter.
2+2
## [1] 4
RStudio just ran the code immediately. The [1] here just means that the 4 is the first result from what R has calculated (we’ll get on to that in a bit). Next, with the Console still selected, press up and RStudio will recall the previously run line of code. and you can change the code and you run it again
2+3
## [1] 5
Unfortunately, once something has been ran, it’s gone. We need a way to keep track of what we do and, more importantly, to save it for next time we open R.
Go to the top menu File > New File > R Script (Ctrl/Cmd + N). This will drop the Console to the bottom left quarter and open up a new Script pane. This also has a tab at the top that says something like Untitled.R and is the name of our script. We can save our script by going to File > Save (Ctrl/Cmd + S), and we can open saved scripts with File > Open (Ctrl/Cmd + O).
This time, in the Script pane, type pi+1, highlight it and press Ctrl/Cmd + Enter.
pi+1
## [1] 4.141593
This will automatically transfer the code into the Console and run it. Importantly, however the code is still in the Script pane, ready to be edited and/or used again.
Congratulations! You just did some coding!
In mathematics, a vector is usually a set of ordered numbers which can be used to represent a point in 2/3D space. However, R uses vectors in a slightly different manner to essentially mean a list of data (although a list is a different thing in R). I can create a vector in R by using the c function and passing some numbers as arguments, separated by a comma.
c(1,2,3)
## [1] 1 2 3
I used a little bit of Jargon there, but to keep things clear. A function is a command that tells R to do something and an argument is the information or data you give to the function so it can do it’s job. Sometimes functions don’t need arguments:
Sys.time()
## [1] "2020-03-05 16:43:37 UTC"
And, a word of warning, R is Case Sensitive. This means you need to be careful that you use upper and lower case appropriately.
sys.time()
## Error in sys.time(): could not find function "sys.time"
We’ve now created a vector of numbers but we’ve not done anything with these numbers. What is happening here:
c(1,2,3) + 10
## [1] 11 12 13
In the jargon of R (and computer programming in general), the addition symbol (+) here is called an operator. Operators take something on the left and something on the right and do something with them. Here are several more and you should recognise most of them:
+ - * / < > ^ == != :
In order: Addition (+), Subtraction (-), Multiplication (*), Division (/), Less Than (<), Greater Than (>), Exponentiation (^), Equivalency (==) (i.e. are these two things the same?) and Non-Equivalency (!=) (are these two things different?). The last one, is probably new to you. the : operator works with numbers to create a vector that runs from the first number to the second and is very useful for quickly creating big vectors.
1:100
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
This also demonstrates what the [1] meant earlier. The element next to it is the first element. The next row here, shows [19], which means that 19 is the nineteenth element, and so on. This makes it easier, if you’re dealing with large data, to pinpoint where something might be wrong. Try running the code 1:1000 (either in the Console or the Script pane).
What do you think is happening here?
length(1:1000)
## [1] 1000
what happens if you run the code length(1:(10^6))? And what about length(1:(10^15)). Try
length(1:(10^16))
## Error in 1:(10^16): result would be too long a vector
R doesn’t even try it. R knows its limits But it can make it up to 1015, which means that R created a vector that was 1015 elements long before figuring out its length. And it did it really quickly (because R is good with vectors)
We’ve just learned about Vectors as lists(even though list means something else) of numbers. But Vectors can also be character vectors. Sometimes called strings, character vectors just means words:
c("Hello","World")
## [1] "Hello" "World"
Another basic data type in R is the logical, which represents TRUE or FALSE values.
c("red","blue","green","yellow") == "red"
## [1] TRUE FALSE FALSE FALSE
Here, we created a vector of characters and then checked whether each of them (one-by-one) were equal to the character vector "red". Again, however, remember that R is Case Senstive:
"red" == "Red"
## [1] FALSE
We can also multiply logical values by numeric values (i.e. numbers) and R will interpret a TRUE as being 1 and a FALSE as being 0.
Consider the following line of code:
length(c("Hello","World")) - 1
## [1] 1
There are many steps happening here. First,the c function took in the two arguments "Hello" and "World", realised they were characters (rather than numbers), put them together into a vector (that’s its job) and then passed that vector as a single argument to the length function. Finally, we subtracted 1 from the result.
Now consider the following line of code:
x <- c("Hello","World")
length(x) - 1
## [1] 1
Much easier to read and follow. We created the vector in the same way, but then we used the <- operator to put it in a box called x. On the next line, we put that box into the length function, and then subtracted the 1. This box is called a variable, and if you run this line of code, you’ll notice that the x variable now populated our Workspace pane in the Top-Right. The Workspace pane tells us the data type (chr is short for character) and we can tell that the length is 2 from the bit that says [1:2] (we’ll unravel this soon, too).
Being able to assign variables is a key part of using R (and again, programming in general). Once these variables have been assigned, we can use them again and again, and re-assign them as and when we need, and even use them to re-assign themselves!
x <- 2
x + 2
## [1] 4
x + 3
## [1] 5
x <- x + 1
x
## [1] 3
Almost anything that you create in R can be saved as a variable. Whether that is a value or the results of a model/test that you can run.
Now, x in our Workspace pane is just a single number (not a vector). But we can obviously change that
x <- c(2,4,6,8,10)
We’re once again told it’s data type (this time its an int), and its length is 5 from the [1:5]. When we’re working with vectors, sometimes we need to grab out just some of the elements and not all of them. For this, we use extraction. If we want to pull out the third element of a vector, we enclose that number in square brackets, [ and ], to tell R that we are extracting data from a vector: x[3]. We can also pass a vector into the subsetting to grab more than one element:
x[c(1,2,3)]
## [1] 2 4 6
And, since we can pass in a vector, think about what the following does.
x[1:3]
## [1] 2 4 6
The next easiest way to subset is to use a logical vector that is the same length as the vector we are subsetting, and R will return any element which is TRUE and ignore elements which are FALSE.
x[c(T,T,F,F,T)]
## [1] 2 4 10
y <- c("red","blue","yellow","green","red")
x[y=="red"]
## [1] 2 10
Finally, we can also subset to remove elements we’re not interested in
x[-3]
## [1] 2 4 8 10
x[-c(2,3)]
## [1] 2 8 10
Assignment and Subsetting can also be combined to change certain elements in our vector
x[3] <- 2
x
## [1] 2 4 2 8 10
x[4:5] <- 0
x
## [1] 2 4 2 0 0
If we have a patrticularly large vector, we can also look at just the first 6 values, by using the function head() (or conversely, tails() to show us the last 6), as we’ll see in the next section.
Do you see what the [1:5] means in the Workspace pane? The 1:5 is the possible values that we can subset x with.
Now that we have the basics, we can move onto some statistical work using data. Normally, you would use a function such as read.csv() to read in our data. This function can read in data that is formatted as a csv file (or comma-separated). There are also ways to import more complicated data, such as from excel, but these don’t come built-in with R (we need additional stuff called packages to load other data).
An important note when loading data using read.csv() or a similar function is that the file directory needs to use forward slashes (/) rather than backward slashes (\). Basically, copy the folder directory, but replace the slashes:
dat <- read.csv("M:\Documents\My Data.csv") # This line won't work
dat <- read.csv("M:/Documents/My Data.csv") # This works!
That’s how you get your data into R. Today, we’re going to begin by loading up a built in dataset, called iris, which is a dataset containing data about flowers.
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
The function data() can be used to load up a Built-in datasets (useful for examples and practising!). As previously said, head() now displays the first 6 rows of our data, which is a table (known in R as a data.frame).
Double click on iris in the Workspace and a new tab will open up in place of the Script pane. This shows the full table and allows us to scroll. There are five variables in iris: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species (note the capitalisation!).
Similar to a vector, we can subset in a data.frame using the square brackets, however since iris is a table, it is 2-dimensional, R needs two values, one for the row and a second for the column, separated by a comma (again, we can extract multiple rows/columns by using vectors)
iris[2,3] #Returns the value in the second row and the third column
## [1] 1.4
iris[1:10,1:3] #Returns the values in the first ten rows and the first three columns
## Sepal.Length Sepal.Width Petal.Length
## 1 5.1 3.5 1.4
## 2 4.9 3.0 1.4
## 3 4.7 3.2 1.3
## 4 4.6 3.1 1.5
## 5 5.0 3.6 1.4
## 6 5.4 3.9 1.7
## 7 4.6 3.4 1.4
## 8 5.0 3.4 1.5
## 9 4.4 2.9 1.4
## 10 4.9 3.1 1.5
iris[1:5,c(3,4,5)] #Returns the first five rows and the 3rd, 4th and 5th columns
## Petal.Length Petal.Width Species
## 1 1.4 0.2 setosa
## 2 1.4 0.2 setosa
## 3 1.3 0.2 setosa
## 4 1.5 0.2 setosa
## 5 1.4 0.2 setosa
The way R structures data.frames is that each variable within the data.frame is it’s own vector of values with the restriction that all the vectors have to be the same length, i.e. the number of rows in the table. Which means, we can pull vectors out using the $ operator:
iris$Sepal.Length #Notice that RStudio pops up with the available variable names here
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
## [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
## [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
## [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
## [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
## [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
And these vectors can be used in the exact same way as other vectors, e.g
iris$Sepal.Width[1:15]
## [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0
As previoiusly, we can also edit individual values (known as cells) within a dataset using the subsetting and assignment.
Here are a few functions that can be useful for describing your data:
mean(iris$Sepal.Length) #Gives the arithmetic mean of the variable Sepal.Length within the data.frame iris
## [1] 5.843333
sd(iris$Sepal.Width) #Standard Deviation of the variable
## [1] 0.4358663
table(iris$Species) #Provides a count of each unique value of the variable
##
## setosa versicolor virginica
## 50 50 50
range(iris$Petal.Length) #Gets the minimum and maximum values, these can also be extracted using min() and max()
## [1] 1.0 6.9
If we want to find out the median and/or percentiles of a variable we use the quantile() function, and we can specify which percentiles we want to find in the vector. We do this by specifying the percentiles as decimal values (i.e. between 0 and 1) in another vector and pass both arguments to the function. Once again, remember we separate arguments by a comma.
Quantiles <- c(0.05,0.5,0.95)
quantile(iris$Petal.Width, Quantiles)
## 5% 50% 95%
## 0.2 1.3 2.3
We did not need to define Quantiles beforehand here, but it certainly makes the code look a little neater than this: quantiles(iris$Petal.Width,c(0.05,0.5,0.95)).
We can also combine subsetting to pull out descriptive statistics based on pre-requisites. What does this do?
mean(iris$Sepal.Length[iris$Species == "setosa"])
## [1] 5.006
From R’s perspective:
logical vector that is the same length as the vector iris$Species (i.e. same as the number of rows in the table), to indicate whether iris$Species == "setosa"iris$Sepal.Length such that the previous vector is TRUER works from the inside outwards when it comes to functions like this. Not the most legible way to understand the code afterwards (but we’ll learn how to fix this next week!)
The basic function for plotting in R is the plot() function, and is used very intuitively for scatter plots:
plot(iris$Sepal.Length,iris$SSepal.Width)
However, this is pretty boring. Let’s add some colour and give appropriate labels
plot(x = iris$Sepal.Length,
y = iris$SSepal.Width,
col = iris$Species,
main = "Plot of Sepal Length vs Width",
sub = "Stratified by Species",
xlab = "Sepal Length",
ylab = "Sepal Width",
)
Here, we changed the plot to use the variable Species as an indicator of which colur to use for each point. We could have just used col="red" to tell R to plot them all red, but that’s not as useful to distinguish!
We’ve done something new here, we supplied our arguments to the function as a named argument. This is because not all of these arguments are needed (see the previous one). If we don’t name them, R assumes the first argument is x and the second is y (which is a good assumption to make). However, since they’re not needed, we need to tell R which is which (it can’t rely on them being supplied in the same order everytime). If we don’t specify them, R will choose default values, which can be seen in the help file (?plot).
For any problem you might have, your first port of call in RStudio is the ? symbol. R mostly works with functions (which we will get to prety soon), and so if you need help in RStudio, you can type a question mark followed by the name of that function and help will pop up in the Viewer pane
?mean
If you don’t quite remember the name of the function you’re needing help on, you can use the double question to search through all of Rs help files
??mean
and then just click on the result that looks like what you’re after. Most of the built-in, default functions are preceded by a base:: or a stats::
Next port of call should be Google/Ecosia/Bing or whatever your Search Engine of choice is. If you have a question that you need to know, type in “R” and then just ask that question to your search engine. I don’t mean try to phrase it in a certain way, I mean write the question exactly how you would ask it!
When you search for something, a lot of results will probably be tutorials on how to do what you’re trying to do. If what you’re asking is a common question, these will be good resources.
For more unusual questions, Stack Exchange is the most reliable forum for R answers. It is full of users who are more than willing to answer your difficult questions (some of them even enjoy it!)
Surprisingly enough, Statisticians love Twitter. If you have a question, just post it with the #RStats hashtag and you’ll probably get some reasonable responses. Or tweet it with #LJMU_MSIT and I will either reply or retweet it to more stats people.
Finally, the Maths, Stats & IT Team run regular Drop-In Sessions and One-To-Ones which you’re more than welcome to come by to. Just check out the Library Calendar to find a suitable session.
Remember, never feel like you’re asking a stupid question. If you don’t know the answer, just ask!
This was just a bit of a taster as an introduction to the world of R as a powerful piece of software. There is so much more that you can learn about R and I will be running more in-depth tutorials throughout the upcoming semester. If you come along, you’ll learn:
ggplot2 package (part of the tidyverse)