1 R & RStudio

1.1 R

R is a programming language that is designed to work quickly and efficiently with large amounts of data, this is done through the unique way it processes this data. It is open source, meaning that other people from around the world can easily contribute code that you can use through add-ons, known as packages. Another reason for R’s popularity is that it (and the applications that run it) are free and (perhaps because it is free) the support online (from other ordinary users) is incredibly helpful.

1.2 RStudio

If you try to download R directly, you will also download a program called RGui. This is the default program for running R code. For what it does, RGui is fine. But we don’t want fine, we want GREAT! So we’re going to use another program called RStudio. RStudio organises your workspace a little bit better than the default RGui and allows for better interacticve help features which we’ll discuss later. Below is an example of what RStudio should look like.

The screenshow shows the four panes that we usually work with. Top-Left shows the Script, which is the code file we’re currently working in. Code written here won’t be run immediately, but it allows us to save code to be re-run later. Bottom-Left shows the Console, code written here will be run, so be careful what you type here. Top-Right is the Workspace, which shows the variables/data that RStudio is storing for us and that we can use. Bottom-Right shows the Viewer, which primarily displays Plots and Help. Of course there is a lot more going on here, but for the most part, these are the important bits.

Remember: R is the language and RStudio is the application, but you can’t use RStudio without R! (That said, I’m probably going to use the two interchangeably throughout)

1.3 Where to get them?

When you’re using an LJMU computer, you can search for “RStudio” in the LJMU Application Player to download it. This course is designed for use with RStudio 64 (with the circular logo). Elsewhere, you can download it from their webpage. As mentioned above, RStudio is free and will automatically install R as well.

1.4 First steps

When you first open RStudio, what you see won’t be like the screenshot. That’s because there is no Script open and so the Console fills the entire left side. It’s also filled with some information regarding licencing and the version of R you are currently running. The first step in RStudio is to create a new script. Go to File > New File > R Script (or press Ctrl + Shift + N on Windows or Cmd + Shift N on Mac). The Script pane should now appear with a tab called “Untitled 1”.

In the Console pane, type 2+2 and hit Enter.

2+2
## [1] 4

RStudio just ran the code immediately. The [1] here just means that the 4 is the first result from what R has calculated (we’ll get on to that in a bit). Next, with the Console still selected, press up and RStudio will recall the previously run line of code. and you can change the code and you run it again

2+3
## [1] 5

This time, in the Script pane, type pi+1, hightlight it and press Ctrl + Enter (or Cmd + Enter on Mac).

pi+1
## [1] 4.141593

This will automatically transfer the code into the Console and run it. Importantly, however the code is still in the Script pane, ready to be edited and/or used again.

Congratulations! You just did some coding!

2 Vectorisation

2.1 Vectors

In mathematics, a vector is usually a set of ordered numbers which can be used to represent a point in 2/3D space. However, R uses vectors in a slightly different manner to essentially mean a list of data (although a list is a different thing in R). I can create a vector in R by using the c function and passing some numbers as arguments, separated by a comma.

c(1,2,3)
## [1] 1 2 3

I used a little bit of Jargon there, but to keep things clear. A function is a command that tells R to do something and an argument is the information or data you give to the function so it can do it’s job. Sometimes functions don’t need arguments:

Sys.time()
## [1] "2020-03-19 17:00:48 GMT"

And, a word of warning, R is Case Sensitive. This means you need to be careful that you use upper and lower case appropriately.

sys.time()
## Error in sys.time(): could not find function "sys.time"

We’ve now created a vector of numbers but we’ve not done anything with these numbers. What is happening here:

c(1,2,3) + 10
## [1] 11 12 13

In the jargon of R (and computer programming in general), the addition symbol (+) here is called an operator. Operators take something on the left and something on the right and do something with them. Here are several more and you should recognise most of them:

+ - * / < > ^ == != :

In order: Addition (+), Subtraction (-), Multiplication (*), Division (/), Less Than (<), Greater Than (>), Exponentiation (^), Equivalency (==) (i.e. are these two things the same?) and Non-Equivalency (!=) (are these two things different?). The last one, is probably new to you. the : operator works with numbers to create a vector that runs from the first number to the second and is very useful for quickly creating big vectors.

1:100
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100

This also demonstrates what the [1] meant earlier. The element next to it is the first element. The next row here, shows [19], which means that 19 is the nineteenth element, and so on. This makes it easier, if you’re dealing with large data, to pinpoint where something might be wrong. Try running the code 1:1000 (either in the Console or the Script pane).

What do you think is happening here?

length(1:1000)
## [1] 1000

what happens if you run the code length(1:(10^6))? And what about length(1:(10^15)). Try

length(1:(10^16))
## Error in 1:(10^16): result would be too long a vector

R doesn’t even try it. R knows its limits But it can make it up to 1015, which means that R created a vector that was 1015 elements long before figuring out its length. And it did it really quickly (because R is good with vectors)

2.2 Data Types

We’ve just learned about Vectors as lists(even though list means something else) of numbers. But Vectors can also be character vectors. Sometimes called strings, character vectors just means words:

c("Hello","World")
## [1] "Hello" "World"

Another basic data type in R is the logical, which represents TRUE or FALSE values.

c("red","blue","green","yellow") == "red"
## [1]  TRUE FALSE FALSE FALSE

Here, we created a vector of characters and then checked whether each of them (one-by-one) were equal to the character vector "red". Again, however, remember that R is Case Senstive:

"red" == "Red"
## [1] FALSE

We can also multiply logical values by numeric values (i.e. numbers) and R will interpret a TRUE as being 1 and a FALSE as being 0.

2.3 Assignment

Consider the following line of code:

length(c("Hello","World")) - 1
## [1] 1

There are many steps happening here. First,the c function took in the two arguments "Hello" and "World", realised they were characters (rather than numbers), put them together into a vector (that’s its job) and then passed that vector as a single argument to the length function. Finally, we subtracted 1 from the result.

Now consider the following line of code:

x <- c("Hello","World")
length(x) - 1
## [1] 1

Much easier to read and follow. We created the vector in the same way, but then we used the <- operator to put it in a box called x. On the next line, we put that box into the length function, and then subtracted the 1. This box is called a variable, and if you run this line of code, you’ll notice that the x variable now populated our Workspace pane in the Top-Right. The Workspace pane tells us the data type (chr is short for character) and we can tell that the length is 2 from the bit that says [1:2] (we’ll unravel this soon, too).

Being able to assign variables is a kay part of using R (and again, programming in general). Once these variables have been assigned, we can use them again and again, and re-assign them as and when we need, and even use them to re-assign themselves!

x <- 2
x + 2
## [1] 4
x + 3
## [1] 5
x <- x + 1
x
## [1] 3

2.4 Subsetting

Now, x in our Workspace pane is just a single number (not a vector). But we can obviously change that

x <- c(2,4,6,8,10)

We’re once again told it’s data type (this time its an int), and its length is 5 from the [1:5]. When we’re working with vectors, sometimes we need to grab out just some of the elements and not all of them. For this, we use extraction. If we want to pull out the third element of a vector, we enclose that number in square brackets, [ and ], to tell R that we are extracting data from a vector: x[3]. We can also pass a vector into the subsetting to grab more than one element:

x[c(1,2,3)]
## [1] 2 4 6

And, since we can pass in a vector, think about what the following does.

x[1:3]
## [1] 2 4 6

The next easiest way to subset is to use a logical vector that is the same length as the vector we are subsetting, and R will return any element which is TRUE and ignore elements which are FALSE.

x[c(T,T,F,F,T)]
## [1]  2  4 10
y <- c("red","blue","yellow","green","red")
x[y=="red"]
## [1]  2 10

Finally, we can also subset to remove elements we’re not interested in

x[-3]
## [1]  2  4  8 10
x[-c(2,3)]
## [1]  2  8 10

Assignment and Subsetting can also be combined to change certain elements in our vector

x[3] <- 2
x
## [1]  2  4  2  8 10
x[4:5] <- 0
x
## [1] 2 4 2 0 0

If we have a patrticularly large vector, we can also look at just the first 6 values, by using the function head() (or conversely, tails() to show us the last 6), as we’ll see in the next section.

Do you see what the [1:5] means in the Workspace pane? The 1:5 is the possible values that we can subset x with.

3 Data

3.1 Matrices

Vectors are obviously really useful, but they’re just the start. As mentioned earlier, they are essentially a list of values with an order, this means they are one-dimensional. That one dimension is what we use to reference their values. This kind of structure is great in some circumstances, but not always.

The two-dimensional equivalent of a vector is a matrix and we can use them in a very similar way. Once again, R is very quick with matrices.

Mat <- matrix(1:9,nrow=3)
Mat
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

To create this matrix, we first created a vector of length 9, and then passed that as an argument to the function matrix(), which creates matrices. We also had to tell R how many rows we wanted with the nrow=3. R then worked out that we also needed 3 columns since our data was of length 9. We could have specified we wanted 3 columns using ncol=3, but we don’t need both. Matrices can be any size and being able to specify whichever dimension we want gives us more flexibility depending on our data.

Another argument that can be passed into the matrix() function is whether you want the data to go vertically or horizontally. By default, it goes vectically (i.e. it fills in the first column, then the second column), but we can change that with the byrow argument. Compare the previous matrix with this one:

Mat <- matrix(1:9,nrow=3,byrow=T)
Mat
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

See how this time the numbers are going across rather than down.

Just like with the vectors, we can pull out values from within our matrix using the square brackets, [[ and ]. However, we need to supply two values, rather than just one, first a row, and then a column.

Mat[1,2] # Gets the first row and second column
## [1] 2
Mat[1,] # Gets the entire first row
## [1] 1 2 3
Mat[,2:3] # Gets the entire third column
##      [,1] [,2]
## [1,]    2    3
## [2,]    5    6
## [3,]    8    9

Notice that when we pull out a single row or column, R is now returning a vector. We’re essentially looking across the matrix or down it and pulling out a one-dimensional object as a vector. If we pass a vector into the matrix, it works the same as selecting multiple elements in a vector, but now we’re in two dimensions.

Just like with vectors, we can create matrices of any size:

Mat <- matrix(1:100,nrow=10)

And these appear in our Environment pane as you would expect, with two vectors in between the square brackets.

As with vectors, you can assign values to individual cells to edit your data within R, and you can combine subsetting and assignment to assign based on specific values

x <- c("Red","Blue","Green")

Mat[x == "Green",2] <- 100
Mat
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1   11   21   31   41   51   61   71   81    91
##  [2,]    2   12   22   32   42   52   62   72   82    92
##  [3,]    3  100   23   33   43   53   63   73   83    93
##  [4,]    4   14   24   34   44   54   64   74   84    94
##  [5,]    5   15   25   35   45   55   65   75   85    95
##  [6,]    6  100   26   36   46   56   66   76   86    96
##  [7,]    7   17   27   37   47   57   67   77   87    97
##  [8,]    8   18   28   38   48   58   68   78   88    98
##  [9,]    9  100   29   39   49   59   69   79   89    99
## [10,]   10   20   30   40   50   60   70   80   90   100

3.2 data.frames

Matrices don’t get used an awful lot on the surface in R. Most of the time, our data will be in the form of a data.frame, which is essentially a table. Just like a matrix, we have rows and columns. The big difference is that the columns are associated with variables and the rows are entries or participants. Much more structure and orgnisation here than in a matrix.

dat <- data.frame(
  Names = c("Steve","Natasha","Anthony","Pepper","Bruce","Wanda","Carol","Clint","James","Peter"),
  Height = c(1.88,1.7,1.85,1.63,1.78,1.7,1.8,1.91,1.85,1.78))
dat
##      Names Height
## 1    Steve   1.88
## 2  Natasha   1.70
## 3  Anthony   1.85
## 4   Pepper   1.63
## 5    Bruce   1.78
## 6    Wanda   1.70
## 7    Carol   1.80
## 8    Clint   1.91
## 9    James   1.85
## 10   Peter   1.78

When creating a data.frame using the data.frame() function, we pass the columns of our table, or the variables in a our data, as arguments with their names. We can access entries in our data frame in the same way as a matrix using a two dimensional coordinate style:

dat[1,2] # First row, second column/variable
## [1] 1.88

We can also access columns by their names:

dat[,"Names"]
##  [1] Steve   Natasha Anthony Pepper  Bruce   Wanda   Carol   Clint   James  
## [10] Peter  
## Levels: Anthony Bruce Carol Clint James Natasha Pepper Peter Steve Wanda

Notice that r now also gives us the Levels of this variable. Since it is saved as a factor, R uses a shortcut to store this data as a bunch of numbers, using the Levels as a look up. Behind the scenes, the data actually looks like this

as.numeric(dat[,"Names"])
##  [1]  9  6  1  7  2 10  3  4  5  8

The ninth level is levels(dat[,"Names"])[9], which equates to Steve, which is the first entry in our variable.

But another even more convenient way of accessing variables form our dataset is to use a new operator, the $ operator:

dat$Height
##  [1] 1.88 1.70 1.85 1.63 1.78 1.70 1.80 1.91 1.85 1.78

This has returned the vector that is being pointed to by the name Height in the dataframe dat. If you look at the Envirnment pane, you can click on the little arrow next to the dat variable and see all the stuff that’s being stored in there.

Just as with matrices, data can be edited using the assignment operator. The new thing is we can now use the $ operator much more intutively

dat$Height[dat$Names == "Natasha"] <- 1.5
dat
##      Names Height
## 1    Steve   1.88
## 2  Natasha   1.50
## 3  Anthony   1.85
## 4   Pepper   1.63
## 5    Bruce   1.78
## 6    Wanda   1.70
## 7    Carol   1.80
## 8    Clint   1.91
## 9    James   1.85
## 10   Peter   1.78

3.3 Loading data

Normally, you would use a function such as read.csv() to read in our data. This function can read in data that is formatted as a csv file (or comma-separated). There are also ways to import more complicated data, such as from excel, but these don’t come built-in with R (we need additional stuff called packages to load other data).

An important note when loading data using read.csv() or a similar function is that the file directory needs to use forward slashes (/) rather than backward slashes (\). Basically, copy the folder directory, but replace the slashes:

dat <- read.csv("M:\Documents\My Data.csv") # This line won't work
dat <- read.csv("M:/Documents/My Data.csv") # This works!

That’s how you get your data into R. Today, we’re going to begin by loading up a built in dataset, called iris, which is a dataset containing data about flowers.

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

The function data() can be used to load up a Built-in datasets (useful for examples and practising!). As previously said, head() now displays the first 6 rows of our data, which is a table (known in R as a data.frame).

Double click on iris in the Workspace and a new tab will open up in place of the Script pane. This shows the full table and allows us to scroll. There are five variables in iris: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species (note the capitalisation!).

4 Statistics

4.1 Describing our data

Here are a few functions that can be useful for describing your data:

mean(iris$Sepal.Length) #Gives the arithmetic mean of the variable Sepal.Length within the data.frame iris
## [1] 5.843333
sd(iris$Sepal.Width) #Standard Deviation of the variable
## [1] 0.4358663
table(iris$Species) #Provides a count of each unique value of the variable
## 
##     setosa versicolor  virginica 
##         50         50         50
range(iris$Petal.Length) #Gets the minimum and maximum values, these can also be extracted using min() and max()
## [1] 1.0 6.9

If we want to find out the median and/or percentiles of a variable we use the quantile() function, and we can specify which percentiles we want to find in the vector. We do this by specifying the percentiles as decimal values (i.e. between 0 and 1) in another vector and pass both arguments to the function. Once again, remember we separate arguments by a comma.

Quantiles <- c(0.05,0.5,0.95)
quantile(iris$Petal.Width, Quantiles)
##  5% 50% 95% 
## 0.2 1.3 2.3

We did not need to define Quantiles beforehand here, but it certainly makes the code look a little neater than this: quantiles(iris$Petal.Width,c(0.05,0.5,0.95)).

We can also combine subsetting to pull out descriptive statistics based on pre-requisites. What does this do?

mean(iris$Sepal.Length[iris$Species == "setosa"])
## [1] 5.006

From R’s perspective:

  • Create a logical vector that is the same length as the vector iris$Species (i.e. same as the number of rows in the table), to indicate whether iris$Species == "setosa"
  • Pull out the subset of iris$Sepal.Length such that the previous vector is TRUE
  • Calculate the mean of this vector.

R works from the inside outwards when it comes to functions like this. Not the most legible way to understand the code afterwards (but we’ll learn how to fix this next week!)

4.2 Plotting

The basic function for plotting in R is the plot() function, and is used very intuitively for scatter plots:

plot(iris$Sepal.Length,iris$SSepal.Width)

However, this is pretty boring. Let’s add some colour and give appropriate labels

plot(x = iris$Sepal.Length,
     y = iris$SSepal.Width,
     col = iris$Species,
     main = "Plot of Sepal Length vs Width",
     sub = "Stratified by Species",
     xlab = "Sepal Length",
     ylab = "Sepal Width",
     )

Here, we changed the plot to use the variable Species as an indicator of which colur to use for each point. We could have just used col="red" to tell R to plot them all red, but that’s not as useful to distinguish!

We’ve done something new here, we supplied our arguments to the function as a named argument. This is because not all of these arguments are needed (see the previous one). If we don’t name them, R assumes the first argument is x and the second is y (which is a good assumption to make). However, since they’re not needed, we need to tell R which is which (it can’t rely on them being supplied in the same order everytime). # Probabilities

4.3 Data Distributions

Now that we can plot, it means we can visualise a key aspect of statistics: the probability density function (pdf) and the cumulative density function (cdf) of a probability distribution.To demonstrate, I’m going to use the Normal distribution.

The Normal distribution is one of the most commonly used distributions due to its symmetry and the fact that many natural variables can be modelled by it. The Standard Normal distribution is when the values have been standardised so that the mean is 0 and the sd is 1.

4.3.1 Density Function

x <- seq(-3,3,0.01) #Create a vector form -3 to 3 with steps of 0.01
y <- dnorm(x, mean=0, sd=1) #Here is the new function
plot(x, y, type="l",ylab="Density") # The argument type = "l" means we want a line plot

The function dnorm is the probability density function (pdf) for the Normal distribution, usually written as \(f(x)\). Higher density means more values are clustered around there and a lower density means there are less. the two arguments, mean and sd can be defined to match our data. For the Standard Normal, the pdf is defined by this nasty looking formula: \[ f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2} \]

Let’s say we have a clinical trial with two arms; a control arm and a treatment arm. The treatment arm recieves a drug that may be able to reduce systolic blood pressure (SBP). After the trial is over, we are told that for the control arm, the mean SBP is 160 with an sd of 3 and the treatment arm has an SBP of 155 with an sd of 4. Assuming they are Normally distributed (a fair assumption if we have enough patients), then we can plot the density of the two distributions on the same plot

x <- seq(145, 170, 0.1) #Broad range of values to cover both groups
control.y <- dnorm(x, mean = 160, sd = 3) #Get the y values for the control arm
treat.y <- dnorm(x, mean = 155, sd = 4) #Get the y values for the treatment arm

plot(x,control.y,col="red",type="l",ylab="Density",xlab="SBP") #Remember what all these arguments do?
lines(x,treat.y,col="blue") #the lines() function *adds* a line to a previous plot

Loooking at the two plots, there is a lot of overlap so it would be difficult to say that this drug works. What happens if you change the mean or sd for either of the arms?

4.3.2 Probability Function

x <- seq(-3,3,0.01)
y <- pnorm(x, mean=0, sd=1)
plot(x, y, type="l",ylab="Probability")

The function pnorm is the cumulative density function (cdf), usually written as \(F(x)\). It is defined as the area under the pdf up to the value (i.e. its integral). It is used to find the probability that something is less than that value (assuming it follows a Normal Distribution). \[ F(x) = \int_{-\infty}^x f(u) \;\textrm{d}u = \textrm{Prob}\left(X < x\right) \]

This can answer questions such as: “If we assume the height of males at a university follows a Normal distribution with a mean of 175.5cm and a standard deviation of 6.5cm. What is the probability that a male student chosen at random is taller than 180cm?”

p <- pnorm(180, mean = 175.5, sd = 6.5) #This is the probability that a student is *shorter* than 180cm
1 - p #The probability that they are *taller* than 180cm
## [1] 0.2443721

4.3.3 Quantile Function

p <- seq(0,1,0.01) #This time, we need the input to be between 0 and 1
y <- qnorm(p,mean = 0, sd = 1)
plot(p,y,type="l")

The function qnorm() is the inverse of the pnorm() function. For a given probability, it can provide the associated value. From the previous example, what height are 95% of the students shorter than?

p <- 0.95
qnorm(p, mean = 175.5, sd = 6.5)
## [1] 186.1915

So 95% of the male students in the school are shorter than 186.2cm.

4.3.4 Random Function

z <- rnorm(100, mean = 0, sd = 1)
head(z)
## [1]  1.41750281  0.91333317 -0.04947397 -0.06346815 -2.02502643  0.71665990

If you were to print out the z variable, you’d see 100 random numbers. The first argument we pass to the function rnorm() tells it how many random numbers to generate. These random numbers have been drawn from the Standard Normal distribution and so they will follow that distribution. We can see this with the following plot:

hist(z)

This histogram shows where all the values in the data in z lie. We group them into bins and then count how many are in each bin.

This histogram should roughly look like the pdf of the distribution. However, due to the random nature and the fact that there are only 100, it’s not very close to it’s underlying distribution. If we increase the number of random numbers to 10,000, it’ll look closer.

z <- rnorm(10000) #This time, we're not specifying the mean or sd for either rnorm or dnorm
x <- seq(-3,3,0.01)
y <- dnorm(x) # By default, R will use mean = 0 and sd = 1.
hist(z, probability = T) #This tells R that we want the probability of each bin, rather than the count (i.e. count/total)
lines(x,y)

4.3.4.1 Other Distributions

So far, for this section, we’ve used the Normal distribution. However, there is a whole bunch of other distributions that we could use, and they all follow the same formatting style:

  • dxxxx(x,...) gives the probability density function of the distribution at x
  • pxxxx(x,...) gives the cumulative density function of the distribution at x
  • qxxxx(p,...) gives the inverse of the cdf of the distribution at p
  • rxxxx(n,...) generates n random numbers that follow the distribution

A few examples can be seen in the below plots:

5 Help

5.1 Help Files

For any problem you might have, your first port of call in RStudio is the ? symbol. R mostly works with functions (which we will get to prety soon), and so if you need help in RStudio, you can type a question mark followed by the name of that function and help will pop up in the Viewer pane

?mean

5.3 Ask

Surprisingly enough, Statisticians love Twitter. If you have a question, just post it with the #RStats hashtag and you’ll probably get some reasonable responses. Or tweet it with #LJMU_MSIT and I will either reply or retweet it to more stats people.

Finally, the Maths, Stats & IT Team run regular Drop-In Sessions and One-To-Ones which you’re more than welcome to come by to. Just check out the Library Calendar to find a suitable session.

Remember, never feel like you’re asking a stupid question. If you don’t know the answer, just ask!