Learning R part 1: The underlying concepts of R.

R is free software for data anlysis, statistics, plotting, and a sophisticated programming language. It runs on Windows, Mac or Linux and can swap 'workspace' files containing data between the different versions. It has all the features and graphical panache of commercial statistical software such as SPSS or Stata and is far more flexible and powerful than Excel. Whilst R semantics are very similar to Matlab it has many more users, and many thousands of 'packages'; user generated add-ons for almost every aspect of scientific research.

Some say that R is difficult to learn but I think that is because it is sometimes taught poorly, either as a new language for computer programmers or as an adjunct to a statistics course. For beginners one problem is that there are usually many ways to accomplish any task, and often new packages will introduce improvements on the base language approach that may take time to spread throughout the R community. On this course we try wherever possible to use newer innovative packages in the R language - whilst showing older base R methods if they are still common.

The R basics

The R software package can be downloaded from the R Binaries page on CRAN: The Comprehensive R Archive Network. The R Binaries will install automatically on both Windows (double-click) and Mac (move to Applications folder) and create a start icon on your Desktop or Start-Up menu.

For this course we are going to use an additional piece of software RStudio that makes using R easier still. RStudio is an improved front end for R that combines a dedicated text editor, file browser, file previewer, spreadsheet, graphics window, and help pages all into a slick Graphical User interface (GUI). Integrating these programs into a single interface rather than lots of separate program windows makes R a lot more user friendly.
Once you have installed R itself you can download the RStudio installer and simply double-click to install. Then when you launch RStudio - you are essentially launching R with a nice easy interface.

blather ends, tutorial begins

If you have used R before then you may want to skip this first tutorial and start with the next but I would advise against. It is often helpful to think explicitly about the unconscious rules we have know …..and this tutorial is quite short. The power of R is the command line. The command line can be viewed as a sort of complex calculator – or algebra machine, with a memory or workspace. At the console type the following:

x = 4
y = 8
z = x * y
ls()

## [1] "x" "y" "z"

## [1] 32

rm(z)
ls()

## [1] "x" "y"

In this first short set of code above we create x and y variables and then z from multiplying x and y together. Typing the name of a variable simply shows it's value and the 'function' ls() shows the variables that we have created (In RStudio you will also see the workspace variables in the top-right pane). The rm() command removes a variable from the workspace. This is the basics - the basis of a simple calculator. However data types in R can be a lot more complex than simple variables such as x,y,z.

x = 1:5
y = 6:10
z = y - x
z

## [1] 5 5 5 5 5

Here the corresponding values of y have been subtracted from the corresponding values of x and stored in the y variable. The variables x, y and z here are called 'vectors'. Note that the original values of x and y have been replaced by our new assignment. Each element of a vector such as x or y can be accessed by using brackets to index them.. for instance.

y[2]

## [1] 7

y[2:3]

## [1] 7 8

Here we have used the brackets to extract the 2nd element of y (i.e. 7) then the 2nd and 3rd value of y (i.e. another vector 7, 8). Here is a slightly different example.

x = 1:5
y = 6
z = x - y
z

## [1] -5 -4 -3 -2 -1

This also works except for each element of x the same y is subtracted each time. If you understand these examples so far you are 33.33329% of the way to learning R!

# NB the following are all the same **
x = 5
x <- 5
x <- 5

For the sake of clarity and simplicity we will not use the -> or <- operators in this tutorial but you will probably see them in other peoples R code.

Exercises

Look at the following example adapted from the R inferno, pg18-19. Can you guess what the values of x, y and z are before looking?
OK check them now (i.e. type x in the console and hit return!). Do you understand that?

# R Inferno Example p18-19
x = max(2, 100, -4, 3, 230, 5)
y = mean(2, -100, -4, 3, -230, 5)
z = mean(c(2, -100, -4, 3, -230, 5))

Characters

Actually vectors do not need to be of numeric mode they can also be characters.

x = c("cha", "cha", "cha")
y = rep("r", 3)
z = x + y

## Error: non-numeric argument to binary operator

When R encounters a problem it should hopefully produce a sensible error message like the one above. You should always read these and think about them before asking others for help! Essentially this is saying in a rather awkward way that adding characters together is not an arithmetic operation, the ‘+’ operator expects numbers.

Instead an intelligible command is to ‘add’ characters (i.e ‘cha’ to ‘r’) using the paste() command:

z = paste(x, y, sep = "")
z

## [1] "char" "char" "char"

Each element of x is now a string of characters. There are 3 new functions above: c() concatenates variables together end to end into longer vectors, rep() is a shortcut for making repeated sequences and paste() like the name suggests- sticks characters together into longer strings element by element. Note the extra options: sep=, this simply means there should be no separator between the pasted variables x and y. Try paste(x,y, sep='gap') or paste(x,y, sep=x) instead.

Exercises

Given: string='string', Can you create a new variable like the following:
“string” “string” “string”
“string string string”
“stringstringstring”
Type ?rep in the console. Look at the help page. Now create a new variable with 22 repeats of string.

Functions

I have used the word function above in a colloquial sense but haven’t properly explained it. Just as in mathematics a function is like a black box where you supply an input- or what we call arguments- and R computes an output. Later we will look at writing our own functions but for now lets look at a few built-in R examples.

x = rnorm(100, mean = 110, sd = 10)
y = rnorm(100, mean = 100, sd = 10)
y

##   [1] 105.94  99.27  83.38  87.98  94.60  85.08 121.01  82.82 102.76  99.05
##  [11]  80.65 101.18  93.52  93.92 118.47 106.84  94.54  69.18  93.86  95.06
##  [21]  96.87 100.75 106.64  99.21  84.50 102.27 108.47  99.25  78.55  85.63
##  [31]  96.19  80.73 112.62  93.25 101.01  80.57 104.05  88.49  99.89 108.98
##  [41]  70.39 107.20 100.00 107.82 101.66  93.18  91.20  90.35 127.64  90.19
##  [51] 106.93  99.63 108.22  95.85 107.03 120.28  96.55  89.18  93.52  87.54
##  [61] 102.33  95.13  96.70  96.73 113.59 102.34 104.60 107.29 103.30  97.38
##  [71]  90.26  91.98 112.25  90.28  97.17  88.68  92.99 100.61  91.11  97.34
##  [81] 101.82 106.04  94.46  85.01  89.58 110.23 114.60 102.62 101.90 103.25
##  [91] 102.54  96.42  97.83 102.11  79.72  86.80  98.79 105.75 113.30 106.48

The rnorm function takes as input arguments a number n, a mean, and a standard deviation and returns n random samples from a normal population with said mean and standard deviation.

# test equality of x and y means
t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y 
## t = 7.715, df = 197.7, p-value = 5.82e-13
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##   8.644 14.579 
## sample estimates:
## mean of x mean of y 
##    109.54     97.93 
##

The t.test function above tests whether the means of vectors x and y are the same. As you would expect from our specification of x and y above: they are not. Note that I have not explicitly named the arguments that x and y match. The x variables have been matched however to the first two arguments of the t.test function (see help documentation just below)

# test if mean of x < mean of y
t.test(x, y, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  x and y 
## t = 7.715, df = 197.7, p-value = 1
## alternative hypothesis: true difference in means is less than 0 
## 95 percent confidence interval:
##  -Inf 14.1 
## sample estimates:
## mean of x mean of y 
##    109.54     97.93 
##

However we can change the output of the t.test() function by adding a further argument. In this case the default argument was alternative=’two.sided’, which is how the function behaves when we do not specify this argument ourselves. The ordinary default behaviour of a function should be documented on the Help page (discussed below).

Or we can change a different named argument: paired =TRUE, and change the operation of the function in a different way (below).

# test if the means of paired groups x and y are equal
t.test(x, y, paired = TRUE)

## 
##  Paired t-test
## 
## data:  x and y 
## t = 7.926, df = 99, p-value = 3.448e-12
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##   8.705 14.518 
## sample estimates:
## mean of the differences 
##                   11.61 
##

OK if you have understood the examples so far then you have really learnt about 89.99991% of the concept of R. Yet there are still thousands and thousands of functions to learn and many many combinations of arguments for each function. So how do we learn the functions and all their arguments? Well… this maybe the work of a lifetime but really to be competent you don’t actually need to know most of them. You just need to know how to navigate the documentation and make educated guesses.

Help Search

If you are looking for a function that you think may exist (i.e. a mathematical or statistical term) try something like:

help.search("t-test")

You should get a window something like this which lists near matches to your search term with a short description for each. The one we want is at the bottom stats::t.test, which in English means ‘the t.test function in the “stats” package’. For more obscure mathematical operations you may have to try a few different search terms.

Use ?

Once you know the name of a function (e.g. t.test() above then you can view 'documentation' for this function by typing:

`?`(t.test)

Every built-in function has a documentation page and each page is standardised! The name of the function followed in brackets by the 'package', A description, examples of usage, the 'arguments' which are the types of variable and parameter that the function expects, detail usually describes how a complex function works, results that are returned (more on this below).

Objects and Classes

Let's return and look at t.test() again:

z = t.test(x, y)

What exactly is z? Its can’t be a single number, character or even a vector of numbers. It is actually a built in class of object - a composite of different types of variable. we can see this using the summary() function

summary(z)

##             Length Class  Mode     
## statistic   1      -none- numeric  
## parameter   1      -none- numeric  
## p.value     1      -none- numeric  
## conf.int    2      -none- numeric  
## estimate    2      -none- numeric  
## null.value  1      -none- numeric  
## alternative 1      -none- character
## method      1      -none- character
## data.name   1      -none- character

z$statistic

##     t 
## 7.715

Each part of z can be accessed using the $ operator (e.g. z$pvalue, z$alternative, z$conf.int ). Most of the R built-in statistical functions return composite objects with many parts and many common statistical types return a special type or class of result which is recognised by other R functions. For instance

z = t.test(x, y)
z = wilcox.test(x, y)
class(z)

## [1] "htest"

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  x and y 
## W = 7933, p-value = 7.765e-13
## alternative hypothesis: true location shift is not equal to 0 
##

Here 'htest' stand for a 'hypothesis test' of which there are many types but which all return essentially the same form of result object with mostly the same components. When someone has gone to the trouble of creating a special class this is usually because there are other functions that use these classes as input– rather than very long lists of variables, for instance:

z = lm(x ~ y)
coefficients(z)

## (Intercept)           y 
##   104.17244     0.05479

summary(z)

## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48.38  -7.52   0.71   6.74  31.82 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 104.1724    10.3482   10.07   <2e-16 ***
## y             0.0548     0.1051    0.52      0.6    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 10.9 on 98 degrees of freedom
## Multiple R-squared: 0.00277, Adjusted R-squared: -0.00741 
## F-statistic: 0.272 on 1 and 98 DF,  p-value: 0.603 
##

Firstly note here we are using a new function lm() to create a linear model – essentially the R jargon for a regression. Note also 'x~y' the 'model formula' just means x versus y. However there are complex formula that can be used to specify all types of complex multiple regression- we’ll deal with this later! Try plot(x~y) to see this graphically.

Secondly note that the function coefficients(z) takes the complex composite object z as an argument but knows how to handle it. Thirdly note that summary(z) returns not a simple list of the object but a specially organised table. For many types of data R functions intelligently uses appropriate methods based upon the class of argument input. Amongst programmers this idea is often referred to as object orientation – and the R language is rife with object orientation. We will discuss this again in a later tutorial.

So far we have gone quite deep into the workings of functions (of course if you are a computer programmer or engineer rather than a biologist this will be second nature). Although you could probably learn to use R without thinking about this stuff- most do.

Data Classes

Now we will rewind and look at 3 more classes of object: the matrix, the list and finally the most useful of all - the data.frame. In large part it is the central use of these 3 special classes of data that has made R so widespread in the research community in particular (… aside from cost and the help documentation).

The Matrix

x = rnorm(10, mean = 110, sd = 10)
y = rnorm(10, mean = 100, sd = 10)
z = cbind(x, y)
class(z)

## [1] "matrix"

A matrix is a table of variables all of the same type. We created it here by binding two numeric vectors x and y using the cbind() - or ‘column bind’. If the variables are numeric then we can perform matrix algebra and manipulations. For instance here we take a transpose (i.e. rotate) the matrix t() and then matrix multiply this by the original matrix:

zt = t(z)
mm = z %*% zt
head(mm)

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
## [1,] 25064 24457 23045 22281 23073 22862 18297 22115 25378 23145
## [2,] 24457 23914 22464 21766 22576 22372 17852 21510 24814 22601
## [3,] 23045 22464 21200 20474 21185 20990 16824 20368 23309 21272
## [4,] 22281 21766 20474 19819 20542 20355 16264 19624 22585 20583
## [5,] 23073 22576 21185 20542 21317 21125 16841 20271 23425 21327
## [6,] 22862 22372 20990 20355 21125 20935 16687 20081 23213 21133

Matrices have dimensions which are the rows and columns of the table. Similar to a vector we can access slices or cells of the matrix using bracket [ ] notation like so:

dim(z)

## [1] 10  2

z[1:5, ]

##          x      y
## [1,] 118.5 104.94
## [2,] 111.0 107.65
## [3,] 111.2  93.98
## [4,] 103.0  95.94
## [5,] 103.3 103.15

z[1:5, 1]

## [1] 118.5 111.0 111.2 103.0 103.3

z[3, 4]

## Error: subscript out of bounds

Exercises

There is also a matrix command. Check the help page ?matrix.
Now create a matrix called m with 10 columns and 8 rows of NA
tricky! Create a matrix with 10 columns of random samples from the uniform distribution between 0 and 7

The List

R Beginners often use the term list interchangeably with vector. The list however is a quite different object that allows different types of variables of different length or dimension to be stored together and indexed either by their order or by name.

a = 1:5
b = rep("chars", 100)
L = list(a = a, b = b)
L

## $a
## [1] 1 2 3 4 5
## 
## $b
##   [1] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##   [9] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [17] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [25] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [33] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [41] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [49] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [57] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [65] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [73] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [81] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [89] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [97] "chars" "chars" "chars" "chars"
##

L[[1]]

## [1] 1 2 3 4 5

L$b[1:5]

## [1] "chars" "chars" "chars" "chars" "chars"

Note that we access the second element, which is a vector, by name: L$b, but we can also restrict this vector to only the first 5 elements L$b[1:5]

Actually most built in R class (e.g. like 'htest' or 'lm' shown above) are special types of list as they contain different types of data combined into a single object and accessible by name e.g coef(z) is the same as z$coef

Exercises

Can you access the 3rd to 5th element of the 1st element in the list L?
If not read that again carefully and try again.

The Data.Frame

The data.frame is the most useful R object as most external data with names, labels, dates, categories and numbers is usually in this format. R has a selection of built-in example datasets including a data.frame of the dimension of three types of iris. Tidy up your workspace first by deleting other variables:

rm(list = ls())
data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

The head() function is often useful as it shows just the first few elements or lines of an object - such as a matrix, list, or data.frame. Similarly to a list the columns of a data.frame each have a unique name which you can use to access them:

iris$Species[1:4]

## [1] setosa setosa setosa setosa
## Levels: setosa versicolor virginica

or just like a matrix you can access them using the bracket notation

iris[1:4, 5]

## [1] setosa setosa setosa setosa
## Levels: setosa versicolor virginica

Note that the Species data is not a simple character vector (NB don’t say ‘a list of characters’. Remember that’s something quite different). It is actually a data-type called a factor which we will deal with in the next tutorial. Another useful function when dealing with data.frames is the attach() command which allows us to directly name columns of the data.frame. Below we attach() the iris data.frame, set up a plot panel with two side-by-side panes and then place two simple plots into each pane.

attach(iris)
par(mfrow = c(1, 2))
plot(Petal.Length ~ Petal.Width, col = Species)
plot(Petal.Length ~ Species)

plot of chunk irisplot

Note here that R has produced two different types of plot appropriate to the type of data supplied. This is another simple example of the object orientation I mentioned earlier.

Another more esoteric but nonetheless interesting reason that matrices, lists and data.frames greatly enhance the R language is because they encourage so-called functional programmming. In practical terms this means that there are many functions that operate on each element, row, or column of these data classes (e.g. apply, lapply, tapply) and thus negate the need for programs with explicit loops. You can use loops in R: e.g.

# loops are slow compared to functional programming
for (i in 1:4) mean(iris[, i])

For many statistical functions such as matrix algebra and calculus this style of programming is similar to the way mathematical formulae are written. For those of a less computing science oriented background it’s quite similar to the way that row and column formulae work in excel. For example:

df = iris[, 1:4]
# column means
apply(df, MARGIN = 2, mean)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        5.843        3.057        3.758        1.199

# sum of the rows
apply(df, MARGIN = 1, sum)

##   [1] 10.2  9.5  9.4  9.4 10.2 11.4  9.7 10.1  8.9  9.6 10.8 10.0  9.3  8.5
##  [15] 11.2 12.0 11.0 10.3 11.5 10.7 10.7 10.7  9.4 10.6 10.3  9.8 10.4 10.4
##  [29] 10.2  9.7  9.7 10.7 10.9 11.3  9.7  9.6 10.5 10.0  8.9 10.2 10.1  8.4
##  [43]  9.1 10.7 11.2  9.5 10.7  9.4 10.7  9.9 16.3 15.6 16.4 13.1 15.4 14.3
##  [57] 15.9 11.6 15.4 13.2 11.5 14.6 13.2 15.1 13.4 15.6 14.6 13.6 14.4 13.1
##  [71] 15.7 14.2 15.2 14.8 14.9 15.4 15.8 16.4 14.9 12.8 12.8 12.6 13.6 15.4
##  [85] 14.4 15.5 16.0 14.3 14.0 13.3 13.7 15.1 13.6 11.6 13.8 14.1 14.1 14.7
##  [99] 11.7 13.9 18.1 15.5 18.1 16.6 17.5 19.3 13.6 18.3 16.8 19.4 16.8 16.3
## [113] 17.4 15.2 16.1 17.2 16.8 20.4 19.5 14.7 18.1 15.3 19.2 15.7 17.8 18.2
## [127] 15.6 15.8 16.9 17.6 18.2 20.1 17.0 15.7 15.7 19.1 17.7 16.8 15.6 17.5
## [141] 17.8 17.4 15.5 18.2 18.2 17.2 15.7 16.7 17.3 15.8

In English the second statement might read ‘calculate the mean for each column of df’ and the third would read ‘calculate the sum for each row of df’, and the results would be a vector with as many columns or rows of df respectively. For those moving from more procedural languages (containing explicit loop commands) such as C, C++ or Java this style of programming on data can take a bit of getting used to.

This is probably a good place to stop. My aim has been to give you a glimpse of the underlying concepts of the R language but it’s probably more constructive to learn the rest via practice with real data.

Exercises

Can you redo that figure with the plots on top and below istead of either side?
Can you plot the Petal.Length by the Petal.Width only for the setosa Species?
have a look at the help page for the data.frame() command.
Now create a new data.frame that has the orginal iris columns plus two further columns for Sepal.Length divided by Sepal.Width and Petal.Length divided by Petal.Width.