R is free software for data anlysis, statistics, plotting, and a sophisticated programming language. It runs on Windows, Mac or Linux and can swap 'workspace' files containing data between the different versions. It has all the features and graphical panache of commercial statistical software such as SPSS or Stata and is far more flexible and powerful than Excel. Whilst R code is very similar to Matlab it has many more users, plus many thousands of 'packages'; user generated add-ons for almost every aspect of scientific research.
Some say that R is difficult to learn but that is perhaps as it is used for complex scientific applications rather tah being especially difficult itself. For beginners one problem is that there are usually many ways to accomplish any task, and often new packages will introduce improvements on the base language approach that may take time to spread throughout the R community. On this course we try wherever possible to use newer innovative packages in the R language - whilst showing older base R methods if they are still common.
The R software package can be downloaded from the R Binaries page on CRAN: The Comprehensive R Archive Network. The R Binaries will install automatically on both Windows (double-click) and Mac (move to Applications folder) and create a start icon on your Desktop or Start-Up menu.
For this course we are going to use an additional piece of software RStudio that makes using R easier still. RStudio is an improved front end for R that combines a dedicated text editor, file browser, file previewer, spreadsheet, graphics window, and help pages all within a single slick Graphical User Interface (GUI). This makes working with R a lot easier.
Once you have installed R itself you can download the RStudio installer and simply double-click to install. Then when you launch RStudio - you are essentially launching R with bells on.
All the tutorials, exercise answers, and necessary data are collected together with a folder for each tutorial (1-7). Begin by simply opening RStudio and using the file browser to navigate to the 'RTuts1' folder and under the 'More' menu click the 'Set As Working Directory' option. Then each time you start a new tutorial navigate to the next folder and reset you working directory.
Whilst working through the course you will probably want to switch between reading through the tutorials in your browser and copy/pasting/modifying bits and pieces to test in RStudio. For the exercises in each tutorial it is probably best to create a new RScript using the RStudio file menu and write your exercises into this. Please attempt the exercises before you check the model answers.
If you have used R before then you could skip this first tutorial and start with the next but I would advise not to. It is often helpful to think explicitly about the unconscious rules we know …..and this tutorial is quite short. The power of R is the command line. The command line can be viewed as a sort of complex calculator – or algebra machine, with a memory or workspace. At the console type the following:
# The hash symbol denotes a comment - if you copy this line into the
# console and hit <return> or leave it in a file of R code - nothing will
# happen
4 + 8
## [1] 12
x = 4
y = 8
z = x * y
ls()
## [1] "x" "y" "z"
z
## [1] 32
# z is of 'type' or 'mode' double i.e. a number
typeof(z)
## [1] "double"
rm(z)
ls()
## [1] "x" "y"
In this first short set of code above we create x and y variables and then z from multiplying x and y together. Typing the name of a variable simply shows it's value and the command ls() shows the names of variables that we have created (In RStudio you will also see the workspace variables in the top-right pane). The rm() command removes a variable from the workspace. This is the basics - the basis of a simple calculator. However data types in R can be a lot more complex than simple variables such as x,y,z.
x = 1:5
y = 6:10
z = y - x
z
## [1] 5 5 5 5 5
Here the corresponding values of y have been subtracted piecewise from the corresponding values of x and stored in the y variable. The variables x, y and z here are called 'vectors'. Note that the original values of x and y have been replaced by our new assignment.
Using vectors like this you can start to do simple maths.
x = 1:10
y = (x^3) - (8 * x^2) - (4 * x) + 2
plot(x, y)
Each element of a vector such as x or y can be accessed by using brackets to index them e.g.
y[2]
## [1] -30
y[2:8]
## [1] -30 -55 -78 -93 -94 -75 -30
Here we have used the brackets to extract the 2nd element of y then the 2nd to 8th value of y (i.e. another vector). Here is a slightly different example.
x = 1:5
y = 6
z = x - y
z
## [1] -5 -4 -3 -2 -1
# a slightly trickier example copy it to your own console then try and
# work out what z is before typing z
x = 1:5
y = -1:1
z = x - y
## Warning: longer object length is not a multiple of shorter object length
z
## [1] 2 2 2 5 5
We say that each element of y is subtracted piecewise from x and then recycled if short of x (hence the red warning message). If you understand these examples so far you are 33.33329% of the way to learning R!
NB Special Note on the Assignment Operator
# NB the following are all the same
x = 5
x <- 5
x <- 5
For the sake of clarity and simplicity we will not use the -> or <- operators in this tutorial but you will probably see them in other peoples R code.
1. Look at the following example adapted from the R inferno, pg18-19. Can you guess what the values of x, y and z are before hitting return?
# R Inferno Example p18-19
x = max(2, 100, -4, 3, 230, 5)
y = mean(2, -100, -4, 3, -230, 5)
z = mean(c(2, -100, -4, 3, -230, 5))
Thus far all the numbers or vectors created have been of type double. The double type can handle decimals such as pi but they use more memory than integers which can be forced using the capital letter L.
z = 3
typeof(z)
## [1] "double"
z = 3L
typeof(z)
## [1] "integer"
# NB note the warning
z = 3.14
typeof(z)
## [1] "double"
Then there is also a special type for complex numbers such as \( sqrt{-3} \) (we'll not worry about these). Plus there are single or strings of characters and also a very useful logical datatype (TRUE, FALSE, T, F) sometimes called boolean. Then there are factors which are for categorical data. We'll discuss logical and factor datatypes a bit later.
# 'c' and 'rep' respectively combine or replicate elements into a longer
# vector
x = c("char", "char", "char")
y = rep("acter", 3)
z = x + y
## Error: non-numeric argument to binary operator
Note the error in the code above. When R encounters a problem it will usually stop and produce a sensible error message like the one above. Essentially this error is saying that adding characters together is not an arithmetic operation, the + operator expects numbers not characters. Instead an intelligible command is to paste characters piecewise (i.e ‘cha’ to ‘r’) using the paste command:
z = paste(x, y, sep = "")
# the similar command paste0(x,y) has no separator.
z
## [1] "character" "character" "character"
Each element of x is now a string of characters. Note the extra option: sep='', this simply means there should be no separator between the pasted variables x and y. Try paste(x,y, sep='a_gap') or paste(x,y, sep=x) instead.
2. Given: string='string' and the commands you know already, Can you create a new variable like the following:
string.The R language uses logical statements to control the way that expressions are executed e.g.
if (*expression_1*) *expression_2* else *expression_3*
This is best demonstrated:
# this is assignment
x = 12
y = 14
# this is a logical test does x equal y, is x less than or equal to y
x == y
## [1] FALSE
x <= y
## [1] TRUE
# tests can be piecewise too
a = 1:8
b = 4
a > b
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
logVector = a > b
typeof(logVector)
## [1] "logical"
# logical conditions can be compounded this means '12 is greater than 14'
# OR) '12 is greater than 4'
(x > y | x > b)
## [1] TRUE
# whereas this means '12 is greater than 4' AND '12 is greater than 4'
(x > y & x > b)
## [1] FALSE
# or they can be useful for pulling out bits of data, a %% 2 is the
# modulus, here we pull out only even elements of a
a%%2
## [1] 1 0 1 0 1 0 1 0
a%%2 == 0
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
a[a%%2 == 0]
## [1] 2 4 6 8
# if x is less than or equal to y do everything between the braces
if (x <= y) {
print("x is less than 13")
z = 4
}
## [1] "x is less than 13"
# the `if` statement can also be used with `else`, != is NOT EQUALS,
if (x != y) {
print("x doesnt equal y")
} else {
print("x equals y")
}
## [1] "x doesnt equal y"
# finally there are loops
x
## [1] 12
for (i in 1:x) print(paste(i, x, i + x))
## [1] "1 12 13"
## [1] "2 12 14"
## [1] "3 12 15"
## [1] "4 12 16"
## [1] "5 12 17"
## [1] "6 12 18"
## [1] "7 12 19"
## [1] "8 12 20"
## [1] "9 12 21"
## [1] "10 12 22"
## [1] "11 12 23"
## [1] "12 12 24"
Note that it is good practice to indent conditional expressions. This makes it easier to follow the flow of R code. Note also that simple one line expressions don't need braces- but you need them if an expression is more than a single line.
3. Write a statement that prints values of the variable a divisibe by 3. Write a statement that prints values of a greater than 2 AND less than 7.
In following tutorials we will discuss creating your own functions or commands but for now we should briefly describe built-in functions. Just as in mathematics a function is like a black box where you supply inputs- or what we call arguments or sometimes options - and R computes an output. For now lets consider a few built-in R examples.
x = rnorm(n = 100, mean = 110, sd = 10)
y = rnorm(n = 100, mean = 100, sd = 10)
length(x)
## [1] 100
head(x)
## [1] 109.22 92.92 83.85 106.33 109.19 94.07
The rnorm() function takes as input arguments a number n, a mean, and a standard deviation sd. The function then returns n random normal samples with that mean and standard deviation and assigns it to the variable x. Here is another example.
# test equality of x and y means
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 7.513, df = 198, p-value = 1.947e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.742 13.252
## sample estimates:
## mean of x mean of y
## 110.1 99.6
The t.test() function above tests whether the means of the vectors x and y we just created are the same. As you would expect from our specification of x and y above: they are not. NB Note that I have not explicitly named the arguments that x and y match. The x and y variables have been matched automatically to the first two arguments of t.test() (see help documentation just below)
# test if mean of x < mean of y
t.test(x, y, alternative = "less")
##
## Welch Two Sample t-test
##
## data: x and y
## t = 7.513, df = 198, p-value = 1
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 12.81
## sample estimates:
## mean of x mean of y
## 110.1 99.6
However we can change the output of the t.test() function by adding a further argument or option. In this case the default argument is alternative=’two.sided’, which is how it behaves if we do not specify the argument ourselves. The default behaviour of every function should be documented on its Help page (discussed below). Alternatively we can change a different named argument: paired =TRUE, and change the operation of the function in a different way (below).
# test if the means of paired groups x and y are equal
t.test(x, y, paired = TRUE)
##
## Paired t-test
##
## data: x and y
## t = 7.606, df = 99, p-value = 1.663e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.758 13.236
## sample estimates:
## mean of the differences
## 10.5
OK if you have understood the examples so far then you have really learnt about 89.99991% of the concept of R. Yet there are still thousands and thousands of functions to learn and many many combinations of arguments for each function. So how do we learn the functions and all their arguments? Well… this maybe the work of a lifetime but really to be competent you don’t actually need to know most of them. You just need to know how to navigate the documentation and make educated guesses.
If you are looking for a function that you think may exist (i.e. a mathematical or statistical term) try something like:
help.search("t-test")
You should get a window something like this which lists near matches to your search term with a short description for each. The one we want is at the bottom stats::t.test, which in English means ‘the t.test function in the “stats” package’. For more obscure mathematical operations you may have to try a few different search terms.
Once you know the name of a function (e.g. t.test() above then you can view 'documentation' for this function by typing:
`?`(t.test)
Every built-in function has a documentation page and each page is in standardised format! The name of the function followed in brackets by the 'package', a title, a description, usage and 'arguments' which are the types of variable and parameter that the function expects, details usually describes how a complex function works, value describes the output, see also (see below) and finally there are examples. You can quickly run these examples and examine input data and output results by using the command example(t.test).
See Also is the second last section of a help document. Once you have begun to learn your way around a few functions then the quickest way to find something is to look at the help page of something similar. For instance the 'See Also' section of wilcox.test() has a link to t.test(). Alternatively the 'Index' of all functions within the 'stats' package can be viewed by clicking 'Index' at the bottom of the Documents page (although for stats this list is huge).
There are hundreds of built -in commands and thousands of packages with many many more commands. No-one knows them all but good R users use the help system, Google, plus prior knowledge and intuition to find what they need.
4. The variance of x and y variables is the same (we created them so). Recalculate the t-test assuming equal variance.
5. Use your intuition and/or the help command to find a way to calculate the Pearson and Spearman correlation between x and y and whether the correlation is significant?
Let's return and look at t.test() again:
z = t.test(x, y)
What exactly is z? It can’t be a single number, character or even a vector of numbers. It is actually a built in class of object - a composite of different types of variable. We can see this using the attributes command.
attributes(z)
## $names
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
##
## $class
## [1] "htest"
z$statistic
## t
## 7.513
Each part of z can be accessed separately using the $ operator (e.g. z$pvalue, z$alternative, z$conf.int). Most of the R built-in statistical functions return composite objects with many parts and many common statistical types return a special type or class of result which is recognised by other R functions. For instance
z = t.test(x, y)
w = wilcox.test(x, y)
class(z)
## [1] "htest"
class(w)
## [1] "htest"
Here htest stands for a 'hypothesis test' of which there are many types but all with an essentially similar form of result and mostly the same components. When someone has gone to the trouble of creating a special class this is usually because there are other R commands that use these classes as input, for instance:
z = lm(y ~ x)
class(z)
## [1] "lm"
coefficients(z)
## (Intercept) x
## 96.92168 0.02435
residuals(z)
## 1 2 3 4 5 6 7
## -2.71379 -24.59930 -13.65072 1.05303 -8.81662 -10.71141 -7.58093
## 8 9 10 11 12 13 14
## 5.76673 8.63124 5.51096 -1.31720 -7.11557 -0.84341 -2.59861
## 15 16 17 18 19 20 21
## -12.65583 0.96071 14.34116 1.68740 -18.36694 5.09844 -3.97902
## 22 23 24 25 26 27 28
## 17.37686 -6.76688 -19.37759 -1.62783 -1.47235 -3.31900 14.54382
## 29 30 31 32 33 34 35
## -10.74836 -14.32159 17.68913 -1.11298 6.90034 -6.55043 -20.77094
## 36 37 38 39 40 41 42
## -6.33074 19.46216 27.39032 4.87889 6.40344 1.78745 11.94260
## 43 44 45 46 47 48 49
## -6.70782 1.80465 -4.12484 12.26206 0.14280 5.48145 -9.42167
## 50 51 52 53 54 55 56
## 0.05443 0.69080 -7.26821 -0.48181 8.77779 -0.94869 3.79709
## 57 58 59 60 61 62 63
## 11.52938 2.63320 6.44709 13.24228 -17.50916 1.44977 0.24909
## 64 65 66 67 68 69 70
## 11.72477 4.91715 -0.69468 -16.21139 -2.76491 13.41308 -3.63658
## 71 72 73 74 75 76 77
## -3.98272 14.91354 -11.58088 -9.94152 2.12077 20.31287 5.96786
## 78 79 80 81 82 83 84
## -5.13487 -4.88427 -0.04681 2.64901 -4.52752 8.20611 -3.56732
## 85 86 87 88 89 90 91
## 13.29606 -3.29457 0.19073 3.90606 -17.07340 -4.47970 7.03141
## 92 93 94 95 96 97 98
## -10.57624 -8.27072 -6.93882 9.82109 18.16312 8.40690 -3.11005
## 99 100
## -6.82185 -3.67800
attributes(z)
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
summary(z)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.599 -6.590 -0.588 6.077 27.390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.9217 11.2066 8.65 1e-13 ***
## x 0.0244 0.1014 0.24 0.81
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.95 on 98 degrees of freedom
## Multiple R-squared: 0.000588, Adjusted R-squared: -0.00961
## F-statistic: 0.0577 on 1 and 98 DF, p-value: 0.811
Firstly note here we are using a new function lm() to create a linear model – essentially the R jargon for a regression. Note also y~x the model formula in this case just means y versus x. However there are complex formula that can be used to specify all types of complex multiple regression- we’ll deal with this later! Try plot(x~y) or plot(y~x) to see this graphically.
Secondly note that the function coefficients(z) takes the composite class object z as an argument and knows how to handle it. Thirdly note that summary(z) returns not a simple list of the object but a specially organised table. For many types of data R functions intelligently use appropriate methods based upon the class of argument input. Amongst programmers this idea is often referred to as object orientation – and the R language is rife with object orientation.
For now we will rewind and look at 3 more basic classes of object: the matrix, the list and finally the most useful of all - the data.frame. In large part it is the central use of these 3 special data classes that has made R so widespread in the research community.
x = rnorm(10, mean = 110, sd = 10)
y = rnorm(10, mean = 100, sd = 10)
z = cbind(x, y)
class(z)
## [1] "matrix"
A matrix is a table of variables all of the same type. We created it here by binding two numeric vectors x and y using the cbind() - or ‘column bind’. If the variables are numeric then we can perform matrix algebra and manipulations. For instance here we take a transpose (i.e. rotate) the matrix t() and then matrix multiply this by the original matrix:
zt = t(z)
mm = z %*% zt
head(mm)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 24002 21714 21779 24800 23404 23079 22402 25079 23060 22710
## [2,] 21714 20263 19750 22814 20997 21559 20160 22754 20609 20969
## [3,] 21779 19750 19765 22532 21222 20993 20319 22761 20904 20639
## [4,] 24800 22814 22532 25856 24075 24262 23083 25953 23672 23724
## [5,] 23404 20997 21222 24075 22869 22312 21873 24435 22556 22025
## [6,] 23079 21559 20993 24262 22312 22939 21424 24187 21896 22302
Matrices have dimensions which are the rows and columns of the table. Similar to a vector we can access slices or cells of the matrix using bracket [ ] notation like so:
dim(z)
## [1] 10 2
# row 1 to 5
z[1:5, ]
## x y
## [1,] 117.45 101.03
## [2,] 90.01 110.28
## [3,] 105.32 93.13
## [4,] 111.43 115.93
## [5,] 119.09 93.20
# row 1 to 5 of column 1
z[1:5, 1]
## [1] 117.45 90.01 105.32 111.43 119.09
# row 3 column 2
z[3, 2]
## y
## 93.13
There is also a matrix command. Check the help page ?matrix.
6. With this command create a matrix called m with 10 columns and 8 rows of 0.
7. Create a matrix with 10 columns of random samples from the uniform distribution (Hint: it's like rnorm()) between 0 and 7. Tricky?
R Beginners often use the term list interchangeably with vector. The list however is a quite different object that allows different types of variables of different length or dimension to be stored together and indexed either by their position in the list [[1]] or directly by name.
a = 1:5
b = rep("chars", 100)
L = list(a = a, b = b)
L
## $a
## [1] 1 2 3 4 5
##
## $b
## [1] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [9] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [17] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [25] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [33] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [41] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [49] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [57] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [65] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [73] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [81] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [89] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
## [97] "chars" "chars" "chars" "chars"
L[1]
## $a
## [1] 1 2 3 4 5
L[[1]]
## [1] 1 2 3 4 5
L$b[1:5]
## [1] "chars" "chars" "chars" "chars" "chars"
Firstly note carefully the difference between L[1] and L[[1]] - this often confuses people. L[1] is an element of the list- still a list, whilst L[[1]] extracts the vector. Secondly note that we can access the second element by name: L$b and then further drill into just the first five elements of this vector using brackets L$b[1:5]. NB Most built in R classes (e.g. like 'htest' or 'lm' shown above) are extensions of list with different types of data combined into a single object and accessible by name e.g. coef(z) is really just z$coef.
8. Can you access the 3rd to 5th integers in the 1st element of the list L?
Similarly to a matrix the data.frame is a table of data but similar to a list the columns can be of different types of data: numeric, character etc. The data.frame is probably the most useful R object as most external data with names, labels, dates, categories and numbers is usually in tabular format. R has a selection of built-in example datasets including a data.frame of iris (the flower) measurements. Tidy up your workspace now by deleting other variables:
# tidy up
rm(list = ls())
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
The head() command used above is often useful for a first look at a large object as it shows just the first few elements or lines of an object - such as a matrix, list, or data.frame. Each column of data - or variable can be accessed using the $ character.
iris$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
iris$Sepal.Length[1:10]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
(iris$Sepal.Length) * (iris$Sepal.Width)
## [1] 17.85 14.70 15.04 14.26 18.00 21.06 15.64 17.00 12.76 15.19 19.98
## [12] 16.32 14.40 12.90 23.20 25.08 21.06 17.85 21.66 19.38 18.36 18.87
## [23] 16.56 16.83 16.32 15.00 17.00 18.20 17.68 15.04 14.88 18.36 21.32
## [34] 23.10 15.19 16.00 19.25 17.64 13.20 17.34 17.50 10.35 14.08 17.50
## [45] 19.38 14.40 19.38 14.72 19.61 16.50 22.40 20.48 21.39 12.65 18.20
## [56] 15.96 20.79 11.76 19.14 14.04 10.00 17.70 13.20 17.69 16.24 20.77
## [67] 16.80 15.66 13.64 14.00 18.88 17.08 15.75 17.08 18.56 19.80 19.04
## [78] 20.10 17.40 14.82 13.20 13.20 15.66 16.20 16.20 20.40 20.77 14.49
## [89] 16.80 13.75 14.30 18.30 15.08 11.50 15.12 17.10 16.53 17.98 12.75
## [100] 15.96 20.79 15.66 21.30 18.27 19.50 22.80 12.25 21.17 16.75 25.92
## [111] 20.80 17.28 20.40 14.25 16.24 20.48 19.50 29.26 20.02 13.20 22.08
## [122] 15.68 21.56 17.01 22.11 23.04 17.36 18.30 17.92 21.60 20.72 30.02
## [133] 17.92 17.64 15.86 23.10 21.42 19.84 18.00 21.39 20.77 21.39 15.66
## [144] 21.76 22.11 20.10 15.75 19.50 21.08 17.70
However like a matrix you can access columns using the bracket notation
# REM brackets are [rows,column]
iris[, 1]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
iris[1:10, 1]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
(iris[, 1]) * (iris[, 2])
## [1] 17.85 14.70 15.04 14.26 18.00 21.06 15.64 17.00 12.76 15.19 19.98
## [12] 16.32 14.40 12.90 23.20 25.08 21.06 17.85 21.66 19.38 18.36 18.87
## [23] 16.56 16.83 16.32 15.00 17.00 18.20 17.68 15.04 14.88 18.36 21.32
## [34] 23.10 15.19 16.00 19.25 17.64 13.20 17.34 17.50 10.35 14.08 17.50
## [45] 19.38 14.40 19.38 14.72 19.61 16.50 22.40 20.48 21.39 12.65 18.20
## [56] 15.96 20.79 11.76 19.14 14.04 10.00 17.70 13.20 17.69 16.24 20.77
## [67] 16.80 15.66 13.64 14.00 18.88 17.08 15.75 17.08 18.56 19.80 19.04
## [78] 20.10 17.40 14.82 13.20 13.20 15.66 16.20 16.20 20.40 20.77 14.49
## [89] 16.80 13.75 14.30 18.30 15.08 11.50 15.12 17.10 16.53 17.98 12.75
## [100] 15.96 20.79 15.66 21.30 18.27 19.50 22.80 12.25 21.17 16.75 25.92
## [111] 20.80 17.28 20.40 14.25 16.24 20.48 19.50 29.26 20.02 13.20 22.08
## [122] 15.68 21.56 17.01 22.11 23.04 17.36 18.30 17.92 21.60 20.72 30.02
## [133] 17.92 17.64 15.86 23.10 21.42 19.84 18.00 21.39 20.77 21.39 15.66
## [144] 21.76 22.11 20.10 15.75 19.50 21.08 17.70
iris$Species
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
Note that the Species data is not a simple character vector (NB don’t say ‘a list of characters’. Remember that’s something quite different). It is actually a datatype we mentioned earlier called a factor which we will deal with in a later tutorial. A sometimes useful command when dealing with data.frames is attach, which allows us to directly name columns of the data.frame. Below we attach the iris data.frame, set up a plot panel with two side-by-side panes and then place two simple plots into each pane.
attach(iris)
par(mfrow = c(1, 2))
plot(Petal.Length ~ Petal.Width, col = Species)
plot(Petal.Length ~ Species)
Note here that R has produced two different types of plot appropriate to the input data type. This is another simple example of the object orientation I mentioned earlier. This is probably a good place to stop - but try and finish the exercises below. We will discuss making your own functions in the next tutorial and after that move swifty on to wrestling with data.
9. Can you redo that figure with the plots on top and below instead of either side?
10. Can you plot the Petal.Length by the Petal.Width only for the setosa Species?
11. Have a look at the help page for the data.frame command. Now create a new data.frame that has the orginal iris columns plus two further columns for Sepal.Length divided by Sepal.Width and Petal.Length divided by Petal.Width.
ExtraWork/HomeWork 1. Make a list L of length 600 with 100 random samples from the numeric integer vector 1:20 in each level (hint: ?replicate). 2. Calculate the standard deviation and the mean of each sample in your list (hint: ?lapply). 3. make the sd and mean values into vectors and combine them in a matrix. 4. Add a new column 'cv' of the coefficent of variance to your matrix (standard deviation divided by the mean) 5. make a data.frame adding a further column 'groups' of repeating 'A' to 'D' labels. 6. Make two side by side plots; on the left is standard deviation versus mean of the A group; on the right is a boxplot of coefficient of variance by groups.