Underlying Concepts of R.

R is free software for data anlysis, statistics, plotting, and a sophisticated programming language. It runs on Windows, Mac or Linux and can swap 'workspace' files containing data between the different versions. It has all the features and graphical panache of commercial statistical software such as SPSS or Stata and is far more flexible and powerful than Excel. Whilst R code is very similar to Matlab it has many more users, plus many thousands of 'packages'; user generated add-ons for almost every aspect of scientific research.

Some say that R is difficult to learn but that is perhaps as it is used for complex scientific applications rather tah being especially difficult itself. For beginners one problem is that there are usually many ways to accomplish any task, and often new packages will introduce improvements on the base language approach that may take time to spread throughout the R community. On this course we try wherever possible to use newer innovative packages in the R language - whilst showing older base R methods if they are still common.

RStudio and the Tutorials

The R software package can be downloaded from the R Binaries page on CRAN: The Comprehensive R Archive Network. The R Binaries will install automatically on both Windows (double-click) and Mac (move to Applications folder) and create a start icon on your Desktop or Start-Up menu.

For this course we are going to use an additional piece of software RStudio that makes using R easier still. RStudio is an improved front end for R that combines a dedicated text editor, file browser, file previewer, spreadsheet, graphics window, and help pages all within a single slick Graphical User Interface (GUI). This makes working with R a lot easier.

Once you have installed R itself you can download the RStudio installer and simply double-click to install. Then when you launch RStudio - you are essentially launching R with bells on.

RStudio Pic

All the tutorials, exercise answers, and necessary data are collected together with a folder for each tutorial (1-7). Begin by simply opening RStudio and using the file browser to navigate to the 'RTuts1' folder and under the 'More' menu click the 'Set As Working Directory' option. Then each time you start a new tutorial navigate to the next folder and reset you working directory.

Whilst working through the course you will probably want to switch between reading through the tutorials in your browser and copy/pasting/modifying bits and pieces to test in RStudio. For the exercises in each tutorial it is probably best to create a new RScript using the RStudio file menu and write your exercises into this. Please attempt the exercises before you check the model answers.

R Basics

If you have used R before then you could skip this first tutorial and start with the next but I would advise not to. It is often helpful to think explicitly about the unconscious rules we know …..and this tutorial is quite short. The power of R is the command line. The command line can be viewed as a sort of complex calculator – or algebra machine, with a memory or workspace. At the console type the following:

# The hash symbol denotes a comment - if you copy this line into the
# console and hit <return> or leave it in a file of R code - nothing will
# happen
4 + 8

## [1] 12

x = 4
y = 8
z = x * y
ls()

## [1] "x" "y" "z"

## [1] 32

# z is of 'type' or 'mode' double i.e.  a number
typeof(z)

## [1] "double"

rm(z)
ls()

## [1] "x" "y"

In this first short set of code above we create x and y variables and then z from multiplying x and y together. Typing the name of a variable simply shows it's value and the command ls() shows the names of variables that we have created (In RStudio you will also see the workspace variables in the top-right pane). The rm() command removes a variable from the workspace. This is the basics - the basis of a simple calculator. However data types in R can be a lot more complex than simple variables such as x,y,z.

x = 1:5
y = 6:10
z = y - x
z

## [1] 5 5 5 5 5

Here the corresponding values of y have been subtracted piecewise from the corresponding values of x and stored in the y variable. The variables x, y and z here are called 'vectors'. Note that the original values of x and y have been replaced by our new assignment.

Using vectors like this you can start to do simple maths.

x = 1:10
y = (x^3) - (8 * x^2) - (4 * x) + 2
plot(x, y)

plot of chunk simpleMaths

Each element of a vector such as x or y can be accessed by using brackets to index them e.g.

y[2]

## [1] -30

y[2:8]

## [1] -30 -55 -78 -93 -94 -75 -30

Here we have used the brackets to extract the 2nd element of y then the 2nd to 8th value of y (i.e. another vector). Here is a slightly different example.

x = 1:5
y = 6
z = x - y
z

## [1] -5 -4 -3 -2 -1

# a slightly trickier example copy it to your own console then try and
# work out what z is before typing z
x = 1:5
y = -1:1
z = x - y

## Warning: longer object length is not a multiple of shorter object length

## [1] 2 2 2 5 5

We say that each element of y is subtracted piecewise from x and then recycled if short of x (hence the red warning message). If you understand these examples so far you are 33.33329% of the way to learning R!

NB Special Note on the Assignment Operator

# NB the following are all the same
x = 5
x <- 5
x <- 5

For the sake of clarity and simplicity we will not use the -> or <- operators in this tutorial but you will probably see them in other peoples R code.

1. Look at the following example adapted from the R inferno, pg18-19. Can you guess what the values of x, y and z are before hitting return?

# R Inferno Example p18-19
x = max(2, 100, -4, 3, 230, 5)
y = mean(2, -100, -4, 3, -230, 5)
z = mean(c(2, -100, -4, 3, -230, 5))

Types of data

Thus far all the numbers or vectors created have been of type double. The double type can handle decimals such as pi but they use more memory than integers which can be forced using the capital letter L.

z = 3
typeof(z)

## [1] "double"

z = 3L
typeof(z)

## [1] "integer"

# NB note the warning
z = 3.14
typeof(z)

## [1] "double"

Then there is also a special type for complex numbers such as $ sqrt{-3} $ (we'll not worry about these). Plus there are single or strings of characters and also a very useful logical datatype (TRUE, FALSE, T, F) sometimes called boolean. Then there are factors which are for categorical data. We'll discuss logical and factor datatypes a bit later.

# 'c' and 'rep' respectively combine or replicate elements into a longer
# vector
x = c("char", "char", "char")
y = rep("acter", 3)
z = x + y

## Error: non-numeric argument to binary operator

Note the error in the code above. When R encounters a problem it will usually stop and produce a sensible error message like the one above. Essentially this error is saying that adding characters together is not an arithmetic operation, the + operator expects numbers not characters. Instead an intelligible command is to paste characters piecewise (i.e ‘cha’ to ‘r’) using the paste command:

z = paste(x, y, sep = "")
# the similar command paste0(x,y) has no separator.
z

## [1] "character" "character" "character"

Each element of x is now a string of characters. Note the extra option: sep='', this simply means there should be no separator between the pasted variables x and y. Try paste(x,y, sep='a_gap') or paste(x,y, sep=x) instead.

2. Given: string='string' and the commands you know already, Can you create a new variable like the following:

“string” “string” “string”
“string string string”
“stringstringstring”
Type ?rep in the console. Look at the help page. Now create a new variable with 22 repeats of string.

Logic and Control

The R language uses logical statements to control the way that expressions are executed e.g.

if (*expression_1*) *expression_2* else *expression_3*

This is best demonstrated:

# this is assignment
x = 12
y = 14

# this is a logical test does x equal y, is x less than or equal to y
x == y

## [1] FALSE

x <= y

## [1] TRUE


# tests can be piecewise too
a = 1:8
b = 4
a > b

## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

logVector = a > b
typeof(logVector)

## [1] "logical"


# logical conditions can be compounded this means '12 is greater than 14'
# OR) '12 is greater than 4'
(x > y | x > b)

## [1] TRUE

# whereas this means '12 is greater than 4' AND '12 is greater than 4'
(x > y & x > b)

## [1] FALSE


# or they can be useful for pulling out bits of data, a %% 2 is the
# modulus, here we pull out only even elements of a
a%%2

## [1] 1 0 1 0 1 0 1 0

a%%2 == 0

## [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

a[a%%2 == 0]

## [1] 2 4 6 8


# if x is less than or equal to y do everything between the braces
if (x <= y) {
    print("x is less than 13")
    z = 4
}

## [1] "x is less than 13"


# the `if` statement can also be used with `else`, != is NOT EQUALS,
if (x != y) {
    print("x doesnt equal y")
} else {
    print("x equals y")
}

## [1] "x doesnt equal y"


# finally there are loops
x

## [1] 12

for (i in 1:x) print(paste(i, x, i + x))

## [1] "1 12 13"
## [1] "2 12 14"
## [1] "3 12 15"
## [1] "4 12 16"
## [1] "5 12 17"
## [1] "6 12 18"
## [1] "7 12 19"
## [1] "8 12 20"
## [1] "9 12 21"
## [1] "10 12 22"
## [1] "11 12 23"
## [1] "12 12 24"

Note that it is good practice to indent conditional expressions. This makes it easier to follow the flow of R code. Note also that simple one line expressions don't need braces- but you need them if an expression is more than a single line.

3. Write a statement that prints values of the variable a divisibe by 3. Write a statement that prints values of a greater than 2 AND less than 7.

Functions

In following tutorials we will discuss creating your own functions or commands but for now we should briefly describe built-in functions. Just as in mathematics a function is like a black box where you supply inputs- or what we call arguments or sometimes options - and R computes an output. For now lets consider a few built-in R examples.

x = rnorm(n = 100, mean = 110, sd = 10)
y = rnorm(n = 100, mean = 100, sd = 10)
length(x)

## [1] 100

head(x)

## [1] 109.22  92.92  83.85 106.33 109.19  94.07

The rnorm() function takes as input arguments a number n, a mean, and a standard deviation sd. The function then returns n random normal samples with that mean and standard deviation and assigns it to the variable x. Here is another example.

# test equality of x and y means
t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y 
## t = 7.513, df = 198, p-value = 1.947e-12
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##   7.742 13.252 
## sample estimates:
## mean of x mean of y 
##     110.1      99.6

The t.test() function above tests whether the means of the vectors x and y we just created are the same. As you would expect from our specification of x and y above: they are not. NB Note that I have not explicitly named the arguments that x and y match. The x and y variables have been matched automatically to the first two arguments of t.test() (see help documentation just below)

# test if mean of x < mean of y
t.test(x, y, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  x and y 
## t = 7.513, df = 198, p-value = 1
## alternative hypothesis: true difference in means is less than 0 
## 95 percent confidence interval:
##   -Inf 12.81 
## sample estimates:
## mean of x mean of y 
##     110.1      99.6

However we can change the output of the t.test() function by adding a further argument or option. In this case the default argument is alternative=’two.sided’, which is how it behaves if we do not specify the argument ourselves. The default behaviour of every function should be documented on its Help page (discussed below). Alternatively we can change a different named argument: paired =TRUE, and change the operation of the function in a different way (below).

# test if the means of paired groups x and y are equal
t.test(x, y, paired = TRUE)

## 
##  Paired t-test
## 
## data:  x and y 
## t = 7.606, df = 99, p-value = 1.663e-11
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##   7.758 13.236 
## sample estimates:
## mean of the differences 
##                    10.5

OK if you have understood the examples so far then you have really learnt about 89.99991% of the concept of R. Yet there are still thousands and thousands of functions to learn and many many combinations of arguments for each function. So how do we learn the functions and all their arguments? Well… this maybe the work of a lifetime but really to be competent you don’t actually need to know most of them. You just need to know how to navigate the documentation and make educated guesses.

Help Search

If you are looking for a function that you think may exist (i.e. a mathematical or statistical term) try something like:

help.search("t-test")

HelpSearchPic

You should get a window something like this which lists near matches to your search term with a short description for each. The one we want is at the bottom stats::t.test, which in English means ‘the t.test function in the “stats” package’. For more obscure mathematical operations you may have to try a few different search terms.

Use ?

Once you know the name of a function (e.g. t.test() above then you can view 'documentation' for this function by typing:

`?`(t.test)

Every built-in function has a documentation page and each page is in standardised format! The name of the function followed in brackets by the 'package', a title, a description, usage and 'arguments' which are the types of variable and parameter that the function expects, details usually describes how a complex function works, value describes the output, see also (see below) and finally there are examples. You can quickly run these examples and examine input data and output results by using the command example(t.test).

Objects and Classes

Let's return and look at t.test() again:

z = t.test(x, y)

What exactly is z? It can’t be a single number, character or even a vector of numbers. It is actually a built in class of object - a composite of different types of variable. We can see this using the attributes command.

attributes(z)

## $names
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

z$statistic

##     t 
## 7.513

Each part of z can be accessed separately using the $ operator (e.g. z$pvalue, z$alternative, z$conf.int). Most of the R built-in statistical functions return composite objects with many parts and many common statistical types return a special type or class of result which is recognised by other R functions. For instance

z = t.test(x, y)
w = wilcox.test(x, y)
class(z)

## [1] "htest"

class(w)

## [1] "htest"

Here htest stands for a 'hypothesis test' of which there are many types but all with an essentially similar form of result and mostly the same components. When someone has gone to the trouble of creating a special class this is usually because there are other R commands that use these classes as input, for instance:

z = lm(y ~ x)
class(z)

## [1] "lm"

coefficients(z)

## (Intercept)           x 
##    96.92168     0.02435

residuals(z)

##         1         2         3         4         5         6         7 
##  -2.71379 -24.59930 -13.65072   1.05303  -8.81662 -10.71141  -7.58093 
##         8         9        10        11        12        13        14 
##   5.76673   8.63124   5.51096  -1.31720  -7.11557  -0.84341  -2.59861 
##        15        16        17        18        19        20        21 
## -12.65583   0.96071  14.34116   1.68740 -18.36694   5.09844  -3.97902 
##        22        23        24        25        26        27        28 
##  17.37686  -6.76688 -19.37759  -1.62783  -1.47235  -3.31900  14.54382 
##        29        30        31        32        33        34        35 
## -10.74836 -14.32159  17.68913  -1.11298   6.90034  -6.55043 -20.77094 
##        36        37        38        39        40        41        42 
##  -6.33074  19.46216  27.39032   4.87889   6.40344   1.78745  11.94260 
##        43        44        45        46        47        48        49 
##  -6.70782   1.80465  -4.12484  12.26206   0.14280   5.48145  -9.42167 
##        50        51        52        53        54        55        56 
##   0.05443   0.69080  -7.26821  -0.48181   8.77779  -0.94869   3.79709 
##        57        58        59        60        61        62        63 
##  11.52938   2.63320   6.44709  13.24228 -17.50916   1.44977   0.24909 
##        64        65        66        67        68        69        70 
##  11.72477   4.91715  -0.69468 -16.21139  -2.76491  13.41308  -3.63658 
##        71        72        73        74        75        76        77 
##  -3.98272  14.91354 -11.58088  -9.94152   2.12077  20.31287   5.96786 
##        78        79        80        81        82        83        84 
##  -5.13487  -4.88427  -0.04681   2.64901  -4.52752   8.20611  -3.56732 
##        85        86        87        88        89        90        91 
##  13.29606  -3.29457   0.19073   3.90606 -17.07340  -4.47970   7.03141 
##        92        93        94        95        96        97        98 
## -10.57624  -8.27072  -6.93882   9.82109  18.16312   8.40690  -3.11005 
##        99       100 
##  -6.82185  -3.67800

attributes(z)

## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"

summary(z)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.599  -6.590  -0.588   6.077  27.390 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  96.9217    11.2066    8.65    1e-13 ***
## x             0.0244     0.1014    0.24     0.81    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 9.95 on 98 degrees of freedom
## Multiple R-squared: 0.000588,    Adjusted R-squared: -0.00961 
## F-statistic: 0.0577 on 1 and 98 DF,  p-value: 0.811

Firstly note here we are using a new function lm() to create a linear model – essentially the R jargon for a regression. Note also y~x the model formula in this case just means y versus x. However there are complex formula that can be used to specify all types of complex multiple regression- we’ll deal with this later! Try plot(x~y) or plot(y~x) to see this graphically.

Secondly note that the function coefficients(z) takes the composite class object z as an argument and knows how to handle it. Thirdly note that summary(z) returns not a simple list of the object but a specially organised table. For many types of data R functions intelligently use appropriate methods based upon the class of argument input. Amongst programmers this idea is often referred to as object orientation – and the R language is rife with object orientation.

Data Structures

For now we will rewind and look at 3 more basic classes of object: the matrix, the list and finally the most useful of all - the data.frame. In large part it is the central use of these 3 special data classes that has made R so widespread in the research community.

The Matrix

x = rnorm(10, mean = 110, sd = 10)
y = rnorm(10, mean = 100, sd = 10)
z = cbind(x, y)
class(z)

## [1] "matrix"

A matrix is a table of variables all of the same type. We created it here by binding two numeric vectors x and y using the cbind() - or ‘column bind’. If the variables are numeric then we can perform matrix algebra and manipulations. For instance here we take a transpose (i.e. rotate) the matrix t() and then matrix multiply this by the original matrix:

zt = t(z)
mm = z %*% zt
head(mm)

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
## [1,] 24002 21714 21779 24800 23404 23079 22402 25079 23060 22710
## [2,] 21714 20263 19750 22814 20997 21559 20160 22754 20609 20969
## [3,] 21779 19750 19765 22532 21222 20993 20319 22761 20904 20639
## [4,] 24800 22814 22532 25856 24075 24262 23083 25953 23672 23724
## [5,] 23404 20997 21222 24075 22869 22312 21873 24435 22556 22025
## [6,] 23079 21559 20993 24262 22312 22939 21424 24187 21896 22302

Matrices have dimensions which are the rows and columns of the table. Similar to a vector we can access slices or cells of the matrix using bracket [ ] notation like so:

dim(z)

## [1] 10  2

# row 1 to 5
z[1:5, ]

##           x      y
## [1,] 117.45 101.03
## [2,]  90.01 110.28
## [3,] 105.32  93.13
## [4,] 111.43 115.93
## [5,] 119.09  93.20

# row 1 to 5 of column 1
z[1:5, 1]

## [1] 117.45  90.01 105.32 111.43 119.09

# row 3 column 2
z[3, 2]

##     y 
## 93.13

Exercises

There is also a matrix command. Check the help page ?matrix. 6. With this command create a matrix called m with 10 columns and 8 rows of 0. 7. Create a matrix with 10 columns of random samples from the uniform distribution (Hint: it's like rnorm()) between 0 and 7. Tricky?

The List

R Beginners often use the term list interchangeably with vector. The list however is a quite different object that allows different types of variables of different length or dimension to be stored together and indexed either by their position in the list [[1]] or directly by name.

a = 1:5
b = rep("chars", 100)
L = list(a = a, b = b)
L

## $a
## [1] 1 2 3 4 5
## 
## $b
##   [1] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##   [9] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [17] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [25] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [33] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [41] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [49] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [57] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [65] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [73] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [81] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [89] "chars" "chars" "chars" "chars" "chars" "chars" "chars" "chars"
##  [97] "chars" "chars" "chars" "chars"

L[1]

## $a
## [1] 1 2 3 4 5

L[[1]]

## [1] 1 2 3 4 5

L$b[1:5]

## [1] "chars" "chars" "chars" "chars" "chars"

Firstly note carefully the difference between L[1] and L[[1]] - this often confuses people. L[1] is an element of the list- still a list, whilst L[[1]] extracts the vector. Secondly note that we can access the second element by name: L$b and then further drill into just the first five elements of this vector using brackets L$b[1:5]. NB Most built in R classes (e.g. like 'htest' or 'lm' shown above) are extensions of list with different types of data combined into a single object and accessible by name e.g. coef(z) is really just z$coef.

8. Can you access the 3rd to 5th integers in the 1st element of the list L?

The data.frame

Similarly to a matrix the data.frame is a table of data but similar to a list the columns can be of different types of data: numeric, character etc. The data.frame is probably the most useful R object as most external data with names, labels, dates, categories and numbers is usually in tabular format. R has a selection of built-in example datasets including a data.frame of iris (the flower) measurements. Tidy up your workspace now by deleting other variables:

# tidy up
rm(list = ls())
data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

The head() command used above is often useful for a first look at a large object as it shows just the first few elements or lines of an object - such as a matrix, list, or data.frame. Each column of data - or variable can be accessed using the $ character.

iris$Sepal.Length

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

iris$Sepal.Length[1:10]

##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

(iris$Sepal.Length) * (iris$Sepal.Width)

##   [1] 17.85 14.70 15.04 14.26 18.00 21.06 15.64 17.00 12.76 15.19 19.98
##  [12] 16.32 14.40 12.90 23.20 25.08 21.06 17.85 21.66 19.38 18.36 18.87
##  [23] 16.56 16.83 16.32 15.00 17.00 18.20 17.68 15.04 14.88 18.36 21.32
##  [34] 23.10 15.19 16.00 19.25 17.64 13.20 17.34 17.50 10.35 14.08 17.50
##  [45] 19.38 14.40 19.38 14.72 19.61 16.50 22.40 20.48 21.39 12.65 18.20
##  [56] 15.96 20.79 11.76 19.14 14.04 10.00 17.70 13.20 17.69 16.24 20.77
##  [67] 16.80 15.66 13.64 14.00 18.88 17.08 15.75 17.08 18.56 19.80 19.04
##  [78] 20.10 17.40 14.82 13.20 13.20 15.66 16.20 16.20 20.40 20.77 14.49
##  [89] 16.80 13.75 14.30 18.30 15.08 11.50 15.12 17.10 16.53 17.98 12.75
## [100] 15.96 20.79 15.66 21.30 18.27 19.50 22.80 12.25 21.17 16.75 25.92
## [111] 20.80 17.28 20.40 14.25 16.24 20.48 19.50 29.26 20.02 13.20 22.08
## [122] 15.68 21.56 17.01 22.11 23.04 17.36 18.30 17.92 21.60 20.72 30.02
## [133] 17.92 17.64 15.86 23.10 21.42 19.84 18.00 21.39 20.77 21.39 15.66
## [144] 21.76 22.11 20.10 15.75 19.50 21.08 17.70

However like a matrix you can access columns using the bracket notation

# REM brackets are [rows,column]
iris[, 1]

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

iris[1:10, 1]

##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

(iris[, 1]) * (iris[, 2])

##   [1] 17.85 14.70 15.04 14.26 18.00 21.06 15.64 17.00 12.76 15.19 19.98
##  [12] 16.32 14.40 12.90 23.20 25.08 21.06 17.85 21.66 19.38 18.36 18.87
##  [23] 16.56 16.83 16.32 15.00 17.00 18.20 17.68 15.04 14.88 18.36 21.32
##  [34] 23.10 15.19 16.00 19.25 17.64 13.20 17.34 17.50 10.35 14.08 17.50
##  [45] 19.38 14.40 19.38 14.72 19.61 16.50 22.40 20.48 21.39 12.65 18.20
##  [56] 15.96 20.79 11.76 19.14 14.04 10.00 17.70 13.20 17.69 16.24 20.77
##  [67] 16.80 15.66 13.64 14.00 18.88 17.08 15.75 17.08 18.56 19.80 19.04
##  [78] 20.10 17.40 14.82 13.20 13.20 15.66 16.20 16.20 20.40 20.77 14.49
##  [89] 16.80 13.75 14.30 18.30 15.08 11.50 15.12 17.10 16.53 17.98 12.75
## [100] 15.96 20.79 15.66 21.30 18.27 19.50 22.80 12.25 21.17 16.75 25.92
## [111] 20.80 17.28 20.40 14.25 16.24 20.48 19.50 29.26 20.02 13.20 22.08
## [122] 15.68 21.56 17.01 22.11 23.04 17.36 18.30 17.92 21.60 20.72 30.02
## [133] 17.92 17.64 15.86 23.10 21.42 19.84 18.00 21.39 20.77 21.39 15.66
## [144] 21.76 22.11 20.10 15.75 19.50 21.08 17.70

iris$Species

##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Note that the Species data is not a simple character vector (NB don’t say ‘a list of characters’. Remember that’s something quite different). It is actually a datatype we mentioned earlier called a factor which we will deal with in a later tutorial. A sometimes useful command when dealing with data.frames is attach, which allows us to directly name columns of the data.frame. Below we attach the iris data.frame, set up a plot panel with two side-by-side panes and then place two simple plots into each pane.

attach(iris)
par(mfrow = c(1, 2))
plot(Petal.Length ~ Petal.Width, col = Species)
plot(Petal.Length ~ Species)

plot of chunk irisplot

Note here that R has produced two different types of plot appropriate to the input data type. This is another simple example of the object orientation I mentioned earlier. This is probably a good place to stop - but try and finish the exercises below. We will discuss making your own functions in the next tutorial and after that move swifty on to wrestling with data.

9. Can you redo that figure with the plots on top and below instead of either side? 10. Can you plot the Petal.Length by the Petal.Width only for the setosa Species? 11. Have a look at the help page for the data.frame command. Now create a new data.frame that has the orginal iris columns plus two further columns for Sepal.Length divided by Sepal.Width and Petal.Length divided by Petal.Width.

ExtraWork/HomeWork 1. Make a list L of length 600 with 100 random samples from the numeric integer vector 1:20 in each level (hint: ?replicate). 2. Calculate the standard deviation and the mean of each sample in your list (hint: ?lapply). 3. make the sd and mean values into vectors and combine them in a matrix. 4. Add a new column 'cv' of the coefficent of variance to your matrix (standard deviation divided by the mean) 5. make a data.frame adding a further column 'groups' of repeating 'A' to 'D' labels. 6. Make two side by side plots; on the left is standard deviation versus mean of the A group; on the right is a boxplot of coefficient of variance by groups.