Objects and Arithmetic

R stores information and operates on objects. The simplest objects are scalars, vectors and matrices. But there are many others: lists and dataframes for example. In advanced use of R it can also be useful to define new types of object, specific for particular application. We will stick with just the most commonly used objects here. An important feature of R is that it will do different things on different types of objects. For example,

4 + 6
## [1] 10

So, R does scalar arithmetic returning the scalar value 10. (In actual fact, R returns a vector of length 1 - hence the [1] denoting first element of the vector. We can assign objects values for subsequent use. For example:

x<-6
y<-4
z<-x+y

would do the same calculation as above, storing the result in an object called z. We can look at the contents of the object by simply typing its name:

z
## [1] 10

At any time we can list the objects which we have created:

ls()
## [1] "x" "y" "z"

Notice that ls is actually an object itself. Typing ls would result in a display of the contents of this object, in this case, the commands of the function. The use of parentheses, ls(), ensures that the function is executed and its result - in this case, a list of the objects in the directory - displayed. More commonly a function will operate on an object, for example

sqrt(16)
## [1] 4

calculates the square root of 16. Objects can be removed from the current workspace with the rm function:

rm(x,y)

for example. There are many standard functions available in R, and it is also possible to create new ones. Vectors can be created in R in a number of ways. We can describe all of the elements:

z<-c(5,9,1,0)

Note the use of the function c to concatenate or ‘glue together’ individual elements. This function can be used much more widely, for example

x <- c(5,9)
y <- c(1,0)
z <- c(x,y)
z
## [1] 5 9 1 0

would lead to the same result by gluing together two vectors to create a single vector. Sequences can be generated as follows:

x<-1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10

while more general sequences can be generated using the seq command. For example:

seq(1,9,by=2)
## [1] 1 3 5 7 9

and

seq(8,20,length=6)
## [1]  8.0 10.4 12.8 15.2 17.6 20.0

These examples illustrate that many functions in R have optional arguments, in this case, either the step length or the total length of the sequence (it doesn’t make sense to use both). If you leave out both of these options, R will make its own default choice, in this case assuming a step length of 1. So, for example,

x<-seq(1,10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10

also generates a vector of integers from 1 to 10. At this point it’s worth mentioning the help facility. If you don’t know how to use a function, or don’t know what the options or default values are, type help(functionname) where function- name is the name of the function you are interested in. This will usually help and will often include examples to make things even clearer. Another useful function for building vectors is the rep command for repeating things. For example

rep(0,100)
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

or

rep(1:3,6)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Notice also a variation on the use of this function

rep(1:3,c(6,6,6))
##  [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

which we could also simplify cleverly as

rep(1:3,rep(6,3))
##  [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

As explained above, R will often adapt to the objects it is asked to work on. For example:

x<-c(6,8,9)
y<-c(1,2,4)
x + y
## [1]  7 10 13

and

x * y
## [1]  6 16 36

showing that R uses componentwise arithmetic on vectors. R will also try to make sense if objects are mixed. For example,

x<-c(6,8,9)
x + 2
## [1]  8 10 11

though care should be taken to make sure that R is doing what you would like it to in these circumstances. Two particularly useful functions worth remembering are length which returns the length of a vector (i.e. the number of elements it contains) and sum which calculates the sum of the elements of a vector.

plot(cars)

Summaries and Subscripting

Let’s suppose we’ve collected some data from an experiment and stored them in an object x:

x<-c(7.5,8.2,3.1,5.6,8.2,9.3,6.5,7.0,9.3,1.2,14.5,6.2)

Some simple summary statistics of these data can be produced:

mean(x)
## [1] 7.216667
var(x)
## [1] 11.00879
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   6.050   7.250   7.217   8.475  14.500

which should all be self explanatory. It may be, however, that we subsequently learn that the first 6 data correspond to measurements made on one machine, and the second six on another machine. This might suggest summarizing the two sets of data separately, so we would need to extract from x the two relevant subvectors. This is achieved by subscripting:

x[1:6]
## [1] 7.5 8.2 3.1 5.6 8.2 9.3

and

x[7:12]
## [1]  6.5  7.0  9.3  1.2 14.5  6.2

give the relevant subvectors. Hence,

summary(x[1:6])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.100   6.075   7.850   6.983   8.200   9.300
summary(x[7:12])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   6.275   6.750   7.450   8.725  14.500

Other subsets can be created in the obvious way. For example:

x[c(2,4,9)]
## [1] 8.2 5.6 9.3

Negative integers can be used to exclude particular elements. For example

x[-(1:6)]
## [1]  6.5  7.0  9.3  1.2 14.5  6.2

has the same effect as x[7:12].

Matrices

Matrices can be created in R in a variety of ways. Perhaps the simplest is to create the columns and then glue them together with the command cbind. For example,

x<-c(5,7,9)
y<-c(6,3,4)
z<-cbind(x,y)
z
##      x y
## [1,] 5 6
## [2,] 7 3
## [3,] 9 4

The dimension of a matrix can be checked with the dim command:

dim(z)
## [1] 3 2

i.e., three rows and two columns. There is a similar command, rbind, for building matrices by gluing rows together. The functions cbind and rbind can also be applied to matrices themselves (provided the di- mensions match) to form larger matrices. For example,

rbind(z,z)
##      x y
## [1,] 5 6
## [2,] 7 3
## [3,] 9 4
## [4,] 5 6
## [5,] 7 3
## [6,] 9 4

Matrices can also be built by explicit construction via the function matrix. For example,

z<-matrix(c(5,7,9,6,3,4),nrow=3)
z
##      [,1] [,2]
## [1,]    5    6
## [2,]    7    3
## [3,]    9    4

results in a matrix z identical to z above. Notice that the dimension of the matrix is determined by the size of the vector and the requirement that the number of rows is 3, as specified by the argument nrow=3. As an alternative we could have specified the number of columns with the argument ncol=2 (obviously, it is unnecessary to give both). Notice that the matrix is ’filled up’ column-wise. If instead you wish to fill up row-wise, add the option byrow=T. For example,

z<-matrix(c(5,7,9,6,3,4),nr=3,byrow=T)
z
##      [,1] [,2]
## [1,]    5    7
## [2,]    9    6
## [3,]    3    4

Notice that the argument nrow has been abbreviated to nr. Such abbreviations are always possible for function arguments provided it induces no ambiguity - if in doubt always use the full argument name. As usual, R will try to interpret operations on matrices in a natural way. For example, with z as above, and

y<-matrix(c(1,3,0,9,5,-1),nrow=3,byrow=T)
y
##      [,1] [,2]
## [1,]    1    3
## [2,]    0    9
## [3,]    5   -1

we obtain

y + z
##      [,1] [,2]
## [1,]    6   10
## [2,]    9   15
## [3,]    8    3

and

y * z
##      [,1] [,2]
## [1,]    5   21
## [2,]    0   54
## [3,]   15   -4

Notice, multiplication here is componentwise rather than conventional matrix multiplication. In- deed, conventional matrix multiplication is undefined for y and z as the dimensions fail to match. Let’s now define

x<-matrix(c(3,4,-2,6),nrow=2,byrow=T)
x
##      [,1] [,2]
## [1,]    3    4
## [2,]   -2    6

Matrix multiplication is expressed using notation %*%:

y%*%x
##      [,1] [,2]
## [1,]   -3   22
## [2,]  -18   54
## [3,]   17   14

Other useful functions on matrices are t to calculate a matrix transpose and solve to calculate inverses:

t(z)
##      [,1] [,2] [,3]
## [1,]    5    9    3
## [2,]    7    6    4

and

solve(x)
##            [,1]       [,2]
## [1,] 0.23076923 -0.1538462
## [2,] 0.07692308  0.1153846

As with vectors it is useful to be able to extract sub-components of matrices. In this case, we may wish to pick out individual elements, rows or columns. As before, the [ ] notation is used to subscript. The following examples should make things clear:

 z[1,1]
## [1] 5
z[c(2,3),2]
## [1] 6 4
z[,2]
## [1] 7 6 4
z[1:2,]
##      [,1] [,2]
## [1,]    5    7
## [2,]    9    6

So, in particular, it is necessary to specify which rows and columns are required, whilst omitting the integer for either dimension implies that every element in that dimension is selected.

Attaching to Objects

R includes a number of datasets that it is convenient to use for examples. You can get a description of what’s available by typing

data()

To access any of these datasets, you then type data(dataset) where dataset is the name of the dataset you wish to access. For example,

data(trees)
trees[1:5,]
##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8

gives us the first 5 rows of these data, and we can now see that the columns represent measurements of girth, height and volume of trees (actually cherry trees: see help(trees)) respectively. Now, if we want to work on the columns of these data, we can use the subscripting technique explained above: for example, trees[,2] gives all of the heights. This is a bit tedious however, and it would be easier if we could refer to the heights more explicitly. We can achieve this by attaching to the trees dataset:

attach(trees)

Effectively, this makes the contents of trees a directory, and if we type the name of an object, R will look inside this directory to find it. Since Height is the name of one of the columns of trees, R now recognises this object when we type the name. Hence, for example,

mean(Height)
## [1] 76

and

mean(trees[,2])
## [1] 76

are synonymous, while it is easier to remember exactly what calculation is being performed by the first of these expressions. In actual fact, trees is an object called a dataframe, essentially a matrix with named columns (though a dataframe, unlike a matrix, may also include non-numerical variables, such as character names). Because of this, there is another equivalent syntax to extract, for example, the vector of heights:

trees$Height
##  [1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74
## [24] 72 77 81 82 80 80 80 87

which can also be used without having first attached to the dataset.

The apply function

It is possible to write loops in R, but they are best avoided whenever possible. A common situation is where we want to apply the same function to every row or column of a matrix. For example, we may want to find the mean value of each variable in the trees dataset. Obviously, we could operate on each column separately but this can be tedious, especially if there are many columns. The function apply simplifies things. It is easiest understood by example:

apply(trees,2,mean)
##    Girth   Height   Volume 
## 13.24839 76.00000 30.17097

has the effect of calculating the mean of each column (dimension 2) of trees. We’d have used a 1 instead of a 2 if we wanted the mean of every row. Any function can be applied in this way, though if optional arguments to the function are required these need to be specified as well - see help(apply) for further details.

Statistical Computation and Simulation

Many of the tedious statistical computations that would once have had to have been done from statistical tables can be easily carried out in R. This can be useful for finding confidence intervals etc. Let’s take as an example the Normal distribution. There are functions in R to evaluate the density function, the distribution function and the quantile function (the inverse distribution function). These functions are, respectively, dnorm, pnorm and qnorm. Unlike with tables, there is no need to standardize the variables first. For example, suppose X ∼ N(3,22), then

dnorm(x,3,2)
##            [,1]      [,2]
## [1,] 0.19947114 0.1760327
## [2,] 0.00876415 0.0647588

will calculate the density function at points contained in the vector x (note, dnorm will assume mean 0 and standard deviation 1 unless these are specified. Note also that the function assumes you will give the standard deviation rather than the variance. As an example

dnorm(5,3,2)
## [1] 0.1209854

evaluates the density of the N (3, 4) distribution at x = 5. As a further example

x<-seq(-5,10,by=.1)
dnorm(x,3,2)
##   [1] 6.691511e-05 8.162820e-05 9.932774e-05 1.205633e-04 1.459735e-04
##   [6] 1.762978e-04 2.123901e-04 2.552325e-04 3.059510e-04 3.658322e-04
##  [11] 4.363413e-04 5.191406e-04 6.161096e-04 7.293654e-04 8.612845e-04
##  [16] 1.014524e-03 1.192044e-03 1.397129e-03 1.633410e-03 1.904881e-03
##  [21] 2.215924e-03 2.571320e-03 2.976266e-03 3.436383e-03 3.957726e-03
##  [26] 4.546781e-03 5.210467e-03 5.956122e-03 6.791485e-03 7.724674e-03
##  [31] 8.764150e-03 9.918677e-03 1.119727e-02 1.260911e-02 1.416352e-02
##  [36] 1.586983e-02 1.773730e-02 1.977502e-02 2.199180e-02 2.439601e-02
##  [41] 2.699548e-02 2.979735e-02 3.280791e-02 3.603244e-02 3.947508e-02
##  [46] 4.313866e-02 4.702454e-02 5.113246e-02 5.546042e-02 6.000450e-02
##  [51] 6.475880e-02 6.971528e-02 7.486373e-02 8.019166e-02 8.568430e-02
##  [56] 9.132454e-02 9.709303e-02 1.029681e-01 1.089261e-01 1.149411e-01
##  [61] 1.209854e-01 1.270295e-01 1.330426e-01 1.389924e-01 1.448458e-01
##  [66] 1.505687e-01 1.561270e-01 1.614862e-01 1.666123e-01 1.714719e-01
##  [71] 1.760327e-01 1.802635e-01 1.841351e-01 1.876202e-01 1.906939e-01
##  [76] 1.933341e-01 1.955213e-01 1.972397e-01 1.984763e-01 1.992220e-01
##  [81] 1.994711e-01 1.992220e-01 1.984763e-01 1.972397e-01 1.955213e-01
##  [86] 1.933341e-01 1.906939e-01 1.876202e-01 1.841351e-01 1.802635e-01
##  [91] 1.760327e-01 1.714719e-01 1.666123e-01 1.614862e-01 1.561270e-01
##  [96] 1.505687e-01 1.448458e-01 1.389924e-01 1.330426e-01 1.270295e-01
## [101] 1.209854e-01 1.149411e-01 1.089261e-01 1.029681e-01 9.709303e-02
## [106] 9.132454e-02 8.568430e-02 8.019166e-02 7.486373e-02 6.971528e-02
## [111] 6.475880e-02 6.000450e-02 5.546042e-02 5.113246e-02 4.702454e-02
## [116] 4.313866e-02 3.947508e-02 3.603244e-02 3.280791e-02 2.979735e-02
## [121] 2.699548e-02 2.439601e-02 2.199180e-02 1.977502e-02 1.773730e-02
## [126] 1.586983e-02 1.416352e-02 1.260911e-02 1.119727e-02 9.918677e-03
## [131] 8.764150e-03 7.724674e-03 6.791485e-03 5.956122e-03 5.210467e-03
## [136] 4.546781e-03 3.957726e-03 3.436383e-03 2.976266e-03 2.571320e-03
## [141] 2.215924e-03 1.904881e-03 1.633410e-03 1.397129e-03 1.192044e-03
## [146] 1.014524e-03 8.612845e-04 7.293654e-04 6.161096e-04 5.191406e-04
## [151] 4.363413e-04

calculates the density function of the same distribution at intervals of 0.1 over the range [−5, 10]. The functions pnorm and qnorm work in an identical way - use help for further information. Similar functions exist for other distributions. For example, dt, pt and qt for the t-distribution, though in this case it is necessary to give the degrees of freedom rather than the mean and standard deviation. Other distributions available include the binomial, exponential, Poisson and gamma, though care is needed interpreting the functions for discrete variables. One further important technique for many statistical applications is the simulation of data from specified probability distributions. R enables simulation from a wide range of distributions, using a syntax similar to the above. For example, to simulate 100 observations from the N(3,4) distribution we write

rnorm(100,3,2)
##   [1]  1.4556778  2.8172140 -0.7927717  1.8572141  4.5743043  0.6892071
##   [7]  4.8574912  5.3392651  2.3615672  2.5743862  1.2659589  1.6636592
##  [13]  4.3161492 -0.5780857  5.7843344  6.0156729  3.1539871  3.8571167
##  [19]  2.1427815  2.3612626  0.3415976  4.3512889  6.7615013  2.1206978
##  [25]  0.1395043  5.3105648  4.3881046  3.3280917  0.6898651  3.3676593
##  [31]  3.4664959  0.3837337  0.4596727  2.0895432 -0.3721689 -1.1681387
##  [37]  2.3167461  4.4449296  2.4964591 -0.8316262  2.9409733  2.1529829
##  [43]  3.8912609  2.8657694  5.5246989  2.9296802  2.8168330  1.7414760
##  [49]  2.7287379  4.0716435  3.2120274 -0.8598837  2.8720610  5.8618954
##  [55]  5.4891618  8.3082841  3.3578537  5.3592111  2.8445927  5.5715025
##  [61]  6.6014939  1.5893613 -0.4813746  4.3605574  4.7740553  2.1016830
##  [67]  1.7153723  4.3951442  0.9446283  5.2541473  2.9420708  5.3222128
##  [73]  1.6121321 -0.5635857 -2.3700585  4.6886791  0.1474059 -1.7658857
##  [79]  1.8487768  2.1106010  1.5806922  3.3626394  0.3342500  4.0432898
##  [85]  4.0618066  3.5343681  7.5878989  2.6065030  2.6775738  3.2648661
##  [91]  3.7229376  4.3008955  4.7701610  1.8252887  0.6850856  2.2360553
##  [97]  4.2125726  4.5453445  3.5368731  7.3415686

Similarly, rt, rpois for simulation from the t and Poisson distributions, etc.

Graphics

R has many facilities for producing high quality graphics. A useful facility before beginning is to divide a page into smaller pieces so that more than one figure can be displayed. For example:

par(mfrow=c(2,2))

creates a window of graphics with 2 rows and 2 columns. With this choice the windows are filled up row-wise. Use mfcol instead of mfrow to fill up column-wise. The function par is a general function for setting graphical parameters. There are many options: see help(par). So, for example

par(mfrow=c(2,2))
hist(Height)
boxplot(Height)
hist(Volume)
boxplot(Volume)

par(mfrow=c(1,1))

produces above Figure . Note the final use of par to return the graphics window to standard size. We can also plot one variable against another using the function plot:

plot(Height,Volume)