2 Simple manipulations; numbers and vectors
2.1 Vectors and assignment
R operates on named data structures. The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers. For example:
x = c(10.4, 5.6, 3.1,6.4,21.7)
x
[1] 10.4 5.6 3.1 6.4 21.7
This is an assignment statement using the function c() which in this context can take an arbitrary number of vector arguments and whose value is a vector got by concatenating its arguments end to end.
A number occuring by itself in an expression is taken as a vector of length one.
Assignment can also be made using the function assign.
For example:
assign("x",c(10.4,5.6,3.1,6.4,21.7))
x
[1] 10.4 5.6 3.1 6.4 21.7
Assignments can also be made in the other direction. For example:
c(10.4,5.6,3.1,6.4,21.7)->x
x
[1] 10.4 5.6 3.1 6.4 21.7
If an expression is used as a complete command, the value is printed and lost. So now if we were to use the command:
1/x
[1] 0.09615385 0.17857143 0.32258065 0.15625000 0.04608295
Shown above, the reciprocals of the five values of x are printed at the terminal (and the value of x, of course, unchanged).
The further assignment
y=c(x,0,x)
y
[1] 10.4 5.6 3.1 6.4 21.7 0.0 10.4 5.6 3.1 6.4 21.7
2.2 Vector arithmetic
Vectors can be used in arithmetic expressions, in which case the operations are performed element by element. Vectors occuring in the same expression need not all be of the same length. If they are not, the value of the expression is a vetor with the same length as the longest vector which occurs in the expression. Shorter vectors in the expression are recycled as often as need be (perhaps fractionally) until they match the length of the longest vector. In particular a constant is simply repeated. So with the above assignments the command
v=2*x+y+1
longer object length is not a multiple of shorter object length
v
[1] 32.2 17.8 10.3 20.2 66.1 21.8 22.6 12.8 16.9 50.8 43.5
2.3 Generating regular sequences
R has a number of facilities for generating commonly used sequences of numbers. For example; 1:30 is
c(1:30)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
[29] 29 30
2*1:15
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
n=10
n
[1] 10
1:n
[1] 1 2 3 4 5 6 7 8 9 10
1:n-1
[1] 0 1 2 3 4 5 6 7 8 9
1:(n-1)
[1] 1 2 3 4 5 6 7 8 9
The construction 30:1 may be used to generate a sequence backwards.
30:1
[1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
[29] 2 1
The function seq() is a more general facility for generating sequences. It has five arguments, only some of which may be specified in any one call. The first two arguments, if given, specify the beginning and end of the sequence, and if these are the only two arguments given the result is the same as the colon operator. That is seq(2,10) is the same vector as 2:10.
seq(2,10)
[1] 2 3 4 5 6 7 8 9 10
2:10
[1] 2 3 4 5 6 7 8 9 10
Arguments to seq(), and to many other R functions, can also be given in named form, in which case the order in which they appear is irrelevant. The first two arguments may be named from=value and to=value; thus seq(1, 30), seq(from=1, to=30) and seq(to=30, from=1) are all the same as 1:30. The next two arguments to seq() may be named by=value and length=value, which specify a step size and a length for the sequence respectively. If neither of these is given, the default by=1 is assumed. For example the following two sequences generate identical vectors using differing rulesets.
s3 = seq(-5,5,by=.2)
s3
[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0
[17] -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
[33] 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4
[49] 4.6 4.8 5.0
s4 = seq(length=51,from=-5,by=.2)
s4
[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0
[17] -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
[33] 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4
[49] 4.6 4.8 5.0
The fifth argument may be named along=vector, which is normally used as the only argument to create the sequence 1, 2, …, length(vector), or the empty sequence if the vector is empty (as it can be).
A related function is rep() which can be used for replicating an object in various complicated ways. The simplest form is seen below. This places five copies of x end-to-end in s5.
s5 = rep(x,times=5)
s5
[1] 10.4 5.6 3.1 6.4 21.7 10.4 5.6 3.1 6.4 21.7 10.4 5.6 3.1 6.4 21.7 10.4
[17] 5.6 3.1 6.4 21.7 10.4 5.6 3.1 6.4 21.7
Another useful version, seen below, is. Which repeats each element of the vector x five times before moving on to the next.
s6 = rep(x, each=5)
s6
[1] 10.4 10.4 10.4 10.4 10.4 5.6 5.6 5.6 5.6 5.6 3.1 3.1 3.1 3.1 3.1 6.4
[17] 6.4 6.4 6.4 6.4 21.7 21.7 21.7 21.7 21.7
2.4 Logical vectors
As well as numerical vectors, R allows manipulation of logical quantities. The elements of a logical vector can have the values TRUE, FALSE, and NA (for “not available”). The first two are often abbreviated as T and F, respectively. Note however that T and F are just variables which are set to TRUE and FALSE by default, but are not reserved words and hence can be overwritten by the user. Hence, you should always use TRUE and FALSE.
Logical vectors are generated by conditions. For example
temp = x>13
temp
[1] FALSE FALSE FALSE FALSE TRUE
sets temp as a vector of the same length as x with values FALSE corresponding to elements of x where the condition is not met and TRUE where it is.
The logical operators are <, <=, >, >=, == for exact equality and != for inequality. In addition if c1 and c2 are logical expressions, then c1 & c2 is their intersection (“and”), c1 given c2 is their union (“or”), and !c1 is the negation of c1.
Logical vectors may be used in ordinary arithmetic, in which case they are coerced into numeric vectors, FALSE becoming 0 and TRUE becoming 1. However there are situations where logical vectors and their coerced numeric counterparts are not equivalent, for example see the next subsection.
2.5 Missing values
In some cases the components of a vector may not be completely known. When an element or value is “not available” or a “missing value” in the statistical sense, a place within a vector may be reserved for it by assigning it the special value NA. In general any operation on an NA becomes an NA. The motivation for this rule is simply that if the specification of an operation is incomplete, the result cannot be known and hence is not available.
The function is.na(x) gives a logical vector of the same size as x with value TRUE if and only if the corresponding element in x is NA.
z = c(1:3,NA)
z
[1] 1 2 3 NA
ind = is.na(z)
ind
[1] FALSE FALSE FALSE TRUE
There is a second kind of “missing” values which are produced by numerical computation, the so-called Not a Number, NaN, values. Examples are:
0/0
[1] NaN
or
Inf/Inf
[1] NaN
In summary, is.na(xx) is TRUE both for NA and NaN values. To differentiate these, is.nan(xx) is only TRUE for NaNs.
2.6 Character vectors
Character quantities and character vectors are used frequently in R, for example as plot lables. Where needed they are denoted by a sequence of characters delimited by the double quote character, e.g., “x-values”, “New iteration results”.
Character strings are entered using either matching double(“) or single (’) quotes, but are printed using double quotes (or sometimes without quotes). They use C-style escape sequences, using forward-slash as the escape character, so forward-slash is entered and printed as forward-slash forward-slash, and inside double quotes” is entered as forward-slash".
Other useful escape sequences are forward-slash n, newline, forward-slash t, tab and forward-slash b, backspace - see ?Quotes for a full list.
Character vectors may be concatenated into a vector by the c(); read on for examples.
The paste() function takes an arbitrary number of arguments and concatenates them one by one into character strings. Any numbers given among the arguments are coerced into character strings in the evident way, that is, in the same way they would be if they were printed. The arguments are by default separated in the result by a single blank character, but this can be changed by the named argument, sep=string, which changes it to string, possibly empty.
For example
labs = paste(c("X","Y"),1:10,sep="")
labs
[1] "X1" "Y2" "X3" "Y4" "X5" "Y6" "X7" "Y8" "X9" "Y10"
c("X1","Y2","X3","Y4","X5","Y6","X7","Y8","X9","Y10")
[1] "X1" "Y2" "X3" "Y4" "X5" "Y6" "X7" "Y8" "X9" "Y10"
2.7 Index Vectors; Selecting and Modifying Subsets of a Data Set
Subsets of the elements of a vector may be selected by appending to the name of the vector an index vector in square brackets. More generally any expression that evaluates to a vector may have subsets of its elements similarly selected by appending an index vector in square brackets immediately after the expression.
- A logical vector. In this case the index vector is recycled to the same length as the vector from which elements are to be selected. Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted. For example
y = x[!is.na(x)]
y
[1] 10.4 5.6 3.1 6.4 21.7
creates (or re-creates) an object y which will contain the non-missing values of x, in the same order. Note that if x has missing values, y will be shorter than x. Also,
z = (x+1)[(!is.na(x)) & x>0]
z
[1] 11.4 6.6 4.1 7.4 22.7
creates an object z and places in it the values of the vector x + 1 for which the corresponding value in x was both non-missing and positive.
- A vector of positive integral quantitites. In this case the values in the index vector must lie in the set {1, 2, …, length(x)}. The corresponding elements of the vector are selected and concatenated, in that order, in the result. The index vector can be of any length and the result is of the same length as the index vector. For example x[6] is the sixth component of x and
\(> x[1:10]\)
selects the first 10 elements of x (assuming length(x) is not less than 10). Also
\(c("x","y")[rep(c(1,2,2,1),times=4]\)
x[6]
[1] NA
x[1:10]
[1] 10.4 5.6 3.1 6.4 21.7 NA NA NA NA NA
c("x","y")[rep(c(1,2,2,1),times=4)]
[1] "x" "y" "y" "x" "x" "y" "y" "x" "x" "y" "y" "x" "x" "y" "y" "x"
(an admittedly unlikely thing to do) produces a character vector of length 16 consisting of “x”, “y”, “y”, “x” repeated four times.
- A vector of negative integral quantities. Such an index vector specifies the values to be excluded rather than included. Thus
\(y = x[-(1:5)]\)
gives
y all but the first five elements of
x
y = x[-(1:5)]
y
numeric(0)
Since x only consists of five elements, nothing is produced.
- A vector of character strings. This possibility only applies where an object has a names attribute to identify its componenets. In this case a sub-vector of the names vector may be used in the same way as the positive integral labels in item 2 further above.
fruit = c(5,10,1,20)
names(fruit) = c("orange","banana","apple","peach")
lunch = fruit[c("apple","orange")]
The advantage is that alphanumeric names are often easier to remember than numeric indices This option is particularly useful in connection with data frames, as we shall see later.
An indexed expression can also appear on the receiving end of an assignment, in which case the assignment operation is performed only on those elements of the vector. The expression must be of the form vector[index_vector] as having an arbitrary expression in place of the vector name does not make much sense here.
For example
x[is.na(x)] = 0
x
[1] 10.4 5.6 3.1 6.4 21.7
replaces any missing values in x by zeros and
y[y<0] = -y[y<0]
y
[1] 10.4 5.6 3.1 6.4 21.7
has the same effect as
2.8 Other types of objects
Vectors are the most important type of object in R, but there are several others which we will meet more formally in later sections.
- matrices or more generally arrays are multi-dimensinal generalizations of vectors. In fact, they are vectors that can be indexed by two or more indices and will be printed in special ways.
- factors provide compact ways to handle categorical data.
- lists are a general form of vector in which the various elements need not be of the same type, and are often themselves vectors or lists. Lists provide a convenient way to return the results of a statistical computation.
- data frames are matrix-like structures, in which the columns can be of different types. Think of data frames as ‘data matrices’ with one row per observational unit but with (possibly) both numerical and categorical variables. Many experiments are best described by data frames: the treatments are categorical but the response is numeric. functions are themselves objects in R which can be stored in the project’s workspace. This provides a simple and convenient way to extend R.
