An object does not have to contain just one piece of data. A vector is an object that contains multiple pieces of data – all of the same type: numerical, character, or logical – arranged in a specific sequence.
You can create a vector using the c() function.
x1 <- c(9, 7, 3, -2.5, 1) # numeric; 9 7 3 -2.5 1
assign("x1", c(9, 7, 3, -2.5, 1)) # same as previous command
x2 <- c("Bush 41", "Bush 43") # character; "Bush 41" "Bush 43"
x3 <- c(T, F, T, T, F) # logical; TRUE FALSE TRUE TRUE FALSE
The vectors c(4, 24, -3) and c(-3, 4, 24) are not the same; the sequence in which the elements are arranged is important. Each element has an index that indicates its priority in the vector’s sequence.
A vector object can be used to indirectly assign value to another vector object, as long as all data are of the same type:
y <- c(x1, 1, 2, c(4, 5, 6)) # 9 7 3 -2.5 1 1 2 4 5 6
The above example shows how to create a vector that includes an existing vector. And a small modification shows next how to add data to an existing vector:
x1 <- c(x1, 1, 2, c(4, 5, 6))
Note that an existing object can be used to assign value to a new version of itself.
More examples of vector creation:
x1.1 <- 2:8 # A numerical vector containing the number 2 through 8
x1.2 <- numeric(length = 12) # Vector with 12 zeros
x1.3 <- rep(x = 0, times = 12) # Same as previous
x1.4 <- rep(c(1, -1, 8), 12) # A dozen repetitions of the vector 1, -1, 8
x1.5 <- seq(from = 3, to = 43, by = 10) # 3 13 23 33 43
x1.6 <- seq(from = 3, by = 10, length.out = 5) # Same as previous
x1.7 <- seq(from = 3, by = 10, along.with = x1.5) # Same as previous
x2.1 <- letters[1:5] # character vector of first five lower case letters
x2.2 <- LETTERS[1:5] # character vector of first five upper case letters
x2.3 <- month.abb[1:4] # character vector of abbreviated names of first 4 months
x2.4 <- month.name[1:4] # character vector of names of first 4 months
x3.1 <- x1 > 6 # logical vector; TRUE for the elements of x1 greater than 6, FALSE otherwise
x3.2 <- x2.1 > "c" # logical vector; TRUE for the letters of x2.1 that come after the letter "c"
x3.3 <- which(x3.1) # Numerical vector of the index numbers for which x3.1 is TRUE.
## So, the `which()` function works only on vectors of the logical type.
x1.8 <- which(x = LETTERS > "U") # Numerical vector of the integers 22 through 26
Note that letters, LETTERS, month.abb and month.name are four of R’s built-in vectors. These are also called R’s constant vectors.
The 4th element of the vector x is printed out by the command x[4].
More generally, the 3rd, 4th, and 7th elements are printed out by the command x[c(3, 4, 7)]. This vector has the 3rd element of x as its first element, the 4th element of x as its second element, etc. On the other hand, x[c(4, 3, 7)] has the 4th element of x as its first element, the 3rd element of x as its second element, etc. In other words, x[c(3, 4, 7)] and x[c(4, 3, 7)] are not the same.
“Subsetting” can even mean making a bigger vector by using certain elements of a vector multiple times. Suppose x has only 3 elements. Then x[c(1, 1, 3, 3, 3, 2)] is bigger than x because its 1st and 3rd elements are inserted more than once in the new “subsetted” vector.
Assuming the vector x has 9 elements, its 3rd, 4th, and 7th elements are also printed out by the command x[c(F, F, T, T, F, F, T, F, F)]. Note that which(c(F, F, T, T, F, F, T, F, F)) is the vector c(3, 4, 7). So, x[which(c(F, F, T, T, F, F, T, F, F))] would also work. (But it just means more typing!)
This gives a way to extract just those elements of the vector x that satisfies a condition. So, x[which(x = LETTERS > "U")] is equivalent to x[22:26] and will extract elements 22 through 26 of x.
If there is a vector y <- c(3, 4, 7) or a vector z <- c(F, F, T, T, F, F, T, F, F), then x[y] and x[z] and x[which(z)] would deliver the same results. (But, again, the which way involves unnecessary typing!)
Continuing with the previous paragraph, if you want the remaining elements of x, you could use x[-y] or x[!z] or x[which(!z)] or x[-which(z)].
When the vector of indexes in x[...] consists of a non-integer number, the highest integer not greater than the number is substituted. So, x[c(1, 3)] and x[c(1.9, 3)] give the same result.
The command x[-4] prints out the vector x without its 4th element. So, a way to delete the 4th element from the vector x is this:
x <- x[-4]
More generally, x <- x[-c(1,3)] deletes the first and third elements of the vector x. An equivalent command is x <- x[c(-1,-3)].
If the vector x has five elements, x[7] doesn’t exist. So the command will return an “NA” value.
Also, as x[c(1, 3)], being a sub-vector of x is itself a vector, it too can serve as a vector of indexes to subset x! In other words, x[x[c(1, 3)]] is perfectly conceivable.
While the elements of a vector x may be indexed by their sequential positions, they may also be assigned their own names. Consider the vector x1.1 <- 2:8 that we saw above. We could assign each of its seven elements their own names:
days <- 2:8
daynames <- c("Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat")
names(days) <- daynames
days[c("Mon", "Tues")] # Yields the vector 3 4
## Mon Tues
## 3 4
As the final command in the code chunk above shows, the names of a vector’s elements could be used to subset the vector.
So, the elements of a vector can be extracted (1) by each element’s numerical position, (2) by each element’s assigned name, and (3) by each element’s logical (true/false) name.
More on naming a vector’s elements:
namedvec <- c(x = 1, y = 2, z = 4)
namedvec2 <- namedvec^2 # Squares each element. The new vector inherits the names.
(namedvec3 <- namedvec^2 + 4) # Adds 4 to each element. The new vector inherits the names.
## x y z
## 5 8 20
Using purrr:: set_names, the naming can be done long after a vector is created:
y <- 1:3
library(purrr)
(y <- set_names(y, c("a", "b", "c")))
## a b c
## 1 2 3
(y <- 1:3)
## [1] 1 2 3
attr(y, "greeting") <- "Hi!"
attr(y, "farewell") <- "Bye!"
y
## [1] 1 2 3
## attr(,"greeting")
## [1] "Hi!"
## attr(,"farewell")
## [1] "Bye!"
Why might this be useful? Dunno!
I have already discussed two vector operations: which() and names(). There are many others.
The “plus” operation x1 + x5 adds the i-th elements of x1 and x5, when they are of the same length.
x1 <- c(9, 7, 3, 5, 1)
x4 <- c(3, 2, 3, 7, 0)
x1 + x4
## [1] 12 9 6 12 1
The length of the vector x is given by length(x).
When two vectors are not of the same length, addition uses something called fractional recycling of vectors. Consider
x1 + x4[c(1,2)]
## Warning in x1 + x4[c(1, 2)]: longer object length is not a multiple of shorter
## object length
## [1] 12 9 6 7 4
As x4[c(1, 2)], which is the vector c(3, 2), has 2 elements and x1 has 5, x4[c(1,2)] is repeated till a 5-element vector is obtained. Then, the usual element-by-element addition is done (though with a Warning message). That is, elements are added at the end of the shorter vector x till it becomes as long as the longer vector: x[length(x) + i] <- x[i].
The + in x1 + x4[c(1,2)] could be replaced by - (subtraction), * (multiplication), and / (division), all done element by element with fractional recycling applied where necessary, assuming the operation is valid. (Don’t divide by zero!)
Here’s the inner product of the vectors x1 and x4:
x1 %*% x4
## [,1]
## [1,] 85
Although the result is a scalar, R expresses it as a 1-by-1 matrix:
class(x1 %*% x4)
## [1] "matrix" "array"
If you must obtain the inner product as a scalar, try:
as.numeric(x1 %*% x4)
## [1] 85
By the way, x1 %*% x4[c(1,2)] is not allowed! No fractional recycling in this case.
The == operation can be used to ask whether two vectors – or indeed any two objects – are the same. The output is either TRUE or FALSE.
x1 == x4
## [1] FALSE FALSE TRUE FALSE FALSE
length(x1) == length(x4) # TRUE; they both have 5 elements
## [1] TRUE
The != operation is the opposite of the == operation.
Here are a few more vector functions with logical output:
x > 2 # FALSE FALSE TRUE TRUE
y < 4 # TRUE FALSE FALSE FALSE
(x > 2) | (y < 4) # TRUE FALSE TRUE TRUE
(x > 2) & (y < 4) # FALSE FALSE FALSE FALSE
any(x > 2) # TRUE, because `x > 2` has some TRUE elements
all(x > 2) # FALSE, because not all elements of `x > 2` are TRUE
!(x > 2) # Complement to `x > 2`: TRUE TRUE FALSE FALSE
Fractional recycling is used when x and y have different lengths. So, x >= 2 is equivalent to x >= c(2, 2, 2, 2). It returns a vector showing TRUE where an element of x is no less than 2 and FALSE where an element is less than 2.
The match(x, y) command gives a vector whose i-th element is the index of the element of y that is the first to match the i-th element of x.
x <- 1:4
y <- 3:6
match(x, y)
## [1] NA NA 1 2
In related news, the x %in% y command gives a logical vector whose i-th element indicates whether the i-th element of x has a matching element in y.
x %in% y
## [1] FALSE FALSE TRUE TRUE
By the way, here’s a good quick-guide:
Find the maximum value among the elements of a vector:
max(x)
## [1] 4
max(y)
## [1] 6
max(x, y) # largest element of the vector c(x, y)
## [1] 6
The min() function works analogously.
The function pmin(x, y, z) uses R’s fractional recycling, if necessary, to extend all vectors to the length of the longest of the vectors x, y, and z. Then pmin(x, y, z) creates a vector with i-th element being the smallest i-th element across all the (extended) vectors. If j-th element is NA in any vector, the j-th element of the returned vector will also be NA. To ignore NA’s use the argument na.rm = TRUE. The function pmax(x, y, z, na.rm = F) is the maximizing counterpart to pmin(x, y, z, na.rm = F).
To add all elements of x, use sum(x, na.rm = F). This assumes numeric or logical elements. For logical elements, T = 1 and F = 0.
Similarly, use prod(x, na.rm = F) to multiply all elements.
The arithmetic mean of a numeric vector x is obtained thus: mean(x, na.rm = F).
The median is obtained from median(x, na.rm = F).
The range of the numbers in x is obtained from range(x, na.rm = F). It returns two numbers: the lowest and the highest. So, it is equivalent to c(min(x1), max(x1)).
The variance of the elememts of x is obtained from var(x). The same result is obtainable in a roundabout way as sum((x - mean(x))^2)/(length(x) - 1).
The standard deviation of the elememts of x is obtained from sd(x). I could equivalently use sqrt(var(x)).
To sort the elements of x in ascending order, use sort(x). To sort in descending order, use sort(x, decreasing = TRUE).
So, max(x) is the same as sort(x)[length(x)], the 2nd largest element of x is sort(x)[length(x) - 1], etc. And min(x) is the same as sort(x)[1], the 2nd smallest element is sort(x)[2], etc.
To get the ranks of the elements of x from 1 (lowest) to length(x) (highest), use the command rank(x).
A related function is order(). order(x) is a vector of the indexes of x – that is, the numbers from 1 through length(x) – arranged from the index of the smallest element of x to the largest. To reverse the ordering, use order(x, decreasing = TRUE).
Note that order() is different from rank(). rank() will return you rank of the elements while order() returns the ranked element’s position in the original list.
To randomly rearrange the elements of a vector use sample(x).
Here’s how to generate a sample of five two-digit positive integers:
(mysample <- sample(x = 10:99, size = 5, replace = FALSE))
## [1] 50 70 23 93 46
Here’s how to generate a sample of five zeroes and ones:
(mysample01 <- sample(x = c(0, 1), size = 5, replace = TRUE, prob = c(0.30, 0.70)))
## [1] 1 1 1 0 1
Note that the chance of getting 1 has been made 70 percent.
diff() function comes in handy in time series work(x <- rnorm(n = 10, mean = 0, sd = 1)) # Vector of 10 random numbers from a normal distribution with mean = zero and standard deviation = 1
## [1] 1.2931486 -0.1249023 -0.6443600 -0.3405146 -0.7188714 0.9804323
## [7] -2.8611554 -0.4376202 2.5839809 1.3805696
diff(x, lag = 2) # 8 numbers, x[3] - x[1], x[4] - x[2], etc
## [1] -1.93750860 -0.21561222 -0.07451142 1.32094688 -2.14228399 -1.41805248
## [7] 5.44513629 1.81818978
diff(x, differences = 2) # diff with lag 1 done on diff with lag 1
## [1] 0.8985934 0.8233030 -0.6822022 2.0776605 -5.5408914 6.2651229 0.5980659
## [8] -4.2250124
diff(diff(x, lag = 1), lag = 1) # Same as diff(x, differences = 2)
## [1] 0.8985934 0.8233030 -0.6822022 2.0776605 -5.5408914 6.2651229 0.5980659
## [8] -4.2250124
diff(x, lag = 3, differences = 2)
## [1] -0.88697759 0.87522025 -0.02124366 6.76236579
diff(diff(x, lag = 3, differences = 1), lag = 3, differences = 1) # Same as previous
## [1] -0.88697759 0.87522025 -0.02124366 6.76236579
I wish there was a command that would automatically generate:
c(NA, NA, diff(x, lag = 2))
## [1] NA NA -1.93750860 -0.21561222 -0.07451142 1.32094688
## [7] -2.14228399 -1.41805248 5.44513629 1.81818978
That way, the elements not amenable to differencing would be explicitly indicated.
Suppose you have data on dinner guests’ ratings of your cooking, from one to three stars, with the interpretations being “bad”, “okay”, and “good”, respectively. These ratings can be represented by a vector of the factor type:
x <- c(3,2,2,3,1,2,3,2,1,2) # Guests' numerical ratings, as a numeric vector
xf <- factor(x, labels=c("bad","okay","good"), ordered = TRUE)
# Interpretations of the ratings. Now the vector is a factor.
x
## [1] 3 2 2 3 1 2 3 2 1 2
xf
## [1] good okay okay good bad okay good okay bad okay
## Levels: bad < okay < good
Objects that denote dates will be discussed later.
This illustrates how to combine numeric vectors into a list and how to use the sapply() function to apply other functions on the vectors that make up the list.
x <- 1:10
y <- 100:105
mylist <- list(x,y)
sapply(mylist, FUN = sum, na.rm = TRUE)
## [1] 55 615
sapply(mylist, FUN = mean, na.rm = TRUE)
## [1] 5.5 102.5
sapply(mylist, FUN = var, na.rm = TRUE)
## [1] 9.166667 3.500000
While the above commands show how the functions sum(), mean() and var() can be applied to multiple numeric vectors all at once, it should be clear that pretty much all the functions discussed above that acted on a single numeric vector could also be made to do their magic on multiple numeric vectors all at once.