All objects in R are built on top of a basic set of built-in objects. The type of an object defines how it is stored in R. Objects in R are also members of a class. Classes define what information objects contain, and how those objects may be used.
The following table shows all of the built-in object types.
Table Primitive object types in R
| Category | Object | type Description | Example |
|---|---|---|---|
| Vectors | integer | Naturally produced from sequences. Can be coerced with the integer() function. | 5:5, integer(5) |
| double | Used to represent floating-point numbers (numbers with decimals and large numbers). On most modern platforms, this will be 8 bytes, or 64 bits. By default, most numerical values are represented as doubles. Can be coerced with the double() function. | 1, -1, 2 **50, double(5) | |
| complex | Complex numbers. To use, you must include both the real and the imaginary parts (even if the real part is 0). | 2+3i, 0+1i, exp(0+1i * pi) | |
| character | A string of characters (just called a string in many other languages). | “Hello world.” | |
| logical | Represents Boolean values. | TRUE, FALSE | |
| raw | A vector containing raw bytes. Useful for encoding objects from outside the R environment. | raw(8), CharToRaw(“Hello”) | |
| Compound | list | A (possibly heterogeneous) collection of other objects. Elements of a list may be named. Many other object types in R (such as data frames) are implemented as lists. | list(1, 2, “hat”) |
| pairlist | A data structure used to represent a set of name-value pairs. Pairlists are primarily used internally but can be created at the user level. Their use is deprecated in user-level programs, because standard list objects are just as efficient and more flexible. | .Options pairlist(apple=1, pear=2, banana=3) | |
| S4 | An R object supporting modern objectoriented paradigms (inheritance, methods, etc.). | ||
| environment | An R environment describes the set of symbols available in a specific context. An environment contains a set of symbol-value pairs and a pointer to an enclosing environment. (For example, you could use any in the signature of a default generic function.) | .GlobalEnv new.env(parent = baseenv()) | |
| Special | any |
An object used to mean that “any” type is OK. Used to prevent coercion from one type to another. Useful in defining slots in S4 objects or signatures for generic functions. | setClass(“Something”, representation( data=“ANY” ) ) |
NULL |
An object that means “there is no object.” Returned by functions and expressions whose value is not defined. The NULL object can have no attributes. | NULL | |
... |
Used in functions to implement variablelength argument lists, particularly arguments passed to other functions. | N/A |
|
| R language | symbol | A symbol is a language object that refers to other objects. Usually encountered when parsing R statements. | as.name(x), as.symbol(x), quote(x) |
| promise | Promises are objects that are not evaluated when they are created but are instead evaluated when they are first used. They are used to implement delayed loading of objects in packages. | > x <- 1;> y <- 2; > z <- 3 > delayedAssign("v", c(x, y, z)) > # v is a promise |
|
| language | R language objects are used when processing the R language itself. | quote(function(x) { x + 1}) | |
| expression | An unevaluated R expression. Expression objects can be created with the expression function, and later evaluated with the eval function. | expression(1 + 2) | |
| Functions | closure | An R function not implemented inside the R system. Most functions fall into this category. Includes user-defined functions, most functions included with R, and most functions in R packages. | f <- function(x) { x + 1} print |
| special | An internal function whose arguments are not necessarily evaluated on call. | if, [ |
|
| builtin | An internal function that evaluates its arguments. | +, ^ |
|
| Internal | char | A scalar “string” object. A character vector is composed of char’s. (Users can’t easily generate a char object but don’t ever need to.) | N/A |
| bytecode | A data type reserved for a future byte-code compiler. | N/A |
|
| externalptr | External pointer. Used in C code. | N/A |
|
| weakref | Weak reference (internal only). | N/A |
to make them easier to understand, the object types can be classified into a few categories, .
Basic vectors
These are vectors containing a single type of value: integers, floating-point numbers, complex numbers, text, logical values, or raw data.
Compound objects
These objects are containers for the basic vectors: lists, pairlists, S4 objects, and environments. Each of these objects has unique properties (described below), but each of them contains a number of named objects.
Special objects
These objects serve a special purpose in R programming: any, NULL, and .... Each of these means something important in a specific context, but you would never create an object of these types.
R language
These are objects that represent R code; they can be evaluated to return other objects.
Vectors and lists have been introduced in Chapter 2 and Chatper 3.
a=1:10
a^3## [1] 1 8 27 64 125 216 343 512 729 1000
c(1:10)^3## [1] 1 8 27 64 125 216 343 512 729 1000
a=1:5
b=2:6
c=seq(2,5,by=1)
a*b## [1] 2 6 12 20 30
a^c## Warning in a^c: longer object length is not a multiple of shorter object
## length
## [1] 1 8 81 1024 25
a%%b## [1] 1 2 3 4 5
a%/%b## [1] 0 0 0 0 0
a%*%t(b)## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 3 4 5 6
## [2,] 4 6 8 10 12
## [3,] 6 9 12 15 18
## [4,] 8 12 16 20 24
## [5,] 10 15 20 25 30
A matrix is an extension of a vector to two dimensions. A matrix is used to represent two-dimensional data of a single type. A clean way to generate a new matrix is with the matrix function. For example,
m <- matrix(data=1:12,nrow=4,ncol=3, dimnames=list(c("r1","r2","r3","r4"), c("c1","c2","c3")))
m## c1 c2 c3
## r1 1 5 9
## r2 2 6 10
## r3 3 7 11
## r4 4 8 12
It is also possible to transform another data structure into a matrix using the as.matrix function.
a = data.frame(1:5,seq(4,8))
is.matrix(a)## [1] FALSE
b = as.matrix(a)
is.matrix(b)## [1] TRUE
An important note: matrices are implemented as vectors, not as a vector of vectors (or as a list of vectors). Array subscripts are used for referencing elements and don’t reflect the way the data is stored. Unlike other classes, matrices don’t have an explicit class attribute.
A <- 1:6
dim(A)## NULL
A## [1] 1 2 3 4 5 6
dim(A) <- c(2,3)
A## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
NULL:B <- list(1,2,3,4,5,6)
dim(B)## NULL
B## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4
##
## [[5]]
## [1] 5
##
## [[6]]
## [1] 6
dim(B) <- c(2,3)
B## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
B <- matrix(1:6, 2, 3)
B[1,] # First row## [1] 1 3 5
B[,3] # Third column## [1] 5 6
If you want the result to be a one-row matrix or a one-column matrix, then include the drop=FALSE argument:
B[1,,drop=FALSE] # First row in a one-row matrix## [,1] [,2] [,3]
## [1,] 1 3 5
B[,3,drop=FALSE] # Third column in a one-column matrix## [,1]
## [1,] 5
## [2,] 6
Loop over matrix elements
If you want to loop over elements in a matrix (columns and rows), then you will have to use nested loops.
# Define corr
corr <- matrix(c(1.00, 0.96, 0.88, 0.96, 1.00, 0.74, 0.88, 0.74, 1.00), 3, 3)
row.names(corr) <- c("apple", "ibm", "micr")
colnames(corr) <- c("apple", "ibm", "micr")
# Print out corr
corr## apple ibm micr
## apple 1.00 0.96 0.88
## ibm 0.96 1.00 0.74
## micr 0.88 0.74 1.00
# Create a nested loop
for(row in 1:nrow(corr)) {
for(col in 1:ncol(corr)) {
print(paste(colnames(corr)[col], "and", rownames(corr)[row],
"have a correlation of", corr[row,col]))
}
}## [1] "apple and apple have a correlation of 1"
## [1] "ibm and apple have a correlation of 0.96"
## [1] "micr and apple have a correlation of 0.88"
## [1] "apple and ibm have a correlation of 0.96"
## [1] "ibm and ibm have a correlation of 1"
## [1] "micr and ibm have a correlation of 0.74"
## [1] "apple and micr have a correlation of 0.88"
## [1] "ibm and micr have a correlation of 0.74"
## [1] "micr and micr have a correlation of 1"
A <- matrix(1:6, 2, 3)
B <- matrix(7:12, 2, 3)
C <- c(13:15)
A## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
B## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
t(B)## [,1] [,2]
## [1,] 7 8
## [2,] 9 10
## [3,] 11 12
A+B## [,1] [,2] [,3]
## [1,] 8 12 16
## [2,] 10 14 18
A-B## [,1] [,2] [,3]
## [1,] -6 -6 -6
## [2,] -6 -6 -6
A*B## [,1] [,2] [,3]
## [1,] 7 27 55
## [2,] 16 40 72
A%*%t(B)## [,1] [,2]
## [1,] 89 98
## [2,] 116 128
A%*%C## [,1]
## [1,] 130
## [2,] 172
A/B## [,1] [,2] [,3]
## [1,] 0.1428571 0.3333333 0.4545455
## [2,] 0.2500000 0.4000000 0.5000000
x = c(1,2,3)
y = diag(x)
y## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 2 0
## [3,] 0 0 3
solve(y)## [,1] [,2] [,3]
## [1,] 1 0.0 0.0000000
## [2,] 0 0.5 0.0000000
## [3,] 0 0.0 0.3333333
A <- matrix(1:9, 3, 3)
A## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
s <- svd(A)
s## $d
## [1] 1.684810e+01 1.068370e+00 5.543107e-16
##
## $u
## [,1] [,2] [,3]
## [1,] -0.4796712 0.77669099 0.4082483
## [2,] -0.5723678 0.07568647 -0.8164966
## [3,] -0.6650644 -0.62531805 0.4082483
##
## $v
## [,1] [,2] [,3]
## [1,] -0.2148372 -0.8872307 0.4082483
## [2,] -0.5205874 -0.2496440 -0.8164966
## [3,] -0.8263375 0.3879428 0.4082483
s$u %*% diag(s$d) %*% t(s$v)## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
s <- eigen(A)
s## eigen() decomposition
## $values
## [1] 1.611684e+01 -1.116844e+00 -5.700691e-16
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.4645473 -0.8829060 0.4082483
## [2,] -0.5707955 -0.2395204 -0.8164966
## [3,] -0.6770438 0.4038651 0.4082483
d <- s$values
v <- s$vectors
v %*% diag(d) %*% solve(v)## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
There are some other objects that you should know about if you’re using R. Although most of these objects are not formally part of the R language, they are used in so many R packages, or get special treatment in R, that they’re worth a closer look.
An array is an extension of a vector to more than two dimensions. Vectors are used to represent multidimensional data of a single type. As above, you can generate an array with the array function:
a <- array(data=1:24,dim=c(3,4,2))
a## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
Like matrices, the underlying storage mechanism for an array is a vector. (Like matrices, and unlike most other classes, matrices don’t have an explicit class attribute.)
When analyzing data, it’s quite common to encounter categorical values. For example, suppose that you have a set of observations about people that includes eye color. You could represent the eye colors as a character array:
eye.colors <- c("brown","blue","blue","green","brown","brown","brown")This is a perfectly valid way to represent the information, but it can become inefficient if you are working with large names or a large number of observations. R provides a better way to represent categorical values, by using factors. A factor is an ordered collection of items. The different values that the factor can take are called levels.
eye.colors <- factor(c("brown", "blue", "blue", "green", "brown", "brown", "brown"))
levels(eye.colors) # The levels function shows all the levels from a factor## [1] "blue" "brown" "green"
Printing a factor shows slightly different information than printing a character vector. In particular, notice that the quotes are not shown and that the levels are explicitly printed:
eye.colors## [1] brown blue blue green brown brown brown
## Levels: blue brown green
In the eye color example, order did not matter. However, sometimes the order of the factors matters for a specific problem. For example, suppose that you had conducted a survey and asked respondents how they felt about the statement “melon is delicious with an omelet.” Furthermore, suppose that you allowed respondents to give the following responses: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.
There are multiple ways to represent this information in R. You could code these as integers (for example, on a scale of 1 to 5), although this approach has some drawbacks. This approach implies a specific quantitative relationship between values, which may or may not make sense. For example, is the difference between Strongly Disagree and Disagree the same as the difference between Disagree and Neutral? A numeric reponse also implies that you can calculate meaningful statistics based on the responses. Can you be sure that a Disagree response and an Agree response average out to Neutral?
To get around these problems, you can use an ordered factor to represent the response of this survey. Here is an example:
survey.results <- factor( c("Disagree", "Neutral", "Strongly Disagree", "Neutral", "Agree", "Strongly Agree", "Disagree", "Strongly Agree", "Neutral", "Strongly Disagree", "Neutral", "Agree"), levels=c("Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree"), ordered=TRUE)
survey.results## [1] Disagree Neutral Strongly Disagree
## [4] Neutral Agree Strongly Agree
## [7] Disagree Strongly Agree Neutral
## [10] Strongly Disagree Neutral Agree
## 5 Levels: Strongly Disagree < Disagree < Neutral < ... < Strongly Agree
Factors are implemented internally using integers. The levels attribute maps each integer to a factor level. Integers take up a small, fixed amount of storage space, so they can be more space efficient than character vectors. It’s possible to take a factor and turn it into an integer array:
eye.colors## [1] brown blue blue green brown brown brown
## Levels: blue brown green
class(eye.colors)## [1] "factor"
eye.colors.integer.vector <- unclass(eye.colors)
eye.colors.integer.vector## [1] 2 1 1 3 2 2 2
## attr(,"levels")
## [1] "blue" "brown" "green"
class(eye.colors.integer.vector)## [1] "integer"
It’s possible to change this back to a factor by setting the class attribute:
class(eye.colors.integer.vector) <- "factor"
eye.colors.integer.vector## [1] brown blue blue green brown brown brown
## Levels: blue brown green
class(eye.colors.integer.vector)## [1] "factor"
In business contexts, data is often kept in database tables. Each table has many rows, which may consist of multiple “columns” representing different quantities and which may be kept in multiple formats. A data frame is a natural way to represent these data sets in R.
data.frame function to assemble them into a data frame:v1 = c(1:5)
v2 = c(6:10)
v3 = c(11:15)
dfrm1 <- data.frame(v1, v2, v3)
dfrm1| v1 | v2 | v3 |
|---|---|---|
| 1 | 6 | 11 |
| 2 | 7 | 12 |
| 3 | 8 | 13 |
| 4 | 9 | 14 |
| 5 | 10 | 15 |
as.data.frame:list.of.vectors = list(a = v1, b = v2, c = v3)
dfrm2 <- as.data.frame(list.of.vectors)
dfrm2| a | b | c |
|---|---|---|
| 1 | 6 | 11 |
| 2 | 7 | 12 |
| 3 | 8 | 13 |
| 4 | 9 | 14 |
| 5 | 10 | 15 |
data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4))Error in data.frame(a = c(1, 2, 3, 4, 5), b = c(1, 2, 3, 4)) : arguments imply differing number of rows: 5, 4Initializing a data frame from column data
Imagine that your data is organized by columns, and you want to assemble it into a data frame. Then, you can store each row in a one-row data frame. Store the one-row data frames in a list. Use rbind and do.call to bind the rows into one, large data frame:
Row1 = data.frame(city="Chicago", county="Cook", state="IL", pop=2853114)
Row2 = data.frame(city="Kenosha", county="Kenosha", state="WI", pop=5428)
Row3 = data.frame(city="Aurora", county="Kane", state="IL", pop=171782)
obs = list(Row1, Row2, Row3)
obs[[1]]| city | county | state | pop |
|---|---|---|---|
| Chicago | Cook | IL | 2853114 |
rbind(obs[[1]], obs[[2]])| city | county | state | pop |
|---|---|---|---|
| Chicago | Cook | IL | 2853114 |
| Kenosha | Kenosha | WI | 5428 |
dfrm <- do.call(rbind, obs)
dfrm| city | county | state | pop |
|---|---|---|---|
| Chicago | Cook | IL | 2853114 |
| Kenosha | Kenosha | WI | 5428 |
| Aurora | Kane | IL | 171782 |
Appending rows to a data frame
newRow <- data.frame(city="West Dundee", county="Kane", state="IL", pop=5428)
suburbs <- rbind(dfrm, newRow)
suburbs| city | county | state | pop |
|---|---|---|---|
| Chicago | Cook | IL | 2853114 |
| Kenosha | Kenosha | WI | 5428 |
| Aurora | Kane | IL | 171782 |
| West Dundee | Kane | IL | 5428 |
Preallocating a data frame
Imagine that you are building a data frame, row by row. You want to preallocate the space instead of appending rows incrementally. Then, you can Create a data frame from generic vectors and factors using the functions numeric(n), character(n) and factor(n):
N <- 10
dfrm <- data.frame(a=factor(N, levels=c("NJ", "IL", "CA")),
b=character(N),
c=numeric(N) )
dfrm| a | b | c |
|---|---|---|
| NA | 0 | |
| NA | 0 | |
| NA | 0 | |
| NA | 0 | |
| NA | 0 | |
| NA | 0 | |
| NA | 0 | |
| NA | 0 | |
| NA | 0 | |
| NA | 0 |
subset
You can use the subset function to select rows and columns from a data frame or matrix. The select argument is a column name, or a vector of column names, to be selected:
v1 = c(1:5)
v2 = c(6:10)
v3 = c(11:15)
dfrm <- data.frame(v1, v2, v3)
subset(dfrm, select = c(v1, v3))| v1 | v3 |
|---|---|
| 1 | 11 |
| 2 | 12 |
| 3 | 13 |
| 4 | 14 |
| 5 | 15 |
subset(dfrm, subset=(v1 > 3))| v1 | v2 | v3 | |
|---|---|---|---|
| 4 | 4 | 9 | 14 |
| 5 | 5 | 10 | 15 |
Changing the names of data frame columns
Data frames have a colnames attribute that is a vector of column names. You can update individual names or the entire vector:
dfrm| v1 | v2 | v3 |
|---|---|---|
| 1 | 6 | 11 |
| 2 | 7 | 12 |
| 3 | 8 | 13 |
| 4 | 9 | 14 |
| 5 | 10 | 15 |
colnames(dfrm) <- c('a','b','c')
dfrm| a | b | c |
|---|---|---|
| 1 | 6 | 11 |
| 2 | 7 | 12 |
| 3 | 8 | 13 |
| 4 | 9 | 14 |
| 5 | 10 | 15 |
Removing NAs from a data frame
You can use na.omit to remove rows that contain any NA values.
dfrm[1,] = c(1,2,NA)
dfrm| a | b | c |
|---|---|---|
| 1 | 2 | NA |
| 2 | 7 | 12 |
| 3 | 8 | 13 |
| 4 | 9 | 14 |
| 5 | 10 | 15 |
clean <- na.omit(dfrm)
clean| a | b | c | |
|---|---|---|---|
| 2 | 2 | 7 | 12 |
| 3 | 3 | 8 | 13 |
| 4 | 4 | 9 | 14 |
| 5 | 5 | 10 | 15 |
Combining two data frames
You can combine the contents of two data frames into one data frame:
dfrm1 <- data.frame(a = 1:5, b = 6:10, c = 11:15)
dfrm2 <- data.frame(a = -1:-5, b = -6:-10, c = -11:-15)
rbind(dfrm1, dfrm2)| a | b | c |
|---|---|---|
| 1 | 6 | 11 |
| 2 | 7 | 12 |
| 3 | 8 | 13 |
| 4 | 9 | 14 |
| 5 | 10 | 15 |
| -1 | -6 | -11 |
| -2 | -7 | -12 |
| -3 | -8 | -13 |
| -4 | -9 | -14 |
| -5 | -10 | -15 |
cbind(dfrm1, dfrm2)| a | b | c | a | b | c |
|---|---|---|---|---|---|
| 1 | 6 | 11 | -1 | -6 | -11 |
| 2 | 7 | 12 | -2 | -7 | -12 |
| 3 | 8 | 13 | -3 | -8 | -13 |
| 4 | 9 | 14 | -4 | -9 | -14 |
| 5 | 10 | 15 | -5 | -10 | -15 |
Loop over data frame rows
Imagine that you are interested in the days where the stock price of Apple rises above 117. If it goes above this value, you want to print out the current date and stock price.
# Define stock
date <- seq(from = as.Date("2016-12-01"), to = as.Date("2016-12-30"), by = "days")
date <- date[-c(3,4,10,11,17,18,24,25,26)]
apple <- c(109.49, 109.90, 109.11, 109.95, 111.03, 112.12, 113.95, 113.30,
115.19, 115.19, 115.82, 115.97, 116.64, 116.95, 117.06, 116.29, 116.52,
117.26, 116.76, 116.73, 115.82)
stock <- data.frame(date = date, apple = apple)
# Loop over stock rows
for (row in 1:nrow(stock)) {
price <- stock[row, "apple"]
date <- stock[row, "date"]
if(price > 116) {
print(paste("On", date,
"the stock price was", price))
} else {
print(paste("The date:", date,
"is not an important day!"))
}
}## [1] "The date: 2016-12-01 is not an important day!"
## [1] "The date: 2016-12-02 is not an important day!"
## [1] "The date: 2016-12-05 is not an important day!"
## [1] "The date: 2016-12-06 is not an important day!"
## [1] "The date: 2016-12-07 is not an important day!"
## [1] "The date: 2016-12-08 is not an important day!"
## [1] "The date: 2016-12-09 is not an important day!"
## [1] "The date: 2016-12-12 is not an important day!"
## [1] "The date: 2016-12-13 is not an important day!"
## [1] "The date: 2016-12-14 is not an important day!"
## [1] "The date: 2016-12-15 is not an important day!"
## [1] "The date: 2016-12-16 is not an important day!"
## [1] "On 2016-12-19 the stock price was 116.64"
## [1] "On 2016-12-20 the stock price was 116.95"
## [1] "On 2016-12-21 the stock price was 117.06"
## [1] "On 2016-12-22 the stock price was 116.29"
## [1] "On 2016-12-23 the stock price was 116.52"
## [1] "On 2016-12-27 the stock price was 117.26"
## [1] "On 2016-12-28 the stock price was 116.76"
## [1] "On 2016-12-29 the stock price was 116.73"
## [1] "The date: 2016-12-30 is not an important day!"
Very often, you need to express a relationship between variables. Sometimes, you want to plot a chart showing the relationship between the two variables. R provides a formula class that lets you describe the relationship for both purposes.
Let’s create a formula as an example:
sample.formula <- as.formula(y~x1+x2+x3)
class(sample.formula)## [1] "formula"
typeof(sample.formula)## [1] "language"
This formula means “y is a function of x1, x2, and x3.” Some R functions use more complicated formulas. Here is an explanation of the meaning of different items in formulas:
Variable names
Represent variable names.
Tilde (~)
Used to show the relationship between the response variables (to the left) and the stimulus variables (to the right).
Plus sign (+)
Used to express a linear relationship between variables.
Zero (0)
When added to a formula, indicates that no intercept term should be included. For example:
y~u+w+v+0
Vertical bar (|)
Used to specify conditioning variables (in lattice formulas).
Identity function (I())
Used to indicate that the enclosed expression should be interpreted by its arithmetic meaning. For example: a+b means that both a and b should be included in the formula. The formula: I(a+b) means that “a plus b” should be included in the formula.
Asterisk (*)
Used to indicate interactions between variables. For example: y~(u+v)*w is equivalent to: y~u+v+w+I(u*w)+I(v*w)
Caret (^)
Used to indicate crossing to a specific degree. For example: y~(u+w)^2 is equivalent to: y~(u+w)*(u+w)
Function of variables Indicates that the function of the specified variables should be interpreted as a variable. For example: y~log(u)+sin(v)+w
Some additional items have special meaning in formulas, for example, s() for smoothing splines in formulas passed to gam.
Many important problems look at how a variable changes over time, and R includes a class to represent this data: time series objects. Regression functions for time series (like ar or arima) use time series objects. Additionally, many plotting functions in R have special methods for time series.
To create a time series object (of class “ts”), use the ts function:
ts(data = NA, start = 1, end = numeric(0), frequency = 1,
deltat = 1, ts.eps = getOption("ts.eps"), class = , names = )The data argument specifies the series of observations; the other arguments specify when the observations were taken. Here is a description of the arguments to ts.
| Argument | Description | Default |
|---|---|---|
| data | A vector or matrix representing a set of observations over time (usually numeric). | NA |
| start | A numeric vector with one or two elements representing the start of the time series. If one element is used, then it represents a “natural time unit.” If two elements are used, then it represents a “natural time unit” and an offset. | 1 |
| end | A numeric vector with one or two elements representing the end of the time series. (Represented the same way as start.) | numeric(0) |
| frequency | The number of observations per unit of time. | 1 |
| deltat | The fraction of the sampling period between observations; frequency=1/deltat. |
1 |
| ts.eps | Time series comparison tolerance. The frequency of two time series objects is considered equal if the difference is less than this amount. | getOption("ts.eps") |
| class | The class to be assigned to the result. | “ts” for a single series, c(“mts”, “ts”) for multiple series |
| names | A character vector specifying the name of each series in a multiple series object. | colnames(data) when not null, otherwise “Series1”, “Series2”, … |
The print method for time series objects can print pretty results when used with units of months or quarters (this is enabled by default and is controlled with the calendar argument to print.ts; see the help file for more details). As an example, let’s create a time series representing eight consecutive quarters between Q2 2008 and Q1 2010:
ts(1:8,start=c(2008,2),frequency=4)## Qtr1 Qtr2 Qtr3 Qtr4
## 2008 1 2 3
## 2009 4 5 6 7
## 2010 8
ts(1:28,start=c(2008,2),frequency=12)## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2008 1 2 3 4 5 6 7 8 9 10 11
## 2009 12 13 14 15 16 17 18 19 20 21 22 23
## 2010 24 25 26 27 28
As another example of a time series, we will look at the price of turkey. The U.S. Department of Agriculture has a program that collects data on the retail price of various meats. The data is taken from supermarkets representing approximately 20% of the U.S. market and then averaged by month and region. The turkey price data is included in the nutshell package as turkey.price.ts:
library(nutshell)## Loading required package: nutshell.bbdb
## Loading required package: nutshell.audioscrobbler
data(turkey.price.ts)
turkey.price.ts## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2001 1.58 1.75 1.63 1.45 1.56 2.07 1.81 1.74 1.54 1.45 0.57 1.15
## 2002 1.50 1.66 1.34 1.67 1.81 1.60 1.70 1.87 1.47 1.59 0.74 0.82
## 2003 1.43 1.77 1.47 1.38 1.66 1.66 1.61 1.74 1.62 1.39 0.70 1.07
## 2004 1.48 1.48 1.50 1.27 1.56 1.61 1.55 1.69 1.49 1.32 0.53 1.03
## 2005 1.62 1.63 1.40 1.73 1.73 1.80 1.92 1.77 1.71 1.53 0.67 1.09
## 2006 1.71 1.90 1.68 1.46 1.86 1.85 1.88 1.86 1.62 1.45 0.67 1.18
## 2007 1.68 1.74 1.70 1.49 1.81 1.96 1.97 1.91 1.89 1.65 0.70 1.17
## 2008 1.76 1.78 1.53 1.90
R includes a variety of utility functions for looking at time series objects:
start(turkey.price.ts)## [1] 2001 1
end(turkey.price.ts)## [1] 2008 4
frequency(turkey.price.ts)## [1] 12
deltat(turkey.price.ts)## [1] 0.08333333
R includes a set of classes for representing dates and times:
Date
Represents dates but not times.
POSIXct
Stores dates and times as seconds since January 1, 1970, 12:00 A.M.
POSIXlt
Stores dates and times in separate vectors. The list includes sec (0-59), min (0-59), hour (0-23), mday (day of month, 1-31), mon (month, 0-11), year (years since 1900), wday (day of week, 0-6), yday (day of year, 0-365), and isdst (flag for “is daylight savings time”).
When possible, it’s a good idea to store date and time values as date objects, not as strings or numbers. There are many good reasons for this. First, manipulating dates as strings is difficult. The date and time classes include functions for addition and subtraction. For example:
date.I.started.writing <- as.Date("2/13/2009","%m/%d/%Y")
date.I.started.writing## [1] "2009-02-13"
today <- Sys.Date()
today## [1] "2019-06-14"
today - date.I.started.writing## Time difference of 3773 days
Additionally, R includes a number of other functions for manipulating time and date objects. Many plotting functions require dates and times.
Objects in R can have many properties associated with them, called attributes. These properties explain what an object represents and how it should be interpreted by R. Quite often, the only difference between two similar objects is that they have different attributes. Some important attributes are shown in the following table.
Table Common attributes
| Attribute | Description |
|---|---|
| class | The class of the object. |
| comment | A comment on the object; often a description of what the object means. |
| dim | Dimensions of the object. |
| dimnames | Names associated with each dimension of the object. |
| names | Returns the names attribute of an object. Results depend on object type; for example, returns the name of each data column in a data frame or each named object in an array. row.names |
| tsp | Start time for an object. Useful for time series data. |
| levels | Levels of a factor. |
Many objects in R are used to represent numerical data, in particular, arrays, matrices, and data frames. So, many common attributes refer to properties of these objects.
There is a standard way to query object attributes in R. For an object x and attribute a, you refer to the attribute through a(x). You can get a list of all attributes of an object using the attributes function. For example,
m <- matrix(data=1:12,nrow=4,ncol=3, dimnames=list(c("r1","r2","r3","r4"), c("c1","c2","c3")))
attributes(m)## $dim
## [1] 4 3
##
## $dimnames
## $dimnames[[1]]
## [1] "r1" "r2" "r3" "r4"
##
## $dimnames[[2]]
## [1] "c1" "c2" "c3"
The dim attribute shows the dimensions of the object, in this case four rows by three columns. The dimnames attribute is a two-element list, consisting of the names for each respective dimension of the object (rows then columns). It is possible to access each of these attributes directly, using the dim and dimnames functions, respectively:
dim(m)## [1] 4 3
dimnames(m)## [[1]]
## [1] "r1" "r2" "r3" "r4"
##
## [[2]]
## [1] "c1" "c2" "c3"
dimnames(m)[[1]]## [1] "r1" "r2" "r3" "r4"
There are convenience functions for accessing the row and column names:
colnames(m)## [1] "c1" "c2" "c3"
row.names(m)## [1] "r1" "r2" "r3" "r4"
It is possible to transform this matrix into another object class simply by changing the attributes. Specifically, we can remove the dimension attribute (by setting it to NULL), and the object will be transformed into a vector:
dim(m) <- NULL
class(m)## [1] "integer"
typeof(m)## [1] "integer"
Let’s consider the following example. We’ll construct an array a, and define a vector with the same contents:
a <- array(1:12, dim = c(3,4))
b <- 1:12You can use R’s bracket notation to refer to elements in a as a two-dimensional array, but you can’t refer to elements in b as a two-dimensional array, because b doesn’t have any dimensions assigned:
a[2,2]## [1] 5
b[2,2]Error in b[2, 2] : incorrect number of dimensionsAt this point, you might wonder if R considers the two objects to be the same. Here’s what happens when you compare them with the == operator:
a == b## [,1] [,2] [,3] [,4]
## [1,] TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE
Notice what is returned: an array with the dimensions of a, where each cell shows the results of the comparison. There is a function in R called all.equal that compares the data and attributes of two objects to show if they’re “nearly” equal, and if they are not explains why:
all.equal(a,b)## [1] "Attributes: < Modes: list, NULL >"
## [2] "Attributes: < Lengths: 1, 0 >"
## [3] "Attributes: < names for target but not for current >"
## [4] "Attributes: < current is not list-like >"
## [5] "target is matrix, current is numeric"
If you just want to check whether two objects are exactly the same, but don’t care why, use the function identical:
identical(a,b)## [1] FALSE
By assigning a dimension attribute to b, b is transformed into an array and the twodimensional data access tools will work. The all.equal function will also show that the two objects are equivalent:
dim(b) <- c(3,4)
b[2,2]## [1] 5
all.equal(a,b)## [1] TRUE
identical(a,b)## [1] TRUE
Functions are the R objects that evaluate a set of input arguments and return an output value. This chapter explains how to create and use functions in R.
functionIn R, function objects are defined with this syntax:
function(arguments) bodywhere arguments is a set of symbol names (and, optionally, default values) that will be defined within the body of the function, and body is an R expression. Typically, the body is enclosed in curly braces, but it does not have to be if the body is a single expression. For example, the following two definitions are equivalent:
f <- function(x,y) x+y
f <- function(x,y) {x+y}A function definition in R includes the names of arguments. Optionally, it may include default values. If you specify a default value for an argument, then the argument is considered optional:
f <- function(x,y) {x+y}
f(1,2)## [1] 3
g <- function(x,y=10) {x+y}
g(1)## [1] 11
If you do not specify a default value for an argument, and you do not specify a value when calling the function, you will get an error if the function attempts to use the argument:
f(1)Error in f(1) : argument "y" is missing, with no defaultNote that you will only get an error if you try to use the uninitialized argument within the function; you could easily write a function that simply doesn’t reference the argument, and it will work fine
In a function call, you may override the default value:
g(1,2)## [1] 3
In R, it is often convenient to specify a variable-length argument list. You might want to pass extra arguments to another function, or you may want to write a function that accepts a variable number of arguments. To do this in R, you specify an ellipsis (...) in the arguments to the function. You might remember from Chapter 7 that ... is a special type of object in R. The only place you can manipulate this object is inside the body of a function. In this context, it means “all the other arguments for the function.”
As an example, let’s create a function that prints the first argument and then passes all the other arguments to the summary function. To do this, we will create a function that takes one argument: x. The arguments specification also includes an ellipsis to indicate that the function takes other arguments. We can then call the summary function with the ellipsis as its argument:
v <- c(sqrt(1:100))
f <- function(x,...) {print(x); summary(...)}
f("Here is the summary for v.", v, digits=2)## [1] "Here is the summary for v."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 5.1 7.1 6.7 8.7 10.0
Notice that all of the arguments after x were passed to summary.
It is also possible to read the arguments from the variable-length argument list. To do this, you can convert the object … to a list within the body of the function. As an example, let’s create a function that simply sums all its arguments:
addemup <- function(x,...) {
args <- list(...)
for (a in args) x <- x + a
x
}
addemup(1,1)## [1] 2
addemup(1,2,3,4,5)## [1] 15
You can also directly refer to items within the list ... through the variables ..1, ..2, to ..9. Use ..1 for the first item, ..2 for the second, and so on. Named arguments are valid symbols within the body of the function.
In an R function, you may use the return function to specify the value returned by the function. For example:
f <- function(x) {return(x^2 + 3)}
f(3)## [1] 12
However, R will simply return the last evaluated expression as the result of a function. So, it is common to omit the return statement:
f <- function(x) {x^2 + 3}
f(3)## [1] 12
In some cases, an explicit return value may lead to cleaner code.
Many functions in R can take other functions as arguments. For example, many modeling functions accept an optional argument that specifies how to handle missing values; this argument is usually a function for processing the input data.
As an example of a function that takes another function as an argument, let’s look at sapply. The sapply function iterates through each element in a vector, applying another function to each element in the vector, and returning the results. Here is a simple example:
a <- 1:7
sapply(a, sqrt)## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
f <- function(x,y){return(x^y) }
sapply(a,2, FUN=f)## [1] 1 4 9 16 25 36 49
This is a toy example; you could have calculated the same quantity with the expression sqrt(1:7). However, there are many useful functions that don’t work properly on a vector with more than one element; sapply provides a simple way to extend such a function to work on a vector. Related functions allow you to summarize every element in a data structure or to perform more complicated calculations.
So far, we’ve mostly seen named functions in R. However, because functions are just objects in R, it is possible to create functions that do not have names. These are called anonymous functions. Anonymous functions are usually passed as arguments to other functions. Let’s start with a very simple example.
We will define a function that takes another function as its argument and then applies that function to the number 3. Let’s call the function apply.to.three, and we will call the argument f:
apply.to.three <- function(f) {f(3)}Now, let’s call apply.to.three with an anonymous function assigned to argument f. As an example, let’s create a simple function that takes one argument and multiplies that argument by 7:
apply.to.three(function(x) {x * 7})## [1] 21
Here’s how this works. When the R interpreter evaluates the expression apply.to.three(function(x) {x * 7}), it assigns the argument f to the anonymous function function(x) {x * 7}. The interpreter then begins evaluating the expression f(3). The interpreter assigns 3 to the argument x for the anonymous function. Finally, the interpreter evaluates the expression 3 * 7 and returns the result.
Anonymous functions are a very powerful concept that is used in many places in R. Above, we used the sapply function to apply a named function to every element in an array. You can also pass an anonymous function as an argument to sapply:
a <- c(1, 2, 3, 4, 5)
sapply(a, function(x) {x+1})## [1] 2 3 4 5 6
This family of functions is a good alternative to control structures.
By the way, it is possible to define an anonymous function and apply it directly to an argument. Here’s an example:
(function(x) {x+1})(1)## [1] 2
Notice that the function object needs to be enclosed in parentheses. This is because function calls, expressions of the form f(arguments), have very high precedence in R.
R includes a set of functions for getting more information about function objects. To see the set of arguments accepted by a function, use the args function. The args function returns a function object with NULL as the body. Here are a few examples:
args(sin)## function (x)
## NULL
args(`?`)## function (e1, e2)
## NULL
args(args)## function (name)
## NULL
args(lm)## function (formula, data, subset, weights, na.action, method = "qr",
## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
## contrasts = NULL, offset, ...)
## NULL
If you would like to manipulate the list of arguments with R code, then you may find the formals function more useful. The formals function will return a pairlist object, with a pair for every argument. The name of each pair will correspond to each argument name in the function. When a default value is defined, the corresponding value in the pairlist will be set to that value. When no default is defined, the value will be NULL. The formals function is only available for functions written in R (objects of type closure) and not for built-in functions.
Here is a simple example of using formals to extract information about the arguments to a function:
f <- function(x,y=1,z=2) {x+y+z}
f.formals <- formals(f)
f.formals## $x
##
##
## $y
## [1] 1
##
## $z
## [1] 2
f.formals$xf.formals$y## [1] 1
f.formals$z## [1] 2
You may also use formals on the left hand side of an assignment statement to change the formal argument for a function. For example:
f.formals$y <- 3
formals(f) <- f.formals
args(f)## function (x, y = 3, z = 2)
## NULL
R provides a convenience function called alist to construct an argument list. You simply specify the argument list as if you were defining a function. (Note that for an argument with no default, you do not need to include a value but still need to include the equals sign.)
f <- function(x,y=1,z=2) {x + y + z}
formals(f) <- alist(x=,y=100,z=200)
f## function (x, y = 100, z = 200)
## {
## x + y + z
## }
R provides a similar function called body that can be used to return the body of a function:
body(f)## {
## x + y + z
## }
Like the formals function, the body function may be used on the lefthand side of an assignment statement:
f## function (x, y = 100, z = 200)
## {
## x + y + z
## }
body(f) <- expression({x * y * z})
f## function (x, y = 100, z = 200)
## {
## x * y * z
## }
Note that the body of a function has type expression, so when you assign a new value it must have the type expression.
When you specify a function in R, you assign a name to each argument in the function. Inside the body of the function, you can access the arguments by name. For example, consider the following function definition:
addTheLog <- function(first, second) {first + log(second)}This function takes two arguments, called first and second. Inside the body of the function, you can refer to the arguments by these names. When you call a function in R, you can specify the arguments in three different ways (in order of priority):
addTheLog(second=exp(4),first=1)## [1] 5addTheLog(s=exp(4),f=1)## [1] 5addTheLog(1,exp(4))## [1] 5When you are using generic functions, you cannot specify the argument name of the object on which the generic function is being called. You can still specify names for other arguments.
When possible, it’s a good practice to use exact argument names. Specifying full argument names does require extra typing, but it makes your code easier to read and removes ambiguity.
Partial names are a deprecated feature because they can lead to confusion. As an example, consider the following function:
f <- function(arg1=10,arg2=20) {
print(paste("arg1:",arg1))
print(paste("arg2:",arg2))
}When you call this function with one ambiguous argument, it will cause an error:
f(arg=1)Error in f(arg = 1) : argument 1 matches multiple formal argumentsHowever, when you specify two arguments, the ambiguous argument could refer to either of the other arguments:
f(arg=1,arg2=2)## [1] "arg1: 1"
## [1] "arg2: 2"
f(arg=1,arg1=2)## [1] "arg1: 2"
## [1] "arg2: 1"
colSums(), colMeans(), cumsum() and sumprod() in R. Please write your owen functions to realize the same function.mvrnorm() is in R pacckage MASS. The fucntion chol() compute the Cholesky factorization of a real symmetric positive-definite square matrix. Please write a function by using chol() to generate a dataset from the multivariate normal distribution with mean \(\mu\) variance \(\Sigma\).