1 R objects

All objects in R are built on top of a basic set of built-in objects. The type of an object defines how it is stored in R. Objects in R are also members of a class. Classes define what information objects contain, and how those objects may be used.

The following table shows all of the built-in object types.

Table Primitive object types in R

Category	Object	type Description	Example
Vectors	integer	Naturally produced from sequences. Can be coerced with the integer() function.	5:5, integer(5)
	double	Used to represent floating-point numbers (numbers with decimals and large numbers). On most modern platforms, this will be 8 bytes, or 64 bits. By default, most numerical values are represented as doubles. Can be coerced with the double() function.	1, -1, 2 **50, double(5)
	complex	Complex numbers. To use, you must include both the real and the imaginary parts (even if the real part is 0).	2+3i, 0+1i, exp(0+1i * pi)
	character	A string of characters (just called a string in many other languages).	“Hello world.”
	logical	Represents Boolean values.	TRUE, FALSE
	raw	A vector containing raw bytes. Useful for encoding objects from outside the R environment.	raw(8), CharToRaw(“Hello”)
Compound	list	A (possibly heterogeneous) collection of other objects. Elements of a list may be named. Many other object types in R (such as data frames) are implemented as lists.	list(1, 2, “hat”)
	pairlist	A data structure used to represent a set of name-value pairs. Pairlists are primarily used internally but can be created at the user level. Their use is deprecated in user-level programs, because standard list objects are just as efficient and more flexible.	.Options pairlist(apple=1, pear=2, banana=3)
	S4	An R object supporting modern objectoriented paradigms (inheritance, methods, etc.).
	environment	An R environment describes the set of symbols available in a specific context. An environment contains a set of symbol-value pairs and a pointer to an enclosing environment. (For example, you could use any in the signature of a default generic function.)	.GlobalEnv new.env(parent = baseenv())
Special	`any`	An object used to mean that “any” type is OK. Used to prevent coercion from one type to another. Useful in defining slots in S4 objects or signatures for generic functions.	setClass(“Something”, representation( data=“ANY” ) )
	`NULL`	An object that means “there is no object.” Returned by functions and expressions whose value is not defined. The NULL object can have no attributes.	NULL
	`...`	Used in functions to implement variablelength argument lists, particularly arguments passed to other functions.	`N/A`
R language	symbol	A symbol is a language object that refers to other objects. Usually encountered when parsing R statements.	as.name(x), as.symbol(x), quote(x)
	promise	Promises are objects that are not evaluated when they are created but are instead evaluated when they are first used. They are used to implement delayed loading of objects in packages.	`> x <- 1;` `> y <- 2;` `> z <- 3` `> delayedAssign("v", c(x, y, z))` `> # v is a promise`
	language	R language objects are used when processing the R language itself.	quote(function(x) { x + 1})
	expression	An unevaluated R expression. Expression objects can be created with the expression function, and later evaluated with the eval function.	expression(1 + 2)
Functions	closure	An R function not implemented inside the R system. Most functions fall into this category. Includes user-defined functions, most functions included with R, and most functions in R packages.	f <- function(x) { x + 1} print
	special	An internal function whose arguments are not necessarily evaluated on call.	`if`, `[`
	builtin	An internal function that evaluates its arguments.	`+`, `^`
Internal	char	A scalar “string” object. A character vector is composed of char’s. (Users can’t easily generate a char object but don’t ever need to.)	`N/A`
	bytecode	A data type reserved for a future byte-code compiler.	`N/A`
	externalptr	External pointer. Used in C code.	`N/A`
	weakref	Weak reference (internal only).	`N/A`

to make them easier to understand, the object types can be classified into a few categories, .
Basic vectors

These are vectors containing a single type of value: integers, floating-point numbers, complex numbers, text, logical values, or raw data.
Compound objects

These objects are containers for the basic vectors: lists, pairlists, S4 objects, and environments. Each of these objects has unique properties (described below), but each of them contains a number of named objects.
Special objects

These objects serve a special purpose in R programming: any, NULL, and .... Each of these means something important in a specific context, but you would never create an object of these types.
R language

These are objects that represent R code; they can be evaluated to return other objects.

2 Vectors and lists

Vectors and lists have been introduced in Chapter 2 and Chatper 3.

Addtional examples for vectors’ operators

a=1:10
a^3

##  [1]    1    8   27   64  125  216  343  512  729 1000

c(1:10)^3

##  [1]    1    8   27   64  125  216  343  512  729 1000

a=1:5
b=2:6
c=seq(2,5,by=1)
a*b

## [1]  2  6 12 20 30

a^c

## Warning in a^c: longer object length is not a multiple of shorter object
## length

## [1]    1    8   81 1024   25

a%%b

## [1] 1 2 3 4 5

a%/%b

## [1] 0 0 0 0 0

a%*%t(b)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    3    4    5    6
## [2,]    4    6    8   10   12
## [3,]    6    9   12   15   18
## [4,]    8   12   16   20   24
## [5,]   10   15   20   25   30

3 Matrices

A matrix is an extension of a vector to two dimensions. A matrix is used to represent two-dimensional data of a single type. A clean way to generate a new matrix is with the matrix function. For example,

m <- matrix(data=1:12,nrow=4,ncol=3, dimnames=list(c("r1","r2","r3","r4"), c("c1","c2","c3")))
m

##    c1 c2 c3
## r1  1  5  9
## r2  2  6 10
## r3  3  7 11
## r4  4  8 12

It is also possible to transform another data structure into a matrix using the as.matrix function.

a = data.frame(1:5,seq(4,8))
is.matrix(a)

## [1] FALSE

b = as.matrix(a)
is.matrix(b)

## [1] TRUE

An important note: matrices are implemented as vectors, not as a vector of vectors (or as a list of vectors). Array subscripts are used for referencing elements and don’t reflect the way the data is stored. Unlike other classes, matrices don’t have an explicit class attribute.

In R, a matrix is just a vector that has dimensions. It may seem strange at first, but you can transform a vector into a matrix simply by giving it dimensions.

A <- 1:6
dim(A)

## NULL

## [1] 1 2 3 4 5 6

dim(A) <- c(2,3)
A

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

A matrix can be created from a list, too. Like a vector, a list has a dim attribute, which is initially NULL:

B <- list(1,2,3,4,5,6)
dim(B)

## NULL

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 6

dim(B) <- c(2,3)
B

##      [,1] [,2] [,3]
## [1,] 1    3    5   
## [2,] 2    4    6

Selecting One Row or Column from a Matrix If you want the result to be a simple vector, just use normal indexing:

B <- matrix(1:6, 2, 3)
B[1,] # First row

## [1] 1 3 5

B[,3] # Third column

## [1] 5 6

If you want the result to be a one-row matrix or a one-column matrix, then include the drop=FALSE argument:

B[1,,drop=FALSE] # First row in a one-row matrix

##      [,1] [,2] [,3]
## [1,]    1    3    5

B[,3,drop=FALSE] # Third column in a one-column matrix

##      [,1]
## [1,]    5
## [2,]    6

Loop over matrix elements

If you want to loop over elements in a matrix (columns and rows), then you will have to use nested loops.

# Define corr
corr <- matrix(c(1.00, 0.96, 0.88, 0.96, 1.00, 0.74, 0.88, 0.74, 1.00), 3, 3)
row.names(corr) <- c("apple", "ibm", "micr")
colnames(corr) <- c("apple", "ibm", "micr")

# Print out corr
corr

##       apple  ibm micr
## apple  1.00 0.96 0.88
## ibm    0.96 1.00 0.74
## micr   0.88 0.74 1.00

# Create a nested loop
for(row in 1:nrow(corr)) {
    for(col in 1:ncol(corr)) {
        print(paste(colnames(corr)[col], "and", rownames(corr)[row], 
                    "have a correlation of", corr[row,col]))
    }
}

## [1] "apple and apple have a correlation of 1"
## [1] "ibm and apple have a correlation of 0.96"
## [1] "micr and apple have a correlation of 0.88"
## [1] "apple and ibm have a correlation of 0.96"
## [1] "ibm and ibm have a correlation of 1"
## [1] "micr and ibm have a correlation of 0.74"
## [1] "apple and micr have a correlation of 0.88"
## [1] "ibm and micr have a correlation of 0.74"
## [1] "micr and micr have a correlation of 1"

Matrix arithmetic

A <- matrix(1:6, 2, 3)
B <- matrix(7:12, 2, 3)
C <- c(13:15)
A

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

t(B)

##      [,1] [,2]
## [1,]    7    8
## [2,]    9   10
## [3,]   11   12

A+B

##      [,1] [,2] [,3]
## [1,]    8   12   16
## [2,]   10   14   18

A-B

##      [,1] [,2] [,3]
## [1,]   -6   -6   -6
## [2,]   -6   -6   -6

A*B

##      [,1] [,2] [,3]
## [1,]    7   27   55
## [2,]   16   40   72

A%*%t(B)

##      [,1] [,2]
## [1,]   89   98
## [2,]  116  128

A%*%C

##      [,1]
## [1,]  130
## [2,]  172

A/B

##           [,1]      [,2]      [,3]
## [1,] 0.1428571 0.3333333 0.4545455
## [2,] 0.2500000 0.4000000 0.5000000

x = c(1,2,3)
y = diag(x)
y

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    2    0
## [3,]    0    0    3

solve(y)

##      [,1] [,2]      [,3]
## [1,]    1  0.0 0.0000000
## [2,]    0  0.5 0.0000000
## [3,]    0  0.0 0.3333333

Matrix decomposition

A <- matrix(1:9, 3, 3)
A

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

s <- svd(A)
s

## $d
## [1] 1.684810e+01 1.068370e+00 5.543107e-16
## 
## $u
##            [,1]        [,2]       [,3]
## [1,] -0.4796712  0.77669099  0.4082483
## [2,] -0.5723678  0.07568647 -0.8164966
## [3,] -0.6650644 -0.62531805  0.4082483
## 
## $v
##            [,1]       [,2]       [,3]
## [1,] -0.2148372 -0.8872307  0.4082483
## [2,] -0.5205874 -0.2496440 -0.8164966
## [3,] -0.8263375  0.3879428  0.4082483

s$u %*% diag(s$d) %*% t(s$v)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

s <- eigen(A)
s

## eigen() decomposition
## $values
## [1]  1.611684e+01 -1.116844e+00 -5.700691e-16
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,] -0.4645473 -0.8829060  0.4082483
## [2,] -0.5707955 -0.2395204 -0.8164966
## [3,] -0.6770438  0.4038651  0.4082483

d <- s$values
v <- s$vectors
v %*% diag(d) %*% solve(v)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

4 Other objects

There are some other objects that you should know about if you’re using R. Although most of these objects are not formally part of the R language, they are used in so many R packages, or get special treatment in R, that they’re worth a closer look.

4.1 Arrays

An array is an extension of a vector to more than two dimensions. Vectors are used to represent multidimensional data of a single type. As above, you can generate an array with the array function:

a <- array(data=1:24,dim=c(3,4,2))
a

## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24

Like matrices, the underlying storage mechanism for an array is a vector. (Like matrices, and unlike most other classes, matrices don’t have an explicit class attribute.)

4.2 Factors

When analyzing data, it’s quite common to encounter categorical values. For example, suppose that you have a set of observations about people that includes eye color. You could represent the eye colors as a character array:

eye.colors <- c("brown","blue","blue","green","brown","brown","brown")

This is a perfectly valid way to represent the information, but it can become inefficient if you are working with large names or a large number of observations. R provides a better way to represent categorical values, by using factors. A factor is an ordered collection of items. The different values that the factor can take are called levels.

eye.colors <- factor(c("brown", "blue", "blue", "green", "brown", "brown", "brown"))
levels(eye.colors) # The levels function shows all the levels from a factor

## [1] "blue"  "brown" "green"

Printing a factor shows slightly different information than printing a character vector. In particular, notice that the quotes are not shown and that the levels are explicitly printed:

eye.colors

## [1] brown blue  blue  green brown brown brown
## Levels: blue brown green

In the eye color example, order did not matter. However, sometimes the order of the factors matters for a specific problem. For example, suppose that you had conducted a survey and asked respondents how they felt about the statement “melon is delicious with an omelet.” Furthermore, suppose that you allowed respondents to give the following responses: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.

There are multiple ways to represent this information in R. You could code these as integers (for example, on a scale of 1 to 5), although this approach has some drawbacks. This approach implies a specific quantitative relationship between values, which may or may not make sense. For example, is the difference between Strongly Disagree and Disagree the same as the difference between Disagree and Neutral? A numeric reponse also implies that you can calculate meaningful statistics based on the responses. Can you be sure that a Disagree response and an Agree response average out to Neutral?

To get around these problems, you can use an ordered factor to represent the response of this survey. Here is an example:

survey.results <- factor( c("Disagree", "Neutral", "Strongly Disagree", "Neutral", "Agree", "Strongly Agree", "Disagree", "Strongly Agree", "Neutral", "Strongly Disagree", "Neutral", "Agree"), levels=c("Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree"), ordered=TRUE)
survey.results

##  [1] Disagree          Neutral           Strongly Disagree
##  [4] Neutral           Agree             Strongly Agree   
##  [7] Disagree          Strongly Agree    Neutral          
## [10] Strongly Disagree Neutral           Agree            
## 5 Levels: Strongly Disagree < Disagree < Neutral < ... < Strongly Agree

Factors are implemented internally using integers. The levels attribute maps each integer to a factor level. Integers take up a small, fixed amount of storage space, so they can be more space efficient than character vectors. It’s possible to take a factor and turn it into an integer array:

eye.colors

## [1] brown blue  blue  green brown brown brown
## Levels: blue brown green

class(eye.colors)

## [1] "factor"

eye.colors.integer.vector <- unclass(eye.colors)
eye.colors.integer.vector

## [1] 2 1 1 3 2 2 2
## attr(,"levels")
## [1] "blue"  "brown" "green"

class(eye.colors.integer.vector)

## [1] "integer"

It’s possible to change this back to a factor by setting the class attribute:

class(eye.colors.integer.vector) <- "factor"
eye.colors.integer.vector

## [1] brown blue  blue  green brown brown brown
## Levels: blue brown green

class(eye.colors.integer.vector)

## [1] "factor"

4.3 Data frames

In business contexts, data is often kept in database tables. Each table has many rows, which may consist of multiple “columns” representing different quantities and which may be kept in multiple formats. A data frame is a natural way to represent these data sets in R.

If your data is captured in several vectors and/or factors, use the data.frame function to assemble them into a data frame:

v1 = c(1:5)
v2 = c(6:10)
v3 = c(11:15)
dfrm1 <- data.frame(v1, v2, v3)
dfrm1

v1	v2	v3
1	6	11
2	7	12
3	8	13
4	9	14
5	10	15

If your data is captured in a list that contains vectors and/or factors, use instead as.data.frame:

list.of.vectors = list(a = v1, b = v2, c = v3)
dfrm2 <- as.data.frame(list.of.vectors)
dfrm2

a	b	c
1	6	11
2	7	12
3	8	13
4	9	14
5	10	15

A data frame represents a table of data. Each column may be a different type, but each row in the data frame must have the same length:

data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4))

Error in data.frame(a = c(1, 2, 3, 4, 5), b = c(1, 2, 3, 4)) : arguments imply differing number of rows: 5, 4

Initializing a data frame from column data

Imagine that your data is organized by columns, and you want to assemble it into a data frame. Then, you can store each row in a one-row data frame. Store the one-row data frames in a list. Use rbind and do.call to bind the rows into one, large data frame:
```
Row1 = data.frame(city="Chicago", county="Cook", state="IL", pop=2853114)
Row2 = data.frame(city="Kenosha", county="Kenosha", state="WI", pop=5428)
Row3 = data.frame(city="Aurora", county="Kane", state="IL", pop=171782)
obs = list(Row1, Row2, Row3)
obs[[1]]
```
city county state pop

Chicago Cook IL 2853114
```
rbind(obs[[1]], obs[[2]])
```
city county state pop

Chicago Cook IL 2853114

Kenosha Kenosha WI 5428
```
dfrm <- do.call(rbind, obs)
dfrm
```
city county state pop

Chicago Cook IL 2853114

Kenosha Kenosha WI 5428

Aurora Kane IL 171782
Appending rows to a data frame
```
newRow <- data.frame(city="West Dundee", county="Kane", state="IL", pop=5428)
suburbs <- rbind(dfrm, newRow)
suburbs
```
city county state pop

Chicago Cook IL 2853114

Kenosha Kenosha WI 5428

Aurora Kane IL 171782

West Dundee Kane IL 5428
Preallocating a data frame

Imagine that you are building a data frame, row by row. You want to preallocate the space instead of appending rows incrementally. Then, you can Create a data frame from generic vectors and factors using the functions numeric(n), character(n) and factor(n):
```
N <- 10
dfrm <- data.frame(a=factor(N, levels=c("NJ", "IL", "CA")),
                   b=character(N),
                   c=numeric(N) )
dfrm
```
a b c

NA 0

NA 0

NA 0

NA 0

NA 0

NA 0

NA 0

NA 0

NA 0

NA 0
subset

You can use the subset function to select rows and columns from a data frame or matrix. The select argument is a column name, or a vector of column names, to be selected:
```
v1 = c(1:5)
v2 = c(6:10)
v3 = c(11:15)
dfrm <- data.frame(v1, v2, v3)
subset(dfrm, select = c(v1, v3))
```
v1 v3

1 11

2 12

3 13

4 14

5 15
```
subset(dfrm, subset=(v1 > 3))
```
v1 v2 v3

4 4 9 14

5 5 10 15
Changing the names of data frame columns

Data frames have a colnames attribute that is a vector of column names. You can update individual names or the entire vector:
```
dfrm
```
v1 v2 v3

1 6 11

2 7 12

3 8 13

4 9 14

5 10 15
```
colnames(dfrm) <- c('a','b','c')
dfrm
```
a b c

1 6 11

2 7 12

3 8 13

4 9 14

5 10 15
Removing NAs from a data frame

You can use na.omit to remove rows that contain any NA values.
```
dfrm[1,] = c(1,2,NA)
dfrm
```
a b c

1 2 NA

2 7 12

3 8 13

4 9 14

5 10 15
```
clean <- na.omit(dfrm)
clean
```
a b c

2 2 7 12

3 3 8 13

4 4 9 14

5 5 10 15
Combining two data frames

You can combine the contents of two data frames into one data frame:
```
dfrm1 <- data.frame(a = 1:5, b = 6:10, c = 11:15)
dfrm2 <- data.frame(a = -1:-5, b = -6:-10, c = -11:-15)
rbind(dfrm1, dfrm2)
```
a b c

1 6 11

2 7 12

3 8 13

4 9 14

5 10 15

-1 -6 -11

-2 -7 -12

-3 -8 -13

-4 -9 -14

-5 -10 -15
```
cbind(dfrm1, dfrm2)
```
a b c a b c

1 6 11 -1 -6 -11

2 7 12 -2 -7 -12

3 8 13 -3 -8 -13

4 9 14 -4 -9 -14

5 10 15 -5 -10 -15
Loop over data frame rows

Imagine that you are interested in the days where the stock price of Apple rises above 117. If it goes above this value, you want to print out the current date and stock price.

city	county	state	pop
Chicago	Cook	IL	2853114

city	county	state	pop
Chicago	Cook	IL	2853114
Kenosha	Kenosha	WI	5428

city	county	state	pop
Chicago	Cook	IL	2853114
Kenosha	Kenosha	WI	5428
Aurora	Kane	IL	171782

city	county	state	pop
Chicago	Cook	IL	2853114
Kenosha	Kenosha	WI	5428
Aurora	Kane	IL	171782
West Dundee	Kane	IL	5428

a	b	c
NA		0
NA		0
NA		0
NA		0
NA		0
NA		0
NA		0
NA		0
NA		0
NA		0

v1	v3
1	11
2	12
3	13
4	14
5	15

	v1	v2	v3
4	4	9	14
5	5	10	15

a	b	c
1	2	NA
2	7	12
3	8	13
4	9	14
5	10	15

	a	b	c
2	2	7	12
3	3	8	13
4	4	9	14
5	5	10	15

a	b	c	a	b	c
1	6	11	-1	-6	-11
2	7	12	-2	-7	-12
3	8	13	-3	-8	-13
4	9	14	-4	-9	-14
5	10	15	-5	-10	-15

# Define stock
date <- seq(from = as.Date("2016-12-01"), to = as.Date("2016-12-30"), by = "days")
date <- date[-c(3,4,10,11,17,18,24,25,26)]
apple <- c(109.49, 109.90, 109.11, 109.95, 111.03, 112.12, 113.95, 113.30,
           115.19, 115.19, 115.82, 115.97, 116.64, 116.95, 117.06, 116.29, 116.52,
           117.26, 116.76, 116.73, 115.82)
stock <- data.frame(date = date, apple = apple)

# Loop over stock rows
for (row in 1:nrow(stock)) {
    price <- stock[row, "apple"]
    date  <- stock[row, "date"]

    if(price > 116) {
        print(paste("On", date, 
                    "the stock price was", price))
    } else {
        print(paste("The date:", date, 
                    "is not an important day!"))
    }
}

## [1] "The date: 2016-12-01 is not an important day!"
## [1] "The date: 2016-12-02 is not an important day!"
## [1] "The date: 2016-12-05 is not an important day!"
## [1] "The date: 2016-12-06 is not an important day!"
## [1] "The date: 2016-12-07 is not an important day!"
## [1] "The date: 2016-12-08 is not an important day!"
## [1] "The date: 2016-12-09 is not an important day!"
## [1] "The date: 2016-12-12 is not an important day!"
## [1] "The date: 2016-12-13 is not an important day!"
## [1] "The date: 2016-12-14 is not an important day!"
## [1] "The date: 2016-12-15 is not an important day!"
## [1] "The date: 2016-12-16 is not an important day!"
## [1] "On 2016-12-19 the stock price was 116.64"
## [1] "On 2016-12-20 the stock price was 116.95"
## [1] "On 2016-12-21 the stock price was 117.06"
## [1] "On 2016-12-22 the stock price was 116.29"
## [1] "On 2016-12-23 the stock price was 116.52"
## [1] "On 2016-12-27 the stock price was 117.26"
## [1] "On 2016-12-28 the stock price was 116.76"
## [1] "On 2016-12-29 the stock price was 116.73"
## [1] "The date: 2016-12-30 is not an important day!"

4.4 Formulas

Very often, you need to express a relationship between variables. Sometimes, you want to plot a chart showing the relationship between the two variables. R provides a formula class that lets you describe the relationship for both purposes.

Let’s create a formula as an example:

sample.formula <- as.formula(y~x1+x2+x3)
class(sample.formula)

## [1] "formula"

typeof(sample.formula)

## [1] "language"

This formula means “y is a function of x1, x2, and x3.” Some R functions use more complicated formulas. Here is an explanation of the meaning of different items in formulas:

Variable names

Represent variable names.
Tilde (~)

Used to show the relationship between the response variables (to the left) and the stimulus variables (to the right).
Plus sign (+)

Used to express a linear relationship between variables.
Zero (0)

When added to a formula, indicates that no intercept term should be included. For example:

y~u+w+v+0
Vertical bar (|)

Used to specify conditioning variables (in lattice formulas).
Identity function (I())

Used to indicate that the enclosed expression should be interpreted by its arithmetic meaning. For example: a+b means that both a and b should be included in the formula. The formula: I(a+b) means that “a plus b” should be included in the formula.
Asterisk (*)

Used to indicate interactions between variables. For example: y~(u+v)*w is equivalent to: y~u+v+w+I(u*w)+I(v*w)
Caret (^)

Used to indicate crossing to a specific degree. For example: y~(u+w)^2 is equivalent to: y~(u+w)*(u+w)
Function of variables Indicates that the function of the specified variables should be interpreted as a variable. For example: y~log(u)+sin(v)+w

Some additional items have special meaning in formulas, for example, s() for smoothing splines in formulas passed to gam.

4.5 Time series

Many important problems look at how a variable changes over time, and R includes a class to represent this data: time series objects. Regression functions for time series (like ar or arima) use time series objects. Additionally, many plotting functions in R have special methods for time series.

To create a time series object (of class “ts”), use the ts function:

ts(data = NA, start = 1, end = numeric(0), frequency = 1,
   deltat = 1, ts.eps = getOption("ts.eps"), class = , names = )

The data argument specifies the series of observations; the other arguments specify when the observations were taken. Here is a description of the arguments to ts.

Argument	Description	Default
data	A vector or matrix representing a set of observations over time (usually numeric).	`NA`
start	A numeric vector with one or two elements representing the start of the time series. If one element is used, then it represents a “natural time unit.” If two elements are used, then it represents a “natural time unit” and an offset.	`1`
end	A numeric vector with one or two elements representing the end of the time series. (Represented the same way as start.)	`numeric(0)`
frequency	The number of observations per unit of time.	`1`
deltat	The fraction of the sampling period between observations; `frequency=1/deltat`.	`1`
ts.eps	Time series comparison tolerance. The frequency of two time series objects is considered equal if the difference is less than this amount.	`getOption("ts.eps")`
class	The class to be assigned to the result.	“ts” for a single series, c(“mts”, “ts”) for multiple series
names	A character vector specifying the name of each series in a multiple series object.	colnames(data) when not null, otherwise “Series1”, “Series2”, …

The print method for time series objects can print pretty results when used with units of months or quarters (this is enabled by default and is controlled with the calendar argument to print.ts; see the help file for more details). As an example, let’s create a time series representing eight consecutive quarters between Q2 2008 and Q1 2010:

ts(1:8,start=c(2008,2),frequency=4)

##      Qtr1 Qtr2 Qtr3 Qtr4
## 2008         1    2    3
## 2009    4    5    6    7
## 2010    8

ts(1:28,start=c(2008,2),frequency=12)

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2008       1   2   3   4   5   6   7   8   9  10  11
## 2009  12  13  14  15  16  17  18  19  20  21  22  23
## 2010  24  25  26  27  28

As another example of a time series, we will look at the price of turkey. The U.S. Department of Agriculture has a program that collects data on the retail price of various meats. The data is taken from supermarkets representing approximately 20% of the U.S. market and then averaged by month and region. The turkey price data is included in the nutshell package as turkey.price.ts:

library(nutshell)

## Loading required package: nutshell.bbdb

## Loading required package: nutshell.audioscrobbler

data(turkey.price.ts)
turkey.price.ts

##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 2001 1.58 1.75 1.63 1.45 1.56 2.07 1.81 1.74 1.54 1.45 0.57 1.15
## 2002 1.50 1.66 1.34 1.67 1.81 1.60 1.70 1.87 1.47 1.59 0.74 0.82
## 2003 1.43 1.77 1.47 1.38 1.66 1.66 1.61 1.74 1.62 1.39 0.70 1.07
## 2004 1.48 1.48 1.50 1.27 1.56 1.61 1.55 1.69 1.49 1.32 0.53 1.03
## 2005 1.62 1.63 1.40 1.73 1.73 1.80 1.92 1.77 1.71 1.53 0.67 1.09
## 2006 1.71 1.90 1.68 1.46 1.86 1.85 1.88 1.86 1.62 1.45 0.67 1.18
## 2007 1.68 1.74 1.70 1.49 1.81 1.96 1.97 1.91 1.89 1.65 0.70 1.17
## 2008 1.76 1.78 1.53 1.90

R includes a variety of utility functions for looking at time series objects:

start(turkey.price.ts)

## [1] 2001    1

end(turkey.price.ts)

## [1] 2008    4

frequency(turkey.price.ts)

## [1] 12

deltat(turkey.price.ts)

## [1] 0.08333333

4.6 Dates and times

R includes a set of classes for representing dates and times:

Date

Represents dates but not times.
POSIXct

Stores dates and times as seconds since January 1, 1970, 12:00 A.M.
POSIXlt

Stores dates and times in separate vectors. The list includes sec (0-59), min (0-59), hour (0-23), mday (day of month, 1-31), mon (month, 0-11), year (years since 1900), wday (day of week, 0-6), yday (day of year, 0-365), and isdst (flag for “is daylight savings time”).

When possible, it’s a good idea to store date and time values as date objects, not as strings or numbers. There are many good reasons for this. First, manipulating dates as strings is difficult. The date and time classes include functions for addition and subtraction. For example:

date.I.started.writing <- as.Date("2/13/2009","%m/%d/%Y")
date.I.started.writing

## [1] "2009-02-13"

today <- Sys.Date()
today

## [1] "2019-06-14"

today - date.I.started.writing

## Time difference of 3773 days

Additionally, R includes a number of other functions for manipulating time and date objects. Many plotting functions require dates and times.

4.7 Attributes

Objects in R can have many properties associated with them, called attributes. These properties explain what an object represents and how it should be interpreted by R. Quite often, the only difference between two similar objects is that they have different attributes. Some important attributes are shown in the following table.

Table Common attributes

Attribute	Description
class	The class of the object.
comment	A comment on the object; often a description of what the object means.
dim	Dimensions of the object.
dimnames	Names associated with each dimension of the object.
names	Returns the names attribute of an object. Results depend on object type; for example, returns the name of each data column in a data frame or each named object in an array. row.names
tsp	Start time for an object. Useful for time series data.
levels	Levels of a factor.

Many objects in R are used to represent numerical data, in particular, arrays, matrices, and data frames. So, many common attributes refer to properties of these objects.

There is a standard way to query object attributes in R. For an object x and attribute a, you refer to the attribute through a(x). You can get a list of all attributes of an object using the attributes function. For example,

m <- matrix(data=1:12,nrow=4,ncol=3, dimnames=list(c("r1","r2","r3","r4"), c("c1","c2","c3")))
attributes(m)

## $dim
## [1] 4 3
## 
## $dimnames
## $dimnames[[1]]
## [1] "r1" "r2" "r3" "r4"
## 
## $dimnames[[2]]
## [1] "c1" "c2" "c3"

The dim attribute shows the dimensions of the object, in this case four rows by three columns. The dimnames attribute is a two-element list, consisting of the names for each respective dimension of the object (rows then columns). It is possible to access each of these attributes directly, using the dim and dimnames functions, respectively:

dim(m)

## [1] 4 3

dimnames(m)

## [[1]]
## [1] "r1" "r2" "r3" "r4"
## 
## [[2]]
## [1] "c1" "c2" "c3"

dimnames(m)[[1]]

## [1] "r1" "r2" "r3" "r4"

There are convenience functions for accessing the row and column names:

colnames(m)

## [1] "c1" "c2" "c3"

row.names(m)

## [1] "r1" "r2" "r3" "r4"

It is possible to transform this matrix into another object class simply by changing the attributes. Specifically, we can remove the dimension attribute (by setting it to NULL), and the object will be transformed into a vector:

dim(m) <- NULL
class(m)

## [1] "integer"

typeof(m)

## [1] "integer"

Let’s consider the following example. We’ll construct an array a, and define a vector with the same contents:

a <- array(1:12, dim = c(3,4))
b <- 1:12

You can use R’s bracket notation to refer to elements in a as a two-dimensional array, but you can’t refer to elements in b as a two-dimensional array, because b doesn’t have any dimensions assigned:

a[2,2]

## [1] 5

b[2,2]

Error in b[2, 2] : incorrect number of dimensions

At this point, you might wonder if R considers the two objects to be the same. Here’s what happens when you compare them with the == operator:

a == b

##      [,1] [,2] [,3] [,4]
## [1,] TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE

Notice what is returned: an array with the dimensions of a, where each cell shows the results of the comparison. There is a function in R called all.equal that compares the data and attributes of two objects to show if they’re “nearly” equal, and if they are not explains why:

all.equal(a,b)

## [1] "Attributes: < Modes: list, NULL >"                   
## [2] "Attributes: < Lengths: 1, 0 >"                       
## [3] "Attributes: < names for target but not for current >"
## [4] "Attributes: < current is not list-like >"            
## [5] "target is matrix, current is numeric"

If you just want to check whether two objects are exactly the same, but don’t care why, use the function identical:

identical(a,b)

## [1] FALSE

By assigning a dimension attribute to b, b is transformed into an array and the twodimensional data access tools will work. The all.equal function will also show that the two objects are equivalent:

dim(b) <- c(3,4)
b[2,2]

## [1] 5

all.equal(a,b)

## [1] TRUE

identical(a,b)

## [1] TRUE

5 Functions

Functions are the R objects that evaluate a set of input arguments and return an output value. This chapter explains how to create and use functions in R.

5.1 The keyword `function`

In R, function objects are defined with this syntax:

    function(arguments) body

where arguments is a set of symbol names (and, optionally, default values) that will be defined within the body of the function, and body is an R expression. Typically, the body is enclosed in curly braces, but it does not have to be if the body is a single expression. For example, the following two definitions are equivalent:

f <- function(x,y) x+y
f <- function(x,y) {x+y}

5.2 Arguments

A function definition in R includes the names of arguments. Optionally, it may include default values. If you specify a default value for an argument, then the argument is considered optional:

f <- function(x,y) {x+y}
f(1,2)

## [1] 3

g <- function(x,y=10) {x+y}
g(1)

## [1] 11

If you do not specify a default value for an argument, and you do not specify a value when calling the function, you will get an error if the function attempts to use the argument:

f(1)

Error in f(1) : argument "y" is missing, with no default

Note that you will only get an error if you try to use the uninitialized argument within the function; you could easily write a function that simply doesn’t reference the argument, and it will work fine

In a function call, you may override the default value:

g(1,2)

## [1] 3

In R, it is often convenient to specify a variable-length argument list. You might want to pass extra arguments to another function, or you may want to write a function that accepts a variable number of arguments. To do this in R, you specify an ellipsis (...) in the arguments to the function. You might remember from Chapter 7 that ... is a special type of object in R. The only place you can manipulate this object is inside the body of a function. In this context, it means “all the other arguments for the function.”

As an example, let’s create a function that prints the first argument and then passes all the other arguments to the summary function. To do this, we will create a function that takes one argument: x. The arguments specification also includes an ellipsis to indicate that the function takes other arguments. We can then call the summary function with the ellipsis as its argument:

v <- c(sqrt(1:100))
f <- function(x,...) {print(x); summary(...)}
f("Here is the summary for v.", v, digits=2)

## [1] "Here is the summary for v."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     5.1     7.1     6.7     8.7    10.0

Notice that all of the arguments after x were passed to summary.

It is also possible to read the arguments from the variable-length argument list. To do this, you can convert the object … to a list within the body of the function. As an example, let’s create a function that simply sums all its arguments:

addemup <- function(x,...) {
  args <- list(...)
  for (a in args) x <- x + a
  x
}
addemup(1,1)

## [1] 2

addemup(1,2,3,4,5)

## [1] 15

You can also directly refer to items within the list ... through the variables ..1, ..2, to ..9. Use ..1 for the first item, ..2 for the second, and so on. Named arguments are valid symbols within the body of the function.

5.3 Return values

In an R function, you may use the return function to specify the value returned by the function. For example:

f <- function(x) {return(x^2 + 3)}
f(3)

## [1] 12

However, R will simply return the last evaluated expression as the result of a function. So, it is common to omit the return statement:

f <- function(x) {x^2 + 3}
f(3)

## [1] 12

In some cases, an explicit return value may lead to cleaner code.

5.4 Functions as arguments

Many functions in R can take other functions as arguments. For example, many modeling functions accept an optional argument that specifies how to handle missing values; this argument is usually a function for processing the input data.

As an example of a function that takes another function as an argument, let’s look at sapply. The sapply function iterates through each element in a vector, applying another function to each element in the vector, and returning the results. Here is a simple example:

a <- 1:7
sapply(a, sqrt)

## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751

f <- function(x,y){return(x^y) }
sapply(a,2, FUN=f)

## [1]  1  4  9 16 25 36 49

This is a toy example; you could have calculated the same quantity with the expression sqrt(1:7). However, there are many useful functions that don’t work properly on a vector with more than one element; sapply provides a simple way to extend such a function to work on a vector. Related functions allow you to summarize every element in a data structure or to perform more complicated calculations.

5.5 Anonymous functions

So far, we’ve mostly seen named functions in R. However, because functions are just objects in R, it is possible to create functions that do not have names. These are called anonymous functions. Anonymous functions are usually passed as arguments to other functions. Let’s start with a very simple example.

We will define a function that takes another function as its argument and then applies that function to the number 3. Let’s call the function apply.to.three, and we will call the argument f:

apply.to.three <- function(f) {f(3)}

Now, let’s call apply.to.three with an anonymous function assigned to argument f. As an example, let’s create a simple function that takes one argument and multiplies that argument by 7:

apply.to.three(function(x) {x * 7})

## [1] 21

Here’s how this works. When the R interpreter evaluates the expression apply.to.three(function(x) {x * 7}), it assigns the argument f to the anonymous function function(x) {x * 7}. The interpreter then begins evaluating the expression f(3). The interpreter assigns 3 to the argument x for the anonymous function. Finally, the interpreter evaluates the expression 3 * 7 and returns the result.

Anonymous functions are a very powerful concept that is used in many places in R. Above, we used the sapply function to apply a named function to every element in an array. You can also pass an anonymous function as an argument to sapply:

a <- c(1, 2, 3, 4, 5)
sapply(a, function(x) {x+1})

## [1] 2 3 4 5 6

This family of functions is a good alternative to control structures.

By the way, it is possible to define an anonymous function and apply it directly to an argument. Here’s an example:

(function(x) {x+1})(1)

## [1] 2

Notice that the function object needs to be enclosed in parentheses. This is because function calls, expressions of the form f(arguments), have very high precedence in R.

5.6 Properties of functions

R includes a set of functions for getting more information about function objects. To see the set of arguments accepted by a function, use the args function. The args function returns a function object with NULL as the body. Here are a few examples:

args(sin)

## function (x) 
## NULL

args(`?`)

## function (e1, e2) 
## NULL

args(args)

## function (name) 
## NULL

args(lm)

## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## NULL

If you would like to manipulate the list of arguments with R code, then you may find the formals function more useful. The formals function will return a pairlist object, with a pair for every argument. The name of each pair will correspond to each argument name in the function. When a default value is defined, the corresponding value in the pairlist will be set to that value. When no default is defined, the value will be NULL. The formals function is only available for functions written in R (objects of type closure) and not for built-in functions.

Here is a simple example of using formals to extract information about the arguments to a function:

f <- function(x,y=1,z=2) {x+y+z}
f.formals <- formals(f)
f.formals

## $x
## 
## 
## $y
## [1] 1
## 
## $z
## [1] 2

f.formals$x

f.formals$y

## [1] 1

f.formals$z

## [1] 2

You may also use formals on the left hand side of an assignment statement to change the formal argument for a function. For example:

f.formals$y <- 3
formals(f) <- f.formals
args(f)

## function (x, y = 3, z = 2) 
## NULL

R provides a convenience function called alist to construct an argument list. You simply specify the argument list as if you were defining a function. (Note that for an argument with no default, you do not need to include a value but still need to include the equals sign.)

f <- function(x,y=1,z=2) {x + y + z}
formals(f) <- alist(x=,y=100,z=200)
f

## function (x, y = 100, z = 200) 
## {
##     x + y + z
## }

R provides a similar function called body that can be used to return the body of a function:

body(f)

## {
##     x + y + z
## }

Like the formals function, the body function may be used on the lefthand side of an assignment statement:

## function (x, y = 100, z = 200) 
## {
##     x + y + z
## }

body(f) <- expression({x * y * z})
f

## function (x, y = 100, z = 200) 
## {
##     x * y * z
## }

Note that the body of a function has type expression, so when you assign a new value it must have the type expression.

5.7 Argument order and named arguments

When you specify a function in R, you assign a name to each argument in the function. Inside the body of the function, you can access the arguments by name. For example, consider the following function definition:

addTheLog <- function(first, second) {first + log(second)}

This function takes two arguments, called first and second. Inside the body of the function, you can refer to the arguments by these names. When you call a function in R, you can specify the arguments in three different ways (in order of priority):

1. Exact names. The arguments will be assigned to full names explicitly given in the argument list. Full argument names are matched first:
```
addTheLog(second=exp(4),first=1)
```
```
## [1] 5
```
1. Partially matching names. The arguments will be assigned to partial names explicitly given in the arguments list:
```
addTheLog(s=exp(4),f=1)
```
```
## [1] 5
```
1. Argument order. The arguments will be assigned to names in the order in which they were given:
```
addTheLog(1,exp(4))
```
```
## [1] 5
```

When you are using generic functions, you cannot specify the argument name of the object on which the generic function is being called. You can still specify names for other arguments.

When possible, it’s a good practice to use exact argument names. Specifying full argument names does require extra typing, but it makes your code easier to read and removes ambiguity.

Partial names are a deprecated feature because they can lead to confusion. As an example, consider the following function:

f <- function(arg1=10,arg2=20) {
  print(paste("arg1:",arg1))
  print(paste("arg2:",arg2))
}

When you call this function with one ambiguous argument, it will cause an error:

f(arg=1)

Error in f(arg = 1) : argument 1 matches multiple formal arguments

However, when you specify two arguments, the ambiguous argument could refer to either of the other arguments:

f(arg=1,arg2=2)

## [1] "arg1: 1"
## [1] "arg2: 2"

f(arg=1,arg1=2)

## [1] "arg1: 2"
## [1] "arg2: 1"

6 Exercises

Ex1. There are functions colSums(), colMeans(), cumsum() and sumprod() in R. Please write your owen functions to realize the same function.
Ex2. The useful function mvrnorm() is in R pacckage MASS. The fucntion chol() compute the Cholesky factorization of a real symmetric positive-definite square matrix. Please write a function by using chol() to generate a dataset from the multivariate normal distribution with mean \(\mu\) variance \(\Sigma\).
- The formulas behind this are follows. Assume that a vector v comes from standard normal distribution. The symmetric positive-definite square matrix \(\Sigma\) can be decomposed as \(\Sigma=A*A^T\), where \(A\) could be \(A=\Sigma^{1/2}\). Thus, we have \[\begin{align} Var(A v)= \Sigma^{1/2} Var(v)\Sigma^{1/2}= \Sigma, \end{align}\] where \(Var(x)\) denotes the variance of \(x\), and we use the fact that \(Var(v)\) is an identity matrix.

7 References

“R Language Definition”
Kabacoff, R. I. . (2011). “R in Action”. Manning Publications Co.
Baeza, S. . (2015). “R For Beginners”. CreateSpace Independent Publishing Platform.
Adler, J. (2010). “R in a nutshell: A desktop quick reference”. O’Reilly Media, Inc.“.

R objects

Xu Liu