These are my notes on the lectures for the Coursera R Programming course. I can't guarantee there are no mistakes or omissions. The goal has been to include every R example shown in the lectures, although there may be cases (such as with debugging) when the R code won't execute properly in an R Markdown document.
So far, week 3 is complete. I've gotten started on week 1 but it is far from complete. Nothing yet on weeks 2 or 4.
Example of an error message:
library(datasets)
data(airquality)
cor(airquality)
Error in cor(airquality) : missing observations in cov/cor
Google the phrase “Error in cor(airquality) : missing observations in cov/cor”
Make the header/subject of question as specific as possible, rather than a general plea for help.
Five basic/atomic data types
The most basic object is a vector
R objects can have attributes
Attributes of an object can be accessed using the attributes() function.
At the R prompt we can type expressions. The \( <- \) symbol is the assignment operator.
x <- 1
print(x)
## [1] 1
x
## [1] 1
msg <- "hello"
The grammar of the language determines whether an expression is complete.
x <- ## Incomplete expression
When a complete expression is entered at the prompt, it is evaluated and the result is returned. The result may be auto-printed.
x <- 5 ## nothing printed
x ## auto-printing occurs
## [1] 5
print(x) ## explicit printing
## [1] 5
The : operator is used to create integer sequences.
x <- 1:20
The c() function creates vectors.
x <- c(0.5, 0.6) ## numeric
x <- c(TRUE, FALSE) ## logical
x <- c(T, F) ## logical (T and F are shortcuts for TRUE and FALSE)
x <- c("a", "b", "c") ## character
x <- 9:29 ## integer
x <- c(1 + (0+0i), 2 + (0+4i)) ## complex (there is no further use of complex numbers in the course)
Using the vector() function:
x <- vector("numeric", length = 10) # creates vector with 10 zeros
Vectors can contain values of only one type, such as numeric or character. If necessary, values are “coerced”. Example:
y <- c(1.7, "a") ## 1.7 coerced to character '1.7'
y <- c(TRUE, 2) ## TRUE coereced to numeric 1
y <- c("a", TRUE) ## TRUE coerced to character 'TRUE'
You can also coerce values explicitly. There are many more functions similar to the examples below.
x <- 0:6
class(x)
## [1] "integer"
as.numeric(x)
## [1] 0 1 2 3 4 5 6
as.logical(x)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
as.character(x)
## [1] "0" "1" "2" "3" "4" "5" "6"
Nonsensical coercion results in NAs.
x <- c("a", "b", "c")
as.numeric(x)
## Warning: NAs introduced by coercion
## [1] NA NA NA
as.logical(x)
## [1] NA NA NA
as.complex(x)
## Warning: NAs introduced by coercion
## [1] NA NA NA
Matrices are vectors with a dimension attribute. The dimension attribute is itself a vector of length 2 (nrow, ncol)
m <- matrix(nrow = 2, ncol = 3)
m
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
dim(m)
## [1] 2 3
attributes(m)
## $dim
## [1] 2 3
Matrices are constructed column-wise, so entries can be thought of as starting in “upper left” corner and running down the columns
m <- matrix(1:6, nrow = 2, ncol = 3)
m
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Matrices can also be created directly from vectors by adding a dimension attribute.
m <- 1:10
dim(m) <- c(2, 5)
m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
(For the above to work, 2 times 5 must equal the length of m.)
Matrices can be created by column-binding or row-binding
x <- 1:3
y <- 10:12
cbind(x, y)
## x y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
rbind(x, y)
## [,1] [,2] [,3]
## x 1 2 3
## y 10 11 12
Lists are a special type of vector that can contain elements of different classes. (Not noted on slide: each element in the list is a vector, and so must have only one type of value in it, but the type can vary between elements. Note that the vector for a list element may contain only a single value, as in the example below.)
x <- list(1, "a", TRUE, 1 + (0+4i))
Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.
x <- factor(c("yes", "yes", "no", "yes", "no"))
x
## [1] yes yes no yes no
## Levels: no yes
table(x) # creates a contingency table for a factor
## x
## no yes
## 2 3
unclass(x)
## [1] 2 2 1 2 1
## attr(,"levels")
## [1] "no" "yes"
The order of the levels can be set using the levels argument for factor()
x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no")) # 'yes' listed before 'no'
x
## [1] yes yes no yes no
## Levels: yes no
x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("no", "yes")) # 'no' listed before 'yes'
x
## [1] yes yes no yes no
## Levels: no yes
An auxiliary function split is also useful, particularly with lapply.
This function accepts a list, or a data structure it coerces to a list, and returns a list. It will send each element of the list to a specified function, and the returned list contains the output of that function for each element in the supplied list.
x <- list(a = 1:5, b = rnorm(10))
# x is a list with two elements. a is a vector with values 1 to 5 b is a
# vector with 10 random values from the standard normal distribution
lapply(x, mean)
## $a
## [1] 3
##
## $b
## [1] -0.3021
# This returns two elements a is the mean of the elements in x$a b is the
# mean of the elements in x$b
Now pass a data frame instead of a list:
data <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6))
# data is not a list, but a data.frame, so it is coerced to a list
lapply(x, mean)
## $a
## [1] 3
##
## $b
## [1] -0.3021
Or a matrix:
data <- matrix(1:4, 2, 2)
lapply(data, mean)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4
In the previous case, we simply got a mean of each cell in the matrix. Not useful. So we can specify a list containing each of the columns of a matrix.
data <- matrix(1:4, 2, 2)
lapply(list(data[, 1], data[, 2]), mean)
## [[1]]
## [1] 1.5
##
## [[2]]
## [1] 3.5
We can of course specify functions other than mean. The runif function (random uniform distribution) will provide random values from the uniform distribution, by default between 0 and 1. The number of values returned is determined by the argument passed to it.
The output is one random value for the first element, two for the second, three for the third, and four for the fourth. Each of these counts matches the values in x.
x <- 1:4 # vector containing 1, 2, 3, 4
lapply(x, runif)
## [[1]]
## [1] 0.3595
##
## [[2]]
## [1] 0.7483 0.2645
##
## [[3]]
## [1] 0.31651 0.03357 0.06476
##
## [[4]]
## [1] 0.1271 0.4602 0.1954 0.1891
Here is an example of including arguments after the function, which are passed to the function for each invocation. For example, the following passes min=0 and max=10 to each invocation of runif. Therefore, the random values returned are between 0 and 10 (instead of the default 0 and 1).
x <- 1:4
lapply(x, runif, min = 0, max = 0)
## [[1]]
## [1] 0
##
## [[2]]
## [1] 0 0
##
## [[3]]
## [1] 0 0 0
##
## [[4]]
## [1] 0 0 0 0
The functions passed to lapply can be anonymous. That means they are defined right within the call to lapply, and don't have a name.
x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
lapply(x, function(elt) elt[, 1]) # returns the first column of each element in x
## $a
## [1] 1 2
##
## $b
## [1] 1 2 3
The sapply function is very similar to lappy, but it will try to “simplify” the results returned if it can.
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean) # lapply always returns a list
## $a
## [1] 2.5
##
## $b
## [1] -0.3331
##
## $c
## [1] 0.6532
##
## $d
## [1] 4.993
sapply(x, mean) # the output is four elements, each with a single number, so a vector is returned
## a b c d
## 2.5000 -0.3331 0.6532 4.9931
You can write loops to do the same things as these functions, but it can require much more code. And without either an apply function or a loop, you can't do something like this:
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
mean(x)
## Warning: argument is not numeric or logical: returning NA
## [1] NA
The mean function expects a simple data structure like a vector, and it doesn't know how to handle a list.
apply is used to evaluate a function (often an anonymous one) over the margins of an array.
str(apply)
## function (X, MARGIN, FUN, ...)
Arguments:
A “margin” in an array is one of its dimensions. So if a matrix is of size 5x4, it has two dimensions. The first is rows, and the second is columns. So 1 is the value to refer to the “margin” for rows, and 2 to refer to the “margin” for columns.
Note that arrays can have three or more dimensions.
x <- matrix(rnorm(200), 20, 10) # creates a 20x10 matrix with random values
apply(x, 2, mean) # returns the mean for each column (margin=2) in x
## [1] 0.53518 0.12483 0.23964 0.10299 0.20221 -0.05152 -0.22057
## [8] 0.24382 0.10032 0.42731
apply(x, 1, mean) # returns the mean for each row (margin=1) in x
## [1] 0.50576 -0.09749 0.25340 -0.16674 0.45362 0.61261 0.17138
## [8] -0.11008 0.38748 -0.31078 0.38368 -0.43272 -0.05969 0.44327
## [15] 0.24876 0.45502 0.33899 0.33950 0.45171 -0.45926
Note that R has built-in functions for getting the sums and means of rows or columns in a matrix, and should be preferred:
Here is an example getting the 25th and 75th quantiles from a matrix, using the quantile function.
x <- matrix(rnorm(200), 20, 10)
apply(x, 1, quantile, probs = c(0.25, 0.75))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## 25% -1.2047 -0.2604 -0.79300 -0.3014 -1.0702 -0.4210 -0.5916 -0.4372
## 75% 0.5207 1.1154 0.07974 0.3564 0.3464 0.3111 0.5982 -0.1266
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## 25% -0.1355 -0.5876 -0.4634 -0.8932 -0.5595 -0.2733 -0.1801 -0.6549
## 75% 0.3864 1.0244 0.6058 0.5012 0.3128 0.6266 0.5659 1.0032
## [,17] [,18] [,19] [,20]
## 25% -0.3744 -0.2516 -0.2301 -0.2818
## 75% 0.1464 0.8058 0.6022 0.4956
Arrays can have more than two dimensions. So you can specify a vector for the margins, containing all of the dimensions over which a function should be applied.
a <- array(rnorm(2 * 2 * 10), c(2, 2, 10)) # 2x2x10 array
apply(a, c(1, 2), mean) # get the mean for all of the values in each combination of the first two dimensions
## [,1] [,2]
## [1,] 0.5505 0.4118
## [2,] 0.1098 -0.1189
rowMeans(a, dims = 2) # returns same result
## [,1] [,2]
## [1,] 0.5505 0.4118
## [2,] 0.1098 -0.1189
tapply is used to apply a function over subsets of a vector. Parameters:
Think of this as a way to group parts of a vector. Example: a vector contains data on 100 people, the first 50 of which are men, the second 50 are women. You want to apply some function to those two groups separately.
The following example creates a vector of 30 numbers, splits it into three groups, and gets the mean of each group.
(The gl function generates a factor with a specified number of levels.)
x <- c(rnorm(10), runif(10), rnorm(10, 1)) # vector of 30 random numbers:
# first 10 are from normal distribution second 10 are from uniform
# distribution last 10 are from normal distribution with a mean of 1
# (instead of default 0)
f <- gl(3, 10) # f is a factor with 3 levels, each with 10 values
f
## [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
## Levels: 1 2 3
tapply(x, f, mean)
## 1 2 3
## 0.01055 0.45760 1.42804
By default, tapply simplifies the result, but this can be turned off.
tapply(x, f, mean, simplify = FALSE) # returns list
## $`1`
## [1] 0.01055
##
## $`2`
## [1] 0.4576
##
## $`3`
## [1] 1.428
In another example, you can get the ranges for specified groups in a vector. The range function returns two values for its input, the lowest and highest values in the input. The output is
tapply(x, f, range)
## $`1`
## [1] -1.308 1.659
##
## $`2`
## [1] 0.01472 0.88373
##
## $`3`
## [1] -0.4889 3.1764
Note that “simplify” is different for sapply and tapply. In the previous example, tapply returned a list with two elements, each of which had a two-element vector. sapply would have returned that as a matrix. tapply returns either a vector, or a list. This distinction is made in the documentation for the two methods.
split takes a vector or other object and splits it into groups determined by a factor or list of factors. Arguments:
This means that you can split data into groups, then pass it to lapply or sapply. You don't have to use tapply.
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10)
split(x, f)
## $`1`
## [1] 1.59825 -0.47098 -2.02568 -1.10501 0.08255 -0.07554 0.01834
## [8] 0.55490 -1.21988 0.23186
##
## $`2`
## [1] 0.9166755 0.8547015 0.6513598 0.3441858 0.0008893 0.3269521 0.2141416
## [8] 0.9718464 0.6278163 0.7504628
##
## $`3`
## [1] -0.40125 1.44140 0.29018 2.36271 0.06213 0.24409 1.40883
## [8] 3.19144 0.06292 1.27034
Now the result can be passed to lapply or sapply. But there is no advantage in the following example over using tapply.
lapply(split(x, f), mean)
## $`1`
## [1] -0.2411
##
## $`2`
## [1] 0.5659
##
## $`3`
## [1] 0.9933
split becomes more useful when we are dealing with more complicated objects such as a data frame.
library(datasets) # get some data
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
We might want to calculate the means for ozone, solar, wind, etc. by month.
s <- split(airquality, airquality$Month)
lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
## $`5`
## Ozone Solar.R Wind
## NA NA 11.62
##
## $`6`
## Ozone Solar.R Wind
## NA 190.17 10.27
##
## $`7`
## Ozone Solar.R Wind
## NA 216.484 8.942
##
## $`8`
## Ozone Solar.R Wind
## NA NA 8.794
##
## $`9`
## Ozone Solar.R Wind
## NA 167.43 10.18
So in the original data frame, month was not a factor variable, but using split, we converted it into a factor variable.
The reason we needed an anonymous function at this point was that we didn't want to apply the colMeans function to everything in s. If we did, we would also have means for month and day, which doesn't make much sense:
lapply(s, colMeans)
## $`5`
## Ozone Solar.R Wind Temp Month Day
## NA NA 11.62 65.55 5.00 16.00
##
## $`6`
## Ozone Solar.R Wind Temp Month Day
## NA 190.17 10.27 79.10 6.00 15.50
##
## $`7`
## Ozone Solar.R Wind Temp Month Day
## NA 216.484 8.942 83.903 7.000 16.000
##
## $`8`
## Ozone Solar.R Wind Temp Month Day
## NA NA 8.794 83.968 8.000 16.000
##
## $`9`
## Ozone Solar.R Wind Temp Month Day
## NA 167.43 10.18 76.90 9.00 15.50
Of course, we could take the above output and remove only the columns we want. But first, that would mean extra calculation to no purpose, and second, if one of the columns contained character data, calculating its mean would be nonsensical (and result in warnings).
We could use sapply instead of lapply to simplify the result to a matrix.
sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
## 5 6 7 8 9
## Ozone NA NA NA NA NA
## Solar.R NA 190.17 216.484 NA 167.43
## Wind 11.62 10.27 8.942 8.794 10.18
We can also pass na.rm to colMeans to exclude values with NA values in any of the specified columns. Note that this does not remove entire rows where any one of the columns has NA. It removes values on a column-by-column basis. So if row 1 has an NA for “Ozone”, then it is not included in calculating the mean for Ozone, but if the Wind column has a value, that row is still used in calculating the mean for Wind.
sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))
## 5 6 7 8 9
## Ozone 23.62 29.44 59.115 59.962 31.45
## Solar.R 181.30 190.17 216.484 171.857 167.43
## Wind 11.62 10.27 8.942 8.794 10.18
A data frame may have more than one factor. Example: gender and race. You may want the mean for each combination of the levels in these factors.
x <- rnorm(10)
f1 <- gl(2, 5, labels = c("male", "female"))
f2 <- gl(5, 2, labels = c("white", "black", "asian", "latino", "other"))
f1
## [1] male male male male male female female female female female
## Levels: male female
f2
## [1] white white black black asian asian latino latino other other
## Levels: white black asian latino other
interaction(f1, f2)
## [1] male.white male.white male.black male.black male.asian
## [6] female.asian female.latino female.latino female.other female.other
## 10 Levels: male.white female.white male.black female.black ... female.other
So there are 10 combinations of gender and race.
str(split(x, list(f1, f2))) # use str to compactly display the structure of splitting x by the factors
## List of 10
## $ male.white : num [1:2] 0.169 -1.171
## $ female.white : num(0)
## $ male.black : num [1:2] 0.963 1.35
## $ female.black : num(0)
## $ male.asian : num -1.56
## $ female.asian : num -0.54
## $ male.latino : num(0)
## $ female.latino: num [1:2] 0.288 -2.042
## $ male.other : num(0)
## $ female.other : num [1:2] -0.821 1.295
Some of the combinations have no values. For example, female.white. That is because f1 and f2 have no combination for those levels (compare the output of f1 and f2 above). So we can drop these.
str(split(x, list(f1, f2), drop = TRUE))
## List of 6
## $ male.white : num [1:2] 0.169 -1.171
## $ male.black : num [1:2] 0.963 1.35
## $ male.asian : num -1.56
## $ female.asian : num -0.54
## $ female.latino: num [1:2] 0.288 -2.042
## $ female.other : num [1:2] -0.821 1.295
The previous functions, lapply, sapply and tapply, only apply a function over a single object. mapply allows you to apply a function to more than one object (for example, more than one list, in contrast to lappy). Arguments:
Example: It is tedious to type the following.
list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
So instead:
mapply(rep, 1:4, 4:1)
## [[1]]
## [1] 1 1 1 1
##
## [[2]]
## [1] 2 2 2
##
## [[3]]
## [1] 3 3
##
## [[4]]
## [1] 4
“This is an artificial example.” Agreed.
noise <- function(n, mean, sd) {
rnorm(n, mean, sd)
}
noise(5, 1, 2)
## [1] -0.3477 1.4204 2.4208 4.1725 2.9152
# this doesn't do what is expected, i.e., 1 random normal with mean 2, 2
# random normals with mean 2, etc. when the n value passed to rnorm has
# length>1, then the length is assumed to be the number of values wanted
noise(1:5, 1:5, 2)
## [1] 3.067 1.721 3.053 4.630 3.193
So we can use mapply to get what we want:
# first call to noise has values 1, 1, 2 second has values 2, 2, 2 and so on
mapply(noise, 1:5, 1:5, 2)
## [[1]]
## [1] -2.588
##
## [[2]]
## [1] -2.299 1.750
##
## [[3]]
## [1] 2.845 5.872 2.780
##
## [[4]]
## [1] 1.318 3.811 3.950 3.316
##
## [[5]]
## [1] 5.969 5.049 4.815 3.600 6.465
“That is how I can instantly vectorize a function that doesn't allow for vector arguments.”
Note that rnorm does in fact allow a vector argument for n, it's just that it doesn't treat that argument the way we expect.
Function returns a NaN, and writes a warning. Execution continues:
log(-1)
## Warning: NaNs produced
## [1] NaN
The following generates an error.
(invisible function: This will return a value from a function, without autoprinting it.)
printmessage <- function(x) {
if (x > 0)
print("x is greater than zero") else print("x is less than or equal to zero")
invisible(x)
}
printmessage(1) # normal behavior
## [1] "x is greater than zero"
printmessage(NA) # an error message
## Error: missing value where TRUE/FALSE needed
An attempt to fix the above:
printmessage2 <- function(x) {
if (is.na(x))
print("x is missing a value!") else if (x > 0)
print("x is greater than zero") else print("x is less than or equal to zero")
invisible(x)
}
x <- log(-1)
## Warning: NaNs produced
printmessage2(x)
## [1] "x is missing a value!"
How do you know that something is wrong with your function?
Tools for debugging:
You can also insert print and cat statements into functions, but then they must be deleted later.
rm(x) # first remove x from environment
mean(x) # this creates an error
## Error: object 'x' not found
traceback()
## No traceback available
Unfortunately, it appears that traceback does not work in R markdown code. Such code runs in a non-interactive environment. So for the rest of these examples, I will not include in R blocks.
lm(y ~ x)
traceback()
Output for above looks like the following. It shows that the first thing that happened was the call to eval(expr, envir, enclos), and the last (where the error occurred) was lm(f ~ x).
7: eval(expr, envir, enclos)
6: eval(predvars, data, env)
5: model.frame.default(formula = f ~ x, drop.unused.levels = TRUE)
4: stats::model.frame(formula = f ~ x, drop.unused.levels = TRUE)
3: eval(expr, envir, enclos)
2: eval(mf, parent.frame())
1: lm(f ~ x)
The debug function sets a function so that it goes into interactive debugging environment whenever it is called. But at the time that the debug call is made, it immediately prints out the entire body of the function.
debug(lm)
Partial output for above. Note that after outputting the function's code, it presents a “Browse” prompt.
debugging in: lm(y ~ x)
debug: {
ret.x <- x
ret.y <- y
cl <- match.call()
...
if (!qr)
z$qr <- NULL
z
}
Browse[2]>
The environment of this “browser” is the environment of the lm function. So at the start of the call, there is nothing in the environment except the function arguments, including any default arguments that might not be included in the call.
Now you can enter n at the prompt (for “next”) and the function executes one line at a time.
Browse[2]> n
debug: ret.x <- x
Browse[2]> n
debug: ret.y <- y
Browse[2]> n
debug: cl <- match.call()
Browse[2]> n
debug: mf <- match.call(expand.dots = FALSE)
Browse[2]> n
debug: m <- match(c("formula", "data", "subset", "weights", "na.action",
"offset"), names(mf), 0L)
The Environment tab in RStudio updates after each line, based on the line of the function that was just executed. Keep hitting n until the error occurs; now you have found exactly where it occurs.
Note that you can debug functions called within the debugged function. So you might call debug on the match function, which is part of the above output.
This section seems incomplete to me. No discussion of how to break out of debug, or the fact that it “sticks” for an entire session, or that there is a debugonce function (which debugs a function for only one invocation).
You can set the recover function to be the error handled using the options function. This sets a global option for an R session.
When recover is the error handler and an error occurs, it presents a menu in which the options are the call stack, as shown in the traceback function.
options(error = recovery)
read.csv("nosuchfile")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message
In file(file, "rt") :
cannot open file 'nosuchfile': No such file or directory
Enter a frame number, or 0 to exit
1: read.csv("nosuchfile")
2. read.table(file = file, header = header, sep = sep, quote = quote, dec =
3. file(file, "rt")
Selection:
You can enter 1-3 in this case to see the environment for a particular line. Hit enter to go back to the menu, and enter 0 to exit.