These are my notes on the lectures for the Coursera R Programming course. I can't guarantee there are no mistakes or omissions. The goal has been to include every R example shown in the lectures, although there may be cases (such as with debugging) when the R code won't execute properly in an R Markdown document.

So far, week 3 is complete. I've gotten started on week 1 but it is far from complete. Nothing yet on weeks 2 or 4.

Week 1

Video 3: Getting help

Finding answers

Try to find an answer by searching archives of forum you plan to post to
Try to find an answer by searching web
Try to find an answer by reading manual
Try to find an answer in FAQ
Try to find an answer by inspection/experimatation
Try to find an answer by asking skilled friend
If you're a programmer, try to find an answer by reading source code

Example of an error message:

library(datasets)
data(airquality)
cor(airquality)
Error in cor(airquality) : missing observations in cov/cor

Google the phrase “Error in cor(airquality) : missing observations in cov/cor”

Asking questions

What steps reproduce the problem?
What is expected output?
What do you see instead?
What version of product (R, packages) are you using?
What operating system?
Any other relevant info

Make the header/subject of question as specific as possible, rather than a general plea for help.

Video 4: Data types part 1

Data types

Five basic/atomic data types

character
numeric (real numbers)
integer
complex numbers
logical (true/false)

Vectors

The most basic object is a vector

Can contain objects of only one class (numeric, integer, etc.)
The one exception is a list, which is represented as a vector but can contain objects of different classes
Empty (zero-length) vectors can be created with vector() function

Numbers

Numbers in R generally treated as numeric objects (double precision reals)
If you explicitly want an integer, append L to number: \( 10L \)
Special number \( Inf \) represents infinity, e.g. \( 1/0 \); can be used in ordinary calculations; e.g. \( 1/Inf \) is 0.
The value \( NaN \) represents an undefined value (“not a number”); e.g. \( 0/0 \); \( NaN \) can also be thought of as a missing value (more on that later)

Attributes

R objects can have attributes

names, dimnames
dimensions (e.g., matrices, arrays)
class
length
other user-defined attributes/metadata

Attributes of an object can be accessed using the attributes() function.

Entering input

At the R prompt we can type expressions. The \( <- \) symbol is the assignment operator.

x <- 1
print(x)

## [1] 1

## [1] 1

msg <- "hello"

The grammar of the language determines whether an expression is complete.

x <- ## Incomplete expression

Evaluation

When a complete expression is entered at the prompt, it is evaluated and the result is returned. The result may be auto-printed.

x <- 5  ## nothing printed
x  ## auto-printing occurs

## [1] 5

print(x)  ## explicit printing

## [1] 5

The : operator is used to create integer sequences.

x <- 1:20

Creating vectors

The c() function creates vectors.

x <- c(0.5, 0.6)  ## numeric
x <- c(TRUE, FALSE)  ## logical
x <- c(T, F)  ## logical (T and F are shortcuts for TRUE and FALSE)
x <- c("a", "b", "c")  ## character
x <- 9:29  ## integer
x <- c(1 + (0+0i), 2 + (0+4i))  ## complex (there is no further use of complex numbers in the course)

Using the vector() function:

x <- vector("numeric", length = 10)  # creates vector with 10 zeros

Vectors can contain values of only one type, such as numeric or character. If necessary, values are “coerced”. Example:

y <- c(1.7, "a")  ## 1.7 coerced to character '1.7'
y <- c(TRUE, 2)  ## TRUE coereced to numeric 1
y <- c("a", TRUE)  ## TRUE coerced to character 'TRUE'

You can also coerce values explicitly. There are many more functions similar to the examples below.

x <- 0:6
class(x)

## [1] "integer"

as.numeric(x)

## [1] 0 1 2 3 4 5 6

as.logical(x)

## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

as.character(x)

## [1] "0" "1" "2" "3" "4" "5" "6"

Nonsensical coercion results in NAs.

x <- c("a", "b", "c")
as.numeric(x)

## Warning: NAs introduced by coercion

## [1] NA NA NA

as.logical(x)

## [1] NA NA NA

as.complex(x)

## Warning: NAs introduced by coercion

## [1] NA NA NA

Matrices

Matrices are vectors with a dimension attribute. The dimension attribute is itself a vector of length 2 (nrow, ncol)

m <- matrix(nrow = 2, ncol = 3)
m

##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA

dim(m)

## [1] 2 3

attributes(m)

## $dim
## [1] 2 3

Matrices are constructed column-wise, so entries can be thought of as starting in “upper left” corner and running down the columns

m <- matrix(1:6, nrow = 2, ncol = 3)
m

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Matrices can also be created directly from vectors by adding a dimension attribute.

m <- 1:10
dim(m) <- c(2, 5)
m

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

(For the above to work, 2 times 5 must equal the length of m.)

cbind and rbind

Matrices can be created by column-binding or row-binding

x <- 1:3
y <- 10:12
cbind(x, y)

##      x  y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12

rbind(x, y)

##   [,1] [,2] [,3]
## x    1    2    3
## y   10   11   12

Lists

Lists are a special type of vector that can contain elements of different classes. (Not noted on slide: each element in the list is a vector, and so must have only one type of value in it, but the type can vary between elements. Note that the vector for a list element may contain only a single value, as in the example below.)

x <- list(1, "a", TRUE, 1 + (0+4i))

Factors

Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.

Factors are treated specially by modelling functions like lm() and glm()
Using factors with labels is better than using integers because factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.

x <- factor(c("yes", "yes", "no", "yes", "no"))
x

## [1] yes yes no  yes no 
## Levels: no yes

table(x)  # creates a contingency table for a factor

## x
##  no yes 
##   2   3

unclass(x)

## [1] 2 2 1 2 1
## attr(,"levels")
## [1] "no"  "yes"

The order of the levels can be set using the levels argument for factor()

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))  # 'yes' listed before 'no'
x

## [1] yes yes no  yes no 
## Levels: yes no

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("no", "yes"))  # 'no' listed before 'yes'
x

## [1] yes yes no  yes no 
## Levels: no yes

Missing values

Week 3

Video 1

Looping functions

lapply: Loop over a list and evaluate a function on each element
sapply: Same as lapply but tries to simplify the result
tapply: Apply a function over subsets of a vector
mapply: Multivariate version of tapply

An auxiliary function split is also useful, particularly with lapply.

lapply

This function accepts a list, or a data structure it coerces to a list, and returns a list. It will send each element of the list to a specified function, and the returned list contains the output of that function for each element in the supplied list.

Arguments

X: The data structure. If not a list, coerced to a list using as.list.
FUN: The function to be applied.
…: Any/all other arguments are passed into FUN for each invocation.

x <- list(a = 1:5, b = rnorm(10))
# x is a list with two elements.  a is a vector with values 1 to 5 b is a
# vector with 10 random values from the standard normal distribution
lapply(x, mean)

## $a
## [1] 3
## 
## $b
## [1] -0.3021

# This returns two elements a is the mean of the elements in x$a b is the
# mean of the elements in x$b

Now pass a data frame instead of a list:

data <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6))
# data is not a list, but a data.frame, so it is coerced to a list
lapply(x, mean)

## $a
## [1] 3
## 
## $b
## [1] -0.3021

Or a matrix:

data <- matrix(1:4, 2, 2)
lapply(data, mean)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4

In the previous case, we simply got a mean of each cell in the matrix. Not useful. So we can specify a list containing each of the columns of a matrix.

data <- matrix(1:4, 2, 2)
lapply(list(data[, 1], data[, 2]), mean)

## [[1]]
## [1] 1.5
## 
## [[2]]
## [1] 3.5

We can of course specify functions other than mean. The runif function (random uniform distribution) will provide random values from the uniform distribution, by default between 0 and 1. The number of values returned is determined by the argument passed to it.

The output is one random value for the first element, two for the second, three for the third, and four for the fourth. Each of these counts matches the values in x.

x <- 1:4  # vector containing 1, 2, 3, 4
lapply(x, runif)

## [[1]]
## [1] 0.3595
## 
## [[2]]
## [1] 0.7483 0.2645
## 
## [[3]]
## [1] 0.31651 0.03357 0.06476
## 
## [[4]]
## [1] 0.1271 0.4602 0.1954 0.1891

Here is an example of including arguments after the function, which are passed to the function for each invocation. For example, the following passes min=0 and max=10 to each invocation of runif. Therefore, the random values returned are between 0 and 10 (instead of the default 0 and 1).

x <- 1:4
lapply(x, runif, min = 0, max = 0)

## [[1]]
## [1] 0
## 
## [[2]]
## [1] 0 0
## 
## [[3]]
## [1] 0 0 0
## 
## [[4]]
## [1] 0 0 0 0

The functions passed to lapply can be anonymous. That means they are defined right within the call to lapply, and don't have a name.

x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
lapply(x, function(elt) elt[, 1])  # returns the first column of each element in x

## $a
## [1] 1 2
## 
## $b
## [1] 1 2 3

sapply

The sapply function is very similar to lappy, but it will try to “simplify” the results returned if it can.

If the result is a list where every element in that list contains only a single element, then a vector is returned.
If the result is a list where every element is a vector of the same length, and that length > 1, then a matrix is returned
Otherwise a list is returned

x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)  # lapply always returns a list

## $a
## [1] 2.5
## 
## $b
## [1] -0.3331
## 
## $c
## [1] 0.6532
## 
## $d
## [1] 4.993

sapply(x, mean)  # the output is four elements, each with a single number, so a vector is returned

##       a       b       c       d 
##  2.5000 -0.3331  0.6532  4.9931

Why the apply functions are important

You can write loops to do the same things as these functions, but it can require much more code. And without either an apply function or a loop, you can't do something like this:

x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
mean(x)

## Warning: argument is not numeric or logical: returning NA

## [1] NA

The mean function expects a simple data structure like a vector, and it doesn't know how to handle a list.

Video 2

apply

apply is used to evaluate a function (often an anonymous one) over the margins of an array.

It is most often used to apply a function to rows or columns of a matrix
It can be used with general arrays, e.g. taking the average of an array of matrices
It is not really faster than writing a loop, but it works in one line

str(apply)

## function (X, MARGIN, FUN, ...)

Arguments:

X: an array
MARGIN: a vector indicating which array margins to be retained
FUN: the function to be used
…: any arguments to be passed to each invocation of the function

A “margin” in an array is one of its dimensions. So if a matrix is of size 5x4, it has two dimensions. The first is rows, and the second is columns. So 1 is the value to refer to the “margin” for rows, and 2 to refer to the “margin” for columns.

Note that arrays can have three or more dimensions.

x <- matrix(rnorm(200), 20, 10)  # creates a 20x10 matrix with random values
apply(x, 2, mean)  # returns the mean for each column (margin=2) in x

##  [1]  0.53518  0.12483  0.23964  0.10299  0.20221 -0.05152 -0.22057
##  [8]  0.24382  0.10032  0.42731

apply(x, 1, mean)  # returns the mean for each row (margin=1) in x

##  [1]  0.50576 -0.09749  0.25340 -0.16674  0.45362  0.61261  0.17138
##  [8] -0.11008  0.38748 -0.31078  0.38368 -0.43272 -0.05969  0.44327
## [15]  0.24876  0.45502  0.33899  0.33950  0.45171 -0.45926

Note that R has built-in functions for getting the sums and means of rows or columns in a matrix, and should be preferred:

rowSums is equivalent to apply(x, 1, sum)
rowMeans is equivalent to apply(x, 1, mean)
colSums is equivalent to apply(x, 2, sum)
colMeans is eqivalent to apply(x, 2, mean)

Here is an example getting the 25th and 75th quantiles from a matrix, using the quantile function.

x <- matrix(rnorm(200), 20, 10)
apply(x, 1, quantile, probs = c(0.25, 0.75))

##        [,1]    [,2]     [,3]    [,4]    [,5]    [,6]    [,7]    [,8]
## 25% -1.2047 -0.2604 -0.79300 -0.3014 -1.0702 -0.4210 -0.5916 -0.4372
## 75%  0.5207  1.1154  0.07974  0.3564  0.3464  0.3111  0.5982 -0.1266
##        [,9]   [,10]   [,11]   [,12]   [,13]   [,14]   [,15]   [,16]
## 25% -0.1355 -0.5876 -0.4634 -0.8932 -0.5595 -0.2733 -0.1801 -0.6549
## 75%  0.3864  1.0244  0.6058  0.5012  0.3128  0.6266  0.5659  1.0032
##       [,17]   [,18]   [,19]   [,20]
## 25% -0.3744 -0.2516 -0.2301 -0.2818
## 75%  0.1464  0.8058  0.6022  0.4956

Arrays can have more than two dimensions. So you can specify a vector for the margins, containing all of the dimensions over which a function should be applied.

a <- array(rnorm(2 * 2 * 10), c(2, 2, 10))  # 2x2x10 array
apply(a, c(1, 2), mean)  # get the mean for all of the values in each combination of the first two dimensions

##        [,1]    [,2]
## [1,] 0.5505  0.4118
## [2,] 0.1098 -0.1189

rowMeans(a, dims = 2)  # returns same result

##        [,1]    [,2]
## [1,] 0.5505  0.4118
## [2,] 0.1098 -0.1189

Video 3: tapply

tapply is used to apply a function over subsets of a vector. Parameters:

X: a vector
INDEX: a factor or a list of factors (or else they are coerced to factors)
FUN: function to be applied
…: arguments to be passed to FUN
simplify: whether to simplify the result

Think of this as a way to group parts of a vector. Example: a vector contains data on 100 people, the first 50 of which are men, the second 50 are women. You want to apply some function to those two groups separately.

The following example creates a vector of 30 numbers, splits it into three groups, and gets the mean of each group.

(The gl function generates a factor with a specified number of levels.)

x <- c(rnorm(10), runif(10), rnorm(10, 1))  # vector of 30 random numbers: 
# first 10 are from normal distribution second 10 are from uniform
# distribution last 10 are from normal distribution with a mean of 1
# (instead of default 0)
f <- gl(3, 10)  # f is a factor with 3 levels, each with 10 values
f

##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
## Levels: 1 2 3

tapply(x, f, mean)

##       1       2       3 
## 0.01055 0.45760 1.42804

By default, tapply simplifies the result, but this can be turned off.

tapply(x, f, mean, simplify = FALSE)  # returns list

## $`1`
## [1] 0.01055
## 
## $`2`
## [1] 0.4576
## 
## $`3`
## [1] 1.428

In another example, you can get the ranges for specified groups in a vector. The range function returns two values for its input, the lowest and highest values in the input. The output is

tapply(x, f, range)

## $`1`
## [1] -1.308  1.659
## 
## $`2`
## [1] 0.01472 0.88373
## 
## $`3`
## [1] -0.4889  3.1764

Note that “simplify” is different for sapply and tapply. In the previous example, tapply returned a list with two elements, each of which had a two-element vector. sapply would have returned that as a matrix. tapply returns either a vector, or a list. This distinction is made in the documentation for the two methods.

Video 4: split

split takes a vector or other object and splits it into groups determined by a factor or list of factors. Arguments:

x: a vector, list or data frame
f: a factor (or coerced to factor) or a list of factors
drop: indicates whether empty factors levels should be dropped

This means that you can split data into groups, then pass it to lapply or sapply. You don't have to use tapply.

x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10)
split(x, f)

## $`1`
##  [1]  1.59825 -0.47098 -2.02568 -1.10501  0.08255 -0.07554  0.01834
##  [8]  0.55490 -1.21988  0.23186
## 
## $`2`
##  [1] 0.9166755 0.8547015 0.6513598 0.3441858 0.0008893 0.3269521 0.2141416
##  [8] 0.9718464 0.6278163 0.7504628
## 
## $`3`
##  [1] -0.40125  1.44140  0.29018  2.36271  0.06213  0.24409  1.40883
##  [8]  3.19144  0.06292  1.27034

Now the result can be passed to lapply or sapply. But there is no advantage in the following example over using tapply.

lapply(split(x, f), mean)

## $`1`
## [1] -0.2411
## 
## $`2`
## [1] 0.5659
## 
## $`3`
## [1] 0.9933

split becomes more useful when we are dealing with more complicated objects such as a data frame.

library(datasets)  # get some data
head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

We might want to calculate the means for ozone, solar, wind, etc. by month.

s <- split(airquality, airquality$Month)
lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))

## $`5`
##   Ozone Solar.R    Wind 
##      NA      NA   11.62 
## 
## $`6`
##   Ozone Solar.R    Wind 
##      NA  190.17   10.27 
## 
## $`7`
##   Ozone Solar.R    Wind 
##      NA 216.484   8.942 
## 
## $`8`
##   Ozone Solar.R    Wind 
##      NA      NA   8.794 
## 
## $`9`
##   Ozone Solar.R    Wind 
##      NA  167.43   10.18

So in the original data frame, month was not a factor variable, but using split, we converted it into a factor variable.

The reason we needed an anonymous function at this point was that we didn't want to apply the colMeans function to everything in s. If we did, we would also have means for month and day, which doesn't make much sense:

lapply(s, colMeans)

## $`5`
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      NA      NA   11.62   65.55    5.00   16.00 
## 
## $`6`
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      NA  190.17   10.27   79.10    6.00   15.50 
## 
## $`7`
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      NA 216.484   8.942  83.903   7.000  16.000 
## 
## $`8`
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      NA      NA   8.794  83.968   8.000  16.000 
## 
## $`9`
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      NA  167.43   10.18   76.90    9.00   15.50

Of course, we could take the above output and remove only the columns we want. But first, that would mean extra calculation to no purpose, and second, if one of the columns contained character data, calculating its mean would be nonsensical (and result in warnings).

We could use sapply instead of lapply to simplify the result to a matrix.

sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))

##             5      6       7     8      9
## Ozone      NA     NA      NA    NA     NA
## Solar.R    NA 190.17 216.484    NA 167.43
## Wind    11.62  10.27   8.942 8.794  10.18

We can also pass na.rm to colMeans to exclude values with NA values in any of the specified columns. Note that this does not remove entire rows where any one of the columns has NA. It removes values on a column-by-column basis. So if row 1 has an NA for “Ozone”, then it is not included in calculating the mean for Ozone, but if the Wind column has a value, that row is still used in calculating the mean for Wind.

sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))

##              5      6       7       8      9
## Ozone    23.62  29.44  59.115  59.962  31.45
## Solar.R 181.30 190.17 216.484 171.857 167.43
## Wind     11.62  10.27   8.942   8.794  10.18

Splitting on more than one level

A data frame may have more than one factor. Example: gender and race. You may want the mean for each combination of the levels in these factors.

x <- rnorm(10)
f1 <- gl(2, 5, labels = c("male", "female"))
f2 <- gl(5, 2, labels = c("white", "black", "asian", "latino", "other"))
f1

##  [1] male   male   male   male   male   female female female female female
## Levels: male female

f2

##  [1] white  white  black  black  asian  asian  latino latino other  other 
## Levels: white black asian latino other

interaction(f1, f2)

##  [1] male.white    male.white    male.black    male.black    male.asian   
##  [6] female.asian  female.latino female.latino female.other  female.other 
## 10 Levels: male.white female.white male.black female.black ... female.other

So there are 10 combinations of gender and race.

str(split(x, list(f1, f2)))  # use str to compactly display the structure of splitting x by the factors

## List of 10
##  $ male.white   : num [1:2] 0.169 -1.171
##  $ female.white : num(0) 
##  $ male.black   : num [1:2] 0.963 1.35
##  $ female.black : num(0) 
##  $ male.asian   : num -1.56
##  $ female.asian : num -0.54
##  $ male.latino  : num(0) 
##  $ female.latino: num [1:2] 0.288 -2.042
##  $ male.other   : num(0) 
##  $ female.other : num [1:2] -0.821 1.295

Some of the combinations have no values. For example, female.white. That is because f1 and f2 have no combination for those levels (compare the output of f1 and f2 above). So we can drop these.

str(split(x, list(f1, f2), drop = TRUE))

## List of 6
##  $ male.white   : num [1:2] 0.169 -1.171
##  $ male.black   : num [1:2] 0.963 1.35
##  $ male.asian   : num -1.56
##  $ female.asian : num -0.54
##  $ female.latino: num [1:2] 0.288 -2.042
##  $ female.other : num [1:2] -0.821 1.295

Video 5: mapply

The previous functions, lapply, sapply and tapply, only apply a function over a single object. mapply allows you to apply a function to more than one object (for example, more than one list, in contrast to lappy). Arguments:

FUN: the function to be applied. This function must accept at least the number of arguments as the number of objects to which you want to apply the function.
…: the objects to which the function should be applied.
MoreArgs: a list of other arguments to pass to each invocation of the function. This is equivalent to the … in lapply, sapply and tapply.
SIMPLIFY: whether the result should be simplified

Example: It is tedious to type the following.

list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))

So instead:

mapply(rep, 1:4, 4:1)

## [[1]]
## [1] 1 1 1 1
## 
## [[2]]
## [1] 2 2 2
## 
## [[3]]
## [1] 3 3
## 
## [[4]]
## [1] 4

“This is an artificial example.” Agreed.

noise <- function(n, mean, sd) {
    rnorm(n, mean, sd)
}
noise(5, 1, 2)

## [1] -0.3477  1.4204  2.4208  4.1725  2.9152

# this doesn't do what is expected, i.e., 1 random normal with mean 2, 2
# random normals with mean 2, etc.  when the n value passed to rnorm has
# length>1, then the length is assumed to be the number of values wanted
noise(1:5, 1:5, 2)

## [1] 3.067 1.721 3.053 4.630 3.193

So we can use mapply to get what we want:

# first call to noise has values 1, 1, 2 second has values 2, 2, 2 and so on
mapply(noise, 1:5, 1:5, 2)

## [[1]]
## [1] -2.588
## 
## [[2]]
## [1] -2.299  1.750
## 
## [[3]]
## [1] 2.845 5.872 2.780
## 
## [[4]]
## [1] 1.318 3.811 3.950 3.316
## 
## [[5]]
## [1] 5.969 5.049 4.815 3.600 6.465

“That is how I can instantly vectorize a function that doesn't allow for vector arguments.”

Note that rnorm does in fact allow a vector argument for n, it's just that it doesn't treat that argument the way we expect.

Video 6: Debugging tools part 1

message: A generic notification/diagnostic message produced by the message function; execution of the function continues
warning: An indication that something is wrong but not necessarily fatal; execution of the function continues; generated by warning function
error: An idnication that a fatal problem has occurred; execution stops; produced by stop function
condition: A generic concept for indicating that something unexpected can occur; programmers create their own conditions

Function returns a NaN, and writes a warning. Execution continues:

log(-1)

## Warning: NaNs produced

## [1] NaN

The following generates an error.

(invisible function: This will return a value from a function, without autoprinting it.)

printmessage <- function(x) {
    if (x > 0) 
        print("x is greater than zero") else print("x is less than or equal to zero")
    invisible(x)
}
printmessage(1)  # normal behavior

## [1] "x is greater than zero"

printmessage(NA)  # an error message

## Error: missing value where TRUE/FALSE needed

An attempt to fix the above:

printmessage2 <- function(x) {
    if (is.na(x)) 
        print("x is missing a value!") else if (x > 0) 
        print("x is greater than zero") else print("x is less than or equal to zero")
    invisible(x)
}
x <- log(-1)

## Warning: NaNs produced

printmessage2(x)

## [1] "x is missing a value!"

How do you know that something is wrong with your function?

What was your input? How did you call your function?
What were you expecting? Output, messages, other results?
What did you get?
How does what you got differ from what you were expecting?
Were your expectations correct in the first place?
Can you reproduce the problem (exactly)?

Video 7: Debugging tools part 2

Tools for debugging:

traceback: prints function call stack after an error occurs; does nothing if there is no error
debug: flags a function for “debug” mode which allows you to step through execution of a function one line at a time
browser: suspends the execution of a function wherever it is called and puts function in debug mode
trace: allows you to insert debugging code into a function at specific places
recover: allows you to modify the error behavior so that you can browse the function call stack

You can also insert print and cat statements into functions, but then they must be deleted later.

traceback function

rm(x)  # first remove x from environment
mean(x)  # this creates an error

## Error: object 'x' not found

traceback()

## No traceback available

Unfortunately, it appears that traceback does not work in R markdown code. Such code runs in a non-interactive environment. So for the rest of these examples, I will not include in R blocks.

lm(y ~ x)
traceback()

Output for above looks like the following. It shows that the first thing that happened was the call to eval(expr, envir, enclos), and the last (where the error occurred) was lm(f ~ x).

7: eval(expr, envir, enclos)
6: eval(predvars, data, env)
5: model.frame.default(formula = f ~ x, drop.unused.levels = TRUE)
4: stats::model.frame(formula = f ~ x, drop.unused.levels = TRUE)
3: eval(expr, envir, enclos)
2: eval(mf, parent.frame())
1: lm(f ~ x)

debug function

The debug function sets a function so that it goes into interactive debugging environment whenever it is called. But at the time that the debug call is made, it immediately prints out the entire body of the function.

debug(lm)

Partial output for above. Note that after outputting the function's code, it presents a “Browse” prompt.

debugging in: lm(y ~ x)
debug: {
    ret.x <- x
    ret.y <- y
    cl <- match.call()
    ...
    if (!qr) 
        z$qr <- NULL
    z
}
Browse[2]>

The environment of this “browser” is the environment of the lm function. So at the start of the call, there is nothing in the environment except the function arguments, including any default arguments that might not be included in the call.

Now you can enter n at the prompt (for “next”) and the function executes one line at a time.

Browse[2]> n
debug: ret.x <- x
Browse[2]> n
debug: ret.y <- y
Browse[2]> n
debug: cl <- match.call()
Browse[2]> n
debug: mf <- match.call(expand.dots = FALSE)
Browse[2]> n
debug: m <- match(c("formula", "data", "subset", "weights", "na.action", 
    "offset"), names(mf), 0L)

The Environment tab in RStudio updates after each line, based on the line of the function that was just executed. Keep hitting n until the error occurs; now you have found exactly where it occurs.

Note that you can debug functions called within the debugged function. So you might call debug on the match function, which is part of the above output.

This section seems incomplete to me. No discussion of how to break out of debug, or the fact that it “sticks” for an entire session, or that there is a debugonce function (which debugs a function for only one invocation).

recover function

You can set the recover function to be the error handled using the options function. This sets a global option for an R session.

When recover is the error handler and an error occurs, it presents a menu in which the options are the call stack, as shown in the traceback function.

options(error = recovery)
read.csv("nosuchfile")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message
In file(file, "rt") : 
  cannot open file 'nosuchfile': No such file or directory

Enter a frame number, or 0 to exit

1: read.csv("nosuchfile")
2. read.table(file = file, header = header, sep = sep, quote = quote, dec =
3. file(file, "rt")

Selection:

You can enter 1-3 in this case to see the environment for a particular line. Hit enter to go back to the menu, and enter 0 to exit.

Summary

There are three main indications of a problem/condition: message, warning, error
Only an error is fatal
When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation
Interactive debuggint tools traceback, debug, browser, trace and recover can be used to find problematic code in functions
Debugging tools are not a substitute for thinking