This brief overview introduces some of the key elements, intended for those who have limited experience with R. It introduces R objects, basic operations, storage modes, and functions. Concepts include vectorization.

R for algorithm development

Algorithm development in this course will exploit a key feature of “base R” called vectorization. By vectorizing operations, the language R allows a user to operate at a level that is intermediate between low-level languages like FORTRAN and C, which specify each operation with a separate line of code, and point-and-click (or drag-and-drop) software that bury algorithms behind a user interface. Vectorization allows for development of basic algorithms that are fast. To see this, here is low-level code that I could write to multiply by 2 the numbers from 1 to 10:

numbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # create a vector
for(j in 1:10){                             # each element times 2
  numbers[j] <- numbers[j]*2
}

This code touches each element of the vector numbers using a for loop. Exploiting vectorization this becomes

numbers <- c(1:10)*2

The simplification comes with no loss of transparency. The advantages of vectorization will affect nearly everything we do in this course.

Where operations cannot be vectorized in R, the expanding capacity to link them to C++ code offers the R user an approach to generate fast code with the control of low-level programming.

In recent years, high-level packages like tidyverse have been added to R that bundle more of the common algorithmic operations within a novel syntax. This can simplify many tasks, but it comes with the disadvantage that most users never learn to exploit R’s greatest strength: vectorization. Because this course requires algorithm development I focus on base R, but I encourage use of other approaches where ever they can help.

The language R and the software RStudio have expanded throughout the sciences (and now, humanities), because vectorization allows a user to work at multiple levels. A computer’s central processing unit (CPU) recognizes binary inputs. We could call them zeros and ones. Or “no” and “yes”–they are qualitative, not quantitative. With enough of these zeros and ones it is possible to represent numbers, letters, and words. It is possible to represent tasks.

The development of computer languages has been guided, in part, by efforts to find an optimal distance between the user language and the machine language (translating zeros and ones). Low-level languages that persist to this day include FORTRAN and C. In fact, many of the underlying functions in R are written in these languages. A programmer can read and write operations in these languages with control over each operation. A compiler translates C into zeros and ones in a way that can be understood by the CPU. The compiled code is processed all at once. The advantages of control and fast processing comes with the cost that finding problems can be cumbersome. The compilation puts some distance between programmer and CPU. At the other extreme are high-level software packages that have limited adaptation to specific questions. Where possible, existing software should be exploited. If a study calls for linear regression there is no need to reinvent the wheel. However, most ecologists are commonly confronted with misaligned and unbalanced data. Many questions suggest models that have not been written before. They require new algorithms. Vectorization lets a user build algorithms at a low level, without costly loops, within a framework that admits many high-level operations.

About your computer

It’s worth understanding a few things about your computer that are relevant when writing algorithms to be used with large data sets. Speed is important. The CPU processes inputs, reported in billions of cycles per second, or gigahertz (GHz). The actual work that gets done further depends on the number of cores that simultaneously share tasks. In principal, more cores means higher capacity. However, efficiency keeps the total number of cores relatively small, as they may have to work together without tripping all over each other. Of course, a faster CPU is good idea, but at any given time, processing speed may not vary widely between laptops and workstations.

Where computers still differ importantly is in main memory. Traditionally called random access memory (RAM) but now extending to additional types, main memory sets the limits on what can be accessed immediately. It’s a bit like the facts that live in your head versus those that you “store” outside your brain, in books, the internet, and so on. When you need them, you look them up, and many will be forgotten by tomorrow or next week. For the computer, this external storage is the disk space. Main memory is managed in part by writing (and then deleting) temporary files. Operations that can be managed with main memory are generally faster than those that require writing and reading from the disk. Part of the time it takes to start up involves reading from files that store information from the last time the machine was open. Likewise, shut down requires time as selected information that will be needed in the next session is written to disk. Web sites also write files to your disk, so they do not have to store all information on millions of users. Management of the potentially huge clutter than can accumulate from this process is both a hardware and a software challenge. For computation in this course, it is helpful to have main memory of 32, 64 gigabytes (GB), or more. Main memory remains relatively cheap, so pay attention to RAM when purchasing a computer that you expect to use for data analysis.

With this background, let’s learn about R.

Basic syntax

The symbol > at the beginning of a line means that R is ready to receive input. Lines beginning with a number in brackets, e.g. [1], are responses from R. The following example of three lines includes two I entered followed by a response from R:

z <- 5  #assign the value 5 to ‘z’
z

## [1] 5

The first line contains the assignment operator <-, which sets the left-hand side equal to the right-hand side. In the first line I define a variable named z, and I assign to it the value 5. The # symbol indicates that a comment follows. R will not execute anything to the right of the symbol #. In the second line I entered the name of the variable, and that value was displayed on the console. The index [1] seems pointless here, but it’s useful for arrays as an indicator of location (see below).

`list` and `data.frame`

The list storage mode offers one way to hold a heterogeneous collection of objects. In the forest inventory example for next time, each tree has attributions including diameter (numeric), leaf habit (factor), and species (character). Here are these attributes for two trees:

diameter  <- c(5.7, 82)
species   <- c('red maple', 'white pine')
leafHabit <- as.factor( c('deciduous','evergreen') )
treeList  <- list( spec = species, diam = diameter, leaf = leafHabit )        # as a list
treeData  <- data.frame( species, diameter, leafHabit )  # as a data frame

I have organized these objects first as a list, where the attributes are given the names spec and diam. If I enter treeList it will be displayed with the names sequentially. I can refer to the second I object as treeList$diam.

I then organize the same objects into a data.frame, which is a specific type of list where each object has the same length. For this reason, it will be displayed as a table with the objects species, diameter, and leafHabit as column headings:

knitr::kable(treeData)

species	diameter	leafHabit
red maple	5.7	deciduous
white pine	82.0	evergreen

Both lists and data.frames are widely used in R.

Functions

A function performs one or more operations on one or more arguments. If I want to repeat the value of z three times I could use the rep function with arguments for z and 3 and get this response: rep(z, times = 3) = 5, 5, 5. The function name is followed by parentheses. Arguments are inside the parentheses.

If I forget the arguments that are required for rep I can use help(rep). Note here that help is the name of the function and it takes another function name as its argument. If I forget what the help function does, I can enter help(help).

When I consult the help page for rep, the first section Usage says this:

rep( x, ...)

Where it asks for x, I sent a specific value, which I have called z. If I do not give a name to the first object, it will assign it a name x. Outside the function, I call this variable z. An equivalent way to call this function would be this:

rep( x = z, times = 3 )

The ellipsis in the call to rep means that it will accept a number of arguments that, if not specified, default to preset values. For example, if I want to repeat each of the values in x three times I could do this:

rep( x = c(1,2), each = 3 ) = 1, 1, 1, 2, 2, 2

In the case of the rep function I need to specify at least one of the options listed on the help page. Otherwise, it simply returns x.

A function might return any type of R object, e.g., a value, a vector, a list, or even another function.
The help page for the function describes the Arguments needed and the Value that will be returned.

Not all functions require an argument. For example, the function ls() (no argument) returns a character vector with the names of objects that are in my current work environment.

Order and default values

The syntax for using the function that generates a normal (Gaussian) density is given in the help page this way:

dnorm(x, mean = 0, sd = 1, log = FALSE)

I see four arguments. The first argument is a vector of quantiles x. It is not assigned a default value, so I must supply it. The remaining arguments are each assigned defaults. If I do not supply them, they will be assigned these values.

If I want to assign values to mean and sd I can do it without names, provided that I use the order assumed by the function: first mean, then sd:

dnorm(x, 58.2, 3.8)

If I want to supply them in some other order, I have to name them, e.g.,

dnorm(x, sd = 3.8, mean = 58.2)

Because I did not specify it in either of these examples, the default log = FALSE applies.

For a combination of assigned and preset values I could this:

dnorm(x, log = TRUE )

which is the same as dnorm(x, mean = 0, sd = 1, log = TRUE). If I say dnorm(x, 3.8), it will assume that the value 3.8 refers to the mean, not the sd.

Further arguments

In the help page for the function rep I saw this:

rep( x, ...)

The help page explained that ... represents “further arguments” followed by some examples like times, length.out, each. The help page may not list all possible arguments that could be used here. In this case, googling the function may help.

Indexing arrays

Numbers are often collected into an array. The difference between a vector, matrix, and array is in how they are rendered to the console and some attributes to recognize lengths in 1, 2, and 3 dimensions. Locations in arrays are indexed by square brackets. The array dimensions are separated by commas. For a one-dimensional array, a vector, there are no commas. Here I construct a vector with even numbers from –2 to 8 to which I assign the name xvec,

xvec <- seq( -2, 9, by = 2 )  # sequence of values
xvec

## [1] -2  0  2  4  6  8

The function seq( from, to, by ) creates the vector. When I entered the name of the vector xvec, it is displayed on the console. If I want only the 3rd element of this vector, I enter

xvec[3]  # the third value of a vector

## [1] 2

If I want the last three elements, I enter

xvec[4:6] # the 4th, 5th, and 6th values

## [1] 4 6 8

If I want the $j^{th}$ element of xvec, I assign a value to j and then ask for that value

j <- 5
xvec[j]   # the jth value of xvec

## [1] 6

To determine the length of the vector use the length function:

length( xvec )

## [1] 6

I could also create a vector with the concatenate function c,

zvec <- c( 5.2, 7, -15 )  #define a vector with 3 elements
zvec

## [1]   5.2   7.0 -15.0

To construct a 2-dimensional array, or matrix, I could use the matrix function

n <- 2
m <- 3
xmat <- matrix( xvec, nrow = n, ncol = m, byrow = T ) # 2 by 3 matrix
xmat

##      [,1] [,2] [,3]
## [1,]   -2    0    2
## [2,]    4    6    8

In my call to the R function matrix, I said to construct a 2 by 3 matrix from the vector having the name xvec (the length-6 vector), filling the matrix by rows. The default is to fill a matrix by columns (start with column 1, then column 2,…). The dimension of xmat is

dim( xmat )

## [1] 2 3

The 2nd column is

xmat[,2]

## [1] 0 6

and the element in row 1, column 3 is

xmat[1,3]

## [1] 2

The [j,k]th element of xmat is

i <- 1
j <- 3
xmat[i,j]

## [1] 2

The matrix consisting of the first and third columns is

xmat[ ,c(1,3) ]

##      [,1] [,2]
## [1,]   -2    2
## [2,]    4    8

Here is a diagram showing row [i,] = 1, column [,j] = 3, and then element [i,j]:

Equivalently, I could get element [i,j] this way:

ij <- cbind(i,j)  #bind two values, each as a column
ij

##      i j
## [1,] 1 3

Now I can use this two-column matrix to specify values in xmat like this:

xmat[ ij ]

## [1] 2

Here I have bound together two values to construct a row vector ij, which R will read as the two dimensions of xmat, the row and column, respectively.

I can also construct a matrix by binding together vectors, provided they have the same length. For example, the same matrix xmat can be constructed by binding two rows created with the concatenate function c,

rbind( c(-2, 0, 2), c(4, 6, 8) )  # bind 2 row vectors

##      [,1] [,2] [,3]
## [1,]   -2    0    2
## [2,]    4    6    8

or by binding together three columns

cbind( c(-2,4), c(0,6), c(2,8) ) # bind 3 columns

##      [,1] [,2] [,3]
## [1,]   -2    0    2
## [2,]    4    6    8

R executes the inner functions first, in this case the calls to function c, followed by the outer call to function rbind or cbind. I now say more about functions.

Try this now: Construct a matrix with n rows and m columns filled with random numbers drawn from a uniform distribution between -1 and 1 using the function runif.

Loops and vectorization

Vectorization is one of the most important features of R, allowing one to operate on elements of a vector without programming loops. Loops are the standard way to apply operations to elements of arrays sequentially. Loops are fast in (low-level) compiled languages, but slow in the interactive language R. Vectorization internalizes loops (in compiled code) so they do not slow execution.

Consider a population of individuals, each of which can be reproductive (“1”) or not (“0”) depending on their differing developmental stages. The mature state 1 is reached in stage 3. I wish to simulate developmental stage $[1, \dots, 5]$ for n = 10 individuals and assign each the correct reproductive state $[0, 1]$.

First, I generate a random length-n vector that I will call stage using the sample function. I define a second vector repro of binary indicators (zero or one) for whether or not a stage is reproductive. I want to create a vector y that holds the reproductive states for each individual in the sample. Here is a for loop to inspect the stage of each individual, determine if it is reproductive, and, if so, assign it a value of 1:

n      <- 10                             # sample size
nstage <- 5                              # no. developmental stages
stage  <- sample(nstage, n, replace = T) # randomly assign a stage to each observation
repro  <- c(0,0,1,1,1)                   # early stages not reproductive
y      <- stage*0                        # initialize data
for(i in 1:n){                           # loop over i from 1 to n
  if(repro[ stage[i] ] == 1)y[i] <- 1    # determine if each is reproductive
}

If the first element of the vector stage = 4, then the first individual is assigned repro[stage[1]] = 1. Here is what I get for the entire vector:

print( rbind(stage, y) )

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## stage    4    5    4    5    3    1    2    3    2     5
## y        1    1    1    1    1    0    0    1    0     1

This loop executes fast, because there are only 10 individuals. It would be too slow if I had, say, $10^6$ individuals or if I had to repeat this loop thousands of times (nested loops).

Here is the same operation, where the for loop is replaced with a vector:

y <- repro[stage]

This alternative is vectorization. It is not only fast, but also simple. Use loops only when vectorization is not possible.

Try this now. Simulate exponential growth using a for loop and determine that simulated values agree with the growth rate used to generate those data.

Define n time steps and a growth rate drawn at random from rho <- runif(1, .9, 1.1)
Write a for loop where the next value of x will be rho times as large as the current value.
Verify growth rate.

Scatter plots and histograms

Scatter plots are generated with the function plot. Here is a plot of two simulated vectors:

n <- 500
x <- rnorm(n)
y <- rnorm(n, x, .4)
plot( x, y)

The function rnorm draws random values from a distribution. I need a way to visualize the sample.

Histograms

Histograms can be generated with the function hist. Here are histograms for the preceeding two vectors:

par(mfrow = c(1,2))
hist(x)
hist(y, nclass = 100)

Try this now: Simulate random body sizes for a population consisting of males and females where mean body size differs by gender, and the fraction that are gender 1 is a parameter p. Do this using i) a for loop and ii) by vectorization. Generate a histogram of body size.

Steps:

Assign a population size n
Draw random genders g using rbinom( n, 1, p)
Assign a mean size for two genders mu
Write a for loop where you check the gender of each individual and draw a random size using rnorm( 1, mu )
Vectorize this operation using g as an index for a length-2 vector mu

My functions

A function contains operations to be performed on objects that have been previously defined in the main program (they exist as objects in the user’s workspace) and/or on arguments passed to them. Nearly every operation in R can be viewed as a function, and I can write my own functions.

Just like the built-in R functions, each function that I write is identified by a name. Any arguments that will be passed to the function (as opposed to being available directly from the user’s workspace) are contained in parentheses that follow the name of the function. Examples of functions encountered in the previous section include seq, c, cbind, rbind, and matrix. Many functions are available in R and can be called by the user. I can also write my own functions. For example, if I want to take the mean of the elements in xvec, I can call the R function

mean(xvec)

## [1] 3

The argument sent to mean is xvec. mean(xvec) returns the answer 3. I can supply additional arguments to mean, which can, for example, handle missing cases. If I want to know the options, I use the help function,

help(mean)

I could also write my own function, such as

mymean <- function(x){sum(x)/length(x)}

This code assigns to the name mymean a function that takes an argument that is assigned within the function mymean the name x. The function mymean, in turn, calls two internal R functions sum and length to determine the mean of whatever vector has been passed to it. To determine the mean of xvec I call the function,

mymean(xvec)

## [1] 3

Due to the fact that xvec is the first (and, in this case, only) argument passed to mymean, it will be assigned the internal name x, a name that is shielded from the main program. Thus, operations within functions do not interact with objects in the user workspace.

Any object that is needed but not assigned within a function must either be passed or it must be available from the user’s workspace. If the object is not in either of these places the function will give up. Note that a function called by another function will not have available to it any objects created in the calling function, unless they are passed. Specifically, if I call a function that I have written a(), which generates an object r and then calls from within a() another function b() without passing r, then b() cannot know about r.

b <- function(r){r^2} # square a number

a <- function(m, p){  
  r <- m/p            # r is generated in the calling funciton, but b doesn't know that
  b()
}
a(7,2)               # this fails

a <- function(m, p){ # now I pass r
  b(m/p)
}
a(7,2)               # this works

Because objects in the home environment may not have the values that I think they do, I write functions to accept as arguments all of the objects that will be needed. I do not rely on them being available in the user’s workspace.

A function returns an object to the workspace. I can do this in a few ways. Here is a function to return the mean of x:

myMean <- function(x){
  mu <- mean(x)
  return( mu )     # return this object
}

Here is a call to this function:

x <- rnorm(1000)
myMean(x)

## [1] -0.007113672

However, I know from the previous definition of myMean that this will also work:

myMean <- function(x){
  mean(x)  # operation is not assigned, so it is returned directly
}

But this will not work:

myMean <- function(x){
  mu <- mean(x)  # assigned but not returned
}

If I do not assign the last operation to a name the function will return that last object.

Try this now: Write a function to calculate the mean and standard deviation of a vector and return them as a length-2 vector.

Basic matrix operations

The basic operations used for scalar quantities apply element-wise to numeric matrices of the same dimension. This only works if the matrices have the same dimensions. These operations include +, -, *, /:

a <- matrix( rnorm(20), 5, 4 )
b <- matrix( rnorm(20), 5, 4 )
c <- a/b
all.equal(a, b*c)  # recover matrix a

## [1] TRUE

Matrix multiplication is written as %*% and requires that the first matrix has the same number of columns as the number of rows in the second matrix. If $\mathbf{A}$ is $n \times k$ and $\mathbf{B}$ is $k \times m$, then $\mathbf{A} \times \mathbf{B}$ is $n \times m$:

n <- 6
k <- 3
m <- 4
A <- matrix( rnorm(n*k), n, k )
B <- matrix( rnorm(k*m), k, m )
C <- A%*%B
print(C)

##            [,1]        [,2]       [,3]       [,4]
## [1,]  2.2768143  0.35393760 -3.8235079  4.2895192
## [2,] -0.5131719  0.38662255 -0.3357246 -0.3785857
## [3,]  0.1020754 -0.08770148  1.0909963 -0.6363143
## [4,] -0.4870463 -0.09908917  2.1361214 -1.8283885
## [5,] -1.5934130 -0.64168348  2.9404518 -2.9756617
## [6,] -1.4585513 -0.22212895  2.7179952 -2.9385403

Recall from matrix algebra that element C[i,j] will be the sum I get from multiplying row i of A and column j of B.

i <- 2
j <- 3
sum( A[i,]*B[,j] )

## [1] -0.3357246

Here are these matrices:

I have highlighted row i of A, which is multiplied by column j of B and summed to give element [i,j] of C.

A diagonal matrix can be generated with the function diag. The identity matrix is a diagonal matrix with ones along the diagonal. The product of the identity matrix and another matrix is the same matrix:

I <- diag(k)
identical( A, A%*%I )

## [1] TRUE

Of course, I and A must have compatible dimensions. Look at I and A in this example.

Also, a square matrix multiplied by its inverse is the identity matrix:

A <- matrix( rnorm( 9 ), 3, 3 )  # a random square matrix
B <- solve( A )                  # its inverse
round( B%*%A, 10 )               # recover identity matrix

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

R objects have storage modes

The most common storage modes for R objects are numeric and character.

object	attributes
`vector`	a vector can hold `numeric`, `character`, or `factor` levels
`matrix`	a 2-dimensional `array`
`array`	use for more than 2 dimensions
`list`	objects that can be named or referenced as `[[1]]`, `[[2]]`, or by name
`data.frame`	a `list` of objects of equal length displayed together as columns
`factor`	a `vector` containing discrete levels

A data.frame is a type of list where objects are vectors of equal length but not necessarily of the same storage mode. When rendered to the console, a data.frame looks like a table, or matrix. However, because it is not stored as a matrix, it does not admit matrix operations like +, *, or %*%.

Functions for vectorization

`which` function

Consider the following task:

I am confronted with a table of values, some of which are missing. I want to initialize the missing values with the mean value for that column.

This problem arises when there are missing predictors in a design matrix. Sometimes a decision is made to fill those missing values with the mean for that predictor. To illustrate with an example I generate a random matrix and fill some of the values with NA, which indicates that a value is missing.

n <- 8                                     # rows
k <- 4                                     # columns
x <- matrix( round(rnorm(n*k), 2), n, k )  # a random matrix

miss <- sample(n*k, 7)                     # random locations for NA
x[ miss ] <- NA                            # set values in x to NA

Here is the matrix with missing values:

print(x)

##       [,1]  [,2]  [,3]  [,4]
## [1,] -1.47 -1.78  0.57  0.56
## [2,]  0.20  0.08  0.72 -0.87
## [3,]    NA -0.64  0.08    NA
## [4,]    NA -0.39 -0.74    NA
## [5,] -0.28 -0.55  0.34    NA
## [6,]  0.90  1.66  1.50  0.26
## [7,]  0.31    NA    NA -0.68
## [8,] -1.19  0.12  0.16  0.56

The matrix is displayed as rows and columns, but it is stored by stacking columns atop one another, referenced by numbers from 1 to n*k. Here is a diagram of matrix locations in x, highlighting the locations of NA values, using two different ways to reference location:

I know the locations of the NA values, because I put them there myself when I generated the vector miss = 3, 4, 15, 23, 27, 28, 29. These elements are shown at left. They could also be located using which( is.na(x) ) = 3, 4, 15, 23, 27, 28, 29. This location refers to how the matrix is stored ($1, \dots, nk$).

Matrix locations can also be referenced by row and column [i,j] as shown at right. I can get the locations in this format with the argument arr.ind = T:

missAI <- which( is.na(x), arr.ind = T ) # provide the array index [row, column]
print( missAI )

##      row col
## [1,]   3   1
## [2,]   4   1
## [3,]   7   2
## [4,]   7   3
## [5,]   3   4
## [6,]   4   4
## [7,]   5   4

This version of missAI holds the same information, but it is often preferable where I want to see column locations.

I now have a quick solution to the initial problem, filling the NA values in matrix x with mean values for that column

cmean <- colMeans( x, na.rm = T )         # column means
x[ missAI ] <- cmean[ missAI[,2] ]        # missing values to column mean in cmu
print( round(x, 2) )

##       [,1]  [,2]  [,3]  [,4]
## [1,] -1.47 -1.78  0.57  0.56
## [2,]  0.20  0.08  0.72 -0.87
## [3,] -0.26 -0.64  0.08 -0.03
## [4,] -0.26 -0.39 -0.74 -0.03
## [5,] -0.28 -0.55  0.34 -0.03
## [6,]  0.90  1.66  1.50  0.26
## [7,]  0.31 -0.21  0.38 -0.68
## [8,] -1.19  0.12  0.16  0.56

Here I highlight the values that were replaced:

Take a moment to see how I’ve used missAI as a two-column matrix to locate missing values in x and as a vector to select the right column mean from cmu.

`apply`, `tapply`, and related functions

Consider a population of individuals that differ in some numeric trait x between one another and over time. I want to summarize these differences by individual and year.

The two most common formats for environmental data are i) as observations-by-variables, in this case one row for each subject-time, and ii) as subjects-by-time, i.e., one row per subject and one column added each time it is observed. In the observations-by-variables I can store more variables by adding more columns. With subjects-by-time all columns hold the same variable.

R has some functions that simplify matrix and data.frame operations that help to move from one format to the other and to generate data summaries. To illustrate I simulate data on ni subjects each of which is observed nj times. Each observation on each subject is entered as a new line in data. For the trait x I assume that it results from differences between individuals xi, over time xj, and a residual:

ni <- 8            # no. individuals
nj <- 5            # no. observation times
xi <- rnorm(ni)    # individual differences
xj <- rnorm(nj)    # differences over time

# individual and time index for ni*nj observations
ind  <- rep( 1:ni, each = nj )
time <- rep( 1:nj, ni )

# simulated data set
x    <- xi[ind] + xj[time] + rnorm(ni*nj, 0, .1)   # vectorized effects + residual
data <- data.frame( ind, time, x )
head(data, 10)

These first few lines of data show the relationship between ind and time indexes.

Data stored as subjects (rows) over time (columns) can be easier to enter in the field where each new observation is placed on the same line as previous ones for the same subject. Here I use vectorization to fill an ind by time matrix with the observed x:

xmat <- matrix(NA, ni, nj,   # initialize the matrix with NA
               dimnames = list( paste('i', c(1:ni), sep = '' ), 
                                paste( 't', c(1:nj), sep = '' ) ) )   
ij   <- cbind(ind, time)     # vectorization to fill matrix
xmat[ ij ] <- x              # fill matrix
xmat

##            t1        t2         t3          t4        t5
## i1  0.7467917 2.4448315 -0.0923592  0.84365365 2.0300277
## i2 -0.1798948 1.5630569 -1.0844677  0.05818866 1.1654111
## i3  0.5599638 2.7562738 -0.3712804  1.07583189 2.1303586
## i4  0.1718194 1.9585385 -1.0662533  0.42247360 1.3071380
## i5 -0.3930947 1.2639592 -1.4372861 -0.19462458 0.7762615
## i6 -0.8777564 0.8594973 -1.7594136 -0.64854330 0.4667120
## i7 -0.4574536 1.5062559 -1.3830894  0.03846775 0.9562548
## i8  1.7830301 3.2750232  0.4721835  2.13395409 3.1634118

Individual and time differences can be obtained from either format. From the data vector I use the tapply function with the ind index to obtain means and standard deviations, first by individual and then by time. Here is by individual:

# individual mean +/- sd from data vector
imu <- tapply( x, ind, mean )
isd <- tapply( x, ind, sd )
print( round( cbind(imu, isd), 3) )

##      imu   isd
## 1  1.195 1.030
## 2  0.304 1.066
## 3  1.230 1.243
## 4  0.559 1.154
## 5  0.003 1.056
## 6 -0.392 1.058
## 7  0.132 1.142
## 8  2.166 1.145

Here is by time:

# time means +/- sd
tmu <- tapply( x, time, mean )
tsd <- tapply( x, time, sd )
print( round( cbind(tmu, tsd), 3) )

##      tmu   tsd
## 1  0.169 0.847
## 2  1.953 0.816
## 3 -0.840 0.766
## 4  0.466 0.872
## 5  1.499 0.885

For the xmat matrix, which is organized as subjects-by-time, I use rowMeans and apply. The apply function applies a function across the margin of an array, either rows (MARGIN = 1) or columns (MARGIN = 2). Here I use apply for the standard deviation sd. I could also use apply for the mean values, but I opt instead for the faster rowMeans and colMeans:

# individual mean +/- sd from matrix
imu <- rowMeans( xmat )
isd <- apply( xmat, 1, sd )
print( round( cbind(imu, isd), 3) )

##       imu   isd
## i1  1.195 1.030
## i2  0.304 1.066
## i3  1.230 1.243
## i4  0.559 1.154
## i5  0.003 1.056
## i6 -0.392 1.058
## i7  0.132 1.142
## i8  2.166 1.145

# time means +/- sd
tmu <- colMeans( xmat )
tsd <- apply( xmat, 2, sd )
print( round( cbind(tmu, tsd), 3) )

##       tmu   tsd
## t1  0.169 0.847
## t2  1.953 0.816
## t3 -0.840 0.766
## t4  0.466 0.872
## t5  1.499 0.885

The format change can be reversed. I recover x from xmat, again using vectorization:

identical( xmat[ ij ], x )

## [1] TRUE

Try this now: Generate a matrix x with missing values. Then replace with the missing values with the median value for that variable using the function apply( x, 2, quantile, .5, na.rm = T ).

The `match` function

The foregoing format changes and summaries were made easy by vectorization using indices that I had generated to simulate the data, the vectors ind and time. How do I generate these indexes from a data set that does not come with a fixed number of times per subject?

I can generate the index using the match function, which locates the matching entry for one vector in a second vector. For example,

a <- sample( letters, 5, replace = T )   # vector a holds 5 random letters
m <- match(a, letters)                   # m holds the first matches in the alphabet
print( rbind( a, letters[m] ) )          # it worked

##   [,1] [,2] [,3] [,4] [,5]
## a "g"  "r"  "i"  "g"  "y" 
##   "g"  "r"  "i"  "g"  "y"

In this example, match compared a random sample of 5 letters to the 26 letters of the alphabet. It returned the matching locations as the integer vector m.

In the block of code that follows I use match to compare the variable ind from data to a vector of unique individuals. The match ii holds one entry for each row in data (i.e., length(ind) = nrow(data)). I do the same for time:

# find the unique individuals and times
inds  <- sort( unique( data$ind ) )   # unique individuals
times <- sort( unique( data$time) )   # unique times

# match rows in data to inds and times
ii <- match( data$ind, inds )
jj <- match( data$time, times )
mm <- cbind( ii, jj )

identical( xmat[ mm ], x )           # it worked

## [1] TRUE

By combining the locations ii with jj in the matrix mm I was able to extract from matrix xmat the vector x.

Try this now: Write a function to transform a data vector into a subject-by-time matrix.

Steps:

Execute the code below to generate a random data set with individuals labeled by letters
Find unique values for subjects i and times t and initialize a matrix x with NA. x must have enough rows to accommodate all individuals (ni) and enough columns to accommodate all times (nj).
Use match to align i with unique subject letters and t to unique times
Fill the matrix with values of x
Wrap it in a function to return the matrix

n  <- 100
ni <- 8
nt <- 6
x  <- rnorm(100)
t  <- sample(nt, n, replace = T)
i  <- sample(letters[1:ni], n, replace = T)
id <- paste(i, t, sep = '_')
data <- data.frame(id, i, t, x)
data <- data[ !duplicated(data$id), ]               # remove duplicates
data <- data[ order(data$id), ]                     # order by sample id

Note that data has multiple columns, but the vector of interest is only the last column. Now go to step 2.

These are common operations, so I save functions to go both ways: observations as subject-time and subjects by time:

vec2Mat <- function( x, i, j){
  I <- sort(unique(i))
  J <- sort(unique(j))
  mat <- matrix( NA, length(I), length(J), dimnames = list( I, J) )
  mm <- cbind( match(i, I), match(j, J) )
  mat[ mm ] <- x
  mat
}
 
mat2Vec <- function( mat ){  # reverse: matrix back to vector
  
  if(is.null(colnames(mat)))colnames(mat) <- c(1:ncol(mat))
  if(is.null(rownames(mat)))rownames(mat) <- c(1:nrow(mat))
  
  i   <- rownames(mat)
  j   <- colnames(mat)
  ii  <- rep( i, length(j) )
  jj  <- rep( j, each = length(i) )
  vec <- mat[ cbind(ii, jj) ]
  w   <- which(is.finite(vec))
  df  <- data.frame( i = ii, j = jj, x = vec)
  df[w,]
}

mat <- vec2Mat( data$x, data$i, data$t )
vec <- mat2Vec( mat )

Try this now: Use the function vec2Mat to generate a subject-by-time matrix from data. Then use the function mat2Vec to recover data. Did it work? Why or why not?

R packages

Groups of related functions are packaged for use by the community and made available through the CRAN website, github, and elsewhere. You can build your own package for use by your own group. These are loaded like this:

install.packages( 'nameOfPackageIWant' )

Once installed I attach it to an environment using

library( nameOfPackageIWant )

To see what packages are loaded I can list them with the search function

search()

## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

Once a package is installed, functions can be used without loading all functions in the package using packageName::functionName(). This has the advantage of reducing conflicts between functions having the same name in different packages.

Getting into a file

The range of different file types that can be interpreted in R is broad. The most common types that I interpret include the following:

file type	example function
`.txt`	`read.table`
`.csv`	`read.csv`
`.Rdata`	`load`
`.r`	`source`

Note that the function data.table::fread is a fast alternative to these traditional functions and requires limited qualification.

Efficiency in R

As with any computer language the best R programmers fumble around all the time. Efficiency depends on ways to find quick answers. There are at least two ways to go about this. Any errors that relate to the interpretation of specific R functions can be quickly checked with the help function:

help( nameOfConfusingFunction )

The second route is to google the error message. You will arrive at a handfull of web sites inhabited by knowledgable and helpful people who spend their free time answering questions like yours. You will almost always find that your errors have been through these web sites many times.

Optional challenge

This file has breeding-bird survey (BBS) data stored as observations-by-species, where an observation is a route-year combination. I want to explore how the abundance of an individual species changes over time, so I want to create a route-by-year matrix. Try this in two ways:

for an individual species
as a summary for all species

I load the data from a gitHub repository here. The data are held in a data.frame x (info about observations) and a matrix y (counts by species):

# load( "bbsExample.rdata", verbose = T )

library( repmis )
d <- "https://github.com/jimclarkatduke/gjam/blob/master/bbsExample.rdata?raw = True"
source_data(d)

head(x)                          # examine structure of data
head(y)[,1:5]                    # spp counts aligned with x, rownames are route_year

routes <- sort(unique(x$Route))  # unique routes
years  <- sort(unique(x$year))   # unique years

The NA values indicate routes that were not observed in a given year.

Try this: Here are steps you might follow. For a single species, this can work:

construct a matrix that can accommodate one route in each row and one year in each column
match the routes in x with routes created in the previous code block
match the years in x with years
select a species (column) in y and use the vectors generated by match to fill the new matrix
sum the counts by year using colSums

For all species, this can work:

Because I am going to sum over all columns in y, the route and year columns in x must be repeated for every column in y, e.g., rep( x$Route, ncol(y) ). These extended route and year indices can be used with tapply to generate a summed route by year matrix.
as for one species, the total count by year can be obtained using colSums

As an added challenge, express these counts as counts per effort (CPE), where effort is the number of routes in each year.

Examine trends for your species and for all species over time.

Getting started with R
some fundamentals
Duke University

Jim Clark

2025-01-09

R for algorithm development

About your computer

Basic syntax

`list` and `data.frame`

Functions

Order and default values

Further arguments

Indexing arrays

Loops and vectorization

Scatter plots and histograms

Histograms

My functions

Basic matrix operations

R objects have storage modes

Functions for vectorization

`which` function

The `match` function

R packages

Getting into a file

Efficiency in R

Optional challenge

Getting started with R some fundamentals Duke University

Jim Clark

2025-01-09

R for algorithm development

About your computer

Basic syntax

list and data.frame

Functions

Order and default values

Further arguments

Indexing arrays

Loops and vectorization

Scatter plots and histograms

Histograms

My functions

Basic matrix operations

R objects have storage modes

Functions for vectorization

which function

apply, tapply, and related functions

The match function

R packages

Getting into a file

Efficiency in R

Optional challenge

Getting started with R
some fundamentals
Duke University

`list` and `data.frame`

`which` function

`apply`, `tapply`, and related functions

The `match` function