This brief overview introduces some of the key elements, intended for those who have limited experience with R. It introduces R objects, basic operations, storage modes, and functions. Concepts include vectorization.

Basic syntax

The symbol > at the beginning of a line means that R is ready to receive input. Lines beginning with a number in brackets, e.g. [1], are responses from R. The following example of three lines includes two I entered followed by a response from R:

z <- 5  #assign the value 5 to ‘z’
z

## [1] 5

The first line contains the assignment operator <-, which sets the left-hand side equal to the right-hand side. In the first line I define a variable named z, and I assign to it the value 5. The # symbol indicates that a comment follows. R will not execute anything to the right of the symbol #. In the second line I entered the name of the variable, and that value was displayed on the console. The index [1] seems pointless here, but it’s useful for arrays as an indicator of location (see below).

Functions

A function performs one or more operations on an one or more arguments. If I want to repeat the value of z three times I could use the rep function with arguments for z and 3 and get this response: rep(z, times = 3) = 5, 5, 5. The function name is followed by parentheses. Arguments are inside the parentheses.

If I forget the arguments that are required for rep I can use help(rep). Note here that help is also the name of the function and it takes another function name as its argument. If I forget what the help function does, I can enter help(help).

A function might return any type of R object, e.g., a value, a vector, a list, or even another function. The help page for the function describes the Arguments needed and the Value that will be returned.

Not all functions require an argument. For example, the function ls() (no argument) returns a character vector with the names of objects that are in my current work environment.

Indexing arrays

Numbers are often collected into an array. The difference between a vector, matrix, and array is in how they are rendered to the console and some attributes to recognize lengths in 1, 2, and 3 dimensions. Locations in arrays are indexed by square brackets. The array dimensions are separated by commas. For a one-dimensional array, a vector, there are no commas. Here I construct a vector with even numbers from –2 to 8 to which I assign the name xvec,

xvec <- seq( -2, 9, by=2 )  # sequence of values
xvec

## [1] -2  0  2  4  6  8

The function seq(from, to, by) creates the vector. When I entered the name of the vector xvec, it is displayed on the console. If I want only the 3rd element, I enter

xvec[3]  # the third value of a vector

## [1] 2

If I want the last three elements, I enter

xvec[4:6] # the 4th, 5th, and 6th values

## [1] 4 6 8

If I want the $j^{th}$ element of xvec, I assign a value to j and then ask for that value

j <- 5
xvec[j]   # the jth value of xvec

## [1] 6

To determine the length of the vector use the length function:

length(xvec)

## [1] 6

I could also create a vector with the concatenate function c,

zvec <- c(5.2, 7, -15)  #define a vector with 3 elements
zvec

## [1]   5.2   7.0 -15.0

To construct a 2-dimensional array, or matrix, I could use the matrix function

n <- 2
m <- 3
xmat <- matrix(xvec, nrow=n, ncol=m, byrow=T) # 2 by 3 matrix
xmat

##      [,1] [,2] [,3]
## [1,]   -2    0    2
## [2,]    4    6    8

In my call to the R function matrix, I said to construct a 2 by 3 matrix from the vector having the name xvec (the length-6 vector), filling the matrix by rows. The default is to fill a matrix by columns (start with column 1, then column 2,…). The dimension of xmat is

dim(xmat)

## [1] 2 3

The 2nd column is

xmat[,2]

## [1] 0 6

and the element in row 1, column 3 is

xmat[1,3]

## [1] 2

The [j,k]th element of xmat is

i <- 1
j <- 3
xmat[i,j]

## [1] 2

Here is a diagram showing row [i,], column [,j], and then element [i,j]:

Equivalently, I could get element [i,j] this way:

ij <- cbind(i,j)  #bind two values, each as a column
ij

##      i j
## [1,] 1 3

xmat[ ij ]

## [1] 2

Here I have bound together two values to construct a row vector jk, which R will read as the two dimensions of xmat, the row and column, respectively.

I can also construct a matrix by binding together vectors, provided they have the same lengths. For example, the same matrix xmat can be constructed by binding two rows created with the concatenate function c,

rbind( c(-2, 0, 2), c(4, 6, 8) )  # bind 2 row vectors

##      [,1] [,2] [,3]
## [1,]   -2    0    2
## [2,]    4    6    8

or by binding together three columns

cbind( c(-2,4), c(0,6), c(2,8) ) # bind 3 columns

##      [,1] [,2] [,3]
## [1,]   -2    0    2
## [2,]    4    6    8

R executes the inner functions first, in this case the calls to function c, followed by the outer call to function rbind or cbind. I now say more about functions.

`list` and `data.frame`

The list storage mode offers one way to hold a heterogeneous collection of objects. In the forest inventory example for next time, each tree has attributions including diameter (numeric), leaf habit (factor), and species (character). Here are these attributes for two trees:

diameter  <- c(5.7, 82)
species   <- c('red maple', 'white pine')
leafHabit <- as.factor( c('deciduous','evergreen') )
treeList  <- list( spec = species, diam = diameter, leaf = leafHabit )        # as a list
treeData  <- data.frame( species, diameter, leafHabit )  # as a data frame

I have organized these objects first as a list, where the attributes are given the names spec and diam. If I enter treeList it will be displayed with the names sequentially. I can refer to the second I object as treeList$diam.

I then organize the same objects into a data.frame, which is a specific type of list where each object has the same length. For this reason, it will be displayed as a table with the objects species, diameter, and leafHabit as column headings:

knitr::kable(treeData)

species	diameter	leafHabit
red maple	5.7	deciduous
white pine	82.0	evergreen

Both lists and data.frames are widely used in R.

Loops and vectorization

Vectorization is one of the most important features of R, allowing one to operate on elements of a vector without programming loops. Loops are the standard way to apply operations to elements of arrays sequentially. Loops are fast in (low-level) compiled languages, but slow in the interactive language R. Vectorization internalizes loops (in compiled code) so they do not slow execution.

Consider a population of individuals, each of which can be reproductive or not depending on their differing developmental stages. The mature state is reached in stage 3. I wish to simulate developmental stage $[1, \dots, 5]$ for 10 individuals and assign each the correct reproductive state $[0, 1]$.

Here is a simple for loop that prints to the console the index and exponential increase of a variable x at 1% per increment:

x <- 2.7
for(i in 1:5){
  x <- x*1.01
  print( c(i, x) )
}

## [1] 1.000 2.727
## [1] 2.00000 2.75427
## [1] 3.000000 2.781813
## [1] 4.000000 2.809631
## [1] 5.000000 2.837727

The index is an integer value i that is incremented over i in 1:5.

For the problem above, I begin by writing a for loop that sequentially evaluates the stage of each individual $i, \dots, n$ and assigns a reproductive state. First, I generate a random length-n vector that I will call stage using the sample function. I define a second vector repro of binary indicators (zero or one) for whether or not a stage is reproductive. I want to create a vector y that holds the reproductive states for each individual in the sample. Here is a for loop to inspect the stage of each individual, determine if it is reproductive, and, if so, assign it a value of 1:

n      <- 10                           # sample size
nstage <- 5                            # no. developmental stages
stage  <- sample(nstage, n, replace=T) # randomly assign a stage to each observation
repro  <- c(0,0,1,1,1)                 # early stages not reproductive
y      <- stage*0                      # initialize data
for(i in 1:n){                         # loop over i from 1 to n
  if(repro[ stage[i] ] == 1)y[i] <- 1  # determine if each is reproductive
}

If the first element of the vector stage = 1, then the first individual is assigned repro[stage[1]] = 0. Here is what I get for the entire vector:

print( rbind(stage, y) )

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## stage    1    4    4    5    4    5    3    2    4     3
## y        0    1    1    1    1    1    1    0    1     1

This loop executes fast, because there are only 10 individuals. It would be too slow if I had, say, $10^6$ individuals or if I had to repeat this loop thousands of times (nested loops).

Here is the same operation, where the for loop is replaced with a vector:

y <- repro[stage]

This alternative is vectorization. It is not only fast, but also simple. Use loops only when vectorization is not possible.

Try this now. Simulate exponential growth using a for loop and determine that simulated values agree with the growth rate used to generate those data.

Define n time steps and a growth rate drawn at random from rho <- runif(1, .9, 1.1)
Write a for loop where the next value of x will be rho times as large as the current value.
Verify growth rate.

Scatter plots and histograms

Scatter plots are generated with the function plot. Here is a plot of two simulated vectors:

n <- 500
x <- rnorm(n)
y <- x + rnorm(n,.01)
plot( x, y)

The function rnorm draws random values from a distribution. I need a way to visualize the sample.

Histograms

Histograms can be generated with the function hist. Here are histograms for the preceeding two vectors:

par(mfrow=c(1,2))
hist(x)
hist(y, nclass=100)

Try this now: Simulate random body sizes for a population consisting of males and females where mean body size differs by gender, and the fraction that are gender 1 is a parameter p. Do this using i) a for loop and ii) by vectorization. Generate a histogram of body size.

Steps:

Assign a population size n
Draw random genders g using rbinom( n, 1, p)
Assign a mean size for two genders mu
Write a for loop where you check the gender of each individual and draw a random size using rnorm( 1, mu )
Vectorize this operation using g as an index for a length-2 vector mu

My functions

A function contains operations to be performed on objects that have been previously defined in the main program (they exist as objects in the user’s workspace) and/or on arguments passed to them. Nearly every operation in R can be viewed as a function, and I can write my own functions.

Just like the built-in R functions, each function that I write is identified by a name. Any arguments that will be passed to the function (as opposed to being available directly from the user’s workspace) are contained in parentheses that follow the name of the function. Examples of functions encountered in the previous section include seq, c, cbind, rbind, and matrix. Many functions are available in R and can be called by the user. I can also write my own functions. For example, if I want to take the mean of the elements in xvec, I can call the R function

mean(xvec)

## [1] 3

The argument sent to mean is xvec. mean(xvec) returns the answer 3. I can supply additional arguments to mean, which can, for example, handle missing cases. If I want to know the options, I use the help function,

help(mean)

I could also write my own function, such as

mymean <- function(x){sum(x)/length(x)}

This code assigns to the name mymean a function that takes an argument that is assigned within the function mymean the name x. The function mymean, in turn, calls two internal R functions sum and length to determine the mean of whatever vector has been passed to it. To determine the mean of xvec I call the function,

mymean(xvec)

## [1] 3

Due to the fact that xvec is the first (and, in this case, only) argument passed to mymean, it will be assigned the internal name x, a name that is shielded from the main program. Thus, operations within functions do not interact with objects in the user workspace.

Any object that is needed but not assigned within a function must either be passed or it must be available from the user’s workspace. If the object is not in either of these places the function will give up. Note that a function called by another function will not have available to it any objects created in the calling function, unless they are passed. Specifically, if I call a function that I have written a(), which generates an object r and then calls from within a() another function b() without passing r, then b() cannot know about r.

b <- function(r){r^2}

a <- function(m, p){ # function b cannot know about r
  r <- m/p
  b()
}
a(7,2)               # this fails

a <- function(m, p){ # now I pass r
  b(m/p)
}
a(7,2)               # this works

Because objects in the home environment may not have the values that I think they do, I write functions to accept as arguments all of the objects that will be needed. I do not rely on them being available in the user’s workspace.

A function returns an object to the workspace. I can do this in a few ways. Here is a function to return the mean of x:

myMean <- function(x){
  mu <- mean(x)
  return( mu )     # return this object
}

Here is call to this function:

x <- rnorm(1000)
myMean(x)

## [1] 0.02274613

However, I know from the previous definition of myMean that this will also work:

myMean <- function(x){
  mean(x)  # operation is not assigned, so it is returned directly
}

But this will not work:

myMean <- function(x){
  mu <- mean(x)  # assigned but not returned
}

If I do not assign the last operation to a name the function will return that last object.

Try this now: Write a function to calculate the mean and standard deviation of a vector and return them as a length-2 vector.

Basic matrix operations

The basic operations used for scalar quantities apply element-wise to numeric matrices of the same dimension. This includes +, -, *, /:

a <- matrix( rnorm(20), 5, 4 )
b <- matrix( rnorm(20), 5, 4 )
c <- a/b
all.equal(a, b*c)  # recover matrix a

## [1] TRUE

Matrix multiplication is written as %*% and requires that the first matrix has the same number of columns as the number of rows in the second matrix. If $\mathbf{A}$ is $n \times k$ and $\mathbf{B}$ is $k \times m$, then $\mathbf{A} \times \mathbf{B}$ is $n \times m$:

n <- 6
k <- 3
m <- 4
A <- matrix( rnorm(n*k), n, k )
B <- matrix( rnorm(k*m), k, m )
C <- A%*%B
print(C)

##            [,1]        [,2]       [,3]       [,4]
## [1,] -1.5577034  0.32028842  1.4129541  0.4826290
## [2,] -1.6750248 -0.59794597  2.6374972  0.5906760
## [3,]  0.7597912 -0.25421742 -0.2383849 -0.2836977
## [4,] -2.4225748  0.32338237  2.6698078  0.7197284
## [5,] -1.9559941  0.16059894  2.3211554  0.5810423
## [6,] -2.6861332  0.08956355  2.1132421  1.0128224

Recall from matrix algebra that element C[i,j] will be the sum I get from multiplying row i of A and column j of B.

i <- 2
j <- 3
sum( A[i,]*B[,j] )

## [1] 2.637497

Here are these matrices:

I have highlighted row i of A, which is multiplied by column j of B and summed to give element [i,j] of C.

A diagonal matrix can be generated with the function diag. The identity matrix is a diagonal matrix with ones along the diagonal. The product of the identity matrix and another matrix is the same matrix:

I <- diag(k)
identical( A, A%*%I )

## [1] TRUE

Of course, I and A must have compatible dimensions. Look at I and A in this example.

R objects have storage modes

The most common storage modes for R objects are numeric and character.

object	attributes
`vector`	a vector can hold `numeric`, `character`, or `factor`
`matrix`	a 2-dimensional `array`
`array`	use for more than 2 dimensions
`list`	objects that can be named or referenced as `[[1]]`, `[[2]]`, or by name
`data.frame`	a `list` of objects of equal length bound together as columns
`factor`	a `vector` containing discrete levels

A data.frame is a type of list where objects are vectors of equal length but not necessarily of the same storage mode. When rendered to the console, a data.frame looks like a table, or matrix. However, because it is not stored as a matrix, it does not admit matrix operations like +, *, or %*%.

`which` function

I am confronted with a table of values, some of which are missing. I want to initialize the missing values with the mean value for that column.

This problem arises when there are missing predictors in a design matrix. Sometimes a decision is made to fill those missing values with the mean for that predictor.

To illustrate with an example I generate a random matrix and fill some of the values with NA, which indicates that a value is missing.

n <- 8                                     # rows
k <- 4                                     # columns
x <- matrix( round(rnorm(n*k), 2), n, k )  # a random matrix

miss <- sample(n*k, 7)                     # random locations for NA
x[ miss ] <- NA                            # set to NA
print( x )

##       [,1]  [,2]  [,3]  [,4]
## [1,] -0.38    NA    NA -1.33
## [2,] -0.17  1.73    NA    NA
## [3,] -0.33  2.74  0.79 -0.46
## [4,]  0.38    NA  0.51  0.05
## [5,] -0.33 -0.36 -0.11  0.12
## [6,]  0.40 -1.92    NA -0.71
## [7,]  0.28 -1.69    NA  0.82
## [8,]  0.05 -0.16 -0.52 -0.14

Here is diagram of matrix locations, highlighting the locations of NA values:

A matrix is stored by stacking columns atop one another, referenced by numbers from 1 to n*k. The NA values in matrix xoccupy the brown cells. I know the locations of the NA values, because I put them there myself when I generated the vector miss = 9, 12, 17, 18, 22, 23, 26. These elements are shown at left. They could also be located using which( is.na(x) ) = 9, 12, 17, 18, 22, 23, 26.

Matrix locations can also be referenced by row and column [i,j] as shown at right. In many cases it will be important to translate the location vector into paired coordinates that specify [row,column]. I can get them with the argument arr.ind = T:

miss <- which( is.na(x), arr.ind = T ) # provide the array index [row, column]
print( miss )

##      row col
## [1,]   1   2
## [2,]   4   2
## [3,]   1   3
## [4,]   2   3
## [5,]   6   3
## [6,]   7   3
## [7,]   2   4

This version of miss holds the same information, but it is often preferable where I want to see column locations.

I now have a quick solution to the initial problem, filling the NA values in matrix x with mean values for that column

miss  <- which( is.na(x), arr.ind = T )  # find missing values
cmean <- colMeans( x, na.rm=T )          # column means
x[ miss ] <- cmean[ miss[,2] ]           # missing values to column mean in cmu
print( round(x, 2) )

##       [,1]  [,2]  [,3]  [,4]
## [1,] -0.38  0.06  0.17 -1.33
## [2,] -0.17  1.73  0.17 -0.24
## [3,] -0.33  2.74  0.79 -0.46
## [4,]  0.38  0.06  0.51  0.05
## [5,] -0.33 -0.36 -0.11  0.12
## [6,]  0.40 -1.92  0.17 -0.71
## [7,]  0.28 -1.69  0.17  0.82
## [8,]  0.05 -0.16 -0.52 -0.14

Here I highlight the values that were replaced:

Take a moment to see how I’ve used miss as a two-column matrix to locate missing values in x and as a vector to select the right column mean from cmu.

`apply`, `tapply`, and related functions

Consider a population of individuals that differ in some numeric trait x between one another and over time. I want to summarize these differences by individual and year.

The two most common formats for environmental data are i) as observations-by-variables, in this case one row for each subject-time, and ii) as subjects-by-time, i.e., one row per subject and one column added each time it is observed.

R has some functions that simplify matrix and data.frame operations that help to move from one format to the other and to generate data summaries. To illustrate I simulate data on ni subjects each of which is observed nj times. Each observation on each subject is entered as a new line in data. For the trait x I assume that it results from differences between individuals xi, over time xj, and a residual:

ni <- 8            # no. individuals
nj <- 5            # no. of observation times
xi <- rnorm(ni)    # individual differences
xj <- rnorm(nj)    # differences over time

# individual and time index for ni*nj observations
ind  <- rep( 1:ni, each = nj )
time <- rep( 1:nj, ni )

# simulated data set
x    <- xi[ind] + xj[time] + rnorm(ni*nj, 0, .1)   # vectorized effects + residual
data <- data.frame( ind, time, x )
head(data, 10)

These first few lines of data show the relationship between ind and time indexes.

Conversely, data may be collected as individuals (rows) over time (columns) as a matrix. This is often convenient because additional columns might be easier in the field (e.g., a tag number), in which case it is easier to enter a new observation on the same line as previous ones. Here I use vectorization to fill an ind by time matrix with the observed x:

xmat <- matrix(NA, ni, nj)   # initialize the matrix with NA
ij   <- cbind(ind, time)     # vectorization to fill matrix
xmat[ ij ] <- x              # fill matrix
head(xmat)

##             [,1]       [,2]        [,3]     [,4]       [,5]
## [1,] -0.22263452 -0.8261760 -0.05167892 2.547390  0.5762054
## [2,] -1.18555406 -1.6865309 -1.04339173 1.621193 -0.4480409
## [3,]  1.79649142  1.3091016  1.91072880 4.672713  2.6833900
## [4,] -0.05359806 -0.2998227  0.27132886 3.099535  1.0879431
## [5,] -0.33386035 -0.5695714 -0.14427386 2.466162  0.7130838
## [6,]  1.52689893  1.2382908  1.95544156 4.576311  2.6381606

Individual and time differences can be obtained from either format. From the data vector I use the tapply function with the ind index to obtain means and standard deviations, first by individual and then by time. Here is by individual:

# individual mean +/- sd from data vector
imu <- tapply( x, ind, mean )
isd <- tapply( x, ind, sd )
print( cbind(imu, isd) )

##          imu      isd
## 1  0.4046211 1.297827
## 2 -0.5484649 1.290691
## 3  2.4744850 1.323958
## 4  0.8210772 1.377160
## 5  0.4263080 1.238794
## 6  2.3870205 1.332452
## 7  0.7996734 1.290195
## 8  2.8511177 1.349622

Here is by time:

# time means +/- sd
tmu <- tapply( x, time, mean )
tsd <- tapply( x, time, sd )
print( cbind(tmu, tsd) )

##          tmu      tsd
## 1 0.46560774 1.181664
## 2 0.06160807 1.179008
## 3 0.69852491 1.231771
## 4 3.37747403 1.238349
## 5 1.40668405 1.260092

For the xmat matrix, which is organize as subjects-by-time, I use rowMeans and apply. The apply function applies a function across the margin of an array, either rows (MARGIN = 1) or columns (MARGIN = 2). Here I use apply for the standard deviation sd. I could also use apply for the mean values, but I opt instead for the faster rowMeans and colMeans:

# individual mean +/- sd from matrix
imu <- rowMeans( xmat )
isd <- apply( xmat, 1, sd )
print( cbind(imu, isd) )

##             imu      isd
## [1,]  0.4046211 1.297827
## [2,] -0.5484649 1.290691
## [3,]  2.4744850 1.323958
## [4,]  0.8210772 1.377160
## [5,]  0.4263080 1.238794
## [6,]  2.3870205 1.332452
## [7,]  0.7996734 1.290195
## [8,]  2.8511177 1.349622

# time means +/- sd
tmu <- colMeans( xmat )
tsd <- apply( xmat, 2, sd )
print( cbind(tmu, tsd) )

##             tmu      tsd
## [1,] 0.46560774 1.181664
## [2,] 0.06160807 1.179008
## [3,] 0.69852491 1.231771
## [4,] 3.37747403 1.238349
## [5,] 1.40668405 1.260092

The format change can be reversed. I recover x from xmat, again using vectorization:

identical( xmat[ ij ], x )

## [1] TRUE

Try this now: Generate a design matrix x with missing values. Then replace with the missing values with the median value for that variable using the function apply( x, 2, quantile, .5, na.rm=T ).

The `match` function

The foregoing format changes and summaries were made easy by vectorization using indices that I had generated to simulate the data, the vectors ind and time. How do I generate these indexes from a data set that does not come with a fixed number of times per subject?

I can generate the index using the match function, which locates the matching entry for one vector in a second vector. For example,

a <- sample( letters, 5, replace=T )   # vector a holds 5 random letters
m <- match(a, letters)                 # m holds the first matches in the alphabet
rbind( a, letters[m] )                 # it worked

##   [,1] [,2] [,3] [,4] [,5]
## a "s"  "s"  "s"  "y"  "q" 
##   "s"  "s"  "s"  "y"  "q"

In this example, match compared a random sample of 5 letters to the 26 letters of the alphabet. It returned the matching locations as the integer vector m.

In the block of code that follows I use match to compare the variable ind from data to a vector of unique individuals. The match ii holds one entry for each row in data (i.e., length(ind) = nrow(data)). I do the same for time:

# find the unique individuals and times
inds  <- sort( unique( data$ind ) )   # unique individuals
times <- sort( unique( data$time) )   # unique times

# match rows in data to inds and times
ii <- match( data$ind, inds )
jj <- match( data$time, times )
mm <- cbind( ii, jj )

identical( xmat[ mm ], x )           # it worked

## [1] TRUE

By combining the locations ii with jj in the matrix mm I was able to extract from matrix xmat the vector x.

Try this now: Write a function to transform a data vector into a subject-by-time matrix.

Steps:

Execute the code below to generate a random data set with individuals labeled by letters
Find unique values for subjects i and times t and initialize a matrix x with NA. x must have enough rows to accommodate all individuals (ni) and enough columns to accommodate all times (nj).
Use match to align i with unique subject letters and t to unique times
Fill the matrix with values of x
Wrap it in a function to return the matrix

n  <- 100
ni <- 8
nt <- 6
x  <- rnorm(100)
t  <- sample(nt, n, replace = T)
i  <- sample(letters[1:ni], n, replace = T)
id <- paste(i, t, sep='_')
data <- data.frame(id, i, t, x)
data <- data[ !duplicated(data$id), ]               # remove duplicates
data <- data[ order(data$id), ]                     # order by sample id

Note that data has multiple columns, but the vector of interest is only the last column. Now go to step 2.

R packages

Groups of related functions are packaged for use by the community and made available through the CRAN website, github, and elsewhere. You can build your own package for use by your own group. These are loaded like this:

install.packages( 'nameOfPackageIWant' )

Once installed I attach it to an environment using

library( nameOfPackageIWant )

To see what packages are loaded I can list them with the search function

search()

## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

Once a package is installed, functions can be used without loading all functions in the package using packageName::functionName(). This has the advantage of reducing conflicts between functions having the same name in different packages.

Getting into a file

The range of different file types that can be interpreted in R is broad. The most common types that I interpret include the following:

file type	example function
`.txt`	`read.table`
`.csv`	`read.csv`
`.Rdata`	`load`
`.r`	`source`

Note that the function data.table::fread is a fast alternative to these traditional functions and requires limited qualification.

Efficiency in R

As with any computer language the best R programmers fumble around all the time. Efficiency depends on ways to find quick answers. There are at least two ways to go about this. Any errors that relate to the interpretation of specific R functions can be quickly checked with the help function:

help( nameOfConfusingFunction )

The second route is to google the error message. You will arrive at a handfull of web sites inhabited by knowledgable and helpful people who spend their free time answering questions like yours. You will almost always find that your errors have been through these web sites many times.

Optional challenge

This file has breeding-bird survey (BBS) data stored as observations-by-species, where an observation is a route-year combination. I want to explore how the abundance of an individual species changes over time, so I want to create a route-by-year matrix. Try this in two ways:

for an individual species
as a summary for all species

I load the data here:

load( 'bbsExample.Rdata', verbose = T )

head(x)  # examine structure of data from data attributes
head(y)  # spp data aligned with x, rownames are route_year

routes <- sort(unique(x$Route))  # unique routes
years  <- sort(unique(x$year))   # unique years

Try this: Here are steps you might follow. For a single species, this can work:

construct a matrix that can accommodate one route in each row and one year in each column
match the routes in x with routes created in the previous code block
match the years in x with years
select a species (column) in y and used the vectors generated by match to fill the new matrix
sum the counts by year using colSums

For all species, this can work:

Because I am going to sum over all columns in y, the route and year columns in x must be repeated for every column in y, e.g., rep( x$Route, ncol(y) ). These extended route and year indices can be used with tapply to generate a summed route by year matrix.
as for one species, the total count by year can be obtained using colSums

As an added challenge, express these counts as counts per effort (CPE), where effort is the number of routes in each year.

Examine trends for your species and for all species over time.

Getting started with R
some fundamentals
Duke University

Jim Clark

2023-01-10

Basic syntax

Functions

Indexing arrays

`list` and `data.frame`

Loops and vectorization

Scatter plots and histograms

Histograms

My functions

Basic matrix operations

R objects have storage modes

`which` function

The `match` function

R packages

Getting into a file

Efficiency in R

Optional challenge

Getting started with R some fundamentals Duke University

Jim Clark

2023-01-10

Basic syntax

Functions

Indexing arrays

list and data.frame

Loops and vectorization

Scatter plots and histograms

Histograms

My functions

Basic matrix operations

R objects have storage modes

which function

apply, tapply, and related functions

The match function

R packages

Getting into a file

Efficiency in R

Optional challenge

Getting started with R
some fundamentals
Duke University

`list` and `data.frame`

`which` function

`apply`, `tapply`, and related functions

The `match` function