This brief overview introduces some of the key elements, intended for those who have limited experience with R. It introduces R objects, basic operations, storage modes, and functions. Concepts include vectorization.


Basic syntax

The symbol > at the beginning of a line means that R is ready to receive input. Lines beginning with a number in brackets, e.g. [1], are responses from R. The following example of three lines includes two I entered followed by a response from R:

z <- 5  #assign the value 5 to ‘z’
z
## [1] 5

The first line contains the assignment operator <-, which sets the left-hand side equal to the right-hand side. In the first line I define a variable named z, and I assign to it the value 5. The # symbol indicates that a comment follows. R will not execute anything to the right of the symbol #. In the second line I entered the name of the variable, and that value was displayed on the console. The index [1] seems pointless here, but it’s useful for arrays as an indicator of location (see below).

Functions

A function performs one or more operations on an one or more arguments. If I want to repeat the value of z three times I could use the rep function with arguments for z and 3 and get this response: rep(z, times = 5) = 5, 5, 5. The function name is followed by parentheses. Arguments are inside the parentheses.

If I forget the arguments that are required for rep I can use help(rep). Note here that help is also the name of the function and it takes another function name as its argument. If I forget what the help function does, I can enter help(help).

A function might return any type of R object, e.g., a value, a vector, a list, or even another function. The help page for the function describes the Arguments needed and the Value that will be returned.

Not all functions require an argument. For example, the function ls() (no argument) returns a character vector with the names of objects that are in my current work environment.

Indexing arrays

Numbers are often collected into an array. The difference between a vector, matrix, and array is in how they are rendered to the console and some attributes to recognize lengths in 1, 2, and 3 dimensions. Locations in arrays are indexed by square brackets. The array dimensions are separated by commas. For a one-dimensional array, a vector, there are no commas. Here I construct a vector with even numbers from –2 to 8 to which I assign the name xvec,

xvec <- seq( -2, 9, by=2 )  # sequence of values
xvec
## [1] -2  0  2  4  6  8

The function seq(from, to, by) creates the vector. When I entered the name of the vector xvec, it is displayed on the console. If I want only the 3rd element, I enter

xvec[3]  # the third value of a vector
## [1] 2

If I want the last three elements, I enter

xvec[4:6] # the 4th, 5th, and 6th values
## [1] 4 6 8

If I want the \(j^{th}\) element of xvec, I assign a value to j and then ask for that value

j <- 5
xvec[j]   # the jth value of xvec
## [1] 6

To determine the length of the vector use this:

length(xvec)
## [1] 6

I could also create a vector with the concatenate function c,

zvec <- c(5.2,7,-15)  #define a vector with 3 elements
zvec
## [1]   5.2   7.0 -15.0

To construct a 2-dimensional array, or matrix, I could use the matrix function

n <- 2
m <- 3
xmat <- matrix(xvec, nrow=n, ncol=m, byrow=T) # 2 by 3 matrix
xmat
##      [,1] [,2] [,3]
## [1,]   -2    0    2
## [2,]    4    6    8

In my call to the R function matrix, I said to construct a 2 by 3 matrix from the vector having the name xvec (the length-6 vector), filling the matrix by rows. The default is to fill a matrix by columns (start with column 1, then column 2,…). The dimension of xmat is

dim(xmat)
## [1] 2 3

The 2nd column is

xmat[,2]
## [1] 0 6

and the element in row 1, column 3 is

xmat[1,3]
## [1] 2

The [j,k]th element of xmat is

i <- 1
j <- 3
xmat[i,j]
## [1] 2

Equivalently, I could get element [j,k] this way:

ij <- cbind(i,j)  #bind two values, each as a column
ij
##      i j
## [1,] 1 3
xmat[ ij ]
## [1] 2

Here I have bound together two values to construct a row vector jk, which R will read as the two dimensions of xmat, the row and column, respectively.

I can also construct a matrix by binding together vectors, provided they have the same lengths. For example, the same matrix xmat can be constructed by binding two rows created with the concatenate function c,

rbind( c(-2, 0, 2), c(4, 6, 8) )  # bind 2 row vectors 
##      [,1] [,2] [,3]
## [1,]   -2    0    2
## [2,]    4    6    8

or by binding together three columns

cbind( c(-2,4), c(0,6), c(2,8) ) # bind 3 columns
##      [,1] [,2] [,3]
## [1,]   -2    0    2
## [2,]    4    6    8

R executes the inner functions first, in this case the calls to function c, followed by the outer call to function rbind or cbind. I now say more about functions.

list and data.frame

The list storage mode offers one way to hold a heterogeneous collection of objects. In the forest inventory example for next time, each tree has attributions including diameter (numeric), leaf habit (factor), and species (character). Here are these attributes for two trees:

diameter  <- c(5.7, 82)
species   <- c('red maple', 'white pine')
leafHabit <- as.factor( c('deciduous','evergreen') )
treeList  <- list( spec = species, diam = diameter, leaf = leafHabit )        # as a list
treeData  <- data.frame( species, diameter, leafHabit )  # as a data frame

I have organized these objects first as a list, where the attributes are given the names spec and diam. If I enter treeList it will be displayed with the names sequentially. I can refer to the second I object as treeList$diam.

I then organize the same objects into a data.frame, which is a specific type of list where each object has the same length. For this reason, it will be displayed as a table with the objects species, diameter, and leafHabit as column headings:

knitr::kable(treeData)
species diameter leafHabit
red maple 5.7 deciduous
white pine 82.0 evergreen

Both lists and data.frames are widely used in R.

Loops and vectorization

Vectorization is one of the most important features of R, allowing one to operate on elements of a vector without programming loops. Loops are the standard way to apply operations to elements of arrays sequentially. Loops are fast in (low-level) compiled languages, but slow in the interactive language R. Vectorization internalizes loops (in compiled code) so they do not slow execution.

Consider a population of individuals, each of which can be reproductive or not depending on their differing developmental stages. The mature state is reached in stage 3. I wish to simulate developmental stage \([1, \dots, 5]\) for 10 individuals and assign each the correct reproductive state \([0, 1]\).

Here is a simple for loop that prints to the console the index and exponential increase of a variable x at 1% per increment:

x <- 2.7
for(i in 1:5){
  x <- x*1.01
  print( c(i, x) )
}
## [1] 1.000 2.727
## [1] 2.00000 2.75427
## [1] 3.000000 2.781813
## [1] 4.000000 2.809631
## [1] 5.000000 2.837727

The index is an integer value i that is incremented over i in 1:5.

For the problem above, I begin by writing a for loop that sequentially evaluates the stage of each individual \(i, \dots, n\) and assigns a reproductive state. First, I generate a random length-n vector that I will call stage using the sample function. I define a second vector repro of binary indicators (zero or one) for whether or not a stage is reproductive. I want to create a vector y that holds the reproductive states for each individual in the sample. Here is a for loop to inspect the stage of each individual, determine if it is reproductive, and, if so, assign it a value of 1:

n      <- 10                           # sample size
nstage <- 5                            # no. developmental stages
stage  <- sample(nstage, n, replace=T) # randomly assign a stage to each observation
repro  <- c(0,0,1,1,1)                 # early stages not reproductive
y      <- stage*0                      # initialize data
for(i in 1:n){                         # loop over i from 1 to n
  if(repro[ stage[i] ] == 1)y[i] <- 1  # determine if each is reproductive
}

If the first element of the vector stage = 1, then the first individual is assigned repro[stage[1]] = 0. Here is what I get for the entire vector:

print( rbind(stage, y) )
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## stage    1    3    5    5    5    3    2    4    3     1
## y        0    1    1    1    1    1    0    1    1     0

This loop executes fast, because there are only 10 individuals. It would be too slow if I had, say, \(10^6\) individuals or if I had to repeat this loop thousands of times (nested loops).

Here is the same operation, where the for loop is replaced with a vector:

y <- repro[stage]

This alternative is vectorization. It is not only fast, but also simple. Use loops only when vectorization is not possible.

Try this now. Simulate exponential growth using a for loop and determine that simulated values agree with the growth rate used to generate those data.

  1. Define n time steps and a growth rate drawn at random from rho <- runif(1, .9, 1.1)
  2. Write a for loop where the next value of x will be rho times as large as the current value.
  3. Verify growth rate.

Scatter plots and histograms

Scatter plots are generated with the function plot. Here is a plot of two simulated vectors:

n <- 500
x <- rnorm(n)
y <- x + rnorm(n,.01)
plot( x, y)

The function rnorm draws random values from a distribution. I need a way to visualize the sample.

Histograms

Histograms can be generated with the function hist. Here are histograms for the preceeding two vectors:

par(mfrow=c(1,2))
hist(x)
hist(y, nclass=100)

Try this now: Simulate random body sizes for a population consisting of males and females where mean body size differs by gender, and the fraction that are gender 1 is a parameter p. Do this using i) a for loop and ii) by vectorization. Generate a histogram of body size.

Steps:
  1. Assign a population size n
  2. Draw random genders g using rbinom( n, 1, p)
  3. Assign a mean size for two genders mu
  4. Write a for loop where you check the gender of each individual and draw a random size using rnorm( 1, mu )
  5. Vectorize this operation using g as an index for a length-2 vector mu

My functions

A function contains operations to be performed on objects that have been previously defined in the main program (they exist as objects in the user’s workspace) and/or on arguments passed to them. Nearly every operation in R can be viewed as a function, and I can write my own functions.

Just like the built-in R functions, each function that I write is identified by a name. Any arguments that will be passed to the function (as opposed to being available directly from the user’s workspace) are contained in parentheses that follow the name of the function. Examples of functions encountered in the previous section include seq, c, cbind, rbind, and matrix. Many functions are available in R and can be called by the user. I can also write my own functions. For example, if I want to take the mean of the elements in xvec, I can call the R function

mean(xvec) 
## [1] 3

The argument sent to mean is xvec. mean(xvec) returns the answer 3. I can supply additional arguments to mean, which can, for example, handle missing cases. If I want to know the options, I use the help function,

help(mean)

I could also write my own function, such as

mymean <- function(x){sum(x)/length(x)}

This code assigns to the name mymean a function that takes an argument that is assigned within the function mymean the name x. The function mymean, in turn, calls two internal R functions sum and length to determine the mean of whatever vector has been passed to it. To determine the mean of xvec I call the function,

mymean(xvec)
## [1] 3

Due to the fact that xvec is the first (and, in this case, only) argument passed to mymean, it will be assigned the internal name x, a name that is shielded from the main program. Thus, operations within functions do not interact with objects in the user workspace.

Any object that is needed but not assigned within a function must either be passed or it must be available from the user’s workspace. If the object is not in either of these places the function will give up. Note that a function called by another function will not have available to it any objects created in the calling function, unless they are passed. Specifically, if I call a function that I have written a(), which generates an object r and call from within a() another function b() without passing r, b() cannot know about r.

b <- function(r){r^2}

a <- function(m, p){ # function b cannot know about r
  r <- m/p
  b()
}
a(7,2)               # this fails

a <- function(m, p){ # now I pass r
  b(m/p)
}
a(7,2)               # this works

Because objects in the home environment may not have the values that I think they do, I write functions to accept as arguments all of the objects that will be needed. I do not rely on them being available in the users workspace.

A function returns an object to the workspace. I can do this in a few ways. Here is a function to return the mean and standard deviation of x:

myMean <- function(x){
  mu <- mean(x)
  return( mu )     # return this object
}

Here is call to this function:

x <- rnorm(1000)
myMean(x)
## [1] 0.01769951

However, I know from the previous definition of myMean that this will also work:

myMean <- function(x){
  mean(x)  # operation is not assigned, so it is returned directly
}

But this will not work:

myMean <- function(x){
  mu <- mean(x)  # assigned but not returned
}

If I do not assign the last operation to a name the function will return that last object.

Try this now. Write a function to calculate the mean and standard deviation of a vector and return them as a length-2 vector.

Basic matrix operations

The basic operations used for scalar quantities apply element-wise to numeric matrices of the same dimension. This includes +, -, *, /:

a <- matrix( rnorm(20), 5, 4 )
b <- matrix( rnorm(20), 5, 4 )
c <- a/b
all.equal(a, b*c)  # recover matrix a
## [1] TRUE

Matrix multiplication is written as %*% and requires that the first matrix has the same number of columns as the number of rows in the second matrix. If \(\mathbf{A}\) is \(n \times k\) and \(\mathbf{B}\) is \(k \times m\), then \(\mathbf{A} \times \mathbf{B}\) is \(n \times m\):

n <- 6
k <- 3
m <- 4
A <- matrix( rnorm(n*k), n, k )
B <- matrix( rnorm(k*m), k, m )
C <- A%*%B
print(C)
##             [,1]        [,2]       [,3]       [,4]
## [1,]  1.97645836 -0.06348035  0.7975842  1.4626173
## [2,]  1.68474269 -0.47041874  1.1067285 -1.9528443
## [3,] -0.90354510 -0.40747350 -0.9085258 -0.9263017
## [4,] -1.97457967 -0.28145404 -1.6675183 -0.2874005
## [5,]  0.07165521  0.05807023  0.4837502 -1.0971187
## [6,]  1.83496895 -0.95115674  0.2612222 -1.1445620

Recall from matrix algebra that element C[i,j] will be the sum I get from multiplying row i of A and column j of B.

i <- 2
j <- 3
sum( A[i,]*B[,j] )
## [1] 1.106729

Here are these matrices:

I have highlighted row i of A, which is multiplied by column j of B and summed to give element [i,j] of C.

A diagonal matrix can be generated with the function diag. The identity matrix is a diagonal matrix with ones along the diagonal. The product of the identity matrix and another matrix is the same matrix:

I <- diag(k)
identical( A, A%*%I )
## [1] TRUE

Of course, I and A must have compatible dimensions. Look at I and A in this example.

R objects have storage modes

The most common storage modes for R objects are numeric and character.

object attributes
vector a vector can hold numeric, character, or factor
matrix a 2-dimensional array
array use for more than 2 dimensions
list objects that can be named or referenced as [[1]], [[2]], or by name
data.frame a list of objects of equal length bound together as columns
factor a vector containing discrete levels

A data.frame is a type of list where objects are vectors of equal length but not necessarily of the same storage mode. When rendered to the console, a data.frame looks like a table, or matrix. However, because it is not stored as a matrix, it does not admit matrix operations like +, *, or %*%.

which function

I am confronted with a table of values, some of which are missing. I want to initialize the missing values with the mean value for that column.

This problem arises when there are missing predictors in a design matrix. Sometimes a decision is made to fill those missing values with the mean for that predictor.

To illustrate with an example I generate a random matrix and fill some of the values with NA, which indicates that a value is missing.

n <- 7                                     # rows
k <- 4                                     # columns
x <- matrix( round(rnorm(n*k), 2), n, k )  # a random matrix

miss <- sample(n*k, 8)                     # random locations for NA
x[ miss ] <- NA                            # set to NA
print( x )
##       [,1]  [,2]  [,3]  [,4]
## [1,] -1.12  0.83    NA -0.89
## [2,]    NA  0.47  1.42    NA
## [3,]  1.30 -1.34 -1.07  1.29
## [4,]    NA    NA    NA    NA
## [5,]  0.90  0.57 -1.13  0.30
## [6,]    NA  0.52 -0.67  0.58
## [7,] -0.81 -0.10 -0.80 -0.48

Here is diagram of matrix locations, highlighting the locations of NA values:

A matrix is stored by stacking columns atop one another, referenced by numbers from 1 to n*k. The NA values in matrix xoccupy the brown cells. I know the locations of the NA values, because I put them there myself when I generated the vector miss = 2, 4, 6, 11, 15, 18, 23, 25. These elements are shown at left. They could also be located using which( is.na(x) ).

Matrix locations can also be referenced by row and column [i,j] as shown at right. In many cases it will be important to translate the location vector into paired coordinates that specify [row,column]. I can get them with the argument arr.ind = T:

miss <- which( is.na(x), arr.ind = T ) # provide the array index [row, column]
print( miss )
##      row col
## [1,]   2   1
## [2,]   4   1
## [3,]   6   1
## [4,]   4   2
## [5,]   1   3
## [6,]   4   3
## [7,]   2   4
## [8,]   4   4

This version of miss holds the same information, but it is often preferable where I want to see column locations. I now have a quick solution to the initial problem:

miss <- which( is.na(x), arr.ind = T ) # find missing values
cmu  <- colMeans( x, na.rm=T )         # column means
x[ miss ] <- cmu[ miss[,2] ]           # missing values to column mean in cmu
print( x )
##         [,1]       [,2]  [,3]  [,4]
## [1,] -1.1200  0.8300000 -0.45 -0.89
## [2,]  0.0675  0.4700000  1.42  0.16
## [3,]  1.3000 -1.3400000 -1.07  1.29
## [4,]  0.0675  0.1583333 -0.45  0.16
## [5,]  0.9000  0.5700000 -1.13  0.30
## [6,]  0.0675  0.5200000 -0.67  0.58
## [7,] -0.8100 -0.1000000 -0.80 -0.48

Take a moment to see how I’ve used miss as a two-column matrix to locate missing values in x and as a vector to select the right column mean from cmu.

The match function

The foregoing format changes and summaries were made easy by vectorization using indices that I had generated to simulate the data, the vectors ind and time. How do I generate these indexes from a data set that does not come with a fixed number of times per subject?

I can generate the index using the match function, which locates the matching entry for one vector in a second vector. For example,

a <- sample( letters, 5, replace=T )   # vector a holds 5 random letters
m <- match(a, letters)                 # m holds the first matches in the alphabet
rbind( a, letters[m] )                 # it worked
##   [,1] [,2] [,3] [,4] [,5]
## a "k"  "l"  "v"  "l"  "j" 
##   "k"  "l"  "v"  "l"  "j"

In this example, match compared a random sample of 5 letters to the 26 letters of the alphabet. It returned the matching locations as the integer vector m.

In the block of code that follows I use match to compare the variable ind from data to a vector of unique individuals. The match ii holds one entry for each row in data (i.e., length(ind) = nrow(data)). I do the same for time:

# find the unique individuals and times
inds  <- sort( unique( data$ind ) )   # unique individuals
times <- sort( unique( data$time) )   # unique times

# match rows in data to inds and times
ii <- match( data$ind, inds )
jj <- match( data$time, times )
mm <- cbind( ii, jj )

identical( xmat[ mm ], x )           # it worked
## [1] TRUE

By combining the locations ii with jj in the matrix mm I was able to extract from matrix xmat the vector x.

Exercise 2. Write a function to transform a data vector into a subject-by-time matrix.

Steps:
  1. Execute the code below to generate a random data set with individuals labeled by letters
  2. Find unique values for subjects i and times t and initialize a matrix x with NA. x must have enough rows to accommodate all individuals (ni) and enough columns to accommodate all times (nj).
  3. Use match to align i with unique subject letters and t to unique times
  4. Fill the matrix with values of x
  5. Wrap it in a function to return the matrix
n  <- 100
ni <- 8
nt <- 6
x <- rnorm(100)
t <- sample(nt, n, replace = T)
i <- sample(letters[1:ni], n, replace = T)
id <- paste(i, t, sep='_')
data <- data.frame(id, i, t, x)
data <- data[ !duplicated(data$id), ]               # remove duplicates
data <- data[ order(data$id), ]                     # order by sample id

Note that data has multiple columns, but the vector of interest is only the last column. Now go to step 2.

R packages

Groups of related functions are packaged for use by the community and made available through the CRAN website, github, and elsewhere. You can build your own package for use by your own group. These are loaded like this:

install.packages( 'nameOfPackageIWant' )

Once installed I attach it to an environment using

library( nameOfPackageIWant )

To see what packages are loaded I can list them with the search function

search()
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

Once a package is installed, functions can be used without loading all functions in the package using packageName::functionName(). This has the advantage of reducing conflicts between functions having the same name in different packages.

Getting into a file

The range of different file types that can be interpreted in R is broad. The most common types that I interpret include the following:

file type example function
.txt read.table
.csv read.csv
.Rdata load
.r source

Note that the function data.table::fread is a fast alternative to these traditional functions and requires limited qualification.

Efficiency in R

As with any computer language the best R programmers fumble around all the time. Efficiency depends on ways to find quick answers. There are at least two ways to go about this. Any errors that relate to the interpretation of specific R functions can be quickly checked with the help function:

help( nameOfConfusingFunction )

The second route is to google the error message. You will arrive at a handfull of web sites inhabited by knowledgable and helpful people who spend their free time answering questions like yours. You will almost always find that your errors have been through these web sites many times.