This brief overview introduces some of the key elements, intended for those who have limited experience with R. It introduces R objects, basic operations, storage modes, and functions. Concepts include vectorization.
Basic syntax
The symbol > at the beginning of a line means that R
is ready to receive input. Lines beginning with a number in brackets,
e.g. [1], are responses from R. The following example of
three lines includes two I entered followed by a response from R:
z <- 5 #assign the value 5 to ‘z’
z## [1] 5
The first line contains the assignment operator <-,
which sets the left-hand side equal to the right-hand side. In the first
line I define a variable named z, and I assign to it the
value 5. The # symbol indicates that a comment
follows. R will not execute anything to the right of the symbol
#. In the second line I entered the name of the variable,
and that value was displayed on the console. The index [1]
seems pointless here, but it’s useful for arrays as an
indicator of location (see below).
Functions
A function performs one or more operations on an one or
more arguments. If I want to repeat the value of z three
times I could use the rep function with arguments for
z and 3 and get this response:
rep(z, times = 3) = 5, 5, 5. The function name is followed
by parentheses. Arguments are inside the parentheses.
If I forget the arguments that are required for rep I
can use help(rep). Note here that help is also
the name of the function and it takes another function name as its
argument. If I forget what the help function does, I can
enter help(help).
A function might return any type of R object, e.g., a value, a
vector, a list, or even another
function. The help page for the function
describes the Arguments needed and the
Value that will be returned.
Not all functions require an argument. For example, the function
ls() (no argument) returns a character vector
with the names of objects that are in my current work environment.
Indexing arrays
Numbers are often collected into an array. The
difference between a vector, matrix, and
array is in how they are rendered to the console and some
attributes to recognize lengths in 1, 2, and 3 dimensions. Locations in
arrays are indexed by square brackets. The
array dimensions are separated by commas. For a
one-dimensional array, a vector, there are no
commas. Here I construct a vector with even numbers from –2
to 8 to which I assign the name xvec,
xvec <- seq( -2, 9, by=2 ) # sequence of values
xvec## [1] -2 0 2 4 6 8
The function seq(from, to, by) creates the
vector. When I entered the name of the
vector xvec, it is displayed on the console. If I want only
the 3rd element, I enter
xvec[3] # the third value of a vector## [1] 2
If I want the last three elements, I enter
xvec[4:6] # the 4th, 5th, and 6th values## [1] 4 6 8
If I want the \(j^{th}\) element of
xvec, I assign a value to j and then ask for
that value
j <- 5
xvec[j] # the jth value of xvec## [1] 6
To determine the length of the vector use the
length function:
length(xvec)## [1] 6
I could also create a vector with the concatenate
function c,
zvec <- c(5.2, 7, -15) #define a vector with 3 elements
zvec## [1] 5.2 7.0 -15.0
To construct a 2-dimensional array, or
matrix, I could use the matrix function
n <- 2
m <- 3
xmat <- matrix(xvec, nrow=n, ncol=m, byrow=T) # 2 by 3 matrix
xmat## [,1] [,2] [,3]
## [1,] -2 0 2
## [2,] 4 6 8
In my call to the R function matrix, I said to construct
a 2 by 3 matrix from the vector having the
name xvec (the length-6 vector), filling the
matrix by rows. The default is to fill a
matrix by columns (start with column 1, then column 2,…).
The dimension of xmat is
dim(xmat)## [1] 2 3
The 2nd column is
xmat[,2]## [1] 0 6
and the element in row 1, column 3 is
xmat[1,3]## [1] 2
The [j,k]th element of xmat is
i <- 1
j <- 3
xmat[i,j]## [1] 2
Here is a diagram showing row [i,], column
[,j], and then element [i,j]:
Equivalently, I could get element [i,j] this way:
ij <- cbind(i,j) #bind two values, each as a column
ij## i j
## [1,] 1 3
xmat[ ij ]## [1] 2
Here I have bound together two values to construct a row
vector jk, which R will read as the two dimensions of
xmat, the row and column, respectively.
I can also construct a matrix by binding together
vectors, provided they have the same lengths.
For example, the same matrix xmat can be constructed by
binding two rows created with the concatenate function
c,
rbind( c(-2, 0, 2), c(4, 6, 8) ) # bind 2 row vectors ## [,1] [,2] [,3]
## [1,] -2 0 2
## [2,] 4 6 8
or by binding together three columns
cbind( c(-2,4), c(0,6), c(2,8) ) # bind 3 columns## [,1] [,2] [,3]
## [1,] -2 0 2
## [2,] 4 6 8
R executes the inner functions first, in this case the calls to
function c, followed by the outer call to function
rbind or cbind. I now say more about
functions.
list and data.frame
The list storage mode offers one way to hold a
heterogeneous collection of objects. In the forest inventory example for
next time, each tree has attributions including diameter
(numeric), leaf habit (factor), and species
(character). Here are these attributes for two trees:
diameter <- c(5.7, 82)
species <- c('red maple', 'white pine')
leafHabit <- as.factor( c('deciduous','evergreen') )
treeList <- list( spec = species, diam = diameter, leaf = leafHabit ) # as a list
treeData <- data.frame( species, diameter, leafHabit ) # as a data frameI have organized these objects first as a list, where
the attributes are given the names spec and
diam. If I enter treeList it will be displayed
with the names sequentially. I can refer to the second I object as
treeList$diam.
I then organize the same objects into a data.frame,
which is a specific type of list where each object has the
same length. For this reason, it will be displayed as a table with the
objects species, diameter, and
leafHabit as column headings:
knitr::kable(treeData)| species | diameter | leafHabit |
|---|---|---|
| red maple | 5.7 | deciduous |
| white pine | 82.0 | evergreen |
Both lists and data.frames are widely used
in R.
Loops and vectorization
Vectorization is one of the most important features of R, allowing one to operate on elements of a vector without programming loops. Loops are the standard way to apply operations to elements of arrays sequentially. Loops are fast in (low-level) compiled languages, but slow in the interactive language R. Vectorization internalizes loops (in compiled code) so they do not slow execution.
Consider a population of individuals, each of which can be reproductive or not depending on their differing developmental stages. The mature state is reached in stage 3. I wish to simulate developmental stage \([1, \dots, 5]\) for 10 individuals and assign each the correct reproductive state \([0, 1]\).
Here is a simple for loop that prints to the console the
index and exponential increase of a variable x at 1% per
increment:
x <- 2.7
for(i in 1:5){
x <- x*1.01
print( c(i, x) )
}## [1] 1.000 2.727
## [1] 2.00000 2.75427
## [1] 3.000000 2.781813
## [1] 4.000000 2.809631
## [1] 5.000000 2.837727
The index is an integer value i that is incremented over
i in 1:5.
For the problem above, I begin by writing a for loop
that sequentially evaluates the stage of each individual \(i, \dots, n\) and assigns a reproductive
state. First, I generate a random length-n vector that I
will call stage using the sample function. I
define a second vector repro of binary indicators (zero or
one) for whether or not a stage is reproductive. I want to create a
vector y that holds the reproductive states for each
individual in the sample. Here is a for loop to inspect the
stage of each individual, determine if it is reproductive,
and, if so, assign it a value of 1:
n <- 10 # sample size
nstage <- 5 # no. developmental stages
stage <- sample(nstage, n, replace=T) # randomly assign a stage to each observation
repro <- c(0,0,1,1,1) # early stages not reproductive
y <- stage*0 # initialize data
for(i in 1:n){ # loop over i from 1 to n
if(repro[ stage[i] ] == 1)y[i] <- 1 # determine if each is reproductive
}If the first element of the vector stage = 1, then the
first individual is assigned repro[stage[1]] = 0. Here is
what I get for the entire vector:
print( rbind(stage, y) )## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## stage 1 4 4 5 4 5 3 2 4 3
## y 0 1 1 1 1 1 1 0 1 1
This loop executes fast, because there are only 10 individuals. It would be too slow if I had, say, \(10^6\) individuals or if I had to repeat this loop thousands of times (nested loops).
Here is the same operation, where the for loop is
replaced with a vector:
y <- repro[stage]This alternative is vectorization. It is not only fast, but also simple. Use loops only when vectorization is not possible.
Try this now. Simulate exponential growth using a
for loop and determine that simulated values agree with the
growth rate used to generate those data.
- Define
ntime steps and a growth rate drawn at random fromrho <- runif(1, .9, 1.1) - Write a
forloop where the next value ofxwill berhotimes as large as the current value. - Verify growth rate.
Scatter plots and histograms
Scatter plots are generated with the function plot. Here
is a plot of two simulated vectors:
n <- 500
x <- rnorm(n)
y <- x + rnorm(n,.01)
plot( x, y)The function rnorm draws random values from a
distribution. I need a way to visualize the sample.
Histograms
Histograms can be generated with the function hist. Here
are histograms for the preceeding two vectors:
par(mfrow=c(1,2))
hist(x)
hist(y, nclass=100)
Try this now: Simulate random body sizes for a
population consisting of males and females where mean body size differs
by gender, and the fraction that are gender 1 is a
parameter p. Do this using i) a for loop and
ii) by vectorization. Generate a histogram of body size.
-
Assign a population size
n -
Draw random genders
gusingrbinom( n, 1, p) -
Assign a mean size for two genders
mu -
Write a
forloop where you check the gender of each individual and draw a random size usingrnorm( 1, mu ) -
Vectorize this operation using
gas an index for a length-2vector mu
My functions
A function contains operations to be performed on objects
that have been previously defined in the main program (they exist as
objects in the user’s workspace) and/or on arguments passed to them.
Nearly every operation in R can be viewed as a function, and I can write
my own functions.
Just like the built-in R functions, each function that I
write is identified by a name. Any arguments that will be passed to the
function (as opposed to being available directly from the user’s
workspace) are contained in parentheses that follow the name of the
function. Examples of functions encountered in the previous
section include seq, c, cbind,
rbind, and matrix. Many functions are
available in R and can be called by the user. I can also write my own
functions. For example, if I want to take the mean of the elements in
xvec, I can call the R function
mean(xvec) ## [1] 3
The argument sent to mean is xvec.
mean(xvec) returns the answer 3. I can supply additional
arguments to mean, which can, for example, handle missing
cases. If I want to know the options, I use the help function,
help(mean)I could also write my own function, such as
mymean <- function(x){sum(x)/length(x)}This code assigns to the name mymean a function that
takes an argument that is assigned within the function
mymean the name x. The function
mymean, in turn, calls two internal R functions
sum and length to determine the mean of
whatever vector has been passed to it. To determine the mean of
xvec I call the function,
mymean(xvec)## [1] 3
Due to the fact that xvec is the first (and, in this
case, only) argument passed to mymean, it will be assigned
the internal name x, a name that is shielded from the main
program. Thus, operations within functions do not interact with objects
in the user workspace.
Any object that is needed but not assigned within a
function must either be passed or it must be available from
the user’s workspace. If the object is not in either of these places the
function will give up. Note that a function called by
another function will not have available to it any objects
created in the calling function, unless they are passed.
Specifically, if I call a function that I have written
a(), which generates an object r and then
calls from within a() another function b()
without passing r, then b() cannot know about
r.
b <- function(r){r^2}
a <- function(m, p){ # function b cannot know about r
r <- m/p
b()
}
a(7,2) # this fails
a <- function(m, p){ # now I pass r
b(m/p)
}
a(7,2) # this worksBecause objects in the home environment may not have the values that
I think they do, I write functions to accept as arguments
all of the objects that will be needed. I do not rely on them being
available in the user’s workspace.
A function returns an object to the workspace. I can do this in a few
ways. Here is a function to return the mean of x:
myMean <- function(x){
mu <- mean(x)
return( mu ) # return this object
}Here is call to this function:
x <- rnorm(1000)
myMean(x)## [1] 0.02274613
However, I know from the previous definition of myMean
that this will also work:
myMean <- function(x){
mean(x) # operation is not assigned, so it is returned directly
}But this will not work:
myMean <- function(x){
mu <- mean(x) # assigned but not returned
}If I do not assign the last operation to a name the function will return that last object.
Try this now: Write a function to calculate the mean and standard deviation of a vector and return them as a length-2 vector.
Basic matrix operations
The basic operations used for scalar quantities apply element-wise to
numeric matrices of the same dimension. This includes +,
-, *, /:
a <- matrix( rnorm(20), 5, 4 )
b <- matrix( rnorm(20), 5, 4 )
c <- a/b
all.equal(a, b*c) # recover matrix a## [1] TRUE
Matrix multiplication is written as %*% and requires
that the first matrix has the same number of columns as the number of
rows in the second matrix. If \(\mathbf{A}\) is \(n \times k\) and \(\mathbf{B}\) is \(k \times m\), then \(\mathbf{A} \times \mathbf{B}\) is \(n \times m\):
n <- 6
k <- 3
m <- 4
A <- matrix( rnorm(n*k), n, k )
B <- matrix( rnorm(k*m), k, m )
C <- A%*%B
print(C)## [,1] [,2] [,3] [,4]
## [1,] -1.5577034 0.32028842 1.4129541 0.4826290
## [2,] -1.6750248 -0.59794597 2.6374972 0.5906760
## [3,] 0.7597912 -0.25421742 -0.2383849 -0.2836977
## [4,] -2.4225748 0.32338237 2.6698078 0.7197284
## [5,] -1.9559941 0.16059894 2.3211554 0.5810423
## [6,] -2.6861332 0.08956355 2.1132421 1.0128224
Recall from matrix algebra that element C[i,j] will be
the sum I get from multiplying row i of A and
column j of B.
i <- 2
j <- 3
sum( A[i,]*B[,j] )## [1] 2.637497
Here are these matrices:
I have highlighted row i of A, which is
multiplied by column j of B and summed to give
element [i,j] of C.
A diagonal matrix can be generated with the
function diag. The identity matrix is a
diagonal matrix with ones along the diagonal. The product of the
identity matrix and another matrix is the same
matrix:
I <- diag(k)
identical( A, A%*%I )## [1] TRUE
Of course, I and A must have compatible
dimensions. Look at I and A in this
example.
R objects have storage modes
The most common storage modes for R objects are
numeric and character.
| object | attributes |
|---|---|
vector |
a vector can hold numeric, character, or
factor |
matrix |
a 2-dimensional array |
array |
use for more than 2 dimensions |
list |
objects that can be named or referenced as [[1]],
[[2]], or by name |
data.frame |
a list of objects of equal length bound together as
columns |
factor |
a vector containing discrete levels |
A data.frame is a type of list where
objects are vectors of equal length but not
necessarily of the same storage mode. When rendered to the
console, a data.frame looks like a table, or matrix.
However, because it is not stored as a matrix, it does not
admit matrix operations like +,
*, or %*%.
which function
I am confronted with a table of values, some of which are missing. I want to initialize the missing values with the mean value for that column.
This problem arises when there are missing predictors in a design matrix. Sometimes a decision is made to fill those missing values with the mean for that predictor.
To illustrate with an example I generate a random matrix
and fill some of the values with NA, which indicates that a
value is missing.
n <- 8 # rows
k <- 4 # columns
x <- matrix( round(rnorm(n*k), 2), n, k ) # a random matrix
miss <- sample(n*k, 7) # random locations for NA
x[ miss ] <- NA # set to NA
print( x )## [,1] [,2] [,3] [,4]
## [1,] -0.38 NA NA -1.33
## [2,] -0.17 1.73 NA NA
## [3,] -0.33 2.74 0.79 -0.46
## [4,] 0.38 NA 0.51 0.05
## [5,] -0.33 -0.36 -0.11 0.12
## [6,] 0.40 -1.92 NA -0.71
## [7,] 0.28 -1.69 NA 0.82
## [8,] 0.05 -0.16 -0.52 -0.14
Here is diagram of matrix locations, highlighting the locations of
NA values:
A matrix is stored by stacking columns atop one another, referenced
by numbers from 1 to n*k. The NA values in
matrix xoccupy the brown cells. I know the locations of the
NA values, because I put them there myself when I generated
the vector miss = 9, 12, 17, 18, 22, 23, 26. These elements
are shown at left. They could also be located using
which( is.na(x) ) = 9, 12, 17, 18, 22, 23, 26.
Matrix locations can also be referenced by row and column
[i,j] as shown at right. In many cases it will be important
to translate the location vector into paired coordinates that specify
[row,column]. I can get them with the argument
arr.ind = T:
miss <- which( is.na(x), arr.ind = T ) # provide the array index [row, column]
print( miss )## row col
## [1,] 1 2
## [2,] 4 2
## [3,] 1 3
## [4,] 2 3
## [5,] 6 3
## [6,] 7 3
## [7,] 2 4
This version of miss holds the same information, but it
is often preferable where I want to see column locations.
I now have a quick solution to the initial problem, filling the
NA values in matrix x with mean values for
that column
miss <- which( is.na(x), arr.ind = T ) # find missing values
cmean <- colMeans( x, na.rm=T ) # column means
x[ miss ] <- cmean[ miss[,2] ] # missing values to column mean in cmu
print( round(x, 2) )## [,1] [,2] [,3] [,4]
## [1,] -0.38 0.06 0.17 -1.33
## [2,] -0.17 1.73 0.17 -0.24
## [3,] -0.33 2.74 0.79 -0.46
## [4,] 0.38 0.06 0.51 0.05
## [5,] -0.33 -0.36 -0.11 0.12
## [6,] 0.40 -1.92 0.17 -0.71
## [7,] 0.28 -1.69 0.17 0.82
## [8,] 0.05 -0.16 -0.52 -0.14
Here I highlight the values that were replaced:
Take a moment to see how I’ve used miss as a two-column
matrix to locate missing values in x and as a
vector to select the right column mean from
cmu.
The match function
The foregoing format changes and summaries were made easy by
vectorization using indices that I had generated to simulate the data,
the vectors ind and time. How do I generate
these indexes from a data set that does not come with a fixed number of
times per subject?
I can generate the index using the match function, which
locates the matching entry for one vector in a second
vector. For example,
a <- sample( letters, 5, replace=T ) # vector a holds 5 random letters
m <- match(a, letters) # m holds the first matches in the alphabet
rbind( a, letters[m] ) # it worked## [,1] [,2] [,3] [,4] [,5]
## a "s" "s" "s" "y" "q"
## "s" "s" "s" "y" "q"
In this example, match compared a random sample of 5
letters to the 26 letters of the alphabet. It returned the matching
locations as the integer vector m.
In the block of code that follows I use match to compare
the variable ind from data to a vector of
unique individuals. The match ii holds one entry for each
row in data (i.e., length(ind) = nrow(data)).
I do the same for time:
# find the unique individuals and times
inds <- sort( unique( data$ind ) ) # unique individuals
times <- sort( unique( data$time) ) # unique times
# match rows in data to inds and times
ii <- match( data$ind, inds )
jj <- match( data$time, times )
mm <- cbind( ii, jj )
identical( xmat[ mm ], x ) # it worked## [1] TRUE
By combining the locations ii with jj in
the matrix mm I was able to extract from
matrix xmat the vector x.
Try this now: Write a function to
transform a data vector into a subject-by-time
matrix.
- Execute the code below to generate a random data set with individuals labeled by letters
-
Find unique values for subjects
iand timestand initialize amatrix xwithNA.xmust have enough rows to accommodate all individuals (ni) and enough columns to accommodate all times (nj). -
Use
matchto aligniwith unique subject letters andtto unique times -
Fill the matrix with values of
x - Wrap it in a function to return the matrix
n <- 100
ni <- 8
nt <- 6
x <- rnorm(100)
t <- sample(nt, n, replace = T)
i <- sample(letters[1:ni], n, replace = T)
id <- paste(i, t, sep='_')
data <- data.frame(id, i, t, x)
data <- data[ !duplicated(data$id), ] # remove duplicates
data <- data[ order(data$id), ] # order by sample idNote that data has multiple columns, but the vector of
interest is only the last column. Now go to step 2.
R packages
Groups of related functions are packaged for use by the community and made available through the CRAN website, github, and elsewhere. You can build your own package for use by your own group. These are loaded like this:
install.packages( 'nameOfPackageIWant' )Once installed I attach it to an environment using
library( nameOfPackageIWant )To see what packages are loaded I can list them with the
search function
search()## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
Once a package is installed, functions can be used without loading
all functions in the package using
packageName::functionName(). This has the advantage of
reducing conflicts between functions having the same name in different
packages.
Getting into a file
The range of different file types that can be interpreted in R is broad. The most common types that I interpret include the following:
| file type | example function |
|---|---|
.txt |
read.table |
.csv |
read.csv |
.Rdata |
load |
.r |
source |
Note that the function data.table::fread is a fast
alternative to these traditional functions and requires limited
qualification.
Efficiency in R
As with any computer language the best R programmers fumble around all the time. Efficiency depends on ways to find quick answers. There are at least two ways to go about this. Any errors that relate to the interpretation of specific R functions can be quickly checked with the help function:
help( nameOfConfusingFunction )The second route is to google the error message. You will arrive at a handfull of web sites inhabited by knowledgable and helpful people who spend their free time answering questions like yours. You will almost always find that your errors have been through these web sites many times.
Optional challenge
This file has breeding-bird survey (BBS) data stored as
observations-by-species, where an observation is a route-year
combination. I want to explore how the abundance of an individual
species changes over time, so I want to create a route-by-year
matrix. Try this in two ways:
for an individual species
as a summary for all species
I load the data here:
load( 'bbsExample.Rdata', verbose = T )
head(x) # examine structure of data from data attributes
head(y) # spp data aligned with x, rownames are route_year
routes <- sort(unique(x$Route)) # unique routes
years <- sort(unique(x$year)) # unique yearsTry this: Here are steps you might follow. For a single species, this can work:
- construct a
matrixthat can accommodate oneroutein each row and oneyearin each column matchthe routes inxwithroutescreated in the previous code blockmatchthe years inxwithyears- select a species (column) in
yand used the vectors generated bymatchto fill the newmatrix - sum the counts by
yearusingcolSums
For all species, this can work:
- Because I am going to sum over all columns in
y, therouteandyearcolumns inxmust be repeated for every column iny, e.g.,rep( x$Route, ncol(y) ). These extendedrouteandyearindices can be used withtapplyto generate a summedroutebyyearmatrix. - as for one species, the total count by
yearcan be obtained usingcolSums
As an added challenge, express these counts as counts per effort (CPE), where effort is the number of routes in each year.
Examine trends for your species and for all species over time.