This brief overview introduces some of the key elements, intended for those who have limited experience with R. It introduces R objects, basic operations, storage modes, and functions. Concepts include vectorization.
Algorithm development in this course will exploit a key feature of “base R” called vectorization. By vectorizing operations, the language R allows a user to operate at a level that is intermediate between low-level languages like FORTRAN and C, which specify each operation with a separate line of code, and point-and-click (or drag-and-drop) software that bury algorithms behind a user interface. Vectorization allows for development of basic algorithms that are fast. To see this, here is low-level code that I could write to multiply by 2 the numbers from 1 to 10:
numbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # create a vector
for(j in 1:10){ # each element times 2
numbers[j] <- numbers[j]*2
}
This code touches each element of the vector numbers
using a for loop. Exploiting vectorization this becomes
numbers <- c(1:10)*2
The simplification comes with no loss of transparency. The advantages of vectorization will affect nearly everything we do in this course.
Where operations cannot be vectorized in R, the expanding capacity to link them to C++ code offers the R user an approach to generate fast code with the control of low-level programming.
In recent years, high-level packages like tidyverse have been added to R that bundle more of the common algorithmic operations within a novel syntax. This can simplify many tasks, but it comes with the disadvantage that most users never learn to exploit R’s greatest strength: vectorization. Because this course requires algorithm development I focus on base R, but I encourage use of other approaches where ever they can help.
The language R and the software RStudio have expanded throughout the sciences (and now, humanities), because vectorization allows a user to work at multiple levels. A computer’s central processing unit (CPU) recognizes binary inputs. We could call them zeros and ones. Or “no” and “yes”–they are qualitative, not quantitative. With enough of these zeros and ones it is possible to represent numbers, letters, and words. It is possible to represent tasks.
The development of computer languages has been guided, in part, by efforts to find an optimal distance between the user language and the machine language (translating zeros and ones). Low-level languages that persist to this day include FORTRAN and C. In fact, many of the underlying functions in R are written in these languages. A programmer can read and write operations in these languages with control over each operation. A compiler translates C into zeros and ones in a way that can be understood by the CPU. The compiled code is processed all at once. The advantages of control and fast processing comes with the cost that finding problems can be cumbersome. The compilation puts some distance between programmer and CPU. At the other extreme are high-level software packages that have limited adaptation to specific questions. Where possible, existing software should be exploited. If a study calls for linear regression there is no need to reinvent the wheel. However, most ecologists are commonly confronted with misaligned and unbalanced data. Many questions suggest models that have not been written before. They require new algorithms. Vectorization lets a user build algorithms at a low level, without costly loops, within a framework that admits many high-level operations.
It’s worth understanding a few things about your computer that are relevant when writing algorithms to be used with large data sets. Speed is important. The CPU processes inputs, reported in billions of cycles per second, or gigahertz (GHz). The actual work that gets done further depends on the number of cores that simultaneously share tasks. In principal, more cores means higher capacity. However, efficiency keeps the total number of cores relatively small, as they may have to work together without tripping all over each other. Of course, a faster CPU is good idea, but at any given time, processing speed may not vary widely between laptops and workstations.
Where computers still differ importantly is in main memory. Traditionally called random access memory (RAM) but now extending to additional types, main memory sets the limits on what can be accessed immediately. It’s a bit like the facts that live in your head versus those that you “store” outside your brain, in books, the internet, and so on. When you need them, you look them up, and many will be forgotten by tomorrow or next week. For the computer, this external storage is the disk space. Main memory is managed in part by writing (and then deleting) temporary files. Operations that can be managed with main memory are generally faster than those that require writing and reading from the disk. Part of the time it takes to start up involves reading from files that store information from the last time the machine was open. Likewise, shut down requires time as selected information that will be needed in the next session is written to disk. Web sites also write files to your disk, so they do not have to store all information on millions of users. Management of the potentially huge clutter than can accumulate from this process is both a hardware and a software challenge. For computation in this course, it is helpful to have main memory of 32, 64 gigabytes (GB), or more. Main memory remains relatively cheap, so pay attention to RAM when purchasing a computer that you expect to use for data analysis.
With this background, let’s learn about R.
The symbol > at the beginning of a line means that R
is ready to receive input. Lines beginning with a number in brackets,
e.g. [1], are responses from R. The following example of
three lines includes two I entered followed by a response from R:
z <- 5 #assign the value 5 to ‘z’
z
## [1] 5
The first line contains the assignment operator <-,
which sets the left-hand side equal to the right-hand side. In the first
line I define a variable named z, and I assign to it the
value 5. The # symbol indicates that a comment
follows. R will not execute anything to the right of the symbol
#. In the second line I entered the name of the variable,
and that value was displayed on the console. The index [1]
seems pointless here, but it’s useful for arrays as an
indicator of location (see below).
list and data.frameThe list storage mode offers one way to hold a
heterogeneous collection of objects. In the forest inventory example for
next time, each tree has attributions including diameter
(numeric), leaf habit (factor), and species
(character). Here are these attributes for two trees:
diameter <- c(5.7, 82)
species <- c('red maple', 'white pine')
leafHabit <- as.factor( c('deciduous','evergreen') )
treeList <- list( spec = species, diam = diameter, leaf = leafHabit ) # as a list
treeData <- data.frame( species, diameter, leafHabit ) # as a data frame
I have organized these objects first as a list, where
the attributes are given the names spec and
diam. If I enter treeList it will be displayed
with the names sequentially. I can refer to the second I object as
treeList$diam.
I then organize the same objects into a data.frame,
which is a specific type of list where each object has the
same length. For this reason, it will be displayed as a table with the
objects species, diameter, and
leafHabit as column headings:
knitr::kable(treeData)
| species | diameter | leafHabit |
|---|---|---|
| red maple | 5.7 | deciduous |
| white pine | 82.0 | evergreen |
Both lists and data.frames are widely used
in R.
A function performs one or more operations on one or
more arguments. If I want to repeat the value of z three
times I could use the rep function with arguments for
z and 3 and get this response:
rep(z, times = 3) = 5, 5, 5. The function name is followed
by parentheses. Arguments are inside the parentheses.
If I forget the arguments that are required for rep I
can use help(rep). Note here that help is the
name of the function and it takes another function name as its argument.
If I forget what the help function does, I can enter
help(help).
When I consult the help page for rep, the first section
Usage says this:
rep( x, ...)
Where it asks for x, I sent a specific value, which I
have called z. If I do not give a name to the first object,
it will assign it a name x. Outside the function, I call
this variable z. An equivalent way to call this function
would be this:
rep( x = z, times = 3 )
The ellipsis in the call to rep means that it will
accept a number of arguments that, if not specified, default to preset
values. For example, if I want to repeat each of the values in
x three times I could do this:
rep( x = c(1,2), each = 3 ) = 1, 1, 1, 2, 2, 2
In the case of the rep function I need to specify at
least one of the options listed on the help page. Otherwise, it simply
returns x.
A function might return any type of R object, e.g., a value, a
vector, a list, or even another
function.
The help page for the function describes the
Arguments needed and the Value that
will be returned.
Not all functions require an argument. For example, the function
ls() (no argument) returns a character vector
with the names of objects that are in my current work environment.
The syntax for using the function that generates a normal (Gaussian)
density is given in the help page this way:
dnorm(x, mean = 0, sd = 1, log = FALSE)
I see four arguments. The first argument is a vector of quantiles
x. It is not assigned a default value, so I must supply it.
The remaining arguments are each assigned defaults. If I do not supply
them, they will be assigned these values.
If I want to assign values to mean and sd I
can do it without names, provided that I use the order assumed by the
function: first mean, then sd:
dnorm(x, 58.2, 3.8)
If I want to supply them in some other order, I have to name them, e.g.,
dnorm(x, sd = 3.8, mean = 58.2)
Because I did not specify it in either of these examples, the default
log = FALSE applies.
For a combination of assigned and preset values I could this:
dnorm(x, log = TRUE )
which is the same as
dnorm(x, mean = 0, sd = 1, log = TRUE). If I say
dnorm(x, 3.8), it will assume that the value
3.8 refers to the mean, not the
sd.
In the help page for the function rep I saw this:
rep( x, ...)
The help page explained that ... represents
“further arguments” followed by some examples like times,
length.out, each. The help page
may not list all possible arguments that could be used here. In this
case, googling the function may help.
Numbers are often collected into an array. The
difference between a vector, matrix, and
array is in how they are rendered to the console and some
attributes to recognize lengths in 1, 2, and 3 dimensions. Locations in
arrays are indexed by square brackets. The
array dimensions are separated by commas. For a
one-dimensional array, a vector, there are no
commas. Here I construct a vector with even numbers from –2
to 8 to which I assign the name xvec,
xvec <- seq( -2, 9, by = 2 ) # sequence of values
xvec
## [1] -2 0 2 4 6 8
The function seq( from, to, by ) creates the
vector. When I entered the name of the
vector xvec, it is displayed on the console. If I want only
the 3rd element of this vector, I enter
xvec[3] # the third value of a vector
## [1] 2
If I want the last three elements, I enter
xvec[4:6] # the 4th, 5th, and 6th values
## [1] 4 6 8
If I want the \(j^{th}\) element of
xvec, I assign a value to j and then ask for
that value
j <- 5
xvec[j] # the jth value of xvec
## [1] 6
To determine the length of the vector use the
length function:
length( xvec )
## [1] 6
I could also create a vector with the concatenate
function c,
zvec <- c( 5.2, 7, -15 ) #define a vector with 3 elements
zvec
## [1] 5.2 7.0 -15.0
To construct a 2-dimensional array, or
matrix, I could use the matrix function
n <- 2
m <- 3
xmat <- matrix( xvec, nrow = n, ncol = m, byrow = T ) # 2 by 3 matrix
xmat
## [,1] [,2] [,3]
## [1,] -2 0 2
## [2,] 4 6 8
In my call to the R function matrix, I said to construct
a 2 by 3 matrix from the vector having the
name xvec (the length-6 vector), filling the
matrix by rows. The default is to fill a
matrix by columns (start with column 1, then column 2,…).
The dimension of xmat is
dim( xmat )
## [1] 2 3
The 2nd column is
xmat[,2]
## [1] 0 6
and the element in row 1, column 3 is
xmat[1,3]
## [1] 2
The [j,k]th element of xmat is
i <- 1
j <- 3
xmat[i,j]
## [1] 2
The matrix consisting of the first and third columns is
xmat[ ,c(1,3) ]
## [,1] [,2]
## [1,] -2 2
## [2,] 4 8
Here is a diagram showing row [i,] = 1, column
[,j] = 3, and then element [i,j]:
Equivalently, I could get element [i,j] this way:
ij <- cbind(i,j) #bind two values, each as a column
ij
## i j
## [1,] 1 3
Now I can use this two-column matrix to specify values in
xmat like this:
xmat[ ij ]
## [1] 2
Here I have bound together two values to construct a row
vector ij, which R will read as the two dimensions of
xmat, the row and column, respectively.
I can also construct a matrix by binding together
vectors, provided they have the same length.
For example, the same matrix xmat can be constructed by
binding two rows created with the concatenate function
c,
rbind( c(-2, 0, 2), c(4, 6, 8) ) # bind 2 row vectors
## [,1] [,2] [,3]
## [1,] -2 0 2
## [2,] 4 6 8
or by binding together three columns
cbind( c(-2,4), c(0,6), c(2,8) ) # bind 3 columns
## [,1] [,2] [,3]
## [1,] -2 0 2
## [2,] 4 6 8
R executes the inner functions first, in this case the calls to
function c, followed by the outer call to function
rbind or cbind. I now say more about
functions.
Try this now: Construct a matrix with
n rows and m columns filled with random
numbers drawn from a uniform distribution between -1 and 1 using the
function runif.
Vectorization is one of the most important features of R, allowing one to operate on elements of a vector without programming loops. Loops are the standard way to apply operations to elements of arrays sequentially. Loops are fast in (low-level) compiled languages, but slow in the interactive language R. Vectorization internalizes loops (in compiled code) so they do not slow execution.
Consider a population of individuals, each of which can be
reproductive (“1”) or not (“0”) depending on their differing
developmental stages. The mature state 1 is reached in stage 3. I wish
to simulate developmental stage \([1, \dots,
5]\) for n = 10 individuals and assign each the
correct reproductive state \([0,
1]\).
First, I generate a random length-n vector that I will
call stage using the sample function. I define
a second vector repro of binary indicators (zero or one)
for whether or not a stage is reproductive. I want to create a vector
y that holds the reproductive states for each individual in
the sample. Here is a for loop to inspect the
stage of each individual, determine if it is reproductive,
and, if so, assign it a value of 1:
n <- 10 # sample size
nstage <- 5 # no. developmental stages
stage <- sample(nstage, n, replace = T) # randomly assign a stage to each observation
repro <- c(0,0,1,1,1) # early stages not reproductive
y <- stage*0 # initialize data
for(i in 1:n){ # loop over i from 1 to n
if(repro[ stage[i] ] == 1)y[i] <- 1 # determine if each is reproductive
}
If the first element of the vector stage = 4, then the
first individual is assigned repro[stage[1]] = 1. Here is
what I get for the entire vector:
print( rbind(stage, y) )
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## stage 4 5 4 5 3 1 2 3 2 5
## y 1 1 1 1 1 0 0 1 0 1
This loop executes fast, because there are only 10 individuals. It would be too slow if I had, say, \(10^6\) individuals or if I had to repeat this loop thousands of times (nested loops).
Here is the same operation, where the for loop is
replaced with a vector:
y <- repro[stage]
This alternative is vectorization. It is not only fast, but also simple. Use loops only when vectorization is not possible.
Try this now. Simulate exponential growth using a
for loop and determine that simulated values agree with the
growth rate used to generate those data.
n time steps and a growth rate drawn at random
from rho <- runif(1, .9, 1.1)for loop where the next value of x
will be rho times as large as the current value.Scatter plots are generated with the function plot. Here
is a plot of two simulated vectors:
n <- 500
x <- rnorm(n)
y <- rnorm(n, x, .4)
plot( x, y)
The function rnorm draws random values from a
distribution. I need a way to visualize the sample.
Histograms can be generated with the function hist. Here
are histograms for the preceeding two vectors:
par(mfrow = c(1,2))
hist(x)
hist(y, nclass = 100)
Try this now: Simulate random body sizes for a
population consisting of males and females where mean body size differs
by gender, and the fraction that are gender 1 is a
parameter p. Do this using i) a for loop and
ii) by vectorization. Generate a histogram of body size.
n
g using rbinom( n, 1, p)
mu
for loop where you check the gender of each
individual and draw a random size using rnorm( 1, mu )
g as an index for a length-2
vector mu
A function contains operations to be performed on objects
that have been previously defined in the main program (they exist as
objects in the user’s workspace) and/or on arguments passed to them.
Nearly every operation in R can be viewed as a function, and I can write
my own functions.
Just like the built-in R functions, each function that I
write is identified by a name. Any arguments that will be passed to the
function (as opposed to being available directly from the user’s
workspace) are contained in parentheses that follow the name of the
function. Examples of functions encountered in the previous
section include seq, c, cbind,
rbind, and matrix. Many functions are
available in R and can be called by the user. I can also write my own
functions. For example, if I want to take the mean of the elements in
xvec, I can call the R function
mean(xvec)
## [1] 3
The argument sent to mean is xvec.
mean(xvec) returns the answer 3. I can supply additional
arguments to mean, which can, for example, handle missing
cases. If I want to know the options, I use the help function,
help(mean)
I could also write my own function, such as
mymean <- function(x){sum(x)/length(x)}
This code assigns to the name mymean a function that
takes an argument that is assigned within the function
mymean the name x. The function
mymean, in turn, calls two internal R functions
sum and length to determine the mean of
whatever vector has been passed to it. To determine the mean of
xvec I call the function,
mymean(xvec)
## [1] 3
Due to the fact that xvec is the first (and, in this
case, only) argument passed to mymean, it will be assigned
the internal name x, a name that is shielded from the main
program. Thus, operations within functions do not interact with objects
in the user workspace.
Any object that is needed but not assigned within a
function must either be passed or it must be available from
the user’s workspace. If the object is not in either of these places the
function will give up. Note that a function called by
another function will not have available to it any objects
created in the calling function, unless they are passed.
Specifically, if I call a function that I have written
a(), which generates an object r and then
calls from within a() another function b()
without passing r, then b() cannot know about
r.
b <- function(r){r^2} # square a number
a <- function(m, p){
r <- m/p # r is generated in the calling funciton, but b doesn't know that
b()
}
a(7,2) # this fails
a <- function(m, p){ # now I pass r
b(m/p)
}
a(7,2) # this works
Because objects in the home environment may not have the values that
I think they do, I write functions to accept as arguments
all of the objects that will be needed. I do not rely on them being
available in the user’s workspace.
A function returns an object to the workspace. I can do this in a few
ways. Here is a function to return the mean of x:
myMean <- function(x){
mu <- mean(x)
return( mu ) # return this object
}
Here is a call to this function:
x <- rnorm(1000)
myMean(x)
## [1] -0.007113672
However, I know from the previous definition of myMean
that this will also work:
myMean <- function(x){
mean(x) # operation is not assigned, so it is returned directly
}
But this will not work:
myMean <- function(x){
mu <- mean(x) # assigned but not returned
}
If I do not assign the last operation to a name the function will return that last object.
Try this now: Write a function to calculate the mean and standard deviation of a vector and return them as a length-2 vector.
The basic operations used for scalar quantities apply element-wise to
numeric matrices of the same dimension. This only works if the
matrices have the same dimensions. These operations include
+, -, *, /:
a <- matrix( rnorm(20), 5, 4 )
b <- matrix( rnorm(20), 5, 4 )
c <- a/b
all.equal(a, b*c) # recover matrix a
## [1] TRUE
Matrix multiplication is written as %*% and requires
that the first matrix has the same number of columns as the number of
rows in the second matrix. If \(\mathbf{A}\) is \(n \times k\) and \(\mathbf{B}\) is \(k \times m\), then \(\mathbf{A} \times \mathbf{B}\) is \(n \times m\):
n <- 6
k <- 3
m <- 4
A <- matrix( rnorm(n*k), n, k )
B <- matrix( rnorm(k*m), k, m )
C <- A%*%B
print(C)
## [,1] [,2] [,3] [,4]
## [1,] 2.2768143 0.35393760 -3.8235079 4.2895192
## [2,] -0.5131719 0.38662255 -0.3357246 -0.3785857
## [3,] 0.1020754 -0.08770148 1.0909963 -0.6363143
## [4,] -0.4870463 -0.09908917 2.1361214 -1.8283885
## [5,] -1.5934130 -0.64168348 2.9404518 -2.9756617
## [6,] -1.4585513 -0.22212895 2.7179952 -2.9385403
Recall from matrix algebra that element C[i,j] will be
the sum I get from multiplying row i of A and
column j of B.
i <- 2
j <- 3
sum( A[i,]*B[,j] )
## [1] -0.3357246
Here are these matrices:
I have highlighted row i of A, which is
multiplied by column j of B and summed to give
element [i,j] of C.
A diagonal matrix can be generated with the
function diag. The identity matrix is a
diagonal matrix with ones along the diagonal. The product of the
identity matrix and another matrix is the same
matrix:
I <- diag(k)
identical( A, A%*%I )
## [1] TRUE
Of course, I and A must have compatible
dimensions. Look at I and A in this
example.
Also, a square matrix multiplied by its inverse is the identity matrix:
A <- matrix( rnorm( 9 ), 3, 3 ) # a random square matrix
B <- solve( A ) # its inverse
round( B%*%A, 10 ) # recover identity matrix
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
The most common storage modes for R objects are
numeric and character.
| object | attributes |
|---|---|
vector |
a vector can hold numeric, character, or
factor levels |
matrix |
a 2-dimensional array |
array |
use for more than 2 dimensions |
list |
objects that can be named or referenced as [[1]],
[[2]], or by name |
data.frame |
a list of objects of equal length displayed together as
columns |
factor |
a vector containing discrete levels |
A data.frame is a type of list where
objects are vectors of equal length but not
necessarily of the same storage mode. When rendered to the
console, a data.frame looks like a table, or matrix.
However, because it is not stored as a matrix, it does not
admit matrix operations like +,
*, or %*%.
which functionConsider the following task:
I am confronted with a table of values, some of which are missing. I want to initialize the missing values with the mean value for that column.
This problem arises when there are missing predictors in a design
matrix. Sometimes a decision is made to fill those missing values with
the mean for that predictor. To illustrate with an example I generate a
random matrix and fill some of the values with
NA, which indicates that a value is missing.
n <- 8 # rows
k <- 4 # columns
x <- matrix( round(rnorm(n*k), 2), n, k ) # a random matrix
miss <- sample(n*k, 7) # random locations for NA
x[ miss ] <- NA # set values in x to NA
Here is the matrix with missing values:
print(x)
## [,1] [,2] [,3] [,4]
## [1,] -1.47 -1.78 0.57 0.56
## [2,] 0.20 0.08 0.72 -0.87
## [3,] NA -0.64 0.08 NA
## [4,] NA -0.39 -0.74 NA
## [5,] -0.28 -0.55 0.34 NA
## [6,] 0.90 1.66 1.50 0.26
## [7,] 0.31 NA NA -0.68
## [8,] -1.19 0.12 0.16 0.56
The matrix is displayed as rows and columns, but it is stored by
stacking columns atop one another, referenced by numbers from 1 to
n*k. Here is a diagram of matrix locations in
x, highlighting the locations of NA values,
using two different ways to reference location:
I know the locations of the NA values, because I put
them there myself when I generated the vector miss = 3, 4,
15, 23, 27, 28, 29. These elements are shown at left. They could also be
located using which( is.na(x) ) = 3, 4, 15, 23, 27, 28,
29. This location refers to how the matrix is stored (\(1, \dots, nk\)).
Matrix locations can also be referenced by row and column
[i,j] as shown at right. I can get the locations in this
format with the argument arr.ind = T:
missAI <- which( is.na(x), arr.ind = T ) # provide the array index [row, column]
print( missAI )
## row col
## [1,] 3 1
## [2,] 4 1
## [3,] 7 2
## [4,] 7 3
## [5,] 3 4
## [6,] 4 4
## [7,] 5 4
This version of missAI holds the same information, but
it is often preferable where I want to see column locations.
I now have a quick solution to the initial problem, filling the
NA values in matrix x with mean values for
that column
cmean <- colMeans( x, na.rm = T ) # column means
x[ missAI ] <- cmean[ missAI[,2] ] # missing values to column mean in cmu
print( round(x, 2) )
## [,1] [,2] [,3] [,4]
## [1,] -1.47 -1.78 0.57 0.56
## [2,] 0.20 0.08 0.72 -0.87
## [3,] -0.26 -0.64 0.08 -0.03
## [4,] -0.26 -0.39 -0.74 -0.03
## [5,] -0.28 -0.55 0.34 -0.03
## [6,] 0.90 1.66 1.50 0.26
## [7,] 0.31 -0.21 0.38 -0.68
## [8,] -1.19 0.12 0.16 0.56
Here I highlight the values that were replaced:
Take a moment to see how I’ve used missAI as a
two-column matrix to locate missing values in
x and as a vector to select the right column
mean from cmu.
match functionThe foregoing format changes and summaries were made easy by
vectorization using indices that I had generated to simulate the data,
the vectors ind and time. How do I generate
these indexes from a data set that does not come with a fixed number of
times per subject?
I can generate the index using the match function, which
locates the matching entry for one vector in a second
vector. For example,
a <- sample( letters, 5, replace = T ) # vector a holds 5 random letters
m <- match(a, letters) # m holds the first matches in the alphabet
print( rbind( a, letters[m] ) ) # it worked
## [,1] [,2] [,3] [,4] [,5]
## a "g" "r" "i" "g" "y"
## "g" "r" "i" "g" "y"
In this example, match compared a random sample of 5
letters to the 26 letters of the alphabet. It returned the matching
locations as the integer vector m.
In the block of code that follows I use match to compare
the variable ind from data to a vector of
unique individuals. The match ii holds one entry for each
row in data (i.e., length(ind) = nrow(data)).
I do the same for time:
# find the unique individuals and times
inds <- sort( unique( data$ind ) ) # unique individuals
times <- sort( unique( data$time) ) # unique times
# match rows in data to inds and times
ii <- match( data$ind, inds )
jj <- match( data$time, times )
mm <- cbind( ii, jj )
identical( xmat[ mm ], x ) # it worked
## [1] TRUE
By combining the locations ii with jj in
the matrix mm I was able to extract from
matrix xmat the vector x.
Try this now: Write a function to
transform a data vector into a subject-by-time
matrix.
i and times t
and initialize a matrix x with NA.
x must have enough rows to accommodate all individuals
(ni) and enough columns to accommodate all times
(nj).
match to align i with unique subject
letters and t to unique times
x
n <- 100
ni <- 8
nt <- 6
x <- rnorm(100)
t <- sample(nt, n, replace = T)
i <- sample(letters[1:ni], n, replace = T)
id <- paste(i, t, sep = '_')
data <- data.frame(id, i, t, x)
data <- data[ !duplicated(data$id), ] # remove duplicates
data <- data[ order(data$id), ] # order by sample id
Note that data has multiple columns, but the vector of
interest is only the last column. Now go to step 2.
These are common operations, so I save functions to go both ways: observations as subject-time and subjects by time:
vec2Mat <- function( x, i, j){
I <- sort(unique(i))
J <- sort(unique(j))
mat <- matrix( NA, length(I), length(J), dimnames = list( I, J) )
mm <- cbind( match(i, I), match(j, J) )
mat[ mm ] <- x
mat
}
mat2Vec <- function( mat ){ # reverse: matrix back to vector
if(is.null(colnames(mat)))colnames(mat) <- c(1:ncol(mat))
if(is.null(rownames(mat)))rownames(mat) <- c(1:nrow(mat))
i <- rownames(mat)
j <- colnames(mat)
ii <- rep( i, length(j) )
jj <- rep( j, each = length(i) )
vec <- mat[ cbind(ii, jj) ]
w <- which(is.finite(vec))
df <- data.frame( i = ii, j = jj, x = vec)
df[w,]
}
mat <- vec2Mat( data$x, data$i, data$t )
vec <- mat2Vec( mat )
Try this now: Use the function vec2Mat to
generate a subject-by-time matrix from data. Then use the
function mat2Vec to recover data. Did it work?
Why or why not?
Groups of related functions are packaged for use by the community and made available through the CRAN website, github, and elsewhere. You can build your own package for use by your own group. These are loaded like this:
install.packages( 'nameOfPackageIWant' )
Once installed I attach it to an environment using
library( nameOfPackageIWant )
To see what packages are loaded I can list them with the
search function
search()
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
Once a package is installed, functions can be used without loading
all functions in the package using
packageName::functionName(). This has the advantage of
reducing conflicts between functions having the same name in different
packages.
The range of different file types that can be interpreted in R is broad. The most common types that I interpret include the following:
| file type | example function |
|---|---|
.txt |
read.table |
.csv |
read.csv |
.Rdata |
load |
.r |
source |
Note that the function data.table::fread is a fast
alternative to these traditional functions and requires limited
qualification.
As with any computer language the best R programmers fumble around all the time. Efficiency depends on ways to find quick answers. There are at least two ways to go about this. Any errors that relate to the interpretation of specific R functions can be quickly checked with the help function:
help( nameOfConfusingFunction )
The second route is to google the error message. You will arrive at a handfull of web sites inhabited by knowledgable and helpful people who spend their free time answering questions like yours. You will almost always find that your errors have been through these web sites many times.
This file has breeding-bird survey (BBS) data stored as
observations-by-species, where an observation is a route-year
combination. I want to explore how the abundance of an individual
species changes over time, so I want to create a route-by-year
matrix. Try this in two ways:
for an individual species
as a summary for all species
I load the data from a gitHub repository here. The data are held in a
data.frame x (info about observations) and a
matrix y (counts by species):
# load( "bbsExample.rdata", verbose = T )
library( repmis )
d <- "https://github.com/jimclarkatduke/gjam/blob/master/bbsExample.rdata?raw = True"
source_data(d)
head(x) # examine structure of data
head(y)[,1:5] # spp counts aligned with x, rownames are route_year
routes <- sort(unique(x$Route)) # unique routes
years <- sort(unique(x$year)) # unique years
The NA values indicate routes that were not observed in
a given year.
Try this: Here are steps you might follow. For a single species, this can work:
matrix that can accommodate one
route in each row and one year in each
columnmatch the routes in x with
routes created in the previous code blockmatch the years in x with
yearsy and use the vectors
generated by match to fill the new matrixyear using colSumsFor all species, this can work:
y, the
route and year columns in x must
be repeated for every column in y, e.g.,
rep( x$Route, ncol(y) ). These extended route
and year indices can be used with tapply to
generate a summed route by year
matrix.year can be
obtained using colSumsAs an added challenge, express these counts as counts per effort (CPE), where effort is the number of routes in each year.
Examine trends for your species and for all species over time.