Welcome to Session III of the TRIA-Net Introduction to R Workshop. We will be introducing the R graphics engine today and demonstrating some of R’s basic plotting functions to get you started. After that, we’ll show how to write your own R functions then finish with some more advanced plotting examples, and steps for creating awesome GIS graphics like this:
Note that today we’ll be using R’s default graphics engine. We don’t have time to cover ggplot2, an alternative graphics package (Wickham 2009) which has become very popular in the R community.
To return to the list of sessions at the rpubs website, click here:
Let’s start by loading some data for examples. R’s default installation includes the dataset package, providing ~90 sets of data from a range of disciplines, all free to explore and use. To see a complete list, enter the following into the console:
library(help = "datasets")
One of these, Loblolly, is a set of growth records of Loblolly pine trees, which seems relevant for this audience. For details on what the data is and where it came from, you can type
?Loblolly
and read the documentation that appears in the help pane. In particular you can find a citation for the dataset (Kung 1986).
As a part of the dataset package, Loblolly is loaded into R by default as a data frame object. Let’s have a look at the contents now using some of the commands we learned in the last session:
dim(Loblolly)
## [1] 84 3
head(Loblolly)
## height age Seed
## 1 4.51 3 301
## 15 10.89 5 301
## 29 28.72 10 301
## 43 41.74 15 301
## 57 52.70 20 301
## 71 60.92 25 301
We see that there are 84 height and age measurements indexed by a unique seed number (In fact each measurement is an average over 16 plantations, and the seed number indicates the seed source).
The unique function will show how many different seed sources were assayed
unique(Loblolly$Seed)
## [1] 301 303 305 307 309 311 315 319 321 323 325 327 329 331
## 14 Levels: 329 < 327 < 325 < 307 < 331 < 311 < 315 < 321 < 319 < ... < 305
What if we wanted to know how many ages were sampled for seed source 309? We can use logical vector indexing to pick out subsets in a data frame. The logical vector here is:
Loblolly$Seed == 309
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
where the double equals-sign tells R to look at the vector Loblolly$Seed and return a vector of equal length, indicating TRUE wherever the seed number is equal to 309, and FALSE wherever it is not. This gives us a logical index of the rows correspond to seed number 309. Now recall that square brackets are used to access specific entries in a vector or data frame. We will use our logical index to specify the rows we want from Loblolly:
Loblolly[Loblolly$Seed == 309,]
## height age Seed
## 5 4.81 3 309
## 19 11.20 5 309
## 33 28.66 10 309
## 47 41.66 15 309
## 61 53.31 20 309
## 75 63.05 25 309
and R returns the subset of the data frame for seed source 309. Since we’re interested in ages in particular, we can pare this down even more by specifying ‘age’ for the column entry in the square brackets:
Loblolly[Loblolly$Seed == 309,'age']
## [1] 3 5 10 15 20 25
Since I’m lousy at counting, I’ll plug this into the length function to count how many ages were measured:
length( Loblolly[Loblolly$Seed == 309,'age'] )
## [1] 6
Now we’ll use a more advanced R function to do this for all the seed numbers at once. The by function can be used to repeat a function call over many subsets of a data frame:
by(Loblolly$age, Loblolly$Seed, FUN=length)
Scan through the output and you’ll find that all the seed sources have 6 associated age-height measurements. How would we check if these are always for ages 3, 5, 10, 15, 20, and 25? The function all.equal can be used to test equality of many elements in a list, so we can use the by function again with FUN = all.equal. This will return us a list of TRUE/FALSE values - I’ll pass these into the function all, which tests if the result was TRUE for all seed sources.
all( by(Loblolly$age, Loblolly$Seed, FUN=all.equal, c(3,5,10,15,20,25)) )
## [1] TRUE
The output is TRUE, so we see that indeed each seed source was measured at these 6 specific ages.
With big datasets, by and its sibling functions apply tapply, and sapply are powerful alternatives to for-loops (which could the same thing, but more slowly). However they can be tricky to use for beginners. A nice intro can be found on this blog.
We’re going to start with a plot of one of these seed sources, then we’ll look at how to include all of them at once.
The plot function creates scatterplots by default. Let’s do this now for seed source 309.
plot(height ~ age, data=Loblolly[Loblolly$Seed == 309,])
There are two arguments in this function call. The first, height ~ age, is referred to as a formula. It specifies which variables to plot, and for a simple plot call it should have the form y ~ x (In the next session we’ll see how this kind of formula can be used to specify a model equation). The second specifies the data source.
As with many tasks in R, we can do this another way:
xvals = Loblolly[Loblolly$Seed == 301,'age']
yvals = Loblolly[Loblolly$Seed == 301,'height']
plot(x=xvals, y=yvals)
That code will produce the same plot, by explicitly providing R with the vectors containing the x-values and the y-values. I think the formula notation is tidier, so I’ll stick with that.
Now let’s look at how to plot multiple seed sources on the same plot. R’s points function can be used to add new points to an existing plot. We’ll use it to add seed source 311 to our scatterplot. And in order to distinguish the two seed sources, I’ll use the pch argument to tell R to use circles for 309 and triangles for 311:
plot(height ~ age, data = Loblolly[Loblolly$Seed == 309,], pch=1)
points(height ~ age, data = Loblolly[Loblolly$Seed == 311,], pch=2)
To add a legend, we call the legend function after a plot has been drawn.
plot(height ~ age, data = Loblolly[Loblolly$Seed == 309,], pch=1)
points(height ~ age, data = Loblolly[Loblolly$Seed == 311,], pch=2)
legend( 'bottomright', c('source 309', 'source 311'), pch=c(1,2) )
The first argument in legend gives the placement, the second argument provides text labels for each entry in the legend, and the third argument tells which symbols to put next to the labels (in this case, the pch codes).
Another basic plotting tool is the lines function, which draws line segments on an existing plot. We’ll use it here to connect the dots.
# make a plot of source 309
plot(height ~ age, data = Loblolly[Loblolly$Seed == 309,], pch=1)
lines(height ~ age, data=Loblolly[Loblolly$Seed == 309,])
# add source 311
points(height ~ age, data = Loblolly[Loblolly$Seed == 311,], pch=2)
lines(height ~ age, data=Loblolly[Loblolly$Seed == 311,])
# add a legend
legend( 'bottomright', c('source 309', 'source 311'), pch=c(1,2) )
Now that the script has gotten a bit longer, I’m adding comments (using the ‘#’ symbol) to keep things organized.
Another basic plotting function in R is called text. This will draw text anywhere on an existing plot, so it’s useful for labelling special points or regions. Let’s label the point for source 311 in year 10:
# make a plot of source 309
plot(height ~ age, data = Loblolly[Loblolly$Seed == 309,], pch=1)
lines(height ~ age, data=Loblolly[Loblolly$Seed == 309,])
# add source 311
points(height ~ age, data = Loblolly[Loblolly$Seed == 311,], pch=2)
lines(height ~ age, data=Loblolly[Loblolly$Seed == 311,])
# add a legend
legend( 'bottomright', c('source 309', 'source 311'), pch=c(1,2) )
# identify a special point
specialheight = Loblolly[Loblolly$Seed == 311,'height'][3]
# label it
text(x=10, y=specialheight, labels='special point', pos=4)
To position the label correctly, I needed the y-coordinate for 311’s year-10 measurement. Year-10 is the 3rd height measurement, so I accessed it using the square bracket index [3]. In the call to the text function, the first two arguments give the x-y coordinates, the third gives the text to write, and the fourth indicates that I want the text to lie to the right of the point (the 4th position) instead of directly on top of the point.
The pch argument that we used to draw circles and triangles in the last section is called the plotting character. It specifies which symbol to use when drawing points. R has hundreds of symbols, each with its own pch number. We can also specify different colours with the col argument, and control the size of the points with cex.
plot(height ~ age, data=Loblolly[Loblolly$Seed == 309,], pch=42, cex=4)
points(height ~ age, data=Loblolly[Loblolly$Seed == 311,], pch=198, col='red')
Like points, the lines function accepts various graphical parameters. For example lwd controls the width of the line, and lty the type (solid, dashed, etc.):
plot(height ~ age, data=Loblolly[Loblolly$Seed == 309,])
lines(height ~ age, data=Loblolly[Loblolly$Seed == 309,], col='green', lwd=2)
Note that lines can’t create a plot for you. It only adds to existing plots. What if you want just the lines, and not the scatterplot points? The trick is to tell R to make the points invisible when you create the plot; do this by setting pch to NA:
plot(height ~ age, data=Loblolly[Loblolly$Seed == 309,], pch=NA)
lines(height ~ age, data=Loblolly[Loblolly$Seed == 309,], col='green')
lines(height ~ age, data=Loblolly[Loblolly$Seed == 311,], col='violet')
Parameters like pch, lty, cex (and many others) can be set ahead of time using the par function. That way, they’ll be the same every time you call an R graphics function like points or text. Complete documentation for graphical parameters and how to use them is found in:
?par
You can query the values of any graphical parameter by calling par with the parameter name as the argument. For example, to see how large your margins are (in inches), call
par('mai')
## [1] 1.02 0.82 0.82 0.42
In script-writing we use many functions that are pre-written and loaded when R starts. But sometimes it makes sense to write your own functions that are specially tailored for your task. These user-defined functions are simply chunks of code that R will execute once you call them by name.
Usually, you want to define your functions at the very beginning of your script. The syntax is:
functionName = function(argument1, argument2,...)
{
# your code goes here
return(...)
}
“functionName” can be any name that isn’t already used by R for something else (eg you shouldn’t write a new function called plot or Loblolly because R already uses those names for something). It should be something short and descriptive.
The arguments are objects that are needed by the code inside the curly braces. This includes anything that R doesn’t already “know” by default. So for example if your function computes the area of a circle using its radius, you will need to pass in the radius as an argument, but not the value of pi (which R knows by default).
return(…) is like an outbox for your function. return(x) tells R that the function is finished, and should return the value of object x (x being something that was computed inside the curly braces). So the circle area function would look like this:
circleA = function(radius)
{
A = pi*radius^2
return(A)
}
It’s important to understand that variables defined inside a function will not be available to you outside the function. Instead we pass values out from the function - by writing an assignment to a new variable in the function call. Try the following:
# assign the value of the area of a circle to a new variable called area5
area5 = circleA(5)
print(area5)
## [1] 78.53982
By passing ‘A’ as output from the function, we are able to copy its value into a new variable called area5. Function print simply intructs R to display its value in the console output. Now try this:
# try to access variable 'A' that was defined inside the function circleA
print(A)
## Error in print(A): object 'A' not found
R gives an error message. Variable ‘A’ was defined inside the circleA function only so the symbol ‘A’ will not be recognized as a variable after the function has finished executing.
This is the concept of environments at work: an environment is space in which variables are stored and recognized by R. Every function gets its own new isolated environment to work in. This means variable definitions inside a function are not “remembered” by R once the function has finished executing, nor will they overwrite existing variables that were defined already prior to the function call.
This is actually very important, since otherwise we would have to worry about what variable names are taken inside all the various pre-written functions we use R. If you want to learn more about environments, you can read up on it here.
If you omit the return call, R will simply return the value of whatever you have in the last line of code inside the function. So this version will behave exactly the same as the above:
# function definition without return call
circleA = function(radius)
{
pi*radius^2
}
# test behavior of function
circleA(5)
## [1] 78.53982
So what’s the point of the return call? For one, I think it makes your code more readable. But it can also be useful when you have multiple cases to consider when producing your function’s output. For example:
# function definition
myVariance = function(inputValues)
{
if( is.numeric(inputValues) )
{
return( var(inputValues) )
}
else if( !is.numeric(inputValues) )
{
print('error: input was not numeric')
}
}
# test behavior of function
myVariance( c(1,2,3,4) )
## [1] 1.666667
myVariance( 'I like cookies' )
## [1] "error: input was not numeric"
This function tests the input to see if it’s numeric data or something else (using the is.numeric function), in which case it prints an error message.
Functions don’t necessarily need to include any arguments. Sometimes all the objects and information you need in your function are already built into R.In that case, just leave the round braces empty. For example, this function will compute the number of days that have elapsed since the supposed “End of the World” date in December 2012 (as predicted by the X-Files):
# function definition with no arguments
endDays = function()
{
Sys.Date() - as.Date('2012-12-22')
}
# test function behavior
endDays()
## Time difference of 1165 days
Functions don’t necessarily need to return output values either. In the Functions & Scripts section we’ll write a function that just makes plots without returning any data.
Plots are a great way to get a feel for a mathematical function that you’re not familiar with. For example suppose you want to have a look at the Gompertz population growth model:
\(y(t) = ae^{-be^{-ct}}\)
for various values of the three parameters. You could start by writing the formula as an R function, with inputs \(t\), \(a\), \(b\), and \(c\), and output equal to the value of \(y(t)\). I’ll call it gomp:
gomp = function(t, a, b, c)
{
a*exp( -b*exp(-c*t) )
}
Next, define a domain (a set of t values) to plot over, and set the parameters. Here I will fix \(a=c=1\), and look at how the function changes as we vary \(b\). Let’s start with \(b=1\)
t = seq(-3, 6, length=1000)
a = c = 1
b = 1
Now we can evaluate the Gompertz function over these \(t\)-values and plot the result:
# evaluate the gomp function over all the t values
y = gomp(t, a, b, c)
# initiate an empty plot
plot(y~t, ylim=c(0,1), pch=NA)
# add lines to show the Gompertz growth curve
lines(y~t)
This looks like a smooth curve, but in fact it’s just a whole bunch of short line segments, connecting all \(y(t)\) values that we just computed. The more \(t\) values that you start with, the smoother the curve will look (we used 1000). Notice that R has seamlessly evaluated a whole vector of \(t\) values at once when you called gomp, and it does so very quickly.
Now let’s compare plots with different values of \(b\), using a for loop:
# define which b values to look at, and assign a colour for each one
b_set = c(0.25, 1, 4, 16)
b_colours = c('red', 'violet', 'blue', 'darkblue')
# make an empty plot using the old values
plot(y~t, pch=NA)
# compute the y(t) values and add lines to the plot
for (j in 1:4)
{
y = gomp(t, a, b_set[j], c)
lines(y~t, col=b_colours[j], lwd=2)
}
# add legend
legend('bottomright', legend=b_set, lty=1, lwd=2, col=b_colours)
If you were fitting parameters to a model using numerical maximum likelihood estimation, you would start by defining a likelihood function in the same way as we’ve done with gomp. We’ll look at an example of this in the next session.
One of the really useful things about functions is that they allow you to compartmentalize your code. Often in a script you will find yourself doing the same thing repeatedly on different chunks of data. Functions simplify things by creating shorthand for repeated bits of code.
For example, in the Points, Lines, & Text section we wrote some code to add lines to a plot. If you were going to plot various combinations of the 14 seed sources, you could streamline it by first creating a special plotting function. Let’s start with a function that plots two seed sources on the same graph. The input will be the Loblolly dataset, and the number codes for the two seed sources:
Loblollyplot = function(Loblolly, seed1, seed2)
{
# generate an empty plot
plot(height ~ age, data = Loblolly, pch=NA)
# add lines
lines(height ~ age, data=Loblolly[Loblolly$Seed == seed1,], col='green')
lines(height ~ age, data=Loblolly[Loblolly$Seed == seed2,], col='violet')
# add legend
legend( 'bottomright', legend=c(seed1, seed2), lty=1, col=c('green', 'violet') )
}
Now with this function defined, we can create plots like the one at the end of Graphical Parameters with one line of code, instead of 4:
Loblollyplot(Loblolly, 309, 311)
Now that we’ve covered the basics, we can look in some more detail at how various types of plots can be created, customized and saved to disk. At the end we’ll look at R’s GIS capabilities and make a plot that combines satellite imagery and recent mountain pine beetle aerial survey data.
One of the handiest graphical parameters to know is ‘mfrow=c(x,y)’. This will divide your graphics window into an x by y grid of panels, each of which can be filled up by its own plot. For example you could call Loblollyplot four times, and arrange the results into a single plot like this:
par(mfrow=c(2,2))
Loblollyplot(Loblolly, 309, 311)
Loblollyplot(Loblolly, 305, 319)
Loblollyplot(Loblolly, 331, 329)
Loblollyplot(Loblolly, 315, 307)
The col argument recognizes the names of hundreds of colours, from primary colours to crazy things like ‘burlywood4’ and ‘lemonchiffon2’. For a complete list, type in:
colours()
Another way to specify colours is to give an RGB (red-green-blue) value. They are coded in a strange way (called hex) that computer scientists often use, with values like ‘#1B9E77’. So if you’re fussy about getting just the right colour, you could find its hex value using a tool like colorpicker.com and R will understand it.
If you specify colours by name or RGB code, remember to use quotation marks. Otherwise R will think you’re referring to a variable (rather than a string representing a colour name or RGB value) and it will give you an object not found error.
I think the best way to do colours is have R create a palette automatically. The palette generator functions rainbow, heat.colors, terrain.colors, topo.colors, and cm.colors are all available in R by default. We can have a glance at what one of them looks like by using the R function pie with a slice for each color:
n = 14
Loblollypalette = topo.colors(n)
pie( rep(1,n), col=Loblollypalette )
Notice that we can set n to any number, and the palette functions will generate that many colours for us to use. Now let’s use this palette function to automatically add colours to a plot of growth curves for several seed sources at once. I will modify the function that we defined in the Functions & Scripts section to accept any number of seed sources. This time the input vector seeds can be anything from 1 to 14 different seed sources:
Loblollyplot = function(Loblolly, seeds)
{
# tally the number of seed sources to plot and create a palette
n = length(seeds)
Loblollypalette = topo.colors(n)
# generate an empty plot
plot(height ~ age, data = Loblolly, pch=NA, xlim=c(3,28))
# for loop to add the lines one-at-a-time
for(j in 1:n)
{
seedNumber = seeds[j]
colour = Loblollypalette[j]
lines(height ~ age, data=Loblolly[Loblolly$Seed == seedNumber,], col=colour)
}
# add legend
legend('topright', legend=seeds, lty=1, col=Loblollypalette, cex=0.5)
}
Go through this code carefully and make sure you understand how the for loop works by incrementing the value of j. I’ve made some adjustments to make sure everything fits in the plot: xlim sets the range of x-values to display (a little wider than our data, to make room for the legend), and cex scales the size of the text so that all 14 legend entries will fit.
Try using this function to plot the first four seed sources. Recall that we can get a list of the seed sources (a vector, actually) using the unique function, and the notation “[1:4]” tells R to index the first four entries.
seeds = unique(Loblolly$Seed)[1:4]
Loblollyplot(Loblolly, seeds)
Now we’ll plot all the seed sources at once:
seeds = unique(Loblolly$Seed)
Loblollyplot(Loblolly, seeds)
If you want really professional looking colour palettes, check out the excellent RColorBrewer package (E. Neuwirth and Neuwirth 2007), which is based on work by Cynthia Brewer, a cartographer and professor of geography at Penn State
R has built-in functions barplot and hist for creating bar-charts and histograms. These are often useful as visual diagnostics in model development. For example we can get a quick histogram of all the heights with this function call:
hist(Loblolly$height)
or if we want a look at the empirical distribution of heights at 10 years:
year10Heights = Loblolly[Loblolly$age == 10, 'height']
hist(year10Heights)
Next I’ll demonstrate how to make a bar chart. If we were fitting a model to this data, we might be interested in whether the variances are equal across age groups. Using the by function, we can compute the sample variances of height at each age:
by(Loblolly$height, Loblolly$age, FUN=var)
## Loblolly$age: 3
## [1] 0.1628951
## --------------------------------------------------------
## Loblolly$age: 5
## [1] 0.6651654
## --------------------------------------------------------
## Loblolly$age: 10
## [1] 2.365095
## --------------------------------------------------------
## Loblolly$age: 15
## [1] 3.805794
## --------------------------------------------------------
## Loblolly$age: 20
## [1] 4.892182
## --------------------------------------------------------
## Loblolly$age: 25
## [1] 5.147607
Display these results graphically by calling barplot:
# compute the height variances by age
heightVars = by(Loblolly$height, Loblolly$age, FUN=var)
# vector of ages sampled to use as labels for the bars
ageLabels = unique(Loblolly$age)
# generate a palette of colours from the heat.col function
ageColours = heat.colors( length(ageLabels) )
# create the bar chart and label it
barplot( heightVars, names.arg=ageLabels, col=ageColours)
title(xlab='age', main='Loblolly Height Variances by Age')
In the last line I called the title function, which adds titles and axis labels to any existing plot. The parameters ‘xlab’, ‘ylab’, and ‘main’ can be specified directly in calls to plot, hist and barplot as well (without having to call title).
A similar function, called axis, can be called to add customized axes to any existing plot. This could for example be used to add a second y-axis to a plot comparing two different response variables.
The easiest way to save a plot to a file in RStudio is to use the ‘Export’ menu above the plot pane. In the dialogue that appears, you can specify the ‘Width’ and ‘Height’ values (in pixels) to control the resolution of your plot.
Sometimes you will want to automate this process. Let’s look at how to export a plot using R functions png and dev.off. For a quick and easy plot all you have to do is specify a file name with png, draw a plot (for example using the Loblollyplot function), then finish with dev.off:
png('myplot.png')
Loblollyplot(Loblolly, seeds)
dev.off()
The role of png is to initiate a special plot device that will write to a a file as soon as dev.off is called. Have a look in your working directory now. R should have saved a .PNG file of the Loblolly growth curves in there called ‘myplot.png’. The resolution low, so let’s fix this using arguments to png:
png('myplot.png', width=6, height=4, units="in", res=1200, pointsize=8)
Loblollyplot(Loblolly, seeds)
dev.off()
Here I specified the dimensions (in inches), and image resolution (1200 points per inch) that I want. The ‘pointsize’ argument is to scale the font size of the axes correctly as we change the resolution. In your own plots you may have to play with these parameters to get the best looking image.
You can put as much code between png and dev.off as you want. For example we could export a 2-panel plot like this:
# initiate the file-writing plot device
png('my2panelPlot.png', width=8, height=4, units="in", res=1200, pointsize=8)
# divide the graphic into two panes
par(mfrow=c(1,2))
# create the first plot then add a title
Loblollyplot(Loblolly, seeds)
title(main='Loblolly Growth Curves by Seed Source')
# create the second plot (barplot of variances)
heightVars = by(Loblolly$height, Loblolly$age, FUN=var)
ageLabels = unique(Loblolly$age)
ageColours = heat.colors( length(ageLabels) )
barplot( heightVars, names.arg=ageLabels, col=ageColours)
title(xlab='age', main='Loblolly Height Variances by Age')
# close (and write) the png file
dev.off()
The will save a .PNG file in your working directory with this more complicated plot. It should look something like this:
A number of different filetypes are supported in R. If you want, you may use the functions bmp, jpeg, tiff, or pdf instead of png. However I find .PNG is best for the kind of graphics generated in a statistical analysis (made up mostly of text, lines, and solid colour blocks).
R can be used to display and analyse GIS data. This is impressive given how expensive professional GIS softare can be. Of course a program like ArcGIS is more slick, and I do prefer it for exploring unfamiliar datasets. However when it comes to analysing large shapefiles and rasters in my research, I found it a lot easier to use R. And by writing scripts instead of pointing and clicking, we get reproducibility.
This segment is to get you started on working with your own shapefiles in R. We’ll start by pulling some mountain pine beetle (MPB) data from the BC Ministry of Forests website. The following code will download the 2015 aerial overview survey data into a new folder in your working directory:
# specify the URL to get the file from
FIDS2015_URL = "https://www.for.gov.bc.ca/ftp/HFP/external/!publish/Aerial_Overview/2015/final_prov_data/FHF_spatial_Feb11.zip"
# create a new folder to put the files in
dir.create('FIDS2015')
# download the file and unzip it, then delete the zip file
temp = tempfile()
download.file(FIDS2015_URL,temp)
unzip(temp,exdir = 'FIDS2015')
unlink(temp)
By the way this ministry website has tons more MPB data freely available, including annual aerial surveys going back almost 60 years.
These data are in the form of shapefiles. To open them, we’ll need the rgdal package (Bivand, Keitt, and Rowlingson 2013). Install this now:
install.packages('rgdal')
Now load the package and have a look at the MPB observations. This might take a few moments as it is a very large dataset:
# load the GIS library and open the shapefile for polygon observations
library('rgdal')
## Warning: package 'rgdal' was built under R version 3.2.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.2.3
## rgdal: version: 1.1-3, (SVN revision 594)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.0.1, released 2015/09/15
## Path to GDAL shared files: C:/Program Files/R/R-3.2.2/library/rgdal/gdal
## GDAL does not use iconv for recoding strings.
## Loaded PROJ.4 runtime: Rel. 4.9.1, 04 March 2015, [PJ_VERSION: 491]
## Path to PROJ.4 shared files: C:/Program Files/R/R-3.2.2/library/rgdal/proj
## Linking to sp version: 1.2-2
FIDSlatest = readOGR(dsn = 'FIDS2015', layer ='FHF_Poly_2015')
# take subset with Forest Health Factor (FHF) labelled 'IBM' (for MPB)
FIDSmpb = FIDSlatest[FIDSlatest$FHF == 'IBM',]
summary(FIDSmpb)
the summary command shows us some important info including the GIS projection used, and the attributes that each of the 2951 polygons have.
It’s easy to get a basic plot:
plot(FIDSmpb)
This is a map of BC forest areas with MPB-killed pine - areas large enough for the ministry to mark as blobs on the map, rather than points. Notice the plot function understands what to do with FIDSmpb, even though it is not a vector of values like we’ve seen before (the name for this data type is SpatialPolygonsDataFrame).
But our plot isn’t very good unless we add some geography to the background. To do this, we’ll install another R package, OpenStreetMap (Fellows and Stotz 2012), that uses the OpenStreetMap project protocol to pull maps off the web (from sites like OpenStreetMap, Bing, and Mapquest) and load them into R.
install.packages('OpenStreetMap')
Let’s zoom in on the Peace region in Northeastern BC. We can provide some coordinates for a rough bounding box, and the function openmap will pull a topographical map from ESRI and plot the result:
library('OpenStreetMap')
## Warning: package 'OpenStreetMap' was built under R version 3.2.3
peaceMap = openmap(c(59,-128), c(54,-119), type='esri-topo')
plot(peaceMap)
We can overlay the MPB polygons onto this plot, using the ‘add’ argument in a new plot command. However, when working in GIS it is very important to match up the projections of each data source. The topographical map is a mercator projection, while the BC MoF data is NAD83. So first we must convert this to mercator using the spTransform function:
# convert the projection
FIDSmpb_mercator = spTransform(FIDSmpb,osm())
# plot onto our map (overlay by using add=TRUE in the second plot call)
plot(peaceMap)
plot(FIDSmpb_mercator, add=TRUE)
We can also colour the MPB affected-areas based on the severity index used by the BC MoF (trace = ‘T’, low = ‘L’, moderate = ‘M’, severe = ‘S’, very severe = ‘V’). I’ll use a satellite image as the background this time (from Bing).
# pull the satellite map
peaceMap = openmap(c(59,-128), c(54,-119), type='bing')
# define the severity levels in order
severities = c('T', 'L', 'M', 'S', 'V')
n = length(severities)
# create a palette with semi-transparent colours
FIDScolours = heat.colors(n, alpha=0.7)
# plot the map
plot(peaceMap)
# plot each severity level
for (j in 1:n)
{
FIDSsubset = FIDSmpb_mercator[FIDSmpb_mercator$SEVERITY == severities[j],]
plot(FIDSsubset, add=TRUE, col=FIDScolours[j], border=NA)
}
Last, we’ll export a high-res version with a legend and title - the one at the top of this webpage. The jpeg file format is appropriate for this kind of image (few lines or solid blocks of colour), so we’ll use jpeg instead of png.
# find the RGB code of white with 60% transparency
semiwhite = rgb(1,1,1, alpha=0.6)
# find out x and y dimensions of the map (its 'bounding box')
x1 = peaceMap$bbox$p1[1]
x2 = peaceMap$bbox$p2[1]
y1 = peaceMap$bbox$p2[2]
y2 = peaceMap$bbox$p1[2]
# initiate the plot device with dimensions matching the satellite image in peaceMap
jpeg('PeaceRiverMPB.jpg', width=816, height=823, quality=100, pointsize=18)
# plot the map
plot(peaceMap)
# plot each severity level over top
for (j in 1:n)
{
FIDSsubset = FIDSmpb_mercator[FIDSmpb_mercator$SEVERITY == severities[j],]
plot(FIDSsubset, add=TRUE, col=FIDScolours[j], border=NA)
}
# add a legend with semi-transparent background
legend('topleft', severities, fill=FIDScolours, bg=semiwhite, cex=0.9)
# add a (sub)title with semitransparent background
rect(xleft=x1, ybottom=y1, xright=x2, ytop= y1 + (y2-y1)/25, col=semiwhite)
title(sub='2015 FIDS MPB Polygon data for Peace Region', cex.sub=1.5)
# close and save the file
dev.off()
I’ve tweaked the resolution and scaling in jpeg, legend, and title until it looked nice, and I added a solid block of transparent-white to emphasize the title using the rect function. I encourage you to try changing some of these parameters to see their effect on the plot. I find this kind of thing in R involves a little trial and error, but in the end we get excellent looking plots.
Rasters are a common data-type in GIS work. In fact the maps we used as backgrounds just now are raster images. We won’t have time to get into details today, but R has great support for raster file formats in the package raster (Hijmans and Van Etten 2013).
Bivand, Roger, Tim Keitt, and Barry Rowlingson. 2013. “Rgdal: Bindings for the Geospatial Data Abstraction Library.” R Package Version 0.8-10.
Fellows, I, and JP Stotz. 2012. “OpenStreetMap: Access to Open Street Map Raster Images.” R Package Version.
Hijmans, Robert J, and J Van Etten. 2013. “Raster: Geographic Data Analysis and Modeling. R Package Version 2.1-49.” See http://CRAN. R-project. org/package= raster.
Kung, FH. 1986. “Fitting Logistic Growth Curve with Predetermined Carrying Capacity.” In ASA Proceedings of the Statistical Computing Section, 340–43.
Neuwirth, Erich, and Maintainer Erich Neuwirth. 2007. “The RColorBrewer Package.”
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer Science & Business Media.