GEOG71922 Week 0. Introduction to the R environment: exploring spatial relationships with simulated data

Author

In this pre-course practical for GEOG71922 we will familiarize ourselves with the R language and the R Studio working environment and consider some issues related to resolution, resampling and spatial data aggregation. We will look at these ideas in the Week One lecture and some of you may have come across similar ideas in GEOG60951. Here we will experiment with some simulated data. So, we won’t need any data for today’s practical as we will make it ourselves. This is relatively easy to do in R and also helpful in general for exploring concepts in spatial ecology.

Before opening R Studio, create a working directory (a folder) on your home directory for GEOG71922. In this folder create another for Week1 so that the file path reads: “my home directory/GEOG71922/Week0” (replacing “my home directory” with your actual home directory).

Open RStudio from the Start Menu. Once open press Ctrl+shift+N to open a new R script (you can also do this from the main File menu under “New File”).

Press ctrl+S to save your script as “Prac0.R” to the Week1 folder.

Familiarise yourself with the R Studio IDE (“Integrated Development Environment”). The upper left panel is where your new script will appear and where we will do most of our coding. The upper right panel gives information about R objects that are currently available in your work space (these are all either items that you will have loaded from files or created inside R). The bottom left panel is the console where results and warnings/errors are reported (you can also enter code directly to the console, for example to access information on functions by typing “?function”). The bottom right panel consists of several sub-panels, the most relevant for us being the “plots” panel where graphics are produced and the “help” panel where information on functions and packages etc. is displayed.

Now, be sure to set the working directory of your session to your newly created folder.

On this course, when you see a line of code in a grey box, that is an indication that you need to do something! To begin with I would advise to copy and paste the code provided into the upper left panel to reduce the chance of coding errors. Once you become more familiar with the console and how it works, then you can start typing the code yourself to create the script.

Running Code

To run code, you can run each line individually by pressing Ctrl+Return or by clicking the Run button on the toolbar. Alternatively you can run whole blocks of code by selecting the code with the cursor (as you would in MS Word) and then pressing Ctrl+Return or clicking Run.First, let’s set the working directory


setwd("my home directory/GEOG71922/Week1")

Once our working directory is set we will look at a few of the fundamental components of the R language that we will commonly use on the course.

Let’s start with the traditional first command.

print("Hello world")

[1] "Hello world"

. Basic objects in R (vectors, matrices, data frames)

R is an object oriented language designed for statistical computing and graphics. Given these emphases it is easy to see how R has become popular within the discipline of spatial ecology, modelling and visualisation. However, many of the operations that we will carry out are based on familiar logical data arrangement and assessment and much of the content of this course will rely on the application of statistical prodedures to data frames (you can think of data frames as tables much like those formed with excel or spss or similar desktop packages). The ability to render and analyse spatial data in this kind of format (and display results graphically), automate complex functions and manage them in an intuitive IDE makes the R “environment” a powerful option for spatial data analysis.

The main “building blocks” for working with data in R are the objects which store our information of interest. Objects come in various “classes” such as vectors, data frames and lists as well as more fundamental GIS objects such as raster layers and spatial points layers.

Vectors in R consist of a series of elements of the same type and are generally formulated with the c() function. Creating an object is done using “object<-” such that creating a vector of integers called x is done as follows:

x<-c(1,2,3,4,5,6,7,8,9)

x

[1] 1 2 3 4 5 6 7 8 9

Note that the : operator can be useful when creating vectors:

x<-c(1:9)

x

[1] 1 2 3 4 5 6 7 8 9

A typical operation for cleaning and otherwise arranging data into the correct format for analysis (which is the bulk of what we spend our time on as spatial scientists) involves modifying vectors, data frames and other classes of spatial data (i.e. data linked to geometries).

This commonly involves using square brackets to access elements of vectors (or rows/columns of matrices and data frames). For example we can return the first three elements of vector ‘x’:

x<-x[1:3]

x

[1] 1 2 3

A basic extension of this operation can be applied to a simple matrix.

As an example, first create a binary vector using the rep() function.

v<-c(rep(0,5),rep(1,5))

class(v)

[1] "numeric"

#inspect

v

 [1] 0 0 0 0 0 1 1 1 1 1

The above code contains two functions (one within the other): the function c() combines two repetitions of length:5 - one repeating ‘0’ five times and one repeating ‘1’ five times.

Then we can create, from the ten elements in ‘v’, a two column matrix that will place 0s and 1s in separate columns.

m<-matrix(v,ncol = 2)

m

     [,1] [,2]
[1,]    0    1
[2,]    0    1
[3,]    0    1
[4,]    0    1
[5,]    0    1

For objects in >=2 dimensions we use [x,y] to select rows(x) and columns(y) respectively. In the example below we select rows 1 to 3 of matrix m and, by leaving the column selection empty, all columns are selected by default.

s<-m[1:3,]

s

     [,1] [,2]
[1,]    0    1
[2,]    0    1
[3,]    0    1

Let’s create another vector ‘t’, make this the third column of a new data frame and provide column headings.

t<-c(2,2,2)

df<-data.frame(s,t)

colnames(df)<-c("One", "Two", "Three")

df

  One Two Three
1   0   1     2
2   0   1     2
3   0   1     2

####check the class of object df

class(df)

[1] "data.frame"

Having a basic grasp of this logic is a good start to working with data in R. Note the use of “#” in the above code to make comments within your script. You can’t underestimate the usefulness of annotating your code with comments both for your own records and to allow others to understand the purpose of your code.

For now we will move onto working with typical spatial data objects to explore some basic functions.

Working with spatial data objects

We will look at how we can explore some basic spatial relationships by creating our own spatial objects in R. We will need to load the “Raster” package and associated libraries from CRAN. In the below code we then set the seed for our random number generator (so that we all end up with the same results) and create a raster object with 12 by 12 cells and with values derived from a poisson distribution (with a mean for all values of 7). First, we can call the terra package and associated libraries with the ‘library()’ function.


install.packages("terra")

library(terra)

Warning: package 'terra' was built under R version 4.5.2

terra 1.8.93

Next, we’ll create a raster object from scratch using the “rast” function in terra:

set.seed(16)

toyData <- rast(nrows=12, ncols=12, xmin=0, xmax=12, ymin=0, ymax=12)

toyData[] <- rpois(ncell(toyData), lambda = 7)

Once you have created the raster object, you can view it using the general plot() function and use the “text()” function to display our simulated values. Remember you can get more information on any function from a loaded package by calling ?function from the console (e.g. “?plot”).

plot(toyData, axes=FALSE, box=FALSE)

text(toyData, digits=2)

At this stage save your script Prac0.R (Ctrl+S)

For a little more versatile plotting, “par(mrfow)” is a useful function that allows you to control the way plots are presented. In this case we have elected to have our plots arranged on one row and two columns (i.e. side by side). Lets create another raster, using the same method, and plot them side by side in the graphics pane.

toy2 <- toyData

toy2[] <- 1:ncell(toyData)

toy2[] <- rpois(ncell(toyData), lambda  =7)

par(mfrow=c(1,2))

plot(toyData, axes=F, box=F)

text(toyData, digits=2,cex=0.7)



plot(toy2, axes=F, box=F)
text(toy2,digits=1,cex=0.7)

Now we have two rasters with different value distributions. We can check this by running a correlation test with the values() function to get the values and the cor() function to look at the Pearson correlation between the two rasters. We’ll plot these values, again using the base plot() function.

par(mfrow=c(1,1))

cor(values(toyData),values(toy2))

          lyr.1
lyr.1 0.2410789

plot(values(toyData),values(toy2))

So, only a very weak correlation observed between our two raster data sets.

Note that “plot(values(toyData),values(toy2))” was a shortcut and we could equally have created two objects to hold the values before plotting such as:

v1<-values(toyData)
v2<-values(toy2)

plot(v1,v2)

Now we’ll see if aggregating these raster values by increasing the cell size has any effect on this relationship. We will use the below code to aggregate both our data sets, using 3 as the aggregation factor (number of times by which the cell resolution will be increased) and ‘mean’ as the function to compute our new value for each cell. The factor ’3’ will result in our 12x12 raster being transformed to a 4x4 raster (i.e. 12/3 - so, if the original resolution of the raster was 1m, then the aggregated raster would now be 3m).

toy_mean <- aggregate(toyData, fact=3, fun=mean)#mean value


toy2_mean <- aggregate(toy2, fact=3, fun=mean)#mean value

Check the resolution before and after aggregation:

res(toyData)

[1] 1 1

res(toy_mean)

[1] 3 3

Now, let’s plot these two rasters again, check the correlation after aggregation and draw a best fit line to the data points:

par(mfrow=c(1,2))

plot(toy_mean,axes=F,box=F)
text(toy_mean)
plot(toy2_mean,axes=F,box=F)
text(toy2_mean, digits=2,cex=0.7)

par(mfrow=c(1,1))

#create objects to plot best fit line

valuesToy1mean<-values(toy_mean)

valuesToy2mean<-values(toy2_mean)

plot(valuesToy1mean,valuesToy2mean) 

#Access linear regression function "lm()" to calculate the best fit line (ordinary least squares) and add this to the existing plot using the "abline()" function

abline(lm(valuesToy1mean~valuesToy2mean),col="red")

cor(values(toy_mean),values(toy2_mean))

          lyr.1
lyr.1 0.6703298

Now we can see that aggregating the raster object brings them closer together in terms of their distributed values. In essesnce what we have done is reduce the variance of each set of values whilst holding the mean constant. We can test this by fetching the mean and variance for our original and aggragated data sets using the global function in the terra package:

global(toyData, mean)

      mean
lyr.1    7

global(toyData,var)

         lyr.1
lyr.1 6.797203

global(toy2_mean,mean)

          mean
lyr.1 6.972222

global(toy2_mean,var)

          lyr.1
lyr.1 0.5259259

See how the variance reduced for our data after aggregation? Although not always the case, altering our data in such a way (for which there may be a good reason - for issues of compatibility for example) can affect our interpretation of the data.

We can also “disaggregate” data, which involves increasing the resolution (reducing cell size). Here’s how:

toy_dis1<-disagg(toyData,fact=3)
toy_dis2<-disagg(toy2,fact=3)

par(mfrow=c(1,2))
plot(toy_dis1,axes=F,box=F)
plot(toy_dis2,axes=F,box=F)

Now check the correlation for these disaggregated rasters:

cor(values(toy_dis1),values(toy_dis2))

          lyr.1
lyr.1 0.2410789

#remove all plots
dev.off()

null device 
          1

Notice that the correlation statistic is back to the value we started with (i.e. for toyData and toy2 rasters). This is because the values for our disagregated rasters have not altered from the original, there are simply more cells for each value (and values are in the same relative position between rasters as in the original comparison). We can polygonize these disagregated raster layers to demonstrate this.

plot(toyData,axes=FALSE,box=FALSE)
text(toyData)

plot(toy_dis1,axes=FALSE,box=FALSE)
text(toy_dis1, digits=2,cex=0.5)

plot(as.polygons(toy_dis1), add=TRUE, border='gray50', lwd=1)#add polygons to original plot

Alternatively, we can disaggregate by using a bilinear interpolation, interpolating values based in a “rook” neighbourhood (4 nearest cells) to create new values.

toy_dis1_bilinear<-disagg(toyData,fact=3, method='bilinear')
toy_dis2_bilinear<-disagg(toy2,fact=3, method='bilinear')

plot(toy_dis1_bilinear)

plot(toy_dis2_bilinear)

cor(values(toy_dis1_bilinear),values(toy_dis2_bilinear))

          lyr.1
lyr.1 0.2339434

We now have a new distribution of values due to the smoothing effect of the interpolation, and a correlation statistic closer to that of the original layers.

global(toy_dis1_bilinear,mean)

      mean
lyr.1    7

global(toy_dis1_bilinear,var)

         lyr.1
lyr.1 3.827904

global(toy_dis2_bilinear,mean)

          mean
lyr.1 6.972222

global(toy_dis2_bilinear,var)

         lyr.1
lyr.1 3.927747

Plotting this new pair of values together next to our original raster objects, we can see that the effect of the bilinear interpolation is to cause values to cluster around the mean. It is important therefore to consider the effects of such methods if subsequent analyses are to be planed.

par(mfrow=c(1,2))

plot(values(toyData),values(toy2))
plot(values(toy_dis1_bilinear),values(toy_dis2_bilinear))

For the final steps of the practical, we will generate some point data from scratch and use these to extract values from some of our raster objects. This will be important for later practicals as much of the techniques in spatial ecology rely on extracting information from raster layers to point locations.

For this we will create a matrix of coordinate pairs to overlay onto our raster. Our rasters are basically unprojected cells numbering 1 to 12 so we need to create a Spatial Points layer that takes coordinates with all possible combinations of pairs for this range of numbers.

Fortunately there is a function available in R - “xy.grid()” from the ‘sfsmisc’ - library that will do just this. We simply need to provide the range of numbers for x and y and xy.grid will do the rest.

To be sure we produce points at cell centres however we need to modify this range slightly. Note that the origin of our raster is 0,0 (the top left corner of the raster object as the raster package reads it) so to get the centre of this cell we need to start at 0.5,0.5. Likewise the centre of our final cell (bottom right) is 11.5,11.5. We therefore create our matrix of coordinates as follows:


# points from scratch

install.packages("sfsmisc")

library(sfsmisc)

Warning: package 'sfsmisc' was built under R version 4.5.2

#Create a matrix of XY coordinates with unique combinations of numbers 0.5 - 11.5 at 1 unit intervals.

grid<-xy.grid(0.5:11.5, 0.5:11.5)

sp = vect(grid)
plot(toyData)
plot(sp,add=TRUE)

Now we can extract our raster cell values to the overlaid points and inspect their distribution. Let’s extract the vaues from our aggregated rasters (toy_mean and toy2_mean).

E1<-extract(toy_mean,sp)
E2<-extract(toy2_mean,sp)

Now plot their values together. Note here the use of the $ symbol to access columns in a data frame. The result of the extract function is a data frame where the first column is the point ID and the second column is the extracted data. We want the second column (with the default name “lyr.1” for an un-named raster)

plot(E1$lyr.1,E2$lyr.1)

So here we basically reproduced the correlation analysis but based on the values having been extracted to point data. Note that there are more values in the scatter plot we just produced than in the one we created earlier for toy_mean and toy2_mean raster values but the plots look exactly the same. This is because the number of UNIQUE values is the same in both cases. We simply duplicated the same 4x4 cell values using our 12x12 grid of points.

You’ve reached the end of this introductory practical, well done! We have explored some basic functionality in the R environment for working with spatial objects, extracting information and manipulating data into meaningful formats. We have also seen that, although manipulation of spatial data is an essential part of spatial ecology and GIS in general, it is not without certain limitations. It’s important to keep in mind the tendency for spatial manipulation to significantly alter our interpretation of the data as you move through this course.