In practical 2 of GEOG71922 we will cover a range of spatial techniques in R that are central to working with spatial data in ecology. Today you will learn how to: 1) convert tabulated data to spatial point distributions, 2) crop data quickly and neatly to a desired study extent, 3) reclassify a raster for more focussed analysis, 4) provide background (or pseudo-absence) points to your study area, 5) assign and transform spatial projections, 6) use for loops and 7) create your own functions in R.
The main aim of the practical will be to evaluate the characteristic scale which descibes how the occurrence of Sciurus vulgaris (Eurasian Red Squirrel) responds to land cover. We will use occurrence data downloaded from the National Biodiversity Network Atlas [https://nbnatlas.org/] to model the distribution of Sciurus vulgaris (point data uploaded since 2015) and use the UK Land Cover Map (2015) to characterize associated habitat types. We will evaluate this relationship through a general linear model approach at a number of scales using a series of buffers around occurrence data points.
1. Read in and map the data
Copy the data from the “Sciurus” folder in the GEOG71922 area on the shared drive to a folder for week two in your own GEOG71922 area. Open R Studio and create a new script for today’s practical (Prac2.R). Remember how to set the working directory?:
setwd("P:/GEOG71922/Week2")
#Code to tell R to only use the terra installation of PROJ (library for performing projections)
plib<-Sys.getenv("PROJ_LIB")
prj<-system.file("proj",package="terra")[1]
Sys.setenv("PROJ_LIB"=prj)
Install the packages for today’s work:
#Install packages
install.packages(c("terra","sf","mapview"))
Load the libraries that we will need from the start of today’s practical (Note you may recieve some information or warnings in the R console similar to the ones below as produced on my machine, you can ignore these- they just provide information about versions and compatibility):
library(terra)
## terra 1.7.72
library(sf)
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
Now read in the spreadsheet containing the coordinates for occurrence locations and some other miscellaneous data columns. Notice that when you read in a .csv file the object automatically takes the class of data.frame.
sciurus<- read.csv("Sciurus.csv")
head(sciurus) # view the first six rows of the data
## NBN.Atlas.record.ID Scientific.name Name.qualifier
## 1 fd44c89f-dc54-4bc8-b242-b25f34714c67 Sciurus vulgaris NA
## 2 fbcf21ec-d168-4b2b-9644-d798e0d296b8 Sciurus vulgaris NA
## 3 fbc7523e-c291-497f-8c1f-821e7a715e7a Sciurus vulgaris NA
## 4 f9dad20d-9ea8-4b58-aafd-cdeee990d703 Sciurus vulgaris NA
## 5 f55a1c93-8b8d-4f9d-88be-b81561231bc9 Sciurus vulgaris NA
## 6 e2d32bf2-d58d-4cfc-ae48-ac5668d7842f Sciurus vulgaris NA
## Common.name Species.ID..TVK. Taxon.Rank Occurrence.status
## 1 Eurasian Red Squirrel NBNSYS0000005108 species present
## 2 Eurasian Red Squirrel NBNSYS0000005108 species present
## 3 Eurasian Red Squirrel NBNSYS0000005108 species present
## 4 Eurasian Red Squirrel NBNSYS0000005108 species present
## 5 Eurasian Red Squirrel NBNSYS0000005108 species present
## 6 Eurasian Red Squirrel NBNSYS0000005108 species present
## Start.date.year Locality OSGR Latitude Longitude Coordinate.uncertainty_m
## 1 2015 NO64G 56.57747 -2.60386 2000
## 2 2015 SD31R 53.60975 -2.95368 2000
## 3 2015 NS99Z 56.17291 -3.62832 2000
## 4 2015 SD31W 53.60999 -2.92346 2000
## 5 2015 SD31V 53.59201 -2.92306 2000
## 6 2015 SD30N 53.55558 -2.98265 2000
## Identification.verification.status Vitality
## 1 Unconfirmed
## 2 Accepted - considered correct
## 3 Unconfirmed
## 4 Accepted - considered correct
## 5 Accepted - considered correct
## 6 Accepted - considered correct
#check the class of the loaded .csv file.
class(sciurus)
## [1] "data.frame"
We will do a small amount of data cleaning next, to ensure there are no NA values in our coordinate data (this would cause later analytical steps to crash). Here we use the subset() function and !is.na() to select all rows with complete values for latitude (‘!’ means “is not” in the R language). In addition let’s remove all records with coordinate uncertainty greater than 100 m.
#subset the data to only include points with complete coordinates.
sciurus<-subset(sciurus,!is.na(sciurus$Latitude))
#remove all points with uncertainty > 100m
sciurus<-subset(sciurus, sciurus$Coordinate.uncertainty_m<100)
2. Create a vector layer from coordinates
Next we make a vector layer from our sciurus object. For this we combine two objects, a list of cooordinates (in this case lat and long) and a Proj.4 string. The latter tells R what coordinate system our coordinates are derived from. In this case we are working (for now) in the WGS84 system so we create an object (crs.latlong) to hold a character string that contains the information for this coordinate system. The full string that describes the projection for WGS84 is “+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0”. Fortunately, R allows use to pass the epsg code for WGS84 instead. epsg codes are a kind of shorthand for projection strings and the code for WGS84 is “epsg:4326”. So all we have to do is assign this code to an object (“crs.latlong” below) using the crs() function and tell the SpataialPoints() function that this object contains our projection string.
Note the use of the ‘$’ symbol below for accessing colmuns of dataframes.
#make spatial points layer
#create crs object
sciurus.latlong<-data.frame(x=sciurus$Longitude,y=sciurus$Latitude)
#Use coordinates object to create our spatial points object
sciurus.sp<-vect(sciurus.latlong,geom=c("x","y"))
#check that the points now have our desired crs.
crs(sciurus.sp)<-"epsg:4326"
Type ?crs() directly into the console and press enter to bring up information about this function in the help panel. The crs() function is very versatile and allows us to get or set the crs of any raster or Spatial object (i.e. any vector object associated with the ‘sp’ library) as well as creating character objects that hold projection strings that can then be applied to any number of objects. The latter is important when you have data without a formal projection (usually from a text or csv file, as in our case) and want to assign one from scratch. The crs() function uses the PROJ.4 library and relies on rgdal for projecting. You can find more information on epsg codes here [http://www.epsg-registry.org/] and a handy overview of using CRS in R here [https://tinyurl.com/y65j2o94]
Now plot our points:
plot(sciurus.sp)
Now let’s have a look at a handy tool in R that allows us to work interactively with spatial data in R - the ‘mapview’ library. This allows use to look at our data and pan around their extent against base maps similar to how you would using ArcGIS. Run the following code then have a play around with the options. When you’re ready, move on to the next step. We will focus our analysis around the Cairngorms area of Scotland, see if you can locate it on the map.
library(mapview)
mapview(st_as_sf(sciurus.sp))#Note: you may receive a warning message about Feature IDs when you run this command but you can ignore this for now).
## Warning in cbind(`Feature ID` = fid, mat): number of rows of result is not a
## multiple of vector length (arg 1)
We have a lot of occurrence data and although it would be nice to consider all of these points, that would be computationally too intensive for our practical session so let’s cut them down to something we can work with by clipping them to a much smaller extent. To do this we use the crop() function and create a new extent:
#First set the exetent to something workable
studyExtent<-c(-4.2,-2.7,56.5,57.5) #list coordinates in the order: min Long, max Long, min Lat, max Lat
#now crop our points to this area
C<-crop(sciurus.sp,studyExtent)
3. Mapping land-cover
Now lets bring in our landcover data to see where our points fit on the map.Y ou will probably receive a warning message on loading the raster layer. This relates to some changes and updates to the Proj library which may not yet be fully implemented in certain packages (such as raster). These deprecated versions still work fine for now so this is nothing to worry about.
#read in our raster data
LCM<-rast("LCMUK.tif")
The raster is set to OSGB, so our two data sets have different projections. Let’s fix that with the project() function. Here we create an object containing the crs information for LCM inside the project() function and pass it to our points layer:
sciurusFin<-project(C,crs(LCM))
Note that above we did not need to use any character strings to define our projection.This is down to the versatility of the crs() function whereby, in this case, we can simply pass the crs of the LCM layer directly to the spTransform function.
Before we continue, let’s crop the land cover data to the extent of our points data (plus 5km to allow space for the buffers we will create in subsequent steps).
sciurusCoords<-crds(sciurusFin)
x.min <- min(sciurusCoords[,1]) - 5000
x.max <- max(sciurusCoords[,1]) + 5000
y.min <- min(sciurusCoords[,2]) - 5000
y.max <- max(sciurusCoords[,2]) + 5000
extent.new <- ext(x.min, x.max, y.min, y.max)
LCM <- crop(LCM$LCMUK_1, extent.new)
plot(LCM)
plot(sciurusFin,add=TRUE)
Now we are ready to start characterizing the locations of our points with the LCM layer.
4. Generating (pseudo-)absence points
Our analysis will be based on attempting to predict the locations of our points relative to the underlying habitat. In order to do this however we want also to characterize areas where Sciurus vulgaris doesn’t occur. We could try without absence data (and some algorithms such as Maxent and Bioclim are built for this) but it is more robust to provide absence as well as presence data in our modelling. Since we don’t have formal absence point locations, we will simulate our own by creating random background points (also called pseudo-absence data). For this we will use the spatSample() function.This function creates a matrix of random coordinates which we can then convert to a vector layer.
Note that in practice we might create many more background points to get a representative sample but we will settle for 1000 for this practical to reduce processing time later on.
set.seed(11)
back.xy <- spatSample(LCM, size=1000,as.points=TRUE)
#create a spatialPoints layer from the back.xy matrix
plot.new()
plot(LCM)
plot(sciurusFin,add=T)
plot(back.xy,add=TRUE, col='red')
5. Characterizing point locations
Now we will use some functions that we learned last week to extract land cover values to our points and plot them to compare presence and absence locations.
eA<-extract(LCM,back.xy)
eP<-extract(LCM,sciurusFin)
A very useful function for cross-tabulating data (i.e. summing counts per category of interest) is the table() function. Let’s use this to tabulate counts per landcover type for our presence and absence points.
#calculate point frequency per landcover category (headings are landcover category codes - see LCM documentation for descriptions)
table(eA[,2])
##
## 0 1 2 3 4 7 9 10 11 12 14 16 20 21
## 2 39 129 101 122 129 328 53 26 48 13 1 2 7
table(eP[,2])
##
## 1 2 3 4 7 9 10 11 14 21
## 102 88 23 48 27 1 1 1 1 24
Use the range of values provided by table(eA) and table(eP) to set the x axes of two histograms comparing presence and absence locations. Note ylim is will be 1 on this occasion as we will set the argument ‘freq’ in the hist() function to FALSE so that the values on the y axis of our histograms represent proportions rather than counts (this is important as the number of points in our presence and absence data are markedly different).
par(mfrow=c(1,2))
hist(eA[,2],freq=FALSE,breaks=c(0:21),xlim=c(0,21),ylim=c(0,1))
hist(eP[,2],freq=FALSE,breaks=c(0:21),xlim=c(0,21),ylim=c(0,1))
The max value on our y axis seems to be 0.4, so change ylim to ylim=c(0,0.5) for both histograms and run again (remember the c() function for combining elements - here min and max values for our axes).
hist(eA[,2],freq=FALSE,breaks=c(0:21),xlim=c(0,21),ylim=c(0,0.4))
hist(eP[,2],freq=FALSE,breaks=c(0:21),xlim=c(0,21),ylim=c(0,0.5))
From these outputs we can see that the proportion of points occuring in broadleaf woodland (class 1 in the UK Landcover Map - see documentation) is much higher for presence relative to background points. The focus for the rest of the practical will be determining to what extent this response to broadleaf woodland differs as a function of scale and, if it does, what is the characteristic scale of this effect (so we can operationalize this optimum scale in future analyses, for example, for species distribution modelling).
Next let’s create a single data frame to hold both our presence and absence data so that we can start to carry out our scale analysis on the one dataset.
Abs<-data.frame(crds(back.xy),Pres=0)
head(Abs)
## x y Pres
## 1 357087.5 747287.5 0
## 2 288062.5 774962.5 0
## 3 319062.5 791362.5 0
## 4 284137.5 836537.5 0
## 5 286387.5 762612.5 0
## 6 292987.5 825812.5 0
Pres<-data.frame(crds(sciurusFin),Pres=1)
head(Pres)
## x y Pres
## 1 302144.9 826416.5 1
## 2 348875.5 780667.0 1
## 3 313613.0 791183.4 1
## 4 302332.6 820625.2 1
## 5 291470.3 822625.6 1
## 6 293429.5 818944.1 1
sciurusData<-rbind(Pres,Abs) # bind the two data frames by row (both dataframes have the same column headings)
#inspect
head(sciurusData)
## x y Pres
## 1 302144.9 826416.5 1
## 2 348875.5 780667.0 1
## 3 313613.0 791183.4 1
## 4 302332.6 820625.2 1
## 5 291470.3 822625.6 1
## 6 293429.5 818944.1 1
tail(sciurusData)
## x y Pres
## 1311 346362.5 779337.5 0
## 1312 322587.5 779687.5 0
## 1313 273387.5 836312.5 0
## 1314 306137.5 849837.5 0
## 1315 271637.5 803512.5 0
## 1316 313462.5 802687.5 0
class(sciurusData)
## [1] "data.frame"
6. Re-classifying the raster
In order to explore how occurrence responds to broadleaf forest cover, we first need to extract a layer from our LCM raster consisting only of cells assigned to this land cover type in the raster (classification ‘1’). we can do this through a simple reclassification of the raster. First we convert the values of the raster layers to factors. In R factors represent categories (categorical data). You will hopefully have realized that our land cover map is a categorical rather than continuous raster, meaning that the values represent land cover types (rather than some continous variable such as elevation or slope - we’ll look at these later in the course).
#access levels of the raster by treating them as categorical data ('factors' in R)
LCM<-as.factor(LCM)
levels(LCM)
## [[1]]
## ID LCMUK_1
## 1 0 0
## 2 1 1
## 3 2 2
## 4 3 3
## 5 4 4
## 6 5 5
## 7 7 7
## 8 8 8
## 9 9 9
## 10 10 10
## 11 11 11
## 12 12 12
## 13 13 13
## 14 14 14
## 15 15 15
## 16 16 16
## 17 17 17
## 18 18 18
## 19 19 19
## 20 20 20
## 21 21 21
Looking at this list we can create a binary raster (assigning 1 for broadleaf woodland - the second element in the list - and 0 for everything else).
To reclassify a raster we create a two-column matrix where the first column contains the original cell values and the second column contains our new values (this is similar to how the reclassify function works in ArcGIS only here we provide the matrix ourselves).
#create an vector object called reclass
reclass <- c(0,1,rep(0,19))
# combine with the LCM categories into a matrix of old and new values.
RCmatrix<- cbind(levels(LCM)[[1]],reclass)
RCmatrix<-RCmatrix[,2:3]#we only need columns 2 and 3
RCmatrix$LCMUK_1<-as.numeric(RCmatrix$LCMUK_1)#convert character to numeric
#Use the reclassify() function to asssign new values to LCM with our reclassification matrix
broadleaf <- classify(LCM, RCmatrix)
Inspect the new data visually:
plot(broadleaf)
plot(sciurusFin,add=TRUE)
In the class practical we characterised the “nighbourhood” of each data point using buffers. This is one approach to doing neighbourhood analysis and the latter is a powerful tool in the GIS toolbox and is one of the ways in which spatial analysis stands apart from other form of statistics and science in general. Another way of considering, and summarizing, neighbourhoods is to use focal analysis, which you have probably come across if you’ve taken the GIS core module on this programme.
In R, with the “terra” package, we can use the focal() function to carry out this kind of analysis.
Alternative calculation of woodland cover with diagnostics (not in the week 2 practical)
Although our results from the practical gave us some confidence in the characteristic scales with strong signals at 100m and 1700m, we cannot be sure about the variation that may occur within our buffer. At 100m intervals, and a fairly smooth increase in loglikelihood after 300m we can be fairly confident that at 100m and at 1700m the response is well explained. However we can test whether the response to broadleaf woodland is stronger at smaller distances than at larger ones within this larger buffer size by applying a Gaussian filter. A Gaussian filter follows the bell shape of a normal distribution and, within a specified radius, applies more weight to locations closer to the focal cell than those further away. We can create a Gaussian filter using the focalMat() function in the terra package.
The focalMat() function creates a moving window over every cell in the broadleaf raster. The size of the window is one of the function arguments and is analagous to the buffer radius we use in the practical.
The degree to which the filter “favours” cells near to the focal cell is determined by the sigma argument - the sigma is basically the standard deviation of a normal distribution which determines the shape of the filter - smaller sigma = more weight given to nearby cells. The other argument that the focalMat function takes is the size of the buffer to be used (1700m in our case).
Once the focal window is established (with focalMat()) we use the focal() functions to create a new raster for which each cell is the result of some statistic of the neighbourhood created by the focalMat() function. Here we will use ‘sum’ as the statistic such that all woodland cells (with a value of 1) will be multiplied by the corresponding values in the focalMat object (the Gaussian filter) where nearby cells have higher values than those further away.Not that all values in the focal. The resukt for each cell will range from 0 to 1 where 0=no woodland in the moving window and 1 = only woodland in the mocing window (so, a proportion basically).
We’ll performing a simple diagnostic analysis on our data. First we will use the “circle” option in the focalMat function. This behaves exactly like a regular buffer and gives equal weight to all cells inside a given distance (which, if you think about it, is what a regular buffer does, as used in the practical). Then we will use the “Gauss” option in the focalMat() function to give locations closer to the focal cell more weight (important) than those further away. An advantage of using focal analysis is that we can also easily use a much greater number of background points as we can simply use the extract() function to get the data from the new focal raster for our modelling without the need to make buffers around each point as above (so it’s quicker). For now though we’ll use the same number of background points as above so that we can compare log likelihood values (adding more sampling points will affect the log likelihood statistic and prevent a fair comparison).
We can look at the relative weights created by the different sigma values simply by typing the name of the object we create. For a radius of 50 metres and a very low sigma of 1 (meaning cells closer to the focal cell have much more weight) we would specify the following (you can experiment with the sigma values to look at the effect this has):
fw.test <- focalMat(broadleaf, c(1,50),type= 'Gauss')
fw.test
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0.000000e+00 0.000000e+00 0.000000e+00 0
## [2,] 0 3.680856e-272 1.918556e-136 3.680856e-272 0
## [3,] 0 1.918556e-136 1.000000e+00 1.918556e-136 0
## [4,] 0 3.680856e-272 1.918556e-136 3.680856e-272 0
## [5,] 0 0.000000e+00 0.000000e+00 0.000000e+00 0
#you can type ?focalMat() into the console to get information on the other options for creating focal weights.
Now let’s run these two models using our 1800m distance taken from the practical (for the regular circle buffer - without FAB correction). For the Gaussian kernel we will try a sigma value of 100, giving more weight to nearby cells. This step can take several minutes to run so please be patient.
################# "Circle" option (i.e. regular buffer) ##################################################
fwCircle <- focalMat(broadleaf, d=1800,type= 'circle')#buffer around each cell of the broadleaf layer, applying our 1700m distance ("d")
focalBroadleaf<-focal(broadleaf,fwCircle,sum) #the focal() function now creates the raster
eA_circle<-extract(focalBroadleaf,back.xy)
eP_circle<-extract(focalBroadleaf,sciurusFin)
absCircle<-data.frame(eA_circle[,2],Pres=0)
presCircle<-data.frame(eP_circle[,2],Pres=1)
names(absCircle)<-c("area","Pres")
names(presCircle)<-c("area","Pres")
sciurusCircle<-rbind(presCircle,absCircle)
head(sciurusCircle)
## area Pres
## 1 0.23803953 1
## 2 0.03706668 1
## 3 0.05812450 1
## 4 0.07715042 1
## 5 0.06089527 1
## 6 0.21211748 1
glmCircle <- glm(Pres ~ area, family = "binomial", data = sciurusCircle)
LL_Circle<-as.vector(logLik(glmCircle))
print(as.numeric(LL_Circle))
## [1] -568.6407
###############################Gaussian approach
fwGauss <- focalMat(broadleaf, c(100,1800),type= 'Gauss')#buffer around each cell of the broadleaf layer, applying our 1800m distance and the sigma
focalBroadleaf<-focal(broadleaf,fwGauss,sum) #the focal() function now creates the raster
plot(focalBroadleaf)
eA_Gauss<-extract(focalBroadleaf,back.xy)
eP_Gauss<-extract(focalBroadleaf,sciurusFin)
absGauss<-data.frame(eA_Gauss[,2],Pres=0)
presGauss<-data.frame(eP_Gauss[,2],Pres=1)
names(absGauss)<-c("area","Pres")
names(presGauss)<-c("area","Pres")
sciurusGauss<-rbind(presGauss,absGauss)
head(sciurusGauss)
## area Pres
## 1 3.685497e-01 1
## 2 2.934999e-07 1
## 3 2.874158e-01 1
## 4 2.158703e-01 1
## 5 8.146309e-03 1
## 6 1.018067e-16 1
glmGauss <- glm(Pres ~ area, family = "binomial", data = sciurusGauss)
LL_Gauss<-as.vector(logLik(glmGauss))
print(as.numeric(LL_Gauss))
## [1] -587.0268
We can see from the comparison of the two models that the circle approach (regular buffer) performs better than the Gaussian kernel, suggesting that our original results can be trusted and that woodland cells closer to our presence points are not necessarily more important than those at greater distances inside our 1800 m.