The thinking behind this tutorial

This practical is designed to have an explanatary text, together with code examples. Note that all code examples have a light yellow background, and are boxed. Output from R is also shown, and text output is also boxed. Graphical output is shown ‘in line’. The idea is to copy the text in the boxes into a running version of R - you can use the output to see whether you are on the right track. Also, don’t be afraid to experiment - it may not always work, but for example playing around with some of the function parameters change some aspects of the analysis, or the graphical presentation. Also, it is generally assumed that the code you type in from the boxes is actually entered in the same order as it appears here. Some of the later boxes depend on what was done earlier, and so skipping ahead might lead to errors. The help facility (putting a question mark in front of a command, for example ?plot, and hitting return) is also a good way to discover things…

Again, we will be using the IIASA Human Impact data.

Analysis of Human Impact and Related Data - Basic PCA

Firstly, load all of the packages that r requires for this exercise.

# Load libraries
packages.wanted <- c("robustbase","mvoutlier","RColorBrewer","GWmodel","raster","tmap")
for (package in packages.wanted) require(package,character.only=TRUE)

If entering the above gives an error message, make sure the packages: robustbase, mvoutlier, RColorBrewer, GWmodel, raster, tmap are all installed on your system.

Now, set up the data set (gw.data) used for this practical. This is based on the raster data attached to the human impact score points.

tr.data <- read.csv("points.csv") # The human impact data
night.lights.gr <- raster("night_lights.tif") # night lights raster
tree.loss.gr <- raster("tree_loss") # tree loss raster
glc.gr <- raster("glc_share.tif") #global land cover
crops.gr  <- raster("crops_iiasa1.tif") #crops

Next, extract the raster values of each of the raster data sets at the locations of the human impact scores:

coords <- tr.data[,2:3]
ntlights <- extract(night.lights.gr,coords)
trloss <- extract(tree.loss.gr,coords)
glc <- as.factor(extract(glc.gr,coords))
crops  <- extract(crops.gr,coords)

Some of the ntlights values are NA as the points are outside of the coverage - make a note of these, to remove later:

no.data <- is.na(ntlights)

Also, simplify the land use to human-influenced and others:

hlu <- as.numeric(glc %in% c("1","2","3","8"))

The as.numeric here turns a logical value (TRUE / FALSE) to zero or one - useful in later analysis. Next join all of this in a data frame:

gw.data <- data.frame(h.imp=tr.data[,4],ntlights,trloss,hlu,crops)[!no.data,]

The [!no.data,] part here filters out the points without any ntlights information. We can then create a SpatialPointsDataFrame :

gw.spdf <- SpatialPointsDataFrame(tr.data[!no.data,2:3],gw.data,proj4string=CRS("+init=epsg:4326"))

Just to test things, we can plot this on a backdrop of crops - using the tmap style of map production:

tm_shape(crops.gr) + tm_raster(title='Crops') + tm_shape(gw.spdf) + tm_dots()

Next, the global principal components of the variables can be computed. They are in the data slot of the gw.spdf object - the first step computes the PCA, the second appends the scores to the data frame in gw.spdf.

pca <- princomp(gw.spdf@data,cor=TRUE,scores=TRUE)
gw.spdf@data <- cbind(gw.spdf@data,pca$scores)
head(gw.spdf)

  h.imp ntlights     trloss hlu crops     Comp.1      Comp.2     Comp.3      Comp.4
1     5        9 0.00000000   1    40 -0.7042022 -0.65412151 -0.9527897 -1.39407706
2   100       19 0.00000000   1    81  1.6383331  0.20986366  0.2125358 -0.39226255
3    90        7 0.00000000   1    67  1.0486146 -0.52405588  0.2964270 -0.12022456
4   100       11 0.00000000   1    78  1.4581494 -0.28894609  0.3948562 -0.20535399
5    75       13 0.00000000   1    45  0.5314931 -0.05751044 -0.2016545 -0.05471134
6    15        0 0.01074219   1    30 -0.9533638 -1.02306416 -0.4766315 -0.87727751
       Comp.5
1 -1.19083084
2  0.40190841
3  0.04998924
4  0.32964207
5 -0.41411511
6 -1.31431841

Looking at the PCA loadings helps to interpret the scores. Recall - probably from several years ago ! - that the first principal component is an index, made from a weighted combination of the variables that gives the greatest variability in the study area. By looking at the values of the weights, we get a picture of what single characteristic accounts for most variability in the multivariate data. The second principal component finds the next most variable weighted index, given the further constraint that it is uncorrelated (orthogonal) with the first, and so on for the other components. Below, the loadings are examined:

pca$loadings


Loadings:
         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
h.imp     0.499  0.171  0.375  0.694  0.316
ntlights  0.218  0.886 -0.350 -0.207       
trloss   -0.347  0.368  0.821 -0.214 -0.156
hlu       0.562                      -0.815
crops     0.517 -0.212  0.230 -0.652  0.457

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings       1.0    1.0    1.0    1.0    1.0
Proportion Var    0.2    0.2    0.2    0.2    0.2
Cumulative Var    0.2    0.4    0.6    0.8    1.0

From these, it seems like component 1 (Comp.1) contrasts tree loss with all of the other variables. This can be mapped:

data(Europe)
qtm(Europe) + 
  tm_shape(gw.spdf) + 
  tm_dots(col='Comp.1',size=0.08,title='First PC') + 
  tm_compass() + 
  tm_style_col_blind()

The second component contrasts crops against the others - this can also be mapped:

data(Europe)
qtm(Europe) + 
  tm_shape(gw.spdf) + 
  tm_dots(col='Comp.2',size=0.08,title='Second PC') + 
  tm_compass() + 
  tm_style_col_blind()

PCA and anomaly detection

Going back to the first principal component, we can also use boxplot to find the outlying values:

boxplot(gw.spdf$Comp.1,horizontal = TRUE)

There are some outlying high values. The boxplot function, as well as actually providing plots, returns information about the outliers themselves. Sometimes extracting the information is a bit convoluted, but it can be done. Basically, you extract the outlying values and match them against the whole list for the variable of interest. For example, to find which cases correspond to the outlying values of Comp.1, use

outl <- which(gw.spdf$Comp.1 %in% boxplot(gw.spdf$Comp.1,plot=FALSE)$out)
outl

 [1]  180  266  290  331  799  996 1067 1400 1872 2033

Then, by subsetting gw.spdf to just these values, just the outlying cases are mapped. In this case, these are places with outstandingly high tree loss contrasted against the other variables (i.e. a multivariate outlier).

qtm(Europe,text='iso_a3',text.size=0.8) + tm_shape(gw.spdf[outl,]) + tm_dots(size='Comp.1',col='navy',title.size='First PC') + tm_compass()

Note that here size is used to represent the values of the component score, rather than colour. Also, the text option in the initial qtm adds a labelling variable to each polygon where iso_a3 is a variable in the SpatialPolygonsDataFrame called Europe - giving three letter abbreviations for each country.

Geographically Weighted PCA and anomaly detection

The above exercise demonstrated the sort of analysis a standard PCA can give - essentially picking out the key multivariate modes of variability in the data. Also looking at outlying values of the principal components of these data gives us an idea of unusual sites (in terms of combinations of all the variables considered). It can also tell us where unusual sites are located.

Next, Geographically weighted PCA can be used to find spatial multivariate outliers. Sounds complicated, but really all this means is it identifies sites that have an unusual multi-way combination of variables in relation to their immediate geographical neighbours. It might be that the values observed at these sites as a combination is not uncommon in general - but is very unusual in its locality.

To find such outliers the procedure is relatively simple - instead of doing a PCA on all of the data, for each sample we do a PCA on data falling into a window centred on the location of that sample. In that way we can check whether the sample is like its neighbours or not, from a multivariate viewpoint.

To compute this, the GWModel function gwpca is used. As with many other GW approaches, it is often helpful to pre-compute distances. Here they are stored in the matrix d:

d <- gw.dist(coordinates(gw.spdf),coordinates(gw.spdf),longlat=TRUE)

Now the GWPCA is computed.

 pca.gw <- gwpca(gw.spdf,vars=colnames(gw.spdf@data)[1:5],bw=20,k=5,dMat=d,adaptive=TRUE)

GWPCA Loadings

A GWPCA actually generates a lot of information. One initial ‘overview’ of this localized PCA is to map which of the variables has most influence on the components - this is done by identifying which one has the highest principal component weightings (loadings). The following R code identifies which variable has the highest weighting at each sample point, and provides a colour-coded map. A few explanatory points:

pca.gw$loadings contains a loading matrix for each component, and for each location. It is a three-dimensional array - the three dimensions are the 5 loadings, the 5 components and the 2642 observations.
lead.item contains the name of the variable with the highest weighting for each location.

local.loadings <- pca.gw$loadings[,,1] # note first component only - would need to explore all components..
lead.item <- colnames(local.loadings)[max.col(abs(local.loadings))]
gw.spdf@data$lead.item <- lead.item
qtm(Europe) + 
  tm_shape(gw.spdf) + 
  tm_dots(col='lead.item',title='Lead PC',size=0.08) + 
  tm_compass()

So the greatest source of variation is one of the variables crops, h.imp (the most common), and ntlights. Whereas this kind of display gives some information, there is still much that it doesn’t really demonstrate. Although it is possible to see a geographical pattern in the variables that dominate the local PCAs (i.e. the ones that contribute most notably to local compositional variability in different places). Glyph plots are more complicated, but they give a view of all of the component weightings (loadings). Unfortunately, they aren’t yet implemented in tmap so a more basic approach is used:

# Transform gw.spdf2 to the map projection of Europe
gw.spdf2 <- spTransform(gw.spdf,CRS(proj4string(Europe))) 
# Draw a 'dummy' map whose extent matches gw.spdf2
plot(gw.spdf2,pch='') 
# Draw the Europe backdrop
plot(Europe,col='grey',add=TRUE) 
# Add a glyph plot layer
glyph.plot(local.loadings,coordinates(gw.spdf2),add=TRUE)

Each ‘spoke’ on the stars represents the value of one of the loadings. Negative loadings are given in red. Although this shows spatial trends (note the shape of the glyphs alters around the study area) - it is not immediately clear how to interpret the glyphs fully, and identify the full weighting information. To get around this, there is an interactive function check.components. Enter

check.components(local.loadings,coordinates(gw.spdf2))

Each time you click on a glyph, it prints out the exact weighting. When you are done, hit the esc key.

GWPCA Percentage of total variation (PTV) explained

Another useful diagnostic for PCA is the percentage of variability in the data explained by each of the components. This can be achieved by looking at the var component of pca.gw; this is written as pca.gw$var. This is an 2642 by 5 matrix - where 2642 is the number of observations and 5 is the number of components. For each location, the 5 columns correspond to the variance of each of the principal components. Looking at the proportion of each component in the sum of all of the variances shows how much of the variability in the data each component contributes. If, say, the first two components contributed 90% of the total variance, then it is reasonable to assume that much of the variability in the data can be seen by just looking at these two components. Because this is geographically weighted PCA, however, this quantity varies across the map. Here is an R function to compute the proportion of variance at each observation point, given the var component of a GWPCA object, and the number of PCs to consider (e.g. in the above discussion I considered the first two components):

prop.var <- function(gwpca.obj, n.components) {
  return(rowSums(gwpca.obj$var[,1:n.components])/rowSums(gwpca.obj$var))
}

Copy this in to R. At this stage it doesn’t do anything except define a new function (in simpler terms, it has told R how to do the calculation, but hasn’t actually done it yet). To compute these values for the pca.gw object, and for the first two components, enter the following:

props <- prop.var(pca.gw,2)

This stores the proportions in props. It may be helpful to inspect the values:

summary(props)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7253  0.9258  0.9627  0.9494  0.9837  1.0000

From this it can be seen that there is some spatial variation in the amount of variance the first two components explain - between around 72.5% and 100.0%. The following code maps the variability:

gw.spdf@data$prop <- props * 100
qtm(Europe) + 
  tm_shape(gw.spdf) + 
  tm_dots(col='prop',title='% variation',size=0.08) + 
  tm_compass()

Choosing the GWPCA Bandwidth Automatically

In the last example, the bandwidth was pre-specified. However, in it also possible to select the bandwidth automatically. Without going into detail here, this is achieved by a form of cross validation, where each observation is omitted, and it is attempted to reconstruct the values on the basis of principal components, derived from the other observations. The bandwidth achieving the optimal results is the one selected. For a complete explanation, see @Harris2011 . The function bw.gwpca computes this:

bw.choice <- bw.gwpca(gw.spdf, vars=colnames(gw.spdf@data)[1:5],  
                      k=2,dMat=d,adaptive=TRUE)

Adaptive bandwidth(number of nearest neighbours): 1633 CV score: 324994 
Adaptive bandwidth(number of nearest neighbours): 1010 CV score: 314580.6 
Adaptive bandwidth(number of nearest neighbours): 624 CV score: 302074.1 
Adaptive bandwidth(number of nearest neighbours): 386 CV score: 295028.2 
Adaptive bandwidth(number of nearest neighbours): 238 CV score: 289141.1 
Adaptive bandwidth(number of nearest neighbours): 147 CV score: 308675 
Adaptive bandwidth(number of nearest neighbours): 294 CV score: 292003.1 
Adaptive bandwidth(number of nearest neighbours): 203 CV score: 290376.3 
Adaptive bandwidth(number of nearest neighbours): 259 CV score: 290345.9 
Adaptive bandwidth(number of nearest neighbours): 224 CV score: 289271.1 
Adaptive bandwidth(number of nearest neighbours): 245 CV score: 289537.3 
Adaptive bandwidth(number of nearest neighbours): 231 CV score: 289086 
Adaptive bandwidth(number of nearest neighbours): 229 CV score: 289153.6 
Adaptive bandwidth(number of nearest neighbours): 234 CV score: 289093.1 
Adaptive bandwidth(number of nearest neighbours): 230 CV score: 289114.7 
Adaptive bandwidth(number of nearest neighbours): 232 CV score: 289024.1 
Adaptive bandwidth(number of nearest neighbours): 232 CV score: 289024.1

bw.choice

[1] 232

From this, the cross-validation choice is quite a lot larger than the choice made earlier. The GWPCA can now be re-run using the cross-validation value:

pca.gw2 <- gwpca(gw.spdf,vars=colnames(gw.spdf@data)[1:5], 
                 bw=bw.choice,k=5,dMat=d,adaptive=TRUE)

Then various characteristics may be re-mapped. Here, the highest loading is re-mapped:

local.loadings2 <- pca.gw2$loadings[,,1] 
# note first component only - would need to explore all components..
lead.item2 <- colnames(local.loadings2)[max.col(abs(local.loadings2))]
gw.spdf@data$lead.item2 <- lead.item2
qtm(Europe) + 
  tm_shape(gw.spdf) + 
  tm_dots(col='lead.item2',title='Lead PC',size=0.08) + 
  tm_compass()

Now, only crops or h.imp are dominant, with south west England and north Africa having notable dominance in terms of crops variability.

Finding Spatial Multivariate Outliers

Recall from earlier that global PCA was used to identify multivariate outliers - but that it was also possible to use local PCA (ie GWPCA) to identify local outliers. One way of doing this links back to the cross-validation idea used earlier to select a bandwidth. Recall that this is based on a score of how well each observation can be reconstructed on the basis of local PCs. The score measures the total discrepancies of true data values from the reconstructed ones - and the bandwidth chosen is the one minimising this. However, the total discrepancy score is the sum of the individual discrepancies. A very large individual discrepancy associated with an observation suggests it is very different - in a multidimensional way, to the observations near to it. These discrepancies can be calculated with the gwpca.cv.contrib function. Note that there is an error in the code for this function in the current version of GWmodel, so an alternative definition is provided here. This takes the same arguments as the gwpca command, but returns the discrepancies.

data.mat <- scale(as.matrix(gw.spdf@data[,1:5]))
discrepancy <- gwpca.cv.contrib(data.mat,coordinates(gw.spdf), 
                                bw.choice,adaptive=TRUE,dMat=d)

Note that the bw=bw.choice part of the command sets the bandwidth to be the automatically selected one. As before, box plots help us to find extreme values:

boxplot(discrepancy,horizontal=TRUE,xlab='Local PC Discrepancy')

A number of these exist, but the most notable is perhaps the one exceeding 200. Again , it is informative to map these - here, the values greater than 20 are chosen:

gw.spdf@data$disc <- discrepancy
qtm(Europe) + tm_shape(gw.spdf[gw.spdf$disc > 20,]) + tm_dots(col='indianred',title.size='Discrepancy',size='disc') + tm_compass()

As before, you can use check.components to inspect the actual values of the variables:

check.components(local.loadings,coordinates(gw.spdf))

Another possibility to undrerstand the nature of the outlier – which doesn’t always work on Windows (!) – is a parallel coordinates plot. Here, each observation is shown as a series of line segments connecting points corresponding to each variable on a set of parallel scales. Since here we are investigating local outliers, one particular observation is highlighted in red, and the remaining ones in grey, but with the intensity of the grey fading according to their distance from the red observation (i.e. a GW parallel coordinates plot). This enables you to see what characteristic the red observation has that means it is outlying from its neighbours. The plot can be created using gw.pcplot - first this must be defined:

gw.pcplot <- function(x,i,loc,bw,ylim=NULL,ylab="",fixtrans=FALSE,...) {
  m <- ncol(x)
    bw <- bw*bw
    dists <- rowSums(sweep(loc,2,loc[i,])**2)
    nbrlist <- which(dists < bw*bw)
  wts <- (1 - dists/(bw*bw))^12
    xss <- scale(x)
    span <- 1:m
    tsc <- 25/length(nbrlist)
    if (is.null(ylim)) ylim <- c(min(xss[nbrlist,]),max(xss[nbrlist,]))
    plot(span,xss[i,],type='l',ylim=ylim,
        xlim=c(0.5,m+0.5),col='red',lwd=6,axes=FALSE,xlab="",ylab=ylab,...)
    axis(1,at=1:m,labels=colnames(x),las=2,cex.axis=1.2)
    axis(2,at=seq(floor(ylim[1]),ceiling(ylim[2]),by=1),cex.axis=1.2)
    abline(v=1:m,col=grey(0.6))
    lines(c(1,m),c(0,0),col=grey(0.6))
    if (fixtrans) {
        for (nbr in nbrlist) lines(span,xss[nbr,],col=rgb(0,0,0,0.3), lwd=3) }
  else {
      for (nbr in nbrlist) lines(span,xss[nbr,],col=rgb(0,0,0,tsc*wts[nbr]), lwd=3)} }

just copy this code and paste into the R terminal. Then, to carry out the plot, enter:

big.outl <- which(gw.spdf$disc > 100) # Find the massive outlier
gw.pcplot(data.mat,big.outl,coordinates(gw.spdf),bw=sqrt(bw.choice))

It turns out this area has a very large value of trloss and this is unusual amongst its neighbours.

Self-test: As a further exercise, try to produce some of the earlier plots using bw.choice as the bandwidth.

Session 3: Geographically Weighted Principal Component Analysis

Lex Comber and Paul Harris