This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
These practicals are designed to have an explanatary text, together with code examples. Note that all code examples have a light yellow background, and are boxed. Output from R is also shown, and text output is also boxed. Graphical output is shown ‘in line’. The idea is to copy the text in the boxes into a running version of R - you can use the output to see whether you are on the right track. Also, don’t be afraid to experiment - it may not always work, but for example playing around with some of the function parameters change some aspects of the analysis, or of the graphical presentation. Also, it is generally assumed that the code you type in from the boxes is actually entered in the same order as it appears here. Some of the later boxes depend on what was done earlier, and so skipping ahead might lead to errors. The help facility (putting a question mark in front of a command, for example ?plot, and hitting return) is also a good way to discover things…
Firstly, load all of the packages that r requires for this exercise.
# Load libraries
packages.wanted <- c("robustbase","mvoutlier","RColorBrewer","GWmodel")
for (package in packages.wanted) require(package,character.only=TRUE)
# Load the GWPCA functions....
library(GWmodel)
If entering the above gives an error message, make sure the packages: robustbase, mvoutlier, RColorBrewer, GWmodel are all installed on your system.
Now, set up the data set (Data.1) used for this practical. This is a subset of the Baltic Soil Survey (BSS) data set (Reimann et al. 2000) from the mvoutlier package (retaining ten key variables and location information from the original).
data(bsstop)
Data.1 <- bsstop[,1:14]
Now, check the column names of this data set. There should be an ID, an X and Y location pair (called XCOO and YCOO where COO is short for coordinate), and 10 columns relating to the concentration of chemical compounds in the soil.
colnames(Data.1)
[1] "ID" "CNo" "XCOO" "YCOO" "SiO2_T" "TiO2_T" "Al2O3_T" "Fe2O3_T"
[9] "MnO_T" "MgO_T" "CaO_T" "Na2O_T" "K2O_T" "P2O5_T"
Next, we will be working with scaled data - that is we will replace the chemical concentration measurements by their z-scores. If you haven’t heared of z-scores before, if x is the original variable then \[ z = \frac{x - \bar{x}}{\textrm{sd}(x)} \] so that the variable is rescaled to have a mean value of zero, and a standard deviation of one. This means that we are measuring relative fluctuations in each individual chemical across geography, rather than measuring all concentrations on the same scale.
Data.1.scaled <- scale(as.matrix(Data.1[5:14])) # standardised data...
rownames(Data.1.scaled) <- Data.1[,1]
we also store the number of observation points in n1:
n1 <- length(Data.1.scaled[,1])
n1
[1] 768
from this it may be seen that there are 768 sample locations. Using this data, compute principal components:
pca <- princomp(Data.1.scaled,cor=F,scores=T) # use covariance matrix to match the following...
pca$loadings
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
SiO2_T 0.178 -0.587 -0.255 0.450 0.239 -0.423 -0.310 0.119
TiO2_T -0.366 -0.182 0.300 -0.114 -0.454 0.673 -0.244
Al2O3_T -0.398 -0.213 -0.226 -0.136 0.164 0.830
Fe2O3_T -0.399 0.274 -0.257 -0.184 0.747 -0.303
MnO_T -0.316 -0.133 0.469 0.125 0.503 0.112 0.610
MgO_T -0.379 0.170 -0.363 -0.256 -0.566 -0.546
CaO_T -0.264 0.301 -0.338 -0.495 0.516 -0.439 0.103
Na2O_T -0.322 -0.571 0.638 0.137 0.113 -0.355
K2O_T -0.224 -0.489 -0.288 0.450 -0.540 -0.136 -0.172 -0.293
P2O5_T -0.224 0.465 0.566 0.339 0.173 -0.465 -0.216
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
SS loadings 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Proportion Var 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Cumulative Var 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
The list of loadings, above, give the key components of variation in the soil data set. Recall - probably from several years ago ! - that the first principal component is an index, made from a weighted combination of the variables that gives the greatest variability in the study area. By looking at the values of the weights, we get a picture of what single characteristic accounts for most variability in the multivariate data. The second principal component finds the next most variable weighted index, given the further constraint that it is uncorrelated wiitgh the first, and so on for the other components. Looking down the columns show these weights. From this, we see that the first component contrasts the level of \(SiO_2\) with all of the other compounds. In a nutshell, higher levels of \(SiO_2\) correspond to larger particle size in the soil, which in turn suggests that the soil is more barren. The next component contrasts ( CaO ) and \(P_2O_5\) with \(SiO_2\), \(TiO_2\), \(Al_2O_3\), \(MnO\) and \(K_2O\). Note that cells in this table left blank correspond to ones with a weighting below 0.1 - a relatively low weighting. This suggests that \(Fe_2O_3\), \(MgO\) and \(Na_2O\) do not play a notable role in the second component.
So the analysis tells us what the weightings are - but not what the geography actually is. A few lines of R will help to address this. Firstly the data set bss.background contains national borders for the study region:
data(bss.background)
Next we can draw a map. This is quite basic. Firstly plot the national borders - type='l' tells R that we want to draw lines. asp=1 makes sure the x- and y- axes are drawn to the same scale, so the map doesn’t appear ‘stretched’. All of the other options are used to switch off various details of the axes - although I am using plot to draw a map, it is essentially a graph-plotting command. It can be used for maps, but the default settings must be altered.
plot(bss.background,asp=1,type='l',xaxt='n',yaxt='n',xlab='',ylab='',bty='n',col='grey')
In fact we will use this backfrop quite often - so define a special function
backdrop() as a short cut for the longer plot command above:
backdrop <- function() plot(bss.background,asp=1,type='l',xaxt='n',yaxt='n',xlab='',ylab='',bty='n',col='grey')
Next, select out the first principal component:
pc1 <- pca$scores[,1]
Now plot where it is negative (red), and where it is positive (blue).
backdrop()
points(Data.1$XCOO[pc1>0],Data.1$YCOO[pc1>0],pch=16,col='blue')
points(Data.1$XCOO[pc1<0],Data.1$YCOO[pc1<0],pch=16,col='red')
This shows a clear spatial trend in the first principal component. You can do similar things to identify patterns in the second, third and other PCs. For example, for the second PC:
pc2 <- pca$scores[,2]
backdrop()
points(Data.1$XCOO[pc2>0],Data.1$YCOO[pc2>0],pch=16,col='blue')
points(Data.1$XCOO[pc2<0],Data.1$YCOO[pc2<0],pch=16,col='red')
Going back to PC1, we can also use boxplot to find the outlying values:
boxplot(pc1,horizontal = TRUE)
There are two outlying low values. The
boxplot function, as well as actually providing plots, returns information about the outliers themselves. Sometimes extracting the information is a bit convoluted, but it can be done. Basically, if the PC1 values each have a name, then you match the names of the outlying ones against the whole list. For example, to find the case numbers of the two outlying PC1 values, use:
outl <- match(names(boxplot(pc1,plot=FALSE)$out),names(pc1))
outl
[1] 43 152
The plot=FALSE option was used because on this occasion we only want the outlier information, we don’t actually want R to draw a box plot. From this it can be seen that observations 43 and 152 are the outliers. This information is stored in outl. Finally, draw these on a map:
backdrop()
points(Data.1$XCOO[pc1>0],Data.1$YCOO[pc1>0],pch=16,col='lightblue',cex=0.5)
points(Data.1$XCOO[pc1<0],Data.1$YCOO[pc1<0],pch=16,col='pink',cex=0.5)
points(Data.1$XCOO[outl],Data.1$YCOO[outl],col='red',cex=2)
The two red circles show the two outlying values of PC1 - in essence, these are areas with an unusually low ( SiO_2 ) levels, when compared to the data set as a whole.
The last exercise demonstrated the sort of analysis a standard PCA can give - essentially picking out the key multivariate modes of variability in the data. Looking at outlying values of the principal components of these data gives us an idea of unusual sites (in terms of combinations of all the chemoical compound concentrations). It can also tell us where unusual sites are located. Next, Geographically weighted PCA can be used to find spatial multivariate outliers. Sounds complicated, but really all this means is it identifies sites that have an unusual multi-way combination of chemical compounds in relation to thier immediate geographical neighbours. It might be that the values observed at these sites as a combination is not uncommon in northern Europe as a whole - but is very unusual in its locality.
To find such outliers the procedure is relatively simple - instead of doing a PCA on northern Europe as a whole, for each sample we do a PCA on data falling into a window centred on the location of that sample. In that way we can check whether the sample is like its neighbours or not, from a multivariate viewpoint.
The following code carries out a geographically weighted PCA. In short, it runs a ‘windowed’ PCA around each of the sample points, the window is 1,000,000m - that is, 1,000km. The k=10 option means we are asking for all 10 of the local PCA weightings to be computed. Note that the gwpca function returns the results of weighted PCAs centred on each of the coordinates of the sample points, and stores this information in the object gw.pca.
Coords1 <- as.matrix(cbind(Data.1$XCOO,Data.1$YCOO)) # Coordinates of the sites
d1s <- SpatialPointsDataFrame(Coords1,as.data.frame(Data.1.scaled))
pca.gw <- gwpca(d1s,vars=colnames(d1s@data),bw=1000000,k=10)
This actually generates a lot of information. One initial ‘overview’ of the local PCA is to map which of the chemical concentrations has most influence on the components - this is done by identifying which one has the highest principal component weightings. The following R code identifies which variable has the highest weighting at each sample point, and provides a colour-coded map. A few explanatory points:
d1s is a SpatialPointsDataFrame containing the data and its coordinates for input to the functionpca.gw$loadings contains a loading matrix for each component, and for each location. It is a three-dimensional array - the three dimensions are the 10 loadings, the 10 components (PC1-PC10) and the 768 observations.lead.item contains the name of the variable with the highest weighting for each location.df1p is a SpatialPointsDataFrame containing the lead.item for each sample point location.colour contains an assigned colour for each point. The colour corresponds to the value of the lead.item.local.loadings <- pca.gw$loadings[,,1] # note first component only - would need to explore all components..
lead.item <- colnames(local.loadings)[max.col(abs(local.loadings))]
df1p = SpatialPointsDataFrame(Coords1, data.frame(lead=lead.item))
backdrop()
colour <- brewer.pal(8,"Dark2")[match(df1p$lead,unique(df1p$lead))]
plot(df1p,pch=18,col=colour,add=TRUE)
legend('topleft',as.character(unique(df1p$lead)),pch=18,col=brewer.pal(8,"Dark2"))
Whereas this kind of display gives some information, there is still much that it doesn’t really demonstrate. Although it is possible to see a geographical pattern in the chemicals that dominate the local PCAs (ie the ones that contributes most notably to local compositional variability in different places). Glyph plots are more complicated, but they give a view of all of the component weightings.
backdrop()
glyph.plot(local.loadings,Coords1,add=TRUE)
Each ‘spoke’ on the stars represents the value of one of the loadings. Negative loadings are given in red. Although this shows spatial trends (note the shape of the glyphs alters around the study area) - it is not immediately clear how to interpret the glyphs fully, and identify the full weighting information. To get around this, there is an interactive function
check.components. Enter
check.components(local.loadings,Coords1)
Each time you click on a glyph, it prints out the exact weighting. When you are done, hit the esc key.
It is also easier to see patterns if you randomly ‘thin out’ the glyphs that you plot:
subsample <- sample(n1,200)
backdrop()
glyph.plot(local.loadings[subsample,],Coords1[subsample,],add=TRUE,r1=25)
To understand this code, recall that n1 contains the number of original sample points. The function call sample(n1,200) selects 200 random numbers in the range 1 to n1 without replacement - this giving a randomly thinned out subsample for the updated glyph plot, and arguably makes trend-like patterns easier to read. The parameter r1 is somewhat esoteric - but essentially smaller values lead to larger glyphs. The default value is 50 - the value 25 here leads to larger glyphs - this also helps with readability.
Another useful diagnostic for PCA is the percentage of variability in the data explained by each of the components. This can be achieved by looking at the var component of pca.gw; this is written as pca.gw$var. This is an 768 by 10 matrix - where 768 is the number of observations and 10 is the number of components. For each location, the 10 columns correspond to the variance of each of the principal components. Looking at the proportion of each component in the sum of all of the variances shows how much of the variability in the data each component contributes. If, say, the first two components contributed 90% of the total variance, then it is reasonable to assume that much of the variability in the data can be seen by just looking at these two components. Because this is geographically weighted PCA, however, this quantity varies across the map. Here is an R function to compute the proportion of variance at each observation point, given the var component of a GWPCA object, and the number of PCs to consider (eg in the above discussion I considered the first two components):
prop.var <- function(gwpca.obj, n.components) {
return(rowSums(gwpca.obj$var[,1:n.components])/rowSums(gwpca.obj$var))
}
Copy this in to R. At this stage it doesn’t do anything except define a new function (in simpler terms, it has told R how to do the calculation, but hasn’t actually done it yet). To compute these values for the pca.gw object, and for the first two components, enter the following:
props <- prop.var(pca.gw,2)
This stores the proportions in props. It may be helpful to inspect the values:
summary(props)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.6393 0.6896 0.7171 0.7324 0.7707 0.8900
From this it can be seen that there is some spatial variation in the amount of variance the first two components explain - between around 63.9% and 89.0%. The following code maps the variability:
interval <- cut(props,c(0.6,0.7,0.8,0.9),labels=FALSE) # note first component only - would need to explore all components..
df1v = SpatialPointsDataFrame(Coords1, data.frame(int=interval))
backdrop()
colour <- brewer.pal(4,"Blues")[-1][interval]
plot(df1v,pch=18,col=colour,add=TRUE)
interval.labels <- c("Below 0.7","0.7 up to 0.8","0.8 and above")
legend('topleft',interval.labels,pch=18,col=brewer.pal(4,"Blues")[-1])
This is fairly complex, but the following outlines what has been done:
prop variable into numerical categories, based on the breakpoints {0.6,0.7,0.8,0.9}SpatialPointsDataFrame with the observation locations, and the split category number as set out aboveLooking at the map it seenms that as one heads south, the first two PCs explain a greater amount of the variation. In one sense - the data gets more ‘two dimensional’ in the southern part of the survey.
In the last example, the bandwidth was pre-specified. However, in it also possible to select the bandwidth automatically. Without going into detail here, this is achieved by a form of cross validation, where each observation is omitted, and it is attempted to reconstruct the values on the basis of principal components, derived from the other observations. The bandwidth achieving the optimal results is the one selected. For a complete explanation, see Harris, Brunsdon, and Charlton (2011) . The function gwpca.autobw computes this - the arguments are almost the same as gwpca except instead of the bandwidth that is actually used, a two-element list giving the upper and lower bounds to search for the optimal bandwidth as defined above (or more rigorously in [2]). Below, the optimal bandwidth is found for the data used before - here the search interval is set to be between 800 km and 2000 km:
bw.choice <- bw.gwpca(d1s,vars=colnames(d1s@data),k=2)
pca.gw.auto <- gwpca(d1s,vars=colnames(d1s@data),bw=bw.choice,k=2)
Since this command compute several GWPCA estimates at different bandwidths in order to find an optimum, it will take noticeably longer than the version of the function with a prescribed bandwidth. Once run, gwpca is re-run with the bandwidth set to bw.choice.
bw.choice
[1] 1464968
thus the optimal automatic bandwidth selection is around 1465 km. (Note: I intentionally crossed out ‘optimal’ just there, ours is one definition of optimal, there are others!).
Having created an automatically chosen PCA it is now possible to map the results. Essentially this may be done by repeating the above steps to get maps, with the original pca.gw objecte replaced with the newly computed pca.gw.auto. For example, the following will produce a new ‘largest locally weighted component’ map:
local.loadings <- pca.gw.auto$loadings[,,1] # note first component only - would need to explore all components..
lead.item <- colnames(local.loadings)[max.col(abs(local.loadings))]
df1p = SpatialPointsDataFrame(Coords1, data.frame(lead=lead.item))
backdrop()
colour <- brewer.pal(8,"Dark2")[match(df1p$lead,unique(df1p$lead))]
plot(df1p,pch=18,col=colour,add=TRUE)
legend('topleft',as.character(unique(df1p$lead)),pch=18,col=brewer.pal(8,"Dark2"))
Note that this is similar to the original map, but perhaps shows less spatial variability. This is unsurprising - the automatic bandwidth of around 1465km is a little larger than the manually chosen 1000km used in the earelier maps. Larger bandwidths imply bigger moving spatial windows, which in turn imply smoother spatially varying outputs. We can do the same with the weightings on the second component, also:
local.loadings.2 <- pca.gw.auto$loadings[,,2] # note second component now
lead.item <- colnames(local.loadings.2)[max.col(abs(local.loadings))]
df1p = SpatialPointsDataFrame(Coords1, data.frame(lead=lead.item))
backdrop()
colour <- brewer.pal(8,"Dark2")[match(df1p$lead,unique(df1p$lead))]
plot(df1p,pch=18,col=colour,add=TRUE)
legend('topleft',as.character(unique(df1p$lead)),pch=18,col=brewer.pal(8,"Dark2"))
Recall from earlier that global PCA was used to identify multivariate outliers - but that it was also possible to use local PCA (ie GWPCA) to identify local outliers. One way of doing this links back to the cross-validation idea used earlier to select a bandwidth. Recall that this based on a score of how well each observation can be reconstrutted on the basis of local PCs. The score measures the total discrepancies of true data values from the reconstructed ones - and the bandwidth chosen is the one minimising this. However, the total discrepancy score is the sum of the individual discrepancies. A very large individual discrepancy associated with an observation suggests it is very different - in a multidimensional way, to the observations near to it. These discrepancies can be calculated with the gwpca.cv.contrib function. Note that there is an error in the code for this function in the current version of GWmodel, so an alternative definition is provided here. This takes the same arguments as the gwpca command, but returns the discrepancies.
gwpca.cv.contrib <- function (x, loc, bw, k = 2, robust = FALSE, kernel = "bisquare",
adaptive = FALSE, p = 2, theta = 0, longlat = F, pcafun = wpca, dMat)
{
if (missing(dMat))
DM.given <- F
else DM.given <- T
n <- nrow(loc)
m <- ncol(x)
w <- array(0, c(n, m, k))
score <- numeric(n)
if (robust == FALSE)
pcafun = wpca
else pcafun = rwpca
for (i in 1:n) {
if (DM.given)
dist.vi <- dMat[, i]
else {
dist.vi <- gw.dist(loc, focus = i, p = p, theta = theta,
longlat = longlat)
}
wt <- gw.weight(dist.vi, bw, kernel, adaptive)
wt[i] <- 0
use <- wt > 0
wt <- wt[use]
if (length(wt) <= 1) {
score[i] <- Inf
expr <- paste("Too small bandwidth: ", bw)
warning(paste(expr, "and the CV value can't be given there.",
sep = ", "))
break
}
v <- pcafun(x[use, ], wt, nu = 0, nv = k)$v
v <- v %*% t(v)
score[i] <- sum((x[i, ] - x[i, ] %*% v))^2
}
score
}
discrepancy <- gwpca.cv.contrib(Data.1.scaled,Coords1,bw=bw.choice,k=2,p=2,theta=0,longlat=FALSE,adaptive=FALSE,kernel='bisquare')
Note that the bw=bw.choice part of the command sets the bandwidth to be the automatically selected one. As before, box plots help us to find extreme values:
boxplot(discrepancy,horizontal=TRUE,xlab='Local PC Discrepancy')
It can be seen that there is one extreme outlier - with a discrepancy measure of about 30. Next, find out which one this is.
which(discrepancy > 25)
[1] 359
Now we can see it is observation 359 - this can be shown on a map.
backdrop()
points(Data.1$XCOO, Data.1$YCOO, pch = 16, col = "sienna", cex = 0.5)
points(Data.1$XCOO[359], Data.1$YCOO[359], col = "red", cex = 2)
On the map - the outlying point is ringed. As before, you can use
check.components to inspect the actual values of the variables:
check.components(local.loadings,Coords1)
Click on the points you are interested in knowing the values of the variables - when you are done hit escape.
Another possibility to undrerstand the nature of the outlier – which doesn’t always work on Windows (!) – is a parallel coordinates plot. Here, each observation is shown as a series of line segments connecting points corresponding to each variable on a set of parallel scales. Since here we are investigating local outliers, one particular observation is highlighted in red, and the remaining ones in grey, but with the intensity of the grey fading according to their distance from the red observation. This enables you to see what characteristic the red observation has that means it is outlying from its neighbours. The plot can be created using gw.pcplot - first this must be defined:
gw.pcplot <- function(x,i,loc,bw,ylim=NULL,ylab="",fixtrans=FALSE,...) {
m <- ncol(x)
bw <- bw*bw
dists <- rowSums(sweep(loc,2,loc[i,])**2)
nbrlist <- which(dists < bw*bw)
wts <- (1 - dists/(bw*bw))^12
xss <- scale(x)
span <- 1:m
tsc <- 25/length(nbrlist)
if (is.null(ylim)) ylim <- c(min(xss[nbrlist,]),max(xss[nbrlist,]))
plot(span,xss[i,],type='l',ylim=ylim,
xlim=c(0.5,m+0.5),col='red',lwd=6,axes=FALSE,xlab="",ylab=ylab,...)
axis(1,at=1:m,labels=colnames(x),las=2,cex.axis=1.2)
axis(2,at=seq(floor(ylim[1]),ceiling(ylim[2]),by=1),cex.axis=1.2)
abline(v=1:m,col=grey(0.6))
lines(c(1,m),c(0,0),col=grey(0.6))
if (fixtrans) {
for (nbr in nbrlist) lines(span,xss[nbr,],col=rgb(0,0,0,0.3), lwd=3) }
else {
for (nbr in nbrlist) lines(span,xss[nbr,],col=rgb(0,0,0,tsc*wts[nbr]), lwd=3)} }
just copy this code and paste into the R terminal. Then, to carry out the plot, enter:
gw.pcplot(Data.1.scaled,359,Coords1,bw=sqrt(bw.choice))
From this it can be seen that a key distinction between observation 359 and its neighbours is its high level of \(P_2O_5\). Note that here we specified the bandwidth to be the square root of the optimum as found by cross-validation. Here the bandwidth is used to control the intensity of shading, rather than the actual calculation of the local statistics, and experimentally this has been found to give better results visually.
Self-test: As a further exercise, try to produce glyph plots (using all of the data and also a random subset, as in the previous example) from
pca.gw.auto. Some suggestions follow the references - but ideally and do them yourself before looking…
For the glyphs of every observation.
backdrop()
local.loadings <- pca.gw.auto$loadings[,,1] # assuming you want to look at PC1
glyph.plot(local.loadings,Coords1,add=TRUE)
For a random selection of 200
subsample <- sample(n1,200)
backdrop()
glyph.plot(local.loadings[subsample,],Coords1[subsample,],add=TRUE)
This assumes you ran the code for every observation first.
Harris, P., C. Brunsdon, and M. Charlton. 2011. “Geographically Weighted Principal Components Analysis.” International Journal of Geographical Information Science 25 (10): 1717–36. doi:10.1080/13658816.2011.554838.
Reimann, C., U. Siewers, T. Tarvainen, L. Bityukova, J. Eriksson, A. Gilucis, V. Gregorauskiene, V. Lukashev, N.N. Matinian, and A. Pasieczna. 2000. “Baltic Soil Survey: Total Concentrations of Major and Selected Trace Elements in Arable Soils from 10 Countries Around the Baltic Sea.” Science of The Total Environment 257 (2 – 3): 155–70. doi:http://dx.doi.org/10.1016/S0048-9697(00)00515-5.