Session 4: Uses of GWPCA

Introduction

GWPCA can be used for a number of applications including classification (eg Cluster Analyses (unsupervised) and Discriminant Analysis (supervised)) and sample design (sample selection).

In this session you will use the farm data again. Load the packages and the data into your R session and create a spatial data set. Note the use of the scale function in the code below. This is an important precursor to any PCA analysis to standardise the data. Also note the proj4string parameter to set the projection to British National Grid (see http://spatialreference.org/ref/epsg/27700/).

library(rgdal)
library(GISTools)
library(GWmodel)
library(e1071)
library(tmap)
# load the farm data 
# with Farmlet_No re-coded as "1, 2 or 3" for "Green, Blue and Red"
Data.1 <- read.csv("GWPCA_SA_2018_Workshopdata_v1.csv", header=T) 
# standardised the data (scaling and centering)
df <- scale(as.matrix(Data.1[,c(12:33)])) 
# create SPDF
gw.spdf <- SpatialPointsDataFrame(Data.1[,2:3], 
                                  data = data.frame(df),
                                  proj4string=CRS("+init=epsg:27700"))
# Read in shapefile
shp <- readOGR(".","FP_Fields_Fenced_post13082013")

PCA

The global principal components of the variables can be computed. They are in the data slot of the gw.spdf object - the first step computes the PCA, the second appends the scores to the data frame in gw.spdf.

pca <- princomp(gw.spdf@data,cor=TRUE,scores=TRUE)
gw.spdf@data <- cbind(gw.spdf@data,pca$scores)
head(gw.spdf)

        Al          As         Ca         Cd          Co        Cr        Cu        Fe
1 2.286476  0.80978276  0.4037600 -1.5423474 -0.06547790 0.1557249 -0.818614 -1.981890
2 2.799274  0.19860719 -0.6981407 -0.9998981 -0.20786264 0.3432710 -1.822840 -2.730610
3 2.320538  0.14387505 -0.5128341 -1.1807145 -0.02518033 1.1448794 -1.293537 -1.929330
4 2.840022  0.49051194 -0.5275000 -0.8190816 -0.13264052 1.5925701 -1.304413 -2.028143
5 2.541374  0.08610224 -0.8375275 -1.1807145 -0.06279140 1.2961263 -1.378733 -2.020927
6 2.203060 -0.06289081 -0.8471894 -1.1807145 -0.05741839 0.9936325 -1.322540 -1.841374
         K        Mg        Mn        Mo       Na          Ni          P        Pb
1 1.254644 1.8565125 -1.303652 -1.253894 2.155344  0.64458885 -0.6096284 0.4153414
2 1.449575 1.2339111 -2.213102 -1.600074 2.683832 -0.09861288 -1.5266654 0.3729909
3 1.328695 0.7790655 -1.708819 -1.332571 2.407962 -0.09861288 -1.0116351 0.4953370
4 1.627575 0.9888803 -1.764465 -1.395513 2.851921  0.09756219 -1.3897027 0.4482808
5 1.551981 0.7586163 -1.836612 -1.426984 2.882438 -0.06465950 -1.2535064 0.5212179
6 1.402872 0.5963829 -1.763824 -1.316836 2.710232 -0.15520185 -1.1169494 0.5823910
           S       Se       Ti         Zn         SOM         pH    Comp.1    Comp.2
1  0.6242669 1.293269 2.016818  0.3478990  1.02723096 -0.3944766 -2.462340 0.8682224
2 -0.2709719 1.087907 2.320318 -0.7190034  0.05735695 -1.5394950 -4.476816 2.9848874
3 -0.1875817 1.019453 1.574097 -0.3956996  0.33566863 -0.8163254 -3.628524 2.1806874
4 -0.3419523 1.567085 2.331456 -0.4603604 -0.09444941 -1.1779102 -4.800811 1.8952403
5 -0.3414781 1.224815 1.835831 -0.5250211 -0.03963045 -1.4792308 -4.423372 2.4817529
6 -0.3380792 0.950999 1.387542 -0.5250211  0.05735695 -1.0573820 -3.817838 2.4638880
    Comp.3     Comp.4      Comp.5    Comp.6   Comp.7     Comp.8      Comp.9    Comp.10
1 3.710113 -0.8407333 -0.61974258 0.9194416 1.014163 -1.4931847 0.006786866 -1.1059213
2 3.215674 -0.9052513  0.03671765 0.7317994 1.586621 -1.6371861 0.130228919 -1.1773463
3 2.950204 -0.6177336 -0.06190955 0.7804780 1.057393 -0.8917427 0.164422342 -0.5885168
4 3.466900 -0.6463973 -0.18627969 0.6768703 1.486462 -1.2858564 0.216551679 -0.7628597
5 3.128145 -0.4197616  0.35675308 0.8502963 1.224432 -1.2209977 0.271276205 -0.6960382
6 2.723372 -0.3893284  0.39090562 0.9320225 1.153410 -0.8402552 0.115899579 -0.5486613
    Comp.11    Comp.12   Comp.13   Comp.14     Comp.15   Comp.16    Comp.17     Comp.18
1 1.5466295  0.5547534 0.8565750 1.4836818  0.02190296 0.5677329 -0.6057264 0.417208494
2 0.8916275  0.4863108 0.8075032 0.7082322  0.47624470 0.4094964 -0.4734630 0.001729923
3 0.9551369 -0.1580545 0.2525130 0.9365356 -0.05229419 0.3837194 -0.3293832 0.222621882
4 0.7290484 -0.4065990 0.4412459 0.6828750 -0.25417791 0.3929000 -0.4358772 0.002191132
5 0.9035869 -0.3790722 0.3416132 0.7048892 -0.09645480 0.3570410 -0.3890014 0.114064953
6 0.9406274 -0.2383287 0.1514403 0.7600118  0.04197376 0.4089871 -0.3039554 0.161800413
     Comp.19   Comp.20     Comp.21     Comp.22
1 -0.9257268 0.2360492 -0.16993557 0.064061332
2 -0.7179861 0.2435590 -0.01565366 0.131697613
3 -0.3524445 0.4051074  0.22736376 0.007768694
4 -0.5373519 0.3336271  0.13593402 0.002818879
5 -0.4455517 0.2748323  0.19180951 0.024954606
6 -0.3031819 0.1990659  0.21715812 0.036357933

Looking at the PCA loadings helps to interpret the scores. Recall - probably from several years ago ! - that the first principal component is an index, made from a weighted combination of the variables that gives the greatest variability in the study area. By looking at the values of the weights, we get a picture of what single characteristic accounts for most variability in the multivariate data. The second principal component finds the next most variable weighted index, given the further constraint that it is uncorrelated with the first, and so on for the other components. Below, the loadings are examined:

pca$loadings

From these, it seems like component 1 (Comp.1) contrasts Al and Cu with all of the other variables. This can be mapped:

tm_shape(gw.spdf) + 
  tm_dots(col='Comp.1',size=0.4,title='1st PC') + 
  tm_compass() + tm_style_col_blind()

The second contrasts crops against the others - this can also be mapped:

tm_shape(gw.spdf) + 
  tm_dots(col='Comp.2',size=0.4,title='2nd PC') + 
  tm_compass() + tm_style_col_blind()

Going back to the first principal component, we can also use boxplot to find the outlying values:

boxplot(gw.spdf$Comp.1,horizontal = TRUE)

There are some outlying high values. The boxplot function, as well as actually providing plots, returns information about the outliers themselves. Sometimes extracting the information is a bit convoluted, but it can be done. Basically, you extract the outlying values and match them against the whole list for the variable of interest. For example, to find which cases correspond to the outlying values of Comp.1, use

outl <- which(gw.spdf$Comp.1 %in% boxplot(gw.spdf$Comp.1,plot=FALSE)$out)
outl

[1]  233  388 1070

Then, by subsetting gw.spdf to just these values, just the outlying cases are mapped. In this case, these are places with outstandingly high tree loss contrasted against the other variables.

tm_shape(shp) + 
  tm_polygons(col = "grey") + 
  tm_shape(gw.spdf[outl,]) + 
  tm_dots(col = "Comp.1", title.size='First PC', size = 0.4) +
  tm_compass()

GWPCA and preparation

As before a GWPCA is object is created using the functions in the GWmodel package. First create the distance matrix and determining the GWPCA bandwidth:

# create distance matrix
d <- gw.dist(coordinates(gw.spdf))
# select bandwidth
bw.choice <- bw.gwpca(gw.spdf, 
                      var= names(gw.spdf@data), 
                      k = 11, dMat = d, adaptive = T)

Adaptive bandwidth(number of nearest neighbours): 662 CV score: 1798.059 
Adaptive bandwidth(number of nearest neighbours): 410 CV score: 2753.779 
Adaptive bandwidth(number of nearest neighbours): 818 CV score: 1931.68 
Adaptive bandwidth(number of nearest neighbours): 565 CV score: 1851.935 
Adaptive bandwidth(number of nearest neighbours): 721 CV score: 1846.611 
Adaptive bandwidth(number of nearest neighbours): 624 CV score: 1777.419 
Adaptive bandwidth(number of nearest neighbours): 602 CV score: 1785.127 
Adaptive bandwidth(number of nearest neighbours): 639 CV score: 1779.82 
Adaptive bandwidth(number of nearest neighbours): 616 CV score: 1777.369 
Adaptive bandwidth(number of nearest neighbours): 609 CV score: 1781.611 
Adaptive bandwidth(number of nearest neighbours): 618 CV score: 1777.203 
Adaptive bandwidth(number of nearest neighbours): 621 CV score: 1777.601 
Adaptive bandwidth(number of nearest neighbours): 617 CV score: 1776.863 
Adaptive bandwidth(number of nearest neighbours): 615 CV score: 1777.499 
Adaptive bandwidth(number of nearest neighbours): 616 CV score: 1777.369 
Adaptive bandwidth(number of nearest neighbours): 615 CV score: 1777.499 
Adaptive bandwidth(number of nearest neighbours): 615 CV score: 1777.499 
Adaptive bandwidth(number of nearest neighbours): 614 CV score: 1778.489 
Adaptive bandwidth(number of nearest neighbours): 614 CV score: 1778.489 
Adaptive bandwidth(number of nearest neighbours): 613 CV score: 1777.34 
Adaptive bandwidth(number of nearest neighbours): 613 CV score: 1777.34 
Adaptive bandwidth(number of nearest neighbours): 612 CV score: 1778.156 
Adaptive bandwidth(number of nearest neighbours): 612 CV score: 1778.156 
Adaptive bandwidth(number of nearest neighbours): 611 CV score: 1779.476 
Adaptive bandwidth(number of nearest neighbours): 611 CV score: 1779.476 
Adaptive bandwidth(number of nearest neighbours): 610 CV score: 1780.583 
Adaptive bandwidth(number of nearest neighbours): 610 CV score: 1780.583 
Adaptive bandwidth(number of nearest neighbours): 609 CV score: 1781.611 
Adaptive bandwidth(number of nearest neighbours): 609 CV score: 1781.611 
Adaptive bandwidth(number of nearest neighbours): 608 CV score: 1782.159 
Adaptive bandwidth(number of nearest neighbours): 608 CV score: 1782.159 
Adaptive bandwidth(number of nearest neighbours): 607 CV score: 1782.531 
Adaptive bandwidth(number of nearest neighbours): 607 CV score: 1782.531 
Adaptive bandwidth(number of nearest neighbours): 606 CV score: 1782.673 
Adaptive bandwidth(number of nearest neighbours): 606 CV score: 1782.673

bw.choice

[1] 617

And finally these can use to specify a GWPCA as before:

# do gwpca
pca.gw <- gwpca(gw.spdf, var= names(gw.spdf@data), 
                bw = bw.choice, k = 11, dMat = d, adaptive = T)

And we will use the loadings as input to the classification. The signs need to be corrected - Harry can explain why!

rowmax <- function(z) z[cbind(1:nrow(z),max.col(abs(z)))]
pca.gw$loadings[,,1] <- sweep(pca.gw$loadings[,,1],1,sign(rowmax(pca.gw$loadings[,,1])),'*')
pca.gw$loadings[,,2] <- sweep(pca.gw$loadings[,,2],1,sign(rowmax(pca.gw$loadings[,,2])),'*')

Finally we can create a dataset that has the PGWPCA outputs attached to each data point as input for the cluster analysis and the sample design:

# Data for CA and DA includes the original data plus GWPCA loadings
Data.CA.DA <- data.frame(Data.1[,c(7,12:33)],
                         pca.gw$loadings[,,1], 
                         pca.gw$loadings[,,2])

1. GWPCA for Cluster Analysis

The code below runs a classification analysis. You should note the comments in the code below that explain each step:

# Number of classes
CA.cut <- 3
# Cluster Analysis
CA.out <- hclust(dist(scale(Data.CA.DA)), "ward.D2")
# Cut the dendrogram to yield  groups
CA.out.x <- cutree(CA.out, k=CA.cut)
sort(CA.out.x)

   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  [43] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  [85] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [127] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [169] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [253] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [295] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [337] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [379] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [421] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
 [463] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [505] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [547] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [589] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [631] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [673] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [715] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [757] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [799] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [841] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [883] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [925] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [967] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[1009] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[1051] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

CA.out.xx <- levels(factor(CA.out.x))
# Comparing the results...
Cont.table.sum <- table(CA.out.x, Data.1$Farmlet_No)
print(Cont.table.sum)

        
CA.out.x   1   2   3
       1  82 265 106
       2  52   4 224
       3 217  88  32

# Find optimal match between the two classifications from e1071 library
Class.match.sum <- matchClasses(as.matrix(Cont.table.sum),method="exact")

Direct agreement: 3 of 3 pairs
Cases in matched pairs: 65.98 %

# Print the confusion table, with rows permuted to maximize the diagonal
print(Cont.table.sum[,Class.match.sum])

        
CA.out.x   2   3   1
       1 265 106  82
       2   4 224  52
       3  88  32 217

# To find pecentage ($daig) correct and kappa ($kappa)
cm.sum <- Cont.table.sum[,Class.match.sum]
CA.results <- classAgreement(as.matrix(cm.sum))
CA.results

$diag
[1] 0.6598131

$kappa
[1] 0.489897

$rand
[1] 0.663415

$crand
[1] 0.2496843

The results show the percentage classified correctly (daig and kappa). A poor classification in this case.

2. GWPCA for Discriminant Analysis

The LDA code below runs a loop that undertakes the classification 100 times with 05:95 training split

# number repeats / simulations
nsim=100
# set up the results matrices
results.percent <- matrix(nrow=nsim,ncol=1)
results.kappa <- matrix(nrow=nsim,ncol=1)
# run the loop 
# Note this can fall over - should do stratifed sampling
for(i in 1:nsim) {
  Data.DA <- data.frame(Data.CA.DA)
  train.x <- sample(1:1070, 54) # 05:95 split
  table(Data.DA$Farmlet_No[train.x])
  z <- lda(Data.1$Farmlet_No ~ ., Data.DA, prior = c(1,1,1)/3, subset = train.x)
  DA.predict.x <- predict(z, Data.DA[-train.x, ])$class
  actual.x <- Data.1$Farmlet_No[-train.x]
  # Comparing the results...
  Cont.table.sum.da <- table(DA.predict.x, actual.x)
  #print(Cont.table.sum.da)
  result <- classAgreement(as.matrix(Cont.table.sum.da))
  results.percent[i] <- result$diag
  results.kappa[i] <- result$kappa
}

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

Warning in lda.default(x, grouping, ...): variables are collinear

DA.results <- data.frame(correct = mean(results.percent), kappa = mean(results.kappa))
DA.results

    correct     kappa
1 0.7740551 0.6611058

The results show a better fit this time.

Comparing the results

# the Cluster Analysis results
Cont.table.sum[,Class.match.sum]

        
CA.out.x   2   3   1
       1 265 106  82
       2   4 224  52
       3  88  32 217

# create spatial data
Data.CA.spdf <- SpatialPointsDataFrame(coordinates(gw.spdf), 
                                       data.frame(Data.1, CA.out.x),
                                       proj4string=CRS("+init=epsg:27700"))
Data.DA.spdf <- SpatialPointsDataFrame(coordinates(gw.spdf[-train.x,]),
                                       data = data.frame(DA.predict.x),
                                       proj4string=CRS("+init=epsg:27700"))
# convert the famr class to factor
Data.CA.spdf$Farmlet_No <- as.factor(Data.CA.spdf$Farmlet_No)

The results can be mapped for the Cluster Analysis…

tm_shape(Data.CA.spdf) +
  tm_dots(c("Farmlet_No","CA.out.x"),  palette="Spectral", size = 0.2) + 
  tm_style_col_blind(legend.position = c("right", "top"))

…and the Discriminant Analysis

tm_shape(Data.DA.spdf) +
  tm_dots(c("DA.predict.x"),  palette="Spectral", size = 0.2) + 
  tm_style_col_blind(legend.position = c("right", "top"))

3. GWPCA for network (sample) re-design

We can use the GWPCA outputs to help inform a location-allocation algorithm. Again this requires a distance matrix. Often location allocation seeks to minimise the distance of the most people (the population) to some locations. Here the ‘populations’ are the GWPCA PTVs created below in the props R object.

The tbart package should be installed and loaded:

install.packages("tbart")
library(tbart)

Basically we would like to select sample locations / sites in areas of low PTVs so that ….finding “variance not explained”. The PTV ( the proportion of the variation explained each component) can be used as input to a sample design. The code below determines the PTV from standard/basic GWPCA for first three components combined:

prop.var <- function(gwpca.obj, n.components) {
  return(rowSums(gwpca.obj$var[,1:n.components])/rowSums(gwpca.obj$var))
}
# the first 3 components
props <- prop.var(pca.gw,3)
gw.spdf$popns.nd <- 100-props

In this you will identify the best 25 sites from 100 possible ones and create a new network distance matrix. The data is sampled to 100 the locations to ensure the process does not take too long:

set.seed(100)
gw2.spdf <- gw.spdf[sample(1:nrow(gw.spdf), 100), ]
# have look if you want! 
# plot(gw2.spdf)
new.sites <- 25
d <- gw.dist(coordinates(gw2.spdf))

Now the Teitz-Bart algorithm can be run. This starts with an initial set of potential locations and then tries to swap them to improve the overall result (overall minimised population distance).

Now calculate a population weighted demand

d.w <- d * gw2.spdf$popns.nd

Then use the tb function in tbart to determine the optimal set of 25 from 100 potential locations:

demands.nd <- tb(gw2.spdf, p = new.sites, metric = d) # Compute the 'demand' levels -

These can be plotted …

plot(gw2.spdf)
plot(gw2.spdf[demands.nd,], add = T, col = "red", pch = 19)

Alternative the catchments for each selected location can be identified and mapped:

allocations.list <- allocate(gw2.spdf,p=25)
plot(shp)
plot(gw2.spdf, col = allocations.list, pch = 19, add = T)