May 2018

Session 1

Part 1. Introductions and setup

Who we are!

Who we are!

Course structure

There will be 5 sessions

  • Introduction to Regression GWR (Lex)
  • Introduction to PCA & GWPCA (Harry)
  • GWPCA for Outlier Detection (Harry)
  • GWPCA for Network Design (Harry)
  • GWPCA for Spatial Classification (Harry!)

There will be breaks

  • 10:30 - Coffee / Tea
  • 12:30 - lunch
  • 15:10 - Coffee / Tea

Aim to finish at 17:00

Course structure

We have set aside time in the first session (Introduction) to make sure everything is set up

  • You have R installed
  • You can load all of the packages
  • You have all of the materials on your computer
  • To make sure that you establish robust workflows

Getting going

  • We will be working in R and RStudio
  • R is the engine
  • we encourage you to work in RStudio

Technical aspects

  1. Code is entered by a console
  2. You should write your code in a script (like a text file)
  3. This is not a typing / English lesson - so copy and paste from the documents
  4. You can run code from the script
  5. This means your code is saved, can be easily changed and re-run etc

Working in R

Technical aspects: RStudio start

Working in R

Technical aspects: Open a script

Working in R

Technical aspects: RStudio Components

Working in R

Conceptual aspects

  1. You are not expected to remember every R command, function, etc
  2. BUT ….you are expected to understand what you are trying to do with your code
    • so ask if you need to!!!
  3. For these reasons you should annotate your code - use the # in your script

Working in R

Conceptual aspects

  • Learning is R is learning to drive.
  • It takes time to become a good driver
  • Learning by doing is important

Why use R

  • R includes a very large number of tools, functions and packages
  • R has the latest methods and tools
  • New tools are in R 10-20 years before commercial software
  • The tools in R are open (i.e. the source code is visible)
  • This support research transparency
  • Oh - and R is free!

Course structure

In each session

  • we will briefly introduce each Part (10-15 mins)
  • then you will work through the exercise documents
  • we will be going round the lab helping where it is needed
  • you are not just expected to run the code but to play with it as well!

Questions?

Part 2. Regression / GW Regression

Overview

  • Data
  • Exploration
  • Multiple Linear Regression
  • Testing Assumptions
  • Outliers

You will find Chapters 5 and Chapter 8 in An Introduction to R for Spatial Analysis and Mapping by Chris Brunsdon and Lex Comber useful.

Data

  • you will load data (.csv format)
  • you will create spatial data
  • you will load shapefiles (.shp)
  • you will need to install some packages
  • only needs to be done once

The Data Variables

The data set contains a number of variables for a study area from Rothamsted Research where Harry works.

  • Sulphur (S)
  • Calcium (Ca)
  • Iron (Fe)
  • Phosphorus (P)
  • Soil Organic Matter (SOM)
  • pH
  • Slope
  • Aspect
  • coordinates (Easting and Northing)

The Data Itself

ID Easting Northing S Ca Fe P SOM pH Slope Aspect
1 265632 99300 698.36 3163.39 28688.60 1026.80 12.90 5.60 1.32 257.46
2 265625 99275 585.10 2527.01 25116.20 823.39 10.60 5.41 0.57 269.99
3 265650 99275 595.65 2634.03 28939.38 937.63 11.26 5.53 2.25 307.24
4 265600 99250 576.12 2625.56 28467.91 853.77 10.24 5.47 2.83 315.00
5 265625 99250 576.18 2446.51 28502.34 883.98 10.37 5.42 0.37 258.71
6 265650 99250 576.61 2440.93 29359.05 914.27 10.60 5.49 2.53 298.74

Initial Explorations

Initial Explorations

Sulphur

Regression Models

Our Model Assumptions

We are interested in predicting S .

Assume associated with

  • Sulphur (S)
  • Calcium (Ca)
  • Iron (Fe)
  • Phosphorus (P)
  • Soil Organic Matter (SOM)
  • pH
  • Slope
  • Aspect

Models

The equation for a regression is:

\(y_i = \beta_{0} + \sum_{m}^{k=1}\beta_{k}x_{ik} + \epsilon_i \textsf{ for } i \in 1 \cdots n\)

  • \(y_i\) is S here
  • The \(x_{ij}\)’s are SOM etc.

for example

m <- lm(S~Ca+Fe+P+SOM+pH+Slope+Aspect, data = df) 

Summary

## 
## Call:
## lm(formula = S ~ Ca + Fe + P + SOM + pH + Slope + Aspect, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -333.55  -33.16   -0.40   31.99  285.15 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.466e+02  6.843e+01  10.910  < 2e-16 ***
## Ca           6.213e-02  4.790e-03  12.972  < 2e-16 ***
## Fe          -8.236e-03  3.765e-04 -21.874  < 2e-16 ***
## P            1.152e-01  1.186e-02   9.710  < 2e-16 ***
## SOM          1.952e+01  1.021e+00  19.118  < 2e-16 ***
## pH          -7.060e+01  1.335e+01  -5.289 1.49e-07 ***
## Slope        9.125e+00  8.159e-01  11.184  < 2e-16 ***
## Aspect       1.157e-01  1.481e-02   7.814 1.33e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.45 on 1062 degrees of freedom
## Multiple R-squared:  0.8357, Adjusted R-squared:  0.8346 
## F-statistic: 771.5 on 7 and 1062 DF,  p-value: < 2.2e-16

Outliers

Studentised Residuals

Spatial Aspects

## tmap mode set to interactive viewing

Assumptions - Normally Distributed Errors

Variations on Regression

  • Robust Regression
    • Minimising the effects of outliers
  • Geographically Weighted Regression
    • Allowing coefficients to change geographically

Robust regression

Geographically Weighted Regression

Tobler’s first law of Geography

everything is related to everything else, but near things are more related than distant things. (Tobler, 1970)

Justification

“the full range of conditions anywhere on the Earth’s surface could in principle be found packed within any small area. There would be no regions of approximately homogeneous conditions to be described by giving attributes to area objects. Topographic surfaces would vary chaotically, with slopes that were everywhere infinite, and the contours of such surfaces would be infinitely dense and contorted. Spatial analysis, and indeed life itself, would be impossible.” (de Smith et al 2007, p44)

GWR

Coefficients change

Intercept Ca Fe P SOM pH Slope Aspect
Min. -233.472 -0.026 -0.015 -0.093 -5.664 -309.436 -3.774 -0.104
1st Qu. 338.519 0.040 -0.006 0.039 10.136 -102.276 -0.542 -0.001
Median 692.092 0.060 -0.004 0.097 18.789 -68.268 1.033 0.043
Mean 659.049 0.063 -0.005 0.103 18.120 -67.445 1.456 0.086
3rd Qu. 896.233 0.082 -0.003 0.168 26.647 -21.152 3.388 0.107
Max. 1942.674 0.176 0.002 0.284 44.712 161.084 8.880 0.737
Global 746.569 0.062 -0.008 0.115 19.516 -70.596 9.125 0.116

GWR

Coefficients change over the map:

GWR

Some coefficients flip…

Questions??