This session sets up the main sessions for the rest of the workshop. The aims of this session are to:

  1. Make sure you have R / RStudio installed and ready on your computer ad you have the basic tools for the workshop
  2. Load some data
  3. Develop a regression
  4. Create spatial data
  5. Develop a Geographically weighted Regression (GWR)

1. Introduction and getting started

1.1 Why use R?

R was initially developed by Robert Gentleman and Ross Ihaka of the Department of Statistics at the University of Auckland. R is increasingly becoming the default software package in many areas of science. There are a number of reasons for this:

  • R includes a very large number of tools, functions and packages
    • packages are libraries of tools
    • packages are written by scientists in different subject areas
  • R has the latest methods and tools
  • New tools are in R 10-20 years before commercial software
    • e.g. GWR was around for 10 years before being in ArcGIS
  • The tools in R are open (i.e. the source code is visible)
    • you can see exactly what is being done
    • most commercial tools are hidden - black boxes
  • R is free!

For these reasons R is becoming widely used in many areas of scientific activity and quantitative research.

R can be found at the CRAN website:

1.2 Working in R

There are 2 key points about working in R

  1. When working in R, either writing your own code or copy and pasting from these materials, you should* write the code into a script or document. Go to File > New File > R Script** to open a new R file.

The reasons for this are so that you get used to using the R console and running the code will help your understanding of the code’s functionality. Then in order to run the code in the R console,a quick way to enter it is to highlight the code (with the mouse or using the keyboard controls) and the press ctrl-R or cmd-Enter on a Mac.

  1. Learning is R is learning to drive. You may pass your test but ti become a good driver it is time behind the wheel that counts. The importance of learning by doing and getting your hands dirty cannot be overstated. Some of the code might look a bit fearsome when first viewed, especially in later session BUT the only really effective way to understand it is to give it a try. A further minor point is that in the code comments are prefixed by \(#\) and are ignored by R when entered into the console.

There are some core concepts that you should be familiar with:

1. Assignment: this is the basic process of giving R objects values:

vals <- c(4.3,7.1,6.3,5.2,3.2,2.1)

2. Operations: having assigned values to object that can be manipulated using functions:

vals*2
## [1]  8.6 14.2 12.6 10.4  6.4  4.2
sum(vals)
## [1] 28.2
mean(vals)
## [1] 4.7

3. Indexing: elements of R objects can be can be referred to:

vals[1]    # first element
## [1] 4.3
vals[1:3]   # a subset of elements 1 to 3
## [1] 4.3 7.1 6.3
sqrt(vals[1:3]) #square roots of the subset
## [1] 2.073644 2.664583 2.509980
vals[c(5,3,2)]  # a subset of elements 5,3,2 - note the ordering
## [1] 3.2 6.3 7.1

Note you can copy and paste the code above into a new R/RStudio script.

1.3 R packages

When you install R / RStudio it comes with a large number of tools already (refereed to as base functionality).

However, one of the joys of R, is the community of users. Users share what they do and create in R in a number of ways. One of these is through packages. Packages are collections of related functions that have been created, tested and supported with help files. These are bundled into a package and shared with other R users via the that users can download from the CRAN repository.

There are 1000s of packages in R. These contain set of tools and can be written by anyone. The number of packages is continually growing. When packages are installed these can be called as libraries. The background to R, along with documentation and information about packages as well as the contributors, can be found at the R Project website http://www.r-project.org.

Packages can be found at the CRAN website - https://cran.r-project.org/web/packages/:

Users install the package once to mount it on their computer, and then it can be called in R scripts as required.

The basic operations are

  1. install the package before the first time it is used
    • you may have to set a mirror site the fist time you install a a package
    • this is only done once
  2. load the package using the library function to use the package tools
    • this is done for each R session

So a typical way to do this is

install.packages("<package name>"", dep = TRUE)

For this workshop you will require a number of packages to be installed.

The code below will make sure you have all of the R packages you will need installed. The tests check whether a given package is in the list of installed packages, and installs it if not. Package installation only needs to be done once - not every time you use R - so if you are sure you have all of the packages installed already, you can skip this step.

# set a mirror
options(repos = c(CRAN = "http://cran.rstudio.com"))
# test for package existance and install
if (!is.element("tidyverse", installed.packages()))
    install.packages("tidyverse", dep = T)
if (!is.element("GISTools", installed.packages()))
    install.packages("GISTools", dep = T)
if (!is.element("GWmodel", installed.packages())) 
    install.packages("GWmodel", dep = T)
if (!is.element("MASS", installed.packages())) 
    install.packages("MASS", dep = T)
if (!is.element("rgdal", installed.packages()))
    install.packages("rgdal", dep = T)
if (!is.element("tmap", installed.packages())) 
    install.packages("tmap", dep = T)
if (!is.element("e1071", installed.packages()))
    install.packages("e1071", dep = T)
if (!is.element("knitr", installed.packages()))
    install.packages("knitr", dep = T)
if (!is.element("car", installed.packages()))
    install.packages("car", dep = T)
if (!is.element("gclus", installed.packages()))
    install.packages("gclus", dep = T)
if (!is.element("kableExtra", installed.packages()))
    install.packages("kableExtra", dep = T)

Note that the code above installs each package and all of is dependencies (dep = TRUE). Once we are sure all the packages are installed, you need to load them into the current session:

# load into the R session
library(tidyverse)
library(GISTools)
library(GWmodel)
library(MASS)
library(rgdal)
library(tmap)
library(e1071)
library(knitr)
library(car)
library(gclus)
library(kableExtra)

Also note that packages can be also be downloaded using the RStudio menu

Tools > Install Packages…

2. Data

There are a number of ways of loading data into you R session:

  1. read a local file in proprietary format (eg an excel file or .csv file)
  2. read a local R formatted binary file (typically with an .rda or .RData extension)
  3. download and manipulate data from the internet
  4. read a file from somewhere in the internet (proprietary or R binary format) - we may do some of that We are not going to do the last 2 in this workshop

2.1 Loading tabular data from your your computer

Loading .csv format data

The base installation of R comes with core functions for reading .txt, csv and other tabular formats. To load data from local files you need to point R / RStudio to the directory that contains the local file. One way is to use the setwd() function as in the below

## Mac
setwd("/Users/geoaco/Desktop/")
## Windows
setwd("C:\\")

Another is to use the menu system

Session > Set Working Directory … which give you options to chose from.

However you do it you should now set your working directory to the folder that contains the file GWPCA_SA_2018_Workshopdata.csv and run the code below to load the data:

dat <- read.csv("GWPCA_SA_2018_Workshopdata.csv")

This loads a data table of 1070 rows and 33 columns. You can inspect the data in a few ways.

## dimensions - rows and columns
dim(dat)
## [1] 1070   33
## column / variable names
names(dat)
##  [1] "ID"                   "Easting"              "Northing"            
##  [4] "Field_No"             "Field_East_Centroid"  "Field_North_Centroid"
##  [7] "Farmlet_No"           "Soil_Class_No"        "Elevation"           
## [10] "Slope"                "Aspect"               "Al"                  
## [13] "As"                   "Ca"                   "Cd"                  
## [16] "Co"                   "Cr"                   "Cu"                  
## [19] "Fe"                   "K"                    "Mg"                  
## [22] "Mn"                   "Mo"                   "Na"                  
## [25] "Ni"                   "P"                    "Pb"                  
## [28] "S"                    "Se"                   "Ti"                  
## [31] "Zn"                   "SOM"                  "pH"
## look at the first 6 rows and 7 columns
dat[1:6,1:7]
##   ID Easting Northing Field_No Field_East_Centroid Field_North_Centroid
## 1  1  265632    99300        1            265618.9             99134.54
## 2  2  265625    99275        1            265618.9             99134.54
## 3  3  265650    99275        1            265618.9             99134.54
## 4  4  265600    99250        1            265618.9             99134.54
## 5  5  265625    99250        1            265618.9             99134.54
## 6  6  265650    99250        1            265618.9             99134.54
##   Farmlet_No
## 1          3
## 2          3
## 3          3
## 4          3
## 5          3
## 6          3
## use the sumamry function for the first 7 columns
summary(dat[,1:7])
##        ID            Easting          Northing        Field_No     
##  Min.   :   1.0   Min.   :265425   Min.   :97475   Min.   : 1.000  
##  1st Qu.: 268.2   1st Qu.:265650   1st Qu.:97975   1st Qu.: 3.000  
##  Median : 535.5   Median :265850   Median :98250   Median : 6.000  
##  Mean   : 535.5   Mean   :265907   Mean   :98309   Mean   : 6.313  
##  3rd Qu.: 802.8   3rd Qu.:266175   3rd Qu.:98650   3rd Qu.: 9.000  
##  Max.   :1070.0   Max.   :266528   Max.   :99300   Max.   :15.000  
##  Field_East_Centroid Field_North_Centroid   Farmlet_No  
##  Min.   :265505      Min.   :97629        Min.   :1.00  
##  1st Qu.:265672      1st Qu.:97968        1st Qu.:1.00  
##  Median :265885      Median :98313        Median :2.00  
##  Mean   :265908      Mean   :98308        Mean   :2.01  
##  3rd Qu.:266173      3rd Qu.:98604        3rd Qu.:3.00  
##  Max.   :266451      Max.   :99135        Max.   :3.00

The default for read.csv is that the file has a header (i.e. the first row contains the names of the columns) and that the separator between values in any record is a comma. However these can be changed depending on the nature of the file you are seeking to load into R. A number of different types of files can be read into R.You should examine the help files for reading data in different formats. Enter ??read to see some of these listed. You will note that read.table and write.table require more parameters to be specified than read.csv and write.csv.

Load R binary files

You can also load R binary files. These have the advantage of being very efficient at storing data and quicker to load than for example, .csv files

load("data.wk.rda")
## use ls to see what is loaded
ls()

You should see that a variable called data.wk has been loaded. It is the same as the data read into dat. You can explore the data.wk R object if you want to using the functions above that were applied to dat.

2.2 Saving data

Saving .csv format data

Data can be written into a Comma Separated Variable file using the command write.csv and then read back into a different variable, as follows:

write.csv(dat, file = "new_data.csv")

This writes a .csv file into the current working directory. If you open it using a text editor or a spreadsheet software, you will see that it has the expected column plus the index for each record. This is because the default for write.csv includes row.names = TRUE. Again examine the help file for this function.

write.csv(dat, file = "test.csv", row.names = F)

Saving R Data files

It is possible to save variables that are in your workspace to a designated .rda or .RData file. This can be loaded at the start of your next session. Saving your workspace saves everything that is present in your workspace - as listed by ls() - whilst the save command allows you to specify what variables you wish to save.

There are a number of ways to do this:

  • You can save the workspace using the drop down menus Session > Save Workspace As…
  • You can save the workspace using the save.image function
save.image(file = "myworkspace.RData")
  • you can save identical elements using the save function
# this will save everything in the workspace
save(list = ls(), file = "MyData.RData")
# this will save just dat
save(list = "dat", file = "MyData.RData")
# this will save dat and data.wk
save(list = c("dat", "dtaa.wk"), file = "MyData.RData")

3. Regression

The dat data is of data.frame format. The code below assigns a subset of the data variables to df, a tibble class of data frame:

# have a look!
class(dat)
# assign selected variables to df
df <- as_tibble(dat[, c(1:3, 28, 14, 19, 26, 32, 33, 10, 11)])

The df object is a tibble format and contains a number of variables

To see this in R, enter

head(df)

Our aim in this practical is to build a model that describes the Sulphur values (S) using the factors listed above. We will use this to understand a number of key features in regression.

3.1 Initial explorations

We should have a look at the data to see what kinds of properties it has - we could use previously introduced functions such as summary to summarize the attributes. You may wish to use this now. But in terms of model selection what we are interested in the degree to which different variables are good predictors of sulphur.

In order to examine the relationships between the variables in df we can plot the continuous data against each other and use the cor function to examine collinearity amongst them:

plot(df[, -c(1:3)], cex = 0.5, col = grey(0.145,alpha=0.5))
Correlations between numeric variables.

Correlations between numeric variables.

Another useful tool here is to show the upper triangle of the scatterplot matrix with smoothed trend lines. These are achieved with lowess curve fits (Cleveland, 1979) - these are smooth (but possibly curved) bivariate trend lines - and provide a good way of judging by eye whether there is colinearity in the predictor variables. Essentially a straight-line shaped trend with not too much scattering of the individual points suggests collinearity might be an issue. When two predictors are correlated it can be difficult to identify whether it is one or the other (or both) that influence the quantity to be predicted. The code below does this. Note that as collinearity amongst the predictors is the key concern here, we focus on these, ignoring S for this graphic. The upper.panel=panel.smooth causes the lowess curves to be added.

plot(df[,-c(1:4)], cex = 0.5, col = grey(0.145,alpha=0.5), upper.panel=panel.smooth)

Visually, it seems that there may be some colinearity between Ca, pH and P, Slope and P for example.

A numerical table of these relationships as correlations can be generated with the kable function:

library(knitr)
tab <- round(cor(df[,-c(1:3)]), 3)
kable(tab, booktabs = T, format = 'markdown')
S Ca Fe P SOM pH Slope Aspect
S 1.000 0.614 -0.436 0.690 0.808 0.062 0.455 0.256
Ca 0.614 1.000 -0.010 0.662 0.517 0.546 0.289 0.101
Fe -0.436 -0.010 1.000 0.016 -0.263 0.261 0.018 -0.083
P 0.690 0.662 0.016 1.000 0.679 0.167 0.367 0.149
SOM 0.808 0.517 -0.263 0.679 1.000 0.032 0.395 0.183
pH 0.062 0.546 0.261 0.167 0.032 1.000 0.089 0.208
Slope 0.455 0.289 0.018 0.367 0.395 0.089 1.000 0.133
Aspect 0.256 0.101 -0.083 0.149 0.183 0.208 0.133 1.000

What we are looking for in the plots, and to be confirmed in the table, are correlations that look like straight lines, showing that values in one variable change with changes in another.

A final useful tool here is Catherine Hurley’s gclus package. This is similar to the data frame plotting, but provides tools to reorder the variables so that the most strongly correlated pairs are closer to the diagonal in the plots. It can also colour-code the plots to show whether the correlations are positive (red) negative (blue) or near to zero (yellow). For this plot we focus on the predictors again. This is shown below

library(gclus)
cor.mat <- cov(df[,-c(1:3)]) 
cpairs(df[,-c(1:3)],order.single(cor.mat),dmat.color(cor.mat))

The order of the variables in the diagonal generally puts positively correlated variables near to each other as much as possible. The ordering here is based on a clustering method (single linkage) provided via the order.single function in gclus - the colouring is based on correlations (via the dmat.color) function in gclus.

Note, you could also write the table out to a .csv file:

write.csv(cor.mat, file = "tab.csv")

The plot and the table above of the correlations between the variables show that only SOM of the variables are highly correlated (< -0.8 or > +0.8) with each other and with the response variable S and others less so. This provides us with confidence that we can construct a reasonable model of from this data - at least some of the variables are correlated with S - but care may be needed in interpreting some of the regression parameters due to some collinearity in the predictors.

We are interested in a predictive model of income and it is useful to how that value varies. The code below plots the histogram of the variable S. Note the use of the parameter prob=T and prob = F: if set to TRUE this computes probabilities for each histogram bin - rather than a frequency count. This makes the \(y\)-axis compatible with a kernel density estimate (a smooth estimate of the probability distribution of \(x\)) , so the two may be overlaid. Here this is done by using the lines function with a kernel density estimate for the distribution given by density. A rug plot is added to show the exact locations of the values on the \(x\)-axis.

The graphs were obtained with the following code:

hist(df$S, xlab='SulphuR (S)', main='Histogram of Suplhur', col = 'DarkRed')

hist(df$S, prob=T, xlab='SulphuR (S)', main='Histogram of Suplhur', col='NavyBlue')
lines(density(df$S,na.rm=T),col='salmon',lwd=2)
rug(jitter(df$S))

# qqPlot(df$S,main='Normal QQ plot of Sulphur') 

3.2 Regression Models

So we have some data. We have had a look at some of the variables, seen how some of them relate to the variable we are interested in predicting, S, how they correlate with each other, and perhaps we have even started to think about hypotheses that we might wish to test in our model.

We will start with a hypothesized model of Sulphur is associated with Calcium (Ca), Iron (Fe), Phosphorus (P), Soil Organic Matter (SOM), pH, slope (Slope) and landscape aspect (Aspect). In R we can construct a regression model using the function lm which stands for linear model. This is a standard ordinary regression function, although simple it is very powerful and there are many variants and functions for manipulating lm outputs: remember R was originally a stats package.

The equation for a regression is:

\(y_i = \beta_{0} + \sum_{m}^{k=1}\beta_{k}x_{ik} + \epsilon_i \textsf{ for } i \in 1 \cdots n\)

where \(y_i\) is the response variable for observation \(i\), \(x_{ik}\) is the value of the \(k^{\textsf{th}}\) predictor variable for observation \(i\), \(m\) is the number of predictor variables, \(\beta_{0}\) is the intercept term, \(\beta_{k}\) is the regression coefficient for the \(k^{th}\) predictor variable, \(\epsilon_i\) is the random error term for observation \(i\) and \(n\) is the number of observations.

In our case, the response variable \(y\) is sulphur (S), and the 5 chemical variables and 2 landscale variables Ca, Fe, P, SOM, pH, Slope and Aspect are the different predictor variables, the \(k\) different \(x\)’s.

The basic formula in R to specify a linear regression model, lm is as follows:

model <- lm(variable_we_are_trying_to_predict ~ predictor_variable1 +
              predictor_variable2...)

where the ~ sign (tilde) separates the variable we are trying to model (the response variable) from the variables we will use to predict it, the predictor variables. The predictor variables are usually separated by a + sign - meaning that they are linear terms in the model added together.

We can specify our first model.

m <- lm(S~Ca+Fe+P+SOM+pH+Slope+Aspect, data = df) 

And look at its summary attributes

summary(m)
## 
## Call:
## lm(formula = S ~ Ca + Fe + P + SOM + pH + Slope + Aspect, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -333.55  -33.16   -0.40   31.99  285.15 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.466e+02  6.843e+01  10.910  < 2e-16 ***
## Ca           6.213e-02  4.790e-03  12.972  < 2e-16 ***
## Fe          -8.236e-03  3.765e-04 -21.874  < 2e-16 ***
## P            1.152e-01  1.186e-02   9.710  < 2e-16 ***
## SOM          1.952e+01  1.021e+00  19.118  < 2e-16 ***
## pH          -7.060e+01  1.335e+01  -5.289 1.49e-07 ***
## Slope        9.125e+00  8.159e-01  11.184  < 2e-16 ***
## Aspect       1.157e-01  1.481e-02   7.814 1.33e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.45 on 1062 degrees of freedom
## Multiple R-squared:  0.8357, Adjusted R-squared:  0.8346 
## F-statistic: 771.5 on 7 and 1062 DF,  p-value: < 2.2e-16

There are lots of things to look at in the summary of the model but we will focus on 3 aspects:

First, Significance. If we examine the model we can see that all are significant predictors of Sulphur. The level of significance is indicated by the \(P\)-value in the column \(Pr(>|t|)\) and the asterisks \(*\). For each coefficient \(\beta_k\) these are the significance levels associated with the hypothesis \(\textsf{H}_0 : \beta_k = 0\) against the alternative \(\textsf{H}_0 : \beta_k \ne 0\) - this corresponds to the probability that one would obtain a \(t\)-statistic for \(\beta_k\) at least as large as the one observed given \(\textsf{H}_0\) is true.

It is a measure of how surprising the relationship is and indicative of whether the relationship between \(x_k\) and \(y\) found in this data could have been found by chance. Very small values suggest that the level of association found here could have come from a random sample of data if \(\beta_k\) really is zero. Cutting to the chase, small \(P\)-values mean strong evidence for a non-zero regression coefficient.

So in this case we can say:

  • ‘changes in Calcium are significantly associated with changes in Sulphur at the \(0.001\) level’. This means that the association between the predictor and the response variables is not one that would be found by chance in a series of random samples \(99.999\%\) of the time. Formally we can reject the Null hypothesis that there is no relationship between changes in Sulphur and Calcium
  • ‘changes in Fe, P, SOM, pH, Slope and Aspect are significantly associated with changes in S at the at the \(<0.001\) level’, which is actually an indicator of less than \(0.001\). So were are even more confident that the observed relationship between these variables and Sulphur are not by chance and can again confidently reject the Null hypotheses.

Second, the Coeffeicient Estimates. These are the \(\beta_{k}\) values that tell you the rate of change between \(x_k\) and \(y\) in the column ‘Estimate’ in the summary table. The first of these is the Intercept - this is \(\beta_0\) in the regression equation and tells you the value of \(y\) if \(x_1 \cdots x_k\) are all zero.

Remember regression is like taking lots of \(x_k\) and \(y\) values and seeing how well they are correlated. The regression fits a line between them. So in the figure below a regression model with just one \(x\) is defined, it is summarised, and then the \(x\) and \(y\) values are plotted and the model of the relationship is plotted.

m2 <- lm(df$S~df$SOM)
summary(m2)
## 
## Call:
## lm(formula = df$S ~ df$SOM)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -158.12  -50.74  -13.96   36.50  463.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 168.2686    10.3170   16.31   <2e-16 ***
## df$SOM       43.1110     0.9616   44.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74.56 on 1068 degrees of freedom
## Multiple R-squared:  0.653,  Adjusted R-squared:  0.6527 
## F-statistic:  2010 on 1 and 1068 DF,  p-value: < 2.2e-16
plot(df$SOM, df$S, xlab = "x = SOM", ylab = "y = Sulphur")
abline(m2, col = "red", lty = 2, lwd = 2)

Here we can see that the intercept of the relationship is 168.3 and the line shows this, and that the rate of change coefficient between S and SOM, when they are modelled on their own, is NA. This means that each increase in 1 of SOM is associated with an increase in NA in sulphur.

So returning to our model m, we can now see how changes of 1 unit for the different predictor variables are associated with changes in S. The association of P is positive but small, the association of SOM is reduced due to interactions with other \(x\)’s but is still positive, the pH and Fe variables are strongly negatively associated with S.

So the regression model m gives us an idea of how different factors are associated in different ways with sulphur in the study area.

Third the ** \(R^2\) ** value. This provides a measure of model fit. They are calculated as the difference between the actual values of \(y\) (eg of S) and the values predicted by the model - sometimes written as \(\hat{y\ }\) (called ‘\(y\)-hat’). A good description is here http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit: Generally \(R^2\)’s of greater than 0.7 indicate a reasonably well fitting model. Some users of this method describe low \(R^2\) as ‘bad’ - they are not - it just means you a have a more imprecise model - the predictions are a bit wrong-er! Essentially there is a sliding scale of reliability of the predictions - based on the relative amounts of deterministic and random variability in the system being analysed - and \(R^2\) is a measure of this.

Finally, we are also interested in outliers - these are records that the model does not describe very well, either vastly over-predicting Sulphur or under-predicting it. When the bivariate regression was considered earlier, these were easy to find, since they were just the points lying a long way from the regression line. However, when there are multiple predictors, relationships cannot be encapsulated in two dimensional space.

For this situation, a useful diagnostic is the studentised residual. For different values of predictors, the expected difference between the actual and fitted \(y\) values has a different variance even if the theoretic model gives \(\epsilon_i\) a constant variance. This discrepancy is due to issues such as collinear predictors. Standardised variables attempt to compensate for this. Plotting these and looking for notably high or low values is a helpful way of spotting outliers in the regression model.

A simple way of doing this is just to plot the residuals in order:

s.resids <- rstudent(m)
plot(s.resids,type='h')
abline(h=c(-2,2),col='lightskyblue')

Here we see some residuals are relatively high. Normally, one might expect standardised residuals to lie between -2 and 2. Adding lines to this plot helps to identify when this is not the case - as is done above.

Another useful job is to identify which are the outlying cases - here just done as a basic table (via kable)

df_out <- data_frame(Name=df$ID,Residual=s.resids)
df_out <- df_out[abs(df_out$Residual)>2,]
kable(df_out)
Name Residual
7 -2.091382
17 -2.033073
29 -2.537072
41 -2.044278
131 -2.620754
211 3.775532
220 2.160951
229 2.903274
242 2.011449
318 2.745407
340 2.775047
428 2.466755
441 2.728331
443 2.786703
453 2.387636
454 2.068413
463 2.516256
464 2.120153
465 4.586674
524 2.141880
662 -2.021370
677 2.196887
919 2.318173
932 -2.730475
964 -2.252646
986 -2.243018
992 -2.085380
993 -2.014177
1002 -2.432719
1006 -2.996939
1009 -4.382376
1010 -2.703729
1012 -2.089864
1013 -2.359771
1014 -2.369609
1019 -2.560585
1047 5.722086
1060 -2.092619
1065 -2.056205
1069 -2.207681
1070 -8.819858

Negative values here have a much lower income than the model predicts, positive values suggest it is much higher.

Finally, we need to check another model assumption in lm - that the error term \(\epsilon_i\)’s are normally distributed. The best way to do this is via a qq-plot which plots the empirical quantiles of a variable whose distribution we are interested in against the theoretical distribution we are testing for. If the variable is from this distribution, the results should yield a straight line - frequently a reference line is provided to test this. Here s.resids is the variable of interest - the studentised residuals, and these are compared to the \(t\) distribution. Although the \(\epsilon_i\)’s have a normal distribution, theory shows that the studentised residuals derived from a sample coming from this model have a \(t\) distribution. However, the qqPlot function from the package car does all the work, and requires only the regression model object to create a plot:

qqPlot(m)

This demonstrates that mostly the residuals do conform to this, with a couple of extreme outliers at the higher end - labelled in the plot. Consulting the earlier table suggests these are Fayette and ID 1009, 1047, 465, and 1070

It is also important to consider spatial patterns in the residuals, from a geographical viewpoint. The code below plots the outliers and their spatial distributions:

# determine residuals
df$s.resids = rstudent(m)

# 1.5 standard deviations
resid.shades = shading(c(-1.5,1.5),c("indianred1","antiquewhite2","powderblue"))
# define a color scheme
cols = resid.shades$cols[1 + findInterval(df$s.resids, resid.shades$breaks)]
# set plot paramters
par(mar = c(0,0,2,0))

# make some spatial data 
# define a projection
os.proj <- CRS("+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs ")

df.sp <- SpatialPointsDataFrame(coords = df[, 2:3], data = data.frame(df), 
                                proj4string = os.proj)

# have a look at the spatial distribution of outliers
choropleth(df.sp,df$s.resids,resid.shades, main = "Model Outliers", pch = 19, cex = 0.7) 

The regression session finishes here, but you will need the results of this session later on. To save all of the variables, regression objects and so on from this session, enter

save.image(file='session2.RData')

This saves an image of all of the R objects currently defined in a file called session2.RData - later on you will be able to load this again via the load function.

4. Spatial data

You will have noticed that the last few lines of code above introduced some spatial data to do the mapping. We do not have time to go into this fully here but in this section we will introduce some spatial data formats and how to read data in and out of R.

In the preceding description of data each of the records (rows) relates to some kind of real world feature (a person, a transaction, a date, etc) and the columns represent some attribute associated with that feature. In spatial data, each record typically represents some real world geographical feature - a place, a route, a region etc. In fact geographic data typically represent points, lines or areas.

There is some key background information to working with spatial data in R. R is dynamic - things don’t stay the same. New tools, packages and functions are constantly be produced, they are updated to improve and develop them. In most cases this is not problematic as the update almost always extends the functionality of the package without affecting the original code. However in a few instances, specific packages are completely re-written without backwards compatibility. If this happens then R code that previously worked may not work with the new package as the functions may take different parameters and arguments. However, there is usually a period of a few package versions before the code stops to work all together.

Occasionally a completely new paradigm is introduced and this has been the case recently for spatial data in R with the advent of the sf package.

Much of the code for handling spatial data in this workshop is built around data structures defined in the sp package and it defines the following spatial objects in R:

The sp package underpins many of the packages that you have loaded either directly or indirectly (ie they are loaded by other packages) have dependencies on sp: the GISTools package implicitly links to maptools, sp, rgeos etc. You may have noticed these packages being listed as GISTools loads.

However, recently a new class of spatial object has been defined called sf of Simple Feature that seeks to encode spatial data in a way that conforms to a formal standard (ISO 19125-1:2004). This emphasises the spatial geometry of objects, the way that objects are stored in databases./ In brief, the aim of the team developing sf (actually they are the same people who developed sp so they do know what they are doing!!) is to provide a format of spatial data.An overview of the evolution of spatial data in R can be found at https://edzer.github.io/UseR2017/.

Their idea is that a feature is a thing, or an object in the real world, such as a building or a tree. As is the case with objects, they often consist of other objects. This is the case with features too: a set of features can form a single feature. A forest stand can be a feature, a forest can be a feature, a city can be a feature. A satellite image pixel can be a feature, a complete image can be a feature too. Features have a geometry describing where on Earth the feature is located, and they have attributes, which describe other properties. There many sf object types but the key ones (that are similar to lines points and areas) are in the table below (taken from the sf vignette).

Ultimately, sf formats will completely replace sp and packages that use sp (such GWmodel for GWR) will all have to be updated to use sf at some point but that is a few years away.

So… in this session we will introduce both sp and sf as well as different mapping routines (with the plot function and with the tmap package) and the and tools to move between them. The later sessions on regression and spatial interaction models will mainly use sp.

4.1 Reading and Writing Spatial Data

As we are working with spatial data it seems appropriate to consider how to get spatial data in and out of R.

Very often we have data that is in a particular format such as shapefile format. R has the ability to read and write data from and to many different spatial data formats using functions in the rgdal and sf packages - we will consider them both here:

However, as the sp package and data formats are depreciated, so are their functions for reading and writing spatial data. The rgdal package includes 2 generic functions for reading and writing all kinds of spatial data - pretty much you name the format and it has a driver for that format. These are

readOGR()
writeOGR()

Spatial data can be read into R and assigned to a variable using the readOGR function:

fields <- readOGR("FP_Fields_Fenced_post13082013.shp") 
## OGR data source with driver: ESRI Shapefile 
## Source: "FP_Fields_Fenced_post13082013.shp", layer: "FP_Fields_Fenced_post13082013"
## with 20 features
## It has 13 fields

If you enter:

class(fields)

You will see that the class of the fields object is sp.

The fields sp format object - we can write this out to a shape file using the writeOGR() function as follows:

writeOGR(obj=fields, dsn=".", layer="fields", driver="ESRI Shapefile") 

You will see that a shapefile has been written into your current working directory, with its associated supporting files (.dbf, etc) that can be recognised by other applications (QGIS etc). You should examine the writeOGR and readOGR function in the rgdal package. R is also able to read and write other proprietary spatial data formats using a number of packages, which you should be able to find through a search of the R help system or via an internet search engine. The rgdal package, which is the R version of the Geospatial Data Abstraction Library. It includes a number of methods for reading and writing spatial objects, including to and from SpatialXDataFrame objects:

The full syntax can be important - the code below overwrites any existing similarly named file:

writeOGR(fields, dsn = ".", layer = "fields", 
         driver="ESRI Shapefile", overwrite_layer = T)

The dsn parameter is important here: for shape files it determines the folder the files are written to. In the above example it was set to "." which places the files in the current working directory. You could specify a file path here.

For a PC might be something like:D:\MyDocuments\MyProject\DataFiles
And for a MAC: /Users/lex/my_docs/project

The setwd() and getwd() functions can be used in determining and setting the file path. You may want to set the file path and then use the dsn setting as above:

setwd("/Users/lex/my_docs/project")
writeOGR(fields, dsn = ".", layer = "fields", 
         driver="ESRI Shapefile", overwrite_layer = T)

Or you could use the getwd() function, save the results to variable and pass this to writeOGR:

td <- getwd()
writeOGR(fields, dsn = td, layer = "fields", 
         driver="ESRI Shapefile", overwrite_layer = T)

You should also examine the functions for reading and writing raster layers in rgdalwhich are readGDAL and writeGDAL. These read and write functions in rgdal are incredibly powerful and can read / write almost any format of spatial data.

Spatial data can be also be read in and written out using the sf functions:

st_read()
st_write()

They work in a similar way - we do not have the time to examine them in detail here.

4.2 Mapping with tmap

The tmap package is for thematic visualization, with a focus on mapping the spatial distribution of data attributes. It has a similar grammar to plotting with ggplot in that it seeks to handle each element of the map separately in a series of layers and in so doing to exercise control over each element. This is different to the basic plot functions used above to map sp and sf data.In earlier sections qtm() was used to compose a quick map and now we will expand this more fully. The code in this section will use the fields and df.sp datasets.

A number of plot parameters additional to the ones that have previously been used exist, including different window sizes, multiple plots in the same window, polygon or area shading, hatching, boundary thickness, boundary colour, labeling etc.

A useful introductory overview of tmap can be found in the tmap-nutshell vignette. You can uses this later to get some further ideas about mapping options:

vignette("tmap-nutshell")
Like ggplot the key point is that tmap is invoked with tm_shape and then the kind of mapping is specified with further Aesthetics commands to control and specify the base layers:
base description
tm_polygons Create a polygon layer (with borders)
tm_symbols Create a layer of symbols
tm_lines Create a layer of lines
tm_raster Create a raster layer
tm_text Create a layer of text labels

Further aesthetics for derived layers are described in

help(tmap)

The code below recreates and augments the map above created with the choropleth function in GISTools this time with filed boundaries.

tm_shape(df.sp) +
  tm_dots("s.resids", size = 0.1, breaks = c(-10,-1.5,2,10), palette = "RdBu") +
  tm_shape(fields) + 
  tm_borders()

So what you can see in the above code are two sets of tmap plot commands: the first set plots the df.sp dataset, specifying a dots to show the data locations, a size and some breaks. The second set adds the field outline.

It is also possible to plot multiple different maps from different datasets together but this requires a bit more control over the tmap parameters.

One of the great features of tmap is its ability to include leaflet plots. This can be done by setting the tmap_mode parameter:

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(df.sp) +
  tm_dots("s.resids", breaks = c(-10,-1.5,2,10), palette = "RdBu") +
  tm_shape(fields) + 
  tm_borders()

Try changing the layers icon - the Esri.WorldTopoMap options! This is where Harry works - on a little farm at the end of the world!!! Note the street name of Cocktree Throat - I thought this was an illness not a place!

5. Geographically Weighted Regression

Earlier in the program we looked at basic regression. As a reminder, this was a way of modelling the relationship between a response variable \(y_i\) and a number of predictor variables \(x_{i1} \cdots x_{im}\), for each observation \(i\) (\(i = 1 \cdots n\)). The basic model used was

\[ y_{i} = \beta_0 + \sum_{k=1}^{k=m} \beta_k x_{ik} + \epsilon_i \] so for each observation, the relationship is linear plus a random error part where each which \(\epsilon_i\) has a independent normal distribution with mean zero and variance \(\sigma^2\). However the assumption that the relationship is linear, and that the error terms have identical and independent normal distributions is quite often a consequence of wishful thinking, rather than a rigorously derived conclusion. Although quite often these assumptions are resemble reality enough to allow predictions to be of some use, changing some of the model assumptions would certainly result in more plausible results.

To take a step closer to reality, it is a good idea to look beyond these assumptions to obtain more reliable predictions, or gain a better theoretical understanding. To mediate the problems caused by the inappropriate use of simple linear regression (or OLS - Ordinary Least Squares) - two general approaches are used:

  1. Making the model calibration more robust - so that minor deviations (such as extreme outliers) - do not negatively influence the modelling process.
  2. Allowing different assumptions in the model or the error term

Here we will look at some examples of these.

5.1. Robust (‘bomb-proof’) Regression

This relates to point 1 above. Most of this session will relate to examples linking to point 2 - but it is useful to consider at least one example of the first kind. To investigate this example, we will need to recall the results from earlier. Recall that the results of the linear regression are stored in m -

summary(m)
## 
## Call:
## lm(formula = S ~ Ca + Fe + P + SOM + pH + Slope + Aspect, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -333.55  -33.16   -0.40   31.99  285.15 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.466e+02  6.843e+01  10.910  < 2e-16 ***
## Ca           6.213e-02  4.790e-03  12.972  < 2e-16 ***
## Fe          -8.236e-03  3.765e-04 -21.874  < 2e-16 ***
## P            1.152e-01  1.186e-02   9.710  < 2e-16 ***
## SOM          1.952e+01  1.021e+00  19.118  < 2e-16 ***
## pH          -7.060e+01  1.335e+01  -5.289 1.49e-07 ***
## Slope        9.125e+00  8.159e-01  11.184  < 2e-16 ***
## Aspect       1.157e-01  1.481e-02   7.814 1.33e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.45 on 1062 degrees of freedom
## Multiple R-squared:  0.8357, Adjusted R-squared:  0.8346 
## F-statistic: 771.5 on 7 and 1062 DF,  p-value: < 2.2e-16

However, looking at the studentised residuals suggested a few outliers that might ‘throw’ the regression coefficient estimates. One way to investigate this is to remove the outlying observations and re-fit the model. The removal is best done using the dplyr approach and the filter function, and re-fitting the model.

df %>% filter(abs(rstudent(m)) <= 2) -> df2
m2 <- lm(S~Ca+Fe+P+SOM+pH+Slope+Aspect, data = df2)

This done, the compareCoefs function in car lets the results be compared:

kable(compareCoefs(m,m2,print=FALSE), digits=3)
E st. 1 SE 1 E st. 2 SE 2
(Intercept) 746.569 68.429 733.611 62.113
Ca 0.062 0.005 0.079 0.006
Fe -0.008 0.000 -0.009 0.000
P 0.115 0.012 0.103 0.011
SOM 19.516 1.021 18.392 0.931
pH -70.596 13.348 -65.865 12.280
Slope 9.125 0.816 7.902 0.709
Aspect 0.116 0.015 0.109 0.013

It may be seen that some of the values alter notably (by around 10%) between the models with and without outliers. Outliers do have an influence.

However, it could be argued that the approach of removing them entirely is misleading - essentially trying to pass off a model as universal but intentionally ignoring observations that contradict it. Although that might be quite extreme, the point holds to some extent. Those counties do have those statistics, and so should perhaps not be ignored entirely.

Robust regression offers a more nuanced approach. Here, instead of simply ignoring the outliers, it is assumed that for whatever reason, they are subject to much more variance than the remaining observations - possibly due to some characteristic of the sampling process, or - more likely here - due to some unusual circumstances in the geographical locality. Rather than totally ignoring such observations, they are downweighted to allow for this. This is less extreme than removing outliers - which in a sense is extreme downweighting to the value zero!

There are a number of approaches to robust regression - here we will use one proposed by Huber (1981). This uses an algorithm as below:

  1. Fit a standard regression model and note the residuals.
  2. Compute the weight for each observation as a function of absolute residual size. A number of possible weight functions could be used, but Huber’s is
    \[ w_i(z_i) = \left\{ \begin{array}{ll} 1, & \textsf{if } |z_i| < c; \\ c/|z|, & \textsf{if } |z_i| \ge c; \\ \end{array} \right. \] where \(c \approx 1.345\) and \(z_i\) is a scaled residual for observation \(i\). If actual residual is \(e_i\), then \(z_i = e_i/\tau\) where \(\tau\) is the median of \(\frac{|e_i|}{0.6745}\). After residuals reach a value \(c\) weighting gradually decreases - in accordance with the idea that higher residuals come from processes with greater variance.
  3. Refit the regression model with these weights. Update the residual values.
  4. Repeat from step 2 until regression coefficient estimates converge.

This is all implemented in the rlm function in the package MASS. It is very easy to use - once you have loaded MASS it just works like lm:

library(MASS)
m3 <- rlm(S~Ca+Fe+P+SOM+pH+Slope+Aspect, data = df)
m3
## Call:
## rlm(formula = S ~ Ca + Fe + P + SOM + pH + Slope + Aspect, data = df)
## Converged in 7 iterations
## 
## Coefficients:
##   (Intercept)            Ca            Fe             P           SOM 
## 726.727575457   0.079668225  -0.008732398   0.103818551  18.153368017 
##            pH         Slope        Aspect 
## -66.944296245   8.416860001   0.107198782 
## 
## Degrees of freedom: 1070 total; 1062 residual
## Scale estimate: 47.4

Here the coefficient estimates are printed out, and it may also be seen that the model converged in i7 iterations, and that the scale factor \(\tau\) was estimated as 47.4. As with lm you can print out a summary:

summary(m3)
## 
## Call: rlm(formula = S ~ Ca + Fe + P + SOM + pH + Slope + Aspect, data = df)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -496.6919  -32.5874    0.9519   30.9568  311.2977 
## 
## Coefficients:
##             Value    Std. Error t value 
## (Intercept) 726.7276  65.2317    11.1407
## Ca            0.0797   0.0046    17.4480
## Fe           -0.0087   0.0004   -24.3289
## P             0.1038   0.0113     9.1837
## SOM          18.1534   0.9731    18.6551
## pH          -66.9443  12.7241    -5.2612
## Slope         8.4169   0.7778    10.8213
## Aspect        0.1072   0.0141     7.5916
## 
## Residual standard error: 47.38 on 1062 degrees of freedom

Finally, coefficients from all three approaches can be compared:

coef_tab <- data_frame(OLS=coef(m),`Outliers Deleted`=coef(m2),Robust=coef(m3))
kable(coef_tab,digits=3)
OLS Ou tliers Deleted R obust
746.569 733.611 726.728
0.062 0.079 0.080
-0.008 -0.009 -0.009
0.115 0.103 0.104
19.516 18.392 18.153
-70.596 -65.865 -66.944
9.125 7.902 8.417
0.116 0.109 0.107

For some coefficients, the robust estimate is closer to the standard OLS, and for others, it is closer to the ‘outliers deleted’ estimate - however in others, it differs from OLS, but not in the same direction as the ‘outliers deleted’ case. As a precautionary approach, I suggest working with the robust approach if there are some notable outliers.

5.2. Geographically Weighted Models

Regression of some kind underpins many statistical models. However often they make one massive assumption: that the relationships between the \(x_k\)’s and \(y\) are the same everywhere. Regression models (and most other models) are aspatial and because they consider all of the data in the study area in one universal model, they are referred to as Global models. For example, in the models seen in this program so far the assumption is that the contributions made by the different variables in predicting S are the same across the study area. In reality this assumption of spatial invariance is violated in many instances when geographic phenomena are considered.

For these reasons, as a geographer….

In statistical terminology… I am interested in spatial non-stationarity in relationships and process spatial heterogeneity.

What this means is that instead of just one general relationship for, for example, the association between changes in SOM and S, I might expect that relationship (as described in the coefficient estimate for \(\beta_k\)) to be different in different places. This is a core concept for geographical data analysis. A key question is “how can such a model be calibrated” - a clue lies in the next subsection…

Tobler’s first law of Geography

everything is related to everything else, but near things are more related than distant things. (Tobler, 1970)

Here, Tobler is arguing (although by his own admission he used the term in a tongue in cheek way) that geographically located phenomena that are near to one another tend to have similar characteristics - or that if this were not the case -

“the full range of conditions anywhere on the Earth’s surface could in principle be found packed within any small area. There would be no regions of approximately homogeneous conditions to be described by giving attributes to area objects. Topographic surfaces would vary chaotically, with slopes that were everywhere infinite, and the contours of such surfaces would be infinitely dense and contorted. Spatial analysis, and indeed life itself, would be impossible.” (de Smith et al 2007, p44)

Here, the idea informs the technique known as Geographically Weighted Regression or GWR (Brunsdon et al 1996). If the relationships between the \(x_k\)’s and \(y\) vary spatially, one might at least expect that the relationship between nearby sets of observations should be consistent. Suppose we assign a value of each regression coefficient to a location \((u,v)\) in space (eg \(u\) is an easting and \(v\) a northing), so that we denote it \(\beta_k(u,v)\). The we could estimate each \(\beta_k(u,v)\) by calibrating a weighted regression model where \(w_i\) was 1 if an observation was within some radius \(h\) of \((u,v)\) and 0 otherwise.
In this way, an entire surface of coefficients could be estimated. It may be imagined as being generated by a moving circular window including scanning across a map of the locations of the observations - in this case county centroids. However, 0/1 weighting is quite crude, and one might expect data from the periphery of the circular window to have less spatial influence than data near the middle.

Thus, critical to the operation of GW models is a moving kernel, that downweights peripheral observations. The kernel moves through the study area (for example to cells in a predefined grid) and at each location computes a local model. It uses data under the kernel to construct a (local) model at that location with data weighted points further away from the kernel centre contributing less to the solution. Hence the weighted element in the GW framework. Thus, the kernel defines the data and weights that are used to calibrate the model at each location. The weight, \(w_i\), associated with each location \((u_i, v_i)\) is a decreasing function of \(d_i\), the distance from the centre of the kernel to \((u_i, v_i)\):

A typical kernel function for example is the bisquare kernel. For a given bandwidth \(h\), this is defined by \[ f(d) = \left\{ \begin{array}{cl} \left(1 - \left( \frac{d}{h} \right)^2 \right)^2 & \mbox{ if $d < h$}; \\ & \\ 0 & \mbox{ otherwise.} \end{array} \right. \] Here \(h\) may be defined as a fixed value, or in an adaptive way. Gollini et al (2015) provide a description of common kernels shape used in GW models. Generally larger values of \(h\) result in a greater degree of spatial smoothing - having a larger window around \(\mathbf{u}\) in which cross tabulations have non-zero weighting. As \(h \rightarrow \infty\) the coefficients converge to the global OLS model.

There are a number of packages that support geographically weighted analyses in R, including spgwr, GWmodel, gwrr and McSpatial. GWmodel is curated and supported by the two of the original GWR team working with others. We will use this here.

Note that Gollini at al (2015) provide an overview of geographically weighted approaches and provide a thorough treatment demonstrate their use, including the steps of model selection, bandwidth / kernel size optimization, handling collinearity etc.

In brief there are 2 basic steps to any GW approach:

  1. Determine the optimal kernel bandwidth \(h\) (ie how many data points - or which size window - to include in each local model);
  2. Fit the GW model.

We will create some spatial data and undertake these steps in turn in the code below: Note the that the df.sp data is projected in meters and not degrees. Distance based measures are used to determine the bandwidth / kernel size.

# determine the kernel bandwidth
bw <- bw.gwr(S~Ca+Fe+P+SOM+pH+Slope+Aspect,
              adaptive = T,
              data=df.sp) 
# fit the GWR model
m4 <- gwr.basic(S~Ca+Fe+P+SOM+pH+Slope+Aspect, 
                adaptive = T,
                data = df.sp,
                bw = bw)  

You should have a look at the bandwidth: it has a value of 184 indicating that 17% of the county data will be used in each local calculation (there are 1070 records in the data).

bw
## [1] 184

You should have a look at the GWR outputs

m4

The code above prints out the whole model summary. You should compare the Summary of GWR coefficient estimates with the outputs of the ordinary regression mode coefficient estimates:

tab <- rbind(apply(m4$SDF@data[, 1:8], 2, summary), coef(m))
rownames(tab)[7] <- "Global"
tab <- round(tab, 3)
kable(tab, booktabs = T, format = 'markdown')
Intercept Ca Fe P SOM pH Slope Aspect
Min. -233.472 -0.026 -0.015 -0.093 -5.664 -309.436 -3.774 -0.104
1st Qu. 338.519 0.040 -0.006 0.039 10.136 -102.276 -0.542 -0.001
Median 692.092 0.060 -0.004 0.097 18.789 -68.268 1.033 0.043
Mean 659.049 0.063 -0.005 0.103 18.120 -67.445 1.456 0.086
3rd Qu. 896.233 0.082 -0.003 0.168 26.647 -21.152 3.388 0.107
Max. 1942.674 0.176 0.002 0.284 44.712 161.084 8.880 0.737
Global 746.569 0.062 -0.008 0.115 19.516 -70.596 9.125 0.116

So the Global coefficient estimates are similar to the Mean and Median values of the GWR output. But you can see the variation in the association between sulphur (S) and the different coefficient estimates.

Note: the GWmodel outputs have a number of types. Of interest is the SDF which is spatial object of the same class as the input - in this case a SpatialPolygonsDataFrame.

So for example, for SOM (soil organic matter) :

  • the global coefficient estimate is 19.516
  • the local coefficient estimate varies from 10.13 to 26.651 between the 1st and 3rd quartiles

Mapping GW outputs

The variation in the GWR coefficient estimates can be mapped to show how different factors are varying associated with Sulphur in different parts of the study area

Maps where large variation in the relationship between the predictor variable and sulphur was found:

tmap_mode('plot')
## tmap mode set to plotting
tm_shape(m4$SDF) +
  tm_dots(c("SOM", "Slope"), size = 0.1) +
  tm_layout(legend.position = c("right","top"))

Maps showing coefficients who values flip from positive to negative in different parts of the study area

tm_shape(m4$SDF) +
  tm_dots(c("Slope", "pH"), size = 0.1) +
  tm_layout(legend.position = c("right","top")) +
  tm_style_col_blind()

So what we see here are the local variations in the degree to which changes in different variables are associated with changes in Sulphur The maps above are not simply maps of the predictor variable.

References

Brunsdon, C. and Comber, L., 2015. An introduction to R for spatial analysis and mapping. Sage.

Brunsdon, C.F., Fotheringham, A.S. and Charlton M. (1996). Geographically Weighted Regression - A Method for Exploring Spatial Non-Stationarity, Geographic Analysis, 28: 281-298.

Cleveland, W. S. (1979) Robust locally weighted regression and smoothing scatterplots. J. American Statistical Association 74, 829–836.

De Smith M., Goodchild M., Longley P. Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools, Troubador Publishing Ltd, 2007

Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships. John Wiley & Sons.

Gollini, I., Lu, B., Charlton, M., Brunsdon, C., & Harris, P. 2015. GWmodel: an R package for exploring spatial heterogeneity using geographically weighted models. Journal of Statistical Software 63 (17), 1–50.

Huber, P.J. (1981) Robust Statistics., Wiley.

Hurley, Catherine B. (2004) Clustering Visualisations of Multidimensional Data, Journal of Computational and Graphical Statistics, 13(4), 788-806.

Tobler W., (1970) A computer movie simulating urban growth in the Detroit region.,Economic Geography, 46(2): 234-240