This session sets up the main sessions for the rest of the workshop. The aims of this session are to:
R was initially developed by Robert Gentleman and Ross Ihaka of the Department of Statistics at the University of Auckland. R is increasingly becoming the default software package in many areas of science. There are a number of reasons for this:
For these reasons R is becoming widely used in many areas of scientific activity and quantitative research.
R can be found at the CRAN website:
There are 2 key points about working in R
The reasons for this are so that you get used to using the R console and running the code will help your understanding of the code’s functionality. Then in order to run the code in the R console,a quick way to enter it is to highlight the code (with the mouse or using the keyboard controls) and the press ctrl-R
or cmd-Enter
on a Mac.
There are some core concepts that you should be familiar with:
1. Assignment: this is the basic process of giving R objects values:
vals <- c(4.3,7.1,6.3,5.2,3.2,2.1)
2. Operations: having assigned values to object that can be manipulated using functions:
vals*2
## [1] 8.6 14.2 12.6 10.4 6.4 4.2
sum(vals)
## [1] 28.2
mean(vals)
## [1] 4.7
3. Indexing: elements of R objects can be can be referred to:
vals[1] # first element
## [1] 4.3
vals[1:3] # a subset of elements 1 to 3
## [1] 4.3 7.1 6.3
sqrt(vals[1:3]) #square roots of the subset
## [1] 2.073644 2.664583 2.509980
vals[c(5,3,2)] # a subset of elements 5,3,2 - note the ordering
## [1] 3.2 6.3 7.1
Note you can copy and paste the code above into a new R/RStudio script.
When you install R / RStudio it comes with a large number of tools already (refereed to as base functionality).
However, one of the joys of R, is the community of users. Users share what they do and create in R in a number of ways. One of these is through packages. Packages are collections of related functions that have been created, tested and supported with help files. These are bundled into a package and shared with other R users via the that users can download from the CRAN repository.
There are 1000s of packages in R. These contain set of tools and can be written by anyone. The number of packages is continually growing. When packages are installed these can be called as libraries. The background to R, along with documentation and information about packages as well as the contributors, can be found at the R Project website http://www.r-project.org.
Packages can be found at the CRAN website - https://cran.r-project.org/web/packages/:
Users install the package once to mount it on their computer, and then it can be called in R scripts as required.
The basic operations are
library
function to use the package tools
So a typical way to do this is
install.packages("<package name>"", dep = TRUE)
For this workshop you will require a number of packages to be installed.
The code below will make sure you have all of the R packages you will need installed. The tests check whether a given package is in the list of installed packages, and installs it if not. Package installation only needs to be done once - not every time you use R - so if you are sure you have all of the packages installed already, you can skip this step.
# set a mirror
options(repos = c(CRAN = "http://cran.rstudio.com"))
# test for package existance and install
if (!is.element("tidyverse", installed.packages()))
install.packages("tidyverse", dep = T)
if (!is.element("GISTools", installed.packages()))
install.packages("GISTools", dep = T)
if (!is.element("GWmodel", installed.packages()))
install.packages("GWmodel", dep = T)
if (!is.element("MASS", installed.packages()))
install.packages("MASS", dep = T)
if (!is.element("rgdal", installed.packages()))
install.packages("rgdal", dep = T)
if (!is.element("tmap", installed.packages()))
install.packages("tmap", dep = T)
if (!is.element("e1071", installed.packages()))
install.packages("e1071", dep = T)
if (!is.element("knitr", installed.packages()))
install.packages("knitr", dep = T)
if (!is.element("car", installed.packages()))
install.packages("car", dep = T)
if (!is.element("gclus", installed.packages()))
install.packages("gclus", dep = T)
if (!is.element("kableExtra", installed.packages()))
install.packages("kableExtra", dep = T)
Note that the code above installs each package and all of is dependencies (dep = TRUE
). Once we are sure all the packages are installed, you need to load them into the current session:
# load into the R session
library(tidyverse)
library(GISTools)
library(GWmodel)
library(MASS)
library(rgdal)
library(tmap)
library(e1071)
library(knitr)
library(car)
library(gclus)
library(kableExtra)
Also note that packages can be also be downloaded using the RStudio menu
Tools > Install Packages…
There are a number of ways of loading data into you R session:
.csv
file).rda
or .RData
extension).csv
format dataThe base installation of R comes with core functions for reading .txt
, csv
and other tabular formats. To load data from local files you need to point R / RStudio to the directory that contains the local file. One way is to use the setwd()
function as in the below
## Mac
setwd("/Users/geoaco/Desktop/")
## Windows
setwd("C:\\")
Another is to use the menu system
Session > Set Working Directory … which give you options to chose from.
However you do it you should now set your working directory to the folder that contains the file GWPCA_SA_2018_Workshopdata.csv
and run the code below to load the data:
dat <- read.csv("GWPCA_SA_2018_Workshopdata.csv")
This loads a data table of 1070 rows and 33 columns. You can inspect the data in a few ways.
## dimensions - rows and columns
dim(dat)
## [1] 1070 33
## column / variable names
names(dat)
## [1] "ID" "Easting" "Northing"
## [4] "Field_No" "Field_East_Centroid" "Field_North_Centroid"
## [7] "Farmlet_No" "Soil_Class_No" "Elevation"
## [10] "Slope" "Aspect" "Al"
## [13] "As" "Ca" "Cd"
## [16] "Co" "Cr" "Cu"
## [19] "Fe" "K" "Mg"
## [22] "Mn" "Mo" "Na"
## [25] "Ni" "P" "Pb"
## [28] "S" "Se" "Ti"
## [31] "Zn" "SOM" "pH"
## look at the first 6 rows and 7 columns
dat[1:6,1:7]
## ID Easting Northing Field_No Field_East_Centroid Field_North_Centroid
## 1 1 265632 99300 1 265618.9 99134.54
## 2 2 265625 99275 1 265618.9 99134.54
## 3 3 265650 99275 1 265618.9 99134.54
## 4 4 265600 99250 1 265618.9 99134.54
## 5 5 265625 99250 1 265618.9 99134.54
## 6 6 265650 99250 1 265618.9 99134.54
## Farmlet_No
## 1 3
## 2 3
## 3 3
## 4 3
## 5 3
## 6 3
## use the sumamry function for the first 7 columns
summary(dat[,1:7])
## ID Easting Northing Field_No
## Min. : 1.0 Min. :265425 Min. :97475 Min. : 1.000
## 1st Qu.: 268.2 1st Qu.:265650 1st Qu.:97975 1st Qu.: 3.000
## Median : 535.5 Median :265850 Median :98250 Median : 6.000
## Mean : 535.5 Mean :265907 Mean :98309 Mean : 6.313
## 3rd Qu.: 802.8 3rd Qu.:266175 3rd Qu.:98650 3rd Qu.: 9.000
## Max. :1070.0 Max. :266528 Max. :99300 Max. :15.000
## Field_East_Centroid Field_North_Centroid Farmlet_No
## Min. :265505 Min. :97629 Min. :1.00
## 1st Qu.:265672 1st Qu.:97968 1st Qu.:1.00
## Median :265885 Median :98313 Median :2.00
## Mean :265908 Mean :98308 Mean :2.01
## 3rd Qu.:266173 3rd Qu.:98604 3rd Qu.:3.00
## Max. :266451 Max. :99135 Max. :3.00
The default for read.csv
is that the file has a header (i.e. the first row contains the names of the columns) and that the separator between values in any record is a comma. However these can be changed depending on the nature of the file you are seeking to load into R. A number of different types of files can be read into R.You should examine the help files for reading data in different formats. Enter ??read
to see some of these listed. You will note that read.table
and write.table
require more parameters to be specified than read.csv
and write.csv
.
You can also load R binary files. These have the advantage of being very efficient at storing data and quicker to load than for example, .csv
files
load("data.wk.rda")
## use ls to see what is loaded
ls()
You should see that a variable called data.wk
has been loaded. It is the same as the data read into dat
. You can explore the data.wk
R object if you want to using the functions above that were applied to dat
.
.csv
format dataData can be written into a Comma Separated Variable file using the command write.csv
and then read back into a different variable, as follows:
write.csv(dat, file = "new_data.csv")
This writes a .csv
file into the current working directory. If you open it using a text editor or a spreadsheet software, you will see that it has the expected column plus the index for each record. This is because the default for write.csv
includes row.names = TRUE
. Again examine the help file for this function.
write.csv(dat, file = "test.csv", row.names = F)
It is possible to save variables that are in your workspace to a designated .rda
or .RData
file. This can be loaded at the start of your next session. Saving your workspace saves everything that is present in your workspace - as listed by ls()
- whilst the save
command allows you to specify what variables you wish to save.
There are a number of ways to do this:
save.image
functionsave.image(file = "myworkspace.RData")
save
function# this will save everything in the workspace
save(list = ls(), file = "MyData.RData")
# this will save just dat
save(list = "dat", file = "MyData.RData")
# this will save dat and data.wk
save(list = c("dat", "dtaa.wk"), file = "MyData.RData")
The dat
data is of data.frame
format. The code below assigns a subset of the data variables to df
, a tibble
class of data frame:
# have a look!
class(dat)
# assign selected variables to df
df <- as_tibble(dat[, c(1:3, 28, 14, 19, 26, 32, 33, 10, 11)])
The df
object is a tibble
format and contains a number of variables
S
)Ca
)Fe
)P
)SOM
)Easting
and Northing
)To see this in R, enter
head(df)
Our aim in this practical is to build a model that describes the Sulphur values (S
) using the factors listed above. We will use this to understand a number of key features in regression.
We should have a look at the data to see what kinds of properties it has - we could use previously introduced functions such as summary
to summarize the attributes. You may wish to use this now. But in terms of model selection what we are interested in the degree to which different variables are good predictors of sulphur.
In order to examine the relationships between the variables in df
we can plot the continuous data against each other and use the cor
function to examine collinearity amongst them:
plot(df[, -c(1:3)], cex = 0.5, col = grey(0.145,alpha=0.5))
Correlations between numeric variables.
Another useful tool here is to show the upper triangle of the scatterplot matrix with smoothed trend lines. These are achieved with lowess curve fits (Cleveland, 1979) - these are smooth (but possibly curved) bivariate trend lines - and provide a good way of judging by eye whether there is colinearity in the predictor variables. Essentially a straight-line shaped trend with not too much scattering of the individual points suggests collinearity might be an issue. When two predictors are correlated it can be difficult to identify whether it is one or the other (or both) that influence the quantity to be predicted. The code below does this. Note that as collinearity amongst the predictors is the key concern here, we focus on these, ignoring S
for this graphic. The upper.panel=panel.smooth
causes the lowess curves to be added.
plot(df[,-c(1:4)], cex = 0.5, col = grey(0.145,alpha=0.5), upper.panel=panel.smooth)
Visually, it seems that there may be some colinearity between
Ca
, pH
and P
, Slope
and P
for example.
A numerical table of these relationships as correlations can be generated with the kable
function:
library(knitr)
tab <- round(cor(df[,-c(1:3)]), 3)
kable(tab, booktabs = T, format = 'markdown')
S | Ca | Fe | P | SOM | pH | Slope | Aspect | |
---|---|---|---|---|---|---|---|---|
S | 1.000 | 0.614 | -0.436 | 0.690 | 0.808 | 0.062 | 0.455 | 0.256 |
Ca | 0.614 | 1.000 | -0.010 | 0.662 | 0.517 | 0.546 | 0.289 | 0.101 |
Fe | -0.436 | -0.010 | 1.000 | 0.016 | -0.263 | 0.261 | 0.018 | -0.083 |
P | 0.690 | 0.662 | 0.016 | 1.000 | 0.679 | 0.167 | 0.367 | 0.149 |
SOM | 0.808 | 0.517 | -0.263 | 0.679 | 1.000 | 0.032 | 0.395 | 0.183 |
pH | 0.062 | 0.546 | 0.261 | 0.167 | 0.032 | 1.000 | 0.089 | 0.208 |
Slope | 0.455 | 0.289 | 0.018 | 0.367 | 0.395 | 0.089 | 1.000 | 0.133 |
Aspect | 0.256 | 0.101 | -0.083 | 0.149 | 0.183 | 0.208 | 0.133 | 1.000 |
What we are looking for in the plots, and to be confirmed in the table, are correlations that look like straight lines, showing that values in one variable change with changes in another.
A final useful tool here is Catherine Hurley’s gclus
package. This is similar to the data frame plotting, but provides tools to reorder the variables so that the most strongly correlated pairs are closer to the diagonal in the plots. It can also colour-code the plots to show whether the correlations are positive (red) negative (blue) or near to zero (yellow). For this plot we focus on the predictors again. This is shown below
library(gclus)
cor.mat <- cov(df[,-c(1:3)])
cpairs(df[,-c(1:3)],order.single(cor.mat),dmat.color(cor.mat))
The order of the variables in the diagonal generally puts positively correlated variables near to each other as much as possible. The ordering here is based on a clustering method (single linkage) provided via the order.single
function in gclus
- the colouring is based on correlations (via the dmat.color
) function in gclus
.
Note, you could also write the table out to a .csv
file:
write.csv(cor.mat, file = "tab.csv")
The plot and the table above of the correlations between the variables show that only SOM
of the variables are highly correlated (< -0.8 or > +0.8) with each other and with the response variable S
and others less so. This provides us with confidence that we can construct a reasonable model of from this data - at least some of the variables are correlated with S
- but care may be needed in interpreting some of the regression parameters due to some collinearity in the predictors.
We are interested in a predictive model of income and it is useful to how that value varies. The code below plots the histogram of the variable S
. Note the use of the parameter prob=T
and prob = F
: if set to TRUE
this computes probabilities for each histogram bin - rather than a frequency count. This makes the \(y\)-axis compatible with a kernel density estimate (a smooth estimate of the probability distribution of \(x\)) , so the two may be overlaid. Here this is done by using the lines
function with a kernel density estimate for the distribution given by density
. A rug
plot is added to show the exact locations of the values on the \(x\)-axis.
The graphs were obtained with the following code:
hist(df$S, xlab='SulphuR (S)', main='Histogram of Suplhur', col = 'DarkRed')
hist(df$S, prob=T, xlab='SulphuR (S)', main='Histogram of Suplhur', col='NavyBlue')
lines(density(df$S,na.rm=T),col='salmon',lwd=2)
rug(jitter(df$S))
# qqPlot(df$S,main='Normal QQ plot of Sulphur')
So we have some data. We have had a look at some of the variables, seen how some of them relate to the variable we are interested in predicting, S
, how they correlate with each other, and perhaps we have even started to think about hypotheses that we might wish to test in our model.
We will start with a hypothesized model of Sulphur is associated with Calcium (Ca
), Iron (Fe
), Phosphorus (P
), Soil Organic Matter (SOM
), pH, slope (Slope
) and landscape aspect (Aspect
). In R we can construct a regression model using the function lm
which stands for linear model. This is a standard ordinary regression function, although simple it is very powerful and there are many variants and functions for manipulating lm
outputs: remember R was originally a stats package.
The equation for a regression is:
\(y_i = \beta_{0} + \sum_{m}^{k=1}\beta_{k}x_{ik} + \epsilon_i \textsf{ for } i \in 1 \cdots n\)
where \(y_i\) is the response variable for observation \(i\), \(x_{ik}\) is the value of the \(k^{\textsf{th}}\) predictor variable for observation \(i\), \(m\) is the number of predictor variables, \(\beta_{0}\) is the intercept term, \(\beta_{k}\) is the regression coefficient for the \(k^{th}\) predictor variable, \(\epsilon_i\) is the random error term for observation \(i\) and \(n\) is the number of observations.
In our case, the response variable \(y\) is sulphur (S
), and the 5 chemical variables and 2 landscale variables Ca
, Fe
, P
, SOM
, pH
, Slope
and Aspect
are the different predictor variables, the \(k\) different \(x\)’s.
The basic formula in R to specify a linear regression model, lm
is as follows:
model <- lm(variable_we_are_trying_to_predict ~ predictor_variable1 +
predictor_variable2...)
where the ~ sign (tilde) separates the variable we are trying to model (the response variable) from the variables we will use to predict it, the predictor variables. The predictor variables are usually separated by a +
sign - meaning that they are linear terms in the model added together.
We can specify our first model.
m <- lm(S~Ca+Fe+P+SOM+pH+Slope+Aspect, data = df)
And look at its summary attributes
summary(m)
##
## Call:
## lm(formula = S ~ Ca + Fe + P + SOM + pH + Slope + Aspect, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -333.55 -33.16 -0.40 31.99 285.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.466e+02 6.843e+01 10.910 < 2e-16 ***
## Ca 6.213e-02 4.790e-03 12.972 < 2e-16 ***
## Fe -8.236e-03 3.765e-04 -21.874 < 2e-16 ***
## P 1.152e-01 1.186e-02 9.710 < 2e-16 ***
## SOM 1.952e+01 1.021e+00 19.118 < 2e-16 ***
## pH -7.060e+01 1.335e+01 -5.289 1.49e-07 ***
## Slope 9.125e+00 8.159e-01 11.184 < 2e-16 ***
## Aspect 1.157e-01 1.481e-02 7.814 1.33e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.45 on 1062 degrees of freedom
## Multiple R-squared: 0.8357, Adjusted R-squared: 0.8346
## F-statistic: 771.5 on 7 and 1062 DF, p-value: < 2.2e-16
There are lots of things to look at in the summary of the model but we will focus on 3 aspects:
First, Significance. If we examine the model we can see that all are significant predictors of Sulphur. The level of significance is indicated by the \(P\)-value in the column \(Pr(>|t|)\) and the asterisks \(*\). For each coefficient \(\beta_k\) these are the significance levels associated with the hypothesis \(\textsf{H}_0 : \beta_k = 0\) against the alternative \(\textsf{H}_0 : \beta_k \ne 0\) - this corresponds to the probability that one would obtain a \(t\)-statistic for \(\beta_k\) at least as large as the one observed given \(\textsf{H}_0\) is true.
It is a measure of how surprising the relationship is and indicative of whether the relationship between \(x_k\) and \(y\) found in this data could have been found by chance. Very small values suggest that the level of association found here could have come from a random sample of data if \(\beta_k\) really is zero. Cutting to the chase, small \(P\)-values mean strong evidence for a non-zero regression coefficient.
So in this case we can say:
Fe
, P
, SOM
, pH
, Slope
and Aspect
are significantly associated with changes in S
at the at the \(<0.001\) level’, which is actually an indicator of less than \(0.001\). So were are even more confident that the observed relationship between these variables and Sulphur are not by chance and can again confidently reject the Null hypotheses.Second, the Coeffeicient Estimates. These are the \(\beta_{k}\) values that tell you the rate of change between \(x_k\) and \(y\) in the column ‘Estimate’ in the summary table. The first of these is the Intercept - this is \(\beta_0\) in the regression equation and tells you the value of \(y\) if \(x_1 \cdots x_k\) are all zero.
Remember regression is like taking lots of \(x_k\) and \(y\) values and seeing how well they are correlated. The regression fits a line between them. So in the figure below a regression model with just one \(x\) is defined, it is summarised, and then the \(x\) and \(y\) values are plotted and the model of the relationship is plotted.
m2 <- lm(df$S~df$SOM)
summary(m2)
##
## Call:
## lm(formula = df$S ~ df$SOM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158.12 -50.74 -13.96 36.50 463.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.2686 10.3170 16.31 <2e-16 ***
## df$SOM 43.1110 0.9616 44.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74.56 on 1068 degrees of freedom
## Multiple R-squared: 0.653, Adjusted R-squared: 0.6527
## F-statistic: 2010 on 1 and 1068 DF, p-value: < 2.2e-16
plot(df$SOM, df$S, xlab = "x = SOM", ylab = "y = Sulphur")
abline(m2, col = "red", lty = 2, lwd = 2)
Here we can see that the intercept of the relationship is 168.3 and the line shows this, and that the rate of change coefficient between S
and SOM
, when they are modelled on their own, is NA. This means that each increase in 1 of SOM
is associated with an increase in NA in sulphur.
So returning to our model m
, we can now see how changes of 1 unit for the different predictor variables are associated with changes in S
. The association of P
is positive but small, the association of SOM
is reduced due to interactions with other \(x\)’s but is still positive, the pH
and Fe
variables are strongly negatively associated with S
.
So the regression model m
gives us an idea of how different factors are associated in different ways with sulphur in the study area.
Third the ** \(R^2\) ** value. This provides a measure of model fit. They are calculated as the difference between the actual values of \(y\) (eg of S
) and the values predicted by the model - sometimes written as \(\hat{y\ }\) (called ‘\(y\)-hat’). A good description is here http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit: Generally \(R^2\)’s of greater than 0.7 indicate a reasonably well fitting model. Some users of this method describe low \(R^2\) as ‘bad’ - they are not - it just means you a have a more imprecise model - the predictions are a bit wrong-er! Essentially there is a sliding scale of reliability of the predictions - based on the relative amounts of deterministic and random variability in the system being analysed - and \(R^2\) is a measure of this.
Finally, we are also interested in outliers - these are records that the model does not describe very well, either vastly over-predicting Sulphur or under-predicting it. When the bivariate regression was considered earlier, these were easy to find, since they were just the points lying a long way from the regression line. However, when there are multiple predictors, relationships cannot be encapsulated in two dimensional space.
For this situation, a useful diagnostic is the studentised residual. For different values of predictors, the expected difference between the actual and fitted \(y\) values has a different variance even if the theoretic model gives \(\epsilon_i\) a constant variance. This discrepancy is due to issues such as collinear predictors. Standardised variables attempt to compensate for this. Plotting these and looking for notably high or low values is a helpful way of spotting outliers in the regression model.
A simple way of doing this is just to plot the residuals in order:
s.resids <- rstudent(m)
plot(s.resids,type='h')
abline(h=c(-2,2),col='lightskyblue')
Here we see some residuals are relatively high. Normally, one might expect standardised residuals to lie between -2 and 2. Adding lines to this plot helps to identify when this is not the case - as is done above.
Another useful job is to identify which are the outlying cases - here just done as a basic table (via kable
)
df_out <- data_frame(Name=df$ID,Residual=s.resids)
df_out <- df_out[abs(df_out$Residual)>2,]
kable(df_out)
Name | Residual |
---|---|
7 | -2.091382 |
17 | -2.033073 |
29 | -2.537072 |
41 | -2.044278 |
131 | -2.620754 |
211 | 3.775532 |
220 | 2.160951 |
229 | 2.903274 |
242 | 2.011449 |
318 | 2.745407 |
340 | 2.775047 |
428 | 2.466755 |
441 | 2.728331 |
443 | 2.786703 |
453 | 2.387636 |
454 | 2.068413 |
463 | 2.516256 |
464 | 2.120153 |
465 | 4.586674 |
524 | 2.141880 |
662 | -2.021370 |
677 | 2.196887 |
919 | 2.318173 |
932 | -2.730475 |
964 | -2.252646 |
986 | -2.243018 |
992 | -2.085380 |
993 | -2.014177 |
1002 | -2.432719 |
1006 | -2.996939 |
1009 | -4.382376 |
1010 | -2.703729 |
1012 | -2.089864 |
1013 | -2.359771 |
1014 | -2.369609 |
1019 | -2.560585 |
1047 | 5.722086 |
1060 | -2.092619 |
1065 | -2.056205 |
1069 | -2.207681 |
1070 | -8.819858 |
Negative values here have a much lower income than the model predicts, positive values suggest it is much higher.
Finally, we need to check another model assumption in lm
- that the error term \(\epsilon_i\)’s are normally distributed. The best way to do this is via a qq-plot which plots the empirical quantiles of a variable whose distribution we are interested in against the theoretical distribution we are testing for. If the variable is from this distribution, the results should yield a straight line - frequently a reference line is provided to test this. Here s.resids
is the variable of interest - the studentised residuals, and these are compared to the \(t\) distribution. Although the \(\epsilon_i\)’s have a normal distribution, theory shows that the studentised residuals derived from a sample coming from this model have a \(t\) distribution. However, the qqPlot
function from the package car
does all the work, and requires only the regression model object to create a plot:
qqPlot(m)
This demonstrates that mostly the residuals do conform to this, with a couple of extreme outliers at the higher end - labelled in the plot. Consulting the earlier table suggests these are Fayette and ID 1009, 1047, 465, and 1070
It is also important to consider spatial patterns in the residuals, from a geographical viewpoint. The code below plots the outliers and their spatial distributions:
# determine residuals
df$s.resids = rstudent(m)
# 1.5 standard deviations
resid.shades = shading(c(-1.5,1.5),c("indianred1","antiquewhite2","powderblue"))
# define a color scheme
cols = resid.shades$cols[1 + findInterval(df$s.resids, resid.shades$breaks)]
# set plot paramters
par(mar = c(0,0,2,0))
# make some spatial data
# define a projection
os.proj <- CRS("+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs ")
df.sp <- SpatialPointsDataFrame(coords = df[, 2:3], data = data.frame(df),
proj4string = os.proj)
# have a look at the spatial distribution of outliers
choropleth(df.sp,df$s.resids,resid.shades, main = "Model Outliers", pch = 19, cex = 0.7)
The regression session finishes here, but you will need the results of this session later on. To save all of the variables, regression objects and so on from this session, enter
save.image(file='session2.RData')
This saves an image of all of the R objects currently defined in a file called session2.RData
- later on you will be able to load this again via the load
function.
You will have noticed that the last few lines of code above introduced some spatial data to do the mapping. We do not have time to go into this fully here but in this section we will introduce some spatial data formats and how to read data in and out of R.
In the preceding description of data each of the records (rows) relates to some kind of real world feature (a person, a transaction, a date, etc) and the columns represent some attribute associated with that feature. In spatial data, each record typically represents some real world geographical feature - a place, a route, a region etc. In fact geographic data typically represent points, lines or areas.
There is some key background information to working with spatial data in R. R is dynamic - things don’t stay the same. New tools, packages and functions are constantly be produced, they are updated to improve and develop them. In most cases this is not problematic as the update almost always extends the functionality of the package without affecting the original code. However in a few instances, specific packages are completely re-written without backwards compatibility. If this happens then R code that previously worked may not work with the new package as the functions may take different parameters and arguments. However, there is usually a period of a few package versions before the code stops to work all together.
Occasionally a completely new paradigm is introduced and this has been the case recently for spatial data in R with the advent of the sf
package.
Much of the code for handling spatial data in this workshop is built around data structures defined in the sp
package and it defines the following spatial objects in R:
The sp
package underpins many of the packages that you have loaded either directly or indirectly (ie they are loaded by other packages) have dependencies on sp
: the GISTools
package implicitly links to maptools
, sp
, rgeos
etc. You may have noticed these packages being listed as GISTools
loads.
However, recently a new class of spatial object has been defined called sf
of Simple Feature
that seeks to encode spatial data in a way that conforms to a formal standard (ISO 19125-1:2004). This emphasises the spatial geometry of objects, the way that objects are stored in databases./ In brief, the aim of the team developing sf
(actually they are the same people who developed sp
so they do know what they are doing!!) is to provide a format of spatial data.An overview of the evolution of spatial data in R can be found at https://edzer.github.io/UseR2017/.
Their idea is that a feature is a thing, or an object in the real world, such as a building or a tree. As is the case with objects, they often consist of other objects. This is the case with features too: a set of features can form a single feature. A forest stand can be a feature, a forest can be a feature, a city can be a feature. A satellite image pixel can be a feature, a complete image can be a feature too. Features have a geometry describing where on Earth the feature is located, and they have attributes, which describe other properties. There many sf
object types but the key ones (that are similar to lines points and areas) are in the table below (taken from the sf
vignette).
Ultimately, sf
formats will completely replace sp
and packages that use sp
(such GWmodel
for GWR) will all have to be updated to use sf
at some point but that is a few years away.
So… in this session we will introduce both sp
and sf
as well as different mapping routines (with the plot
function and with the tmap
package) and the and tools to move between them. The later sessions on regression and spatial interaction models will mainly use sp
.
As we are working with spatial data it seems appropriate to consider how to get spatial data in and out of R.
Very often we have data that is in a particular format such as shapefile format. R has the ability to read and write data from and to many different spatial data formats using functions in the rgdal
and sf
packages - we will consider them both here:
However, as the sp
package and data formats are depreciated, so are their functions for reading and writing spatial data. The rgdal
package includes 2 generic functions for reading and writing all kinds of spatial data - pretty much you name the format and it has a driver for that format. These are
readOGR()
writeOGR()
Spatial data can be read into R and assigned to a variable using the readOGR
function:
fields <- readOGR("FP_Fields_Fenced_post13082013.shp")
## OGR data source with driver: ESRI Shapefile
## Source: "FP_Fields_Fenced_post13082013.shp", layer: "FP_Fields_Fenced_post13082013"
## with 20 features
## It has 13 fields
If you enter:
class(fields)
You will see that the class of the fields
object is sp
.
The fields
sp
format object - we can write this out to a shape file using the writeOGR()
function as follows:
writeOGR(obj=fields, dsn=".", layer="fields", driver="ESRI Shapefile")
You will see that a shapefile has been written into your current working directory, with its associated supporting files (.dbf
, etc) that can be recognised by other applications (QGIS etc). You should examine the writeOGR
and readOGR
function in the rgdal
package. R is also able to read and write other proprietary spatial data formats using a number of packages, which you should be able to find through a search of the R help system or via an internet search engine. The rgdal
package, which is the R version of the Geospatial Data Abstraction Library. It includes a number of methods for reading and writing spatial objects, including to and from SpatialXDataFrame
objects:
The full syntax can be important - the code below overwrites any existing similarly named file:
writeOGR(fields, dsn = ".", layer = "fields",
driver="ESRI Shapefile", overwrite_layer = T)
The dsn
parameter is important here: for shape files it determines the folder the files are written to. In the above example it was set to "."
which places the files in the current working directory. You could specify a file path here.
For a PC might be something like:D:\MyDocuments\MyProject\DataFiles
And for a MAC: /Users/lex/my_docs/project
The setwd()
and getwd()
functions can be used in determining and setting the file path. You may want to set the file path and then use the dsn
setting as above:
setwd("/Users/lex/my_docs/project")
writeOGR(fields, dsn = ".", layer = "fields",
driver="ESRI Shapefile", overwrite_layer = T)
Or you could use the getwd()
function, save the results to variable and pass this to writeOGR
:
td <- getwd()
writeOGR(fields, dsn = td, layer = "fields",
driver="ESRI Shapefile", overwrite_layer = T)
You should also examine the functions for reading and writing raster layers in rgdal
which are readGDAL
and writeGDAL
. These read and write functions in rgdal
are incredibly powerful and can read / write almost any format of spatial data.
Spatial data can be also be read in and written out using the sf
functions:
st_read()
st_write()
They work in a similar way - we do not have the time to examine them in detail here.
tmap
The tmap
package is for thematic visualization, with a focus on mapping the spatial distribution of data attributes. It has a similar grammar to plotting with ggplot
in that it seeks to handle each element of the map separately in a series of layers and in so doing to exercise control over each element. This is different to the basic plot
functions used above to map sp
and sf
data.In earlier sections qtm()
was used to compose a quick map and now we will expand this more fully. The code in this section will use the fields
and df.sp
datasets.
A number of plot parameters additional to the ones that have previously been used exist, including different window sizes, multiple plots in the same window, polygon or area shading, hatching, boundary thickness, boundary colour, labeling etc.
A useful introductory overview of tmap
can be found in the tmap-nutshell
vignette. You can uses this later to get some further ideas about mapping options:
vignette("tmap-nutshell")
Like ggplot
the key point is that tmap
is invoked with tm_shape
and then the kind of mapping is specified with further Aesthetics commands to control and specify the base layers:
base | description |
---|---|
tm_polygons | Create a polygon layer (with borders) |
tm_symbols | Create a layer of symbols |
tm_lines | Create a layer of lines |
tm_raster | Create a raster layer |
tm_text | Create a layer of text labels |
Further aesthetics for derived layers are described in
help(tmap)
The code below recreates and augments the map above created with the choropleth
function in GISTools
this time with filed boundaries.
tm_shape(df.sp) +
tm_dots("s.resids", size = 0.1, breaks = c(-10,-1.5,2,10), palette = "RdBu") +
tm_shape(fields) +
tm_borders()
So what you can see in the above code are two sets of tmap
plot commands: the first set plots the df.sp
dataset, specifying a dots to show the data locations, a size and some breaks. The second set adds the field outline.
It is also possible to plot multiple different maps from different datasets together but this requires a bit more control over the tmap
parameters.
One of the great features of tmap
is its ability to include leaflet
plots. This can be done by setting the tmap_mode
parameter:
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(df.sp) +
tm_dots("s.resids", breaks = c(-10,-1.5,2,10), palette = "RdBu") +
tm_shape(fields) +
tm_borders()
Try changing the layers icon - the Esri.WorldTopoMap options! This is where Harry works - on a little farm at the end of the world!!! Note the street name of Cocktree Throat - I thought this was an illness not a place!
Earlier in the program we looked at basic regression. As a reminder, this was a way of modelling the relationship between a response variable \(y_i\) and a number of predictor variables \(x_{i1} \cdots x_{im}\), for each observation \(i\) (\(i = 1 \cdots n\)). The basic model used was
\[ y_{i} = \beta_0 + \sum_{k=1}^{k=m} \beta_k x_{ik} + \epsilon_i \] so for each observation, the relationship is linear plus a random error part where each which \(\epsilon_i\) has a independent normal distribution with mean zero and variance \(\sigma^2\). However the assumption that the relationship is linear, and that the error terms have identical and independent normal distributions is quite often a consequence of wishful thinking, rather than a rigorously derived conclusion. Although quite often these assumptions are resemble reality enough to allow predictions to be of some use, changing some of the model assumptions would certainly result in more plausible results.
To take a step closer to reality, it is a good idea to look beyond these assumptions to obtain more reliable predictions, or gain a better theoretical understanding. To mediate the problems caused by the inappropriate use of simple linear regression (or OLS - Ordinary Least Squares) - two general approaches are used:
Here we will look at some examples of these.
This relates to point 1 above. Most of this session will relate to examples linking to point 2 - but it is useful to consider at least one example of the first kind. To investigate this example, we will need to recall the results from earlier. Recall that the results of the linear regression are stored in m
-
summary(m)
##
## Call:
## lm(formula = S ~ Ca + Fe + P + SOM + pH + Slope + Aspect, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -333.55 -33.16 -0.40 31.99 285.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.466e+02 6.843e+01 10.910 < 2e-16 ***
## Ca 6.213e-02 4.790e-03 12.972 < 2e-16 ***
## Fe -8.236e-03 3.765e-04 -21.874 < 2e-16 ***
## P 1.152e-01 1.186e-02 9.710 < 2e-16 ***
## SOM 1.952e+01 1.021e+00 19.118 < 2e-16 ***
## pH -7.060e+01 1.335e+01 -5.289 1.49e-07 ***
## Slope 9.125e+00 8.159e-01 11.184 < 2e-16 ***
## Aspect 1.157e-01 1.481e-02 7.814 1.33e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.45 on 1062 degrees of freedom
## Multiple R-squared: 0.8357, Adjusted R-squared: 0.8346
## F-statistic: 771.5 on 7 and 1062 DF, p-value: < 2.2e-16
However, looking at the studentised residuals suggested a few outliers that might ‘throw’ the regression coefficient estimates. One way to investigate this is to remove the outlying observations and re-fit the model. The removal is best done using the dplyr
approach and the filter
function, and re-fitting the model.
df %>% filter(abs(rstudent(m)) <= 2) -> df2
m2 <- lm(S~Ca+Fe+P+SOM+pH+Slope+Aspect, data = df2)
This done, the compareCoefs
function in car
lets the results be compared:
kable(compareCoefs(m,m2,print=FALSE), digits=3)
E | st. 1 | SE 1 E | st. 2 | SE 2 |
---|---|---|---|---|
(Intercept) | 746.569 | 68.429 | 733.611 | 62.113 |
Ca | 0.062 | 0.005 | 0.079 | 0.006 |
Fe | -0.008 | 0.000 | -0.009 | 0.000 |
P | 0.115 | 0.012 | 0.103 | 0.011 |
SOM | 19.516 | 1.021 | 18.392 | 0.931 |
pH | -70.596 | 13.348 | -65.865 | 12.280 |
Slope | 9.125 | 0.816 | 7.902 | 0.709 |
Aspect | 0.116 | 0.015 | 0.109 | 0.013 |
It may be seen that some of the values alter notably (by around 10%) between the models with and without outliers. Outliers do have an influence.
However, it could be argued that the approach of removing them entirely is misleading - essentially trying to pass off a model as universal but intentionally ignoring observations that contradict it. Although that might be quite extreme, the point holds to some extent. Those counties do have those statistics, and so should perhaps not be ignored entirely.
Robust regression offers a more nuanced approach. Here, instead of simply ignoring the outliers, it is assumed that for whatever reason, they are subject to much more variance than the remaining observations - possibly due to some characteristic of the sampling process, or - more likely here - due to some unusual circumstances in the geographical locality. Rather than totally ignoring such observations, they are downweighted to allow for this. This is less extreme than removing outliers - which in a sense is extreme downweighting to the value zero!
There are a number of approaches to robust regression - here we will use one proposed by Huber (1981). This uses an algorithm as below:
This is all implemented in the rlm
function in the package MASS
. It is very easy to use - once you have loaded MASS
it just works like lm
:
library(MASS)
m3 <- rlm(S~Ca+Fe+P+SOM+pH+Slope+Aspect, data = df)
m3
## Call:
## rlm(formula = S ~ Ca + Fe + P + SOM + pH + Slope + Aspect, data = df)
## Converged in 7 iterations
##
## Coefficients:
## (Intercept) Ca Fe P SOM
## 726.727575457 0.079668225 -0.008732398 0.103818551 18.153368017
## pH Slope Aspect
## -66.944296245 8.416860001 0.107198782
##
## Degrees of freedom: 1070 total; 1062 residual
## Scale estimate: 47.4
Here the coefficient estimates are printed out, and it may also be seen that the model converged in i7 iterations, and that the scale factor \(\tau\) was estimated as 47.4. As with lm
you can print out a summary:
summary(m3)
##
## Call: rlm(formula = S ~ Ca + Fe + P + SOM + pH + Slope + Aspect, data = df)
## Residuals:
## Min 1Q Median 3Q Max
## -496.6919 -32.5874 0.9519 30.9568 311.2977
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 726.7276 65.2317 11.1407
## Ca 0.0797 0.0046 17.4480
## Fe -0.0087 0.0004 -24.3289
## P 0.1038 0.0113 9.1837
## SOM 18.1534 0.9731 18.6551
## pH -66.9443 12.7241 -5.2612
## Slope 8.4169 0.7778 10.8213
## Aspect 0.1072 0.0141 7.5916
##
## Residual standard error: 47.38 on 1062 degrees of freedom
Finally, coefficients from all three approaches can be compared:
coef_tab <- data_frame(OLS=coef(m),`Outliers Deleted`=coef(m2),Robust=coef(m3))
kable(coef_tab,digits=3)
OLS Ou | tliers Deleted R | obust |
---|---|---|
746.569 | 733.611 | 726.728 |
0.062 | 0.079 | 0.080 |
-0.008 | -0.009 | -0.009 |
0.115 | 0.103 | 0.104 |
19.516 | 18.392 | 18.153 |
-70.596 | -65.865 | -66.944 |
9.125 | 7.902 | 8.417 |
0.116 | 0.109 | 0.107 |
For some coefficients, the robust estimate is closer to the standard OLS, and for others, it is closer to the ‘outliers deleted’ estimate - however in others, it differs from OLS, but not in the same direction as the ‘outliers deleted’ case. As a precautionary approach, I suggest working with the robust approach if there are some notable outliers.
Regression of some kind underpins many statistical models. However often they make one massive assumption: that the relationships between the \(x_k\)’s and \(y\) are the same everywhere. Regression models (and most other models) are aspatial and because they consider all of the data in the study area in one universal model, they are referred to as Global models. For example, in the models seen in this program so far the assumption is that the contributions made by the different variables in predicting S
are the same across the study area. In reality this assumption of spatial invariance is violated in many instances when geographic phenomena are considered.
For these reasons, as a geographer….
In statistical terminology… I am interested in spatial non-stationarity in relationships and process spatial heterogeneity.
What this means is that instead of just one general relationship for, for example, the association between changes in SOM
and S
, I might expect that relationship (as described in the coefficient estimate for \(\beta_k\)) to be different in different places. This is a core concept for geographical data analysis. A key question is “how can such a model be calibrated” - a clue lies in the next subsection…
everything is related to everything else, but near things are more related than distant things. (Tobler, 1970)
Here, Tobler is arguing (although by his own admission he used the term in a tongue in cheek way) that geographically located phenomena that are near to one another tend to have similar characteristics - or that if this were not the case -
“the full range of conditions anywhere on the Earth’s surface could in principle be found packed within any small area. There would be no regions of approximately homogeneous conditions to be described by giving attributes to area objects. Topographic surfaces would vary chaotically, with slopes that were everywhere infinite, and the contours of such surfaces would be infinitely dense and contorted. Spatial analysis, and indeed life itself, would be impossible.” (de Smith et al 2007, p44)
Here, the idea informs the technique known as Geographically Weighted Regression or GWR (Brunsdon et al 1996). If the relationships between the \(x_k\)’s and \(y\) vary spatially, one might at least expect that the relationship between nearby sets of observations should be consistent. Suppose we assign a value of each regression coefficient to a location \((u,v)\) in space (eg \(u\) is an easting and \(v\) a northing), so that we denote it \(\beta_k(u,v)\). The we could estimate each \(\beta_k(u,v)\) by calibrating a weighted regression model where \(w_i\) was 1 if an observation was within some radius \(h\) of \((u,v)\) and 0 otherwise.
In this way, an entire surface of coefficients could be estimated. It may be imagined as being generated by a moving circular window including scanning across a map of the locations of the observations - in this case county centroids. However, 0/1 weighting is quite crude, and one might expect data from the periphery of the circular window to have less spatial influence than data near the middle.
Thus, critical to the operation of GW models is a moving kernel, that downweights peripheral observations. The kernel moves through the study area (for example to cells in a predefined grid) and at each location computes a local model. It uses data under the kernel to construct a (local) model at that location with data weighted points further away from the kernel centre contributing less to the solution. Hence the weighted element in the GW framework. Thus, the kernel defines the data and weights that are used to calibrate the model at each location. The weight, \(w_i\), associated with each location \((u_i, v_i)\) is a decreasing function of \(d_i\), the distance from the centre of the kernel to \((u_i, v_i)\):
A typical kernel function for example is the bisquare kernel. For a given bandwidth \(h\), this is defined by \[ f(d) = \left\{ \begin{array}{cl} \left(1 - \left( \frac{d}{h} \right)^2 \right)^2 & \mbox{ if $d < h$}; \\ & \\ 0 & \mbox{ otherwise.} \end{array} \right. \] Here \(h\) may be defined as a fixed value, or in an adaptive way. Gollini et al (2015) provide a description of common kernels shape used in GW models. Generally larger values of \(h\) result in a greater degree of spatial smoothing - having a larger window around \(\mathbf{u}\) in which cross tabulations have non-zero weighting. As \(h \rightarrow \infty\) the coefficients converge to the global OLS model.
There are a number of packages that support geographically weighted analyses in R, including spgwr
, GWmodel
, gwrr
and McSpatial
. GWmodel
is curated and supported by the two of the original GWR team working with others. We will use this here.
Note that Gollini at al (2015) provide an overview of geographically weighted approaches and provide a thorough treatment demonstrate their use, including the steps of model selection, bandwidth / kernel size optimization, handling collinearity etc.
In brief there are 2 basic steps to any GW approach:
We will create some spatial data and undertake these steps in turn in the code below: Note the that the df.sp
data is projected in meters and not degrees. Distance based measures are used to determine the bandwidth / kernel size.
# determine the kernel bandwidth
bw <- bw.gwr(S~Ca+Fe+P+SOM+pH+Slope+Aspect,
adaptive = T,
data=df.sp)
# fit the GWR model
m4 <- gwr.basic(S~Ca+Fe+P+SOM+pH+Slope+Aspect,
adaptive = T,
data = df.sp,
bw = bw)
You should have a look at the bandwidth: it has a value of 184 indicating that 17% of the county data will be used in each local calculation (there are 1070 records in the data).
bw
## [1] 184
You should have a look at the GWR outputs
m4
The code above prints out the whole model summary. You should compare the Summary of GWR coefficient estimates with the outputs of the ordinary regression mode coefficient estimates:
tab <- rbind(apply(m4$SDF@data[, 1:8], 2, summary), coef(m))
rownames(tab)[7] <- "Global"
tab <- round(tab, 3)
kable(tab, booktabs = T, format = 'markdown')
Intercept | Ca | Fe | P | SOM | pH | Slope | Aspect | |
---|---|---|---|---|---|---|---|---|
Min. | -233.472 | -0.026 | -0.015 | -0.093 | -5.664 | -309.436 | -3.774 | -0.104 |
1st Qu. | 338.519 | 0.040 | -0.006 | 0.039 | 10.136 | -102.276 | -0.542 | -0.001 |
Median | 692.092 | 0.060 | -0.004 | 0.097 | 18.789 | -68.268 | 1.033 | 0.043 |
Mean | 659.049 | 0.063 | -0.005 | 0.103 | 18.120 | -67.445 | 1.456 | 0.086 |
3rd Qu. | 896.233 | 0.082 | -0.003 | 0.168 | 26.647 | -21.152 | 3.388 | 0.107 |
Max. | 1942.674 | 0.176 | 0.002 | 0.284 | 44.712 | 161.084 | 8.880 | 0.737 |
Global | 746.569 | 0.062 | -0.008 | 0.115 | 19.516 | -70.596 | 9.125 | 0.116 |
So the Global coefficient estimates are similar to the Mean and Median values of the GWR output. But you can see the variation in the association between sulphur (S
) and the different coefficient estimates.
Note: the GWmodel
outputs have a number of types. Of interest is the SDF
which is spatial object of the same class as the input - in this case a SpatialPolygonsDataFrame
.
So for example, for SOM
(soil organic matter) :
The variation in the GWR coefficient estimates can be mapped to show how different factors are varying associated with Sulphur in different parts of the study area
Maps where large variation in the relationship between the predictor variable and sulphur was found:
tmap_mode('plot')
## tmap mode set to plotting
tm_shape(m4$SDF) +
tm_dots(c("SOM", "Slope"), size = 0.1) +
tm_layout(legend.position = c("right","top"))
Maps showing coefficients who values flip from positive to negative in different parts of the study area
tm_shape(m4$SDF) +
tm_dots(c("Slope", "pH"), size = 0.1) +
tm_layout(legend.position = c("right","top")) +
tm_style_col_blind()
So what we see here are the local variations in the degree to which changes in different variables are associated with changes in Sulphur The maps above are not simply maps of the predictor variable.
Brunsdon, C. and Comber, L., 2015. An introduction to R for spatial analysis and mapping. Sage.
Brunsdon, C.F., Fotheringham, A.S. and Charlton M. (1996). Geographically Weighted Regression - A Method for Exploring Spatial Non-Stationarity, Geographic Analysis, 28: 281-298.
Cleveland, W. S. (1979) Robust locally weighted regression and smoothing scatterplots. J. American Statistical Association 74, 829–836.
De Smith M., Goodchild M., Longley P. Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools, Troubador Publishing Ltd, 2007
Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships. John Wiley & Sons.
Gollini, I., Lu, B., Charlton, M., Brunsdon, C., & Harris, P. 2015. GWmodel: an R package for exploring spatial heterogeneity using geographically weighted models. Journal of Statistical Software 63 (17), 1–50.
Huber, P.J. (1981) Robust Statistics., Wiley.
Hurley, Catherine B. (2004) Clustering Visualisations of Multidimensional Data, Journal of Computational and Graphical Statistics, 13(4), 788-806.
Tobler W., (1970) A computer movie simulating urban growth in the Detroit region.,Economic Geography, 46(2): 234-240