Earlier in the program we looked at basic regression. As a reminder, this was a way of modelling the relationshiop between a response variable \(y_i\) and a number of predictor variables \(x_{i1} \cdots x_{im}\), for each observation \(i\) (\(i = 1 \cdots n\)). The basic model used was
\[ y_{i} = \beta_0 + \sum_{k=1}^{k=m} \beta_k x_{ik} + \epsilon_i \] so for each observation, the relationship is linear plus a random error part where each which \(\epsilon_i\) has a independent normal distribution with mean zero and varianace \(\sigma^2\). However the assumption that the relationship is linear, and that the error terms have identical and independent normal distributions is quite often a consequence of wishful thinking, rather than a rigorously derived conclusion. Although quite often these assumptions are resemble reality enough to allow predictions to be of some use, changing some of the model assumptions would certainly result in more plausible results.
To take a step closer to reality, it is a good idea to look beyond these assumptions to obtain more reliable predictions, or gain a better theoretical understanding. To mediate the problems caused by the inappropriate use of simple linear regression (or OLS - Ordinary Least Squares) - two general approaches are used:
In this session we will look at some examples of these. To do this, we’ll need to load some R packages, and reload the image of the variables created in the earlier regression session (Session 2).
library(tidyverse)
library(GISTools)
library(GWmodel)
library(rgdal)
library(tmap)
library(knitr)
library(car)
library(gclus)
library(kableExtra)
load('session2.RData')
This relates to point 1 above. Most of this session will relate to examples linking to point 2 - but it is useful to consider at least one example of the first kind. To investigate this example, we will need to recall the results from session 2 - the introductory regression session, and revisit the regression example for median income (MedInc
) there. We have already re-loaded the results from the session in the introduction. The results of the linear regression are stored in m
-
summary(m)
##
## Call:
## lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov +
## PctBlack, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4203 -2.9897 -0.6163 2.2095 25.8201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.59895 3.10893 16.919 < 2e-16 ***
## PctRural 0.07377 0.02043 3.611 0.000414 ***
## PctBach 0.69726 0.11221 6.214 4.73e-09 ***
## PctEld -0.78862 0.17979 -4.386 2.14e-05 ***
## PctFB -1.29030 0.47388 -2.723 0.007229 **
## PctPov -0.95400 0.10459 -9.121 4.19e-16 ***
## PctBlack 0.03313 0.03717 0.891 0.374140
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.155 on 152 degrees of freedom
## Multiple R-squared: 0.7685, Adjusted R-squared: 0.7593
## F-statistic: 84.09 on 6 and 152 DF, p-value: < 2.2e-16
However, looking at the studentised residuals suggested a few outliers that might ‘throw’ the regression coefficient estimates. One way to investigate this is to remove the outlying observations and re-fit the model. The removal is best done using the dplyr
approach and the filter
function, and re-fitting the model.
df %>% filter(abs(rstudent(m)) <= 2) -> df2
m2 <- lm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = df2)
This done, the compareCoefs
function in car
lets the results be compared:
kable(compareCoefs(m,m2,print=FALSE), digits=3)
Est. 1 | SE 1 | Est. 2 | SE 2 | |
---|---|---|---|---|
(Intercept) | 52.599 | 3.109 | 52.905 | 2.644 |
PctRural | 0.074 | 0.020 | 0.067 | 0.017 |
PctBach | 0.697 | 0.112 | 0.728 | 0.105 |
PctEld | -0.789 | 0.180 | -0.925 | 0.147 |
PctFB | -1.290 | 0.474 | -1.485 | 0.459 |
PctPov | -0.954 | 0.105 | -0.887 | 0.092 |
PctBlack | 0.033 | 0.037 | 0.043 | 0.031 |
It may be seen that some of the values alter notably (by around 10%) between the models with and without outliers. PctBlack
is more extreme at around 30%. Outliers do have an influence.
However, it could be argued that the approach of removing them entirely is misleading - essentially trying to pass off a model as universal but intentionally ignoring observations that contradict it. Although that might be quite extreme, the point holds to some extent. Those counties do have those statistics, and so should perhaps not be ignored entirely.
Robust regression offers a more nuanced approach. Here, instead of simply ignoring the outliers, it is assumed that for whatever reason, they are subject to much more variance than the remaining observations - possibly due to some characteristic of the sampling process, or - more likely here - due to some unusual circumstances in the geographical locality (for example the presence of a large university skewing income figures compared to most of the State). Rather than totally ignoring such observations, they are downweighted to allow for this. This is less extreme than removing outliers - which in a sense is extreme downweighting to the value zero!
There are a number of approaches to robust regression - here we will use one proposed by Huber (1981). This uses an algorithm as below:
This is all implemented in the rlm
function in the package MASS
. It is very easy to use - once you have loaded MASS
it just works like lm
:
library(MASS)
m3 <- rlm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = df)
m3
## Call:
## rlm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB +
## PctPov + PctBlack, data = df)
## Converged in 9 iterations
##
## Coefficients:
## (Intercept) PctRural PctBach PctEld PctFB PctPov
## 51.60773916 0.06337478 0.67787963 -0.77778234 -1.22469054 -0.88053199
## PctBlack
## 0.03179542
##
## Degrees of freedom: 159 total; 152 residual
## Scale estimate: 4.19
Here the coefficient estimates are printed out, and it may also be seen that the model converged in 9 iterations, and that the scale factor \(\tau\) was estimated as 4.19. As with lm
you can print out a summary:
summary(m3)
##
## Call: rlm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB +
## PctPov + PctBlack, data = df)
## Residuals:
## Min 1Q Median 3Q Max
## -12.8485 -2.8326 -0.1765 2.5012 27.4204
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 51.6077 2.7377 18.8506
## PctRural 0.0634 0.0180 3.5224
## PctBach 0.6779 0.0988 6.8604
## PctEld -0.7778 0.1583 -4.9126
## PctFB -1.2247 0.4173 -2.9348
## PctPov -0.8805 0.0921 -9.5604
## PctBlack 0.0318 0.0327 0.9714
##
## Residual standard error: 4.188 on 152 degrees of freedom
Finally, coefficients from all three aproaches can be compared:
coef_tab <- data_frame(OLS=coef(m),`Outliers Deleted`=coef(m2),Robust=coef(m3))
kable(coef_tab,digits=3)
OLS | Outliers Deleted | Robust |
---|---|---|
52.599 | 52.905 | 51.608 |
0.074 | 0.067 | 0.063 |
0.697 | 0.728 | 0.678 |
-0.789 | -0.925 | -0.778 |
-1.290 | -1.485 | -1.225 |
-0.954 | -0.887 | -0.881 |
0.033 | 0.043 | 0.032 |
For some coefficients, the robust estimate is closer to the standard OLS, and for others, it is closer to the ‘outliers deleted’ estimate - however in others, it differs from OLS, but not in the same direction as the ‘outliers deleted’ case. As a precautionary approach, I suggest working with the robust approach if there are some notable outliers.
Regression of some kind underpins many statistical models. However often they make one massive assumption: that the relationships between the \(x_k\)’s and \(y\) are the same everywhere. Regression models (and most other models) are aspatial and because they consider all of the data in the study area in one universal model, they are referred to as Global models. For example, in the models seen in this program so far the assumption is that the contributions made by the different variables in predicting MedInc
are the same across the study area. In reality this assumption of spatial invariance is violated in many instances when geographic phenomena are considered.
For these reasons, as a geographer….
In statistical terminology… I am interested in spatial non-stationarity in relationships and process spatial heterogeneity.
What this means is that instead of just one general relationship for, for example, the association between changes in PctBach
and MedInc
, I might expect that relationship (as described in the coefficient estimate for \(\beta_k\)) to be different in different places. This is a core concept for geographical data analysis. A key question is “how can such a model be calibrated” - a clue lies in the next subsection…
everything is related to everything else, but near things are more related than distant things. (Tobler, 1970)
Here, Tobler is arguing (although by his own admission he used the term in a tongue in cheek way) that geographically located phenomena that are near to one another tend to have similar characteristics - or that if this were not the case -
“the full range of conditions anywhere on the Earth’s surface could in principle be found packed within any small area. There would be no regions of approximately homogeneous conditions to be described by giving attributes to area objects. Topographic surfaces would vary chaotically, with slopes that were everywhere infinite, and the contours of such surfaces would be infinitely dense and contorted. Spatial analysis, and indeed life itself, would be impossible.” (de Smith et al 2007, p44)
Here, the idea informs the technique known as Geographically Weighted Regression or GWR (Brunsdon et al 1996). If the relationships between the \(x_k\)’s and \(y\) vary spatially, one might at least expect that the relationship between nearby sets of observations should be consistent. Suppose we assign a value of each regression coefficient to a location \((u,v)\) in space (eg \(u\) is an easting and \(v\) a northing), so that we denote it \(\beta_k(u,v)\). The we could estimate each \(\beta_k(u,v)\) by calibrating a weighted regression model where \(w_i\) was 1 if an observation was within some radius \(h\) of \((u,v)\) and 0 otherwise.
In this way, an entire surface of coefficients could be estimated. It may be imagined as being generated by a moving circular window inclusing scanning across a map of the locations of the observations - in this case county centroids. However, 0/1 weighting is quite crude, and one might expect data from the periphery of the circular window to have less spatial influence than data near the middle.
Thus, critical to the operation of GW models is a moving kernel, that downweights peripheral observations. The kernel moves through the study area (for example to cells in a predefined grid) and at each location computes a local model. It uses data under the kernel to construct a (local) model at that location with data weighted points further away from the kernel centre contributing less to the solution. Hence the weighted element in the GW framework. Thus, the kernel defines the data and weights that are used to calibrate the model at each location. The weight, \(w_i\), associated with each location \((u_i, v_i)\) is a decreasing function of \(d_i\), the distance from the centre of the kernel to \((u_i, v_i)\):
A typical kernel function for example is the bisquare kernel. For a given bandwidth \(h\), this is defined by \[ f(d) = \left\{ \begin{array}{cl} \left(1 - \left( \frac{d}{h} \right)^2 \right)^2 & \mbox{ if $d < h$}; \\ & \\ 0 & \mbox{ otherwise.} \end{array} \right. \] Here \(h\) may be defined as a fixed value, or in an adaptive way. Gollini et al (2015) provide a description of common kernels shape used in GW models. Generally larger values of \(h\) result in a greater degree of spatial smoothing - having a larger window around \(\mathbf{u}\) in which cross tabulations have non-zero weighting. As \(h \rightarrow \infty\) the coeffients converge to the global OLS model.
There are a number of packages that support geographically weighted analyses in R, including spgwr
, GWmodel
, gwrr
and McSpatial
. GWmodel
is curated and supported by the two of the original GWR team working with others. We will use this here.
Note that Gollini at al (2015) provide an overview of geographically weighted approaches and provide a thorough treatment demonstrate their use, including the steps of model selection, bandwidth / kernel size optimization, handling collinearity etc.
In brief there are 2 basic steps to any GW approach:
We will create some spatial data and undertake these steps in turn in the code below: Note the use of the georgia2
dataset as this is projected in meters and not degrees. Distance based measures are used to determine the bandwidth / kernel size.
# create spatial data
df.sp <- georgia2[, c(14,4:9)]
df.sp$MedInc <- df.sp$MedInc/1000
# determine the kernel bandwidth
bw <- bw.gwr(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack,
adaptive = T,
data=df.sp)
# fit the GWR model
m4 <- gwr.basic(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack,
adaptive = T,
data = df.sp,
bw = bw)
You should have a look at the bandwidth: it has a value of 101 indicating that 63% of the county data will be used in each local calculation (there are 159 records in the data).
bw
## [1] 101
You should have a look at the GWR outputs
m4
The code above prints out the whole model summary. You should compare the Summary of GWR coefficient estimates with the outputs of the ordinary regression mode coefficient estimates:
tab <- rbind(apply(m4$SDF@data[, 1:7], 2, summary), coef(m))
rownames(tab)[7] <- "Global"
tab <- round(tab, 3)
kable(tab, booktabs = T, format = 'markdown')
Intercept | PctRural | PctBach | PctEld | PctFB | PctPov | PctBlack | |
---|---|---|---|---|---|---|---|
Min. | 43.570 | 0.026 | 0.338 | -1.156 | -2.931 | -1.563 | -0.196 |
1st Qu. | 48.440 | 0.054 | 0.596 | -0.948 | -2.100 | -1.244 | -0.028 |
Median | 51.900 | 0.078 | 0.672 | -0.857 | -1.268 | -0.875 | 0.031 |
Mean | 52.560 | 0.078 | 0.710 | -0.860 | -1.410 | -0.931 | 0.017 |
3rd Qu. | 56.900 | 0.105 | 0.856 | -0.787 | -0.750 | -0.594 | 0.078 |
Max. | 61.630 | 0.139 | 1.142 | -0.526 | 0.152 | -0.490 | 0.160 |
Global | 52.599 | 0.074 | 0.697 | -0.789 | -1.290 | -0.954 | 0.033 |
So the Global coefficient estimates are similar to the Mean and Median values of the GWR output. But you can see the variation in the association between median income (MedInc
) and the different coefficient estimates.
Note: the GWmodel
outputs have a number of types. Of interest is the SDF
which is spatial object of the same class as the input - in this case a SpatialPolygonsDataFrame
.
So for example, for PctPov
(poverty) :
The variation in the GWR coefficient estimates can be mapped to show how different factors are varying associated with Median Income in different parts of Georgia.
Maps where large variation in the relationship between the predictor variable and median income was found:
tm_shape(m4$SDF) +
tm_fill(c("PctBach", "PctPov")) +
tm_layout(legend.position = c("right","top"))
Maps showing coefficients who values flip from positive to negative in different parts of the state.
tm_shape(m4$SDF) +
tm_fill(c("PctFB", "PctBlack")) +
tm_layout(legend.position = c("right","top")) +
tm_style_col_blind()
So what we see here are the local variations in the degree to which changes in different variables are associated with changes in Median Income. The maps above are not simply maps of the predictor variable.
Here another assumption in OLS is modified. The \(y_i\) variable is assumed to have a normal distribution, whose mean is determined by the predictor variables (as \(\beta_0 + \sum_k \beta_kx_{ik}\) ) and whose variance is \(\sigma^2\). However, suppose the value of \(y_i\) was a count of some phenomenon - for example the number of crimes committed in a month, or the number of houses sold (rather than the price). Then, the normal distribution is not a sensible model because
or, more bluntly, \(y_i\) is not normally distributed! Typically count data are modelled by the Poisson distribution -
\[ \textsf{Prob}(y_i=j) = \frac{e^{-\lambda}\lambda^j}{j!} \] where \(\lambda\) is the mean value of \(y_i\). There is a constraint that \(\lambda > 0\). Here \(\lambda\) is a constant parameter, but if this depended on some predictor variables, we would have a prediction model similar to OLS, but with \(y_i\) as a Poisson distributed count variable rather than a normally distributed continuous one.
In principal, one could specify the model so that \(\lambda_i\), the mean value for observation \(i\), is given by \(\lambda_i = \beta_0 + \sum_k\beta_kx_{ij}\) however, this could lead to problems if a negative value of \(\lambda_i\) were predicted. One way of overcoming this is to specify a log-transformed relationship:
\[ \log(\lambda_i) = \beta_0 + \sum_{k=1}^{k=m} \beta_k x_{ik} \]
which is the standard form for Poisson regression.
Taking antilogs on both sides, the effect on the expected value is a multiplicative one:
\[ \lambda_i = e^{\beta_0} \times e^{\beta_1 x_{i1}} \times \cdots e^{\beta_k x_{ik}} \]
so that the expected number of observations changes by a scale factor (ie a ratio) for a unit change in a prector variable. In particular for a coefficient \(\beta_k\), the factor is \(e^{\beta_k}\) - this is useful in interpreting the model.
A final - and extra - feature found in some Poisson regression model is the idea of an offset. For many geographical count-based regressions, there is a ‘population at risk’ - for example the total population of an area when counting people having a particular illness. In this case an expected count is essentially a rate of occurrence multiplied by the population at risk. The model can be written as
\[ \lambda_i = P_i \times e^{\beta_0} \times e^{\beta_1 x_{i1}} \times \cdots e^{\beta_k x_{ik}} \] where \(P_i\) is the population at risk. Writing \(P_i\) as \(e^{\log(P_i)}\) this is really just another predictor term (in \(\log(P_i)\)) but with the coefficient being fixed at the value 1, rather than being calibrated. Such a term can be specified in R - and is referred to as an offset term.
Here, we will use a data set based on UK crime reporting data - as made available via the police.uk
web site.
In this case, burglary reports for a one month period in Newcastle upon Tyne are counted for the geographical lower-level super output areas (LSOAs). These are the \(y\)-variable. As predictors a number of variables are used:
In addition, the response variable is burgs - the count of reported burglaries.
The data can obtained from the data link in the schedule web page. Download the file into your current directory, and then load it into R. The file format is the R binary format.
load('lsoa_imd.RData')
This loads a tibble called lsoa_imd
- which can be listed:
kable(head(lsoa_imd))
lsoacode | IMD_score | popn | dens | burgs | geometry |
---|---|---|---|---|---|
E01008365 | 3.674 | 1679 | 62.5 | 0 | 427144.1, 426837.0, 426649.5, 426483.9, 426738.7, 426898.2, 427144.1, 565908.8, 565774.6, 565897.6, 566180.1, 566368.5, 566457.1, 565908.8 |
E01008364 | 11.968 | 1696 | 99.6 | 2 | 427379.1, 427375.2, 427154.7, 427144.1, 426898.2, 427160.3, 427379.1, 566194.8, 565915.1, 565909.7, 565908.8, 566457.1, 566577.0, 566194.8 |
E01008362 | 18.735 | 1795 | 156.8 | 2 | 426943.9, 426900.9, 426844.4, 427263.1, 427284.9, 427284.2, 426943.9, 565158.8, 565291.4, 565654.0, 565439.8, 565357.8, 565357.5, 565158.8 |
E01008453 | 14.138 | 1669 | 44.0 | 2 | 423059.4, 423131.2, 422695.3, 422308.3, 422254.8, 422068.1, 422265.4, 422497.8, 423059.4, 565633.6, 565550.8, 565034.9, 565036.8, 565192.6, 565528.3, 565764.0, 565423.3, 565633.6 |
E01008292 | 8.104 | 1342 | 63.9 | 1 | 420218.9, 420401.4, 420591.7, 420617.2, 420124.4, 420218.9, 565042.6, 565328.6, 565244.8, 564720.6, 564917.2, 565042.6 |
E01008293 | 17.514 | 1447 | 21.4 | 0 | 421477.4, 421387.9, 421087.2, 420840.0, 420980.7, 420547.5, 420550.9, 420547.5, 420550.7, 421477.4, 568316.5, 567537.1, 567858.2, 567794.8, 567612.1, 567725.8, 567758.4, 568454.3, 568457.4, 568316.5 |
The geometry
column is used to map the LSOAs - the other columns are the variables used in the regression. The regression itself is set out below in R:
crime_model <- glm(burgs~dens+IMD_score+offset(log(popn)),data=lsoa_imd,family=poisson())
summary(crime_model)
##
## Call:
## glm(formula = burgs ~ dens + IMD_score + offset(log(popn)), family = poisson(),
## data = lsoa_imd)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9792 -1.3341 -0.1743 0.2937 5.7159
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.611e+00 1.760e-01 -43.232 < 2e-16 ***
## dens 6.727e-05 2.020e-03 0.033 0.97344
## IMD_score 1.080e-02 3.293e-03 3.280 0.00104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 274.86 on 174 degrees of freedom
## Residual deviance: 264.40 on 172 degrees of freedom
## AIC: 519.14
##
## Number of Fisher Scoring iterations: 5
The density coefficient, although positive, is not significant. The index of multiple deprivation is significant, suggesting a strong linkage between burglary risk and higher levels of deprivation. The value of the coefficient is 0.0108 - and the antilog of this is 1.0109 suggesting that if the IMD increases by one unit, this corresponds to around 1% rise in the risk of burglary.
Brunsdon, C.F., Fotheringham, A.S. and Charlton M. (1996). Geographically Weighted Regression - A Method for Exploring Spatial Non-Stationarity, Geographic Analysis, 28: 281-298.
De Smith M., Goodchild M., Longley P. Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools, Troubador Publishing Ltd, 2007
Gollini, I., Lu, B., Charlton, M., Brunsdon, C., & Harris, P. 2015. GWmodel: an R package for exploring spatial heterogeneity using geographically weighted models. Journal of Statistical Software 63 (17), 1–50.
Huber, P.J. (1981) Robust Statistics., Wiley.
Tobler W., (1970) A computer movie simulating urban growth in the Detroit region.,Economic Geography, 46(2): 234-240