Earlier in the program we looked at basic regression. As a reminder, this was a way of modelling the relationshiop between a response variable \(y_i\) and a number of predictor variables \(x_{i1} \cdots x_{im}\), for each observation \(i\) (\(i = 1 \cdots n\)). The basic model used was
\[ y_{i} = \beta_0 + \sum_{k=1}^{k=m} \beta_k x_{ik} + \epsilon_i \] so for each observation, the relationship is linear plus a random error part where each which \(\epsilon_i\) has a independent normal distribution with mean zero and varianace \(\sigma^2\). However the assumption that the relationship is linear, and that the error terms have identical and independent normal distributions is quite often a consequence of wishful thinking, rather than a rigorously derived conclusion. Although quite often these assumptions are resemble reality enough to allow predictions to be of some use, changing some of the model assumptions would certainly result in more plausible results.
To take a step closer to reality, it is a good idea to look beyond these assumptions to obtain more reliable predictions, or gain a better theoretical understanding. To mediate the problems caused by the inappropriate use of simple linear regression (or OLS  Ordinary Least Squares)  two general approaches are used:
In this session we will look at some examples of these. To do this, we’ll need to load some R packages, and reload the image of the variables created in the earlier regression session (Session 2).
library(tidyverse)
library(GISTools)
library(GWmodel)
library(rgdal)
library(tmap)
library(knitr)
library(car)
library(gclus)
library(kableExtra)
load('session2.RData')
This relates to point 1 above. Most of this session will relate to examples linking to point 2  but it is useful to consider at least one example of the first kind. To investigate this example, we will need to recall the results from session 2  the introductory regression session, and revisit the regression example for median income (MedInc
) there. We have already reloaded the results from the session in the introduction. The results of the linear regression are stored in m

summary(m)
##
## Call:
## lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov +
## PctBlack, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## 12.4203 2.9897 0.6163 2.2095 25.8201
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 52.59895 3.10893 16.919 < 2e16 ***
## PctRural 0.07377 0.02043 3.611 0.000414 ***
## PctBach 0.69726 0.11221 6.214 4.73e09 ***
## PctEld 0.78862 0.17979 4.386 2.14e05 ***
## PctFB 1.29030 0.47388 2.723 0.007229 **
## PctPov 0.95400 0.10459 9.121 4.19e16 ***
## PctBlack 0.03313 0.03717 0.891 0.374140
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.155 on 152 degrees of freedom
## Multiple Rsquared: 0.7685, Adjusted Rsquared: 0.7593
## Fstatistic: 84.09 on 6 and 152 DF, pvalue: < 2.2e16
However, looking at the studentised residuals suggested a few outliers that might ‘throw’ the regression coefficient estimates. One way to investigate this is to remove the outlying observations and refit the model. The removal is best done using the dplyr
approach and the filter
function, and refitting the model.
df %>% filter(abs(rstudent(m)) <= 2) > df2
m2 < lm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = df2)
This done, the compareCoefs
function in car
lets the results be compared:
kable(compareCoefs(m,m2,print=FALSE), digits=3)
Est. 1  SE 1  Est. 2  SE 2  

(Intercept)  52.599  3.109  52.905  2.644 
PctRural  0.074  0.020  0.067  0.017 
PctBach  0.697  0.112  0.728  0.105 
PctEld  0.789  0.180  0.925  0.147 
PctFB  1.290  0.474  1.485  0.459 
PctPov  0.954  0.105  0.887  0.092 
PctBlack  0.033  0.037  0.043  0.031 
It may be seen that some of the values alter notably (by around 10%) between the models with and without outliers. PctBlack
is more extreme at around 30%. Outliers do have an influence.
However, it could be argued that the approach of removing them entirely is misleading  essentially trying to pass off a model as universal but intentionally ignoring observations that contradict it. Although that might be quite extreme, the point holds to some extent. Those counties do have those statistics, and so should perhaps not be ignored entirely.
Robust regression offers a more nuanced approach. Here, instead of simply ignoring the outliers, it is assumed that for whatever reason, they are subject to much more variance than the remaining observations  possibly due to some characteristic of the sampling process, or  more likely here  due to some unusual circumstances in the geographical locality (for example the presence of a large university skewing income figures compared to most of the State). Rather than totally ignoring such observations, they are downweighted to allow for this. This is less extreme than removing outliers  which in a sense is extreme downweighting to the value zero!
There are a number of approaches to robust regression  here we will use one proposed by Huber (1981). This uses an algorithm as below:
This is all implemented in the rlm
function in the package MASS
. It is very easy to use  once you have loaded MASS
it just works like lm
:
library(MASS)
m3 < rlm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = df)
m3
## Call:
## rlm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB +
## PctPov + PctBlack, data = df)
## Converged in 9 iterations
##
## Coefficients:
## (Intercept) PctRural PctBach PctEld PctFB PctPov
## 51.60773916 0.06337478 0.67787963 0.77778234 1.22469054 0.88053199
## PctBlack
## 0.03179542
##
## Degrees of freedom: 159 total; 152 residual
## Scale estimate: 4.19
Here the coefficient estimates are printed out, and it may also be seen that the model converged in 9 iterations, and that the scale factor \(\tau\) was estimated as 4.19. As with lm
you can print out a summary:
summary(m3)
##
## Call: rlm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB +
## PctPov + PctBlack, data = df)
## Residuals:
## Min 1Q Median 3Q Max
## 12.8485 2.8326 0.1765 2.5012 27.4204
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 51.6077 2.7377 18.8506
## PctRural 0.0634 0.0180 3.5224
## PctBach 0.6779 0.0988 6.8604
## PctEld 0.7778 0.1583 4.9126
## PctFB 1.2247 0.4173 2.9348
## PctPov 0.8805 0.0921 9.5604
## PctBlack 0.0318 0.0327 0.9714
##
## Residual standard error: 4.188 on 152 degrees of freedom
Finally, coefficients from all three aproaches can be compared:
coef_tab < data_frame(OLS=coef(m),`Outliers Deleted`=coef(m2),Robust=coef(m3))
kable(coef_tab,digits=3)
OLS  Outliers Deleted  Robust 

52.599  52.905  51.608 
0.074  0.067  0.063 
0.697  0.728  0.678 
0.789  0.925  0.778 
1.290  1.485  1.225 
0.954  0.887  0.881 
0.033  0.043  0.032 
For some coefficients, the robust estimate is closer to the standard OLS, and for others, it is closer to the ‘outliers deleted’ estimate  however in others, it differs from OLS, but not in the same direction as the ‘outliers deleted’ case. As a precautionary approach, I suggest working with the robust approach if there are some notable outliers.
Regression of some kind underpins many statistical models. However often they make one massive assumption: that the relationships between the \(x_k\)’s and \(y\) are the same everywhere. Regression models (and most other models) are aspatial and because they consider all of the data in the study area in one universal model, they are referred to as Global models. For example, in the models seen in this program so far the assumption is that the contributions made by the different variables in predicting MedInc
are the same across the study area. In reality this assumption of spatial invariance is violated in many instances when geographic phenomena are considered.
For these reasons, as a geographer….
In statistical terminology… I am interested in spatial nonstationarity in relationships and process spatial heterogeneity.
What this means is that instead of just one general relationship for, for example, the association between changes in PctBach
and MedInc
, I might expect that relationship (as described in the coefficient estimate for \(\beta_k\)) to be different in different places. This is a core concept for geographical data analysis. A key question is “how can such a model be calibrated”  a clue lies in the next subsection…