1. Introduction

Earlier in the program we looked at basic regression. As a reminder, this was a way of modelling the relationshiop between a response variable \(y_i\) and a number of predictor variables \(x_{i1} \cdots x_{im}\), for each observation \(i\) (\(i = 1 \cdots n\)). The basic model used was

\[ y_{i} = \beta_0 + \sum_{k=1}^{k=m} \beta_k x_{ik} + \epsilon_i \] so for each observation, the relationship is linear plus a random error part where each which \(\epsilon_i\) has a independent normal distribution with mean zero and varianace \(\sigma^2\). However the assumption that the relationship is linear, and that the error terms have identical and independent normal distributions is quite often a consequence of wishful thinking, rather than a rigorously derived conclusion. Although quite often these assumptions are resemble reality enough to allow predictions to be of some use, changing some of the model assumptions would certainly result in more plausible results.

To take a step closer to reality, it is a good idea to look beyond these assumptions to obtain more reliable predictions, or gain a better theoretical understanding. To mediate the problems caused by the inappropriate use of simple linear regression (or OLS - Ordinary Least Squares) - two general approaches are used:

  1. Making the model calibration more robust - so that minor deviations (such as extreme outliers) - do not negatively in fluence the modelling process.
  2. Allowing different assumptions in the model or the error term

In this session we will look at some examples of these. To do this, we’ll need to load some R packages, and reload the image of the variables created in the earlier regression session (Session 2).


2. Robust (‘bomb-proof’) Regression

This relates to point 1 above. Most of this session will relate to examples linking to point 2 - but it is useful to consider at least one example of the first kind. To investigate this example, we will need to recall the results from session 2 - the introductory regression session, and revisit the regression example for median income (MedInc) there. We have already re-loaded the results from the session in the introduction. The results of the linear regression are stored in m -

## Call:
## lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov + 
##     PctBlack, data = df)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4203  -2.9897  -0.6163   2.2095  25.8201 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.59895    3.10893  16.919  < 2e-16 ***
## PctRural     0.07377    0.02043   3.611 0.000414 ***
## PctBach      0.69726    0.11221   6.214 4.73e-09 ***
## PctEld      -0.78862    0.17979  -4.386 2.14e-05 ***
## PctFB       -1.29030    0.47388  -2.723 0.007229 ** 
## PctPov      -0.95400    0.10459  -9.121 4.19e-16 ***
## PctBlack     0.03313    0.03717   0.891 0.374140    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 5.155 on 152 degrees of freedom
## Multiple R-squared:  0.7685, Adjusted R-squared:  0.7593 
## F-statistic: 84.09 on 6 and 152 DF,  p-value: < 2.2e-16

However, looking at the studentised residuals suggested a few outliers that might ‘throw’ the regression coefficient estimates. One way to investigate this is to remove the outlying observations and re-fit the model. The removal is best done using the dplyr approach and the filter function, and re-fitting the model.

df %>% filter(abs(rstudent(m)) <= 2) -> df2
m2 <- lm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = df2)

This done, the compareCoefs function in car lets the results be compared:

kable(compareCoefs(m,m2,print=FALSE), digits=3)
Est. 1 SE 1 Est. 2 SE 2
(Intercept) 52.599 3.109 52.905 2.644
PctRural 0.074 0.020 0.067 0.017
PctBach 0.697 0.112 0.728 0.105
PctEld -0.789 0.180 -0.925 0.147
PctFB -1.290 0.474 -1.485 0.459
PctPov -0.954 0.105 -0.887 0.092
PctBlack 0.033 0.037 0.043 0.031

It may be seen that some of the values alter notably (by around 10%) between the models with and without outliers. PctBlack is more extreme at around 30%. Outliers do have an influence.

However, it could be argued that the approach of removing them entirely is misleading - essentially trying to pass off a model as universal but intentionally ignoring observations that contradict it. Although that might be quite extreme, the point holds to some extent. Those counties do have those statistics, and so should perhaps not be ignored entirely.

Robust regression offers a more nuanced approach. Here, instead of simply ignoring the outliers, it is assumed that for whatever reason, they are subject to much more variance than the remaining observations - possibly due to some characteristic of the sampling process, or - more likely here - due to some unusual circumstances in the geographical locality (for example the presence of a large university skewing income figures compared to most of the State). Rather than totally ignoring such observations, they are downweighted to allow for this. This is less extreme than removing outliers - which in a sense is extreme downweighting to the value zero!

There are a number of approaches to robust regression - here we will use one proposed by Huber (1981). This uses an algorithm as below:

  1. Fit a standard regression model and note the residuals.
  2. Compute the weight for each observation as a function of absolute residual size. A number of possible weight functions could be used, but Huber’s is
    \[ w_i(z_i) = \left\{ \begin{array}{ll} 1, & \textsf{if } |z_i| < c; \\ c/|z|, & \textsf{if } |z_i| \ge c; \\ \end{array} \right. \] where \(c \approx 1.345\) and \(z_i\) is a scaled residual for observation \(i\). If actual residual is \(e_i\), then \(z_i = e_i/\tau\) where \(\tau\) is the median of \(\frac{|e_i|}{0.6745}\). After residuals reach a value \(c\) weighting gradually decreases - in accordance with the idea that higher residuals come from processes with greater variance.
  3. Refit the regression model with these weights. Update the residual values.
  4. Repeat from step 2 until regression coefficient estimates converge.

This is all implemented in the rlm function in the package MASS. It is very easy to use - once you have loaded MASS it just works like lm:

m3 <- rlm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = df)
## Call:
## rlm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + 
##     PctPov + PctBlack, data = df)
## Converged in 9 iterations
## Coefficients:
## (Intercept)    PctRural     PctBach      PctEld       PctFB      PctPov 
## 51.60773916  0.06337478  0.67787963 -0.77778234 -1.22469054 -0.88053199 
##    PctBlack 
##  0.03179542 
## Degrees of freedom: 159 total; 152 residual
## Scale estimate: 4.19

Here the coefficient estimates are printed out, and it may also be seen that the model converged in 9 iterations, and that the scale factor \(\tau\) was estimated as 4.19. As with lm you can print out a summary:

## Call: rlm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + 
##     PctPov + PctBlack, data = df)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8485  -2.8326  -0.1765   2.5012  27.4204 
## Coefficients:
##             Value   Std. Error t value
## (Intercept) 51.6077  2.7377    18.8506
## PctRural     0.0634  0.0180     3.5224
## PctBach      0.6779  0.0988     6.8604
## PctEld      -0.7778  0.1583    -4.9126
## PctFB       -1.2247  0.4173    -2.9348
## PctPov      -0.8805  0.0921    -9.5604
## PctBlack     0.0318  0.0327     0.9714
## Residual standard error: 4.188 on 152 degrees of freedom

Finally, coefficients from all three aproaches can be compared:

coef_tab <- data_frame(OLS=coef(m),`Outliers Deleted`=coef(m2),Robust=coef(m3))
OLS Outliers Deleted Robust
52.599 52.905 51.608
0.074 0.067 0.063
0.697 0.728 0.678
-0.789 -0.925 -0.778
-1.290 -1.485 -1.225
-0.954 -0.887 -0.881
0.033 0.043 0.032

For some coefficients, the robust estimate is closer to the standard OLS, and for others, it is closer to the ‘outliers deleted’ estimate - however in others, it differs from OLS, but not in the same direction as the ‘outliers deleted’ case. As a precautionary approach, I suggest working with the robust approach if there are some notable outliers.

3. Geographically Weighted Models

Regression of some kind underpins many statistical models. However often they make one massive assumption: that the relationships between the \(x_k\)’s and \(y\) are the same everywhere. Regression models (and most other models) are aspatial and because they consider all of the data in the study area in one universal model, they are referred to as Global models. For example, in the models seen in this program so far the assumption is that the contributions made by the different variables in predicting MedInc are the same across the study area. In reality this assumption of spatial invariance is violated in many instances when geographic phenomena are considered.

For these reasons, as a geographer….

In statistical terminology… I am interested in spatial non-stationarity in relationships and process spatial heterogeneity.

What this means is that instead of just one general relationship for, for example, the association between changes in PctBach and MedInc, I might expect that relationship (as described in the coefficient estimate for \(\beta_k\)) to be different in different places. This is a core concept for geographical data analysis. A key question is “how can such a model be calibrated” - a clue lies in the next subsection…

Tobler’s first law of Geography