1. Introduction

Earlier in the program we looked at basic regression. As a reminder, this was a way of modelling the relationshiop between a response variable $y_i$ and a number of predictor variables $x_{i1} \cdots x_{im}$, for each observation $i$ ($i = 1 \cdots n$). The basic model used was

\[ y_{i} = \beta_0 + \sum_{k=1}^{k=m} \beta_k x_{ik} + \epsilon_i \] so for each observation, the relationship is linear plus a random error part where each which $\epsilon_i$ has a independent normal distribution with mean zero and varianace $\sigma^2$. However the assumption that the relationship is linear, and that the error terms have identical and independent normal distributions is quite often a consequence of wishful thinking, rather than a rigorously derived conclusion. Although quite often these assumptions are resemble reality enough to allow predictions to be of some use, changing some of the model assumptions would certainly result in more plausible results.

To take a step closer to reality, it is a good idea to look beyond these assumptions to obtain more reliable predictions, or gain a better theoretical understanding. To mediate the problems caused by the inappropriate use of simple linear regression (or OLS - Ordinary Least Squares) - two general approaches are used:

Making the model calibration more robust - so that minor deviations (such as extreme outliers) - do not negatively in fluence the modelling process.
Allowing different assumptions in the model or the error term

In this session we will look at some examples of these. To do this, we’ll need to load some R packages, and reload the image of the variables created in the earlier regression session (Session 2).

library(tidyverse)
library(GISTools)
library(GWmodel)
library(rgdal)
library(tmap)
library(knitr)
library(car)
library(gclus)
library(kableExtra)
load('session2.RData')

2. Robust (‘bomb-proof’) Regression

This relates to point 1 above. Most of this session will relate to examples linking to point 2 - but it is useful to consider at least one example of the first kind. To investigate this example, we will need to recall the results from session 2 - the introductory regression session, and revisit the regression example for median income (MedInc) there. We have already re-loaded the results from the session in the introduction. The results of the linear regression are stored in m -

summary(m)

## 
## Call:
## lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov + 
##     PctBlack, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4203  -2.9897  -0.6163   2.2095  25.8201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.59895    3.10893  16.919  < 2e-16 ***
## PctRural     0.07377    0.02043   3.611 0.000414 ***
## PctBach      0.69726    0.11221   6.214 4.73e-09 ***
## PctEld      -0.78862    0.17979  -4.386 2.14e-05 ***
## PctFB       -1.29030    0.47388  -2.723 0.007229 ** 
## PctPov      -0.95400    0.10459  -9.121 4.19e-16 ***
## PctBlack     0.03313    0.03717   0.891 0.374140    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.155 on 152 degrees of freedom
## Multiple R-squared:  0.7685, Adjusted R-squared:  0.7593 
## F-statistic: 84.09 on 6 and 152 DF,  p-value: < 2.2e-16

However, looking at the studentised residuals suggested a few outliers that might ‘throw’ the regression coefficient estimates. One way to investigate this is to remove the outlying observations and re-fit the model. The removal is best done using the dplyr approach and the filter function, and re-fitting the model.

df %>% filter(abs(rstudent(m)) <= 2) -> df2
m2 <- lm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = df2)

This done, the compareCoefs function in car lets the results be compared:

kable(compareCoefs(m,m2,print=FALSE), digits=3)

	Est. 1	SE 1	Est. 2	SE 2
(Intercept)	52.599	3.109	52.905	2.644
PctRural	0.074	0.020	0.067	0.017
PctBach	0.697	0.112	0.728	0.105
PctEld	-0.789	0.180	-0.925	0.147
PctFB	-1.290	0.474	-1.485	0.459
PctPov	-0.954	0.105	-0.887	0.092
PctBlack	0.033	0.037	0.043	0.031

It may be seen that some of the values alter notably (by around 10%) between the models with and without outliers. PctBlack is more extreme at around 30%. Outliers do have an influence.

However, it could be argued that the approach of removing them entirely is misleading - essentially trying to pass off a model as universal but intentionally ignoring observations that contradict it. Although that might be quite extreme, the point holds to some extent. Those counties do have those statistics, and so should perhaps not be ignored entirely.

Robust regression offers a more nuanced approach. Here, instead of simply ignoring the outliers, it is assumed that for whatever reason, they are subject to much more variance than the remaining observations - possibly due to some characteristic of the sampling process, or - more likely here - due to some unusual circumstances in the geographical locality (for example the presence of a large university skewing income figures compared to most of the State). Rather than totally ignoring such observations, they are downweighted to allow for this. This is less extreme than removing outliers - which in a sense is extreme downweighting to the value zero!

There are a number of approaches to robust regression - here we will use one proposed by Huber (1981). This uses an algorithm as below:

Fit a standard regression model and note the residuals.
Compute the weight for each observation as a function of absolute residual size. A number of possible weight functions could be used, but Huber’s is
\[ w_i(z_i) = \left\{ \begin{array}{ll} 1, & \textsf{if } |z_i| < c; \\ c/|z|, & \textsf{if } |z_i| \ge c; \\ \end{array} \right. \] where $c \approx 1.345$ and $z_i$ is a scaled residual for observation $i$. If actual residual is $e_i$, then $z_i = e_i/\tau$ where $\tau$ is the median of $\frac{|e_i|}{0.6745}$. After residuals reach a value $c$ weighting gradually decreases - in accordance with the idea that higher residuals come from processes with greater variance.
Refit the regression model with these weights. Update the residual values.
Repeat from step 2 until regression coefficient estimates converge.

This is all implemented in the rlm function in the package MASS. It is very easy to use - once you have loaded MASS it just works like lm:

library(MASS)
m3 <- rlm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = df)
m3

## Call:
## rlm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + 
##     PctPov + PctBlack, data = df)
## Converged in 9 iterations
## 
## Coefficients:
## (Intercept)    PctRural     PctBach      PctEld       PctFB      PctPov 
## 51.60773916  0.06337478  0.67787963 -0.77778234 -1.22469054 -0.88053199 
##    PctBlack 
##  0.03179542 
## 
## Degrees of freedom: 159 total; 152 residual
## Scale estimate: 4.19

Here the coefficient estimates are printed out, and it may also be seen that the model converged in 9 iterations, and that the scale factor $\tau$ was estimated as 4.19. As with lm you can print out a summary:

summary(m3)

## 
## Call: rlm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + 
##     PctPov + PctBlack, data = df)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8485  -2.8326  -0.1765   2.5012  27.4204 
## 
## Coefficients:
##             Value   Std. Error t value
## (Intercept) 51.6077  2.7377    18.8506
## PctRural     0.0634  0.0180     3.5224
## PctBach      0.6779  0.0988     6.8604
## PctEld      -0.7778  0.1583    -4.9126
## PctFB       -1.2247  0.4173    -2.9348
## PctPov      -0.8805  0.0921    -9.5604
## PctBlack     0.0318  0.0327     0.9714
## 
## Residual standard error: 4.188 on 152 degrees of freedom

Finally, coefficients from all three aproaches can be compared:

coef_tab <- data_frame(OLS=coef(m),`Outliers Deleted`=coef(m2),Robust=coef(m3))
kable(coef_tab,digits=3)

OLS	Outliers Deleted	Robust
52.599	52.905	51.608
0.074	0.067	0.063
0.697	0.728	0.678
-0.789	-0.925	-0.778
-1.290	-1.485	-1.225
-0.954	-0.887	-0.881
0.033	0.043	0.032

For some coefficients, the robust estimate is closer to the standard OLS, and for others, it is closer to the ‘outliers deleted’ estimate - however in others, it differs from OLS, but not in the same direction as the ‘outliers deleted’ case. As a precautionary approach, I suggest working with the robust approach if there are some notable outliers.

3. Geographically Weighted Models

Regression of some kind underpins many statistical models. However often they make one massive assumption: that the relationships between the $x_k$’s and $y$ are the same everywhere. Regression models (and most other models) are aspatial and because they consider all of the data in the study area in one universal model, they are referred to as Global models. For example, in the models seen in this program so far the assumption is that the contributions made by the different variables in predicting MedInc are the same across the study area. In reality this assumption of spatial invariance is violated in many instances when geographic phenomena are considered.

For these reasons, as a geographer….

I do not expect things to be the same everywhere
I expect to find clusters, hotspots etc
I am interested in how and where processes, relationships, and other phenomena vary spatially**

In statistical terminology… I am interested in spatial non-stationarity in relationships and process spatial heterogeneity.

What this means is that instead of just one general relationship for, for example, the association between changes in PctBach and MedInc, I might expect that relationship (as described in the coefficient estimate for $\beta_k$) to be different in different places. This is a core concept for geographical data analysis. A key question is “how can such a model be calibrated” - a clue lies in the next subsection…

Tobler’s first law of Geography

everything is related to everything else, but near things are more related than distant things. (Tobler, 1970)

Here, Tobler is arguing (although by his own admission he used the term in a tongue in cheek way) that geographically located phenomena that are near to one another tend to have similar characteristics - or that if this were not the case -

“the full range of conditions anywhere on the Earth’s surface could in principle be found packed within any small area. There would be no regions of approximately homogeneous conditions to be described by giving attributes to area objects. Topographic surfaces would vary chaotically, with slopes that were everywhere infinite, and the contours of such surfaces would be infinitely dense and contorted. Spatial analysis, and indeed life itself, would be impossible.” (de Smith et al 2007, p44)

Here, the idea informs the technique known as Geographically Weighted Regression or GWR (Brunsdon et al 1996). If the relationships between the $x_k$’s and $y$ vary spatially, one might at least expect that the relationship between nearby sets of observations should be consistent. Suppose we assign a value of each regression coefficient to a location $(u,v)$ in space (eg $u$ is an easting and $v$ a northing), so that we denote it $\beta_k(u,v)$. The we could estimate each $\beta_k(u,v)$ by calibrating a weighted regression model where $w_i$ was 1 if an observation was within some radius $h$ of $(u,v)$ and 0 otherwise.
In this way, an entire surface of coefficients could be estimated. It may be imagined as being generated by a moving circular window inclusing scanning across a map of the locations of the observations - in this case county centroids. However, 0/1 weighting is quite crude, and one might expect data from the periphery of the circular window to have less spatial influence than data near the middle.

Thus, critical to the operation of GW models is a moving kernel, that downweights peripheral observations. The kernel moves through the study area (for example to cells in a predefined grid) and at each location computes a local model. It uses data under the kernel to construct a (local) model at that location with data weighted points further away from the kernel centre contributing less to the solution. Hence the weighted element in the GW framework. Thus, the kernel defines the data and weights that are used to calibrate the model at each location. The weight, $w_i$, associated with each location $(u_i, v_i)$ is a decreasing function of $d_i$, the distance from the centre of the kernel to $(u_i, v_i)$:

A typical kernel function for example is the bisquare kernel. For a given bandwidth $h$, this is defined by \[ f(d) = \left\{ \begin{array}{cl} \left(1 - \left( \frac{d}{h} \right)^2 \right)^2 & \mbox{ if $d < h$}; \\ & \\ 0 & \mbox{ otherwise.} \end{array} \right. \] Here $h$ may be defined as a fixed value, or in an adaptive way. Gollini et al (2015) provide a description of common kernels shape used in GW models. Generally larger values of $h$ result in a greater degree of spatial smoothing - having a larger window around $\mathbf{u}$ in which cross tabulations have non-zero weighting. As $h \rightarrow \infty$ the coeffients converge to the global OLS model.

There are a number of packages that support geographically weighted analyses in R, including spgwr, GWmodel, gwrr and McSpatial. GWmodel is curated and supported by the two of the original GWR team working with others. We will use this here.

Note that Gollini at al (2015) provide an overview of geographically weighted approaches and provide a thorough treatment demonstrate their use, including the steps of model selection, bandwidth / kernel size optimization, handling collinearity etc.

In brief there are 2 basic steps to any GW approach:

Determine the optimal kernel bandwidth $h$ (ie how many data points - or which size window - to include in each local model);
Fit the GW model.

We will create some spatial data and undertake these steps in turn in the code below: Note the use of the georgia2 dataset as this is projected in meters and not degrees. Distance based measures are used to determine the bandwidth / kernel size.

# create spatial data 
df.sp <- georgia2[, c(14,4:9)]
df.sp$MedInc <- df.sp$MedInc/1000
# determine the kernel bandwidth
bw <- bw.gwr(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack,
              adaptive = T,
              data=df.sp) 
# fit the GWR model
m4 <- gwr.basic(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, 
                adaptive = T,
                data = df.sp,
                bw = bw)

You should have a look at the bandwidth: it has a value of 101 indicating that 63% of the county data will be used in each local calculation (there are 159 records in the data).

bw

## [1] 101

You should have a look at the GWR outputs

m4

The code above prints out the whole model summary. You should compare the Summary of GWR coefficient estimates with the outputs of the ordinary regression mode coefficient estimates:

tab <- rbind(apply(m4$SDF@data[, 1:7], 2, summary), coef(m))
rownames(tab)[7] <- "Global"
tab <- round(tab, 3)

kable(tab, booktabs = T, format = 'markdown')

	Intercept	PctRural	PctBach	PctEld	PctFB	PctPov	PctBlack
Min.	43.570	0.026	0.338	-1.156	-2.931	-1.563	-0.196
1st Qu.	48.440	0.054	0.596	-0.948	-2.100	-1.244	-0.028
Median	51.900	0.078	0.672	-0.857	-1.268	-0.875	0.031
Mean	52.560	0.078	0.710	-0.860	-1.410	-0.931	0.017
3rd Qu.	56.900	0.105	0.856	-0.787	-0.750	-0.594	0.078
Max.	61.630	0.139	1.142	-0.526	0.152	-0.490	0.160
Global	52.599	0.074	0.697	-0.789	-1.290	-0.954	0.033

So the Global coefficient estimates are similar to the Mean and Median values of the GWR output. But you can see the variation in the association between median income (MedInc) and the different coefficient estimates.

Note: the GWmodel outputs have a number of types. Of interest is the SDF which is spatial object of the same class as the input - in this case a SpatialPolygonsDataFrame.

So for example, for PctPov (poverty) :

the global coefficient estimate is -0.954
the local coefficient estimate varies from -1.244 to -0.594 between the 1^st and 3^rd quartiles

Mapping GW outputs

The variation in the GWR coefficient estimates can be mapped to show how different factors are varying associated with Median Income in different parts of Georgia.

Maps where large variation in the relationship between the predictor variable and median income was found:

tm_shape(m4$SDF) +
  tm_fill(c("PctBach", "PctPov")) +
  tm_layout(legend.position = c("right","top"))

Maps showing coefficients who values flip from positive to negative in different parts of the state.

tm_shape(m4$SDF) +
  tm_fill(c("PctFB", "PctBlack")) +
  tm_layout(legend.position = c("right","top")) +
  tm_style_col_blind()

So what we see here are the local variations in the degree to which changes in different variables are associated with changes in Median Income. The maps above are not simply maps of the predictor variable.

4. Poisson Regression

Here another assumption in OLS is modified. The $y_i$ variable is assumed to have a normal distribution, whose mean is determined by the predictor variables (as $\beta_0 + \sum_k \beta_kx_{ik}$ ) and whose variance is $\sigma^2$. However, suppose the value of $y_i$ was a count of some phenomenon - for example the number of crimes committed in a month, or the number of houses sold (rather than the price). Then, the normal distribution is not a sensible model because

$y_i$ can only take intger values
$y_i$ takes a moinimum value of zero

or, more bluntly, $y_i$ is not normally distributed! Typically count data are modelled by the Poisson distribution -

\[ \textsf{Prob}(y_i=j) = \frac{e^{-\lambda}\lambda^j}{j!} \] where $\lambda$ is the mean value of $y_i$. There is a constraint that $\lambda > 0$. Here $\lambda$ is a constant parameter, but if this depended on some predictor variables, we would have a prediction model similar to OLS, but with $y_i$ as a Poisson distributed count variable rather than a normally distributed continuous one.

In principal, one could specify the model so that $\lambda_i$, the mean value for observation $i$, is given by $\lambda_i = \beta_0 + \sum_k\beta_kx_{ij}$ however, this could lead to problems if a negative value of $\lambda_i$ were predicted. One way of overcoming this is to specify a log-transformed relationship:

\[ \log(\lambda_i) = \beta_0 + \sum_{k=1}^{k=m} \beta_k x_{ik} \]

which is the standard form for Poisson regression.

Taking antilogs on both sides, the effect on the expected value is a multiplicative one:

\[ \lambda_i = e^{\beta_0} \times e^{\beta_1 x_{i1}} \times \cdots e^{\beta_k x_{ik}} \]

so that the expected number of observations changes by a scale factor (ie a ratio) for a unit change in a prector variable. In particular for a coefficient $\beta_k$, the factor is $e^{\beta_k}$ - this is useful in interpreting the model.

A final - and extra - feature found in some Poisson regression model is the idea of an offset. For many geographical count-based regressions, there is a ‘population at risk’ - for example the total population of an area when counting people having a particular illness. In this case an expected count is essentially a rate of occurrence multiplied by the population at risk. The model can be written as

\[ \lambda_i = P_i \times e^{\beta_0} \times e^{\beta_1 x_{i1}} \times \cdots e^{\beta_k x_{ik}} \] where $P_i$ is the population at risk. Writing $P_i$ as $e^{\log(P_i)}$ this is really just another predictor term (in $\log(P_i)$) but with the coefficient being fixed at the value 1, rather than being calibrated. Such a term can be specified in R - and is referred to as an offset term.

Poisson Regression Example.

Here, we will use a data set based on UK crime reporting data - as made available via the police.uk web site.

In this case, burglary reports for a one month period in Newcastle upon Tyne are counted for the geographical lower-level super output areas (LSOAs). These are the $y$-variable. As predictors a number of variables are used:

IMD_score The Index of Multiple Deprivation - an index of economic and social deprivation. Higher values correspond to more deprived areas.
dens The population density of each LSOA - this is a measure of degree of rurality/urban-ness
popn The population at risk - the log of this will be the offset variable here

In addition, the response variable is burgs - the count of reported burglaries.

The data can obtained from the data link in the schedule web page. Download the file into your current directory, and then load it into R. The file format is the R binary format.

load('lsoa_imd.RData')

This loads a tibble called lsoa_imd - which can be listed:

kable(head(lsoa_imd))

lsoacode	IMD_score	popn	dens	burgs	geometry
E01008365	3.674	1679	62.5	0	427144.1, 426837.0, 426649.5, 426483.9, 426738.7, 426898.2, 427144.1, 565908.8, 565774.6, 565897.6, 566180.1, 566368.5, 566457.1, 565908.8
E01008364	11.968	1696	99.6	2	427379.1, 427375.2, 427154.7, 427144.1, 426898.2, 427160.3, 427379.1, 566194.8, 565915.1, 565909.7, 565908.8, 566457.1, 566577.0, 566194.8
E01008362	18.735	1795	156.8	2	426943.9, 426900.9, 426844.4, 427263.1, 427284.9, 427284.2, 426943.9, 565158.8, 565291.4, 565654.0, 565439.8, 565357.8, 565357.5, 565158.8
E01008453	14.138	1669	44.0	2	423059.4, 423131.2, 422695.3, 422308.3, 422254.8, 422068.1, 422265.4, 422497.8, 423059.4, 565633.6, 565550.8, 565034.9, 565036.8, 565192.6, 565528.3, 565764.0, 565423.3, 565633.6
E01008292	8.104	1342	63.9	1	420218.9, 420401.4, 420591.7, 420617.2, 420124.4, 420218.9, 565042.6, 565328.6, 565244.8, 564720.6, 564917.2, 565042.6
E01008293	17.514	1447	21.4	0	421477.4, 421387.9, 421087.2, 420840.0, 420980.7, 420547.5, 420550.9, 420547.5, 420550.7, 421477.4, 568316.5, 567537.1, 567858.2, 567794.8, 567612.1, 567725.8, 567758.4, 568454.3, 568457.4, 568316.5

The geometry column is used to map the LSOAs - the other columns are the variables used in the regression. The regression itself is set out below in R:

crime_model <- glm(burgs~dens+IMD_score+offset(log(popn)),data=lsoa_imd,family=poisson())
summary(crime_model)

## 
## Call:
## glm(formula = burgs ~ dens + IMD_score + offset(log(popn)), family = poisson(), 
##     data = lsoa_imd)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9792  -1.3341  -0.1743   0.2937   5.7159  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.611e+00  1.760e-01 -43.232  < 2e-16 ***
## dens         6.727e-05  2.020e-03   0.033  0.97344    
## IMD_score    1.080e-02  3.293e-03   3.280  0.00104 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 274.86  on 174  degrees of freedom
## Residual deviance: 264.40  on 172  degrees of freedom
## AIC: 519.14
## 
## Number of Fisher Scoring iterations: 5

The density coefficient, although positive, is not significant. The index of multiple deprivation is significant, suggesting a strong linkage between burglary risk and higher levels of deprivation. The value of the coefficient is 0.0108 - and the antilog of this is 1.0109 suggesting that if the IMD increases by one unit, this corresponds to around 1% rise in the risk of burglary.

References

Brunsdon, C.F., Fotheringham, A.S. and Charlton M. (1996). Geographically Weighted Regression - A Method for Exploring Spatial Non-Stationarity, Geographic Analysis, 28: 281-298.

De Smith M., Goodchild M., Longley P. Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools, Troubador Publishing Ltd, 2007

Gollini, I., Lu, B., Charlton, M., Brunsdon, C., & Harris, P. 2015. GWmodel: an R package for exploring spatial heterogeneity using geographically weighted models. Journal of Statistical Software 63 (17), 1–50.

Huber, P.J. (1981) Robust Statistics., Wiley.

Tobler W., (1970) A computer movie simulating urban growth in the Detroit region.,Economic Geography, 46(2): 234-240

Other Forms of Regression

Chris Brunsdon

July 2018