Validation of attractiveness index

In this notebook we undertake a validation exercise to explore whether the estimated attractiveness score per town centre boundary responds to variables that are typically associated with attractive areas to live.

suppressWarnings(library(lattice))
suppressWarnings(library(robustbase))
suppressWarnings(library(dplyr))
suppressWarnings(library(ggplot2))
options(warn=-1)

# load data
res.all.out = read.csv('c:/Users/sgscombe/Documents/mlm_geos.csv')

# check kde plot of mlm estimates
plot(density(res.all.out$mlm_est))

# exponentiate to return from log units to pounds
plot(density(exp(res.all.out$mlm_est)))

Let’s undertake a quick multicollinearity check.

cols = c('stores', 'CRIME.AND.DISORDER.SCORE', 'medianprice_2015q2_rollingyr',
         'green900', 'vacant_rate', 'eer15nm', 'percent_unemp')

## check multicollinearity 
pairs(res.all.out[cols], panel=panel.smooth)

On the whole, everything looks fine, so let’s move onto the next step. We estimate a standard linear model alongside a robust regression. We use a set of variables derived from a range of sources, including the AHAH index, the IMD and the 2011 Census. These are:

Store count in town centre boundary (LDC).
Crime and disorder score (IMD).
Median housing price in 2015 for the rolling year (Land Registry).
Green space present (m\(^2\)) (AHAH).
Vacancy rate in town centre boundary (LDC).
Dummy variable for region.
Unemployment rate (Census).

formula.res = mlm_est ~ stores + CRIME.AND.DISORDER.SCORE + 
  medianprice_2015q2_rollingyr +  green900  + 
  vacant_rate + eer15nm    + percent_unemp

model_lm = lm(formula.res, data=res.all.out)

summary(model_lm)

## 
## Call:
## lm(formula = formula.res, data = res.all.out)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5747 -0.3802 -0.1138  0.2248  2.7925 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      8.818e+00  6.180e-02 142.697  < 2e-16 ***
## stores                           3.968e-04  4.108e-05   9.659  < 2e-16 ***
## CRIME.AND.DISORDER.SCORE         8.811e-02  2.519e-02   3.498 0.000476 ***
## medianprice_2015q2_rollingyr     7.441e-07  7.885e-08   9.437  < 2e-16 ***
## green900                         1.492e-02  2.120e-02   0.704 0.481800    
## vacant_rate                     -1.987e-02  1.755e-03 -11.321  < 2e-16 ***
## eer15nmEastern                   3.597e-01  6.261e-02   5.745 1.02e-08 ***
## eer15nmLondon                    1.015e-01  5.757e-02   1.763 0.077950 .  
## eer15nmNorth East                2.710e-01  7.709e-02   3.515 0.000447 ***
## eer15nmNorth West                2.602e-01  5.832e-02   4.462 8.46e-06 ***
## eer15nmSouth East                3.049e-01  5.697e-02   5.352 9.43e-08 ***
## eer15nmSouth West                1.733e-01  6.023e-02   2.877 0.004046 ** 
## eer15nmWest Midlands             2.617e-01  6.470e-02   4.045 5.38e-05 ***
## eer15nmYorkshire and The Humber  2.171e-01  6.092e-02   3.563 0.000373 ***
## percent_unemp                   -2.333e-02  7.759e-03  -3.006 0.002669 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6254 on 2756 degrees of freedom
## Multiple R-squared:  0.1423, Adjusted R-squared:  0.1379 
## F-statistic: 32.65 on 14 and 2756 DF,  p-value: < 2.2e-16

Now let’s plot the fitted values against the observed.

res.all.out$fitted_est = fitted(model_lm)
ggplot(res.all.out) +
  aes(x = fitted_est, y = mlm_est) +
  geom_point(alpha=.5, colour='black') +
  geom_smooth(method='lm', colour='red') +
  xlab('Fitted') +
  ylab('Observed')

To test for outliers, we plot the leverage values for each observations. The leverage value is the \(i\)th value of the diagonal in the hat hatrix:

\[ H = (X^TX)^{-1}X^T. \] Given \(h_{ii}\) expresses how much the observation \(y_i\) has impact on \(\hat{y}_i\), it can be used for outlier indication. A large value for \(h_{ii}\), for example, indicates the \(i\)th observation is distant from the center of all values in \(X\) for all \(n\) observations, and hence has more leverage.

M = model_lm
cut   <- c(2, 3) * length(coefficients(M)) / length(resid(M))
M.hat <- hatvalues(M)
plot(M.hat)
abline(h = cut, col = "red", lty = 2)
text( which(M.hat > min(cut) ) ,  M.hat[M.hat > min(cut)] ,
      row.names(res.all.out)[M.hat > min(cut)], pos=4, col="red")

There seems to be several observations that exceed the average \(\hat{h}\). Let’s take a more detailed look by visualising leverage alongside each observation’s discrepancy (studentized residuals), with each each observation’s symbol size proportional to its Cook’s D score.

res_augment <- res.all.out %>%
  mutate(hat = hatvalues(model_lm),
         student = rstudent(model_lm),
         cooksd = cooks.distance(model_lm))

# draw bubble plot
ggplot(res_augment, aes(hat, student)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(aes(size = cooksd), shape = 1) +
   geom_text(data = res_augment %>%
              arrange(-cooksd) %>%
              slice(1:10),
            aes(label = cl_name_x)) +
  scale_size_continuous(range = c(1, 20)) +
  labs(x = "Leverage",
       y = "Studentized residual") +
  theme(legend.position = "none")

In the above plot we observe several residuals greater than 1.96 standard deviations from mean. Overall, areas of London appear to disproportionately leverage the linear model, which possibly neccessitates the account of outliers using a robust regression model. Generally, robust regression use so-called MM-estimation, which is useful when the distribution of the residuals are not normal, or if there are some outliers that affect the model.

model_rob = lmrob(formula.res, data=res.all.out, method='MM')

summary(model_rob)

## 
## Call:
## lmrob(formula = formula.res, data = res.all.out, method = "MM")
##  \--> method = "MM"
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -7.07585 -0.21520  0.02104  0.32057  2.97399 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      8.557e+00  4.508e-02 189.805  < 2e-16 ***
## stores                           8.906e-04  5.203e-05  17.118  < 2e-16 ***
## CRIME.AND.DISORDER.SCORE         8.680e-02  2.037e-02   4.262 2.10e-05 ***
## medianprice_2015q2_rollingyr     8.953e-07  7.949e-08  11.263  < 2e-16 ***
## green900                         4.200e-03  1.447e-02   0.290    0.772    
## vacant_rate                     -9.285e-03  1.625e-03  -5.715 1.21e-08 ***
## eer15nmEastern                   3.924e-01  4.552e-02   8.620  < 2e-16 ***
## eer15nmLondon                    2.462e-01  4.242e-02   5.804 7.20e-09 ***
## eer15nmNorth East                2.736e-01  6.051e-02   4.521 6.42e-06 ***
## eer15nmNorth West                2.202e-01  4.451e-02   4.948 7.93e-07 ***
## eer15nmSouth East                3.056e-01  4.074e-02   7.501 8.48e-14 ***
## eer15nmSouth West                1.744e-01  3.947e-02   4.418 1.03e-05 ***
## eer15nmWest Midlands             2.075e-01  4.542e-02   4.569 5.11e-06 ***
## eer15nmYorkshire and The Humber  1.993e-01  4.264e-02   4.675 3.09e-06 ***
## percent_unemp                   -4.606e-02  6.024e-03  -7.645 2.86e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Robust residual standard error: 0.3813 
## Multiple R-squared:  0.4119, Adjusted R-squared:  0.4089 
## Convergence in 14 IRWLS iterations
## 
## Robustness weights: 
##  117 observations c(3,4,7,8,12,13,16,18,20,22,23,565,566,699,839,862,863,868,877,878,885,900,1126,1127,1128,1130,1132,1136,1137,1141,1145,1147,1151,1153,1155,1160,1246,1368,1370,1375,1381,1384,1385,1387,1390,1391,1398,1407,1410,1416,1417,1740,1741,1742,1744,1746,1748,1749,1750,1753,1755,1756,1758,1761,1763,1764,1765,1767,1771,1774,1781,1783,1786,2052,2053,2055,2056,2068,2072,2079,2080,2280,2283,2285,2289,2290,2297,2369,2386,2387,2392,2393,2398,2399,2400,2401,2403,2404,2405,2407,2411,2413,2419,2421,2587,2591,2592,2594,2597,2598,2601,2604,2605,2606,2607,2610,2612)
##   are outliers with |weight| <= 2.2e-05 ( < 3.6e-05); 
##  234 weights are ~= 1. The remaining 2420 ones are summarized as
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000468 0.8640000 0.9546000 0.8663000 0.9870000 0.9990000 
## Algorithmic parameters: 
##        tuning.chi                bb        tuning.psi        refine.tol 
##         1.548e+00         5.000e-01         4.685e+00         1.000e-07 
##           rel.tol         solve.tol       eps.outlier             eps.x 
##         1.000e-07         1.000e-07         3.609e-05         4.908e-06 
## warn.limit.reject warn.limit.meanrw 
##         5.000e-01         5.000e-01 
##      nResample         max.it       best.r.s       k.fast.s          k.max 
##            500             50              2              1            200 
##    maxit.scale      trace.lev            mts     compute.rd fast.s.large.n 
##            200              0           1000              0           2000 
##                   psi           subsampling                   cov 
##            "bisquare"         "nonsingular"         ".vcov.avar1" 
## compute.outlier.stats 
##                  "SM" 
## seed : int(0)

res.all.out$fitted_robust = fitted(model_rob)
ggplot(res.all.out) + 
  aes(x = fitted_robust, y = mlm_est) +
  geom_point(alpha=.5, colour='black') +
  geom_smooth(method='lm', colour='red') +
  xlab('Fitted') +
  ylab('Observed')

Hmm. There is possibly the issue of residual heteroscedasticity here, so let’s check the constance in the variance of error quickly.

par(mfrow=c(1,1))
plot(model_rob)

## recomputing robust Mahalanobis distances

## saving the robust distances 'MD' as part of 'model_rob'