In this notebook we undertake a validation exercise to explore whether the estimated attractiveness score per town centre boundary responds to variables that are typically associated with attractive areas to live.
suppressWarnings(library(lattice))
suppressWarnings(library(robustbase))
suppressWarnings(library(dplyr))
suppressWarnings(library(ggplot2))
options(warn=-1)
# load data
res.all.out = read.csv('c:/Users/sgscombe/Documents/mlm_geos.csv')
# check kde plot of mlm estimates
plot(density(res.all.out$mlm_est))
# exponentiate to return from log units to pounds
plot(density(exp(res.all.out$mlm_est)))
Let’s undertake a quick multicollinearity check.
cols = c('stores', 'CRIME.AND.DISORDER.SCORE', 'medianprice_2015q2_rollingyr',
'green900', 'vacant_rate', 'eer15nm', 'percent_unemp')
## check multicollinearity
pairs(res.all.out[cols], panel=panel.smooth)
On the whole, everything looks fine, so let’s move onto the next step. We estimate a standard linear model alongside a robust regression. We use a set of variables derived from a range of sources, including the AHAH index, the IMD and the 2011 Census. These are:
formula.res = mlm_est ~ stores + CRIME.AND.DISORDER.SCORE +
medianprice_2015q2_rollingyr + green900 +
vacant_rate + eer15nm + percent_unemp
model_lm = lm(formula.res, data=res.all.out)
summary(model_lm)
##
## Call:
## lm(formula = formula.res, data = res.all.out)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5747 -0.3802 -0.1138 0.2248 2.7925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.818e+00 6.180e-02 142.697 < 2e-16 ***
## stores 3.968e-04 4.108e-05 9.659 < 2e-16 ***
## CRIME.AND.DISORDER.SCORE 8.811e-02 2.519e-02 3.498 0.000476 ***
## medianprice_2015q2_rollingyr 7.441e-07 7.885e-08 9.437 < 2e-16 ***
## green900 1.492e-02 2.120e-02 0.704 0.481800
## vacant_rate -1.987e-02 1.755e-03 -11.321 < 2e-16 ***
## eer15nmEastern 3.597e-01 6.261e-02 5.745 1.02e-08 ***
## eer15nmLondon 1.015e-01 5.757e-02 1.763 0.077950 .
## eer15nmNorth East 2.710e-01 7.709e-02 3.515 0.000447 ***
## eer15nmNorth West 2.602e-01 5.832e-02 4.462 8.46e-06 ***
## eer15nmSouth East 3.049e-01 5.697e-02 5.352 9.43e-08 ***
## eer15nmSouth West 1.733e-01 6.023e-02 2.877 0.004046 **
## eer15nmWest Midlands 2.617e-01 6.470e-02 4.045 5.38e-05 ***
## eer15nmYorkshire and The Humber 2.171e-01 6.092e-02 3.563 0.000373 ***
## percent_unemp -2.333e-02 7.759e-03 -3.006 0.002669 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6254 on 2756 degrees of freedom
## Multiple R-squared: 0.1423, Adjusted R-squared: 0.1379
## F-statistic: 32.65 on 14 and 2756 DF, p-value: < 2.2e-16
Now let’s plot the fitted values against the observed.
res.all.out$fitted_est = fitted(model_lm)
ggplot(res.all.out) +
aes(x = fitted_est, y = mlm_est) +
geom_point(alpha=.5, colour='black') +
geom_smooth(method='lm', colour='red') +
xlab('Fitted') +
ylab('Observed')
To test for outliers, we plot the leverage values for each observations. The leverage value is the \(i\)th value of the diagonal in the hat hatrix:
\[ H = (X^TX)^{-1}X^T. \] Given \(h_{ii}\) expresses how much the observation \(y_i\) has impact on \(\hat{y}_i\), it can be used for outlier indication. A large value for \(h_{ii}\), for example, indicates the \(i\)th observation is distant from the center of all values in \(X\) for all \(n\) observations, and hence has more leverage.
M = model_lm
cut <- c(2, 3) * length(coefficients(M)) / length(resid(M))
M.hat <- hatvalues(M)
plot(M.hat)
abline(h = cut, col = "red", lty = 2)
text( which(M.hat > min(cut) ) , M.hat[M.hat > min(cut)] ,
row.names(res.all.out)[M.hat > min(cut)], pos=4, col="red")
There seems to be several observations that exceed the average \(\hat{h}\). Let’s take a more detailed look by visualising leverage alongside each observation’s discrepancy (studentized residuals), with each each observation’s symbol size proportional to its Cook’s D score.
res_augment <- res.all.out %>%
mutate(hat = hatvalues(model_lm),
student = rstudent(model_lm),
cooksd = cooks.distance(model_lm))
# draw bubble plot
ggplot(res_augment, aes(hat, student)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(aes(size = cooksd), shape = 1) +
geom_text(data = res_augment %>%
arrange(-cooksd) %>%
slice(1:10),
aes(label = cl_name_x)) +
scale_size_continuous(range = c(1, 20)) +
labs(x = "Leverage",
y = "Studentized residual") +
theme(legend.position = "none")
In the above plot we observe several residuals greater than 1.96 standard deviations from mean. Overall, areas of London appear to disproportionately leverage the linear model, which possibly neccessitates the account of outliers using a robust regression model. Generally, robust regression use so-called MM-estimation, which is useful when the distribution of the residuals are not normal, or if there are some outliers that affect the model.
model_rob = lmrob(formula.res, data=res.all.out, method='MM')
summary(model_rob)
##
## Call:
## lmrob(formula = formula.res, data = res.all.out, method = "MM")
## \--> method = "MM"
## Residuals:
## Min 1Q Median 3Q Max
## -7.07585 -0.21520 0.02104 0.32057 2.97399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.557e+00 4.508e-02 189.805 < 2e-16 ***
## stores 8.906e-04 5.203e-05 17.118 < 2e-16 ***
## CRIME.AND.DISORDER.SCORE 8.680e-02 2.037e-02 4.262 2.10e-05 ***
## medianprice_2015q2_rollingyr 8.953e-07 7.949e-08 11.263 < 2e-16 ***
## green900 4.200e-03 1.447e-02 0.290 0.772
## vacant_rate -9.285e-03 1.625e-03 -5.715 1.21e-08 ***
## eer15nmEastern 3.924e-01 4.552e-02 8.620 < 2e-16 ***
## eer15nmLondon 2.462e-01 4.242e-02 5.804 7.20e-09 ***
## eer15nmNorth East 2.736e-01 6.051e-02 4.521 6.42e-06 ***
## eer15nmNorth West 2.202e-01 4.451e-02 4.948 7.93e-07 ***
## eer15nmSouth East 3.056e-01 4.074e-02 7.501 8.48e-14 ***
## eer15nmSouth West 1.744e-01 3.947e-02 4.418 1.03e-05 ***
## eer15nmWest Midlands 2.075e-01 4.542e-02 4.569 5.11e-06 ***
## eer15nmYorkshire and The Humber 1.993e-01 4.264e-02 4.675 3.09e-06 ***
## percent_unemp -4.606e-02 6.024e-03 -7.645 2.86e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Robust residual standard error: 0.3813
## Multiple R-squared: 0.4119, Adjusted R-squared: 0.4089
## Convergence in 14 IRWLS iterations
##
## Robustness weights:
## 117 observations c(3,4,7,8,12,13,16,18,20,22,23,565,566,699,839,862,863,868,877,878,885,900,1126,1127,1128,1130,1132,1136,1137,1141,1145,1147,1151,1153,1155,1160,1246,1368,1370,1375,1381,1384,1385,1387,1390,1391,1398,1407,1410,1416,1417,1740,1741,1742,1744,1746,1748,1749,1750,1753,1755,1756,1758,1761,1763,1764,1765,1767,1771,1774,1781,1783,1786,2052,2053,2055,2056,2068,2072,2079,2080,2280,2283,2285,2289,2290,2297,2369,2386,2387,2392,2393,2398,2399,2400,2401,2403,2404,2405,2407,2411,2413,2419,2421,2587,2591,2592,2594,2597,2598,2601,2604,2605,2606,2607,2610,2612)
## are outliers with |weight| <= 2.2e-05 ( < 3.6e-05);
## 234 weights are ~= 1. The remaining 2420 ones are summarized as
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000468 0.8640000 0.9546000 0.8663000 0.9870000 0.9990000
## Algorithmic parameters:
## tuning.chi bb tuning.psi refine.tol
## 1.548e+00 5.000e-01 4.685e+00 1.000e-07
## rel.tol solve.tol eps.outlier eps.x
## 1.000e-07 1.000e-07 3.609e-05 4.908e-06
## warn.limit.reject warn.limit.meanrw
## 5.000e-01 5.000e-01
## nResample max.it best.r.s k.fast.s k.max
## 500 50 2 1 200
## maxit.scale trace.lev mts compute.rd fast.s.large.n
## 200 0 1000 0 2000
## psi subsampling cov
## "bisquare" "nonsingular" ".vcov.avar1"
## compute.outlier.stats
## "SM"
## seed : int(0)
res.all.out$fitted_robust = fitted(model_rob)
ggplot(res.all.out) +
aes(x = fitted_robust, y = mlm_est) +
geom_point(alpha=.5, colour='black') +
geom_smooth(method='lm', colour='red') +
xlab('Fitted') +
ylab('Observed')
Hmm. There is possibly the issue of residual heteroscedasticity here, so let’s check the constance in the variance of error quickly.
par(mfrow=c(1,1))
plot(model_rob)
## recomputing robust Mahalanobis distances
## saving the robust distances 'MD' as part of 'model_rob'