Student Number: 1125961

Exploring House Price Data in Birmingham

To many economic geographers, the notion that “spatial is special” is obvious. Afterall, Tobler’s first law of geography states “everything is related to everything else, but near things are more related than distant things” (1969).

Because of this, it is often assumed that analysis of geographic datasets will be done via the Geographically Weighted toolkit developed by Fotheringham et al. (2002) (or other comparable methodologies). However, a large volume of academic and commercial research still relies upon aspatial models – in particular, the Random Forest (“RF”).

Whilst this preference can no doubt be partly attributed to a lack of geographic knowledge; in many cases it is appropriate to recognise the strong performance and generalisable nature offered by non-parametric models like RF.

The following document therefore develops and presents a novel geographic variant of RF, in which the results are weighted by K-nearest neighbours. The purpose of doing so is to marry the advantages of spatially weighted parametric models to the benefits of RF. This is a model newly developed for this project, and as such will be subject to further revisions and analysis before being uploaded to CRAN.

To demonstrate the use of the spatially K-weighted RF (“KGWRF”), an analysis of Birmingham house price data is conducted. The research explores the relationship between dependent and independent variables rather than attempt to make predictions however.

A brief layout of the document is included below:

  1. Exploratory Analysis;
  2. Comparison of Moran’s I via different neighbourhood weighting systems;
  3. The results of an ordinary least squared regression (“OLS”);
  4. The results of the spatial error model
  5. The results of a geographically weighted regression (“GWR”);
  6. The results of KGWRF; and
  7. A review of the dataset.

The code used to create the KGWRF is presented at the bottom of the markdown to improve readability (its a long chunk of code).

The Exploratory Analysis

The libraries used in the following analysis have been accessed at the beginning of the code.

The data on Birmingham City is downloaded and a spatial-polygons data frame created.

Access Packages and Relevant Data

Sys.setenv(NOAWT=1)
library(sf)
library(tmap)
library(plotly)
library(ggplot2)
library(dplyr)
library(tidyr)
library(grid)
library(GWmodel)
library(spdep)
library(spatialreg)
library(tmaptools)
library(RANN)
library(cartogram)
library(broom)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(OpenStreetMap)

Birmingham_Shape <- st_read("birmingham.shp")
Birmingham_Shape$LN_Price<- log(Birmingham_Shape$price_paid)
birmingham_sp <- as_Spatial(Birmingham_Shape)
st_crs(Birmingham_Shape) <- 27700 #Assigning OSGB 1936 to shapefile.
birmingham_map <- read_osm(Birmingham_Shape)

To explore the distribution of the relevant data, a chart of house price density with a histogram overlay has been created.

Examining the charts reveals that among Birmingham’s MSOA’s, the median house price is predominantly distributed between £80,000 to £300,000. However, like many continuous variables, there are significant outliers to the right end of the distribution. Subsequently, the data has a positive skew, with a mean house price > median house price > mode house price.

Since the modelling of non-normal distributions is likely to provide results which are also highly skewed, a log transformation is applied to the data analysed in research.

Create Plotly Charts of Price Distributions

p <- ggplot(Birmingham_Shape, aes(log(price_paid))) + 
  geom_histogram(aes(y = ..density..), alpha = 0.7, fill = "#333333") + 
  geom_density(fill = "#ff4d4d", alpha = 0.5) + 
  theme(panel.background = element_rect(fill = '#ffffff')) + 
  ggtitle("Birmingham House Price Density with Histogram overlay") +
  xlab("Price (log)") +
  ylab("Density (log)")


p1 <- ggplot(Birmingham_Shape, aes(price_paid)) + 
  geom_histogram(aes(y = ..density..), alpha = 0.7, fill = "#333333") + 
  geom_density(fill = "#20f40c", alpha = 0.5) + 
  theme(panel.background = element_rect(fill = '#ffffff')) + 
  ggtitle("Birmingham House Price Density with Histogram overlay") +
  xlab("Price") +
  ylab("Density")

subplot(p1, p,margin = 0.1, titleY = TRUE, titleX = TRUE)

To review the spatial distribution of house prices in Birmingham, a map is created below using the original data for clarity (rather than the log transformations).

A cursory look at the map shows the high value (blue) properties are situated within two regions. These are the mid-west and north-west areas of the city.

To assess this further we can rescale the map by population size. An examination of this map then suggests that the mid-west MSOA’s represent densely populated city areas, whilst the North Western area is more likely a rural commuter belt since it is less densely populated (imagine metropolitan flats vs suburban homes with gardens).

Weighting the maps in this way demonstrates the need for a spatially aware methodology. This is because the underlying features which drive property price vary by geography. For instance, in a simple regression crime may act as a confounding variable and correlate with high property prices in the mid-west region of Birmingham since it is a proxy for city proximity. Conversely, in the north-west suburb’s crime would likely have a more intuitive effect and drive down property prices.

Create Map and Population Weighted Map of Birmingham

breaks = c(0.5,1,1.5,2,2.5,3,3.5,4) * 100000

tmap_mode("plot")

benm <- 
  qtm(birmingham_map)+
  tm_shape(Birmingham_Shape) +
  tm_fill("price_paid",palette=c("red","white","blue"),style = "cont",title = "Median Household Price", midpoint = NA, alpha = 0.25) +
  tm_borders()+
  tm_layout(legend.outside = TRUE,    main.title = "Median House Price
            ", 
    main.title.position = "center", title.size = 10,frame="white")+
  tm_scale_bar(position=c("right", "bottom"))

# construct a cartogram using the population in 2005
afr_sf_cont <- cartogram_cont(Birmingham_Shape, "persons", 5)
st_crs(afr_sf_cont) <- 27700 #Assigning OSGB 1936 to shapefile.
woah <- read_osm(afr_sf_cont)

ben <- 
  qtm(woah)+
  tm_shape(afr_sf_cont) +
  tm_fill("price_paid",palette=c("red","white","blue"),style = "cont",title = "Median Household Price", midpoint = NA,alpha = 0.25) +
  tm_borders()+
  tm_layout(legend.outside = TRUE,    main.title = "Median House Price
  (MSOA weighted by pop)", 
    main.title.position = "center", title.size = 10,frame="white")+
  tm_scale_bar(position=c("right", "bottom"))

Comparison of Moran’s I via different neighbourhood weighting systems

To statistically determine (as opposed to eyeballing a map) spatial autocorrelation, it is necessary to calculate Moran’s I.

Doing so requires the identification of neighbourhoods, which can be done in numerous ways, however for the purpose of this study a few common variants have been calculated and visualised.

Beneath the code, and from left to right, then top to bottom the variants are:

  1. 5 K-nearest neighbours;
  2. 10 K- nearest neighbours;
  3. Inverse 5 K- nearest neighbours (same as above);
  4. Contiguous neighbours;
  5. Neighbours within 0-1500m; and
  6. Neighbours within 0-2000m.

Assign Variables For Calculating Neighbours

pts <- coordinates(birmingham_sp)
IDs <- row.names(as(birmingham_sp, "data.frame"))

Calculate Contiguous Neighbours

(The standard method for calculating contiguous neighbours produces incorrect results, where MSOA’s are incorrectly listed as neighbourless. The below code rectifies the situation.)

overlapmat = st_overlaps(Birmingham_Shape,sparse=FALSE)
ovnb = mat2listw(overlapmat)
wr <- nb2mat(ovnb$neighbours, style='B')
contig<- ovnb$neighbours

Calculate Distance Based Neighbours

wd1 <- dnearneigh(pts, 0, 1500)

wd2 <- dnearneigh(pts, 0, 2000)

Calculate K Nearest Neighbours

k10 <- knn2nb(knearneigh(pts, k=10, RANN=FALSE))
k5 <- knn2nb(knearneigh(pts, k=5, RANN=FALSE))

Calculate Inverse Weight

knn <- nn2(pts, pts, k = 5)
d <- knn$nn.dists   # The distances
d <- d[,-1]         # Delete the first column of zero distances
idw <- 1/d          # the inverse of distance 
glist <- lapply(1:nrow(idw), function(i) idw[i,])
# This is just a way of converting the inverse distances into weights
knearnb <- knn2nb(knearneigh(pts, k=4, RANN=FALSE))
spknear35IDW <- nb2listw(knearnb, glist=glist)
# Combine the neighbourhood list with the inverse distance weighting

Convert Different Weights to ListW

wr_lw <- nb2listw(contig, zero.policy=T)
wd1_lw <- nb2listw(wd1, zero.policy=T)
wd2_lw <- nb2listw(wd2, zero.policy=T)
k5_lw <- nb2listw(k5, zero.policy=T)
k10_lw <- nb2listw(k10, zero.policy=T)

Visualise Neighbourhoods

par(mfrow=c(1,2))

plot(Birmingham_Shape$geometry,border="grey",main="5 K-Nearest Neighbours") + plot(k5, st_coordinates(st_centroid(Birmingham_Shape)),pch = 19, cex = 0.4, add=TRUE,col='red')

plot(Birmingham_Shape$geometry,border="grey",main="10 K-Nearest Neighbours") + plot(k10, st_coordinates(st_centroid(Birmingham_Shape)),pch = 19, cex = 0.4, add=TRUE,col='red')

Once we have divided the MSOA’s into neighbourhoods, we can calculate Moran’s I. This can be done via regression or Monte Carlo simulation. Since the distribution of polygons in our dataset is irregular, the more robust method (Monte Carlo simulation) is utilised as the tool of choice. In the table below we can see the results. A visualisation of the Monte Carlo simulations is then presented underneath the table.

Moran’s I:

Neighbourhood Moran’s I P-Value
Contiguous 0.619 0.01
0-1500m 0.443 0.01
0-2000m 0.449 0.01
K 5 0.512 0.01
K 10 0.416 0.01
K 5 (inverse) 0.532 0.01

In each instance the computed P-Value suggests a significant result. As such we can reject the null hypothesis (that the median house price is randomly distributed among the MSOA’s). Instead we can conclude that the spatial distribution of expensive and/or cheap house prices in the dataset are more spatially clustered than would be expected if underlying spatial processes were random. This is clearly visualised in each of the Monte Carlo simulations below, where the Moran’s I value sits well outside each of the simulated outcomes.

Among the different conceptualizations of spatial relationships, the greatest level of spatial auto-correlation is present within the contiguous neighborhood calculation, with the least in the neighborhoods defined as being within 0-1500m of one another. Obviously this is a counter intuitive outcome, since the two measures of proximity should be very similar, however it is likely explained by the large number of neighbors without connections calculated via the distance method (distance is taken from the centoid not the border).

Run Moran Monte Carlo Simulations

#run sim
wr_moran<- moran.mc(Birmingham_Shape$LN_Price, wr_lw, nsim=99)
wd1_moran<- moran.mc(Birmingham_Shape$LN_Price, wd1_lw,zero.policy=TRUE, nsim=99)
wd2_moran<- moran.mc(Birmingham_Shape$LN_Price, wd2_lw,zero.policy=TRUE, nsim=99)
k5_moran<- moran.mc(Birmingham_Shape$LN_Price, k5_lw, nsim=99)
k10_moran<- moran.mc(Birmingham_Shape$LN_Price, k10_lw, nsim=99)
k35_moran<- moran.mc(Birmingham_Shape$LN_Price, spknear35IDW, nsim=99)

Visualise Monte Carlo Simulations

par(mfrow=c(1,2))
plot(k5_moran,main="5 K-Nearest Neighbours",xlab="Brimingham Log House Price 
     MC Simulation")
plot(k10_moran,main="10 K-Nearest Neighbours",xlab="Brimingham Log House Price 
     MC Simulation")

par(mfrow=c(1,2))

plot(k35_moran,main="Inverse 5 K-Nearest Neighbours",xlab="Brimingham Log House Price 
     MC Simulation")

plot(wr_moran,main="Contiguous Neighbours",xlab="Brimingham Log House Price 
     MC Simulation")

par(mfrow=c(1,2))

plot(wd1_moran,main="Neighbours within 0-1500m",xlab="Brimingham Log House Price 
     MC Simulation")

plot(wd2_moran,main="Neighbours within 0-2000m",xlab="Brimingham Log House Price 
     MC Simulation")

After calculating the above we can select the most appropriate method of neighborhood calculation. Since the values of median price are skewed, the K means method is chosen. This is because it has the advantage of being able to ensure there will be neighbours even in geographies where feature density varies widely. Similarly, because the target variable is continuous, inverse distances and a Gaussian kernel are more appropriate.

The next step is then to calculate the optimum number of neighbours. In the below chart the correlation between the variable and its neighbours are plotted. We can see on the smoothed line that the rate of correlation begins to flatten out after the 40th neighbour - implying an optimum neighbourhood of inverse 40-k neaerest neighbours.

Interestingly, the correlation begins to rise again at around the 70th neighbour.

Calculate Optimum Number of Neighbours

knear250 <- knearneigh(pts, k=131)
## Warning in knearneigh(pts, k = 131): k greater than one-third of the number of
## data points
r <- sapply(1:131, function(i) {
  cor(Birmingham_Shape$LN_Price, Birmingham_Shape$LN_Price[knear250$nn[,i]])
})

K_optimum<-data.frame(k = 1:131, r = r) %>%
  ggplot(aes(x = k, y = r)) +
  geom_line() +
  geom_smooth(se = FALSE) +
  xlab("kth nearest neighbour") +
  ylab("Correlation, r")

K_optimum

Create the Optimum Neighbour Matrix

options(scipen = 0)

knn_gau_idw <- nn2(pts, pts, k = 40)
d <- knn_gau_idw$nn.dists
d <- d[,-1]
glist <- lapply(1:nrow(d), function(i) {
max.d <- d[i,39]
exp(-0.5*d[i,]^2 / max.d^2)
# The above calculates the gaussian weighting
})
knear_gau_idw <- knearneigh(as(pts, "matrix"), k=39)
knearnb_gau_idw <- knn2nb(knear_gau_idw)
spknear30IDW <- nb2listw(knearnb_gau_idw, glist=glist, style="C")

The Monte Carlo simulation shows that Moran’s I for the optimum weight matrix is 0.17 and statistically significant. Though this is lower than the methods demonstrated above, we now have rationale for the number of neighbours selected - rather than an arbitrary selection.

The results are visualised below.

Moran Value for Optimum

hallo<- moran.mc(Birmingham_Shape$LN_Price, spknear30IDW, nsim=99)

plot(hallo,main="Optimum Weight Matrix (inverse K-40)",xlab="Brimingham Log House Price 
     MC Simulation")

Additionally, it is possible to calculate each MSOA’s local Moran value from their respective centoids for the log price of housing.

As can be seen in the histogram below, the majority of local Moran values are greater than zero, suggesting they sit within a cluster of similarly priced MSOA’s.

Calculate the Local Moran

localm_h <- localmoran(birmingham_sp$LN_Price, listw2U(spknear30IDW))

localm_1 <- data.frame(localm_h)

localm_i <- data.frame(localm_1$Ii)

moran.map <- cbind(birmingham_sp, localm_1)
hist(moran.map$Ii,
     main="Histogram for Local Moran", 
     xlab="Local Moran",
     border="white", 
     col="blue",
     xlim=c(-1,2),
     las=1)

These results are then compared against the geographically weighted standard deviation of log prices.

To calculate the geographically weighted standard deviation, a bandwidth is defined. This requires the selection of independent variables.

In this instance the variables selected are:

  1. Crime
  2. Detatched
  3. New Builds
  4. Freeholds
  5. Distance from City Centre
  6. Percentage Unemployed

Though the selection of variables in relatively random (with the exception of distance which was chosen to demonstrate the problem of treating distance as an ordinary feature), further details of the variables can be found in Harris’ book Quantitative Geography: the Basics (2016).

A Gaussian kernel is used for the bandwidth calculation since the variable is continuous.

Calculate the Standard Deviation

distances<- gw.dist(dp.locat = pts)

bw<- bw.gwr(LN_Price ~ crimes +  detached  + new + freehld + d + pctunemp, data=birmingham_sp, adaptive = T, dMat = distances, kernel = "gaussian")

gwstats <- gwss(birmingham_sp, vars = "LN_Price", adaptive=T, bw=bw,
kernel = "gaussian")



gw_sd <- as.data.frame(gwstats$SDF$LN_Price_LSD)

If we then map the local Moran’s and local STD we can see the lower Moran values are predominantly on the periphery of the city. It is thus possible to infer that further from the city centre price becomes more random and less affected by spatial clustering. The exception to this rule is in the higher priced properties in the north of the city. Subsequently, other than in this area the relationship between the Moran value and the standard deviation of the price appears weak.

Map of STDs and Local Moran Values

std_1 <- cbind(birmingham_sp, gw_sd)

std_map<- tm_shape(std_1) +
  tm_fill("gwstats.SDF.LN_Price_LSD",palette=c("red","orange","yellow","green","blue"),title = "Standard Deviation",style = "cont", midpoint = NA) +
  tm_borders()+
  tm_layout(legend.outside = TRUE, 
    main.title.position = "center", title.size = 10,frame="white")

loacl_map_moran <- tm_shape(moran.map) +
  tm_fill("Ii",palette=c("red","orange","yellow","green","blue"),title = "Local Moran Value",style = "cont", midpoint = NA) +
  tm_borders()+
  tm_layout(legend.outside = TRUE, 
    main.title.position = "center", title.size = 10,frame="white")

To formalise this statement we can plot the relationship between the local moran and the standard deviation of property price. As can be seen below, the regression line shows there is very little relationship - although the correlation is 0.22 and does appear to be statistically significant when explicitly calculated.

Local Moran Plot

yo<- data.frame(moran.map,std_1)


ggplot(data= yo, aes(x= localm_i$localm_1.Ii, y= std_1$gwstats.SDF.LN_Price_LSD))+
geom_point()+
geom_smooth(method=lm, se= F)+
labs(x= "Local Moran’s I", y= "STD",
title= "Local Moran Statistics")

Run OLS Regression

Returning to the discussion of aspatial models, a common practice by non-geographers is to attempt to treat distance metrics as simply another feature.

For instance, in the dataset we are analysing, one could attempt to use the varaible “d” to address spatial dependencies in the data - since it represents the distance from the inner city. However, as Britto and Sabourin (2014) argue: spatial heterogenity will likely persist since geography cannot be adequately controlled for.

To test this, an OLS regression is run on the Birmingham City dataset with the ‘d’ feature. Additional variables identical to those used in the bandwidth section are also used in the model.

OLS Output

model1 <- lm(LN_Price ~ crimes +  detached  + new + freehld + d + pctunemp, data=Birmingham_Shape)

tab_model(model1,digits=5)
  LN_Price
Predictors Estimates CI p
(Intercept) 12.59232 12.29638 – 12.88826 <0.001
crimes -0.00072 -0.00127 – -0.00018 0.009
detached 1.45711 1.16834 – 1.74589 <0.001
new -0.04365 -0.29698 – 0.20969 0.734
freehld -0.17866 -0.34523 – -0.01209 0.036
d -0.00003 -0.00004 – -0.00002 <0.001
pctunemp -0.06655 -0.07905 – -0.05405 <0.001
Observations 132
R2 / R2 adjusted 0.814 / 0.805

The P-values for all features are shown to be statistically significant, with the exception of New Housing. However, an examination of the models coefficients highlights the failures of an OLM in dealing with spatial data.

For example, although we expect the variable “d” to be negative (the further a home is from the city centre, the less land is likely to be worth) its impact is far too low. A coefficient of -0.000029 implies that for each 1 unit of change in distance, the price of a house will fall by this amount. As such, a house 10km from the city centre would be valued £29 less (after accounting for the log transformation of the dependent variable) than one in the city centre. Even the most basic knowledge of real estate would allow one to infer that distance from the city is more important to the price of property than this.

The problem for the model is that it cannot account for when the spatial traits of the data alter the importance and impact of a feature. We can see evidence of this again by looking at the “freehld” variable. Properties which are a freehold should be worth more than those which are not, however because inner city properties are more often a leasehold the inverse is true. As such, the property price in certain areas are incorrectly penalised.

From looking at the distribution of residuals in the histogram below it is immediately apparent that the model has struggled to fit the data - since we would hope to see some degree of symmetry around the mean.

Looking at a map of residuals we can see evidence of spatial clustering which provides evidence of the aformentioned statement. Overestimation of values occures around the city centre, whilst there is a systematic undershooting only a little way north of the centre and in the north east.

Histogram of Residuals

hist(model1$residuals,
     main="Histogram for OLS Residuals", 
     xlab="Residuals",
     border="white", 
     col="blue",
     xlim=c(-0.5,1),
     las=1)

Map of Residuals

Looking at a scatter plot of the residuals vs fitted we can see further evidence of heteroskedasticy. Were the models predictions centred on zero we could consider it a reasonable fit, however this is clearly not the case. Instead, the variance increases as the fitted values increase in size.

Heter Graph

heter_graph<- plot(model1, which = 1)

To further examine the spatial dependency we can run a Moran’s I test using the optimal matrix defined above.

In doing so we can correct for the fact that the residuals are the product of a regression.

The results show a statistically significant value for the Moran I statistic.

Additionally, the high observed Moran I (0.17) suggests we need to to deal with the data in a spatially conscious manner and cannot just use a simple aspatial regression.

Thus we now have sufficient motivation to deploy a spatially weighted model rather than an aspatial model.

Moran Test

linear_moran<- lm.morantest(model1, listw2U(spknear30IDW))

Global Moran’s I For Regression Residuals:

Observed Moran I Expectation Variance
0.168 -0.121 0.0001
Standard Deviation P-Value
13.09 < 2.2e-16

Spatial Econometric Models

Two common approaches to dealing with spatial data are to use either a spatial lag or spatial error model. To decide which we can utilise a Lagrange test, and interpret the output via the flow diagram developed by Anselin and Rey (2014, p.110).

If we calculate the Lagrange multiples we can see that only the spatial error model is statistically significant, whilst the spatially lagged model is not. As such, Anelin and Rey’s work suggests we will want to run a spatial error model.

Lagrange Test:

bam <-lm.LMtests(model1, spknear30IDW, test = c("LMerr","LMlag"))
## Warning in lm.LMtests(model1, spknear30IDW, test = c("LMerr", "LMlag")): Spatial
## weights matrix not row standardized
Model P-Value
LMerr < 2.2e-16
LMlag 0.268

Intuitively this makes sense. A spatially lagged model is appropriate for when we want to understand the interactions between interdependent units (Darmofal, 2015: 4). For data measured over years, this is unlikely to be appropriate.

Instead, the spatial error model is more appropriate as the spatial dependence observed in our data does not reflect a truly spatial process, but merely the geographical clustering of points of interest (Darmofal, 2015: 4). For example, in this instance it is likely house prices simply cluster by geography due to either proximity to the city centre or planning permissions which facilitate larger gardens etc.

Examining the results of the spatial model, we can see that although the distribution of residuals is still far from perfect, it is better than the OLS model. Interestingly, there is little variation amongst the significance of individual variables, the regression does not recognise any as now signficant when they were not previously or vice versa. However, the standard error of the freehold variable has fallen, whilst that for distance has risen.

Spatial Error Model

fit_3_err <- errorsarlm(LN_Price ~ crimes +  detached  + new + freehld + d + pctunemp, data=Birmingham_Shape, spknear30IDW) 
## Warning: Function errorsarlm moved to the spatialreg package

Histogram of Residuals

hist(fit_3_err$residuals,
     main="Histogram for Spatial Error Residuals", 
     xlab="Residuals",
     border="white", 
     col="blue",
     xlim=c(-0.5,1),
     las=1)

Summary of Spatial Error Model:

Estimate Std. Error z value Pr(>|z|)
(Intercept) 1 .2437e+01 1.4509e-01 85.7189 < 2.2e-16
crimes -6 .3915e-04 2.4336e-04 -2.6263 0.008631
detached 1 .5911e+00 1.3460e-01 11.8216 < 2.2e-16
new 1 .7339e-02 1.1531e-01 0.1504 0.880469
freehld -1 .5202e-01 7.4383e-02 -2.0437 0.040981
d -2 .9191e-05 6.3582e-06 -4.5911 4.408e-06
pctunemp -5 .9467e-02 5.7737e-03 - 10.2996 < 2.2e-16

Residuals:

Min 1Q Median 3Q Max
0.296553 -0.072184 -0.015588 0.065992 0 .465379

As per Burnham and Anderson (2004), a difference of ten between AIC scores suggests a better fit for the data with the lower score. Since the AIC of the OLS regression is -133.2 whilst the spatial error model is -150.7, there is very strong evidence that the spatial error model better describes the data. Similarly, because the log likelihood of the OLS model is 74.6 whilst for the spatial error model it is 84.3, there is further evidence in favour of the spatial error model.

However, as mentioned, the unequal distribution of residuals around the mean, seen in the histogram above, requires us to continue searching for a method which can adequately deal with spatial autocorrelation.

The results of a GWR

Moving onto the geographically weighted regression, we must again specify a bandwidth and kernel. These are left the same as those used in the models above.

Since the model is composed of many local models, it will typically generate a high R-squared, as such only limited information can be gleaned from this output.

Moving onto the coefficients, if we examine the variable “detatched”, the coefficient ranges from 1.28 to 2.19. This is equivalent to an increase in house price of £259 to £877 (after adjusting for log dependent variable) for each unit of change in the independent variable. For half of the dataset the price of houses will change by between £361 and £557 points (the inter-quartile range) for each unit of change in the variable. When mapping the data we can see that the houses in the north are those in which the coefficient is lowest - since other factors drive price here (as discussed above).

Coefficients can also be seen for other variables, with crime, as previously hypothesised, having a large range in the amount of impact it has. To visualise this we again plot the spatial patterning on a map. Surprisingly we then see that the variable correlates positively with houses far from the city centre, and negatively with those which are closer. Such a result would suggest that crime is unlikely to be statistically significant (and we will see such shortly).

The spatial variation follows a simmilar pattern in the other variables shown below, with the exception of detatched housing - which is positive everywhere. The coefficient of detatched housing roughly increases as we move closer to the city centre - perhaps suggesting that it can be used as a proxy for the land value per square foot ceteris paribus, as opposed to house price which we currently use as a dependent variable. Meanwhile, the map of the other variables suggests they are not particularly significant.

Geographically Weighted Regression

gwr.model <- gwr.basic(LN_Price ~ crimes +  detached  + new + freehld + d + pctunemp, data=birmingham_sp, adaptive=T, dMat=distances, bw=bw, kernel = "gaussian")

Coefficient Maps

results<-as.data.frame(gwr.model$SDF)
birmingham_sp$coedetached<-results$detached
birmingham_sp$coefcrime<-results$crimes
birmingham_sp$coefd<-results$d
birmingham_sp$coefpctunemp<-results$pctunemp

detatched <- tm_shape(birmingham_sp) +
  tm_fill("coedetached",palette=c("red","orange","yellow","green","blue"),title = "Detatched",style = "cont", midpoint = NA) +
  tm_borders()+
  tm_layout(legend.outside = TRUE, 
    main.title.position = "center", title.size = 10,frame="white")

Crime<-tm_shape(birmingham_sp) + 
  tm_fill("coefcrime",palette=c("red","orange","yellow","green","blue"),title = "Crime",style = "cont", midpoint = NA) +
  tm_borders()+
  tm_layout(legend.outside = TRUE, 
    main.title.position = "center", title.size = 10,frame="white")

distance<-tm_shape(birmingham_sp) +
  tm_fill("coefd",palette=c("red","orange","yellow","green","blue"),title = "Distance",style = "cont", midpoint = NA) +
  tm_borders()+
  tm_layout(legend.outside = TRUE, 
    main.title.position = "center", title.size = 10,frame="white")

pct_unemp<-tm_shape(birmingham_sp) +
  tm_fill("coefpctunemp",palette=c("red","orange","yellow","green","blue"),title = "% Unemployed",style = "cont", midpoint = NA) +
  tm_borders()+
  tm_layout(legend.outside = TRUE, 
    main.title.position = "center", title.size = 10,frame="white")

Before we can move on however, it is important to check the P-Values of the variables used in the model.

To do so we will again use a Monte Carlo simulation. Looking at the results we can see that only the distance feature was significant. As such, it is only distance for which the geographic variation of the variable affects the price of houses. This means the coefficients for the other variables can be disregarded.

Monte Carlo GWR

montecarlo.gwr(LN_Price ~ crimes +  detached  + new + freehld + d + pctunemp, data=birmingham_sp,
adaptive = T, dMat = distances, bw= bw, kernel = "gaussian")
## 
## Tests based on the Monte Carlo significance test
## 
##             p-value
## (Intercept)    0.41
## crimes         0.52
## detached       0.32
## new            0.79
## freehld        0.98
## d              0.00
## pctunemp       0.05

The Geographic Random Forest

Finally we can apply the KGWRF. The code draws on the original RandomForest package in R and SpatialML work, but with adaptions based on the GWR framework developed by Fotheringham et al. (2002). Rather than computing a single “global” model, the KGWRF gives each datapoint a “local” model calculated based on a its neighbors, which are defined by the K-nearest model. The effect of such is a reduction in spatial auto-correlation by allowing feature importance to vary according to a data point’s spatial properties. As such, the predictions will bias for certain geographies with errors clustering in space.

Though Random Forest’s use is typically as a predictive tool, we can see its application below as an explanatory model.

The model is calibrated to calculate a local Random Forests for each MSOA’s 40-K Nearest neighbors - as per the optimum model identified earlier.

Run KGWRF (Code for KGWRF is Provided at End of Document)

KGWRF <- BRF(LN_Price ~ crimes +  detached  + new + freehld + d + pctunemp, dframe=birmingham_sp, bw=40,
kernel="adaptive",spd= birmingham_sp, kz=40, ky=39)

Interestingly, the relative model strength varies across space. As such, there is a need for the incorporation of additional variables into the model. This is not problematic as it would be with the GWR, as a luxury of RF is that the model should not overfit the data even as you introduce additional variables.

Beyond the above, the spatial variation in feature importance can also be calculated using the geographically weighted random forest - using the localised mean squared error as a proxy.

In the model results mapped below, we can see a clear spatial clustering in feature importance - with the exception of distance, which is largely important across the map. Such an output corresponds with the GWR finding that distance was the most significant variable in our model.

Conclusion

The research above presents a variety of models which can be used to understand the housing market in Birmingham by economists and real estate agents.

Since the data. has spatial features, and a Moran’s I which suggests spatial heterogeneity; aspatial models are inadequate for the task. The OLS regression utilised initially demonstrated such, with counterintuitive coefficients assigned to variables, and a skewed residual which clustered spatially.

Spatial models can be considered a necessity as the price of property spreads via diffusion (Redfern, 1997), clusters in certain areas and is influenced by the features of neighboring areas (Kolko, 2007). In other words, any model utilised must have the means to deal with auto-correlation.

Following a Lagrange Test, a spatial error model was applied so as to account for the spatial clustering of residuals. The effect of this model switch was an increase in standard error, but with little substantive change to coefficients – i.e the model recognised an uncertainty due to geography which the OLS failed to interpret - but could not recognise the change implied by the variables. Additionally, the model still skewed the residuals.

The next model utilised was the geographically weighted regression. The tool allows for analysis of how different features within the dataset affect property price in the context of an area’s spatial properties (Cellmer, 2012). However, much of the literature suggests the model is not sufficiently capable of dealing with the diffusive nature of spatial data, and that it has a tendency toward overfitting (Steif et al., n.d.).

Finally, the KGWRF model was demonstrated. The model is a useful tool for investigating spatial variations in model performance and feature performance, and highlights the variations which may be constrained when utilising a traditional Random Forest model. As such, the model provides a bridge for those interested in using non-parametric models rather than a GWR. The KGWRF also has the benefit of being able to calculate feature importance without needing to run a computationally expensive Monte Carlo simulation, like the GWR.

Beyond the use of models however, the data set also requires review. For instance, Lees (2000: 398) has argued the variables which cause gentrification may diminish in importance over time. Subsequently, there has been a failure to treat temporal data with the same respect we applied to the spatial properties of the dataset. As such, in an ideal world, we would have also had access to a longitudinal study of house prices.

Another issue is that the KGWRF and GRF rely upon point estimates from each MSOA’s centroid. This restricts the analysis performed so a better alternative may be to interpolate the data and create more variables - or otherwise use a model which could recognise boundaries.

Bibliography

Burnham KP, Anderson DR 2004. Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociological Methods & Research. 33(2):261-304. doi:10.1177/0049124104268644

Cellmer, R., 2012. The Use of the Geographically Weighted Regression for the Real Estate Market Analysis. Folia Oeconomica Stetinensia 11. https://doi.org/10.2478/v10031-012-0009-6

Fotheringham AS, Charlton ME, Brunsdon C (1998) Geographically Weighted Regression: A Natural Evolution of the Expansion Method for Spatial Data Analysis. Environment and Planning A: Economy and Space. 1998;30(11):1905-1927. doi:10.1068/a301905

Harris, K (2016) Quantitative Geography: The Basic. Sage,

Kolko, J 2007. The determinants of gentrification. SSRN December. DOI 10.2139/ssrn.985714.

Lees, L 2000. A reappraisal of gentrification: Towards a ‘geography of gentrification’. Progress in Human Geography 24(3): 389–408.

Tobler, W.R., 1969. Geographical Filters and their Inverses*. Geographical Analysis 1, 234–253. https://doi.org/10.1111/j.1538-4632.1969.tb00621.x

Redfern, PA 1997. A new look at gentrification: 2. A model of gentrification. Environment and Planning A 29(8): 1335–1354.

Steif, K., Mallach, A., Fichman, M., Kassel, S., n.d. Predicting gentrification using longitudinal census data [WWW Document]. Urban Spatial. URL http://urbanspatialanalysis.com/portfolio/ predicting-gentrification-using-longitudinal-census-data/ (accessed 12.10.20).

KGWRF Code

BRF <- function(formula, dframe, bw, kernel, spd, ntree=500, mtry=NULL, importance=TRUE, forests = TRUE, kz=6, ky=5)
  {
    library(randomForest)
    set.seed(1)
    f <- formula(formula)
    RNames <- attr(terms(f), "term.labels")
    ModelVarNo <- length(RNames)
    ntrees <- ntree
    Obs <- nrow(dframe)
    if (is.null(mtry)) {mtry= max(floor(ModelVarNo/3), 1)}

    message("\nNumber of Observations: ", Obs)
    message("Number of Independent Variables: ", ModelVarNo)

    if(kernel == 'adaptive')
    {
      Ne <- bw
      message("Kernel: Adaptive\nNeightbours: ", Ne)
    }
    else
    {
      if(kernel == 'fixed')
      {
        message("Kernel: Fixed\nBandwidth: ", bw)
      }
    }

    Gl.Model <- eval(substitute(randomForest(formula, data = dframe, ntree=ntree, mtry= mtry, importance=importance)))
    yhat<-predict(Gl.Model, dframe)

    message("Number of Variables: ", ModelVarNo)
    message("\n--------------- Global Model Summary ---------------\n")

    print(Gl.Model)

    message("\nImportance:\n")
    print(importance(Gl.Model))

    g.RSS<-sum((Gl.Model$y-yhat)^2)
    g.mean.y<-mean(Gl.Model$y)
    g.TSS<-sum((Gl.Model$y-g.mean.y)^2)

    g.r<-1-(g.RSS/g.TSS)

    message("\nResidual Sum of Squares (Predicted): ", round(g.RSS,3))
    message("R-squared (Predicted) %: ", round(100 * g.r,3))

    pts <- coordinates(spd) # Get the centroid of each district
    knn <- nn2(pts, pts, k = kz)
    # Recall that using RANN includes each district as its own neighbour
    d <- knn$nn.dists   # The distances
    d <- d[,-1]         # Delete the first column of zero distances
    idw <- 1/d          # the inverse of distance
    
    
    DistanceT <- idw
    Dij <- as.matrix(DistanceT)


if (forests == TRUE) {LM_Forests <- as.list(rep(NA, length(ntrees)))}

      LM_LEst1 <- as.data.frame(setNames(replicate(ModelVarNo,numeric(0), simplify = F), RNames[1:ModelVarNo]))
      LM_LEst2 <- as.data.frame(setNames(replicate(ModelVarNo,numeric(0), simplify = F), RNames[1:ModelVarNo]))

      LM_GofFit <- data.frame(y=numeric(0), LM_yfitOOB=numeric(0), LM_ResOOB=numeric(0), LM_yfitPred=numeric(0), LM_ResPred=numeric(0), LM_MSR=numeric(0), LM_Rsq100=numeric(0))

      for(m in 1:Obs){

        #Get the data
        DNeighbour <- Dij[,ky]
        DataSet <- data.frame(dframe, DNeighbour=DNeighbour)

        #Sort by distance
        DataSetSorted<- DataSet[order(DataSet$DNeighbour),]

        if(kernel == 'adaptive')
        {
          #Keep Nearest Neighbours
          SubSet <- DataSetSorted[1:Ne,]
          Kernel_H <- max(SubSet$DNeighbour)
        }
        else
        {
          if(kernel == 'fixed')
          {
            SubSet <- subset(DataSetSorted, DNeighbour <= bw)
            Kernel_H <- bw
          }
        }

        #Bi-square weights
        Wts<-(1-(SubSet$DNeighbour/Kernel_H)^2)^2

        #Calculate WLM
        Lcl.Model<-eval(substitute(randomForest(formula, data = SubSet, ntree=ntree, mtry= mtry, importance=importance)))

        if (forests == TRUE) {LM_Forests[[m]]<-Lcl.Model}

        #Store in table
        #Importance
        for (j in 1:ModelVarNo) {
          LM_LEst1[m,j] <- importance(Lcl.Model)[j,1]
          LM_LEst2[m,j] <- importance(Lcl.Model)[j,2]
        }

    #Observed y
    LM_GofFit[m,1]<-Gl.Model$y[m]
    LM_GofFit[m,2]<-Lcl.Model$predicted[[1]]
    LM_GofFit[m,3]<-LM_GofFit[m,1] - LM_GofFit[m,2]
    LM_GofFit[m,4]<-predict(Lcl.Model, dframe[m,])
    LM_GofFit[m,5]<-LM_GofFit[m,1] - LM_GofFit[m,4]
    LM_GofFit[m,6]<-Lcl.Model$mse[ntrees]
    LM_GofFit[m,7]<-100 * Lcl.Model$rsq[ntrees]

  }
  if (forests == TRUE) {BRF.out<-list(Global.Model=Gl.Model, Locations = coords, Local.Pc.IncMSE= LM_LEst1, Local.IncNodePurity= LM_LEst2, LGofFit=LM_GofFit,Forests=LM_Forests)}
  else {BRF.out<-list(Global.Model=Gl.Model, Locations = coords, Local.Pc.IncMSE= LM_LEst1, Local.IncNodePurity= LM_LEst2, LGofFit=LM_GofFit)}

   message("\n--------------- Local Model Summary ---------------\n")

   message("\nResiduals OOB:\n")
   print(summary(BRF.out$LGofFit$LM_ResOOB))

   message("\nResiduals Predicted:\n")
   print(summary(BRF.out$LGofFit$LM_ResPred))

   t1 <- data.frame(Min = apply(BRF.out$Local.Pc.IncMSE, 2, min), Max = apply(BRF.out$Local.Pc.IncMSE, 2, max),
                     Mean = apply(BRF.out$Local.Pc.IncMSE, 2, mean), StD = apply(BRF.out$Local.Pc.IncMSE, 2, sd))

   message("\n%IncMSE:\n")
   print(t1)

   t2 <- data.frame(Min = apply(BRF.out$Local.IncNodePurity, 2, min), Max = apply(BRF.out$Local.IncNodePurity, 2, max),
                     Mean = apply(BRF.out$Local.IncNodePurity, 2, mean), StD = apply(BRF.out$Local.IncNodePurity, 2, sd))

   message("\n%IncNodePurity: \n")
   print(t2)

   l.RSS.OOB<-sum(BRF.out$LGofFit$LM_ResOOB^2)
   l.RSS.Pred<-sum(BRF.out$LGofFit$LM_ResPred^2)

   mean.y<-mean(BRF.out$LGofFit$y)
   TSS<-sum((BRF.out$LGofFit$y-mean.y)^2)


   l.r.OOB<-1-(l.RSS.OOB/TSS)
   message("\nResidual Sum of Squares (OOB): ", round(l.RSS.OOB,3))
   message("R-squared (OOB) %: ", round(100* l.r.OOB,3))

   l.r.Pred<-1-(l.RSS.Pred/TSS)
   message("Residual Sum of Squares (Predicted): ", round(l.RSS.Pred,3))
   message("R-squared (Predicted) %: ", round(100* l.r.Pred,3))

   lModelSummary = list()
   lModelSummary$l.IncMSE<-t1
   lModelSummary$l.IncNodePurity<-t2
   lModelSummary$l.RSS.OOB<-l.RSS.OOB
   lModelSummary$l.r.OOB<-l.r.OOB
   lModelSummary$l.RSS.Pred<-l.RSS.Pred
   lModelSummary$l.r.Pred <-l.r.Pred


   BRF.out$LocalModelSummary<-lModelSummary


  return(BRF.out)
}