Points and Spatial Regression

Introduction

In this assignment a subset of housing transaction data are taken from the land registry in order to interpret the spatial heterogeneity of price, as well as to critique modelling techniques for predicting house price values.

#Set Working Directory

setwd('C:/Users/Christian/Documents/Masters/Spatial_Analysis/A_2/assignment2/')
 #Load packages

# Layout
library(tufte)
# For pretty table
library(knitr)
# Spatial Data management
library(rgdal)
# Pretty graphics
library(ggplot2)
# Thematic maps
library(tmap)
# Pretty maps
library(ggmap)
# Various GIS utilities
library(GISTools)
# For all your interpolation needs
library(gstat)
# For data manipulation
library(plyr)
# Spatial regression
library(spdep)
# Simulation methods
library(arm)
#Interactive Maps
library(leaflet)
#Plot Spatial polygons
library(sp)
#Smooth Kernels 
library(KernSmooth)
#Allow creation of data tables
library(data.table)
#Allows conversion to raster to plot over leaflet
library(raster)
#Read in Shapefile containg data
db <- readOGR(dsn = '.', layer = 'feb16')
## OGR data source with driver: ESRI Shapefile 
## Source: ".", layer: "feb16"
## with 312 features
## It has 31 fields
## Integer64 fields read as strings:  price imd_rank
#Convert price to numeric
db@data$price <- as.numeric(as.character((db@data$price)))

# Format dates
dts <- as.Date(db@data$trans_date)

Examining the distribution

Histogram

The most prominent inequality among urban areas is residential segregation as a result of a number of factors which combine to influence the heterogeneity of house prices (Vesselinov et al., 2007). Such inequality is highlighted when the distribution of house prices is examined through a histogram. The distribution forms a shape that resembles a power law distribution (Goldstein, 2004), whereby the main concentration of values is situated to the lesser end with a handful that are much larger towards the tail of the distribution (Figure 1).

#Histogram of Distribution

#Create the histogram
hist <- qplot(data=db@data,x=price)

#Call plot
hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Figure 1 Distribution of House Prices

One-dimensional KDE

Logarithms of values have been taken to help normalise the distribution (Figure 2), although there are still some outlieing values towards the tail end caused by a school that sold for ÂŁ10,076,828 (over ten times more than the next highest price - ÂŁ825,000) (Table 3). Kernel Density Estimation (KDE) is a non-parametric method of probability density estimation (red line in figure 2), essentially created using a contiguous smoothing approach through histogram binning. The use of 50 bins sufficiently demonstrates the value of a KDE as it mirrors the now more Gaussian distribution of the histogram, albeit now representing probability rather than counts (Figure 2).

#One-dimensional KDE and logarithmic histogram

#Use logarithmic because it is not a normal distribution
# Create log of house prices
logpr <- log(as.numeric(db@data$price))

#Add log price to data frame
db@data['logpr'] <- logpr

#Create the base
base <- ggplot(db@data, aes(x=logpr))
#Histogram
hist <- base + 
  geom_histogram(bins=50, aes(y=..density..))
#Overlay density plot
kde <- hist + 
  geom_density(fill="#FF6666", alpha=0.5, colour="#FF6666")
#Call plot
kde


Figure 2 Logarithmic distribution of house prices and One-dimensional KDE

Two-dimensional KDE

While a one-dimensional KDE can display the value of the technique to match a distribution, the importance of space in geography implies that a two-dimensional KDE may be of greater value. Figure 3 displays areas of higher transaction propensity with increased shading. The distribution of the density is mostly situated within a large cluster (hot spot), combining two smaller areas of sub-clustering with decreasing density contouring outwards from each hot spot.

While there may a number of reasons why hot spots occur, the lack of reference through other data prevents any definite inference to explain which such hotspots exist. Although personal knowledge of the local area provides evidence to suggest that the high propensity of transactions are not a result of high population and therefore demand for housing (see McMillen, 2003). One explanation for the high transaction density hotspots within the predominantly student area of Wavertree is a high turnover of property within the lucrative student housing industry. Further southeast around the more affluent Allerton, a higher demand for housing for young professionals moving in and elderly moving out might also be a reason to explain the main hotspot within Figure 3.

#Two-dimensional (spatial) KDE

# Reproject coordinates
wgs84 <- CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
db_wgs84 <- spTransform(db, wgs84)
db_wgs84@data['lon'] <- db_wgs84@coords[, 1]
db_wgs84@data['lat'] <- db_wgs84@coords[, 2]

#Convert dataframe to data table
dat <- data.table(db_wgs84@data)

#Make Contour lines
#Use KernSmooth to calculate the KDE plot, set bandwidth and grid size
kde <- bkde2D(dat[ , list(lon, lat)],
              bandwidth=c(.0045, .0068), gridsize = c(100,100))
#Create contour lines
CL <- contourLines(kde$x1 , kde$x2 , kde$fhat)

## Extract contour line levels
#Define levels using wrapper (sapply)
LEVS <- as.factor(sapply(CL, `[[`, "level"))
#Define number of levels within KDE based on data
NLEV <- length(levels(LEVS))

## Convert contour lines to polygons
pgons <- lapply(1:length(CL), function(i)
  Polygons(list(Polygon(cbind(CL[[i]]$x, CL[[i]]$y))), ID=i))
spgons = SpatialPolygons(pgons)

#Convert levels to numeric for plotting
LEVS1 <- as.numeric(as.character(LEVS))

#Define colour palette
pal <- colorNumeric(
  palette = "YlOrRd",
  domain = LEVS1
)

#Create Map

#Stroke= False to prevent polygon borders
leaflet(spgons) %>% addTiles() %>% 
  #Add base map from CartoDB
  addProviderTiles("CartoDB.Positron") %>%
  #Overlay polygons
  addPolygons(color = pal(LEVS1) , stroke = FALSE, fillOpacity = 0.3)%>%
  #Add legend
  addLegend("topright", pal=pal, values= LEVS1,
            title= "Propensity <br> of 
            House <br>Transactions")


Figure 3 Two dimensional KDE

Spatial interpolation - Inverse Distance Weighting (IDW)

To interpret the variation of a given value in space a group of techniques known as spatial interpolation must be used because KDE only represents a surface of propensities. More specifically Inverse Distance Weighting will be used. In simple terms IDW attempts to estimate property price based on the characteristics of the surrounding neighbours of a given property whereby closer neighbours have higher weighting (Brunsdon &Comber, 2015).

The predominant hotspot of high IDW is to the south east of Liverpool (Figure 4). Further inspection of the most expensive houses finds that the most expensive is within L25, additionally L17, L18 and L12 in the south east of Liverpool contain the top 30 most expensive properties (Table 3). IDW is highest when a house is surrounded by consistently similar house prices which may explain why the weighting is so high in the south east where the house prices are more uniformly high. Figure 4 supports some evidence of house price heterogeneity in Liverpool because the hotspot of IDW is surrounded by several cold spots implying that the hotspot is not reflective of the area in general which reflects house price inequalities (Vesselinov et al., 2007). However the majority of Liverpool has a homogenous house price distribution, with consideration that postcodes that aren’t in the data have taken the lowest IDW value to fill the clipped raster.

#Inverse Distance Weights

#Set up grid
liv.grid <- spsample(db, type='regular', n=25000)

#Create IDW
idw.hp <- idw(price ~ 1, locations=db, newdata=liv.grid)
## [inverse distance weighted interpolation]
# Load Liverpool outline for clipping raster file
liv.otl <- readOGR(dsn = '.', layer ='liv_outline')
## Warning in ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv
## = use_iconv, : ogrInfo: ./liv_outline.dbf not found
## OGR data source with driver: ESRI Shapefile 
## Source: ".", layer: "liv_outline"
## with 1 features
## It has 0 fields
# #Surface Plot

#Set up Coordinates using logarithm because of large house values causing skew
xyz <- data.frame(x=coordinates(idw.hp)[, 1], 
                  y=coordinates(idw.hp)[, 2], 
                  z=idw.hp$var1.pred)

#Set up log coordinates
xyz['lz'] <- log(xyz$z)

#Create copy of coordinates file
xyz2 <- xyz

#Remove unnecessary column
xyz2$z <- NULL

#Create spatial polygon
spg <- xyz2

#Add coordinates to spatial polgon
coordinates(spg) <- ~ x + y

# coerce to SpatialPixelsDataFrame
gridded(spg) <- TRUE

# coerce to raster
rasterDF <- raster(spg)

#Set CRS of raster file for plotting
crs(rasterDF) <- CRS("+init=epsg:27700")

#Clip Raster to Liverpool outline for neatness
clip_rast <- mask(rasterDF, liv.otl)

#Define Colour Palette
pal <- colorNumeric("OrRd", values(clip_rast), na.color = "transparent")

#Create Leaflet Map
leaflet() %>% addTiles() %>%
  #Add base map from CartoDB
  addProviderTiles("CartoDB.Positron") %>%
  #Add raster tiles
  addRasterImage(clip_rast, colors = pal, opacity=0.5) %>%
  #Add legend
  addLegend(pal = pal, values = values(clip_rast),
            title = "Inverse <br> Distance <br> Weight")


Figure 4 Inverse Distance Weighting

## Spatial Econometrics

Regression Modelling

Regression develops bivariate correlation to a multivariate level. Each variable has a coefficient which reflects the degree of correlation when the other variables remain constant. With consideration of this, IMD and whether a property is newly built have been used as explanatory variables when trying to decipher the spatial patterns of house prices in Liverpool. Control variables are used to explain the percentage of correlation between IMD and price that new built houses are responsible for (confounding factors), given that new builds are often built in areas with low deprivation which a bivariate correlation would not account for. Adjusted R-squared is used to measure model fit to prevent overestimation caused by autocorrelation.

#Spatial Econonmetrics

#Read in data
hst <- readOGR(dsn = '.', layer = 'feb16')
## OGR data source with driver: ESRI Shapefile 
## Source: ".", layer: "feb16"
## with 312 features
## It has 31 fields
## Integer64 fields read as strings:  price imd_rank
#Format dates
dts <- as.Date(hst@data$trans_date)

#Read in IMD to cross compare continuous variables within spatial setting
imd <- read.csv('E08000012.csv')

#Merge files to match deprivation with house transaction data by location
db <- merge(hst, imd)
## Warning in `[<-.factor`(`*tmp*`, ri, value = structure(c(68L, 68L, 15L, :
## invalid factor level, NA generated
#Convert price to numeric
db@data$price <- as.numeric(as.character(db@data$price))

#Create column that contains first part of postcode using string split and lapply to form a list the same length as the original column
db$pc <- as.character(lapply(strsplit(as.character(db$pcds), split=" "), "[", 1))

#Create column that contains first part of postcode using string split and lapply to form a list the same length as the original column
db$pc <- as.character(lapply(strsplit(as.character(db$pcds), split=" "), "[", 1))

#Gaussian function to select random values

#Create Function to draw possible random model values 
generate_draw <- function(m){
  # Set up predictors matrix
  x <- model.matrix(m)
  # Obtain draws of parameters (inferential uncertainty)
  sim_bs <- sim(m, 1)
  # Predicted value
  mu <- x %*% sim_bs@coef[1, ]
  # Draw
  n <- length(mu)
  y_hat <- rnorm(n, mu, sim_bs@sigma[1])
  return(y_hat)
}

Baseline non-spatial regression

A baseline non-spatial regression is conducted in Model 1 with each coefficient representing the correlation between the respective explanatory variable and price with confounding factors accounted for. Each model in this study considers the effect of confounding factors, however each model can only account for explanatory variables it includes. The results of Model 1 show that houses tend to be more expensive in areas with lower deprivation (an average of ÂŁ7,336 for each additional unit increase in IMD) which might be expected. A new build might expect an even greater increase in price (an average of ÂŁ38,076) which despite being statistically insignificant might be explained by the highest house price (which causes the data skew in figure 1 and 2) being a new build. The standard error of the new build coefficient reflects this (204,895) which may lead to inaccuracy of the simulation (Gelman & Hill, 2006 p158) in Figure 5. Further error is likely as the coefficient is statistically insignificant and the residual standard error is also very high. The adjusted R-squared value (0.018) suggests that 1.8% of the variation in price is captured by this model.

#Gaussian Distribution Baseline Model

#Basic null linear regreassion model
m1 <- lm('price ~ new + imd_score', db)

#Call model summary
summary(m1)
## 
## Call:
## lm(formula = "price ~ new + imd_score", data = db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -328798 -156445  -68113   25188 9711922 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   454766      93779   4.849 1.97e-06 ***
## newY           38076     204895   0.186  0.85270    
## imd_score      -7336       2622  -2.797  0.00548 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 793900 on 309 degrees of freedom
## Multiple R-squared:  0.02473,    Adjusted R-squared:  0.01842 
## F-statistic: 3.918 on 2 and 309 DF,  p-value: 0.02088


Model 1 Baseline non-spatial regression

The predictive checking graphs throughout this study display a predicted single estimate of the distribution of the (log of) price, as well as the actual distribution and 250 simulations based of the respective model. The predicted line in Figure 5 demonstrates that the prediction is skewed towards the tail of the distribution which is likely to have occurred because the distribution is not perfectly normal (Gelman and Hill, 2006). Moreover the prediction is not accounting for all of the predictive error captured by the model. The simulation most likely underestimates because the residual error boundaries of the model as they are so large for each coefficient as mentioned. The simulation relies on a normal distribution which explains why there is over dispersion and may provide evidence for redistribution of the data.

#Plot predictive checking graph


#Plot realisation of the model given and actual expected value
#There is no need to set the print margin when the tallest line is plotted first
plot(density(db$price), 
     col='red',
     main='')
#Add line for model prediction
lines(density(m1$fitted.values), 
      col='black',
      main='')
# Loop for realizations
for(i in 1:250){
  tmp_y <- generate_draw(m1)
  lines(density(tmp_y),
        col='green',
        lwd=0.1
  )
}
#Add legend
legend('topright', 
       c('Predicted', 'Actual', 'Simulated(n=250)'),
       col=c('black', 'red', 'green'),
       lwd=1)
#Add title
title(main="Predictive Check - Baseline  Non-Spatial Model")


Figure 5 Predictive Check - Baseline non-spatial regression

Logarithmic non-spatial regression

As mentioned previously using logarithmic values of price causes a more Gaussian (normal) distribution. Natural logarithms have been chosen because they are directly interpretable as the coefficient being a percentage difference in price (as opposed to log 10). Taking exponentials of the coefficient values is essential to properly interpret the coefficients, given that the model is in relation to log (price) (Gelman and Hill, 2006 p60-61). In Model 2 the coefficient for IMD represents that for each unit increase in IMD score there is a 2.2% decrease in house price. The variable new is now significant (to the 99% significance level) and the coefficient implies that a new house might expect a 55.2% increase in price in comparison to an old house. The change in distribution has led to an increased ability for the model to capture the variation in house prices (adjusted R-squared = 29.3%) which provides some evidence to support the use of logarithmic values.

#Guassian Logarithmic Baseline

#Define Model
m2 <- lm('logpr ~ new + imd_score', db)

#Call model summary
summary(m2)
## 
## Call:
## lm(formula = "logpr ~ new + imd_score", data = db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4295 -0.2809 -0.0120  0.2653  3.9012 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.490193   0.068591 182.097  < 2e-16 ***
## newY         0.439804   0.149863   2.935  0.00359 ** 
## imd_score   -0.021683   0.001918 -11.304  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5806 on 309 degrees of freedom
## Multiple R-squared:  0.2975, Adjusted R-squared:  0.293 
## F-statistic: 65.43 on 2 and 309 DF,  p-value: < 2.2e-16


Model 2 Predictive check - Logarithmic Non-Spatial Model

The predictive checking (Figure 6) demonstrates that the prediction is still overestimating towards the right causing a skewed distribution as well as under dispersion of values which may imply autocorrelation between the explanatory variables (Zhu, 2012). While there is still some over dispersion of the simulations, the 250 runs are now more accurate which reflects a decrease in inferential error when the coefficient accuracy improves. The values of the actual model now increase rapidly at log 11 before flattening off slightly to the peak just before 12 which explains where the main body of house prices are distributed. However there is still a skew to the right centre, as a result of higher house prices, including the most expensive which is now found at log 16.

#create logpr because R removes it for some reason 

# Create log of house prices
logpr <- log(as.numeric(db@data$price))

#Add log price to data frame
db@data['logpr'] <- logpr
#Plot realisation of the model given and actual expected value
plot(density(m2$fitted.values), 
     xlim=c(8, max(db$logpr)),
     main='') 

# Loop for realizations
for(i in 1:250){
  tmp_y <- generate_draw(m2)
  lines(density(tmp_y),
        col='green',
        lwd=0.1
  )
}#Add line for model prediction
lines(density(db@data$logpr), 
      col='red',
      main='')

#Add legend
legend('topleft', 
       c('Predicted', 'Actual', 'Simulated (n=250)'),
       col=c('black', 'red', 'green'),
       lwd=1)
#Add title
title(main="Predictive check - Logarithmic Non-Spatial Model")


Figure 6 Predictive check - Logarithmic Non-Spatial Model

Fixed Effects

The introduction of fixed effects in the form of short postcodes allows each constant term to vary by postcode and essentially attempts to isolate the comparison between neighbourhood characteristics. The constant terms of Model 3 are all statistically significant which implies that some of the unobserved effects previously absorbed in the error term align spatially within postcodes. Despite looking fairly similar, the exponential values of the postcode coefficients vary greatly between the smallest L28 (98,449) and largest L18 (274,366), which might be expected given that a lot of the most expensive properties sold were in L18. Introducing fixed effects has decreased the percentage impact of a house being new increasing price (45.9%) and for one unit increase in the deprivation index there is still around a 2% decrease in house price.

#Gaussian Logarithmic Fixed Effects

#Fixed Effects model
m3 <- lm('log(price) ~ pc + new + imd_score - 1', db)

#Call model summary
summary(m3)
## 
## Call:
## lm(formula = "log(price) ~ pc + new + imd_score - 1", data = db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3723 -0.2891 -0.0187  0.2266  3.7348 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## pcL10     12.316686   0.315119  39.086  < 2e-16 ***
## pcL11     12.262708   0.213379  57.469  < 2e-16 ***
## pcL12     12.251291   0.116612 105.061  < 2e-16 ***
## pcL13     12.216101   0.178686  68.366  < 2e-16 ***
## pcL14     12.219684   0.245377  49.800  < 2e-16 ***
## pcL15     12.063087   0.143614  83.996  < 2e-16 ***
## pcL16     12.356093   0.206604  59.806  < 2e-16 ***
## pcL17     12.506734   0.125515  99.643  < 2e-16 ***
## pcL18     12.522219   0.096834 129.317  < 2e-16 ***
## pcL19     12.307708   0.143676  85.663  < 2e-16 ***
## pcL20     12.448735   0.614169  20.269  < 2e-16 ***
## pcL24     12.101003   0.227885  53.101  < 2e-16 ***
## pcL25     12.591609   0.122658 102.656  < 2e-16 ***
## pcL27     11.972328   0.330814  36.191  < 2e-16 ***
## pcL28     11.497295   0.590961  19.455  < 2e-16 ***
## newY       0.377967   0.158674   2.382   0.0179 *  
## imd_score -0.016378   0.003143  -5.211 3.53e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5715 on 295 degrees of freedom
## Multiple R-squared:  0.9978, Adjusted R-squared:  0.9977 
## F-statistic:  7870 on 17 and 295 DF,  p-value: < 2.2e-16


Model 3 Predictive check - Logarithmic FE Model

The peak density of the predictive check has decreased (Figure 7) in comparison to Figure 6 which might imply that the predictive error has decreased as more of the explanatory variables influencing price are captured. This is reflected in the near perfect R-squared value (99.8%). Furthermore the predicted distribution is now closer to the smooth bell curve distribution one might expect from a Gaussian distribution especially before the peak density (~ log 12). The simulations are relatively similar to Model 2 with slightly less overestimation towards the peak despite over dispersion around either side of the peak as a result of a change in inferential error.

#Fixed Effects model using LSOA for RMSE comparison
m3b <- lm('log(price) ~ pc + new + LSOA11CD - 1', db)

#Plot realisation of the model given and actual expected value
plot(density(m3$fitted.values), 
     xlim=c(8, max(db$logpr)),
     main='') 
# Loop for realizations
for(i in 1:250){
  tmp_y <- generate_draw(m3)
  lines(density(tmp_y),
        col='green',
        lwd=0.1
  )
}#Add line for model prediction
lines(density(db@data$logpr), 
      col='red',
      main='')

#Add legend
legend('topleft', 
       c('Predicted', 'Actual', 'Simulated (n=250)'),
       col=c('black', 'red', 'green'),
       lwd=1)
#Add title
title(main="Predictive check - Logarithmic FE Model")


Figure 7 Predictive check - Logarithmic FE Model

Fixed Effects - Spatial Regime

The introduction of fixed effects in the form of short postcodes allows each constant term to vary by postcode and essentially attempts to isolate the comparison between neighbourhood characteristics. The constant terms of Model 3 are all statistically significant which implies that some of the unobserved effects previously absorbed in the error term align spatially within postcodes. Despite looking fairly similar, the exponential values of the postcode coefficients vary greatly between the smallest L28 (98,449) and largest L18 (274,366), which might be expected given that a lot of the most expensive properties sold were in L18. Introducing fixed effects has decreased the percentage impact of a house being new increasing price (45.9%) and for one unit increase in the deprivation index there is still around a 2% decrease in house price.

#Spatial Regimes
#Model 4 focuses on IMD to allow a seperate constant term and estimate on house price for every postcode

# Create a new variable `newB` to store the binary form of `new`
db@data$newB <- 1
db[db@data$new=='N', 'newB'] <- 0

#Set constant to allow variation of constant
db$one <- 1

#Display table to show NA
table(db$pc, db$new)
##      
##        N  Y
##   L10  4  0
##   L11 17  8
##   L12 34  2
##   L13 33  0
##   L14  9  0
##   L15 37  0
##   L16  8  0
##   L17 37  2
##   L18 41  3
##   L19 26  1
##   L20  1  0
##   L24 13  0
##   L25 31  0
##   L27  4  0
##   L28  1  0

Table 1 Newly Built Houses by Postcode

#Spatial Regimes Model
m4 <- lm('log(price) ~ 0 + (one + imd_score):pc', db)

#Call model summary
summary(m4)
## 
## Call:
## lm(formula = "log(price) ~ 0 + (one + imd_score):pc", data = db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5146 -0.3159 -0.0234  0.2408  3.5012 
## 
## Coefficients: (2 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## one:pcL10       12.045818  15.303892   0.787 0.431875    
## one:pcL11       10.471641   0.939362  11.148  < 2e-16 ***
## one:pcL12       12.089118   0.234966  51.450  < 2e-16 ***
## one:pcL13       12.002095   0.482152  24.893  < 2e-16 ***
## one:pcL14       12.438533   0.748224  16.624  < 2e-16 ***
## one:pcL15       12.116095   0.232220  52.175  < 2e-16 ***
## one:pcL16       12.271297   0.692667  17.716  < 2e-16 ***
## one:pcL17       12.247727   0.243354  50.329  < 2e-16 ***
## one:pcL18       12.500780   0.255626  48.903  < 2e-16 ***
## one:pcL19       12.835956   0.289501  44.338  < 2e-16 ***
## one:pcL20       11.277203   0.567562  19.870  < 2e-16 ***
## one:pcL24       11.427994   0.697553  16.383  < 2e-16 ***
## one:pcL25       13.139211   0.233243  56.333  < 2e-16 ***
## one:pcL27        9.968490   1.279735   7.789 1.27e-13 ***
## one:pcL28       10.714418   0.567562  18.878  < 2e-16 ***
## imd_score:pcL10 -0.009967   0.362139  -0.028 0.978061    
## imd_score:pcL11  0.018098   0.016814   1.076 0.282677    
## imd_score:pcL12 -0.007777   0.010100  -0.770 0.441936    
## imd_score:pcL13 -0.011847   0.009993  -1.186 0.236790    
## imd_score:pcL14 -0.020826   0.014712  -1.416 0.158004    
## imd_score:pcL15 -0.017912   0.006153  -2.911 0.003889 ** 
## imd_score:pcL16 -0.010185   0.048418  -0.210 0.833536    
## imd_score:pcL17 -0.006173   0.008275  -0.746 0.456274    
## imd_score:pcL18 -0.012939   0.017546  -0.737 0.461459    
## imd_score:pcL19 -0.033867   0.009118  -3.714 0.000245 ***
## imd_score:pcL20        NA         NA      NA       NA    
## imd_score:pcL24 -0.003459   0.013045  -0.265 0.791091    
## imd_score:pcL25 -0.042013   0.009821  -4.278 2.58e-05 ***
## imd_score:pcL27  0.021412   0.023534   0.910 0.363670    
## imd_score:pcL28        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5676 on 284 degrees of freedom
## Multiple R-squared:  0.9979, Adjusted R-squared:  0.9977 
## F-statistic:  4846 on 28 and 284 DF,  p-value: < 2.2e-16


Model 4 Predictive check - Logarithmic FE Model - Spatial Regime

Figure 8 shows that the prediction has an increased density for lower log price values compared to other models, before a trough and then a steep increase to the peak density around log 12 like the other logarithmic models. The trough in the prediction means Model 4 is under predicting when the actual reaches its peak. Additional patterns of under dispersion can be seen either side of the distribution respective to the actual which is linked to the inability of the model to remove all of the inferential and predictive error despite being most accurate in terms of Adjusted R-squared (99.8%).

#Plot realisation of the model given and actual expected value
plot(density(m4$fitted.values), 
     xlim=c(8, max(db$logpr)),
     main='') 
#Add line for model prediction
lines(density(db@data$logpr), 
      col='red',
      main='')

#Add legend
legend('topleft', 
       c('Predicted', 'Actual'),
       col=c('black', 'red'),
       lwd=1)
#Add title
title(main="Predictive check - Logarithmic FE Model - Spatial Regime")


Figure 8 Predictive check - Logarithmic FE Model - Spatial Regime

Plotting coefficients

The spatial distribution of IMD coefficients by postcode can be seen in Figure 9. Apart from L27 to the East it could be argued that there is almost a pattern of increasing coefficient positivity in the south and increasingly negative coefficients to the North. At the postcode aggregation it could be argued the effect of IMD on house price is heterogeneous, albeit not sporadically heterogeneous which agrees with current literature (Vesselinov et al., 2007). L25 has the highest negative percentage decrease in price for each unit of IMD and contains some of the highest house prices which implies that in this case high house prices are most affected by IMD. This might be expected given that house buyers may consider an area of low deprivation worthy of a higher house price in association with perceived quality of living. Consideration of spatial lag will help to test this theory.

#Plot coefficients 

#Extract coeffieicents and convert to df
temp <- data.frame(summary(m4)$coefficients)

#Write as csv to index table when read back in
write.csv(temp, "temp.csv")
temp <- read.csv('temp.csv')

#Create column that contains first part of postcode to match to db
temp$pc <- as.character(lapply(strsplit(as.character(temp$X), split="imd_score:pc"), "[", 2))

#Remove NA values created when R cannot split the string of the first 15 values
#Table created with coeff for each postcode to show the effect of IMD vary over space
temp[complete.cases(temp),]
##                  X     Estimate  Std..Error     t.value     Pr...t..  pc
## 16 imd_score:pcL10 -0.009967480 0.362138588 -0.02752394 9.780612e-01 L10
## 17 imd_score:pcL11  0.018097653 0.016813646  1.07636691 2.826767e-01 L11
## 18 imd_score:pcL12 -0.007776935 0.010099829 -0.77000653 4.419360e-01 L12
## 19 imd_score:pcL13 -0.011846756 0.009992602 -1.18555258 2.367900e-01 L13
## 20 imd_score:pcL14 -0.020825949 0.014712311 -1.41554576 1.580042e-01 L14
## 21 imd_score:pcL15 -0.017912047 0.006153296 -2.91096805 3.888575e-03 L15
## 22 imd_score:pcL16 -0.010185326 0.048418093 -0.21036198 8.335360e-01 L16
## 23 imd_score:pcL17 -0.006173363 0.008275088 -0.74601773 4.562737e-01 L17
## 24 imd_score:pcL18 -0.012939394 0.017546196 -0.73744723 4.614593e-01 L18
## 25 imd_score:pcL19 -0.033867231 0.009117936 -3.71435282 2.452442e-04 L19
## 26 imd_score:pcL24 -0.003458827 0.013045108 -0.26514360 7.910912e-01 L24
## 27 imd_score:pcL25 -0.042012702 0.009820642 -4.27799960 2.579834e-05 L25
## 28 imd_score:pcL27  0.021412262 0.023533689  0.90985575 3.636701e-01 L27
#Convert temp to df
temp <- data.frame(temp)

#Create column that contains first part of postcode using string split and lapply to form a list the same length as the original column
db_wgs84$pc <- as.character(lapply(strsplit(as.character(db_wgs84$pcds), split=" "), "[", 1))


#Match to values to each point within shapefile
db_wgs84@data <- data.frame(db_wgs84@data, 
                            temp[match(db_wgs84@data[, "pc"],
                                       temp[, "pc"]), ])


#Create df 
df <- db_wgs84@data

# Create a palette that maps factor levels to colors
pal <- colorBin("RdYlGn", df$Estimate)

#Define Leaflet plot
leaflet(df) %>% addTiles() %>% 
  #Add cartoDB base map
  addProviderTiles("CartoDB.Positron") %>%
  #Add points as circles
  addCircleMarkers(
    #Choose radius
    radius = 5,
    #Select estimate as colour
    color = ~pal(Estimate),
    #Remove point outlines
    stroke = FALSE, 
    #Define opacity
    fillOpacity = 0.5,
    #Add popup to display postcode
    popup = ~pc)%>%
  #Add legend
  addLegend(pal = pal, values = exp(df$Estimate),
            title= "Coefficent <br> of IMD")
## Assuming 'lon' and 'lat' are longitude and latitude, respectively


Figure 9 Spatial Distribution of IMD Coefficients

Spatial Lag

Spatial lag can be defined as the average house price as a result of the combination of explanatory variables from a given neighbourhood. In this example the square root (18) of the number of observations (320) has been used as the number of neighbours which is convention for the sample size, although cross-validation might be considered for future study (Duda et al., 2012). Model 5 shows that the reintroduction of newly built in to the regression might influence house price by around 56.5% which is higher than any other model in this study. For every unit increase in IMD house price might expect a 1.9% decrease which is similar to the other models. In terms of the new coefficient for spatial lag which is statistically insignificant, there is evidence to suggest that the IMD of the surrounding houses has no conclusive effect on a given house price. Further study might look to test whether aggregation of IMD at the LSOA level rather than the household level would help spatial lag to have a more significant effect on the model.

#Spatial lag

#K-nearest neighbours where k =  150
# Because some rows are different units on the same house, slightly
# jitter the locations to break ties
xy.jit <- jitter(db@coords)

#Calculate the square root of n
sqrt(320)
## [1] 17.88854
# Create knn list of each house
hnn <- knearneigh(xy.jit, k=18)

# Create nb object
hnb <- knn2nb(hnn)

# Create spatial weights matrix (note it row-standardizes by default)
hknn <- nb2listw(hnb)

#Compute the spatial lag of imd_score
db@data$w_imd_score <- lag.listw(hknn, db@data$imd_score)
#Spatial lag model
m5 <- lm('logpr ~ new + imd_score + w_imd_score', db)

#Call model summary
summary(m5)
## 
## Call:
## lm(formula = "logpr ~ new + imd_score + w_imd_score", data = db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4349 -0.2882 -0.0048  0.2526  3.9080 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.547388   0.089996 139.422  < 2e-16 ***
## newY         0.448107   0.150110   2.985  0.00306 ** 
## imd_score   -0.019371   0.003037  -6.378 6.57e-10 ***
## w_imd_score -0.004204   0.004282  -0.982  0.32699    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5807 on 308 degrees of freedom
## Multiple R-squared:  0.2997, Adjusted R-squared:  0.2929 
## F-statistic: 43.94 on 3 and 308 DF,  p-value: < 2.2e-16


Model 5 Predictive check - Logarithmic FE Model - Spatial lag

Figure 10 displays the predictive checking of Model 5 and is virtually identical to Model 2. With consideration of adjusted R-squared (30%), there is strong evidence to support the use of a spatial fixed effect in the model for this data using the particular method of spatial lag.

#Plot realisation of the model given and actual expected value
plot(density(m5$fitted.values), 
     xlim=c(8, max(db$logpr)),
     main='') 
# Loop for realizations
for(i in 1:250){
  tmp_y <- generate_draw(m5)
  lines(density(tmp_y),
        col='green',
        lwd=0.1
  )
}#Add line for model prediction
lines(density(db@data$logpr), 
      col='red',
      main='')

#Add legend
legend('topleft', 
       c('Predicted', 'Actual', 'Simulated (n=250)'),
       col=c('black', 'red', 'green'),
       lwd=1)
#Add title
title(main="Predictive check - Logarithmic FE Model - Spatial lag")


Figure 10 Predictive check - Logarithmic FE Model - Spatial lag

RMSE

RMSE can be defined as a predictive performance tool that measures the average distance between the predicted vector and actual distribution. The difference in units between the baseline model and the logarithmic models means that they are incomparable (without rescaling which is beyond this assignment) (Table 2). The addition of fixed effects to the logarithmic baseline model helps to decrease the predictive error by removing some uncertainty in the variability in price. However there is also an addition of inferential uncertainty gained around the error margins defined by the residual standard deviation of the coefficient. When the spatial regime is conducted it can be argued that by isolating each postcode the predictive uncertainty further decreases. This is because the model is analysing at a more specific aggregation which consequently decreases the individual inferential uncertainty by reducing the coefficient error term. As mentioned previously the addition of spatial lag instead of postcodes greatly reduces the models accuracy reflected in the R-squared value and the increase in RMSE which confirms the worsened model fit. For additional consideration the RMSE of the LSOA-level as a higher aggregation of fixed effects has been included and suggests that the model becomes more accurate which might be expected given that LSOA-level boundaries are specifically designed to have similar characteristics.

#RMSE

#Define RMSE function
rmse <- function(t, p){
  se <- (t - p)^2
  mse <- mean(se)
  rmse <- sqrt(mse)
  return(rmse)
}

#Create dataframe of model results
rmses <- data.frame(model=c('OLS - Baseline', 'Logarithmic - Baseline', 'Logarithmic + FE(PC)', 
                            'Logarithmic + FE(LSOA)', 'Logarithmic + Spatial Regime', 'Logarithmic + Spatial Lag'
                            
),
#Apply function to each set of variables
RMSE=c(rmse(db$price, 
            m1$fitted.values),
       rmse(db$logpr, 
            m2$fitted.values),
       rmse(db$logpr, 
            m3$fitted.values),
       rmse(db$logpr, 
            m3b$fitted.values),
       rmse(db$logpr, 
            m4$fitted.values),
       rmse(db$logpr, 
            m5$fitted.values)
)
)

#Set value format for easier comparison
options(scipen = 999)

#Call results
rmses
##                          model           RMSE
## 1               OLS - Baseline 790026.4863505
## 2       Logarithmic - Baseline      0.5778338
## 3         Logarithmic + FE(PC)      0.5557589
## 4       Logarithmic + FE(LSOA)      0.4088923
## 5 Logarithmic + Spatial Regime      0.5414959
## 6    Logarithmic + Spatial Lag      0.5769318


Table 2 Root Mean Squared Error

Comparison of Results

Comparison of results in class cannot be wholly compared like the trip data as the points vary month to month because very few houses resell frequently. The reduction of sample size and sample area means that the distribution of house prices is also relatively incomparable, although there is evidence to suggest that the removal of city centre postcodes has been conducted to better demonstrate the effect of price propensity that isn’t skewed by population. Evidence of this can be seen within Figure 5, where the hotspots in class are most noticeable in the North and more densely populated centre of Liverpool. The distribution of the data for this assignment (especially the extremely high value) has caused a large amount of overestimation on the effect of new builds in all models apart from when spatial regimes are used, in comparison to the class data. Additionally the skew within the data has led to greater statistical significance for a number of variables in comparison to the class data. The RMSE is not directly compatible although it is noticeable that the addition of fixed effects within a spatial regime provides the highest decrease in RMSE which might be expected.

Further study

  • Remove the large extreme value at the tail end to help normalise the distribution at the risk of increasing researcher bias.

  • The inability of logarithms to normalise the distribution suggests that using a Poisson distribution may be more useful than a Gaussian one. In fact some intermediate analysis finds that a Poisson distribution of improves the model accuracy (RMSE) for a number of models.

Appendix

#Create mini df to create price
db_mini <- data.frame("ID" = db$id,
                      "Price" = db$price,
                      "Postcode" = db$pc,
                      "Description" = db$paon,
                      "Locality" = db$locality)

# Order table by descending price
db_mini <- db_mini[with(db_mini, order(-db_mini$Price)), ]
head(db_mini, 30)
##                                         ID    Price Postcode
## 308 {2D1E4B27-1F71-FCD1-E050-A8C0630544EC} 10076828      L25
## 309 {2D1E4B27-1F72-FCD1-E050-A8C0630544EC} 10076828      L25
## 15  {2D1E4B26-8D24-FCD1-E050-A8C0630544EC}   825000      L17
## 33  {2D1E4B26-8E2A-FCD1-E050-A8C0630544EC}   775000      L12
## 306 {2D1E4B27-1F5E-FCD1-E050-A8C0630544EC}   700000      L18
## 126 {2D1E4B26-8C0C-FCD1-E050-A8C0630544EC}   623500      L19
## 190 {2D1E4B26-8CDA-FCD1-E050-A8C0630544EC}   600000      L25
## 213 {2D1E4B26-8D31-FCD1-E050-A8C0630544EC}   595000      L19
## 10  {2D1E4B26-8C50-FCD1-E050-A8C0630544EC}   525000      L17
## 244 {2AC10E4F-C395-1AF6-E050-A8C063052BA1}   525000      L17
## 23  {2D1E4B26-8DAC-FCD1-E050-A8C0630544EC}   520000      L18
## 107 {2D1E4B26-8F09-FCD1-E050-A8C0630544EC}   470746      L12
## 133 {2D1E4B26-8C1C-FCD1-E050-A8C0630544EC}   450000      L17
## 268 {2D1E4B26-7167-FCD1-E050-A8C0630544EC}   427500      L18
## 50  {2D1E4B26-8D68-FCD1-E050-A8C0630544EC}   425000      L17
## 272 {2D1E4B26-71AA-FCD1-E050-A8C0630544EC}   425000      L18
## 37  {2D1E4B26-8E44-FCD1-E050-A8C0630544EC}   424995      L18
## 39  {2D1E4B26-8E49-FCD1-E050-A8C0630544EC}   414995      L18
## 72  {2D1E4B26-8DD2-FCD1-E050-A8C0630544EC}   392500      L18
## 145 {2D1E4B26-8F8E-FCD1-E050-A8C0630544EC}   375000      L18
## 127 {2D1E4B26-8F5F-FCD1-E050-A8C0630544EC}   355000      L17
## 24  {2D1E4B26-8DB8-FCD1-E050-A8C0630544EC}   350000      L18
## 32  {2D1E4B26-8E1B-FCD1-E050-A8C0630544EC}   342496      L18
## 234 {2D1E4B27-1F81-FCD1-E050-A8C0630544EC}   340000      L18
## 174 {2D1E4B26-8CA8-FCD1-E050-A8C0630544EC}   335000      L17
## 155 {2D1E4B26-8EC1-FCD1-E050-A8C0630544EC}   329995      L12
## 122 {2D1E4B26-8C03-FCD1-E050-A8C0630544EC}   325000      L18
## 13  {2D1E4B26-8CBE-FCD1-E050-A8C0630544EC}   320000      L18
## 253 {23B6165E-925B-FCF4-E050-A8C0620577FA}   320000      L18
## 25  {2D1E4B26-8DBA-FCD1-E050-A8C0630544EC}   310000      L12
##                                 Description       Locality
## 308 GATEACRE COMMUNITY COMPREHENSIVE SCHOOL       GATEACRE
## 309 GATEACRE COMMUNITY COMPREHENSIVE SCHOOL       GATEACRE
## 15                                       23           <NA>
## 33                      OSBOURNE HOUSE, 33B SANDFIELD PARK
## 306                                       1       ALLERTON
## 126                                       5    CRESSINGTON
## 190                                     389        WOOLTON
## 213                                      33           <NA>
## 10                                        8           <NA>
## 244                                       8           <NA>
## 23                                       4A           <NA>
## 107                                      49           <NA>
## 133                                      25           <NA>
## 268                                       4           <NA>
## 50                                        8       AIGBURTH
## 272                                     149           <NA>
## 37                                        5           <NA>
## 39                                        4           <NA>
## 72                                       29           <NA>
## 145                                     114           <NA>
## 127                                       6           <NA>
## 24                                        9           <NA>
## 32                                        6           <NA>
## 234                                       2   MOSSLEY HILL
## 174                         HAZELMERE HOUSE           <NA>
## 155                                      14           <NA>
## 122                                       4           <NA>
## 13                                       24           <NA>
## 253                                      24           <NA>
## 25                                        8           <NA>


Table 3 Data Sorted by Descending Price