In this assignment a subset of housing transaction data are taken from the land registry in order to interpret the spatial heterogeneity of price, as well as to critique modelling techniques for predicting house price values.
#Set Working Directory
setwd('C:/Users/Christian/Documents/Masters/Spatial_Analysis/A_2/assignment2/')
#Load packages
# Layout
library(tufte)
# For pretty table
library(knitr)
# Spatial Data management
library(rgdal)
# Pretty graphics
library(ggplot2)
# Thematic maps
library(tmap)
# Pretty maps
library(ggmap)
# Various GIS utilities
library(GISTools)
# For all your interpolation needs
library(gstat)
# For data manipulation
library(plyr)
# Spatial regression
library(spdep)
# Simulation methods
library(arm)
#Interactive Maps
library(leaflet)
#Plot Spatial polygons
library(sp)
#Smooth Kernels
library(KernSmooth)
#Allow creation of data tables
library(data.table)
#Allows conversion to raster to plot over leaflet
library(raster)
#Read in Shapefile containg data
db <- readOGR(dsn = '.', layer = 'feb16')
## OGR data source with driver: ESRI Shapefile
## Source: ".", layer: "feb16"
## with 312 features
## It has 31 fields
## Integer64 fields read as strings: price imd_rank
#Convert price to numeric
db@data$price <- as.numeric(as.character((db@data$price)))
# Format dates
dts <- as.Date(db@data$trans_date)
Histogram
The most prominent inequality among urban areas is residential segregation as a result of a number of factors which combine to influence the heterogeneity of house prices (Vesselinov et al., 2007). Such inequality is highlighted when the distribution of house prices is examined through a histogram. The distribution forms a shape that resembles a power law distribution (Goldstein, 2004), whereby the main concentration of values is situated to the lesser end with a handful that are much larger towards the tail of the distribution (Figure 1).
#Histogram of Distribution
#Create the histogram
hist <- qplot(data=db@data,x=price)
#Call plot
hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 1 Distribution of House Prices
One-dimensional KDE
Logarithms of values have been taken to help normalise the distribution (Figure 2), although there are still some outlieing values towards the tail end caused by a school that sold for ÂŁ10,076,828 (over ten times more than the next highest price - ÂŁ825,000) (Table 3). Kernel Density Estimation (KDE) is a non-parametric method of probability density estimation (red line in figure 2), essentially created using a contiguous smoothing approach through histogram binning. The use of 50 bins sufficiently demonstrates the value of a KDE as it mirrors the now more Gaussian distribution of the histogram, albeit now representing probability rather than counts (Figure 2).
#One-dimensional KDE and logarithmic histogram
#Use logarithmic because it is not a normal distribution
# Create log of house prices
logpr <- log(as.numeric(db@data$price))
#Add log price to data frame
db@data['logpr'] <- logpr
#Create the base
base <- ggplot(db@data, aes(x=logpr))
#Histogram
hist <- base +
geom_histogram(bins=50, aes(y=..density..))
#Overlay density plot
kde <- hist +
geom_density(fill="#FF6666", alpha=0.5, colour="#FF6666")
#Call plot
kde
Figure 2 Logarithmic distribution of house prices and One-dimensional KDE
Two-dimensional KDE
While a one-dimensional KDE can display the value of the technique to match a distribution, the importance of space in geography implies that a two-dimensional KDE may be of greater value. Figure 3 displays areas of higher transaction propensity with increased shading. The distribution of the density is mostly situated within a large cluster (hot spot), combining two smaller areas of sub-clustering with decreasing density contouring outwards from each hot spot.
While there may a number of reasons why hot spots occur, the lack of reference through other data prevents any definite inference to explain which such hotspots exist. Although personal knowledge of the local area provides evidence to suggest that the high propensity of transactions are not a result of high population and therefore demand for housing (see McMillen, 2003). One explanation for the high transaction density hotspots within the predominantly student area of Wavertree is a high turnover of property within the lucrative student housing industry. Further southeast around the more affluent Allerton, a higher demand for housing for young professionals moving in and elderly moving out might also be a reason to explain the main hotspot within Figure 3.
#Two-dimensional (spatial) KDE
# Reproject coordinates
wgs84 <- CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
db_wgs84 <- spTransform(db, wgs84)
db_wgs84@data['lon'] <- db_wgs84@coords[, 1]
db_wgs84@data['lat'] <- db_wgs84@coords[, 2]
#Convert dataframe to data table
dat <- data.table(db_wgs84@data)
#Make Contour lines
#Use KernSmooth to calculate the KDE plot, set bandwidth and grid size
kde <- bkde2D(dat[ , list(lon, lat)],
bandwidth=c(.0045, .0068), gridsize = c(100,100))
#Create contour lines
CL <- contourLines(kde$x1 , kde$x2 , kde$fhat)
## Extract contour line levels
#Define levels using wrapper (sapply)
LEVS <- as.factor(sapply(CL, `[[`, "level"))
#Define number of levels within KDE based on data
NLEV <- length(levels(LEVS))
## Convert contour lines to polygons
pgons <- lapply(1:length(CL), function(i)
Polygons(list(Polygon(cbind(CL[[i]]$x, CL[[i]]$y))), ID=i))
spgons = SpatialPolygons(pgons)
#Convert levels to numeric for plotting
LEVS1 <- as.numeric(as.character(LEVS))
#Define colour palette
pal <- colorNumeric(
palette = "YlOrRd",
domain = LEVS1
)
#Create Map
#Stroke= False to prevent polygon borders
leaflet(spgons) %>% addTiles() %>%
#Add base map from CartoDB
addProviderTiles("CartoDB.Positron") %>%
#Overlay polygons
addPolygons(color = pal(LEVS1) , stroke = FALSE, fillOpacity = 0.3)%>%
#Add legend
addLegend("topright", pal=pal, values= LEVS1,
title= "Propensity <br> of
House <br>Transactions")
To interpret the variation of a given value in space a group of techniques known as spatial interpolation must be used because KDE only represents a surface of propensities. More specifically Inverse Distance Weighting will be used. In simple terms IDW attempts to estimate property price based on the characteristics of the surrounding neighbours of a given property whereby closer neighbours have higher weighting (Brunsdon &Comber, 2015).
The predominant hotspot of high IDW is to the south east of Liverpool (Figure 4). Further inspection of the most expensive houses finds that the most expensive is within L25, additionally L17, L18 and L12 in the south east of Liverpool contain the top 30 most expensive properties (Table 3). IDW is highest when a house is surrounded by consistently similar house prices which may explain why the weighting is so high in the south east where the house prices are more uniformly high. Figure 4 supports some evidence of house price heterogeneity in Liverpool because the hotspot of IDW is surrounded by several cold spots implying that the hotspot is not reflective of the area in general which reflects house price inequalities (Vesselinov et al., 2007). However the majority of Liverpool has a homogenous house price distribution, with consideration that postcodes that aren’t in the data have taken the lowest IDW value to fill the clipped raster.
#Inverse Distance Weights
#Set up grid
liv.grid <- spsample(db, type='regular', n=25000)
#Create IDW
idw.hp <- idw(price ~ 1, locations=db, newdata=liv.grid)
## [inverse distance weighted interpolation]
# Load Liverpool outline for clipping raster file
liv.otl <- readOGR(dsn = '.', layer ='liv_outline')
## Warning in ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv
## = use_iconv, : ogrInfo: ./liv_outline.dbf not found
## OGR data source with driver: ESRI Shapefile
## Source: ".", layer: "liv_outline"
## with 1 features
## It has 0 fields
# #Surface Plot
#Set up Coordinates using logarithm because of large house values causing skew
xyz <- data.frame(x=coordinates(idw.hp)[, 1],
y=coordinates(idw.hp)[, 2],
z=idw.hp$var1.pred)
#Set up log coordinates
xyz['lz'] <- log(xyz$z)
#Create copy of coordinates file
xyz2 <- xyz
#Remove unnecessary column
xyz2$z <- NULL
#Create spatial polygon
spg <- xyz2
#Add coordinates to spatial polgon
coordinates(spg) <- ~ x + y
# coerce to SpatialPixelsDataFrame
gridded(spg) <- TRUE
# coerce to raster
rasterDF <- raster(spg)
#Set CRS of raster file for plotting
crs(rasterDF) <- CRS("+init=epsg:27700")
#Clip Raster to Liverpool outline for neatness
clip_rast <- mask(rasterDF, liv.otl)
#Define Colour Palette
pal <- colorNumeric("OrRd", values(clip_rast), na.color = "transparent")
#Create Leaflet Map
leaflet() %>% addTiles() %>%
#Add base map from CartoDB
addProviderTiles("CartoDB.Positron") %>%
#Add raster tiles
addRasterImage(clip_rast, colors = pal, opacity=0.5) %>%
#Add legend
addLegend(pal = pal, values = values(clip_rast),
title = "Inverse <br> Distance <br> Weight")
Regression develops bivariate correlation to a multivariate level. Each variable has a coefficient which reflects the degree of correlation when the other variables remain constant. With consideration of this, IMD and whether a property is newly built have been used as explanatory variables when trying to decipher the spatial patterns of house prices in Liverpool. Control variables are used to explain the percentage of correlation between IMD and price that new built houses are responsible for (confounding factors), given that new builds are often built in areas with low deprivation which a bivariate correlation would not account for. Adjusted R-squared is used to measure model fit to prevent overestimation caused by autocorrelation.
#Spatial Econonmetrics
#Read in data
hst <- readOGR(dsn = '.', layer = 'feb16')
## OGR data source with driver: ESRI Shapefile
## Source: ".", layer: "feb16"
## with 312 features
## It has 31 fields
## Integer64 fields read as strings: price imd_rank
#Format dates
dts <- as.Date(hst@data$trans_date)
#Read in IMD to cross compare continuous variables within spatial setting
imd <- read.csv('E08000012.csv')
#Merge files to match deprivation with house transaction data by location
db <- merge(hst, imd)
## Warning in `[<-.factor`(`*tmp*`, ri, value = structure(c(68L, 68L, 15L, :
## invalid factor level, NA generated
#Convert price to numeric
db@data$price <- as.numeric(as.character(db@data$price))
#Create column that contains first part of postcode using string split and lapply to form a list the same length as the original column
db$pc <- as.character(lapply(strsplit(as.character(db$pcds), split=" "), "[", 1))
#Create column that contains first part of postcode using string split and lapply to form a list the same length as the original column
db$pc <- as.character(lapply(strsplit(as.character(db$pcds), split=" "), "[", 1))
#Gaussian function to select random values
#Create Function to draw possible random model values
generate_draw <- function(m){
# Set up predictors matrix
x <- model.matrix(m)
# Obtain draws of parameters (inferential uncertainty)
sim_bs <- sim(m, 1)
# Predicted value
mu <- x %*% sim_bs@coef[1, ]
# Draw
n <- length(mu)
y_hat <- rnorm(n, mu, sim_bs@sigma[1])
return(y_hat)
}
Baseline non-spatial regression
A baseline non-spatial regression is conducted in Model 1 with each coefficient representing the correlation between the respective explanatory variable and price with confounding factors accounted for. Each model in this study considers the effect of confounding factors, however each model can only account for explanatory variables it includes. The results of Model 1 show that houses tend to be more expensive in areas with lower deprivation (an average of ÂŁ7,336 for each additional unit increase in IMD) which might be expected. A new build might expect an even greater increase in price (an average of ÂŁ38,076) which despite being statistically insignificant might be explained by the highest house price (which causes the data skew in figure 1 and 2) being a new build. The standard error of the new build coefficient reflects this (204,895) which may lead to inaccuracy of the simulation (Gelman & Hill, 2006 p158) in Figure 5. Further error is likely as the coefficient is statistically insignificant and the residual standard error is also very high. The adjusted R-squared value (0.018) suggests that 1.8% of the variation in price is captured by this model.
#Gaussian Distribution Baseline Model
#Basic null linear regreassion model
m1 <- lm('price ~ new + imd_score', db)
#Call model summary
summary(m1)
##
## Call:
## lm(formula = "price ~ new + imd_score", data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -328798 -156445 -68113 25188 9711922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 454766 93779 4.849 1.97e-06 ***
## newY 38076 204895 0.186 0.85270
## imd_score -7336 2622 -2.797 0.00548 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 793900 on 309 degrees of freedom
## Multiple R-squared: 0.02473, Adjusted R-squared: 0.01842
## F-statistic: 3.918 on 2 and 309 DF, p-value: 0.02088
Model 1 Baseline non-spatial regression
The predictive checking graphs throughout this study display a predicted single estimate of the distribution of the (log of) price, as well as the actual distribution and 250 simulations based of the respective model. The predicted line in Figure 5 demonstrates that the prediction is skewed towards the tail of the distribution which is likely to have occurred because the distribution is not perfectly normal (Gelman and Hill, 2006). Moreover the prediction is not accounting for all of the predictive error captured by the model. The simulation most likely underestimates because the residual error boundaries of the model as they are so large for each coefficient as mentioned. The simulation relies on a normal distribution which explains why there is over dispersion and may provide evidence for redistribution of the data.
#Plot predictive checking graph
#Plot realisation of the model given and actual expected value
#There is no need to set the print margin when the tallest line is plotted first
plot(density(db$price),
col='red',
main='')
#Add line for model prediction
lines(density(m1$fitted.values),
col='black',
main='')
# Loop for realizations
for(i in 1:250){
tmp_y <- generate_draw(m1)
lines(density(tmp_y),
col='green',
lwd=0.1
)
}
#Add legend
legend('topright',
c('Predicted', 'Actual', 'Simulated(n=250)'),
col=c('black', 'red', 'green'),
lwd=1)
#Add title
title(main="Predictive Check - Baseline Non-Spatial Model")
Figure 5 Predictive Check - Baseline non-spatial regression
Logarithmic non-spatial regression
As mentioned previously using logarithmic values of price causes a more Gaussian (normal) distribution. Natural logarithms have been chosen because they are directly interpretable as the coefficient being a percentage difference in price (as opposed to log 10). Taking exponentials of the coefficient values is essential to properly interpret the coefficients, given that the model is in relation to log (price) (Gelman and Hill, 2006 p60-61). In Model 2 the coefficient for IMD represents that for each unit increase in IMD score there is a 2.2% decrease in house price. The variable new is now significant (to the 99% significance level) and the coefficient implies that a new house might expect a 55.2% increase in price in comparison to an old house. The change in distribution has led to an increased ability for the model to capture the variation in house prices (adjusted R-squared = 29.3%) which provides some evidence to support the use of logarithmic values.
#Guassian Logarithmic Baseline
#Define Model
m2 <- lm('logpr ~ new + imd_score', db)
#Call model summary
summary(m2)
##
## Call:
## lm(formula = "logpr ~ new + imd_score", data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4295 -0.2809 -0.0120 0.2653 3.9012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.490193 0.068591 182.097 < 2e-16 ***
## newY 0.439804 0.149863 2.935 0.00359 **
## imd_score -0.021683 0.001918 -11.304 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5806 on 309 degrees of freedom
## Multiple R-squared: 0.2975, Adjusted R-squared: 0.293
## F-statistic: 65.43 on 2 and 309 DF, p-value: < 2.2e-16
Model 2 Predictive check - Logarithmic Non-Spatial Model
The predictive checking (Figure 6) demonstrates that the prediction is still overestimating towards the right causing a skewed distribution as well as under dispersion of values which may imply autocorrelation between the explanatory variables (Zhu, 2012). While there is still some over dispersion of the simulations, the 250 runs are now more accurate which reflects a decrease in inferential error when the coefficient accuracy improves. The values of the actual model now increase rapidly at log 11 before flattening off slightly to the peak just before 12 which explains where the main body of house prices are distributed. However there is still a skew to the right centre, as a result of higher house prices, including the most expensive which is now found at log 16.
#create logpr because R removes it for some reason
# Create log of house prices
logpr <- log(as.numeric(db@data$price))
#Add log price to data frame
db@data['logpr'] <- logpr
#Plot realisation of the model given and actual expected value
plot(density(m2$fitted.values),
xlim=c(8, max(db$logpr)),
main='')
# Loop for realizations
for(i in 1:250){
tmp_y <- generate_draw(m2)
lines(density(tmp_y),
col='green',
lwd=0.1
)
}#Add line for model prediction
lines(density(db@data$logpr),
col='red',
main='')
#Add legend
legend('topleft',
c('Predicted', 'Actual', 'Simulated (n=250)'),
col=c('black', 'red', 'green'),
lwd=1)
#Add title
title(main="Predictive check - Logarithmic Non-Spatial Model")
Figure 6 Predictive check - Logarithmic Non-Spatial Model
Fixed Effects
The introduction of fixed effects in the form of short postcodes allows each constant term to vary by postcode and essentially attempts to isolate the comparison between neighbourhood characteristics. The constant terms of Model 3 are all statistically significant which implies that some of the unobserved effects previously absorbed in the error term align spatially within postcodes. Despite looking fairly similar, the exponential values of the postcode coefficients vary greatly between the smallest L28 (98,449) and largest L18 (274,366), which might be expected given that a lot of the most expensive properties sold were in L18. Introducing fixed effects has decreased the percentage impact of a house being new increasing price (45.9%) and for one unit increase in the deprivation index there is still around a 2% decrease in house price.
#Gaussian Logarithmic Fixed Effects
#Fixed Effects model
m3 <- lm('log(price) ~ pc + new + imd_score - 1', db)
#Call model summary
summary(m3)
##
## Call:
## lm(formula = "log(price) ~ pc + new + imd_score - 1", data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3723 -0.2891 -0.0187 0.2266 3.7348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## pcL10 12.316686 0.315119 39.086 < 2e-16 ***
## pcL11 12.262708 0.213379 57.469 < 2e-16 ***
## pcL12 12.251291 0.116612 105.061 < 2e-16 ***
## pcL13 12.216101 0.178686 68.366 < 2e-16 ***
## pcL14 12.219684 0.245377 49.800 < 2e-16 ***
## pcL15 12.063087 0.143614 83.996 < 2e-16 ***
## pcL16 12.356093 0.206604 59.806 < 2e-16 ***
## pcL17 12.506734 0.125515 99.643 < 2e-16 ***
## pcL18 12.522219 0.096834 129.317 < 2e-16 ***
## pcL19 12.307708 0.143676 85.663 < 2e-16 ***
## pcL20 12.448735 0.614169 20.269 < 2e-16 ***
## pcL24 12.101003 0.227885 53.101 < 2e-16 ***
## pcL25 12.591609 0.122658 102.656 < 2e-16 ***
## pcL27 11.972328 0.330814 36.191 < 2e-16 ***
## pcL28 11.497295 0.590961 19.455 < 2e-16 ***
## newY 0.377967 0.158674 2.382 0.0179 *
## imd_score -0.016378 0.003143 -5.211 3.53e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5715 on 295 degrees of freedom
## Multiple R-squared: 0.9978, Adjusted R-squared: 0.9977
## F-statistic: 7870 on 17 and 295 DF, p-value: < 2.2e-16
Model 3 Predictive check - Logarithmic FE Model
The peak density of the predictive check has decreased (Figure 7) in comparison to Figure 6 which might imply that the predictive error has decreased as more of the explanatory variables influencing price are captured. This is reflected in the near perfect R-squared value (99.8%). Furthermore the predicted distribution is now closer to the smooth bell curve distribution one might expect from a Gaussian distribution especially before the peak density (~ log 12). The simulations are relatively similar to Model 2 with slightly less overestimation towards the peak despite over dispersion around either side of the peak as a result of a change in inferential error.
#Fixed Effects model using LSOA for RMSE comparison
m3b <- lm('log(price) ~ pc + new + LSOA11CD - 1', db)
#Plot realisation of the model given and actual expected value
plot(density(m3$fitted.values),
xlim=c(8, max(db$logpr)),
main='')
# Loop for realizations
for(i in 1:250){
tmp_y <- generate_draw(m3)
lines(density(tmp_y),
col='green',
lwd=0.1
)
}#Add line for model prediction
lines(density(db@data$logpr),
col='red',
main='')
#Add legend
legend('topleft',
c('Predicted', 'Actual', 'Simulated (n=250)'),
col=c('black', 'red', 'green'),
lwd=1)
#Add title
title(main="Predictive check - Logarithmic FE Model")
Figure 7 Predictive check - Logarithmic FE Model
Fixed Effects - Spatial Regime
The introduction of fixed effects in the form of short postcodes allows each constant term to vary by postcode and essentially attempts to isolate the comparison between neighbourhood characteristics. The constant terms of Model 3 are all statistically significant which implies that some of the unobserved effects previously absorbed in the error term align spatially within postcodes. Despite looking fairly similar, the exponential values of the postcode coefficients vary greatly between the smallest L28 (98,449) and largest L18 (274,366), which might be expected given that a lot of the most expensive properties sold were in L18. Introducing fixed effects has decreased the percentage impact of a house being new increasing price (45.9%) and for one unit increase in the deprivation index there is still around a 2% decrease in house price.
#Spatial Regimes
#Model 4 focuses on IMD to allow a seperate constant term and estimate on house price for every postcode
# Create a new variable `newB` to store the binary form of `new`
db@data$newB <- 1
db[db@data$new=='N', 'newB'] <- 0
#Set constant to allow variation of constant
db$one <- 1
#Display table to show NA
table(db$pc, db$new)
##
## N Y
## L10 4 0
## L11 17 8
## L12 34 2
## L13 33 0
## L14 9 0
## L15 37 0
## L16 8 0
## L17 37 2
## L18 41 3
## L19 26 1
## L20 1 0
## L24 13 0
## L25 31 0
## L27 4 0
## L28 1 0
Table 1 Newly Built Houses by Postcode
#Spatial Regimes Model
m4 <- lm('log(price) ~ 0 + (one + imd_score):pc', db)
#Call model summary
summary(m4)
##
## Call:
## lm(formula = "log(price) ~ 0 + (one + imd_score):pc", data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5146 -0.3159 -0.0234 0.2408 3.5012
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## one:pcL10 12.045818 15.303892 0.787 0.431875
## one:pcL11 10.471641 0.939362 11.148 < 2e-16 ***
## one:pcL12 12.089118 0.234966 51.450 < 2e-16 ***
## one:pcL13 12.002095 0.482152 24.893 < 2e-16 ***
## one:pcL14 12.438533 0.748224 16.624 < 2e-16 ***
## one:pcL15 12.116095 0.232220 52.175 < 2e-16 ***
## one:pcL16 12.271297 0.692667 17.716 < 2e-16 ***
## one:pcL17 12.247727 0.243354 50.329 < 2e-16 ***
## one:pcL18 12.500780 0.255626 48.903 < 2e-16 ***
## one:pcL19 12.835956 0.289501 44.338 < 2e-16 ***
## one:pcL20 11.277203 0.567562 19.870 < 2e-16 ***
## one:pcL24 11.427994 0.697553 16.383 < 2e-16 ***
## one:pcL25 13.139211 0.233243 56.333 < 2e-16 ***
## one:pcL27 9.968490 1.279735 7.789 1.27e-13 ***
## one:pcL28 10.714418 0.567562 18.878 < 2e-16 ***
## imd_score:pcL10 -0.009967 0.362139 -0.028 0.978061
## imd_score:pcL11 0.018098 0.016814 1.076 0.282677
## imd_score:pcL12 -0.007777 0.010100 -0.770 0.441936
## imd_score:pcL13 -0.011847 0.009993 -1.186 0.236790
## imd_score:pcL14 -0.020826 0.014712 -1.416 0.158004
## imd_score:pcL15 -0.017912 0.006153 -2.911 0.003889 **
## imd_score:pcL16 -0.010185 0.048418 -0.210 0.833536
## imd_score:pcL17 -0.006173 0.008275 -0.746 0.456274
## imd_score:pcL18 -0.012939 0.017546 -0.737 0.461459
## imd_score:pcL19 -0.033867 0.009118 -3.714 0.000245 ***
## imd_score:pcL20 NA NA NA NA
## imd_score:pcL24 -0.003459 0.013045 -0.265 0.791091
## imd_score:pcL25 -0.042013 0.009821 -4.278 2.58e-05 ***
## imd_score:pcL27 0.021412 0.023534 0.910 0.363670
## imd_score:pcL28 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5676 on 284 degrees of freedom
## Multiple R-squared: 0.9979, Adjusted R-squared: 0.9977
## F-statistic: 4846 on 28 and 284 DF, p-value: < 2.2e-16
Model 4 Predictive check - Logarithmic FE Model - Spatial Regime
Figure 8 shows that the prediction has an increased density for lower log price values compared to other models, before a trough and then a steep increase to the peak density around log 12 like the other logarithmic models. The trough in the prediction means Model 4 is under predicting when the actual reaches its peak. Additional patterns of under dispersion can be seen either side of the distribution respective to the actual which is linked to the inability of the model to remove all of the inferential and predictive error despite being most accurate in terms of Adjusted R-squared (99.8%).
#Plot realisation of the model given and actual expected value
plot(density(m4$fitted.values),
xlim=c(8, max(db$logpr)),
main='')
#Add line for model prediction
lines(density(db@data$logpr),
col='red',
main='')
#Add legend
legend('topleft',
c('Predicted', 'Actual'),
col=c('black', 'red'),
lwd=1)
#Add title
title(main="Predictive check - Logarithmic FE Model - Spatial Regime")
Figure 8 Predictive check - Logarithmic FE Model - Spatial Regime
Plotting coefficients
The spatial distribution of IMD coefficients by postcode can be seen in Figure 9. Apart from L27 to the East it could be argued that there is almost a pattern of increasing coefficient positivity in the south and increasingly negative coefficients to the North. At the postcode aggregation it could be argued the effect of IMD on house price is heterogeneous, albeit not sporadically heterogeneous which agrees with current literature (Vesselinov et al., 2007). L25 has the highest negative percentage decrease in price for each unit of IMD and contains some of the highest house prices which implies that in this case high house prices are most affected by IMD. This might be expected given that house buyers may consider an area of low deprivation worthy of a higher house price in association with perceived quality of living. Consideration of spatial lag will help to test this theory.
#Plot coefficients
#Extract coeffieicents and convert to df
temp <- data.frame(summary(m4)$coefficients)
#Write as csv to index table when read back in
write.csv(temp, "temp.csv")
temp <- read.csv('temp.csv')
#Create column that contains first part of postcode to match to db
temp$pc <- as.character(lapply(strsplit(as.character(temp$X), split="imd_score:pc"), "[", 2))
#Remove NA values created when R cannot split the string of the first 15 values
#Table created with coeff for each postcode to show the effect of IMD vary over space
temp[complete.cases(temp),]
## X Estimate Std..Error t.value Pr...t.. pc
## 16 imd_score:pcL10 -0.009967480 0.362138588 -0.02752394 9.780612e-01 L10
## 17 imd_score:pcL11 0.018097653 0.016813646 1.07636691 2.826767e-01 L11
## 18 imd_score:pcL12 -0.007776935 0.010099829 -0.77000653 4.419360e-01 L12
## 19 imd_score:pcL13 -0.011846756 0.009992602 -1.18555258 2.367900e-01 L13
## 20 imd_score:pcL14 -0.020825949 0.014712311 -1.41554576 1.580042e-01 L14
## 21 imd_score:pcL15 -0.017912047 0.006153296 -2.91096805 3.888575e-03 L15
## 22 imd_score:pcL16 -0.010185326 0.048418093 -0.21036198 8.335360e-01 L16
## 23 imd_score:pcL17 -0.006173363 0.008275088 -0.74601773 4.562737e-01 L17
## 24 imd_score:pcL18 -0.012939394 0.017546196 -0.73744723 4.614593e-01 L18
## 25 imd_score:pcL19 -0.033867231 0.009117936 -3.71435282 2.452442e-04 L19
## 26 imd_score:pcL24 -0.003458827 0.013045108 -0.26514360 7.910912e-01 L24
## 27 imd_score:pcL25 -0.042012702 0.009820642 -4.27799960 2.579834e-05 L25
## 28 imd_score:pcL27 0.021412262 0.023533689 0.90985575 3.636701e-01 L27
#Convert temp to df
temp <- data.frame(temp)
#Create column that contains first part of postcode using string split and lapply to form a list the same length as the original column
db_wgs84$pc <- as.character(lapply(strsplit(as.character(db_wgs84$pcds), split=" "), "[", 1))
#Match to values to each point within shapefile
db_wgs84@data <- data.frame(db_wgs84@data,
temp[match(db_wgs84@data[, "pc"],
temp[, "pc"]), ])
#Create df
df <- db_wgs84@data
# Create a palette that maps factor levels to colors
pal <- colorBin("RdYlGn", df$Estimate)
#Define Leaflet plot
leaflet(df) %>% addTiles() %>%
#Add cartoDB base map
addProviderTiles("CartoDB.Positron") %>%
#Add points as circles
addCircleMarkers(
#Choose radius
radius = 5,
#Select estimate as colour
color = ~pal(Estimate),
#Remove point outlines
stroke = FALSE,
#Define opacity
fillOpacity = 0.5,
#Add popup to display postcode
popup = ~pc)%>%
#Add legend
addLegend(pal = pal, values = exp(df$Estimate),
title= "Coefficent <br> of IMD")
## Assuming 'lon' and 'lat' are longitude and latitude, respectively
Spatial Lag
Spatial lag can be defined as the average house price as a result of the combination of explanatory variables from a given neighbourhood. In this example the square root (18) of the number of observations (320) has been used as the number of neighbours which is convention for the sample size, although cross-validation might be considered for future study (Duda et al., 2012). Model 5 shows that the reintroduction of newly built in to the regression might influence house price by around 56.5% which is higher than any other model in this study. For every unit increase in IMD house price might expect a 1.9% decrease which is similar to the other models. In terms of the new coefficient for spatial lag which is statistically insignificant, there is evidence to suggest that the IMD of the surrounding houses has no conclusive effect on a given house price. Further study might look to test whether aggregation of IMD at the LSOA level rather than the household level would help spatial lag to have a more significant effect on the model.
#Spatial lag
#K-nearest neighbours where k = 150
# Because some rows are different units on the same house, slightly
# jitter the locations to break ties
xy.jit <- jitter(db@coords)
#Calculate the square root of n
sqrt(320)
## [1] 17.88854
# Create knn list of each house
hnn <- knearneigh(xy.jit, k=18)
# Create nb object
hnb <- knn2nb(hnn)
# Create spatial weights matrix (note it row-standardizes by default)
hknn <- nb2listw(hnb)
#Compute the spatial lag of imd_score
db@data$w_imd_score <- lag.listw(hknn, db@data$imd_score)
#Spatial lag model
m5 <- lm('logpr ~ new + imd_score + w_imd_score', db)
#Call model summary
summary(m5)
##
## Call:
## lm(formula = "logpr ~ new + imd_score + w_imd_score", data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4349 -0.2882 -0.0048 0.2526 3.9080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.547388 0.089996 139.422 < 2e-16 ***
## newY 0.448107 0.150110 2.985 0.00306 **
## imd_score -0.019371 0.003037 -6.378 6.57e-10 ***
## w_imd_score -0.004204 0.004282 -0.982 0.32699
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5807 on 308 degrees of freedom
## Multiple R-squared: 0.2997, Adjusted R-squared: 0.2929
## F-statistic: 43.94 on 3 and 308 DF, p-value: < 2.2e-16
Model 5 Predictive check - Logarithmic FE Model - Spatial lag
Figure 10 displays the predictive checking of Model 5 and is virtually identical to Model 2. With consideration of adjusted R-squared (30%), there is strong evidence to support the use of a spatial fixed effect in the model for this data using the particular method of spatial lag.
#Plot realisation of the model given and actual expected value
plot(density(m5$fitted.values),
xlim=c(8, max(db$logpr)),
main='')
# Loop for realizations
for(i in 1:250){
tmp_y <- generate_draw(m5)
lines(density(tmp_y),
col='green',
lwd=0.1
)
}#Add line for model prediction
lines(density(db@data$logpr),
col='red',
main='')
#Add legend
legend('topleft',
c('Predicted', 'Actual', 'Simulated (n=250)'),
col=c('black', 'red', 'green'),
lwd=1)
#Add title
title(main="Predictive check - Logarithmic FE Model - Spatial lag")
Figure 10 Predictive check - Logarithmic FE Model - Spatial lag
RMSE can be defined as a predictive performance tool that measures the average distance between the predicted vector and actual distribution. The difference in units between the baseline model and the logarithmic models means that they are incomparable (without rescaling which is beyond this assignment) (Table 2). The addition of fixed effects to the logarithmic baseline model helps to decrease the predictive error by removing some uncertainty in the variability in price. However there is also an addition of inferential uncertainty gained around the error margins defined by the residual standard deviation of the coefficient. When the spatial regime is conducted it can be argued that by isolating each postcode the predictive uncertainty further decreases. This is because the model is analysing at a more specific aggregation which consequently decreases the individual inferential uncertainty by reducing the coefficient error term. As mentioned previously the addition of spatial lag instead of postcodes greatly reduces the models accuracy reflected in the R-squared value and the increase in RMSE which confirms the worsened model fit. For additional consideration the RMSE of the LSOA-level as a higher aggregation of fixed effects has been included and suggests that the model becomes more accurate which might be expected given that LSOA-level boundaries are specifically designed to have similar characteristics.
#RMSE
#Define RMSE function
rmse <- function(t, p){
se <- (t - p)^2
mse <- mean(se)
rmse <- sqrt(mse)
return(rmse)
}
#Create dataframe of model results
rmses <- data.frame(model=c('OLS - Baseline', 'Logarithmic - Baseline', 'Logarithmic + FE(PC)',
'Logarithmic + FE(LSOA)', 'Logarithmic + Spatial Regime', 'Logarithmic + Spatial Lag'
),
#Apply function to each set of variables
RMSE=c(rmse(db$price,
m1$fitted.values),
rmse(db$logpr,
m2$fitted.values),
rmse(db$logpr,
m3$fitted.values),
rmse(db$logpr,
m3b$fitted.values),
rmse(db$logpr,
m4$fitted.values),
rmse(db$logpr,
m5$fitted.values)
)
)
#Set value format for easier comparison
options(scipen = 999)
#Call results
rmses
## model RMSE
## 1 OLS - Baseline 790026.4863505
## 2 Logarithmic - Baseline 0.5778338
## 3 Logarithmic + FE(PC) 0.5557589
## 4 Logarithmic + FE(LSOA) 0.4088923
## 5 Logarithmic + Spatial Regime 0.5414959
## 6 Logarithmic + Spatial Lag 0.5769318
Table 2 Root Mean Squared Error
Comparison of results in class cannot be wholly compared like the trip data as the points vary month to month because very few houses resell frequently. The reduction of sample size and sample area means that the distribution of house prices is also relatively incomparable, although there is evidence to suggest that the removal of city centre postcodes has been conducted to better demonstrate the effect of price propensity that isn’t skewed by population. Evidence of this can be seen within Figure 5, where the hotspots in class are most noticeable in the North and more densely populated centre of Liverpool. The distribution of the data for this assignment (especially the extremely high value) has caused a large amount of overestimation on the effect of new builds in all models apart from when spatial regimes are used, in comparison to the class data. Additionally the skew within the data has led to greater statistical significance for a number of variables in comparison to the class data. The RMSE is not directly compatible although it is noticeable that the addition of fixed effects within a spatial regime provides the highest decrease in RMSE which might be expected.
Remove the large extreme value at the tail end to help normalise the distribution at the risk of increasing researcher bias.
The inability of logarithms to normalise the distribution suggests that using a Poisson distribution may be more useful than a Gaussian one. In fact some intermediate analysis finds that a Poisson distribution of improves the model accuracy (RMSE) for a number of models.
#Create mini df to create price
db_mini <- data.frame("ID" = db$id,
"Price" = db$price,
"Postcode" = db$pc,
"Description" = db$paon,
"Locality" = db$locality)
# Order table by descending price
db_mini <- db_mini[with(db_mini, order(-db_mini$Price)), ]
head(db_mini, 30)
## ID Price Postcode
## 308 {2D1E4B27-1F71-FCD1-E050-A8C0630544EC} 10076828 L25
## 309 {2D1E4B27-1F72-FCD1-E050-A8C0630544EC} 10076828 L25
## 15 {2D1E4B26-8D24-FCD1-E050-A8C0630544EC} 825000 L17
## 33 {2D1E4B26-8E2A-FCD1-E050-A8C0630544EC} 775000 L12
## 306 {2D1E4B27-1F5E-FCD1-E050-A8C0630544EC} 700000 L18
## 126 {2D1E4B26-8C0C-FCD1-E050-A8C0630544EC} 623500 L19
## 190 {2D1E4B26-8CDA-FCD1-E050-A8C0630544EC} 600000 L25
## 213 {2D1E4B26-8D31-FCD1-E050-A8C0630544EC} 595000 L19
## 10 {2D1E4B26-8C50-FCD1-E050-A8C0630544EC} 525000 L17
## 244 {2AC10E4F-C395-1AF6-E050-A8C063052BA1} 525000 L17
## 23 {2D1E4B26-8DAC-FCD1-E050-A8C0630544EC} 520000 L18
## 107 {2D1E4B26-8F09-FCD1-E050-A8C0630544EC} 470746 L12
## 133 {2D1E4B26-8C1C-FCD1-E050-A8C0630544EC} 450000 L17
## 268 {2D1E4B26-7167-FCD1-E050-A8C0630544EC} 427500 L18
## 50 {2D1E4B26-8D68-FCD1-E050-A8C0630544EC} 425000 L17
## 272 {2D1E4B26-71AA-FCD1-E050-A8C0630544EC} 425000 L18
## 37 {2D1E4B26-8E44-FCD1-E050-A8C0630544EC} 424995 L18
## 39 {2D1E4B26-8E49-FCD1-E050-A8C0630544EC} 414995 L18
## 72 {2D1E4B26-8DD2-FCD1-E050-A8C0630544EC} 392500 L18
## 145 {2D1E4B26-8F8E-FCD1-E050-A8C0630544EC} 375000 L18
## 127 {2D1E4B26-8F5F-FCD1-E050-A8C0630544EC} 355000 L17
## 24 {2D1E4B26-8DB8-FCD1-E050-A8C0630544EC} 350000 L18
## 32 {2D1E4B26-8E1B-FCD1-E050-A8C0630544EC} 342496 L18
## 234 {2D1E4B27-1F81-FCD1-E050-A8C0630544EC} 340000 L18
## 174 {2D1E4B26-8CA8-FCD1-E050-A8C0630544EC} 335000 L17
## 155 {2D1E4B26-8EC1-FCD1-E050-A8C0630544EC} 329995 L12
## 122 {2D1E4B26-8C03-FCD1-E050-A8C0630544EC} 325000 L18
## 13 {2D1E4B26-8CBE-FCD1-E050-A8C0630544EC} 320000 L18
## 253 {23B6165E-925B-FCF4-E050-A8C0620577FA} 320000 L18
## 25 {2D1E4B26-8DBA-FCD1-E050-A8C0630544EC} 310000 L12
## Description Locality
## 308 GATEACRE COMMUNITY COMPREHENSIVE SCHOOL GATEACRE
## 309 GATEACRE COMMUNITY COMPREHENSIVE SCHOOL GATEACRE
## 15 23 <NA>
## 33 OSBOURNE HOUSE, 33B SANDFIELD PARK
## 306 1 ALLERTON
## 126 5 CRESSINGTON
## 190 389 WOOLTON
## 213 33 <NA>
## 10 8 <NA>
## 244 8 <NA>
## 23 4A <NA>
## 107 49 <NA>
## 133 25 <NA>
## 268 4 <NA>
## 50 8 AIGBURTH
## 272 149 <NA>
## 37 5 <NA>
## 39 4 <NA>
## 72 29 <NA>
## 145 114 <NA>
## 127 6 <NA>
## 24 9 <NA>
## 32 6 <NA>
## 234 2 MOSSLEY HILL
## 174 HAZELMERE HOUSE <NA>
## 155 14 <NA>
## 122 4 <NA>
## 13 24 <NA>
## 253 24 <NA>
## 25 8 <NA>
Table 3 Data Sorted by Descending Price