Spatial Data Analysis on Green Spaces and House Prices in London Boroughs
Author
Affiliations
John Karuitha
Karatina University, School of Business
University of the Witwatersrand, School of Construction Economics & Management
Published
February 11, 2024
Modified
February 11, 2024
Abstract
In this analysis, we examine the connection between the accessibility to local parks and housing prices in London. The results reveal a noteworthy association between house prices and access to local parks and metropolitan parks. However, district parks and open spaces show a negative relationship with house prices, although it is not statistically significant. Notably, Westminster consistently stands out as the priciest area. The regression models employed in the analysis show that accessibility to parks has limited role in influencing housing prices. Instead, the borough is of primary importance in pricing. Nonetheless, access to open spaces, local parks, and district parks has a significant relationship with house prices. Metropolitan parks and regional parks have no significant relationship with prices.
Keywords
R, Quarto, Spatial Data, sf, Housing, London, Parks
1BACKGROUND
In the urban context of London, the relationship between access to green spaces and housing prices has become a significant focal point for investigation. As urbanization intensifies, the availability of green spaces in a city has profound implications for the quality of life and the economic landscape. Access to parks, gardens, and recreational areas not only contributes to the overall well-being of residents but also serves as a potential factor influencing housing market dynamics. Understanding this relationship is crucial for urban planners, policymakers, and residents alike, as it can inform strategic decisions regarding green space preservation, urban development, and housing affordability in a metropolis as dynamic and diverse as London.
Against this backdrop, the main objective of the forthcoming analysis is to systematically explore and establish the intricate relationship between access to green spaces and the pricing of residential properties in London. This investigation aims to employ quantitative methods to discern patterns, correlations, and potential causation between the proximity and quality of green spaces and the fluctuating prices of houses across different neighborhoods in the city. By unraveling these connections, the analysis seeks to contribute valuable insights that can guide urban planning initiatives, inform housing policies, and enhance our understanding of the complex interplay between urban ecology and real estate dynamics in the context of London.
Code
## Load packages manager ----if(!require(pacman)){install.packages('pacman')}## Load required packages ----p_load(tidyverse, janitor, skimr, mice, ggthemes, rmarkdown, readxl, conflicted, naniar, GGally, modelsummary, sf, tmap, gt, patchwork, spdep, kableExtra, sp, maptools)## Load from github ----p_load_gh("datarootsio/artyfarty")## Set the options ----options(digits =3)options(scipen =999)## Set a nice theme for plots ----theme_set(theme_bain())
2OBJECTIVE
The main objective of this analysis is to establish the relationship between access to green spaces and the price of houses in London.
3DATA
The data set encompasses house prices across various London regions. To facilitate mapping, I enhance the set of data by incorporating a shape file of London. Additionally, I categorize the regions into either Inner London or Outer London. The analysis initially encompasses the entirety of London before specifically delving into the examination of Inner London.
Code
inner_london <-c('Camden', 'Greenwich','Hackney', 'Hammersmith and Fulham', 'Islington','Royal Borough of Kensington and Chelsea', 'Lambeth', 'Lewisham', 'Southwark','Tower', 'Hamlets', 'Wandsworth', 'Westminster', 'City of London')outer_london <-c('Barking and Dagenham', 'Barnet', 'Bexley', 'Brent', 'Bromley', 'Croydon', 'Ealing', 'Enfield', 'Haringey','Harrow', 'Havering', 'Hillingdon', 'Hounslow', 'Kingston upon Thames', 'Merton','Newham', 'Redbridge', 'Richmond upon Thames', 'Sutton', 'Waltham Forest')
The data contains 17 variables and 525 observations of data. I augment the data with data for the map of London from the sf package to permit data visualization using maps.
We examine the missing values in the data. Eight (8) variables in the parks data have 56 missing values each. The table below shows the variables and their associated missing values.
In this section, I examine all the boroughs in London. There are 525 London boroughs in this set of data. I start with data exploration and then run a series of statistical tests.
4.1Data Exploration and Visualization
The data visualization shows the clustering of houses by neighborhoods and house prices. Central London appears to have a very high concentration of high priced houses. The prices of the houses decline as we get further away from central London. Figure 2 shows the relationship between house prices and access to parks; open space, local parks, district parks, and metropolitan parks. Across all the cases, there is not much pattern. However, much of this pattern could unravel once we do spatial analysis.
parks %>%as.data.frame() %>% dplyr::select(where(is.numeric),-year, -HECTARES, -NONLD_AREA) %>% modelsummary::datasummary_skim()
Unique (#)
Missing (%)
Mean
SD
Min
Median
Max
open_space
358
11
50.0
26.9
0.0
51.2
100.0
local_parks
355
11
36.5
22.1
0.0
34.3
96.5
district_parks
270
11
33.0
34.5
0.0
20.0
100.0
metropolitan_parks
215
11
55.3
42.0
0.0
66.4
100.0
regional_parks
59
11
24.9
41.2
0.0
0.0
100.0
prices
336
0
535963.1
268089.3
230000.0
476500.0
2550000.0
Figure 3 displays pairs plots illustrating the relationships between house prices and access to parks in London. The focus is on understanding the variable distributions and examining correlation coefficients. Notably, the distribution of prices is right-skewed, indicating a concentration of houses with lower prices and a few with significantly higher prices. A noteworthy finding is the relatively robust and statistically significant positive correlation (0.15) between house prices and access to metropolitan parks. However, access to open spaces and district parks exhibits a negative correlation with house prices, although this correlation is not statistically significant.
5Statistical Tests
In this section, I run a range of statistical tests to check the relationship between access to green spaces and the prices of housing in London. To run these tests, I start by dropping data that has missing values.
5.0.1Analysis of spatial autocorrelation (e.g. Moran’s I or LISA mapping).
In this section, I test for spatial autocorrelation.
Spatial autocorrelation is a special case of correlation, which is the global concept that two attribute variables X and Y have some average degree of alignment between the relative magnitudes of their respective values (Griffith and Chun 2018).
We test the following hypothesis:
H0: The Moran 1 test is not significantly different from zero (There is no spatial autocorrelation between prices and access to green spaces). Specifically, prices are randomly distributed across boroughs following a completely random process.
H1: The Moran 1 test is significantly different from zero (There is spatial autocorrelation between prices and access to green spaces). Specifically, the distribution of house prices across boroughs is not random.
The Moran’s I test is sensitive to outliers, prompting an initial exploration of the distribution of the price variable. Observing a pronounced right skewness in the original price distribution, a logarithmic transformation is applied to bring the distribution closer to normal. Consequently, the analysis proceeds with the logarithm of prices, a more suitable representation for robust and accurate assessments (Chen 2021).
Code
(parks %>%ggplot(mapping =aes(x = prices)) +geom_density() +labs(x ="Prices", y ="Density",title ="Distribution of Prices in London",subtitle ="House prices in London are heavily skewed to the right.") |## Distribution after logging prices ----parks %>%ggplot(mapping =aes(x = prices)) +geom_density() +labs(x ="Prices (Log Scale)", y ="Density",title ="Distribution of Prices in London",subtitle ="House prices in London are heavily skewed to the right.") +scale_x_log10())
Distribution of house prices in greater London
I additionally investigate the variation in house prices across diverse London boroughs. The presented graph highlights a considerable disparity, with Westminster having notably higher house prices, followed by Camden as a distant second. In contrast, Barking and Dagenham exhibit the lowest median house prices. The overarching objective of this analysis is to discern if this observed price differential is associated with the accessibility of green spaces.
Code
# names(parks)parks |> dplyr::filter(!is.na(borough_name)) |>mutate(borough_name =fct_reorder(borough_name, prices, median)) |>ggplot(mapping =aes(y = prices, x = borough_name)) +geom_boxplot(aes(fill = borough_name),show.legend =FALSE) +geom_jitter(shape =".") +labs(x ="House Price", y ="",title ="House Prices by Borough",subtitle ="Westmister has the highest priced houses Barking and Dagenham is the cheapest.") +coord_flip()
House prices by region
I start by defining the neighboring polygons. Below, we see that polygon 2 has 3 neighbors; 4, 7, and 8.
Code
## Define neighboring polygons ----nb <-poly2nb(parks, queen=TRUE)## Neighbors in second slot nb[[2]]
[1] 4 7 8
Next, I assign weights to each neighboring polygon. In this case, each neighboring polygon will be multiplied by the weight \(1/(Number of neighbors)\) (style="W"–note the uppercase "W") such that the sum of the weights equal 1. If a binary weight is desired (i.e. one where each neighboring polygon is a assigned a weight of 1, regardless of the number of neighbors), we set style="B".
Code
lw <-nb2listw(nb, style="W", zero.policy=TRUE)
To get the relationship between the prices in each polygon (house prices in this case), we specify the lag of prices as follows.
Code
parks$lag <-lag.listw(lw, parks$prices)
Next, I specify and run a regression model.
Code
# Create a regression modelM <-lm(log(lag) ~log(prices), data = parks)coef(M[1])
(Intercept) log(prices)
2.91 0.78
After extracting the coefficients for the first block (coef(M[1])), we have a positive coefficient 0f 0.78. This coefficient means that as house prices in block one increase, the prices of house in neighboring boroughs also tend to rise.
Code
# Plot the dataplot(log(lag) ~log(prices), parks, pch=21, asp=1, las=1, col ="grey40", bg="grey80")abline(M, col="blue") # Add the regression line from model Mabline(v =mean(log(parks$prices)), lty=3, col ="grey80")abline(h =mean(log(parks$prices)), lty=3, col ="grey80")
Plotting the Regression Model
The slope of the regression model is the Moran’s I coefficient. The next step will show you how to compute this statistic without needing to compute the lagged values and fitting a regression model.
We can compute the Moran 1 statistic directly, as below.
Code
moran(log(parks$prices), listw = lw, n =length(nb), S0 =Szero(lw))
$I
[1] 0.749
$K
[1] 5.04
To test the hypothesis that the Moran’s coefficient is statistically different from zero, we get the p-values, as follows.
Code
MC <-moran.mc(log(parks$prices), lw, nsim =999)# View results (including pseudo p-value)MC
Monte-Carlo simulation of Moran I
data: log(parks$prices)
weights: lw
number of simulations + 1: 1000
statistic = 0.7, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater
We employ visualization to illustrate the simulation, where the curve represents the distribution of Moran I values anticipated if house prices were randomly distributed among the boroughs. The observed value, represented by the vertical line in the graph, significantly surpasses the anticipated value. This discrepancy suggests that house prices exhibit a spatial correlation, indicating a non-random distribution pattern across the boroughs.
Code
# Plot the distribution (note that this is a density plot instead of a histogram)plot(MC, main="", las =1)
Margin Plots
5.0.2ANOVA
I run the analysis of variance (ANOVA). We find a statistically-significant difference in house prices according to access to parks. The access to metropolitan parks, local parks and regional parks have significant relationships with house prices. A Tukey post-hoc test revealed significant differences between access to parks and house prices. House with access to metropolitan parks, regional parks, and district parks attract a better price than house with less access (Potvin 2020).
Similar to the analysis of variance (ANOVA) discussed earlier, the regression analysis reaffirms that houses with access to parks generally command higher prices. Specifically, the model indicates that, holding all other factors constant, houses with access to metropolitan parks are, on average, £931 more expensive than comparable houses without such access. Similarly, houses with access to local parks carry an average premium of £836 compared to similar houses without access to local parks. Access to regional parks also plays a significant role, correlating with an average increase of £595. The model itself demonstrates statistical significance at a 1% level. However, it is noteworthy that the model exhibits limited explanatory power, accounting for only 2% of the variability in house prices. This suggests that, while park access contributes to pricing, other factors such as location and house type play pivotal roles in determining house prices.
5.0.4Correlations or chi-square tests of association
In this section, I rerun the correlation tests. There is a significant correlation between house prices and access to metropolitan parks. District parks and open spaces have a negative correlation with prices, although this correlation is not statistically significant. There is also a notable degree of skewness in prices and access to parks.
inner_parks <- parks %>% dplyr::filter(area =="inner_london")
We repeat the same analysis for inner London boroughs. Inner London consists of 139, starting with the visualization and summary of data.
6.1Data Exploration and Visualization
In the case of inner London boroughs, we have a set of data with 139 observations of 18 variables. I start by mapping the boroughs of inner London and the corresponding house prices. We see areas of high concentration of high cost house in the North West of London. The South East of London has the bulk of lower priced houses.
In this section, I run a range of statistical tests to check the relationship between access to green spaces and the prices of housing in Inner London.
6.2.1Analysis of spatial autocorrelation (e.g. Moran’s I or LISA mapping).
In this analysis of spatial autocorrelation (Moran’s 1 test), I start by defining the neighboring polygons.
In this analysis, we test the following hypothesis:
H0: The Moran 1 test is not significantly different from zero (There is no spatial autocorrelation between prices and access to green spaces).
H1: The Moran 1 test is significantly different from zero (There is spatial autocorrelation between prices and access to green spaces).
We see that polygon 2 has 3 neighbors; 1, 3, and 5.
Code
## Define neighboring polygons ----nb1 <-poly2nb(inner_parks, queen=TRUE)## Neighbors in first slot nb1[[2]]
[1] 1 3 5
I start by examining the distribution of house prices in inner London boroughs. Like in the case of outer London, the prices are right skewed. because Moran’s 1 test is sensitive to outliers, we work with the logarithm of prices. Westmister is still the most expensive area of inner London as was the case for the whole of London. Greenwich is the cheapest area of Inner London.
Code
(inner_parks %>%ggplot(mapping =aes(x = prices)) +geom_density() +labs(x ="Prices", y ="Density",title ="Distribution of Prices in London",subtitle ="House prices in London are heavily skewed to the right.") |## Distribution after logging prices ----inner_parks %>%ggplot(mapping =aes(x = prices)) +geom_density() +labs(x ="Prices (Log Scale)", y ="Density",title ="Distribution of Prices in London",subtitle ="House prices in London are Relatively Normal after Taking Logs.") +scale_x_log10())
Distribution of house prices in Inner London
Code
# names(parks)inner_parks |> dplyr::filter(!is.na(borough_name)) |>mutate(borough_name =fct_reorder(borough_name, prices, median)) |>ggplot(mapping =aes(y = prices, x = borough_name)) +geom_boxplot(aes(fill = borough_name),show.legend =FALSE) +geom_jitter(shape =".") +labs(x ="House Price", y ="",title ="House Prices by Borough: Inner London",subtitle ="In inner London, Westmister has the highest priced houses Greenwich is the cheapest.")
House prices by region in inner London
As previously, we assign weights to each of the polygons.
Code
lw1 <-nb2listw(nb1, style="W", zero.policy=TRUE)
To get the relationship between the prices in each polygon (house prices in this case), we specify the lag of prices as follows.
Plotting the prices against the lags shows strong positive spatial autocorrelation in prices. Positive spatial autocorrelation means that geographically nearby values of a variable tend to be similar on a map: high values tend to be located near high values, medium values near medium values, and low values near low values. Indeed the map also collaborates this information as we see concentrations of houses of roughly equal prices clustered together.
After running the regression, the coefficient of block 1 is 0.739. This coefficient means that as prices in block A go up, prices in the neighboring blocks also tend to go up.
Code
# Create a regression modelM1 <-lm(log(lag) ~log(prices), data = inner_parks)coef(M1[1])
(Intercept) log(prices)
3.507 0.739
I plot the simulation model.
Code
# Plot the dataplot(log(lag) ~log(prices), inner_parks, pch=21, asp=1, las=1, col ="grey40", bg="grey80")abline(M, col="blue") # Add the regression line from model Mabline(v =mean(log(inner_parks$prices)), lty=3, col ="grey80")abline(h =mean(log(inner_parks$prices)), lty=3, col ="grey80")
Spatial Correlation Plots
The slope of the regression model is the Moran’s I coefficient (Moraga 2023). Next, we compute this statistic without needing to compute the lagged values and fitting a regression model.
We can compute the Moran 1 statistic directly, as below.
Code
moran(log(inner_parks$prices), listw = lw1, n =length(nb), S0 =Szero(lw))
$I
[1] 0.696
$K
[1] 4.5
We see that the prices are spatially autocorrelated with a sptial correlation coefficient of 0.696. Thus, there is a high degree of correspondence between prices of houses in the same neighborhood. Testing the hypothesis whether the spatial correlation is statistically significant yields a positive. The spatial correlation between prices is significantly different from zero.
Monte-Carlo simulation of Moran I
data: inner_parks$prices
weights: lw1
number of simulations + 1: 1000
statistic = 0.5, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater
Running the simulation yields figure 14 below. The curve shows the distribution of Moran I values we could expect had the house prices been randomly distributed across the boroughs. Our observed value (vertical line in the graph below) is far above the expected value, which implies that the house prices are spatially correlated.
Code
# Plot the distribution (note that this is a density plot instead of a histogram)plot(MC1, main="", las =1)
Margin Plots
6.2.2ANOVA
The ANOVA results for Inner London mirror the results for all of London. There is a significant difference in house prices based on their access to local parks and metropolitan parks.
As was the case for the whole of London, the regression model for inner London shows that access to metropolitan parks and local parks have a positive relationship with house prices in inner London. All else remaining the same, a one unit rise in access to metro parks corresponds to 1983 Pounds rise in average prices. Local parks are even more important with a unit rise of access to local parks corresponding to 2034 pounds rise in prices, ceteris paribus. Access to open spaces in inner London have a negative association with prices. A unit rise in access to open spaces is associated with a 2960 Pounds drop in prices on average, all else remaining the same. Access to regional parks and district parks also have a positive relationship with prices, although this price is not statistically significant.
6.2.4Correlations or chi-square tests of association
In this section, we examine the correlation coefficients between house prices and access to green spaces in the inner London boroughs. Again, we see a significant correlation between house prices and access to parks. District parks and open spaces have a negative correlation with house prices. However, the correlation is not significant.
In this analysis, we have examined the relationship between house prices and access to parks and open spaces. We find that house prices are spatially correlated- with the highest priced houses in Westminster and the lowest priced in Barking and Dagenham. Secondly, there is a statistically significant correlation between house prices and access to local parks. Besides, the access to parks has a significant relationship with house prices, especially access to local parks, regional parks, and metropolitan parks. The regression models have low explanatory power, probably because there are other factors beyond access to parks that determine the prices of houses.
References
Chen, Yanguang. 2021. “An Analytical Process of Spatial Autocorrelation Functions Based on Moran’s Index.”PLoS One 16 (4): e0249589.