In this document I will be doing some exploratory analyses on the London House Price dataset prepared by Imperial College as part of the Summer Data Challenge. The goal is to provide a visual exploration of how house prices varies as a factor of different attributes. To this end I have also built an interactive web app to portray how these data on a London map. The link to this app can be found at the end of this document.
While all the R codes to process the data and produce the visualisations are present for reference, if you are not familiar with the language feel free to skip the blocks of text in coloured backgrounds.
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
First let’s load the dataset and do some initial preprocessing.
df <- read.csv('Houseprice_2009_100km_London.csv')
df$Price <- as.integer(gsub(pattern='_', replacement='', df$Price)) # to int
df$Trdate <- strptime(df$Trdate, format='%Y-%m-%d %H:%M', tz='GMT')
df$Trdate <- as.Date(df$Trdate)
str(df)
## 'data.frame': 1384998 obs. of 13 variables:
## $ Price : int 83500 150000 227000 930000 183750 125000 350000 180000 220000 535000 ...
## $ Trdate : Date, format: "2009-06-26" "2009-03-25" ...
## $ Postcode : Factor w/ 305958 levels "AL10 0AB","AL10 0AH",..: 161804 243836 51161 223823 276399 45347 123008 61368 28556 245115 ...
## $ Property_Type: Factor w/ 4 levels "D","F","S","T": 3 4 3 4 4 2 1 4 3 1 ...
## $ Newbuild : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freeorlease : Factor w/ 2 levels "F","L": 1 1 1 1 1 2 1 1 1 1 ...
## $ Year : int 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
## $ Month : Factor w/ 65 levels "2009-01","2009-02",..: 6 3 8 10 12 5 8 9 12 6 ...
## $ Oseast1M : Factor w/ 124643 levels "429_936","430_206",..: 28074 63153 71771 68629 112206 115675 54477 84103 82892 62031 ...
## $ Osnrth1M : Factor w/ 119808 levels "100_001","100_003",..: 115057 44145 39269 51441 18605 93398 41646 55265 110981 37469 ...
## $ Oa11 : Factor w/ 61206 levels "E00000001","E00000003",..: 49229 20245 5404 19675 45055 37441 14226 1598 32236 52784 ...
## $ Latitude : num 52.3 51.4 51.3 51.5 51.1 ...
## $ Longitude : num -0.717 -0.1743 -0.0515 -0.0924 0.6935 ...
Everything ready! Let’s do some preliminary exploratory analyses. Below are some automatically generated summary statistics by R on the dataset.
summary(df[,c(1,2,4,5,6,8)])
## Price Trdate Property_Type Newbuild
## Min. : 6200 Min. :2009-01-01 D:264394 N:1256261
## 1st Qu.: 179000 1st Qu.:2010-07-08 F:427491 Y: 128737
## Median : 247500 Median :2011-11-18 S:315479
## Mean : 332720 Mean :2011-11-09 T:377634
## 3rd Qu.: 365000 3rd Qu.:2013-04-12
## Max. :54959000 Max. :2014-05-30
##
## Freeorlease Month
## F:944801 2013-08: 30920
## L:440197 2013-11: 30870
## 2013-10: 29896
## 2013-12: 29855
## 2009-12: 29325
## 2013-07: 29008
## (Other):1205124
Eyeballing through the summary, we can see the that London property prices ranged from £6,200 to nearly £55m! The median price is roughly £250,000, with 50% of the data ranging between £179,000 and £365,000. For this reason it is pretty safe to ignore the mean as they are more sensitive to skews in the data.
We can see that the data spans between 1st Jan 2009 and 30th May 2014, as the data for 2014 is incomplete we might see a marked decrease in summary statistics computed on this year.
Note an interesting pattern in the “Month” variable, it seems a lot of property purchases was made between August and December 2013. This must have been a particularly ‘hot’ period for the London property market.
There are a few other statistics but it will be more intuitive to look at them through some visualisations. They will be presented in later parts of the analyses. For now let’s have a look at the distribution of these prices a bit more closely.
ggplot(data=df, aes(x=Price)) + geom_histogram(aes(y=..density..)) + geom_density()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Above is a histogram combined with a density estimate of how many properties were sold at each prize point. Unsurprisingly, the distribution looks like it follows a power law. The easiest way to understand the power law distribution is to think of the famous 80:20 rule. In business this often translates to “80% of your revenue comes from 20% of your customers”. In the context of this dataset, you can make a very rough estimate that 80% of purchases were made in the lowest 20% of the price range.
To get a better view of the data we could do a log transformation on prices:
ggplot(data=df, aes(x=log(Price))) + geom_histogram(aes(y=..density..)) + geom_density()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
exp(12.5) # = 268,337.3
## [1] 268337.3
We get (roughly) normal distribution after log-transform with mode around exp(12.5), or ~£270,000. There is an spike around this price point and several smaller peaks scattered across different price points. We will learn more about these spikes in the next few visualisations.
Following this discovery, all the subsequent visualisations with a Prices distribution will be log transformed. Note that each tick on the price axis will be
associated with an exponential increase in price, not a constant increase like you would normally expect.
Now we will look at how each of the variables affect house price through a few visualisations.
by_pt <- select(df, Price, Property_Type) %>%
group_by(Property_Type) %>%
summarise(Count=n(), Mean.Price=mean(Price), St.Dev.Price=sd(Price), Median.Price=median(Price))
ggplot(melt(by_pt), aes(x=variable, y=value, fill=Property_Type)) +
facet_wrap(~ variable, scales='free') +
geom_bar(position="dodge", stat='identity', width=0.3)
## Using Property_Type as id variables
We’ll look at the summary statistics individually. The variable “Count” shows us the number of purchases for each type of property. From looking at the graph we can see that most purchases properties in London were Flats, followed by Terraced houses. There are considerably less Detached and Semi-Detached houses bought in this period.
The mean and median prices tell us similar stories: Detached houses are a lot more expensive, and other property types had roughly similar price points. The mean statistic is much higher than the median price, which is most likely due to outliers - or the odd houses at the several million pound price point skewing the data.
Standard deviation gives us a measure of variation or dispersion around the mean. We can see that Detached and Terraced houses vary wildly in price, followed by Flats, whereas Semi-Detached house prices are comparatively less spread out. Unsurprisingly the standard deviations are also very large, again most likely due to the power law distribution we’ve observed above.
ggplot(df, aes(x=Property_Type, y=Price)) +
stat_ydensity(trim = FALSE, aes(fill = Property_Type)) +
scale_y_log10(breaks=round(10^seq(3.6,8,0.2))) +
coord_trans(y = "log10") +
coord_flip() +
theme(axis.text.x= element_text(angle=45, hjust=1))
The above is a “Violin” graph. Think of each blob as a smoothed histogram with an inverted side. It allows us to compare the distribution of house prices for each property type. Looking at the plot we can see that the distribution for Terraced and Semi-detached houses are very similar. The other two distributions show that Flats are mostly priced lower than the previous two, with Detached houses being generally higher in price.
We see again this interesting drop in number of purchases around the £250k price point. There is another dip around ~500k which is most dramatic in detached houses. These price points appear to correspond to the tax bands associated with house prices, hence most properties priced above these price points will be pushed down to make them more attractive to buyers.
by_nb <- select(df, Price, Newbuild) %>%
group_by(Newbuild) %>%
summarise(Count=n(), Mean.Price=mean(Price), St.Dev.Price=sd(Price), Median.Price=median(Price))
ggplot(melt(by_nb), aes(x=variable, y=value, fill=Newbuild)) +
facet_wrap(~ variable, scales='free') +
geom_bar(position="dodge", stat='identity', width=0.3)
## Using Newbuild as id variables
It’s not surprising that most purchased houses are not newbuilds, as this lack of new developments in the London property market is well documented. The median price suggests a negligible difference in price between the average newbuild and non-newbuild home. There is likely to be some subtlety here, however, so let’s take a look at the violin plot to take a closer look at their price distributions.
ggplot(df, aes(x=Newbuild, y=Price)) +
stat_ydensity(trim = FALSE, aes(fill = Newbuild)) +
scale_y_log10(breaks=round(10^seq(3.6,8,0.2))) +
coord_trans(y = "log10") +
coord_flip() +
theme(axis.text.x= element_text(angle=45, hjust=1))
Turns out their distributions are fairly similar. Non-newbuilds tend to have sharper cutoff peaksand more extreme prices at higher price points. This suggests that there are new properties being sold at all price points. It would be interesting if we could try separating supply and demand for newbuild properties, however it is not possible to do so from this graph. We can see that the most popular price point was roughly around 250k, so any new properties set around this price is likely to find many interested buyers.
In a later visualisation we will investigate where these new properties are being purchased.
by_fl <- select(df, Price, Freeorlease) %>%
group_by(Freeorlease) %>%
summarise(Count=n(), Mean.Price=mean(Price), St.Dev.Price=sd(Price), Median.Price=median(Price))
ggplot(melt(by_fl), aes(x=variable, y=value, fill=Freeorlease)) +
facet_wrap(~ variable, scales='free') +
geom_bar(position="dodge", stat='identity', width=0.3)
## Using Freeorlease as id variables
There were over twice as many freehold properties being purchased in the last 4 years than leasehold properties. Freeholds are also more expensive, however the difference is much smaller. Looking at the median prices this difference is roughly £40,000.
ggplot(df, aes(x=Freeorlease, y=Price)) +
stat_ydensity(trim = FALSE, aes(fill = Freeorlease)) +
scale_y_log10(breaks=round(10^seq(3.6,8,0.2))) +
coord_trans(y = "log10") +
coord_flip() +
theme(axis.text.x= element_text(angle=45, hjust=1))
This price different becomes clear after we observe the price distributions. There is a large peak on the Leasehold distribution at roughly £170k, and the known peak at ~£250k is not as sharp as Freehold properties. You also see more Freehold properties selling at prices beyond ~£500k. This suggests that Leasehold properties tend to be sold at lower price points and Freehold properties at higher price points. Given that most Leaseholds properties are inside central London, where you would expect the most extreme price points, this is a fairly surprising result.
by_y <- select(df, Price, Year) %>%
group_by(Year) %>%
summarise(Count=n(), Mean.Price=mean(Price), St.Dev.Price=sd(Price), Median.Price=median(Price))
ggplot(melt(by_y, 'Year'), aes(x=Year, y=value, group=variable, colour=variable)) +
facet_wrap(~ variable, scales='free') +
geom_line()
When looking at property prices by year, we see a steady increase across time. We observe a dramatic drop in the count of properties purchased in 2014 because data is only available until May. Regardless, the median price plot shows a sharp rise in property prices. This may have occurred as a response to the sharp rise in number of properties being purchased in 2013, implying a sudden growth in demand for London properties.
ggplot(df, aes(x= Year, y=Price)) +
stat_ydensity(trim = FALSE, aes(fill = as.factor(Year))) +
scale_y_log10(breaks=round(10^seq(3.6,8,0.2))) +
coord_trans(y = "log10") +
coord_flip() +
theme(axis.text.x= element_text(angle=45, hjust=1))
Looking at the price distributions we see a clear increase in property price across the years. Note that the peaks at ~£250k and ~£500k appears to be fairly constant. Towards 2014 however, we start to see more properties being purchased beyond these price points and less properties being sold at the lower price points.
by_m <- select(df, Price, Month) %>%
group_by(Month) %>%
summarise(Count=n(), Mean.Price=mean(Price), St.Dev.Price=sd(Price), Median.Price=median(Price))
ggplot(melt(by_m), aes(x=Month, y=value, group=variable, colour=variable)) +
scale_x_discrete('Time',breaks = by_m$Month[grep('(-0[369]|-12)$', by_m$Month)]) +
facet_wrap(~ variable, scales='free') +
geom_line() +
theme(axis.text.x= element_text(angle=75, hjust=1))
## Using Month as id variables
We’ll drill down on the data to see how prices changes across time by months. The mean and median prices both shows a gradual increase with time. It’s interesting to note that the standard deviation is also increases, indicating more dispersed prices from the mean. There is a sharp rise in standard deviation around the start of 2012, where it appears to stabilise shortly after. This means that prices suddenly started varying a lot more around this time.
There appears to be some seasonality in the data, especially in Count (number of purchases). To investigate this we’ll look at all these numbers per month, averaged across all years (except 2014) due to incomplete data.
by_mm <- select(df, Price, Month) %>%
filter(!grepl('2014-', Month)) %>% # Remove 2014 because the year is incomplete
mutate(MonthAve = sub('\\d+-', '', Month)) %>%
group_by(MonthAve) %>%
summarise(Count=n(), Mean.Price=mean(Price), St.Dev.Price=sd(Price), Median.Price=median(Price))
ggplot(melt(by_mm), aes(x=MonthAve, y=value, group=variable, colour=variable)) +
facet_wrap(~ variable, scales='free') +
geom_line()
## Using MonthAve as id variables
The graphs make it clear that house sale volume increases in the summer months resulting in a pricing surge. The astute buyer will be sure to go house hunting closer to March when competition is scarcer and prices are lower!
Finally, we’ll have a look at how these factors vary across different parts of London. First we will look at a heatmap of the number of properties purchases by their latitude/longitude and overlay it on top of a London map.
small_df <- select(df, Year, Latitude, Longitude)
rm(df)
london <- get_map('london', zoom=11)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=london&zoom=11&size=%20640x640&scale=%202&maptype=terrain&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=london&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
london_map <- ggmap(london, extent='panel')
london_map +
stat_density2d(
data = small_df,
aes(x = Longitude, y = Latitude,
fill = ..level..,
alpha = ..level..),
size = 0.1, bins = 8,
geom = 'polygon',
) +
scale_fill_gradient(
'Density'
) +
scale_alpha('Density',
guide=FALSE,
range = c(0, 0.8)
) +
ggtitle('Number of property purchases between Jan-09 and May-14')
## Warning: Removed 993583 rows containing non-finite values
## (stat_density2d).
From this map we observe that the majority of property purchases occurred near central London. Several centroids emerge around areas such as Poplar, Kensington, Maida Vale, Hackney, and Battersea, to name a few. We also observe some activities around some commuting hubs such as Croydon, Bromley, Kingston upon Thames, and Surbiton.
Now we’ll look at how these purchasing activities change across time.
london_map +
stat_density2d(
data = small_df,
aes(x = Longitude, y = Latitude,
fill = ..level..,
alpha = ..level..),
size = 0.1, bins = 8,
geom = 'polygon',
) +
scale_fill_gradient(
'Density'
) +
scale_alpha('Density',
guide=FALSE,
range = c(0, 0.8)
) +
facet_wrap(~ Year) +
ggtitle('Number of property purchases between Jan-09 and May-14')
## Warning: Removed 167927 rows containing non-finite values
## (stat_density2d).
## Warning: Removed 178771 rows containing non-finite values
## (stat_density2d).
## Warning: Removed 177516 rows containing non-finite values
## (stat_density2d).
## Warning: Removed 178652 rows containing non-finite values
## (stat_density2d).
## Warning: Removed 215276 rows containing non-finite values
## (stat_density2d).
## Warning: Removed 75441 rows containing non-finite values (stat_density2d).
Several interesting patterns emerge. We can see that most of the purchasing activities in Poplar occurred around 2010 and then moved south of the river toward Canada Water. Most of the purchases in East London also slowly moved south-east towards areas near Shadwell and Wapping. Purchasing activities in Wesminster and directly south near Battersea remained fairly constant. In recent years we also observe more purchases being made in less central areas, such as in South London in areas near Crystal Palace.
To visualise how house price varies across different areas of London, I’ve built an interactive web visualisation using d3.js. In this visualisation you can select different values you’d like to see overlaid on different London boroughs, then hover and zoom to investigate the data. A small preview is shown below. You can try it out here.
Feel free to explore the data yourself and let me know what you find. You can find me on twitter at @thinh_ha.