Predicting Melbourne housing prices

Finding homes that ‘fit’…

Jacob Mathew (s4012538)

Last updated: 28 May, 2023

Introduction

Source: www.istockphoto.com

  • Culturally, for many individuals and families, owning a house is part of the Great Australian Dream. (‘Australian Dream’, 2022)
  • Following decades of sustained price increase, Australian housing valuation lies among the highest in the world. (House-price-to-income ratio in selected countries 2022)
  • Unlike other assets, like equities for which discounted cash flow methods can be used, housing valuation is relative, particularly for residential housing, which, unlike investment property does not afford an income stream.
  • Understanding the drivers of housing valuation is of importance to help buyers, sellers and agents to make more informed and data-driven decisions.

Problem Statement

Data: Description

melb_data = fread("../02 DATA/Melbourne_housing_FULL.csv") %>%
  filter(Bathroom < 7 & Bathroom > 0 & Bedroom2 < 7 & Bedroom2 > 0 & 
           Landsize < 2500 & Landsize > 0 & Distance < 25 & Type == "h") %>%
  select (Date, Address, Postcode, Price, Distance, Landsize, BuildingArea, 
          Bedroom2, Bathroom) %>%
  na.omit() %>% group_by(Address, Postcode) %>% arrange(desc(Date)) %>% slice(1) %>% 
  setDT # data.table syntax can be more succinct

Pre-processing: Outliers

Pre-processing: Outliers [cont]

dbscan_result <- 
  melb_data[, .(Price, Landsize, BuildingArea)] %>% 
  na.omit %>% scale() %>% dbscan(eps = 0.75, minPts = 5)
melb_data[, Outlier := dbscan_result$cluster == 0]

p <- plot_ly(melb_data, x = ~Landsize, y = ~BuildingArea, 
  type = "scatter", mode = "markers", color = ~Outlier, 
  colors = c("cornflowerblue", "salmon"), alpha = 0.7, 
  hovertext = ~paste0(Address, " ", Postcode, "<br>Price: $", 
     Price, "<br>Land: ", Landsize, "<br>Building: ", 
     BuildingArea, "<br>Bed: ", Bedroom2, "<br>Bath: ", 
     Bathroom, "<br>Distance: ", Distance)) %>%
  layout(xaxis = list(title = "Landsize"), 
     yaxis = list(title = "BuildingArea"),
     title = paste0("BuildingArea vs. Landsize ",
            "(with DBSCAN outlier prediction)")) %>%  
  layout(legend = list(title = "Predicted Outlier")); p

This is an interactive graph. If viewing an online version of this file: Hover over points to see property details, and click and drag to zoom.

Pre-processing: Outliers [cont]

labeled_outliers <- ifelse(
      (melb_data$Address == "30 Pyne St" & melb_data$Postcode == 3162) | # Err price
      (melb_data$Address == "52 Monash St" & melb_data$Postcode == 3075) | # Err  BuildingArea
      (melb_data$Address == "7 Garnet St" & melb_data$Postcode == 3056) | # Err  BuildingArea
      (melb_data$Address == "53 Stewart St" & melb_data$Postcode == 3056) | # Err Landsize
      (melb_data$Address == "35 Bevis St" & melb_data$Postcode == 3170) | # Err price
      (melb_data$Address == "3 Lewisham La" & melb_data$Postcode == 3181) | # Err Landsize
      (melb_data$Address == "77 Suffolk St" & melb_data$Postcode == 3012) | # error BuildingArea
      (melb_data$Address == "24 Fitzwilliam St" & melb_data$Postcode == 3101) | # Err BuildingArea
      (melb_data$Address == "46 Athelstan Rd" & melb_data$Postcode == 3124) | # Err BuildingArea
      (melb_data$Address == "19 Warringal St" & melb_data$Postcode == 3105) | # Err price
      (melb_data$Address == "20 Wright St" & melb_data$Postcode == 3204) | # Err Landsize
      (melb_data$Address == "171 Moreland Rd" & melb_data$Postcode == 3058), # units not house
      1, 0
    )
melb_data = melb_data[labeled_outliers==0 ][ # remove explicitlylabelled outliers
  !between(BuildingArea / Landsize, 0.95, 1.05)][ # remove the y=x cluster
    !BuildingArea / Landsize > 1.5][ # remove implausible BuildingAreas
      !BuildingArea < 75 ] # remove implausible BuildingAreas
  • 79 possible outliers found by DBSCAN (using k = 5, \(\epsilon\) = 0.75) were checked against online sources. (realestate.com.au, 2023; www.onthehouse.com.au, 2023)
  • 12 clearly erroneous observations found during the above review are deleted and BuildingArea is further filtered based on the comments in the previous slides.

Descriptive Statistics

# Calculating summary statistics using 
# summarise across columns
melb_data %>% summarise( across( 
  .cols = Price:BuildingArea,
  .fns = list( Min = min, Median = median, 
     `1Q` = ~quantile(., 0.25, na.rm = TRUE),
     `3Q` = ~quantile(., 0.75, na.rm = TRUE), 
     Max = max, Mean = mean, 
     SD = sd, N = ~length(.), 
     NAs = ~sum(is.na(.))) ) ) %>%
# Transposing the summary statistics DataFrame
  pivot_longer(cols = everything(), 
     names_to = c(".value", "Statistic"), 
     names_sep = "_") %>%
  kable(digits = c(0, 0, 1, 1, 1)) %>%
    row_spec(8, bold=T) %>% row_spec(9, bold=T)
Statistic Price Distance Landsize BuildingArea
Min 260000 1.3 66.0 75.0
Median 1100000 10.6 560.0 148.0
25th Centile 783000 7.0 355.0 116.0
75th Centile 1550000 14.2 670.0 197.0
Max 8000000 24.8 2187.0 789.0
Mean 1273331 11.0 534.2 166.8
SD 711052 5.1 232.9 71.5
N 6001 6001.0 6001.0 6001.0
NAs 0 0.0 0.0 0.0

Visualisation of study variables

  • The graphs pair histograms with horizontally orientated boxplots for each variable.
    • Dashed vertical red lines indicate the mean. Transverse red error bars on the x axis represent \(\mu\) ± 3 \(\sigma\) representing the typical plausible range for normally distributed data.
  • Price (a), the target variable, is quite skewed to the right as will be discussed on the next slide.
  • Distance (b) is symmetrical (naturally, land area increases \(\propto\) Distance\(^2\) but this is offset by lower population density). It is visibly platykurtic.
  • Landsize (c) is bi-modal, likely reflecting distinct sub-populations of smaller inner city properties, and larger outer city properties.
  • BuildingArea (d) shows significant positive skew even after filtering.
variables <- c("Price", "Distance", "Landsize", "BuildingArea")
grouped_plots <- lapply(variables, create_grouped_plot)
subplot <- plot_grid(plotlist = grouped_plots, nrow = 2, ncol = 2, 
  labels="auto"); subplot

Tranformation of dependent variable

  • Per previous slide, Price is right skewed.
  • Large positive outliers may influence fit significantly and may lead to hetereocedasticity of residuals at higher Prices.
  • One option is logarithmic transformation of Price: (a) This yields a more symmetrical distribution, though (b) a slight positive skew remains, with higher density around log(Price) = 13 than the normal distribution.
melb_data[, log_Price := log(Price)]
tf_Price = create_grouped_plot("log_Price") # See appendix
qq_tf_Price = melb_data %>% 
  ggplot(aes(sample = log_Price)) + 
  geom_qq(color="purple", alpha = 0.4) + 
  geom_qq_line(colour = 'orange', linetype='dashed') + 
  theme_minimal() + 
  labs(x="Normal quantiles", y="log_Price quantiles")
plot_grid(plotlist = list(tf_Price, qq_tf_Price), 
  nrow = 1, ncol = 2, labels = "auto")

Exploration: multi-collinearity

  • This refers to pairwise or higher order linear relationships between predictors, leads to unstable coefficient estimation, complicates coefficient interpetation, may mislead hypothesis testing of individual predictor significance, and can challenge the matrix inversion needed for OLS parameter estimation.
  • Here, there are only weak to moderate pairwise linear relationships between the predictors, which addresses this concern.
  • Supporting our subjective impression, pairwise Pearson correlation coefficients in the top diagonal show only moderate values. (‘Multicollinearity’, 2023)
ggpairs(melb_data, 
  columns = c( "log_Price", "Landsize", 
               "BuildingArea", "Distance")) + 
  theme_minimal() +
  scale_fill_manual(values = c("cornflowerblue"))  + 
  ggtitle("Paired scatter-plots for study variables")

Hypothesis Testing

  • Our null hypothesis is that there is no linear relationship between the predictors and log(Price).
    • Formally, given the model: \[ \begin{align*} & \hspace{1.5cm} \ln(\text{Price}) = \beta_0 + \beta_1 \cdot \text{Distance} + \beta_2 \cdot \text{Landsize} + \beta_3 \cdot \text{BuildingArea} + \epsilon \end{align*} \]
    • Our hypotheses are: \[ \begin{align*} &\text{H}_0: \hspace{0.25 cm} \text{An intercept only model explains as much variability in ln(Price) as the full model}. \\ &\hspace{1.5cm} i.e. \quad \beta_0 = \mu_{\text{ln(Price)}}, \beta_1 = \beta_2 = \beta_3 = 0 & \\ &\text{H}_1: \hspace{0.25 cm} \text{The full model explains more variability in ln(Price) than an intercept only model.} \\ &\hspace{1.5cm} i.e. \quad \beta_1 \text{ or } \beta_2 \text{ or } \beta_3 \neq 0 & \end{align*} \]
  • We will assume an \(\alpha\) of 0.05 when assessing significance.
  • The assumptions underlying linear regression are:
    • Linearity: there is a linear relationship between predictors and log_Price.
    • Residuals (\(\epsilon\)) follow a normal distribution
    • Homoscedasticity (equal variance) of residuals across the range of the dependent variable.
    • Model fit is sensitive to extreme outliers
    • Estimated model coefficients can be very unstable in presence of multi-collinearity.

Model fit and residual plots to assess assumptions

  • Residual vs fitted: Residuals are symmetrically distributed around zero throughout range of fitted values, supporting linearity assumption.

  • QQ residuals: Residual distribution is slightly thinner at left tail than normal distribution. There are some more extreme values at both tails that might be expected but this is driven by a few observations. Overall, assumption of normality seems reasonable.

  • Scale-location: The scaled residuals are symmetrical around 1 through range of fitted values and there is no sustained trend in the LOESS line supporting homoscedasticity of the residuals.

  • Residuals vs leverage: No outliers exceed Cook’s distance, suggesting the model fit has not been greatly biased by these.

lm_model <- lm(log_Price ~ Landsize + BuildingArea + Distance, 
               data = melb_data)
par(mfrow=c(2,2))
plot(lm_model)

Model summary

x = summary(lm_model); print(x)
## 
## Call:
## lm(formula = log_Price ~ Landsize + BuildingArea + Distance, 
##     data = melb_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68314 -0.22466 -0.01561  0.22920  1.40568 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.377e+01  1.348e-02 1020.88   <2e-16 ***
## Landsize      5.531e-04  2.128e-05   25.99   <2e-16 ***
## BuildingArea  3.076e-03  6.291e-05   48.89   <2e-16 ***
## Distance     -5.807e-02  8.955e-04  -64.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3182 on 5997 degrees of freedom
## Multiple R-squared:  0.5627, Adjusted R-squared:  0.5625 
## F-statistic:  2572 on 3 and 5997 DF,  p-value: < 2.2e-16
  • F-statistic assesses the overall significance of the model, by comparing the fit of the full model (including all predictors) to a reduced model with no predictors (intercept only model).
  • Our model’s numerator has 3 (\(= predictors\)) degrees of freedom (DoF) and denominator has 5997 (\(= observations - predictors - 1\)) DoF.
  • The F-statistic value of 2572, with the given DoF, yields a p-value of < 2.2e-16, being the very small probability of observing an F-statistic as extreme as this were the null hypothesis true.
  • We therefore reject the null hypothesis and accept the alternative that a linear model including these predictors explains more variabillity in the dependent variable than an intercept only model.

Model parameters

Discussion

References

Appendix: Custom function to visualise distribution

create_grouped_plot <- function(variable) {
  # Calculate mean and standard deviation
  mean_value <- mean(melb_data[[variable]])
  sd_value <- sd(melb_data[[variable]])
  min_scale = min(c(melb_data[[variable]], mean_value-3*sd_value) )
  max_scale = max(c(melb_data[[variable]], mean_value+3*sd_value) )
  
  # Histogram with Density Plot
  histogram <- ggplot(data = melb_data, aes(x = !!sym(variable)), linewidth = 0.2) +
    geom_histogram(aes(y = ..count..), bins = 100, fill = "cornflowerblue", 
                   color = "cornflowerblue", alpha = 0.5) +
    geom_vline(xintercept = mean_value, color = "red", size = 0.5, linetype="dashed") +
    geom_errorbarh(aes(y = 0, xmin = mean_value-3*sd_value, 
                      xmax = mean_value+3*sd_value), 
                  width = 2, color = "red", size = 0.5, height=50, linetype = "solid") +
    scale_x_continuous(limits = c(min_scale, max_scale)) + 
    labs(title = "", x = NULL, y = "Frequency") +
    theme_minimal() +
    theme(
      axis.text.x = element_blank(),
      plot.margin = margin(0, 0, 0, 0)
    )

  # Boxplot
  boxplot <- ggplot(data = melb_data, aes(y = !!sym(variable)), linewidth = 0.5) +
    geom_boxplot(fill = "orange", color = "salmon", alpha = 0.5) +
    geom_hline(yintercept = mean_value, color = "red", size = 0.5, linetype="dashed") +
    labs(title = "", y = variable, x = "") +
    scale_y_continuous(labels = label_comma(), limits =c(min_scale, max_scale) ) +
    theme_minimal() +
    theme(
      axis.text.y = element_blank(),
      panel.grid.major = element_blank(),
      plot.margin = margin(0, 0, 0, 0)
    ) +
    coord_flip()
  
  # Combine the plots
  plot_grid(histogram, boxplot, nrow = 2, align = "v",
            rel_heights = c(5, 2), greedy = F)
}

Appendix: Libraries used

library(tidyverse)
library(ggplot2) 
library(knitr) 
library(data.table)
library(plotly)
library(dbscan)
library(scales)
library(cowplot)
library (patchwork)
library(GGally)
library(kableExtra)