Introduction

Source: www.istockphoto.com

Culturally, for many individuals and families, owning a house is part of the Great Australian Dream. (‘Australian Dream’, 2022)
Following decades of sustained price increase, Australian housing valuation lies among the highest in the world. (House-price-to-income ratio in selected countries 2022)
Unlike other assets, like equities for which discounted cash flow methods can be used, housing valuation is relative, particularly for residential housing, which, unlike investment property does not afford an income stream.
Understanding the drivers of housing valuation is of importance to help buyers, sellers and agents to make more informed and data-driven decisions.

Problem Statement

Starting from the intuitive but naive assumptions that “bigger is better and proximity pays” we seek to test the hypothesis that ceteris paribus, Melbourne residential property prices are linearly related to property and building size and inversely so to distance from the city centre.
Approach
- We will collect publicly available house pricing data and filter to properties of interest
- Obvious outliers will be identified using clustering algorithms as the dataset is too large to parse manually.
- During data exploration any further data transformation will be considered
- A linear model will be fit to predict Price
- The assumptions underlying linear regression will be assessed.
- The model’s performance and summary parameters will be discussed.

Data: Description

Data were obtained from an online data source describing Melbourne housing sale prices, and house characteristics between February 2016 and March 2018. (Pino, 2018)
The dependent variable was sale price (Price, in Australian dollars), and other features included distance to Melbourne CBD (Distance, km), land area (Landsize, m\(^2\)), building area (BuildingArea, m\(^2\)), number of bedrooms (Room2), number of bathrooms (Bathroom), number of car parking spots (Car).
Data were filtered to standalone houses, within 25 km of the CBD, with 6 or fewer bedrooms, bathrooms and car spots,and a land area of < 2500 m\(^2\) as these were felt a priori to be reasonable limits for residential property within inner metropolitan Melbourne. For houses sold more than once during the time period only the last transaction was included.

melb_data = fread("../02 DATA/Melbourne_housing_FULL.csv") %>%
  filter(Bathroom < 7 & Bathroom > 0 & Bedroom2 < 7 & Bedroom2 > 0 & 
           Landsize < 2500 & Landsize > 0 & Distance < 25 & Type == "h") %>%
  select (Date, Address, Postcode, Price, Distance, Landsize, BuildingArea, 
          Bedroom2, Bathroom) %>%
  na.omit() %>% group_by(Address, Postcode) %>% arrange(desc(Date)) %>% slice(1) %>% 
  setDT # data.table syntax can be more succinct

We will restrict further analysis to the Distance, Landsize and BuildingArea variables.

Pre-processing: Outliers

Our filtered dataset constrains Distance and Landsize to plausible values, but there are some erroneous Price and BuildingArea values. Clustering algorithms are an efficient way to screen for outliers.
DBSCAN is one flexible approach: Features were scaled, K was and \(\epsilon\) were chosen using the inflection on a K-distance graph (not shown), favouring smaller k and slightly larger \(\epsilon\) to enrich for the extreme outliers that are most likely to bias the fit of a linear model. (Hahsler, Piekenbrock and Doran, 2019)
The scatter-plot on the next slide highlights outliers (in red) using DBSCAN clustering based on Price, Landsize and BuildingArea after scaling those features.
- Note is made of implausible BuildingArea values (< 50 m\(^2\), or having area much greater than Landsize).
- Note suspicious cluster of observations along the \(x=y\) line (which likely reflect transcription errors where Landsize was inadvertently entered into BuildingArea).
- Outliers to the top left of the scatter plot reflect high BuildingArea relative to Landsize, many of which are likely erroneous.
- Those to right mostly reflect large Landsize and may not be erroneous.
- Outliers within main cluster reflect high pricing relative to building and property size. Many of the latter are properties in expensive suburbs, but some are transcription errors of Price.

Pre-processing: Outliers [cont]

dbscan_result <- 
  melb_data[, .(Price, Landsize, BuildingArea)] %>% 
  na.omit %>% scale() %>% dbscan(eps = 0.75, minPts = 5)
melb_data[, Outlier := dbscan_result$cluster == 0]

p <- plot_ly(melb_data, x = ~Landsize, y = ~BuildingArea, 
  type = "scatter", mode = "markers", color = ~Outlier, 
  colors = c("cornflowerblue", "salmon"), alpha = 0.7, 
  hovertext = ~paste0(Address, " ", Postcode, "<br>Price: $", 
     Price, "<br>Land: ", Landsize, "<br>Building: ", 
     BuildingArea, "<br>Bed: ", Bedroom2, "<br>Bath: ", 
     Bathroom, "<br>Distance: ", Distance)) %>%
  layout(xaxis = list(title = "Landsize"), 
     yaxis = list(title = "BuildingArea"),
     title = paste0("BuildingArea vs. Landsize ",
            "(with DBSCAN outlier prediction)")) %>%  
  layout(legend = list(title = "Predicted Outlier")); p

This is an interactive graph. If viewing an online version of this file: Hover over points to see property details, and click and drag to zoom.

Pre-processing: Outliers [cont]

labeled_outliers <- ifelse(
      (melb_data$Address == "30 Pyne St" & melb_data$Postcode == 3162) | # Err price
      (melb_data$Address == "52 Monash St" & melb_data$Postcode == 3075) | # Err  BuildingArea
      (melb_data$Address == "7 Garnet St" & melb_data$Postcode == 3056) | # Err  BuildingArea
      (melb_data$Address == "53 Stewart St" & melb_data$Postcode == 3056) | # Err Landsize
      (melb_data$Address == "35 Bevis St" & melb_data$Postcode == 3170) | # Err price
      (melb_data$Address == "3 Lewisham La" & melb_data$Postcode == 3181) | # Err Landsize
      (melb_data$Address == "77 Suffolk St" & melb_data$Postcode == 3012) | # error BuildingArea
      (melb_data$Address == "24 Fitzwilliam St" & melb_data$Postcode == 3101) | # Err BuildingArea
      (melb_data$Address == "46 Athelstan Rd" & melb_data$Postcode == 3124) | # Err BuildingArea
      (melb_data$Address == "19 Warringal St" & melb_data$Postcode == 3105) | # Err price
      (melb_data$Address == "20 Wright St" & melb_data$Postcode == 3204) | # Err Landsize
      (melb_data$Address == "171 Moreland Rd" & melb_data$Postcode == 3058), # units not house
      1, 0
    )
melb_data = melb_data[labeled_outliers==0 ][ # remove explicitlylabelled outliers
  !between(BuildingArea / Landsize, 0.95, 1.05)][ # remove the y=x cluster
    !BuildingArea / Landsize > 1.5][ # remove implausible BuildingAreas
      !BuildingArea < 75 ] # remove implausible BuildingAreas

79 possible outliers found by DBSCAN (using k = 5, \(\epsilon\) = 0.75) were checked against online sources. (realestate.com.au, 2023; www.onthehouse.com.au, 2023)
12 clearly erroneous observations found during the above review are deleted and BuildingArea is further filtered based on the comments in the previous slides.

Descriptive Statistics

We are left with 6001 observations with superficially plausible variable values.
- Ranges of Price and BuildingArea well exceeds 6 \(\sigma\) with IQR less than 1.33 \(\sigma\), and \(\mu\) > median reflecting very right skewed, non-normal distributions.
- Range of Landsize is almost 10 \(\sigma\) though IQR is 1.4 \(\sigma\) and \(\mu\) < median, and, as we shall see, follows a bimodal distribution.
- Distribution of distance is more symmetrical than the other variables.

# Calculating summary statistics using 
# summarise across columns
melb_data %>% summarise( across( 
  .cols = Price:BuildingArea,
  .fns = list( Min = min, Median = median, 
     `1Q` = ~quantile(., 0.25, na.rm = TRUE),
     `3Q` = ~quantile(., 0.75, na.rm = TRUE), 
     Max = max, Mean = mean, 
     SD = sd, N = ~length(.), 
     NAs = ~sum(is.na(.))) ) ) %>%
# Transposing the summary statistics DataFrame
  pivot_longer(cols = everything(), 
     names_to = c(".value", "Statistic"), 
     names_sep = "_") %>%
  kable(digits = c(0, 0, 1, 1, 1)) %>%
    row_spec(8, bold=T) %>% row_spec(9, bold=T)

Statistic	Price	Distance	Landsize	BuildingArea
Min	260000	1.3	66.0	75.0
Median	1100000	10.6	560.0	148.0
25th Centile	783000	7.0	355.0	116.0
75th Centile	1550000	14.2	670.0	197.0
Max	8000000	24.8	2187.0	789.0
Mean	1273331	11.0	534.2	166.8
SD	711052	5.1	232.9	71.5
N	6001	6001.0	6001.0	6001.0
NAs	0	0.0	0.0	0.0

Visualisation of study variables

The graphs pair histograms with horizontally orientated boxplots for each variable.
- Dashed vertical red lines indicate the mean. Transverse red error bars on the x axis represent \(\mu\) ± 3 \(\sigma\) representing the typical plausible range for normally distributed data.
Price (a), the target variable, is quite skewed to the right as will be discussed on the next slide.
Distance (b) is symmetrical (naturally, land area increases \(\propto\) Distance\(^2\) but this is offset by lower population density). It is visibly platykurtic.
Landsize (c) is bi-modal, likely reflecting distinct sub-populations of smaller inner city properties, and larger outer city properties.
BuildingArea (d) shows significant positive skew even after filtering.

variables <- c("Price", "Distance", "Landsize", "BuildingArea")
grouped_plots <- lapply(variables, create_grouped_plot)
subplot <- plot_grid(plotlist = grouped_plots, nrow = 2, ncol = 2, 
  labels="auto"); subplot

Tranformation of dependent variable

Per previous slide, Price is right skewed.
Large positive outliers may influence fit significantly and may lead to hetereocedasticity of residuals at higher Prices.
One option is logarithmic transformation of Price: (a) This yields a more symmetrical distribution, though (b) a slight positive skew remains, with higher density around log(Price) = 13 than the normal distribution.

melb_data[, log_Price := log(Price)]
tf_Price = create_grouped_plot("log_Price") # See appendix
qq_tf_Price = melb_data %>% 
  ggplot(aes(sample = log_Price)) + 
  geom_qq(color="purple", alpha = 0.4) + 
  geom_qq_line(colour = 'orange', linetype='dashed') + 
  theme_minimal() + 
  labs(x="Normal quantiles", y="log_Price quantiles")
plot_grid(plotlist = list(tf_Price, qq_tf_Price), 
  nrow = 1, ncol = 2, labels = "auto")

Exploration: multi-collinearity

This refers to pairwise or higher order linear relationships between predictors, leads to unstable coefficient estimation, complicates coefficient interpetation, may mislead hypothesis testing of individual predictor significance, and can challenge the matrix inversion needed for OLS parameter estimation.
Here, there are only weak to moderate pairwise linear relationships between the predictors, which addresses this concern.
Supporting our subjective impression, pairwise Pearson correlation coefficients in the top diagonal show only moderate values. (‘Multicollinearity’, 2023)

ggpairs(melb_data, 
  columns = c( "log_Price", "Landsize", 
               "BuildingArea", "Distance")) + 
  theme_minimal() +
  scale_fill_manual(values = c("cornflowerblue"))  + 
  ggtitle("Paired scatter-plots for study variables")

Hypothesis Testing

Our null hypothesis is that there is no linear relationship between the predictors and log(Price).
- Formally, given the model: \[ \begin{align*} & \hspace{1.5cm} \ln(\text{Price}) = \beta_0 + \beta_1 \cdot \text{Distance} + \beta_2 \cdot \text{Landsize} + \beta_3 \cdot \text{BuildingArea} + \epsilon \end{align*} \]
- Our hypotheses are: \[ \begin{align*} &\text{H}_0: \hspace{0.25 cm} \text{An intercept only model explains as much variability in ln(Price) as the full model}. \\ &\hspace{1.5cm} i.e. \quad \beta_0 = \mu_{\text{ln(Price)}}, \beta_1 = \beta_2 = \beta_3 = 0 & \\ &\text{H}_1: \hspace{0.25 cm} \text{The full model explains more variability in ln(Price) than an intercept only model.} \\ &\hspace{1.5cm} i.e. \quad \beta_1 \text{ or } \beta_2 \text{ or } \beta_3 \neq 0 & \end{align*} \]
We will assume an \(\alpha\) of 0.05 when assessing significance.

The assumptions underlying linear regression are:
- Linearity: there is a linear relationship between predictors and log_Price.
- Residuals (\(\epsilon\)) follow a normal distribution
- Homoscedasticity (equal variance) of residuals across the range of the dependent variable.
- Model fit is sensitive to extreme outliers
- Estimated model coefficients can be very unstable in presence of multi-collinearity.

Model fit and residual plots to assess assumptions

Residual vs fitted: Residuals are symmetrically distributed around zero throughout range of fitted values, supporting linearity assumption.
QQ residuals: Residual distribution is slightly thinner at left tail than normal distribution. There are some more extreme values at both tails that might be expected but this is driven by a few observations. Overall, assumption of normality seems reasonable.
Scale-location: The scaled residuals are symmetrical around 1 through range of fitted values and there is no sustained trend in the LOESS line supporting homoscedasticity of the residuals.
Residuals vs leverage: No outliers exceed Cook’s distance, suggesting the model fit has not been greatly biased by these.

lm_model <- lm(log_Price ~ Landsize + BuildingArea + Distance, 
               data = melb_data)
par(mfrow=c(2,2))
plot(lm_model)

Model summary

x = summary(lm_model); print(x)

## 
## Call:
## lm(formula = log_Price ~ Landsize + BuildingArea + Distance, 
##     data = melb_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68314 -0.22466 -0.01561  0.22920  1.40568 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.377e+01  1.348e-02 1020.88   <2e-16 ***
## Landsize      5.531e-04  2.128e-05   25.99   <2e-16 ***
## BuildingArea  3.076e-03  6.291e-05   48.89   <2e-16 ***
## Distance     -5.807e-02  8.955e-04  -64.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3182 on 5997 degrees of freedom
## Multiple R-squared:  0.5627, Adjusted R-squared:  0.5625 
## F-statistic:  2572 on 3 and 5997 DF,  p-value: < 2.2e-16

F-statistic assesses the overall significance of the model, by comparing the fit of the full model (including all predictors) to a reduced model with no predictors (intercept only model).
Our model’s numerator has 3 (\(= predictors\)) degrees of freedom (DoF) and denominator has 5997 (\(= observations - predictors - 1\)) DoF.
The F-statistic value of 2572, with the given DoF, yields a p-value of < 2.2e-16, being the very small probability of observing an F-statistic as extreme as this were the null hypothesis true.
We therefore reject the null hypothesis and accept the alternative that a linear model including these predictors explains more variabillity in the dependent variable than an intercept only model.

Model parameters

The intercept and all predictors have T-statistic p-values < 2.2e-16, meaning in the context of this valid and significant linear model, they are individually significant, linear, predictors of ln(Price) with non-zero coefficients.
The intercept term has an estimated coefficient of 13.8, which represents the average log_Price when all predictor variables are zero (an implausible situation in this case!).
Landsize, BuildingArea, and Distance have estimated coefficients of 0.00055, 0.00308, and -0.05807, respectively.
- These indicate expected change in log_Price associated with a one-unit increase in each respective predictor, holding other variables constant.
- Specifically, log_Price could be expected to increase by 0.055 for every 100 m\(^2\) increase in land size, by 0.308 for every 100 m\(^2\) increase in building size and to decrease by 0.058 for every km distance from the CBD.
- The decision to log transform Price has yielded a reasonable linear model, but may have sacrificed some interpretability, as the above refer to changes in log(Price).
R-squared value (0.563) indicates proportion of variance in log_Price that is explained by the predictors. It represents the goodness-of-fit of the model, with higher values indicating a better fit, with a plausible range of 0 (no explanatory power) to 1 (perfect fit).
The Adjusted R-squared (0.562) adjusts the R-squared value for the number of predictors and sample size, and in this instance is comparable, given a relatively parsimonious model.

Discussion

Overall we found that that the logarithm of house Price varied inversely with distance from Melbourne CBD and increases linearly with land size, and building size, that such a model satisfies assumptions of linear regression and explains about half the variation in the natural logarithm of house price during the study period.
This might be regarded as a useful explanatory model, demonstrating the importance of these simple metrics in explaining house price, though for the reasons below, will likely not form a practically usable predictive model.
- Initial data pre-processing filtered to plausible ranges for residential housing, but by excluding observations with missing values, may have biased the analysis.
- The model did not account for other important determinants of housing price, such as clustering within geographic regions or postcodes, beyond distance from CBD, and does not account for temporal trends, which even within the study period of two years may have been significant.
- Internal validation (e.g. k-fold cross-validation) was not attempted here, and no attempt was made to evaluate out-of-sample performance.
- Care must be taken not to apply the model outside of the study domain (i.e. properties outside inner metropolitan Melbourne, those of a different type - e.g. apartments, those significantly larger or smaller than represented in the study and for time periods that are far outside those in the study).
Further research might evaluate a more detailed set of predictors of price such as geographic location, explore diverse geographic locations, account for temporal trends and the state of the broader economy, and attempt both internal and external validation.
In conclusion, proximity to urban centres, and property and building size are important determinants of price, and can by in themselves explain a significant proportion of the variation in Melbourne house prices.

References

‘Australian Dream’ (2022) Wikipedia. Available at: https://en.wikipedia.org/w/index.php?title=Australian_Dream&oldid=1117008920 (Accessed: 28 May 2023).
Hahsler, M., Piekenbrock, M. and Doran, D. (2019) ‘dbscan : Fast Density-Based Clustering with R’, Journal of Statistical Software, 91(1). Available at: https://doi.org/10.18637/jss.v091.i01.
House-price-to-income ratio in selected countries 2022 (no date) Statista. Available at: https://www.statista.com/statistics/237529/price-to-income-ratio-of-housing-worldwide/ (Accessed: 28 May 2023).
‘Multicollinearity’ (2023) Wikipedia. Available at: https://en.wikipedia.org/w/index.php?title=Multicollinearity&oldid=1140956842#Consequences (Accessed: 28 May 2023).
www.onthehouse.com.au (2023) www.onthehouse.com.au. Available at: https://www.onthehouse.com.au/ (Accessed: 28 May 2023)
Pino, T. (2018) Melbourne Housing Market. Available at: https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market (Accessed: 28 May 2023).
realestate.com.au (2023) realestate.com.au. Available at: https://realestate.com.au (Accessed: 28 May 2023).

Appendix: Custom function to visualise distribution

create_grouped_plot <- function(variable) {
  # Calculate mean and standard deviation
  mean_value <- mean(melb_data[[variable]])
  sd_value <- sd(melb_data[[variable]])
  min_scale = min(c(melb_data[[variable]], mean_value-3*sd_value) )
  max_scale = max(c(melb_data[[variable]], mean_value+3*sd_value) )
  
  # Histogram with Density Plot
  histogram <- ggplot(data = melb_data, aes(x = !!sym(variable)), linewidth = 0.2) +
    geom_histogram(aes(y = ..count..), bins = 100, fill = "cornflowerblue", 
                   color = "cornflowerblue", alpha = 0.5) +
    geom_vline(xintercept = mean_value, color = "red", size = 0.5, linetype="dashed") +
    geom_errorbarh(aes(y = 0, xmin = mean_value-3*sd_value, 
                      xmax = mean_value+3*sd_value), 
                  width = 2, color = "red", size = 0.5, height=50, linetype = "solid") +
    scale_x_continuous(limits = c(min_scale, max_scale)) + 
    labs(title = "", x = NULL, y = "Frequency") +
    theme_minimal() +
    theme(
      axis.text.x = element_blank(),
      plot.margin = margin(0, 0, 0, 0)
    )

  # Boxplot
  boxplot <- ggplot(data = melb_data, aes(y = !!sym(variable)), linewidth = 0.5) +
    geom_boxplot(fill = "orange", color = "salmon", alpha = 0.5) +
    geom_hline(yintercept = mean_value, color = "red", size = 0.5, linetype="dashed") +
    labs(title = "", y = variable, x = "") +
    scale_y_continuous(labels = label_comma(), limits =c(min_scale, max_scale) ) +
    theme_minimal() +
    theme(
      axis.text.y = element_blank(),
      panel.grid.major = element_blank(),
      plot.margin = margin(0, 0, 0, 0)
    ) +
    coord_flip()
  
  # Combine the plots
  plot_grid(histogram, boxplot, nrow = 2, align = "v",
            rel_heights = c(5, 2), greedy = F)
}

Predicting Melbourne housing prices

Finding homes that ‘fit’…

RPubs link information

Introduction

Problem Statement

Data: Description

Pre-processing: Outliers

Pre-processing: Outliers [cont]

Pre-processing: Outliers [cont]

Descriptive Statistics

Visualisation of study variables

Tranformation of dependent variable

Exploration: multi-collinearity

Hypothesis Testing

Model fit and residual plots to assess assumptions

Model summary

Model parameters

Discussion

References

Appendix: Custom function to visualise distribution

Appendix: Libraries used