Jacob Mathew (s4012538)
Last updated: 28 May, 2023
Source: www.istockphoto.com
PricePrice, in
Australian dollars), and other features included distance to Melbourne
CBD (Distance, km), land area (Landsize,
m\(^2\)), building area
(BuildingArea, m\(^2\)),
number of bedrooms (Room2), number of bathrooms
(Bathroom), number of car parking spots
(Car).melb_data = fread("../02 DATA/Melbourne_housing_FULL.csv") %>%
filter(Bathroom < 7 & Bathroom > 0 & Bedroom2 < 7 & Bedroom2 > 0 &
Landsize < 2500 & Landsize > 0 & Distance < 25 & Type == "h") %>%
select (Date, Address, Postcode, Price, Distance, Landsize, BuildingArea,
Bedroom2, Bathroom) %>%
na.omit() %>% group_by(Address, Postcode) %>% arrange(desc(Date)) %>% slice(1) %>%
setDT # data.table syntax can be more succinctDistance,
Landsize and BuildingArea variables.Distance and
Landsize to plausible values, but there are some erroneous
Price and BuildingArea values. Clustering
algorithms are an efficient way to screen for outliers.Price,
Landsize and BuildingArea after scaling those
features.
BuildingArea values (<
50 m\(^2\), or having area much greater
than Landsize).Landsize was inadvertently
entered into BuildingArea).BuildingArea relative to Landsize, many of
which are likely erroneous.Landsize and may
not be erroneous.Price.dbscan_result <-
melb_data[, .(Price, Landsize, BuildingArea)] %>%
na.omit %>% scale() %>% dbscan(eps = 0.75, minPts = 5)
melb_data[, Outlier := dbscan_result$cluster == 0]
p <- plot_ly(melb_data, x = ~Landsize, y = ~BuildingArea,
type = "scatter", mode = "markers", color = ~Outlier,
colors = c("cornflowerblue", "salmon"), alpha = 0.7,
hovertext = ~paste0(Address, " ", Postcode, "<br>Price: $",
Price, "<br>Land: ", Landsize, "<br>Building: ",
BuildingArea, "<br>Bed: ", Bedroom2, "<br>Bath: ",
Bathroom, "<br>Distance: ", Distance)) %>%
layout(xaxis = list(title = "Landsize"),
yaxis = list(title = "BuildingArea"),
title = paste0("BuildingArea vs. Landsize ",
"(with DBSCAN outlier prediction)")) %>%
layout(legend = list(title = "Predicted Outlier")); pThis is an interactive graph. If viewing an online version of this file: Hover over points to see property details, and click and drag to zoom.
labeled_outliers <- ifelse(
(melb_data$Address == "30 Pyne St" & melb_data$Postcode == 3162) | # Err price
(melb_data$Address == "52 Monash St" & melb_data$Postcode == 3075) | # Err BuildingArea
(melb_data$Address == "7 Garnet St" & melb_data$Postcode == 3056) | # Err BuildingArea
(melb_data$Address == "53 Stewart St" & melb_data$Postcode == 3056) | # Err Landsize
(melb_data$Address == "35 Bevis St" & melb_data$Postcode == 3170) | # Err price
(melb_data$Address == "3 Lewisham La" & melb_data$Postcode == 3181) | # Err Landsize
(melb_data$Address == "77 Suffolk St" & melb_data$Postcode == 3012) | # error BuildingArea
(melb_data$Address == "24 Fitzwilliam St" & melb_data$Postcode == 3101) | # Err BuildingArea
(melb_data$Address == "46 Athelstan Rd" & melb_data$Postcode == 3124) | # Err BuildingArea
(melb_data$Address == "19 Warringal St" & melb_data$Postcode == 3105) | # Err price
(melb_data$Address == "20 Wright St" & melb_data$Postcode == 3204) | # Err Landsize
(melb_data$Address == "171 Moreland Rd" & melb_data$Postcode == 3058), # units not house
1, 0
)
melb_data = melb_data[labeled_outliers==0 ][ # remove explicitlylabelled outliers
!between(BuildingArea / Landsize, 0.95, 1.05)][ # remove the y=x cluster
!BuildingArea / Landsize > 1.5][ # remove implausible BuildingAreas
!BuildingArea < 75 ] # remove implausible BuildingAreasBuildingArea is further filtered based on the
comments in the previous slides.Price and BuildingArea well
exceeds 6 \(\sigma\) with IQR less than
1.33 \(\sigma\), and \(\mu\) > median reflecting very right
skewed, non-normal distributions.Landsize is almost 10 \(\sigma\) though IQR is 1.4 \(\sigma\) and \(\mu\) < median, and, as we shall see,
follows a bimodal distribution.distance is more symmetrical than the
other variables.# Calculating summary statistics using
# summarise across columns
melb_data %>% summarise( across(
.cols = Price:BuildingArea,
.fns = list( Min = min, Median = median,
`1Q` = ~quantile(., 0.25, na.rm = TRUE),
`3Q` = ~quantile(., 0.75, na.rm = TRUE),
Max = max, Mean = mean,
SD = sd, N = ~length(.),
NAs = ~sum(is.na(.))) ) ) %>%
# Transposing the summary statistics DataFrame
pivot_longer(cols = everything(),
names_to = c(".value", "Statistic"),
names_sep = "_") %>%
kable(digits = c(0, 0, 1, 1, 1)) %>%
row_spec(8, bold=T) %>% row_spec(9, bold=T)| Statistic | Price | Distance | Landsize | BuildingArea |
|---|---|---|---|---|
| Min | 260000 | 1.3 | 66.0 | 75.0 |
| Median | 1100000 | 10.6 | 560.0 | 148.0 |
| 25th Centile | 783000 | 7.0 | 355.0 | 116.0 |
| 75th Centile | 1550000 | 14.2 | 670.0 | 197.0 |
| Max | 8000000 | 24.8 | 2187.0 | 789.0 |
| Mean | 1273331 | 11.0 | 534.2 | 166.8 |
| SD | 711052 | 5.1 | 232.9 | 71.5 |
| N | 6001 | 6001.0 | 6001.0 | 6001.0 |
| NAs | 0 | 0.0 | 0.0 | 0.0 |
Price (a), the target variable, is
quite skewed to the right as will be discussed on the next slide.Distance (b) is symmetrical
(naturally, land area increases \(\propto\) Distance\(^2\) but this is offset by lower population
density). It is visibly platykurtic.Landsize (c) is bi-modal, likely
reflecting distinct sub-populations of smaller inner city properties,
and larger outer city properties.BuildingArea (d) shows significant
positive skew even after filtering.Price is right skewed.Prices.Price:
(a) This yields a more symmetrical distribution, though
(b) a slight positive skew remains, with higher density
around log(Price) = 13 than the normal distribution.melb_data[, log_Price := log(Price)]
tf_Price = create_grouped_plot("log_Price") # See appendix
qq_tf_Price = melb_data %>%
ggplot(aes(sample = log_Price)) +
geom_qq(color="purple", alpha = 0.4) +
geom_qq_line(colour = 'orange', linetype='dashed') +
theme_minimal() +
labs(x="Normal quantiles", y="log_Price quantiles")
plot_grid(plotlist = list(tf_Price, qq_tf_Price),
nrow = 1, ncol = 2, labels = "auto")log_Price.Residual vs fitted: Residuals are symmetrically distributed around zero throughout range of fitted values, supporting linearity assumption.
QQ residuals: Residual distribution is slightly thinner at left tail than normal distribution. There are some more extreme values at both tails that might be expected but this is driven by a few observations. Overall, assumption of normality seems reasonable.
Scale-location: The scaled residuals are symmetrical around 1 through range of fitted values and there is no sustained trend in the LOESS line supporting homoscedasticity of the residuals.
Residuals vs leverage: No outliers exceed Cook’s distance, suggesting the model fit has not been greatly biased by these.
##
## Call:
## lm(formula = log_Price ~ Landsize + BuildingArea + Distance,
## data = melb_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.68314 -0.22466 -0.01561 0.22920 1.40568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.377e+01 1.348e-02 1020.88 <2e-16 ***
## Landsize 5.531e-04 2.128e-05 25.99 <2e-16 ***
## BuildingArea 3.076e-03 6.291e-05 48.89 <2e-16 ***
## Distance -5.807e-02 8.955e-04 -64.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3182 on 5997 degrees of freedom
## Multiple R-squared: 0.5627, Adjusted R-squared: 0.5625
## F-statistic: 2572 on 3 and 5997 DF, p-value: < 2.2e-16
Price) with non-zero coefficients.log_Price when all predictor
variables are zero (an implausible situation in this case!).Landsize, BuildingArea, and
Distance have estimated coefficients of 0.00055, 0.00308,
and -0.05807, respectively.
log_Price associated
with a one-unit increase in each respective predictor, holding other
variables constant.log_Price could be expected to increase
by 0.055 for every 100 m\(^2\) increase
in land size, by 0.308 for every 100 m\(^2\) increase in building size and to
decrease by 0.058 for every km distance from the CBD.Price).log_Price that is explained by the predictors.
It represents the goodness-of-fit of the model, with higher values
indicating a better fit, with a plausible range of 0 (no explanatory
power) to 1 (perfect fit).Price
varied inversely with distance from Melbourne CBD and increases linearly
with land size, and building size, that such a model satisfies
assumptions of linear regression and explains about half the variation
in the natural logarithm of house price during the study period.‘Australian Dream’ (2022) Wikipedia. Available at: https://en.wikipedia.org/w/index.php?title=Australian_Dream&oldid=1117008920 (Accessed: 28 May 2023).
Hahsler, M., Piekenbrock, M. and Doran, D. (2019) ‘dbscan : Fast Density-Based Clustering with R’, Journal of Statistical Software, 91(1). Available at: https://doi.org/10.18637/jss.v091.i01.
House-price-to-income ratio in selected countries 2022 (no date) Statista. Available at: https://www.statista.com/statistics/237529/price-to-income-ratio-of-housing-worldwide/ (Accessed: 28 May 2023).
‘Multicollinearity’ (2023) Wikipedia. Available at: https://en.wikipedia.org/w/index.php?title=Multicollinearity&oldid=1140956842#Consequences (Accessed: 28 May 2023).
www.onthehouse.com.au (2023) www.onthehouse.com.au. Available at: https://www.onthehouse.com.au/ (Accessed: 28 May 2023)
Pino, T. (2018) Melbourne Housing Market. Available at: https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market (Accessed: 28 May 2023).
realestate.com.au (2023) realestate.com.au. Available at: https://realestate.com.au (Accessed: 28 May 2023).
create_grouped_plot <- function(variable) {
# Calculate mean and standard deviation
mean_value <- mean(melb_data[[variable]])
sd_value <- sd(melb_data[[variable]])
min_scale = min(c(melb_data[[variable]], mean_value-3*sd_value) )
max_scale = max(c(melb_data[[variable]], mean_value+3*sd_value) )
# Histogram with Density Plot
histogram <- ggplot(data = melb_data, aes(x = !!sym(variable)), linewidth = 0.2) +
geom_histogram(aes(y = ..count..), bins = 100, fill = "cornflowerblue",
color = "cornflowerblue", alpha = 0.5) +
geom_vline(xintercept = mean_value, color = "red", size = 0.5, linetype="dashed") +
geom_errorbarh(aes(y = 0, xmin = mean_value-3*sd_value,
xmax = mean_value+3*sd_value),
width = 2, color = "red", size = 0.5, height=50, linetype = "solid") +
scale_x_continuous(limits = c(min_scale, max_scale)) +
labs(title = "", x = NULL, y = "Frequency") +
theme_minimal() +
theme(
axis.text.x = element_blank(),
plot.margin = margin(0, 0, 0, 0)
)
# Boxplot
boxplot <- ggplot(data = melb_data, aes(y = !!sym(variable)), linewidth = 0.5) +
geom_boxplot(fill = "orange", color = "salmon", alpha = 0.5) +
geom_hline(yintercept = mean_value, color = "red", size = 0.5, linetype="dashed") +
labs(title = "", y = variable, x = "") +
scale_y_continuous(labels = label_comma(), limits =c(min_scale, max_scale) ) +
theme_minimal() +
theme(
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
plot.margin = margin(0, 0, 0, 0)
) +
coord_flip()
# Combine the plots
plot_grid(histogram, boxplot, nrow = 2, align = "v",
rel_heights = c(5, 2), greedy = F)
}