Introduction - The Great Australian Dream

“The Great Australian Dream is a belief that in Australia, home-ownership can lead to a better life and is an expression of success and security. Although this standard of living is enjoyed by many in the existing Australian population, rising house prices compared to average wages are making it increasingly difficult for many to achieve the”great Australian Dream“, especially for those living in large cities.”

— Wikipedia on the ‘Australian Dream’^[1]

In order to acheive the great Australian Dream, Australians are forced to purchase a home further away from the CBD (Central Business District)
The further the house is from the CBD, the cheaper the price
A large percentage of jobs (an estimated 25%)^[2] in the greater Melbourne region are located in the CBD, in addition to universities and other key amenities
An estimated 75%^[3] of residents in Melbourne drive to their work.
Thus there is a tradeoff between travel time into CBD via private vehicle and house price.
As part of this project we will also be comparing our “Travel Time” linear regression model to a more traditional “Distance from CBD” linear regression model which was found to have: R² = 0.441, p= <.001

^[1] Wikipedia, ‘The Australian Dream’, (https://en.wikipedia.org/wiki/Australian_Dream), (accessed 27 October 2018).
^{[2] [3]} C. Butt, and T. Jacks, ‘Cars continue to rule Melbourne roads, census shows’, (https://blogs.crikey.com.au/theurbanist/2018/03/19/jobs-centre-city/), 24 October 2017, (accessed 27 October 2018).

Problem Statement

This study aims to investigate the relationship between travel time into CBD by private vehicle and the median house price per suburb.
Travel time by private vehicle was used as it provided a more interesting dimension than just kilometers from the CBD; it takes into account existing infrastructure and congestion. Two suburbs both with the same distance from CBD can have varying travel times due to access to better roads/freeways and less congestion.
In order to do this, two data sets are used:
Sold house prices in Melboune (Data from Kaggle)
Travel time based on Google Map’s estimation (Self Collected)
The flow of the statistical investigation is as follows:

Import, pre-process and join the 2 datasets together
Summarise and visualise the dataset to provide an intial view
Run a Pearson’s correlation co-efficient and a linear regression model on the 2 main variables
Identify the assumptions of the 2 main variables

Linear regression is a good staistical tool to use as it can be used to both predict the relationship between the variables in addition to quantifying the strength.

Data set 1 - Sold House Prices in Melbourne

-The Housing Dataset is sourced from: Melbourne Housing Market

Key variables in the dataset based on importance are as follows:

Price (ratio, in $AUD)
Suburb (factor)
Rooms (factor - used for filtering)
Type (factor - used for filtering)
Date sold (Date - used for filtering)
Distance (ratio, in KM from city center. Used for filtering)

Dataset has been cleaned using the following steps:

Remove all null prices (House sold but price not reported)
Filter results to 3 bedroom houses
- 3 bedroom houses were chosen as a standard to compare across suburbs as the number of bedrooms and the type of house has an impact on the house price. ‘Compare apples to apples’
Filter out only houses within 100 km of CBD
- 50km radius is an estimate on the ‘radius’ of Melbourne city. Aim is to remove regional Victoria house prices from influencing the model
Keep only the houses that were sold in 2018
- House prices increased substantially between 2016 and 2017. Keeping 2018 records allows a better reflection of current price
Calculate median price per suburb
- Median is used due to robustness from outliers within the suburb

Data set 2 - Travel time to RMIT

Data collected from GoogleAPI: GoogleCloudPlatform
RMIT city campus was used as the destination as RMIT is both a central location in Melbourne CBD and central to the authors of this study
Travel time is calculated on the suburb granularity, with Google determining where the suburb starting point is
Travel time is based on how long it takes to reach RMIT at 9am on Monday. This specific time was chosen account for morning peak hour traffic and to simulate arriving to work
Travel time returned is in seconds. This has been converted to minutes
Finally, the travel time duration from each suburb was added to the main dataset (Housing data) using a left join.

Getting Late

Descriptive Statistics and Visualisation - Histograms of travel time and price

par(mfrow =c(1,2))
hist(Dataset$median_price, main="Median property prices", xlab = "Median Price",col = "lightblue")
hist(Dataset$Travel_Time, main = "Travel time to RMIT", col = "lightblue")

Median Price observations

The median property price distribution is skewed to the right.
For the purpose of running the regression test, data was normalised using a logarithmic transformation.

Travel time observations

Looks symmetrical and similiar to a normal distribution
Majority of the residents would spend 30 - 45 minutes to traveling to the RMIT

Descriptive Statistics and Visualisation - Box chart and % in travel duration

p1 <-ggplot(Dataset, aes(x=Time_category, y=median_price, fill=Time_category)) +  
  geom_boxplot()  + scale_y_continuous(labels = dollar) + coord_flip() +
    scale_alpha_manual(values=c("0-15 mins","15-30 mins","30-45 mins","45-60 mins","60+ mins")) + 
  scale_fill_manual(values=c("#336600","#9ACD32", "#FFFF00","#FF6600","red"))+ theme(legend.position = "none")  +
  ggtitle("Property price against time category") + xlab("Time categories") + ylab("Median Price") +
  theme(plot.title = element_text(colour = "Black",size = 14, face = "bold", hjust = 0.5 )  )
p2 <- barchart(col=c("#336600","#9ACD32", "#FFFF00","#FF6600","red"), main="Travel Duration", 
xlab="Population percentage", x = percentage)
grid.arrange( p1, p2, nrow=1)

Observations

Boxplot clearly show a significant decrease in median prices as travel time increases, up until the last category
The 30 - 45 minute category shows the biggest drop in median price from the previous category (15 - 30 minutes)
The last two categories (45 - 60 mins, 60+ mins) appear to have the similar IQRs and medians which suggests variability in price has dropped
There are significant upper outliers in the higher price categories (15 - 30 mins, 0 - 15 mins), which represent the most expensive and affluent suburbs
These outliers cannot be removed as they are not errors (i.e. these houses actually sold in those suburbs)

Descriptive Statistics Cont - Scatter chart

ggplot(Dataset, aes(x=Travel_Time, y=median_price, col=Time_category)) + scale_y_continuous(trans='log', 
limits=c(400000, 3500000),breaks=c(400000,500000,600000,700000,800000,900000,1000000,1200000,1400000,1600000,1800000,2000000,2250000,2500000,2750000,3000000,3500000))+ geom_point()+geom_smooth(method = lm,se=FALSE) + scale_color_manual(values=c('#336600','#9ACD32', '#FFFF00',"#FF6600","red"))+
  labs(title="Melbourne:Property price against travel time", x="Travel time (in mins)",y="median property price (log scale)")+
  theme(  plot.title = element_text(colour = "Black",size = 14, face = "bold", hjust = 0.5 ))

Notes/Observations

Roughly, there is a linear relationship between median price (log) and travel time
- Log price was taken to strengthen the linearity of the relationship.
Within the 0 - 15 minute time category, it is interesting to note the increase in price as time increases.
- this could possibly be due to avoiding the ‘hustle and bustle’ of Melbourne CBD but still remaining within close proximity to it
Signifcant decrease in price in the 30 - 45 minute category, additionally it also has the steapest slope of all all categories
Also interesting that the last 2 categories show a very slight increase in price as travel time increases.

Decsriptive Statistics

Dataset %>%  group_by(Time_category) %>% summarise(Min = min(median_price,na.rm = TRUE),
                                          Q1 = quantile(median_price,probs = .25,na.rm=TRUE),
                                          Median = median(median_price, na.rm = TRUE),
                                          Q3 = quantile(median_price,probs = .75,na.rm=TRUE),
                                          Max = max(median_price,na.rm = TRUE),
                                          Mean = mean(median_price, na.rm = TRUE),
                                          SD = sd(median_price, na.rm = TRUE),
                                          IQR=IQR(median_price,na.rm = TRUE),
                                          n = n(),
                                          Missing = sum(is.na(median_price))) ->table_stats

knitr::kable(table_stats)

Time_category	Min	Q1	Median	Q3	Max	Mean	SD	IQR	n
0-15 mins	680000	1230062	1411250	1587500	3339000	1490203.1	572479.5	357437.5	16
15-30 mins	650000	982500	1220000	1645500	2812500	1302472.5	416519.0	663000.0	91
30-45 mins	362500	642000	773500	1030000	2300000	865817.8	316246.5	388000.0	133
45-60 mins	381000	550000	640000	743000	1320000	669101.7	188585.6	193000.0	59
60+ mins	407000	510000	667500	737500	912500	642442.3	168632.9	227500.0	13

Observations

Maximum number of observations can be seen in the 30-45mins travel time category.
The median price, interquartile range and standard deviation for the last 2 categories are very similiar.
This suggests that the two categories may possibly be combined

Hypothesis Testing - Pearson Correlation

Pearson correlation was conducted to measure the strength of the linear relationship between house prices (no log) and travel time.
Negative correlation value of r = -0.597 was obtained.
Several transformations were performed in an attempt to improve this value. The log of the median price yeileded the best result with a value of r = -0.639:

cor.test(Dataset$log_median_price,Dataset$Travel_Time)

## 
##  Pearson's product-moment correlation
## 
## data:  Dataset$log_median_price and Dataset$Travel_Time
## t = -14.617, df = 310, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7001423 -0.5680117
## sample estimates:
##        cor 
## -0.6387628

The hypothesis test for Pearson’s correlation is as follows: \[H_0: r = 0 \]

\[H_A: r \ne 0\]

Result

The negative correlation was statistically significant, $r$ = -0.639, $p$ <.001, 95% CI[-0.7, -0.568]

Hypothesis Testing - Linear Regression

As there is a negative relationship beween house prices and travel, a linear regression model is the logical next step:

log_lm <- lm(log_median_price ~ Travel_Time, data = Dataset)
log_lm %>% summary()

## 
## Call:
## lm(formula = log_median_price ~ Travel_Time, data = Dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83460 -0.22031 -0.03803  0.21610  1.02360 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.425999   0.052243  276.13   <2e-16 ***
## Travel_Time -0.019994   0.001368  -14.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3142 on 310 degrees of freedom
## Multiple R-squared:  0.408,  Adjusted R-squared:  0.4061 
## F-statistic: 213.7 on 1 and 310 DF,  p-value: < 2.2e-16

The hypothesis test for both the constant is:

$H_0: \alpha = 0$
$H_A: \alpha \ne 0$

and the hypothesis test for the slope is:

$H_0: \beta = 0$
$H_A: \beta \ne 0$

The summary shows that $\alpha$ = 14.426 and $\beta$ = -0.02
The overall regression model is statistically significant as the p-value for both the figures is < 0.001
The R² = 0.408

Assumptions Testing

Linear regression has 4 main assumptions that must be validated:

1) Independence of variables

Travel Time: As the google API algorithm is not open source, there is no way to verify that one travel time is truly independent to another. It is assumed that suburb’s travel time is independent to another suburb.
House Prices: There is potentailly a small amount of dependence between individual sale prices, as the real estate agents set valuations based on recent sales of similiar properties in the area. This risk is reduced by taking the median of the suburb.

2) Linearity

As covered in slide 9, there is a medium negative relationship between median price and travel time.
The ‘Residuals vs Fitted’ graph on slide 13 and ‘QQ plot’ on slide 14 help verify the linearity of the dataset.

3) Normality of residuals

The ‘QQ plot’ is used to determine whether residuals are normally distributed or not

4) Homoscedasticity

Homoscedasticity refers to the assumption of constant variance across the predicted values.
The ‘Residuals vs Fitted’ graph and the ‘Scale - Location’ graph on slide 15 help to validate this assumption.

Assumptions Testing - Residuals vs. Fitted Plot

plot(log_lm,which=c(1))

Observation

The red line flattens and is towards the center of the graph. This is a good indication for normality of residuals.
Variance from left to right remains fairly constant, a good demonstration of homoscedasticity

Assumptions Testing - Quantile-Quantile Plot

invisible(qqPlot(log_lm,which=c(2)))

Observation

The Q-Q plot shows most of the points fall approximately on the reference line.
There are some deviations from the dotted blue lines which show the 95% CI towards the 0 t quartile.
Overall residuals can be assumed as normally distributed

Assumptions Testing - Scale Location

plot(log_lm,which=c(3))

Observation

The Scale Location plot shows an even spread of the data, this is indicated by the flattening red line and is more eveidence of homoscedasticity

Assumptions Testing - Residuals vs. Levarge

plot(log_lm,which=c(5))

Observation

The Residuals vs. Levarge plot shows that there are no significant outliers that are affecting the linear model as there are no points outside Cook’s distance (dotted red line)

Discussion - Result

A linear regression model was fitted to predict the dependant variable, house prices, from a single predictor, travel time.
Scatter plot demonstrated a linear negative relationship between house price and travel time
The overall regression model was significant $F$(1,310) = 213.7, $p$ < 0.001
Travel time accounted for 40.8% of the variability in house ($R$² = 0.408)
The estimated regression equation is: median_price (logged) = 14.425999 -0.019994 * Travel_Time
The negative slope was statistically significant $b$ = -0.02, $t$ (310) = -14.62, $p$ <.001, 95% CI [-0.023, -0.017]
The intercept was statistically significant $\alpha$ = 14.426, $t$ (310) = 276.13, $p$ <.001, 95% CI [14.323, 14.529]
The intercept of is equal to $1,841,332 which suggest that a 3 bedroom house in Melbourne CBD is expected to sell for $1.8m
Inspection of residuals supported normality and homoscedasticity

Strengths, weaknesses and proposals for future investigations

Strengths

A relatively large and complete dataset was used (i.e. This is perhaps all the publicaly available house price data )
Google travel time data is relatively accurate to estimate travel time from a suburb to Melbourne CBD

Weaknesses

Some suburbs have a very small number of sales for a 3 bedroom house in 2018. This could lead to the median price not being representative of that suburb
Other types of houses were excluded from the scope of this study - e.g. town houses, apartments and houses with more or less than 3 bedrooms
Source data only uses domain.com.au and only includes information available publicly.
Google’s travel time data was taken on face value with no understanding on how this is calculated

Future investigations

Multi linear regression to take into account other factors such as landsize, crime rates, bedroom numbers
Possibility of doing seperate linear equations based on direction. e.g. South east suburbs vs western suburbs
Incorporating other categorical factors such as closeness to water, to public transport, good schools
Expand the investigation to include other capital cities or at a national level.
Expand the investigation to include other modes of transport such as via public transport

Conclusion

A linear model was produced for median distance and median price (as reported in slide 3)
The R² for median distance is 0.441 versus 0.408 for the R² for time travelled
This means that distance from CBD is a better predictor in price compared to time travelled.
The travel time linear model may still be useful if a user wishes to predict house price based on travel time to CBD
Lastly, the 30 - 45 minute category seems to offer the best value for money

References:

House and Sold Image: https://pixabay.com
Time and architect Image: https://www.pexels.com/

RPubs link for the study

Introduction - The Great Australian Dream

Problem Statement

Data set 1 - Sold House Prices in Melbourne

Data set 2 - Travel time to RMIT

Descriptive Statistics and Visualisation - Histograms of travel time and price

Median Price observations

Travel time observations

Descriptive Statistics and Visualisation - Box chart and % in travel duration

Observations

Descriptive Statistics Cont - Scatter chart

Notes/Observations

Decsriptive Statistics

Observations

Hypothesis Testing - Pearson Correlation

Result

Hypothesis Testing - Linear Regression

Assumptions Testing

1) Independence of variables

2) Linearity

3) Normality of residuals

4) Homoscedasticity

Assumptions Testing - Residuals vs. Fitted Plot

Observation

Assumptions Testing - Quantile-Quantile Plot

Observation

Assumptions Testing - Scale Location

Observation

Assumptions Testing - Residuals vs. Levarge

Observation

Discussion - Result

Strengths, weaknesses and proposals for future investigations

Strengths

Weaknesses

Future investigations

Conclusion

References: