The Great Australian Dream       Kenneth Tsang (s3750542)   Ahmad Hasnain (s3712538)   Goran Stojkoski (s3017862)

The Great Australian Dream


Kenneth Tsang (s3750542)
Ahmad Hasnain (s3712538)
Goran Stojkoski (s3017862)

Introduction - The Great Australian Dream

“The Great Australian Dream is a belief that in Australia, home-ownership can lead to a better life and is an expression of success and security. Although this standard of living is enjoyed by many in the existing Australian population, rising house prices compared to average wages are making it increasingly difficult for many to achieve the”great Australian Dream“, especially for those living in large cities.”

— Wikipedia on the ‘Australian Dream’[1]

[1] Wikipedia, ‘The Australian Dream’, (https://en.wikipedia.org/wiki/Australian_Dream), (accessed 27 October 2018).
[2] [3] C. Butt, and T. Jacks, ‘Cars continue to rule Melbourne roads, census shows’, (https://blogs.crikey.com.au/theurbanist/2018/03/19/jobs-centre-city/), 24 October 2017, (accessed 27 October 2018).

Problem Statement

  1. Import, pre-process and join the 2 datasets together
  2. Summarise and visualise the dataset to provide an intial view
  3. Run a Pearson’s correlation co-efficient and a linear regression model on the 2 main variables
  4. Identify the assumptions of the 2 main variables

Data set 1 - Sold House Prices in Melbourne

-The Housing Dataset is sourced from: Melbourne Housing Market

  1. Price (ratio, in $AUD)
  2. Suburb (factor)
  3. Rooms (factor - used for filtering)
  4. Type (factor - used for filtering)
  5. Date sold (Date - used for filtering)
  6. Distance (ratio, in KM from city center. Used for filtering)
  1. Remove all null prices (House sold but price not reported)
  2. Filter results to 3 bedroom houses
    • 3 bedroom houses were chosen as a standard to compare across suburbs as the number of bedrooms and the type of house has an impact on the house price. ‘Compare apples to apples’
  3. Filter out only houses within 100 km of CBD
    • 50km radius is an estimate on the ‘radius’ of Melbourne city. Aim is to remove regional Victoria house prices from influencing the model
  4. Keep only the houses that were sold in 2018
    • House prices increased substantially between 2016 and 2017. Keeping 2018 records allows a better reflection of current price
  5. Calculate median price per suburb
    • Median is used due to robustness from outliers within the suburb

Data set 2 - Travel time to RMIT

Getting Late

Descriptive Statistics and Visualisation - Histograms of travel time and price

par(mfrow =c(1,2))
hist(Dataset$median_price, main="Median property prices", xlab = "Median Price",col = "lightblue")
hist(Dataset$Travel_Time, main = "Travel time to RMIT", col = "lightblue")

Median Price observations

Travel time observations

Descriptive Statistics and Visualisation - Box chart and % in travel duration

p1 <-ggplot(Dataset, aes(x=Time_category, y=median_price, fill=Time_category)) +  
  geom_boxplot()  + scale_y_continuous(labels = dollar) + coord_flip() +
    scale_alpha_manual(values=c("0-15 mins","15-30 mins","30-45 mins","45-60 mins","60+ mins")) + 
  scale_fill_manual(values=c("#336600","#9ACD32", "#FFFF00","#FF6600","red"))+ theme(legend.position = "none")  +
  ggtitle("Property price against time category") + xlab("Time categories") + ylab("Median Price") +
  theme(plot.title = element_text(colour = "Black",size = 14, face = "bold", hjust = 0.5 )  )
p2 <- barchart(col=c("#336600","#9ACD32", "#FFFF00","#FF6600","red"), main="Travel Duration", 
xlab="Population percentage", x = percentage)
grid.arrange( p1, p2, nrow=1)

Observations

Descriptive Statistics Cont - Scatter chart

ggplot(Dataset, aes(x=Travel_Time, y=median_price, col=Time_category)) + scale_y_continuous(trans='log', 
limits=c(400000, 3500000),breaks=c(400000,500000,600000,700000,800000,900000,1000000,1200000,1400000,1600000,1800000,2000000,2250000,2500000,2750000,3000000,3500000))+ geom_point()+geom_smooth(method = lm,se=FALSE) + scale_color_manual(values=c('#336600','#9ACD32', '#FFFF00',"#FF6600","red"))+
  labs(title="Melbourne:Property price against travel time", x="Travel time (in mins)",y="median property price (log scale)")+
  theme(  plot.title = element_text(colour = "Black",size = 14, face = "bold", hjust = 0.5 )) 

Notes/Observations

Decsriptive Statistics

Dataset %>%  group_by(Time_category) %>% summarise(Min = min(median_price,na.rm = TRUE),
                                          Q1 = quantile(median_price,probs = .25,na.rm=TRUE),
                                          Median = median(median_price, na.rm = TRUE),
                                          Q3 = quantile(median_price,probs = .75,na.rm=TRUE),
                                          Max = max(median_price,na.rm = TRUE),
                                          Mean = mean(median_price, na.rm = TRUE),
                                          SD = sd(median_price, na.rm = TRUE),
                                          IQR=IQR(median_price,na.rm = TRUE),
                                          n = n(),
                                          Missing = sum(is.na(median_price))) ->table_stats

knitr::kable(table_stats)
Time_category Min Q1 Median Q3 Max Mean SD IQR n Missing
0-15 mins 680000 1230062 1411250 1587500 3339000 1490203.1 572479.5 357437.5 16 0
15-30 mins 650000 982500 1220000 1645500 2812500 1302472.5 416519.0 663000.0 91 0
30-45 mins 362500 642000 773500 1030000 2300000 865817.8 316246.5 388000.0 133 0
45-60 mins 381000 550000 640000 743000 1320000 669101.7 188585.6 193000.0 59 0
60+ mins 407000 510000 667500 737500 912500 642442.3 168632.9 227500.0 13 0

Observations

Hypothesis Testing - Pearson Correlation

cor.test(Dataset$log_median_price,Dataset$Travel_Time)
## 
##  Pearson's product-moment correlation
## 
## data:  Dataset$log_median_price and Dataset$Travel_Time
## t = -14.617, df = 310, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7001423 -0.5680117
## sample estimates:
##        cor 
## -0.6387628

The hypothesis test for Pearson’s correlation is as follows: \[H_0: r = 0 \]

\[H_A: r \ne 0\]

Result

Hypothesis Testing - Linear Regression

log_lm <- lm(log_median_price ~ Travel_Time, data = Dataset)
log_lm %>% summary()
## 
## Call:
## lm(formula = log_median_price ~ Travel_Time, data = Dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83460 -0.22031 -0.03803  0.21610  1.02360 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.425999   0.052243  276.13   <2e-16 ***
## Travel_Time -0.019994   0.001368  -14.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3142 on 310 degrees of freedom
## Multiple R-squared:  0.408,  Adjusted R-squared:  0.4061 
## F-statistic: 213.7 on 1 and 310 DF,  p-value: < 2.2e-16

\(H_0: \alpha = 0\)
\(H_A: \alpha \ne 0\)

and the hypothesis test for the slope is:

\(H_0: \beta = 0\)
\(H_A: \beta \ne 0\)

Assumptions Testing

1) Independence of variables

2) Linearity

3) Normality of residuals

4) Homoscedasticity

Assumptions Testing - Residuals vs. Fitted Plot

plot(log_lm,which=c(1))

Observation

Assumptions Testing - Quantile-Quantile Plot

invisible(qqPlot(log_lm,which=c(2)))

Observation

Assumptions Testing - Scale Location

plot(log_lm,which=c(3))

Observation

Assumptions Testing - Residuals vs. Levarge

plot(log_lm,which=c(5))

Observation

Discussion - Result

Strengths, weaknesses and proposals for future investigations

Strengths

Weaknesses

Future investigations

Conclusion

References:

House and Sold Image: https://pixabay.com
Time and architect Image: https://www.pexels.com/