Housing Price in Melbourne

Can distance of houses from the CBD be used to predict Melbourne house prices?

Keavatey Srun, S3767615

Last updated: 24 October, 2019

Introduction

[Photo: Melbourne’s Skyline (The Australian 2018)]

Problem Statement

[Photo: House in North Melbourne (realestate.com.au 2019)]

Data

Suburb Address Rooms Type Price Day Month Year Distance Postcode Bathroom Car YearBuilt
Abbotsford 68 Studley St 2 h NA 3 09 2016 2.5 3067 1 1 NA
Abbotsford 85 Turner St 2 h 1480000 3 12 2016 2.5 3067 1 1 NA
Abbotsford 25 Bloomburg St 2 h 1035000 4 02 2016 2.5 3067 1 0 1900
Abbotsford 18/659 Victoria St 3 u NA 4 02 2016 2.5 3067 2 1 NA
Abbotsford 5 Charles St 3 h 1465000 4 03 2017 2.5 3067 2 0 1900
Abbotsford 40 Federation La 3 h 850000 4 03 2017 2.5 3067 2 1 NA

Data (Cont.)

  1. Tidy Date variable by seperate Date into Day, Month, Year for filtering purpose later
  2. Filter dataset based on # of room, # of bathroom, Year sold, and distance as mentioned earlier.
  1. Check for missing values and special values
    • There are 56 missing values in the price variable.
    • Since the missing value in Price accounts for about 21%, they are replaced by the mean.
    • There is no special values detected.
    • Note:There are also missing values in Car and Year built variables, but they can be left as they are because these will not effect the analysis.

Data (Cont.)

  1. Detecting bivariate outliers by using The Mahalanobis distance: there are 17 outliers found in the dataset.
  1. Dealing with outliers
    • Since outlier is about 6.85%, not very big, it can be removed from the data set. This is the last step of data pre-processing.
Preview of the final dataset: final data set has 248 observations and 13 variables.
Suburb Address Rooms Type Price Day Month Year Distance Postcode Bathroom Car YearBuilt
Epping 10A Cabot Dr 2 h 432000 6 01 2018 19.6 3076 1 1 1995
Fawkner 1/1 Clara St 2 u 412000 6 01 2018 13.1 3060 1 1 NA
Glenroy 70 Beatty Av 2 t 530000 6 01 2018 11.2 3046 1 2 2010
Glenroy 43 Bindi St 2 h 637000 6 01 2018 11.2 3046 1 2 NA
Glenroy 181 Daley St 2 h 628000 6 01 2018 11.2 3046 1 4 NA
Glenroy 36 Gladstone Pde 2 h 1245000 6 01 2018 11.2 3046 1 2 NA
## [1] 248  13

Descriptive Statistics and Visualisation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  385000  788125  979775  977615 1102000 2220000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   5.100   6.700   7.865  10.600  19.900

Visualization

Hypthesis Testing–Overall Model

\(H_0\): The  data does not fit the linear regression model.

\(H_A\): The  data fits the linear regression model.

F-test will be used to test this overall model.

Below assumptions will also be checked.

  1. Independence of residuals
  2. Linearity of residuals
  3. Normality of residuals (check after model is fitted)
  4. Homoscedasticity (check after model is fitted)
## 
## Call:
## lm(formula = Price ~ Distance, data = Houseprice0_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -592417 -190017  -35996  115621 1342832 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1110740      41858  26.536  < 2e-16 ***
## Distance      -16926       4753  -3.561 0.000443 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 296500 on 246 degrees of freedom
## Multiple R-squared:  0.04902,    Adjusted R-squared:  0.04516 
## F-statistic: 12.68 on 1 and 246 DF,  p-value: 0.0004433

Linear Regression–Interpreting the intercept

HousePrice %>% summary() %>% coef() 
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 1110740.24  41857.879 26.535989 3.859886e-74
## Distance     -16925.56   4752.918 -3.561088 4.432840e-04
HousePrice %>% confint()
##                  2.5 %      97.5 %
## (Intercept) 1028294.69 1193185.787
## Distance     -26287.17   -7563.955

Linear Regression–Interpreting Slope

HousePrice %>% summary() %>% coef() 
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 1110740.24  41857.879 26.535989 3.859886e-74
## Distance     -16925.56   4752.918 -3.561088 4.432840e-04

Fitting the Regression Model

Testing Assumptions

Assumptions

Before the final regression model can be reported, the all assumptions for linear regression mentioned earlier must be validated.

Testing Assumptions

Independence of Residuals

Independence is checked through the research design. Since the data set is cross sectional as each observation is collected at one point of time, the independence of residuals is assumed to be met.

Linearity of Residuals

From the plot, it shows a very slight curve but the residuals equally

spread around the horizontal line without a distinct pattern (close to flat). Therefore, it is a good indication it is a linear relationship.

Normality of Residuals

As seen in the plot above, the residuals follow close to a straight line on this plot except the last part that moves off the curve. Therefore, it is a fairly good to indicate they are normally distributed.

Testing Assumptions (Cont.)

Homoscedasticity

The residuals reasonably well spread above and below and along a pretty horizontal line but in the beginning of the line there are fewer points along and below the line, so it is slightly less variance there. However, homoscedasticity should still be assumed.

Residuals vs Leverage

There are no values fall outside the bands; therefore, no evidence of influencial cases.

Strength and Direction of Linear Relationship

r <- cor(Price, Distance, use = "complete.obs")
r
## [1] -0.2214115
CIr(r, n = 248, level = 0.95)
## [1] -0.33669245 -0.09959113
r^2
## [1] 0.04902305

Discussion

Major Findings

Conclusion

Discussion (Cont.)

Limitations and Future Analysis

References