Introduction

Melbourne is known as one of the most liveable city in the world and often named as one of the best places to live in Australia (Chalkley-Rhoden, 2017).
With relative low unemployment, various things to do, suburbs and districts to suit everyone, Melbourne is perfect for the whole family to live in (TransferWise, 2018).
When people look to start a family, buying a house always come as a priority although it is a huge investment for everyone especially young couples.
It is a stressful and complicated job for newlyweds to buy their first home (Tweddale, 2015).
As housing market in Melbourne has recovered with a 1.7 percent increase in September (Fuary-Wagner, 2019), people would want to obtain as much information as possible to make a wise decision before purchasing buy a house here.

[Photo: Melbourne’s Skyline (The Australian 2018)]

Problem Statement

Since it might be the first huge purchase together as mentioned earlier, young couples would be interested to understand the factors influencing house prices in Melbourne to make better decisions When looking for a house.
In general, location is known as primary influencing indicator on housing price (Chen & Hao, 2008).
Therefore, this analysis will investigate if distance of a house from CBD can be used to determine a house price in Melbourne.
The linear relationship between house price and distance from CBD will be checked by using linear regression F test.

[Photo: House in North Melbourne (realestate.com.au 2019)]

Data

The dataset used in this analysis is the subset of the original data retrieved from Kaggle and was scraped by Tony Pino from Domain’s website.
The original data contains 34,857 records with 21 variables. It contains more information about houses sold from from January 2016 to the end of 2018.
However, for the purpose of this study, the data is filtered to 265 records with 11 variables by selecting only houses with two rooms, one bathroom and not further than 20 kilometers from Melbourne CBC considering these conditions are the most suitable for young profressionals, newlywed and small families. In addition to these conditions, only houses sold in 2018 were selected since it is the most recent ones.

Suburb	Address	Rooms	Type	Price	Day	Month	Year	Distance	Postcode	Bathroom	Car	YearBuilt
Abbotsford	68 Studley St	2	h	NA	3	09	2016	2.5	3067	1	1	NA
Abbotsford	85 Turner St	2	h	1480000	3	12	2016	2.5	3067	1	1	NA
Abbotsford	25 Bloomburg St	2	h	1035000	4	02	2016	2.5	3067	1	0	1900
Abbotsford	18/659 Victoria St	3	u	NA	4	02	2016	2.5	3067	2	1	NA
Abbotsford	5 Charles St	3	h	1465000	4	03	2017	2.5	3067	2	0	1900
Abbotsford	40 Federation La	3	h	850000	4	03	2017	2.5	3067	2	1	NA

Data (Cont.)

Most variables in the dataset is self-explanatory: Suburb, Address, Rooms, Type, Price, Date, Distance, Postcode, Bathroom, Car, Year Built. Below are most relevant variables for this study.
- Rooms: number or rooms in each house
- Price: Price of house that was sold
- Date: Date of house was sold
- Distance: Distance of each house from Melbourne CBD (in Km)
- Bathroom: number of bathroom in each house
Data Pre-processing steps done in R

Tidy Date variable by seperate Date into Day, Month, Year for filtering purpose later
Filter dataset based on # of room, # of bathroom, Year sold, and distance as mentioned earlier.

Check for missing values and special values
- There are 56 missing values in the price variable.
- Since the missing value in Price accounts for about 21%, they are replaced by the mean.
- There is no special values detected.
- Note:There are also missing values in Car and Year built variables, but they can be left as they are because these will not effect the analysis.

Data (Cont.)

Detecting bivariate outliers by using The Mahalanobis distance: there are 17 outliers found in the dataset.

Dealing with outliers
- Since outlier is about 6.85%, not very big, it can be removed from the data set. This is the last step of data pre-processing.

Houseprice0_clean <- Melbournehousingfilter[-c(1:17), ]

Preview of the final dataset: final data set has 248 observations and 13 variables.

Suburb	Address	Rooms	Type	Price	Day	Month	Year	Distance	Postcode	Bathroom	Car	YearBuilt
Epping	10A Cabot Dr	2	h	432000	6	01	2018	19.6	3076	1	1	1995
Fawkner	1/1 Clara St	2	u	412000	6	01	2018	13.1	3060	1	1	NA
Glenroy	70 Beatty Av	2	t	530000	6	01	2018	11.2	3046	1	2	2010
Glenroy	43 Bindi St	2	h	637000	6	01	2018	11.2	3046	1	2	NA
Glenroy	181 Daley St	2	h	628000	6	01	2018	11.2	3046	1	4	NA
Glenroy	36 Gladstone Pde	2	h	1245000	6	01	2018	11.2	3046	1	2	NA

dim(Houseprice0_clean)

## [1] 248  13

Descriptive Statistics and Visualisation

Summary Statistics and histrogram of Price to check normality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  385000  788125  979775  977615 1102000 2220000

Summary Statistics and histrogram of Distance to check normality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   5.100   6.700   7.865  10.600  19.900

Visualization

Based on summary statistics, mean and median of both variables are not far from each other. Also, histrograms of both show slight right skewness yet almost symmetric. Since the dataset has more than 30 observations, according to Central Limit Theorem, it is safe to assume both are normal distributions. Thus, no data transformation is needed before the analysis.
Assign final variable names

Price <- as.numeric(Houseprice0_clean$Price)
Distance <- Houseprice0_clean$Distance

Observe the relationship of both variables in the scatter plot below

plot(Price ~ Distance, data = Houseprice0_clean,
     xlab = "Distances from CBD (in Km)", ylab = "House Prices")

Hypthesis Testing–Overall Model

Hypotheses for the over linear regression model

$H_0$: The data does not fit the linear regression model.

$H_A$: The data fits the linear regression model.

F-test will be used to test this overall model.

Assumptions

Below assumptions will also be checked.

Independence of residuals
Linearity of residuals
Normality of residuals (check after model is fitted)
Homoscedasticity (check after model is fitted)

Linear regression model is fitted using the lm() function.

HousePrice <- lm(Price ~ Distance, data = Houseprice0_clean)
HousePrice %>% summary()

## 
## Call:
## lm(formula = Price ~ Distance, data = Houseprice0_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -592417 -190017  -35996  115621 1342832 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1110740      41858  26.536  < 2e-16 ***
## Distance      -16926       4753  -3.561 0.000443 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 296500 on 246 degrees of freedom
## Multiple R-squared:  0.04902,    Adjusted R-squared:  0.04516 
## F-statistic: 12.68 on 1 and 246 DF,  p-value: 0.0004433

The p-value for the F-test is small, F(1, 249) = 12.68, p < 0.001. As p < 0.05, $H_0$ is rejected. There is statistically significant evidence that data fit a linear regression model.

Linear Regression–Interpreting the intercept

The intercept is reported below a = 1110740.24 meaning when distance equals 0, the expected average price of a house is 1110740.24.

HousePrice %>% summary() %>% coef()

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 1110740.24  41857.879 26.535989 3.859886e-74
## Distance     -16925.56   4752.918 -3.561088 4.432840e-04

Below statistical hypotheses are set to test the statistical significance of the intercept.
- $H_0: \alpha = 0$
- $H_A: \alpha \ne 0$

HousePrice %>% confint()

##                  2.5 %      97.5 %
## (Intercept) 1028294.69 1193185.787
## Distance     -26287.17   -7563.955

R reports the 95% CI for a to be [1028294.69, 1193185.787]. It does not capture $H_0: \alpha = 0$, so it was rejected.

Linear Regression–Interpreting Slope

The slope of the regression line is reported as b = -16925.56. It means as distance increases by one unit, the house price decreases on average by 16925.56 units.

HousePrice %>% summary() %>% coef()

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 1110740.24  41857.879 26.535989 3.859886e-74
## Distance     -16925.56   4752.918 -3.561088 4.432840e-04

The hypothesis test of the slope b as follows:
- $H_0: \beta = 0$
- $H_A: \beta \ne 0$
From the earlier result of the confint() function, confidence interval is [-26287.17, -7563.955] doesn’t capture $H_0$, so it is rejected. Thus, there was statiscally significant evidence that distance is negatively related to house price.

Fitting the Regression Model

The best line fit: House Price = 1110740 - 16926 x Distance from CBD
Summarise Linear Relationship in a plot

Testing Assumptions

Assumptions

Before the final regression model can be reported, the all assumptions for linear regression mentioned earlier must be validated.

The residuals reflected how far an observed score $y_i$(actual) deviates from $\hat{y}$(fitted) score predicted by the line of the best fit as shown in the below plot. The length of each point’s line is a residual.

Testing Assumptions

Independence of Residuals

Independence is checked through the research design. Since the data set is cross sectional as each observation is collected at one point of time, the independence of residuals is assumed to be met.

Linearity of Residuals

From the plot, it shows a very slight curve but the residuals equally

spread around the horizontal line without a distinct pattern (close to flat). Therefore, it is a good indication it is a linear relationship.

Normality of Residuals

As seen in the plot above, the residuals follow close to a straight line on this plot except the last part that moves off the curve. Therefore, it is a fairly good to indicate they are normally distributed.

Testing Assumptions (Cont.)

Homoscedasticity

The residuals reasonably well spread above and below and along a pretty horizontal line but in the beginning of the line there are fewer points along and below the line, so it is slightly less variance there. However, homoscedasticity should still be assumed.

Residuals vs Leverage

There are no values fall outside the bands; therefore, no evidence of influencial cases.

Strength and Direction of Linear Relationship

$r$ is - 0.22 means the correration between the two variables is moderately negative.

r <- cor(Price, Distance, use = "complete.obs")
r

## [1] -0.2214115

A hypothesis test for r has the following statistical hypotheses:
- $H_0:r=0$
- $H_A:r≠0$

CIr(r, n = 248, level = 0.95)

## [1] -0.33669245 -0.09959113

Confidence interval does not capture $H_0$, so it is rejected. There is a statiscally significant negative correlation between house price and distance from CBD.

r^2

## [1] 0.04902305

$r^{2}$ is 0.049 means only 4.9% of the variablity in house price can be explained by a linear relationship with distance.

Discussion

Major Findings

A linear regression model is fitted to predict the house price in Melbourne (dependent variable) using measures of distance from CBD (independent variable).
There is no influencial cases.
Linearity was assumed. Normality of residuals and homoscedasticity are checked and validated.
The overal regression model is statistically significant, F(1, 246) = 12.68, p-value < 0.001, and explained 4.9% of the variablity in house price, $R^{2}$ = 0.049.
The estimated regression equation is House Price = 1110740 - 16926 x Distance from CBD.
The intercept is statistically significant, a = 1110740, 95% CI [1028294.69, 1193185.787]
The negative slope for distance is statistically significant, b = - 16926, 95% CI [-26287.17, -7563.955]

Conclusion

Our analysis aimed to give young profressionals, newlywed couples and small families a better understanding of the relationship of house price and distance from CBD for them to make better decisions when looking to buy a home.

Discussion (Cont.)

The result shows that there is a statistically significant negative linear relationship between distance from CBD and the housing price in Melbourne. That means distance of houses from CBD can be used to explain house prices in Melbourne as following.
- The average price of a house in the selected category with no link to distance from CBD is expected to be $1,110,740.
- The mean value of a house in the selected category decreases by $16,926 on average for each additonal one kilometer of distance from CBD.

Limitations and Future Analysis

The current dataset was scraped from domain.com.au. Although it is one of leading websites in property in Melbourne, it does not cover all houses listed throughout Melbourne.
Furthermore, only houses sold in 2018 that has 2 rooms, 1 bathroom and are not further than 20 km from CBD were included. Thus, the result could not represent the linear relationship of houses and distance in other categories. For future analysis, one could work on a more representative sample to explore the relationship between the two variables.
In addition, generally there are a lot of factors influencing house price such as house size, number of rooms, location, suburbs, demand, supply, to name a few. In this analysis, only the simple linear relationship of distance and price is explored. As seen in the result, only 4.9% of the variablity of house price could be explained by the distance. In the future investigation, multiple linear relationship of house price and other factors (such as house size, location..) should be explored.

References

Pino, T. (2019). Melbourne Housing Market. [online] Kaggle.com. Available at: https://www.kaggle.com/anthonypino/melbourne-housing-market [Accessed 18 Oct. 2019].
Tweddale, A. (2015). How to Buy Your First Home as a Newlywed Couple. [online] GOBankingRates. Available at: https://www.gobankingrates.com/investing/real-estate/buy-first-home-newlywed-couple/ [Accessed 22 Oct. 2019]
Fuary-Wagner, I. (2019). House price growth ‘close to boom times’. [online] Financial Review. Available at https://www.afr.com/property/residential/rapid-bounce-back-house-prices-gain-momentum-20190930-p52wda [Accessed 22 Oct. 2019]
Chalkley-Rhoden, S. (2017). World’s most liveable city: Melbourne takes top spot for seventh year running. [online] ABC. Available at https://www.abc.net.au/news/2017-08-16/melbourne-named-worlds-most-liveable-city-for-seventh-year/8812196 [Accessed 24 Oct. 2019]
TransferWise (2018). Buy a house in Australia guide. [online] The Telegraph. Available at https://www.telegraph.co.uk/money/transferwise/buy-a-house-in-australia/ [Accessed 24 Oct. 2019]
Chen, J. & Hao, Q. (2008). The Impacts of Distance to CBD on Housing Prices in Shanghai: A Hedonic Analysis. [online] Research Gate. Avaialable at https://www.researchgate.net/publication/24085146_The_impacts_of_distance_to_CBD_on_housing_prices_in_Shanghai_a_hedonic_analysis [Accessed 22 Oct 2019]
Melbourne’s Skyline (2018). Melbourne prepares to overtake Sydney as biggest city. [Digital photography]. The Australian. Available at https://www.theaustralian.com.au/nation/inquirer/melbourne-prepares-to-overtake-sydney-as-biggest-city/news-story/bf607f69c9959efb0d750bd1a9169917 [Accessed 22 Oct. 2019]
House in North Melbourne (2019). 273 Flemington Road, North Melbourne, Vic 3051. [Digital photography].realestate.com.au. Available at https://www.realestate.com.au/property-house-vic-north+melbourne-129480618 [Accessed 22 Oct. 2019]

Housing Price in Melbourne

Can distance of houses from the CBD be used to predict Melbourne house prices?

RPubs link information

Introduction

Problem Statement

Data

Data (Cont.)

Data (Cont.)

Descriptive Statistics and Visualisation

Visualization

Hypthesis Testing–Overall Model

Linear Regression–Interpreting the intercept

Linear Regression–Interpreting Slope

Fitting the Regression Model

Testing Assumptions

Assumptions

Testing Assumptions

Independence of Residuals

Linearity of Residuals

Normality of Residuals

Testing Assumptions (Cont.)

Homoscedasticity

Residuals vs Leverage

Strength and Direction of Linear Relationship

Discussion

Major Findings

Conclusion

Discussion (Cont.)

Limitations and Future Analysis

References