Project Title: “Analyzing the influence of ‘Distance from CBD’ and ‘Land Size’ on the Price of Houses in Melbourne”

Name: “Kunal Bhattacharya”

Email: “kunal.jubce@gmail.com

Company : “Virtusa Consulting Services”

output: html_document

1. Introduction

Real estate is “property consisting of land and the buildings on it, along with it natural resources such as crops, minerals or water”. The business of real estate is the profession of buying, selling, or renting land, buildings or housing. Residential real estate may contain either a single family or multifamily structure that is available for occupation or for non-business purposes. Residences can be classified by if and how they are connected to neighbouring residences and land -

Attached

1. Apartment or Flat
2. Multi-family house
3. Terraced House/Townhouse
4. Condominium
5. Cooperative

Semi-attached

1. Duplex

Detached

1. Detached house

2. Overview of the study

My field study concerns the housing rates in the various regions of Melbourne, Australia. As of November 2017, Melbourne’s house price growth has exceeded Sydney’s for the previous twelve months with signs of renewed momentum in the latter half of that period after showing signs of slowdown. Experts are predicting the slowed trends to continue in 2018 due to APRA’s (Australian Prudential Regulatory Authority) tightening on investor borrowing and interest-only loans which has also led to less investors. This also indicates a possibility of a comeback of first-time property buyers who can now get low-interest loans more easily than before. On these lines, I believe that it would be extremely interesting to study what factors influence the housing prices in certain highly populated regions.

3. An empirical study of the house prices in various regions of Melbourne

3.1 Overview

Melbourne is currently experiencing a housing bubble (some experts say it may burst soon). The specific objective of this study is to identify the influence of “Distance from CBD (Central Business District)” and “Landsize” that affect the prices of houses in Melbourne. The dataset includes Address, Type of Real estate, Suburb, Method of Selling, Rooms, Price, Real Estate Agent, Date of Sale and distance from C.B.D.

Hypothesis H1 - The price of houses is correlated with Distance from CBD and landsize.

3.2 Data

The data was scraped from publicly available results posted every week from Domain.com.au. The variables used in our analysis are as follows:

Suburb: Suburb

Rooms: Number of rooms

Price: Price in dollars

Method: Status of property sale -

S - property sold; 
SP - property sold prior; 
PI - property passed in; 
PN - sold prior not disclosed; 
SN - sold not disclosed; 
VB - vendor bid; 
W -  withdrawn prior to auction; 
SA - sold after auction; 
SS - sold after auction price not disclosed.

Type: Type of house -

h - house,cottage,villa, semi,terrace; 
u - unit, duplex; 
t - townhouse;

SellerG: Real Estate Agent

Date: Date sold

Distance: Distance from CBD

Regionname: General Region (West, North West, North, North east …etc)

Propertycount: Number of properties that exist in the suburb.

Bedroom2: Number of Bedrooms

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

YearBuilt: Year the house was built

3.3 Model

In order to test Hypothesis H1, we propose the following model:

\[Price= \alpha_0 + \alpha_1 Distance + \alpha_2 Bedroom2 + \alpha_3 Bathroom + \alpha_4 Car + \alpha_5 Landsize + \epsilon\]

Reading the data into R

raw_data.df <- read.csv(paste("F:/Data Analytics for Managerial Applications/Capstone Project/Melbourne_housing_data.csv", sep = ""))

Data Cleansing

Here we see that there are a lot of missing and junk values in the data set which we need to cleanse. For Eg., Price, Bedroom2 and Bathroom have quite a lot of “NA” values (missing) and Landsize has a lot junk values such as 0,1,3,10 and so on, which are obviously too less to build a house on.

So, to cleanse our dataset, we do the following: 1. Filter out the NA values in the columns - Price, Bedroom2, Bathroom, Car and YearBuilt 2. Filter out all data that has landsize < 121. The reason behind this figure of 121 : Considering that landsize has been mentioned as square units, we take 121 i.e. square of 11 as a reasonable entity to build our house on.

housing.df <- raw_data.df[which(raw_data.df$Price != 'NA' & raw_data.df$Bedroom2 != 'NA' & raw_data.df$Bathroom != 'NA' & raw_data.df$Car != 'NA' & raw_data.df$YearBuilt != 'NA' & raw_data.df$Landsize > 120),]
housing.df$Distance <- as.numeric(as.character(housing.df$Distance)) ## to change datatype of Distance from Factor to Numeric
lmodel <- lm(Price ~ Distance + Bedroom2 + Bathroom + Car + Landsize + YearBuilt, data = housing.df)
summary(lmodel)
## 
## Call:
## lm(formula = Price ~ Distance + Bedroom2 + Bathroom + Car + Landsize + 
##     YearBuilt, data = housing.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4198601  -292258   -76087   195442  8353672 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.151e+07  3.390e+05  33.959   <2e-16 ***
## Distance    -3.008e+04  9.876e+02 -30.460   <2e-16 ***
## Bedroom2     1.691e+05  8.643e+03  19.561   <2e-16 ***
## Bathroom     3.465e+05  1.042e+04  33.262   <2e-16 ***
## Car          6.361e+04  6.373e+03   9.981   <2e-16 ***
## Landsize     1.269e+01  5.193e+00   2.444   0.0145 *  
## YearBuilt   -5.728e+03  1.746e+02 -32.815   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 514900 on 7764 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:   0.45 
## F-statistic:  1060 on 6 and 7764 DF,  p-value: < 2.2e-16

We established the effect of various factors on the price of houses in Melbourne with the simplest model. We regressed Price on no.of rooms, car slots, YearBuilt and distance of CBD. We estimated model, using linear least squares.

3.4 Results

We found empirical support for H1. The F-statistic and overall p-value (2.2e-16 < 0.05) suggests that the model is overall a good model. The value of adjusted R-squared (=0.45) suggests that the selected explanatory variables explain 45% of the variance while also suggesting that there are few other variables that account for the price. Further, the beta-coefficients of all explanatory variables have a significant p-value (<0.05) which suggests all of them are correlated with Price.

4. Conclusion

This paper was motivated by the need for research that could improve our understanding of how various factors influences the prices of houses in the Melbourne real-estate industry. We found that Distance from CBD is a inversely proportional to prices while all other factors are also correlated to the price.

5. References

Dataset compiled by Anthony Pino - https://www.kaggle.com/anthonypino/melbourne-housing-market

Residential_Research_Melbourne_Q3-2017.pdf - A research paper of JLL Research

Five property market trends to expect in 2018 - Chris Kohler, www.domain.com.au

Appendix - 1

Descriptive Statistics

head(housing.df)
##        Suburb Rooms Type   Price Method SellerG       Date Distance
## 3  Abbotsford     2    h 1035000      S  Biggin 04-02-2016      2.5
## 5  Abbotsford     3    h 1465000     SP  Biggin 04-03-2017      2.5
## 12 Abbotsford     3    h 1876000      S  Nelson 07-05-2016      2.5
## 15 Abbotsford     2    h 1636000      S  Nelson 08-10-2016      2.5
## 19 Abbotsford     2    h 1097000      S  Biggin 08-10-2016      2.5
## 25 Abbotsford     3    h 1350000     VB  Nelson 12-11-2016      2.5
##    Postcode Bedroom2 Bathroom Car Landsize YearBuilt            Regionname
## 3      3067        2        1   0      156      1900 Northern Metropolitan
## 5      3067        3        2   0      134      1900 Northern Metropolitan
## 12     3067        4        2   0      245      1910 Northern Metropolitan
## 15     3067        2        1   2      256      1890 Northern Metropolitan
## 19     3067        3        1   2      220      1900 Northern Metropolitan
## 25     3067        3        2   2      214      2005 Northern Metropolitan
##    Propertycount
## 3           4019
## 5           4019
## 12          4019
## 15          4019
## 19          4019
## 25          4019
dim(housing.df)
## [1] 7771   16

Now we see that there are no more ‘NA’ values and even the Landsizes look more practical for consideration. We have 16 columns and 7771 rows of data.

Summary Statistics of the dataset

To check for the datatypes in columns and also to get the summary statistics for numerical variables:

housing.df$Distance <- as.numeric(as.character(housing.df$Distance)) ## to change datatype of Distance from Factor to Numeric
str(housing.df)
## 'data.frame':    7771 obs. of  16 variables:
##  $ Suburb       : Factor w/ 347 levels "Abbotsford","Aberfeldie",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Rooms        : int  2 3 3 2 2 3 2 3 2 4 ...
##  $ Type         : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Price        : int  1035000 1465000 1876000 1636000 1097000 1350000 1310000 1176500 890000 1330000 ...
##  $ Method       : Factor w/ 9 levels "PI","PN","S",..: 3 6 3 3 3 8 3 3 3 1 ...
##  $ SellerG      : Factor w/ 366 levels "@Realty","A",..: 34 34 233 233 34 233 162 191 34 175 ...
##  $ Date         : Factor w/ 69 levels "01-07-2017","03-06-2017",..: 6 7 13 18 18 28 34 36 39 40 ...
##  $ Distance     : num  2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ Postcode     : Factor w/ 209 levels "#N/A","3000",..: 55 55 55 55 55 55 55 55 55 55 ...
##  $ Bedroom2     : int  2 3 4 2 3 3 2 2 2 4 ...
##  $ Bathroom     : int  1 2 2 1 1 2 1 1 1 2 ...
##  $ Car          : int  0 0 0 2 2 2 2 1 1 2 ...
##  $ Landsize     : int  156 134 245 256 220 214 238 138 150 780 ...
##  $ YearBuilt    : int  1900 1900 1910 1890 1900 2005 1890 1890 1985 1900 ...
##  $ Regionname   : Factor w/ 9 levels "#N/A","Eastern Metropolitan",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Propertycount: Factor w/ 340 levels "#N/A","1008",..: 189 189 189 189 189 189 189 189 189 189 ...
library(psych)
describe(housing.df[,c(2,4,8,10:13)])
##          vars    n       mean        sd   median    trimmed       mad
## Rooms       1 7771       3.23      0.90      3.0       3.20      1.48
## Price       2 7771 1164559.68 694199.51 960500.0 1057799.10 485551.50
## Distance    3 7771      11.58      6.60     10.7      10.76      4.89
## Bedroom2    4 7771       3.20      0.91      3.0       3.18      1.48
## Bathroom    5 7771       1.68      0.74      2.0       1.58      1.48
## Car         6 7771       1.76      1.00      2.0       1.67      1.48
## Landsize    7 7771     624.24   1133.77    545.0     509.90    269.83
##               min       max     range  skew kurtosis      se
## Rooms         1.0      12.0      11.0  0.54     2.43    0.01
## Price    131000.0 9000000.0 8869000.0  2.38    10.71 7874.91
## Distance      0.7      47.4      46.7  1.62     4.09    0.07
## Bedroom2      0.0      12.0      12.0  0.58     2.84    0.01
## Bathroom      0.0       9.0       9.0  1.23     3.67    0.01
## Car           0.0      10.0      10.0  1.40     5.45    0.01
## Landsize    121.0   40469.0   40348.0 17.70   443.93   12.86

Contingency tables

To check for number of different types of property per region:

region_by_type <- xtabs(~ Regionname+Type, data = housing.df)
region_by_type
##                             Type
## Regionname                      h    t    u
##   #N/A                          0    0    0
##   Eastern Metropolitan        814   71   78
##   Eastern Victoria             45    0    2
##   Northern Metropolitan      1795  142  227
##   Northern Victoria            47    0    0
##   South-Eastern Metropolitan  305   13   19
##   Southern Metropolitan      1739  218  367
##   Western Metropolitan       1587  138  126
##   Western Victoria             38    0    0

Here, Type is the type of house in the dataset: h -> house,cottage,villa, semi,terrace; t -> townhouse; u -> unit, duplex;

To check for status of property sale per type per region:

sell_type_region <- xtabs(~ Regionname+Type+Method, data = housing.df)
ftable(sell_type_region)
##                                 Method   PI   PN    S   SA   SN   SP   SS   VB    W
## Regionname                 Type                                                    
## #N/A                       h              0    0    0    0    0    0    0    0    0
##                            t              0    0    0    0    0    0    0    0    0
##                            u              0    0    0    0    0    0    0    0    0
## Eastern Metropolitan       h            115    0  540    5    0   85    0   69    0
##                            t             10    0   44    1    0    8    0    8    0
##                            u              7    0   62    0    0    4    0    5    0
## Eastern Victoria           h              5    0   24    2    0   12    0    2    0
##                            t              0    0    0    0    0    0    0    0    0
##                            u              0    0    1    0    0    1    0    0    0
## Northern Metropolitan      h            169    0 1270    9    0  239    0  108    0
##                            t             12    0   96    0    0   21    0   13    0
##                            u             17    0  133    2    0   56    0   19    0
## Northern Victoria          h              0    0   32    2    0   13    0    0    0
##                            t              0    0    0    0    0    0    0    0    0
##                            u              0    0    0    0    0    0    0    0    0
## South-Eastern Metropolitan h             23    0  213    2    0   56    0   11    0
##                            t              0    0   13    0    0    0    0    0    0
##                            u              1    0   14    0    0    4    0    0    0
## Southern Metropolitan      h            288    0 1060   15    0  113    0  263    0
##                            t             32    0  147    2    0   19    0   18    0
##                            u             45    0  243    2    0   39    0   38    0
## Western Metropolitan       h            154    0 1023    6    0  312    0   92    0
##                            t             18    0   78    0    0   30    0   12    0
##                            u             14    0   73    3    0   26    0   10    0
## Western Victoria           h              3    0   30    0    0    5    0    0    0
##                            t              0    0    0    0    0    0    0    0    0
##                            u              0    0    0    0    0    0    0    0    0

Boxplot to check the data and outliers

To draw a boxplot to check for price in different regions:

par(mar=c(3.1,12,4.1,2.1), mgp = c(11, 1, 0))
boxplot(housing.df$Price ~ housing.df$Regionname, horizontal = TRUE, ylab = "Region Name", xlab = "Price of houses", main = "Boxplot of price of houses by region", las = 1)

Insights:

To draw a boxplot to check for landsize in different regions:

par(mar=c(3.1,12,0.95,2.1), mgp = c(11, 1, 0), mfrow = c(2,1))
boxplot(housing.df$Landsize ~ housing.df$Regionname, horizontal = TRUE, xlab = "Landsize of houses", ylab = "Region Name", main = "Boxplot of landsize of houses by region", las = 1)
boxplot(housing.df$Distance ~ housing.df$Type, horizontal = TRUE, ylab = "Type of House", xlab = "Distance from CBD", main = "Boxplot of distance from CBD vs type of houses", las = 1)

Histograms

par(mfrow = c(2,2), mar=c(3.1,3.1,0.95,0))
hist(housing.df$Distance, breaks = 40, xlim = c(0,50), ylim = c(0,800),xlab = "Distance from CBD", col = "Green", main = "Histogram of Distances from CBD", las =1)
hist(housing.df$Bedroom2, breaks = 40, xlim = c(0,10), ylim = c(0,4000),xlab = "No. of Bedrooms", col = "Red", main = "Histogram of no. of bedrooms in houses", las =1)
hist(housing.df$Bathroom, breaks = 40, xlim = c(0,10), ylim = c(0,4000),xlab = "No. of Bathrooms", col = "Blue", main = "Histogram of no. of bathrooms in houses", las =1)
hist(housing.df$Car, breaks = 40, xlim = c(0,10), ylim = c(0,4000),xlab = "No. of Carslots", col = "gray33", main = "Histogram of no. of carslots in houses", las =1)

From the Distance from CBD histogram, we can see a very strong right skew in the data indicating that very few of the houses are extremely far away from CBD. From the remaining 3 histograms, we can infer that the majority of houses have 2-4 bedrooms, 1-2 bathrooms and 1-2 carslots. However, there are outliers in each histogram.

Plotting a scatterplot matrix for various variables

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(housing.df[,c(4,8,10:13)], spread = FALSE, smoother.args = list(lty = 2), main = "Scatter Plot Matrix")

Correlation Matrix

To get the correlation matrix for all numeric variables in the dataset, rounded to 2 decimal places:

round(cor(housing.df[,c(2,4,8,10:13)]),2)
##          Rooms Price Distance Bedroom2 Bathroom  Car Landsize
## Rooms     1.00  0.42     0.19     0.95     0.61 0.34    -0.02
## Price     0.42  1.00    -0.31     0.40     0.44 0.17     0.00
## Distance  0.19 -0.31     1.00     0.20     0.08 0.21     0.04
## Bedroom2  0.95  0.40     0.20     1.00     0.61 0.35    -0.02
## Bathroom  0.61  0.44     0.08     0.61     1.00 0.28     0.04
## Car       0.34  0.17     0.21     0.35     0.28 1.00     0.08
## Landsize -0.02  0.00     0.04    -0.02     0.04 0.08     1.00

Corrgram

library(corrgram)
corrgram(housing.df[,c(2,4,8,10:13)], order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main="Corrgram of housing data intercorrelations")

From the corrgram, we see that there’s a strong positive correlation between No. of Rooms, Bathrooms, carslots and Bedrooms with price while there is a negative correlation between Distance from CBD and price, both of which make sense. There does not appear to be a strong correlation between landsize and price, which we need to check further.

Hypothesis Testing

Testing Correlation between landsize and Price of houses

Null Hypothesis: There is no significant difference between landsize and Price of houses i.e. there is no correlation between the two variable. Alternative Hypothesis: There is a significant difference between landsize and price and they are correlated.

Using the Pearson’s Correlation test:

cor.test(housing.df$Price, housing.df$Landsize)
## 
##  Pearson's product-moment correlation
## 
## data:  housing.df$Price and housing.df$Landsize
## t = 0.14774, df = 7769, p-value = 0.8826
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02055883  0.02390948
## sample estimates:
##         cor 
## 0.001676152

p-value = 0.8826(>0.05) is not significant and suggests that we fail to reject the null hypothesis that there is no correlation between land size and price.

Testing Correlation between Distance from CBD and Price of houses

Null Hypothesis: There is no significant difference between Distance from CBD and Price of houses i.e. there is no correlation between the two variable. Alternative Hypothesis: There is a significant difference between Distance from CBD and price and they are correlated.

Using the Pearson’s Correlation test:

cor.test(housing.df$Price, housing.df$Distance)
## 
##  Pearson's product-moment correlation
## 
## data:  housing.df$Price and housing.df$Distance
## t = -28.46, df = 7769, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3272682 -0.2869964
## sample estimates:
##        cor 
## -0.3072699

We have a significant p-value here (2.2e-16 << 0.05) and thus we reject the null hypothesis. We conclude there is indeed a correlation between distance from CBD and price.

Testing Correlation between no. of rooms and Price of houses

Null Hypothesis: There is no significant difference between No. of rooms and Price of houses i.e. there is no correlation between the two variable. Alternative Hypothesis: There is a significant difference between No. of rooms and price and they are correlated.

Using the Pearson’s Correlation test:

cor.test(housing.df$Price, housing.df$Rooms)
## 
##  Pearson's product-moment correlation
## 
## data:  housing.df$Price and housing.df$Rooms
## t = 40.806, df = 7769, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4016376 0.4382605
## sample estimates:
##       cor 
## 0.4201201

Here again we have a significant p-value here (2.2e-16 << 0.05) and thus we reject the null hypothesis. We conclude there is indeed a correlation between No. of rooms and price.