Real estate is “property consisting of land and the buildings on it, along with it natural resources such as crops, minerals or water”. The business of real estate is the profession of buying, selling, or renting land, buildings or housing. Residential real estate may contain either a single family or multifamily structure that is available for occupation or for non-business purposes. Residences can be classified by if and how they are connected to neighbouring residences and land -
Attached
1. Apartment or Flat
2. Multi-family house
3. Terraced House/Townhouse
4. Condominium
5. Cooperative
Semi-attached
1. Duplex
Detached
1. Detached house
My field study concerns the housing rates in the various regions of Melbourne, Australia. As of November 2017, Melbourne’s house price growth has exceeded Sydney’s for the previous twelve months with signs of renewed momentum in the latter half of that period after showing signs of slowdown. Experts are predicting the slowed trends to continue in 2018 due to APRA’s (Australian Prudential Regulatory Authority) tightening on investor borrowing and interest-only loans which has also led to less investors. This also indicates a possibility of a comeback of first-time property buyers who can now get low-interest loans more easily than before. On these lines, I believe that it would be extremely interesting to study what factors influence the housing prices in certain highly populated regions.
Melbourne is currently experiencing a housing bubble (some experts say it may burst soon). The specific objective of this study is to identify the influence of “Distance from CBD (Central Business District)” and “Landsize” that affect the prices of houses in Melbourne. The dataset includes Address, Type of Real estate, Suburb, Method of Selling, Rooms, Price, Real Estate Agent, Date of Sale and distance from C.B.D.
Hypothesis H1 - The price of houses is correlated with Distance from CBD and landsize.
The data was scraped from publicly available results posted every week from Domain.com.au. The variables used in our analysis are as follows:
Suburb: Suburb
Rooms: Number of rooms
Price: Price in dollars
Method: Status of property sale -
S - property sold;
SP - property sold prior;
PI - property passed in;
PN - sold prior not disclosed;
SN - sold not disclosed;
VB - vendor bid;
W - withdrawn prior to auction;
SA - sold after auction;
SS - sold after auction price not disclosed.
Type: Type of house -
h - house,cottage,villa, semi,terrace;
u - unit, duplex;
t - townhouse;
SellerG: Real Estate Agent
Date: Date sold
Distance: Distance from CBD
Regionname: General Region (West, North West, North, North east …etc)
Propertycount: Number of properties that exist in the suburb.
Bedroom2: Number of Bedrooms
Bathroom: Number of Bathrooms
Car: Number of carspots
Landsize: Land Size
YearBuilt: Year the house was built
In order to test Hypothesis H1, we propose the following model:
\[Price= \alpha_0 + \alpha_1 Distance + \alpha_2 Bedroom2 + \alpha_3 Bathroom + \alpha_4 Car + \alpha_5 Landsize + \epsilon\]
Reading the data into R
raw_data.df <- read.csv(paste("F:/Data Analytics for Managerial Applications/Capstone Project/Melbourne_housing_data.csv", sep = ""))
Data Cleansing
Here we see that there are a lot of missing and junk values in the data set which we need to cleanse. For Eg., Price, Bedroom2 and Bathroom have quite a lot of “NA” values (missing) and Landsize has a lot junk values such as 0,1,3,10 and so on, which are obviously too less to build a house on.
So, to cleanse our dataset, we do the following: 1. Filter out the NA values in the columns - Price, Bedroom2, Bathroom, Car and YearBuilt 2. Filter out all data that has landsize < 121. The reason behind this figure of 121 : Considering that landsize has been mentioned as square units, we take 121 i.e. square of 11 as a reasonable entity to build our house on.
housing.df <- raw_data.df[which(raw_data.df$Price != 'NA' & raw_data.df$Bedroom2 != 'NA' & raw_data.df$Bathroom != 'NA' & raw_data.df$Car != 'NA' & raw_data.df$YearBuilt != 'NA' & raw_data.df$Landsize > 120),]
housing.df$Distance <- as.numeric(as.character(housing.df$Distance)) ## to change datatype of Distance from Factor to Numeric
lmodel <- lm(Price ~ Distance + Bedroom2 + Bathroom + Car + Landsize + YearBuilt, data = housing.df)
summary(lmodel)
##
## Call:
## lm(formula = Price ~ Distance + Bedroom2 + Bathroom + Car + Landsize +
## YearBuilt, data = housing.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4198601 -292258 -76087 195442 8353672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.151e+07 3.390e+05 33.959 <2e-16 ***
## Distance -3.008e+04 9.876e+02 -30.460 <2e-16 ***
## Bedroom2 1.691e+05 8.643e+03 19.561 <2e-16 ***
## Bathroom 3.465e+05 1.042e+04 33.262 <2e-16 ***
## Car 6.361e+04 6.373e+03 9.981 <2e-16 ***
## Landsize 1.269e+01 5.193e+00 2.444 0.0145 *
## YearBuilt -5.728e+03 1.746e+02 -32.815 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 514900 on 7764 degrees of freedom
## Multiple R-squared: 0.4504, Adjusted R-squared: 0.45
## F-statistic: 1060 on 6 and 7764 DF, p-value: < 2.2e-16
We established the effect of various factors on the price of houses in Melbourne with the simplest model. We regressed Price on no.of rooms, car slots, YearBuilt and distance of CBD. We estimated model, using linear least squares.
We found empirical support for H1. The F-statistic and overall p-value (2.2e-16 < 0.05) suggests that the model is overall a good model. The value of adjusted R-squared (=0.45) suggests that the selected explanatory variables explain 45% of the variance while also suggesting that there are few other variables that account for the price. Further, the beta-coefficients of all explanatory variables have a significant p-value (<0.05) which suggests all of them are correlated with Price.
This paper was motivated by the need for research that could improve our understanding of how various factors influences the prices of houses in the Melbourne real-estate industry. We found that Distance from CBD is a inversely proportional to prices while all other factors are also correlated to the price.
Dataset compiled by Anthony Pino - https://www.kaggle.com/anthonypino/melbourne-housing-market
Residential_Research_Melbourne_Q3-2017.pdf - A research paper of JLL Research
Five property market trends to expect in 2018 - Chris Kohler, www.domain.com.au
head(housing.df)
## Suburb Rooms Type Price Method SellerG Date Distance
## 3 Abbotsford 2 h 1035000 S Biggin 04-02-2016 2.5
## 5 Abbotsford 3 h 1465000 SP Biggin 04-03-2017 2.5
## 12 Abbotsford 3 h 1876000 S Nelson 07-05-2016 2.5
## 15 Abbotsford 2 h 1636000 S Nelson 08-10-2016 2.5
## 19 Abbotsford 2 h 1097000 S Biggin 08-10-2016 2.5
## 25 Abbotsford 3 h 1350000 VB Nelson 12-11-2016 2.5
## Postcode Bedroom2 Bathroom Car Landsize YearBuilt Regionname
## 3 3067 2 1 0 156 1900 Northern Metropolitan
## 5 3067 3 2 0 134 1900 Northern Metropolitan
## 12 3067 4 2 0 245 1910 Northern Metropolitan
## 15 3067 2 1 2 256 1890 Northern Metropolitan
## 19 3067 3 1 2 220 1900 Northern Metropolitan
## 25 3067 3 2 2 214 2005 Northern Metropolitan
## Propertycount
## 3 4019
## 5 4019
## 12 4019
## 15 4019
## 19 4019
## 25 4019
dim(housing.df)
## [1] 7771 16
Now we see that there are no more ‘NA’ values and even the Landsizes look more practical for consideration. We have 16 columns and 7771 rows of data.
To check for the datatypes in columns and also to get the summary statistics for numerical variables:
housing.df$Distance <- as.numeric(as.character(housing.df$Distance)) ## to change datatype of Distance from Factor to Numeric
str(housing.df)
## 'data.frame': 7771 obs. of 16 variables:
## $ Suburb : Factor w/ 347 levels "Abbotsford","Aberfeldie",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Rooms : int 2 3 3 2 2 3 2 3 2 4 ...
## $ Type : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 1 1 1 1 ...
## $ Price : int 1035000 1465000 1876000 1636000 1097000 1350000 1310000 1176500 890000 1330000 ...
## $ Method : Factor w/ 9 levels "PI","PN","S",..: 3 6 3 3 3 8 3 3 3 1 ...
## $ SellerG : Factor w/ 366 levels "@Realty","A",..: 34 34 233 233 34 233 162 191 34 175 ...
## $ Date : Factor w/ 69 levels "01-07-2017","03-06-2017",..: 6 7 13 18 18 28 34 36 39 40 ...
## $ Distance : num 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ Postcode : Factor w/ 209 levels "#N/A","3000",..: 55 55 55 55 55 55 55 55 55 55 ...
## $ Bedroom2 : int 2 3 4 2 3 3 2 2 2 4 ...
## $ Bathroom : int 1 2 2 1 1 2 1 1 1 2 ...
## $ Car : int 0 0 0 2 2 2 2 1 1 2 ...
## $ Landsize : int 156 134 245 256 220 214 238 138 150 780 ...
## $ YearBuilt : int 1900 1900 1910 1890 1900 2005 1890 1890 1985 1900 ...
## $ Regionname : Factor w/ 9 levels "#N/A","Eastern Metropolitan",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Propertycount: Factor w/ 340 levels "#N/A","1008",..: 189 189 189 189 189 189 189 189 189 189 ...
library(psych)
describe(housing.df[,c(2,4,8,10:13)])
## vars n mean sd median trimmed mad
## Rooms 1 7771 3.23 0.90 3.0 3.20 1.48
## Price 2 7771 1164559.68 694199.51 960500.0 1057799.10 485551.50
## Distance 3 7771 11.58 6.60 10.7 10.76 4.89
## Bedroom2 4 7771 3.20 0.91 3.0 3.18 1.48
## Bathroom 5 7771 1.68 0.74 2.0 1.58 1.48
## Car 6 7771 1.76 1.00 2.0 1.67 1.48
## Landsize 7 7771 624.24 1133.77 545.0 509.90 269.83
## min max range skew kurtosis se
## Rooms 1.0 12.0 11.0 0.54 2.43 0.01
## Price 131000.0 9000000.0 8869000.0 2.38 10.71 7874.91
## Distance 0.7 47.4 46.7 1.62 4.09 0.07
## Bedroom2 0.0 12.0 12.0 0.58 2.84 0.01
## Bathroom 0.0 9.0 9.0 1.23 3.67 0.01
## Car 0.0 10.0 10.0 1.40 5.45 0.01
## Landsize 121.0 40469.0 40348.0 17.70 443.93 12.86
To check for number of different types of property per region:
region_by_type <- xtabs(~ Regionname+Type, data = housing.df)
region_by_type
## Type
## Regionname h t u
## #N/A 0 0 0
## Eastern Metropolitan 814 71 78
## Eastern Victoria 45 0 2
## Northern Metropolitan 1795 142 227
## Northern Victoria 47 0 0
## South-Eastern Metropolitan 305 13 19
## Southern Metropolitan 1739 218 367
## Western Metropolitan 1587 138 126
## Western Victoria 38 0 0
Here, Type is the type of house in the dataset: h -> house,cottage,villa, semi,terrace; t -> townhouse; u -> unit, duplex;
To check for status of property sale per type per region:
sell_type_region <- xtabs(~ Regionname+Type+Method, data = housing.df)
ftable(sell_type_region)
## Method PI PN S SA SN SP SS VB W
## Regionname Type
## #N/A h 0 0 0 0 0 0 0 0 0
## t 0 0 0 0 0 0 0 0 0
## u 0 0 0 0 0 0 0 0 0
## Eastern Metropolitan h 115 0 540 5 0 85 0 69 0
## t 10 0 44 1 0 8 0 8 0
## u 7 0 62 0 0 4 0 5 0
## Eastern Victoria h 5 0 24 2 0 12 0 2 0
## t 0 0 0 0 0 0 0 0 0
## u 0 0 1 0 0 1 0 0 0
## Northern Metropolitan h 169 0 1270 9 0 239 0 108 0
## t 12 0 96 0 0 21 0 13 0
## u 17 0 133 2 0 56 0 19 0
## Northern Victoria h 0 0 32 2 0 13 0 0 0
## t 0 0 0 0 0 0 0 0 0
## u 0 0 0 0 0 0 0 0 0
## South-Eastern Metropolitan h 23 0 213 2 0 56 0 11 0
## t 0 0 13 0 0 0 0 0 0
## u 1 0 14 0 0 4 0 0 0
## Southern Metropolitan h 288 0 1060 15 0 113 0 263 0
## t 32 0 147 2 0 19 0 18 0
## u 45 0 243 2 0 39 0 38 0
## Western Metropolitan h 154 0 1023 6 0 312 0 92 0
## t 18 0 78 0 0 30 0 12 0
## u 14 0 73 3 0 26 0 10 0
## Western Victoria h 3 0 30 0 0 5 0 0 0
## t 0 0 0 0 0 0 0 0 0
## u 0 0 0 0 0 0 0 0 0
To draw a boxplot to check for price in different regions:
par(mar=c(3.1,12,4.1,2.1), mgp = c(11, 1, 0))
boxplot(housing.df$Price ~ housing.df$Regionname, horizontal = TRUE, ylab = "Region Name", xlab = "Price of houses", main = "Boxplot of price of houses by region", las = 1)
Insights:
To draw a boxplot to check for landsize in different regions:
par(mar=c(3.1,12,0.95,2.1), mgp = c(11, 1, 0), mfrow = c(2,1))
boxplot(housing.df$Landsize ~ housing.df$Regionname, horizontal = TRUE, xlab = "Landsize of houses", ylab = "Region Name", main = "Boxplot of landsize of houses by region", las = 1)
boxplot(housing.df$Distance ~ housing.df$Type, horizontal = TRUE, ylab = "Type of House", xlab = "Distance from CBD", main = "Boxplot of distance from CBD vs type of houses", las = 1)
par(mfrow = c(2,2), mar=c(3.1,3.1,0.95,0))
hist(housing.df$Distance, breaks = 40, xlim = c(0,50), ylim = c(0,800),xlab = "Distance from CBD", col = "Green", main = "Histogram of Distances from CBD", las =1)
hist(housing.df$Bedroom2, breaks = 40, xlim = c(0,10), ylim = c(0,4000),xlab = "No. of Bedrooms", col = "Red", main = "Histogram of no. of bedrooms in houses", las =1)
hist(housing.df$Bathroom, breaks = 40, xlim = c(0,10), ylim = c(0,4000),xlab = "No. of Bathrooms", col = "Blue", main = "Histogram of no. of bathrooms in houses", las =1)
hist(housing.df$Car, breaks = 40, xlim = c(0,10), ylim = c(0,4000),xlab = "No. of Carslots", col = "gray33", main = "Histogram of no. of carslots in houses", las =1)
From the Distance from CBD histogram, we can see a very strong right skew in the data indicating that very few of the houses are extremely far away from CBD. From the remaining 3 histograms, we can infer that the majority of houses have 2-4 bedrooms, 1-2 bathrooms and 1-2 carslots. However, there are outliers in each histogram.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(housing.df[,c(4,8,10:13)], spread = FALSE, smoother.args = list(lty = 2), main = "Scatter Plot Matrix")
To get the correlation matrix for all numeric variables in the dataset, rounded to 2 decimal places:
round(cor(housing.df[,c(2,4,8,10:13)]),2)
## Rooms Price Distance Bedroom2 Bathroom Car Landsize
## Rooms 1.00 0.42 0.19 0.95 0.61 0.34 -0.02
## Price 0.42 1.00 -0.31 0.40 0.44 0.17 0.00
## Distance 0.19 -0.31 1.00 0.20 0.08 0.21 0.04
## Bedroom2 0.95 0.40 0.20 1.00 0.61 0.35 -0.02
## Bathroom 0.61 0.44 0.08 0.61 1.00 0.28 0.04
## Car 0.34 0.17 0.21 0.35 0.28 1.00 0.08
## Landsize -0.02 0.00 0.04 -0.02 0.04 0.08 1.00
library(corrgram)
corrgram(housing.df[,c(2,4,8,10:13)], order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main="Corrgram of housing data intercorrelations")
From the corrgram, we see that there’s a strong positive correlation between No. of Rooms, Bathrooms, carslots and Bedrooms with price while there is a negative correlation between Distance from CBD and price, both of which make sense. There does not appear to be a strong correlation between landsize and price, which we need to check further.
Null Hypothesis: There is no significant difference between landsize and Price of houses i.e. there is no correlation between the two variable. Alternative Hypothesis: There is a significant difference between landsize and price and they are correlated.
Using the Pearson’s Correlation test:
cor.test(housing.df$Price, housing.df$Landsize)
##
## Pearson's product-moment correlation
##
## data: housing.df$Price and housing.df$Landsize
## t = 0.14774, df = 7769, p-value = 0.8826
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02055883 0.02390948
## sample estimates:
## cor
## 0.001676152
p-value = 0.8826(>0.05) is not significant and suggests that we fail to reject the null hypothesis that there is no correlation between land size and price.
Null Hypothesis: There is no significant difference between Distance from CBD and Price of houses i.e. there is no correlation between the two variable. Alternative Hypothesis: There is a significant difference between Distance from CBD and price and they are correlated.
Using the Pearson’s Correlation test:
cor.test(housing.df$Price, housing.df$Distance)
##
## Pearson's product-moment correlation
##
## data: housing.df$Price and housing.df$Distance
## t = -28.46, df = 7769, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3272682 -0.2869964
## sample estimates:
## cor
## -0.3072699
We have a significant p-value here (2.2e-16 << 0.05) and thus we reject the null hypothesis. We conclude there is indeed a correlation between distance from CBD and price.
Null Hypothesis: There is no significant difference between No. of rooms and Price of houses i.e. there is no correlation between the two variable. Alternative Hypothesis: There is a significant difference between No. of rooms and price and they are correlated.
Using the Pearson’s Correlation test:
cor.test(housing.df$Price, housing.df$Rooms)
##
## Pearson's product-moment correlation
##
## data: housing.df$Price and housing.df$Rooms
## t = 40.806, df = 7769, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4016376 0.4382605
## sample estimates:
## cor
## 0.4201201
Here again we have a significant p-value here (2.2e-16 << 0.05) and thus we reject the null hypothesis. We conclude there is indeed a correlation between No. of rooms and price.