Assignment 4 - Melbourne Housing Prices

How and when to get a good deal

Balkan Misirli - S3650250

Last updated: 04 June, 2017

Introduction

Problem Statement

Data

Data Cont.

Preprocessing

melb <- read.csv("/Users/balkan_misirli/Downloads/Melbourne_housing_extra_data.csv")
melb <- melb %>% separate(Date, c("Day","Month","Year"),"/", remove = FALSE)
melb <- melb %>% filter(Suburb != "    RE") %>% filter(Suburb !="") %>% filter(!is.na(Price)) %>% filter(Rooms %in% c("1","2","3","4","5"))
melb$Date <- as.Date(melb$Date, format = "%d/%m/%Y")
goodTransaction <- c("S","SP","SA")
melb <- melb %>% filter(Method %in% goodTransaction)
melb$Postcode <- as.factor(melb$Postcode)
melb$bin <- cut(melb$Distance, breaks = c(0,5,10,15), labels = c("<5","5-10","10-15"))
top5 <- c("Nelson","Jellis","hockingstuart","Barry","Buxton")
melb <- melb %>% mutate(top5 = ifelse(SellerG %in% top5, 1,0))

Descriptive Statistics and Visualisation

melb$bin %>% table() %>% prop.table()*100
## .
##       <5     5-10    10-15 
## 17.94812 41.06084 40.99104
melb %>% group_by(bin) %>% summarise(Min = min(Price/1000,na.rm = TRUE), Q1 = quantile(Price/1000,probs = .25,na.rm = TRUE), Median = median(Price/1000, na.rm = TRUE), Q3 = quantile(Price/1000,probs = .75,na.rm = TRUE), Max = max(Price/1000,na.rm = TRUE), Mean = mean(Price/1000, na.rm = TRUE), SD = sd(Price/1000, na.rm = TRUE), n = n(), Missing = sum(is.na(Price))) -> table1
knitr::kable(table1)
bin Min Q1 Median Q3 Max Mean SD n Missing
<5 240 635.00 978.5 1400 6250.0 1128.7634 701.1179 1543 0
5-10 210 678.50 940.0 1350 6240.0 1115.8146 665.8885 3530 0
10-15 185 645.85 840.0 1230 4802.5 994.5407 518.6734 3524 0

Rooms Versus Price

boxplot(melb$Price/1000000 ~ melb$Rooms, main = "Prices by number of Rooms", xlab = "Number of Rooms", ylab = "Price in $ Millions")

- The boxplot of Price by number of Rooms shows a clear (and understandable) positive linear trend for properties with 1-5 rooms. For each extra room, the variation around the median also appears to increase markedly. The number of outliers indicates there is a premium being paid due to some other variable irrespective of the size of the property

Pricing by Suburb

table3<-melb %>% filter(CouncilArea != "") %>% group_by(CouncilArea) %>% summarise(CostPerRoom= round(mean(Price/Rooms),2), AvgDist=round(mean(Distance),2)) %>% arrange(desc(CostPerRoom))
knitr::kable(table3)
CouncilArea CostPerRoom AvgDist
Bayside 543486.6 11.89
Boroondara 522724.6 7.84
Stonnington 514060.2 6.22
Port Phillip 486413.0 5.10
Yarra 482851.5 2.70
Melbourne 427010.5 2.81
Whitehorse 408637.4 12.08
Glen Eira 383580.0 11.97
Monash 354837.7 13.29
Manningham 353788.3 13.30
Hobsons Bay 343012.7 9.43
Moonee Valley 336060.4 8.87
Darebin 335335.8 8.61
Banyule 317991.5 10.84
Kingston 314397.9 14.62
Maribyrnong 301404.7 7.83
Moreland 298416.1 8.88
Brimbank 217715.3 13.35
Hume 169125.0 14.35

Seasonality

freq1 <- melb$Month %>% table() %>% prop.table()*100 
barplot(freq1, ylim = c(0,20), main = "Sales Volume by Month",ylab="Percent", xlab="Month", col = "lightblue")

Seasonality cont.

room_season <- table(melb$Rooms,melb$Month) %>% prop.table(margin = 1)
barplot(room_season, main = "Sales Volume per Month by number of Rooms",
ylab="Proportion within No. Rooms", ylim = c(0,.30), xlab="Month", beside = TRUE, args.legend=c(x = "top",horiz=TRUE,title="Rooms"), legend=rownames(room_season),col=brewer.pal(5, name = "RdYlGn"))

Real Estate Agency Effects

t1<- melb %>% filter(SellerG %in% c("Nelson","Jellis","hockingstuart","Barry","Buxton")) %>% summarise(Top5Mean = mean(Price), sd(Price), count = n())
t2 <-melb %>% filter(!(SellerG %in% c("Nelson","Jellis","hockingstuart","Barry","Buxton"))) %>% summarise(NonTop5Mean = mean(Price), sd(Price),count=n())
knitr::kable(t1)
Top5Mean sd(Price) count
1058538 562350 3783
knitr::kable(t2)
NonTop5Mean sd(Price) count
1076199 661590 4814

Chi Square Test of Association (Hypothesis Testing)

-As this is a situation where we wish to compare an observed outcome to an expected outcome of a categorical variable (top 5 or not) it is fitting to use the Chi Square Test of Association.

-Null Hypothesis: whether or not one uses a top 5 agency they will get the same price result

\(H_O : \mu_1 = \mu_2\)

-Alternate Hypothesis: it makes a difference to result price if you use a top 5 agency

\(H_A : \mu_1 ≠ \mu_2\)

-Assumptions:

-Not more than 25% of cells have expected counts below 5. This is confirmed.

-Decision Rule:

-If test finds p-value < 0.05, reject Null Hypothesis

Hypothesis Test

-It is hard to tell visually what differences there are between the top 5 agencies and the rest from the below graph, except that non-top 5 agencies together conduct more maximum price range ($2m+) than the big 5.

-The bars go from left-most red ($200-399k) to right-most green ($2m+) in increments of $200,000

price.cat <- function(x, lower = 200, upper, by = 200,
                   sep = "-", above.char = "+") {
  
  labs <- c(paste(seq(lower, upper - by, by = by),
                  seq(lower + by - 1, upper - 1, by = by),
                  sep = sep),
            paste(upper, above.char, sep = ""))
  
  cut(floor(x), breaks = c(seq(lower, upper, by = by), Inf),
      right = FALSE, labels = labs)
}
tb1 <- table(price.cat(melb$Price/1000, upper = 2000),melb$top5) %>% prop.table(margin = 2)
barplot(tb1,main = "Top 5 versus non-Top 5 price range frequencies",ylab="Proportion Within Group", ylim=c(0,.3), beside=TRUE,
        xlab="0 = Rest,  1 = Top 5", col = brewer.pal(10, name = "RdYlGn"))

Hypothesis Test cont.

-The Chi Square test;

chi2 <- chisq.test(table(price.cat((melb$Price/1000), upper = 2000),melb$top5)) 
chi2
## 
##  Pearson's Chi-squared test
## 
## data:  table(price.cat((melb$Price/1000), upper = 2000), melb$top5)
## X-squared = 75.175, df = 9, p-value = 1.459e-12

-Critical Value and P Value

qchisq(p = .95,df = 9)
## [1] 16.91898
# 75.175 > 16.919

pchisq(q = 75.175,df = 9,lower.tail = FALSE)
## [1] 1.459428e-12
# p-value below 0.001

Observed vs Expected Values

-Observed Values

t(chi2$observed)
##    
##     200-399 400-599 600-799 800-999 1000-1199 1200-1399 1400-1599
##   0     191     748    1048     855       489       425       284
##   1     107     557     747     701       424       457       270
##    
##     1600-1799 1800-1999 2000+
##   0       208       151   414
##   1       185       130   204

-Expected Values

t(chi2$expected)
##    
##      200-399  400-599   600-799  800-999 1000-1199 1200-1399 1400-1599
##   0 166.8731 730.7696 1005.1582 871.3238  511.2588  493.8995  310.2271
##   1 131.1269 574.2304  789.8418 684.6762  401.7412  388.1005  243.7729
##    
##     1600-1799 1800-1999    2000+
##   0  220.0709  157.3535 346.0656
##   1  172.9291  123.6465 271.9344

Results of Chi Square Test

-The Chi-Square Test of Association was used to test for a statistically significant association between a property’s selling agent and the sales price achieved. The result of this test indicates a statistically significant association, with \(χ^2 = 75.175\) and p-value < 0.001. This leads us to reject the null hypothesis.

This result is evidence that it is statistically likely that these top 5 realtors do indeed attract a premium price for properties they sell compared to their competitors.

Discussion

-Limitations

-This investigation originally attempted to add a further variable measuring the distance to the nearest train station using Google Maps API, geosphere package and copious nested looping. This would have been a useful addition but had to be abandoned due to excessive compute time and API limitations.

-Housing is a highly multifaceted domain to analyse as its pricing mechanisms reflect many different variables. As such, this study was inherently limited in its possible findings.

-Future

-It is quite feasible that geographic location and distance from amenities impacts pricing, this could readily be assessed and analysed in future.

-Clearly, interest rates and economic activity are influential upon housing prices, this may also be an avenue of interest in future analysis.

Discussion cont.

-Strengths

-The dataset contained a lot of observations and useful variables. Thus, the results carried more statistical weight given the numerous counts of each contingency.

-Findings

-This investigation found that there was significant seasonality in the sample, such that testing was probably not required. Volume of auctions does tend to spike just before the middle and end of the year.

-A baseline relationship appears to exist between number of rooms and price, however there is such a premium on top of this from elsewhere as to make its usefulness limited

-The top 5 agencies appear to have a defensible argument as to their superiority, given that they are able to bring in higher prices than the competitors

-In summary, if you are planning to sell your house, do it in May with one of the top 5 realtors. This strategy may net you a higher price.

References