Balkan Misirli - S3650250
Last updated: 04 June, 2017
melb <- read.csv("/Users/balkan_misirli/Downloads/Melbourne_housing_extra_data.csv")
melb <- melb %>% separate(Date, c("Day","Month","Year"),"/", remove = FALSE)
melb <- melb %>% filter(Suburb != " RE") %>% filter(Suburb !="") %>% filter(!is.na(Price)) %>% filter(Rooms %in% c("1","2","3","4","5"))
melb$Date <- as.Date(melb$Date, format = "%d/%m/%Y")
goodTransaction <- c("S","SP","SA")
melb <- melb %>% filter(Method %in% goodTransaction)
melb$Postcode <- as.factor(melb$Postcode)
melb$bin <- cut(melb$Distance, breaks = c(0,5,10,15), labels = c("<5","5-10","10-15"))
top5 <- c("Nelson","Jellis","hockingstuart","Barry","Buxton")
melb <- melb %>% mutate(top5 = ifelse(SellerG %in% top5, 1,0))
melb$bin %>% table() %>% prop.table()*100
## .
## <5 5-10 10-15
## 17.94812 41.06084 40.99104
melb %>% group_by(bin) %>% summarise(Min = min(Price/1000,na.rm = TRUE), Q1 = quantile(Price/1000,probs = .25,na.rm = TRUE), Median = median(Price/1000, na.rm = TRUE), Q3 = quantile(Price/1000,probs = .75,na.rm = TRUE), Max = max(Price/1000,na.rm = TRUE), Mean = mean(Price/1000, na.rm = TRUE), SD = sd(Price/1000, na.rm = TRUE), n = n(), Missing = sum(is.na(Price))) -> table1
knitr::kable(table1)
bin | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
---|---|---|---|---|---|---|---|---|---|
<5 | 240 | 635.00 | 978.5 | 1400 | 6250.0 | 1128.7634 | 701.1179 | 1543 | 0 |
5-10 | 210 | 678.50 | 940.0 | 1350 | 6240.0 | 1115.8146 | 665.8885 | 3530 | 0 |
10-15 | 185 | 645.85 | 840.0 | 1230 | 4802.5 | 994.5407 | 518.6734 | 3524 | 0 |
boxplot(melb$Price/1000000 ~ melb$Rooms, main = "Prices by number of Rooms", xlab = "Number of Rooms", ylab = "Price in $ Millions")
- The boxplot of Price by number of Rooms shows a clear (and understandable) positive linear trend for properties with 1-5 rooms. For each extra room, the variation around the median also appears to increase markedly. The number of outliers indicates there is a premium being paid due to some other variable irrespective of the size of the property
table3<-melb %>% filter(CouncilArea != "") %>% group_by(CouncilArea) %>% summarise(CostPerRoom= round(mean(Price/Rooms),2), AvgDist=round(mean(Distance),2)) %>% arrange(desc(CostPerRoom))
knitr::kable(table3)
CouncilArea | CostPerRoom | AvgDist |
---|---|---|
Bayside | 543486.6 | 11.89 |
Boroondara | 522724.6 | 7.84 |
Stonnington | 514060.2 | 6.22 |
Port Phillip | 486413.0 | 5.10 |
Yarra | 482851.5 | 2.70 |
Melbourne | 427010.5 | 2.81 |
Whitehorse | 408637.4 | 12.08 |
Glen Eira | 383580.0 | 11.97 |
Monash | 354837.7 | 13.29 |
Manningham | 353788.3 | 13.30 |
Hobsons Bay | 343012.7 | 9.43 |
Moonee Valley | 336060.4 | 8.87 |
Darebin | 335335.8 | 8.61 |
Banyule | 317991.5 | 10.84 |
Kingston | 314397.9 | 14.62 |
Maribyrnong | 301404.7 | 7.83 |
Moreland | 298416.1 | 8.88 |
Brimbank | 217715.3 | 13.35 |
Hume | 169125.0 | 14.35 |
freq1 <- melb$Month %>% table() %>% prop.table()*100
barplot(freq1, ylim = c(0,20), main = "Sales Volume by Month",ylab="Percent", xlab="Month", col = "lightblue")
room_season <- table(melb$Rooms,melb$Month) %>% prop.table(margin = 1)
barplot(room_season, main = "Sales Volume per Month by number of Rooms",
ylab="Proportion within No. Rooms", ylim = c(0,.30), xlab="Month", beside = TRUE, args.legend=c(x = "top",horiz=TRUE,title="Rooms"), legend=rownames(room_season),col=brewer.pal(5, name = "RdYlGn"))
t1<- melb %>% filter(SellerG %in% c("Nelson","Jellis","hockingstuart","Barry","Buxton")) %>% summarise(Top5Mean = mean(Price), sd(Price), count = n())
t2 <-melb %>% filter(!(SellerG %in% c("Nelson","Jellis","hockingstuart","Barry","Buxton"))) %>% summarise(NonTop5Mean = mean(Price), sd(Price),count=n())
knitr::kable(t1)
Top5Mean | sd(Price) | count |
---|---|---|
1058538 | 562350 | 3783 |
knitr::kable(t2)
NonTop5Mean | sd(Price) | count |
---|---|---|
1076199 | 661590 | 4814 |
-As this is a situation where we wish to compare an observed outcome to an expected outcome of a categorical variable (top 5 or not) it is fitting to use the Chi Square Test of Association.
-Null Hypothesis: whether or not one uses a top 5 agency they will get the same price result
\(H_O : \mu_1 = \mu_2\)
-Alternate Hypothesis: it makes a difference to result price if you use a top 5 agency
\(H_A : \mu_1 ≠ \mu_2\)
-Assumptions:
-Not more than 25% of cells have expected counts below 5. This is confirmed.
-Decision Rule:
-If test finds p-value < 0.05, reject Null Hypothesis
-It is hard to tell visually what differences there are between the top 5 agencies and the rest from the below graph, except that non-top 5 agencies together conduct more maximum price range ($2m+) than the big 5.
-The bars go from left-most red ($200-399k) to right-most green ($2m+) in increments of $200,000
price.cat <- function(x, lower = 200, upper, by = 200,
sep = "-", above.char = "+") {
labs <- c(paste(seq(lower, upper - by, by = by),
seq(lower + by - 1, upper - 1, by = by),
sep = sep),
paste(upper, above.char, sep = ""))
cut(floor(x), breaks = c(seq(lower, upper, by = by), Inf),
right = FALSE, labels = labs)
}
tb1 <- table(price.cat(melb$Price/1000, upper = 2000),melb$top5) %>% prop.table(margin = 2)
barplot(tb1,main = "Top 5 versus non-Top 5 price range frequencies",ylab="Proportion Within Group", ylim=c(0,.3), beside=TRUE,
xlab="0 = Rest, 1 = Top 5", col = brewer.pal(10, name = "RdYlGn"))
-The Chi Square test;
chi2 <- chisq.test(table(price.cat((melb$Price/1000), upper = 2000),melb$top5))
chi2
##
## Pearson's Chi-squared test
##
## data: table(price.cat((melb$Price/1000), upper = 2000), melb$top5)
## X-squared = 75.175, df = 9, p-value = 1.459e-12
-Critical Value and P Value
qchisq(p = .95,df = 9)
## [1] 16.91898
# 75.175 > 16.919
pchisq(q = 75.175,df = 9,lower.tail = FALSE)
## [1] 1.459428e-12
# p-value below 0.001
-Observed Values
t(chi2$observed)
##
## 200-399 400-599 600-799 800-999 1000-1199 1200-1399 1400-1599
## 0 191 748 1048 855 489 425 284
## 1 107 557 747 701 424 457 270
##
## 1600-1799 1800-1999 2000+
## 0 208 151 414
## 1 185 130 204
-Expected Values
t(chi2$expected)
##
## 200-399 400-599 600-799 800-999 1000-1199 1200-1399 1400-1599
## 0 166.8731 730.7696 1005.1582 871.3238 511.2588 493.8995 310.2271
## 1 131.1269 574.2304 789.8418 684.6762 401.7412 388.1005 243.7729
##
## 1600-1799 1800-1999 2000+
## 0 220.0709 157.3535 346.0656
## 1 172.9291 123.6465 271.9344
-The Chi-Square Test of Association was used to test for a statistically significant association between a property’s selling agent and the sales price achieved. The result of this test indicates a statistically significant association, with \(χ^2 = 75.175\) and p-value < 0.001. This leads us to reject the null hypothesis.
This result is evidence that it is statistically likely that these top 5 realtors do indeed attract a premium price for properties they sell compared to their competitors.
-Limitations
-This investigation originally attempted to add a further variable measuring the distance to the nearest train station using Google Maps API, geosphere package and copious nested looping. This would have been a useful addition but had to be abandoned due to excessive compute time and API limitations.
-Housing is a highly multifaceted domain to analyse as its pricing mechanisms reflect many different variables. As such, this study was inherently limited in its possible findings.
-Future
-It is quite feasible that geographic location and distance from amenities impacts pricing, this could readily be assessed and analysed in future.
-Clearly, interest rates and economic activity are influential upon housing prices, this may also be an avenue of interest in future analysis.
-Strengths
-The dataset contained a lot of observations and useful variables. Thus, the results carried more statistical weight given the numerous counts of each contingency.
-Findings
-This investigation found that there was significant seasonality in the sample, such that testing was probably not required. Volume of auctions does tend to spike just before the middle and end of the year.
-A baseline relationship appears to exist between number of rooms and price, however there is such a premium on top of this from elsewhere as to make its usefulness limited
-The top 5 agencies appear to have a defensible argument as to their superiority, given that they are able to bring in higher prices than the competitors
-In summary, if you are planning to sell your house, do it in May with one of the top 5 realtors. This strategy may net you a higher price.