Problem Statement

What factors most influence property prices in Melbourne?
Does employing a well-known real estate agency increase your chances of getting a higher price for your property?
Does the time of year make any difference to the resulting sales price?
This study will dissect trends within the auction sales sample and test statistical significance for the hypotheses that result in regards to the population of properties in Melbourne

Data

The dataset is from a Kaggle competition and was compiled by Tony Pino, via scraping of the Domain.com.au auction results weekly from January 2016 until May 2017
The data includes fields on:
- Location (address, suburb, postcode, council area, GPS coordinates and distance from CBD)
- Attributes (bedrooms, bathrooms, car space, year built and type of dwelling (house/unit/townhouse) )
- Sales Method (real estate agency and sale result type (sold at/before/after auction) )
- Price
The dataset is likely to represent most auction sales with largely correct information, however it is not guaranteed to do so. Domain is a reputable source although the compiler themselves is unknown to us.
For the purposes of this study, it will be assumed that the observations represent a subset of the population of Melbourne property market values.

Data Cont.

The data spans 14,246 sales across 144 suburbs, 94 postcodes and 19 council areas
The sales were conducted by 237 agencies with Nelson Alexander, Jellis Craig and Hocking Stuart together completing 31% of them.
Properties were divided into houses (‘h’), townhouses (‘t’) and units (‘u’)
Price is in Australian dollars and distance in kilometres

Preprocessing

Date field was converted from string to date type, and then also split into day/month/year columns
The data was filtered to include only confirmed price transactions, leaving out those observations which did not result in sales or where the transaction price was not publicised. This was done to distil the data into known transacted prices instead of hypothesised quote prices.
Similarly, observations with NA prices were left out
Extravagantly large properties with over 5 bedrooms were also left out as it was not interesting to study them for the purposes of this investigation
There were 24 observations which did not include any data except for the Suburb field being “RE” or empty. These were also removed.
A field was added to denote whether the property was inner city (<5km from CBD), middle suburbs (5-10km) or outer suburbs (10-15km)
The resulting clean dataset was comprised of 8,597 observations.

melb <- read.csv("/Users/balkan_misirli/Downloads/Melbourne_housing_extra_data.csv")
melb <- melb %>% separate(Date, c("Day","Month","Year"),"/", remove = FALSE)
melb <- melb %>% filter(Suburb != "    RE") %>% filter(Suburb !="") %>% filter(!is.na(Price)) %>% filter(Rooms %in% c("1","2","3","4","5"))
melb$Date <- as.Date(melb$Date, format = "%d/%m/%Y")
goodTransaction <- c("S","SP","SA")
melb <- melb %>% filter(Method %in% goodTransaction)
melb$Postcode <- as.factor(melb$Postcode)
melb$bin <- cut(melb$Distance, breaks = c(0,5,10,15), labels = c("<5","5-10","10-15"))
top5 <- c("Nelson","Jellis","hockingstuart","Barry","Buxton")
melb <- melb %>% mutate(top5 = ifelse(SellerG %in% top5, 1,0))

Descriptive Statistics and Visualisation

Of the properties sold, the location distribution in percentages was as such;

melb$bin %>% table() %>% prop.table()*100

## .
##       <5     5-10    10-15 
## 17.94812 41.06084 40.99104

The summary statistics for prices per location were as below (in thousands of dollars);

melb %>% group_by(bin) %>% summarise(Min = min(Price/1000,na.rm = TRUE), Q1 = quantile(Price/1000,probs = .25,na.rm = TRUE), Median = median(Price/1000, na.rm = TRUE), Q3 = quantile(Price/1000,probs = .75,na.rm = TRUE), Max = max(Price/1000,na.rm = TRUE), Mean = mean(Price/1000, na.rm = TRUE), SD = sd(Price/1000, na.rm = TRUE), n = n(), Missing = sum(is.na(Price))) -> table1
knitr::kable(table1)

bin	Min	Q1	Median	Q3	Max	Mean	SD	n
<5	240	635.00	978.5	1400	6250.0	1128.7634	701.1179	1543
5-10	210	678.50	940.0	1350	6240.0	1115.8146	665.8885	3530
10-15	185	645.85	840.0	1230	4802.5	994.5407	518.6734	3524

Rooms Versus Price

boxplot(melb$Price/1000000 ~ melb$Rooms, main = "Prices by number of Rooms", xlab = "Number of Rooms", ylab = "Price in $ Millions")

- The boxplot of Price by number of Rooms shows a clear (and understandable) positive linear trend for properties with 1-5 rooms. For each extra room, the variation around the median also appears to increase markedly. The number of outliers indicates there is a premium being paid due to some other variable irrespective of the size of the property

Pricing by Suburb

The cost of a property per room varies greatly depending on the Suburb and its distance from the CBD. It seems Moreland, Maribyrnong and Darebin are reasonably cheap given their proximity to the city.

table3<-melb %>% filter(CouncilArea != "") %>% group_by(CouncilArea) %>% summarise(CostPerRoom= round(mean(Price/Rooms),2), AvgDist=round(mean(Distance),2)) %>% arrange(desc(CostPerRoom))
knitr::kable(table3)

CouncilArea	CostPerRoom	AvgDist
Bayside	543486.6	11.89
Boroondara	522724.6	7.84
Stonnington	514060.2	6.22
Port Phillip	486413.0	5.10
Yarra	482851.5	2.70
Melbourne	427010.5	2.81
Whitehorse	408637.4	12.08
Glen Eira	383580.0	11.97
Monash	354837.7	13.29
Manningham	353788.3	13.30
Hobsons Bay	343012.7	9.43
Moonee Valley	336060.4	8.87
Darebin	335335.8	8.61
Banyule	317991.5	10.84
Kingston	314397.9	14.62
Maribyrnong	301404.7	7.83
Moreland	298416.1	8.88
Brimbank	217715.3	13.35
Hume	169125.0	14.35

Seasonality

Sales volumes are not uniform throughout the year, with more auctions being conducted in May and November than other times. January and July appear to be particularly bad times to sell.

freq1 <- melb$Month %>% table() %>% prop.table()*100 
barplot(freq1, ylim = c(0,20), main = "Sales Volume by Month",ylab="Percent", xlab="Month", col = "lightblue")

Seasonality cont.

The distribution of sales volumes of each Room category over the year is shown below. All categories display roughly the same month-to-month trends, although the market seems to swing slightly from smaller properties to larger ones as the year progresses. January is well and truly a dead month for the market.

room_season <- table(melb$Rooms,melb$Month) %>% prop.table(margin = 1)
barplot(room_season, main = "Sales Volume per Month by number of Rooms",
ylab="Proportion within No. Rooms", ylim = c(0,.30), xlab="Month", beside = TRUE, args.legend=c(x = "top",horiz=TRUE,title="Rooms"), legend=rownames(room_season),col=brewer.pal(5, name = "RdYlGn"))

Real Estate Agency Effects

Nelson Alexander, Jellis Craig, Hocking Stuart, Barry Plant and Buxton are the 5 largest agencies in Melbourne by volume over this period. Together they represented 44% of the sales in this dataset.
Using hypothesis testing we will aim to find whether hiring a top 5 real estate agency tends to result in higher sales prices for one’s property in general.
Mean prices and standard deviations for top 5/non-top 5 below;

t1<- melb %>% filter(SellerG %in% c("Nelson","Jellis","hockingstuart","Barry","Buxton")) %>% summarise(Top5Mean = mean(Price), sd(Price), count = n())
t2 <-melb %>% filter(!(SellerG %in% c("Nelson","Jellis","hockingstuart","Barry","Buxton"))) %>% summarise(NonTop5Mean = mean(Price), sd(Price),count=n())
knitr::kable(t1)

Top5Mean	sd(Price)	count
1058538	562350	3783

knitr::kable(t2)

NonTop5Mean	sd(Price)	count
1076199	661590	4814

Chi Square Test of Association (Hypothesis Testing)

-As this is a situation where we wish to compare an observed outcome to an expected outcome of a categorical variable (top 5 or not) it is fitting to use the Chi Square Test of Association.

-Null Hypothesis: whether or not one uses a top 5 agency they will get the same price result

$H_O : \mu_1 = \mu_2$

-Alternate Hypothesis: it makes a difference to result price if you use a top 5 agency

$H_A : \mu_1 ≠ \mu_2$

-Assumptions:

-Not more than 25% of cells have expected counts below 5. This is confirmed.

-Decision Rule:

-If test finds p-value < 0.05, reject Null Hypothesis

Hypothesis Test

-It is hard to tell visually what differences there are between the top 5 agencies and the rest from the below graph, except that non-top 5 agencies together conduct more maximum price range ($2m+) than the big 5.

-The bars go from left-most red ($200-399k) to right-most green ($2m+) in increments of $200,000

price.cat <- function(x, lower = 200, upper, by = 200,
                   sep = "-", above.char = "+") {
  
  labs <- c(paste(seq(lower, upper - by, by = by),
                  seq(lower + by - 1, upper - 1, by = by),
                  sep = sep),
            paste(upper, above.char, sep = ""))
  
  cut(floor(x), breaks = c(seq(lower, upper, by = by), Inf),
      right = FALSE, labels = labs)
}
tb1 <- table(price.cat(melb$Price/1000, upper = 2000),melb$top5) %>% prop.table(margin = 2)
barplot(tb1,main = "Top 5 versus non-Top 5 price range frequencies",ylab="Proportion Within Group", ylim=c(0,.3), beside=TRUE,
        xlab="0 = Rest,  1 = Top 5", col = brewer.pal(10, name = "RdYlGn"))

Hypothesis Test cont.

-The Chi Square test;

chi2 <- chisq.test(table(price.cat((melb$Price/1000), upper = 2000),melb$top5)) 
chi2

## 
##  Pearson's Chi-squared test
## 
## data:  table(price.cat((melb$Price/1000), upper = 2000), melb$top5)
## X-squared = 75.175, df = 9, p-value = 1.459e-12

-Critical Value and P Value

qchisq(p = .95,df = 9)

## [1] 16.91898

# 75.175 > 16.919

pchisq(q = 75.175,df = 9,lower.tail = FALSE)

## [1] 1.459428e-12

# p-value below 0.001

Observed vs Expected Values

-Observed Values

t(chi2$observed)

##    
##     200-399 400-599 600-799 800-999 1000-1199 1200-1399 1400-1599
##   0     191     748    1048     855       489       425       284
##   1     107     557     747     701       424       457       270
##    
##     1600-1799 1800-1999 2000+
##   0       208       151   414
##   1       185       130   204

-Expected Values

t(chi2$expected)

##    
##      200-399  400-599   600-799  800-999 1000-1199 1200-1399 1400-1599
##   0 166.8731 730.7696 1005.1582 871.3238  511.2588  493.8995  310.2271
##   1 131.1269 574.2304  789.8418 684.6762  401.7412  388.1005  243.7729
##    
##     1600-1799 1800-1999    2000+
##   0  220.0709  157.3535 346.0656
##   1  172.9291  123.6465 271.9344

Results of Chi Square Test

-The Chi-Square Test of Association was used to test for a statistically significant association between a property’s selling agent and the sales price achieved. The result of this test indicates a statistically significant association, with $χ^2 = 75.175$ and p-value < 0.001. This leads us to reject the null hypothesis.

This result is evidence that it is statistically likely that these top 5 realtors do indeed attract a premium price for properties they sell compared to their competitors.

Discussion

-Limitations

-This investigation originally attempted to add a further variable measuring the distance to the nearest train station using Google Maps API, geosphere package and copious nested looping. This would have been a useful addition but had to be abandoned due to excessive compute time and API limitations.

-Housing is a highly multifaceted domain to analyse as its pricing mechanisms reflect many different variables. As such, this study was inherently limited in its possible findings.

-Future

-It is quite feasible that geographic location and distance from amenities impacts pricing, this could readily be assessed and analysed in future.

-Clearly, interest rates and economic activity are influential upon housing prices, this may also be an avenue of interest in future analysis.

Discussion cont.

-Strengths

-The dataset contained a lot of observations and useful variables. Thus, the results carried more statistical weight given the numerous counts of each contingency.

-Findings

-This investigation found that there was significant seasonality in the sample, such that testing was probably not required. Volume of auctions does tend to spike just before the middle and end of the year.

-A baseline relationship appears to exist between number of rooms and price, however there is such a premium on top of this from elsewhere as to make its usefulness limited

-The top 5 agencies appear to have a defensible argument as to their superiority, given that they are able to bring in higher prices than the competitors

-In summary, if you are planning to sell your house, do it in May with one of the top 5 realtors. This strategy may net you a higher price.

Assignment 4 - Melbourne Housing Prices

How and when to get a good deal

Introduction