As part of the Property Price Influencers Group, our aim was to identify what are the most influential property factors that would a play vital role on determining the property prices growth of the Greater Sydney Region. Since there were many predictor variables collected and compared, we noticed there might be a chance for some of the important linear assumptions for being violated. Like,
Lack of Independence nature between data points : It means the property prices within a particular suburb is dependent on the suburb. In our scenario, every suburb has a slightly different influencer on property price, and this is going to be an idiosyncratic factor that affects all prices from the same suburb, thus rendering these different prices inter-dependent (correlated) rather than independent.
Collinearity among variables : The important suburb specific factors like SEIFA scores, crime factors, Migration ratio or distancing from public schools/private schools etc., are correlating with each other rather than statistically significant on determining the property price. This makes the interpretation of model being unstable sometimes which may overrides the explanatory power of certain significant variables.
So the following article is an attempt to handle these violations, particularly about independence assumptions via Linear Mixed Effect Model Approach.
A quick look on the features collected from propery analysis dataset is as below. Some of the key points on the collected data set would be:
library(MASS)
library(caret)
library(tidyverse)
library(scales)
library(corrplot)
library(viridis)
library(dplyr)
library(hrbrthemes)
property_data <- read.csv("cleaned_merged_property_data_final.csv")
str(property_data)
## 'data.frame': 51300 obs. of 44 variables:
## $ postcode : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ year : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ property_id : int 411 310 450 362 345 303 397 61 402 463 ...
## $ lga : Factor w/ 42 levels "Bankstown City Council",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ lat : num -33.9 -33.9 -33.9 -33.9 -33.9 ...
## $ lon : num 151 151 151 151 151 ...
## $ date : Factor w/ 852 levels "1/01/2018","1/02/2018",..: 624 555 828 464 466 823 650 797 432 448 ...
## $ price : num 920000 1210000 1750000 1280000 1060000 ...
## $ address : Factor w/ 50089 levels "(Lot 103) 17 Entwistle Crescent, HARRINGTON PARK NSW 2567",..: 25888 27480 21777 49821 49823 7519 23488 41205 47073 43851 ...
## $ suburb : Factor w/ 836 levels "abbotsbury","abbotsford",..: 706 706 706 706 706 706 706 341 706 706 ...
## $ url : Factor w/ 51288 levels "https://www.domain.com.au/-35-stoneham-circuit-oran-park-nsw-2570-2015456400",..: 26149 27762 22001 50711 50713 7603 23234 41616 47453 44287 ...
## $ no_of_bed : int 1 2 3 2 2 1 2 2 1 2 ...
## $ no_of_bath : int 1 2 2 2 2 1 2 2 1 2 ...
## $ no_of_parking : int 1 1 1 1 1 1 1 4 1 2 ...
## $ house_size : int 88 89 160 141 107 91 107 175 83 156 ...
## $ type : Factor w/ 8 levels "Apartment / Unit / Flat",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ distance_from_CBD : num 0.661 0.754 0.968 1.034 0.968 ...
## $ median_taxable_income : int 27138 27138 27138 27138 27138 27138 27138 27138 27138 27138 ...
## $ mean_taxable_income : int 66698 66698 66698 66698 66698 66698 66698 66698 66698 66698 ...
## $ AQI : num 76.4 76.4 76.4 76.4 76.4 ...
## $ AQI_exclude_bushfire_month : num 32.2 32.2 32.2 32.2 32.2 ...
## $ Within_5km_2019 : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Btwn_5_10km_2019 : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Btwn_10_15km_2019 : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Within_5km_2018 : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Btwn_5_10km_2018 : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Btwn_10_15km_2018 : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Drug.offences : int 3663 3663 3663 3663 3663 3663 3663 3663 3663 3663 ...
## $ Non.Violent.Crime : int 36823 36823 36823 36823 36823 36823 36823 36823 36823 36823 ...
## $ Violent.Crime : int 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 ...
## $ distance_from_closest_station : num 0.298 0.163 0.117 0.233 0.117 0.176 0.12 0.234 0.196 0.252 ...
## $ distance_from_closest_public_school : num 0.788 0.478 0.356 0.751 0.356 0.232 0.357 0.439 0.841 0.737 ...
## $ distance_from_closest_private_school: num 0.38 0.243 0.469 0.371 0.469 0.846 0.471 0.443 0.164 0.414 ...
## $ metropolitan_area : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ Nom.number : int 6471 6471 6471 6471 6471 6471 6471 6471 6471 6471 ...
## $ Nom.ratio : num 2.6 2.6 2.6 2.6 2.6 2.6 2.6 2.6 2.6 2.6 ...
## $ Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ SAdvDisav_score : int 1072 1072 1072 1072 1072 1072 1072 1007 1072 1072 ...
## $ SAdvDisav_decile : int 9 9 9 9 9 9 9 6 9 9 ...
## $ ER_score : int 806 806 806 806 806 806 806 718 806 806 ...
## $ ER_decile : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EDUOCC_score : int 1135 1135 1135 1135 1135 1135 1135 1079 1135 1135 ...
## $ EDUOCC_decile : int 10 10 10 10 10 10 10 9 10 10 ...
## $ avg_temp : num 17 21 23.5 11.4 14.5 23 17 23.5 17 23.5 ...
barplot(table(property_data$postcode),main="No of Properties by Postcodes",xlab="Postcode",ylab='Frequency')
As we have the properties distributed across most of the sydney postcodes (>40 more) and few 100 to 400 property values were collected per suburb, we are narrowing our analysis further by subsetting the top 10 suburbs based on higher to lower in number of properties sold and also restrict to important property types such as “House” and “Apartment/Unit” for effectively validating the model.
Those postcodes are:
top10_suburb <- property_data %>%
group_by(suburb) %>%
summarize(num=n()) %>%
arrange(desc(num)) %>%
slice(1:10)
top10_suburb
df <- property_data %>%
inner_join(top10_suburb) %>%
filter(type=="House" | type=="Unit" ) %>%
#removing outliers - for each LGA remove 2 highest and 2 lowest from each suburb
arrange(suburb, price) %>%
group_by(suburb) %>%
slice(3:(n()-2)) %>%
ungroup()
As the above table depicts there are multiple observations collected on every suburb,let’s visualise in diagram by how much these price variations is happening between those suburbs even though the number of properties being sold between them is similar in range.
df %>%
mutate(myaxis = paste0(suburb, "\n", "n=", num)) %>%
ggplot( aes(x=myaxis, y=price, fill=suburb)) +
geom_violin(width=1.4) +
geom_boxplot(width=0.1, color="grey", alpha=0.2) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
coord_flip() + # This switch X and Y axis and allows to get the horizontal version
xlab("") +
ylab("Property Price (Log10)")
This is also depicted in the below diagram with the inclusion of a significant fixed effect variable (no_of_beds). Note: The price range is normalised. So based on that, We observed that there will be atleast 0.25 to 0.50 variation of increase/decrease in property prices between suburbs. There lies again a two trends among them with suburbs under “The Hills Shire Council” shown a little higher range of properties on comparing to other suburbs. Thus the property price is influenced also with by the behaviour within suburbs. Let’s see if we can break this dependency via Mixed Effect Model.
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
df$price_norm <- normalize(df$price)
qplot(no_of_bed, price_norm, facets = . ~ suburb,
colour = suburb, geom = "boxplot", data = df)
Apart from the suburb influence, there is an additional source of non-independence that needs to be accounted for : Different Property types We expected the by-type variation among those properties. For example, property prices across Houses can value more than the property price of an Unit even across different suburbs. This behaviour can be noticed irrespective of suburbs they belong to. Under the logarthmic scale, there shows the terrace building would cost 2 scales above than a Unit/Duplex properties. However,it is shown that, the amount of variation among these property types are little as compared to the variation between suburbs.
property_data$type <- as.character(property_data$type)
property_data$type <-
case_when (
property_data$type =="Apartment / Unit / Flat" ~ "Unit",
property_data$type =="New House & Land" ~ "H & L",
TRUE ~ property_data$type
)
property_data %>%
ggplot(aes(x = type, y = price)) +
geom_boxplot() +
scale_y_continuous(labels = comma) +
stat_summary(fun.y=mean, colour="darkred", geom="point",
shape=18, size=3,show_guide = FALSE) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
labs(x = "Property Type",
y = "Property Price",
title = "Property Price by Property Type")
By deriving the understanding based on (Green and Tukey, 1960) definition, if sample exhausts the entire population, then it is considered as a fixed effect variable. In our property dataset, all property demographic features like 1. Property Types 2. Property Size 3. Number in terms of bedrooms,bathrooms and car parking And some of the common suburb related ABS survey metrics like 1. Migration ratio 2. Air Quality numbers are all can be considered as the fixed effect variable since entire population of samples are considered.
However, for random effects, eventhough the variable can have all possible levels, the observations are collected only for the sample of populations which will eventually be used to generalise for future predictions. In property dataset, it can be 1. Property addresses within suburb
2. Property prices across suburbs
Let’s say we are interested only on analysing the property behaviour of these top 10 suburbs. So, Are we correct on understanding whether these top 10 suburbs can have any common behaviour among them. That will be interesting to explore.
First, let’s see it in diagram.
#Select Properties for top 10 suburbs
df <- property_data %>%
inner_join(top10_suburb) %>%
filter(type=="House" | type=="Unit" ) %>%
#removing outliers - for each LGA remove 2 highest and 2 lowest from each suburb
arrange(suburb, price) %>%
group_by(suburb) %>%
slice(3:(n()-2)) %>%
ungroup()
#Setting color theme
theme_set(theme_bw(base_size = 18))
df %>%
ggplot(aes(x = type, y = price,colour = suburb)) +
geom_boxplot() +
scale_y_continuous(breaks = seq(300000,max(df$price), by = 500000)) +
labs(x = "Property Type",
y = "Property Price",
title = "Property Price by Property Type")
#Note:Price label to brake starting from the minimum price of suburb which belongs to "Unit" type and break the limits by 500,000 to locate for the distinction clearly in the picture
The visual explains an immediate distinction in valuation of properties between “House” and “Unit”. The mean price of a house is atleast 500,000 higher than the mean price of an unit. But one implicit factor among them is the different on the house size or landscape associated with the property. The Unit/Apartment properties may have 2 times higher than the average size of house/townhose.
with(df, aggregate(price ~ type, FUN = "mean"))
with(df, aggregate(house_size ~ type, FUN = "mean"))
In addition, there are some suburbs having no units sold. Wonder why? But reason could be demographic background or landscape behind them. But I couldn’t confirmed this using the dataset collected.
Let’s first model the differences based on suburbs. For that, we are using the wonderful concept from Linear Mixed Effect model (LMER) which can estimate the random intercepts for every suburb. As per my knowledge, random intercept explains the base value upon which the properties are laid out. These baseline values should be more or less equal to the mean actual prices of the properties. Let’s validate our understanding.
library(lme4)
#Model 1:
res1 = lmer(price ~ type + (1 | suburb), data = df)
# summary(res1)
coef(res1)$suburb[1]
with(df, aggregate(price ~ suburb, FUN = "mean"))
The prices are pretty much closer to the actual mean price of every suburb. The random intercept estimated by every suburb will take care of individual influences across suburbs. The same behaviour can be noticed whenever multiple responses collected for every subject. The subject in this case is the actual properties.
Let’s extend this basic mode by adding few significant fixed effect variables that matters to the properyt prices. But how these various fixed effect variables will interact or effect each other and which one among them can play a influential data role here. Let’s identify by running a statistial test:
#Transformation of variables into factors
df$no_of_bed <- as.factor(df$no_of_bed)
df$no_of_bath <- as.factor(df$no_of_bath)
df$house_size_cat = cut(df$house_size, breaks = seq(0, 1000, by = 100))
#Model 2:
res2.no_of_bath.null = lmer(price ~ type + no_of_bed + (1 | suburb), data = df,REML=FALSE)
# coef(res2.no_of_bath.null)
# summary(res2.no_of_bath.null)
#Model 3:
res3.no_of_bath.model = lmer(price ~ type + no_of_bed + no_of_bath + (1 | suburb), data = df,REML=FALSE)
# coef(res3.no_of_bath.model)
# summary(res3.no_of_bath.model)
anova(res2.no_of_bath.null,res3.no_of_bath.model)
According to Bodo Winter bw_LME_tutorial,the statistical test “Likelihood Ratio Test” can determines presence of a variable with & without in model. So we tested by running a couple of model with property features like Number of Bedrooms and/or with Number of Bathrooms etc.. And,it is found that the effect of having number of bathrooms in the model can effect the model with (p<0.01) and chi-squuared χ2(1)=440.68 (Note: No of bathrooms also shown a good role in AT2 step-wise analysis model). So we can say that basic property features plays a most important role in defining the price of a property.
Now, How about if we compare this resultset with one more fixed effect model : house_size_cat?
#Model 4:
res4.complete.model = lmer(price ~ type + no_of_bed + no_of_bath + house_size_cat + (1 | suburb), data = df,REML=FALSE)
# coef(res4.complete.model)
# summary(res4.complete.model)
anova(res2.no_of_bath.null,res3.no_of_bath.model)
anova(res3.no_of_bath.model,res4.complete.model)
It shows the reduction of AIC value by 200 and also less increase in chi-square. So, it doesn’t have much variance by adding the house size and without. So in order to avoid the collinearity issues, we can stick to our fixed effect model with basic features of property.
Now,let’s extend this model to external properties outside of property scope like Migration Numbers, SEIFA Education Decile number or Bushfire impact numbers. Again, there are many features which are influencing the property model so let’s compare one with Null features, important features (which we understood from AT2 Analysis) and all features.
Here,we cannot model with restricted properties under parsimonious effect of these variables . So we are moving away the filter effect and see how lmer can work if we extend it for all suburbs and particularly with all suburb-wide scope variables.
df2 <- property_data %>%
filter(type=="House" | type=="Unit" ) %>%
#removing outliers - for each LGA remove 2 highest and 2 lowest from each suburb
arrange(lga, price) %>%
group_by(lga) %>%
slice(3:(n()-2)) %>%
ungroup()
df2$no_of_bed <- as.factor(df2$no_of_bed)
df2$no_of_bath <- as.factor(df2$no_of_bath)
#normalise data
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
df2$Violent.Crime<-normalize(df2$Violent.Crime)
#Model 4b:
res4b.all.no_of_bath.model = lmer(price ~ type + no_of_bed + no_of_bath + (1 | suburb), data = df2,REML=FALSE)
#Model 5:
res5a.all.seifa_nom.model = lmer(price ~ type + no_of_bed + no_of_bath + EDUOCC_decile + Nom.ratio + Violent.Crime + (1 | suburb), data = df2,REML=FALSE)
# summary(res5.all.seifa_nom.model)
anova(res4b.all.no_of_bath.model,res5a.all.seifa_nom.model)
The results shown a usage of these variables influenced with significant chisq χ2(1)=403. But again, in contrast to the chi-Square or P-value of property features, this is showing a bit lesser value between them. So if we look out for a model to generalise the results across all suburbs, then it is warrant to consider these 3 important external features.
However, since this analysis focused mainly towards the influential factors between & within suburbs for focused group via top 10 properties, let’s analyse further on random intercept result from these models.
summary(res3.no_of_bath.model)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: price ~ type + no_of_bed + no_of_bath + (1 | suburb)
## Data: df
##
## AIC BIC logLik deviance df.resid
## 94714.6 94807.3 -47342.3 94684.6 3539
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.1177 -0.5538 -0.0574 0.4367 8.0415
##
## Random effects:
## Groups Name Variance Std.Dev.
## suburb (Intercept) 4.652e+10 215679
## Residual 2.137e+10 146178
## Number of obs: 3554, groups: suburb, 10
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 715990 88218 8.116
## typeUnit -236056 27647 -8.538
## no_of_bed2 26759 57052 0.469
## no_of_bed3 74040 56138 1.319
## no_of_bed4 159120 56534 2.815
## no_of_bed5 241699 56984 4.242
## no_of_bed6 212431 59327 3.581
## no_of_bed7 152023 80349 1.892
## no_of_bed8 191318 101641 1.882
## no_of_bath2 47103 7091 6.643
## no_of_bath3 175790 10134 17.347
## no_of_bath4 327016 26833 12.187
## no_of_bath5 487898 41388 11.788
The standard deviation for every value of a suburb can approximately result in 215K differences among these propertiess and also the residual difference shows more significant value. So this indicates the dataset can still represent the property price by very other different variables that are not accounted on this model like distance from school/station/transport/shopping center accessbilities/any other unknown features etc.,
Now let’s also understand further on the effect of fixed effect variables:
#Note: The model tweaked by replacing type with house_size
library(sjPlot)
library(sjmisc)
res3b.no_of_bath.model = lmer(price ~ house_size + no_of_bed + no_of_bath + (1 | suburb), data = df,REML=FALSE)
plot_model(res3b.no_of_bath.model, type = "slope")
All three plots explains a positive effect of variance for every increase in the base fixed effect value. But there is a hinch while notice the decrease in the price of property in case if they exceeds more than 6 number of bedrooms. Let’s see we can plot this via linear model by differentiating between two groups.
df$no_of_bed <- as.numeric(df$no_of_bed)
ggplot(data=filter(df,no_of_bed<6 & type=="House"), aes(x=price,y=no_of_bed)) + geom_point()+
geom_smooth(method='lm')
ggplot(data=filter(df,no_of_bed>=6 & type=="House"), aes(x=price,y=no_of_bed)) + geom_point()+
geom_smooth(method='lm')
Oh yes, we can notice the same results by interpreting the fixed coeffcients of the variables. Also we can note from these coeffecients the suburbs of houses except castle hill is ranging from 500K to 600K value and since the remaining features are fixed, for instance, the unit price, they have same range of values across suburbs with -200K for every unit and gradual increase for every bedroom/bathroom.
coef(res3.no_of_bath.model)
## $suburb
## (Intercept) typeUnit no_of_bed2 no_of_bed3 no_of_bed4
## bateau bay 557044.3 -236055.8 26758.71 74039.82 159119.7
## baulkham hills 952581.6 -236055.8 26758.71 74039.82 159119.7
## blacktown 559349.3 -236055.8 26758.71 74039.82 159119.7
## castle hill 1186285.5 -236055.8 26758.71 74039.82 159119.7
## glenmore park 578957.3 -236055.8 26758.71 74039.82 159119.7
## kellyville 949893.5 -236055.8 26758.71 74039.82 159119.7
## quakers hill 625100.0 -236055.8 26758.71 74039.82 159119.7
## seven hills 625311.3 -236055.8 26758.71 74039.82 159119.7
## st clair 545019.6 -236055.8 26758.71 74039.82 159119.7
## umina beach 580352.5 -236055.8 26758.71 74039.82 159119.7
## no_of_bed5 no_of_bed6 no_of_bed7 no_of_bed8 no_of_bath2
## bateau bay 241699.2 212430.8 152022.6 191317.8 47102.67
## baulkham hills 241699.2 212430.8 152022.6 191317.8 47102.67
## blacktown 241699.2 212430.8 152022.6 191317.8 47102.67
## castle hill 241699.2 212430.8 152022.6 191317.8 47102.67
## glenmore park 241699.2 212430.8 152022.6 191317.8 47102.67
## kellyville 241699.2 212430.8 152022.6 191317.8 47102.67
## quakers hill 241699.2 212430.8 152022.6 191317.8 47102.67
## seven hills 241699.2 212430.8 152022.6 191317.8 47102.67
## st clair 241699.2 212430.8 152022.6 191317.8 47102.67
## umina beach 241699.2 212430.8 152022.6 191317.8 47102.67
## no_of_bath3 no_of_bath4 no_of_bath5
## bateau bay 175790.2 327015.5 487897.6
## baulkham hills 175790.2 327015.5 487897.6
## blacktown 175790.2 327015.5 487897.6
## castle hill 175790.2 327015.5 487897.6
## glenmore park 175790.2 327015.5 487897.6
## kellyville 175790.2 327015.5 487897.6
## quakers hill 175790.2 327015.5 487897.6
## seven hills 175790.2 327015.5 487897.6
## st clair 175790.2 327015.5 487897.6
## umina beach 175790.2 327015.5 487897.6
##
## attr(,"class")
## [1] "coef.mer"
Since so far the effect of the variable is defined only under the different variation between the suburbs but it is quite expected that property price value would also show random variation within suburbs due to its different property features. Let’s model how it looks like:
df$no_of_bed <- as.factor(df$no_of_bed)
res3c.no_of_bath.model =
lmer(price ~ type + no_of_bath + (1 + suburb | no_of_bed ), data = df, REML=FALSE)
#summary(res3c.no_of_bath.model)
coef(res3c.no_of_bath.model)
## $no_of_bed
## suburbbaulkham hills suburbblacktown suburbcastle hill suburbglenmore park
## 1 -62786.083 -558.0452 -165646.96 -43820.9828
## 2 405699.381 -2815.4831 471260.20 -53132.0976
## 3 386701.989 20711.5442 515371.41 -997.6518
## 4 394848.373 -36599.5095 615864.18 16952.6619
## 5 448483.213 30431.8437 757292.59 91011.9234
## 6 530236.631 32561.7991 826423.61 67442.4499
## 7 321984.728 9293.1934 521576.29 46140.7058
## 8 -864.601 10146.2908 30504.96 23200.9237
## suburbkellyville suburbquakers hill suburbseven hills suburbst clair
## 1 -160219.75 -3548.842 4097.064 30476.43
## 2 149856.54 84255.375 84241.056 34824.03
## 3 243944.12 85037.064 79124.078 12463.68
## 4 373031.09 47889.313 35609.559 -34553.38
## 5 519457.06 84670.783 64564.552 -46269.19
## 6 511906.09 105801.559 87764.098 -28619.64
## 7 338464.45 56807.500 44436.491 -27308.08
## 8 47674.04 1564.754 -1224.580 -10158.65
## suburbumina beach (Intercept) typeUnit no_of_bath2 no_of_bath3 no_of_bath4
## 1 -17546.050 628699.9 -221087.8 58636.11 175308.2 312045.4
## 2 6082.508 599506.3 -221087.8 58636.11 175308.2 312045.4
## 3 14275.283 630253.8 -221087.8 58636.11 175308.2 312045.4
## 4 39363.152 720133.7 -221087.8 58636.11 175308.2 312045.4
## 5 44436.845 719026.2 -221087.8 58636.11 175308.2 312045.4
## 6 40593.353 686163.9 -221087.8 58636.11 175308.2 312045.4
## 7 29774.857 700199.5 -221087.8 58636.11 175308.2 312045.4
## 8 4171.853 688157.9 -221087.8 58636.11 175308.2 312045.4
## no_of_bath5
## 1 486823.8
## 2 486823.8
## 3 486823.8
## 4 486823.8
## 5 486823.8
## 6 486823.8
## 7 486823.8
## 8 486823.8
##
## attr(,"class")
## [1] "coef.mer"
d_bycond = na.omit(df) %>%
group_by(no_of_bed, suburb) %>%
summarise(mean_price = mean(price))
ggplot(d_bycond, aes(x=no_of_bed, y=mean_price,
colour=suburb, group=suburb)) +
geom_line(size=2) + geom_point(size=1, shape=21, fill="white")
The results are apparent that every suburb have increase in value for every increase in the number of bedrooms. For example: The umina beach shows a increase of around 3k to 4k for every increase in the bedroom which is also happening across costliest suburbs like baulkham hills (proportional increase of 3% to 4% ).
Let’s replace the same model but instead of finding the random slope by rooms, let’s do by property types. It will be evident to notice a different range of values between Houses and Units and also all it is common across all suburbs. The proportion of this difference is also similar to the proportion based on number of rooms.
res3d.no_of_bath.model =
lmer(price ~ no_of_bed + no_of_bath + (1 + suburb | type ), data = df, REML=FALSE)
coef(res3d.no_of_bath.model)
## $type
## suburbbaulkham hills suburbblacktown suburbcastle hill
## House 397077.6 -630.9702 631817.9
## Unit 214112.6 20489.3300 463421.3
## suburbglenmore park suburbkellyville suburbquakers hill suburbseven hills
## House 22129.16 394476.7 68060.4 68449.22
## Unit 125260.73 260749.1 152464.6 149308.14
## suburbst clair suburbumina beach (Intercept) no_of_bed2 no_of_bed3
## House -12229.29 23712.62 540178.9 38816.31 91290.87
## Unit 123347.64 130887.34 344269.9 38816.31 91290.87
## no_of_bed4 no_of_bed5 no_of_bed6 no_of_bed7 no_of_bed8 no_of_bath2
## House 175291.5 257787.3 229256.7 169210.8 210065.9 47295.74
## Unit 175291.5 257787.3 229256.7 169210.8 210065.9 47295.74
## no_of_bath3 no_of_bath4 no_of_bath5
## House 175223.6 326177.6 487061.9
## Unit 175223.6 326177.6 487061.9
##
## attr(,"class")
## [1] "coef.mer"
d_bycond = na.omit(df) %>%
group_by(type, suburb) %>%
summarise(mean_price = mean(price))
ggplot(d_bycond, aes(x=type, y=mean_price,
colour=suburb, group=suburb)) +
geom_line(size=2) + geom_point(size=5, shape=21, fill="white")
As we have mixed effect model which is very effective in estimating the baseline price value and also good on estimating the variation between them, how can we understand “If random slopes is important?”. Let’s answer by identifying AIC deviance between these models. As per the below reference article, it is mentioned that lower the AIC’s, the better the model is fitting the data. Let’s compare the AIC’s value here.
res3a.no_of_bath.model = lmer(price ~ type + no_of_bed + no_of_bath + (1 | suburb), data = df,REML=FALSE)
res3c.no_of_bath.model =
lmer(price ~ type + no_of_bath + (1 + suburb | no_of_bed ), data = df, REML=FALSE)
anova(res3a.no_of_bath.model,res3c.no_of_bath.model)
Surprisingly (am not sure if am interpreting results correct), but the AIC value over random slope/random intercept model shows lesser AIC value than the model only with random intercept. And they also have 5 times more than the number of parametes than random intercept. So I am not sure, if the model belongs to overfitting, it’s going to be part of my section-2 analysis in future.
I would like to reinstate the earlier question on how the major assumption from linear model would be violated when addressing a more realtime datasets like Property datasets here. Especially if any datasets comes under hierarchical structure eg Different positions within organisations or Different groups of entities or Different categories of animals/flowers, better to watch out more carefully on explanatory power of variables.
In those cases, the linear mixed effect model is a highly recommended model (in scope of my little knowledge). This model can be used for behavioural study between and within the geographical regions of the properties. I am happy to explore this same model (mixed effect) for learning about properties outside of Australia as well. However, I understand I am scratching the surface still.