1. Going back to Group Results

As part of the Property Price Influencers Group, our aim was to identify what are the most influential property factors that would a play vital role on determining the property prices growth of the Greater Sydney Region. Since there were many predictor variables collected and compared, we noticed there might be a chance for some of the important linear assumptions for being violated. Like,

  1. Lack of Independence nature between data points : It means the property prices within a particular suburb is dependent on the suburb. In our scenario, every suburb has a slightly different influencer on property price, and this is going to be an idiosyncratic factor that affects all prices from the same suburb, thus rendering these different prices inter-dependent (correlated) rather than independent.

  2. Collinearity among variables : The important suburb specific factors like SEIFA scores, crime factors, Migration ratio or distancing from public schools/private schools etc., are correlating with each other rather than statistically significant on determining the property price. This makes the interpretation of model being unstable sometimes which may overrides the explanatory power of certain significant variables.

So the following article is an attempt to handle these violations, particularly about independence assumptions via Linear Mixed Effect Model Approach.

2. Violation of Independence Assumption

2.1 General Look of Dataset

A quick look on the features collected from propery analysis dataset is as below. Some of the key points on the collected data set would be:

  1. Properties collected across Greater Sydney Region
  2. Basic Demographic features recorded from Domain websites are considered
  3. External influencing features on Natural Disasters, Crime Rate & SEIFA Scores are collected.
  4. Collected for the time period from Nov 2018 onwards. Property evaluation across timeperiod cannot be analysed.
  5. Top 3 & Bottom 3 outliers of property prices can be eliminated from this dataset as some data quality issues were encountered during manual entry of properties.
library(MASS)
library(caret)
library(tidyverse)
library(scales)
library(corrplot)
library(viridis)
library(dplyr)
library(hrbrthemes)

property_data <- read.csv("cleaned_merged_property_data_final.csv")
str(property_data)
## 'data.frame':    51300 obs. of  44 variables:
##  $ postcode                            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ year                                : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ property_id                         : int  411 310 450 362 345 303 397 61 402 463 ...
##  $ lga                                 : Factor w/ 42 levels "Bankstown City Council",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ lat                                 : num  -33.9 -33.9 -33.9 -33.9 -33.9 ...
##  $ lon                                 : num  151 151 151 151 151 ...
##  $ date                                : Factor w/ 852 levels "1/01/2018","1/02/2018",..: 624 555 828 464 466 823 650 797 432 448 ...
##  $ price                               : num  920000 1210000 1750000 1280000 1060000 ...
##  $ address                             : Factor w/ 50089 levels "(Lot 103) 17 Entwistle Crescent, HARRINGTON PARK NSW 2567",..: 25888 27480 21777 49821 49823 7519 23488 41205 47073 43851 ...
##  $ suburb                              : Factor w/ 836 levels "abbotsbury","abbotsford",..: 706 706 706 706 706 706 706 341 706 706 ...
##  $ url                                 : Factor w/ 51288 levels "https://www.domain.com.au/-35-stoneham-circuit-oran-park-nsw-2570-2015456400",..: 26149 27762 22001 50711 50713 7603 23234 41616 47453 44287 ...
##  $ no_of_bed                           : int  1 2 3 2 2 1 2 2 1 2 ...
##  $ no_of_bath                          : int  1 2 2 2 2 1 2 2 1 2 ...
##  $ no_of_parking                       : int  1 1 1 1 1 1 1 4 1 2 ...
##  $ house_size                          : int  88 89 160 141 107 91 107 175 83 156 ...
##  $ type                                : Factor w/ 8 levels "Apartment / Unit / Flat",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ distance_from_CBD                   : num  0.661 0.754 0.968 1.034 0.968 ...
##  $ median_taxable_income               : int  27138 27138 27138 27138 27138 27138 27138 27138 27138 27138 ...
##  $ mean_taxable_income                 : int  66698 66698 66698 66698 66698 66698 66698 66698 66698 66698 ...
##  $ AQI                                 : num  76.4 76.4 76.4 76.4 76.4 ...
##  $ AQI_exclude_bushfire_month          : num  32.2 32.2 32.2 32.2 32.2 ...
##  $ Within_5km_2019                     : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Btwn_5_10km_2019                    : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Btwn_10_15km_2019                   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Within_5km_2018                     : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Btwn_5_10km_2018                    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Btwn_10_15km_2018                   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Drug.offences                       : int  3663 3663 3663 3663 3663 3663 3663 3663 3663 3663 ...
##  $ Non.Violent.Crime                   : int  36823 36823 36823 36823 36823 36823 36823 36823 36823 36823 ...
##  $ Violent.Crime                       : int  1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 ...
##  $ distance_from_closest_station       : num  0.298 0.163 0.117 0.233 0.117 0.176 0.12 0.234 0.196 0.252 ...
##  $ distance_from_closest_public_school : num  0.788 0.478 0.356 0.751 0.356 0.232 0.357 0.439 0.841 0.737 ...
##  $ distance_from_closest_private_school: num  0.38 0.243 0.469 0.371 0.469 0.846 0.471 0.443 0.164 0.414 ...
##  $ metropolitan_area                   : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ Nom.number                          : int  6471 6471 6471 6471 6471 6471 6471 6471 6471 6471 ...
##  $ Nom.ratio                           : num  2.6 2.6 2.6 2.6 2.6 2.6 2.6 2.6 2.6 2.6 ...
##  $ Year                                : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ SAdvDisav_score                     : int  1072 1072 1072 1072 1072 1072 1072 1007 1072 1072 ...
##  $ SAdvDisav_decile                    : int  9 9 9 9 9 9 9 6 9 9 ...
##  $ ER_score                            : int  806 806 806 806 806 806 806 718 806 806 ...
##  $ ER_decile                           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EDUOCC_score                        : int  1135 1135 1135 1135 1135 1135 1135 1079 1135 1135 ...
##  $ EDUOCC_decile                       : int  10 10 10 10 10 10 10 9 10 10 ...
##  $ avg_temp                            : num  17 21 23.5 11.4 14.5 23 17 23.5 17 23.5 ...
barplot(table(property_data$postcode),main="No of Properties by Postcodes",xlab="Postcode",ylab='Frequency')

2.2 Restricted Dataset for this Analysis

As we have the properties distributed across most of the sydney postcodes (>40 more) and few 100 to 400 property values were collected per suburb, we are narrowing our analysis further by subsetting the top 10 suburbs based on higher to lower in number of properties sold and also restrict to important property types such as “House” and “Apartment/Unit” for effectively validating the model.

Those postcodes are:

top10_suburb <- property_data %>% 
  group_by(suburb) %>% 
  summarize(num=n()) %>%
  arrange(desc(num)) %>%
  slice(1:10)

top10_suburb
df <- property_data %>% 
  inner_join(top10_suburb) %>%
  filter(type=="House" | type=="Unit"  ) %>%
  #removing outliers - for each LGA remove 2 highest and 2 lowest from each suburb
  arrange(suburb, price) %>% 
  group_by(suburb) %>% 
  slice(3:(n()-2)) %>%
  ungroup()

As the above table depicts there are multiple observations collected on every suburb,let’s visualise in diagram by how much these price variations is happening between those suburbs even though the number of properties being sold between them is similar in range.

df %>%
  mutate(myaxis = paste0(suburb, "\n", "n=", num)) %>%
  ggplot( aes(x=myaxis, y=price, fill=suburb)) +
  geom_violin(width=1.4) +
  geom_boxplot(width=0.1, color="grey", alpha=0.2) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
    labels = trans_format("log10", math_format(10^.x))) +
  scale_fill_viridis(discrete = TRUE) +
  theme_ipsum() +
  theme(
    legend.position="none",
    plot.title = element_text(size=11)
  ) +
  coord_flip() + # This switch X and Y axis and allows to get the horizontal version
  xlab("") +
  ylab("Property Price (Log10)")

This is also depicted in the below diagram with the inclusion of a significant fixed effect variable (no_of_beds). Note: The price range is normalised. So based on that, We observed that there will be atleast 0.25 to 0.50 variation of increase/decrease in property prices between suburbs. There lies again a two trends among them with suburbs under “The Hills Shire Council” shown a little higher range of properties on comparing to other suburbs. Thus the property price is influenced also with by the behaviour within suburbs. Let’s see if we can break this dependency via Mixed Effect Model.

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

df$price_norm <- normalize(df$price)

qplot(no_of_bed, price_norm, facets = . ~ suburb,
     colour = suburb, geom = "boxplot", data = df) 

2.3 Absence of Independence - Property Type

Apart from the suburb influence, there is an additional source of non-independence that needs to be accounted for : Different Property types We expected the by-type variation among those properties. For example, property prices across Houses can value more than the property price of an Unit even across different suburbs. This behaviour can be noticed irrespective of suburbs they belong to. Under the logarthmic scale, there shows the terrace building would cost 2 scales above than a Unit/Duplex properties. However,it is shown that, the amount of variation among these property types are little as compared to the variation between suburbs.

property_data$type <- as.character(property_data$type)

property_data$type <- 
  case_when (
    property_data$type =="Apartment / Unit / Flat" ~ "Unit",
    property_data$type =="New House & Land" ~ "H & L",
    TRUE ~ property_data$type
  )

property_data %>%
  ggplot(aes(x = type, y = price)) +
  geom_boxplot() + 
  scale_y_continuous(labels = comma) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", 
               shape=18, size=3,show_guide = FALSE) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                labels = trans_format("log10", math_format(10^.x))) +
  labs(x = "Property Type",
       y = "Property Price",
       title = "Property Price by Property Type")

3. Fixed Vs Random Effect Variables

By deriving the understanding based on (Green and Tukey, 1960) definition, if sample exhausts the entire population, then it is considered as a fixed effect variable. In our property dataset, all property demographic features like 1. Property Types 2. Property Size 3. Number in terms of bedrooms,bathrooms and car parking And some of the common suburb related ABS survey metrics like 1. Migration ratio 2. Air Quality numbers are all can be considered as the fixed effect variable since entire population of samples are considered.

However, for random effects, eventhough the variable can have all possible levels, the observations are collected only for the sample of populations which will eventually be used to generalise for future predictions. In property dataset, it can be 1. Property addresses within suburb
2. Property prices across suburbs

4. Data Analysis across Property types

Let’s say we are interested only on analysing the property behaviour of these top 10 suburbs. So, Are we correct on understanding whether these top 10 suburbs can have any common behaviour among them. That will be interesting to explore.

First, let’s see it in diagram.

#Select Properties for top 10 suburbs
df <- property_data %>% 
  inner_join(top10_suburb) %>%
  filter(type=="House" | type=="Unit" ) %>%
  #removing outliers - for each LGA remove 2 highest and 2 lowest from each suburb
  arrange(suburb, price) %>% 
  group_by(suburb) %>% 
  slice(3:(n()-2)) %>%
  ungroup()

#Setting color theme
theme_set(theme_bw(base_size = 18))

df %>%
  ggplot(aes(x = type, y = price,colour = suburb)) +
  geom_boxplot() + 
  scale_y_continuous(breaks = seq(300000,max(df$price), by = 500000)) +
  labs(x = "Property Type",
       y = "Property Price",
       title = "Property Price by Property Type")

#Note:Price label to brake starting from the minimum price of suburb which belongs to "Unit" type and break the limits by 500,000 to locate for the distinction clearly in the picture

The visual explains an immediate distinction in valuation of properties between “House” and “Unit”. The mean price of a house is atleast 500,000 higher than the mean price of an unit. But one implicit factor among them is the different on the house size or landscape associated with the property. The Unit/Apartment properties may have 2 times higher than the average size of house/townhose.

with(df, aggregate(price ~ type, FUN = "mean"))
with(df, aggregate(house_size ~ type, FUN = "mean"))

In addition, there are some suburbs having no units sold. Wonder why? But reason could be demographic background or landscape behind them. But I couldn’t confirmed this using the dataset collected.

5. Mixed Effect Model for Random Suburbs

Let’s first model the differences based on suburbs. For that, we are using the wonderful concept from Linear Mixed Effect model (LMER) which can estimate the random intercepts for every suburb. As per my knowledge, random intercept explains the base value upon which the properties are laid out. These baseline values should be more or less equal to the mean actual prices of the properties. Let’s validate our understanding.

5.1 Creating Model

library(lme4)

#Model 1:
res1 = lmer(price ~ type + (1 | suburb), data = df)

# summary(res1)
coef(res1)$suburb[1]

5.2 Finding Actual Mean Price of Dataset

with(df, aggregate(price ~ suburb, FUN = "mean"))

The prices are pretty much closer to the actual mean price of every suburb. The random intercept estimated by every suburb will take care of individual influences across suburbs. The same behaviour can be noticed whenever multiple responses collected for every subject. The subject in this case is the actual properties.

6. Likelihood Ratio Test(LRT) for validating Models

Let’s extend this basic mode by adding few significant fixed effect variables that matters to the properyt prices. But how these various fixed effect variables will interact or effect each other and which one among them can play a influential data role here. Let’s identify by running a statistial test:

6.1 Basic Property Features

#Transformation of variables into factors
df$no_of_bed <- as.factor(df$no_of_bed)
df$no_of_bath <- as.factor(df$no_of_bath)
df$house_size_cat = cut(df$house_size, breaks = seq(0, 1000, by = 100))

#Model 2:
res2.no_of_bath.null = lmer(price ~ type + no_of_bed +  (1 | suburb), data = df,REML=FALSE)
# coef(res2.no_of_bath.null)
# summary(res2.no_of_bath.null)

#Model 3:
res3.no_of_bath.model = lmer(price ~ type + no_of_bed + no_of_bath +  (1 | suburb), data = df,REML=FALSE)
# coef(res3.no_of_bath.model)
# summary(res3.no_of_bath.model)

anova(res2.no_of_bath.null,res3.no_of_bath.model)

According to Bodo Winter bw_LME_tutorial,the statistical test “Likelihood Ratio Test” can determines presence of a variable with & without in model. So we tested by running a couple of model with property features like Number of Bedrooms and/or with Number of Bathrooms etc.. And,it is found that the effect of having number of bathrooms in the model can effect the model with (p<0.01) and chi-squuared χ2(1)=440.68 (Note: No of bathrooms also shown a good role in AT2 step-wise analysis model). So we can say that basic property features plays a most important role in defining the price of a property.

6.2 Adding House Size

Now, How about if we compare this resultset with one more fixed effect model : house_size_cat?

#Model 4:
res4.complete.model = lmer(price ~ type + no_of_bed + no_of_bath + house_size_cat + (1 | suburb), data = df,REML=FALSE)
# coef(res4.complete.model)
# summary(res4.complete.model)

anova(res2.no_of_bath.null,res3.no_of_bath.model)
anova(res3.no_of_bath.model,res4.complete.model)

It shows the reduction of AIC value by 200 and also less increase in chi-square. So, it doesn’t have much variance by adding the house size and without. So in order to avoid the collinearity issues, we can stick to our fixed effect model with basic features of property.

6.3 Features External to Properties

Now,let’s extend this model to external properties outside of property scope like Migration Numbers, SEIFA Education Decile number or Bushfire impact numbers. Again, there are many features which are influencing the property model so let’s compare one with Null features, important features (which we understood from AT2 Analysis) and all features.

Here,we cannot model with restricted properties under parsimonious effect of these variables . So we are moving away the filter effect and see how lmer can work if we extend it for all suburbs and particularly with all suburb-wide scope variables.

df2 <- property_data %>% 
  filter(type=="House" | type=="Unit"  ) %>%
  #removing outliers - for each LGA remove 2 highest and 2 lowest from each suburb
  arrange(lga, price) %>% 
  group_by(lga) %>% 
  slice(3:(n()-2)) %>%
  ungroup()

df2$no_of_bed <- as.factor(df2$no_of_bed)
df2$no_of_bath <- as.factor(df2$no_of_bath)
#normalise data
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
df2$Violent.Crime<-normalize(df2$Violent.Crime)

#Model 4b:
res4b.all.no_of_bath.model = lmer(price ~ type + no_of_bed + no_of_bath +  (1 | suburb), data = df2,REML=FALSE)

#Model 5:
res5a.all.seifa_nom.model = lmer(price ~ type + no_of_bed + no_of_bath +  EDUOCC_decile + Nom.ratio + Violent.Crime +   (1 | suburb), data = df2,REML=FALSE)
# summary(res5.all.seifa_nom.model)

anova(res4b.all.no_of_bath.model,res5a.all.seifa_nom.model)

The results shown a usage of these variables influenced with significant chisq χ2(1)=403. But again, in contrast to the chi-Square or P-value of property features, this is showing a bit lesser value between them. So if we look out for a model to generalise the results across all suburbs, then it is warrant to consider these 3 important external features.

7. Interpretation of model results - Random Intercept and Fixed Slope

However, since this analysis focused mainly towards the influential factors between & within suburbs for focused group via top 10 properties, let’s analyse further on random intercept result from these models.

7.1 Results Summary

summary(res3.no_of_bath.model)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: price ~ type + no_of_bed + no_of_bath + (1 | suburb)
##    Data: df
## 
##      AIC      BIC   logLik deviance df.resid 
##  94714.6  94807.3 -47342.3  94684.6     3539 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.1177 -0.5538 -0.0574  0.4367  8.0415 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  suburb   (Intercept) 4.652e+10 215679  
##  Residual             2.137e+10 146178  
## Number of obs: 3554, groups:  suburb, 10
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   715990      88218   8.116
## typeUnit     -236056      27647  -8.538
## no_of_bed2     26759      57052   0.469
## no_of_bed3     74040      56138   1.319
## no_of_bed4    159120      56534   2.815
## no_of_bed5    241699      56984   4.242
## no_of_bed6    212431      59327   3.581
## no_of_bed7    152023      80349   1.892
## no_of_bed8    191318     101641   1.882
## no_of_bath2    47103       7091   6.643
## no_of_bath3   175790      10134  17.347
## no_of_bath4   327016      26833  12.187
## no_of_bath5   487898      41388  11.788

The standard deviation for every value of a suburb can approximately result in 215K differences among these propertiess and also the residual difference shows more significant value. So this indicates the dataset can still represent the property price by very other different variables that are not accounted on this model like distance from school/station/transport/shopping center accessbilities/any other unknown features etc.,

7.2 Variation from Fixed Effect Variables

Now let’s also understand further on the effect of fixed effect variables:

#Note: The model tweaked by replacing type with house_size
library(sjPlot)
library(sjmisc)

res3b.no_of_bath.model = lmer(price ~ house_size + no_of_bed + no_of_bath +  (1 | suburb), data = df,REML=FALSE)

plot_model(res3b.no_of_bath.model, type = "slope")

All three plots explains a positive effect of variance for every increase in the base fixed effect value. But there is a hinch while notice the decrease in the price of property in case if they exceeds more than 6 number of bedrooms. Let’s see we can plot this via linear model by differentiating between two groups.

df$no_of_bed <- as.numeric(df$no_of_bed)

ggplot(data=filter(df,no_of_bed<6 & type=="House"), aes(x=price,y=no_of_bed)) + geom_point()+ 
  geom_smooth(method='lm')

ggplot(data=filter(df,no_of_bed>=6 & type=="House"), aes(x=price,y=no_of_bed)) + geom_point()+ 
  geom_smooth(method='lm')

Oh yes, we can notice the same results by interpreting the fixed coeffcients of the variables. Also we can note from these coeffecients the suburbs of houses except castle hill is ranging from 500K to 600K value and since the remaining features are fixed, for instance, the unit price, they have same range of values across suburbs with -200K for every unit and gradual increase for every bedroom/bathroom.

coef(res3.no_of_bath.model)
## $suburb
##                (Intercept)  typeUnit no_of_bed2 no_of_bed3 no_of_bed4
## bateau bay        557044.3 -236055.8   26758.71   74039.82   159119.7
## baulkham hills    952581.6 -236055.8   26758.71   74039.82   159119.7
## blacktown         559349.3 -236055.8   26758.71   74039.82   159119.7
## castle hill      1186285.5 -236055.8   26758.71   74039.82   159119.7
## glenmore park     578957.3 -236055.8   26758.71   74039.82   159119.7
## kellyville        949893.5 -236055.8   26758.71   74039.82   159119.7
## quakers hill      625100.0 -236055.8   26758.71   74039.82   159119.7
## seven hills       625311.3 -236055.8   26758.71   74039.82   159119.7
## st clair          545019.6 -236055.8   26758.71   74039.82   159119.7
## umina beach       580352.5 -236055.8   26758.71   74039.82   159119.7
##                no_of_bed5 no_of_bed6 no_of_bed7 no_of_bed8 no_of_bath2
## bateau bay       241699.2   212430.8   152022.6   191317.8    47102.67
## baulkham hills   241699.2   212430.8   152022.6   191317.8    47102.67
## blacktown        241699.2   212430.8   152022.6   191317.8    47102.67
## castle hill      241699.2   212430.8   152022.6   191317.8    47102.67
## glenmore park    241699.2   212430.8   152022.6   191317.8    47102.67
## kellyville       241699.2   212430.8   152022.6   191317.8    47102.67
## quakers hill     241699.2   212430.8   152022.6   191317.8    47102.67
## seven hills      241699.2   212430.8   152022.6   191317.8    47102.67
## st clair         241699.2   212430.8   152022.6   191317.8    47102.67
## umina beach      241699.2   212430.8   152022.6   191317.8    47102.67
##                no_of_bath3 no_of_bath4 no_of_bath5
## bateau bay        175790.2    327015.5    487897.6
## baulkham hills    175790.2    327015.5    487897.6
## blacktown         175790.2    327015.5    487897.6
## castle hill       175790.2    327015.5    487897.6
## glenmore park     175790.2    327015.5    487897.6
## kellyville        175790.2    327015.5    487897.6
## quakers hill      175790.2    327015.5    487897.6
## seven hills       175790.2    327015.5    487897.6
## st clair          175790.2    327015.5    487897.6
## umina beach       175790.2    327015.5    487897.6
## 
## attr(,"class")
## [1] "coef.mer"

8. Interpretation of model results - Random Intercept and Random Slope

Since so far the effect of the variable is defined only under the different variation between the suburbs but it is quite expected that property price value would also show random variation within suburbs due to its different property features. Let’s model how it looks like:

8.1 Random Slope : No-of-Bed

df$no_of_bed <- as.factor(df$no_of_bed)

res3c.no_of_bath.model = 
  lmer(price ~  type + no_of_bath +  (1 + suburb | no_of_bed ), data = df, REML=FALSE)

#summary(res3c.no_of_bath.model)

coef(res3c.no_of_bath.model)
## $no_of_bed
##   suburbbaulkham hills suburbblacktown suburbcastle hill suburbglenmore park
## 1           -62786.083       -558.0452        -165646.96         -43820.9828
## 2           405699.381      -2815.4831         471260.20         -53132.0976
## 3           386701.989      20711.5442         515371.41           -997.6518
## 4           394848.373     -36599.5095         615864.18          16952.6619
## 5           448483.213      30431.8437         757292.59          91011.9234
## 6           530236.631      32561.7991         826423.61          67442.4499
## 7           321984.728       9293.1934         521576.29          46140.7058
## 8             -864.601      10146.2908          30504.96          23200.9237
##   suburbkellyville suburbquakers hill suburbseven hills suburbst clair
## 1       -160219.75          -3548.842          4097.064       30476.43
## 2        149856.54          84255.375         84241.056       34824.03
## 3        243944.12          85037.064         79124.078       12463.68
## 4        373031.09          47889.313         35609.559      -34553.38
## 5        519457.06          84670.783         64564.552      -46269.19
## 6        511906.09         105801.559         87764.098      -28619.64
## 7        338464.45          56807.500         44436.491      -27308.08
## 8         47674.04           1564.754         -1224.580      -10158.65
##   suburbumina beach (Intercept)  typeUnit no_of_bath2 no_of_bath3 no_of_bath4
## 1        -17546.050    628699.9 -221087.8    58636.11    175308.2    312045.4
## 2          6082.508    599506.3 -221087.8    58636.11    175308.2    312045.4
## 3         14275.283    630253.8 -221087.8    58636.11    175308.2    312045.4
## 4         39363.152    720133.7 -221087.8    58636.11    175308.2    312045.4
## 5         44436.845    719026.2 -221087.8    58636.11    175308.2    312045.4
## 6         40593.353    686163.9 -221087.8    58636.11    175308.2    312045.4
## 7         29774.857    700199.5 -221087.8    58636.11    175308.2    312045.4
## 8          4171.853    688157.9 -221087.8    58636.11    175308.2    312045.4
##   no_of_bath5
## 1    486823.8
## 2    486823.8
## 3    486823.8
## 4    486823.8
## 5    486823.8
## 6    486823.8
## 7    486823.8
## 8    486823.8
## 
## attr(,"class")
## [1] "coef.mer"
d_bycond = na.omit(df) %>%
  group_by(no_of_bed, suburb) %>%
  summarise(mean_price = mean(price))

ggplot(d_bycond, aes(x=no_of_bed, y=mean_price, 
                     colour=suburb, group=suburb)) +
  geom_line(size=2) + geom_point(size=1, shape=21, fill="white")

The results are apparent that every suburb have increase in value for every increase in the number of bedrooms. For example: The umina beach shows a increase of around 3k to 4k for every increase in the bedroom which is also happening across costliest suburbs like baulkham hills (proportional increase of 3% to 4% ).

8.2 Random Slope : Type

Let’s replace the same model but instead of finding the random slope by rooms, let’s do by property types. It will be evident to notice a different range of values between Houses and Units and also all it is common across all suburbs. The proportion of this difference is also similar to the proportion based on number of rooms.

res3d.no_of_bath.model = 
  lmer(price ~  no_of_bed + no_of_bath +  (1 + suburb | type  ), data = df, REML=FALSE)

coef(res3d.no_of_bath.model)
## $type
##       suburbbaulkham hills suburbblacktown suburbcastle hill
## House             397077.6       -630.9702          631817.9
## Unit              214112.6      20489.3300          463421.3
##       suburbglenmore park suburbkellyville suburbquakers hill suburbseven hills
## House            22129.16         394476.7            68060.4          68449.22
## Unit            125260.73         260749.1           152464.6         149308.14
##       suburbst clair suburbumina beach (Intercept) no_of_bed2 no_of_bed3
## House      -12229.29          23712.62    540178.9   38816.31   91290.87
## Unit       123347.64         130887.34    344269.9   38816.31   91290.87
##       no_of_bed4 no_of_bed5 no_of_bed6 no_of_bed7 no_of_bed8 no_of_bath2
## House   175291.5   257787.3   229256.7   169210.8   210065.9    47295.74
## Unit    175291.5   257787.3   229256.7   169210.8   210065.9    47295.74
##       no_of_bath3 no_of_bath4 no_of_bath5
## House    175223.6    326177.6    487061.9
## Unit     175223.6    326177.6    487061.9
## 
## attr(,"class")
## [1] "coef.mer"
d_bycond = na.omit(df) %>%
  group_by(type, suburb) %>%
  summarise(mean_price = mean(price))

ggplot(d_bycond, aes(x=type, y=mean_price, 
                     colour=suburb, group=suburb)) +
  geom_line(size=2) + geom_point(size=5, shape=21, fill="white")

8.3 Comparing Models between Random Slope Vs Only Random Intercept

As we have mixed effect model which is very effective in estimating the baseline price value and also good on estimating the variation between them, how can we understand “If random slopes is important?”. Let’s answer by identifying AIC deviance between these models. As per the below reference article, it is mentioned that lower the AIC’s, the better the model is fitting the data. Let’s compare the AIC’s value here.

res3a.no_of_bath.model = lmer(price ~ type + no_of_bed + no_of_bath +  (1 | suburb), data = df,REML=FALSE)

res3c.no_of_bath.model = 
  lmer(price ~  type + no_of_bath +  (1 + suburb | no_of_bed ), data = df, REML=FALSE)

anova(res3a.no_of_bath.model,res3c.no_of_bath.model)

Surprisingly (am not sure if am interpreting results correct), but the AIC value over random slope/random intercept model shows lesser AIC value than the model only with random intercept. And they also have 5 times more than the number of parametes than random intercept. So I am not sure, if the model belongs to overfitting, it’s going to be part of my section-2 analysis in future.

9. Conclusion

I would like to reinstate the earlier question on how the major assumption from linear model would be violated when addressing a more realtime datasets like Property datasets here. Especially if any datasets comes under hierarchical structure eg Different positions within organisations or Different groups of entities or Different categories of animals/flowers, better to watch out more carefully on explanatory power of variables.

In those cases, the linear mixed effect model is a highly recommended model (in scope of my little knowledge). This model can be used for behavioural study between and within the geographical regions of the properties. I am happy to explore this same model (mixed effect) for learning about properties outside of Australia as well. However, I understand I am scratching the surface still.

10. References

  1. Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. Link
  2. Bodo Winter (2011). The F distribution and the basic principle behind ANOVAs.Last updated: September 21, 2011.Link
  3. Section Week 8 - Linear Mixed Models. Link 4.Daniel Lüdecke.2020-05-28.Plotting Interaction Effects of Regression Models. Link 5.Micah Mumper. January 2017. Multilevel modelling. A quick introduction.Link