The group project aimed to derive the variables that affect the most the price of houses in Greater Sydney Region. That because the Greater Sydney’s housing market today is recognized as one of the most expensive in the world with median detached dwelling prices exceeding $1 million and reaching 10.5 times the average household income. This phenomenon is at a time when the mortgages are absorbing up to 50% of household income. The highly leveraged position many of Greater Sydney’s population find themselves in is not sustainable – any economic shock could undermine the precarious balance of the housing market in the region. In particular, we first developed a forward multiple linear regression in order to identify the critical variables in the model. Furthermore, we decided to implement a second model using the ‘Average House’, which was considered the average of the main features that characterize the house. As a result, we discovered that the essential variables in determining the price of houses in Greater Sydney Region are the Mean Taxable Income followed by the distance from CBD.
At this point, we are ready to icrease our research on the household market in Sydney. What is important to underline is that after studying more in-depth our dataset, we found a possible hierarchy structure among data. In particular, the housing choice starts from the area in which the house is built, then the suburb and finally the house itself, with all its characteristic. As a result, in this situation, it seems natural using the MLM methods in order to study the dataset. In particular, throughout this paper, we want to answer the following questions:
How much of the variability among prices can be explained by differrences among LGAs? What is the impact of the house size on the price, depending on the LGA?
In this situation, it would be interesting applying the multilevel approach to the dataset, since the observation are organized in more than one level (Nested data). Indeed, the individual house is the lowest level of data in this multilevel approach, which is nested in the Suburb and finally in the LGA. In particular, in our dataset, the LGAs are very different among each other concerning the number of observations and type of houses. Indeed, in the following graph, we can see the relationship between the number of houses in a particular LGA subdivided by the type of houses.
As a result, I think it would be interesting to develop a multilevel model in order to analyze the impact of the area in which the house is built on the price. Indeed, I am going to create a mixed effect model to underline the effects of the hierarchal nature of the data and the difference in price among different LGA.
We want to use this particular statistical method in our analysis since it allows us to compare prices among different LGA. Moreover, this particular statistical tool allows us to analyse a dataset at different level simultaneously that are typical in a nested dataset. In fact, MLM is a method that is very helpful to handle clustered or grouped data. Indeed, it helps us in understanding how much the suburb, postcode or council influences the price of houses. Once we have done this, this method could be implemented to find the average price for houses in determining areas and with particular characteristics.
In order to apply the multilevel model technique to the dataset, we first have to modify some variables of the dataset. For the project, I am going to use the library Tidyverse and lme4
We have applied a logarithm transformation to the price variable in order to deal with the skewness.
We have transformed numerical variables like Distance from CBD, number of beds or air quality in categorical variables. (see Appendix)
In the first model, we are going to use the simplest one possible. Indeed, we implement a model with no explanatory variables except for the intercept. A random intercepts model is a model in which intercepts are allowed to vary, and therefore, the scores on the dependent variable for each individual observation are predicted by the intercept that varies across groups. This model assumes that slopes are fixed (the same across different contexts). In addition, this model provides information about intraclass correlations, which are helpful in determining whether multilevel models are required in the first place.[1] In particular, we can say that for the property ‘a’ in the LGA ‘b’, the price of the house is given by a function of the fixed term (house price) plus the random effect (the differential caused by the LGA b). We can say that the random term represents the dispersion of the correspondents LGA from the general mean. In particular:
lme.1 <- lmer(price ~ 1 + (1|lga), data=dataset)
summary(lme.1)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: price ~ 1 + (1 | lga)
#> Data: dataset
#>
#> REML criterion at convergence: -54133.8
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -8.5116 -0.5394 -0.0263 0.5206 22.5166
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> lga (Intercept) 0.04754 0.2180
#> Residual 0.02025 0.1423
#> Number of obs: 51300, groups: lga, 42
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 6.06116 0.03366 180.1dataset$predict = predict(lme.1)
ranef(lme.1)
#> $lga
#> (Intercept)
#> Bankstown City Council -0.12317721
#> Bayside Council -0.01441420
#> Blue Mountains City Council -0.25011466
#> Burwood Council 0.08220606
#> Camden Council -0.19274736
#> Campbelltown City Council -0.27247246
#> Canterbury City Council -0.10377695
#> Central Coast Council -0.16438787
#> City of Blacktown -0.19727544
#> City of Canada Bay Council 0.17650504
#> City of Lake Macquarie -0.26022948
#> City of Liverpool -0.16988359
#> City of Parramatta Council -0.06951067
#> City of Ryde 0.09805750
#> City of Willoughby 0.28328003
#> Council of the City of Sydney 0.11251804
#> Cumberland Council -0.14661988
#> Fairfield City Council -0.16055613
#> Hawkesbury City Council -0.23928748
#> Hurstville City Council -0.01494982
#> Inner West Council 0.14209136
#> Kogarah City Council 0.08044354
#> Ku-ring-gai Council 0.22122201
#> Lane Cove Municipal Council 0.20201523
#> Mosman Municipal Council 0.45476442
#> North Sydney Council 0.24205529
#> Northern Beaches Council 0.15351306
#> Penrith City Council -0.23159945
#> Randwick City Council 0.22078391
#> Shellharbour City Council -0.23513035
#> Strathfield Municipal Council 0.08265607
#> Sutherland Shire -0.01079302
#> The Council of the Municipality of Hunters Hill 0.33677399
#> The Council of the Municipality of Kiama -0.10731519
#> The Council of the Shire of Hornsby 0.04464278
#> The Hills Shire 0.03621923
#> Waverley Council 0.40065478
#> Wingecarribee Shire Council -0.19302693
#> Wollondilly Shire Council -0.23941970
#> Wollongong City Council -0.18269156
#> Woollahra Municipal Council 0.50857293
#> Wyong Shire Council -0.29959585
#>
#> with conditional variances for "lga"performance::icc(lme.1)
#> # Intraclass Correlation Coefficient
#>
#> Adjusted ICC: 0.701
#> Conditional ICC: 0.701We can see that the model produces a fixed intercept of 6.0116. In particular, we can see that LGAs are subdivided among the ones over or under the estimated intercept. Indeed, while areas such as Blue Mountains City Council or Campbelltown City Council are below the estimated intercept, other areas such as Mosman Municipal council or Waverley Council are far above. The intraclass correlation, calculated using the function ICC, is equal to 70%. This value is very high, and it means that 70% of the total variance is due to differences among LGAs. That is a massive result because it tells us that the vast majority of the variability in prices depends on the areas in which the house is built. Moreover, this value is significant because it indicates the presence of a segmented market and also that the price of houses varies a lot between LGAs.
At this point, we can introduce in our analysis another random variable. We choose to use the distance from CBD. After having transformed the variable from numeric ti categorical, we can implement the model2.
lme.2 <- lmer(price ~ (1|lga) + (1|distCBD), data=dataset)
summary(lme.2)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: price ~ (1 | lga) + (1 | distCBD)
#> Data: dataset
#>
#> REML criterion at convergence: -54203.7
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -8.5179 -0.5398 -0.0248 0.5213 22.5333
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> lga (Intercept) 0.0411819 0.20293
#> distCBD (Intercept) 0.0004029 0.02007
#> Residual 0.0202237 0.14221
#> Number of obs: 51300, groups: lga, 42; distCBD, 4
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 6.05321 0.03291 183.9dataset$predict = predict(lme.2)
ranef(lme.2)
#> $lga
#> (Intercept)
#> Bankstown City Council -0.13329330
#> Bayside Council -0.03153009
#> Blue Mountains City Council -0.22155028
#> Burwood Council 0.06509832
#> Camden Council -0.17301689
#> Campbelltown City Council -0.25454490
#> Canterbury City Council -0.12088662
#> Central Coast Council -0.14226571
#> City of Blacktown -0.18337066
#> City of Canada Bay Council 0.15937513
#> City of Lake Macquarie -0.23166424
#> City of Liverpool -0.15987240
#> City of Parramatta Council -0.07875728
#> City of Ryde 0.08093476
#> City of Willoughby 0.26613791
#> Council of the City of Sydney 0.09539402
#> Cumberland Council -0.15120692
#> Fairfield City Council -0.15480005
#> Hawkesbury City Council -0.21586601
#> Hurstville City Council -0.03151545
#> Inner West Council 0.12496534
#> Kogarah City Council 0.06332604
#> Ku-ring-gai Council 0.20430735
#> Lane Cove Municipal Council 0.18487025
#> Mosman Municipal Council 0.43752423
#> North Sydney Council 0.22490473
#> Northern Beaches Council 0.14138472
#> Penrith City Council -0.21146850
#> Randwick City Council 0.20365079
#> Shellharbour City Council -0.20656683
#> Strathfield Municipal Council 0.06554096
#> Sutherland Shire -0.01066365
#> The Council of the Municipality of Hunters Hill 0.31953368
#> The Council of the Municipality of Kiama -0.07877190
#> The Council of the Shire of Hornsby 0.04085213
#> The Hills Shire 0.03990841
#> Waverley Council 0.38343669
#> Wingecarribee Shire Council -0.16446863
#> Wollondilly Shire Council -0.21085392
#> Wollongong City Council -0.15451526
#> Woollahra Municipal Council 0.49133611
#> Wyong Shire Council -0.27103207
#>
#> $distCBD
#> (Intercept)
#> low 0.025072078
#> medium 0.005520451
#> high -0.009976939
#> very high -0.020615590
#>
#> with conditional variances for "lga" "distCBD"We can see that this time, the model produces a fixed intercept of 6.05321. First of all, we can immediately see that the distance from CBD negatively influences price. The farther a house is situated from CBD and the more the price is likely to decrease. We could have aspected this result since we had demonstrated, in our previous work, the strong correlation between distance from CBD and mean taxable income. The intraclass correlation, however, calculated using the function ICC, does not produce significate changes with respect to the one of model1.
Till now, we have used models that allowed the different group means to vary but who was considering a common slope for the fixed effect. As a result, we decided to apply the MLM that also allows the slope of a predictor to vary depending on another variable. In particular, a random slopes model is a model in which slopes are allowed to vary, and therefore, the slopes are different across groups. Indeed, we have decided to use as another variable the dimension of the house. Doing so, we can identify how prices change according to the LGA, also depending on the size of the house. Moreover, being an exchange student from a little village in the north of Italy, when I arrived here I was impressed by the multiethnicity that characterizes the city of Sydney. As a result, also considering what is happening in the world right now, I decided to include the number of the oversea migrant in the model. As we can see from this graph, there is a sort of clusters among the number of migrants among postcodes.
Indeed, this model is considering the price of the house to vary depending on the house size change across LGA. Moreover, we have considered all the other variables to be fixed.
y=distCBD(FixedEffect)+bed(FixedEffect)+bath(FixedEffect)+housesize(FixedEffect)+type(FixedEffect)+migrants(FixedEffect)+income(FixedEffect)+LGA(Random)
summary(model3)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: price ~ 1 + distCBD + bed + bath + housesize + type + migrants +
#> income + (housesize | lga)
#> Data: dataset
#>
#> REML criterion at convergence: -86087.2
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -10.720 -0.596 -0.062 0.491 32.326
#>
#> Random effects:
#> Groups Name Variance Std.Dev. Corr
#> lga (Intercept) 0.006988 0.08359
#> housesizeMedium 0.002692 0.05188 0.67
#> housesizeBig 0.005660 0.07524 0.70 0.92
#> housesizeVeryBig 0.010220 0.10109 0.67 0.81 0.95
#> Residual 0.009083 0.09530
#> Number of obs: 46569, groups: lga, 39
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 5.571534 0.031374 177.587
#> distCBDmedium -0.053907 0.002005 -26.892
#> distCBDhigh -0.088392 0.002963 -29.828
#> distCBDvery high -0.105051 0.003800 -27.641
#> bedBed2 0.066981 0.004877 13.734
#> bedBed3 0.112681 0.005168 21.805
#> bedMorebed 0.152578 0.005309 28.741
#> bathBath2 0.057633 0.001208 47.707
#> bathBath3 0.126765 0.001721 73.664
#> bathMoreBath 0.215173 0.003198 67.293
#> housesizeMedium 0.065203 0.010128 6.438
#> housesizeBig 0.111266 0.013602 8.180
#> housesizeVeryBig 0.152095 0.017369 8.757
#> typeDuplex 0.082542 0.005329 15.489
#> typeHouse 0.117761 0.003261 36.107
#> typeNew House & Land 0.105287 0.007771 13.549
#> typeSemi-Detached 0.096742 0.005253 18.415
#> typeTerrace 0.132804 0.009995 13.287
#> typeTownhouse 0.051552 0.004017 12.832
#> typeVilla 0.057031 0.004879 11.689
#> migrantsMedium 0.125023 0.037883 3.300
#> migrantsHigh 0.042374 0.039016 1.086
#> migrantsVeryHigh 0.095250 0.040138 2.373
#> incomeMedium 0.049519 0.001623 30.512
#> incomeHigh 0.144946 0.002637 54.957
#> incomeVeryHigh 0.221466 0.007590 29.179
#>
#> Correlation matrix not shown by default, as p = 26 > 12.
#> Use print(x, correlation=TRUE) or
#> vcov(x) if you need itranef(model3)
#> $lga
#> (Intercept) housesizeMedium
#> Bankstown City Council -0.054788427 -0.009586885
#> Bayside Council -0.007909152 -0.006916195
#> Blue Mountains City Council -0.022737307 -0.021131315
#> Burwood Council 0.023117982 -0.012825164
#> Camden Council -0.066273244 -0.037792573
#> Campbelltown City Council -0.148643882 -0.060169820
#> Canterbury City Council -0.058697990 0.032579924
#> Central Coast Council -0.097787358 -0.041683264
#> City of Blacktown -0.110005115 -0.032219158
#> City of Canada Bay Council 0.021762674 0.023307025
#> City of Lake Macquarie -0.153295467 -0.085912235
#> City of Liverpool -0.082984479 -0.022196153
#> City of Parramatta Council -0.039957890 -0.037565945
#> City of Ryde 0.049109438 0.016985297
#> City of Willoughby 0.044775428 0.037643731
#> Council of the City of Sydney 0.122609646 0.039120464
#> Cumberland Council -0.082779458 -0.034514878
#> Fairfield City Council -0.046066479 -0.032085926
#> Hawkesbury City Council -0.051722541 -0.020346620
#> Inner West Council 0.055564271 0.048586910
#> Ku-ring-gai Council 0.007543399 -0.002775343
#> Lane Cove Municipal Council 0.069075763 0.048513667
#> Mosman Municipal Council 0.122703805 0.098904534
#> North Sydney Council 0.058340536 0.088200242
#> Northern Beaches Council 0.121141629 0.044950630
#> Penrith City Council -0.075851002 -0.134940540
#> Randwick City Council 0.119907153 0.035742216
#> Shellharbour City Council -0.029942630 -0.017587238
#> Strathfield Municipal Council 0.014930329 -0.014906446
#> Sutherland Shire -0.019275306 -0.012900840
#> The Council of the Municipality of Hunters Hill 0.059089257 0.030037557
#> The Council of the Municipality of Kiama 0.035995779 0.019326917
#> The Council of the Shire of Hornsby 0.039926441 0.003074293
#> The Hills Shire 0.024069717 -0.009095897
#> Waverley Council 0.124632847 0.061675544
#> Wingecarribee Shire Council -0.036661528 -0.021205873
#> Wollondilly Shire Council -0.077049778 -0.044306826
#> Wollongong City Council -0.015717675 -0.007370034
#> Woollahra Municipal Council 0.163850609 0.091386215
#> housesizeBig housesizeVeryBig
#> Bankstown City Council -0.034139879 -0.02810037
#> Bayside Council 0.009003138 0.01166230
#> Blue Mountains City Council -0.066083336 -0.09590685
#> Burwood Council 0.004654611 0.03403993
#> Camden Council -0.049433971 -0.04354714
#> Campbelltown City Council -0.094712160 -0.13500750
#> Canterbury City Council 0.065660246 0.07247355
#> Central Coast Council -0.081167980 -0.10398415
#> City of Blacktown -0.036801706 -0.08835306
#> City of Canada Bay Council 0.040547795 0.07502376
#> City of Lake Macquarie -0.101668741 -0.11231207
#> City of Liverpool -0.030565729 -0.05162184
#> City of Parramatta Council -0.043187705 -0.04455733
#> City of Ryde 0.021454713 0.03127004
#> City of Willoughby 0.041172731 0.04877277
#> Council of the City of Sydney 0.115187053 0.15537331
#> Cumberland Council -0.047008738 -0.04532123
#> Fairfield City Council -0.034965403 -0.06917474
#> Hawkesbury City Council -0.053509081 -0.07403404
#> Inner West Council 0.051629535 0.05178405
#> Ku-ring-gai Council -0.026480189 -0.04624038
#> Lane Cove Municipal Council 0.088691866 0.11257132
#> Mosman Municipal Council 0.156916641 0.21412810
#> North Sydney Council 0.115682035 0.15859821
#> Northern Beaches Council 0.037237741 -0.02873328
#> Penrith City Council -0.159364596 -0.18000069
#> Randwick City Council 0.046517538 0.04204069
#> Shellharbour City Council -0.051593870 -0.09336604
#> Strathfield Municipal Council 0.063500409 0.17784640
#> Sutherland Shire -0.044107610 -0.07194487
#> The Council of the Municipality of Hunters Hill 0.072425044 0.12621586
#> The Council of the Municipality of Kiama -0.008276877 -0.04498806
#> The Council of the Shire of Hornsby -0.019352418 -0.03567197
#> The Hills Shire -0.036899182 -0.01485598
#> Waverley Council 0.098749628 0.13637785
#> Wingecarribee Shire Council -0.040399847 -0.05456099
#> Wollondilly Shire Council -0.063320889 -0.05632209
#> Wollongong City Council -0.045762138 -0.09925277
#> Woollahra Municipal Council 0.139771314 0.16967934
#>
#> with conditional variances for "lga"The results of this model are not easy to read. First of all, consider the fixed part of the model. As we can see, an increase in the average taxable income, number of beds, number of baths and of the size of the house has a positive effect on the price of the house. The type of house ‘house’ seems to be to one with the higher intercept. Moreover, it is hard to understand the effect that the variable ‘migrants’ has on the price. It seems that there is not a clear pattern among price and number of migrants. For sure this is a valuable point for further analysis. For what regards the random part, the results we found are fascinating. In particular, there are some LGA in which the price of the house is always above or under the others, regardless of the dimension of the house. This is another important proof that the market in Sydney is divided into submarkets. As a result, depending on the area, the price in Sydney vary a lot.
During the group-work in this semester, we have analyzed the household market in Sydney. In the first part, we found that the variables that affect the price of the house the most are the mean taxable income and the distance from CBD. With this project, we wanted to expand our previous result analyzing the hierarchical structure of the dataset at a different level. The results I obtained are powerful. In particular, through the application of the first model, we found that the variance of prices can be explained by the differences among LGAs up to 70%! Furthermore, we decided to implement model1, adding another level of hierarchy in model2. What we found is that it is clear that this is a segmented market at different levels. (Submarket) In the last part, we allowed the slope of the model to change accordingly to the dimension of the size. What we found is again that the market has many clusters with LGAs that are regularly above the average. With this analysis, we just wanted to show and proof the existence of a hierarchical structure, and it can be implemented for more levels. We think these results help us in understand a little bit better what the Sydney household market is. In my opinion, the position of the house seems to be more critical than the characteristic of the house itself. This is a very powerful conclusion because it indicates that there are significant differences among LGAs that are causing an increase in variability in prices.
Trasformation of numerical variables:
- https://en.wikipedia.org/wiki/Multilevel_model
- https://en.wikipedia.org/wiki/Multilevel_model http://www.statstutor.ac.uk/resources/uploaded/multilevelmodelling.pdf Kristy Kitto - Multilevel Models - https://canvas.uts.edu.au/courses/14754/files/399449?fd_cookie_set=1 Harrison, X. A., Donaldson, L., Correa-Cano, M. E., Evans, J., Fisher, D. N., Goodwin, C. E., … & Inger, R. (2018). A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ, 6, e4794. https://statmodeling.stat.columbia.edu/2005/01/25/why_i_dont_use/ https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf