In this assignment we will develop a model to predict housing prices for five neighborhoods in Reykjavík. First we split the dataset into a training set and a test set. Then we develop a model on the training set before measuring its effectiveness on the test set.
Housing prices show a significant positive skew so we create a new dependent variable containing the log-transformed housing prices. We also examined the distribution of square feet (ibm2) and decided to log-transform that variable as well.
First we make predictor plots to visualize how the covariates on our dataset vary along with housing prices.
There are several ordinal predictors counting the number of rooms, bathtubs etc. While they are technically not continuous we will treat them as such in this analysis. There seems to be a positive linear relationship between log of square feet(log_ibm2) and log of price log(núvirði). Also there is a weaker positive linear relationship between days from buy date (kdagur). Its worth mentioning there seems be a positive relationship between price and how many rooms, showers, toilets an house has. That is most likely connected to the size of the building so it could be a proxy predictor of how big the house is.
The variance in price seems to be highest in the neighborhood of Þingholt. When Looking at types of housing. Semi-detached houses have the largest variance in price. The small sample size of semi-detached houses could be causing this effect.
Since many of the ordinal variables are counting similar things (rooms, kitchens, bedrooms etc.) including all of them would reduce the conditioning of the least squares problem. We analyzed correlations between predictors and found that many of them are collinear. With this in mind we decided to use as few of the collinear predictors as we could while maintaining acceptable model quality.
In order to not overfit the training dataset we analyze the effect of a predictor via the R command drop1 with k set to log(n) where n is the number of data points. This is the BIC (Bayesian information criterion) and it is more punishing towards unnecessary predictors than the AIC (akaike information criterion).
| term | Df | Sum of Sq | RSS | BIC | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| <none> | 11.51220 | 0.00000 | ||||
| teg_eign | 3 | 1.767157 | 13.27936 | 45.66152 | 22.41143 | 0 |
| matssvaedi | 4 | 4.156626 | 15.66883 | 113.68386 | 39.53635 | 0 |
| kdagur | 1 | 6.321519 | 17.83372 | 189.97743 | 240.51219 | 0 |
| log_ibm2 | 1 | 23.265127 | 34.77733 | 489.18523 | 885.15855 | 0 |
The table above shows that all predictors should be in the model. Next we investigate whether the coefficient for ibm2 has a different slope for apartments and houses.
| term | Df | Sum of Sq | RSS | BIC | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| <none> | 11.16251 | 0.000000 | ||||
| log_ibm2:ibud | 1 | 0.3496955 | 11.51220 | 7.714656 | 13.69020 | 0.0002431 |
| teg_eign | 3 | 0.9681800 | 12.13069 | 18.949262 | 12.63440 | 0.0000001 |
| matssvaedi | 4 | 4.0509070 | 15.21342 | 114.289215 | 39.64715 | 0.0000000 |
| kdagur | 1 | 6.2770974 | 17.43961 | 193.785246 | 245.74150 | 0.0000000 |
The table above shows that the interaction effect should probably be kept in the model, but the plot below shows that the difference in slopes is very little and might be explained only by the fact that apartements are smaller than houses.
Next we examine whether the age of a property has an effect on its price. The plot below shows that the areas were built in very different times. Thus we decide to examine the main effects of age as well as the interaction between age and area.
| term | Df | Sum of Sq | RSS | BIC | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| <none> | 9.464248 | 0.000000 | ||||
| log_ibm2:ibud | 1 | 0.2846354 | 9.748883 | 7.170086 | 12.992316 | 0.0003492 |
| teg_eign | 3 | 0.6594986 | 10.123746 | 11.864012 | 10.034373 | 0.0000021 |
| matssvaedi:aldur | 4 | 0.8375043 | 10.301752 | 13.567945 | 9.557068 | 0.0000002 |
| kdagur | 1 | 6.5915776 | 16.055825 | 230.685798 | 300.875630 | 0.0000000 |
The table above shows that the interaction effect is significant and provides the model with predictive power, but the figure below shows mixed results. In three areas the distribution of prices over age is uniform and in two the effect is slightly negative. The negative effect might as well be just a proxy for the different neighborhoods within the areas.
| set | name | r2 | r2_adj |
|---|---|---|---|
| test | Simple | 0.903 | 0.903 |
| ibm2:appartement | 0.907 | 0.906 | |
| matssvaedi:aldur | 0.906 | 0.906 | |
| train | Simple | 0.914 | 0.912 |
| ibm2:appartement | 0.919 | 0.917 | |
| matssvaedi:aldur | 0.927 | 0.924 |
The model fit did not increase by a lot when we introduced interaction effects or the effect of property age. A common advice is to choose the simplest model when in doubt. Thus we conclude that the simple model without interaction effects is the best and will have better out-of-sample predictions later.
Based on this report we conclude that the simple model is better. Its formula would be written:
\[ \begin{aligned} \mathrm{log(Núvirði)} &= 3.28 + \\ &+ 0.73 \cdot \mathrm{log(ibm2)} \\ &+ 0.00024 \cdot \mathrm{kdagur^*} \\ &- 0.26 \cdot \mathbb I_{\text{teg_eign = Íbúðareign}} \\ &- 0.05 \cdot \mathbb I_{\text{teg_eign = Parhús}} \\ &- 0.18 \cdot \mathbb I_{\text{teg_eign = Raðhús}} \\ &+ 0.11 \cdot \mathbb I_{\text{Matssvæði = Hlíðar}} \\ &+ 0.36 \cdot \mathbb I_{\text{Matssvæði = Miðbær (Suður Þingholt)}} \\ &- 0.12 \cdot \mathbb I_{\text{Matssvæði = Seljahverfi}} \\ &+ 0.15 \cdot \mathbb I_{\text{Matssvæði = Vesturbær (Vestan Bræðraborgarstígs)}} \end{aligned} \]