Executive summary

In this assignment we will develop a model to predict housing prices for five neighborhoods in Reykjavík. First we split the dataset into a training set and a test set. Then we develop a model on the training set before measuring its effectiveness on the test set.

Exploratory analysis

Transforming the dependent variable

Housing prices show a significant positive skew so we create a new dependent variable containing the log-transformed housing prices. We also examined the distribution of square feet (ibm2) and decided to log-transform that variable as well.

Predictor analysis

First we make predictor plots to visualize how the covariates on our dataset vary along with housing prices.

(Pseudo)Continuous predictors

There are several ordinal predictors counting the number of rooms, bathtubs etc. While they are technically not continuous we will treat them as such in this analysis. There seems to be a positive linear relationship between log of square feet(log_ibm2) and log of price log(núvirði). Also there is a weaker positive linear relationship between days from buy date (kdagur). Its worth mentioning there seems be a positive relationship between price and how many rooms, showers, toilets an house has. That is most likely connected to the size of the building so it could be a proxy predictor of how big the house is.

Discrete predictors

The variance in price seems to be highest in the neighborhood of Þingholt. When Looking at types of housing. Semi-detached houses have the largest variance in price. The small sample size of semi-detached houses could be causing this effect.

Multicolinearity

Since many of the ordinal variables are counting similar things (rooms, kitchens, bedrooms etc.) including all of them would reduce the conditioning of the least squares problem. We analyzed correlations between predictors and found that many of them are collinear. With this in mind we decided to use as few of the collinear predictors as we could while maintaining acceptable model quality.

Fitting a model

Finding optimum model on the training set

In order to not overfit the training dataset we analyze the effect of a predictor via the R command drop1 with k set to log(n) where n is the number of data points. This is the BIC (Bayesian information criterion) and it is more punishing towards unnecessary predictors than the AIC (akaike information criterion).

Table 1: drop1 table for first model
term Df Sum of Sq RSS BIC F value Pr(>F)
<none> 11.51220 0.00000
teg_eign 3 1.767157 13.27936 45.66152 22.41143 0
matssvaedi 4 4.156626 15.66883 113.68386 39.53635 0
kdagur 1 6.321519 17.83372 189.97743 240.51219 0
log_ibm2 1 23.265127 34.77733 489.18523 885.15855 0

The table above shows that all predictors should be in the model. Next we investigate whether the coefficient for ibm2 has a different slope for apartments and houses.

Table 2: drop1 table for second model
term Df Sum of Sq RSS BIC F value Pr(>F)
<none> 11.16251 0.000000
log_ibm2:ibud 1 0.3496955 11.51220 7.714656 13.69020 0.0002431
teg_eign 3 0.9681800 12.13069 18.949262 12.63440 0.0000001
matssvaedi 4 4.0509070 15.21342 114.289215 39.64715 0.0000000
kdagur 1 6.2770974 17.43961 193.785246 245.74150 0.0000000

The table above shows that the interaction effect should probably be kept in the model, but the plot below shows that the difference in slopes is very little and might be explained only by the fact that apartements are smaller than houses.

Next we examine whether the age of a property has an effect on its price. The plot below shows that the areas were built in very different times. Thus we decide to examine the main effects of age as well as the interaction between age and area.

Table 3: drop1 table for third model
term Df Sum of Sq RSS BIC F value Pr(>F)
<none> 9.464248 0.000000
log_ibm2:ibud 1 0.2846354 9.748883 7.170086 12.992316 0.0003492
teg_eign 3 0.6594986 10.123746 11.864012 10.034373 0.0000021
matssvaedi:aldur 4 0.8375043 10.301752 13.567945 9.557068 0.0000002
kdagur 1 6.5915776 16.055825 230.685798 300.875630 0.0000000

The table above shows that the interaction effect is significant and provides the model with predictive power, but the figure below shows mixed results. In three areas the distribution of prices over age is uniform and in two the effect is slightly negative. The negative effect might as well be just a proxy for the different neighborhoods within the areas.

Predicting on the test set

Table 4: Summary table of model quality
set name r2 r2_adj
test Simple 0.903 0.903
ibm2:appartement 0.907 0.906
matssvaedi:aldur 0.906 0.906
train Simple 0.914 0.912
ibm2:appartement 0.919 0.917
matssvaedi:aldur 0.927 0.924

Conclusion

The model fit did not increase by a lot when we introduced interaction effects or the effect of property age. A common advice is to choose the simplest model when in doubt. Thus we conclude that the simple model without interaction effects is the best and will have better out-of-sample predictions later.

Based on this report we conclude that the simple model is better. Its formula would be written:

\[ \begin{aligned} \mathrm{log(Núvirði)} &= 3.28 + \\ &+ 0.73 \cdot \mathrm{log(ibm2)} \\ &+ 0.00024 \cdot \mathrm{kdagur^*} \\ &- 0.26 \cdot \mathbb I_{\text{teg_eign = Íbúðareign}} \\ &- 0.05 \cdot \mathbb I_{\text{teg_eign = Parhús}} \\ &- 0.18 \cdot \mathbb I_{\text{teg_eign = Raðhús}} \\ &+ 0.11 \cdot \mathbb I_{\text{Matssvæði = Hlíðar}} \\ &+ 0.36 \cdot \mathbb I_{\text{Matssvæði = Miðbær (Suður Þingholt)}} \\ &- 0.12 \cdot \mathbb I_{\text{Matssvæði = Seljahverfi}} \\ &+ 0.15 \cdot \mathbb I_{\text{Matssvæði = Vesturbær (Vestan Bræðraborgarstígs)}} \end{aligned} \]