Executive summary

In this assignment we will develop a model to predict housing prices for five neighborhoods in Reykjavík. First we split the dataset into a training set and a test set. Then we develop a model on the training set before measuring its effectiveness on the test set.

Exploratory analysis

Transforming the dependent variable

Housing prices show a significant positive skew so we create a new dependent variable containing the log-transformed housing prices. We also examined the distribution of square feet (ibm2) and decided to log-transform that variable as well.

Predictor analysis

First we make predictor plots to visualize how the covariates on our dataset vary along with housing prices.

(Pseudo)Continuous predictors

There are several ordinal predictors counting the number of rooms, bathtubs etc. While they are technically not continuous we will treat them as such in this analysis. There seems to be a positive linear relationship between log of square feet(log_ibm2) and log of price log(núvirði). Also there is a weaker positive linear relationship between days from buy date (kdagur). Its worth mentioning there seems be a positive relationship between price and how many rooms, showers, toilets an house has. That is most likely connected to the size of the building so it could be a proxy predictor of how big the house is.

Discrete predictors

The variance in price seems to be highest in the neighborhood of Þingholt. When Looking at types of housing. Semi-detached houses have the largest variance in price. The small sample size of semi-detached houses could be causing this effect.

Multicolinearity

Since many of the ordinal variables are counting similar things (rooms, kitchens, bedrooms etc.) including all of them would reduce the conditioning of the least squares problem. We analyzed correlations between predictors and found that many of them are collinear. With this in mind we decided to use as few of the collinear predictors as we could while maintaining acceptable model quality.

Fitting a model

Finding optimum model on the training set

In order to not overfit the training dataset we analyze the effect of a predictor via the R command drop1 with k set to log(n) where n is the number of data points. This is the BIC (Bayesian information criterion) and it is more punishing towards unnecessary predictors than the AIC (akaike information criterion).

Table 1: drop1 table for first model
term	Df	Sum of Sq	RSS	BIC	F value	Pr(>F)
<none>			11.51220	0.00000
teg_eign	3	1.767157	13.27936	45.66152	22.41143	0
matssvaedi	4	4.156626	15.66883	113.68386	39.53635	0
kdagur	1	6.321519	17.83372	189.97743	240.51219	0
log_ibm2	1	23.265127	34.77733	489.18523	885.15855	0

The table above shows that all predictors should be in the model. Next we investigate whether the coefficient for ibm2 has a different slope for apartments and houses.

Table 2: drop1 table for second model
term	Df	Sum of Sq	RSS	BIC	F value	Pr(>F)
<none>			11.16251	0.000000
log_ibm2:ibud	1	0.3496955	11.51220	7.714656	13.69020	0.0002431
teg_eign	3	0.9681800	12.13069	18.949262	12.63440	0.0000001
matssvaedi	4	4.0509070	15.21342	114.289215	39.64715	0.0000000
kdagur	1	6.2770974	17.43961	193.785246	245.74150	0.0000000

The table above shows that the interaction effect should probably be kept in the model, but the plot below shows that the difference in slopes is very little and might be explained only by the fact that apartements are smaller than houses.

Next we examine whether the age of a property has an effect on its price. The plot below shows that the areas were built in very different times. Thus we decide to examine the main effects of age as well as the interaction between age and area.

Table 3: drop1 table for third model
term	Df	Sum of Sq	RSS	BIC	F value	Pr(>F)
<none>			9.464248	0.000000
log_ibm2:ibud	1	0.2846354	9.748883	7.170086	12.992316	0.0003492
teg_eign	3	0.6594986	10.123746	11.864012	10.034373	0.0000021
matssvaedi:aldur	4	0.8375043	10.301752	13.567945	9.557068	0.0000002
kdagur	1	6.5915776	16.055825	230.685798	300.875630	0.0000000

The table above shows that the interaction effect is significant and provides the model with predictive power, but the figure below shows mixed results. In three areas the distribution of prices over age is uniform and in two the effect is slightly negative. The negative effect might as well be just a proxy for the different neighborhoods within the areas.

Predicting on the test set

Table 4: Summary table of model quality
set	name	r2	r2_adj
test	Simple	0.903	0.903
	ibm2:appartement	0.907	0.906
	matssvaedi:aldur	0.906	0.906
train	Simple	0.914	0.912
	ibm2:appartement	0.919	0.917
	matssvaedi:aldur	0.927	0.924

Conclusion

The model fit did not increase by a lot when we introduced interaction effects or the effect of property age. A common advice is to choose the simplest model when in doubt. Thus we conclude that the simple model without interaction effects is the best and will have better out-of-sample predictions later.

Based on this report we conclude that the simple model is better. Its formula would be written:

\[ \begin{aligned} \mathrm{log(Núvirði)} &= 3.28 + \\ &+ 0.73 \cdot \mathrm{log(ibm2)} \\ &+ 0.00024 \cdot \mathrm{kdagur^*} \\ &- 0.26 \cdot \mathbb I_{\text{teg_eign = Íbúðareign}} \\ &- 0.05 \cdot \mathbb I_{\text{teg_eign = Parhús}} \\ &- 0.18 \cdot \mathbb I_{\text{teg_eign = Raðhús}} \\ &+ 0.11 \cdot \mathbb I_{\text{Matssvæði = Hlíðar}} \\ &+ 0.36 \cdot \mathbb I_{\text{Matssvæði = Miðbær (Suður Þingholt)}} \\ &- 0.12 \cdot \mathbb I_{\text{Matssvæði = Seljahverfi}} \\ &+ 0.15 \cdot \mathbb I_{\text{Matssvæði = Vesturbær (Vestan Bræðraborgarstígs)}} \end{aligned} \]

Assignment 5

Predicting housing prices in Reykjavík.

Brynjólfur Gauti Jónsson

Magnús Benedikt Sigurðsson

Teacher: Birgir Hrafnkelsson