Executive Summary

In Iceland an annual housing price evaluation is performed on the 31. of may by the national register. In this assignment I will be utilizing the housing price data from the 2017 report. Firstly I will use the models I have studied this semester as well as others to predict housing prices. Secondly I will train two models, a linear discriminant model and a multilayer perceptron, to classify a property into one of six building categories.

The dataset which I will use in this report has many covariates, many of whom will be unneeded. I will start by analyzing effects graphically and then use correlation plots along with principal components analysis and matrix factorization to remove unneeded continuous predictors.

Exploratory Analysis

Before I start the exploratory analysis I split the data into a training set and a set set. The test set consists of 1572 sales that happened on or after December 1. 2016. The training set is all the other 27486 sales.

Location, location, location

An often-repeated paradigm of estate pricing is that the location is the most important part of estates. In the plots below we can see the distribution of pricing broken down by location.

Community

Price distribution by community

There seems to be a difference in pricing distribution by community, but it is not obvious. Next I will break the locations down into smaller areas and see if there are larger differences.

Smaller zones

Price distribution broken down by smaller location areas

There difference in distributions become more obvious when we look at smaller areas. The highest median prices are in Arnarnes and Flatir in Garðabær, Skerfjafjörður in Reykjavík and Leirvogstunga in Mosfellsbær. These are all high class areas with prime location. On the other the the lower median prices are a single property in Rural Reykjavík, three areas in Breiðholt and Flensborg in Hafnarfjörður.

Selecting between geographic predictors

It is likely that the predictor svfn is unneeded if we include matssvaedi. To see whether this is the case I create a linear model of nuvirdi as a function of svfn and matssvaedi, then see if the model’s BIC (bayesian information criterion) rises a lot when I remove svfn

Drop1 table of linear model
Variable	BIC	F value
<none>	0.000	NA
svfn	0.000	NA
matssvaedi	2083.656	51.11257

The table above shows that cutting svfn has no meaningfull effect on the performacnce of our model. I therefore decide to remove svfn.

Buy date of properties

Housing properties are commonly known as a good investment since their value often rises with time. In the plot below we can see a steady increase in price over time. Though the increase is relatively small compared to the full prices any significant predictor is worth using.

$Evolution of property price over the years$

Evolution of property price over the years

Property age

Below is a scatter plot with an overlaid third degree polynomial of nuvirdi as a function of age. There is no meaningful relationship to be seen between the two variables.

Property type

Below we can see that there are some differences in pricing distribution by property type.

Property size

Below we see that property size has a positive effect on nuvirdi but the effect of property circumference is not easily seen. I expect that one of ibm2 and ntm2 will be enough for the models and having both might even be harmful for model performance.

Number of rooms in properties

There are a lot of predictors that deal with number of rooms, appliances or other miscelaneous things in a property:

fjhaed:
fjbkar:
fjsturt:
fjklos:
fjeld:
fjherb:
fjstof:
fjgeym:

These predictors are likely to correlate strongly. In order to cut down on collinearities I will look at the scree plot from a principal component analysis to see how many predictors I should keep in order to explain at least 90% of the variance.

Effects of fj predictors on current price

Scree plot

The scree plot above insinuates that 6 variables should be enough to capture most of the variance in these predictors. Keep in mind that the PCA uses orthognal bases so we can expect to explain less variance if we use the same amount of predictors.

Correlation analysis

Removing predictors based on correlation tables and plots is good for spotting simple one dimensional correlations. In this data I suspect there are multiple dimensional colinearities. To decide which predictors are the biggest culprits I calculate the Cholesky decomposition of the correlation matrix.

Cholesky decomposition aims to transform a normal matrix into another where its columns are orthogonal with values close to 1 on the diagonal. If the value is far away from one that is a sign of multiple collinearity in the data. From the diagonal of the decomposition I find variables with low values as well as high absolute column-wise sums in the correlation matrix, and mark those as likely candidates for cutting.

Analysis of collinearity
Value	fjhaed	fjbkar	fjsturt	fjklos	fjeld	fjherb	fjstof	fjgeym
Cholesky	1.000	0.988	0.882	0.715	0.963	0.738	0.848	0.994
Correlation	3.078	2.343	2.600	3.527	2.184	3.259	2.930	1.304

The table above shows that the variables, fjklos and fjherb have the lowest Cholesky values. We can also see that fjhaed and fjstof have high correlation sums. Below I draw up a correlation plot.

From the plot we can see that the four variables we marked all correlate strongly together. I suspect that by removing fjherb and fjstof much of the problem will be removed.

Removing size-related predictors

Next I perform the same preprocessing steps on the predictors related to property size.

The scree plot hints that we should be able to remove two predictors without affecting model power too much.

Analysis of collinearity
Value	ibm2	ntm2	bilskurm2	svalm2	geymm2	ummal
Cholesky	1.000	0.103	0.778	0.961	0.985	0.944
Correlation	2.976	2.975	2.574	1.842	1.316	1.718

Three variables seem good candidates for removal: ibm2, ntm2 and bilskurm2. Lets look at the correlation plot.

We can see that the three marked variables all correlate strongly together. I will remove ntm2 and bilskurm2 and leave ibm2 as is in the dataset.

Comparing correlation matrices and their conditioning number

To see whether these preprocessing steps helped I look at correlation plots before and after removing the predictors. Also, I compare the determinant of the correlation matrices. If a matrix as higly collinear columns the determinant will close or equal to zero. By removing higly collinear variables I expect to increase the absolute value of the determinant.

Correlations plot

We can see multiple corrlation clusters in the above plot and the variables we marked before seem to be the culprits, as seen in the analysis table above. Also, the determinant of the correlation matrix is $10^{-4}$, which is very low and tells us there are multiple collinearities present. Below are the results after removing unnecessary predictors.

Correlations plot

Above we can see much fewer correlation clusters in the plot and the determinant is now equal to 0.134, a proportional increase of 1319! The hard work in variable choice seems to have paid off.

Below we can see the correlation analysis for all continuous predictors. The diagonal values for fjsturt and fjhaed decreased when we added the size related variables. We should keep an eye out for the possibility of removing them.

Analysis of collinearity
Variable	Cholesky	Correlation
fjsturt	0.78	2.72
fjhaed	0.78	3.12
fjstof	0.82	3.17
svalm2	0.91	2.21
fjeld	0.92	2.39
fjbkar	0.96	2.38
ummal	0.96	2.38
ibm2	0.98	3.59
geymm2	0.99	1.51
fjgeym	0.99	1.47
fjmib	1.00	1.97
haednr	1.00	2.28

BIC analysis of current predictors

Having done some preliminary subset selection I will now train a linear model of nuvirdi as a function of all predictors. I then look at $\Delta BIC$, the change in model BIC when removing any of its predictors. If a predictor has a very low $\Delta BIC$ it does little for the model and should be dropped.

Drop1 table of linear model
Variable	BIC
fjstof	0.000000
<none>	9.011805
fjbkar	10.156585
fjmib	39.831634
fjeld	75.635517
fjgeym	93.611392
lyfta	97.431565
ummal	171.915717
fjsturt	243.870652
haednr	323.961381
svalm2	423.531594
fjhaed	716.837275
geymm2	1248.011218
teg_eign	2124.554427
matssvaedi	8661.585535
kdagur	9126.110542
ibm2	16455.770641

We can see that including fjstof is actually reducing the BIC of our model, so it seems logical to remove that predictor. We also see that fjbkar has little positive effect on our model having a $\Delta BIC$ only about 1 higher than a model without it. These results are in accordance with our Cholesky and correlation analysis so I will therefore remove that coefficient too.

Below I compare the the full model trained without removing any coefficients, and the reduced model trained after removing predictors. As we can see the effect on performance is minimal.

Model	RMSE	Rsquared	MAE
Full	6367.704	0.8355359	3854.235
Reduced	6642.239	0.8210488	4124.218

Training models

Before training models using caret I perform the following preprocessing steps:

Boxcox transform the dependent variable resulting in a log transform. This make sense both statistically and from a predictive point of view. Statistically we want a normally distributed dependent variable. Log transforming the dependent variable makes it so that small differences punish the models less when the actual value is high. This means that the model is punished a lot for an error of 500 if the true value is 1000 but it is punished less when the true value is 10000.
Center and scale all continuous predictors, utilizing the means and standard deviations from the training set to normalize the test set.
One hot encode all categorical variables
Create a resample list containing 10 bootstrapped samples to be used in each of the models. By using exactly the same resamples in all models I can later combine them into a caretEnsemble and hopefully squeeze even more out of them

I will show basic model learning curves before comparing their resamples and test set performance in a later section.

Linear Models

Stepwise AIC

I start by training a stepwise linear model. There is no real learning curve to show so on we move to the next model.

Generalized Linear Model with Elastic-Net regularization

Elastic net regularization applies a mixture of LASSO and ridge regression via the formula:

$RSS + \alpha \lambda||L||_2 + (1-\alpha)\lambda||L||_1$
$\alpha = 0$: LASSO
$\alpha = 1$: ridge

I trained linear regressions models using elastic net regularization with parameters:

$\alpha$: 8 values on the set [0, 1]
$\lambda$: 4 values on the set [0, 1] and 4 values on the set [1, 10]

Splines

Generalized Additive Models (GAM)

GAM models were orginally developed by our dear lords and mentors Trevor Hastie and Robert Tibshirani. They are a generalized linear model that relates the dependent variable to smooth functions of the predictor variables.

Multivariate Adaptive Regression Splines (MARS)

MARS were introduced by Jerome Friedman in 1991, and are a nonlinear extension to regular linear models. MARS models models capture nonlinearities by taking a continuous predictor x and breaking it into two predictors: max(0, (x - c)) and max(0, (c - x)). MARS models can also model categorical predictors and interactions of any degree. They make for a good middle ground in accuracy and interpretability.

The term MARS is trademarked so many software packages, including the ones in R, are called Earth.

Below we see the learning curve for the Earth model. The number of coefficients to keep in the model is a parameter so it is easy to manage the complexity of the model. I trained with interaction degree of 2 and number of parameters as seen below.

Below is the formula for the final MARS model:

##   10.97032
##   -    0.2131135 * matssvaedi.150
##   -    0.2045601 * matssvaedi.160
##   -    0.3182457 * matssvaedi.161
##   -    0.2622683 * matssvaedi.170
##   -    0.1990688 * matssvaedi.200
##   -    0.1371983 * matssvaedi.600
##   - 0.0002730988 * h(1542-kdagur)
##   -    0.3978572 * h(0.764092-ibm2)
##   +    0.1355252 * h(ibm2-0.764092)
##   -    0.1114682 * teg_eignÍbúðareign*h(0.921902-geymm2)
##   +    0.1217767 * h(haednr--0.669594)*matssvaedi.20

Tree-Based Ensemble Models

Extremely Randomized Trees

I model an Extremely Random Trees model for a range of randomly selected predictors. Extremely random trees are alike random forests except they don’t make supervsed splits on the randomly chosen variables. Rather they make totally random splits and choose the best one.

Gradient Boosted Model (GBM)

The GBM model was trained with parameters:

n.trees from 100 to 1500
interacton.deoth from 1 to 7
shrinage 0.01 or 0.1

Regression Model Trees and K Nearest Neighbors with Cubist

Cubist is a rule-based hierarchical linear model that was only commercally available until its source code was released in 2011. The algorithm acts like a cart regression tree but at each node insted of a single prediction there is a linear model. When a value is predicted it traverses down the tree to the right linear model and a prediction is made. On the way up, the prediction is adjusted by its parent models higher up in the tree.

Cubist allows for a boosting-like mechanism called committees, in which it makes multiple such linear model trees and averages their prediction. Cubist also has a parameter called instances which works by implementing a K nearest neighbors smoothing with nearby points from the training set data where instances is equal to the K in KNN.

Simple Cubist Model

Fitting a complex model

Comparing Resamples

Resampled RMSE measures

From the figure above we can see that Cubist and GBM performed best with Random forest and MARS close behind. In a statistical setting the MARS model would be a viable candidate since it produces interpretable coefficients.

Training an ensemble

Because we used identical bootstrap resamples for each model, caretEnsemble now allows us to stack them into one ensemble by using their predictions as covariates in a stepwise linear model. Any model can be used to control the ensemble but I chose a simple model to save time and safeguard from overfitting.

The table below shows the weight on each model’s predictions. In an optimal setting I would train models that have weak correlations in order to capture different parts of the predictor space.

Evaluation

Training set performance of top models
model	RMSE	MAE	Rsquared
GBM	0.0948299	0.0713518	0.9461574
Ensemble	0.0968804	0.0725684	0.9437722
Cubist	0.1171608	0.0864059	0.9177981
RF	0.1216649	0.0906210	0.9157851
Stepwise LM	0.1651826	0.1214602	0.8364213
GLM	0.1651948	0.1215269	0.8364111
MARS	0.1671436	0.1275649	0.8325142
GAM	0.1730909	0.1323546	0.8203838

The table above shows the training set performance of the best models.

Test set performance

Test set performance of top models
model	RMSE	MAE	Rsquared
GBM	0.1096725	0.0844474	0.9114116
RF	0.1554163	0.1206349	0.8778158
MARS	0.1970418	0.1596586	0.8203008
Ensemble	0.8424269	0.8314427	0.8585638
Cubist	3.0943439	3.0763884	0.4853602
GLM	3.9316304	3.9288276	0.8386808

It seems like many models have overfit the training data. The ones that do well are the ones that perform inherent feature selection, random forest, gradient boosting and mars. The matssvaedi variable turns into 60 predictors once one-hot encoded. It could be that this explicit representation of housing zones makes the models read signals from what is really noise in the geographic data.

Even though the GBM and RF models did better (as should be expected), I am still very happy with the performance of the multivariate adaptive regression spline model. I hadn’t heard much about it before, but I read about it in the book Applied Predictive Modeling (by Max Kuhn). A non-linear model with interpretable coefficients, built-in feature selection (which can include up to nth order interaction) is something I’m going to study further in the future. I’m especially looking forward to trying it out in my research paper for the hospital this summer.

Predicting type of property

Next I will utilize a linear discriminant classifier as well as a multilayer perceptron neural network in order to classify which type of property a data point is. For this task I chose to remove predictors that could inadvertedly tell the models too much information. That is predictors like number of elevators, on which floor is the property and so on.

In the end I settled on predicting using the following covariates:

ummal, ibm2, fjhaed, fjherb, fjsturt, fjstof, fjeld, fjbkar, fjklos, svalm2, geymm2

Feature plot

Training models

Linear Discriminant Model

Having trained most of the models we covered in the course above, I thought it would be nice to train a Linear Discriminant Classifier. LDA is a naive approximation of a Bayes classifier that expects the covariance matrix to have nonzero entries only on the diagonal.

Confusion matrix
	Reference
Prediction	Einbýlishús	Fjölbýlishús	Íbúðareign	Íbúðarhús	Parhús	Raðhús
Einbýlishús	86	1	10	0	3	25
Fjölbýlishús	5	0	5	0	0	0
Íbúðareign	41	3	1230	1	4	32
Íbúðarhús	5	0	1	0	1	0
Parhús	1	0	0	0	0	0
Raðhús	27	0	30	0	19	42

Model Accuracy
0.8638677

The LDA model does very well. It might be expected since we are using so many covariates that relate to property type.

Multilayer perceptron (Neural network)

Next I will train a multilayer perceptron, which is a simple neural network architecture. A neural network works like a large group of linear models whose predictions are passed through nonlinear functions and then utilized by other linear functions.

I used the Keras interface for R and trained a four hidden layer network:

Inputs
Layer 1: 64 neurons
Layer 2: 32 neurons
Layer 3: 16 neurons
Layer 4: 4 neurons that output one of 4 classes.

Each layer used the relu activation function, or max(0, f(x)) where f(x) is a linear combination of all predictors. To prevent overfitting I added dropout regularization layers with rates of 20% between each hidden layer. Dropout regularization works by giving each neuron a chance of not firing during each training round. Thus a network can not rely on only a few narrowly trained neurons but has to collectively make a good decision.

Learning curve for multilayer perceptron

Confusion matrix
	Reference
Prediction	Einbýlishús	Fjölbýlishús	Íbúðareign	Íbúðarhús	Parhús	Raðhús
Einbýlishús	117	1	12	0	6	26
Fjölbýlishús	0	0	0	0	0	0
Íbúðareign	21	3	1255	1	1	18
Íbúðarhús	0	0	0	0	0	0
Parhús	0	0	0	0	0	0
Raðhús	27	0	9	0	20	55

Model Accuracy
0.9077608

The neural network does a little bit better, but could probably be tuned to achieve better accuracy with enough parameter tuning.

Final Project

Exploring Housing Prices in Iceland

Brynjólfur Gauti Jónsson

Teachers: Óli Páll Geirsson and Oxana Storozhenko

2018-04-16

Executive Summary

Exploratory Analysis

Location, location, location

Community

Smaller zones

Selecting between geographic predictors

Buy date of properties

Property age

Property type

Property size

Number of rooms in properties

Scree plot

Correlation analysis

Removing size-related predictors

Comparing correlation matrices and their conditioning number

BIC analysis of current predictors

Training models

Linear Models

Stepwise AIC

Generalized Linear Model with Elastic-Net regularization

Splines

Generalized Additive Models (GAM)

Multivariate Adaptive Regression Splines (MARS)

Tree-Based Ensemble Models

Extremely Randomized Trees

Gradient Boosted Model (GBM)

Regression Model Trees and K Nearest Neighbors with Cubist

Simple Cubist Model

Fitting a complex model

Comparing Resamples

Training an ensemble

Evaluation

Test set performance

Predicting type of property

Feature plot

Training models

Linear Discriminant Model

Multilayer perceptron (Neural network)