In Iceland an annual housing price evaluation is performed on the 31. of may by the national register. In this assignment I will be utilizing the housing price data from the 2017 report. Firstly I will use the models I have studied this semester as well as others to predict housing prices. Secondly I will train two models, a linear discriminant model and a multilayer perceptron, to classify a property into one of six building categories.
The dataset which I will use in this report has many covariates, many of whom will be unneeded. I will start by analyzing effects graphically and then use correlation plots along with principal components analysis and matrix factorization to remove unneeded continuous predictors.
Before I start the exploratory analysis I split the data into a training set and a set set. The test set consists of 1572 sales that happened on or after December 1. 2016. The training set is all the other 27486 sales.
An often-repeated paradigm of estate pricing is that the location is the most important part of estates. In the plots below we can see the distribution of pricing broken down by location.
Price distribution by community
There seems to be a difference in pricing distribution by community, but it is not obvious. Next I will break the locations down into smaller areas and see if there are larger differences.
Price distribution broken down by smaller location areas
There difference in distributions become more obvious when we look at smaller areas. The highest median prices are in Arnarnes and Flatir in Garðabær, Skerfjafjörður in Reykjavík and Leirvogstunga in Mosfellsbær. These are all high class areas with prime location. On the other the the lower median prices are a single property in Rural Reykjavík, three areas in Breiðholt and Flensborg in Hafnarfjörður.
It is likely that the predictor svfn is unneeded if we include matssvaedi. To see whether this is the case I create a linear model of nuvirdi as a function of svfn and matssvaedi, then see if the model’s BIC (bayesian information criterion) rises a lot when I remove svfn
| Variable | BIC | F value |
|---|---|---|
| <none> | 0.000 | NA |
| svfn | 0.000 | NA |
| matssvaedi | 2083.656 | 51.11257 |
The table above shows that cutting svfn has no meaningfull effect on the performacnce of our model. I therefore decide to remove svfn.
Housing properties are commonly known as a good investment since their value often rises with time. In the plot below we can see a steady increase in price over time. Though the increase is relatively small compared to the full prices any significant predictor is worth using.
Evolution of property price over the years
Below is a scatter plot with an overlaid third degree polynomial of nuvirdi as a function of age. There is no meaningful relationship to be seen between the two variables.
Below we can see that there are some differences in pricing distribution by property type.
Below we see that property size has a positive effect on nuvirdi but the effect of property circumference is not easily seen. I expect that one of ibm2 and ntm2 will be enough for the models and having both might even be harmful for model performance.
There are a lot of predictors that deal with number of rooms, appliances or other miscelaneous things in a property:
These predictors are likely to correlate strongly. In order to cut down on collinearities I will look at the scree plot from a principal component analysis to see how many predictors I should keep in order to explain at least 90% of the variance.
Effects of fj predictors on current price
The scree plot above insinuates that 6 variables should be enough to capture most of the variance in these predictors. Keep in mind that the PCA uses orthognal bases so we can expect to explain less variance if we use the same amount of predictors.
Removing predictors based on correlation tables and plots is good for spotting simple one dimensional correlations. In this data I suspect there are multiple dimensional colinearities. To decide which predictors are the biggest culprits I calculate the Cholesky decomposition of the correlation matrix.
Cholesky decomposition aims to transform a normal matrix into another where its columns are orthogonal with values close to 1 on the diagonal. If the value is far away from one that is a sign of multiple collinearity in the data. From the diagonal of the decomposition I find variables with low values as well as high absolute column-wise sums in the correlation matrix, and mark those as likely candidates for cutting.
| Value | fjhaed | fjbkar | fjsturt | fjklos | fjeld | fjherb | fjstof | fjgeym |
|---|---|---|---|---|---|---|---|---|
| Cholesky | 1.000 | 0.988 | 0.882 | 0.715 | 0.963 | 0.738 | 0.848 | 0.994 |
| Correlation | 3.078 | 2.343 | 2.600 | 3.527 | 2.184 | 3.259 | 2.930 | 1.304 |
The table above shows that the variables, fjklos and fjherb have the lowest Cholesky values. We can also see that fjhaed and fjstof have high correlation sums. Below I draw up a correlation plot.
From the plot we can see that the four variables we marked all correlate strongly together. I suspect that by removing fjherb and fjstof much of the problem will be removed.
To see whether these preprocessing steps helped I look at correlation plots before and after removing the predictors. Also, I compare the determinant of the correlation matrices. If a matrix as higly collinear columns the determinant will close or equal to zero. By removing higly collinear variables I expect to increase the absolute value of the determinant.
Correlations plot
We can see multiple corrlation clusters in the above plot and the variables we marked before seem to be the culprits, as seen in the analysis table above. Also, the determinant of the correlation matrix is \(10^{-4}\), which is very low and tells us there are multiple collinearities present. Below are the results after removing unnecessary predictors.
Correlations plot
Above we can see much fewer correlation clusters in the plot and the determinant is now equal to 0.134, a proportional increase of 1319! The hard work in variable choice seems to have paid off.
Below we can see the correlation analysis for all continuous predictors. The diagonal values for fjsturt and fjhaed decreased when we added the size related variables. We should keep an eye out for the possibility of removing them.
| Variable | Cholesky | Correlation |
|---|---|---|
| fjsturt | 0.78 | 2.72 |
| fjhaed | 0.78 | 3.12 |
| fjstof | 0.82 | 3.17 |
| svalm2 | 0.91 | 2.21 |
| fjeld | 0.92 | 2.39 |
| fjbkar | 0.96 | 2.38 |
| ummal | 0.96 | 2.38 |
| ibm2 | 0.98 | 3.59 |
| geymm2 | 0.99 | 1.51 |
| fjgeym | 0.99 | 1.47 |
| fjmib | 1.00 | 1.97 |
| haednr | 1.00 | 2.28 |
Having done some preliminary subset selection I will now train a linear model of nuvirdi as a function of all predictors. I then look at \(\Delta BIC\), the change in model BIC when removing any of its predictors. If a predictor has a very low \(\Delta BIC\) it does little for the model and should be dropped.
| Variable | BIC |
|---|---|
| fjstof | 0.000000 |
| <none> | 9.011805 |
| fjbkar | 10.156585 |
| fjmib | 39.831634 |
| fjeld | 75.635517 |
| fjgeym | 93.611392 |
| lyfta | 97.431565 |
| ummal | 171.915717 |
| fjsturt | 243.870652 |
| haednr | 323.961381 |
| svalm2 | 423.531594 |
| fjhaed | 716.837275 |
| geymm2 | 1248.011218 |
| teg_eign | 2124.554427 |
| matssvaedi | 8661.585535 |
| kdagur | 9126.110542 |
| ibm2 | 16455.770641 |
We can see that including fjstof is actually reducing the BIC of our model, so it seems logical to remove that predictor. We also see that fjbkar has little positive effect on our model having a \(\Delta BIC\) only about 1 higher than a model without it. These results are in accordance with our Cholesky and correlation analysis so I will therefore remove that coefficient too.
Below I compare the the full model trained without removing any coefficients, and the reduced model trained after removing predictors. As we can see the effect on performance is minimal.
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| Full | 6367.704 | 0.8355359 | 3854.235 |
| Reduced | 6642.239 | 0.8210488 | 4124.218 |
Before training models using caret I perform the following preprocessing steps:
Boxcox transform the dependent variable resulting in a log transform. This make sense both statistically and from a predictive point of view. Statistically we want a normally distributed dependent variable. Log transforming the dependent variable makes it so that small differences punish the models less when the actual value is high. This means that the model is punished a lot for an error of 500 if the true value is 1000 but it is punished less when the true value is 10000.
Center and scale all continuous predictors, utilizing the means and standard deviations from the training set to normalize the test set.
One hot encode all categorical variables
Create a resample list containing 10 bootstrapped samples to be used in each of the models. By using exactly the same resamples in all models I can later combine them into a caretEnsemble and hopefully squeeze even more out of them
I will show basic model learning curves before comparing their resamples and test set performance in a later section.
I start by training a stepwise linear model. There is no real learning curve to show so on we move to the next model.
Elastic net regularization applies a mixture of LASSO and ridge regression via the formula:
I trained linear regressions models using elastic net regularization with parameters:
GAM models were orginally developed by our dear lords and mentors Trevor Hastie and Robert Tibshirani. They are a generalized linear model that relates the dependent variable to smooth functions of the predictor variables.
MARS were introduced by Jerome Friedman in 1991, and are a nonlinear extension to regular linear models. MARS models models capture nonlinearities by taking a continuous predictor x and breaking it into two predictors: max(0, (x - c)) and max(0, (c - x)). MARS models can also model categorical predictors and interactions of any degree. They make for a good middle ground in accuracy and interpretability.
The term MARS is trademarked so many software packages, including the ones in R, are called Earth.
Below we see the learning curve for the Earth model. The number of coefficients to keep in the model is a parameter so it is easy to manage the complexity of the model. I trained with interaction degree of 2 and number of parameters as seen below.
Below is the formula for the final MARS model:
## 10.97032
## - 0.2131135 * matssvaedi.150
## - 0.2045601 * matssvaedi.160
## - 0.3182457 * matssvaedi.161
## - 0.2622683 * matssvaedi.170
## - 0.1990688 * matssvaedi.200
## - 0.1371983 * matssvaedi.600
## - 0.0002730988 * h(1542-kdagur)
## - 0.3978572 * h(0.764092-ibm2)
## + 0.1355252 * h(ibm2-0.764092)
## - 0.1114682 * teg_eignÍbúðareign*h(0.921902-geymm2)
## + 0.1217767 * h(haednr--0.669594)*matssvaedi.20
I model an Extremely Random Trees model for a range of randomly selected predictors. Extremely random trees are alike random forests except they don’t make supervsed splits on the randomly chosen variables. Rather they make totally random splits and choose the best one.
The GBM model was trained with parameters:
Cubist is a rule-based hierarchical linear model that was only commercally available until its source code was released in 2011. The algorithm acts like a cart regression tree but at each node insted of a single prediction there is a linear model. When a value is predicted it traverses down the tree to the right linear model and a prediction is made. On the way up, the prediction is adjusted by its parent models higher up in the tree.
Cubist allows for a boosting-like mechanism called committees, in which it makes multiple such linear model trees and averages their prediction. Cubist also has a parameter called instances which works by implementing a K nearest neighbors smoothing with nearby points from the training set data where instances is equal to the K in KNN.
Resampled RMSE measures
From the figure above we can see that Cubist and GBM performed best with Random forest and MARS close behind. In a statistical setting the MARS model would be a viable candidate since it produces interpretable coefficients.
Because we used identical bootstrap resamples for each model, caretEnsemble now allows us to stack them into one ensemble by using their predictions as covariates in a stepwise linear model. Any model can be used to control the ensemble but I chose a simple model to save time and safeguard from overfitting.
The table below shows the weight on each model’s predictions. In an optimal setting I would train models that have weak correlations in order to capture different parts of the predictor space.
| model | RMSE | MAE | Rsquared |
|---|---|---|---|
| GBM | 0.0948299 | 0.0713518 | 0.9461574 |
| Ensemble | 0.0968804 | 0.0725684 | 0.9437722 |
| Cubist | 0.1171608 | 0.0864059 | 0.9177981 |
| RF | 0.1216649 | 0.0906210 | 0.9157851 |
| Stepwise LM | 0.1651826 | 0.1214602 | 0.8364213 |
| GLM | 0.1651948 | 0.1215269 | 0.8364111 |
| MARS | 0.1671436 | 0.1275649 | 0.8325142 |
| GAM | 0.1730909 | 0.1323546 | 0.8203838 |
The table above shows the training set performance of the best models.
| model | RMSE | MAE | Rsquared |
|---|---|---|---|
| GBM | 0.1096725 | 0.0844474 | 0.9114116 |
| RF | 0.1554163 | 0.1206349 | 0.8778158 |
| MARS | 0.1970418 | 0.1596586 | 0.8203008 |
| Ensemble | 0.8424269 | 0.8314427 | 0.8585638 |
| Cubist | 3.0943439 | 3.0763884 | 0.4853602 |
| GLM | 3.9316304 | 3.9288276 | 0.8386808 |
It seems like many models have overfit the training data. The ones that do well are the ones that perform inherent feature selection, random forest, gradient boosting and mars. The matssvaedi variable turns into 60 predictors once one-hot encoded. It could be that this explicit representation of housing zones makes the models read signals from what is really noise in the geographic data.
Even though the GBM and RF models did better (as should be expected), I am still very happy with the performance of the multivariate adaptive regression spline model. I hadn’t heard much about it before, but I read about it in the book Applied Predictive Modeling (by Max Kuhn). A non-linear model with interpretable coefficients, built-in feature selection (which can include up to nth order interaction) is something I’m going to study further in the future. I’m especially looking forward to trying it out in my research paper for the hospital this summer.
Next I will utilize a linear discriminant classifier as well as a multilayer perceptron neural network in order to classify which type of property a data point is. For this task I chose to remove predictors that could inadvertedly tell the models too much information. That is predictors like number of elevators, on which floor is the property and so on.
In the end I settled on predicting using the following covariates:
ummal, ibm2, fjhaed, fjherb, fjsturt, fjstof, fjeld, fjbkar, fjklos, svalm2, geymm2
Having trained most of the models we covered in the course above, I thought it would be nice to train a Linear Discriminant Classifier. LDA is a naive approximation of a Bayes classifier that expects the covariance matrix to have nonzero entries only on the diagonal.
| Prediction | Einbýlishús | Fjölbýlishús | Íbúðareign | Íbúðarhús | Parhús | Raðhús |
|---|---|---|---|---|---|---|
| Einbýlishús | 86 | 1 | 10 | 0 | 3 | 25 |
| Fjölbýlishús | 5 | 0 | 5 | 0 | 0 | 0 |
| Íbúðareign | 41 | 3 | 1230 | 1 | 4 | 32 |
| Íbúðarhús | 5 | 0 | 1 | 0 | 1 | 0 |
| Parhús | 1 | 0 | 0 | 0 | 0 | 0 |
| Raðhús | 27 | 0 | 30 | 0 | 19 | 42 |
| Model Accuracy |
|---|
| 0.8638677 |
The LDA model does very well. It might be expected since we are using so many covariates that relate to property type.
Next I will train a multilayer perceptron, which is a simple neural network architecture. A neural network works like a large group of linear models whose predictions are passed through nonlinear functions and then utilized by other linear functions.
I used the Keras interface for R and trained a four hidden layer network:
Each layer used the relu activation function, or max(0, f(x)) where f(x) is a linear combination of all predictors. To prevent overfitting I added dropout regularization layers with rates of 20% between each hidden layer. Dropout regularization works by giving each neuron a chance of not firing during each training round. Thus a network can not rely on only a few narrowly trained neurons but has to collectively make a good decision.
Learning curve for multilayer perceptron
| Prediction | Einbýlishús | Fjölbýlishús | Íbúðareign | Íbúðarhús | Parhús | Raðhús |
|---|---|---|---|---|---|---|
| Einbýlishús | 117 | 1 | 12 | 0 | 6 | 26 |
| Fjölbýlishús | 0 | 0 | 0 | 0 | 0 | 0 |
| Íbúðareign | 21 | 3 | 1255 | 1 | 1 | 18 |
| Íbúðarhús | 0 | 0 | 0 | 0 | 0 | 0 |
| Parhús | 0 | 0 | 0 | 0 | 0 | 0 |
| Raðhús | 27 | 0 | 9 | 0 | 20 | 55 |
| Model Accuracy |
|---|
| 0.9077608 |
The neural network does a little bit better, but could probably be tuned to achieve better accuracy with enough parameter tuning.