Brief Introduction

This is a personal project of mine - I gathered the structural and sales data for properties, built a GIS framework using Google Earth and used ArcGIS (and R) to derive the spatial data needed to enrich the existing dataset.

The point was to compare ALL the prominent mass valuation methods, in terms of predictive accuracy, under a common dataset, since no published paper has ever done so. There are currently two papers that compare more than two algorithms (McCluskey, et al., 2013, Antipov & Pokryshevskaya, 2012), however the way testing is performed (simple train-test split) and the lack of hyperparameter optimization motivated me to “try for something better”.

The variables for the models were chosen after reading papers regarding the factors that impact the value of real estate prices and using my own experience.

A lot of research was also done on the techniques that have been used in mass valuation (ranging from the traditional comparative methods, to hedonic models, spatial regressions and ML algorithms).

As you can imagine the research background is huge; it can’t be covered in this single post; anyone interested - contact me.

Area of study & data set

The area of study is the city of Thessaloniki in Greece, during the period 2002-2009.

During this period the Greek Market was extremely volatile (financial peak and crisis); this means that we cannot expect very high accuracy. In good, non-volatile market conditions, local professional real estate appraisers have achieved an accuracy of +- 8%-10%. This number can serve as a good, strict baseline.

After gathering structural and sales data from the real estate market for 759 properties I enriched the dataset with spatial information (for full list of variables see “Variables” below).

This was achieved by performing the following process:

The properties have been geotagged and their coordinates were extracted. Geodetic WGS84 coordinates were transformed to GGRS87 (Greek Grid see https://epsg.io/2100).
A framework was created on Google Earth containing all points of interest (such as the Industrial Zone, Primary Roads, Secondary Roads, the Coastal Line etc.). Again, all coordinates were transformed from geodetic to cartesian.
ArcGIS was used to derive the shortest distances between each property and each point (or area) of interest. ArcGIS is not needed though; you can find the code that performs the exact same process, using R, on the end of this page.

We will assess and compare the results of multiple statistical and Machine Learning methods against each other (ensembles and stacked models will be covered on a later post), in order to evaluate which one provides the most accurate predictions.

Methods Used

The methods we will apply, assess & compare are:

Statistical Methods
- MLR (Multiple Linear Regression)
- SAR Model (Spatial Lag)
- GWR (Geographically Weighted Regression)
Machine Learning
- SVM (Regression)
- Random Forest (Regression)
- Neural Netowrk (Using H2O)
- XGBoost (Regression)

To train the SAR Model we will have to go through some hoops:

Find the nearest neighboors for each property,
Add noise to the X,Y coordinates and some other variables (else we cannot invert the needed matrices)
Create the weights matrix (we will be using the function W=1/(x^2))

Hyperparameter Optimization

Most models were tuned using Grid Search and nested cross-validation (see relevant code).

XGBoost was tuned manually and the ANN was tuned through a combination of manual (for the activation and loss functions) and Grid Search. This was done to save time, since Grid Search is a very time-consuming process.

The Grid Search part of the code for the ANN has been removed due to the time needed to run it for each fold (Grid Search for the ANN took 48 hours FOR EACH FOLD on my Intel-i7; unfortunately I was not able to use my GFX card to train the NN).

Bayesian Optimization was avoided, due to the nominal hyperparameters. Evolutionary Algorithms: While they are effective they require way too much time.

Parallel Computing

Since Grid Search takes quite a while, we will be employing the DoPar package:

R normally works in sequence (XGboost and H2O are exceptions; they work in parallel), which means that only a 20-25% of our CPU is being utilised. However the DoPar package allows us to run segments of our code in parallel, utilising up to 100% of our CPU.

Data cleansing and outlier handling

All inaccurate and incomplete records were removed beforehand. A few number of records that were considered extreme outliers were also removed. However we decided to not exclude other records which could be classified as outliers, to test the robustness of each algorithm.

Accounting for the temporal effect

To account for the temporal effect, the prices were adjusted to a common date (2001) using the House price index for Thessaloniki, published by the National Bank of Greece. Another variable which is equal to the difference in years between the sale date and the common date used for the price adjustment (2001), was added. This was done in order to account for inconsistencies between the published price index and real prices, and showed better results. While there are other, more data driven methods to perform this adjustment, which involve creating indexes from the data (Gloudemens, 1999), due to the small size of the database, they could not be applied.

Data transformations

There were some data transformations and some tricks employed, however I will not explain them here. If you are interested - contact me!

Start

Variables

Following is the list of variables we will be using.

Condition and Floor are stored both as dummy variables (one-hot encoded) and as categorical; depending on the algorithm, we will be using the appropriate format (eg. XGBoost is fine with nominal / categorical variables, linear regression and ANNs are not).
- Because the total number of properties that are on the 7th floor or above, is extremely small (5 properties or 0.6%), it was decided to group them with those that are on the 6th floor (see Quick Data Presentation below).
Near_X variables are dummy variables, and state that the property is right next to the point of interest. While we already have the distance to each point as a variable, we also need these due to the inability of the models to “catch” the impact when the distance to the PoI is less than 20-50m. In other words, adding them increased the accuracy of all models.
Region.id is just an incremental number, unique for each observation, needed for the SAR model.
Price2001 is our target variable, euros/m^2, adjusted to the price in 2001. As the sales data span between 2002 - 2009 We used the index provided by the National Bank of Greece.
X and Y are the Cartesian Coordinates of each property in GGRS87 (https://epsg.io/2100); I converted the derived WGS84 geodetic coords to Cartesian ones.

This list was compiled after studying many research papers and through the experience I have accumulated working in the real estate sector.

rm(list=ls())

DATA<-"Real_Estate.csv"
data<-read.csv(DATA)

colnames(data)

##  [1] "region.id"            "Name"                 "DstToMainRoad"       
##  [4] "DstToScndRoad"        "DstToRingRoad"        "DstToCoastal"        
##  [7] "DstToHospitals"       "DstToUniCampus"       "DstToExCamp"         
## [10] "DstToCemetary"        "DstToIndustrial"      "DstToStadium"        
## [13] "DstToSquare_SmPark"   "DstToBigPark"         "DstToCityCenter"     
## [16] "Floor"                "Floor0.5"             "Age"                 
## [19] "Age.Squared"          "Area"                 "Cond"                
## [22] "Cond1"                "Cond2"                "Cond3"               
## [25] "Cond4"                "Corner"               "InCentre"            
## [28] "OnSeafront"           "NewBuilt"             "OnMainRoad"          
## [31] "OnScndRoad"           "OnRingROad"           "NearStadium"         
## [34] "NearSmallPark_Square" "NearBigPark"          "NearCemetary"        
## [37] "Parking"              "LuxuriousBuilding"    "PrMSQR"              
## [40] "AbsPr"                "YoE"                  "YoE.2000"            
## [43] "YoE.2001"             "Ybuilt"               "Loc"                 
## [46] "Price2001"            "X"                    "Y"                   
## [49] "DstToTrainstation"    "Floor1"               "Floor2"              
## [52] "Floor3"               "Floor4"               "Floor5"              
## [55] "Floor6Above"          "Floor0"

Quick Data Presentation

Let’s have a look at how our dataset looks like.

We will use the Hmisc library to create histograms and the psych library to get descriptive statistics.

We build a function to get frequencies (and percentages).

For the dummy variables 1 means “Yes”, 0 means “No”

Analysing the dataset before modelling takes place, can provide useful insights and reveal potential challenges that might emerge. Ideally, a comparison with the complete sales data from the whole market should have been made, to ensure that the acquired sample was representative, however this was not possible.

While the sample is not very large, the values of most other attributes are well spread out (see the descriptive statistics and histograms below). This is quite important, since all statistical methods and ML algorithms require variance among the values of the predictors.

As observed, only 0.79% of the properties are located on the seafront and only 1.58% in the city centre. These attributes are assumed to majorly affect values, and the low total count of both cases (6 and 12 respectively) might limit the capabilities of the different methods to estimate their impact accurately. Similarly, flats that are located in a luxurious complex and account for only 1.45% of the data might get mispriced. Issues might also arise when estimating the values of those properties that are on a main road (1.58%) or next to the ring road (1.19%); access to such roads is considered highly sought after. For the rest of the properties 0.92% of them are next to a big park and 0.66% are next to cemetery. Properties located on the ground floor and in-between the ground and the first floor account for 1.98% and 1.84% of the sample respectively.

As already mentioned, because the total number of properties that are on the 7th floor or above, is extremely small (5 properties or 0.6%), it was decided to group them with those that are on the 6th floor (hence the variable 6th floor or above.)

library(Hmisc)
data_to_plot<-data[,c(3:15,18,20,46)]

hist.data.frame(data_to_plot)

library(psych)
cat("\n\nDescriptive Statistics\n\n")

## 
## 
## Descriptive Statistics

describe(data_to_plot,skew=FALSE)

##                    vars   n    mean      sd    min      max    range
## DstToMainRoad         1 759 1050.57  755.83   0.00  2840.89  2840.89
## DstToScndRoad         2 759  324.97  318.49   0.00  1616.41  1616.41
## DstToRingRoad         3 759 1735.83 1053.51  22.08  4497.74  4475.66
## DstToCoastal          4 759 3031.04 1573.75  88.15  7001.31  6913.16
## DstToHospitals        5 759 2239.85 1267.91  62.07  5818.50  5756.43
## DstToUniCampus        6 759 5134.14 2416.40  92.11  9206.05  9113.94
## DstToExCamp           7 759 2206.11 1449.92  81.59  6466.20  6384.61
## DstToCemetary         8 759 1701.02  829.27   5.48  3751.92  3746.43
## DstToIndustrial       9 759 6564.97 4790.69 615.19 17778.53 17163.34
## DstToStadium         10 759 1691.72 1099.67  20.90  4803.70  4782.80
## DstToSquare_SmPark   11 759  222.24  162.86   0.00  1018.73  1018.73
## DstToBigPark         12 759 1872.71 1386.60   0.00  5691.86  5691.86
## DstToCityCenter      13 759 5283.00 1994.33 279.61  9869.41  9589.80
## Age                  14 759   14.28   13.57   0.00    54.00    54.00
## Area                 15 759   87.74   27.86  27.00   240.00   213.00
## Price2001            16 759  922.48  267.79 311.99  2524.23  2212.23
##                        se
## DstToMainRoad       27.43
## DstToScndRoad       11.56
## DstToRingRoad       38.24
## DstToCoastal        57.12
## DstToHospitals      46.02
## DstToUniCampus      87.71
## DstToExCamp         52.63
## DstToCemetary       30.10
## DstToIndustrial    173.89
## DstToStadium        39.92
## DstToSquare_SmPark   5.91
## DstToBigPark        50.33
## DstToCityCenter     72.39
## Age                  0.49
## Area                 1.01
## Price2001            9.72

cat("\n\nDummy Variables\n\n")

## 
## 
## Dummy Variables

cat("\nCondition\n\n")

## 
## Condition

tblFun <- function(x){
    tbl <- table(x)
    res <- (cbind(table(x),round(prop.table(tbl)*100,2)))
    colnames(res) <- c('Count','Percentage')
    res
}

tblFun(data[,"Cond"])

##   Count Percentage
## 1    61       8.04
## 2    75       9.88
## 3   465      61.26
## 4   158      20.82

cat("\nFloor\n\n")

## 
## Floor

tblFun(data[,"Floor"])

##   Count Percentage
## 0    29       3.82
## 1   198      26.09
## 2   185      24.37
## 3   137      18.05
## 4   104      13.70
## 5    78      10.28
## 6    23       3.03
## 7     3       0.40
## 8     2       0.26

cat("\n\nRest of Dummy Variables\n\n")

## 
## 
## Rest of Dummy Variables

for (i in 26:38){
print(colnames(data[i]))
print(tblFun(data[,i]))
}

## [1] "Corner"
##   Count Percentage
## 0   509      67.06
## 1   250      32.94
## [1] "InCentre"
##   Count Percentage
## 0   747      98.42
## 1    12       1.58
## [1] "OnSeafront"
##   Count Percentage
## 0   753      99.21
## 1     6       0.79
## [1] "NewBuilt"
##   Count Percentage
## 0   594      78.26
## 1   165      21.74
## [1] "OnMainRoad"
##   Count Percentage
## 0   747      98.42
## 1    12       1.58
## [1] "OnScndRoad"
##   Count Percentage
## 0   690      90.91
## 1    69       9.09
## [1] "OnRingROad"
##   Count Percentage
## 0   750      98.81
## 1     9       1.19
## [1] "NearStadium"
##   Count Percentage
## 0   733      96.57
## 1    26       3.43
## [1] "NearSmallPark_Square"
##   Count Percentage
## 0   700      92.23
## 1    59       7.77
## [1] "NearBigPark"
##   Count Percentage
## 0   752      99.08
## 1     7       0.92
## [1] "NearCemetary"
##   Count Percentage
## 0   754      99.34
## 1     5       0.66
## [1] "Parking"
##   Count Percentage
## 0   457      60.21
## 1   302      39.79
## [1] "LuxuriousBuilding"
##   Count Percentage
## 0   748      98.55
## 1    11       1.45

Multicollinearity / Variable Selection

Calculating the correlation matrix and visualizing the result revealed that some attributes were correlated. However removing any of the highly correlated features resulted in a loss of accuracy for all the methods except for the SVM. For the regression models, while multicollinearity might invalidate significance tests and interfere with the correct estimation of the beta parameters, which in turn can result in the wrong assessment of the impact each variable has on the price, it has been argued that it might not affect the predictive power of the model. This generally applies to many ML algorithms as well, especially to regression trees, because of the way the splitting at each node works. As the focus of this study has been placed on the accuracy of the predictions, and by using correlated variables better predictive accuracy was acquired, it was decided not to remove them. For the SVM however, the highly correlated variables were removed.

Regarding feature selection, for the regression models, a number of different elimination techniques was considered. Stepwise regression and Akaike’s Information Criterion (Akaike, 1974), both of which are based on variable importance were applied. Recursive feature elimination (RFE) was also tried. However, the results of multiple experiments, implied that there was no consistent set of features that provided better accuracy than the use of the whole set, as each time the “best set” of features was heavily related to the training sample. As such, and because of the fact that the ML algorithms of the study can perform feature selection by themselves (Random Forests use voting and subsamples of the features, while XGBoost, ANNs and SVMs all use regularization) all the available features were included.

Later experiments showed that a reduced set of variables resulted in slightly better performance for the SVM model only, however for this run we will use all of them.

datacors<-(data[,c(3:16,18,21,26:38,49)])
datacors<-cor(datacors)
library(corrplot)
corrplot(datacors,type="upper")

Formulas

As mentioned, different algorithms require different formats of the variables (for example regressions and ANNs need the nominal variables to be one-hot encoded, trees do not).

Experiments were conducted on which format provides the best result for each algorithm.

Let’s load the data and create the formulas:

full<-Price2001~DstToMainRoad+DstToScndRoad+DstToRingRoad+DstToCoastal+DstToHospitals+DstToUniCampus+DstToExCamp+DstToCemetary+DstToIndustrial+DstToStadium+DstToSquare_SmPark+DstToBigPark+DstToCityCenter+Floor1+Floor2+Floor3+Floor4+Floor5+Floor6Above+Floor0.5+Age+Age.Squared+Cond2+Cond3+Cond4+InCentre+OnSeafront+NewBuilt+OnMainRoad+OnScndRoad+OnRingROad+Parking+NearStadium+NearSmallPark_Square+NearBigPark+NearCemetary+LuxuriousBuilding+YoE.2001

fullMLSVM<-Price2001~DstToMainRoad+DstToScndRoad+DstToRingRoad+DstToCoastal+DstToHospitals+DstToUniCampus+DstToExCamp+DstToCemetary+DstToIndustrial+DstToStadium+DstToSquare_SmPark+DstToBigPark+DstToCityCenter+Floor1+Floor2+Floor3+Floor4+Floor5+Floor6Above+Floor0.5+Age+Age.Squared+Cond1+Cond2+Cond3+Cond4+InCentre+OnSeafront+NewBuilt+OnMainRoad+OnScndRoad+OnRingROad+Parking+NearStadium+NearSmallPark_Square+NearBigPark+NearCemetary+LuxuriousBuilding+YoE.2001+DstToTrainstation+Corner+Area+X+Y

fullMLtree<-Price2001~DstToMainRoad+DstToScndRoad+DstToRingRoad+DstToCoastal+DstToHospitals+DstToUniCampus+DstToExCamp+DstToCemetary+DstToIndustrial+DstToStadium+DstToSquare_SmPark+DstToBigPark+DstToCityCenter+Floor+Floor0.5+Age+Age.Squared+Cond+InCentre+OnSeafront+NewBuilt+OnMainRoad+OnScndRoad+OnRingROad+Parking+NearStadium+NearSmallPark_Square+NearBigPark+NearCemetary+LuxuriousBuilding+YoE.2001+DstToTrainstation+Corner+Area+X+Y

Feature Scaling

In this experiment we do not scale/normalize the features’ values for the ANN or for the SVM; this process is handled by caret (for the SVM) and by H2O (for the ANN).

However, it is highly preffered to perform this process manually since the predefined way the libraries do it is not suitable for all cases.

Hyper-Parameter optimization using nested cross-validation

The values of the hyper-parameters of both the statistical methods and the ML algorithms, had to be optimized, in order to achieve maximum accuracy of the estimates. This was performed using nested cross-validation to avoid introducing any kind of bias during the optimization process.

The SAR method that was used, is the spatial error model, as it is proven in the literature to provide more accurate results than the lag model. For the selection of the weight function and the number of nearest neighbours, manual search was performed. The function: f(d)=1/d^2 and the number of neighbours k, k=5 provided the best results.

For the Geographically Weighted Regression different combinations of kernels and bandwidth type were tried. The optimal combination was the Gaussian kernel and the “fixed” bandwidth. The optimal bandwidth during each fold was determined using the built-in selection function, which is essentialy a nested-CV Grid Search.

The neural network, due to its vast complexity, was tuned by setting the activation and the loss function manually, and then performing a Grid Search for the rest of the parameters. The Rectifier activation (Hahnloser et al., 2000) and the Huber loss (Huber, 1964) functions were selected manually. The Rectifier activation function was chosen since it has been demonstrated to provide better training for ANNs (Glorot, et al., 2011), while the Huber loss function was selected because it is less sensitive to outliers than the squared error loss. The number of hidden layers, nodes, the values for the L1 and L2 regularization parameters, and the value for the Huber alpha, were determined through Grid Search. The number of initial training iterations was set high, as to avoid getting “stuck” in a local minima. Early stopping was employed, to avoid over-fitting. The optimal number of iterations was determined through cross-validation, in each fold. The most accurate results were obtained by having 3 hidden layers, and a high number of total nodes (220, 220, and 65). As the number of layers and nodes seemed unusually high, multiple runs of Grid Search were performed, however this architecture provided the best results in every trial.

The Random Forest and the SVMs algorithms were both tuned performing Grid Search. For the Random Forest the number of the random variables that will be used in each split was tuned, while the number of trees was set to 500 and the subsample rate to 90%. For the SVM, the Radial kernel was used, and the values of sigma and the cost were tuned.

XGBoost was tuned entirely manually. The Tree-based model (“booster”) was selected. The learning rate (“eta”) was set to a low number (0.01) to achieve higher accuracy, which in turn meant a higher number of rounds and time were required. A low depth tree (3) was selected, since making the model more complicated was not offering improvements in terms of accuracy, which is reasonable considering the total amount of variables (36) and observations (759). The subsample rate was set to 80% to improve generalization. Minimum child weight (for the regression case it is the minimum number of observations a node should have, a regularization parameter) was set to a low number (3) and gamma (another regularization parameter, which is related to the minimum loss reduction required to make a further partition in the node) to a high (120). These settings provided the optimal results.

Stratified cross-validation

To provide a fair comparison and analysis of the prediction power of the different models, each property had to be evaluated without, at the same time, taking part in the training process.

For this reason, a process known as k-fold validation (Figure 3-2) was adopted: The dataset was split into 10 equal parts. In every iteration 9 of them were used as the “training set” and the last one (different in each iteration) was used as the “test set”. After 10 repetitions, all properties were evaluated, without being in the training and the test set at the same time. Stratified sampling was used. However the resulting sample might not be totally representative of the whole dataset, since the solution provided by the software is based only on one feature, in this case the sale price. To reduce variance, the cross-validation process was performed multiple times and the final results were averaged.

Code

Everything is contained in the following big loop. We do k-fold cross-validation, training and testing multiple algorithms, each time on the same fold.

The code contains comments and is easily readable.

Remember, we have to apply noise to the train and test set for our spatial regression models (thats why we have trainData and trainData1), then jump through the aforementioned hoops to train & test the SAR model.

trainData1<-data.frame() #we will be adding noise to the train/test set for the geostatistical methods
testData1<-data.frame() #same as above
test_set<-data.frame() 
test_set2<-data.frame()

#Perform 10 fold cross validation
#Randomly shuffle the data
yourData<-data[sample(nrow(data)),]

#Create 10 equally size folds stratified
library(caret)
folds <- createFolds(data$Price2001, k=10,list = FALSE)
for(i in 1:10){
  #Segement the data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- yourData[testIndexes, ]
  trainData <- yourData[-testIndexes, ]
  trainData1<-trainData ### else the noise loop will not run
  testData1<-testData  ### 
 
# Adding noise to variables  
  for (j in 3:56){
    noise <- sample(-5:5,length(trainData$X),replace=TRUE)
    noise <-noise/1000
    trainData1[,j]<-as.numeric(trainData[,j])+noise
  }

# Adding noise the coordinates X/Y for the SAR models (identical points when calculating neighboors)
  for (j in 47:48){
    noise <- sample(-5:5,length(trainData$X),replace=TRUE)
    noise<-noise/10
    trainData1[,j]<-as.numeric(trainData[,j])+noise
  }
  
  
# Region.id for later use on SAR models has been added externally
  library(sp)

  ### The geostatistical methods need spatiapointdataframes
  trainData1<-SpatialPointsDataFrame(trainData1[,47:48],trainData1[,1:56])
  testData1<-SpatialPointsDataFrame(testData1[,47:48],testData1[,1:56])
  
  ###### TRAIN AND TEST MODELS IN HERE BEFORE THE LOOP CLOSES

  
  ###### Simple Regression #####
  simpleLinear<-lm(full,trainData)
  LM_fit<-predict(simpleLinear,testData)
  LM_residuals<-as.vector(testData$Price2001-LM_fit)

  ###### SAR Model  ###### 
  ### Preparing the kNearestNeighbors 
  
  #SAR
  #knearest neighbors 
  require(spdep)
  
  neighbors<-knearneigh(trainData1, k=5, longlat = NULL, RANN=FALSE)
  neighbors<-knn2nb(neighbors, row.names = trainData1$region.id, sym = FALSE)
  neigh.dist <- nbdists(neighbors,coordinates(trainData1), longlat=FALSE)
  
  ### creating weight matrix 
  inverse <- lapply(neigh.dist, function(x) (1/((x^2))))
  neigh.listw <- nb2listw(neighbors, glist=inverse, style="W", zero.policy=TRUE)
  W <- as(neigh.listw, "CsparseMatrix")
  trW<-trW(W, m = 30, p = 16, type = "mult", listw=neigh.listw, momentsSymmetry=TRUE)
  
  ### Train 
 fullerrorsarlm<- errorsarlm(full, data=trainData1, listw=neigh.listw, na.action=na.exclude, zero.policy=TRUE,etype="error",trs=trW,tol.solve=1e-14)
  
  ### Test set gets the neighboors from the training set, as test set neighors should not been "seen"
  neighbors_test<-knearneigh(trainData1, k=5, longlat = NULL, RANN=FALSE)
  neighbors_test<-knn2nb(neighbors_test, row.names = trainData1$region.id, sym = FALSE)
  neigh.dist_test <- nbdists(neighbors_test,coordinates(trainData1), longlat=FALSE)
  
  inverse_test <- lapply(neigh.dist_test, function(x) (1/((x^2))))
  neigh.listw_test <- nb2listw(neighbors_test, glist=inverse_test, style="W", zero.policy=TRUE)
  sar<-predict.sarlm(fullerrorsarlm, newdata = testData1, listw = neigh.listw_test, pred.type = "TS", all.data = FALSE, zero.policy = TRUE, legacy = TRUE, legacy.mixed = FALSE, power = NULL, order = 250, tol = .Machine$double.eps^(3/5), spChk = TRUE)
  
  resultssar<-print.sarlm.pred(sar)
  SAR_fit<-resultssar$fit
  SAR_residuals<-testData$Price2001-SAR_fit
  ###### GWR  ###### 
  require(GWmodel)
  bw<-bw.gwr(full, data=trainData1,kernel="gaussian", adaptive=FALSE, p=2, theta=0, longlat=F, approach='CV')
  gwr<-gwr.basic(full, data = trainData1, bw=bw, kernel="gaussian", adaptive=FALSE, longlat=F)
  gwrpredict<-gwr.predict(full, data=trainData1, predictdata=testData1, bw=bw, kernel="gaussian",adaptive=FALSE, longlat=F)
  results_gwr<-gwrpredict$SDF
  GWR_fit<-results_gwr$prediction
  GWR_residuals<-testData$Price2001-GWR_fit

  
  
    
  
 ##### ANN ######
 # These are the variables we will use in our Neural Network
ANNLearn<-colnames(trainData[,c(3:21,26,27,28,29,30:37,38,43,47,48,49)])
 # Train
 # We use ReLu and the Huber Loss Function to deal with outliers more effectively.
 # Stopping metric for early stop is MAE, ideally we would like MAPE but it is not support by H2O as of now. Can be done in Tensorflow
require(h2o)
h2o.init(nthreads = 2)
h2o.no_progress()
model = h2o.deeplearning(x=ANNLearn, y = 'Price2001',
                         training_frame = as.h2o(trainData),
                         nfolds = 5,
                         activation = 'Rectifier',
                         hidden = c(220,220,65),
                         epochs = 300,
                         stopping_metric = "MAE",
                         train_samples_per_iteration = -2,
                         stopping_rounds = 5,
                         distribution="huber",
                         huber_alpha = 0.9,
                         overwrite_with_best_model = FALSE,
                         verbose = FALSE,
                         loss = "Huber")


# Predicting the Test set results
ANN_fit = h2o.predict(model, newdata = as.h2o(testData))
ANN_fit = as.vector(ANN_fit)
ANN_residuals=testData$Price2001-ANN_fit

 ##### RANDOM FOREST ##### 

library(randomForest)
library(caret)
library(doParallel)


 ##### Hyper-parameter optimization  in parallel using more CPU cores for faster implementation

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

 ## Grid Search
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
tunegrid <- expand.grid(.mtry=c(10,14,16,18,20,22,24,26,28,30,34))
rf.tune <- train(fullMLtree, data=trainData, method="rf",  metric=metric, tuneGrid=tunegrid, trControl=control)

stopCluster(cl)

 # Predict test-set results
rf_fit<-predict(rf.tune,
                 newdata=testData)

rf_fit = as.vector(rf_fit)
rForest_residuals<-testData$Price2001-rf_fit




 ##### SVM ##### 
library(e1071)

control <- trainControl(method="repeatedcv", number=10, repeats=3)

svmgrid<-expand.grid(C=c(0.001,0.01,0.1, 1, 10, 100, 200, 300, 500,1000),
                     sigma=c(0.0005,0.001, 0.005, 0.01, 0.1, 1,10,100))

 ##### Hyper-parameter optimization  in parallel using more CPU cores for faster implementation 
library(doParallel)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
svm.tune <- train(fullMLSVM,
                  data=trainData,
                  method = "svmRadial",
                  tuneGrid = svmgrid,
                  preProcess=c("scale"),
                  trControl=control)
stopCluster(cl)

 # Predict test-set results
SVM_fit<-predict(svm.tune, 
                 newdata=testData)

SVM_fit<- as.vector(SVM_fit)
SVM_residuals=testData$Price2001-SVM_fit


trainData<-as.data.frame(trainData)
testData<-as.data.frame(testData)

 ##### XGBOOST ##### 

library(xgboost)
y<-trainData$Price2001

 #Algorithm has been tuned manually, we still have to find out when to stop training
xgbearlystop <- xgb.cv(data =data.matrix(trainData[,c(3:21,26,27,28,29,30:37,38,43,47,48,49)]),
                       label =y, 
                       nfold=10,
                       max.depth =3,
                       eta = 0.01,
                       nthread = 8, 
                       nround = 6000,
                       min_child_weight=3,
                       #max_delta_step=10,
                       eval_metric = "mae",
                       tree_method="exact",
                       gamma=120,
                       subsample=0.8,
                       #colsample_bytree=0.6,
                       #colsample_bytree=0.9,
                       objective = "reg:linear",
                       early_stopping_rounds=200,
                       verbose = 0,
                       #alpha=0.1,
                       callbacks=list(cb.evaluation.log()))



 #Train 
modelxgboost <- xgboost(data =data.matrix(trainData[,c(3:21,26,27,28,29,30:37,38,43,47,48,49)]),
                        nround = xgbearlystop$best_iteration,
                        label =y, 
                        nfold=10,
                        max.depth =3,
                        eta = 0.01,
                        nthread = 8, 
                        min_child_weight=3,
                        # max_delta_step=10,
                        eval_metric = "mae",
                        tree_method="exact",
                         gamma=120,
                        subsample=0.8,
                        #colsample_bytree=0.6,
                        #colsample_bytree=0.8,
                        objective = "reg:linear",
                        verbose = 0
                        #alpha=0.1
                        )
 #Predict
XGBoost_fit<-predict(modelxgboost, data.matrix(testData[,c(3:21,26,27,28,29,30:37,38,43,47,48,49)]))
XGBoost_fit <- as.vector(XGBoost_fit)
XGBoost_residuals=testData$Price2001-XGBoost_fit

# The following can be used to get plots for the importances of variables
# Leaving them here  

#xgb.plot.importance(xgbimportances)
#xgbimportances<-xgb.importance(feature_names = names(data[,c(3:21,26:38,43,49)]),model=modelxgboost)
#modelxgboost <- xgboost(data =data.matrix(trainData[,c(3:21,26:38,43,47,48,49)]),
                        
#variableimportance_xgboost <- xgb.importance(feature_names = names(trainData[,c(3:19,29,32:44,48:54,56,61,72,73)]), model = modelxgboost)
#xgb.ggplot.importance(xgbimportances)



##### lets wrap up the results #####
test1<-as.data.frame(testData)
binded<-cbind(test1,LM_fit,LM_residuals,SAR_fit,SAR_residuals,GWR_fit,GWR_residuals,ANN_fit,ANN_residuals,XGBoost_fit,XGBoost_residuals,rf_fit,rForest_residuals,SVM_fit,SVM_residuals)
test_set<-rbind(test_set,binded)
}

Metrics (Code Included)

Since we have finished our k-fold validation let’s calculate the metrics. We are calculating MSE, RMSE, MAE and MAPE.

MSE_LM<-mean(test_set$LM_residuals^2)
MSE_SAR<-mean(test_set$SAR_residuals^2)
MSE_GWR<-mean(test_set$GWR_residuals^2)
MSE_ANN<-mean(test_set$ANN_residuals^2)
MSE_rFor<-mean(test_set$rForest_residuals^2)
MSE_SVM<-mean(test_set$SVM_residuals^2)
MSE_XGB<-mean(test_set$XGBoost_residuals^2)
MSE<-cbind(MSE_LM,MSE_SAR,MSE_GWR,MSE_ANN,MSE_rFor,MSE_SVM,MSE_XGB)

RMSE_LM<-(MSE_LM)^(1/2)
RMSE_SAR<-(MSE_SAR)^(1/2)
RMSE_GWR<-(MSE_GWR)^(1/2)
RMSE_ANN<-(MSE_ANN)^(1/2)
RMSE_rFor<-(MSE_rFor)^(1/2)
RMSE_SVM<-(MSE_SVM)^(1/2)
RMSE_XGB<-(MSE_XGB)^(1/2)
RMSE<-cbind(RMSE_LM,RMSE_SAR,RMSE_GWR,RMSE_ANN,RMSE_rFor,RMSE_SVM,RMSE_XGB)


MAE_LM<-mean(abs(test_set$LM_residuals))
MAE_SAR<-mean(abs(test_set$SAR_residuals))
MAE_GWR<-mean(abs(test_set$GWR_residuals))
MAE_ANN<-mean(abs(test_set$ANN_residuals))
MAE_rFor<-mean(abs(test_set$rForest_residuals))
MAE_SVM<-mean(abs(test_set$SVM_residuals))
MAE_XGB<-mean(abs(test_set$XGBoost_residuals))
MAE<-cbind(MAE_LM,MAE_SAR,MAE_GWR,MAE_ANN,MAE_rFor,MAE_SVM,MAE_XGB)

MAPE_LM<-mean(abs(test_set$LM_residuals/test_set$Price2001))
MAPE_SAR<-mean(abs(test_set$SAR_residuals/test_set$Price2001))
MAPE_GWR<-mean(abs(test_set$GWR_residuals/test_set$Price2001))
MAPE_ANN<-mean(abs(test_set$ANN_residuals/test_set$Price2001))
MAPE_rFor<-mean(abs(test_set$rForest_residuals/test_set$Price2001))
MAPE_SVM<-mean(abs(test_set$SVM_residuals/test_set$Price2001))
MAPE_XGB<-mean(abs(test_set$XGBoost_residuals/test_set$Price2001))
MAPE<-cbind(MAPE_LM,MAPE_SAR,MAPE_GWR,MAPE_ANN,MAPE_rFor,MAPE_SVM,MAPE_XGB)

PE_LM<-abs(test_set$LM_residuals/test_set$Price2001)
PE_SAR<-abs(test_set$SAR_residuals/test_set$Price2001)
PE_GWR<-abs(test_set$GWR_residuals/test_set$Price2001)
PE_ANN<-abs(test_set$ANN_residuals/test_set$Price2001)
PE_rFor<-abs(test_set$rForest_residuals/test_set$Price2001)
PE_SVM<-abs(test_set$SVM_residuals/test_set$Price2001)
PE_XGB<-abs(test_set$XGBoost_residuals/test_set$Price2001)
PE<-cbind(PE_LM,PE_SAR,PE_GWR,PE_ANN,PE_rFor,PE_SVM,PE_XGB)

R2LM <- 1 - (sum((test_set$LM_residuals )^2)/sum((test_set$Price2001-mean(test_set$Price2001))^2))
R2SAR <- 1 - (sum((test_set$SAR_residuals )^2)/sum((test_set$Price2001-mean(test_set$Price2001))^2))
R2GWR<- 1- (sum((test_set$GWR_residuals )^2)/sum((test_set$Price2001-mean(test_set$Price2001))^2))
R2ANN <- 1 - (sum((test_set$ANN_residuals )^2)/sum((test_set$Price2001-mean(test_set$Price2001))^2))
R2RFOR <- 1 - (sum((test_set$rForest_residuals )^2)/sum((test_set$Price2001-mean(test_set$Price2001))^2))
R2SVM<- 1 - (sum((test_set$SVM_residuals )^2)/sum((test_set$Price2001-mean(test_set$Price2001))^2))
R2XGB <- 1 - (sum((test_set$XGBoost_residuals )^2)/sum((test_set$Price2001-mean(test_set$Price2001))^2))
R2<-cbind(R2LM,R2SAR,R2GWR,R2ANN,R2RFOR,R2SVM,R2XGB)

results<-rbind(MSE,RMSE,MAE,MAPE,R2)
colnames(results)<-c("MLR","SAR","GWR","ANN","Random Forest","SVM","XGB")
rownames(results)<-c("MSE","RMSE","MAE","MAPE","Rsquared")

Evaluation and comparison

MAPE has become one the most frequently used evaluation metrics for real estate models, because of its ability to express the error in a simple and understandable way, while also being more informative and less misleading than MAE, since it is expressed as a percentage of the total sale price and not as an absolute value.

Looking at the results below, we observe that when an algorithm shows a lower MAPE than another, it also tends to show lower values in the other metrics as well.

XGBoost performs the best (afterall it is the most popular algorithm on Kaggle for structured tabular data - due to its precision, speed and scalability), and Random Forest the second best (!), which is not what we expected. However there is one paper that has showcased the ability of Random Forest to surpass other algorithms, when applied for Mass Real Estate Valuation (Antipov & Pokryshevskaya, 2012).

After applying feature selection for the SVM algorithm (eliminating most of the variables) I managed to get slightly better results (12.01% MAPE), however it still performed the worse out of all ML algorithms.

results

##          X          MLR          SAR          GWR          ANN
## 1      MSE 2.049849e+04 2.048817e+04 2.010253e+04 2.096424e+04
## 2     RMSE 1.431729e+02 1.431369e+02 1.417834e+02 1.447903e+02
## 3      MAE 1.123721e+02 1.123647e+02 1.112654e+02 1.067373e+02
## 4     MAPE 1.277548e-01 1.278283e-01 1.256679e-01 1.181973e-01
## 5 Rsquared 7.137788e-01 7.139228e-01 7.193076e-01 7.072755e-01
##   Random.Forest          SVM          XGB
## 1  2.142707e+04 2.050502e+04 1.637152e+04
## 2  1.463799e+02 1.431958e+02 1.279512e+02
## 3  1.039159e+02 1.116478e+02 9.420039e+01
## 4  1.162081e-01 1.254637e-01 1.056557e-01
## 5  7.008130e-01 7.136875e-01 7.714038e-01

Which algorithm is “the best”?

It depends: Do we care about explainability?

Banks for example need to be able to explain how their models work. If we indeed do, then we should use GWR.

If we do not, then XGBoost is the undisputed winner - fast, generally easy to tune and very accurate.

Variable Importance (XGBoost)

Let’s have a look at what XGBoost considers as the most important variables.

We will train the model once again, this time using the whole data set and plot importances.

We see that most near_X variables have been omitted; XGBoost is pretty good at catching the “impact” a zero-distance to a PoI has on the price.

Also while the distances play a major role, the X and Y coordinates still have a big impaact on the price.

Later experimentations proved that we can indeed ommit most of the distance variables (apart from the “In Centre” “On Seafront” variables) and use the X and Y as proxies, without a big accuracy loss (~11.2% MAPE).

Pretty important, as this would eliminate the need of an extensive GIS framework and cut down the pre-processing time.

price<-data$Price2001
xgbearlystop2 <- xgb.cv(data =data.matrix(data[,c(3:21,26,27,28,29,30:37,38,43,47,48,49)]),
                       label =price, 
                       nfold=10,
                       max.depth =3,
                       eta = 0.01,
                       nthread = 8, 
                       nround = 6000,
                       min_child_weight=3,
                       #max_delta_step=10,
                       eval_metric = "mae",
                       tree_method="exact",
                       gamma=120,
                       subsample=0.8,
                       #colsample_bytree=0.6,
                       #colsample_bytree=0.9,
                       objective = "reg:linear",
                       early_stopping_rounds=200,
                       verbose = 0,
                       #alpha=0.1,
                       callbacks=list(cb.evaluation.log()))



 #Train 
modelxgboost2 <- xgboost(data =data.matrix(data[,c(3:21,26,27,28,29,30:37,38,43,47,48,49)]),
                        nround = xgbearlystop$best_iteration,
                        label =price, 
                        nfold=10,
                        max.depth =3,
                        eta = 0.01,
                        nthread = 8, 
                        min_child_weight=3,
                        # max_delta_step=10,
                        eval_metric = "mae",
                        tree_method="exact",
                         gamma=120,
                        subsample=0.8,
                        #colsample_bytree=0.6,
                        #colsample_bytree=0.8,
                        objective = "reg:linear",
                        verbose = 0
                        #alpha=0.1
                        )

variableimportance_xgboost <- xgb.importance(feature_names = colnames(data[,c(3:21,26,27,28,29,30:37,38,43,47,48,49)]), model = modelxgboost2)
xgb.plot.importance(variableimportance_xgboost)

Using the Cartesian coordinates as proxies for the distance variables

As mentioned earlier, future experiments focused on removing all the distance variables (dummy variables included) apart from the “In Center” and “On Seafront” and just using the X and Y coordinates as proxies.

Generally speaking, GWR and XGBoost were able to provide roughly the same level of accuracy with the above results (slightly worse - 12.92% and 11.16% MAPE respectively) however this way eliminates the need of a GIS framework (and the subsequent time and costs).

However, the neural network (after being re-tuned) performed much worse (13.23% MAPE).

For comparison purposes MLR and SAR had a MAPE of 15.98% and 16.20% respectively (without the X and Y variables).

Machine Learning - Real Estate Mass Valuation

Alexios Georgiadis

April 14, 2019