1. Background

Recently, there have been wide swings in the growth rate of housing prices across many countries. Housing prices tend to rise over time; however, accurately predicting them might be a difficult task. There are various factors that may influence the demand for a property. Additionally, it is extremely difficult to figure out which set of factors might explain buyers’ behavior, since every buyer tends to have their own preferences such as house size, location, etc. In this document, I am going to predict housing prices in King County using a linear regression model, as well as find out which attribute plays the key role in determining the housing price. The dataset used was obtained from Kaggle (https://www.kaggle.com/datasets/harlfoxem/housesalesprediction) and consists of 21,613 observations.

2. Data Preparation

2.1 Data Inspection

house <- read.csv("kc_house_data.csv")
head(house)
glimpse(house)
## Rows: 21,613
## Columns: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date          <chr> "20141013T000000", "20141209T000000", "20150225T000000",…
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms      <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living   <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot      <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition     <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade         <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above    <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built      <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated  <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode       <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15    <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …

Here are some informations about the features:

  • id: Unique ID for each home sold
  • date: Date of the home sale
  • price: Price of each home sold
  • bedrooms: Number of bedrooms
  • bathrooms: Number of bathrooms, where .5 accounts for a room with a toilet but no shower
  • sqft_living: Square footage of the house interior living space
  • sqft_lot: Square footage of the land space
  • floors: Number of floors
  • waterfront: Whether the house was overlooking the waterfront or not
  • view: An index of how good the view of the property was (0-4)
  • condition: An index on the condition of the apartment (1-5)
  • grade: An index on the quality of building construction and design (1-13),
  • sqft_above: The square footage of the interior housing space that is above ground level
  • sqft_basement: The square footage of the interior housing space that is below ground level
  • yr_built: The year the house was initially built
  • yr_renovated: The year of the house’s last renovation
  • zipcode: Zipcode area the house is in
  • lat: Latitude
  • long: Longitude
  • sqft_living15: The square footage of interior housing living space for the nearest 15 neighbors
  • sqft_lot15: The square footage of the land lots of the nearest 15 neighbors

We can drop id and date because it is not relevant with predicting house prices. Furthermore, since our data consists of housing prices in King County, we can drop zipcode, lat, and long. Inspecting the data further, we can see that waterfront is recognized as integer instead of categorical feature. So, we will correct it as well.

house <- house %>% 
  select(-c(id, date, zipcode, lat, long)) %>% 
  mutate(waterfront = as.factor(waterfront)) 

2.2 Missing and Duplicated Values

Checking whether the training set has any missing or duplicated values.

colSums(is.na(house))
##         price      bedrooms     bathrooms   sqft_living      sqft_lot 
##             0             0             0             0             0 
##        floors    waterfront          view     condition         grade 
##             0             0             0             0             0 
##    sqft_above sqft_basement      yr_built  yr_renovated sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15 
##             0
sum(duplicated(house))
## [1] 6

The training set does not have any missing value, but it has 6 duplicated values. I personally think that although some houses in the regency might be built exactly the same, it is enough to just pick one of them in modeling. This is because duplicated values, in a regression model, might systematically bias regression estimates. So, I will drop these missing values.

house <- house[!duplicated(house), ]

The data is now clean and ready to be used.

3. Exploratory Data Analysis

First, let’s see our data summary.

summary(house)
##      price            bedrooms        bathrooms      sqft_living   
##  Min.   :  75000   Min.   : 0.000   Min.   :0.000   Min.   :  290  
##  1st Qu.: 321725   1st Qu.: 3.000   1st Qu.:1.750   1st Qu.: 1428  
##  Median : 450000   Median : 3.000   Median :2.250   Median : 1910  
##  Mean   : 540107   Mean   : 3.371   Mean   :2.115   Mean   : 2080  
##  3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.: 2550  
##  Max.   :7700000   Max.   :33.000   Max.   :8.000   Max.   :13540  
##     sqft_lot           floors      waterfront      view          condition    
##  Min.   :    520   Min.   :1.000   0:21444    Min.   :0.0000   Min.   :1.000  
##  1st Qu.:   5040   1st Qu.:1.000   1:  163    1st Qu.:0.0000   1st Qu.:3.000  
##  Median :   7620   Median :1.500              Median :0.0000   Median :3.000  
##  Mean   :  15110   Mean   :1.494              Mean   :0.2342   Mean   :3.409  
##  3rd Qu.:  10692   3rd Qu.:2.000              3rd Qu.:0.0000   3rd Qu.:4.000  
##  Max.   :1651359   Max.   :3.500              Max.   :4.0000   Max.   :5.000  
##      grade          sqft_above   sqft_basement       yr_built   
##  Min.   : 1.000   Min.   : 290   Min.   :   0.0   Min.   :1900  
##  1st Qu.: 7.000   1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951  
##  Median : 7.000   Median :1560   Median :   0.0   Median :1975  
##  Mean   : 7.657   Mean   :1788   Mean   : 291.6   Mean   :1971  
##  3rd Qu.: 8.000   3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997  
##  Max.   :13.000   Max.   :9410   Max.   :4820.0   Max.   :2015  
##   yr_renovated     sqft_living15    sqft_lot15    
##  Min.   :   0.00   Min.   : 399   Min.   :   651  
##  1st Qu.:   0.00   1st Qu.:1490   1st Qu.:  5100  
##  Median :   0.00   Median :1840   Median :  7620  
##  Mean   :  84.33   Mean   :1987   Mean   : 12771  
##  3rd Qu.:   0.00   3rd Qu.:2360   3rd Qu.: 10084  
##  Max.   :2015.00   Max.   :6210   Max.   :871200

The range of our target variable, price is quite big. I’m afraid there will be a lot of outliers in the data. Let’s check the distribution of house prices.

hist(house$price,
     col = "skyblue",
     main = "Distribution of House Prices",
     xlab = "House Price",
     breaks = 30)

boxplot(house$price,
        main = "Boxplot of House Prices",
        col = "skyblue4")

We can see that the distribution of house prices looks like a normal distribution, with some data on the right tail that skewed the distribution. Furthermore, seeing the boxplot, we can see that there are lots of outliers in the data. I have planned to construct a model using Ordinary Least Square Linear Regression, and although it does not need the distribution of neither outcome nor independent variables to be of Gaussian (normal), it is very sensitive to outliers. Besides, outliers are considered to be anomaly in the data. So, I will trim the data as to remove some outliers.

house_trim <- house %>% 
  arrange(price) %>% 
  slice_min(price, prop = 0.97)

hist(house_trim$price,
     col = "skyblue",
     main = "Distribution of House Prices",
     xlab = "House Price",
     breaks = 20)

The distribution seems to be better. Now, let’s see the correlation between features in this dataset.

ggcorr(house_trim, geom = "blank", label = T, label_size = 3, hjust = 1, size = 3, layout.exp = 2) +
  geom_point(size = 8, aes(color = coefficient > 0, alpha = abs(coefficient) >= 0.5)) +
  scale_alpha_manual(values = c("TRUE" = 0.25, "FALSE" = 0)) +
  guides(color = F, alpha = F)
## Warning in ggcorr(house_trim, geom = "blank", label = T, label_size = 3, : data
## in column(s) 'waterfront' are not numeric and were ignored
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

By looking at the correlation matrix, we can see that all variables are positively correlated with price. Furthermore, sqft_living, grade, sqft_above, and sqft_living15 seems to have strong correlations with price.

3.1 Cross Validation

Before proceeding to the modeling part, I will split the data into 2 sets: train set and test set. The reason behind this is pretty much straightforward. When we are creating a model, the main purpose is to be able to use it for predicting other data, i.e. data that the model has not seen. Thus, it is very important to evaluate the model’s out of sample accuracy, rather than only its accuracy on the training set (set containing data the model has learned from). I will split the data into 70% train set and 30% test set.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417)
idx <- sample(nrow(house_trim), nrow(house_trim) * 0.7)
train <- house_trim[idx, ]
test <- house_trim[-idx, ]

4. Modeling

4.1 Multiple Linear Regression

Let’s try to construct the model using 4 features that are strongly correlated with house prices: sqft_living, grade, sqft_above, and sqft_living15.

summary(lm(price ~ sqft_living + grade + sqft_above + sqft_living15,
           data = house))
## 
## Call:
## lm(formula = price ~ sqft_living + grade + sqft_above + sqft_living15, 
##     data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1073331  -136535   -23410    98540  4832331 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.476e+05  1.355e+04 -47.788  < 2e-16 ***
## sqft_living    2.273e+02  4.186e+00  54.313  < 2e-16 ***
## grade          1.057e+05  2.413e+03  43.789  < 2e-16 ***
## sqft_above    -8.183e+01  4.465e+00 -18.330  < 2e-16 ***
## sqft_living15  2.619e+01  4.024e+00   6.508 7.77e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 248500 on 21602 degrees of freedom
## Multiple R-squared:  0.542,  Adjusted R-squared:  0.5419 
## F-statistic:  6391 on 4 and 21602 DF,  p-value: < 2.2e-16

We can see the goodness of fit of the model by looking at the adjusted R2 value. It tells us how well our dependent variable can be explained by our independent variables. Using those 4 features, we get adjusted R2 value of 0.5421. It is quite poor. Let’s compare the result to the model with all the features included.

model_all <- lm(price ~ ., data = house)
summary(model_all)
## 
## Call:
## lm(formula = price ~ ., data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1341541  -109312   -10670    89963  4289643 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.194e+06  1.385e+05  44.733  < 2e-16 ***
## bedrooms      -3.928e+04  2.027e+03 -19.380  < 2e-16 ***
## bathrooms      4.567e+04  3.493e+03  13.075  < 2e-16 ***
## sqft_living    1.670e+02  4.668e+00  35.778  < 2e-16 ***
## sqft_lot      -1.789e-03  5.129e-02  -0.035  0.97217    
## floors         2.694e+04  3.785e+03   7.118 1.13e-12 ***
## waterfront1    5.787e+05  1.864e+04  31.047  < 2e-16 ***
## view           4.332e+04  2.274e+03  19.051  < 2e-16 ***
## condition      1.955e+04  2.497e+03   7.828 5.19e-15 ***
## grade          1.197e+05  2.250e+03  53.220  < 2e-16 ***
## sqft_above    -6.258e+00  4.545e+00  -1.377  0.16858    
## sqft_basement         NA         NA      NA       NA    
## yr_built      -3.569e+03  7.100e+01 -50.274  < 2e-16 ***
## yr_renovated   1.036e+01  3.916e+00   2.645  0.00818 ** 
## sqft_living15  2.488e+01  3.601e+00   6.909 5.03e-12 ***
## sqft_lot15    -5.503e-01  7.837e-02  -7.022 2.25e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216100 on 21592 degrees of freedom
## Multiple R-squared:  0.6538, Adjusted R-squared:  0.6536 
## F-statistic:  2912 on 14 and 21592 DF,  p-value: < 2.2e-16

When we use all the features, the adjusted R2 raised by quite a significant amount. It rose from 0.5421 to 0.6999. Let’s check the RMSE of our model in the test set to check the out-of-sample error. It measures on average, how far is our prediction from the actual data.

RMSE(y_pred = predict(model_all, newdata = test),
     y_true = test$price)
## [1] 169260.4

The RMSE is quite high. It is still a poor model. Since there are still statistically insignificant variables, and to improve our model even further, I will use backward stepwise regression. It is a method that iteratively examines the significance of every independent variables in a regression model. Backward stepwise regression begins with a model containing all possible variables, then deleting each variable whose absence decreases the AIC (Akaike Information Criterion / information loss) the most.

model_back <- step(model_all, direction = "backward", trace = 0)
summary(model_back)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
##     waterfront + view + condition + grade + yr_built + yr_renovated + 
##     sqft_living15 + sqft_lot15, data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1344213  -109623   -10716    89998  4287299 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.214e+06  1.376e+05  45.146  < 2e-16 ***
## bedrooms      -3.928e+04  2.026e+03 -19.384  < 2e-16 ***
## bathrooms      4.646e+04  3.445e+03  13.486  < 2e-16 ***
## sqft_living    1.629e+02  3.575e+00  45.561  < 2e-16 ***
## floors         2.476e+04  3.436e+03   7.206 5.94e-13 ***
## waterfront1    5.778e+05  1.862e+04  31.023  < 2e-16 ***
## view           4.389e+04  2.235e+03  19.637  < 2e-16 ***
## condition      1.975e+04  2.492e+03   7.925 2.40e-15 ***
## grade          1.194e+05  2.239e+03  53.338  < 2e-16 ***
## yr_built      -3.578e+03  7.068e+01 -50.626  < 2e-16 ***
## yr_renovated   1.031e+01  3.916e+00   2.633  0.00846 ** 
## sqft_living15  2.392e+01  3.528e+00   6.781 1.22e-11 ***
## sqft_lot15    -5.604e-01  5.556e-02 -10.085  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216100 on 21594 degrees of freedom
## Multiple R-squared:  0.6537, Adjusted R-squared:  0.6536 
## F-statistic:  3398 on 12 and 21594 DF,  p-value: < 2.2e-16
RMSE(y_pred = predict(model_back, newdata = test),
     y_true = test$price)
## [1] 169384.2

So, it turns out that the model with the least information loss is the model that uses all variables, with the optional sqft_basement variable. We can pretty much conclude the best multiple linear regression model that can be constructed with this data is pretty poor, with adjusted R2 value of 0.6999 and RMSE of 149611.2. Let’s first check whether the model fulfills all multiple linear regression assumption.

4.2 Assumptions

Let’s first check on the assumption of normality.

hist(model_back$residuals,
     col = "skyblue",
     main = "Histogram of Residuals",
     xlab = "Residuals")

nortest::ad.test(model_back$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  model_back$residuals
## A = 470.68, p-value < 2.2e-16

I used the Anderson-Darling normality test instead of shapiro-wilk normality test because the data size is too large. From the histogram and the statistical test, we can see that the errors are not normally distributed. This might happen because of outliers in our data, and we must be cautious of using the model as the errors will not be around 0. Let’s now check on the homoscedasticity assumption.

plot(model_back$fitted.values, model_back$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, col = "red")

bptest(model_back)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_back
## BP = 2765.5, df = 12, p-value < 2.2e-16

Since the p-value is less than 0.05, we reject the null hypothesis and conclude that our model violates the homoscedasticity assumption as well. This indicates that there are patterns that the model was unable to learn. Now the last assumption, multicollinearity.

data.frame("VIF" = vif(model_back))

Since all VIF are below 10, we can conclude that multicollinearity is not present in our model. Looking at how the model violates 2 assumptions, it means that our model can still be improved upon, or maybe multiple linear regression is not good enough to find all the patterns in the data.

4.3 Model Improvement

Our data is skewed to begin with, thus it is prone to overfitting. I would like to see what will happen if we remove all outliers in the target variable, and log-transform all numeric variables in the data. This should help in transforming all variables’ distribution into a more normal-shaped bell curve, thus the model should not have overfitting problem. For the log transformation, since there are quite a few zeros in the data, I will add a constant and use log10(x+1) instead.

log_tf <- function(x){
  return(log10(x+1))
}
house2 <- house %>% 
  filter(price < quantile(house$price, 0.75) + 1.5 * IQR(house$price)) %>% 
  mutate_if(is.numeric, log_tf)

house2

Splitting the data into train and test set.

set.seed(417)
idx_tf <- sample(nrow(house2), nrow(house2) * 0.7)
train_tf <- house2[idx_tf, ]
test_tf <- house2[-idx_tf, ]

Creating multiple linear regression model using transformed data.

model_tf <- step(object = lm(price ~ ., data = train_tf),
                 direction = "backward",
                 trace = F)

summary(model_tf)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
##     waterfront + view + condition + grade + sqft_above + sqft_basement + 
##     yr_built + yr_renovated + sqft_living15 + sqft_lot15, data = train_tf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50260 -0.08692  0.00982  0.08808  0.79153 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   35.230172   0.800623  44.003  < 2e-16 ***
## bedrooms      -0.174142   0.015636 -11.137  < 2e-16 ***
## bathrooms      0.121389   0.018315   6.628 3.52e-11 ***
## sqft_living    0.196043   0.030836   6.358 2.11e-10 ***
## floors         0.247268   0.018228  13.565  < 2e-16 ***
## waterfront1    0.058243   0.021296   2.735  0.00625 ** 
## view           0.079930   0.008550   9.349  < 2e-16 ***
## condition      0.207228   0.019524  10.614  < 2e-16 ***
## grade          1.541827   0.032827  46.969  < 2e-16 ***
## sqft_above     0.096171   0.030241   3.180  0.00148 ** 
## sqft_basement  0.016296   0.002045   7.968 1.74e-15 ***
## yr_built      -9.983872   0.243796 -40.952  < 2e-16 ***
## yr_renovated   0.003302   0.001797   1.837  0.06623 .  
## sqft_living15  0.283654   0.012660  22.406  < 2e-16 ***
## sqft_lot15    -0.051048   0.003767 -13.550  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1282 on 14307 degrees of freedom
## Multiple R-squared:  0.5624, Adjusted R-squared:  0.562 
## F-statistic:  1313 on 14 and 14307 DF,  p-value: < 2.2e-16

We ended up getting a lower adjusted R2 error. Let’s see the RMSE.

predict_tf <- predict(model_tf, newdata = test_tf)
RMSE(y_pred = 10^(predict_tf) - 1,
     y_true = 10^(test_tf$price) - 1)
## [1] 136126.8

Although we get a lower adjusted R2 value, we ended up getting a lower RMSE value as well. Let’s check whether the new model violates linear model assumptions or not.

hist(model_tf$residuals,
     col = "skyblue",
     main = "Histogram of Residuals",
     xlab = "Residuals")

nortest::ad.test(model_tf$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  model_tf$residuals
## A = 21.792, p-value < 2.2e-16

Although the histogram seems slightly skewed, turns out that statistically, it is not normally distributed. So, the new model still violates the first assumption since the errors are not normally distributed.

plot(model_tf$fitted.values, model_tf$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, col = "red")

bptest(model_tf)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_tf
## BP = 541.44, df = 14, p-value < 2.2e-16

The homoscedasticity assumption is also violated, which means that there are patterns that the model was unable to learn. Let’s check the last assumption, multicollinearity.

data.frame("VIF" = vif(model_tf))

The new model turns out to violate multicollinearity assumption as well. I have tested that removing sqft_above will result in slightly decreased RMSE, but the previous two assumptions will still be violated by the model. It seems that for this data, it is not appropriate to use multiple linear regression model.

5. Conclusion

For predicting housing prices in King County, USA, the multiple linear regression performs poorly. Multiple linear regression was unable to caught every patterns in the data. The best adjusted R2 got in the multiple linear model is 0.6999 with RMSE of 149611.2. Removing every outliers and log-transforming numerical features did not help much either, as it decreases the adjusted R2 into 0.562, but also decreases RMSE into 136126.8 (which is still high). All the model also violates several linear regression model assumptions. It is recommended to use another machine learning methods such as decision tree, random forest, or deep learning for this data.

6. References

  1. S. Borde, A. Rane, G. Shende, and S. Shetty, “Real estate investment advising using machine learning,” International Research Journal of Engineering and Technology (IRJET), vol. 4, no. 3, p. 1821, 2017.

  2. B. Trawinski, Z. Telec, J. Krasnoborski et al., “Comparison of expert algorithms with machine learning models for real estate appraisal,” in Proceedings of the 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Gdynia, Poland, July 2017.

  3. V. Kontrimas and A. Verikas, “The mass appraisal of the real estate by computational intelligence,” Applied Soft Computing, vol. 11, no. 1, pp. 443–448, 2011.

  4. M. Woźniak, M. Graña, and E. Corchado, “A survey of multiple classifier systems as hybrid systems,” Information Fusion, vol. 16, pp. 3–17, 2014.

  5. J. R. Barr, E. A. Ellis, A. Kassab, C. L. Redfearn, N. N. Srinivasan, and K. B. Voris, “Home price index: a machine learning methodology,” International Journal of Semantic Computing, vol. 11, no. 1, pp. 111–133, 2017.