Introduction

In this project we will try to predict Warsaw real estate prices. Data that we use comes from github user link and include real webscraped data. It is quite big - 25240 rows x 75 columns. We will use RMSE as metric to compare models. There will be 8 models. Including some simple like linear regression or k-nearest neighbors. But we also use some newer tree based approaches like xgboost. Our goal is to get intuition which type is better. Dependent variable is very Price and describes price of real estate. Because it`s distribution is very skewed we are going to model log transformation of this variable.

Data preparation

Reading libraries

library(tidyverse)
library(car)
library(pROC)
library(lmtest)
library(caret)
library(DMwR)
library(ranger)
library(corrplot)
library(DT)

Reading data and printing 20 observations.

df <- readxl::read_xlsx('RE_models_input_enriched.xlsx')
DT::datatable(head(df,20),extensions = 'FixedColumns',options = list(
  scrollX = TRUE))

Data have 75 columns, at the first look we can spot that some of them are transformations of another. We need to tidy the data. Firstly we are going to look at depended variable - Price. It might require log transformation as Price variables are usually skewed. Let`s make histogram of Price.

ggplot(df,
       aes(x = Price)) +
  geom_histogram(fill = "pink",
                 col='grey',
                 bins = 100) + ggtitle('Price distribution')  +
  theme_bw()

We can see that it is skewed. There is fat tail. It is not a surprise, there are probably some expensive apartments. Let`s plot histogram of log transformation.

ggplot(df,
       aes(x = log(Price))) +
  geom_histogram(fill = "pink",
                 col='grey',
                 bins = 100) + ggtitle('log(Price) distribution')  +
  theme_bw()

#transforming Price to log
df$Price <- log(df$Price)

Now it looks much better. We are going to model this in log transformed version. Normally such things should be done on train splitted sample, but log transformation does not cause data leakage and skewed Price variable is common thing, so we did it at beginning.
Next we are going to look closer at character variables. Let`s print their length.

df_character_vars <- 
  sapply(df, is.character) %>% 
  which() %>% 
  names()
sapply(df[, df_character_vars], 
        function(x) 
          unique(x) %>% 
          length()) %>% 
  sort()

##                City          offer_date              market Construction_status 
##                   1                   2                   2                   4 
##        Windows_type  Building_ownership             Heating       Building_type 
##                   4                   5                   7                   9 
##           rooms_num   Building_material            floor_no        district_old 
##                  11                  11                  15                  17 
##            district         subdistrict 
##                  17                 140

There are some suspicious results. We will immediately delete City it only has 1 value. Let`s look closer at other.

table(df$offer_date)

## 
## 2020-05-24 2020-07-07 
##       7018      18222

It looks like ther are two dates. We will delete this variable. This does not tell us to much. Probably they are just dates that data were scraped.

table(df$market)

## 
##   primary secondary 
##      8287     16953

Market variable look ok.

table(df$Construction_status)

## 
## not_specified  ready_to_use to_completion to_renovation 
##          8619          9074          6062          1485

Market variable look ok.

table(df$Windows_type)

## 
##     aluminium not_specified       plastic        wooden 
##           196         12995          9532          2517

Windows_type variable look quite ok.

table(df$Heating)

## 
##   boiler_room    electrical           gas not_specified         other 
##           185            70           543         10265           562 
##   tiled_stove         urban 
##             1         13614

df$Heating[df$Heating=='tiled_stove'] <- 'not_specified'

In heating there is tiled_stove with only 1 instance. We will merge it to other.

table(df$Building_type)

## 
##     apartment         block      detached         house        infill 
##          6271         10906             1           107             7 
##          loft not_specified        ribbon      tenement 
##             5          5588            90          2265

#merging levels
df$Building_type[df$Building_type=='loft'] <- 'not_specified'
df$Building_type[df$Building_type=='detached'] <- 'not_specified'
df$Building_type[df$Building_type=='infill'] <- 'not_specified'

In Building_type there are 3 levels that should be merged. Loft, detached and infill will be changed to not_specified.

table(df$rooms_num)

## 
##    1   10    2    3    4    5    6    7    8    9 more 
## 2491    1 8990 9200 3499  815  188   31    7    9    9

#merging levels
df$rooms_num[df$rooms_num %in% c('7','8','9','more','10')] <- '7+'

In rooms_num everything greater than 7 will be moved to 7+ category.

table(df$Building_material)

## 
##         breezeblock               brick   cellular_concrete            concrete 
##                 388                6509                  70                 115 
##      concrete_plate            hydroton       not_specified               other 
##                1344                   5               13189                2629 
## reinforced_concrete             silikat                wood 
##                 656                 332                   3

#merging hydroton and wood to other
df$Building_material[df$Building_material %in% c('hydroton','wood')] <- 'other'

In Building_material hydroton and wood will be moved to other.

table(df$floor_no)

## 
##          cellar         floor_1        floor_10         floor_2         floor_3 
##               6            4980             447            4044            3471 
##         floor_4         floor_5         floor_6         floor_7         floor_8 
##            2733            1577            1149             753             616 
##         floor_9 floor_higher_10          garret    ground_floor   not_specified 
##             445             773               3            3412             831

#merging garret and cellar to not_specified
df$floor_no[df$floor_no %in% c('garret','cellar')] <- 'not_specified'

In floor_no cellar and garret will be moved to not_specified.
There are also 3 other variables:district_old, district and subdistrict. We will delete subdistrict as it has to much categories. District and district_old is pretty much the same variable. We will also delete dictrict_old. We will delete variable that we mention at the end of data preparation
Now we are going to transform all those characters to factor. Next we will check how many unique values are in numeric variables.

#changing character variables to factor type
df[df_character_vars] <- lapply(df[df_character_vars] , factor)

df_numeric_vars <- 
  sapply(df, is.numeric) %>% 
  which() %>% 
  names()
#checking number of unique observations
numeric_unique <- sapply(df[, df_numeric_vars], 
        function(x) 
          unique(x) %>% 
          length()) %>% 
  sort() %>% as.data.frame()
datatable(numeric_unique)

We can see that there are many variables with only 2 unique observations. They are probably some dummy variables and should be transformed to factors. Next we will check which of the numeric variables have potentially 0 variance.

#changing columns from equipment to security to factors
to_fac <- df %>% 
  select(Equipment_types_dishwasher:Security_types_roller_shutters) %>%
  colnames()
df[to_fac] <- lapply(df[to_fac] , factor)
zero_names <- df[, df_numeric_vars] %>% 
          nearZeroVar(names=TRUE)
zero_names

## [1] "Equipment_types_tv"             "Extras_types_attic"            
## [3] "Extras_types_two_storey"        "Media_types_cable_television"  
## [5] "Media_types_electricity"        "Media_types_sewage"            
## [7] "Media_types_water"              "Security_types_alarm"          
## [9] "Security_types_roller_shutters"

9 of them are suspicious. Let`s check their distribution. We are going to delete only those that are highly unbalanced.

foo_table <- table(df[,zero_names[1]])
for (name in zero_names[-1]){
  foo_table <- rbind(foo_table,table(df[,name]))
}
rownames(foo_table) <- zero_names
foo_table %>%
  kableExtra::kbl() %>%
  kableExtra::kable_classic("striped", full_width = F)

	0	1
Equipment_types_tv	24446	794
Extras_types_attic	25239	1
Extras_types_two_storey	24837	403
Media_types_cable_television	25239	1
Media_types_electricity	25238	2
Media_types_sewage	25239	1
Media_types_water	25238	2
Security_types_alarm	24408	832
Security_types_roller_shutters	25023	217

In the table above we can see their distribution. We are going to delete those that have only 1 or 2 observations in ‘1’ column. So we will delete 5 of those variables. They will be totally irrelevant in modeling.

#Deleting all variables that I mentioned
#this variable got ' in name i can`t delet it automatically, so I used some workaround
df$Media_types_cable_television <- df$`Media_types_cable-television`
#defining variable with names
drops <- c("City","offer_date",'subdistrict','Extras_types_attic','Media_types_cable_television','Media_types_electricity','Media_types_sewage','Media_types_water','unit_price','floor_num',"Id",'Media_types_cable-television')
#filtering data 
df <- df[ , !(names(df) %in% drops)]
ncol(df)

## [1] 63

At the end I`m going to partition the data to 80% train and 20% test.

set.seed(987654321)
#splitting the data 
data_which_train <- createDataPartition(df$Price, # target variable
                                          # share of the training sample
                                          p = 0.8, 
                                          # should result be a list?
                                          list = FALSE)
#filtering to train
df_train <- df[data_which_train,]
#filtering to test
df_test <- df[-data_which_train,]
#summarising train and test to compare
matrix_compare <- rbind(summary(df_train$Price),summary(df_test$Price))
rownames(matrix_compare) <- c('train','test')
matrix_compare %>%
  kableExtra::kbl() %>%
  kableExtra::kable_classic("striped", full_width = F)

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
train	11.91170	12.99453	13.25164	13.34013	13.57979	16.82613
test	11.68688	12.99225	13.25164	13.33794	13.57979	16.41561

Feature Selection & data transformation

Now we are going to check if the numeric variables are correlated. If yes we will delete them. We will keep those which sound more natural and have higher correlation with Price. Also we will delete only those that show very high correlation.

#selecting names of numeric variables
train_numeric_vars <- 
  # check if variable is numeric
  sapply(df_train, is.numeric) %>% 
  # select those which are
  which() %>% 
  # and keep just their names
  names()
train_numeric_vars

##  [1] "Area"                        "Price"                      
##  [3] "latitude"                    "longitude"                  
##  [5] "build_year"                  "building_floors_num"        
##  [7] "lon_mod"                     "lat_mod"                    
##  [9] "index"                       "grid_price"                 
## [11] "sample_size"                 "geo_Id"                     
## [13] "distance_transit_8AM"        "time_transit_8AM"           
## [15] "distance_driving_8AM"        "time_driving_8AM"           
## [17] "distance_return_transit_5PM" "time_return_transit_5PM"    
## [19] "distance_return_driving_5PM" "time_return_driving_5PM"    
## [21] "price_decrease_from_20k"     "price_decrease_per_10min"   
## [23] "restaurant_price_level"      "restaurant_mean_rating"     
## [25] "restaurant_mean_popularity"  "restaurant_count"           
## [27] "restaurant_ratings_count"

We can see that there are 27 numeric variables. We will plot two correlograms, for the first part and of the second part.

#counting correlations
train_correlations <- 
    cor(df_train[,train_numeric_vars],
        use = "pairwise.complete.obs")
#ordering from +1 to -1
train_numeric_vars_order <- 
  # we take correlations with the Sale_Price
  train_correlations[,"Price"] %>% 
  # sort them in the decreasing order
  sort(decreasing = TRUE) %>%
  # end extract just variables' names
  names()
corrplot.mixed(train_correlations[train_numeric_vars_order[1:15], 
                                   train_numeric_vars_order[1:15]],
               upper = "square",
               lower = "number",
               tl.col="black", # color of labels (variable names)
               tl.pos = "lt")

Looking at correlogram of first part we can see that lon_mod is very highly correlated with longitude. We will delete lon_mod, as longitude sounds and looks more natural. Also, restaurant_rating_count will be deleted, as this variable is highly correlated with restaurant count that is a little more correlated with Price. All other variables look ok.
Now we will draw correlation plot of second part.

corrplot.mixed(train_correlations[train_numeric_vars_order[c(1,15:27)], 
                                   train_numeric_vars_order[c(1,15:27)]],
               upper = "square",
               lower = "number",
               tl.col="black", # color of labels (variable names)
               tl.pos = "lt")

We can see that much more variables are correlated with each other. We will delete lat_mod, geo_Id and index that are correlated with latitude. Moreover, there is group of variables named ‘time_…’ and ‘distance_…’ that are highly correlated. We will leave only one of them - distance_transit_8AM, as this variable has the biggest correlation with Price, also most of the people use transit and are interested how fast they can travel to work at the morning. Time variables are derived from distance. It makes more sense to keep distance.
Furthermore, there might be some data leakage here. As there were no description what each variable means, we can only make selection based on pure statistical analysis. For example price_decrease_from_20k sounds suspicious.

train_drops <- c('lon_mod','lat_mod','restaurant_ratings_count','geo_Id',
                 'time_driving_8AM','time_return_driving_5PM',
                 "time_return_transit_5PM",'time_transit_8AM',
                 'distance_return_driving_5PM','distance_driving_8AM',
                 'district_old','distance_return_transit_5PM','index')
df_train <- df_train[ , !(names(df_train) %in% train_drops)]

Last step is to choose final variables for model with Random Forest variable importance.

options("scipen"=100, "digits"=4)

#Fitting Random Forest to train data
set.seed(987654321)
ranger_model <- ranger::ranger(Price~.,df_train, importance = "impurity_corrected")
#printing variable importance
imp_vars <- importance(ranger_model) %>% 
  data.frame() %>% arrange(desc(.)) 
imp_vars %>%
  kableExtra::kbl() %>%
  kableExtra::kable_classic("striped", full_width = F) %>% 
  kableExtra::scroll_box(height = "400px")

	.
Area	1421.5408
rooms_num	744.7496
price_decrease_from_20k	286.9759
grid_price	270.5109
distance_transit_8AM	116.1758
latitude	115.4145
restaurant_count	114.3624
build_year	108.8691
Building_type	106.4479
price_decrease_per_10min	91.3255
building_floors_num	85.0382
district	76.1436
longitude	75.7134
restaurant_mean_rating	74.0507
restaurant_price_level	71.6801
restaurant_mean_popularity	62.6309
Windows_type	44.5182
Extras_types_garage	43.3451
market	33.8212
Extras_types_terrace	33.3808
Construction_status	25.4553
floor_no	24.8035
Building_ownership	17.1149
sample_size	16.9539
Extras_types_air_conditioning	15.0466
Building_material	10.4956
Security_types_monitoring	9.5471
Extras_types_two_storey	8.8318
Heating	7.1137
Extras_types_balcony	6.6870
Extras_types_basement	5.9224
Extras_types_lift	5.2697
Extras_types_usable_room	4.5745
Security_types_entryphone	4.2427
Security_types_alarm	3.8526
Equipment_types_furniture	3.2810
Media_types_phone	2.9625
Equipment_types_dishwasher	2.5976
Security_types_closed_area	2.5450
Media_types_internet	2.5029
Extras_types_garden	2.0550
Security_types_anti_burglary_door	1.9770
Extras_types_separate_kitchen	1.7242
Equipment_types_tv	0.8692
Equipment_types_oven	0.7854
Equipment_types_stove	0.5760
Equipment_types_fridge	0.5122
Equipment_types_washing_machine	0.3963
Security_types_roller_shutters	0.0381

We decided that we will keep only 20 variables with biggest importance.

df_train <- df_train %>% select(rownames(imp_vars)[1:20],Price)
df_test <- df_test %>% select(rownames(imp_vars)[1:20],Price)
rm(ranger_model)

Train

Having everything prepared we will now train 8 models. Process of training is pretty much the same for all of them. We use caret train function and appropriate method name. Some of them might include tuning parameters. We are going to make comments only to the first method what each line of code do. We will use 5-Fold Cross-Validation.

options(contrasts = c("contr.treatment",  # for non-ordinal factors
                      "contr.treatment")) # for ordinal factors
ctrl_cv5 <- trainControl(method = "cv",
                          number=5,
                         allowParallel=TRUE)

#fitting logit
set.seed(987654321)
lm_model <- train(Price ~ ., 
        data = df_train,
        method = "lm",
        trControl = ctrl_cv5,
        metric='RMSE')

#predicting train data from model obtained from cross-validation logit
train_lm = predict(lm_model)
#Counting RMSE
train_lm_acc = RMSE(train_lm,df_train$Price)
#predicting test data
test_lm = predict(lm_model,df_test)
#Counting RMSE
test_lm_acc = RMSE(test_lm,df_test$Price)
#storing cross-validation RMSE
vec_valid <- mean(lm_model$resample[,"RMSE"])
#storing train prediction RMSE
vec_train <- train_lm_acc
#storing test prediction RMSE
vec_test <- test_lm_acc

#everything below follows the same procedure so I`m not gonna comment every model
#only difference could be some additional parameters tuning

#KNN
set.seed(987654321)
different_k <- data.frame(k = c(3))
knn_model <- 
  train(Price ~ ., 
        data = df_train ,
        method = "knn",
        trControl = ctrl_cv5,
        tuneGrid = different_k,
        preProcess = c("range"),
        #preProcess = c("center","scale"),
        metric='RMSE')

train_knn = predict(knn_model)
train_knn_acc =  RMSE(train_knn,df_train$Price)
test_knn = predict(knn_model,df_test)
test_knn_acc = RMSE(test_knn,df_test$Price)
vec_valid <- rbind(vec_valid,mean(knn_model$resample$RMSE))
vec_train <- rbind(vec_train,train_knn_acc)
vec_test <- rbind(vec_test,test_knn_acc)

#Random Forest
tgrid <- expand.grid(
  .mtry = 15,
  .splitrule = "variance",
  .min.node.size = 2
) 
set.seed(987654321)
rf_gridsearch <- train(Price ~ ., 
                       data = df_train,
                       method = 'ranger',
                       metric = 'RMSE',
                       tuneGrid = tgrid,
                       trControl = ctrl_cv5,
                       verbose=TRUE)

## Growing trees.. Progress: 99%. Estimated remaining time: 0 seconds.

train_rf= predict(rf_gridsearch)
train_rf_acc = RMSE(train_rf,df_train$Price)
test_rf = predict(rf_gridsearch,df_test)
test_rf_acc = RMSE(test_rf,df_test$Price)
vec_valid <- rbind(vec_valid,mean(rf_gridsearch$resample$RMSE))
vec_train <- rbind(vec_train,train_rf_acc)
vec_test <- rbind(vec_test,test_rf_acc)

#Extreme Gradient Boosting
set.seed(987654321)
tune_grid <- expand.grid(nrounds = 200,
                        max_depth = 4,
                        eta = 0.05,
                        gamma = 0,
                        colsample_bytree = 0.5,
                        min_child_weight = 0,
                        subsample = 0.5)

xgb_fit <- train(Price ~., data = df_train, method = "xgbTree",
                trControl=ctrl_cv5,
                tuneGrid = tune_grid,
                tuneLength = 10,
                objective = "reg:squarederror")
train_xgb= predict(xgb_fit)
train_xgb_acc = RMSE(train_xgb,df_train$Price)
test_xgb = predict(xgb_fit,df_test)
test_xgb_acc = RMSE(test_xgb,df_test$Price)
vec_valid <- rbind(vec_valid,mean(xgb_fit$resample$RMSE))
vec_train <- rbind(vec_train,train_xgb_acc)
vec_test <- rbind(vec_test,test_xgb_acc)


#Neural Network
set.seed(987654321)
nnet_fit <- train(Price ~., data = df_train, method = "nnet",
                trControl=ctrl_cv5,
                 maxit=1000,
                preProcess = c('center', 'scale'),
                metric='RMSE', 
                trace=FALSE,
                linout=TRUE,
                tuneGrid=expand.grid(size=c(5), decay=c(0.1)))
train_nn= predict(nnet_fit)
train_nn_acc = RMSE(train_nn,df_train$Price)
test_nn = predict(nnet_fit,df_test)
test_nn_acc = RMSE(test_nn,df_test$Price)
vec_valid <- rbind(vec_valid,mean(nnet_fit$resample$RMSE))
vec_train <- rbind(vec_train,train_nn_acc)
vec_test <- rbind(vec_test,test_nn_acc)

#Elastic net
set.seed(987654321)
parameters_elastic2 <- expand.grid(alpha = seq(0, 1, 0.1), 
                                   lambda = seq(0.01, 0.4, 0.01))

glm_fit <- train(Price ~., data = df_train, method = "glmnet",
                trControl=ctrl_cv5,
                metric='RMSE',
                tuneGrid = parameters_elastic2
                )
train_glm= predict(glm_fit)
train_glm_acc = RMSE(train_glm,df_train$Price)
test_glm = predict(glm_fit,df_test)
test_glm_acc = RMSE(test_glm,df_test$Price)
vec_valid <- rbind(vec_valid,mean(glm_fit$resample$RMSE))
vec_train <- rbind(vec_train,train_glm_acc)
vec_test <- rbind(vec_test,test_glm_acc)

#Gradient Boosting
set.seed(987654321)
parameters_gbm<- expand.grid(interaction.depth=c(4),
                             n.trees=c(250),
                             shrinkage=0.1,
                             n.minobsinnode=10)
gbm_fit <- train(Price ~ ., data = df_train, 
                 method = "gbm",
                 metric='RMSE',
                 trControl = ctrl_cv5, 
                 tuneGrid = parameters_gbm,
                 verbose = FALSE)
train_gbm= predict(gbm_fit)
train_gbm_acc = RMSE(train_gbm,df_train$Price)
test_gbm = predict(gbm_fit,df_test)
test_gbm_acc = RMSE(test_gbm,df_test$Price)
vec_valid <- rbind(vec_valid,mean(gbm_fit$resample$RMSE))
vec_train <- rbind(vec_train,train_gbm_acc)
vec_test <- rbind(vec_test,test_gbm_acc)

#Partial Least Squared
set.seed(987654321)
pls_fit <- train(Price ~ ., data = df_train, 
                 method = "pls",
                 metric='RMSE',
                 preProc = c("center", "scale"),
                 tuneLength = 10,
                 maxit=1000,
                 trControl = ctrl_cv5)
train_pls= predict(pls_fit)
train_pls_acc = RMSE(train_pls,df_train$Price)
test_pls = predict(pls_fit,df_test)
test_pls_acc = RMSE(test_pls,df_test$Price)
vec_valid <- rbind(vec_valid,mean(pls_fit$resample$RMSE))
vec_train <- rbind(vec_train,train_pls_acc)
vec_test <- rbind(vec_test,test_pls_acc)

After all models are trained we are going to create matrix of results.

#column binding train prediction,cross-validation and test prediction results
df_results <- cbind(vec_train,vec_valid,vec_test)
#changing rownames
rownames(df_results) <- c('lm','knn','rf', 'xgb','nnet','elastic',
                          'gbm','pls')
#changing colnames
colnames(df_results) <- c('train','cross_validation','test')

On the plot below we can see cross-validation results. Firstly we can spot that KNN is the worst model, RMSE equal to 0.278 is far worse than other methods. Next we have three models that have very similar performance. In this group there is eslastic net, lingear regression and partial least squared. It is not a surprise they all share similar concept. Their RMSE is equal to 0.191. Next we have neural network that has lower RMSE than those methods - 0.154. XGboost and GBM have a little bit better performance than neural network, RMSE - 0.145. The best method is Random Forest with RMSE equal to 0.130.

#storing list of resamples to make dotplot of cross-validation AUCROC results 
cvValues <- resamples(list(elastic = glm_fit,nn=nnet_fit,
                           xgb=xgb_fit,
                           lm=lm_model,rf=rf_gridsearch,
                           gbm=gbm_fit,pls=pls_fit,knn=knn_model))
#ploting results of models cross-validation results
dotplot(cvValues, metric = "RMSE", main = 'Cross-validation results')

Now let`s check how the models predict out of sample data. As we can see in the table all the models have better performance on test data. There could be some extreme, unusual observations in train data. KNN is still the worst. For lm, elastic and pls test RMSE is equal to 0.187. Neural network RMSE is the same on test data, same for gbm and xgb. Random Forest RMSE is equal to 0.128 and is even better than cross-validation.

#printing table of train/cross-validation/test results
df_results %>% as.data.frame() %>% arrange(cross_validation) %>%
  kableExtra::kbl(digits = 3,caption = 'RMSE') %>%
  kableExtra::kable_classic("striped", full_width = F)

RMSE
	train	cross_validation	test
rf	0.050	0.130	0.128
gbm	0.137	0.145	0.145
xgb	0.137	0.145	0.145
nnet	0.148	0.154	0.154
lm	0.189	0.190	0.187
pls	0.189	0.191	0.187
elastic	0.190	0.191	0.187
knn	0.185	0.278	0.271

Summary

To sum up we can see that tree based models are the best, especially Random Forest. They could easily detect nonlinearities and their performance is way better than standard methods like linear regression. Neural Network looks also promising on this dataset. Maybe with more iterations and better tuning it could overcome tree based methods. Moreover, using explainable artificial intelligence could be interesting here, checking why Random Forest did better than gradient boosting methods.
We were not pleased that there were no description of variables. Having more knowledge could lead us to better data preparation and selection. Preparing and tidying the data took us the biggest part of the time in this project. We think that it is crucial in modelling.

ML1 big dataset - regression

Paulina Żabska & Rafał Łobacz

Introduction

Data preparation

Feature Selection & data transformation

Train

Summary