Please click here to visit my github page.

Introduction

The goal of this project is to predict the sale price of residental homes in Ames, Iowa. The dataset comprises 79 explanatory variables. A XGBoost Regression algorithm will be used for prediction.

More information about data can be found here: http://jse.amstat.org/v19n3/decock.pdf

Preperations

Load Packages

First, relevant packages are loaded. We load a range of libraries for general data wrangling and general visualisation together with more specialised tools for modelling.

if (!require(here)) install.packages('here')  
if (!require(tidymodels)) install.packages('tidymodels') 
if (!require(tidyverse)) install.packages('tidyverse') 
if (!require(janitor)) install.packages('janitor') 
if (!require(EnvStats)) install.packages('EnvStats')
if (!require(ggpubr)) install.packages('ggpubr')
if (!require(gridExtra)) install.packages('gridExtra')
if (!require(vip)) install.packages('vip') 
if (!require(visdat)) install.packages('visdat') 
if (!require(DataExplorer)) install.packages('DataExplorer', repos = "http://cran.us.r-project.org")
if (!require(igraph)) install.packages('igraph', repos = "http://cran.us.r-project.org")
if (!require(xgboost)) install.packages('xgboost', repos = "http://cran.us.r-project.org")
if (!require(knitr)) install.packages('knitr')

Load Data

The data was downloaded from kaggle and stored locally (login necessary). The here package is used to locate the files relative to the project root. The clean_names() function from the janitor package is used to make clear variable names. We already have a first general look at the amount of missing data:

training_data <- read_csv(here("Data", "train.csv")) %>%
                  clean_names()
testing_data <-  read_csv(here("Data", "test.csv")) %>%
                  clean_names()

combined <- bind_rows(training_data, testing_data)
train_rows <- nrow(training_data)

cat(dim(training_data[!complete.cases(training_data),])[1], "out of", train_rows, "observations have at least one missing value.")
1460 out of 1460 observations have at least one missing value.

Missing data

We already saw that we have at least one missing value for every observation. On the other side, we have 79 explanatory variables in the training set. Therefore, we will first have a closer look on the distribution of missing values in our dataset. We will use the DataExplorer package for that, since it gives a nice graphical overview over the amount of missing data. In a second step, we will also use the DataExplorer package to delete all variables with missing rate in the train set >40%:

vis_dat(training_data)

Notes:

Remove Missing Data

The next graph shows an overview of all the variables with at least one missing value and their missing rate. The DataExplorer package is used again to make this plot. It makes it easy to specify an indicator containing all variables which have at least 40% missing values:

missing_plot <- DataExplorer::plot_missing(training_data, missing_only = T, ggtheme = theme_classic())

drop_cols <- missing_plot$data %>% filter(Band %in% c("Remove", "Bad")) %>% pull(feature)

Target Variable

Original Scale

We will continue by looking at the distribution of our target variable.

# hist
g1 <- ggplot(training_data, aes(x=sale_price)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="indianred")+ 
  labs(x = "", y = "") +
  theme_classic() 

# boxplot
g2 <- ggplot(training_data, aes(y=sale_price)) + 
  geom_boxplot(aes(x=""), colour="black", fill="indianred", alpha=.2)+
  coord_flip()+ 
  labs(x = "", y = "") +
  theme_classic()

# qqplot
g3 <- ggplot(training_data, aes(sample = sale_price))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(x = "", y = "") + 
  theme_classic()

grid.arrange(g1, arrangeGrob(g2, g3, nrow=2), nrow = 1)

Notes:

  • All three graphs show clear evidence for a right-skewed distribution.
  • Common transformations for right-skewed data include log transformation.

Log-Transformation

Lets see if Log transformation helps to make the distribution more normal:

# hist
g1 <- ggplot(training_data, aes(x=log(sale_price))) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="indianred")+ 
  labs(x = "", y = "") +
  theme_classic() 

# boxplot
g2 <- ggplot(training_data, aes(y=log(sale_price))) + 
  geom_boxplot(aes(x=""), colour="black", fill="indianred", alpha=.2)+
  coord_flip()+ 
  labs(x = "", y = "") +
  theme_classic()

# qqplot
g3 <- ggplot(training_data, aes(sample = log(sale_price)))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(x = "", y = "") + 
  theme_classic()

grid.arrange(g1, arrangeGrob(g2, g3, nrow=2), nrow = 1)

Notes:

  • The Distribution looks way better after log transformation.
  • All three graphs show almost no signs of remaining skeweness.
  • Log Transformation will be applied in modelling.

Exploratory Analysis

Important Predictors

In a first regression using all variables which not have any missing data as predictors, Overall Quality, First Floor Square Foot, Zoning Classification and Overall Condition were identified as potentially important predictors. Therefore, these four variables are plotted in their relationship with Sale Price below. The results from the regression used to identify these predicotrs can be seen in the “Quick” Regression tab above.

par(mfrow=c(2,2))

g1 <- ggplot(aes(as.factor(overall_qual), sale_price), data = training_data) +
  geom_bar(stat = "summary", fun = "mean", fill = "indianred") +
  stat_n_text(y.pos = 25000, size = 2) + 
  scale_fill_brewer(palette="Pastel1") +
  xlab("Quality") + 
  ylab("Sale Price") + 
  ggtitle("Price vs Overall Quality of Material") +
  theme_classic() 
#constant decrease in Sale Price with lower Roof Material

g2 <- ggplot(aes(x1st_flr_sf, sale_price), data = combined[!is.na(combined$sale_price),]) +
  geom_point(col = "indianred") +
  xlab("First floor square feet") + 
  ylab("Sale Price") + 
  ggtitle("Price vs First floor square feet") +
  theme_classic()


g3 <- ggplot(aes(reorder(ms_zoning,-sale_price), sale_price), data = training_data) +
  geom_bar(stat = "summary", fun = "mean", fill = "indianred") +
  stat_n_text(y.pos = 12500, size = 3) + 
  xlab("Zoning Classification") + 
  ylab("Sale Price") + 
  ggtitle("Price vs Zoning Classification") +
  #coord_cartesian(ylim = c(0, 400000)) + 
  theme_classic()

g4 <- ggplot(aes(as.factor(overall_cond), sale_price), data = training_data) +
  geom_bar(stat = "summary", fun = "mean", fill = "indianred") +
  stat_n_text(y.pos = 12500, size = 2) + 
  xlab("Quality") + 
  ylab("Sale Price") + 
  ggtitle("Price vs Overall Condition") +
  theme_classic()

ggarrange(g1, g2, g3, g4,
          ncol = 2, nrow = 2)

Notes:

  • For all four variables, the direction of the relationship can be seen clearly when looking at the plots.
  • Especially Overall Quality and First Flore Square Feet are showing a strong pattern in their relationship with house price.

“Quick” Regression

AS said before, a quick regression using all variables which not have any missing data as predictors was run to get a first impression on which variables may be important for predicting Sale Price. Log-Transformation is applied to SalePrice to address the skeweness of this variable showed above.

regressionResult <- tidy(lm(log(sale_price) ~ ., data = training_data[ ,sapply(training_data, function(x) !any(is.na(x)))]))
kable(head(arrange(regressionResult, -statistic), 10))
term estimate std.error statistic p.value
roof_matlWdShngl 2.7202321 0.1503643 18.090946 0
roof_matlCompShg 2.6138786 0.1448945 18.039880 0
roof_matlTar&Grv 2.6408934 0.1665445 15.856987 0
roof_matlWdShake 2.5250789 0.1613245 15.652172 0
roof_matlRoll 2.6816411 0.1826503 14.681831 0
roof_matlMembran 2.9917945 0.2097053 14.266663 0
roof_matlMetal 2.7819635 0.2071698 13.428423 0
x1st_flr_sf 0.0002549 0.0000234 10.890736 0
overall_qual 0.0470576 0.0044789 10.506443 0
overall_cond 0.0366476 0.0038256 9.579511 0

Notes:

  • The t-statistic is the coefficient divided by its standard error and gives a good overview of the relative importance of predictors.
  • The roof materials seems to have a big influence on Sale Price and will be examined further.
  • Square Foot of the first floor, Overal Quality and Overall Condition also have a strong relationship with Sale Price, which seems highly plausible.

The roof materials seems to have a big influence on Sale Price. Let us further look at this variable:

kable(training_data %>% count(roof_matl))
roof_matl n
ClyTile 1
CompShg 1434
Membran 1
Metal 1
Roll 1
Tar&Grv 11
WdShake 5
WdShngl 6

Ups, the variable is highly unbalanced and has low variability, over 98% of houses have Standard (Composite) Shingle as Roof Material. Let us look again at the regression without roof material:

regressionResult <- tidy(lm(log(sale_price) ~ ., data = subset(training_data[ ,sapply(training_data, function(x) !any(is.na(x)))], select = -c(roof_matl))))
kable(head(arrange(regressionResult, -statistic)))
term estimate std.error statistic p.value
overall_qual 0.0553948 0.0049657 11.155507 0
x1st_flr_sf 0.0002271 0.0000260 8.724359 0
ms_zoningRL 0.4201654 0.0515950 8.143535 0
ms_zoningRM 0.3921489 0.0482741 8.123385 0
x2nd_flr_sf 0.0002089 0.0000258 8.102472 0
overall_cond 0.0328835 0.0042818 7.679909 0

Notes:

  • Without roof material, we see strong relationships with Sale Price for: Overall Quality, First Floor Square Feet, Zoning Classification, Second Floor Square Feet and Overall Condition

Data Preprocessing

In the next step, typos will be fixed, outliers will be removed and further variables which could be useful in Sale Price Prediction will be created.

Outliers

The plot below shows two data points which were removed since they are untypical from visual inspection and the model performance was better without them:

#fix typos
combined <- combined %>% mutate(roof_matl = recode(roof_matl,'Tar&Grv' = "Male")) %>% 
                          mutate(exterior1st = recode(exterior1st, 'Wd Sdng'="WdSdng")) %>%
                          mutate(exterior2nd = recode(exterior2nd,'Brk Cmn' = 'BrkComm', 'CmentBd' = 'CemntBd', 'Wd Sdng' = 'WdSdng', 'Wd Shng' = 'WdShing')) %>%
                          mutate(garage_yr_blt = replace(garage_yr_blt, garage_yr_blt > 2010, 'NA'))

#Outliers
ggplot(combined, aes(x=gr_liv_area, y=sale_price))+geom_point()+
  geom_text(data=combined[combined$gr_liv_area>4500,], mapping=aes(label=id), vjust=1.5, col = "red" ) +
  theme_classic()

combined <- combined %>% filter(!(gr_liv_area>4500 & !is.na(sale_price)))

Feature Creation

There are many variables in the dataset which can be further transformed to extract new features. In the following code, 15 new features are created. While many of them are indicators whether the house is containing something (like a pool or a second floor). Other variables are combined with the goal to create one variable with a stronger relationship to SalePrice than the individual variables. For example, the variable “overall_sf” (Overall square feet) is created, combining the above ground living area available in the house and the total area of the basement (both measured in square feet).

combined <- combined %>% mutate(has_bsmt = as.numeric(total_bsmt_sf!=0),
                          has_2ndFloor = as.numeric(x2nd_flr_sf!=0),
                          has_pool = as.numeric(pool_area!=0),
                          has_porch = as.numeric((open_porch_sf+enclosed_porch+x3ssn_porch+screen_porch)!=0),
                          has_remod = factor(year_remod_add != year_built),
                          has_fireplace = as.numeric(fireplaces >0),
                          is_new = as.numeric(yr_sold==year_built),
                          overall_sf = gr_liv_area + total_bsmt_sf,
                          overall_bath = full_bath + 0.5*half_bath + 0.5 *bsmt_half_bath + bsmt_full_bath,
                          house_age = yr_sold - year_remod_add,
                          yr_sold = factor(yr_sold),
                          mo_sold = factor(mo_sold),
                          tot_porch_sf = open_porch_sf + enclosed_porch +x3ssn_porch + screen_porch,
                          bsmt_bath = as.numeric(bsmt_half_bath + bsmt_full_bath !=0),
                          is_bsmt_unf = as.numeric(total_bsmt_sf == bsmt_unf_sf))

#update training and testing data that it includes all transformations
training_data <- combined %>% filter(!is.na(sale_price))
testing_Data <- combined %>% filter(is.na(sale_price))

Modelling

Now we will start the modeling part. As said before, a XGBoost regression model will be used. Further feature engineering will be done with the recipes package, since this provides comprehensive preprocessing possibilities in a nice and tidy code structure.

Performance on cross-validation subset

We will start to prepare our data. To asses the performance on the training dataset, resampling will be used in the form of cross-validation. Our xgboost model is specified with the parsnip package. In addition, the recipes package is used to combine all the transformations and other features related to the model as a single block that can be used for any subset of the data.

The recipe we used is applying the following transformations:

  • Remove Variables with too many missing data specified above
  • Assign simple values for Novel Factor Levels
  • Imputing the missing values via k-nearest neighbors
  • Create traditional dummy variables from nominal variables
  • Remove variables that are highly sparse and unbalanced.
  • Use Logarithm for Outcome Variable

After specifying the recipe, we will run a first XGboost model on the cross-validation dataset to get an impression about performance and variable importance without tuning. The results are displayed in the following table:

set.seed(121)
vfold_data <- rsample::vfold_cv(data = training_data, v = 10)

xgb_recipe <- training_data %>%
  recipe(sale_price ~ .) %>%
  update_role(id, new_role = "id var") %>%
  step_rm(street, utilities) %>%
  step_rm(one_of(drop_cols)) %>%
  step_novel(all_predictors(), -all_numeric()) %>%
  step_impute_knn(all_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_nzv(everything()) %>%
  step_log(all_outcomes(), offset = 1) 

#specify model
xgb <-  parsnip::boost_tree(trees = 100) %>%
  set_engine("xgboost", objective = "reg:squarederror") %>%
  set_mode("regression")

xgb_wf <- workflows::workflow() %>%
  workflows::add_recipe(xgb_recipe) %>% 
  workflows::add_model(xgb)

# performance on cross-validation
xgb_fit <- xgb_wf %>% fit_resamples(vfold_data)
kable(collect_metrics(xgb_fit))
.metric .estimator mean n std_err .config
rmse standard 0.1388313 10 0.0045048 Preprocessor1_Model1
rsq standard 0.8798578 10 0.0071070 Preprocessor1_Model1

Without further optimizing, we get:

  • a RMSE of 0.14
  • a R-squared of 0.88

After looking at variable importance, we will improve the accuracy with model tuning.

#variable importance
xgb_wf %>% fit(training_data) %>%
  extract_fit_parsnip() %>% 
  vip() +
  theme_classic()

Notes:

  • The created variable overall_sq is the most important variable for our model and seems to have replaced first and second floor square feet
  • As already seen during data exploration, Overall Quality is again a very important variable in the XGBoost algorithm to predict SalePrice
  • Some further created variables are even more important than Overall Condition

Tune Hyperparameters

Now it is time to improve our model. For that purpose, the model and parameter tuning will be defined. We will use R-Square as variable to optimize. Then, the defined parameters will be tuned through iteration and the model accuracy after tuning is assessed.

set.seed(1234)
xgb_tuning <- parsnip::boost_tree(
  trees = 1324, 
  tree_depth = tune(), 
  min_n = tune(), 
  loss_reduction = tune(),                     ## first three: model complexity
  sample_size = 0.212, mtry = 80,              ## randomness
  learn_rate = 0.00893,                     ## step size
) %>% 
  set_engine("xgboost", objective = "reg:squarederror") %>%
  set_mode("regression")

xgb_grid <- grid_latin_hypercube(
  #trees(), 
  tree_depth(),
  min_n(),
  loss_reduction(),
  #sample_size = sample_prop(),
  #finalize(mtry(), training_data),
  #learn_rate(),
  size = 50
)

xgb_wf_tune <- workflows::workflow() %>%
  workflows::add_recipe(xgb_recipe) %>% 
  workflows::add_model(xgb_tuning)

#tune hyperparameters
xgb_res <- tune_grid(
  xgb_wf_tune,
  resamples = vfold_data,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

#look at tuning results
kable(head(xgb_res %>% collect_metrics %>% arrange(desc(.metric) ,-mean)))
min_n tree_depth loss_reduction .metric .estimator mean n std_err .config
6 14 0.0000000 rsq standard 0.9166721 10 0.0080414 Preprocessor1_Model06
5 9 0.0000307 rsq standard 0.9162517 10 0.0078405 Preprocessor1_Model26
3 6 0.0000000 rsq standard 0.9160771 10 0.0081234 Preprocessor1_Model44
7 8 0.0023759 rsq standard 0.9158317 10 0.0082477 Preprocessor1_Model12
7 14 0.0006571 rsq standard 0.9158051 10 0.0081564 Preprocessor1_Model27
9 14 0.0003326 rsq standard 0.9148765 10 0.0084475 Preprocessor1_Model49
xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  select(mean, min_n:loss_reduction ) %>%
  pivot_longer(min_n:loss_reduction,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "R-Squared") +
  theme_classic() +
  theme(panel.background = element_rect(colour = 'black'))

After Model Tuning, we get a cross-validation accuracy of

  • 0.12 in terms of RMSE
  • 0.91 in terms of R-squared

Therefore, the accuracy was improved.

Finalize model

Finally, the model with best accuracy will be selected and this final model will be fitted on the whole training data to make test predictions.

best_xgb <- xgb_res %>%
  select_best(metric = "rmse")

final_wf <- xgb_wf_tune %>% 
  finalize_workflow(best_xgb)

ind <- list(analysis = seq(nrow(training_data)), assessment = nrow(training_data) + seq(nrow(testing_data)))
splits <- make_splits(ind, combined)

final_fit <- final_wf %>% 
  last_fit(splits)

submission <- final_fit %>% collect_predictions() %>% transmute(ID = testing_data$id, SalePrice =  exp(.pred))

submission %>% write_csv(here("Data", "submission.csv"))

Conclusion

This project showed a basic workflow to set up and improve a XGBoost regression model to predict house prices. Whereas a XGBoost models without hyerparamter tuning already showed good results, it is great to see that the cross-validation model performance after hyperparameter tuning regarding R-Squared could be further improved by 2.75 percentage points. In total, 90.7% of the variance in the training set can be explained from our prediction model.