Introduction
Preperations
- Load Packages
- Load Data
Missing data
Target Variable
- Original Scale
- Log-Transformation
Exploratory Analysis
- Important Predictors
- “Quick” Regression
Data Preprocessing
- Outliers
- Feature Creation
Modelling
Conclusion

Please click here to visit my github page.

Introduction

The goal of this project is to predict the sale price of residental homes in Ames, Iowa. The dataset comprises 79 explanatory variables. A XGBoost Regression algorithm will be used for prediction.

More information about data can be found here: http://jse.amstat.org/v19n3/decock.pdf

Preperations

Load Packages

First, relevant packages are loaded. We load a range of libraries for general data wrangling and general visualisation together with more specialised tools for modelling.

if (!require(here)) install.packages('here')  
if (!require(tidymodels)) install.packages('tidymodels') 
if (!require(tidyverse)) install.packages('tidyverse') 
if (!require(janitor)) install.packages('janitor') 
if (!require(EnvStats)) install.packages('EnvStats')
if (!require(ggpubr)) install.packages('ggpubr')
if (!require(gridExtra)) install.packages('gridExtra')
if (!require(vip)) install.packages('vip') 
if (!require(visdat)) install.packages('visdat') 
if (!require(DataExplorer)) install.packages('DataExplorer', repos = "http://cran.us.r-project.org")
if (!require(igraph)) install.packages('igraph', repos = "http://cran.us.r-project.org")
if (!require(xgboost)) install.packages('xgboost', repos = "http://cran.us.r-project.org")
if (!require(knitr)) install.packages('knitr')

Load Data

The data was downloaded from kaggle and stored locally (login necessary). The here package is used to locate the files relative to the project root. The clean_names() function from the janitor package is used to make clear variable names. We already have a first general look at the amount of missing data:

training_data <- read_csv(here("Data", "train.csv")) %>%
                  clean_names()
testing_data <-  read_csv(here("Data", "test.csv")) %>%
                  clean_names()

combined <- bind_rows(training_data, testing_data)
train_rows <- nrow(training_data)

cat(dim(training_data[!complete.cases(training_data),])[1], "out of", train_rows, "observations have at least one missing value.")

1460 out of 1460 observations have at least one missing value.

Missing data

We already saw that we have at least one missing value for every observation. On the other side, we have 79 explanatory variables in the training set. Therefore, we will first have a closer look on the distribution of missing values in our dataset. We will use the DataExplorer package for that, since it gives a nice graphical overview over the amount of missing data. In a second step, we will also use the DataExplorer package to delete all variables with missing rate in the train set >40%:

vis_dat(training_data)

Notes:

Especially some character variables have a lot of missing data.
Although there seems to be quite a few variables with a very high percentage of missing data, the most variables don’t have any missing values at all.
We will drop columns with missing rate in the train set >40%.

Remove Missing Data

The next graph shows an overview of all the variables with at least one missing value and their missing rate. The DataExplorer package is used again to make this plot. It makes it easy to specify an indicator containing all variables which have at least 40% missing values:

missing_plot <- DataExplorer::plot_missing(training_data, missing_only = T, ggtheme = theme_classic())

drop_cols <- missing_plot$data %>% filter(Band %in% c("Remove", "Bad")) %>% pull(feature)

Target Variable

Original Scale

We will continue by looking at the distribution of our target variable.

# hist
g1 <- ggplot(training_data, aes(x=sale_price)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="indianred")+ 
  labs(x = "", y = "") +
  theme_classic() 

# boxplot
g2 <- ggplot(training_data, aes(y=sale_price)) + 
  geom_boxplot(aes(x=""), colour="black", fill="indianred", alpha=.2)+
  coord_flip()+ 
  labs(x = "", y = "") +
  theme_classic()

# qqplot
g3 <- ggplot(training_data, aes(sample = sale_price))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(x = "", y = "") + 
  theme_classic()

grid.arrange(g1, arrangeGrob(g2, g3, nrow=2), nrow = 1)

Notes:

All three graphs show clear evidence for a right-skewed distribution.
Common transformations for right-skewed data include log transformation.

Log-Transformation

Lets see if Log transformation helps to make the distribution more normal:

# hist
g1 <- ggplot(training_data, aes(x=log(sale_price))) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="indianred")+ 
  labs(x = "", y = "") +
  theme_classic() 

# boxplot
g2 <- ggplot(training_data, aes(y=log(sale_price))) + 
  geom_boxplot(aes(x=""), colour="black", fill="indianred", alpha=.2)+
  coord_flip()+ 
  labs(x = "", y = "") +
  theme_classic()

# qqplot
g3 <- ggplot(training_data, aes(sample = log(sale_price)))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(x = "", y = "") + 
  theme_classic()

grid.arrange(g1, arrangeGrob(g2, g3, nrow=2), nrow = 1)

Notes:

The Distribution looks way better after log transformation.
All three graphs show almost no signs of remaining skeweness.
Log Transformation will be applied in modelling.

Exploratory Analysis

Important Predictors

In a first regression using all variables which not have any missing data as predictors, Overall Quality, First Floor Square Foot, Zoning Classification and Overall Condition were identified as potentially important predictors. Therefore, these four variables are plotted in their relationship with Sale Price below. The results from the regression used to identify these predicotrs can be seen in the “Quick” Regression tab above.

par(mfrow=c(2,2))

g1 <- ggplot(aes(as.factor(overall_qual), sale_price), data = training_data) +
  geom_bar(stat = "summary", fun = "mean", fill = "indianred") +
  stat_n_text(y.pos = 25000, size = 2) + 
  scale_fill_brewer(palette="Pastel1") +
  xlab("Quality") + 
  ylab("Sale Price") + 
  ggtitle("Price vs Overall Quality of Material") +
  theme_classic() 
#constant decrease in Sale Price with lower Roof Material

g2 <- ggplot(aes(x1st_flr_sf, sale_price), data = combined[!is.na(combined$sale_price),]) +
  geom_point(col = "indianred") +
  xlab("First floor square feet") + 
  ylab("Sale Price") + 
  ggtitle("Price vs First floor square feet") +
  theme_classic()


g3 <- ggplot(aes(reorder(ms_zoning,-sale_price), sale_price), data = training_data) +
  geom_bar(stat = "summary", fun = "mean", fill = "indianred") +
  stat_n_text(y.pos = 12500, size = 3) + 
  xlab("Zoning Classification") + 
  ylab("Sale Price") + 
  ggtitle("Price vs Zoning Classification") +
  #coord_cartesian(ylim = c(0, 400000)) + 
  theme_classic()

g4 <- ggplot(aes(as.factor(overall_cond), sale_price), data = training_data) +
  geom_bar(stat = "summary", fun = "mean", fill = "indianred") +
  stat_n_text(y.pos = 12500, size = 2) + 
  xlab("Quality") + 
  ylab("Sale Price") + 
  ggtitle("Price vs Overall Condition") +
  theme_classic()

ggarrange(g1, g2, g3, g4,
          ncol = 2, nrow = 2)

Notes:

For all four variables, the direction of the relationship can be seen clearly when looking at the plots.
Especially Overall Quality and First Flore Square Feet are showing a strong pattern in their relationship with house price.

“Quick” Regression

AS said before, a quick regression using all variables which not have any missing data as predictors was run to get a first impression on which variables may be important for predicting Sale Price. Log-Transformation is applied to SalePrice to address the skeweness of this variable showed above.

regressionResult <- tidy(lm(log(sale_price) ~ ., data = training_data[ ,sapply(training_data, function(x) !any(is.na(x)))]))
kable(head(arrange(regressionResult, -statistic), 10))

term	estimate	std.error	statistic
roof_matlWdShngl	2.7202321	0.1503643	18.090946
roof_matlCompShg	2.6138786	0.1448945	18.039880
roof_matlTar&Grv	2.6408934	0.1665445	15.856987
roof_matlWdShake	2.5250789	0.1613245	15.652172
roof_matlRoll	2.6816411	0.1826503	14.681831
roof_matlMembran	2.9917945	0.2097053	14.266663
roof_matlMetal	2.7819635	0.2071698	13.428423
x1st_flr_sf	0.0002549	0.0000234	10.890736
overall_qual	0.0470576	0.0044789	10.506443
overall_cond	0.0366476	0.0038256	9.579511

Notes:

The t-statistic is the coefficient divided by its standard error and gives a good overview of the relative importance of predictors.
The roof materials seems to have a big influence on Sale Price and will be examined further.
Square Foot of the first floor, Overal Quality and Overall Condition also have a strong relationship with Sale Price, which seems highly plausible.

The roof materials seems to have a big influence on Sale Price. Let us further look at this variable:

kable(training_data %>% count(roof_matl))

roof_matl	n
ClyTile	1
CompShg	1434
Membran	1
Metal	1
Roll	1
Tar&Grv	11
WdShake	5
WdShngl	6

Ups, the variable is highly unbalanced and has low variability, over 98% of houses have Standard (Composite) Shingle as Roof Material. Let us look again at the regression without roof material:

regressionResult <- tidy(lm(log(sale_price) ~ ., data = subset(training_data[ ,sapply(training_data, function(x) !any(is.na(x)))], select = -c(roof_matl))))
kable(head(arrange(regressionResult, -statistic)))

term	estimate	std.error	statistic
overall_qual	0.0553948	0.0049657	11.155507
x1st_flr_sf	0.0002271	0.0000260	8.724359
ms_zoningRL	0.4201654	0.0515950	8.143535
ms_zoningRM	0.3921489	0.0482741	8.123385
x2nd_flr_sf	0.0002089	0.0000258	8.102472
overall_cond	0.0328835	0.0042818	7.679909

Notes:

Without roof material, we see strong relationships with Sale Price for: Overall Quality, First Floor Square Feet, Zoning Classification, Second Floor Square Feet and Overall Condition

Data Preprocessing

In the next step, typos will be fixed, outliers will be removed and further variables which could be useful in Sale Price Prediction will be created.

Outliers

The plot below shows two data points which were removed since they are untypical from visual inspection and the model performance was better without them:

#fix typos
combined <- combined %>% mutate(roof_matl = recode(roof_matl,'Tar&Grv' = "Male")) %>% 
                          mutate(exterior1st = recode(exterior1st, 'Wd Sdng'="WdSdng")) %>%
                          mutate(exterior2nd = recode(exterior2nd,'Brk Cmn' = 'BrkComm', 'CmentBd' = 'CemntBd', 'Wd Sdng' = 'WdSdng', 'Wd Shng' = 'WdShing')) %>%
                          mutate(garage_yr_blt = replace(garage_yr_blt, garage_yr_blt > 2010, 'NA'))

#Outliers
ggplot(combined, aes(x=gr_liv_area, y=sale_price))+geom_point()+
  geom_text(data=combined[combined$gr_liv_area>4500,], mapping=aes(label=id), vjust=1.5, col = "red" ) +
  theme_classic()

combined <- combined %>% filter(!(gr_liv_area>4500 & !is.na(sale_price)))

Feature Creation

There are many variables in the dataset which can be further transformed to extract new features. In the following code, 15 new features are created. While many of them are indicators whether the house is containing something (like a pool or a second floor). Other variables are combined with the goal to create one variable with a stronger relationship to SalePrice than the individual variables. For example, the variable “overall_sf” (Overall square feet) is created, combining the above ground living area available in the house and the total area of the basement (both measured in square feet).

combined <- combined %>% mutate(has_bsmt = as.numeric(total_bsmt_sf!=0),
                          has_2ndFloor = as.numeric(x2nd_flr_sf!=0),
                          has_pool = as.numeric(pool_area!=0),
                          has_porch = as.numeric((open_porch_sf+enclosed_porch+x3ssn_porch+screen_porch)!=0),
                          has_remod = factor(year_remod_add != year_built),
                          has_fireplace = as.numeric(fireplaces >0),
                          is_new = as.numeric(yr_sold==year_built),
                          overall_sf = gr_liv_area + total_bsmt_sf,
                          overall_bath = full_bath + 0.5*half_bath + 0.5 *bsmt_half_bath + bsmt_full_bath,
                          house_age = yr_sold - year_remod_add,
                          yr_sold = factor(yr_sold),
                          mo_sold = factor(mo_sold),
                          tot_porch_sf = open_porch_sf + enclosed_porch +x3ssn_porch + screen_porch,
                          bsmt_bath = as.numeric(bsmt_half_bath + bsmt_full_bath !=0),
                          is_bsmt_unf = as.numeric(total_bsmt_sf == bsmt_unf_sf))

#update training and testing data that it includes all transformations
training_data <- combined %>% filter(!is.na(sale_price))
testing_Data <- combined %>% filter(is.na(sale_price))

Modelling

Now we will start the modeling part. As said before, a XGBoost regression model will be used. Further feature engineering will be done with the recipes package, since this provides comprehensive preprocessing possibilities in a nice and tidy code structure.

Performance on cross-validation subset

We will start to prepare our data. To asses the performance on the training dataset, resampling will be used in the form of cross-validation. Our xgboost model is specified with the parsnip package. In addition, the recipes package is used to combine all the transformations and other features related to the model as a single block that can be used for any subset of the data.

The recipe we used is applying the following transformations:

Remove Variables with too many missing data specified above
Assign simple values for Novel Factor Levels
Imputing the missing values via k-nearest neighbors
Create traditional dummy variables from nominal variables
Remove variables that are highly sparse and unbalanced.
Use Logarithm for Outcome Variable

After specifying the recipe, we will run a first XGboost model on the cross-validation dataset to get an impression about performance and variable importance without tuning. The results are displayed in the following table:

set.seed(121)
vfold_data <- rsample::vfold_cv(data = training_data, v = 10)

xgb_recipe <- training_data %>%
  recipe(sale_price ~ .) %>%
  update_role(id, new_role = "id var") %>%
  step_rm(street, utilities) %>%
  step_rm(one_of(drop_cols)) %>%
  step_novel(all_predictors(), -all_numeric()) %>%
  step_impute_knn(all_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_nzv(everything()) %>%
  step_log(all_outcomes(), offset = 1) 

#specify model
xgb <-  parsnip::boost_tree(trees = 100) %>%
  set_engine("xgboost", objective = "reg:squarederror") %>%
  set_mode("regression")

xgb_wf <- workflows::workflow() %>%
  workflows::add_recipe(xgb_recipe) %>% 
  workflows::add_model(xgb)

# performance on cross-validation
xgb_fit <- xgb_wf %>% fit_resamples(vfold_data)
kable(collect_metrics(xgb_fit))

.metric	.estimator	mean	n	std_err	.config
rmse	standard	0.1388313	10	0.0045048	Preprocessor1_Model1
rsq	standard	0.8798578	10	0.0071070	Preprocessor1_Model1

Without further optimizing, we get:

a RMSE of 0.14
a R-squared of 0.88

After looking at variable importance, we will improve the accuracy with model tuning.

#variable importance
xgb_wf %>% fit(training_data) %>%
  extract_fit_parsnip() %>% 
  vip() +
  theme_classic()

Notes:

The created variable overall_sq is the most important variable for our model and seems to have replaced first and second floor square feet
As already seen during data exploration, Overall Quality is again a very important variable in the XGBoost algorithm to predict SalePrice
Some further created variables are even more important than Overall Condition

Tune Hyperparameters

Now it is time to improve our model. For that purpose, the model and parameter tuning will be defined. We will use R-Square as variable to optimize. Then, the defined parameters will be tuned through iteration and the model accuracy after tuning is assessed.

set.seed(1234)
xgb_tuning <- parsnip::boost_tree(
  trees = 1324, 
  tree_depth = tune(), 
  min_n = tune(), 
  loss_reduction = tune(),                     ## first three: model complexity
  sample_size = 0.212, mtry = 80,              ## randomness
  learn_rate = 0.00893,                     ## step size
) %>% 
  set_engine("xgboost", objective = "reg:squarederror") %>%
  set_mode("regression")

xgb_grid <- grid_latin_hypercube(
  #trees(), 
  tree_depth(),
  min_n(),
  loss_reduction(),
  #sample_size = sample_prop(),
  #finalize(mtry(), training_data),
  #learn_rate(),
  size = 50
)

xgb_wf_tune <- workflows::workflow() %>%
  workflows::add_recipe(xgb_recipe) %>% 
  workflows::add_model(xgb_tuning)

#tune hyperparameters
xgb_res <- tune_grid(
  xgb_wf_tune,
  resamples = vfold_data,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

#look at tuning results
kable(head(xgb_res %>% collect_metrics %>% arrange(desc(.metric) ,-mean)))

min_n	tree_depth	loss_reduction	.metric	.estimator	mean	n	std_err	.config
6	14	0.0000000	rsq	standard	0.9166721	10	0.0080414	Preprocessor1_Model06
5	9	0.0000307	rsq	standard	0.9162517	10	0.0078405	Preprocessor1_Model26
3	6	0.0000000	rsq	standard	0.9160771	10	0.0081234	Preprocessor1_Model44
7	8	0.0023759	rsq	standard	0.9158317	10	0.0082477	Preprocessor1_Model12
7	14	0.0006571	rsq	standard	0.9158051	10	0.0081564	Preprocessor1_Model27
9	14	0.0003326	rsq	standard	0.9148765	10	0.0084475	Preprocessor1_Model49

xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  select(mean, min_n:loss_reduction ) %>%
  pivot_longer(min_n:loss_reduction,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "R-Squared") +
  theme_classic() +
  theme(panel.background = element_rect(colour = 'black'))

After Model Tuning, we get a cross-validation accuracy of

0.12 in terms of RMSE
0.91 in terms of R-squared

Therefore, the accuracy was improved.

Finalize model

Finally, the model with best accuracy will be selected and this final model will be fitted on the whole training data to make test predictions.

best_xgb <- xgb_res %>%
  select_best(metric = "rmse")

final_wf <- xgb_wf_tune %>% 
  finalize_workflow(best_xgb)

ind <- list(analysis = seq(nrow(training_data)), assessment = nrow(training_data) + seq(nrow(testing_data)))
splits <- make_splits(ind, combined)

final_fit <- final_wf %>% 
  last_fit(splits)

submission <- final_fit %>% collect_predictions() %>% transmute(ID = testing_data$id, SalePrice =  exp(.pred))

submission %>% write_csv(here("Data", "submission.csv"))

Conclusion

This project showed a basic workflow to set up and improve a XGBoost regression model to predict house prices. Whereas a XGBoost models without hyerparamter tuning already showed good results, it is great to see that the cross-validation model performance after hyperparameter tuning regarding R-Squared could be further improved by 2.75 percentage points. In total, 90.7% of the variance in the training set can be explained from our prediction model.

House Price Prediction with XGBoost

Introduction

Preperations

Load Packages

Load Data

Missing data

Remove Missing Data

Target Variable

Original Scale

Log-Transformation

Exploratory Analysis

Important Predictors

“Quick” Regression

Data Preprocessing

Outliers

Feature Creation

Modelling

Performance on cross-validation subset

Tune Hyperparameters

Finalize model

Conclusion