Please click here to visit my github page.
The goal of this project is to predict the sale price of residental homes in Ames, Iowa. The dataset comprises 79 explanatory variables. A XGBoost Regression algorithm will be used for prediction.
More information about data can be found here: http://jse.amstat.org/v19n3/decock.pdf
First, relevant packages are loaded. We load a range of libraries for general data wrangling and general visualisation together with more specialised tools for modelling.
if (!require(here)) install.packages('here')
if (!require(tidymodels)) install.packages('tidymodels')
if (!require(tidyverse)) install.packages('tidyverse')
if (!require(janitor)) install.packages('janitor')
if (!require(EnvStats)) install.packages('EnvStats')
if (!require(ggpubr)) install.packages('ggpubr')
if (!require(gridExtra)) install.packages('gridExtra')
if (!require(vip)) install.packages('vip')
if (!require(visdat)) install.packages('visdat')
if (!require(DataExplorer)) install.packages('DataExplorer', repos = "http://cran.us.r-project.org")
if (!require(igraph)) install.packages('igraph', repos = "http://cran.us.r-project.org")
if (!require(xgboost)) install.packages('xgboost', repos = "http://cran.us.r-project.org")
if (!require(knitr)) install.packages('knitr')The data was downloaded from kaggle and stored locally (login necessary). The here package is used to locate the files relative to the project root. The clean_names() function from the janitor package is used to make clear variable names. We already have a first general look at the amount of missing data:
training_data <- read_csv(here("Data", "train.csv")) %>%
clean_names()
testing_data <- read_csv(here("Data", "test.csv")) %>%
clean_names()
combined <- bind_rows(training_data, testing_data)
train_rows <- nrow(training_data)
cat(dim(training_data[!complete.cases(training_data),])[1], "out of", train_rows, "observations have at least one missing value.")1460 out of 1460 observations have at least one missing value.
We already saw that we have at least one missing value for every observation. On the other side, we have 79 explanatory variables in the training set. Therefore, we will first have a closer look on the distribution of missing values in our dataset. We will use the DataExplorer package for that, since it gives a nice graphical overview over the amount of missing data. In a second step, we will also use the DataExplorer package to delete all variables with missing rate in the train set >40%:
vis_dat(training_data)
Notes:
The next graph shows an overview of all the variables with at least one missing value and their missing rate. The DataExplorer package is used again to make this plot. It makes it easy to specify an indicator containing all variables which have at least 40% missing values:
missing_plot <- DataExplorer::plot_missing(training_data, missing_only = T, ggtheme = theme_classic())drop_cols <- missing_plot$data %>% filter(Band %in% c("Remove", "Bad")) %>% pull(feature)We will continue by looking at the distribution of our target variable.
# hist
g1 <- ggplot(training_data, aes(x=sale_price)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="indianred")+
labs(x = "", y = "") +
theme_classic()
# boxplot
g2 <- ggplot(training_data, aes(y=sale_price)) +
geom_boxplot(aes(x=""), colour="black", fill="indianred", alpha=.2)+
coord_flip()+
labs(x = "", y = "") +
theme_classic()
# qqplot
g3 <- ggplot(training_data, aes(sample = sale_price))+
stat_qq()+
stat_qq_line()+
labs(x = "", y = "") +
theme_classic()
grid.arrange(g1, arrangeGrob(g2, g3, nrow=2), nrow = 1)Notes:
Lets see if Log transformation helps to make the distribution more normal:
# hist
g1 <- ggplot(training_data, aes(x=log(sale_price))) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="indianred")+
labs(x = "", y = "") +
theme_classic()
# boxplot
g2 <- ggplot(training_data, aes(y=log(sale_price))) +
geom_boxplot(aes(x=""), colour="black", fill="indianred", alpha=.2)+
coord_flip()+
labs(x = "", y = "") +
theme_classic()
# qqplot
g3 <- ggplot(training_data, aes(sample = log(sale_price)))+
stat_qq()+
stat_qq_line()+
labs(x = "", y = "") +
theme_classic()
grid.arrange(g1, arrangeGrob(g2, g3, nrow=2), nrow = 1)Notes:
In a first regression using all variables which not have any missing data as predictors, Overall Quality, First Floor Square Foot, Zoning Classification and Overall Condition were identified as potentially important predictors. Therefore, these four variables are plotted in their relationship with Sale Price below. The results from the regression used to identify these predicotrs can be seen in the “Quick” Regression tab above.
par(mfrow=c(2,2))
g1 <- ggplot(aes(as.factor(overall_qual), sale_price), data = training_data) +
geom_bar(stat = "summary", fun = "mean", fill = "indianred") +
stat_n_text(y.pos = 25000, size = 2) +
scale_fill_brewer(palette="Pastel1") +
xlab("Quality") +
ylab("Sale Price") +
ggtitle("Price vs Overall Quality of Material") +
theme_classic()
#constant decrease in Sale Price with lower Roof Material
g2 <- ggplot(aes(x1st_flr_sf, sale_price), data = combined[!is.na(combined$sale_price),]) +
geom_point(col = "indianred") +
xlab("First floor square feet") +
ylab("Sale Price") +
ggtitle("Price vs First floor square feet") +
theme_classic()
g3 <- ggplot(aes(reorder(ms_zoning,-sale_price), sale_price), data = training_data) +
geom_bar(stat = "summary", fun = "mean", fill = "indianred") +
stat_n_text(y.pos = 12500, size = 3) +
xlab("Zoning Classification") +
ylab("Sale Price") +
ggtitle("Price vs Zoning Classification") +
#coord_cartesian(ylim = c(0, 400000)) +
theme_classic()
g4 <- ggplot(aes(as.factor(overall_cond), sale_price), data = training_data) +
geom_bar(stat = "summary", fun = "mean", fill = "indianred") +
stat_n_text(y.pos = 12500, size = 2) +
xlab("Quality") +
ylab("Sale Price") +
ggtitle("Price vs Overall Condition") +
theme_classic()
ggarrange(g1, g2, g3, g4,
ncol = 2, nrow = 2)
Notes:
AS said before, a quick regression using all variables which not have any missing data as predictors was run to get a first impression on which variables may be important for predicting Sale Price. Log-Transformation is applied to SalePrice to address the skeweness of this variable showed above.
regressionResult <- tidy(lm(log(sale_price) ~ ., data = training_data[ ,sapply(training_data, function(x) !any(is.na(x)))]))
kable(head(arrange(regressionResult, -statistic), 10))| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| roof_matlWdShngl | 2.7202321 | 0.1503643 | 18.090946 | 0 |
| roof_matlCompShg | 2.6138786 | 0.1448945 | 18.039880 | 0 |
| roof_matlTar&Grv | 2.6408934 | 0.1665445 | 15.856987 | 0 |
| roof_matlWdShake | 2.5250789 | 0.1613245 | 15.652172 | 0 |
| roof_matlRoll | 2.6816411 | 0.1826503 | 14.681831 | 0 |
| roof_matlMembran | 2.9917945 | 0.2097053 | 14.266663 | 0 |
| roof_matlMetal | 2.7819635 | 0.2071698 | 13.428423 | 0 |
| x1st_flr_sf | 0.0002549 | 0.0000234 | 10.890736 | 0 |
| overall_qual | 0.0470576 | 0.0044789 | 10.506443 | 0 |
| overall_cond | 0.0366476 | 0.0038256 | 9.579511 | 0 |
Notes:
The roof materials seems to have a big influence on Sale Price. Let us further look at this variable:
kable(training_data %>% count(roof_matl))| roof_matl | n |
|---|---|
| ClyTile | 1 |
| CompShg | 1434 |
| Membran | 1 |
| Metal | 1 |
| Roll | 1 |
| Tar&Grv | 11 |
| WdShake | 5 |
| WdShngl | 6 |
Ups, the variable is highly unbalanced and has low variability, over 98% of houses have Standard (Composite) Shingle as Roof Material. Let us look again at the regression without roof material:
regressionResult <- tidy(lm(log(sale_price) ~ ., data = subset(training_data[ ,sapply(training_data, function(x) !any(is.na(x)))], select = -c(roof_matl))))
kable(head(arrange(regressionResult, -statistic)))| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| overall_qual | 0.0553948 | 0.0049657 | 11.155507 | 0 |
| x1st_flr_sf | 0.0002271 | 0.0000260 | 8.724359 | 0 |
| ms_zoningRL | 0.4201654 | 0.0515950 | 8.143535 | 0 |
| ms_zoningRM | 0.3921489 | 0.0482741 | 8.123385 | 0 |
| x2nd_flr_sf | 0.0002089 | 0.0000258 | 8.102472 | 0 |
| overall_cond | 0.0328835 | 0.0042818 | 7.679909 | 0 |
Notes:
In the next step, typos will be fixed, outliers will be removed and further variables which could be useful in Sale Price Prediction will be created.
The plot below shows two data points which were removed since they are untypical from visual inspection and the model performance was better without them:
#fix typos
combined <- combined %>% mutate(roof_matl = recode(roof_matl,'Tar&Grv' = "Male")) %>%
mutate(exterior1st = recode(exterior1st, 'Wd Sdng'="WdSdng")) %>%
mutate(exterior2nd = recode(exterior2nd,'Brk Cmn' = 'BrkComm', 'CmentBd' = 'CemntBd', 'Wd Sdng' = 'WdSdng', 'Wd Shng' = 'WdShing')) %>%
mutate(garage_yr_blt = replace(garage_yr_blt, garage_yr_blt > 2010, 'NA'))
#Outliers
ggplot(combined, aes(x=gr_liv_area, y=sale_price))+geom_point()+
geom_text(data=combined[combined$gr_liv_area>4500,], mapping=aes(label=id), vjust=1.5, col = "red" ) +
theme_classic()combined <- combined %>% filter(!(gr_liv_area>4500 & !is.na(sale_price)))There are many variables in the dataset which can be further transformed to extract new features. In the following code, 15 new features are created. While many of them are indicators whether the house is containing something (like a pool or a second floor). Other variables are combined with the goal to create one variable with a stronger relationship to SalePrice than the individual variables. For example, the variable “overall_sf” (Overall square feet) is created, combining the above ground living area available in the house and the total area of the basement (both measured in square feet).
combined <- combined %>% mutate(has_bsmt = as.numeric(total_bsmt_sf!=0),
has_2ndFloor = as.numeric(x2nd_flr_sf!=0),
has_pool = as.numeric(pool_area!=0),
has_porch = as.numeric((open_porch_sf+enclosed_porch+x3ssn_porch+screen_porch)!=0),
has_remod = factor(year_remod_add != year_built),
has_fireplace = as.numeric(fireplaces >0),
is_new = as.numeric(yr_sold==year_built),
overall_sf = gr_liv_area + total_bsmt_sf,
overall_bath = full_bath + 0.5*half_bath + 0.5 *bsmt_half_bath + bsmt_full_bath,
house_age = yr_sold - year_remod_add,
yr_sold = factor(yr_sold),
mo_sold = factor(mo_sold),
tot_porch_sf = open_porch_sf + enclosed_porch +x3ssn_porch + screen_porch,
bsmt_bath = as.numeric(bsmt_half_bath + bsmt_full_bath !=0),
is_bsmt_unf = as.numeric(total_bsmt_sf == bsmt_unf_sf))
#update training and testing data that it includes all transformations
training_data <- combined %>% filter(!is.na(sale_price))
testing_Data <- combined %>% filter(is.na(sale_price))Now we will start the modeling part. As said before, a XGBoost regression model will be used. Further feature engineering will be done with the recipes package, since this provides comprehensive preprocessing possibilities in a nice and tidy code structure.
We will start to prepare our data. To asses the performance on the training dataset, resampling will be used in the form of cross-validation. Our xgboost model is specified with the parsnip package. In addition, the recipes package is used to combine all the transformations and other features related to the model as a single block that can be used for any subset of the data.
The recipe we used is applying the following transformations:
After specifying the recipe, we will run a first XGboost model on the cross-validation dataset to get an impression about performance and variable importance without tuning. The results are displayed in the following table:
set.seed(121)
vfold_data <- rsample::vfold_cv(data = training_data, v = 10)
xgb_recipe <- training_data %>%
recipe(sale_price ~ .) %>%
update_role(id, new_role = "id var") %>%
step_rm(street, utilities) %>%
step_rm(one_of(drop_cols)) %>%
step_novel(all_predictors(), -all_numeric()) %>%
step_impute_knn(all_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_nzv(everything()) %>%
step_log(all_outcomes(), offset = 1)
#specify model
xgb <- parsnip::boost_tree(trees = 100) %>%
set_engine("xgboost", objective = "reg:squarederror") %>%
set_mode("regression")
xgb_wf <- workflows::workflow() %>%
workflows::add_recipe(xgb_recipe) %>%
workflows::add_model(xgb)
# performance on cross-validation
xgb_fit <- xgb_wf %>% fit_resamples(vfold_data)
kable(collect_metrics(xgb_fit))| .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|
| rmse | standard | 0.1388313 | 10 | 0.0045048 | Preprocessor1_Model1 |
| rsq | standard | 0.8798578 | 10 | 0.0071070 | Preprocessor1_Model1 |
Without further optimizing, we get:
After looking at variable importance, we will improve the accuracy with model tuning.
#variable importance
xgb_wf %>% fit(training_data) %>%
extract_fit_parsnip() %>%
vip() +
theme_classic()Notes:
Now it is time to improve our model. For that purpose, the model and parameter tuning will be defined. We will use R-Square as variable to optimize. Then, the defined parameters will be tuned through iteration and the model accuracy after tuning is assessed.
set.seed(1234)
xgb_tuning <- parsnip::boost_tree(
trees = 1324,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = 0.212, mtry = 80, ## randomness
learn_rate = 0.00893, ## step size
) %>%
set_engine("xgboost", objective = "reg:squarederror") %>%
set_mode("regression")
xgb_grid <- grid_latin_hypercube(
#trees(),
tree_depth(),
min_n(),
loss_reduction(),
#sample_size = sample_prop(),
#finalize(mtry(), training_data),
#learn_rate(),
size = 50
)
xgb_wf_tune <- workflows::workflow() %>%
workflows::add_recipe(xgb_recipe) %>%
workflows::add_model(xgb_tuning)
#tune hyperparameters
xgb_res <- tune_grid(
xgb_wf_tune,
resamples = vfold_data,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
#look at tuning results
kable(head(xgb_res %>% collect_metrics %>% arrange(desc(.metric) ,-mean)))| min_n | tree_depth | loss_reduction | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 6 | 14 | 0.0000000 | rsq | standard | 0.9166721 | 10 | 0.0080414 | Preprocessor1_Model06 |
| 5 | 9 | 0.0000307 | rsq | standard | 0.9162517 | 10 | 0.0078405 | Preprocessor1_Model26 |
| 3 | 6 | 0.0000000 | rsq | standard | 0.9160771 | 10 | 0.0081234 | Preprocessor1_Model44 |
| 7 | 8 | 0.0023759 | rsq | standard | 0.9158317 | 10 | 0.0082477 | Preprocessor1_Model12 |
| 7 | 14 | 0.0006571 | rsq | standard | 0.9158051 | 10 | 0.0081564 | Preprocessor1_Model27 |
| 9 | 14 | 0.0003326 | rsq | standard | 0.9148765 | 10 | 0.0084475 | Preprocessor1_Model49 |
xgb_res %>%
collect_metrics() %>%
filter(.metric == "rsq") %>%
select(mean, min_n:loss_reduction ) %>%
pivot_longer(min_n:loss_reduction,
values_to = "value",
names_to = "parameter"
) %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "R-Squared") +
theme_classic() +
theme(panel.background = element_rect(colour = 'black'))After Model Tuning, we get a cross-validation accuracy of
Therefore, the accuracy was improved.
Finally, the model with best accuracy will be selected and this final model will be fitted on the whole training data to make test predictions.
best_xgb <- xgb_res %>%
select_best(metric = "rmse")
final_wf <- xgb_wf_tune %>%
finalize_workflow(best_xgb)
ind <- list(analysis = seq(nrow(training_data)), assessment = nrow(training_data) + seq(nrow(testing_data)))
splits <- make_splits(ind, combined)
final_fit <- final_wf %>%
last_fit(splits)
submission <- final_fit %>% collect_predictions() %>% transmute(ID = testing_data$id, SalePrice = exp(.pred))
submission %>% write_csv(here("Data", "submission.csv"))This project showed a basic workflow to set up and improve a XGBoost
regression model to predict house prices. Whereas a XGBoost models
without hyerparamter tuning already showed good results, it is great to
see that the cross-validation model performance after hyperparameter
tuning regarding R-Squared could be further improved by 2.75 percentage
points. In total, 90.7% of the variance in the training set can be
explained from our prediction model.