This document is a response to the Kaggle Challenge “House Prices - Advanced Regression Techniques”.
The goal of this challenge is to predict house prices in Ames, Iowa using the available information which contains 79 explanatory variables.
After having performed Target and Feature Engineering, we fitted three models: a Random Forest, a Gradient Boosting Machine and an Elastic Net Model. Finally, we fitted a stacked model which combines the advantages of the three previous Base Learners to produce a model with smaller RMSE and better R-squared.
When applying the stacked model in the testing data set and uploading it into Kaggle, the model has returned an RMSLE of 0.12251, which at the moment of the writing of this report, is a top 15% submission.
library(rlang)
library(httr)
library(jsonlite)
library(data.table)
library(tidyr)
library(ggplot2)
library(dplyr)
library(showtext)
library(extrafont)
library(viridis)
library(RColorBrewer)
library(ggsci)
library(lubridate)
library(forcats)
library(broom)
library(WVPlots)
library(caret)
library(pROC)
library(recipes)
library(readr)
library(ROCR)
library(vip)
library(xts)
library(stringr)
library(scales)
library(ggtext)
library(rpart)
library(rpart.plot)
library(ranger)
library(VGAM)
library(randomForest)
library(gridExtra)
library(ggplotify)
library(grid)
library(ggthemes)
library(moments)
library(cowplot)
library(egg)
library(patchwork)
library(outliers)
library(Hmisc)
library(zoo)
library(earth)
library(caretEnsemble)
library(h2o)
library(pdp)
library(gbm)
library(xgboost)
library(DALEX)
loadfonts(device = "win")
font_add("Palatino", "pala.ttf")
showtext_auto()
base_size <- 20
theme1 <- theme_minimal(base_size = base_size) +
theme(text = element_text(family = "Palatino"),
panel.grid.minor.x = element_line(linetype = "dashed"),
panel.grid.minor.y = element_line(linetype = "dashed"))
my_theme <- function(type = c("discrete", "continuous")){
if (type == "discrete"){
list(theme1, scale_color_jama(), scale_fill_jama())
} else {
list(theme1, scale_fill_continuous(high = pal_jama("default")(9)[1], low = "lightskyblue"))
}
}
color1 <- pal_jama("default")(9)[1]
color2 <- pal_jama("default")(9)[2]
color3 <- "lightskyblue"
Let’s download the training set, the testing set and have a look at data structures.
test <- fread("test.csv")
train <- fread("train.csv")
str(train)
## Classes 'data.table' and 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ 1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ 2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ 3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
## - attr(*, ".internal.selfref")=<externalptr>
To make predictions on the testing data set, we will need it to have the same columns as the training one. So we’ll add the SalePrice variable.
test <- test %>%
mutate(SalePrice = as.integer(0))
We will combine the features YrSold and MoSold to analyze each sale by its exact month and year (new variable to be called SaleDate). If we plot the histogram for this feature, we will notice how seasonal the real estate market is (sales improve in the summer). Notice the color of the bars: the darker they are, the deeper into the year the month is. So, sales begin slow, pick up in the mid-year, then diminish again by winter.
date_function <- function(df){
df %>%
mutate(MoSold = ifelse(MoSold < 10, paste(0,MoSold,sep =""), MoSold)) %>%
mutate(SaleDate = paste(MoSold,"-01-",YrSold, sep = "")) %>%
mutate(SaleDate = as.Date(SaleDate, "%m-%d-%Y"))
}
train <- train %>%
date_function()
test <- test %>%
date_function()
sales_by_month <- train %>%
group_by(SaleDate) %>%
summarise(Count = n(), meanPrice = mean(SalePrice))
sales_by_month <- sales_by_month %>%
mutate(movingAvg = (meanPrice + lag(meanPrice) + lag(meanPrice, 2) +
lag(meanPrice, 3) + lag(meanPrice, 4) +
lag(meanPrice, 5))/6,
totalAvg = cumsum(meanPrice)/index(sales_by_month))
sales_by_month %>%
ggplot(aes(x = SaleDate, y = Count, fill = lubridate::month(SaleDate))) +
geom_col() +
scale_x_date(breaks = unique(sales_by_month$SaleDate)) +
labs(title = "Houses Sold by Date") +
my_theme(type = "continuous") +
theme(axis.text.x = element_text(angle = 270),
legend.position = "none")
However, if we check average Sale Price by month, we see another pattern: in most years except 2009, houses were on average more expensive by the end of the year. The gold line is the six-month moving average sale price and helps us visualize the seasonality. The dashed red line is the cumulative moving average sale price, which offers us another conclusion: Sale Price declined slightly over the years, but it hasn’t changed dramatically.
sales_by_month %>%
ggplot(aes(x = SaleDate, y = meanPrice)) +
geom_col(aes(fill = lubridate::month(SaleDate)), show.legend = FALSE) +
geom_line(aes(y = movingAvg, color = "6-month Moving Average"), size = base_size/10) +
geom_line(aes(y = totalAvg, color = "Cumulative Moving Average"), size = base_size/15, linetype = "dashed") +
scale_x_date(breaks = unique(sales_by_month$SaleDate)) +
scale_y_continuous(expand = c(0,0)) +
scale_color_manual(name = "Average",
breaks = c("6-month Moving Average", "Cumulative Moving Average"),
values = c("6-month Moving Average" = color2,
"Cumulative Moving Average" = "darkred")) +
labs(title = "Average House Price per Date") +
my_theme(type = "continuous") +
theme(axis.text.x = element_text(angle = 270),
legend.position = "bottom")
In conclusion, we will turn the MoSold feature into a factor variable, since it seems to have an impact on the price, but it isn’t a purely numerical feature.
Some of the categorical features on the data set are qualitative features that can be modified to better capture their relationship with the response variable. For example, ExterCond evaluates the condition of materials in the exterior of the property, from Poor to Excellent, the latter being naturally preferred. We will perform transformations (label-encoding) to turn this kind of feature into integer features. The features to undergo label-encoding are basically the ones that evaluate quality (suffixes Qual or QC) or condition (suffix Cond) of various ambients of the house:
We will also perform idiossincratic transformations on the following variables:
YearBuilt: we will cut and group the years in a way that divides the sample into levels with a similar number of observations. Since this will create a naturally ordinal feature, we will also add this to the label-encoding step, which will give the oldest house the minimum value and the newest house the maximum value. This variable will be called YearBuiltGrouped.
YearRemodAdd: this shall undergo the same transformation that we applied to the YearBuilt feature, except it concerns the year of the last remodeling.
MSSubClass : even though it is listed as as integer feature, we will turn it into a factor since it concerns the type of dwelling involved in the sale.
Other features will require replacent of the NA values for a string, since the NAs in these cases does not mean the values are missing but rather that the house does not have that specific ambient. (For example, the NAs in GarageType mean that the houses don’t have garages).
levels1 <- rev(c("Ex", "Gd", "TA", "Fa", "Po"))
levels2 <- rev(c("Ex", "Gd", "TA", "Fa", "Po", "None"))
levels3 <- rev(c("Gd", "Av", "Mn", "No", "None"))
levels4 <- rev(c("GLQ", "ALQ", "BLQ", "Rec", "LwQ", "Unf", "None"))
levels5 <- rev(c("Typ", "Min1", "Min2", "Mod", "Maj1", "Maj2", "Sev", "Sal"))
levels6 <- rev(c("Fin", "RFn", "Unf", "None"))
levels7 <- rev(c("GdPrv", "MnPrv", "GdWo", "MnWw", "None"))
levels_yrbuilt <- cut2(train$YearBuilt, g = 8, onlycuts = TRUE)
levels_yrremod <- cut2(train$YearRemodAdd, g = 8, onlycuts = TRUE)
mutate_fun <- function(df){
df %>%
rename(FirstFloorSF = "1stFlrSF",
SecondFloorSF = "2ndFlrSF",
ThreeSsnPorch = "3SsnPorch") %>%
mutate(MSSubClass = factor(MSSubClass),
MoSold = factor(MoSold),
YrSold = factor(YrSold),
YearBuiltGrouped = cut(YearBuilt, breaks = levels_yrbuilt, include.lowest = TRUE),
YearRemodGrouped = cut(YearRemodAdd, breaks = levels_yrremod, include.lowest = TRUE),
GarageType = replace_na(GarageType, "None"),
Alley = replace_na(Alley, "None"),
MiscFeature = replace_na(MiscFeature, "None"),
ExterCond = fct_relevel(ExterCond, levels = levels1),
ExterQual = fct_relevel(ExterQual, levels = levels1),
HeatingQC = fct_relevel(HeatingQC, levels = levels1),
Functional = fct_relevel(Functional, levels = levels5),
KitchenQual = fct_relevel(KitchenQual, levels = levels1),
FireplaceQu = fct_relevel(replace_na(FireplaceQu, "None"), levels = levels2),
GarageQual = fct_relevel(replace_na(GarageQual, "None"), levels = levels2),
GarageCond = fct_relevel(replace_na(GarageCond, "None"), levels = levels2),
PoolQC = fct_relevel(replace_na(PoolQC, "None"), levels = levels2),
Fence = fct_relevel(replace_na(Fence, "None"), levels = levels7),
GarageFinish = fct_relevel(replace_na(GarageFinish, "None"), levels = levels6),
BsmtQual = fct_relevel(replace_na(BsmtQual, "None"), levels = levels2),
BsmtCond = fct_relevel(replace_na(BsmtCond, "None"), levels = levels2),
BsmtExposure = fct_relevel(replace_na(BsmtExposure, "None"), levels = levels3),
BsmtFinType1 = fct_relevel(replace_na(BsmtFinType1, "None"), levels = levels4),
BsmtFinType2 = fct_relevel(replace_na(BsmtFinType2, "None"), levels = levels4)) %>%
mutate_if(is.character, as.factor)
}
train <- train %>%
mutate_fun() %>%
select(-c(YearBuilt, YearRemodAdd))
test <- test %>%
mutate_fun() %>%
select(-c(YearBuilt, YearRemodAdd))
We will also deal with missing values by performing a KNN-imputation, which will search for the 6 nearest neighboring observations and provide imputations based on the average or the mode of these observations.
Finally, we execute the feature engineering “recipe” to label-encode the features and impute the missing values.
label_encode <- c("ExterCond", "ExterQual", "HeatingQC",
"Functional", "KitchenQual","FireplaceQu",
"GarageQual", "GarageCond", "PoolQC",
"Fence", "GarageFinish",
"BsmtQual", "BsmtCond",
"BsmtExposure","BsmtFinType1", "BsmtFinType2",
"YearBuiltGrouped", "YearRemodGrouped")
(recipe <- recipe(SalePrice ~ ., data = train) %>%
step_integer(all_of(label_encode)) %>%
step_other(all_nominal(), threshold = 0.1, other = "other") %>%
step_knnimpute(all_predictors(), neighbors = 6))
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 81
##
## Operations:
##
## Integer encoding for all_of(label_encode)
## Collapsing factor levels for all_nominal()
## K-nearest neighbor imputation for all_predictors()
prepare <- prep(recipe, training = train)
train <- bake(prepare, new_data = train)
test <- bake(prepare, new_data = test)
In this section, we evaluate the distribution of our response variable. Can log-transformation make its shape more similar to a normal distribution?
ggplot(train, aes(x = SalePrice)) +
geom_histogram(aes(y = ..count..), fill = color3, color = color1, size = base_size*0.05) +
my_theme(type = "discrete") +
theme(axis.title.y = element_blank()) +
ggtitle("Response Variable Distribution")
ggplot(train, aes(x = log(SalePrice))) +
geom_histogram(aes(y = ..count..), fill = color3, color = color1, size = base_size*0.05) +
my_theme(type = "discrete") +
theme(axis.title.y = element_blank()) +
ggtitle("Response Variable Distribution (Log-Transformated)")
Log-transformation of the response variable has lowered skewness and kurtosis of distribution.
We will add this to the baking recipe. We will also add a Yeo-Johnson transformation to normalize all of the numeric predictors, along with standardization, near-zero variance removal and PCA dimension reduction (we will only consider features that amount to a 90% explanatory power).
In this section we analyze numeric features. We already determined that log-transforming SalePrice will help normalize its distribution.
We will also perform transformations to normalize and standardize numeric features. In this step, we will add to the recipe a Yeo-Johnson transformation, that will determine by itself which features will require normalizing. Also, a centering and a scaling step will be performed.
Dropping near-zero variance predictors is also a step that we will add, since it helps performing feature filtering very early on in the process.
(recipe2 <- recipe(SalePrice ~ ., data = train) %>%
step_log(all_outcomes()) %>%
step_nzv(all_predictors()) %>%
step_YeoJohnson(all_numeric()) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()))
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 81
##
## Operations:
##
## Log transformation on all_outcomes()
## Sparse, unbalanced variable filter on all_predictors()
## Yeo-Johnson transformation on all_numeric()
## Centering for all_numeric(), -all_outcomes()
## Scaling for all_numeric(), -all_outcomes()
prepare2 <- prep(recipe2, training = train)
baked_train <- bake(prepare2, new_data = train)
baked_train <- cbind(train$Id, baked_train)
baked_train <- baked_train %>%
select(-Id) %>%
rename(Id = "train$Id")
baked_test <- bake(prepare2, new_data = test)
baked_test <- cbind(test$Id, baked_test)
baked_test <- baked_test %>%
select(-Id) %>%
rename(Id = "test$Id")
This step dropped 18 variables from the training set. This will help us perform cleaner Exploratory Data Analysis.
Let’s take a look at the numeric features in the “baked” data set. Which will have higher correlation with the response variable?
cormat <- cor(baked_train %>% select_if(is.numeric) %>% select(-Id))
get_upper_tri<-function(cormat){
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
upper_tri <- melt(get_upper_tri(cormat), na.rm = TRUE)
ggplot(upper_tri, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes(x = Var1, y = Var2, label = round(value, 2)),
color = "white", size = base_size/5) +
labs(title = "Correlation Matrix of Numeric Features",
fill = "Correlation") +
my_theme(type = "continuous") +
theme(axis.text.x = element_text(angle = 270),
axis.title = element_blank(),
legend.position = "none")
upper_tri %>%
filter(Var2 == "SalePrice", Var1 != "SalePrice") %>%
arrange(desc(value)) %>%
head(10)
Qualitative evaluation features have high correlation with the final Sale Price. Amongst these, the Overall Quality is the most important, followed by the quality of the external area, the kitchen quality and the basement quality.
We also see that a lot of features that evaluate square feet area of various ambients seem to be important. The most relevant one is GrLivArea, which is the area of the above ground living space, followed by area of the garage, basement, and first floor.
The amount of cars that fit in the garage is important, as is the time period that the house was built.
Which features have low correlation with the response variable?
upper_tri %>%
filter(Var2 == "SalePrice", Var1 != "SalePrice") %>%
arrange(value) %>%
head(10)
The Fence, the overall condition, the number of basement half-bathrooms and the External Condition have the lowest correlations with the response variable. Let’s look at distributions of these features and their relationship with the response variable.
boxplot_plot <- function(df, yvar, xvar, logtrans = FALSE){
x <- eval(substitute(xvar), df)
y <- eval(substitute(yvar), df)
plot <- ggplot(data = df, aes(x = factor(df[[x]]), y = df[[y]])) +
geom_boxplot(color = color1, fill = "lightskyblue", size = base_size*0.05) +
my_theme(type = "discrete") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_line(linetype = "dashed")) +
labs(title = paste(as.character(substitute(y))," by ",
as.character(substitute(x))),
y = paste(as.character(substitute(y))),
x = paste(as.character(substitute(x))))
if(logtrans == FALSE){
plot + scale_y_continuous(labels = dollar, breaks = pretty_breaks(n = 5))
} else {
plot + scale_y_log10(labels = dollar, breaks = pretty_breaks(n = 5))
}
}
histogram_plot <- function(df, xvar){
x <- eval(substitute(xvar), df)
ggplot(data = df, aes(x = factor(df[[x]]))) +
geom_histogram(stat = 'count', aes(y = ..count..),
fill = color1, color = "lightskyblue", size = base_size/7.5) +
my_theme(type = "discrete") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_line(linetype = "dashed")) +
labs(title = "Histogram",
y = "Count",
x = paste(as.character(substitute(xvar))))
}
ordered_analysis <- function(df, yvar, xvar){
grid.arrange(
histogram_plot(train, xvar),
boxplot_plot(train, xvar, yvar),
nrow = 1, ncol = 2
)
}
ordered_analysis(train, yvar = "SalePrice", xvar = "OverallQual")
ordered_analysis(train, yvar = "SalePrice", xvar = "ExterQual")
ordered_analysis(train, yvar = "SalePrice", xvar = "KitchenQual")
Evidently, expected sale price increases the higher the quality of the rooms in the house. Also most houses have medium evaluation of quality.
ordered_analysis(train, yvar = "SalePrice", xvar = "YearBuiltGrouped")
We have grouped the years that the houses were built. So, the higher the “grade” (from 1 to 8), the newer the house is. Average sale price is higher for newer houses, but it seems that change is very slight when it comes to the older houses (groups from 1 to 3).
Now let’s plot the most relevant continuous variables. We will fit a simple linear model to try to capture the statistical relationship between these features and the response variable. We will also plot model residuals to check for Homoscedasticity. The blue line in the model residual plot indicates a LOESS smoothing of residuals.
outliers <- function(df, x, upper, lower){
x <- eval(quote(x), df)
sel <- df[[x]]
upper_bound <- quantile(sel, upper)
lower_bound <- quantile(sel, lower)
return(sel[sel > upper_bound | sel < lower_bound])
}
density_plot <- function(df,xvar, upper.bound, lower.bound, logtrans = FALSE){
x <- eval(substitute(xvar), df)
out <- outliers(df, xvar, upper = upper.bound, lower = lower.bound)
df1 <- subset(df, !df[[x]] %in% out)
plot <- ggplot(data = df1, aes(x = df1[[x]])) +
geom_histogram(aes(y = ..density..),
bins = 50, fill = color1, alpha = 0.5) +
geom_density(aes(y = ..density..),
size = base_size*0.05, color = color1, fill = color3, alpha = 0.5) +
my_theme(type = "discrete") +
theme(panel.grid.minor.x = element_line(linetype = "dashed"),
panel.grid.minor.y = element_line(linetype = "dashed")) +
labs(title = "Density Plot",
y = "Density",
x = paste(as.character(substitute(x))))
if(logtrans == FALSE){
return(plot + scale_x_continuous(breaks = pretty_breaks(n = 5)))
} else {
return(plot + scale_x_log10(breaks = pretty_breaks(n = 5)))
}
}
scatter_plot <- function(df,yvar,xvar, upper.bound, lower.bound,logtrans = FALSE){
x <- eval(substitute(xvar), df)
y <- eval(substitute(yvar), df)
out <- outliers(df, xvar, upper = upper.bound, lower = lower.bound)
df1 <- subset(df, !df[[x]] %in% out)
plot <- ggplot(data = df1, aes(x = df1[[x]], y = df1[[y]])) +
geom_point(color = color1, alpha = 0.5) +
geom_smooth(method = "lm", color = color3) +
scale_y_log10(labels = dollar) +
my_theme(type = "discrete") +
labs(title = paste(as.character(substitute(y))," by ",
as.character(substitute(x))),
y = paste(as.character(substitute(y))),
x = paste(as.character(substitute(x))))
if(logtrans == FALSE){
return(plot + scale_x_continuous(breaks = pretty_breaks(n = 5)))
} else {
return(plot + scale_x_log10(breaks = pretty_breaks(n = 5)))
}
}
residual_plot <- function(df, response, predictor, xvar, upper.bound, lower.bound){
x <- eval(substitute(xvar), df)
predictor <- eval(substitute(predictor), df)
y <- eval(substitute(response), df)
out <- outliers(df, xvar, upper = upper.bound, lower = lower.bound)
df1 <- subset(df, !df[[x]] %in% out)
formula <- as.formula(paste(as.character(substitute(y)), "~",
as.character(substitute(predictor))))
fit <- train(
form = formula,
data = df1,
method = "lm",
trControl = trainControl(method = "cv", number = 10)
)
ggplot(fit$finalModel, aes(fitted(fit), resid(fit))) +
geom_point(alpha = .6, color = color1) +
geom_smooth(method = "loess", color = color3, size = base_size*0.08, se = FALSE) +
geom_hline(yintercept=0, color = color2, size = base_size*0.05, alpha = 0.5) +
scale_x_continuous(expand = c(0, 0)) +
labs(title = "Model Residuals",
x = "Predicted Values",
y = "Residuals") +
my_theme(type = "discrete")
}
numerical_analysis <- function(df, yvar, xvar, response, predictor,
upper.bound = 0.975, lower.bound = 0.025, logtrans = FALSE){
grid.arrange(
density_plot(df, xvar, upper.bound = upper.bound, lower.bound = lower.bound,
logtrans = logtrans),
scatter_plot(df, yvar, xvar, upper.bound = upper.bound, lower.bound = lower.bound,
logtrans = logtrans),
residual_plot(df, response = response, predictor = predictor, xvar = xvar,
upper.bound = upper.bound, lower.bound = lower.bound),
ncol = 2,
layout_matrix = cbind(c(1,1), c(2,3))
)
}
numerical_analysis(train, yvar = "SalePrice", xvar = "GrLivArea",
response = "log(SalePrice)", predictor = "log(GrLivArea)",
logtrans = TRUE)
We have log-scaled GrLivArea (Above Ground Living Area) to normalize it’s distribution. This feature has a strong linear relationship with the log-scaled Sale Price. We also see that errors seem to have constant variance.
numerical_analysis(train, yvar = "SalePrice", xvar = "GarageArea",
response = "log(SalePrice)", predictor = "GarageArea",
logtrans = FALSE)
For GarageArea the most noticeable thing is the high occurrence of zeros. This obviously indicates that a high number of houses have no garage at all.
numerical_analysis(train, yvar = "SalePrice", xvar = "TotalBsmtSF",
response = "log(SalePrice)", predictor = "TotalBsmtSF")
numerical_analysis(train, yvar = "SalePrice", xvar = "FirstFloorSF",
response = "log(SalePrice)", predictor = "log(FirstFloorSF)",
logtrans = TRUE)
For Basement and First Floor Areas, model residuals show higher variance than the two previously analyzed features.
Now, onto categorical features. First all we will one-hot encode all the nominal variables. This means all categorical features will be turned into dummy variables ranging from 0 to 1.
(recipe3 <- recipe(SalePrice ~ ., data = baked_train) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE))
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 63
##
## Operations:
##
## Dummy variables from all_nominal(), -all_outcomes()
prepare3 <- prep(recipe3, data = baked_train)
baked_train1 <- bake(prepare3, new_data = baked_train)
baked_test <- bake(prepare3, new_data = baked_test)
Let’s calculate Point Biserial Correlation of each dummy variable and the response variable and plot the highest absolute values that are statistically significant.
baked_train_dummy <- baked_train1 %>%
select_if(~ is.numeric(.) && all(between(., 0, 1)))
pbs_correlation <- data.frame(0)
for (i in 1:ncol(baked_train_dummy)){
x <- baked_train_dummy[[i]]
cortest <- cor.test(baked_train[["SalePrice"]], x)
pbs_correlation[i,1] <- base::colnames(baked_train_dummy[,i])
pbs_correlation[i,2] <- cortest$p.value
pbs_correlation[i,3] <- cortest$estimate
}
base::colnames(pbs_correlation) <- c("Feature", "P_value", "Correlation")
pbs_correlation %>%
filter(P_value <= 0.05) %>%
ggplot(aes(x = Correlation)) +
geom_col(data = pbs_correlation %>%
filter(P_value <= 0.05, Correlation > 0.25 | Correlation < -0.25) %>%
arrange(desc(Correlation)),
aes(x = Correlation, y = fct_reorder(Feature, Correlation),
fill = Correlation)) +
my_theme(type = "continuous") +
labs(title = "Point Biserial Correlation: dummy variables vs. SalePrice",
y = "")
Foundation_PConc is the feature with the highest positive correlation. This means that the tendency of a higher price is higher if the foundation of the house is poured concrete.
The basis of our model will be a Random Forest model. By building several decision trees on bootstrapped copies of the training set, a bagged tree process will have good predictive performance with low variance. Furthermore, combining this with the randomization of a limited subset of features to be considered at each split, we will have a Random Forest, which further improves the model by achieving de-correlated trees.
We will perform a grid search to automatically tune the hyper-parameters of the model. By using the ranger method, we select our hyper-parameters: mtry is the size of the subset of features, min.node.size is the minimum number of observations to be gathered at each final node, and splitrule is the method for deciding the criteria for each split.
response <- "SalePrice"
predictors <- setdiff(names(baked_train1), response)
n_features <- length(predictors)
set.seed(123)
rf_grid <- expand.grid(
mtry = floor(n_features * c(.25, .333,.5,.666,.75)),
min.node.size = c(2,3,5),
splitrule = c("extratrees")
)
train_control <- trainControl(
method = "cv",
number = 5,
savePredictions = "final",
index=createResample(baked_train1$SalePrice, 5))
(rf_model <- train(
SalePrice~.,
data= baked_train1,
method = 'ranger',
tuneGrid = rf_grid,
trControl= train_control))
## Random Forest
##
## 1460 samples
## 110 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1460, 1460, 1460, 1460, 1460
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 27 2 0.01414142 0.8633831 0.009432508
## 27 3 0.01412139 0.8637996 0.009436124
## 27 5 0.01414600 0.8631515 0.009435525
## 36 2 0.01400858 0.8650781 0.009327478
## 36 3 0.01402868 0.8648902 0.009323417
## 36 5 0.01405492 0.8643441 0.009370735
## 55 2 0.01386001 0.8669270 0.009234908
## 55 3 0.01389116 0.8665296 0.009247718
## 55 5 0.01383715 0.8676296 0.009233661
## 73 2 0.01380998 0.8674738 0.009207579
## 73 3 0.01381440 0.8673175 0.009202873
## 73 5 0.01381143 0.8675723 0.009208541
## 82 2 0.01379674 0.8678737 0.009192222
## 82 3 0.01381949 0.8670338 0.009228803
## 82 5 0.01381297 0.8670811 0.009195442
##
## Tuning parameter 'splitrule' was held constant at a value of extratrees
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 82, splitrule = extratrees
## and min.node.size = 2.
The grid search selected the hyper-parameters that optimized the overall RMSE. We see that it has selected a subset size of 84, which corresponds to 75% of the original feature space.
We also notice that the optimal RMSE of 0.0137967 is lower than the standar deviation of the response variable of 0.0371552. This indicates that our model fares better than the naive estimate of the mean for all predictions. Mind you, these values have undergone the transformations that we chose to do in our pre-processing of the data. This means that they don’t reflect the error estimates in dollar but rather in a normalized and log-adjusted value.
ggplot(rf_model) +
geom_point(size = base_size/5, aes(color = min.node.size)) +
geom_line(aes(group = min.node.size, color = min.node.size),
size = base_size/15) +
guides(shape = FALSE) +
theme(legend.position = "bottom") +
my_theme(type = "discrete") +
labs(
title = "Model Tuning - Random Forest",
color = "Minimal Node Size",
x = "Subset of Features (mtry)"
)
Let’s make predictions on the training set using the model we just trained. How did it fare against the actual sale prices?
lambda <- prepare2$steps[[3]]$lambdas["SalePrice"]
predictions <- as.data.frame(matrix(nrow = 1460))
predictions$rf_pred <- predict(rf_model, newdata = baked_train1)
predictions$SalePrice <- baked_train1$SalePrice
predictions <- as.data.frame(predictions)
predictions <- predictions %>%
mutate(rf_pred = exp(1)^yeo.johnson(rf_pred, lambda = lambda, inverse = TRUE),
SalePrice = exp(1)^yeo.johnson(SalePrice, lambda = lambda, inverse = TRUE))
fitted_values_plot <- function(df, x){
df %>%
ggplot(aes(x = {{x}}, y = SalePrice)) +
geom_point(color = color1, alpha = 0.5) +
geom_abline(xintercept = 0, yintercept = 0, slope = 1, linetype = "dashed") +
my_theme(type = "discrete") +
scale_y_log10(labels = dollar) +
scale_x_log10(labels = dollar) +
coord_flip() +
labs(title = "Fitted Values vs Response Values",
x = "Fitted Values",
y = "Response Values")
}
fitted_values_plot(predictions, x = rf_pred)
When we plot the fitted values against the actual response values, we see that most observations have fallen pretty close to the dotted line where x = y. That is, except for really cheap or really expensive houses. Let’s evaluate residuals.
predictions <- predictions %>%
mutate(rf_residuals = SalePrice - rf_pred)
resid_plot <- function(df, x, y){
df %>%
ggplot(aes(x = {{x}}, y = {{y}})) +
geom_point(color = color1, alpha = 0.5) +
geom_hline(yintercept=0, color = color2, size = 0.8, alpha = 0.5) +
geom_smooth(method = "loess", color = color3, se = TRUE) +
scale_x_log10(labels = dollar) +
scale_y_continuous(labels = dollar) +
my_theme(type = "discrete") +
labs(title = "Model Residuals",
y = "Residuals",
x = "Fitted Values")
}
resid_histogram <- function(df, x){
df %>%
ggplot(aes(x = {{x}})) +
geom_histogram(aes(y = ..count..), bins = 100,
color = color3, fill = color1, size = base_size/30) +
my_theme(type = "discrete") +
labs(title = "Residual Distribution",
x = "Residuals",
y = "Density") +
coord_flip()
}
residual_analysis <- function(df, fitted, resid){
grid.arrange(
resid_plot(df = df, x = {{fitted}}, y = {{resid}}),
resid_histogram(df = df, x = {{resid}}),
nrow = 1, ncol = 2, widths = c(2, 1)
)
}
residual_analysis(df = predictions, fitted = rf_pred, resid = rf_residuals)
Looking at residuals distribution, we see that it is mostly centered around the 0 value, which is a good sign. However, the distribution is right-skewed. We see that when houses are expensive, residuals tend to be very large and positive (predicted values are way lower than actual values). When houses are cheap, residuals are generally negative (predicted values are higher than actual values). Random Forests are models based on average values. That means that they are bad at extrapolating.
For most houses, our model will have good predictive power. However, for the outlying, very cheap or very expensive houses, our model will make a very poor prediction.
This means that the Random Forest model is not enough, we will have to combine it with other models that are better at extrapolating.
We will perform a 5-step process to build and Extreme Gradient Boosting Machine Model. This model is also tree-based, but unlike the Random Forest, it uses shallow trees that are built sequentially in a process known as a Gradient Descent.
Step one of this process is to tune the hyper-parameters of the gradient boost machine itself, nrounds, which represents the number of trees built and eta which is the “learning rate”, or the rate in which the gradient descent will occur.
set.seed(123)
xgb_grid <- expand.grid(
nrounds = seq(from = 200, to = 1000, by = 100),
eta = c(0.025, 0.05, 0.1, 0.3),
max_depth = c(2, 3, 5),
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
xgb_model <- train(
SalePrice~.,
data= baked_train1,
method = 'xgbTree',
tuneGrid = xgb_grid,
objective = "reg:squarederror",
trControl= train_control)
In Step 2, we hold constant the values we tuned in Step 1 and move on to the tree parameters: maximum depth of each tree and minimum sum of instance weight. Remember the goal of the gradient boosting machine is to build shallow trees, not deep ones.
set.seed(123)
xgb_grid2 <- expand.grid(
nrounds = xgb_model$bestTune$nrounds,
eta = xgb_model$bestTune$eta,
max_depth = c(2, 3, 5),
min_child_weight = c(1, 2, 3),
colsample_bytree = 1,
subsample = 1,
gamma = 0
)
xgb_model2 <- train(
SalePrice~.,
data= baked_train1,
method = 'xgbTree',
tuneGrid = xgb_grid2,
objective = "reg:squarederror",
trControl= train_control)
In Step 3, we deal with the Stochastic GBMs hyperparameters, which will help us not getting “stuck” in the gradient descent by subsampling rows and columns (much like in Random Forests).
set.seed(123)
xgb_grid3 <- expand.grid(
nrounds = xgb_model$bestTune$nrounds,
eta = xgb_model$bestTune$eta,
max_depth = xgb_model2$bestTune$max_depth,
min_child_weight = xgb_model2$bestTune$min_child_weight,
colsample_bytree = c(0.5, 0.75, 1.0),
subsample = c(0.5, 0.75, 1.0),
gamma = 0
)
xgb_model3 <- train(
SalePrice~.,
data= baked_train1,
method = 'xgbTree',
tuneGrid = xgb_grid3,
objective = "reg:squarederror",
trControl= train_control)
In Step 4, we tune gamma: a hyper-parameter for pruning our trees, a minimum reduction in loss required to make another split in a tree. This will help with preventing overfitting.
set.seed(123)
xgb_grid4 <- expand.grid(
nrounds = xgb_model$bestTune$nrounds,
eta = xgb_model$bestTune$eta,
max_depth = xgb_model2$bestTune$max_depth,
min_child_weight = xgb_model2$bestTune$min_child_weight,
colsample_bytree = xgb_model3$bestTune$colsample_bytree,
subsample = xgb_model3$bestTune$subsample,
gamma = c(0, 0.05, 0.1, 0.5, 0.7, 0.9, 1.0)
)
xgb_model4 <- train(
SalePrice~.,
data= baked_train1,
method = 'xgbTree',
tuneGrid = xgb_grid4,
objective = "reg:squarederror",
trControl= train_control)
In Step 5, since we have tuned every parameter, we retune learning rate so that the optimal rate of descent can be smaller, thus helping the model to achieve the optimal error rate. This will be our final model.
set.seed(123)
xgb_grid5 <- expand.grid(
nrounds = seq(100, 10000, by = 100),
eta = c(0.01, 0.015, 0.025, 0.05),
max_depth = xgb_model2$bestTune$max_depth,
min_child_weight = xgb_model2$bestTune$min_child_weight,
colsample_bytree = xgb_model3$bestTune$colsample_bytree,
subsample = xgb_model3$bestTune$subsample,
gamma = xgb_model4$bestTune$gamma
)
(xgb_model5 <- train(
SalePrice~.,
data= baked_train1,
method = 'xgbTree',
tuneGrid = xgb_grid5,
objective = "reg:squarederror",
trControl= train_control))
## eXtreme Gradient Boosting
##
## 1460 samples
## 110 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1460, 1460, 1460, 1460, 1460
## Resampling results across tuning parameters:
##
## eta nrounds RMSE Rsquared MAE
## 0.010 100 0.85387161 NaN 0.853077045
## 0.010 200 0.31401178 0.6805169 0.312662842
## 0.010 300 0.11673631 0.8078558 0.114860590
## 0.010 400 0.04537512 0.8491497 0.043024283
## 0.010 500 0.02127960 0.8637374 0.018195498
## 0.010 600 0.01476462 0.8716227 0.010931428
## 0.010 700 0.01331284 0.8770471 0.009220156
## 0.010 800 0.01290540 0.8807505 0.008777987
## 0.010 900 0.01273757 0.8829988 0.008613014
## 0.010 1000 0.01263120 0.8847513 0.008531811
## 0.010 1100 0.01255545 0.8860364 0.008468356
## 0.010 1200 0.01251112 0.8867942 0.008428738
## 0.010 1300 0.01247390 0.8874650 0.008398666
## 0.010 1400 0.01243772 0.8880838 0.008366186
## 0.010 1500 0.01241854 0.8884389 0.008341119
## 0.010 1600 0.01239894 0.8887879 0.008321055
## 0.010 1700 0.01238474 0.8890042 0.008299817
## 0.010 1800 0.01237175 0.8892561 0.008283489
## 0.010 1900 0.01236528 0.8893860 0.008272097
## 0.010 2000 0.01236333 0.8894256 0.008261313
## 0.010 2100 0.01235629 0.8895580 0.008251056
## 0.010 2200 0.01235007 0.8896494 0.008244790
## 0.010 2300 0.01234297 0.8897310 0.008235373
## 0.010 2400 0.01234652 0.8896673 0.008227902
## 0.010 2500 0.01234438 0.8896994 0.008225872
## 0.010 2600 0.01234196 0.8897389 0.008222050
## 0.010 2700 0.01234427 0.8897415 0.008223336
## 0.010 2800 0.01234221 0.8897746 0.008216764
## 0.010 2900 0.01234422 0.8897594 0.008216744
## 0.010 3000 0.01234754 0.8896907 0.008215171
## 0.010 3100 0.01235070 0.8896559 0.008213764
## 0.010 3200 0.01235100 0.8896428 0.008209477
## 0.010 3300 0.01235073 0.8896155 0.008206713
## 0.010 3400 0.01235594 0.8895506 0.008207553
## 0.010 3500 0.01235840 0.8894898 0.008207865
## 0.010 3600 0.01235700 0.8895052 0.008207755
## 0.010 3700 0.01235887 0.8895021 0.008206205
## 0.010 3800 0.01236131 0.8894393 0.008206407
## 0.010 3900 0.01236464 0.8893812 0.008207743
## 0.010 4000 0.01236670 0.8893524 0.008207947
## 0.010 4100 0.01236559 0.8893632 0.008207597
## 0.010 4200 0.01236905 0.8892863 0.008208116
## 0.010 4300 0.01236911 0.8893168 0.008207658
## 0.010 4400 0.01237421 0.8892351 0.008210150
## 0.010 4500 0.01237541 0.8892140 0.008211710
## 0.010 4600 0.01238112 0.8891106 0.008215724
## 0.010 4700 0.01238136 0.8891183 0.008214138
## 0.010 4800 0.01238138 0.8891070 0.008214525
## 0.010 4900 0.01238393 0.8890778 0.008215870
## 0.010 5000 0.01238760 0.8889968 0.008217617
## 0.010 5100 0.01239100 0.8889330 0.008219615
## 0.010 5200 0.01239286 0.8888881 0.008219906
## 0.010 5300 0.01239379 0.8888654 0.008221142
## 0.010 5400 0.01239647 0.8888285 0.008222793
## 0.010 5500 0.01239903 0.8887608 0.008223076
## 0.010 5600 0.01240113 0.8887324 0.008225922
## 0.010 5700 0.01240257 0.8886971 0.008227442
## 0.010 5800 0.01240402 0.8886747 0.008229043
## 0.010 5900 0.01240667 0.8886308 0.008229715
## 0.010 6000 0.01241000 0.8885751 0.008230931
## 0.010 6100 0.01241318 0.8885213 0.008233799
## 0.010 6200 0.01241340 0.8885103 0.008235837
## 0.010 6300 0.01241690 0.8884451 0.008238619
## 0.010 6400 0.01242064 0.8883860 0.008240072
## 0.010 6500 0.01242159 0.8883583 0.008241582
## 0.010 6600 0.01242261 0.8883503 0.008240948
## 0.010 6700 0.01242447 0.8883033 0.008242908
## 0.010 6800 0.01242592 0.8882583 0.008244519
## 0.010 6900 0.01242677 0.8882436 0.008244885
## 0.010 7000 0.01243062 0.8881823 0.008247784
## 0.010 7100 0.01243224 0.8881491 0.008250079
## 0.010 7200 0.01243315 0.8881281 0.008251284
## 0.010 7300 0.01243634 0.8880831 0.008254001
## 0.010 7400 0.01243915 0.8880320 0.008255520
## 0.010 7500 0.01244005 0.8880075 0.008256671
## 0.010 7600 0.01244000 0.8879973 0.008257277
## 0.010 7700 0.01244211 0.8879655 0.008259174
## 0.010 7800 0.01244657 0.8878865 0.008263320
## 0.010 7900 0.01244734 0.8878779 0.008263887
## 0.010 8000 0.01244878 0.8878571 0.008264912
## 0.010 8100 0.01244965 0.8878293 0.008266762
## 0.010 8200 0.01245256 0.8877655 0.008269256
## 0.010 8300 0.01245457 0.8877420 0.008270804
## 0.010 8400 0.01245561 0.8877155 0.008271162
## 0.010 8500 0.01245716 0.8876827 0.008272404
## 0.010 8600 0.01245811 0.8876630 0.008272850
## 0.010 8700 0.01245876 0.8876496 0.008274428
## 0.010 8800 0.01246163 0.8876028 0.008277478
## 0.010 8900 0.01246297 0.8875703 0.008278485
## 0.010 9000 0.01246243 0.8875784 0.008278837
## 0.010 9100 0.01246355 0.8875471 0.008280176
## 0.010 9200 0.01246556 0.8875167 0.008281441
## 0.010 9300 0.01246604 0.8875067 0.008282818
## 0.010 9400 0.01246566 0.8875095 0.008283199
## 0.010 9500 0.01246684 0.8874876 0.008284456
## 0.010 9600 0.01246795 0.8874710 0.008285253
## 0.010 9700 0.01246901 0.8874508 0.008286328
## 0.010 9800 0.01247041 0.8874270 0.008287939
## 0.010 9900 0.01247179 0.8874041 0.008289623
## 0.010 10000 0.01247249 0.8873894 0.008290750
## 0.015 100 0.51554354 0.5869598 0.514415724
## 0.015 200 0.11583916 0.8070756 0.113946298
## 0.015 300 0.02956789 0.8572672 0.026789506
## 0.015 400 0.01471319 0.8715855 0.010869642
## 0.015 500 0.01304894 0.8790998 0.008918842
## 0.015 600 0.01273788 0.8829099 0.008592721
## 0.015 700 0.01258167 0.8854992 0.008452760
## 0.015 800 0.01250674 0.8867935 0.008385423
## 0.015 900 0.01244108 0.8879130 0.008325411
## 0.015 1000 0.01241311 0.8883610 0.008291066
## 0.015 1100 0.01238676 0.8888805 0.008259770
## 0.015 1200 0.01236556 0.8892544 0.008241903
## 0.015 1300 0.01233845 0.8897205 0.008211343
## 0.015 1400 0.01233310 0.8898580 0.008202472
## 0.015 1500 0.01232140 0.8900608 0.008184101
## 0.015 1600 0.01232130 0.8901003 0.008180069
## 0.015 1700 0.01230957 0.8902601 0.008166509
## 0.015 1800 0.01231018 0.8902990 0.008162514
## 0.015 1900 0.01231108 0.8902723 0.008165554
## 0.015 2000 0.01231022 0.8903244 0.008162020
## 0.015 2100 0.01231032 0.8902905 0.008155305
## 0.015 2200 0.01231356 0.8902649 0.008158117
## 0.015 2300 0.01231717 0.8901871 0.008158377
## 0.015 2400 0.01232215 0.8900856 0.008160755
## 0.015 2500 0.01232531 0.8900098 0.008159903
## 0.015 2600 0.01232928 0.8899616 0.008164432
## 0.015 2700 0.01233015 0.8899293 0.008163505
## 0.015 2800 0.01233121 0.8899146 0.008162118
## 0.015 2900 0.01233544 0.8898344 0.008163442
## 0.015 3000 0.01234022 0.8897493 0.008165855
## 0.015 3100 0.01234435 0.8896809 0.008170613
## 0.015 3200 0.01235176 0.8895554 0.008175070
## 0.015 3300 0.01235866 0.8894329 0.008182868
## 0.015 3400 0.01236205 0.8893654 0.008185501
## 0.015 3500 0.01236755 0.8892633 0.008191477
## 0.015 3600 0.01236885 0.8892574 0.008191101
## 0.015 3700 0.01236889 0.8892329 0.008193100
## 0.015 3800 0.01237056 0.8892110 0.008194462
## 0.015 3900 0.01237581 0.8891157 0.008196920
## 0.015 4000 0.01237919 0.8890555 0.008200125
## 0.015 4100 0.01238225 0.8890072 0.008205764
## 0.015 4200 0.01238400 0.8889667 0.008206801
## 0.015 4300 0.01238650 0.8889158 0.008211646
## 0.015 4400 0.01239061 0.8888518 0.008216396
## 0.015 4500 0.01239368 0.8887962 0.008219297
## 0.015 4600 0.01239700 0.8887361 0.008220888
## 0.015 4700 0.01239933 0.8886936 0.008222592
## 0.015 4800 0.01240375 0.8886089 0.008225464
## 0.015 4900 0.01240829 0.8885324 0.008229524
## 0.015 5000 0.01241109 0.8884796 0.008233064
## 0.015 5100 0.01241087 0.8884793 0.008233359
## 0.015 5200 0.01241318 0.8884318 0.008236304
## 0.015 5300 0.01241634 0.8883748 0.008239972
## 0.015 5400 0.01241848 0.8883396 0.008243229
## 0.015 5500 0.01241882 0.8883196 0.008244361
## 0.015 5600 0.01241967 0.8882934 0.008245071
## 0.015 5700 0.01242549 0.8881917 0.008249690
## 0.015 5800 0.01242893 0.8881389 0.008252717
## 0.015 5900 0.01242859 0.8881423 0.008252902
## 0.015 6000 0.01243030 0.8881031 0.008253851
## 0.015 6100 0.01243084 0.8880874 0.008253745
## 0.015 6200 0.01243180 0.8880617 0.008256181
## 0.015 6300 0.01243496 0.8880114 0.008259024
## 0.015 6400 0.01243606 0.8879871 0.008260966
## 0.015 6500 0.01243856 0.8879494 0.008264002
## 0.015 6600 0.01243984 0.8879172 0.008265925
## 0.015 6700 0.01244092 0.8879013 0.008266658
## 0.015 6800 0.01244157 0.8878839 0.008268117
## 0.015 6900 0.01244338 0.8878502 0.008269375
## 0.015 7000 0.01244491 0.8878201 0.008271478
## 0.015 7100 0.01244587 0.8877956 0.008272412
## 0.015 7200 0.01244742 0.8877689 0.008274913
## 0.015 7300 0.01244900 0.8877420 0.008276153
## 0.015 7400 0.01245016 0.8877181 0.008277746
## 0.015 7500 0.01245121 0.8877035 0.008279559
## 0.015 7600 0.01245256 0.8876822 0.008281087
## 0.015 7700 0.01245400 0.8876550 0.008282449
## 0.015 7800 0.01245492 0.8876358 0.008284009
## 0.015 7900 0.01245804 0.8875849 0.008286405
## 0.015 8000 0.01245939 0.8875540 0.008287526
## 0.015 8100 0.01246071 0.8875347 0.008289713
## 0.015 8200 0.01246168 0.8875162 0.008290794
## 0.015 8300 0.01246237 0.8875062 0.008291848
## 0.015 8400 0.01246283 0.8874948 0.008292762
## 0.015 8500 0.01246510 0.8874586 0.008295155
## 0.015 8600 0.01246580 0.8874452 0.008296572
## 0.015 8700 0.01246648 0.8874354 0.008297380
## 0.015 8800 0.01246728 0.8874163 0.008298228
## 0.015 8900 0.01246856 0.8873926 0.008299159
## 0.015 9000 0.01246897 0.8873858 0.008300150
## 0.015 9100 0.01246917 0.8873765 0.008300550
## 0.015 9200 0.01247036 0.8873574 0.008301598
## 0.015 9300 0.01247120 0.8873407 0.008302484
## 0.015 9400 0.01247192 0.8873264 0.008303072
## 0.015 9500 0.01247242 0.8873149 0.008303844
## 0.015 9600 0.01247251 0.8873159 0.008304259
## 0.015 9700 0.01247347 0.8872974 0.008305602
## 0.015 9800 0.01247429 0.8872816 0.008305796
## 0.015 9900 0.01247471 0.8872756 0.008306699
## 0.015 10000 0.01247562 0.8872647 0.008307592
## 0.025 100 0.18747750 0.7565256 0.185860621
## 0.025 200 0.02090959 0.8614989 0.017744220
## 0.025 300 0.01310467 0.8780779 0.008941946
## 0.025 400 0.01269958 0.8836608 0.008562190
## 0.025 500 0.01255027 0.8862685 0.008438835
## 0.025 600 0.01247652 0.8876326 0.008357100
## 0.025 700 0.01243007 0.8885014 0.008308833
## 0.025 800 0.01238348 0.8892290 0.008262387
## 0.025 900 0.01237355 0.8893572 0.008245859
## 0.025 1000 0.01237703 0.8893663 0.008234050
## 0.025 1100 0.01236126 0.8895396 0.008213342
## 0.025 1200 0.01236872 0.8894625 0.008210338
## 0.025 1300 0.01237378 0.8893739 0.008205490
## 0.025 1400 0.01236571 0.8895043 0.008197807
## 0.025 1500 0.01236977 0.8894390 0.008201460
## 0.025 1600 0.01236984 0.8893681 0.008200426
## 0.025 1700 0.01237574 0.8893231 0.008202956
## 0.025 1800 0.01239033 0.8890434 0.008213137
## 0.025 1900 0.01240445 0.8887808 0.008220118
## 0.025 2000 0.01240896 0.8886910 0.008221984
## 0.025 2100 0.01242446 0.8884325 0.008232715
## 0.025 2200 0.01243418 0.8882620 0.008236421
## 0.025 2300 0.01244212 0.8880908 0.008241720
## 0.025 2400 0.01245232 0.8878843 0.008251754
## 0.025 2500 0.01245877 0.8877902 0.008255709
## 0.025 2600 0.01245898 0.8877796 0.008256087
## 0.025 2700 0.01246423 0.8876611 0.008261768
## 0.025 2800 0.01246280 0.8876912 0.008260843
## 0.025 2900 0.01246514 0.8876145 0.008263497
## 0.025 3000 0.01246765 0.8875567 0.008267931
## 0.025 3100 0.01247509 0.8874455 0.008276150
## 0.025 3200 0.01247630 0.8874218 0.008276712
## 0.025 3300 0.01247956 0.8873601 0.008276608
## 0.025 3400 0.01248293 0.8872897 0.008281052
## 0.025 3500 0.01248835 0.8871734 0.008285382
## 0.025 3600 0.01248934 0.8871665 0.008289170
## 0.025 3700 0.01249268 0.8870946 0.008291278
## 0.025 3800 0.01249854 0.8869905 0.008297627
## 0.025 3900 0.01250311 0.8869154 0.008301602
## 0.025 4000 0.01250993 0.8867850 0.008305981
## 0.025 4100 0.01251203 0.8867439 0.008309036
## 0.025 4200 0.01251354 0.8866966 0.008312272
## 0.025 4300 0.01251407 0.8866757 0.008313530
## 0.025 4400 0.01251656 0.8866285 0.008315088
## 0.025 4500 0.01251940 0.8865723 0.008317763
## 0.025 4600 0.01251855 0.8865771 0.008317423
## 0.025 4700 0.01251967 0.8865635 0.008318793
## 0.025 4800 0.01252056 0.8865376 0.008321107
## 0.025 4900 0.01252241 0.8865187 0.008322501
## 0.025 5000 0.01252387 0.8864839 0.008323623
## 0.025 5100 0.01252649 0.8864437 0.008326859
## 0.025 5200 0.01252700 0.8864186 0.008329328
## 0.025 5300 0.01252944 0.8863744 0.008332898
## 0.025 5400 0.01253188 0.8863342 0.008335071
## 0.025 5500 0.01253203 0.8863262 0.008335826
## 0.025 5600 0.01253257 0.8863139 0.008336087
## 0.025 5700 0.01253457 0.8862837 0.008338048
## 0.025 5800 0.01253654 0.8862476 0.008339889
## 0.025 5900 0.01253681 0.8862408 0.008341758
## 0.025 6000 0.01253641 0.8862444 0.008341886
## 0.025 6100 0.01253879 0.8862004 0.008344656
## 0.025 6200 0.01254070 0.8861669 0.008345638
## 0.025 6300 0.01254168 0.8861428 0.008346350
## 0.025 6400 0.01254161 0.8861485 0.008346739
## 0.025 6500 0.01254217 0.8861324 0.008347868
## 0.025 6600 0.01254241 0.8861234 0.008348202
## 0.025 6700 0.01254343 0.8861046 0.008349438
## 0.025 6800 0.01254289 0.8861163 0.008349032
## 0.025 6900 0.01254249 0.8861227 0.008349141
## 0.025 7000 0.01254322 0.8861110 0.008350290
## 0.025 7100 0.01254375 0.8861016 0.008351190
## 0.025 7200 0.01254374 0.8861030 0.008350843
## 0.025 7300 0.01254411 0.8860920 0.008350977
## 0.025 7400 0.01254359 0.8860975 0.008350743
## 0.025 7500 0.01254360 0.8860987 0.008350604
## 0.025 7600 0.01254366 0.8860968 0.008350876
## 0.025 7700 0.01254385 0.8860928 0.008351064
## 0.025 7800 0.01254370 0.8860923 0.008351037
## 0.025 7900 0.01254420 0.8860859 0.008351100
## 0.025 8000 0.01254442 0.8860813 0.008351273
## 0.025 8100 0.01254484 0.8860735 0.008351681
## 0.025 8200 0.01254485 0.8860741 0.008351914
## 0.025 8300 0.01254520 0.8860688 0.008352247
## 0.025 8400 0.01254522 0.8860677 0.008352137
## 0.025 8500 0.01254520 0.8860688 0.008352254
## 0.025 8600 0.01254546 0.8860647 0.008352237
## 0.025 8700 0.01254506 0.8860692 0.008352189
## 0.025 8800 0.01254538 0.8860631 0.008352855
## 0.025 8900 0.01254532 0.8860630 0.008352901
## 0.025 9000 0.01254529 0.8860646 0.008353076
## 0.025 9100 0.01254557 0.8860592 0.008353501
## 0.025 9200 0.01254574 0.8860564 0.008353499
## 0.025 9300 0.01254582 0.8860554 0.008353504
## 0.025 9400 0.01254600 0.8860530 0.008353686
## 0.025 9500 0.01254579 0.8860553 0.008353516
## 0.025 9600 0.01254605 0.8860528 0.008353794
## 0.025 9700 0.01254607 0.8860500 0.008353931
## 0.025 9800 0.01254598 0.8860524 0.008353922
## 0.025 9900 0.01254634 0.8860453 0.008354222
## 0.025 10000 0.01254639 0.8860449 0.008354283
## 0.050 100 0.02019549 0.8606155 0.016961112
## 0.050 200 0.01279386 0.8811788 0.008667120
## 0.050 300 0.01256953 0.8852374 0.008481421
## 0.050 400 0.01249969 0.8864172 0.008408215
## 0.050 500 0.01245798 0.8871797 0.008344584
## 0.050 600 0.01245868 0.8872084 0.008324380
## 0.050 700 0.01246484 0.8871500 0.008330722
## 0.050 800 0.01247842 0.8869224 0.008338992
## 0.050 900 0.01248562 0.8868353 0.008341825
## 0.050 1000 0.01250411 0.8865056 0.008356213
## 0.050 1100 0.01251547 0.8863078 0.008361095
## 0.050 1200 0.01252145 0.8862307 0.008365572
## 0.050 1300 0.01252606 0.8861557 0.008370854
## 0.050 1400 0.01253458 0.8860098 0.008377187
## 0.050 1500 0.01254621 0.8857896 0.008389025
## 0.050 1600 0.01255036 0.8857148 0.008397745
## 0.050 1700 0.01255913 0.8855536 0.008411406
## 0.050 1800 0.01255977 0.8855344 0.008417230
## 0.050 1900 0.01256272 0.8854499 0.008418160
## 0.050 2000 0.01256712 0.8853849 0.008424742
## 0.050 2100 0.01257240 0.8852863 0.008433033
## 0.050 2200 0.01257880 0.8851610 0.008439755
## 0.050 2300 0.01258292 0.8850647 0.008444897
## 0.050 2400 0.01258975 0.8849653 0.008450154
## 0.050 2500 0.01259107 0.8849291 0.008452318
## 0.050 2600 0.01259270 0.8848952 0.008456038
## 0.050 2700 0.01259382 0.8848717 0.008459563
## 0.050 2800 0.01259667 0.8848214 0.008462621
## 0.050 2900 0.01259774 0.8847913 0.008466732
## 0.050 3000 0.01260184 0.8847280 0.008469645
## 0.050 3100 0.01260548 0.8846629 0.008473451
## 0.050 3200 0.01260726 0.8846183 0.008474622
## 0.050 3300 0.01260811 0.8845957 0.008474765
## 0.050 3400 0.01260974 0.8845740 0.008477239
## 0.050 3500 0.01261054 0.8845628 0.008477454
## 0.050 3600 0.01261124 0.8845509 0.008478141
## 0.050 3700 0.01261198 0.8845448 0.008478146
## 0.050 3800 0.01261238 0.8845325 0.008479475
## 0.050 3900 0.01261208 0.8845361 0.008478692
## 0.050 4000 0.01261348 0.8845101 0.008479756
## 0.050 4100 0.01261338 0.8845178 0.008479984
## 0.050 4200 0.01261265 0.8845233 0.008479842
## 0.050 4300 0.01261227 0.8845291 0.008480017
## 0.050 4400 0.01261171 0.8845406 0.008479701
## 0.050 4500 0.01261203 0.8845389 0.008479937
## 0.050 4600 0.01261196 0.8845379 0.008480264
## 0.050 4700 0.01261231 0.8845331 0.008480360
## 0.050 4800 0.01261209 0.8845355 0.008480287
## 0.050 4900 0.01261192 0.8845360 0.008480297
## 0.050 5000 0.01261176 0.8845398 0.008480388
## 0.050 5100 0.01261225 0.8845330 0.008480862
## 0.050 5200 0.01261178 0.8845445 0.008480608
## 0.050 5300 0.01261215 0.8845392 0.008480719
## 0.050 5400 0.01261194 0.8845397 0.008480582
## 0.050 5500 0.01261222 0.8845354 0.008480904
## 0.050 5600 0.01261232 0.8845323 0.008480726
## 0.050 5700 0.01261179 0.8845419 0.008480155
## 0.050 5800 0.01261186 0.8845438 0.008479775
## 0.050 5900 0.01261153 0.8845452 0.008479749
## 0.050 6000 0.01261135 0.8845522 0.008479488
## 0.050 6100 0.01261128 0.8845503 0.008479479
## 0.050 6200 0.01261114 0.8845504 0.008479193
## 0.050 6300 0.01261182 0.8845405 0.008479697
## 0.050 6400 0.01261193 0.8845393 0.008480066
## 0.050 6500 0.01261205 0.8845347 0.008480178
## 0.050 6600 0.01261205 0.8845342 0.008480209
## 0.050 6700 0.01261240 0.8845291 0.008480386
## 0.050 6800 0.01261216 0.8845329 0.008480111
## 0.050 6900 0.01261237 0.8845312 0.008480265
## 0.050 7000 0.01261202 0.8845327 0.008479976
## 0.050 7100 0.01261231 0.8845318 0.008480244
## 0.050 7200 0.01261259 0.8845280 0.008480324
## 0.050 7300 0.01261262 0.8845265 0.008480551
## 0.050 7400 0.01261248 0.8845295 0.008480330
## 0.050 7500 0.01261254 0.8845278 0.008480528
## 0.050 7600 0.01261233 0.8845298 0.008480318
## 0.050 7700 0.01261262 0.8845276 0.008480475
## 0.050 7800 0.01261236 0.8845295 0.008480386
## 0.050 7900 0.01261238 0.8845327 0.008480133
## 0.050 8000 0.01261204 0.8845386 0.008479905
## 0.050 8100 0.01261247 0.8845328 0.008480063
## 0.050 8200 0.01261253 0.8845285 0.008480172
## 0.050 8300 0.01261303 0.8845233 0.008480255
## 0.050 8400 0.01261309 0.8845190 0.008480134
## 0.050 8500 0.01261330 0.8845164 0.008480371
## 0.050 8600 0.01261333 0.8845140 0.008480446
## 0.050 8700 0.01261322 0.8845163 0.008480404
## 0.050 8800 0.01261313 0.8845146 0.008480440
## 0.050 8900 0.01261314 0.8845165 0.008480383
## 0.050 9000 0.01261318 0.8845155 0.008480473
## 0.050 9100 0.01261326 0.8845131 0.008480429
## 0.050 9200 0.01261315 0.8845131 0.008480633
## 0.050 9300 0.01261305 0.8845155 0.008480416
## 0.050 9400 0.01261325 0.8845139 0.008480460
## 0.050 9500 0.01261321 0.8845149 0.008480414
## 0.050 9600 0.01261308 0.8845184 0.008480317
## 0.050 9700 0.01261303 0.8845186 0.008480364
## 0.050 9800 0.01261327 0.8845158 0.008480532
## 0.050 9900 0.01261338 0.8845137 0.008480634
## 0.050 10000 0.01261342 0.8845139 0.008480795
##
## Tuning parameter 'max_depth' was held constant at a value of 3
## Tuning
##
## Tuning parameter 'min_child_weight' was held constant at a value of 3
##
## Tuning parameter 'subsample' was held constant at a value of 0.5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 1700, max_depth = 3, eta
## = 0.015, gamma = 0, colsample_bytree = 0.5, min_child_weight = 3 and
## subsample = 0.5.
predictions$xgb_pred <- predict(xgb_model5, newdata = baked_train1)
predictions <- predictions %>%
mutate(xgb_pred = exp(1)^yeo.johnson(xgb_pred, lambda = lambda, inverse = TRUE))
fitted_values_plot(predictions, x = xgb_pred)
predictions <- predictions %>%
mutate(xgb_residuals = SalePrice - xgb_pred)
residual_analysis(df = predictions, fitted = xgb_pred, resid = xgb_residuals)
Extreme Gradient Boosting seemed to further normalize residuals, thus minimizing the random forest problem of extrapolation.
The second model we will run to compliment the Random Forest model is an Elastic Net. The Elastic Net is a form of regularized regression which uses a combination of the Ridge penalty and the Lasso Penalty. This constraints on the model will help deal with correlation between features and also provide automated feature selection.
We will let caret perform a grid search for the tunable parameters alpha (which will decide which mix of these models to perform: alpha = 0 means a pure Ridge Model, while alpha = 1 means a pure Lasso Model) and lambda (the scale of the penalty to be applied in the objective function).
set.seed(123)
en_grid <- expand.grid(
alpha = c(0.7,0.75,0.8,0.85,0.9,0.95,1),
lambda = seq(0,0.001, by = 0.00005)
)
elastic_net <- train(
SalePrice ~ .,
data = baked_train1,
method = "glmnet",
trControl = train_control,
tuneGrid = en_grid
)
ggplot(elastic_net) +
geom_vline(aes(xintercept = elastic_net$bestTune$lambda), linetype = "dashed",
size = base_size/15) +
my_theme(type = "discrete") +
labs(title = "Model Tuning - Elastic Net")
The model tuning performed resulted in a RMSE of 0.0126309. The alpha parameter equals 1, and lambda equals .
Alpha = 1 means that a Lasso penalty was performed but a Ridge penalty wasn’t. That means that model tuning made our Elastic Net effectively a LASSO Model.
predictions$en_pred <- predict(elastic_net, newdata = baked_train1)
predictions <- predictions %>%
mutate(en_pred = exp(1)^yeo.johnson(en_pred, lambda = lambda, inverse = TRUE))
fitted_values_plot(predictions, x = en_pred)
predictions <- predictions %>%
mutate(en_residuals = SalePrice - en_pred)
residual_analysis(df = predictions, fitted = en_pred, resid = en_residuals)
By plotting model results and model residuals, we can observe that our model made some gross overestimations that resulted in a left-skewed distribution of residuals.
Combining the Elastic Net Model with the previous tree-based models will improve the generalization power of our model, since linear models are low-variance, high bias models.
Now, we list the models and take a look at their response correlations.
model_list <- c(rf_model, elastic_net, xgb_model5)
model_cor <- modelCor(resamples(model_list))
melt(model_cor) %>%
ggplot(aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes(x = Var1, y = Var2, size = base_size/10, label = round(value, 2)),
color = "white") +
my_theme(type = "continuous") +
labs(title = "Model Correlation",
y = "Model",
x = "Model",
fill = "Correlation")
We see that the XGBoost model (xbgTree) and the Elastic Net (glmnet) had reasonably high correlation between themselves, but the Random Forest was more uncorrelated. Since correlation was not so high overall, our final model should benefit from model stacking.
The meta-model will be trained with the matrix of fitted values from each individual Base Learner. For this model, we use a Multivariate Adaptive Regression Spline.
A MARS model will use a step-function to deal with nonlinearity and interactive relationships between features.
set.seed(123)
stacked_model <- caretStack(
model_list,
method="earth",
metric = "RMSE",
tuneLength = 10,
trControl = trainControl(method = "cv", number = 10, savePredictions = TRUE)
)
stacked_model
## A earth ensemble of 3 base models: ranger1, glmnet2, xgbTree3
##
## Ensemble results:
## Multivariate Adaptive Regression Spline
##
## 2698 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2427, 2429, 2429, 2429, 2428, 2427, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 0.01248116 0.8864399 0.008437066
## 3 0.01201387 0.8946213 0.008057307
## 4 0.01211000 0.8929142 0.008090165
## 5 0.01202917 0.8945270 0.008032821
## 6 0.01198728 0.8954661 0.008021465
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 6 and degree = 1.
predictions$stack_pred <- predict(stacked_model, newdata = baked_train1)
predictions <- predictions %>%
mutate(stack_pred = exp(1)^yeo.johnson(stack_pred, lambda = lambda, inverse = TRUE))
fitted_values_plot(predictions, x = stack_pred)
predictions <- predictions %>%
mutate(stack_residuals = SalePrice - stack_pred)
residual_analysis(df = predictions, fitted = stack_pred, resid = stack_residuals)
While stacking the models resulted in a slightly more platykurtic distribution of residuals, we notice how the LOESS smoothing narrowed average error when it comes to the higher fitted values.
Here we look at Partial Dependence, which helps us interpret the marginal effect that an increase in given feature has on the final prediction value. Here, we treat the predictions of the individual Base Learners as the features of the meta-model.
partial1 <- partial(stacked_model$ens_model, pred.var = "ranger1", plot = TRUE,
center = TRUE, grid.resolution = 20,
plot.engine = "ggplot2")
partial2 <- partial(stacked_model$ens_model, pred.var = "glmnet2", plot = TRUE,
center = TRUE, grid.resolution = 20,
plot.engine = "ggplot2")
partial3 <- partial(stacked_model$ens_model, pred.var = "xgbTree3", plot = TRUE,
center = TRUE, grid.resolution = 20,
plot.engine = "ggplot2")
grid.arrange(
partial1 + geom_line(color = color3, size = base_size/7.5) + my_theme(type = "discrete")
+ ggtitle("Partial Dependence - Random Forest Model"),
partial2 + geom_line(color = color3, size = base_size/7.5) + my_theme(type = "discrete")
+ ggtitle("Partial Dependence - Elastic Net Model"),
partial3 + geom_line(color = color3, size = base_size/7.5) + my_theme(type = "discrete")
+ ggtitle("Partial Dependence - XGBoost Model"),
ncol = 3, nrow = 1
)
While the Random Forest has constant marginal effect on the meta-model results, the XGBoost model has increased effects when its value surpasses a certain point.
predictions <- predictions %>%
mutate(stack_pred = as.numeric(stack_pred),
stack_residuals = as.numeric(stack_residuals))
get_statistic <- function(df, stat){
as.numeric(df$results %>%
filter(RMSE == min(RMSE)) %>%
select(stat))
}
list_models <- c(rf_model, elastic_net, xgb_model5, stacked_model$ens_model)
r_squared <- melt(lapply(list_models, get_statistic, stat = "Rsquared"))
rmse <- melt(lapply(list_models, get_statistic, stat = "RMSE"))
mae <- melt(lapply(list_models, get_statistic, stat = "MAE"))
models <- c("Random Forest", "Elastic Net", "XGBoost", "Stacked Model")
stats <- cbind(models, rmse[,1], r_squared[,1], mae[,1])
colnames(stats) <- c("Models", "RMSE", "R_squared", "MAE")
stats <- as.data.frame(stats)
stats <- stats %>%
mutate(RMSE = as.numeric(RMSE), R_squared = as.numeric(R_squared), MAE = as.numeric(MAE))
pred_stacked <- predictions %>%
select(rf_residuals, en_residuals, xgb_residuals, stack_residuals) %>%
stack() %>%
mutate(ind = factor(ind))
model_info <- function(model){
paste(model, "\n", "RMSE:", round(stats[models == model, "RMSE"], 4),
"\n", "R-squared:", round(stats[models == model, "R_squared"], 4))
}
model_information <- as.matrix(sapply(models, model_info), nrow = 1, ncol = 4)
levels(pred_stacked$ind) <- model_information
pred_stacked %>%
ggplot(aes(y = values, x = ind, fill = ind)) +
geom_boxplot() +
scale_y_log10(labels = comma) +
my_theme(type = "discrete") +
theme(legend.position = "none") +
labs(title = "Residual Boxplots", x = "")
As we can see from the plot, stacking the models has lowered cross-validated RMSE while simultaneously improving cross-validated R-squared.
When applying the model, we can see that every model has produced is its own batch of overestimated outliers. While stacking the model has helped lower average overestimations, it resulted in a slightly higher median residual than its tree-based Base Learners.
To better visualize how each feature has impacted each model, we calculate Feature Importance and plot the 10 most important features for each model.
rf_impurity <- ranger(
formula = SalePrice ~ .,
data = baked_train1,
num.trees = rf_model$finalModel$num.trees,
mtry = rf_model$finalModel$mtry,
min.node.size = rf_model$finalModel$min.node.size,
replace = FALSE,
importance = "impurity",
respect.unordered.factors = "order",
verbose = FALSE,
seed = 123
)
vip_rf <- vip(rf_impurity, num_features = 10, bar = FALSE)
vip_xgb <- vip(xgb_model5, num_features = 10, bar = FALSE)
vip_en <- vip(elastic_net, num_features = 10, bar = FALSE)
importance <- function(df){
df %>%
ggplot(aes(x = Importance, y = reorder(Variable, Importance), fill = Importance)) +
geom_col() +
my_theme(type = "continuous") +
theme(legend.position = "none") +
labs(
title = "Component Importance",
y = "Components"
)
}
grid.arrange(
importance(vip_rf$data) + ggtitle("Random Forest"),
importance(vip_xgb$data) + ggtitle("XGBoost"),
importance(vip_en$data) + ggtitle("Elastic Net"),
nrow = 1, ncol = 3
)
OverallQual and GrLivArea were the top 2 most important features in every model. However, notice how different importance was for the Random Forest model. From the third feature onwards, models have produces varied results.
By stacking the models, we have created a better fitted model while minimizing Mean Squared Error. The MARS meta-model captured any nonlinear relationships that may occur between the features and the response variable.
Upon fitting the model in the Testing Data Set and submitting it into the Kaggle website, our model has been graded with a Root Mean Squared Logarithmic Error of 0.12251. At the moment this report is written, this grade puts our model in the top 15% of responses in a challenge with over 10k submissions.