This document is a response to the Kaggle Challenge “House Prices - Advanced Regression Techniques”.

The goal of this challenge is to predict house prices in Ames, Iowa using the available information which contains 79 explanatory variables.

After having performed Target and Feature Engineering, we fitted three models: a Random Forest, a Gradient Boosting Machine and an Elastic Net Model. Finally, we fitted a stacked model which combines the advantages of the three previous Base Learners to produce a model with smaller RMSE and better R-squared.

When applying the stacked model in the testing data set and uploading it into Kaggle, the model has returned an RMSLE of 0.12251, which at the moment of the writing of this report, is a top 15% submission.

Summary

Required Packages
Setting Theme to be Used
Downloading Data Sets
Exploratory Data Analysis and Target and Feature Engineering
The Random Forest Model
The XGBoost Model
The Elastic Net Model
Stacking the Models
Model Comparison and Interpretation
Conclusion

Required packages

library(rlang)
library(httr)
library(jsonlite)
library(data.table)
library(tidyr)
library(ggplot2)
library(dplyr)
library(showtext)
library(extrafont)
library(viridis)
library(RColorBrewer)
library(ggsci)
library(lubridate)
library(forcats)
library(broom)
library(WVPlots)
library(caret)
library(pROC)
library(recipes)
library(readr)
library(ROCR)
library(vip)
library(xts)
library(stringr)
library(scales)
library(ggtext)
library(rpart) 
library(rpart.plot)
library(ranger)
library(VGAM)
library(randomForest)
library(gridExtra)
library(ggplotify)
library(grid)
library(ggthemes)
library(moments)
library(cowplot)
library(egg)
library(patchwork)
library(outliers)
library(Hmisc)
library(zoo)
library(earth)
library(caretEnsemble)
library(h2o)
library(pdp)
library(gbm)
library(xgboost)
library(DALEX)

Setting theme to be used

loadfonts(device = "win")
font_add("Palatino", "pala.ttf")
showtext_auto()

base_size <- 20
theme1 <- theme_minimal(base_size = base_size) + 
  theme(text = element_text(family = "Palatino"),
        panel.grid.minor.x = element_line(linetype = "dashed"),
        panel.grid.minor.y = element_line(linetype = "dashed"))

my_theme <- function(type = c("discrete", "continuous")){
  if (type == "discrete"){
    list(theme1, scale_color_jama(), scale_fill_jama())
  } else {
    list(theme1, scale_fill_continuous(high = pal_jama("default")(9)[1], low = "lightskyblue"))
  }
}

color1 <- pal_jama("default")(9)[1]
color2 <- pal_jama("default")(9)[2]
color3 <- "lightskyblue"

Downloading Data Sets

Let’s download the training set, the testing set and have a look at data structures.

test <- fread("test.csv")
train <- fread("train.csv")
str(train)

## Classes 'data.table' and 'data.frame':   1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ 1stFlrSF     : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ 2ndFlrSF     : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ 3SsnPorch    : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
##  - attr(*, ".internal.selfref")=<externalptr>

To make predictions on the testing data set, we will need it to have the same columns as the training one. So we’ll add the SalePrice variable.

test <- test %>%
  mutate(SalePrice = as.integer(0))

Exploratory Data Analysis and Target and Feature Engineering

Dealing with Dates and Seazonality

We will combine the features YrSold and MoSold to analyze each sale by its exact month and year (new variable to be called SaleDate). If we plot the histogram for this feature, we will notice how seasonal the real estate market is (sales improve in the summer). Notice the color of the bars: the darker they are, the deeper into the year the month is. So, sales begin slow, pick up in the mid-year, then diminish again by winter.

date_function <- function(df){
  df %>%
    mutate(MoSold = ifelse(MoSold < 10, paste(0,MoSold,sep =""), MoSold)) %>%
    mutate(SaleDate = paste(MoSold,"-01-",YrSold, sep = "")) %>% 
    mutate(SaleDate = as.Date(SaleDate, "%m-%d-%Y"))
}

train <- train %>% 
  date_function()

test <- test %>%
  date_function()

sales_by_month <- train %>% 
    group_by(SaleDate) %>%
    summarise(Count = n(), meanPrice = mean(SalePrice))
  
sales_by_month <- sales_by_month %>%
    mutate(movingAvg = (meanPrice + lag(meanPrice) + lag(meanPrice, 2) +
                          lag(meanPrice, 3) + lag(meanPrice, 4) +
                          lag(meanPrice, 5))/6,
           totalAvg = cumsum(meanPrice)/index(sales_by_month))


sales_by_month %>%
  ggplot(aes(x = SaleDate, y = Count, fill = lubridate::month(SaleDate))) +
    geom_col() +
    scale_x_date(breaks = unique(sales_by_month$SaleDate)) +
    labs(title = "Houses Sold by Date") +
    my_theme(type = "continuous") +
    theme(axis.text.x = element_text(angle = 270),
          legend.position = "none")

However, if we check average Sale Price by month, we see another pattern: in most years except 2009, houses were on average more expensive by the end of the year. The gold line is the six-month moving average sale price and helps us visualize the seasonality. The dashed red line is the cumulative moving average sale price, which offers us another conclusion: Sale Price declined slightly over the years, but it hasn’t changed dramatically.

sales_by_month %>%
  ggplot(aes(x = SaleDate, y = meanPrice)) +
    geom_col(aes(fill = lubridate::month(SaleDate)), show.legend = FALSE) +
    geom_line(aes(y = movingAvg, color = "6-month Moving Average"), size = base_size/10) +
    geom_line(aes(y = totalAvg, color = "Cumulative Moving Average"), size = base_size/15, linetype = "dashed") +
    scale_x_date(breaks = unique(sales_by_month$SaleDate)) +
    scale_y_continuous(expand = c(0,0)) +
    scale_color_manual(name = "Average", 
                       breaks = c("6-month Moving Average", "Cumulative Moving Average"),
                       values = c("6-month Moving Average" = color2, 
                                  "Cumulative Moving Average" = "darkred")) +
    labs(title = "Average House Price per Date") +
    my_theme(type = "continuous") +
    theme(axis.text.x = element_text(angle = 270),
          legend.position = "bottom")

In conclusion, we will turn the MoSold feature into a factor variable, since it seems to have an impact on the price, but it isn’t a purely numerical feature.

Ordinal categorical features

Some of the categorical features on the data set are qualitative features that can be modified to better capture their relationship with the response variable. For example, ExterCond evaluates the condition of materials in the exterior of the property, from Poor to Excellent, the latter being naturally preferred. We will perform transformations (label-encoding) to turn this kind of feature into integer features. The features to undergo label-encoding are basically the ones that evaluate quality (suffixes Qual or QC) or condition (suffix Cond) of various ambients of the house:

ExterCond, ExterQual, HeatingQC, Functional, KitchenQual, FireplaceQu, GarageQual, GarageCond, PoolQC, Fence, GarageFinish, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2.

We will also perform idiossincratic transformations on the following variables:

YearBuilt: we will cut and group the years in a way that divides the sample into levels with a similar number of observations. Since this will create a naturally ordinal feature, we will also add this to the label-encoding step, which will give the oldest house the minimum value and the newest house the maximum value. This variable will be called YearBuiltGrouped.
YearRemodAdd: this shall undergo the same transformation that we applied to the YearBuilt feature, except it concerns the year of the last remodeling.
MSSubClass : even though it is listed as as integer feature, we will turn it into a factor since it concerns the type of dwelling involved in the sale.

Other features will require replacent of the NA values for a string, since the NAs in these cases does not mean the values are missing but rather that the house does not have that specific ambient. (For example, the NAs in GarageType mean that the houses don’t have garages).

levels1 <- rev(c("Ex", "Gd", "TA", "Fa", "Po"))
levels2 <- rev(c("Ex", "Gd", "TA", "Fa", "Po", "None"))
levels3 <- rev(c("Gd", "Av", "Mn", "No", "None"))
levels4 <- rev(c("GLQ", "ALQ", "BLQ", "Rec", "LwQ", "Unf", "None"))
levels5 <- rev(c("Typ", "Min1", "Min2", "Mod", "Maj1", "Maj2", "Sev", "Sal"))
levels6 <- rev(c("Fin", "RFn", "Unf", "None"))
levels7 <- rev(c("GdPrv", "MnPrv", "GdWo", "MnWw", "None"))

levels_yrbuilt <- cut2(train$YearBuilt, g = 8, onlycuts = TRUE)
levels_yrremod <- cut2(train$YearRemodAdd, g = 8, onlycuts = TRUE)

mutate_fun <- function(df){
  df %>%
    rename(FirstFloorSF = "1stFlrSF",
         SecondFloorSF = "2ndFlrSF",
         ThreeSsnPorch = "3SsnPorch") %>%
  mutate(MSSubClass = factor(MSSubClass), 
         MoSold = factor(MoSold), 
         YrSold = factor(YrSold),
         YearBuiltGrouped = cut(YearBuilt, breaks  = levels_yrbuilt, include.lowest = TRUE),
         YearRemodGrouped = cut(YearRemodAdd, breaks = levels_yrremod, include.lowest = TRUE),
         GarageType = replace_na(GarageType, "None"),
         Alley = replace_na(Alley, "None"),
         MiscFeature = replace_na(MiscFeature, "None"),
         ExterCond = fct_relevel(ExterCond, levels = levels1),
         ExterQual = fct_relevel(ExterQual, levels = levels1),
         HeatingQC = fct_relevel(HeatingQC, levels = levels1),
         Functional = fct_relevel(Functional, levels = levels5),
         KitchenQual = fct_relevel(KitchenQual, levels = levels1),
         FireplaceQu = fct_relevel(replace_na(FireplaceQu, "None"), levels = levels2),
         GarageQual = fct_relevel(replace_na(GarageQual, "None"), levels = levels2),
         GarageCond = fct_relevel(replace_na(GarageCond, "None"), levels = levels2),
         PoolQC = fct_relevel(replace_na(PoolQC, "None"), levels = levels2),
         Fence = fct_relevel(replace_na(Fence, "None"), levels = levels7),
         GarageFinish = fct_relevel(replace_na(GarageFinish, "None"), levels = levels6),
         BsmtQual = fct_relevel(replace_na(BsmtQual, "None"), levels = levels2),
         BsmtCond = fct_relevel(replace_na(BsmtCond, "None"), levels = levels2),
         BsmtExposure = fct_relevel(replace_na(BsmtExposure, "None"), levels = levels3),
         BsmtFinType1 = fct_relevel(replace_na(BsmtFinType1, "None"), levels = levels4),
         BsmtFinType2 = fct_relevel(replace_na(BsmtFinType2, "None"), levels = levels4)) %>%
  mutate_if(is.character, as.factor)
}

train <- train %>%
  mutate_fun() %>%
  select(-c(YearBuilt, YearRemodAdd))

test <- test %>%
  mutate_fun() %>%
  select(-c(YearBuilt, YearRemodAdd))

We will also deal with missing values by performing a KNN-imputation, which will search for the 6 nearest neighboring observations and provide imputations based on the average or the mode of these observations.

Finally, we execute the feature engineering “recipe” to label-encode the features and impute the missing values.

label_encode <- c("ExterCond", "ExterQual", "HeatingQC",
                  "Functional", "KitchenQual","FireplaceQu",
                  "GarageQual", "GarageCond", "PoolQC",
                  "Fence", "GarageFinish",
                  "BsmtQual", "BsmtCond",
                  "BsmtExposure","BsmtFinType1", "BsmtFinType2",
                  "YearBuiltGrouped", "YearRemodGrouped")

(recipe <- recipe(SalePrice ~ ., data = train) %>%
  step_integer(all_of(label_encode)) %>%
  step_other(all_nominal(), threshold = 0.1, other = "other") %>%
  step_knnimpute(all_predictors(), neighbors = 6))

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         81
## 
## Operations:
## 
## Integer encoding for all_of(label_encode)
## Collapsing factor levels for all_nominal()
## K-nearest neighbor imputation for all_predictors()

prepare <- prep(recipe, training = train)
train <- bake(prepare, new_data = train)
test <- bake(prepare, new_data = test)

Response Variable Transformation

In this section, we evaluate the distribution of our response variable. Can log-transformation make its shape more similar to a normal distribution?

ggplot(train, aes(x = SalePrice)) +
  geom_histogram(aes(y = ..count..), fill = color3, color = color1, size = base_size*0.05) +
  my_theme(type = "discrete") +
  theme(axis.title.y = element_blank()) +
  ggtitle("Response Variable Distribution")

ggplot(train, aes(x = log(SalePrice))) +
  geom_histogram(aes(y = ..count..), fill = color3, color = color1,  size = base_size*0.05) +
  my_theme(type = "discrete") +
  theme(axis.title.y = element_blank()) +
  ggtitle("Response Variable Distribution (Log-Transformated)")

Log-transformation of the response variable has lowered skewness and kurtosis of distribution.

We will add this to the baking recipe. We will also add a Yeo-Johnson transformation to normalize all of the numeric predictors, along with standardization, near-zero variance removal and PCA dimension reduction (we will only consider features that amount to a 90% explanatory power).

Numeric Predictors Analysis and Engineering

In this section we analyze numeric features. We already determined that log-transforming SalePrice will help normalize its distribution.

We will also perform transformations to normalize and standardize numeric features. In this step, we will add to the recipe a Yeo-Johnson transformation, that will determine by itself which features will require normalizing. Also, a centering and a scaling step will be performed.

Dropping near-zero variance predictors is also a step that we will add, since it helps performing feature filtering very early on in the process.

(recipe2 <- recipe(SalePrice ~ ., data = train) %>%
  step_log(all_outcomes()) %>%
  step_nzv(all_predictors()) %>%
  step_YeoJohnson(all_numeric()) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()))

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         81
## 
## Operations:
## 
## Log transformation on all_outcomes()
## Sparse, unbalanced variable filter on all_predictors()
## Yeo-Johnson transformation on all_numeric()
## Centering for all_numeric(), -all_outcomes()
## Scaling for all_numeric(), -all_outcomes()

prepare2 <- prep(recipe2, training = train)

baked_train <- bake(prepare2, new_data = train)
baked_train <- cbind(train$Id, baked_train)
baked_train <- baked_train %>%
  select(-Id) %>%
  rename(Id  = "train$Id")

baked_test <- bake(prepare2, new_data = test)
baked_test <- cbind(test$Id, baked_test)
baked_test <- baked_test %>%
  select(-Id) %>%
  rename(Id  = "test$Id")

This step dropped 18 variables from the training set. This will help us perform cleaner Exploratory Data Analysis.

Let’s take a look at the numeric features in the “baked” data set. Which will have higher correlation with the response variable?

cormat <- cor(baked_train %>% select_if(is.numeric) %>% select(-Id))

get_upper_tri<-function(cormat){
    cormat[lower.tri(cormat)] <- NA
    return(cormat)
}

upper_tri <- melt(get_upper_tri(cormat), na.rm = TRUE)
  
ggplot(upper_tri, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(x = Var1, y = Var2, label = round(value, 2)),
            color = "white", size = base_size/5) +
  labs(title = "Correlation Matrix of Numeric Features",
       fill = "Correlation") +
  my_theme(type = "continuous") +
  theme(axis.text.x = element_text(angle = 270),
        axis.title = element_blank(),
        legend.position = "none")

upper_tri %>%
  filter(Var2 == "SalePrice", Var1 != "SalePrice") %>%
  arrange(desc(value)) %>%
  head(10)

Qualitative evaluation features have high correlation with the final Sale Price. Amongst these, the Overall Quality is the most important, followed by the quality of the external area, the kitchen quality and the basement quality.

We also see that a lot of features that evaluate square feet area of various ambients seem to be important. The most relevant one is GrLivArea, which is the area of the above ground living space, followed by area of the garage, basement, and first floor.

The amount of cars that fit in the garage is important, as is the time period that the house was built.

Which features have low correlation with the response variable?

upper_tri %>%
  filter(Var2 == "SalePrice", Var1 != "SalePrice") %>%
  arrange(value) %>%
  head(10)

The Fence, the overall condition, the number of basement half-bathrooms and the External Condition have the lowest correlations with the response variable. Let’s look at distributions of these features and their relationship with the response variable.

boxplot_plot <- function(df, yvar, xvar, logtrans = FALSE){
  x <- eval(substitute(xvar), df)
  y <- eval(substitute(yvar), df)
  plot <- ggplot(data = df, aes(x = factor(df[[x]]), y = df[[y]])) +
    geom_boxplot(color = color1, fill = "lightskyblue", size = base_size*0.05) +
    my_theme(type = "discrete") +
    theme(panel.grid.major.x = element_blank(),
          panel.grid.minor.y = element_line(linetype = "dashed")) +
    labs(title = paste(as.character(substitute(y))," by ",
                         as.character(substitute(x))), 
         y = paste(as.character(substitute(y))),
         x = paste(as.character(substitute(x))))
  if(logtrans == FALSE){
    plot + scale_y_continuous(labels = dollar, breaks = pretty_breaks(n = 5))
  } else {
    plot + scale_y_log10(labels = dollar, breaks = pretty_breaks(n = 5))
  }
}

histogram_plot <- function(df, xvar){
  x <- eval(substitute(xvar), df)
  ggplot(data = df, aes(x = factor(df[[x]]))) +
    geom_histogram(stat = 'count', aes(y = ..count..),
                   fill = color1, color = "lightskyblue", size = base_size/7.5) +
    my_theme(type = "discrete") +
    theme(panel.grid.major.x = element_blank(),
          panel.grid.minor.y = element_line(linetype = "dashed")) +
    labs(title = "Histogram", 
         y = "Count",
         x = paste(as.character(substitute(xvar))))
}

ordered_analysis <- function(df, yvar, xvar){
  grid.arrange(
    histogram_plot(train, xvar),
    boxplot_plot(train, xvar, yvar),
    nrow = 1, ncol = 2
  )
}

ordered_analysis(train, yvar = "SalePrice", xvar = "OverallQual")

ordered_analysis(train, yvar = "SalePrice", xvar = "ExterQual")

ordered_analysis(train, yvar = "SalePrice", xvar = "KitchenQual")

Evidently, expected sale price increases the higher the quality of the rooms in the house. Also most houses have medium evaluation of quality.

ordered_analysis(train, yvar = "SalePrice", xvar = "YearBuiltGrouped")

We have grouped the years that the houses were built. So, the higher the “grade” (from 1 to 8), the newer the house is. Average sale price is higher for newer houses, but it seems that change is very slight when it comes to the older houses (groups from 1 to 3).

Now let’s plot the most relevant continuous variables. We will fit a simple linear model to try to capture the statistical relationship between these features and the response variable. We will also plot model residuals to check for Homoscedasticity. The blue line in the model residual plot indicates a LOESS smoothing of residuals.

outliers <- function(df, x, upper, lower){
  x <- eval(quote(x), df)
  sel <- df[[x]]
  upper_bound <- quantile(sel, upper)
  lower_bound <- quantile(sel, lower)
  return(sel[sel > upper_bound | sel < lower_bound])
}

density_plot <- function(df,xvar, upper.bound, lower.bound, logtrans = FALSE){
  x <- eval(substitute(xvar), df)
  out <- outliers(df, xvar, upper = upper.bound, lower = lower.bound)
  df1 <- subset(df, !df[[x]] %in% out)
  plot <- ggplot(data = df1, aes(x = df1[[x]])) +
      geom_histogram(aes(y = ..density..), 
                   bins = 50, fill = color1, alpha = 0.5) +
      geom_density(aes(y = ..density..),
                 size = base_size*0.05, color = color1, fill = color3, alpha = 0.5) +
      my_theme(type = "discrete") +
      theme(panel.grid.minor.x = element_line(linetype = "dashed"),
          panel.grid.minor.y = element_line(linetype = "dashed")) +
      labs(title = "Density Plot", 
         y = "Density",
         x = paste(as.character(substitute(x))))
  if(logtrans == FALSE){
    return(plot + scale_x_continuous(breaks = pretty_breaks(n = 5)))
  } else {
    return(plot + scale_x_log10(breaks = pretty_breaks(n = 5)))
  }
}

scatter_plot <- function(df,yvar,xvar, upper.bound, lower.bound,logtrans = FALSE){
  x <- eval(substitute(xvar), df)
  y <- eval(substitute(yvar), df)
  out <- outliers(df, xvar, upper = upper.bound, lower = lower.bound)
  df1 <- subset(df, !df[[x]] %in% out)
  plot <- ggplot(data = df1, aes(x = df1[[x]], y = df1[[y]])) +
      geom_point(color = color1, alpha = 0.5) +
      geom_smooth(method = "lm", color = color3) +
      scale_y_log10(labels = dollar) +
      my_theme(type = "discrete") +
      labs(title = paste(as.character(substitute(y))," by ",
                         as.character(substitute(x))), 
         y = paste(as.character(substitute(y))),
         x = paste(as.character(substitute(x))))
  if(logtrans == FALSE){
    return(plot + scale_x_continuous(breaks = pretty_breaks(n = 5)))
  } else {
    return(plot + scale_x_log10(breaks = pretty_breaks(n = 5)))
  }
}

residual_plot <- function(df, response, predictor, xvar, upper.bound, lower.bound){
  x <- eval(substitute(xvar), df)
  predictor <- eval(substitute(predictor), df)
  y <- eval(substitute(response), df)
  out <- outliers(df, xvar, upper = upper.bound, lower = lower.bound)
  df1 <- subset(df, !df[[x]] %in% out)
  formula <- as.formula(paste(as.character(substitute(y)), "~", 
                              as.character(substitute(predictor))))
  fit <- train(
    form = formula, 
    data = df1, 
    method = "lm",
    trControl = trainControl(method = "cv", number = 10)
  )
  ggplot(fit$finalModel, aes(fitted(fit), resid(fit))) +
    geom_point(alpha = .6, color = color1) +
    geom_smooth(method = "loess", color = color3, size = base_size*0.08, se = FALSE) +
    geom_hline(yintercept=0, color = color2, size = base_size*0.05, alpha = 0.5) +
    scale_x_continuous(expand = c(0, 0)) +
    labs(title = "Model Residuals", 
       x = "Predicted Values",
       y = "Residuals") +
   my_theme(type = "discrete")
}

numerical_analysis <- function(df, yvar, xvar, response, predictor, 
                               upper.bound = 0.975, lower.bound = 0.025, logtrans = FALSE){
  grid.arrange(
    density_plot(df, xvar, upper.bound = upper.bound, lower.bound = lower.bound,
                 logtrans = logtrans),
    scatter_plot(df, yvar, xvar, upper.bound = upper.bound, lower.bound = lower.bound,
                 logtrans = logtrans), 
    residual_plot(df, response = response, predictor = predictor, xvar = xvar,
                  upper.bound = upper.bound, lower.bound = lower.bound), 
    ncol = 2, 
    layout_matrix = cbind(c(1,1), c(2,3))
  )
}

numerical_analysis(train, yvar = "SalePrice", xvar = "GrLivArea", 
                   response = "log(SalePrice)", predictor = "log(GrLivArea)", 
                   logtrans = TRUE)

We have log-scaled GrLivArea (Above Ground Living Area) to normalize it’s distribution. This feature has a strong linear relationship with the log-scaled Sale Price. We also see that errors seem to have constant variance.

numerical_analysis(train, yvar = "SalePrice", xvar = "GarageArea", 
                   response = "log(SalePrice)", predictor = "GarageArea", 
                   logtrans = FALSE)

For GarageArea the most noticeable thing is the high occurrence of zeros. This obviously indicates that a high number of houses have no garage at all.

numerical_analysis(train, yvar = "SalePrice", xvar = "TotalBsmtSF", 
                   response = "log(SalePrice)", predictor = "TotalBsmtSF")

numerical_analysis(train, yvar = "SalePrice", xvar = "FirstFloorSF", 
                   response = "log(SalePrice)", predictor = "log(FirstFloorSF)", 
                   logtrans = TRUE)

For Basement and First Floor Areas, model residuals show higher variance than the two previously analyzed features.

Categorical Features Analysis and Engineering

Now, onto categorical features. First all we will one-hot encode all the nominal variables. This means all categorical features will be turned into dummy variables ranging from 0 to 1.

(recipe3 <- recipe(SalePrice ~ ., data = baked_train) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE))

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         63
## 
## Operations:
## 
## Dummy variables from all_nominal(), -all_outcomes()

prepare3 <- prep(recipe3, data = baked_train)
baked_train1 <- bake(prepare3, new_data = baked_train)
baked_test <- bake(prepare3, new_data = baked_test)

Let’s calculate Point Biserial Correlation of each dummy variable and the response variable and plot the highest absolute values that are statistically significant.

baked_train_dummy <- baked_train1 %>%
 select_if(~ is.numeric(.) && all(between(., 0, 1)))

pbs_correlation <- data.frame(0)
for (i in 1:ncol(baked_train_dummy)){
  x <- baked_train_dummy[[i]]
  cortest <- cor.test(baked_train[["SalePrice"]], x)
  pbs_correlation[i,1] <- base::colnames(baked_train_dummy[,i])
  pbs_correlation[i,2] <- cortest$p.value
  pbs_correlation[i,3] <- cortest$estimate
}

base::colnames(pbs_correlation) <-  c("Feature", "P_value", "Correlation")

pbs_correlation %>%
  filter(P_value <= 0.05) %>%
  ggplot(aes(x = Correlation)) +
    geom_col(data = pbs_correlation %>%
               filter(P_value <= 0.05, Correlation > 0.25 | Correlation < -0.25) %>%
               arrange(desc(Correlation)), 
             aes(x = Correlation, y = fct_reorder(Feature, Correlation), 
                 fill = Correlation)) +
  my_theme(type = "continuous") +
  labs(title = "Point Biserial Correlation: dummy variables vs. SalePrice",
       y = "")

Foundation_PConc is the feature with the highest positive correlation. This means that the tendency of a higher price is higher if the foundation of the house is poured concrete.

Random Forest Model

The basis of our model will be a Random Forest model. By building several decision trees on bootstrapped copies of the training set, a bagged tree process will have good predictive performance with low variance. Furthermore, combining this with the randomization of a limited subset of features to be considered at each split, we will have a Random Forest, which further improves the model by achieving de-correlated trees.

We will perform a grid search to automatically tune the hyper-parameters of the model. By using the ranger method, we select our hyper-parameters: mtry is the size of the subset of features, min.node.size is the minimum number of observations to be gathered at each final node, and splitrule is the method for deciding the criteria for each split.

response <- "SalePrice"
predictors <- setdiff(names(baked_train1), response)
n_features <- length(predictors)

set.seed(123)

rf_grid <- expand.grid(
  mtry = floor(n_features * c(.25, .333,.5,.666,.75)),
  min.node.size = c(2,3,5),
  splitrule = c("extratrees")
)

train_control <- trainControl(
  method = "cv", 
  number = 5, 
  savePredictions = "final",
  index=createResample(baked_train1$SalePrice, 5))

(rf_model <- train(
  SalePrice~., 
  data= baked_train1, 
  method = 'ranger', 
  tuneGrid = rf_grid,
  trControl= train_control))

## Random Forest 
## 
## 1460 samples
##  110 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1460, 1460, 1460, 1460, 1460 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE        Rsquared   MAE        
##   27    2              0.01414142  0.8633831  0.009432508
##   27    3              0.01412139  0.8637996  0.009436124
##   27    5              0.01414600  0.8631515  0.009435525
##   36    2              0.01400858  0.8650781  0.009327478
##   36    3              0.01402868  0.8648902  0.009323417
##   36    5              0.01405492  0.8643441  0.009370735
##   55    2              0.01386001  0.8669270  0.009234908
##   55    3              0.01389116  0.8665296  0.009247718
##   55    5              0.01383715  0.8676296  0.009233661
##   73    2              0.01380998  0.8674738  0.009207579
##   73    3              0.01381440  0.8673175  0.009202873
##   73    5              0.01381143  0.8675723  0.009208541
##   82    2              0.01379674  0.8678737  0.009192222
##   82    3              0.01381949  0.8670338  0.009228803
##   82    5              0.01381297  0.8670811  0.009195442
## 
## Tuning parameter 'splitrule' was held constant at a value of extratrees
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 82, splitrule = extratrees
##  and min.node.size = 2.

The grid search selected the hyper-parameters that optimized the overall RMSE. We see that it has selected a subset size of 84, which corresponds to 75% of the original feature space.

We also notice that the optimal RMSE of 0.0137967 is lower than the standar deviation of the response variable of 0.0371552. This indicates that our model fares better than the naive estimate of the mean for all predictions. Mind you, these values have undergone the transformations that we chose to do in our pre-processing of the data. This means that they don’t reflect the error estimates in dollar but rather in a normalized and log-adjusted value.

ggplot(rf_model) +
  geom_point(size = base_size/5, aes(color = min.node.size)) +
  geom_line(aes(group = min.node.size, color = min.node.size),
            size = base_size/15) +
  guides(shape = FALSE) +
  theme(legend.position = "bottom") +
  my_theme(type = "discrete") +
  labs(
    title = "Model Tuning - Random Forest",
    color = "Minimal Node Size",
    x = "Subset of Features (mtry)"
    )

Assessing Model Performance

Let’s make predictions on the training set using the model we just trained. How did it fare against the actual sale prices?

lambda <- prepare2$steps[[3]]$lambdas["SalePrice"]

predictions <- as.data.frame(matrix(nrow = 1460))

predictions$rf_pred <- predict(rf_model, newdata = baked_train1)

predictions$SalePrice <- baked_train1$SalePrice

predictions <- as.data.frame(predictions)

predictions <- predictions %>%
  mutate(rf_pred = exp(1)^yeo.johnson(rf_pred, lambda = lambda, inverse = TRUE),
         SalePrice = exp(1)^yeo.johnson(SalePrice, lambda = lambda, inverse = TRUE))

fitted_values_plot <- function(df, x){
  df %>%
    ggplot(aes(x = {{x}}, y = SalePrice)) +
      geom_point(color = color1, alpha = 0.5) +
      geom_abline(xintercept = 0, yintercept = 0, slope = 1, linetype = "dashed") +
      my_theme(type = "discrete") +
      scale_y_log10(labels = dollar) +
      scale_x_log10(labels = dollar) +
      coord_flip() +
      labs(title = "Fitted Values vs Response Values",
         x = "Fitted Values",
         y = "Response Values")
}

fitted_values_plot(predictions, x = rf_pred)

When we plot the fitted values against the actual response values, we see that most observations have fallen pretty close to the dotted line where x = y. That is, except for really cheap or really expensive houses. Let’s evaluate residuals.

predictions <- predictions %>%
  mutate(rf_residuals = SalePrice - rf_pred)

resid_plot <- function(df, x, y){
  df %>%
    ggplot(aes(x = {{x}}, y = {{y}})) +
      geom_point(color = color1, alpha = 0.5) +
      geom_hline(yintercept=0, color = color2, size = 0.8, alpha = 0.5) +
      geom_smooth(method = "loess", color = color3, se = TRUE) +
      scale_x_log10(labels = dollar) +
      scale_y_continuous(labels = dollar) +
      my_theme(type = "discrete") +
      labs(title = "Model Residuals",
         y = "Residuals", 
         x = "Fitted Values")
}

resid_histogram <- function(df, x){
  df %>%
    ggplot(aes(x = {{x}})) +
      geom_histogram(aes(y = ..count..), bins = 100,
                     color = color3, fill = color1, size = base_size/30) +
      my_theme(type = "discrete") +
      labs(title = "Residual Distribution", 
           x = "Residuals",
           y = "Density") +
      coord_flip()
}

residual_analysis <- function(df, fitted, resid){
  grid.arrange(
    resid_plot(df = df, x = {{fitted}}, y = {{resid}}),
    resid_histogram(df = df, x = {{resid}}),
    nrow = 1, ncol = 2, widths = c(2, 1)
  )
}

residual_analysis(df = predictions, fitted = rf_pred, resid = rf_residuals)

Looking at residuals distribution, we see that it is mostly centered around the 0 value, which is a good sign. However, the distribution is right-skewed. We see that when houses are expensive, residuals tend to be very large and positive (predicted values are way lower than actual values). When houses are cheap, residuals are generally negative (predicted values are higher than actual values). Random Forests are models based on average values. That means that they are bad at extrapolating.

For most houses, our model will have good predictive power. However, for the outlying, very cheap or very expensive houses, our model will make a very poor prediction.

This means that the Random Forest model is not enough, we will have to combine it with other models that are better at extrapolating.

XGBoost Model

We will perform a 5-step process to build and Extreme Gradient Boosting Machine Model. This model is also tree-based, but unlike the Random Forest, it uses shallow trees that are built sequentially in a process known as a Gradient Descent.

Step one of this process is to tune the hyper-parameters of the gradient boost machine itself, nrounds, which represents the number of trees built and eta which is the “learning rate”, or the rate in which the gradient descent will occur.

set.seed(123)

xgb_grid <- expand.grid(
  nrounds = seq(from = 200, to = 1000, by = 100),
  eta = c(0.025, 0.05, 0.1, 0.3),
  max_depth = c(2, 3, 5),
  gamma = 0, 
  colsample_bytree = 1,
  min_child_weight = 1, 
  subsample = 1
)

xgb_model <- train(
  SalePrice~., 
  data= baked_train1, 
  method = 'xgbTree', 
  tuneGrid = xgb_grid,
  objective = "reg:squarederror",
  trControl= train_control)

In Step 2, we hold constant the values we tuned in Step 1 and move on to the tree parameters: maximum depth of each tree and minimum sum of instance weight. Remember the goal of the gradient boosting machine is to build shallow trees, not deep ones.

set.seed(123)

xgb_grid2 <- expand.grid(
  nrounds = xgb_model$bestTune$nrounds,
  eta = xgb_model$bestTune$eta,
  max_depth = c(2, 3, 5),
  min_child_weight = c(1, 2, 3),
  colsample_bytree = 1, 
  subsample = 1, 
  gamma = 0
)

xgb_model2 <- train(
  SalePrice~., 
  data= baked_train1, 
  method = 'xgbTree', 
  tuneGrid = xgb_grid2,
  objective = "reg:squarederror",
  trControl= train_control)

In Step 3, we deal with the Stochastic GBMs hyperparameters, which will help us not getting “stuck” in the gradient descent by subsampling rows and columns (much like in Random Forests).

set.seed(123)

xgb_grid3 <- expand.grid(
  nrounds = xgb_model$bestTune$nrounds,
  eta = xgb_model$bestTune$eta,
  max_depth = xgb_model2$bestTune$max_depth,
  min_child_weight = xgb_model2$bestTune$min_child_weight,
  colsample_bytree = c(0.5, 0.75, 1.0), 
  subsample = c(0.5, 0.75, 1.0), 
  gamma = 0
)

xgb_model3 <- train(
  SalePrice~., 
  data= baked_train1, 
  method = 'xgbTree', 
  tuneGrid = xgb_grid3,
  objective = "reg:squarederror",
  trControl= train_control)

In Step 4, we tune gamma: a hyper-parameter for pruning our trees, a minimum reduction in loss required to make another split in a tree. This will help with preventing overfitting.

set.seed(123)

xgb_grid4 <- expand.grid(
  nrounds = xgb_model$bestTune$nrounds,
  eta = xgb_model$bestTune$eta,
  max_depth = xgb_model2$bestTune$max_depth,
  min_child_weight = xgb_model2$bestTune$min_child_weight,
  colsample_bytree = xgb_model3$bestTune$colsample_bytree, 
  subsample = xgb_model3$bestTune$subsample, 
  gamma = c(0, 0.05, 0.1, 0.5, 0.7, 0.9, 1.0)
)

xgb_model4 <- train(
  SalePrice~., 
  data= baked_train1, 
  method = 'xgbTree', 
  tuneGrid = xgb_grid4,
  objective = "reg:squarederror",
  trControl= train_control)

In Step 5, since we have tuned every parameter, we retune learning rate so that the optimal rate of descent can be smaller, thus helping the model to achieve the optimal error rate. This will be our final model.

set.seed(123)

xgb_grid5 <- expand.grid(
  nrounds = seq(100, 10000, by = 100),
  eta = c(0.01, 0.015, 0.025, 0.05),
  max_depth = xgb_model2$bestTune$max_depth,
  min_child_weight = xgb_model2$bestTune$min_child_weight,
  colsample_bytree = xgb_model3$bestTune$colsample_bytree, 
  subsample = xgb_model3$bestTune$subsample, 
  gamma = xgb_model4$bestTune$gamma
)

(xgb_model5 <- train(
  SalePrice~., 
  data= baked_train1, 
  method = 'xgbTree', 
  tuneGrid = xgb_grid5,
  objective = "reg:squarederror",
  trControl= train_control))

## eXtreme Gradient Boosting 
## 
## 1460 samples
##  110 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1460, 1460, 1460, 1460, 1460 
## Resampling results across tuning parameters:
## 
##   eta    nrounds  RMSE        Rsquared   MAE        
##   0.010    100    0.85387161        NaN  0.853077045
##   0.010    200    0.31401178  0.6805169  0.312662842
##   0.010    300    0.11673631  0.8078558  0.114860590
##   0.010    400    0.04537512  0.8491497  0.043024283
##   0.010    500    0.02127960  0.8637374  0.018195498
##   0.010    600    0.01476462  0.8716227  0.010931428
##   0.010    700    0.01331284  0.8770471  0.009220156
##   0.010    800    0.01290540  0.8807505  0.008777987
##   0.010    900    0.01273757  0.8829988  0.008613014
##   0.010   1000    0.01263120  0.8847513  0.008531811
##   0.010   1100    0.01255545  0.8860364  0.008468356
##   0.010   1200    0.01251112  0.8867942  0.008428738
##   0.010   1300    0.01247390  0.8874650  0.008398666
##   0.010   1400    0.01243772  0.8880838  0.008366186
##   0.010   1500    0.01241854  0.8884389  0.008341119
##   0.010   1600    0.01239894  0.8887879  0.008321055
##   0.010   1700    0.01238474  0.8890042  0.008299817
##   0.010   1800    0.01237175  0.8892561  0.008283489
##   0.010   1900    0.01236528  0.8893860  0.008272097
##   0.010   2000    0.01236333  0.8894256  0.008261313
##   0.010   2100    0.01235629  0.8895580  0.008251056
##   0.010   2200    0.01235007  0.8896494  0.008244790
##   0.010   2300    0.01234297  0.8897310  0.008235373
##   0.010   2400    0.01234652  0.8896673  0.008227902
##   0.010   2500    0.01234438  0.8896994  0.008225872
##   0.010   2600    0.01234196  0.8897389  0.008222050
##   0.010   2700    0.01234427  0.8897415  0.008223336
##   0.010   2800    0.01234221  0.8897746  0.008216764
##   0.010   2900    0.01234422  0.8897594  0.008216744
##   0.010   3000    0.01234754  0.8896907  0.008215171
##   0.010   3100    0.01235070  0.8896559  0.008213764
##   0.010   3200    0.01235100  0.8896428  0.008209477
##   0.010   3300    0.01235073  0.8896155  0.008206713
##   0.010   3400    0.01235594  0.8895506  0.008207553
##   0.010   3500    0.01235840  0.8894898  0.008207865
##   0.010   3600    0.01235700  0.8895052  0.008207755
##   0.010   3700    0.01235887  0.8895021  0.008206205
##   0.010   3800    0.01236131  0.8894393  0.008206407
##   0.010   3900    0.01236464  0.8893812  0.008207743
##   0.010   4000    0.01236670  0.8893524  0.008207947
##   0.010   4100    0.01236559  0.8893632  0.008207597
##   0.010   4200    0.01236905  0.8892863  0.008208116
##   0.010   4300    0.01236911  0.8893168  0.008207658
##   0.010   4400    0.01237421  0.8892351  0.008210150
##   0.010   4500    0.01237541  0.8892140  0.008211710
##   0.010   4600    0.01238112  0.8891106  0.008215724
##   0.010   4700    0.01238136  0.8891183  0.008214138
##   0.010   4800    0.01238138  0.8891070  0.008214525
##   0.010   4900    0.01238393  0.8890778  0.008215870
##   0.010   5000    0.01238760  0.8889968  0.008217617
##   0.010   5100    0.01239100  0.8889330  0.008219615
##   0.010   5200    0.01239286  0.8888881  0.008219906
##   0.010   5300    0.01239379  0.8888654  0.008221142
##   0.010   5400    0.01239647  0.8888285  0.008222793
##   0.010   5500    0.01239903  0.8887608  0.008223076
##   0.010   5600    0.01240113  0.8887324  0.008225922
##   0.010   5700    0.01240257  0.8886971  0.008227442
##   0.010   5800    0.01240402  0.8886747  0.008229043
##   0.010   5900    0.01240667  0.8886308  0.008229715
##   0.010   6000    0.01241000  0.8885751  0.008230931
##   0.010   6100    0.01241318  0.8885213  0.008233799
##   0.010   6200    0.01241340  0.8885103  0.008235837
##   0.010   6300    0.01241690  0.8884451  0.008238619
##   0.010   6400    0.01242064  0.8883860  0.008240072
##   0.010   6500    0.01242159  0.8883583  0.008241582
##   0.010   6600    0.01242261  0.8883503  0.008240948
##   0.010   6700    0.01242447  0.8883033  0.008242908
##   0.010   6800    0.01242592  0.8882583  0.008244519
##   0.010   6900    0.01242677  0.8882436  0.008244885
##   0.010   7000    0.01243062  0.8881823  0.008247784
##   0.010   7100    0.01243224  0.8881491  0.008250079
##   0.010   7200    0.01243315  0.8881281  0.008251284
##   0.010   7300    0.01243634  0.8880831  0.008254001
##   0.010   7400    0.01243915  0.8880320  0.008255520
##   0.010   7500    0.01244005  0.8880075  0.008256671
##   0.010   7600    0.01244000  0.8879973  0.008257277
##   0.010   7700    0.01244211  0.8879655  0.008259174
##   0.010   7800    0.01244657  0.8878865  0.008263320
##   0.010   7900    0.01244734  0.8878779  0.008263887
##   0.010   8000    0.01244878  0.8878571  0.008264912
##   0.010   8100    0.01244965  0.8878293  0.008266762
##   0.010   8200    0.01245256  0.8877655  0.008269256
##   0.010   8300    0.01245457  0.8877420  0.008270804
##   0.010   8400    0.01245561  0.8877155  0.008271162
##   0.010   8500    0.01245716  0.8876827  0.008272404
##   0.010   8600    0.01245811  0.8876630  0.008272850
##   0.010   8700    0.01245876  0.8876496  0.008274428
##   0.010   8800    0.01246163  0.8876028  0.008277478
##   0.010   8900    0.01246297  0.8875703  0.008278485
##   0.010   9000    0.01246243  0.8875784  0.008278837
##   0.010   9100    0.01246355  0.8875471  0.008280176
##   0.010   9200    0.01246556  0.8875167  0.008281441
##   0.010   9300    0.01246604  0.8875067  0.008282818
##   0.010   9400    0.01246566  0.8875095  0.008283199
##   0.010   9500    0.01246684  0.8874876  0.008284456
##   0.010   9600    0.01246795  0.8874710  0.008285253
##   0.010   9700    0.01246901  0.8874508  0.008286328
##   0.010   9800    0.01247041  0.8874270  0.008287939
##   0.010   9900    0.01247179  0.8874041  0.008289623
##   0.010  10000    0.01247249  0.8873894  0.008290750
##   0.015    100    0.51554354  0.5869598  0.514415724
##   0.015    200    0.11583916  0.8070756  0.113946298
##   0.015    300    0.02956789  0.8572672  0.026789506
##   0.015    400    0.01471319  0.8715855  0.010869642
##   0.015    500    0.01304894  0.8790998  0.008918842
##   0.015    600    0.01273788  0.8829099  0.008592721
##   0.015    700    0.01258167  0.8854992  0.008452760
##   0.015    800    0.01250674  0.8867935  0.008385423
##   0.015    900    0.01244108  0.8879130  0.008325411
##   0.015   1000    0.01241311  0.8883610  0.008291066
##   0.015   1100    0.01238676  0.8888805  0.008259770
##   0.015   1200    0.01236556  0.8892544  0.008241903
##   0.015   1300    0.01233845  0.8897205  0.008211343
##   0.015   1400    0.01233310  0.8898580  0.008202472
##   0.015   1500    0.01232140  0.8900608  0.008184101
##   0.015   1600    0.01232130  0.8901003  0.008180069
##   0.015   1700    0.01230957  0.8902601  0.008166509
##   0.015   1800    0.01231018  0.8902990  0.008162514
##   0.015   1900    0.01231108  0.8902723  0.008165554
##   0.015   2000    0.01231022  0.8903244  0.008162020
##   0.015   2100    0.01231032  0.8902905  0.008155305
##   0.015   2200    0.01231356  0.8902649  0.008158117
##   0.015   2300    0.01231717  0.8901871  0.008158377
##   0.015   2400    0.01232215  0.8900856  0.008160755
##   0.015   2500    0.01232531  0.8900098  0.008159903
##   0.015   2600    0.01232928  0.8899616  0.008164432
##   0.015   2700    0.01233015  0.8899293  0.008163505
##   0.015   2800    0.01233121  0.8899146  0.008162118
##   0.015   2900    0.01233544  0.8898344  0.008163442
##   0.015   3000    0.01234022  0.8897493  0.008165855
##   0.015   3100    0.01234435  0.8896809  0.008170613
##   0.015   3200    0.01235176  0.8895554  0.008175070
##   0.015   3300    0.01235866  0.8894329  0.008182868
##   0.015   3400    0.01236205  0.8893654  0.008185501
##   0.015   3500    0.01236755  0.8892633  0.008191477
##   0.015   3600    0.01236885  0.8892574  0.008191101
##   0.015   3700    0.01236889  0.8892329  0.008193100
##   0.015   3800    0.01237056  0.8892110  0.008194462
##   0.015   3900    0.01237581  0.8891157  0.008196920
##   0.015   4000    0.01237919  0.8890555  0.008200125
##   0.015   4100    0.01238225  0.8890072  0.008205764
##   0.015   4200    0.01238400  0.8889667  0.008206801
##   0.015   4300    0.01238650  0.8889158  0.008211646
##   0.015   4400    0.01239061  0.8888518  0.008216396
##   0.015   4500    0.01239368  0.8887962  0.008219297
##   0.015   4600    0.01239700  0.8887361  0.008220888
##   0.015   4700    0.01239933  0.8886936  0.008222592
##   0.015   4800    0.01240375  0.8886089  0.008225464
##   0.015   4900    0.01240829  0.8885324  0.008229524
##   0.015   5000    0.01241109  0.8884796  0.008233064
##   0.015   5100    0.01241087  0.8884793  0.008233359
##   0.015   5200    0.01241318  0.8884318  0.008236304
##   0.015   5300    0.01241634  0.8883748  0.008239972
##   0.015   5400    0.01241848  0.8883396  0.008243229
##   0.015   5500    0.01241882  0.8883196  0.008244361
##   0.015   5600    0.01241967  0.8882934  0.008245071
##   0.015   5700    0.01242549  0.8881917  0.008249690
##   0.015   5800    0.01242893  0.8881389  0.008252717
##   0.015   5900    0.01242859  0.8881423  0.008252902
##   0.015   6000    0.01243030  0.8881031  0.008253851
##   0.015   6100    0.01243084  0.8880874  0.008253745
##   0.015   6200    0.01243180  0.8880617  0.008256181
##   0.015   6300    0.01243496  0.8880114  0.008259024
##   0.015   6400    0.01243606  0.8879871  0.008260966
##   0.015   6500    0.01243856  0.8879494  0.008264002
##   0.015   6600    0.01243984  0.8879172  0.008265925
##   0.015   6700    0.01244092  0.8879013  0.008266658
##   0.015   6800    0.01244157  0.8878839  0.008268117
##   0.015   6900    0.01244338  0.8878502  0.008269375
##   0.015   7000    0.01244491  0.8878201  0.008271478
##   0.015   7100    0.01244587  0.8877956  0.008272412
##   0.015   7200    0.01244742  0.8877689  0.008274913
##   0.015   7300    0.01244900  0.8877420  0.008276153
##   0.015   7400    0.01245016  0.8877181  0.008277746
##   0.015   7500    0.01245121  0.8877035  0.008279559
##   0.015   7600    0.01245256  0.8876822  0.008281087
##   0.015   7700    0.01245400  0.8876550  0.008282449
##   0.015   7800    0.01245492  0.8876358  0.008284009
##   0.015   7900    0.01245804  0.8875849  0.008286405
##   0.015   8000    0.01245939  0.8875540  0.008287526
##   0.015   8100    0.01246071  0.8875347  0.008289713
##   0.015   8200    0.01246168  0.8875162  0.008290794
##   0.015   8300    0.01246237  0.8875062  0.008291848
##   0.015   8400    0.01246283  0.8874948  0.008292762
##   0.015   8500    0.01246510  0.8874586  0.008295155
##   0.015   8600    0.01246580  0.8874452  0.008296572
##   0.015   8700    0.01246648  0.8874354  0.008297380
##   0.015   8800    0.01246728  0.8874163  0.008298228
##   0.015   8900    0.01246856  0.8873926  0.008299159
##   0.015   9000    0.01246897  0.8873858  0.008300150
##   0.015   9100    0.01246917  0.8873765  0.008300550
##   0.015   9200    0.01247036  0.8873574  0.008301598
##   0.015   9300    0.01247120  0.8873407  0.008302484
##   0.015   9400    0.01247192  0.8873264  0.008303072
##   0.015   9500    0.01247242  0.8873149  0.008303844
##   0.015   9600    0.01247251  0.8873159  0.008304259
##   0.015   9700    0.01247347  0.8872974  0.008305602
##   0.015   9800    0.01247429  0.8872816  0.008305796
##   0.015   9900    0.01247471  0.8872756  0.008306699
##   0.015  10000    0.01247562  0.8872647  0.008307592
##   0.025    100    0.18747750  0.7565256  0.185860621
##   0.025    200    0.02090959  0.8614989  0.017744220
##   0.025    300    0.01310467  0.8780779  0.008941946
##   0.025    400    0.01269958  0.8836608  0.008562190
##   0.025    500    0.01255027  0.8862685  0.008438835
##   0.025    600    0.01247652  0.8876326  0.008357100
##   0.025    700    0.01243007  0.8885014  0.008308833
##   0.025    800    0.01238348  0.8892290  0.008262387
##   0.025    900    0.01237355  0.8893572  0.008245859
##   0.025   1000    0.01237703  0.8893663  0.008234050
##   0.025   1100    0.01236126  0.8895396  0.008213342
##   0.025   1200    0.01236872  0.8894625  0.008210338
##   0.025   1300    0.01237378  0.8893739  0.008205490
##   0.025   1400    0.01236571  0.8895043  0.008197807
##   0.025   1500    0.01236977  0.8894390  0.008201460
##   0.025   1600    0.01236984  0.8893681  0.008200426
##   0.025   1700    0.01237574  0.8893231  0.008202956
##   0.025   1800    0.01239033  0.8890434  0.008213137
##   0.025   1900    0.01240445  0.8887808  0.008220118
##   0.025   2000    0.01240896  0.8886910  0.008221984
##   0.025   2100    0.01242446  0.8884325  0.008232715
##   0.025   2200    0.01243418  0.8882620  0.008236421
##   0.025   2300    0.01244212  0.8880908  0.008241720
##   0.025   2400    0.01245232  0.8878843  0.008251754
##   0.025   2500    0.01245877  0.8877902  0.008255709
##   0.025   2600    0.01245898  0.8877796  0.008256087
##   0.025   2700    0.01246423  0.8876611  0.008261768
##   0.025   2800    0.01246280  0.8876912  0.008260843
##   0.025   2900    0.01246514  0.8876145  0.008263497
##   0.025   3000    0.01246765  0.8875567  0.008267931
##   0.025   3100    0.01247509  0.8874455  0.008276150
##   0.025   3200    0.01247630  0.8874218  0.008276712
##   0.025   3300    0.01247956  0.8873601  0.008276608
##   0.025   3400    0.01248293  0.8872897  0.008281052
##   0.025   3500    0.01248835  0.8871734  0.008285382
##   0.025   3600    0.01248934  0.8871665  0.008289170
##   0.025   3700    0.01249268  0.8870946  0.008291278
##   0.025   3800    0.01249854  0.8869905  0.008297627
##   0.025   3900    0.01250311  0.8869154  0.008301602
##   0.025   4000    0.01250993  0.8867850  0.008305981
##   0.025   4100    0.01251203  0.8867439  0.008309036
##   0.025   4200    0.01251354  0.8866966  0.008312272
##   0.025   4300    0.01251407  0.8866757  0.008313530
##   0.025   4400    0.01251656  0.8866285  0.008315088
##   0.025   4500    0.01251940  0.8865723  0.008317763
##   0.025   4600    0.01251855  0.8865771  0.008317423
##   0.025   4700    0.01251967  0.8865635  0.008318793
##   0.025   4800    0.01252056  0.8865376  0.008321107
##   0.025   4900    0.01252241  0.8865187  0.008322501
##   0.025   5000    0.01252387  0.8864839  0.008323623
##   0.025   5100    0.01252649  0.8864437  0.008326859
##   0.025   5200    0.01252700  0.8864186  0.008329328
##   0.025   5300    0.01252944  0.8863744  0.008332898
##   0.025   5400    0.01253188  0.8863342  0.008335071
##   0.025   5500    0.01253203  0.8863262  0.008335826
##   0.025   5600    0.01253257  0.8863139  0.008336087
##   0.025   5700    0.01253457  0.8862837  0.008338048
##   0.025   5800    0.01253654  0.8862476  0.008339889
##   0.025   5900    0.01253681  0.8862408  0.008341758
##   0.025   6000    0.01253641  0.8862444  0.008341886
##   0.025   6100    0.01253879  0.8862004  0.008344656
##   0.025   6200    0.01254070  0.8861669  0.008345638
##   0.025   6300    0.01254168  0.8861428  0.008346350
##   0.025   6400    0.01254161  0.8861485  0.008346739
##   0.025   6500    0.01254217  0.8861324  0.008347868
##   0.025   6600    0.01254241  0.8861234  0.008348202
##   0.025   6700    0.01254343  0.8861046  0.008349438
##   0.025   6800    0.01254289  0.8861163  0.008349032
##   0.025   6900    0.01254249  0.8861227  0.008349141
##   0.025   7000    0.01254322  0.8861110  0.008350290
##   0.025   7100    0.01254375  0.8861016  0.008351190
##   0.025   7200    0.01254374  0.8861030  0.008350843
##   0.025   7300    0.01254411  0.8860920  0.008350977
##   0.025   7400    0.01254359  0.8860975  0.008350743
##   0.025   7500    0.01254360  0.8860987  0.008350604
##   0.025   7600    0.01254366  0.8860968  0.008350876
##   0.025   7700    0.01254385  0.8860928  0.008351064
##   0.025   7800    0.01254370  0.8860923  0.008351037
##   0.025   7900    0.01254420  0.8860859  0.008351100
##   0.025   8000    0.01254442  0.8860813  0.008351273
##   0.025   8100    0.01254484  0.8860735  0.008351681
##   0.025   8200    0.01254485  0.8860741  0.008351914
##   0.025   8300    0.01254520  0.8860688  0.008352247
##   0.025   8400    0.01254522  0.8860677  0.008352137
##   0.025   8500    0.01254520  0.8860688  0.008352254
##   0.025   8600    0.01254546  0.8860647  0.008352237
##   0.025   8700    0.01254506  0.8860692  0.008352189
##   0.025   8800    0.01254538  0.8860631  0.008352855
##   0.025   8900    0.01254532  0.8860630  0.008352901
##   0.025   9000    0.01254529  0.8860646  0.008353076
##   0.025   9100    0.01254557  0.8860592  0.008353501
##   0.025   9200    0.01254574  0.8860564  0.008353499
##   0.025   9300    0.01254582  0.8860554  0.008353504
##   0.025   9400    0.01254600  0.8860530  0.008353686
##   0.025   9500    0.01254579  0.8860553  0.008353516
##   0.025   9600    0.01254605  0.8860528  0.008353794
##   0.025   9700    0.01254607  0.8860500  0.008353931
##   0.025   9800    0.01254598  0.8860524  0.008353922
##   0.025   9900    0.01254634  0.8860453  0.008354222
##   0.025  10000    0.01254639  0.8860449  0.008354283
##   0.050    100    0.02019549  0.8606155  0.016961112
##   0.050    200    0.01279386  0.8811788  0.008667120
##   0.050    300    0.01256953  0.8852374  0.008481421
##   0.050    400    0.01249969  0.8864172  0.008408215
##   0.050    500    0.01245798  0.8871797  0.008344584
##   0.050    600    0.01245868  0.8872084  0.008324380
##   0.050    700    0.01246484  0.8871500  0.008330722
##   0.050    800    0.01247842  0.8869224  0.008338992
##   0.050    900    0.01248562  0.8868353  0.008341825
##   0.050   1000    0.01250411  0.8865056  0.008356213
##   0.050   1100    0.01251547  0.8863078  0.008361095
##   0.050   1200    0.01252145  0.8862307  0.008365572
##   0.050   1300    0.01252606  0.8861557  0.008370854
##   0.050   1400    0.01253458  0.8860098  0.008377187
##   0.050   1500    0.01254621  0.8857896  0.008389025
##   0.050   1600    0.01255036  0.8857148  0.008397745
##   0.050   1700    0.01255913  0.8855536  0.008411406
##   0.050   1800    0.01255977  0.8855344  0.008417230
##   0.050   1900    0.01256272  0.8854499  0.008418160
##   0.050   2000    0.01256712  0.8853849  0.008424742
##   0.050   2100    0.01257240  0.8852863  0.008433033
##   0.050   2200    0.01257880  0.8851610  0.008439755
##   0.050   2300    0.01258292  0.8850647  0.008444897
##   0.050   2400    0.01258975  0.8849653  0.008450154
##   0.050   2500    0.01259107  0.8849291  0.008452318
##   0.050   2600    0.01259270  0.8848952  0.008456038
##   0.050   2700    0.01259382  0.8848717  0.008459563
##   0.050   2800    0.01259667  0.8848214  0.008462621
##   0.050   2900    0.01259774  0.8847913  0.008466732
##   0.050   3000    0.01260184  0.8847280  0.008469645
##   0.050   3100    0.01260548  0.8846629  0.008473451
##   0.050   3200    0.01260726  0.8846183  0.008474622
##   0.050   3300    0.01260811  0.8845957  0.008474765
##   0.050   3400    0.01260974  0.8845740  0.008477239
##   0.050   3500    0.01261054  0.8845628  0.008477454
##   0.050   3600    0.01261124  0.8845509  0.008478141
##   0.050   3700    0.01261198  0.8845448  0.008478146
##   0.050   3800    0.01261238  0.8845325  0.008479475
##   0.050   3900    0.01261208  0.8845361  0.008478692
##   0.050   4000    0.01261348  0.8845101  0.008479756
##   0.050   4100    0.01261338  0.8845178  0.008479984
##   0.050   4200    0.01261265  0.8845233  0.008479842
##   0.050   4300    0.01261227  0.8845291  0.008480017
##   0.050   4400    0.01261171  0.8845406  0.008479701
##   0.050   4500    0.01261203  0.8845389  0.008479937
##   0.050   4600    0.01261196  0.8845379  0.008480264
##   0.050   4700    0.01261231  0.8845331  0.008480360
##   0.050   4800    0.01261209  0.8845355  0.008480287
##   0.050   4900    0.01261192  0.8845360  0.008480297
##   0.050   5000    0.01261176  0.8845398  0.008480388
##   0.050   5100    0.01261225  0.8845330  0.008480862
##   0.050   5200    0.01261178  0.8845445  0.008480608
##   0.050   5300    0.01261215  0.8845392  0.008480719
##   0.050   5400    0.01261194  0.8845397  0.008480582
##   0.050   5500    0.01261222  0.8845354  0.008480904
##   0.050   5600    0.01261232  0.8845323  0.008480726
##   0.050   5700    0.01261179  0.8845419  0.008480155
##   0.050   5800    0.01261186  0.8845438  0.008479775
##   0.050   5900    0.01261153  0.8845452  0.008479749
##   0.050   6000    0.01261135  0.8845522  0.008479488
##   0.050   6100    0.01261128  0.8845503  0.008479479
##   0.050   6200    0.01261114  0.8845504  0.008479193
##   0.050   6300    0.01261182  0.8845405  0.008479697
##   0.050   6400    0.01261193  0.8845393  0.008480066
##   0.050   6500    0.01261205  0.8845347  0.008480178
##   0.050   6600    0.01261205  0.8845342  0.008480209
##   0.050   6700    0.01261240  0.8845291  0.008480386
##   0.050   6800    0.01261216  0.8845329  0.008480111
##   0.050   6900    0.01261237  0.8845312  0.008480265
##   0.050   7000    0.01261202  0.8845327  0.008479976
##   0.050   7100    0.01261231  0.8845318  0.008480244
##   0.050   7200    0.01261259  0.8845280  0.008480324
##   0.050   7300    0.01261262  0.8845265  0.008480551
##   0.050   7400    0.01261248  0.8845295  0.008480330
##   0.050   7500    0.01261254  0.8845278  0.008480528
##   0.050   7600    0.01261233  0.8845298  0.008480318
##   0.050   7700    0.01261262  0.8845276  0.008480475
##   0.050   7800    0.01261236  0.8845295  0.008480386
##   0.050   7900    0.01261238  0.8845327  0.008480133
##   0.050   8000    0.01261204  0.8845386  0.008479905
##   0.050   8100    0.01261247  0.8845328  0.008480063
##   0.050   8200    0.01261253  0.8845285  0.008480172
##   0.050   8300    0.01261303  0.8845233  0.008480255
##   0.050   8400    0.01261309  0.8845190  0.008480134
##   0.050   8500    0.01261330  0.8845164  0.008480371
##   0.050   8600    0.01261333  0.8845140  0.008480446
##   0.050   8700    0.01261322  0.8845163  0.008480404
##   0.050   8800    0.01261313  0.8845146  0.008480440
##   0.050   8900    0.01261314  0.8845165  0.008480383
##   0.050   9000    0.01261318  0.8845155  0.008480473
##   0.050   9100    0.01261326  0.8845131  0.008480429
##   0.050   9200    0.01261315  0.8845131  0.008480633
##   0.050   9300    0.01261305  0.8845155  0.008480416
##   0.050   9400    0.01261325  0.8845139  0.008480460
##   0.050   9500    0.01261321  0.8845149  0.008480414
##   0.050   9600    0.01261308  0.8845184  0.008480317
##   0.050   9700    0.01261303  0.8845186  0.008480364
##   0.050   9800    0.01261327  0.8845158  0.008480532
##   0.050   9900    0.01261338  0.8845137  0.008480634
##   0.050  10000    0.01261342  0.8845139  0.008480795
## 
## Tuning parameter 'max_depth' was held constant at a value of 3
## Tuning
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 3
## 
## Tuning parameter 'subsample' was held constant at a value of 0.5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 1700, max_depth = 3, eta
##  = 0.015, gamma = 0, colsample_bytree = 0.5, min_child_weight = 3 and
##  subsample = 0.5.

Assessing Model Performance

predictions$xgb_pred <- predict(xgb_model5, newdata = baked_train1)

predictions <- predictions %>%
  mutate(xgb_pred = exp(1)^yeo.johnson(xgb_pred, lambda = lambda, inverse = TRUE))

fitted_values_plot(predictions, x = xgb_pred)

predictions <- predictions %>%
  mutate(xgb_residuals = SalePrice - xgb_pred)

residual_analysis(df = predictions, fitted = xgb_pred, resid = xgb_residuals)

Extreme Gradient Boosting seemed to further normalize residuals, thus minimizing the random forest problem of extrapolation.

The Elastic Net Model

The second model we will run to compliment the Random Forest model is an Elastic Net. The Elastic Net is a form of regularized regression which uses a combination of the Ridge penalty and the Lasso Penalty. This constraints on the model will help deal with correlation between features and also provide automated feature selection.

We will let caret perform a grid search for the tunable parameters alpha (which will decide which mix of these models to perform: alpha = 0 means a pure Ridge Model, while alpha = 1 means a pure Lasso Model) and lambda (the scale of the penalty to be applied in the objective function).

set.seed(123)

en_grid <- expand.grid(
  alpha = c(0.7,0.75,0.8,0.85,0.9,0.95,1),
  lambda = seq(0,0.001, by = 0.00005)
)

elastic_net <- train(
  SalePrice ~ .,
  data = baked_train1,
  method = "glmnet",
  trControl = train_control,
  tuneGrid = en_grid
)

ggplot(elastic_net) +
  geom_vline(aes(xintercept = elastic_net$bestTune$lambda), linetype = "dashed", 
             size = base_size/15) +
  my_theme(type = "discrete") +
  labs(title = "Model Tuning - Elastic Net")

The model tuning performed resulted in a RMSE of 0.0126309. The alpha parameter equals 1, and lambda equals .

Alpha = 1 means that a Lasso penalty was performed but a Ridge penalty wasn’t. That means that model tuning made our Elastic Net effectively a LASSO Model.

Assessing Model Performance

predictions$en_pred <- predict(elastic_net, newdata = baked_train1)

predictions <- predictions %>%
  mutate(en_pred = exp(1)^yeo.johnson(en_pred, lambda = lambda, inverse = TRUE))

fitted_values_plot(predictions, x = en_pred)

predictions <- predictions %>%
  mutate(en_residuals = SalePrice - en_pred)

residual_analysis(df = predictions, fitted = en_pred, resid = en_residuals)

By plotting model results and model residuals, we can observe that our model made some gross overestimations that resulted in a left-skewed distribution of residuals.

Combining the Elastic Net Model with the previous tree-based models will improve the generalization power of our model, since linear models are low-variance, high bias models.

Stacking the Models

Now, we list the models and take a look at their response correlations.

model_list <- c(rf_model, elastic_net, xgb_model5)

model_cor <- modelCor(resamples(model_list))

melt(model_cor) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) +
    geom_tile() +
    geom_text(aes(x = Var1, y = Var2, size = base_size/10, label = round(value, 2)),
              color = "white") +
    my_theme(type = "continuous") +
    labs(title = "Model Correlation",
         y = "Model", 
         x = "Model", 
         fill = "Correlation")

We see that the XGBoost model (xbgTree) and the Elastic Net (glmnet) had reasonably high correlation between themselves, but the Random Forest was more uncorrelated. Since correlation was not so high overall, our final model should benefit from model stacking.

The meta-model will be trained with the matrix of fitted values from each individual Base Learner. For this model, we use a Multivariate Adaptive Regression Spline.

A MARS model will use a step-function to deal with nonlinearity and interactive relationships between features.

set.seed(123)

stacked_model <- caretStack(
  model_list,
  method="earth",
  metric = "RMSE",
  tuneLength = 10,
  trControl = trainControl(method = "cv", number = 10, savePredictions = TRUE)
)

stacked_model

## A earth ensemble of 3 base models: ranger1, glmnet2, xgbTree3
## 
## Ensemble results:
## Multivariate Adaptive Regression Spline 
## 
## 2698 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2427, 2429, 2429, 2429, 2428, 2427, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE        Rsquared   MAE        
##   2       0.01248116  0.8864399  0.008437066
##   3       0.01201387  0.8946213  0.008057307
##   4       0.01211000  0.8929142  0.008090165
##   5       0.01202917  0.8945270  0.008032821
##   6       0.01198728  0.8954661  0.008021465
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 6 and degree = 1.

Assessing Model Performance

predictions$stack_pred <- predict(stacked_model, newdata = baked_train1)

predictions <- predictions %>%
  mutate(stack_pred = exp(1)^yeo.johnson(stack_pred, lambda = lambda, inverse = TRUE))

fitted_values_plot(predictions, x = stack_pred)

predictions <- predictions %>%
  mutate(stack_residuals = SalePrice - stack_pred)

residual_analysis(df = predictions, fitted = stack_pred, resid = stack_residuals)

While stacking the models resulted in a slightly more platykurtic distribution of residuals, we notice how the LOESS smoothing narrowed average error when it comes to the higher fitted values.

Here we look at Partial Dependence, which helps us interpret the marginal effect that an increase in given feature has on the final prediction value. Here, we treat the predictions of the individual Base Learners as the features of the meta-model.

partial1 <- partial(stacked_model$ens_model, pred.var = "ranger1", plot = TRUE,
                    center = TRUE, grid.resolution = 20,
              plot.engine = "ggplot2")

partial2 <- partial(stacked_model$ens_model, pred.var = "glmnet2", plot = TRUE,
                    center = TRUE, grid.resolution = 20,
              plot.engine = "ggplot2")

partial3 <- partial(stacked_model$ens_model, pred.var = "xgbTree3", plot = TRUE,
                    center = TRUE, grid.resolution = 20,
              plot.engine = "ggplot2")


grid.arrange(
  partial1 + geom_line(color = color3, size = base_size/7.5) + my_theme(type = "discrete") 
  + ggtitle("Partial Dependence - Random Forest Model"),
  partial2 + geom_line(color = color3, size = base_size/7.5) + my_theme(type = "discrete") 
  + ggtitle("Partial Dependence - Elastic Net Model"),
  partial3 + geom_line(color = color3, size = base_size/7.5) + my_theme(type = "discrete") 
  + ggtitle("Partial Dependence - XGBoost Model"),
  ncol = 3, nrow = 1
)

While the Random Forest has constant marginal effect on the meta-model results, the XGBoost model has increased effects when its value surpasses a certain point.

Model Comparison and Interpretation

predictions <- predictions %>%
  mutate(stack_pred = as.numeric(stack_pred),
         stack_residuals = as.numeric(stack_residuals))

get_statistic <- function(df, stat){
  as.numeric(df$results %>%
    filter(RMSE == min(RMSE)) %>%
    select(stat))
}

list_models <- c(rf_model, elastic_net, xgb_model5, stacked_model$ens_model)

r_squared <- melt(lapply(list_models, get_statistic, stat = "Rsquared"))
rmse <- melt(lapply(list_models, get_statistic, stat = "RMSE"))
mae <- melt(lapply(list_models, get_statistic, stat = "MAE"))

models <- c("Random Forest", "Elastic Net", "XGBoost", "Stacked Model")
stats <- cbind(models, rmse[,1], r_squared[,1], mae[,1])
colnames(stats) <- c("Models", "RMSE", "R_squared", "MAE")
stats <- as.data.frame(stats)

stats <- stats %>%
  mutate(RMSE = as.numeric(RMSE), R_squared = as.numeric(R_squared), MAE = as.numeric(MAE))

pred_stacked <- predictions %>%
  select(rf_residuals, en_residuals, xgb_residuals, stack_residuals) %>%
  stack() %>%
  mutate(ind = factor(ind)) 

model_info <- function(model){
  paste(model, "\n", "RMSE:", round(stats[models == model, "RMSE"], 4),
        "\n", "R-squared:", round(stats[models == model, "R_squared"], 4))
}

model_information <- as.matrix(sapply(models, model_info), nrow = 1, ncol = 4)
levels(pred_stacked$ind) <- model_information

pred_stacked %>%
    ggplot(aes(y = values, x = ind, fill = ind)) +
    geom_boxplot() +
    scale_y_log10(labels = comma) +
    my_theme(type = "discrete") +
    theme(legend.position = "none") +
    labs(title = "Residual Boxplots", x = "")

As we can see from the plot, stacking the models has lowered cross-validated RMSE while simultaneously improving cross-validated R-squared.

When applying the model, we can see that every model has produced is its own batch of overestimated outliers. While stacking the model has helped lower average overestimations, it resulted in a slightly higher median residual than its tree-based Base Learners.

To better visualize how each feature has impacted each model, we calculate Feature Importance and plot the 10 most important features for each model.

rf_impurity <- ranger(
  formula = SalePrice ~ .,
  data = baked_train1,
  num.trees = rf_model$finalModel$num.trees,
  mtry = rf_model$finalModel$mtry,
  min.node.size = rf_model$finalModel$min.node.size,
  replace = FALSE,
  importance = "impurity",
  respect.unordered.factors = "order",
  verbose = FALSE,
  seed = 123
)

vip_rf <- vip(rf_impurity, num_features = 10, bar = FALSE)
vip_xgb <- vip(xgb_model5, num_features = 10, bar = FALSE)
vip_en <- vip(elastic_net, num_features = 10, bar = FALSE)

importance <- function(df){
  df %>%
    ggplot(aes(x = Importance, y = reorder(Variable, Importance), fill = Importance)) +
      geom_col() +
      my_theme(type = "continuous") +
      theme(legend.position = "none") +
      labs(
      title = "Component Importance",
      y = "Components"
      )
}

grid.arrange(
  importance(vip_rf$data) + ggtitle("Random Forest"), 
  importance(vip_xgb$data) + ggtitle("XGBoost"),
  importance(vip_en$data) + ggtitle("Elastic Net"),
  nrow = 1, ncol = 3
)

OverallQual and GrLivArea were the top 2 most important features in every model. However, notice how different importance was for the Random Forest model. From the third feature onwards, models have produces varied results.

Conclusion

By stacking the models, we have created a better fitted model while minimizing Mean Squared Error. The MARS meta-model captured any nonlinear relationships that may occur between the features and the response variable.

Upon fitting the model in the Testing Data Set and submitting it into the Kaggle website, our model has been graded with a Root Mean Squared Logarithmic Error of 0.12251. At the moment this report is written, this grade puts our model in the top 15% of responses in a challenge with over 10k submissions.

House Prices - Advanced Regression Techniques

Gustavo Reis

16/04/2021

Summary

Required packages

Setting theme to be used

Downloading Data Sets

Exploratory Data Analysis and Target and Feature Engineering

Dealing with Dates and Seazonality

Ordinal categorical features

Response Variable Transformation

Numeric Predictors Analysis and Engineering

Categorical Features Analysis and Engineering

Random Forest Model

Assessing Model Performance

XGBoost Model

Assessing Model Performance

The Elastic Net Model

Assessing Model Performance

Stacking the Models

Assessing Model Performance

Model Comparison and Interpretation

Conclusion