Introduction

This report presents the analysis and modeling done on the House Pricing dataset provided on Kaggle for advanced regression technique.

The goal of this report is to show how the data was transformed into numeric for modeling, while keeping all the relevant features for effective prediction. Since gradient boosting cannot extrapolate, ridge, lasso and elastic-net regularization will be used in ensemble. This will also account for a lot of the multicollinearity in the data. Prediction from all four models will be averaged to get final prediction.

Data Wrangling

Missing Values - MICE Imputations

The training and testing data is loaded and combined together for data wrangling. The training data consists of 1460 rows and 81 columns while the testing has 1459 rows and 80 columns (excluding the SalePrice column), which in this dataset is the response variable. To perform analysis at a more productive rate the 2 dataframes are first combined for engineering the features, then split once we are ready to train a model.

The Id feature is removed because it does not contribute to the the modeling and SalePrice being the response variable is not included either. The ultimate goal is to tranform all our variables to numeric.

library(ggplot2)
library(dplyr)
library(Amelia)
library(varhandle)
library(stringr)
library(caret)
library(moments)
library(mice)
library(corrplot)
library(randomForest)
library(gridExtra)
library(xgboost)
library(glmnet)

train <- read.csv("train.csv")
test <- read.csv("test.csv")

train_row <- nrow(train)
test_row <- nrow(test)

comb_data <- rbind(within(train, rm('Id','SalePrice')), within(test, rm('Id')))

Firstly, missing values are dealt with using MICE imputation with predictive mean matching (PMM) method. PMM imputation provides one of the best and quick ways of filling in the missing data. it is a well-known and widely used method for generating hot-deck imputations, which imputes missing values by means of the nearest-neighbor donor with distance based on the expected values of the missing variables conditional on the observed covariates.

#Calculating the number of missing values for each variable
NA_sum <- sort(sapply(comb_data, function(x) sum(is.na(x))), decreasing = TRUE)
print(NA_sum) 
##        PoolQC   MiscFeature         Alley         Fence   FireplaceQu 
##          2909          2814          2721          2348          1420 
##   LotFrontage   GarageYrBlt  GarageFinish    GarageQual    GarageCond 
##           486           159           159           159           159 
##    GarageType      BsmtCond  BsmtExposure      BsmtQual  BsmtFinType2 
##           157            82            82            81            80 
##  BsmtFinType1    MasVnrType    MasVnrArea      MSZoning     Utilities 
##            79            24            23             4             2 
##  BsmtFullBath  BsmtHalfBath    Functional   Exterior1st   Exterior2nd 
##             2             2             2             1             1 
##    BsmtFinSF1    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF    Electrical 
##             1             1             1             1             1 
##   KitchenQual    GarageCars    GarageArea      SaleType    MSSubClass 
##             1             1             1             1             0 
##       LotArea        Street      LotShape   LandContour     LotConfig 
##             0             0             0             0             0 
##     LandSlope  Neighborhood    Condition1    Condition2      BldgType 
##             0             0             0             0             0 
##    HouseStyle   OverallQual   OverallCond     YearBuilt  YearRemodAdd 
##             0             0             0             0             0 
##     RoofStyle      RoofMatl     ExterQual     ExterCond    Foundation 
##             0             0             0             0             0 
##       Heating     HeatingQC    CentralAir     X1stFlrSF     X2ndFlrSF 
##             0             0             0             0             0 
##  LowQualFinSF     GrLivArea      FullBath      HalfBath  BedroomAbvGr 
##             0             0             0             0             0 
##  KitchenAbvGr  TotRmsAbvGrd    Fireplaces    PavedDrive    WoodDeckSF 
##             0             0             0             0             0 
##   OpenPorchSF EnclosedPorch    X3SsnPorch   ScreenPorch      PoolArea 
##             0             0             0             0             0 
##       MiscVal        MoSold        YrSold SaleCondition 
##             0             0             0             0
missmap(comb_data, col = c("black", "White"), 
        main = "Missing Data", rank.order = T,
        y.labels = F, y.at = F, legend = F, tsvar = T)

#Running MICE imputation with PMM on the columns that had null values. This is reduce computation time.
 
mice_mod <- mice(comb_data[, colnames(comb_data) %in% c('LotFrontage',
                                                          'Alley','BsmtQual',' BsmtCond',
                                                          'BsmtExposure','BsmtFinType1',
                                                          'BsmtFinType2', 'Electrical',
                                                          'FireplaceQu','GarageType', 
                                                          'GarageYrBlt','GarageFinish', 
                                                          'PoolQC','Fence', 'MiscFeature', 
                                                          'PoolQC','Fence', 'MasVnrType',
                                                          'BsmtCond', 'GarageQual',
                                                          'GarageCond', "MasVnrArea", "MSZoning",
                                                          "Utilities", "BsmtFullBath",
                                                          "MSZoning", 'BsmtHalfBath', 'Functional',
                                                          'Exterior1st', 'Exterior2nd','BsmtFinSF1',
                                                          'BsmtFinSF2', 'BsmtUnfSf', 'TotalBsmtSF',
                                                          'Electrical', 'KitchenQual', 'GarageCars',
                                                          'GarageArea', 'SaleType',
                                                        'BsmtUnfSF')], method = "pmm")
## 
##  iter imp variable
##   1   1  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   1   2  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   1   3  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   1   4  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   1   5  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   2   1  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   2   2  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   2   3  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   2   4  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   2   5  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   3   1  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   3   2  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   3   3  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   3   4  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   3   5  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   4   1  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   4   2  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   4   3  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   4   4  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   4   5  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   5   1  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   5   2  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   5   3  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   5   4  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
##   5   5  MSZoning  LotFrontage  Alley  Utilities  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Electrical  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PoolQC  Fence  MiscFeature  SaleType
complete_pmm <- complete(mice_mod)

#Recombining the imputed values with the rest of the data
sim_col <- match(colnames(complete_pmm), colnames(comb_data))
comb_data <- comb_data[,-sim_col]


comb_data <- cbind(comb_data, complete_pmm)
missmap(comb_data, col = c("black", "White"), 
        main = "Missing Data", rank.order = T,
        y.labels = F, y.at = F, legend = F, tsvar = T)

Feature Engineering

The first step in feature engineering is to separate the numeric variables in a separate dataset. Once separated, the categoric variables are hotencoded into the dataset.

numeric <- sapply(comb_data, is.numeric)

print(numeric)
##    MSSubClass       LotArea        Street      LotShape   LandContour 
##          TRUE          TRUE         FALSE         FALSE         FALSE 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##         FALSE         FALSE          TRUE          TRUE          TRUE 
##  YearRemodAdd     RoofStyle      RoofMatl     ExterQual     ExterCond 
##          TRUE         FALSE         FALSE         FALSE         FALSE 
##    Foundation       Heating     HeatingQC    CentralAir     X1stFlrSF 
##         FALSE         FALSE         FALSE         FALSE          TRUE 
##     X2ndFlrSF  LowQualFinSF     GrLivArea      FullBath      HalfBath 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces    PavedDrive 
##          TRUE          TRUE          TRUE          TRUE         FALSE 
##    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch   ScreenPorch 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##      PoolArea       MiscVal        MoSold        YrSold SaleCondition 
##          TRUE          TRUE          TRUE          TRUE         FALSE 
##      MSZoning   LotFrontage         Alley     Utilities   Exterior1st 
##         FALSE          TRUE         FALSE         FALSE         FALSE 
##   Exterior2nd    MasVnrType    MasVnrArea      BsmtQual      BsmtCond 
##         FALSE         FALSE          TRUE         FALSE         FALSE 
##  BsmtExposure  BsmtFinType1    BsmtFinSF1  BsmtFinType2    BsmtFinSF2 
##         FALSE         FALSE          TRUE         FALSE          TRUE 
##     BsmtUnfSF   TotalBsmtSF    Electrical  BsmtFullBath  BsmtHalfBath 
##          TRUE          TRUE         FALSE          TRUE          TRUE 
##   KitchenQual    Functional   FireplaceQu    GarageType   GarageYrBlt 
##         FALSE         FALSE         FALSE         FALSE          TRUE 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##         FALSE          TRUE          TRUE         FALSE         FALSE 
##        PoolQC         Fence   MiscFeature      SaleType 
##         FALSE         FALSE         FALSE         FALSE
numeric_dat <- comb_data[,numeric==TRUE]

A function is created to plot the categoric variable and another to hotencode. Hence, train dataset is grouped back to get the relationship between the SalePrice and levels of the categoric variable. This would evetually help with numerically labeling the ordinal variables.

# function that groups a column by its features and returns the mdedian saleprice for each unique feature. 

group_df <- comb_data[1:1460,]
group_df$SalePrice <- train$SalePrice

group.prices <- function(col) { 
        group.table <- group_df[,c(col, 'SalePrice', 'OverallQual')] %>%
                group_by_(col) %>% 
                summarise(mean.Quality = round(mean(OverallQual),2),
                          mean.Price = mean(SalePrice), n = n()) %>%
                arrange(mean.Quality)
    
  print(qplot(x=reorder(group.table[[col]], -group.table[['mean.Price']]),
              y=group.table[['mean.Price']]) +
                geom_bar(stat='identity', fill='#feb24c') +
                theme_minimal() +
                scale_y_continuous() + 
                labs(x=col, y='Mean SalePrice') + 
                theme(axis.text.x = element_text(angle = 45)))
  
  return(data.frame(group.table))
}


hotencode <- function(column, dum.vars, dataframe) {
        for(col in column){
                dataframe[col] <- as.numeric(dum.vars[comb_data[,col]])
        }
        return(dataframe)
}
#The following code plots the categorical variables agaisnt mean SalePrice
#Then the variables are hot encoded ordinally

group.prices('FireplaceQu')

##   FireplaceQu mean.Quality mean.Price   n
## 1          Po         5.13   129116.3  38
## 2          Fa         5.53   150531.4  72
## 3          TA         5.99   175922.2 590
## 4          Gd         6.24   186774.6 720
## 5          Ex         7.17   253211.1  40
group.prices('BsmtQual')

##   BsmtQual mean.Quality mean.Price   n
## 1       Fa         4.86   114006.1  36
## 2       TA         5.28   139129.3 683
## 3       Gd         6.65   202328.2 620
## 4       Ex         8.26   327041.0 121
group.prices('KitchenQual')

##   KitchenQual mean.Quality mean.Price   n
## 1          Fa         4.49   105565.2  39
## 2          TA         5.34   139962.5 735
## 3          Gd         6.79   212116.0 586
## 4          Ex         8.27   328554.7 100
group.prices('GarageCond')

##   GarageCond mean.Quality mean.Price    n
## 1         Fa         4.56  110256.47   45
## 2         Po         5.11   99377.78    9
## 3         Ex         5.50  124000.00    2
## 4         Gd         5.80  170437.00   10
## 5         TA         6.16  183885.68 1394
group.prices('HeatingQC')

##   HeatingQC mean.Quality mean.Price   n
## 1        Fa         5.00   123919.5  49
## 2        Po         5.00    87000.0   1
## 3        TA         5.37   142362.9 428
## 4        Gd         5.70   156858.9 241
## 5        Ex         6.72   214914.4 741
qual_column <- c('ExterQual', 'ExterCond', 'GarageQual', 
               'GarageCond', 'FireplaceQu', 'KitchenQual', 
               'HeatingQC', 'BsmtQual')

qual_list <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
numeric_dat <- hotencode(qual_column, qual_list, numeric_dat)


group.prices('BsmtExposure')

##   BsmtExposure mean.Quality mean.Price   n
## 1           No         5.86   163547.3 989
## 2           Mn         6.25   192789.7 114
## 3           Av         6.56   206253.1 222
## 4           Gd         6.95   256521.7 135
bsmt_list <- c('None' = 0, 'No' = 1, 'Mn' = 2, 'Av' = 3, 'Gd' = 4)
numeric_dat <-hotencode(c("BsmtExposure"), bsmt_list, numeric_dat)

group.prices('BsmtFinType1')

##   BsmtFinType1 mean.Quality mean.Price   n
## 1          Rec         5.31   144306.6 141
## 2          BLQ         5.34   148800.4 150
## 3          LwQ         5.51   150717.1  76
## 4          ALQ         5.54   161224.6 223
## 5          Unf         6.16   169191.9 438
## 6          GLQ         6.95   231398.2 432
bsmtFin_list <- c('None' = 0, 'Unf' = 1, 'LwQ' = 2,
                   'Rec'= 3, 'BLQ' = 4, 'ALQ' = 5, 'GLQ' = 6)
numeric_dat <- hotencode(c('BsmtFinType1','BsmtFinType2'), bsmtFin_list, numeric_dat)

group.prices('Functional')

##   Functional mean.Quality mean.Price    n
## 1       Min2         4.97   144240.6   34
## 2       Maj2         5.00    85800.0    5
## 3       Min1         5.26   146385.5   31
## 4        Mod         5.40   168393.3   15
## 5       Maj1         5.50   153948.1   14
## 6        Sev         6.00   129000.0    1
## 7        Typ         6.16   183429.1 1360
functional_list <- c('None' = 0, 'Sal' = 1, 'Sev' = 2, 'Maj2' = 3, 'Maj1' = 4,
                     'Mod' = 5, 'Min2' = 6, 'Min1' = 7, 'Typ'= 8)
numeric_dat['Functional'] <- as.numeric(functional_list[comb_data$Functional])

group.prices('Fence')

##   Fence mean.Quality mean.Price   n
## 1 MnPrv         5.91   171254.0 748
## 2  MnWw         5.91   155397.8  23
## 3  GdWo         6.00   173405.9 268
## 4 GdPrv         6.52   204275.6 421
fence_list <- c('None' = 0, 'MnWw' = 1, 'GdWo' = 1, 'MnPrv' = 2, 'GdPrv' = 4)
numeric_dat['Fence'] <- as.numeric(fence_list[comb_data$Fence])

numeric_dat['NewerDwelling'] <- as.numeric(ifelse(comb_data$NewerDwelling == '20', 1, 0))


group.prices('GarageFinish')

##   GarageFinish mean.Quality mean.Price   n
## 1          Unf         5.33   138602.3 663
## 2          RFn         6.50   198443.4 439
## 3          Fin         7.04   237807.1 358
GarageFin_list <- c('None' = 0,'Unf' = 1, 'RFn' = 1, 'Fin' = 2)
numeric_dat['GarageFinish'] <- as.numeric(GarageFin_list[comb_data$GarageFinish])

group.prices('HeatingQC')

##   HeatingQC mean.Quality mean.Price   n
## 1        Fa         5.00   123919.5  49
## 2        Po         5.00    87000.0   1
## 3        TA         5.37   142362.9 428
## 4        Gd         5.70   156858.9 241
## 5        Ex         6.72   214914.4 741
heating.list <- c('Po' = 0, 'Fa' = 1, 'TA' = 2, 'Gd' = 3, 'Ex' = 4)
numeric_dat['HeatingScale'] <- as.numeric(heating.list[comb_data$HeatingQC])

With this, categoric features with an ordianl scale have been transformed into a numeric variables. Corelation between the numeric varibale is then assesed. Variables with strong relationship with SalePrice are the focus for modeling so we will focus primarily on features that have a coefficient > .5 or < -.5.

#Numeric values are selected for corelation matrix
numeric_cor <- cbind(numeric_dat[1:1460,], train['SalePrice'])

corr_df <- cor(numeric_cor)
highcor <- as.matrix(sort(corr_df[ ,'SalePrice'], decreasing = TRUE))

corr.idx <- names(which(apply(highcor, 1, function(x) (x > 0.5 | x < -0.5))))

corrplot(as.matrix(corr_df[corr.idx,corr.idx]), method='square', 
         addCoef.col = 'black', tl.cex = .8,cl.cex = .8, number.cex=.6)

From the correlation plot we can see the 10 features with the strongest effect on SalePrice.

  1. OverallQual: Overall material and finish quality
  2. GrLivArea: Above grade (ground) living area square feet
  3. GarageCars: Size of garage in car capacity
  4. GarageArea: Size of garage in square feet
  5. TotalBsmtSF: Total square feet of basement area
  6. 1stFlrSF: First Floor square feet
  7. YearBuilt: Original construction date
  8. YearRemodAdd: Remodel date
  9. FullBath: Full bathrooms above grade
  10. TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

A strong correlation between GrLivArea and TotRmsAbvGrd is observed since the size of the living area will most likely be a constraint on the number of rooms above ground. An interesting relationship to look into is the relationship between houses with a small living area but many rooms, which will result in smaller rooms and vise versa. Besides, the newer homes will likely give higher listings compared to older models. We can print a matrix of scatter plots to see what these relationships look like under the hood to get a better sense of whats going on.

For the remaining categorical variables dummy variables are created.

# helper function for plotting categoric data for easier data visualization
plot.categoric <- function(cols, df){
  for (col in cols) {
    order.cols <- names(sort(table(comb_data[,col]), decreasing = TRUE))
  
    num.plot <- qplot(df[,col]) +
      geom_bar(fill = '#8856a7') +
      geom_text(aes(label = ..count..), stat='count', vjust=-0.5) +
      theme_minimal() +
      scale_y_continuous(limits = c(0,max(table(df[,col]))*1.1)) +
      scale_x_discrete(limits = order.cols) +
      xlab(col) +
      theme(axis.text.x = element_text(angle = 30, size=12))
  
    print(num.plot)
  }
}


#The category with the most count is hotencoded as 1 and remaning 0
plot.categoric('LotShape', comb_data)

numeric_dat['RegularLotShape'] <- (comb_data$LotShape == 'Reg') * 1

plot.categoric('LandContour', comb_data)

numeric_dat['LandLeveled'] <- (comb_data$LandContour == 'Lvl') * 1

plot.categoric('LandSlope', comb_data)

numeric_dat['LandSlopeGentle'] <- (comb_data$LandSlope == 'Gtl') * 1

plot.categoric('Electrical', comb_data)

numeric_dat['ElectricalSB'] <- (comb_data$Electrical == 'SBrkr') * 1

plot.categoric('GarageType', comb_data)

numeric_dat['GarageDetchd'] <- (comb_data$GarageType == 'Detchd') * 1

plot.categoric('SaleCondition', comb_data)

numeric_dat['PartialPlan'] <- (comb_data$SaleCondition == 'Partial') * 1

Dummy variable columns for PavedDrive are hotencoded. Also, for WoodDeckSF, X2ndFlrSF, MasVnrArea values over 0 are encoded as 1 under ‘HasX’ columns where X are the mentioned variables. For MiscFeature, if the variable has Shed the HasX column is encoded for 1.

numeric_dat['HasPavedDrive'] <- (comb_data$PavedDrive == 'Y') * 1

numeric_dat['HasWoodDeck'] <- (comb_data$WoodDeckSF > 0) * 1

numeric_dat['Has2ndFlr'] <- (comb_data$X2ndFlrSF > 0) * 1

numeric_dat['HasMasVnr'] <- (comb_data$MasVnrArea > 0) * 1

feature_column<- c('OpenPorchSF','EnclosedPorch', 'X3SsnPorch', 'ScreenPorch')

for (col in feature_column){
        numeric_dat[str_c('Has',col)] <- (comb_data[,col] !=0) * 1
        }


plot.categoric('MiscFeature', comb_data)

numeric_dat['HasShed'] <- (comb_data$MiscFeature == 'Shed') * 1

A new column is also created to differentiate the new houses from the remodeled houses. Also, rencently remodeled houses were differentianted in new dummy variable. New column for house sold in the same year it was built were given a new column, because of the ‘Hotness’ factor.

## New column when the house was not remodeled in the same year as built

numeric_dat['Remodeled'] <- (comb_data$YearBuilt != comb_data$YearRemodAdd) * 1

numeric_dat['RecentRemodel'] <- (comb_data$YearRemodAdd >= comb_data$YrSold) * 1

#New column for house sold in the same year it was built

numeric_dat['NewHouse'] <- (numeric_dat$YearBuilt == numeric_dat$YrSold) * 1

It is obvious that neigbourhood plays an important role in housing prices. Hence, neighbourhood were hotencoded on a scale of 0 to 4:

  • 0 = MeadowV
  • 1 = IDOTRR Sawyer, BrDale , OldTown, Edwards, BrkSide, Blueste
  • 2 = SWISU, NAmes, NPkVill, Mitchel,SawyerW, Gilbert, NWAmes, Blmngtn, CollgCr
  • 3 = ClearCr,Crawfor, Veenker, Somerst, Timber
  • 4 = StoneBr, NoRidge,NridgHt
group.prices('Neighborhood')

##    Neighborhood mean.Quality mean.Price   n
## 1       MeadowV         4.47   98576.47  17
## 2        IDOTRR         4.76  100123.78  37
## 3        Sawyer         5.03  136793.14  74
## 4       BrkSide         5.05  124834.05  58
## 5       Edwards         5.08  128219.70 100
## 6         NAmes         5.36  145847.08 225
## 7       OldTown         5.39  128225.30 113
## 8         SWISU         5.44  142591.36  25
## 9       Mitchel         5.59  156270.12  49
## 10       BrDale         5.69  104493.75  16
## 11      ClearCr         5.89  212565.43  28
## 12      Blueste         6.00  137500.00   2
## 13      NPkVill         6.00  142694.44   9
## 14      Crawfor         6.27  210624.73  51
## 15      SawyerW         6.32  186555.80  59
## 16       NWAmes         6.33  189050.07  73
## 17      Gilbert         6.56  192854.51  79
## 18      CollgCr         6.64  197965.77 150
## 19      Veenker         6.73  238772.73  11
## 20       Timber         7.16  242247.45  38
## 21      Blmngtn         7.18  194870.88  17
## 22      Somerst         7.34  225379.84  86
## 23      NoRidge         7.93  335295.32  41
## 24      StoneBr         8.16  310499.00  25
## 25      NridgHt         8.26  316270.62  77
nbrhood_list <- c('MeadowV' = 0, 'IDOTRR' = 1, 'Sawyer' = 1, 'BrDale' = 1, 'OldTown' = 1, 
                  'Edwards' = 1, 'BrkSide' = 1, 'Blueste' = 1, 'SWISU' = 2, 
                  'NAmes' = 2, 'NPkVill' = 2, 'Mitchel' = 2,'SawyerW' = 2, 'Gilbert' = 2,
                  'NWAmes' = 2, 'Blmngtn' = 2, 'CollgCr' = 2,'ClearCr' = 3,'Crawfor' = 3, 
                  'Veenker' = 3, 'Somerst' = 3, 'Timber' = 3, 'StoneBr' = 4, 
                  'NoRidge' = 4,'NridgHt' = 4)

numeric_dat['NeighbourhoodCluster'] <- as.numeric(nbrhood_list[comb_data$Neighborhood])

Finally, the remaining categoric variables are encoded using dummy variable function in the caret package.

dummy <- dummyVars(" ~ .",data=comb_data[,numeric==FALSE])
categoric_dat <- data.frame(predict(dummy,newdata=comb_data[,numeric==FALSE]))
df <- cbind(numeric_dat, categoric_dat)

Variables showing near zero variance

nzv_logic <- nearZeroVar(df, saveMetrics = TRUE)

nzv_var <- rownames(nzv_logic)[nzv_logic$nzv==TRUE]

new_df <- df[,!names(df) %in% nzv_var]
#Dimensions of the feature engineered dataset
dim(new_df)
## [1] 2919  156

Dealing With Outliers

The variable GrLivArea and GarageArea are two variables with high correlation with SalePrice. Hence, it would be better to deal with the outliers for these variables

train_dat <- new_df[1:1460, ]

test_dat <- new_df[1461:nrow(new_df),]

boxplot(train_dat$GrLivArea, main = "GrLivArea")

boxplot(train_dat$GarageArea, main = "GarageArea")

# Houses with garage area more 1200SF and grLivArea more than 4000SF are removed because they are gargantuan and their presence is adding skewness to the data.

condition1 <- which(train_dat$GarageArea < 1200 & train_dat$GrLivArea < 4000)

train_dat <- train_dat[1:nrow(train_dat) %in% condition1, ]

#Outliers are also removed from our response variable

y_true <- train$SalePrice[which(1:nrow(train) %in% condition1)]

Dealing with Skewness of Response Variable

SalePrice distribution in dataset has been skewed as seen from the distribution plot.

true_dist <- qplot(y_true, geom = "density") + geom_histogram(aes(y=..density..),fill = "Blue", alpha = .5, bins = 75)+
        geom_line(aes(y=..density..), color='green', lwd = 1, stat = 'density') +
        stat_function(fun = dnorm, colour = 'red', lwd = 1, args = 
                              list(mean(y_true), sd(y_true))) +
        labs(x = 'SalePrice') + ggtitle("Distribution of Sale Price") + 
        annotate('text', label = paste('skewness =', signif(skewness(y_true),3)),
                 x=6e+05,y=7.5e-06)

print(true_dist)

To deal with the skewness, log(SalePrice +1) is taken. This solves the skewness problem.

y_train <-log(y_true+1)


train_dist <- qplot(y_train, geom = "density") + 
        geom_histogram(aes(y=..density..), fill = "Blue", alpha = .5, bins = 75)+
        geom_line(aes(y=..density..), color='green', lwd = 1, stat = 'density') +
        stat_function(fun = dnorm, colour = 'red', lwd = 1, args = 
                              list(mean(y_train), sd(y_train))) +
        labs(x = 'log(SalePrice + 1)') + ggtitle("Distribution of Log(Sale Price + 1)") + 
        annotate('text', label = paste('skewness =', signif(skewness(y_train),3)),
                 x=13,y=1.2)



grid.arrange(true_dist, train_dist, ncol = 2 )

Modeling

With the feature engineering complete and the data ready, regularization, XGBoosting were used for predicting the housing prices.

#XGBoost has its own function to create matrix for the modeling

dtrain <- xgb.DMatrix(as.matrix(train_dat), label = y_train)
dtest <- xgb.DMatrix(as.matrix(test_dat))


cv.ctrl <- trainControl(method = "repeatedcv", repeats = 1,number = 4, 
                        allowParallel=T)

xgb.grid <- expand.grid(nrounds = 750,
                        eta = c(0.01,0.005,0.001),
                        max_depth = c(4,6,8),
                        colsample_bytree=c(0,1,10),
                        min_child_weight = 2,
                        subsample=c(0,0.2,0.4,0.6),
                        gamma=0.01)

set.seed(53)

xgb_tune <- train(as.matrix(train_dat),
                  y_train, method="xgbTree", 
                  trControl=cv.ctrl, 
                  tuneGrid=xgb.grid, 
                  verbose=T, metric="RMSE", nthread =3)

xgb_params <- list(
        booster = 'gbtree',
        objective = 'reg:linear',
        colsample_bytree=1,
        eta=0.01,
        max_depth=8,
        min_child_weight=3,
        alpha=0.3,
        lambda=0.4,
        gamma=0.01, # less overfit
        subsample=0.4,
        seed=5,
        silent=TRUE)

bst <- xgb.train(xgb_params,dtrain, nrounds = 10000, early_stopping_rounds = 300, watchlist = list(train=dtrain))

To test how well the model has done the RMSE value is calculated.

# Functtion to calculate the RMSE value
rmse_eval <- function(y.true, y.pred) {
  mse_eval <- sum((y.true - exp(y.pred)-1)^2) / length(y.true)
  return(sqrt(mse_eval))
}

y_pred.xgb <- predict(bst, dtrain)
rmse_eval(y_true, y_pred.xgb)
## [1] 8060.264

XGBoost has an extremely useful function called xgb.plot.importance, which plots the feature importance of the model.

Feature importance is computed by averaging the gain of each feature for all split and all trees in the model. We can take a look at the 10 most important features used in our model. Also, predict the SalePrice on our test data is calculated as well.

model.names <- dimnames(dtrain)[[2]]

importance_matrix <- xgb.importance(model.names, model = bst)

#Plotting 10 most important feature that contribute to the model
xgb.plot.importance(importance_matrix[1:10])

pred.xgb <- predict(bst, as.matrix(test_dat))

xgb.SalePrice <- as.data.frame(exp(pred.xgb)-1)

colnames(xgb.SalePrice) <- c("1")

One limitation to using GBM’s and XGBoost in particular is its inability to extrapolate and because of this linear model can better predict any sale prices outside the range of prices given in our training set. Hence, regularized linear models are used with penalty term lambda to minimize the error. Ridge, Lasso and Elastic Net regularization is ensembled.

#For regularization, a matrix is generated

x <- train_dat %>% data.matrix()

# a range of lambdas are generated over which regularization can be tested
lamdas <- 10^seq(3, -2, by = -.1)

#cv.glm net provides the optimum lambda for regularization

glm.cv.ridge <- cv.glmnet(x, y_train, alpha = 0, lambda = lamdas)
glm.cv.lasso <- cv.glmnet(x, y_train, alpha = 1, lambda = lamdas)
glm.cv.net <- cv.glmnet(x, y_train, alpha = 0.01, lambda = lamdas)

par(mfrow=c(1,3))

plot(glm.cv.ridge, main = "Ridge")
plot(glm.cv.lasso, main = "Lasso")
plot(glm.cv.net, main = "Elastic Net")

#The the minimum value is optimum lambda 
penalty.ridge <- glm.cv.ridge$lambda.min
penalty.lasso <- glm.cv.lasso$lambda.min
penalty.net <- glm.cv.net$lambda.min


pred.ridge <- predict(glm.cv.ridge, s = penalty.ridge, newx = data.matrix(test_dat))
ridge.SalePrice <- as.data.frame(exp(pred.ridge)-1)
pred.lasso <- predict(glm.cv.lasso, s = penalty.lasso, newx = data.matrix(test_dat))
lasso.SalePrice <- as.data.frame(exp(pred.lasso)-1)
pred.net <- predict(glm.cv.net, s = penalty.net, newx = data.matrix(test_dat))
net.SalePrice <- as.data.frame(exp(pred.net)-1)


#Random forest was also used to see how well the prediction worked. However it yielded bad results compared to gradient boosting and regularization ensembles

#model_rf <- randomForest(y_train ~ ., data = train_dat)
#pred.rf <- predict(model_rf, test_dat)
#rf.SalePrice <- as.data.frame(exp(pred.net)-1)

SalePrice <- (xgb.SalePrice + ridge.SalePrice + lasso.SalePrice + net.SalePrice)/4.0

PredictedPrice <- cbind(test['Id'], SalePrice$`1`)

colnames(PredictedPrice) <- c("Id", "SalePrice")

write.csv(PredictedPrice, file = "PredictedPrice.csv", row.names = F)

Conclusion

This report elaborated on the details of how the house prices were predicted. Missing data was first imputed, then new features were engineers that could be fed into the model. A combination of XGBoost (Gradient Boosting) and regularization techniques was used to create the model.