1 Introduction

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. This Kaggle competition’s dataset proves that there are many more house features that influence price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this research tries to predict the final price of each home.

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often-cited Boston Housing dataset.

2 Executive Summary

This research aims at predicting prices of residential homes in Ames, Iowa based on a large set of features. It starts by acquiring and defining input data, data cleaning, feature engineering, and finally by model training, selection, and prediction. The research predicts house prices by combining results of several Machine Learning algorithms using Models Ensembling technique.

Four algorithms have been used for model training and prediction; Ridge Regression, Lasso Regression, Gradient Boosting, and Linear Model with Forward Stepwise. The Root Mean Square Error (RMSE) is used to measure the prediction error for each. Model RMSE’s have been found to be 0.0986, 0.110, 0.120, and 0.147 for the four mentioned models, respectively. Weighted Average Ensembling has been used to predict the house prices in the Kaggle test file.

Several R packages are used in the research. Packages cover data visualization, data manipulation, and model training and prediction.

3 Research Data

Data of the research has been obtained from Kaggle and is available under this link.

The data is split into two data sets:

3.1 R Libraries

The research uses the libraries loaded below:

library(knitr)
library(dplyr)
library(gridExtra)
library(ggplot2)
library(ggthemes)
library(cowplot)
library(moments)
library(caret)
library(glmnet)

3.2 Importing Kaggle data sets

Kaggle train and test data sets are imported into R. Then, both sets are combined into one set (allSet) for Data Cleaning and Feature Engineering before being used in model training and prediction.

train <- read.csv("./input/train.csv", stringsAsFactors = F)
test <- read.csv("./input/test.csv", stringsAsFactors = F)
allSet <- bind_rows(train,test)

3.3 Data dimensions and structure

trainDim <- dim(train)
testDim <- dim(test)
trainR <- trainDim[1];testR <- testDim[1]
trainC <- trainDim[2];testC <- testDim[2]

By having a look at the dimensions of the train and test data, we see that they have 1460 and 1459 observations (houses) and 81 and 80 features, respectively.

We can also have a glimpse of the data structure. It is clear that we have features that are integers and others as characters. It’s worth noticing that the outcome variable (SalePrice) is integer.

glimpse(allSet)
## Observations: 2,919
## Variables: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning      <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", ...
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", ...
## $ Alley         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape      <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg",...
## $ LandContour   <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl",...
## $ Utilities     <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub"...
## $ LotConfig     <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Ins...
## $ LandSlope     <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl",...
## $ Neighborhood  <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoR...
## $ Condition1    <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm",...
## $ Condition2    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", ...
## $ BldgType      <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", ...
## $ HouseStyle    <chr> "2Story", "1Story", "2Story", "2Story", "2Story"...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle     <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Ga...
## $ RoofMatl      <chr> "CompShg", "CompShg", "CompShg", "CompShg", "Com...
## $ Exterior1st   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "Vin...
## $ Exterior2nd   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "Vin...
## $ MasVnrType    <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace",...
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual     <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", ...
## $ ExterCond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ Foundation    <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "...
## $ BsmtQual      <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", ...
## $ BsmtCond      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", ...
## $ BsmtExposure  <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", ...
## $ BsmtFinType1  <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ",...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2  <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf",...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating       <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", ...
## $ HeatingQC     <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", ...
## $ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ Electrical    <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SB...
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual   <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", ...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional    <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ",...
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu   <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA"...
## $ GarageType    <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd"...
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish  <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn",...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ GarageCond    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ PavedDrive    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence         <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA,...
## $ MiscFeature   <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, ...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", ...
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, ...

3.4 Data dictionary

Below is a description of each of the features existing in the data sets to give insight of what information we have about houses.

Feature Description
SalePrice the property’s sale price in dollars. This is the target variable that you’re trying to predict.
MSSubClass The building class
MSZoning The general zoning classification
LotFrontage Linear feet of street connected to property
LotArea Lot size in square feet
Street Type of road access
Alley Type of alley access
LotShape General shape of property
LandContour Flatness of the property
Utilities Type of utilities available
LotConfig Lot configuration
LandSlope Slope of property
Neighborhood Physical locations within Ames city limits
Condition1 Proximity to main road or railroad
Condition2 Proximity to main road or railroad (if a second is present)
BldgType Type of dwelling
HouseStyle Style of dwelling
OverallQual Overall material and finish quality
OverallCond Overall condition rating
YearBuilt Original construction date
YearRemodAdd Remodel date
RoofStyle Type of roof
RoofMatl Roof material
Exterior1st Exterior covering on house
Exterior2nd Exterior covering on house (if more than one material)
MasVnrType Masonry veneer type
MasVnrArea Masonry veneer area in square feet
ExterQual Exterior material quality
ExterCond Present condition of the material on the exterior
Foundation Type of foundation
BsmtQual Height of the basement
BsmtCond General condition of the basement
BsmtExposure Walkout or garden level basement walls
BsmtFinType1 Quality of basement finished area
BsmtFinSF1 Type 1 finished square feet
BsmtFinType2 Quality of second finished area (if present)
BsmtFinSF2 Type 2 finished square feet
BsmtUnfSF Unfinished square feet of basement area
TotalBsmtSF Total square feet of basement area
Heating Type of heating
HeatingQC Heating quality and condition
CentralAir Central air conditioning
Electrical Electrical system
1stFlrSF First Floor square feet
2ndFlrSF Second floor square feet
LowQualFinSF Low quality finished square feet (all floors)
GrLivArea Above grade (ground) living area square feet
BsmtFullBath Basement full bathrooms
BsmtHalfBath Basement half bathrooms
FullBath Full bathrooms above grade
HalfBath Half baths above grade
Bedroom Number of bedrooms above basement level
Kitchen Number of kitchens
KitchenQual Kitchen quality
TotRmsAbvGrd Total rooms above grade (does not include bathrooms)
Functional Home functionality rating
Fireplaces Number of fireplaces
FireplaceQu Fireplace quality
GarageType Garage location
GarageYrBlt Year garage was built
GarageFinish Interior finish of the garage
GarageCars Size of garage in car capacity
GarageArea Size of garage in square feet
GarageQual Garage quality
GarageCond Garage condition
PavedDrive Paved driveway
WoodDeckSF Wood deck area in square feet
OpenPorchSF Open porch area in square feet
EnclosedPorch Enclosed porch area in square feet
3SsnPorch Three season porch area in square feet
ScreenPorch Screen porch area in square feet
PoolArea Pool area in square feet
PoolQC Pool quality
Fence Fence quality
MiscFeature Miscellaneous feature not covered in other categories
MiscVal $Value of miscellaneous feature
MoSold Month Sold
YrSold Year Sold
SaleType Type of sale
SaleCondition Condition of sale

4 Data Cleaning

By exploring the data set, as shown in the below subset, we can realize that the data needs to be processed and cleaned. Data cleaning involves handling missing values, fixing features classes, or coding character features.

kable(head(allSet))
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 60 RL 65 8450 Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NA Attchd 2003 RFn 2 548 TA TA Y 0 61 0 0 0 0 NA NA NA 0 2 2008 WD Normal 208500
2 20 RL 80 9600 Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976 RFn 2 460 TA TA Y 298 0 0 0 0 0 NA NA NA 0 5 2007 WD Normal 181500
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001 RFn 2 608 TA TA Y 0 42 0 0 0 0 NA NA NA 0 9 2008 WD Normal 223500
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998 Unf 3 642 TA TA Y 0 35 272 0 0 0 NA NA NA 0 2 2006 WD Abnorml 140000
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NA NA NA 0 12 2008 WD Normal 250000
6 50 RL 85 14115 Pave NA IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NA Attchd 1993 Unf 2 480 TA TA Y 40 30 0 320 0 0 NA MnPrv Shed 700 10 2009 WD Normal 143000

4.1 Imputing missing values

Missing values in data sets are problematic in data analysis, model training and prediction. Hence, we will impute missing values before going further in model training.

Let’s first define a generic function to check for missing values.

# define a function to check NAs as we will be using it frequenty to check
# what NAs are left after imputation
checkNAs <- function(df){
        # identify columns (by column number) with missing values
        naCols <- which(colSums(is.na(df))>0)
        # get columns with missing values sorted by number of missing values
        sort(colSums(sapply(allSet[naCols],is.na)), decreasing = TRUE)
}

checkNAs(allSet)
##       PoolQC  MiscFeature        Alley        Fence    SalePrice 
##         2909         2814         2721         2348         1459 
##  FireplaceQu  LotFrontage  GarageYrBlt GarageFinish   GarageQual 
##         1420          486          159          159          159 
##   GarageCond   GarageType     BsmtCond BsmtExposure     BsmtQual 
##          159          157           82           82           81 
## BsmtFinType2 BsmtFinType1   MasVnrType   MasVnrArea     MSZoning 
##           80           79           24           23            4 
##    Utilities BsmtFullBath BsmtHalfBath   Functional  Exterior1st 
##            2            2            2            2            1 
##  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1

After thoroughly studying the features with missing values, we can classify the imputation process into:

  1. Imputation based on most frequent values of the missing-value features
  2. Imputation based on other features that help predict the missing-value features

4.1.1 Imputation based on most frequent values of the missing-value features

This type of imputation is used for features that cannot be predicted from any other features in the available data set. These features are:

  • Fence
  • Alley
  • MiscFeature
  • Utilities
  • Functional
  • Exterior1st
  • Exterior2nd

We will visualize the frequencies/distributions of these features to identify the most appropriate values to use in the imputation process.

# visualizing Fence
p1 <- allSet%>% 
        ggplot(aes(x = Fence))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))

# visualizing Alley
p2 <- allSet%>% 
        ggplot(aes(x = Alley))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))

# visualizing MiscFeature
p3 <- allSet%>% 
        ggplot(aes(x = MiscFeature))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))

# visualizing Utilities
p4 <- allSet%>% 
        ggplot(aes(x = Utilities))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))

# visualizing Functional
p5 <- allSet%>% 
        ggplot(aes(x = Functional))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))

# visualizing Exterior1st & Exterior2nd
p6 <- allSet%>% 
        ggplot(aes(x = Exterior1st))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))+
       coord_flip()
p7 <- allSet%>% 
        ggplot(aes(x = Exterior2nd))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))+
        coord_flip()

grid.arrange(p1,p2,p3,p4,p5, ncol = 2)

grid.arrange(p6,p7, ncol = 2)

Analysis and conclusion

Using the visualizations shown above, we can notice the following:

  • Fence feature: most of the houses have NA fence values which indicates a No Fence value according to the data_description.txt file. So, we’ll replace the NAs by None.

  • Alley feature: most of the houses have NA alley values which indicates a No alley access value according to the data_description.txt file. So, we’ll replace the NAs by None.

  • MiscFeature: most of the houses have NA MiscFeature values which indicates a no additional miscellaneous features of the house and cannot be predicted from available data. Hence, we’ll replace the NAs by None.

  • Utilities feature: almost all houses have AllPub utilities value. So, we’ll replace the NAs by AllPub.

  • Functional: we have only 2 missing values of the Functional feature. Hence, we’ll replace the missing values by the most common value (Typ).

  • Exterior1st and Exterior2nd: we have only one missing value for each feature. Hence, we’ll replace the missing values by the most common value (VinylSd for both)

Imputation

Based on the above analysis and conclusion, we will impute missing values for this set of features.

allSet$Fence[is.na(allSet$Fence)] <- "None"
allSet$Alley[is.na(allSet$Alley)] <- "None"
allSet$MiscFeature[is.na(allSet$MiscFeature)] <- "None"
allSet$Utilities[is.na(allSet$Utilities)] <- "AllPub"
allSet$Functional[is.na(allSet$Functional)] <- "Typ"
allSet$Exterior1st[is.na(allSet$Exterior1st)] <- "VinylSd"
allSet$Exterior2nd[is.na(allSet$Exterior2nd)] <- "VinylSd"

4.1.2 Imputation based on other features that help predict the missing-value features

This type of imputation is used for features that can be predicted from other features in the available data set. These features are:

  • FireplaceQu: We have two features of fireplace that can be related to each other; Fireplaces and FireplaceQu. If a fireplace exists, then its quality should be logged as well. So, let’s see how many missing values of FireplaceQu are there for non-missing Fireplaces values.
# check number of houses with missing (NA) Fireplace quality and zero fireplaces.
fpNoQul <- allSet[is.na(allSet$FireplaceQu) & allSet$Fireplaces == 0 ,c("Fireplaces","FireplaceQu")]

# such houses should have "None" values in the FireplaceQu feature.
allSet$FireplaceQu[is.na(allSet$FireplaceQu) & allSet$Fireplaces == 0] <- "None"

# check that houses with no fireplaces all have "None" value for FireplaceQu
fp <- allSet[allSet$Fireplaces == 0 ,c("Fireplaces","FireplaceQu")]

table(fp)
##           FireplaceQu
## Fireplaces None
##          0 1420
  • LotFrontage: The linear feet of street connected to property (LotFrontage) can be linked to the Neighborhood feature as houses within the same neighborhood tend to have similar lot frontages.

Let’s see medians of lot frontages for each neighborhood and use them to impute missing lot frontage values.

# get median of lot frontage for each neighborhood
medNgbrLotFr <- allSet[!is.na(allSet$Neighborhood), c("Neighborhood","LotFrontage")] %>% 
        group_by(Neighborhood) %>% 
        summarize(median = median(LotFrontage, na.rm = T))
medNgbrLotFr
## # A tibble: 25 x 2
##    Neighborhood median
##    <chr>         <dbl>
##  1 Blmngtn        43.0
##  2 Blueste        24.0
##  3 BrDale         21.0
##  4 BrkSide        51.0
##  5 ClearCr        80.5
##  6 CollgCr        70.0
##  7 Crawfor        70.0
##  8 Edwards        65.0
##  9 Gilbert        64.0
## 10 IDOTRR         60.0
## # ... with 15 more rows
# replace missing lot frontage values with the median of lot frontage with the same neighborhood
rIndx <- which(is.na(allSet$LotFrontage))

# run a vlookup-like loop to get median value for each missing value by linkage to neighborhood
for(i in rIndx){
        medVal <- medNgbrLotFr[medNgbrLotFr$Neighborhood == allSet$Neighborhood[i],"median"]
        allSet$LotFrontage[i] <- medVal[[1]]
}
  • MasVnrType and MasVnrArea: Both masonry features are related. If a house has a masonry veneer type, it should have its area recorded. On the other hand, if a house does not have a veener type and recorded area, it can be assumed not having a masonry veneer as one option available in the data_description.txt.

Let’s check the masonry veneer type and area (MasVnrType and MasVnrArea) missing values.

allSet[is.na(allSet$MasVnrType) | is.na(allSet$MasVnrArea), c("MasVnrType","MasVnrArea")]
##      MasVnrType MasVnrArea
## 235        <NA>         NA
## 530        <NA>         NA
## 651        <NA>         NA
## 937        <NA>         NA
## 974        <NA>         NA
## 978        <NA>         NA
## 1244       <NA>         NA
## 1279       <NA>         NA
## 1692       <NA>         NA
## 1707       <NA>         NA
## 1883       <NA>         NA
## 1993       <NA>         NA
## 2005       <NA>         NA
## 2042       <NA>         NA
## 2312       <NA>         NA
## 2326       <NA>         NA
## 2341       <NA>         NA
## 2350       <NA>         NA
## 2369       <NA>         NA
## 2593       <NA>         NA
## 2611       <NA>        198
## 2658       <NA>         NA
## 2687       <NA>         NA
## 2863       <NA>         NA

Out of 24 houses, we have 23 with missing values in both features. So, we’ll replace missing values where both are missing in the house by None in the MasVnrType and by zero in the MasVnrArea.

allSet$MasVnrType[is.na(allSet$MasVnrType) & is.na(allSet$MasVnrArea)] <- "None"
allSet$MasVnrArea[is.na(allSet$MasVnrArea)] <- 0

For the house with missing type but with recorded area, we will use the median of Masonry areas across the data to estimate its masonry type.

missArea <- allSet$MasVnrArea[is.na(allSet$MasVnrType)]
allSet[allSet$MasVnrType != "None", c("MasVnrType","MasVnrArea")] %>% 
        ggplot(aes(x = MasVnrType, y = MasVnrArea)) +
        geom_boxplot()+
        geom_hline(yintercept = missArea, color = "red")

medMason <- allSet[allSet$MasVnrType != "None", c("MasVnrType","MasVnrArea")] %>% 
        group_by(MasVnrType) %>% 
        summarize(median = median(MasVnrArea))
medMason
## # A tibble: 4 x 2
##   MasVnrType median
##   <chr>       <dbl>
## 1 BrkCmn        161
## 2 BrkFace       203
## 3 Stone         200
## 4 <NA>           NA

It’s obvious that the nearest masonry type to the missing values is Stone. Hence, we will replace the missing value by Stone.

rIndx <- which(is.na(allSet$MasVnrType))

for(i in rIndx){
        medVal <- medMason[which((abs(medMason$median - allSet$MasVnrArea[i])) == min(abs(medMason$median - allSet$MasVnrArea[i]), na.rm = T)),"MasVnrType"]
         allSet$MasVnrType[i] <- medVal[[1]]
}
  • MSZoning: The zoning classification of the sale (MSZoning) can be linked to the type of dwelling involved in the sale (MSSubClass); i.e. house type.
allSet[is.na(allSet$MSZoning), c("MSSubClass","MSZoning")]
##      MSSubClass MSZoning
## 1916         30     <NA>
## 2217         20     <NA>
## 2251         70     <NA>
## 2905         20     <NA>
# distribution of house type versus its zoning classification 
# for the three dwelling types of the missing zoning values
missZoning <- unique(allSet$MSSubClass[is.na(allSet$MSZoning)])
allSet[!is.na(allSet$MSZoning) & allSet$MSSubClass %in% missZoning, c("MSZoning","MSSubClass")] %>% 
        ggplot(aes(x = MSZoning, fill = factor(MSSubClass)))+
        geom_histogram(stat = "count")

From the histogram above we can see that house types of 70 and 30 can have a zoning classification of RM while the 20 type is of zoning RL. So, the values will be imputed accordingly.

allSet$MSZoning[is.na(allSet$MSZoning) & allSet$MSSubClass %in% c(70,30)] <- "RM"
allSet$MSZoning[is.na(allSet$MSZoning) & allSet$MSSubClass == 20] <- "RL"
  • SaleType: Sale Type may be linked to Sale Condition (SaleCondition).
allSet[is.na(allSet$SaleType), c("SaleType","SaleCondition")]
##      SaleType SaleCondition
## 2490     <NA>        Normal

We have one house with missing Sale Type value having a Normal Sale Condition. Let’s explore the relationship between both feautres.

allSet[!is.na(allSet$SaleType) & !is.na(allSet$SaleCondition), c("SaleType","SaleCondition")] %>% 
        ggplot(aes(x = SaleType, fill = factor(SaleCondition)))+
        geom_histogram(stat = "count")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

We can see that most Normal SaleCondition have a WD SaleType. Hence, we’ll impute the missing value with the same.

allSet$SaleType[is.na(allSet$SaleType)] <- "WD"
  • Electrical: This feature can be directly related to the overall condition of the house (OverallCond). Let’s see how they are related.
allSet[is.na(allSet$Electrical), c("OverallCond","Electrical")]
##      OverallCond Electrical
## 1380           5       <NA>

We have 1 house with missing Electrical value having a 5 rate of Overall Condition. Let’s explore the relationship between both features.

allSet[!is.na(allSet$Electrical), c("OverallCond","Electrical")] %>% 
        ggplot(aes(x = factor(OverallCond), fill = factor(Electrical)))+
        geom_histogram(stat = "count")

Most of 5-rating Overall Condition houses have SBrkr electrical system. Hence, we’ll impute the missing value.

allSet$Electrical[is.na(allSet$Electrical)] <- "SBrkr"
  • KitchenQual: We will use the Kitchens Above Grade feature (KitchenAbvGr) to guide us to the missing values in kitchen quality. Let’s see the relationship.
allSet[is.na(allSet$KitchenQual), c("KitchenAbvGr","KitchenQual")]
##      KitchenAbvGr KitchenQual
## 1556            1        <NA>
allSet[!is.na(allSet$KitchenQual), c("KitchenAbvGr","KitchenQual")] %>% 
        ggplot(aes(x = KitchenAbvGr, fill = factor(KitchenQual)))+
        geom_histogram(stat = "count")

For a KitchenAbvGr of 1, the most likely value for a KitchenQual is TA. Hence, we’ll impute the missing value with TA.

allSet$KitchenQual[is.na(allSet$KitchenQual)] <- "TA"
  • PoolQC: Since Pool Quality can be detected from Pool Area, it can be assumed that if a pool area is zero, then the house has no pool. Hence, missing Pool Quality can be imputed with None. Let’s see the count of houses with missing (NA) PoolQC values.
allSet[is.na(allSet$PoolQC),c("PoolQC","PoolArea")] %>% 
        group_by(PoolArea) %>% 
        summarize(count = n())
## # A tibble: 4 x 2
##   PoolArea count
##      <int> <int>
## 1        0  2906
## 2      368     1
## 3      444     1
## 4      561     1

Out of the 2909 houses with missing PoolQC values, we have 2906 with zero PoolArea. This lets us assume with high confidence that houses with zero PoolAreas do not have pools; hence, these PoolQC values can be imputed with None.

allSet$PoolQC[allSet$PoolArea == 0] <- "None"

However, we have three houses with missing PoolQC values while having non-zero PoolAreas.

allSet[is.na(allSet$PoolQC) & allSet$PoolArea >0,c("PoolQC","PoolArea")]
##      PoolQC PoolArea
## 2421   <NA>      368
## 2504   <NA>      444
## 2600   <NA>      561

Since PoolArea is a likely predictor to PoolQC, we can impute the missing PoolQC values with corresponding values of other houses having means of PoolAreas near to those of the missing PoolQC values.

# calculate the means of PoolAreas for each of the PoolQC in the data set
meanArea <- allSet[!is.na(allSet$PoolQC),c("PoolQC","PoolArea")] %>% 
        group_by(PoolQC) %>% 
        summarize(AreaMean = round(mean(PoolArea),0))
meanArea
## # A tibble: 4 x 2
##   PoolQC AreaMean
##   <chr>     <dbl>
## 1 Ex          360
## 2 Fa          584
## 3 Gd          648
## 4 None          0

Impute missing PoolQC values with the values of nearest PoolArea means.

rIndx <- which(is.na(allSet$PoolQC) & allSet$PoolArea >0)

for(i in rIndx){
        poolQc <- meanArea[which((abs(meanArea$AreaMean - allSet$PoolArea[i])) == min(abs(meanArea$AreaMean - allSet$PoolArea[i]), na.rm = T)),"PoolQC"]
        allSet$PoolQC[i] <- poolQc[[1]]
}
  • Garage features: By checking the Garage-related features with missing values, we notice that we have 7 features all related to garage and have maximum number of missing values of 159 in GarageYrBlt column.
# identify all Garage features as those starting with Garage text
garageCols <- names(allSet)[grepl("Garage.*", names(allSet))]
garageCols
## [1] "GarageType"   "GarageYrBlt"  "GarageFinish" "GarageCars"  
## [5] "GarageArea"   "GarageQual"   "GarageCond"

By studying garage features description in the data_description.txt file, we see that garage features: GarageType, GarageFinish, GarageQual, and GarageCond all have NA values to indicate No-Garage rather than a missing value recorded. Yet, we need to change the NAs to None for these features, and to zero for the integer-type features: GarageYrBlt, * GarageCars* and * GarageArea*. This is to avoid problems in modeling the data.

First, we will consider houses with a complete set of NAs or zeros- as explained above- as houses with None values.

# find indexes of houses having all Garage features flagged with NAs or zeros
noGarage <- which((is.na(allSet$GarageArea) | allSet$GarageArea == 0)
                  & (is.na(allSet$GarageCars) | allSet$GarageCars == 0)
                  & is.na(allSet$GarageCond)
                  & is.na(allSet$GarageFinish)
                  & is.na(allSet$GarageQual)
                  & is.na(allSet$GarageType)
                  & (is.na(allSet$GarageYrBlt) | allSet$GarageYrBlt == 0))
                  
 
allSet[noGarage,garageCols]
##      GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 40         <NA>          NA         <NA>          0          0       <NA>
## 49         <NA>          NA         <NA>          0          0       <NA>
## 79         <NA>          NA         <NA>          0          0       <NA>
## 89         <NA>          NA         <NA>          0          0       <NA>
## 90         <NA>          NA         <NA>          0          0       <NA>
## 100        <NA>          NA         <NA>          0          0       <NA>
## 109        <NA>          NA         <NA>          0          0       <NA>
## 126        <NA>          NA         <NA>          0          0       <NA>
## 128        <NA>          NA         <NA>          0          0       <NA>
## 141        <NA>          NA         <NA>          0          0       <NA>
## 149        <NA>          NA         <NA>          0          0       <NA>
## 156        <NA>          NA         <NA>          0          0       <NA>
## 164        <NA>          NA         <NA>          0          0       <NA>
## 166        <NA>          NA         <NA>          0          0       <NA>
## 199        <NA>          NA         <NA>          0          0       <NA>
## 211        <NA>          NA         <NA>          0          0       <NA>
## 242        <NA>          NA         <NA>          0          0       <NA>
## 251        <NA>          NA         <NA>          0          0       <NA>
## 288        <NA>          NA         <NA>          0          0       <NA>
## 292        <NA>          NA         <NA>          0          0       <NA>
## 308        <NA>          NA         <NA>          0          0       <NA>
## 376        <NA>          NA         <NA>          0          0       <NA>
## 387        <NA>          NA         <NA>          0          0       <NA>
## 394        <NA>          NA         <NA>          0          0       <NA>
## 432        <NA>          NA         <NA>          0          0       <NA>
## 435        <NA>          NA         <NA>          0          0       <NA>
## 442        <NA>          NA         <NA>          0          0       <NA>
## 465        <NA>          NA         <NA>          0          0       <NA>
## 496        <NA>          NA         <NA>          0          0       <NA>
## 521        <NA>          NA         <NA>          0          0       <NA>
## 529        <NA>          NA         <NA>          0          0       <NA>
## 534        <NA>          NA         <NA>          0          0       <NA>
## 536        <NA>          NA         <NA>          0          0       <NA>
## 563        <NA>          NA         <NA>          0          0       <NA>
## 583        <NA>          NA         <NA>          0          0       <NA>
## 614        <NA>          NA         <NA>          0          0       <NA>
## 615        <NA>          NA         <NA>          0          0       <NA>
## 621        <NA>          NA         <NA>          0          0       <NA>
## 636        <NA>          NA         <NA>          0          0       <NA>
## 637        <NA>          NA         <NA>          0          0       <NA>
## 639        <NA>          NA         <NA>          0          0       <NA>
## 650        <NA>          NA         <NA>          0          0       <NA>
## 706        <NA>          NA         <NA>          0          0       <NA>
## 711        <NA>          NA         <NA>          0          0       <NA>
## 739        <NA>          NA         <NA>          0          0       <NA>
## 751        <NA>          NA         <NA>          0          0       <NA>
## 785        <NA>          NA         <NA>          0          0       <NA>
## 827        <NA>          NA         <NA>          0          0       <NA>
## 844        <NA>          NA         <NA>          0          0       <NA>
## 922        <NA>          NA         <NA>          0          0       <NA>
## 943        <NA>          NA         <NA>          0          0       <NA>
## 955        <NA>          NA         <NA>          0          0       <NA>
## 961        <NA>          NA         <NA>          0          0       <NA>
## 969        <NA>          NA         <NA>          0          0       <NA>
## 971        <NA>          NA         <NA>          0          0       <NA>
## 977        <NA>          NA         <NA>          0          0       <NA>
## 1010       <NA>          NA         <NA>          0          0       <NA>
## 1012       <NA>          NA         <NA>          0          0       <NA>
## 1031       <NA>          NA         <NA>          0          0       <NA>
## 1039       <NA>          NA         <NA>          0          0       <NA>
## 1097       <NA>          NA         <NA>          0          0       <NA>
## 1124       <NA>          NA         <NA>          0          0       <NA>
## 1132       <NA>          NA         <NA>          0          0       <NA>
## 1138       <NA>          NA         <NA>          0          0       <NA>
## 1144       <NA>          NA         <NA>          0          0       <NA>
## 1174       <NA>          NA         <NA>          0          0       <NA>
## 1180       <NA>          NA         <NA>          0          0       <NA>
## 1219       <NA>          NA         <NA>          0          0       <NA>
## 1220       <NA>          NA         <NA>          0          0       <NA>
## 1235       <NA>          NA         <NA>          0          0       <NA>
## 1258       <NA>          NA         <NA>          0          0       <NA>
## 1284       <NA>          NA         <NA>          0          0       <NA>
## 1324       <NA>          NA         <NA>          0          0       <NA>
## 1326       <NA>          NA         <NA>          0          0       <NA>
## 1327       <NA>          NA         <NA>          0          0       <NA>
## 1338       <NA>          NA         <NA>          0          0       <NA>
## 1350       <NA>          NA         <NA>          0          0       <NA>
## 1408       <NA>          NA         <NA>          0          0       <NA>
## 1450       <NA>          NA         <NA>          0          0       <NA>
## 1451       <NA>          NA         <NA>          0          0       <NA>
## 1454       <NA>          NA         <NA>          0          0       <NA>
## 1514       <NA>          NA         <NA>          0          0       <NA>
## 1532       <NA>          NA         <NA>          0          0       <NA>
## 1540       <NA>          NA         <NA>          0          0       <NA>
## 1553       <NA>          NA         <NA>          0          0       <NA>
## 1557       <NA>          NA         <NA>          0          0       <NA>
## 1559       <NA>          NA         <NA>          0          0       <NA>
## 1561       <NA>          NA         <NA>          0          0       <NA>
## 1591       <NA>          NA         <NA>          0          0       <NA>
## 1594       <NA>          NA         <NA>          0          0       <NA>
## 1595       <NA>          NA         <NA>          0          0       <NA>
## 1615       <NA>          NA         <NA>          0          0       <NA>
## 1616       <NA>          NA         <NA>          0          0       <NA>
## 1718       <NA>          NA         <NA>          0          0       <NA>
## 1722       <NA>          NA         <NA>          0          0       <NA>
## 1788       <NA>          NA         <NA>          0          0       <NA>
## 1809       <NA>          NA         <NA>          0          0       <NA>
## 1811       <NA>          NA         <NA>          0          0       <NA>
## 1812       <NA>          NA         <NA>          0          0       <NA>
## 1820       <NA>          NA         <NA>          0          0       <NA>
## 1823       <NA>          NA         <NA>          0          0       <NA>
## 1832       <NA>          NA         <NA>          0          0       <NA>
## 1835       <NA>          NA         <NA>          0          0       <NA>
## 1837       <NA>          NA         <NA>          0          0       <NA>
## 1840       <NA>          NA         <NA>          0          0       <NA>
## 1848       <NA>          NA         <NA>          0          0       <NA>
## 1894       <NA>          NA         <NA>          0          0       <NA>
## 2011       <NA>          NA         <NA>          0          0       <NA>
## 2082       <NA>          NA         <NA>          0          0       <NA>
## 2091       <NA>          NA         <NA>          0          0       <NA>
## 2094       <NA>          NA         <NA>          0          0       <NA>
## 2097       <NA>          NA         <NA>          0          0       <NA>
## 2100       <NA>          NA         <NA>          0          0       <NA>
## 2105       <NA>          NA         <NA>          0          0       <NA>
## 2136       <NA>          NA         <NA>          0          0       <NA>
## 2152       <NA>          NA         <NA>          0          0       <NA>
## 2154       <NA>          NA         <NA>          0          0       <NA>
## 2190       <NA>          NA         <NA>          0          0       <NA>
## 2191       <NA>          NA         <NA>          0          0       <NA>
## 2192       <NA>          NA         <NA>          0          0       <NA>
## 2193       <NA>          NA         <NA>          0          0       <NA>
## 2194       <NA>          NA         <NA>          0          0       <NA>
## 2213       <NA>          NA         <NA>          0          0       <NA>
## 2239       <NA>          NA         <NA>          0          0       <NA>
## 2247       <NA>          NA         <NA>          0          0       <NA>
## 2354       <NA>          NA         <NA>          0          0       <NA>
## 2355       <NA>          NA         <NA>          0          0       <NA>
## 2399       <NA>          NA         <NA>          0          0       <NA>
## 2400       <NA>          NA         <NA>          0          0       <NA>
## 2423       <NA>          NA         <NA>          0          0       <NA>
## 2427       <NA>          NA         <NA>          0          0       <NA>
## 2553       <NA>          NA         <NA>          0          0       <NA>
## 2554       <NA>          NA         <NA>          0          0       <NA>
## 2558       <NA>          NA         <NA>          0          0       <NA>
## 2576       <NA>          NA         <NA>          0          0       <NA>
## 2580       <NA>          NA         <NA>          0          0       <NA>
## 2604       <NA>          NA         <NA>          0          0       <NA>
## 2610       <NA>          NA         <NA>          0          0       <NA>
## 2692       <NA>          NA         <NA>          0          0       <NA>
## 2694       <NA>          NA         <NA>          0          0       <NA>
## 2709       <NA>          NA         <NA>          0          0       <NA>
## 2768       <NA>          NA         <NA>          0          0       <NA>
## 2772       <NA>          NA         <NA>          0          0       <NA>
## 2790       <NA>          NA         <NA>          0          0       <NA>
## 2792       <NA>          NA         <NA>          0          0       <NA>
## 2800       <NA>          NA         <NA>          0          0       <NA>
## 2860       <NA>          NA         <NA>          0          0       <NA>
## 2863       <NA>          NA         <NA>          0          0       <NA>
## 2871       <NA>          NA         <NA>          0          0       <NA>
## 2889       <NA>          NA         <NA>          0          0       <NA>
## 2892       <NA>          NA         <NA>          0          0       <NA>
## 2893       <NA>          NA         <NA>          0          0       <NA>
## 2894       <NA>          NA         <NA>          0          0       <NA>
## 2910       <NA>          NA         <NA>          0          0       <NA>
## 2914       <NA>          NA         <NA>          0          0       <NA>
## 2915       <NA>          NA         <NA>          0          0       <NA>
## 2918       <NA>          NA         <NA>          0          0       <NA>
##      GarageCond
## 40         <NA>
## 49         <NA>
## 79         <NA>
## 89         <NA>
## 90         <NA>
## 100        <NA>
## 109        <NA>
## 126        <NA>
## 128        <NA>
## 141        <NA>
## 149        <NA>
## 156        <NA>
## 164        <NA>
## 166        <NA>
## 199        <NA>
## 211        <NA>
## 242        <NA>
## 251        <NA>
## 288        <NA>
## 292        <NA>
## 308        <NA>
## 376        <NA>
## 387        <NA>
## 394        <NA>
## 432        <NA>
## 435        <NA>
## 442        <NA>
## 465        <NA>
## 496        <NA>
## 521        <NA>
## 529        <NA>
## 534        <NA>
## 536        <NA>
## 563        <NA>
## 583        <NA>
## 614        <NA>
## 615        <NA>
## 621        <NA>
## 636        <NA>
## 637        <NA>
## 639        <NA>
## 650        <NA>
## 706        <NA>
## 711        <NA>
## 739        <NA>
## 751        <NA>
## 785        <NA>
## 827        <NA>
## 844        <NA>
## 922        <NA>
## 943        <NA>
## 955        <NA>
## 961        <NA>
## 969        <NA>
## 971        <NA>
## 977        <NA>
## 1010       <NA>
## 1012       <NA>
## 1031       <NA>
## 1039       <NA>
## 1097       <NA>
## 1124       <NA>
## 1132       <NA>
## 1138       <NA>
## 1144       <NA>
## 1174       <NA>
## 1180       <NA>
## 1219       <NA>
## 1220       <NA>
## 1235       <NA>
## 1258       <NA>
## 1284       <NA>
## 1324       <NA>
## 1326       <NA>
## 1327       <NA>
## 1338       <NA>
## 1350       <NA>
## 1408       <NA>
## 1450       <NA>
## 1451       <NA>
## 1454       <NA>
## 1514       <NA>
## 1532       <NA>
## 1540       <NA>
## 1553       <NA>
## 1557       <NA>
## 1559       <NA>
## 1561       <NA>
## 1591       <NA>
## 1594       <NA>
## 1595       <NA>
## 1615       <NA>
## 1616       <NA>
## 1718       <NA>
## 1722       <NA>
## 1788       <NA>
## 1809       <NA>
## 1811       <NA>
## 1812       <NA>
## 1820       <NA>
## 1823       <NA>
## 1832       <NA>
## 1835       <NA>
## 1837       <NA>
## 1840       <NA>
## 1848       <NA>
## 1894       <NA>
## 2011       <NA>
## 2082       <NA>
## 2091       <NA>
## 2094       <NA>
## 2097       <NA>
## 2100       <NA>
## 2105       <NA>
## 2136       <NA>
## 2152       <NA>
## 2154       <NA>
## 2190       <NA>
## 2191       <NA>
## 2192       <NA>
## 2193       <NA>
## 2194       <NA>
## 2213       <NA>
## 2239       <NA>
## 2247       <NA>
## 2354       <NA>
## 2355       <NA>
## 2399       <NA>
## 2400       <NA>
## 2423       <NA>
## 2427       <NA>
## 2553       <NA>
## 2554       <NA>
## 2558       <NA>
## 2576       <NA>
## 2580       <NA>
## 2604       <NA>
## 2610       <NA>
## 2692       <NA>
## 2694       <NA>
## 2709       <NA>
## 2768       <NA>
## 2772       <NA>
## 2790       <NA>
## 2792       <NA>
## 2800       <NA>
## 2860       <NA>
## 2863       <NA>
## 2871       <NA>
## 2889       <NA>
## 2892       <NA>
## 2893       <NA>
## 2894       <NA>
## 2910       <NA>
## 2914       <NA>
## 2915       <NA>
## 2918       <NA>
# impute with NAs or zeros
allSet[noGarage,c("GarageType","GarageFinish","GarageQual","GarageCond")] <- "None"
allSet[noGarage, c("GarageYrBlt","GarageCars","GarageArea")] <- 0

Let’s see all houses with at least one garage feature having a missing value. Such houses do not indicate a house with no garage, and their missing values need to be imputed with reasonable values.

# find indexes of houses having at least one missing Garage value
missGarage <- which(is.na(allSet$GarageArea) 
                    | is.na(allSet$GarageCars)
                    | is.na(allSet$GarageCond)
                    | is.na(allSet$GarageFinish)
                    | is.na(allSet$GarageQual)
                    | is.na(allSet$GarageType)
                    | is.na(allSet$GarageYrBlt))
allSet[missGarage,garageCols]
##      GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 2127     Detchd          NA         <NA>          1        360       <NA>
## 2577     Detchd          NA         <NA>         NA         NA       <NA>
##      GarageCond
## 2127       <NA>
## 2577       <NA>

We have only two houses with garages but having missing values in some features. For both houses, we can use the YearBuilt feature to impute the GarageYrBlt value as most of garages are built with their houses.

allSet[missGarage,"GarageYrBlt"] <- allSet[missGarage,"YearBuilt"]

House 2577 is missing more values than house 2127. Hence, we will impute its missing values first based on the matching values of houses of the same GarageType and GarageYrBlt.

# index houses with missing garageCars or area
miss <- which(is.na(allSet$GarageCars) | is.na(allSet$GarageArea))

# group houses by garage features
grpdVals1 <- allSet%>% 
        group_by(GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond) %>% 
        summarize(medCars = median(GarageCars), meanArea = round(mean(GarageArea),0),count = n()) %>% 
        arrange(desc(count))
comp <- complete.cases(grpdVals1)
grpdVals1 <- grpdVals1[comp,]
grpdVals1
## # A tibble: 601 x 8
## # Groups: GarageType, GarageYrBlt, GarageFinish, GarageQual [553]
##    GarageType GarageYrBlt GarageFinish GarageQual GarageCond medCars
##    <chr>            <dbl> <chr>        <chr>      <chr>        <dbl>
##  1 None                 0 None         None       None          0   
##  2 Attchd            2006 RFn          TA         TA            2.00
##  3 Attchd            2005 Fin          TA         TA            2.00
##  4 Attchd            2007 Fin          TA         TA            3.00
##  5 Attchd            2005 RFn          TA         TA            2.00
##  6 Attchd            2006 Fin          TA         TA            2.00
##  7 Attchd            2004 Fin          TA         TA            2.00
##  8 Attchd            2007 RFn          TA         TA            2.00
##  9 Attchd            2003 Fin          TA         TA            2.00
## 10 Attchd            2008 Fin          TA         TA            3.00
## # ... with 591 more rows, and 2 more variables: meanArea <dbl>, count
## #   <int>
# look-up missing values based on houses with the same type and year built
rIndx <- miss
for(i in rIndx){
    missVals <- grpdVals1[which((grpdVals1$GarageYrBlt == allSet$GarageYrBlt[i]) & (grpdVals1$GarageType == allSet$GarageType[i])),c("GarageFinish", "medCars", "meanArea", "GarageQual","GarageCond")]
        allSet$GarageFinish[i] <- missVals[[1]][1]
        allSet$GarageCars[i] <- missVals[[2]][1]
        allSet$GarageArea[i] <- missVals[[3]][1]
        allSet$GarageQual[i] <- missVals[[4]][1]
        allSet$GarageCond[i] <- missVals[[5]][1]
  }

House 2127 has a garage with and area of 360 square feet and 1-GarageCars value. So, we will impute its missing values (GarageFinish, GarageQual and GarageCond) with those having similar GarageArea, GarageCars, and GarageType.

grpdVals <- allSet[allSet$GarageType == "Detchd" & allSet$GarageCars == 1,garageCols] %>% 
        group_by(GarageFinish, GarageQual, GarageCond) %>% 
        summarize(meanArea = round(mean(GarageArea),0),count = n()) %>% 
        arrange(desc(count))
comp <- complete.cases(grpdVals)
grpdVals <- grpdVals[comp,]
grpdVals
## # A tibble: 14 x 5
## # Groups: GarageFinish, GarageQual [8]
##    GarageFinish GarageQual GarageCond meanArea count
##    <chr>        <chr>      <chr>         <dbl> <int>
##  1 Unf          TA         TA              294   269
##  2 Unf          Fa         TA              248    40
##  3 Unf          Fa         Fa              226    25
##  4 Unf          TA         Fa              258    18
##  5 RFn          TA         TA              373     6
##  6 Fin          TA         TA              327     4
##  7 Unf          Fa         Po              246     4
##  8 Unf          Po         Po              301     4
##  9 Fin          Ex         Ex              924     1
## 10 Unf          Ex         Ex              300     1
## 11 Unf          Gd         TA              352     1
## 12 Unf          Po         Fa              195     1
## 13 Unf          TA         Gd              440     1
## 14 Unf          TA         Po              216     1

Impute missing values with the nearest values found in the table above.

rIndx <- missGarage

for(i in rIndx){
    missVals <- grpdVals[which((abs(grpdVals$meanArea - allSet$GarageArea[i])) == min(abs(grpdVals$meanArea - allSet$GarageArea[i]),na.rm = T)),c("GarageFinish", "GarageQual","GarageCond")]
        allSet$GarageFinish[i] <- missVals[[1]][1]
        allSet$GarageQual[i] <- missVals[[2]][1]
        allSet$GarageCond[i] <- missVals[[3]][1]
  }
  • Basement features: By studying the data_description.txt file, we find that there are 11 features-listed below- that describe the basements in houses.
# identify all basement features as those containing Bsmt text
bsmtCols <- names(allSet)[grepl("Bsmt.*", names(allSet))]
bsmtCols
##  [1] "BsmtQual"     "BsmtCond"     "BsmtExposure" "BsmtFinType1"
##  [5] "BsmtFinSF1"   "BsmtFinType2" "BsmtFinSF2"   "BsmtUnfSF"   
##  [9] "TotalBsmtSF"  "BsmtFullBath" "BsmtHalfBath"

First, let’s check missing values in all Basement features. Assuming that if a house has NA values in all basement features, then the house has no basement. Basement values of such houses will be replaced by None for string features and by zero for numeric features.

# check all Bsmt houses having all Bsmt features NAs or zeros
bsmtAllNa <- which((allSet[,bsmtCols[1]]== 0 | is.na(allSet[,bsmtCols[1]])) &
                        (allSet[,bsmtCols[2]]== 0 | is.na(allSet[,bsmtCols[2]])) & 
                        (allSet[,bsmtCols[3]]== 0 | is.na(allSet[,bsmtCols[3]])) &
                        (allSet[,bsmtCols[4]]== 0 | is.na(allSet[,bsmtCols[4]])) &
                        (allSet[,bsmtCols[5]]== 0 | is.na(allSet[,bsmtCols[5]])) &
                        (allSet[,bsmtCols[6]]== 0 | is.na(allSet[,bsmtCols[6]])) &
                        (allSet[,bsmtCols[7]]== 0 | is.na(allSet[,bsmtCols[7]])) &
                        (allSet[,bsmtCols[8]]== 0 | is.na(allSet[,bsmtCols[8]])) &
                        (allSet[,bsmtCols[9]]== 0 | is.na(allSet[,bsmtCols[9]])) &
                        (allSet[,bsmtCols[10]]== 0 | is.na(allSet[,bsmtCols[10]])) &
                        (allSet[,bsmtCols[11]]== 0 | is.na(allSet[,bsmtCols[11]]))
        )
allSet[bsmtAllNa,bsmtCols]
##      BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 18       <NA>     <NA>         <NA>         <NA>          0         <NA>
## 40       <NA>     <NA>         <NA>         <NA>          0         <NA>
## 91       <NA>     <NA>         <NA>         <NA>          0         <NA>
## 103      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 157      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 183      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 260      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 343      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 363      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 372      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 393      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 521      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 533      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 534      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 554      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 647      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 706      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 737      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 750      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 779      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 869      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 895      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 898      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 985      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1001     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1012     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1036     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1046     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1049     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1050     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1091     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1180     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1217     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1219     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1233     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1322     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1413     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1586     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1594     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1730     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1779     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1815     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1848     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1849     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1857     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1858     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1859     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1861     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1916     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2051     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2067     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2069     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2121     <NA>     <NA>         <NA>         <NA>         NA         <NA>
## 2123     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2189     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2190     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2191     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2194     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2217     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2225     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2388     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2436     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2453     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2454     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2491     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2499     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2548     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2553     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2565     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2579     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2600     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2703     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2764     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2767     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2804     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2805     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2825     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2892     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2905     <NA>     <NA>         <NA>         <NA>          0         <NA>
##      BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
## 18            0         0           0            0            0
## 40            0         0           0            0            0
## 91            0         0           0            0            0
## 103           0         0           0            0            0
## 157           0         0           0            0            0
## 183           0         0           0            0            0
## 260           0         0           0            0            0
## 343           0         0           0            0            0
## 363           0         0           0            0            0
## 372           0         0           0            0            0
## 393           0         0           0            0            0
## 521           0         0           0            0            0
## 533           0         0           0            0            0
## 534           0         0           0            0            0
## 554           0         0           0            0            0
## 647           0         0           0            0            0
## 706           0         0           0            0            0
## 737           0         0           0            0            0
## 750           0         0           0            0            0
## 779           0         0           0            0            0
## 869           0         0           0            0            0
## 895           0         0           0            0            0
## 898           0         0           0            0            0
## 985           0         0           0            0            0
## 1001          0         0           0            0            0
## 1012          0         0           0            0            0
## 1036          0         0           0            0            0
## 1046          0         0           0            0            0
## 1049          0         0           0            0            0
## 1050          0         0           0            0            0
## 1091          0         0           0            0            0
## 1180          0         0           0            0            0
## 1217          0         0           0            0            0
## 1219          0         0           0            0            0
## 1233          0         0           0            0            0
## 1322          0         0           0            0            0
## 1413          0         0           0            0            0
## 1586          0         0           0            0            0
## 1594          0         0           0            0            0
## 1730          0         0           0            0            0
## 1779          0         0           0            0            0
## 1815          0         0           0            0            0
## 1848          0         0           0            0            0
## 1849          0         0           0            0            0
## 1857          0         0           0            0            0
## 1858          0         0           0            0            0
## 1859          0         0           0            0            0
## 1861          0         0           0            0            0
## 1916          0         0           0            0            0
## 2051          0         0           0            0            0
## 2067          0         0           0            0            0
## 2069          0         0           0            0            0
## 2121         NA        NA          NA           NA           NA
## 2123          0         0           0            0            0
## 2189          0         0           0           NA           NA
## 2190          0         0           0            0            0
## 2191          0         0           0            0            0
## 2194          0         0           0            0            0
## 2217          0         0           0            0            0
## 2225          0         0           0            0            0
## 2388          0         0           0            0            0
## 2436          0         0           0            0            0
## 2453          0         0           0            0            0
## 2454          0         0           0            0            0
## 2491          0         0           0            0            0
## 2499          0         0           0            0            0
## 2548          0         0           0            0            0
## 2553          0         0           0            0            0
## 2565          0         0           0            0            0
## 2579          0         0           0            0            0
## 2600          0         0           0            0            0
## 2703          0         0           0            0            0
## 2764          0         0           0            0            0
## 2767          0         0           0            0            0
## 2804          0         0           0            0            0
## 2805          0         0           0            0            0
## 2825          0         0           0            0            0
## 2892          0         0           0            0            0
## 2905          0         0           0            0            0

We have 79 houses that can be flagged as not having basements.

# flag the non-basement houses with "No Basement" for string bsmt features and with zeros
# for numeric bsmt feautres
allSet[bsmtAllNa,c("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2")] <- "None"
allSet[bsmtAllNa, c("BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath")] <- 0

Now, let’s check remaining basement houses which have missing values but seem to have basements based on at least one non-NA basement feature.

bsmtSmNa <- which(is.na(allSet[,bsmtCols[1]]) |
                          is.na(allSet[,bsmtCols[2]]) | 
                          is.na(allSet[,bsmtCols[3]]) | 
                          is.na(allSet[,bsmtCols[4]]) |
                          is.na(allSet[,bsmtCols[5]]) | 
                          is.na(allSet[,bsmtCols[6]]) |
                          is.na(allSet[,bsmtCols[7]]) | 
                          is.na(allSet[,bsmtCols[8]]) | 
                          is.na(allSet[,bsmtCols[9]]) |
                          is.na(allSet[,bsmtCols[10]])| 
                          is.na(allSet[,bsmtCols[11]]))
bsmtNa <- allSet[bsmtSmNa,bsmtCols]
bsmtNa
##      BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 333        Gd       TA           No          GLQ       1124         <NA>
## 949        Gd       TA         <NA>          Unf          0          Unf
## 1488       Gd       TA         <NA>          Unf          0          Unf
## 2041       Gd     <NA>           Mn          GLQ       1044          Rec
## 2186       TA     <NA>           No          BLQ       1033          Unf
## 2218     <NA>       Fa           No          Unf          0          Unf
## 2219     <NA>       TA           No          Unf          0          Unf
## 2349       Gd       TA         <NA>          Unf          0          Unf
## 2525       TA     <NA>           Av          ALQ        755          Unf
##      BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
## 333         479      1603        3206            1            0
## 949           0       936         936            0            0
## 1488          0      1595        1595            0            0
## 2041        382         0        1426            1            0
## 2186          0        94        1127            0            1
## 2218          0       173         173            0            0
## 2219          0       356         356            0            0
## 2349          0       725         725            0            0
## 2525          0       240         995            0            0

Reading the description of each basement feature, we can see that:

  • Basement finished, unfinished, and total square feet are related in the following formula: TotalBsmtSF = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF

  • BsmtFinType1 and BsmtFinSF1 might be related. We may predict rating of basement finished area (type 1) from its finished square feet.

  • BsmtFinType2 and BsmtFinSF2 might be related. We may predict rating of basement finished area (type 2) from its finished square feet.

Let’s first make sure that all basement areas are correct according to the abovementioned formula.

sum( allSet$TotalBsmtSF != (allSet$BsmtFinSF1 + allSet$BsmtFinSF2 + allSet$BsmtUnfSF))
## [1] 0

All basement areas numbers are correct and according to the formula.

We have one house (333) whose BsmtFinType2 value is missing while its BsmtFinSF2 is 479 square feet. Let’s see the distribution of all BsmtFinType2 values versus BsmtFinSF2.

# check all houses having non-zero BsmtFinSF2 versus their BsmtFinType2
missFinSF2 <- bsmtNa$BsmtFinSF2[is.na(bsmtNa$BsmtFinType2)]
bsmt2 <- allSet[allSet$BsmtFinSF2 >0 & !is.na(allSet$BsmtFinType2), c("BsmtFinSF2", "BsmtFinType2")]
bsmt2 %>% ggplot(aes(x = BsmtFinType2, y = BsmtFinSF2))+
                geom_boxplot()+
                geom_hline(yintercept = missFinSF2, color = "red")

# calculate medians of finished square feet for each bsmt finish type 2
medSF <- bsmt2 %>% 
        group_by(BsmtFinType2) %>% 
        summarize(median = median(BsmtFinSF2))
medSF
## # A tibble: 6 x 2
##   BsmtFinType2 median
##   <chr>         <dbl>
## 1 ALQ          553   
## 2 BLQ          294   
## 3 GLQ          691   
## 4 LwQ          263   
## 5 Rec          308   
## 6 Unf            6.00

Let’s impute the missing rating of basement finished area (type2) based on the nearest median of square feet for the available type2 ratings.

rIndx <- which(is.na(allSet$BsmtFinType2))

for(i in rIndx){
        bsmtVal <- medSF[which((abs(medSF$median - allSet$BsmtFinSF2[i])) == min(abs(medSF$median - allSet$BsmtFinSF2[i]), na.rm = T)),"BsmtFinType2"]
         allSet$BsmtFinType2[i] <- bsmtVal[[1]]
}

For basement exposure (BsmtExposure), we have 3 missing values. Let’s see if there’s any relationship between basement types and exposure.

allSet[!is.na(allSet$BsmtExposure), c("BsmtExposure","BsmtFinType1")] %>% 
        ggplot(aes(x = BsmtExposure, fill = BsmtFinType1))+
        geom_histogram(stat = "count")

From the histogram above, we can see that the majority of houses have no basement exposure, and the majority of the no-basement exposure houses are Unf basement finish type, which match the 3 missing values of basement exposure. So, we’ll impute the missing values with “No”.

allSet$BsmtExposure[is.na(allSet$BsmtExposure)] <- "No"

Now, we are left with missing values in basement quality and basement condition. Let’s see if there’s some relationship between basement quality (which refers to its height) and the exposure.

allSet[!is.na(allSet$BsmtQual), c("BsmtQual","BsmtExposure")] %>% 
        ggplot(aes(x = BsmtQual, fill = BsmtExposure))+
        geom_histogram(stat = "count")

It seems that most of the No-exposure basements have either a TA or a Gd Quality with higher frequency towards TA. So, we’ll take TA as the value of the missing basement quality houses.

allSet$BsmtQual[is.na(allSet$BsmtQual)] <- "TA"

Now, let’s see the relationship between basement quality and condition.

allSet[!is.na(allSet$BsmtQual) & !is.na(allSet$BsmtCond), c("BsmtQual","BsmtCond")] %>% 
        ggplot(aes(x = BsmtQual, fill = BsmtCond))+
        geom_histogram(stat = "count")

It’s obvious from the plot that the majority of TA and Gd quality have TA basement condition. So, we’ll use TA basement condition for the missing values.

allSet$BsmtCond[is.na(allSet$BsmtCond)] <- "TA"

By finishing the imputation process for all missing values of all features we are ready to move to the next step of Feature Engineering. Let’s check if any other missing values are still there in the data set.

checkNAs(allSet)
## SalePrice 
##      1459

Aside from the SalePrice response feature, which is expected to have missing values in the original test set, no remaining missing values are observed. Therefore, we can move to the next step.

5 Feature Engineering

Feature Engineering simply refers to creating new features from the existing ones to improve model performance. The following sub-sections handle multiple feature engineering tasks that would make training the model possible, easier, and more accurate in prediction.

5.1 Log transformation

It is a good practice to transform highly skewed numerical variables into their log values. This helps show the relative change in variable values rather than the absolute change for variables not showing normal distribution. Most researchers use the range of -2 to +2 as the acceptable limits of skewness to decide on a variable normality. Hence, we will use this range to log transform values.

# create a new copy of the master data set (allSet)
allSetNew <- allSet

# get classes of features in the data set
featureClasses <- sapply(names(allSetNew), function(x){class(allSetNew[[x]])})

# get numeric or integer class features
numFeatures <- names(featureClasses[featureClasses == "numeric" | featureClasses == "integer"])

# get character class features
charFeatures <- names(featureClasses[featureClasses == "character"])

# determine skewness of each numeric feature
skewedVals <- sapply(numFeatures, function(x){skewness(allSetNew[[x]],na.rm = T)})

# identify skewed features with threshold of -2,+2
skewedFeatures <- skewedVals[skewedVals < -2 | skewedVals > 2]

# log-transform skewed features
for (i in names(skewedFeatures)) {
        allSetNew[[i]] <- log(allSetNew[[i]] + 1)
}

5.2 Convert character features into dummy variables

Since we will use the glmment function to build regression models, we need to convert character features into dummy variables since the glmnet function accepts numeric/quantitative variables.

# create a full set of dummy variables for the character features
dummies <- dummyVars(~., allSetNew[charFeatures])
dummyVariables <- predict(dummies, allSetNew[charFeatures])

# compile the data set again by combining both numerical and dummy variables features
allSetNew <- cbind(allSetNew[numFeatures], dummyVariables)

6 Model Training and Testing

Now, having all features processed and engineered, we can start the model training then testing with all existing features (except the Id) as model predictors.

First, we will re-split the full data set into its original train and test sets.

salesPriceNA <- which(is.na(allSetNew["SalePrice"]))
train <- allSetNew[-salesPriceNA,]
test <- allSetNew[salesPriceNA,]

6.1 Using Shrinkage Models

Since we are dealing with high-dimensional data set, we will use the Shrinkage Models (e.g. Ridge and Lasso) to obtain the best model with the least effect of predictors variance and colinearity. Ridge and Lasso methods penalize the model coefficient estimates, using Lambda tuning parameter, to reduce their variance which result in a better model fitting.

6.1.1 Ridge Regression

Ridge regression performs shrinkage without exclusion of predictors. To use the glmnet function we need to have a matrix of all predictors and a vector of the response.

# convert the train and test data sets into matrices excluding the Id and SalePrice features
train.Matrix =as.matrix(train[,names(train) != c("Id","SalePrice")])
test.Matrix = as.matrix(test[,names(test) != c("Id","SalePrice")])

# create a vector of the log-transformed response (SalePrice)
train.y = log(train$SalePrice + 1)

Fit the Ridge Regression using the alpha argument = 0. We will implement the function over a grid of values ranging from λ = 10^10 to λ = 10^−2, essentially covering the full range of scenarios from the null model containing only the intercept to the least squares fit model.

set.seed(4)
# create a grid of lambda values
grid = 10^seq(10,-2, length = 100)

# train the Ridge Regression model (alpha = 0) using the grid of selected lambdas
ridge.mod = glmnet(train.Matrix,train.y,alpha = 0, lambda = grid)
dim(coef(ridge.mod))
## [1] 303 100

We have 303 regression coefficients with 100 lambda values. We will split the train data set into training and testing subsets to estimate the test error.

set.seed(5)
# split the train data set into training and testing data sets (75/25 ratio).
# number of training rows
nTrain <- round(0.75 * nrow(train.Matrix))

# sample row IDs
sampleTrain <- sample(nrow(train.Matrix),nTrain)

# create training and testing data sets
training <- train.Matrix[sampleTrain,]
testing <- train.Matrix[-sampleTrain,]
training.y <- train.y[sampleTrain]
testing.y <- train.y[-sampleTrain]

We will use the built-in cross-validation function cv.glmnet() to choose the best tuning parameter (lambda) through cross validation in the training data.

set.seed(1)
cv.out <- cv.glmnet(training, training.y, alpha = 0)
plot(cv.out)

ridgeBestLambda <- cv.out$lambda.min
ridgeBestLambda;log(ridgeBestLambda)
## [1] 0.2249694
## [1] -1.491791

The best lambda value that results in the smallest cross validation error is 0.2249694. Let’s see the test RMSE associated with this value of lambda.

# predict the response (SalePrice) in the testing data using the best lambda
ridge.predict <- predict(ridge.mod,s = ridgeBestLambda, newx = testing)
ridge.rmse <- sqrt(mean((ridge.predict-testing.y)^2))
ridge.rmse
## [1] 0.09866289

So, the test error (RMSE) resulted from the Ridge Regression model on the testing data is 0.0986629.

(Note: remember that the testing data is the testing partition from the original train data set)

6.1.2 Lasso Regression

Let’s see if the Lasso regression can yield a better RMSE than that of the Ridge. We will use the glmnet() function but with alpha = 1.

set.seed(5)
lasso.mod = glmnet(train.Matrix,train.y,alpha = 1, lambda = grid)
plot(lasso.mod)

We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero. Let’s perform cross validation and see the test error.

set.seed(1)
cv.out <- cv.glmnet(training, training.y, alpha = 1)
plot(cv.out)

lassoBestlambda <- cv.out$lambda.min
lassoBestlambda;log(lassoBestlambda)
## [1] 0.00583801
## [1] -5.143365

The best lambda value that results in the smallest cross validation error is 0.005838. Let’s see the test RMSE associated with this value of lambda.

lasso.predict <- predict(lasso.mod,s = lassoBestlambda, newx = testing)
lasso.rmse <- sqrt(mean((lasso.predict-testing.y)^2))
lasso.rmse
## [1] 0.110181

So, the RMSE resulted from the Lasso Regression model is 0.110181 which is slightly higher than that of the Ridge Regression.

lasso.coef <- predict(lasso.mod,type = "coefficients", s = lassoBestlambda)[1:303,]
names(lasso.coef[lasso.coef !=0])
##  [1] "(Intercept)"          "LotArea"              "OverallQual"         
##  [4] "OverallCond"          "YearBuilt"            "YearRemodAdd"        
##  [7] "BsmtFinSF1"           "TotalBsmtSF"          "GrLivArea"           
## [10] "BsmtFullBath"         "KitchenAbvGr"         "Fireplaces"          
## [13] "GarageCars"           "GarageArea"           "WoodDeckSF"          
## [16] "OpenPorchSF"          "ScreenPorch"          "MSZoningC (all)"     
## [19] "MSZoningRM"           "NeighborhoodCrawfor"  "NeighborhoodEdwards" 
## [22] "NeighborhoodNridgHt"  "NeighborhoodSomerst"  "NeighborhoodStoneBr" 
## [25] "Condition1Artery"     "Condition1Norm"       "Condition2PosN"      
## [28] "RoofMatlClyTile"      "Exterior1stBrkComm"   "Exterior1stBrkFace"  
## [31] "ExterQualTA"          "ExterCondFa"          "FoundationPConc"     
## [34] "BsmtQualEx"           "BsmtCondFa"           "BsmtExposureGd"      
## [37] "BsmtFinType1GLQ"      "BsmtFinType1Unf"      "HeatingGrav"         
## [40] "HeatingQCEx"          "CentralAirN"          "KitchenQualEx"       
## [43] "KitchenQualTA"        "FunctionalMaj2"       "FunctionalSev"       
## [46] "FunctionalTyp"        "FireplaceQuNone"      "GarageTypeAttchd"    
## [49] "GarageFinishUnf"      "GarageCondTA"         "PavedDriveN"         
## [52] "PavedDriveY"          "SaleTypeNew"          "SaleConditionAbnorml"

Number of coefficients that the Lasso Regression picked (not zero) is 54.

6.1.3 Gradient Boosting Machine (GBM) Algorithm

Now, we will try the GBM algorithm. First, let’s split the train and test data sets, then split the train data into training and testing data sets to validate the model.

set.seed(5)
# re-split the combined data into train and test data sets
salesPriceNA <- which(is.na(allSetNew["SalePrice"]))
train <- allSetNew[-salesPriceNA,]
test <- allSetNew[salesPriceNA,]
 
# convert the response in the train data into log
train$SalePrice <- log(train$SalePrice + 1)
 
# split the train data set into training and testing data sets (75/25 ratio)
# to validate the model

# number of training rows
nTrain <- round(0.75 * nrow(train))

# sample row IDs
sampleTrain <- sample(nrow(train),nTrain)

# create training and testing data sets
training <- train[sampleTrain,!names(train) == "Id"]
testing <- train[-sampleTrain,!names(train) == "Id"]
 
testing.y <- testing$SalePrice

Build the GBM model.

set.seed(2)
control <- trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE)
 
gbm.mod <- train(SalePrice~., data = training, method = "gbm", trControl = control, verbose = FALSE)
 
gbm.mod
## Stochastic Gradient Boosting 
## 
## 1095 samples
##  302 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 985, 984, 985, 985, 985, 986, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE       
##   1                   50      0.1754895  0.8277985  0.12294897
##   1                  100      0.1491191  0.8595910  0.10367975
##   1                  150      0.1427706  0.8688748  0.09846921
##   2                   50      0.1536606  0.8551644  0.10579030
##   2                  100      0.1399041  0.8737287  0.09511092
##   2                  150      0.1364623  0.8796950  0.09257943
##   3                   50      0.1453491  0.8670813  0.09885915
##   3                  100      0.1357116  0.8811948  0.09180459
##   3                  150      0.1335058  0.8851087  0.09033930
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Let’s see the test error (RMSE) by applying the GBM model to the testing data set.

gbm.predict <- predict(gbm.mod,newdata = testing)
gbm.rmse <- sqrt(mean((gbm.predict-testing.y)^2))
gbm.rmse
## [1] 0.1203715

GBM model resulted in a worse prediction than that of both Ridge and Lasso as indicated by its high RMSE value.

6.1.4 Linear Model with Forward Stepwise

Forward Stepwise is a Subset Selection method that begins with a model containing no predictors (null model), and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

# use the lm Regression model to predict the SalePrice. Use Forward Stepwise algorithm to select the best predictors.
 
set.seed(4)
# build the null model with no predictors
null_model <- lm(SalePrice~1, data = training)
 
# build the full model with all predictors
full_model <- lm(SalePrice~., data = training)
 
# perform forward stepwise algorithm to get an economic model with best predictors
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")

Let’s see the test error (RMSE) by applying the Step Wise model to the testing data set.

step.predict <- predict(step_model,newdata = testing)
step.rmse <- sqrt(mean((step.predict-testing.y)^2))
step.rmse
## [1] 0.1472766

The RMSE resulted from the Linear model with Forward Stepwise is 0.1472766.

7 Predictions on test data set

By now we have trained and tested four models. We will use each one to predict the Sale Price response variable in the original test data set.

# predict using Ridge
test.predict.ridge <- exp(predict(ridge.mod,s = ridgeBestLambda,newx = test.Matrix))-1
 
# predict using lasso
test.predict.lasso <- exp(predict(lasso.mod, s = lassoBestlambda, newx = test.Matrix))-1
 
# predict using gbm
test.predict.gbm <- exp(predict(gbm.mod,newdata = test)) - 1
 
# predict using linear step wise
test.predict.step <- exp(predict(step_model,newdata = test)) - 1

8 Models Ensembling

In order to get more robust predictions that incorporate predictions of the previous four models we will perform the Model Ensemble technique using the Weighted Average method.

We will assign the weights based on each model’s RMSE value. Hence, the models’ weights can be assumed as shown below:

ensmb.df <- data.frame(Model = c("ridge", "lasso", "gbm", "step"), RMSE = c(ridge.rmse,lasso.rmse,gbm.rmse,step.rmse), Weight = c(40,30,15,15))

ensmb.df
##   Model       RMSE Weight
## 1 ridge 0.09866289     40
## 2 lasso 0.11018098     30
## 3   gbm 0.12037154     15
## 4  step 0.14727656     15

9 Writing the solution to a csv file

Finally, we will predict the Sale Price in the original test data set using the ensembled models with the weights defined above.

# construct data frame for the ensembled solution
solution <- data.frame(Id = as.integer(rownames(test)),SalePrice =  as.numeric(test.predict.ridge*.4 + test.predict.lasso*.3 + test.predict.gbm*.15 + test.predict.step*.15))

write.csv(solution,"ensemble_sol.csv",row.names=FALSE)