1 Introduction

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. This Kaggle competition’s dataset proves that there are many more house features that influence price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this research tries to predict the final price of each home.

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often-cited Boston Housing dataset.

2 Executive Summary

This research aims at predicting prices of residential homes in Ames, Iowa based on a large set of features. It starts by acquiring and defining input data, data cleaning, feature engineering, and finally by model training, selection, and prediction. The research predicts house prices by combining results of several Machine Learning algorithms using Models Ensembling technique.

Four algorithms have been used for model training and prediction; Ridge Regression, Lasso Regression, Gradient Boosting, and Linear Model with Forward Stepwise. The Root Mean Square Error (RMSE) is used to measure the prediction error for each. Model RMSE’s have been found to be 0.0986, 0.110, 0.120, and 0.147 for the four mentioned models, respectively. Weighted Average Ensembling has been used to predict the house prices in the Kaggle test file.

Several R packages are used in the research. Packages cover data visualization, data manipulation, and model training and prediction.

3 Research Data

Data of the research has been obtained from Kaggle and is available under this link.

The data is split into two data sets:

train set (train.csv) which will be used to train the prediction models. Since the available test set does not contain the ground truth of the price of each house, this training set will be split into training and testing sets. The temporary training set will be used to train prediction models, whereas the temporary testing set will be used to validate model performance and measure test error. Hence, ranking of models in terms of lower RMSE is based on the testing data set.
test set (test.csv) in which the ground truth response (SalePrice) is not provided. The research aims at predicting the response in this set. For each house in the test set, models predicted values will be ensembled to give the house sale price value.

3.1 R Libraries

The research uses the libraries loaded below:

library(knitr)
library(dplyr)
library(gridExtra)
library(ggplot2)
library(ggthemes)
library(cowplot)
library(moments)
library(caret)
library(glmnet)

3.2 Importing Kaggle data sets

Kaggle train and test data sets are imported into R. Then, both sets are combined into one set (allSet) for Data Cleaning and Feature Engineering before being used in model training and prediction.

train <- read.csv("./input/train.csv", stringsAsFactors = F)
test <- read.csv("./input/test.csv", stringsAsFactors = F)
allSet <- bind_rows(train,test)

3.3 Data dimensions and structure

trainDim <- dim(train)
testDim <- dim(test)
trainR <- trainDim[1];testR <- testDim[1]
trainC <- trainDim[2];testC <- testDim[2]

By having a look at the dimensions of the train and test data, we see that they have 1460 and 1459 observations (houses) and 81 and 80 features, respectively.

We can also have a glimpse of the data structure. It is clear that we have features that are integers and others as characters. It’s worth noticing that the outcome variable (SalePrice) is integer.

glimpse(allSet)

## Observations: 2,919
## Variables: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning      <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", ...
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", ...
## $ Alley         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape      <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg",...
## $ LandContour   <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl",...
## $ Utilities     <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub"...
## $ LotConfig     <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Ins...
## $ LandSlope     <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl",...
## $ Neighborhood  <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoR...
## $ Condition1    <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm",...
## $ Condition2    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", ...
## $ BldgType      <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", ...
## $ HouseStyle    <chr> "2Story", "1Story", "2Story", "2Story", "2Story"...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle     <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Ga...
## $ RoofMatl      <chr> "CompShg", "CompShg", "CompShg", "CompShg", "Com...
## $ Exterior1st   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "Vin...
## $ Exterior2nd   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "Vin...
## $ MasVnrType    <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace",...
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual     <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", ...
## $ ExterCond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ Foundation    <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "...
## $ BsmtQual      <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", ...
## $ BsmtCond      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", ...
## $ BsmtExposure  <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", ...
## $ BsmtFinType1  <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ",...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2  <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf",...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating       <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", ...
## $ HeatingQC     <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", ...
## $ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ Electrical    <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SB...
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual   <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", ...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional    <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ",...
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu   <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA"...
## $ GarageType    <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd"...
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish  <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn",...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ GarageCond    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ PavedDrive    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence         <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA,...
## $ MiscFeature   <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, ...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", ...
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, ...

3.4 Data dictionary

Below is a description of each of the features existing in the data sets to give insight of what information we have about houses.

Feature	Description
SalePrice	the property’s sale price in dollars. This is the target variable that you’re trying to predict.
MSSubClass	The building class
MSZoning	The general zoning classification
LotFrontage	Linear feet of street connected to property
LotArea	Lot size in square feet
Street	Type of road access
Alley	Type of alley access
LotShape	General shape of property
LandContour	Flatness of the property
Utilities	Type of utilities available
LotConfig	Lot configuration
LandSlope	Slope of property
Neighborhood	Physical locations within Ames city limits
Condition1	Proximity to main road or railroad
Condition2	Proximity to main road or railroad (if a second is present)
BldgType	Type of dwelling
HouseStyle	Style of dwelling
OverallQual	Overall material and finish quality
OverallCond	Overall condition rating
YearBuilt	Original construction date
YearRemodAdd	Remodel date
RoofStyle	Type of roof
RoofMatl	Roof material
Exterior1st	Exterior covering on house
Exterior2nd	Exterior covering on house (if more than one material)
MasVnrType	Masonry veneer type
MasVnrArea	Masonry veneer area in square feet
ExterQual	Exterior material quality
ExterCond	Present condition of the material on the exterior
Foundation	Type of foundation
BsmtQual	Height of the basement
BsmtCond	General condition of the basement
BsmtExposure	Walkout or garden level basement walls
BsmtFinType1	Quality of basement finished area
BsmtFinSF1	Type 1 finished square feet
BsmtFinType2	Quality of second finished area (if present)
BsmtFinSF2	Type 2 finished square feet
BsmtUnfSF	Unfinished square feet of basement area
TotalBsmtSF	Total square feet of basement area
Heating	Type of heating
HeatingQC	Heating quality and condition
CentralAir	Central air conditioning
Electrical	Electrical system
1stFlrSF	First Floor square feet
2ndFlrSF	Second floor square feet
LowQualFinSF	Low quality finished square feet (all floors)
GrLivArea	Above grade (ground) living area square feet
BsmtFullBath	Basement full bathrooms
BsmtHalfBath	Basement half bathrooms
FullBath	Full bathrooms above grade
HalfBath	Half baths above grade
Bedroom	Number of bedrooms above basement level
Kitchen	Number of kitchens
KitchenQual	Kitchen quality
TotRmsAbvGrd	Total rooms above grade (does not include bathrooms)
Functional	Home functionality rating
Fireplaces	Number of fireplaces
FireplaceQu	Fireplace quality
GarageType	Garage location
GarageYrBlt	Year garage was built
GarageFinish	Interior finish of the garage
GarageCars	Size of garage in car capacity
GarageArea	Size of garage in square feet
GarageQual	Garage quality
GarageCond	Garage condition
PavedDrive	Paved driveway
WoodDeckSF	Wood deck area in square feet
OpenPorchSF	Open porch area in square feet
EnclosedPorch	Enclosed porch area in square feet
3SsnPorch	Three season porch area in square feet
ScreenPorch	Screen porch area in square feet
PoolArea	Pool area in square feet
PoolQC	Pool quality
Fence	Fence quality
MiscFeature	Miscellaneous feature not covered in other categories
MiscVal	$Value of miscellaneous feature
MoSold	Month Sold
YrSold	Year Sold
SaleType	Type of sale
SaleCondition	Condition of sale

4 Data Cleaning

By exploring the data set, as shown in the below subset, we can realize that the data needs to be processed and cleaned. Data cleaning involves handling missing values, fixing features classes, or coding character features.

kable(head(allSet))

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	X1stFlrSF	X2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	X3SsnPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
1	60	RL	65	8450	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2003	2003	Gable	CompShg	VinylSd	VinylSd	BrkFace	196	Gd	TA	PConc	Gd	TA	No	GLQ	706	Unf	150	856	GasA	Ex	Y	SBrkr	856	854	1710	1	0	2	1	3	1	Gd	8	Typ	0	NA	Attchd	2003	RFn	2	548	TA	TA	Y	0	61	0	0	NA	NA	NA	0	2	2008	WD	Normal	208500
2	20	RL	80	9600	Pave	NA	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	6	8	1976	1976	Gable	CompShg	MetalSd	MetalSd	None	0	TA	TA	CBlock	Gd	TA	Gd	ALQ	978	Unf	284	1262	GasA	Ex	Y	SBrkr	1262	0	1262	0	1	2	0	3	1	TA	6	Typ	1	TA	Attchd	1976	RFn	2	460	TA	TA	Y	298	0	0	0	NA	NA	NA	0	5	2007	WD	Normal	181500
3	60	RL	68	11250	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2001	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	162	Gd	TA	PConc	Gd	TA	Mn	GLQ	486	Unf	434	920	GasA	Ex	Y	SBrkr	920	866	1786	1	0	2	1	3	1	Gd	6	Typ	1	TA	Attchd	2001	RFn	2	608	TA	TA	Y	0	42	0	0	NA	NA	NA	0	9	2008	WD	Normal	223500
4	70	RL	60	9550	Pave	NA	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	1Fam	2Story	7	5	1915	1970	Gable	CompShg	Wd Sdng	Wd Shng	None	0	TA	TA	BrkTil	TA	Gd	No	ALQ	216	Unf	540	756	GasA	Gd	Y	SBrkr	961	756	1717	1	0	1	0	3	1	Gd	7	Typ	1	Gd	Detchd	1998	Unf	3	642	TA	TA	Y	0	35	272	0	NA	NA	NA	0	2	2006	WD	Abnorml	140000
5	60	RL	84	14260	Pave	NA	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	1Fam	2Story	8	5	2000	2000	Gable	CompShg	VinylSd	VinylSd	BrkFace	350	Gd	TA	PConc	Gd	TA	Av	GLQ	655	Unf	490	1145	GasA	Ex	Y	SBrkr	1145	1053	2198	1	0	2	1	4	1	Gd	9	Typ	1	TA	Attchd	2000	RFn	3	836	TA	TA	Y	192	84	0	0	NA	NA	NA	0	12	2008	WD	Normal	250000
6	50	RL	85	14115	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	Mitchel	Norm	Norm	1Fam	1.5Fin	5	5	1993	1995	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	Wood	Gd	TA	No	GLQ	732	Unf	64	796	GasA	Ex	Y	SBrkr	796	566	1362	1	0	1	1	1	1	TA	5	Typ	0	NA	Attchd	1993	Unf	2	480	TA	TA	Y	40	30	0	320	NA	MnPrv	Shed	700	10	2009	WD	Normal	143000

4.1 Imputing missing values

Missing values in data sets are problematic in data analysis, model training and prediction. Hence, we will impute missing values before going further in model training.

Let’s first define a generic function to check for missing values.

# define a function to check NAs as we will be using it frequenty to check
# what NAs are left after imputation
checkNAs <- function(df){
        # identify columns (by column number) with missing values
        naCols <- which(colSums(is.na(df))>0)
        # get columns with missing values sorted by number of missing values
        sort(colSums(sapply(allSet[naCols],is.na)), decreasing = TRUE)
}

checkNAs(allSet)

##       PoolQC  MiscFeature        Alley        Fence    SalePrice 
##         2909         2814         2721         2348         1459 
##  FireplaceQu  LotFrontage  GarageYrBlt GarageFinish   GarageQual 
##         1420          486          159          159          159 
##   GarageCond   GarageType     BsmtCond BsmtExposure     BsmtQual 
##          159          157           82           82           81 
## BsmtFinType2 BsmtFinType1   MasVnrType   MasVnrArea     MSZoning 
##           80           79           24           23            4 
##    Utilities BsmtFullBath BsmtHalfBath   Functional  Exterior1st 
##            2            2            2            2            1 
##  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1

After thoroughly studying the features with missing values, we can classify the imputation process into:

Imputation based on most frequent values of the missing-value features
Imputation based on other features that help predict the missing-value features

4.1.1 Imputation based on most frequent values of the missing-value features

This type of imputation is used for features that cannot be predicted from any other features in the available data set. These features are:

Fence
Alley
MiscFeature
Utilities
Functional
Exterior1st
Exterior2nd

We will visualize the frequencies/distributions of these features to identify the most appropriate values to use in the imputation process.

# visualizing Fence
p1 <- allSet%>% 
        ggplot(aes(x = Fence))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))

# visualizing Alley
p2 <- allSet%>% 
        ggplot(aes(x = Alley))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))

# visualizing MiscFeature
p3 <- allSet%>% 
        ggplot(aes(x = MiscFeature))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))

# visualizing Utilities
p4 <- allSet%>% 
        ggplot(aes(x = Utilities))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))

# visualizing Functional
p5 <- allSet%>% 
        ggplot(aes(x = Functional))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))

# visualizing Exterior1st & Exterior2nd
p6 <- allSet%>% 
        ggplot(aes(x = Exterior1st))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))+
       coord_flip()
p7 <- allSet%>% 
        ggplot(aes(x = Exterior2nd))+
        geom_histogram(stat = "count")+
        geom_label(stat='count',aes(label=..count..))+
        coord_flip()

grid.arrange(p1,p2,p3,p4,p5, ncol = 2)

grid.arrange(p6,p7, ncol = 2)

Analysis and conclusion

Using the visualizations shown above, we can notice the following:

Fence feature: most of the houses have NA fence values which indicates a No Fence value according to the data_description.txt file. So, we’ll replace the NAs by None.
Alley feature: most of the houses have NA alley values which indicates a No alley access value according to the data_description.txt file. So, we’ll replace the NAs by None.
MiscFeature: most of the houses have NA MiscFeature values which indicates a no additional miscellaneous features of the house and cannot be predicted from available data. Hence, we’ll replace the NAs by None.
Utilities feature: almost all houses have AllPub utilities value. So, we’ll replace the NAs by AllPub.
Functional: we have only 2 missing values of the Functional feature. Hence, we’ll replace the missing values by the most common value (Typ).
Exterior1st and Exterior2nd: we have only one missing value for each feature. Hence, we’ll replace the missing values by the most common value (VinylSd for both)

Imputation

Based on the above analysis and conclusion, we will impute missing values for this set of features.

allSet$Fence[is.na(allSet$Fence)] <- "None"
allSet$Alley[is.na(allSet$Alley)] <- "None"
allSet$MiscFeature[is.na(allSet$MiscFeature)] <- "None"
allSet$Utilities[is.na(allSet$Utilities)] <- "AllPub"
allSet$Functional[is.na(allSet$Functional)] <- "Typ"
allSet$Exterior1st[is.na(allSet$Exterior1st)] <- "VinylSd"
allSet$Exterior2nd[is.na(allSet$Exterior2nd)] <- "VinylSd"

4.1.2 Imputation based on other features that help predict the missing-value features

This type of imputation is used for features that can be predicted from other features in the available data set. These features are:

FireplaceQu: We have two features of fireplace that can be related to each other; Fireplaces and FireplaceQu. If a fireplace exists, then its quality should be logged as well. So, let’s see how many missing values of FireplaceQu are there for non-missing Fireplaces values.

# check number of houses with missing (NA) Fireplace quality and zero fireplaces.
fpNoQul <- allSet[is.na(allSet$FireplaceQu) & allSet$Fireplaces == 0 ,c("Fireplaces","FireplaceQu")]

# such houses should have "None" values in the FireplaceQu feature.
allSet$FireplaceQu[is.na(allSet$FireplaceQu) & allSet$Fireplaces == 0] <- "None"

# check that houses with no fireplaces all have "None" value for FireplaceQu
fp <- allSet[allSet$Fireplaces == 0 ,c("Fireplaces","FireplaceQu")]

table(fp)

##           FireplaceQu
## Fireplaces None
##          0 1420

LotFrontage: The linear feet of street connected to property (LotFrontage) can be linked to the Neighborhood feature as houses within the same neighborhood tend to have similar lot frontages.

Let’s see medians of lot frontages for each neighborhood and use them to impute missing lot frontage values.

# get median of lot frontage for each neighborhood
medNgbrLotFr <- allSet[!is.na(allSet$Neighborhood), c("Neighborhood","LotFrontage")] %>% 
        group_by(Neighborhood) %>% 
        summarize(median = median(LotFrontage, na.rm = T))
medNgbrLotFr

## # A tibble: 25 x 2
##    Neighborhood median
##    <chr>         <dbl>
##  1 Blmngtn        43.0
##  2 Blueste        24.0
##  3 BrDale         21.0
##  4 BrkSide        51.0
##  5 ClearCr        80.5
##  6 CollgCr        70.0
##  7 Crawfor        70.0
##  8 Edwards        65.0
##  9 Gilbert        64.0
## 10 IDOTRR         60.0
## # ... with 15 more rows

# replace missing lot frontage values with the median of lot frontage with the same neighborhood
rIndx <- which(is.na(allSet$LotFrontage))

# run a vlookup-like loop to get median value for each missing value by linkage to neighborhood
for(i in rIndx){
        medVal <- medNgbrLotFr[medNgbrLotFr$Neighborhood == allSet$Neighborhood[i],"median"]
        allSet$LotFrontage[i] <- medVal[[1]]
}

MasVnrType and MasVnrArea: Both masonry features are related. If a house has a masonry veneer type, it should have its area recorded. On the other hand, if a house does not have a veener type and recorded area, it can be assumed not having a masonry veneer as one option available in the data_description.txt.

Let’s check the masonry veneer type and area (MasVnrType and MasVnrArea) missing values.

allSet[is.na(allSet$MasVnrType) | is.na(allSet$MasVnrArea), c("MasVnrType","MasVnrArea")]

##      MasVnrType MasVnrArea
## 235        <NA>         NA
## 530        <NA>         NA
## 651        <NA>         NA
## 937        <NA>         NA
## 974        <NA>         NA
## 978        <NA>         NA
## 1244       <NA>         NA
## 1279       <NA>         NA
## 1692       <NA>         NA
## 1707       <NA>         NA
## 1883       <NA>         NA
## 1993       <NA>         NA
## 2005       <NA>         NA
## 2042       <NA>         NA
## 2312       <NA>         NA
## 2326       <NA>         NA
## 2341       <NA>         NA
## 2350       <NA>         NA
## 2369       <NA>         NA
## 2593       <NA>         NA
## 2611       <NA>        198
## 2658       <NA>         NA
## 2687       <NA>         NA
## 2863       <NA>         NA

Out of 24 houses, we have 23 with missing values in both features. So, we’ll replace missing values where both are missing in the house by None in the MasVnrType and by zero in the MasVnrArea.

allSet$MasVnrType[is.na(allSet$MasVnrType) & is.na(allSet$MasVnrArea)] <- "None"
allSet$MasVnrArea[is.na(allSet$MasVnrArea)] <- 0

For the house with missing type but with recorded area, we will use the median of Masonry areas across the data to estimate its masonry type.

missArea <- allSet$MasVnrArea[is.na(allSet$MasVnrType)]
allSet[allSet$MasVnrType != "None", c("MasVnrType","MasVnrArea")] %>% 
        ggplot(aes(x = MasVnrType, y = MasVnrArea)) +
        geom_boxplot()+
        geom_hline(yintercept = missArea, color = "red")

medMason <- allSet[allSet$MasVnrType != "None", c("MasVnrType","MasVnrArea")] %>% 
        group_by(MasVnrType) %>% 
        summarize(median = median(MasVnrArea))
medMason

## # A tibble: 4 x 2
##   MasVnrType median
##   <chr>       <dbl>
## 1 BrkCmn        161
## 2 BrkFace       203
## 3 Stone         200
## 4 <NA>           NA

It’s obvious that the nearest masonry type to the missing values is Stone. Hence, we will replace the missing value by Stone.

rIndx <- which(is.na(allSet$MasVnrType))

for(i in rIndx){
        medVal <- medMason[which((abs(medMason$median - allSet$MasVnrArea[i])) == min(abs(medMason$median - allSet$MasVnrArea[i]), na.rm = T)),"MasVnrType"]
         allSet$MasVnrType[i] <- medVal[[1]]
}

MSZoning: The zoning classification of the sale (MSZoning) can be linked to the type of dwelling involved in the sale (MSSubClass); i.e. house type.

allSet[is.na(allSet$MSZoning), c("MSSubClass","MSZoning")]

##      MSSubClass MSZoning
## 1916         30     <NA>
## 2217         20     <NA>
## 2251         70     <NA>
## 2905         20     <NA>

# distribution of house type versus its zoning classification 
# for the three dwelling types of the missing zoning values
missZoning <- unique(allSet$MSSubClass[is.na(allSet$MSZoning)])
allSet[!is.na(allSet$MSZoning) & allSet$MSSubClass %in% missZoning, c("MSZoning","MSSubClass")] %>% 
        ggplot(aes(x = MSZoning, fill = factor(MSSubClass)))+
        geom_histogram(stat = "count")

From the histogram above we can see that house types of 70 and 30 can have a zoning classification of RM while the 20 type is of zoning RL. So, the values will be imputed accordingly.

allSet$MSZoning[is.na(allSet$MSZoning) & allSet$MSSubClass %in% c(70,30)] <- "RM"
allSet$MSZoning[is.na(allSet$MSZoning) & allSet$MSSubClass == 20] <- "RL"

SaleType: Sale Type may be linked to Sale Condition (SaleCondition).

allSet[is.na(allSet$SaleType), c("SaleType","SaleCondition")]

##      SaleType SaleCondition
## 2490     <NA>        Normal

We have one house with missing Sale Type value having a Normal Sale Condition. Let’s explore the relationship between both feautres.

allSet[!is.na(allSet$SaleType) & !is.na(allSet$SaleCondition), c("SaleType","SaleCondition")] %>% 
        ggplot(aes(x = SaleType, fill = factor(SaleCondition)))+
        geom_histogram(stat = "count")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

We can see that most Normal SaleCondition have a WD SaleType. Hence, we’ll impute the missing value with the same.

allSet$SaleType[is.na(allSet$SaleType)] <- "WD"

Electrical: This feature can be directly related to the overall condition of the house (OverallCond). Let’s see how they are related.

allSet[is.na(allSet$Electrical), c("OverallCond","Electrical")]

##      OverallCond Electrical
## 1380           5       <NA>

We have 1 house with missing Electrical value having a 5 rate of Overall Condition. Let’s explore the relationship between both features.

allSet[!is.na(allSet$Electrical), c("OverallCond","Electrical")] %>% 
        ggplot(aes(x = factor(OverallCond), fill = factor(Electrical)))+
        geom_histogram(stat = "count")

Most of 5-rating Overall Condition houses have SBrkr electrical system. Hence, we’ll impute the missing value.

allSet$Electrical[is.na(allSet$Electrical)] <- "SBrkr"

KitchenQual: We will use the Kitchens Above Grade feature (KitchenAbvGr) to guide us to the missing values in kitchen quality. Let’s see the relationship.

allSet[is.na(allSet$KitchenQual), c("KitchenAbvGr","KitchenQual")]

##      KitchenAbvGr KitchenQual
## 1556            1        <NA>

allSet[!is.na(allSet$KitchenQual), c("KitchenAbvGr","KitchenQual")] %>% 
        ggplot(aes(x = KitchenAbvGr, fill = factor(KitchenQual)))+
        geom_histogram(stat = "count")

For a KitchenAbvGr of 1, the most likely value for a KitchenQual is TA. Hence, we’ll impute the missing value with TA.

allSet$KitchenQual[is.na(allSet$KitchenQual)] <- "TA"

PoolQC: Since Pool Quality can be detected from Pool Area, it can be assumed that if a pool area is zero, then the house has no pool. Hence, missing Pool Quality can be imputed with None. Let’s see the count of houses with missing (NA) PoolQC values.

allSet[is.na(allSet$PoolQC),c("PoolQC","PoolArea")] %>% 
        group_by(PoolArea) %>% 
        summarize(count = n())

## # A tibble: 4 x 2
##   PoolArea count
##      <int> <int>
## 1        0  2906
## 2      368     1
## 3      444     1
## 4      561     1

Out of the 2909 houses with missing PoolQC values, we have 2906 with zero PoolArea. This lets us assume with high confidence that houses with zero PoolAreas do not have pools; hence, these PoolQC values can be imputed with None.

allSet$PoolQC[allSet$PoolArea == 0] <- "None"

However, we have three houses with missing PoolQC values while having non-zero PoolAreas.

allSet[is.na(allSet$PoolQC) & allSet$PoolArea >0,c("PoolQC","PoolArea")]

##      PoolQC PoolArea
## 2421   <NA>      368
## 2504   <NA>      444
## 2600   <NA>      561

Since PoolArea is a likely predictor to PoolQC, we can impute the missing PoolQC values with corresponding values of other houses having means of PoolAreas near to those of the missing PoolQC values.

# calculate the means of PoolAreas for each of the PoolQC in the data set
meanArea <- allSet[!is.na(allSet$PoolQC),c("PoolQC","PoolArea")] %>% 
        group_by(PoolQC) %>% 
        summarize(AreaMean = round(mean(PoolArea),0))
meanArea

## # A tibble: 4 x 2
##   PoolQC AreaMean
##   <chr>     <dbl>
## 1 Ex          360
## 2 Fa          584
## 3 Gd          648
## 4 None          0

Impute missing PoolQC values with the values of nearest PoolArea means.

rIndx <- which(is.na(allSet$PoolQC) & allSet$PoolArea >0)

for(i in rIndx){
        poolQc <- meanArea[which((abs(meanArea$AreaMean - allSet$PoolArea[i])) == min(abs(meanArea$AreaMean - allSet$PoolArea[i]), na.rm = T)),"PoolQC"]
        allSet$PoolQC[i] <- poolQc[[1]]
}

Garage features: By checking the Garage-related features with missing values, we notice that we have 7 features all related to garage and have maximum number of missing values of 159 in GarageYrBlt column.

# identify all Garage features as those starting with Garage text
garageCols <- names(allSet)[grepl("Garage.*", names(allSet))]
garageCols

## [1] "GarageType"   "GarageYrBlt"  "GarageFinish" "GarageCars"  
## [5] "GarageArea"   "GarageQual"   "GarageCond"

By studying garage features description in the data_description.txt file, we see that garage features: GarageType, GarageFinish, GarageQual, and GarageCond all have NA values to indicate No-Garage rather than a missing value recorded. Yet, we need to change the NAs to None for these features, and to zero for the integer-type features: GarageYrBlt, * GarageCars* and * GarageArea*. This is to avoid problems in modeling the data.

First, we will consider houses with a complete set of NAs or zeros- as explained above- as houses with None values.

# find indexes of houses having all Garage features flagged with NAs or zeros
noGarage <- which((is.na(allSet$GarageArea) | allSet$GarageArea == 0)
                  & (is.na(allSet$GarageCars) | allSet$GarageCars == 0)
                  & is.na(allSet$GarageCond)
                  & is.na(allSet$GarageFinish)
                  & is.na(allSet$GarageQual)
                  & is.na(allSet$GarageType)
                  & (is.na(allSet$GarageYrBlt) | allSet$GarageYrBlt == 0))
                  
 
allSet[noGarage,garageCols]

##      GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 40         <NA>          NA         <NA>          0          0       <NA>
## 49         <NA>          NA         <NA>          0          0       <NA>
## 79         <NA>          NA         <NA>          0          0       <NA>
## 89         <NA>          NA         <NA>          0          0       <NA>
## 90         <NA>          NA         <NA>          0          0       <NA>
## 100        <NA>          NA         <NA>          0          0       <NA>
## 109        <NA>          NA         <NA>          0          0       <NA>
## 126        <NA>          NA         <NA>          0          0       <NA>
## 128        <NA>          NA         <NA>          0          0       <NA>
## 141        <NA>          NA         <NA>          0          0       <NA>
## 149        <NA>          NA         <NA>          0          0       <NA>
## 156        <NA>          NA         <NA>          0          0       <NA>
## 164        <NA>          NA         <NA>          0          0       <NA>
## 166        <NA>          NA         <NA>          0          0       <NA>
## 199        <NA>          NA         <NA>          0          0       <NA>
## 211        <NA>          NA         <NA>          0          0       <NA>
## 242        <NA>          NA         <NA>          0          0       <NA>
## 251        <NA>          NA         <NA>          0          0       <NA>
## 288        <NA>          NA         <NA>          0          0       <NA>
## 292        <NA>          NA         <NA>          0          0       <NA>
## 308        <NA>          NA         <NA>          0          0       <NA>
## 376        <NA>          NA         <NA>          0          0       <NA>
## 387        <NA>          NA         <NA>          0          0       <NA>
## 394        <NA>          NA         <NA>          0          0       <NA>
## 432        <NA>          NA         <NA>          0          0       <NA>
## 435        <NA>          NA         <NA>          0          0       <NA>
## 442        <NA>          NA         <NA>          0          0       <NA>
## 465        <NA>          NA         <NA>          0          0       <NA>
## 496        <NA>          NA         <NA>          0          0       <NA>
## 521        <NA>          NA         <NA>          0          0       <NA>
## 529        <NA>          NA         <NA>          0          0       <NA>
## 534        <NA>          NA         <NA>          0          0       <NA>
## 536        <NA>          NA         <NA>          0          0       <NA>
## 563        <NA>          NA         <NA>          0          0       <NA>
## 583        <NA>          NA         <NA>          0          0       <NA>
## 614        <NA>          NA         <NA>          0          0       <NA>
## 615        <NA>          NA         <NA>          0          0       <NA>
## 621        <NA>          NA         <NA>          0          0       <NA>
## 636        <NA>          NA         <NA>          0          0       <NA>
## 637        <NA>          NA         <NA>          0          0       <NA>
## 639        <NA>          NA         <NA>          0          0       <NA>
## 650        <NA>          NA         <NA>          0          0       <NA>
## 706        <NA>          NA         <NA>          0          0       <NA>
## 711        <NA>          NA         <NA>          0          0       <NA>
## 739        <NA>          NA         <NA>          0          0       <NA>
## 751        <NA>          NA         <NA>          0          0       <NA>
## 785        <NA>          NA         <NA>          0          0       <NA>
## 827        <NA>          NA         <NA>          0          0       <NA>
## 844        <NA>          NA         <NA>          0          0       <NA>
## 922        <NA>          NA         <NA>          0          0       <NA>
## 943        <NA>          NA         <NA>          0          0       <NA>
## 955        <NA>          NA         <NA>          0          0       <NA>
## 961        <NA>          NA         <NA>          0          0       <NA>
## 969        <NA>          NA         <NA>          0          0       <NA>
## 971        <NA>          NA         <NA>          0          0       <NA>
## 977        <NA>          NA         <NA>          0          0       <NA>
## 1010       <NA>          NA         <NA>          0          0       <NA>
## 1012       <NA>          NA         <NA>          0          0       <NA>
## 1031       <NA>          NA         <NA>          0          0       <NA>
## 1039       <NA>          NA         <NA>          0          0       <NA>
## 1097       <NA>          NA         <NA>          0          0       <NA>
## 1124       <NA>          NA         <NA>          0          0       <NA>
## 1132       <NA>          NA         <NA>          0          0       <NA>
## 1138       <NA>          NA         <NA>          0          0       <NA>
## 1144       <NA>          NA         <NA>          0          0       <NA>
## 1174       <NA>          NA         <NA>          0          0       <NA>
## 1180       <NA>          NA         <NA>          0          0       <NA>
## 1219       <NA>          NA         <NA>          0          0       <NA>
## 1220       <NA>          NA         <NA>          0          0       <NA>
## 1235       <NA>          NA         <NA>          0          0       <NA>
## 1258       <NA>          NA         <NA>          0          0       <NA>
## 1284       <NA>          NA         <NA>          0          0       <NA>
## 1324       <NA>          NA         <NA>          0          0       <NA>
## 1326       <NA>          NA         <NA>          0          0       <NA>
## 1327       <NA>          NA         <NA>          0          0       <NA>
## 1338       <NA>          NA         <NA>          0          0       <NA>
## 1350       <NA>          NA         <NA>          0          0       <NA>
## 1408       <NA>          NA         <NA>          0          0       <NA>
## 1450       <NA>          NA         <NA>          0          0       <NA>
## 1451       <NA>          NA         <NA>          0          0       <NA>
## 1454       <NA>          NA         <NA>          0          0       <NA>
## 1514       <NA>          NA         <NA>          0          0       <NA>
## 1532       <NA>          NA         <NA>          0          0       <NA>
## 1540       <NA>          NA         <NA>          0          0       <NA>
## 1553       <NA>          NA         <NA>          0          0       <NA>
## 1557       <NA>          NA         <NA>          0          0       <NA>
## 1559       <NA>          NA         <NA>          0          0       <NA>
## 1561       <NA>          NA         <NA>          0          0       <NA>
## 1591       <NA>          NA         <NA>          0          0       <NA>
## 1594       <NA>          NA         <NA>          0          0       <NA>
## 1595       <NA>          NA         <NA>          0          0       <NA>
## 1615       <NA>          NA         <NA>          0          0       <NA>
## 1616       <NA>          NA         <NA>          0          0       <NA>
## 1718       <NA>          NA         <NA>          0          0       <NA>
## 1722       <NA>          NA         <NA>          0          0       <NA>
## 1788       <NA>          NA         <NA>          0          0       <NA>
## 1809       <NA>          NA         <NA>          0          0       <NA>
## 1811       <NA>          NA         <NA>          0          0       <NA>
## 1812       <NA>          NA         <NA>          0          0       <NA>
## 1820       <NA>          NA         <NA>          0          0       <NA>
## 1823       <NA>          NA         <NA>          0          0       <NA>
## 1832       <NA>          NA         <NA>          0          0       <NA>
## 1835       <NA>          NA         <NA>          0          0       <NA>
## 1837       <NA>          NA         <NA>          0          0       <NA>
## 1840       <NA>          NA         <NA>          0          0       <NA>
## 1848       <NA>          NA         <NA>          0          0       <NA>
## 1894       <NA>          NA         <NA>          0          0       <NA>
## 2011       <NA>          NA         <NA>          0          0       <NA>
## 2082       <NA>          NA         <NA>          0          0       <NA>
## 2091       <NA>          NA         <NA>          0          0       <NA>
## 2094       <NA>          NA         <NA>          0          0       <NA>
## 2097       <NA>          NA         <NA>          0          0       <NA>
## 2100       <NA>          NA         <NA>          0          0       <NA>
## 2105       <NA>          NA         <NA>          0          0       <NA>
## 2136       <NA>          NA         <NA>          0          0       <NA>
## 2152       <NA>          NA         <NA>          0          0       <NA>
## 2154       <NA>          NA         <NA>          0          0       <NA>
## 2190       <NA>          NA         <NA>          0          0       <NA>
## 2191       <NA>          NA         <NA>          0          0       <NA>
## 2192       <NA>          NA         <NA>          0          0       <NA>
## 2193       <NA>          NA         <NA>          0          0       <NA>
## 2194       <NA>          NA         <NA>          0          0       <NA>
## 2213       <NA>          NA         <NA>          0          0       <NA>
## 2239       <NA>          NA         <NA>          0          0       <NA>
## 2247       <NA>          NA         <NA>          0          0       <NA>
## 2354       <NA>          NA         <NA>          0          0       <NA>
## 2355       <NA>          NA         <NA>          0          0       <NA>
## 2399       <NA>          NA         <NA>          0          0       <NA>
## 2400       <NA>          NA         <NA>          0          0       <NA>
## 2423       <NA>          NA         <NA>          0          0       <NA>
## 2427       <NA>          NA         <NA>          0          0       <NA>
## 2553       <NA>          NA         <NA>          0          0       <NA>
## 2554       <NA>          NA         <NA>          0          0       <NA>
## 2558       <NA>          NA         <NA>          0          0       <NA>
## 2576       <NA>          NA         <NA>          0          0       <NA>
## 2580       <NA>          NA         <NA>          0          0       <NA>
## 2604       <NA>          NA         <NA>          0          0       <NA>
## 2610       <NA>          NA         <NA>          0          0       <NA>
## 2692       <NA>          NA         <NA>          0          0       <NA>
## 2694       <NA>          NA         <NA>          0          0       <NA>
## 2709       <NA>          NA         <NA>          0          0       <NA>
## 2768       <NA>          NA         <NA>          0          0       <NA>
## 2772       <NA>          NA         <NA>          0          0       <NA>
## 2790       <NA>          NA         <NA>          0          0       <NA>
## 2792       <NA>          NA         <NA>          0          0       <NA>
## 2800       <NA>          NA         <NA>          0          0       <NA>
## 2860       <NA>          NA         <NA>          0          0       <NA>
## 2863       <NA>          NA         <NA>          0          0       <NA>
## 2871       <NA>          NA         <NA>          0          0       <NA>
## 2889       <NA>          NA         <NA>          0          0       <NA>
## 2892       <NA>          NA         <NA>          0          0       <NA>
## 2893       <NA>          NA         <NA>          0          0       <NA>
## 2894       <NA>          NA         <NA>          0          0       <NA>
## 2910       <NA>          NA         <NA>          0          0       <NA>
## 2914       <NA>          NA         <NA>          0          0       <NA>
## 2915       <NA>          NA         <NA>          0          0       <NA>
## 2918       <NA>          NA         <NA>          0          0       <NA>
##      GarageCond
## 40         <NA>
## 49         <NA>
## 79         <NA>
## 89         <NA>
## 90         <NA>
## 100        <NA>
## 109        <NA>
## 126        <NA>
## 128        <NA>
## 141        <NA>
## 149        <NA>
## 156        <NA>
## 164        <NA>
## 166        <NA>
## 199        <NA>
## 211        <NA>
## 242        <NA>
## 251        <NA>
## 288        <NA>
## 292        <NA>
## 308        <NA>
## 376        <NA>
## 387        <NA>
## 394        <NA>
## 432        <NA>
## 435        <NA>
## 442        <NA>
## 465        <NA>
## 496        <NA>
## 521        <NA>
## 529        <NA>
## 534        <NA>
## 536        <NA>
## 563        <NA>
## 583        <NA>
## 614        <NA>
## 615        <NA>
## 621        <NA>
## 636        <NA>
## 637        <NA>
## 639        <NA>
## 650        <NA>
## 706        <NA>
## 711        <NA>
## 739        <NA>
## 751        <NA>
## 785        <NA>
## 827        <NA>
## 844        <NA>
## 922        <NA>
## 943        <NA>
## 955        <NA>
## 961        <NA>
## 969        <NA>
## 971        <NA>
## 977        <NA>
## 1010       <NA>
## 1012       <NA>
## 1031       <NA>
## 1039       <NA>
## 1097       <NA>
## 1124       <NA>
## 1132       <NA>
## 1138       <NA>
## 1144       <NA>
## 1174       <NA>
## 1180       <NA>
## 1219       <NA>
## 1220       <NA>
## 1235       <NA>
## 1258       <NA>
## 1284       <NA>
## 1324       <NA>
## 1326       <NA>
## 1327       <NA>
## 1338       <NA>
## 1350       <NA>
## 1408       <NA>
## 1450       <NA>
## 1451       <NA>
## 1454       <NA>
## 1514       <NA>
## 1532       <NA>
## 1540       <NA>
## 1553       <NA>
## 1557       <NA>
## 1559       <NA>
## 1561       <NA>
## 1591       <NA>
## 1594       <NA>
## 1595       <NA>
## 1615       <NA>
## 1616       <NA>
## 1718       <NA>
## 1722       <NA>
## 1788       <NA>
## 1809       <NA>
## 1811       <NA>
## 1812       <NA>
## 1820       <NA>
## 1823       <NA>
## 1832       <NA>
## 1835       <NA>
## 1837       <NA>
## 1840       <NA>
## 1848       <NA>
## 1894       <NA>
## 2011       <NA>
## 2082       <NA>
## 2091       <NA>
## 2094       <NA>
## 2097       <NA>
## 2100       <NA>
## 2105       <NA>
## 2136       <NA>
## 2152       <NA>
## 2154       <NA>
## 2190       <NA>
## 2191       <NA>
## 2192       <NA>
## 2193       <NA>
## 2194       <NA>
## 2213       <NA>
## 2239       <NA>
## 2247       <NA>
## 2354       <NA>
## 2355       <NA>
## 2399       <NA>
## 2400       <NA>
## 2423       <NA>
## 2427       <NA>
## 2553       <NA>
## 2554       <NA>
## 2558       <NA>
## 2576       <NA>
## 2580       <NA>
## 2604       <NA>
## 2610       <NA>
## 2692       <NA>
## 2694       <NA>
## 2709       <NA>
## 2768       <NA>
## 2772       <NA>
## 2790       <NA>
## 2792       <NA>
## 2800       <NA>
## 2860       <NA>
## 2863       <NA>
## 2871       <NA>
## 2889       <NA>
## 2892       <NA>
## 2893       <NA>
## 2894       <NA>
## 2910       <NA>
## 2914       <NA>
## 2915       <NA>
## 2918       <NA>

# impute with NAs or zeros
allSet[noGarage,c("GarageType","GarageFinish","GarageQual","GarageCond")] <- "None"
allSet[noGarage, c("GarageYrBlt","GarageCars","GarageArea")] <- 0

Let’s see all houses with at least one garage feature having a missing value. Such houses do not indicate a house with no garage, and their missing values need to be imputed with reasonable values.

# find indexes of houses having at least one missing Garage value
missGarage <- which(is.na(allSet$GarageArea) 
                    | is.na(allSet$GarageCars)
                    | is.na(allSet$GarageCond)
                    | is.na(allSet$GarageFinish)
                    | is.na(allSet$GarageQual)
                    | is.na(allSet$GarageType)
                    | is.na(allSet$GarageYrBlt))
allSet[missGarage,garageCols]

##      GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 2127     Detchd          NA         <NA>          1        360       <NA>
## 2577     Detchd          NA         <NA>         NA         NA       <NA>
##      GarageCond
## 2127       <NA>
## 2577       <NA>

We have only two houses with garages but having missing values in some features. For both houses, we can use the YearBuilt feature to impute the GarageYrBlt value as most of garages are built with their houses.

allSet[missGarage,"GarageYrBlt"] <- allSet[missGarage,"YearBuilt"]

House 2577 is missing more values than house 2127. Hence, we will impute its missing values first based on the matching values of houses of the same GarageType and GarageYrBlt.

# index houses with missing garageCars or area
miss <- which(is.na(allSet$GarageCars) | is.na(allSet$GarageArea))

# group houses by garage features
grpdVals1 <- allSet%>% 
        group_by(GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond) %>% 
        summarize(medCars = median(GarageCars), meanArea = round(mean(GarageArea),0),count = n()) %>% 
        arrange(desc(count))
comp <- complete.cases(grpdVals1)
grpdVals1 <- grpdVals1[comp,]
grpdVals1

## # A tibble: 601 x 8
## # Groups: GarageType, GarageYrBlt, GarageFinish, GarageQual [553]
##    GarageType GarageYrBlt GarageFinish GarageQual GarageCond medCars
##    <chr>            <dbl> <chr>        <chr>      <chr>        <dbl>
##  1 None                 0 None         None       None          0   
##  2 Attchd            2006 RFn          TA         TA            2.00
##  3 Attchd            2005 Fin          TA         TA            2.00
##  4 Attchd            2007 Fin          TA         TA            3.00
##  5 Attchd            2005 RFn          TA         TA            2.00
##  6 Attchd            2006 Fin          TA         TA            2.00
##  7 Attchd            2004 Fin          TA         TA            2.00
##  8 Attchd            2007 RFn          TA         TA            2.00
##  9 Attchd            2003 Fin          TA         TA            2.00
## 10 Attchd            2008 Fin          TA         TA            3.00
## # ... with 591 more rows, and 2 more variables: meanArea <dbl>, count
## #   <int>

# look-up missing values based on houses with the same type and year built
rIndx <- miss
for(i in rIndx){
    missVals <- grpdVals1[which((grpdVals1$GarageYrBlt == allSet$GarageYrBlt[i]) & (grpdVals1$GarageType == allSet$GarageType[i])),c("GarageFinish", "medCars", "meanArea", "GarageQual","GarageCond")]
        allSet$GarageFinish[i] <- missVals[[1]][1]
        allSet$GarageCars[i] <- missVals[[2]][1]
        allSet$GarageArea[i] <- missVals[[3]][1]
        allSet$GarageQual[i] <- missVals[[4]][1]
        allSet$GarageCond[i] <- missVals[[5]][1]
  }

House 2127 has a garage with and area of 360 square feet and 1-GarageCars value. So, we will impute its missing values (GarageFinish, GarageQual and GarageCond) with those having similar GarageArea, GarageCars, and GarageType.

grpdVals <- allSet[allSet$GarageType == "Detchd" & allSet$GarageCars == 1,garageCols] %>% 
        group_by(GarageFinish, GarageQual, GarageCond) %>% 
        summarize(meanArea = round(mean(GarageArea),0),count = n()) %>% 
        arrange(desc(count))
comp <- complete.cases(grpdVals)
grpdVals <- grpdVals[comp,]
grpdVals

## # A tibble: 14 x 5
## # Groups: GarageFinish, GarageQual [8]
##    GarageFinish GarageQual GarageCond meanArea count
##    <chr>        <chr>      <chr>         <dbl> <int>
##  1 Unf          TA         TA              294   269
##  2 Unf          Fa         TA              248    40
##  3 Unf          Fa         Fa              226    25
##  4 Unf          TA         Fa              258    18
##  5 RFn          TA         TA              373     6
##  6 Fin          TA         TA              327     4
##  7 Unf          Fa         Po              246     4
##  8 Unf          Po         Po              301     4
##  9 Fin          Ex         Ex              924     1
## 10 Unf          Ex         Ex              300     1
## 11 Unf          Gd         TA              352     1
## 12 Unf          Po         Fa              195     1
## 13 Unf          TA         Gd              440     1
## 14 Unf          TA         Po              216     1

Impute missing values with the nearest values found in the table above.

rIndx <- missGarage

for(i in rIndx){
    missVals <- grpdVals[which((abs(grpdVals$meanArea - allSet$GarageArea[i])) == min(abs(grpdVals$meanArea - allSet$GarageArea[i]),na.rm = T)),c("GarageFinish", "GarageQual","GarageCond")]
        allSet$GarageFinish[i] <- missVals[[1]][1]
        allSet$GarageQual[i] <- missVals[[2]][1]
        allSet$GarageCond[i] <- missVals[[3]][1]
  }

Basement features: By studying the data_description.txt file, we find that there are 11 features-listed below- that describe the basements in houses.

# identify all basement features as those containing Bsmt text
bsmtCols <- names(allSet)[grepl("Bsmt.*", names(allSet))]
bsmtCols

##  [1] "BsmtQual"     "BsmtCond"     "BsmtExposure" "BsmtFinType1"
##  [5] "BsmtFinSF1"   "BsmtFinType2" "BsmtFinSF2"   "BsmtUnfSF"   
##  [9] "TotalBsmtSF"  "BsmtFullBath" "BsmtHalfBath"

First, let’s check missing values in all Basement features. Assuming that if a house has NA values in all basement features, then the house has no basement. Basement values of such houses will be replaced by None for string features and by zero for numeric features.

# check all Bsmt houses having all Bsmt features NAs or zeros
bsmtAllNa <- which((allSet[,bsmtCols[1]]== 0 | is.na(allSet[,bsmtCols[1]])) &
                        (allSet[,bsmtCols[2]]== 0 | is.na(allSet[,bsmtCols[2]])) & 
                        (allSet[,bsmtCols[3]]== 0 | is.na(allSet[,bsmtCols[3]])) &
                        (allSet[,bsmtCols[4]]== 0 | is.na(allSet[,bsmtCols[4]])) &
                        (allSet[,bsmtCols[5]]== 0 | is.na(allSet[,bsmtCols[5]])) &
                        (allSet[,bsmtCols[6]]== 0 | is.na(allSet[,bsmtCols[6]])) &
                        (allSet[,bsmtCols[7]]== 0 | is.na(allSet[,bsmtCols[7]])) &
                        (allSet[,bsmtCols[8]]== 0 | is.na(allSet[,bsmtCols[8]])) &
                        (allSet[,bsmtCols[9]]== 0 | is.na(allSet[,bsmtCols[9]])) &
                        (allSet[,bsmtCols[10]]== 0 | is.na(allSet[,bsmtCols[10]])) &
                        (allSet[,bsmtCols[11]]== 0 | is.na(allSet[,bsmtCols[11]]))
        )
allSet[bsmtAllNa,bsmtCols]

##      BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 18       <NA>     <NA>         <NA>         <NA>          0         <NA>
## 40       <NA>     <NA>         <NA>         <NA>          0         <NA>
## 91       <NA>     <NA>         <NA>         <NA>          0         <NA>
## 103      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 157      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 183      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 260      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 343      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 363      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 372      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 393      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 521      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 533      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 534      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 554      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 647      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 706      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 737      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 750      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 779      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 869      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 895      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 898      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 985      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1001     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1012     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1036     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1046     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1049     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1050     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1091     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1180     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1217     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1219     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1233     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1322     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1413     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1586     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1594     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1730     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1779     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1815     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1848     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1849     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1857     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1858     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1859     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1861     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1916     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2051     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2067     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2069     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2121     <NA>     <NA>         <NA>         <NA>         NA         <NA>
## 2123     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2189     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2190     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2191     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2194     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2217     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2225     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2388     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2436     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2453     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2454     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2491     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2499     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2548     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2553     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2565     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2579     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2600     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2703     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2764     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2767     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2804     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2805     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2825     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2892     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2905     <NA>     <NA>         <NA>         <NA>          0         <NA>
##      BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
## 18            0         0           0            0            0
## 40            0         0           0            0            0
## 91            0         0           0            0            0
## 103           0         0           0            0            0
## 157           0         0           0            0            0
## 183           0         0           0            0            0
## 260           0         0           0            0            0
## 343           0         0           0            0            0
## 363           0         0           0            0            0
## 372           0         0           0            0            0
## 393           0         0           0            0            0
## 521           0         0           0            0            0
## 533           0         0           0            0            0
## 534           0         0           0            0            0
## 554           0         0           0            0            0
## 647           0         0           0            0            0
## 706           0         0           0            0            0
## 737           0         0           0            0            0
## 750           0         0           0            0            0
## 779           0         0           0            0            0
## 869           0         0           0            0            0
## 895           0         0           0            0            0
## 898           0         0           0            0            0
## 985           0         0           0            0            0
## 1001          0         0           0            0            0
## 1012          0         0           0            0            0
## 1036          0         0           0            0            0
## 1046          0         0           0            0            0
## 1049          0         0           0            0            0
## 1050          0         0           0            0            0
## 1091          0         0           0            0            0
## 1180          0         0           0            0            0
## 1217          0         0           0            0            0
## 1219          0         0           0            0            0
## 1233          0         0           0            0            0
## 1322          0         0           0            0            0
## 1413          0         0           0            0            0
## 1586          0         0           0            0            0
## 1594          0         0           0            0            0
## 1730          0         0           0            0            0
## 1779          0         0           0            0            0
## 1815          0         0           0            0            0
## 1848          0         0           0            0            0
## 1849          0         0           0            0            0
## 1857          0         0           0            0            0
## 1858          0         0           0            0            0
## 1859          0         0           0            0            0
## 1861          0         0           0            0            0
## 1916          0         0           0            0            0
## 2051          0         0           0            0            0
## 2067          0         0           0            0            0
## 2069          0         0           0            0            0
## 2121         NA        NA          NA           NA           NA
## 2123          0         0           0            0            0
## 2189          0         0           0           NA           NA
## 2190          0         0           0            0            0
## 2191          0         0           0            0            0
## 2194          0         0           0            0            0
## 2217          0         0           0            0            0
## 2225          0         0           0            0            0
## 2388          0         0           0            0            0
## 2436          0         0           0            0            0
## 2453          0         0           0            0            0
## 2454          0         0           0            0            0
## 2491          0         0           0            0            0
## 2499          0         0           0            0            0
## 2548          0         0           0            0            0
## 2553          0         0           0            0            0
## 2565          0         0           0            0            0
## 2579          0         0           0            0            0
## 2600          0         0           0            0            0
## 2703          0         0           0            0            0
## 2764          0         0           0            0            0
## 2767          0         0           0            0            0
## 2804          0         0           0            0            0
## 2805          0         0           0            0            0
## 2825          0         0           0            0            0
## 2892          0         0           0            0            0
## 2905          0         0           0            0            0

We have 79 houses that can be flagged as not having basements.

# flag the non-basement houses with "No Basement" for string bsmt features and with zeros
# for numeric bsmt feautres
allSet[bsmtAllNa,c("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2")] <- "None"
allSet[bsmtAllNa, c("BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath")] <- 0

Now, let’s check remaining basement houses which have missing values but seem to have basements based on at least one non-NA basement feature.

bsmtSmNa <- which(is.na(allSet[,bsmtCols[1]]) |
                          is.na(allSet[,bsmtCols[2]]) | 
                          is.na(allSet[,bsmtCols[3]]) | 
                          is.na(allSet[,bsmtCols[4]]) |
                          is.na(allSet[,bsmtCols[5]]) | 
                          is.na(allSet[,bsmtCols[6]]) |
                          is.na(allSet[,bsmtCols[7]]) | 
                          is.na(allSet[,bsmtCols[8]]) | 
                          is.na(allSet[,bsmtCols[9]]) |
                          is.na(allSet[,bsmtCols[10]])| 
                          is.na(allSet[,bsmtCols[11]]))
bsmtNa <- allSet[bsmtSmNa,bsmtCols]
bsmtNa

##      BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 333        Gd       TA           No          GLQ       1124         <NA>
## 949        Gd       TA         <NA>          Unf          0          Unf
## 1488       Gd       TA         <NA>          Unf          0          Unf
## 2041       Gd     <NA>           Mn          GLQ       1044          Rec
## 2186       TA     <NA>           No          BLQ       1033          Unf
## 2218     <NA>       Fa           No          Unf          0          Unf
## 2219     <NA>       TA           No          Unf          0          Unf
## 2349       Gd       TA         <NA>          Unf          0          Unf
## 2525       TA     <NA>           Av          ALQ        755          Unf
##      BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
## 333         479      1603        3206            1            0
## 949           0       936         936            0            0
## 1488          0      1595        1595            0            0
## 2041        382         0        1426            1            0
## 2186          0        94        1127            0            1
## 2218          0       173         173            0            0
## 2219          0       356         356            0            0
## 2349          0       725         725            0            0
## 2525          0       240         995            0            0

Reading the description of each basement feature, we can see that:

Basement finished, unfinished, and total square feet are related in the following formula: TotalBsmtSF = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF
BsmtFinType1 and BsmtFinSF1 might be related. We may predict rating of basement finished area (type 1) from its finished square feet.
BsmtFinType2 and BsmtFinSF2 might be related. We may predict rating of basement finished area (type 2) from its finished square feet.

Let’s first make sure that all basement areas are correct according to the abovementioned formula.

sum( allSet$TotalBsmtSF != (allSet$BsmtFinSF1 + allSet$BsmtFinSF2 + allSet$BsmtUnfSF))

## [1] 0

All basement areas numbers are correct and according to the formula.

We have one house (333) whose BsmtFinType2 value is missing while its BsmtFinSF2 is 479 square feet. Let’s see the distribution of all BsmtFinType2 values versus BsmtFinSF2.

# check all houses having non-zero BsmtFinSF2 versus their BsmtFinType2
missFinSF2 <- bsmtNa$BsmtFinSF2[is.na(bsmtNa$BsmtFinType2)]
bsmt2 <- allSet[allSet$BsmtFinSF2 >0 & !is.na(allSet$BsmtFinType2), c("BsmtFinSF2", "BsmtFinType2")]
bsmt2 %>% ggplot(aes(x = BsmtFinType2, y = BsmtFinSF2))+
                geom_boxplot()+
                geom_hline(yintercept = missFinSF2, color = "red")

# calculate medians of finished square feet for each bsmt finish type 2
medSF <- bsmt2 %>% 
        group_by(BsmtFinType2) %>% 
        summarize(median = median(BsmtFinSF2))
medSF

## # A tibble: 6 x 2
##   BsmtFinType2 median
##   <chr>         <dbl>
## 1 ALQ          553   
## 2 BLQ          294   
## 3 GLQ          691   
## 4 LwQ          263   
## 5 Rec          308   
## 6 Unf            6.00

Let’s impute the missing rating of basement finished area (type2) based on the nearest median of square feet for the available type2 ratings.

rIndx <- which(is.na(allSet$BsmtFinType2))

for(i in rIndx){
        bsmtVal <- medSF[which((abs(medSF$median - allSet$BsmtFinSF2[i])) == min(abs(medSF$median - allSet$BsmtFinSF2[i]), na.rm = T)),"BsmtFinType2"]
         allSet$BsmtFinType2[i] <- bsmtVal[[1]]
}

For basement exposure (BsmtExposure), we have 3 missing values. Let’s see if there’s any relationship between basement types and exposure.

allSet[!is.na(allSet$BsmtExposure), c("BsmtExposure","BsmtFinType1")] %>% 
        ggplot(aes(x = BsmtExposure, fill = BsmtFinType1))+
        geom_histogram(stat = "count")

From the histogram above, we can see that the majority of houses have no basement exposure, and the majority of the no-basement exposure houses are Unf basement finish type, which match the 3 missing values of basement exposure. So, we’ll impute the missing values with “No”.

allSet$BsmtExposure[is.na(allSet$BsmtExposure)] <- "No"

Now, we are left with missing values in basement quality and basement condition. Let’s see if there’s some relationship between basement quality (which refers to its height) and the exposure.

allSet[!is.na(allSet$BsmtQual), c("BsmtQual","BsmtExposure")] %>% 
        ggplot(aes(x = BsmtQual, fill = BsmtExposure))+
        geom_histogram(stat = "count")

It seems that most of the No-exposure basements have either a TA or a Gd Quality with higher frequency towards TA. So, we’ll take TA as the value of the missing basement quality houses.

allSet$BsmtQual[is.na(allSet$BsmtQual)] <- "TA"

Now, let’s see the relationship between basement quality and condition.

allSet[!is.na(allSet$BsmtQual) & !is.na(allSet$BsmtCond), c("BsmtQual","BsmtCond")] %>% 
        ggplot(aes(x = BsmtQual, fill = BsmtCond))+
        geom_histogram(stat = "count")

It’s obvious from the plot that the majority of TA and Gd quality have TA basement condition. So, we’ll use TA basement condition for the missing values.

allSet$BsmtCond[is.na(allSet$BsmtCond)] <- "TA"

By finishing the imputation process for all missing values of all features we are ready to move to the next step of Feature Engineering. Let’s check if any other missing values are still there in the data set.

checkNAs(allSet)

## SalePrice 
##      1459

Aside from the SalePrice response feature, which is expected to have missing values in the original test set, no remaining missing values are observed. Therefore, we can move to the next step.

5 Feature Engineering

Feature Engineering simply refers to creating new features from the existing ones to improve model performance. The following sub-sections handle multiple feature engineering tasks that would make training the model possible, easier, and more accurate in prediction.

5.1 Log transformation

It is a good practice to transform highly skewed numerical variables into their log values. This helps show the relative change in variable values rather than the absolute change for variables not showing normal distribution. Most researchers use the range of -2 to +2 as the acceptable limits of skewness to decide on a variable normality. Hence, we will use this range to log transform values.

# create a new copy of the master data set (allSet)
allSetNew <- allSet

# get classes of features in the data set
featureClasses <- sapply(names(allSetNew), function(x){class(allSetNew[[x]])})

# get numeric or integer class features
numFeatures <- names(featureClasses[featureClasses == "numeric" | featureClasses == "integer"])

# get character class features
charFeatures <- names(featureClasses[featureClasses == "character"])

# determine skewness of each numeric feature
skewedVals <- sapply(numFeatures, function(x){skewness(allSetNew[[x]],na.rm = T)})

# identify skewed features with threshold of -2,+2
skewedFeatures <- skewedVals[skewedVals < -2 | skewedVals > 2]

# log-transform skewed features
for (i in names(skewedFeatures)) {
        allSetNew[[i]] <- log(allSetNew[[i]] + 1)
}

5.2 Convert character features into dummy variables

Since we will use the glmment function to build regression models, we need to convert character features into dummy variables since the glmnet function accepts numeric/quantitative variables.

# create a full set of dummy variables for the character features
dummies <- dummyVars(~., allSetNew[charFeatures])
dummyVariables <- predict(dummies, allSetNew[charFeatures])

# compile the data set again by combining both numerical and dummy variables features
allSetNew <- cbind(allSetNew[numFeatures], dummyVariables)

6 Model Training and Testing

Now, having all features processed and engineered, we can start the model training then testing with all existing features (except the Id) as model predictors.

First, we will re-split the full data set into its original train and test sets.

salesPriceNA <- which(is.na(allSetNew["SalePrice"]))
train <- allSetNew[-salesPriceNA,]
test <- allSetNew[salesPriceNA,]

6.1 Using Shrinkage Models

Since we are dealing with high-dimensional data set, we will use the Shrinkage Models (e.g. Ridge and Lasso) to obtain the best model with the least effect of predictors variance and colinearity. Ridge and Lasso methods penalize the model coefficient estimates, using Lambda tuning parameter, to reduce their variance which result in a better model fitting.

6.1.1 Ridge Regression

Ridge regression performs shrinkage without exclusion of predictors. To use the glmnet function we need to have a matrix of all predictors and a vector of the response.

# convert the train and test data sets into matrices excluding the Id and SalePrice features
train.Matrix =as.matrix(train[,names(train) != c("Id","SalePrice")])
test.Matrix = as.matrix(test[,names(test) != c("Id","SalePrice")])

# create a vector of the log-transformed response (SalePrice)
train.y = log(train$SalePrice + 1)

Fit the Ridge Regression using the alpha argument = 0. We will implement the function over a grid of values ranging from λ = 10^10 to λ = 10^−2, essentially covering the full range of scenarios from the null model containing only the intercept to the least squares fit model.

set.seed(4)
# create a grid of lambda values
grid = 10^seq(10,-2, length = 100)

# train the Ridge Regression model (alpha = 0) using the grid of selected lambdas
ridge.mod = glmnet(train.Matrix,train.y,alpha = 0, lambda = grid)
dim(coef(ridge.mod))

## [1] 303 100

We have 303 regression coefficients with 100 lambda values. We will split the train data set into training and testing subsets to estimate the test error.

set.seed(5)
# split the train data set into training and testing data sets (75/25 ratio).
# number of training rows
nTrain <- round(0.75 * nrow(train.Matrix))

# sample row IDs
sampleTrain <- sample(nrow(train.Matrix),nTrain)

# create training and testing data sets
training <- train.Matrix[sampleTrain,]
testing <- train.Matrix[-sampleTrain,]
training.y <- train.y[sampleTrain]
testing.y <- train.y[-sampleTrain]

We will use the built-in cross-validation function cv.glmnet() to choose the best tuning parameter (lambda) through cross validation in the training data.

set.seed(1)
cv.out <- cv.glmnet(training, training.y, alpha = 0)
plot(cv.out)

ridgeBestLambda <- cv.out$lambda.min
ridgeBestLambda;log(ridgeBestLambda)

## [1] 0.2249694

## [1] -1.491791

The best lambda value that results in the smallest cross validation error is 0.2249694. Let’s see the test RMSE associated with this value of lambda.

# predict the response (SalePrice) in the testing data using the best lambda
ridge.predict <- predict(ridge.mod,s = ridgeBestLambda, newx = testing)
ridge.rmse <- sqrt(mean((ridge.predict-testing.y)^2))
ridge.rmse

## [1] 0.09866289

So, the test error (RMSE) resulted from the Ridge Regression model on the testing data is 0.0986629.

(Note: remember that the testing data is the testing partition from the original train data set)

6.1.2 Lasso Regression

Let’s see if the Lasso regression can yield a better RMSE than that of the Ridge. We will use the glmnet() function but with alpha = 1.

set.seed(5)
lasso.mod = glmnet(train.Matrix,train.y,alpha = 1, lambda = grid)
plot(lasso.mod)

We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero. Let’s perform cross validation and see the test error.

set.seed(1)
cv.out <- cv.glmnet(training, training.y, alpha = 1)
plot(cv.out)

lassoBestlambda <- cv.out$lambda.min
lassoBestlambda;log(lassoBestlambda)

## [1] 0.00583801

## [1] -5.143365

The best lambda value that results in the smallest cross validation error is 0.005838. Let’s see the test RMSE associated with this value of lambda.

lasso.predict <- predict(lasso.mod,s = lassoBestlambda, newx = testing)
lasso.rmse <- sqrt(mean((lasso.predict-testing.y)^2))
lasso.rmse

## [1] 0.110181

So, the RMSE resulted from the Lasso Regression model is 0.110181 which is slightly higher than that of the Ridge Regression.

lasso.coef <- predict(lasso.mod,type = "coefficients", s = lassoBestlambda)[1:303,]
names(lasso.coef[lasso.coef !=0])

##  [1] "(Intercept)"          "LotArea"              "OverallQual"         
##  [4] "OverallCond"          "YearBuilt"            "YearRemodAdd"        
##  [7] "BsmtFinSF1"           "TotalBsmtSF"          "GrLivArea"           
## [10] "BsmtFullBath"         "KitchenAbvGr"         "Fireplaces"          
## [13] "GarageCars"           "GarageArea"           "WoodDeckSF"          
## [16] "OpenPorchSF"          "ScreenPorch"          "MSZoningC (all)"     
## [19] "MSZoningRM"           "NeighborhoodCrawfor"  "NeighborhoodEdwards" 
## [22] "NeighborhoodNridgHt"  "NeighborhoodSomerst"  "NeighborhoodStoneBr" 
## [25] "Condition1Artery"     "Condition1Norm"       "Condition2PosN"      
## [28] "RoofMatlClyTile"      "Exterior1stBrkComm"   "Exterior1stBrkFace"  
## [31] "ExterQualTA"          "ExterCondFa"          "FoundationPConc"     
## [34] "BsmtQualEx"           "BsmtCondFa"           "BsmtExposureGd"      
## [37] "BsmtFinType1GLQ"      "BsmtFinType1Unf"      "HeatingGrav"         
## [40] "HeatingQCEx"          "CentralAirN"          "KitchenQualEx"       
## [43] "KitchenQualTA"        "FunctionalMaj2"       "FunctionalSev"       
## [46] "FunctionalTyp"        "FireplaceQuNone"      "GarageTypeAttchd"    
## [49] "GarageFinishUnf"      "GarageCondTA"         "PavedDriveN"         
## [52] "PavedDriveY"          "SaleTypeNew"          "SaleConditionAbnorml"

Number of coefficients that the Lasso Regression picked (not zero) is 54.

6.1.3 Gradient Boosting Machine (GBM) Algorithm

Now, we will try the GBM algorithm. First, let’s split the train and test data sets, then split the train data into training and testing data sets to validate the model.

set.seed(5)
# re-split the combined data into train and test data sets
salesPriceNA <- which(is.na(allSetNew["SalePrice"]))
train <- allSetNew[-salesPriceNA,]
test <- allSetNew[salesPriceNA,]
 
# convert the response in the train data into log
train$SalePrice <- log(train$SalePrice + 1)
 
# split the train data set into training and testing data sets (75/25 ratio)
# to validate the model

# number of training rows
nTrain <- round(0.75 * nrow(train))

# sample row IDs
sampleTrain <- sample(nrow(train),nTrain)

# create training and testing data sets
training <- train[sampleTrain,!names(train) == "Id"]
testing <- train[-sampleTrain,!names(train) == "Id"]
 
testing.y <- testing$SalePrice

Build the GBM model.

set.seed(2)
control <- trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE)
 
gbm.mod <- train(SalePrice~., data = training, method = "gbm", trControl = control, verbose = FALSE)
 
gbm.mod

## Stochastic Gradient Boosting 
## 
## 1095 samples
##  302 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 985, 984, 985, 985, 985, 986, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE       
##   1                   50      0.1754895  0.8277985  0.12294897
##   1                  100      0.1491191  0.8595910  0.10367975
##   1                  150      0.1427706  0.8688748  0.09846921
##   2                   50      0.1536606  0.8551644  0.10579030
##   2                  100      0.1399041  0.8737287  0.09511092
##   2                  150      0.1364623  0.8796950  0.09257943
##   3                   50      0.1453491  0.8670813  0.09885915
##   3                  100      0.1357116  0.8811948  0.09180459
##   3                  150      0.1335058  0.8851087  0.09033930
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Let’s see the test error (RMSE) by applying the GBM model to the testing data set.

gbm.predict <- predict(gbm.mod,newdata = testing)
gbm.rmse <- sqrt(mean((gbm.predict-testing.y)^2))
gbm.rmse

## [1] 0.1203715

GBM model resulted in a worse prediction than that of both Ridge and Lasso as indicated by its high RMSE value.

6.1.4 Linear Model with Forward Stepwise

Forward Stepwise is a Subset Selection method that begins with a model containing no predictors (null model), and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

# use the lm Regression model to predict the SalePrice. Use Forward Stepwise algorithm to select the best predictors.
 
set.seed(4)
# build the null model with no predictors
null_model <- lm(SalePrice~1, data = training)
 
# build the full model with all predictors
full_model <- lm(SalePrice~., data = training)
 
# perform forward stepwise algorithm to get an economic model with best predictors
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")

Let’s see the test error (RMSE) by applying the Step Wise model to the testing data set.

step.predict <- predict(step_model,newdata = testing)
step.rmse <- sqrt(mean((step.predict-testing.y)^2))
step.rmse

## [1] 0.1472766

The RMSE resulted from the Linear model with Forward Stepwise is 0.1472766.

7 Predictions on test data set

By now we have trained and tested four models. We will use each one to predict the Sale Price response variable in the original test data set.

# predict using Ridge
test.predict.ridge <- exp(predict(ridge.mod,s = ridgeBestLambda,newx = test.Matrix))-1
 
# predict using lasso
test.predict.lasso <- exp(predict(lasso.mod, s = lassoBestlambda, newx = test.Matrix))-1
 
# predict using gbm
test.predict.gbm <- exp(predict(gbm.mod,newdata = test)) - 1
 
# predict using linear step wise
test.predict.step <- exp(predict(step_model,newdata = test)) - 1

8 Models Ensembling

In order to get more robust predictions that incorporate predictions of the previous four models we will perform the Model Ensemble technique using the Weighted Average method.

We will assign the weights based on each model’s RMSE value. Hence, the models’ weights can be assumed as shown below:

ensmb.df <- data.frame(Model = c("ridge", "lasso", "gbm", "step"), RMSE = c(ridge.rmse,lasso.rmse,gbm.rmse,step.rmse), Weight = c(40,30,15,15))

ensmb.df

##   Model       RMSE Weight
## 1 ridge 0.09866289     40
## 2 lasso 0.11018098     30
## 3   gbm 0.12037154     15
## 4  step 0.14727656     15

9 Writing the solution to a csv file

Finally, we will predict the Sale Price in the original test data set using the ensembled models with the weights defined above.

# construct data frame for the ensembled solution
solution <- data.frame(Id = as.integer(rownames(test)),SalePrice =  as.numeric(test.predict.ridge*.4 + test.predict.lasso*.3 + test.predict.gbm*.15 + test.predict.step*.15))

write.csv(solution,"ensemble_sol.csv",row.names=FALSE)

House Prices: Ensembled Prediction Models

Mohammed K. Barakat

March 24, 2018