Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. This Kaggle competition’s dataset proves that there are many more house features that influence price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this research tries to predict the final price of each home.
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often-cited Boston Housing dataset.
This research aims at predicting prices of residential homes in Ames, Iowa based on a large set of features. It starts by acquiring and defining input data, data cleaning, feature engineering, and finally by model training, selection, and prediction. The research predicts house prices by combining results of several Machine Learning algorithms using Models Ensembling technique.
Four algorithms have been used for model training and prediction; Ridge Regression, Lasso Regression, Gradient Boosting, and Linear Model with Forward Stepwise. The Root Mean Square Error (RMSE) is used to measure the prediction error for each. Model RMSE’s have been found to be 0.0986, 0.110, 0.120, and 0.147 for the four mentioned models, respectively. Weighted Average Ensembling has been used to predict the house prices in the Kaggle test file.
Several R packages are used in the research. Packages cover data visualization, data manipulation, and model training and prediction.
Data of the research has been obtained from Kaggle and is available under this link.
The data is split into two data sets:
train set (train.csv) which will be used to train the prediction models. Since the available test set does not contain the ground truth of the price of each house, this training set will be split into training and testing sets. The temporary training set will be used to train prediction models, whereas the temporary testing set will be used to validate model performance and measure test error. Hence, ranking of models in terms of lower RMSE is based on the testing data set.
test set (test.csv) in which the ground truth response (SalePrice) is not provided. The research aims at predicting the response in this set. For each house in the test set, models predicted values will be ensembled to give the house sale price value.
The research uses the libraries loaded below:
library(knitr)
library(dplyr)
library(gridExtra)
library(ggplot2)
library(ggthemes)
library(cowplot)
library(moments)
library(caret)
library(glmnet)
Kaggle train and test data sets are imported into R. Then, both sets are combined into one set (allSet) for Data Cleaning and Feature Engineering before being used in model training and prediction.
train <- read.csv("./input/train.csv", stringsAsFactors = F)
test <- read.csv("./input/test.csv", stringsAsFactors = F)
allSet <- bind_rows(train,test)
trainDim <- dim(train)
testDim <- dim(test)
trainR <- trainDim[1];testR <- testDim[1]
trainC <- trainDim[2];testC <- testDim[2]
By having a look at the dimensions of the train and test data, we see that they have 1460 and 1459 observations (houses) and 81 and 80 features, respectively.
We can also have a glimpse of the data structure. It is clear that we have features that are integers and others as characters. It’s worth noticing that the outcome variable (SalePrice) is integer.
glimpse(allSet)
## Observations: 2,919
## Variables: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", ...
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", ...
## $ Alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg",...
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl",...
## $ Utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub"...
## $ LotConfig <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Ins...
## $ LandSlope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl",...
## $ Neighborhood <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoR...
## $ Condition1 <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm",...
## $ Condition2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", ...
## $ BldgType <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", ...
## $ HouseStyle <chr> "2Story", "1Story", "2Story", "2Story", "2Story"...
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Ga...
## $ RoofMatl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "Com...
## $ Exterior1st <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "Vin...
## $ Exterior2nd <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "Vin...
## $ MasVnrType <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace",...
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", ...
## $ ExterCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ Foundation <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "...
## $ BsmtQual <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", ...
## $ BsmtCond <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", ...
## $ BsmtExposure <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", ...
## $ BsmtFinType1 <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ",...
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2 <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf",...
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", ...
## $ HeatingQC <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", ...
## $ CentralAir <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ Electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SB...
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", ...
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ",...
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA"...
## $ GarageType <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd"...
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn",...
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ GarageCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ PavedDrive <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA,...
## $ MiscFeature <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, ...
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", ...
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal...
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, ...
Below is a description of each of the features existing in the data sets to give insight of what information we have about houses.
| Feature | Description |
|---|---|
| SalePrice | the property’s sale price in dollars. This is the target variable that you’re trying to predict. |
| MSSubClass | The building class |
| MSZoning | The general zoning classification |
| LotFrontage | Linear feet of street connected to property |
| LotArea | Lot size in square feet |
| Street | Type of road access |
| Alley | Type of alley access |
| LotShape | General shape of property |
| LandContour | Flatness of the property |
| Utilities | Type of utilities available |
| LotConfig | Lot configuration |
| LandSlope | Slope of property |
| Neighborhood | Physical locations within Ames city limits |
| Condition1 | Proximity to main road or railroad |
| Condition2 | Proximity to main road or railroad (if a second is present) |
| BldgType | Type of dwelling |
| HouseStyle | Style of dwelling |
| OverallQual | Overall material and finish quality |
| OverallCond | Overall condition rating |
| YearBuilt | Original construction date |
| YearRemodAdd | Remodel date |
| RoofStyle | Type of roof |
| RoofMatl | Roof material |
| Exterior1st | Exterior covering on house |
| Exterior2nd | Exterior covering on house (if more than one material) |
| MasVnrType | Masonry veneer type |
| MasVnrArea | Masonry veneer area in square feet |
| ExterQual | Exterior material quality |
| ExterCond | Present condition of the material on the exterior |
| Foundation | Type of foundation |
| BsmtQual | Height of the basement |
| BsmtCond | General condition of the basement |
| BsmtExposure | Walkout or garden level basement walls |
| BsmtFinType1 | Quality of basement finished area |
| BsmtFinSF1 | Type 1 finished square feet |
| BsmtFinType2 | Quality of second finished area (if present) |
| BsmtFinSF2 | Type 2 finished square feet |
| BsmtUnfSF | Unfinished square feet of basement area |
| TotalBsmtSF | Total square feet of basement area |
| Heating | Type of heating |
| HeatingQC | Heating quality and condition |
| CentralAir | Central air conditioning |
| Electrical | Electrical system |
| 1stFlrSF | First Floor square feet |
| 2ndFlrSF | Second floor square feet |
| LowQualFinSF | Low quality finished square feet (all floors) |
| GrLivArea | Above grade (ground) living area square feet |
| BsmtFullBath | Basement full bathrooms |
| BsmtHalfBath | Basement half bathrooms |
| FullBath | Full bathrooms above grade |
| HalfBath | Half baths above grade |
| Bedroom | Number of bedrooms above basement level |
| Kitchen | Number of kitchens |
| KitchenQual | Kitchen quality |
| TotRmsAbvGrd | Total rooms above grade (does not include bathrooms) |
| Functional | Home functionality rating |
| Fireplaces | Number of fireplaces |
| FireplaceQu | Fireplace quality |
| GarageType | Garage location |
| GarageYrBlt | Year garage was built |
| GarageFinish | Interior finish of the garage |
| GarageCars | Size of garage in car capacity |
| GarageArea | Size of garage in square feet |
| GarageQual | Garage quality |
| GarageCond | Garage condition |
| PavedDrive | Paved driveway |
| WoodDeckSF | Wood deck area in square feet |
| OpenPorchSF | Open porch area in square feet |
| EnclosedPorch | Enclosed porch area in square feet |
| 3SsnPorch | Three season porch area in square feet |
| ScreenPorch | Screen porch area in square feet |
| PoolArea | Pool area in square feet |
| PoolQC | Pool quality |
| Fence | Fence quality |
| MiscFeature | Miscellaneous feature not covered in other categories |
| MiscVal | $Value of miscellaneous feature |
| MoSold | Month Sold |
| YrSold | Year Sold |
| SaleType | Type of sale |
| SaleCondition | Condition of sale |
By exploring the data set, as shown in the below subset, we can realize that the data needs to be processed and cleaned. Data cleaning involves handling missing values, fixing features classes, or coding character features.
kable(head(allSet))
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 |
| 2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 |
| 3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 |
| 4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 |
| 6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1993 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
Missing values in data sets are problematic in data analysis, model training and prediction. Hence, we will impute missing values before going further in model training.
Let’s first define a generic function to check for missing values.
# define a function to check NAs as we will be using it frequenty to check
# what NAs are left after imputation
checkNAs <- function(df){
# identify columns (by column number) with missing values
naCols <- which(colSums(is.na(df))>0)
# get columns with missing values sorted by number of missing values
sort(colSums(sapply(allSet[naCols],is.na)), decreasing = TRUE)
}
checkNAs(allSet)
## PoolQC MiscFeature Alley Fence SalePrice
## 2909 2814 2721 2348 1459
## FireplaceQu LotFrontage GarageYrBlt GarageFinish GarageQual
## 1420 486 159 159 159
## GarageCond GarageType BsmtCond BsmtExposure BsmtQual
## 159 157 82 82 81
## BsmtFinType2 BsmtFinType1 MasVnrType MasVnrArea MSZoning
## 80 79 24 23 4
## Utilities BsmtFullBath BsmtHalfBath Functional Exterior1st
## 2 2 2 2 1
## Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
After thoroughly studying the features with missing values, we can classify the imputation process into:
This type of imputation is used for features that cannot be predicted from any other features in the available data set. These features are:
We will visualize the frequencies/distributions of these features to identify the most appropriate values to use in the imputation process.
# visualizing Fence
p1 <- allSet%>%
ggplot(aes(x = Fence))+
geom_histogram(stat = "count")+
geom_label(stat='count',aes(label=..count..))
# visualizing Alley
p2 <- allSet%>%
ggplot(aes(x = Alley))+
geom_histogram(stat = "count")+
geom_label(stat='count',aes(label=..count..))
# visualizing MiscFeature
p3 <- allSet%>%
ggplot(aes(x = MiscFeature))+
geom_histogram(stat = "count")+
geom_label(stat='count',aes(label=..count..))
# visualizing Utilities
p4 <- allSet%>%
ggplot(aes(x = Utilities))+
geom_histogram(stat = "count")+
geom_label(stat='count',aes(label=..count..))
# visualizing Functional
p5 <- allSet%>%
ggplot(aes(x = Functional))+
geom_histogram(stat = "count")+
geom_label(stat='count',aes(label=..count..))
# visualizing Exterior1st & Exterior2nd
p6 <- allSet%>%
ggplot(aes(x = Exterior1st))+
geom_histogram(stat = "count")+
geom_label(stat='count',aes(label=..count..))+
coord_flip()
p7 <- allSet%>%
ggplot(aes(x = Exterior2nd))+
geom_histogram(stat = "count")+
geom_label(stat='count',aes(label=..count..))+
coord_flip()
grid.arrange(p1,p2,p3,p4,p5, ncol = 2)
grid.arrange(p6,p7, ncol = 2)
Analysis and conclusion
Using the visualizations shown above, we can notice the following:
Fence feature: most of the houses have NA fence values which indicates a No Fence value according to the data_description.txt file. So, we’ll replace the NAs by None.
Alley feature: most of the houses have NA alley values which indicates a No alley access value according to the data_description.txt file. So, we’ll replace the NAs by None.
MiscFeature: most of the houses have NA MiscFeature values which indicates a no additional miscellaneous features of the house and cannot be predicted from available data. Hence, we’ll replace the NAs by None.
Utilities feature: almost all houses have AllPub utilities value. So, we’ll replace the NAs by AllPub.
Functional: we have only 2 missing values of the Functional feature. Hence, we’ll replace the missing values by the most common value (Typ).
Exterior1st and Exterior2nd: we have only one missing value for each feature. Hence, we’ll replace the missing values by the most common value (VinylSd for both)
Imputation
Based on the above analysis and conclusion, we will impute missing values for this set of features.
allSet$Fence[is.na(allSet$Fence)] <- "None"
allSet$Alley[is.na(allSet$Alley)] <- "None"
allSet$MiscFeature[is.na(allSet$MiscFeature)] <- "None"
allSet$Utilities[is.na(allSet$Utilities)] <- "AllPub"
allSet$Functional[is.na(allSet$Functional)] <- "Typ"
allSet$Exterior1st[is.na(allSet$Exterior1st)] <- "VinylSd"
allSet$Exterior2nd[is.na(allSet$Exterior2nd)] <- "VinylSd"
This type of imputation is used for features that can be predicted from other features in the available data set. These features are:
# check number of houses with missing (NA) Fireplace quality and zero fireplaces.
fpNoQul <- allSet[is.na(allSet$FireplaceQu) & allSet$Fireplaces == 0 ,c("Fireplaces","FireplaceQu")]
# such houses should have "None" values in the FireplaceQu feature.
allSet$FireplaceQu[is.na(allSet$FireplaceQu) & allSet$Fireplaces == 0] <- "None"
# check that houses with no fireplaces all have "None" value for FireplaceQu
fp <- allSet[allSet$Fireplaces == 0 ,c("Fireplaces","FireplaceQu")]
table(fp)
## FireplaceQu
## Fireplaces None
## 0 1420
Let’s see medians of lot frontages for each neighborhood and use them to impute missing lot frontage values.
# get median of lot frontage for each neighborhood
medNgbrLotFr <- allSet[!is.na(allSet$Neighborhood), c("Neighborhood","LotFrontage")] %>%
group_by(Neighborhood) %>%
summarize(median = median(LotFrontage, na.rm = T))
medNgbrLotFr
## # A tibble: 25 x 2
## Neighborhood median
## <chr> <dbl>
## 1 Blmngtn 43.0
## 2 Blueste 24.0
## 3 BrDale 21.0
## 4 BrkSide 51.0
## 5 ClearCr 80.5
## 6 CollgCr 70.0
## 7 Crawfor 70.0
## 8 Edwards 65.0
## 9 Gilbert 64.0
## 10 IDOTRR 60.0
## # ... with 15 more rows
# replace missing lot frontage values with the median of lot frontage with the same neighborhood
rIndx <- which(is.na(allSet$LotFrontage))
# run a vlookup-like loop to get median value for each missing value by linkage to neighborhood
for(i in rIndx){
medVal <- medNgbrLotFr[medNgbrLotFr$Neighborhood == allSet$Neighborhood[i],"median"]
allSet$LotFrontage[i] <- medVal[[1]]
}
Let’s check the masonry veneer type and area (MasVnrType and MasVnrArea) missing values.
allSet[is.na(allSet$MasVnrType) | is.na(allSet$MasVnrArea), c("MasVnrType","MasVnrArea")]
## MasVnrType MasVnrArea
## 235 <NA> NA
## 530 <NA> NA
## 651 <NA> NA
## 937 <NA> NA
## 974 <NA> NA
## 978 <NA> NA
## 1244 <NA> NA
## 1279 <NA> NA
## 1692 <NA> NA
## 1707 <NA> NA
## 1883 <NA> NA
## 1993 <NA> NA
## 2005 <NA> NA
## 2042 <NA> NA
## 2312 <NA> NA
## 2326 <NA> NA
## 2341 <NA> NA
## 2350 <NA> NA
## 2369 <NA> NA
## 2593 <NA> NA
## 2611 <NA> 198
## 2658 <NA> NA
## 2687 <NA> NA
## 2863 <NA> NA
Out of 24 houses, we have 23 with missing values in both features. So, we’ll replace missing values where both are missing in the house by None in the MasVnrType and by zero in the MasVnrArea.
allSet$MasVnrType[is.na(allSet$MasVnrType) & is.na(allSet$MasVnrArea)] <- "None"
allSet$MasVnrArea[is.na(allSet$MasVnrArea)] <- 0
For the house with missing type but with recorded area, we will use the median of Masonry areas across the data to estimate its masonry type.
missArea <- allSet$MasVnrArea[is.na(allSet$MasVnrType)]
allSet[allSet$MasVnrType != "None", c("MasVnrType","MasVnrArea")] %>%
ggplot(aes(x = MasVnrType, y = MasVnrArea)) +
geom_boxplot()+
geom_hline(yintercept = missArea, color = "red")
medMason <- allSet[allSet$MasVnrType != "None", c("MasVnrType","MasVnrArea")] %>%
group_by(MasVnrType) %>%
summarize(median = median(MasVnrArea))
medMason
## # A tibble: 4 x 2
## MasVnrType median
## <chr> <dbl>
## 1 BrkCmn 161
## 2 BrkFace 203
## 3 Stone 200
## 4 <NA> NA
It’s obvious that the nearest masonry type to the missing values is Stone. Hence, we will replace the missing value by Stone.
rIndx <- which(is.na(allSet$MasVnrType))
for(i in rIndx){
medVal <- medMason[which((abs(medMason$median - allSet$MasVnrArea[i])) == min(abs(medMason$median - allSet$MasVnrArea[i]), na.rm = T)),"MasVnrType"]
allSet$MasVnrType[i] <- medVal[[1]]
}
allSet[is.na(allSet$MSZoning), c("MSSubClass","MSZoning")]
## MSSubClass MSZoning
## 1916 30 <NA>
## 2217 20 <NA>
## 2251 70 <NA>
## 2905 20 <NA>
# distribution of house type versus its zoning classification
# for the three dwelling types of the missing zoning values
missZoning <- unique(allSet$MSSubClass[is.na(allSet$MSZoning)])
allSet[!is.na(allSet$MSZoning) & allSet$MSSubClass %in% missZoning, c("MSZoning","MSSubClass")] %>%
ggplot(aes(x = MSZoning, fill = factor(MSSubClass)))+
geom_histogram(stat = "count")
From the histogram above we can see that house types of 70 and 30 can have a zoning classification of RM while the 20 type is of zoning RL. So, the values will be imputed accordingly.
allSet$MSZoning[is.na(allSet$MSZoning) & allSet$MSSubClass %in% c(70,30)] <- "RM"
allSet$MSZoning[is.na(allSet$MSZoning) & allSet$MSSubClass == 20] <- "RL"
allSet[is.na(allSet$SaleType), c("SaleType","SaleCondition")]
## SaleType SaleCondition
## 2490 <NA> Normal
We have one house with missing Sale Type value having a Normal Sale Condition. Let’s explore the relationship between both feautres.
allSet[!is.na(allSet$SaleType) & !is.na(allSet$SaleCondition), c("SaleType","SaleCondition")] %>%
ggplot(aes(x = SaleType, fill = factor(SaleCondition)))+
geom_histogram(stat = "count")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We can see that most Normal SaleCondition have a WD SaleType. Hence, we’ll impute the missing value with the same.
allSet$SaleType[is.na(allSet$SaleType)] <- "WD"
allSet[is.na(allSet$Electrical), c("OverallCond","Electrical")]
## OverallCond Electrical
## 1380 5 <NA>
We have 1 house with missing Electrical value having a 5 rate of Overall Condition. Let’s explore the relationship between both features.
allSet[!is.na(allSet$Electrical), c("OverallCond","Electrical")] %>%
ggplot(aes(x = factor(OverallCond), fill = factor(Electrical)))+
geom_histogram(stat = "count")
Most of 5-rating Overall Condition houses have SBrkr electrical system. Hence, we’ll impute the missing value.
allSet$Electrical[is.na(allSet$Electrical)] <- "SBrkr"
allSet[is.na(allSet$KitchenQual), c("KitchenAbvGr","KitchenQual")]
## KitchenAbvGr KitchenQual
## 1556 1 <NA>
allSet[!is.na(allSet$KitchenQual), c("KitchenAbvGr","KitchenQual")] %>%
ggplot(aes(x = KitchenAbvGr, fill = factor(KitchenQual)))+
geom_histogram(stat = "count")
For a KitchenAbvGr of 1, the most likely value for a KitchenQual is TA. Hence, we’ll impute the missing value with TA.
allSet$KitchenQual[is.na(allSet$KitchenQual)] <- "TA"
allSet[is.na(allSet$PoolQC),c("PoolQC","PoolArea")] %>%
group_by(PoolArea) %>%
summarize(count = n())
## # A tibble: 4 x 2
## PoolArea count
## <int> <int>
## 1 0 2906
## 2 368 1
## 3 444 1
## 4 561 1
Out of the 2909 houses with missing PoolQC values, we have 2906 with zero PoolArea. This lets us assume with high confidence that houses with zero PoolAreas do not have pools; hence, these PoolQC values can be imputed with None.
allSet$PoolQC[allSet$PoolArea == 0] <- "None"
However, we have three houses with missing PoolQC values while having non-zero PoolAreas.
allSet[is.na(allSet$PoolQC) & allSet$PoolArea >0,c("PoolQC","PoolArea")]
## PoolQC PoolArea
## 2421 <NA> 368
## 2504 <NA> 444
## 2600 <NA> 561
Since PoolArea is a likely predictor to PoolQC, we can impute the missing PoolQC values with corresponding values of other houses having means of PoolAreas near to those of the missing PoolQC values.
# calculate the means of PoolAreas for each of the PoolQC in the data set
meanArea <- allSet[!is.na(allSet$PoolQC),c("PoolQC","PoolArea")] %>%
group_by(PoolQC) %>%
summarize(AreaMean = round(mean(PoolArea),0))
meanArea
## # A tibble: 4 x 2
## PoolQC AreaMean
## <chr> <dbl>
## 1 Ex 360
## 2 Fa 584
## 3 Gd 648
## 4 None 0
Impute missing PoolQC values with the values of nearest PoolArea means.
rIndx <- which(is.na(allSet$PoolQC) & allSet$PoolArea >0)
for(i in rIndx){
poolQc <- meanArea[which((abs(meanArea$AreaMean - allSet$PoolArea[i])) == min(abs(meanArea$AreaMean - allSet$PoolArea[i]), na.rm = T)),"PoolQC"]
allSet$PoolQC[i] <- poolQc[[1]]
}
# identify all Garage features as those starting with Garage text
garageCols <- names(allSet)[grepl("Garage.*", names(allSet))]
garageCols
## [1] "GarageType" "GarageYrBlt" "GarageFinish" "GarageCars"
## [5] "GarageArea" "GarageQual" "GarageCond"
By studying garage features description in the data_description.txt file, we see that garage features: GarageType, GarageFinish, GarageQual, and GarageCond all have NA values to indicate No-Garage rather than a missing value recorded. Yet, we need to change the NAs to None for these features, and to zero for the integer-type features: GarageYrBlt, * GarageCars* and * GarageArea*. This is to avoid problems in modeling the data.
First, we will consider houses with a complete set of NAs or zeros- as explained above- as houses with None values.
# find indexes of houses having all Garage features flagged with NAs or zeros
noGarage <- which((is.na(allSet$GarageArea) | allSet$GarageArea == 0)
& (is.na(allSet$GarageCars) | allSet$GarageCars == 0)
& is.na(allSet$GarageCond)
& is.na(allSet$GarageFinish)
& is.na(allSet$GarageQual)
& is.na(allSet$GarageType)
& (is.na(allSet$GarageYrBlt) | allSet$GarageYrBlt == 0))
allSet[noGarage,garageCols]
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 40 <NA> NA <NA> 0 0 <NA>
## 49 <NA> NA <NA> 0 0 <NA>
## 79 <NA> NA <NA> 0 0 <NA>
## 89 <NA> NA <NA> 0 0 <NA>
## 90 <NA> NA <NA> 0 0 <NA>
## 100 <NA> NA <NA> 0 0 <NA>
## 109 <NA> NA <NA> 0 0 <NA>
## 126 <NA> NA <NA> 0 0 <NA>
## 128 <NA> NA <NA> 0 0 <NA>
## 141 <NA> NA <NA> 0 0 <NA>
## 149 <NA> NA <NA> 0 0 <NA>
## 156 <NA> NA <NA> 0 0 <NA>
## 164 <NA> NA <NA> 0 0 <NA>
## 166 <NA> NA <NA> 0 0 <NA>
## 199 <NA> NA <NA> 0 0 <NA>
## 211 <NA> NA <NA> 0 0 <NA>
## 242 <NA> NA <NA> 0 0 <NA>
## 251 <NA> NA <NA> 0 0 <NA>
## 288 <NA> NA <NA> 0 0 <NA>
## 292 <NA> NA <NA> 0 0 <NA>
## 308 <NA> NA <NA> 0 0 <NA>
## 376 <NA> NA <NA> 0 0 <NA>
## 387 <NA> NA <NA> 0 0 <NA>
## 394 <NA> NA <NA> 0 0 <NA>
## 432 <NA> NA <NA> 0 0 <NA>
## 435 <NA> NA <NA> 0 0 <NA>
## 442 <NA> NA <NA> 0 0 <NA>
## 465 <NA> NA <NA> 0 0 <NA>
## 496 <NA> NA <NA> 0 0 <NA>
## 521 <NA> NA <NA> 0 0 <NA>
## 529 <NA> NA <NA> 0 0 <NA>
## 534 <NA> NA <NA> 0 0 <NA>
## 536 <NA> NA <NA> 0 0 <NA>
## 563 <NA> NA <NA> 0 0 <NA>
## 583 <NA> NA <NA> 0 0 <NA>
## 614 <NA> NA <NA> 0 0 <NA>
## 615 <NA> NA <NA> 0 0 <NA>
## 621 <NA> NA <NA> 0 0 <NA>
## 636 <NA> NA <NA> 0 0 <NA>
## 637 <NA> NA <NA> 0 0 <NA>
## 639 <NA> NA <NA> 0 0 <NA>
## 650 <NA> NA <NA> 0 0 <NA>
## 706 <NA> NA <NA> 0 0 <NA>
## 711 <NA> NA <NA> 0 0 <NA>
## 739 <NA> NA <NA> 0 0 <NA>
## 751 <NA> NA <NA> 0 0 <NA>
## 785 <NA> NA <NA> 0 0 <NA>
## 827 <NA> NA <NA> 0 0 <NA>
## 844 <NA> NA <NA> 0 0 <NA>
## 922 <NA> NA <NA> 0 0 <NA>
## 943 <NA> NA <NA> 0 0 <NA>
## 955 <NA> NA <NA> 0 0 <NA>
## 961 <NA> NA <NA> 0 0 <NA>
## 969 <NA> NA <NA> 0 0 <NA>
## 971 <NA> NA <NA> 0 0 <NA>
## 977 <NA> NA <NA> 0 0 <NA>
## 1010 <NA> NA <NA> 0 0 <NA>
## 1012 <NA> NA <NA> 0 0 <NA>
## 1031 <NA> NA <NA> 0 0 <NA>
## 1039 <NA> NA <NA> 0 0 <NA>
## 1097 <NA> NA <NA> 0 0 <NA>
## 1124 <NA> NA <NA> 0 0 <NA>
## 1132 <NA> NA <NA> 0 0 <NA>
## 1138 <NA> NA <NA> 0 0 <NA>
## 1144 <NA> NA <NA> 0 0 <NA>
## 1174 <NA> NA <NA> 0 0 <NA>
## 1180 <NA> NA <NA> 0 0 <NA>
## 1219 <NA> NA <NA> 0 0 <NA>
## 1220 <NA> NA <NA> 0 0 <NA>
## 1235 <NA> NA <NA> 0 0 <NA>
## 1258 <NA> NA <NA> 0 0 <NA>
## 1284 <NA> NA <NA> 0 0 <NA>
## 1324 <NA> NA <NA> 0 0 <NA>
## 1326 <NA> NA <NA> 0 0 <NA>
## 1327 <NA> NA <NA> 0 0 <NA>
## 1338 <NA> NA <NA> 0 0 <NA>
## 1350 <NA> NA <NA> 0 0 <NA>
## 1408 <NA> NA <NA> 0 0 <NA>
## 1450 <NA> NA <NA> 0 0 <NA>
## 1451 <NA> NA <NA> 0 0 <NA>
## 1454 <NA> NA <NA> 0 0 <NA>
## 1514 <NA> NA <NA> 0 0 <NA>
## 1532 <NA> NA <NA> 0 0 <NA>
## 1540 <NA> NA <NA> 0 0 <NA>
## 1553 <NA> NA <NA> 0 0 <NA>
## 1557 <NA> NA <NA> 0 0 <NA>
## 1559 <NA> NA <NA> 0 0 <NA>
## 1561 <NA> NA <NA> 0 0 <NA>
## 1591 <NA> NA <NA> 0 0 <NA>
## 1594 <NA> NA <NA> 0 0 <NA>
## 1595 <NA> NA <NA> 0 0 <NA>
## 1615 <NA> NA <NA> 0 0 <NA>
## 1616 <NA> NA <NA> 0 0 <NA>
## 1718 <NA> NA <NA> 0 0 <NA>
## 1722 <NA> NA <NA> 0 0 <NA>
## 1788 <NA> NA <NA> 0 0 <NA>
## 1809 <NA> NA <NA> 0 0 <NA>
## 1811 <NA> NA <NA> 0 0 <NA>
## 1812 <NA> NA <NA> 0 0 <NA>
## 1820 <NA> NA <NA> 0 0 <NA>
## 1823 <NA> NA <NA> 0 0 <NA>
## 1832 <NA> NA <NA> 0 0 <NA>
## 1835 <NA> NA <NA> 0 0 <NA>
## 1837 <NA> NA <NA> 0 0 <NA>
## 1840 <NA> NA <NA> 0 0 <NA>
## 1848 <NA> NA <NA> 0 0 <NA>
## 1894 <NA> NA <NA> 0 0 <NA>
## 2011 <NA> NA <NA> 0 0 <NA>
## 2082 <NA> NA <NA> 0 0 <NA>
## 2091 <NA> NA <NA> 0 0 <NA>
## 2094 <NA> NA <NA> 0 0 <NA>
## 2097 <NA> NA <NA> 0 0 <NA>
## 2100 <NA> NA <NA> 0 0 <NA>
## 2105 <NA> NA <NA> 0 0 <NA>
## 2136 <NA> NA <NA> 0 0 <NA>
## 2152 <NA> NA <NA> 0 0 <NA>
## 2154 <NA> NA <NA> 0 0 <NA>
## 2190 <NA> NA <NA> 0 0 <NA>
## 2191 <NA> NA <NA> 0 0 <NA>
## 2192 <NA> NA <NA> 0 0 <NA>
## 2193 <NA> NA <NA> 0 0 <NA>
## 2194 <NA> NA <NA> 0 0 <NA>
## 2213 <NA> NA <NA> 0 0 <NA>
## 2239 <NA> NA <NA> 0 0 <NA>
## 2247 <NA> NA <NA> 0 0 <NA>
## 2354 <NA> NA <NA> 0 0 <NA>
## 2355 <NA> NA <NA> 0 0 <NA>
## 2399 <NA> NA <NA> 0 0 <NA>
## 2400 <NA> NA <NA> 0 0 <NA>
## 2423 <NA> NA <NA> 0 0 <NA>
## 2427 <NA> NA <NA> 0 0 <NA>
## 2553 <NA> NA <NA> 0 0 <NA>
## 2554 <NA> NA <NA> 0 0 <NA>
## 2558 <NA> NA <NA> 0 0 <NA>
## 2576 <NA> NA <NA> 0 0 <NA>
## 2580 <NA> NA <NA> 0 0 <NA>
## 2604 <NA> NA <NA> 0 0 <NA>
## 2610 <NA> NA <NA> 0 0 <NA>
## 2692 <NA> NA <NA> 0 0 <NA>
## 2694 <NA> NA <NA> 0 0 <NA>
## 2709 <NA> NA <NA> 0 0 <NA>
## 2768 <NA> NA <NA> 0 0 <NA>
## 2772 <NA> NA <NA> 0 0 <NA>
## 2790 <NA> NA <NA> 0 0 <NA>
## 2792 <NA> NA <NA> 0 0 <NA>
## 2800 <NA> NA <NA> 0 0 <NA>
## 2860 <NA> NA <NA> 0 0 <NA>
## 2863 <NA> NA <NA> 0 0 <NA>
## 2871 <NA> NA <NA> 0 0 <NA>
## 2889 <NA> NA <NA> 0 0 <NA>
## 2892 <NA> NA <NA> 0 0 <NA>
## 2893 <NA> NA <NA> 0 0 <NA>
## 2894 <NA> NA <NA> 0 0 <NA>
## 2910 <NA> NA <NA> 0 0 <NA>
## 2914 <NA> NA <NA> 0 0 <NA>
## 2915 <NA> NA <NA> 0 0 <NA>
## 2918 <NA> NA <NA> 0 0 <NA>
## GarageCond
## 40 <NA>
## 49 <NA>
## 79 <NA>
## 89 <NA>
## 90 <NA>
## 100 <NA>
## 109 <NA>
## 126 <NA>
## 128 <NA>
## 141 <NA>
## 149 <NA>
## 156 <NA>
## 164 <NA>
## 166 <NA>
## 199 <NA>
## 211 <NA>
## 242 <NA>
## 251 <NA>
## 288 <NA>
## 292 <NA>
## 308 <NA>
## 376 <NA>
## 387 <NA>
## 394 <NA>
## 432 <NA>
## 435 <NA>
## 442 <NA>
## 465 <NA>
## 496 <NA>
## 521 <NA>
## 529 <NA>
## 534 <NA>
## 536 <NA>
## 563 <NA>
## 583 <NA>
## 614 <NA>
## 615 <NA>
## 621 <NA>
## 636 <NA>
## 637 <NA>
## 639 <NA>
## 650 <NA>
## 706 <NA>
## 711 <NA>
## 739 <NA>
## 751 <NA>
## 785 <NA>
## 827 <NA>
## 844 <NA>
## 922 <NA>
## 943 <NA>
## 955 <NA>
## 961 <NA>
## 969 <NA>
## 971 <NA>
## 977 <NA>
## 1010 <NA>
## 1012 <NA>
## 1031 <NA>
## 1039 <NA>
## 1097 <NA>
## 1124 <NA>
## 1132 <NA>
## 1138 <NA>
## 1144 <NA>
## 1174 <NA>
## 1180 <NA>
## 1219 <NA>
## 1220 <NA>
## 1235 <NA>
## 1258 <NA>
## 1284 <NA>
## 1324 <NA>
## 1326 <NA>
## 1327 <NA>
## 1338 <NA>
## 1350 <NA>
## 1408 <NA>
## 1450 <NA>
## 1451 <NA>
## 1454 <NA>
## 1514 <NA>
## 1532 <NA>
## 1540 <NA>
## 1553 <NA>
## 1557 <NA>
## 1559 <NA>
## 1561 <NA>
## 1591 <NA>
## 1594 <NA>
## 1595 <NA>
## 1615 <NA>
## 1616 <NA>
## 1718 <NA>
## 1722 <NA>
## 1788 <NA>
## 1809 <NA>
## 1811 <NA>
## 1812 <NA>
## 1820 <NA>
## 1823 <NA>
## 1832 <NA>
## 1835 <NA>
## 1837 <NA>
## 1840 <NA>
## 1848 <NA>
## 1894 <NA>
## 2011 <NA>
## 2082 <NA>
## 2091 <NA>
## 2094 <NA>
## 2097 <NA>
## 2100 <NA>
## 2105 <NA>
## 2136 <NA>
## 2152 <NA>
## 2154 <NA>
## 2190 <NA>
## 2191 <NA>
## 2192 <NA>
## 2193 <NA>
## 2194 <NA>
## 2213 <NA>
## 2239 <NA>
## 2247 <NA>
## 2354 <NA>
## 2355 <NA>
## 2399 <NA>
## 2400 <NA>
## 2423 <NA>
## 2427 <NA>
## 2553 <NA>
## 2554 <NA>
## 2558 <NA>
## 2576 <NA>
## 2580 <NA>
## 2604 <NA>
## 2610 <NA>
## 2692 <NA>
## 2694 <NA>
## 2709 <NA>
## 2768 <NA>
## 2772 <NA>
## 2790 <NA>
## 2792 <NA>
## 2800 <NA>
## 2860 <NA>
## 2863 <NA>
## 2871 <NA>
## 2889 <NA>
## 2892 <NA>
## 2893 <NA>
## 2894 <NA>
## 2910 <NA>
## 2914 <NA>
## 2915 <NA>
## 2918 <NA>
# impute with NAs or zeros
allSet[noGarage,c("GarageType","GarageFinish","GarageQual","GarageCond")] <- "None"
allSet[noGarage, c("GarageYrBlt","GarageCars","GarageArea")] <- 0
Let’s see all houses with at least one garage feature having a missing value. Such houses do not indicate a house with no garage, and their missing values need to be imputed with reasonable values.
# find indexes of houses having at least one missing Garage value
missGarage <- which(is.na(allSet$GarageArea)
| is.na(allSet$GarageCars)
| is.na(allSet$GarageCond)
| is.na(allSet$GarageFinish)
| is.na(allSet$GarageQual)
| is.na(allSet$GarageType)
| is.na(allSet$GarageYrBlt))
allSet[missGarage,garageCols]
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 2127 Detchd NA <NA> 1 360 <NA>
## 2577 Detchd NA <NA> NA NA <NA>
## GarageCond
## 2127 <NA>
## 2577 <NA>
We have only two houses with garages but having missing values in some features. For both houses, we can use the YearBuilt feature to impute the GarageYrBlt value as most of garages are built with their houses.
allSet[missGarage,"GarageYrBlt"] <- allSet[missGarage,"YearBuilt"]
House 2577 is missing more values than house 2127. Hence, we will impute its missing values first based on the matching values of houses of the same GarageType and GarageYrBlt.
# index houses with missing garageCars or area
miss <- which(is.na(allSet$GarageCars) | is.na(allSet$GarageArea))
# group houses by garage features
grpdVals1 <- allSet%>%
group_by(GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond) %>%
summarize(medCars = median(GarageCars), meanArea = round(mean(GarageArea),0),count = n()) %>%
arrange(desc(count))
comp <- complete.cases(grpdVals1)
grpdVals1 <- grpdVals1[comp,]
grpdVals1
## # A tibble: 601 x 8
## # Groups: GarageType, GarageYrBlt, GarageFinish, GarageQual [553]
## GarageType GarageYrBlt GarageFinish GarageQual GarageCond medCars
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 None 0 None None None 0
## 2 Attchd 2006 RFn TA TA 2.00
## 3 Attchd 2005 Fin TA TA 2.00
## 4 Attchd 2007 Fin TA TA 3.00
## 5 Attchd 2005 RFn TA TA 2.00
## 6 Attchd 2006 Fin TA TA 2.00
## 7 Attchd 2004 Fin TA TA 2.00
## 8 Attchd 2007 RFn TA TA 2.00
## 9 Attchd 2003 Fin TA TA 2.00
## 10 Attchd 2008 Fin TA TA 3.00
## # ... with 591 more rows, and 2 more variables: meanArea <dbl>, count
## # <int>
# look-up missing values based on houses with the same type and year built
rIndx <- miss
for(i in rIndx){
missVals <- grpdVals1[which((grpdVals1$GarageYrBlt == allSet$GarageYrBlt[i]) & (grpdVals1$GarageType == allSet$GarageType[i])),c("GarageFinish", "medCars", "meanArea", "GarageQual","GarageCond")]
allSet$GarageFinish[i] <- missVals[[1]][1]
allSet$GarageCars[i] <- missVals[[2]][1]
allSet$GarageArea[i] <- missVals[[3]][1]
allSet$GarageQual[i] <- missVals[[4]][1]
allSet$GarageCond[i] <- missVals[[5]][1]
}
House 2127 has a garage with and area of 360 square feet and 1-GarageCars value. So, we will impute its missing values (GarageFinish, GarageQual and GarageCond) with those having similar GarageArea, GarageCars, and GarageType.
grpdVals <- allSet[allSet$GarageType == "Detchd" & allSet$GarageCars == 1,garageCols] %>%
group_by(GarageFinish, GarageQual, GarageCond) %>%
summarize(meanArea = round(mean(GarageArea),0),count = n()) %>%
arrange(desc(count))
comp <- complete.cases(grpdVals)
grpdVals <- grpdVals[comp,]
grpdVals
## # A tibble: 14 x 5
## # Groups: GarageFinish, GarageQual [8]
## GarageFinish GarageQual GarageCond meanArea count
## <chr> <chr> <chr> <dbl> <int>
## 1 Unf TA TA 294 269
## 2 Unf Fa TA 248 40
## 3 Unf Fa Fa 226 25
## 4 Unf TA Fa 258 18
## 5 RFn TA TA 373 6
## 6 Fin TA TA 327 4
## 7 Unf Fa Po 246 4
## 8 Unf Po Po 301 4
## 9 Fin Ex Ex 924 1
## 10 Unf Ex Ex 300 1
## 11 Unf Gd TA 352 1
## 12 Unf Po Fa 195 1
## 13 Unf TA Gd 440 1
## 14 Unf TA Po 216 1
Impute missing values with the nearest values found in the table above.
rIndx <- missGarage
for(i in rIndx){
missVals <- grpdVals[which((abs(grpdVals$meanArea - allSet$GarageArea[i])) == min(abs(grpdVals$meanArea - allSet$GarageArea[i]),na.rm = T)),c("GarageFinish", "GarageQual","GarageCond")]
allSet$GarageFinish[i] <- missVals[[1]][1]
allSet$GarageQual[i] <- missVals[[2]][1]
allSet$GarageCond[i] <- missVals[[3]][1]
}
# identify all basement features as those containing Bsmt text
bsmtCols <- names(allSet)[grepl("Bsmt.*", names(allSet))]
bsmtCols
## [1] "BsmtQual" "BsmtCond" "BsmtExposure" "BsmtFinType1"
## [5] "BsmtFinSF1" "BsmtFinType2" "BsmtFinSF2" "BsmtUnfSF"
## [9] "TotalBsmtSF" "BsmtFullBath" "BsmtHalfBath"
First, let’s check missing values in all Basement features. Assuming that if a house has NA values in all basement features, then the house has no basement. Basement values of such houses will be replaced by None for string features and by zero for numeric features.
# check all Bsmt houses having all Bsmt features NAs or zeros
bsmtAllNa <- which((allSet[,bsmtCols[1]]== 0 | is.na(allSet[,bsmtCols[1]])) &
(allSet[,bsmtCols[2]]== 0 | is.na(allSet[,bsmtCols[2]])) &
(allSet[,bsmtCols[3]]== 0 | is.na(allSet[,bsmtCols[3]])) &
(allSet[,bsmtCols[4]]== 0 | is.na(allSet[,bsmtCols[4]])) &
(allSet[,bsmtCols[5]]== 0 | is.na(allSet[,bsmtCols[5]])) &
(allSet[,bsmtCols[6]]== 0 | is.na(allSet[,bsmtCols[6]])) &
(allSet[,bsmtCols[7]]== 0 | is.na(allSet[,bsmtCols[7]])) &
(allSet[,bsmtCols[8]]== 0 | is.na(allSet[,bsmtCols[8]])) &
(allSet[,bsmtCols[9]]== 0 | is.na(allSet[,bsmtCols[9]])) &
(allSet[,bsmtCols[10]]== 0 | is.na(allSet[,bsmtCols[10]])) &
(allSet[,bsmtCols[11]]== 0 | is.na(allSet[,bsmtCols[11]]))
)
allSet[bsmtAllNa,bsmtCols]
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 18 <NA> <NA> <NA> <NA> 0 <NA>
## 40 <NA> <NA> <NA> <NA> 0 <NA>
## 91 <NA> <NA> <NA> <NA> 0 <NA>
## 103 <NA> <NA> <NA> <NA> 0 <NA>
## 157 <NA> <NA> <NA> <NA> 0 <NA>
## 183 <NA> <NA> <NA> <NA> 0 <NA>
## 260 <NA> <NA> <NA> <NA> 0 <NA>
## 343 <NA> <NA> <NA> <NA> 0 <NA>
## 363 <NA> <NA> <NA> <NA> 0 <NA>
## 372 <NA> <NA> <NA> <NA> 0 <NA>
## 393 <NA> <NA> <NA> <NA> 0 <NA>
## 521 <NA> <NA> <NA> <NA> 0 <NA>
## 533 <NA> <NA> <NA> <NA> 0 <NA>
## 534 <NA> <NA> <NA> <NA> 0 <NA>
## 554 <NA> <NA> <NA> <NA> 0 <NA>
## 647 <NA> <NA> <NA> <NA> 0 <NA>
## 706 <NA> <NA> <NA> <NA> 0 <NA>
## 737 <NA> <NA> <NA> <NA> 0 <NA>
## 750 <NA> <NA> <NA> <NA> 0 <NA>
## 779 <NA> <NA> <NA> <NA> 0 <NA>
## 869 <NA> <NA> <NA> <NA> 0 <NA>
## 895 <NA> <NA> <NA> <NA> 0 <NA>
## 898 <NA> <NA> <NA> <NA> 0 <NA>
## 985 <NA> <NA> <NA> <NA> 0 <NA>
## 1001 <NA> <NA> <NA> <NA> 0 <NA>
## 1012 <NA> <NA> <NA> <NA> 0 <NA>
## 1036 <NA> <NA> <NA> <NA> 0 <NA>
## 1046 <NA> <NA> <NA> <NA> 0 <NA>
## 1049 <NA> <NA> <NA> <NA> 0 <NA>
## 1050 <NA> <NA> <NA> <NA> 0 <NA>
## 1091 <NA> <NA> <NA> <NA> 0 <NA>
## 1180 <NA> <NA> <NA> <NA> 0 <NA>
## 1217 <NA> <NA> <NA> <NA> 0 <NA>
## 1219 <NA> <NA> <NA> <NA> 0 <NA>
## 1233 <NA> <NA> <NA> <NA> 0 <NA>
## 1322 <NA> <NA> <NA> <NA> 0 <NA>
## 1413 <NA> <NA> <NA> <NA> 0 <NA>
## 1586 <NA> <NA> <NA> <NA> 0 <NA>
## 1594 <NA> <NA> <NA> <NA> 0 <NA>
## 1730 <NA> <NA> <NA> <NA> 0 <NA>
## 1779 <NA> <NA> <NA> <NA> 0 <NA>
## 1815 <NA> <NA> <NA> <NA> 0 <NA>
## 1848 <NA> <NA> <NA> <NA> 0 <NA>
## 1849 <NA> <NA> <NA> <NA> 0 <NA>
## 1857 <NA> <NA> <NA> <NA> 0 <NA>
## 1858 <NA> <NA> <NA> <NA> 0 <NA>
## 1859 <NA> <NA> <NA> <NA> 0 <NA>
## 1861 <NA> <NA> <NA> <NA> 0 <NA>
## 1916 <NA> <NA> <NA> <NA> 0 <NA>
## 2051 <NA> <NA> <NA> <NA> 0 <NA>
## 2067 <NA> <NA> <NA> <NA> 0 <NA>
## 2069 <NA> <NA> <NA> <NA> 0 <NA>
## 2121 <NA> <NA> <NA> <NA> NA <NA>
## 2123 <NA> <NA> <NA> <NA> 0 <NA>
## 2189 <NA> <NA> <NA> <NA> 0 <NA>
## 2190 <NA> <NA> <NA> <NA> 0 <NA>
## 2191 <NA> <NA> <NA> <NA> 0 <NA>
## 2194 <NA> <NA> <NA> <NA> 0 <NA>
## 2217 <NA> <NA> <NA> <NA> 0 <NA>
## 2225 <NA> <NA> <NA> <NA> 0 <NA>
## 2388 <NA> <NA> <NA> <NA> 0 <NA>
## 2436 <NA> <NA> <NA> <NA> 0 <NA>
## 2453 <NA> <NA> <NA> <NA> 0 <NA>
## 2454 <NA> <NA> <NA> <NA> 0 <NA>
## 2491 <NA> <NA> <NA> <NA> 0 <NA>
## 2499 <NA> <NA> <NA> <NA> 0 <NA>
## 2548 <NA> <NA> <NA> <NA> 0 <NA>
## 2553 <NA> <NA> <NA> <NA> 0 <NA>
## 2565 <NA> <NA> <NA> <NA> 0 <NA>
## 2579 <NA> <NA> <NA> <NA> 0 <NA>
## 2600 <NA> <NA> <NA> <NA> 0 <NA>
## 2703 <NA> <NA> <NA> <NA> 0 <NA>
## 2764 <NA> <NA> <NA> <NA> 0 <NA>
## 2767 <NA> <NA> <NA> <NA> 0 <NA>
## 2804 <NA> <NA> <NA> <NA> 0 <NA>
## 2805 <NA> <NA> <NA> <NA> 0 <NA>
## 2825 <NA> <NA> <NA> <NA> 0 <NA>
## 2892 <NA> <NA> <NA> <NA> 0 <NA>
## 2905 <NA> <NA> <NA> <NA> 0 <NA>
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
## 18 0 0 0 0 0
## 40 0 0 0 0 0
## 91 0 0 0 0 0
## 103 0 0 0 0 0
## 157 0 0 0 0 0
## 183 0 0 0 0 0
## 260 0 0 0 0 0
## 343 0 0 0 0 0
## 363 0 0 0 0 0
## 372 0 0 0 0 0
## 393 0 0 0 0 0
## 521 0 0 0 0 0
## 533 0 0 0 0 0
## 534 0 0 0 0 0
## 554 0 0 0 0 0
## 647 0 0 0 0 0
## 706 0 0 0 0 0
## 737 0 0 0 0 0
## 750 0 0 0 0 0
## 779 0 0 0 0 0
## 869 0 0 0 0 0
## 895 0 0 0 0 0
## 898 0 0 0 0 0
## 985 0 0 0 0 0
## 1001 0 0 0 0 0
## 1012 0 0 0 0 0
## 1036 0 0 0 0 0
## 1046 0 0 0 0 0
## 1049 0 0 0 0 0
## 1050 0 0 0 0 0
## 1091 0 0 0 0 0
## 1180 0 0 0 0 0
## 1217 0 0 0 0 0
## 1219 0 0 0 0 0
## 1233 0 0 0 0 0
## 1322 0 0 0 0 0
## 1413 0 0 0 0 0
## 1586 0 0 0 0 0
## 1594 0 0 0 0 0
## 1730 0 0 0 0 0
## 1779 0 0 0 0 0
## 1815 0 0 0 0 0
## 1848 0 0 0 0 0
## 1849 0 0 0 0 0
## 1857 0 0 0 0 0
## 1858 0 0 0 0 0
## 1859 0 0 0 0 0
## 1861 0 0 0 0 0
## 1916 0 0 0 0 0
## 2051 0 0 0 0 0
## 2067 0 0 0 0 0
## 2069 0 0 0 0 0
## 2121 NA NA NA NA NA
## 2123 0 0 0 0 0
## 2189 0 0 0 NA NA
## 2190 0 0 0 0 0
## 2191 0 0 0 0 0
## 2194 0 0 0 0 0
## 2217 0 0 0 0 0
## 2225 0 0 0 0 0
## 2388 0 0 0 0 0
## 2436 0 0 0 0 0
## 2453 0 0 0 0 0
## 2454 0 0 0 0 0
## 2491 0 0 0 0 0
## 2499 0 0 0 0 0
## 2548 0 0 0 0 0
## 2553 0 0 0 0 0
## 2565 0 0 0 0 0
## 2579 0 0 0 0 0
## 2600 0 0 0 0 0
## 2703 0 0 0 0 0
## 2764 0 0 0 0 0
## 2767 0 0 0 0 0
## 2804 0 0 0 0 0
## 2805 0 0 0 0 0
## 2825 0 0 0 0 0
## 2892 0 0 0 0 0
## 2905 0 0 0 0 0
We have 79 houses that can be flagged as not having basements.
# flag the non-basement houses with "No Basement" for string bsmt features and with zeros
# for numeric bsmt feautres
allSet[bsmtAllNa,c("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2")] <- "None"
allSet[bsmtAllNa, c("BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath")] <- 0
Now, let’s check remaining basement houses which have missing values but seem to have basements based on at least one non-NA basement feature.
bsmtSmNa <- which(is.na(allSet[,bsmtCols[1]]) |
is.na(allSet[,bsmtCols[2]]) |
is.na(allSet[,bsmtCols[3]]) |
is.na(allSet[,bsmtCols[4]]) |
is.na(allSet[,bsmtCols[5]]) |
is.na(allSet[,bsmtCols[6]]) |
is.na(allSet[,bsmtCols[7]]) |
is.na(allSet[,bsmtCols[8]]) |
is.na(allSet[,bsmtCols[9]]) |
is.na(allSet[,bsmtCols[10]])|
is.na(allSet[,bsmtCols[11]]))
bsmtNa <- allSet[bsmtSmNa,bsmtCols]
bsmtNa
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 333 Gd TA No GLQ 1124 <NA>
## 949 Gd TA <NA> Unf 0 Unf
## 1488 Gd TA <NA> Unf 0 Unf
## 2041 Gd <NA> Mn GLQ 1044 Rec
## 2186 TA <NA> No BLQ 1033 Unf
## 2218 <NA> Fa No Unf 0 Unf
## 2219 <NA> TA No Unf 0 Unf
## 2349 Gd TA <NA> Unf 0 Unf
## 2525 TA <NA> Av ALQ 755 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
## 333 479 1603 3206 1 0
## 949 0 936 936 0 0
## 1488 0 1595 1595 0 0
## 2041 382 0 1426 1 0
## 2186 0 94 1127 0 1
## 2218 0 173 173 0 0
## 2219 0 356 356 0 0
## 2349 0 725 725 0 0
## 2525 0 240 995 0 0
Reading the description of each basement feature, we can see that:
Basement finished, unfinished, and total square feet are related in the following formula: TotalBsmtSF = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF
BsmtFinType1 and BsmtFinSF1 might be related. We may predict rating of basement finished area (type 1) from its finished square feet.
BsmtFinType2 and BsmtFinSF2 might be related. We may predict rating of basement finished area (type 2) from its finished square feet.
Let’s first make sure that all basement areas are correct according to the abovementioned formula.
sum( allSet$TotalBsmtSF != (allSet$BsmtFinSF1 + allSet$BsmtFinSF2 + allSet$BsmtUnfSF))
## [1] 0
All basement areas numbers are correct and according to the formula.
We have one house (333) whose BsmtFinType2 value is missing while its BsmtFinSF2 is 479 square feet. Let’s see the distribution of all BsmtFinType2 values versus BsmtFinSF2.
# check all houses having non-zero BsmtFinSF2 versus their BsmtFinType2
missFinSF2 <- bsmtNa$BsmtFinSF2[is.na(bsmtNa$BsmtFinType2)]
bsmt2 <- allSet[allSet$BsmtFinSF2 >0 & !is.na(allSet$BsmtFinType2), c("BsmtFinSF2", "BsmtFinType2")]
bsmt2 %>% ggplot(aes(x = BsmtFinType2, y = BsmtFinSF2))+
geom_boxplot()+
geom_hline(yintercept = missFinSF2, color = "red")
# calculate medians of finished square feet for each bsmt finish type 2
medSF <- bsmt2 %>%
group_by(BsmtFinType2) %>%
summarize(median = median(BsmtFinSF2))
medSF
## # A tibble: 6 x 2
## BsmtFinType2 median
## <chr> <dbl>
## 1 ALQ 553
## 2 BLQ 294
## 3 GLQ 691
## 4 LwQ 263
## 5 Rec 308
## 6 Unf 6.00
Let’s impute the missing rating of basement finished area (type2) based on the nearest median of square feet for the available type2 ratings.
rIndx <- which(is.na(allSet$BsmtFinType2))
for(i in rIndx){
bsmtVal <- medSF[which((abs(medSF$median - allSet$BsmtFinSF2[i])) == min(abs(medSF$median - allSet$BsmtFinSF2[i]), na.rm = T)),"BsmtFinType2"]
allSet$BsmtFinType2[i] <- bsmtVal[[1]]
}
For basement exposure (BsmtExposure), we have 3 missing values. Let’s see if there’s any relationship between basement types and exposure.
allSet[!is.na(allSet$BsmtExposure), c("BsmtExposure","BsmtFinType1")] %>%
ggplot(aes(x = BsmtExposure, fill = BsmtFinType1))+
geom_histogram(stat = "count")
From the histogram above, we can see that the majority of houses have no basement exposure, and the majority of the no-basement exposure houses are Unf basement finish type, which match the 3 missing values of basement exposure. So, we’ll impute the missing values with “No”.
allSet$BsmtExposure[is.na(allSet$BsmtExposure)] <- "No"
Now, we are left with missing values in basement quality and basement condition. Let’s see if there’s some relationship between basement quality (which refers to its height) and the exposure.
allSet[!is.na(allSet$BsmtQual), c("BsmtQual","BsmtExposure")] %>%
ggplot(aes(x = BsmtQual, fill = BsmtExposure))+
geom_histogram(stat = "count")
It seems that most of the No-exposure basements have either a TA or a Gd Quality with higher frequency towards TA. So, we’ll take TA as the value of the missing basement quality houses.
allSet$BsmtQual[is.na(allSet$BsmtQual)] <- "TA"
Now, let’s see the relationship between basement quality and condition.
allSet[!is.na(allSet$BsmtQual) & !is.na(allSet$BsmtCond), c("BsmtQual","BsmtCond")] %>%
ggplot(aes(x = BsmtQual, fill = BsmtCond))+
geom_histogram(stat = "count")
It’s obvious from the plot that the majority of TA and Gd quality have TA basement condition. So, we’ll use TA basement condition for the missing values.
allSet$BsmtCond[is.na(allSet$BsmtCond)] <- "TA"
By finishing the imputation process for all missing values of all features we are ready to move to the next step of Feature Engineering. Let’s check if any other missing values are still there in the data set.
checkNAs(allSet)
## SalePrice
## 1459
Aside from the SalePrice response feature, which is expected to have missing values in the original test set, no remaining missing values are observed. Therefore, we can move to the next step.
Feature Engineering simply refers to creating new features from the existing ones to improve model performance. The following sub-sections handle multiple feature engineering tasks that would make training the model possible, easier, and more accurate in prediction.
It is a good practice to transform highly skewed numerical variables into their log values. This helps show the relative change in variable values rather than the absolute change for variables not showing normal distribution. Most researchers use the range of -2 to +2 as the acceptable limits of skewness to decide on a variable normality. Hence, we will use this range to log transform values.
# create a new copy of the master data set (allSet)
allSetNew <- allSet
# get classes of features in the data set
featureClasses <- sapply(names(allSetNew), function(x){class(allSetNew[[x]])})
# get numeric or integer class features
numFeatures <- names(featureClasses[featureClasses == "numeric" | featureClasses == "integer"])
# get character class features
charFeatures <- names(featureClasses[featureClasses == "character"])
# determine skewness of each numeric feature
skewedVals <- sapply(numFeatures, function(x){skewness(allSetNew[[x]],na.rm = T)})
# identify skewed features with threshold of -2,+2
skewedFeatures <- skewedVals[skewedVals < -2 | skewedVals > 2]
# log-transform skewed features
for (i in names(skewedFeatures)) {
allSetNew[[i]] <- log(allSetNew[[i]] + 1)
}
Since we will use the glmment function to build regression models, we need to convert character features into dummy variables since the glmnet function accepts numeric/quantitative variables.
# create a full set of dummy variables for the character features
dummies <- dummyVars(~., allSetNew[charFeatures])
dummyVariables <- predict(dummies, allSetNew[charFeatures])
# compile the data set again by combining both numerical and dummy variables features
allSetNew <- cbind(allSetNew[numFeatures], dummyVariables)
Now, having all features processed and engineered, we can start the model training then testing with all existing features (except the Id) as model predictors.
First, we will re-split the full data set into its original train and test sets.
salesPriceNA <- which(is.na(allSetNew["SalePrice"]))
train <- allSetNew[-salesPriceNA,]
test <- allSetNew[salesPriceNA,]
Since we are dealing with high-dimensional data set, we will use the Shrinkage Models (e.g. Ridge and Lasso) to obtain the best model with the least effect of predictors variance and colinearity. Ridge and Lasso methods penalize the model coefficient estimates, using Lambda tuning parameter, to reduce their variance which result in a better model fitting.
Ridge regression performs shrinkage without exclusion of predictors. To use the glmnet function we need to have a matrix of all predictors and a vector of the response.
# convert the train and test data sets into matrices excluding the Id and SalePrice features
train.Matrix =as.matrix(train[,names(train) != c("Id","SalePrice")])
test.Matrix = as.matrix(test[,names(test) != c("Id","SalePrice")])
# create a vector of the log-transformed response (SalePrice)
train.y = log(train$SalePrice + 1)
Fit the Ridge Regression using the alpha argument = 0. We will implement the function over a grid of values ranging from λ = 10^10 to λ = 10^−2, essentially covering the full range of scenarios from the null model containing only the intercept to the least squares fit model.
set.seed(4)
# create a grid of lambda values
grid = 10^seq(10,-2, length = 100)
# train the Ridge Regression model (alpha = 0) using the grid of selected lambdas
ridge.mod = glmnet(train.Matrix,train.y,alpha = 0, lambda = grid)
dim(coef(ridge.mod))
## [1] 303 100
We have 303 regression coefficients with 100 lambda values. We will split the train data set into training and testing subsets to estimate the test error.
set.seed(5)
# split the train data set into training and testing data sets (75/25 ratio).
# number of training rows
nTrain <- round(0.75 * nrow(train.Matrix))
# sample row IDs
sampleTrain <- sample(nrow(train.Matrix),nTrain)
# create training and testing data sets
training <- train.Matrix[sampleTrain,]
testing <- train.Matrix[-sampleTrain,]
training.y <- train.y[sampleTrain]
testing.y <- train.y[-sampleTrain]
We will use the built-in cross-validation function cv.glmnet() to choose the best tuning parameter (lambda) through cross validation in the training data.
set.seed(1)
cv.out <- cv.glmnet(training, training.y, alpha = 0)
plot(cv.out)
ridgeBestLambda <- cv.out$lambda.min
ridgeBestLambda;log(ridgeBestLambda)
## [1] 0.2249694
## [1] -1.491791
The best lambda value that results in the smallest cross validation error is 0.2249694. Let’s see the test RMSE associated with this value of lambda.
# predict the response (SalePrice) in the testing data using the best lambda
ridge.predict <- predict(ridge.mod,s = ridgeBestLambda, newx = testing)
ridge.rmse <- sqrt(mean((ridge.predict-testing.y)^2))
ridge.rmse
## [1] 0.09866289
So, the test error (RMSE) resulted from the Ridge Regression model on the testing data is 0.0986629.
(Note: remember that the testing data is the testing partition from the original train data set)
Let’s see if the Lasso regression can yield a better RMSE than that of the Ridge. We will use the glmnet() function but with alpha = 1.
set.seed(5)
lasso.mod = glmnet(train.Matrix,train.y,alpha = 1, lambda = grid)
plot(lasso.mod)
We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero. Let’s perform cross validation and see the test error.
set.seed(1)
cv.out <- cv.glmnet(training, training.y, alpha = 1)
plot(cv.out)
lassoBestlambda <- cv.out$lambda.min
lassoBestlambda;log(lassoBestlambda)
## [1] 0.00583801
## [1] -5.143365
The best lambda value that results in the smallest cross validation error is 0.005838. Let’s see the test RMSE associated with this value of lambda.
lasso.predict <- predict(lasso.mod,s = lassoBestlambda, newx = testing)
lasso.rmse <- sqrt(mean((lasso.predict-testing.y)^2))
lasso.rmse
## [1] 0.110181
So, the RMSE resulted from the Lasso Regression model is 0.110181 which is slightly higher than that of the Ridge Regression.
lasso.coef <- predict(lasso.mod,type = "coefficients", s = lassoBestlambda)[1:303,]
names(lasso.coef[lasso.coef !=0])
## [1] "(Intercept)" "LotArea" "OverallQual"
## [4] "OverallCond" "YearBuilt" "YearRemodAdd"
## [7] "BsmtFinSF1" "TotalBsmtSF" "GrLivArea"
## [10] "BsmtFullBath" "KitchenAbvGr" "Fireplaces"
## [13] "GarageCars" "GarageArea" "WoodDeckSF"
## [16] "OpenPorchSF" "ScreenPorch" "MSZoningC (all)"
## [19] "MSZoningRM" "NeighborhoodCrawfor" "NeighborhoodEdwards"
## [22] "NeighborhoodNridgHt" "NeighborhoodSomerst" "NeighborhoodStoneBr"
## [25] "Condition1Artery" "Condition1Norm" "Condition2PosN"
## [28] "RoofMatlClyTile" "Exterior1stBrkComm" "Exterior1stBrkFace"
## [31] "ExterQualTA" "ExterCondFa" "FoundationPConc"
## [34] "BsmtQualEx" "BsmtCondFa" "BsmtExposureGd"
## [37] "BsmtFinType1GLQ" "BsmtFinType1Unf" "HeatingGrav"
## [40] "HeatingQCEx" "CentralAirN" "KitchenQualEx"
## [43] "KitchenQualTA" "FunctionalMaj2" "FunctionalSev"
## [46] "FunctionalTyp" "FireplaceQuNone" "GarageTypeAttchd"
## [49] "GarageFinishUnf" "GarageCondTA" "PavedDriveN"
## [52] "PavedDriveY" "SaleTypeNew" "SaleConditionAbnorml"
Number of coefficients that the Lasso Regression picked (not zero) is 54.
Now, we will try the GBM algorithm. First, let’s split the train and test data sets, then split the train data into training and testing data sets to validate the model.
set.seed(5)
# re-split the combined data into train and test data sets
salesPriceNA <- which(is.na(allSetNew["SalePrice"]))
train <- allSetNew[-salesPriceNA,]
test <- allSetNew[salesPriceNA,]
# convert the response in the train data into log
train$SalePrice <- log(train$SalePrice + 1)
# split the train data set into training and testing data sets (75/25 ratio)
# to validate the model
# number of training rows
nTrain <- round(0.75 * nrow(train))
# sample row IDs
sampleTrain <- sample(nrow(train),nTrain)
# create training and testing data sets
training <- train[sampleTrain,!names(train) == "Id"]
testing <- train[-sampleTrain,!names(train) == "Id"]
testing.y <- testing$SalePrice
Build the GBM model.
set.seed(2)
control <- trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE)
gbm.mod <- train(SalePrice~., data = training, method = "gbm", trControl = control, verbose = FALSE)
gbm.mod
## Stochastic Gradient Boosting
##
## 1095 samples
## 302 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 985, 984, 985, 985, 985, 986, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 0.1754895 0.8277985 0.12294897
## 1 100 0.1491191 0.8595910 0.10367975
## 1 150 0.1427706 0.8688748 0.09846921
## 2 50 0.1536606 0.8551644 0.10579030
## 2 100 0.1399041 0.8737287 0.09511092
## 2 150 0.1364623 0.8796950 0.09257943
## 3 50 0.1453491 0.8670813 0.09885915
## 3 100 0.1357116 0.8811948 0.09180459
## 3 150 0.1335058 0.8851087 0.09033930
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
Let’s see the test error (RMSE) by applying the GBM model to the testing data set.
gbm.predict <- predict(gbm.mod,newdata = testing)
gbm.rmse <- sqrt(mean((gbm.predict-testing.y)^2))
gbm.rmse
## [1] 0.1203715
GBM model resulted in a worse prediction than that of both Ridge and Lasso as indicated by its high RMSE value.
Forward Stepwise is a Subset Selection method that begins with a model containing no predictors (null model), and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.
# use the lm Regression model to predict the SalePrice. Use Forward Stepwise algorithm to select the best predictors.
set.seed(4)
# build the null model with no predictors
null_model <- lm(SalePrice~1, data = training)
# build the full model with all predictors
full_model <- lm(SalePrice~., data = training)
# perform forward stepwise algorithm to get an economic model with best predictors
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
Let’s see the test error (RMSE) by applying the Step Wise model to the testing data set.
step.predict <- predict(step_model,newdata = testing)
step.rmse <- sqrt(mean((step.predict-testing.y)^2))
step.rmse
## [1] 0.1472766
The RMSE resulted from the Linear model with Forward Stepwise is 0.1472766.
By now we have trained and tested four models. We will use each one to predict the Sale Price response variable in the original test data set.
# predict using Ridge
test.predict.ridge <- exp(predict(ridge.mod,s = ridgeBestLambda,newx = test.Matrix))-1
# predict using lasso
test.predict.lasso <- exp(predict(lasso.mod, s = lassoBestlambda, newx = test.Matrix))-1
# predict using gbm
test.predict.gbm <- exp(predict(gbm.mod,newdata = test)) - 1
# predict using linear step wise
test.predict.step <- exp(predict(step_model,newdata = test)) - 1
In order to get more robust predictions that incorporate predictions of the previous four models we will perform the Model Ensemble technique using the Weighted Average method.
We will assign the weights based on each model’s RMSE value. Hence, the models’ weights can be assumed as shown below:
ensmb.df <- data.frame(Model = c("ridge", "lasso", "gbm", "step"), RMSE = c(ridge.rmse,lasso.rmse,gbm.rmse,step.rmse), Weight = c(40,30,15,15))
ensmb.df
## Model RMSE Weight
## 1 ridge 0.09866289 40
## 2 lasso 0.11018098 30
## 3 gbm 0.12037154 15
## 4 step 0.14727656 15
Finally, we will predict the Sale Price in the original test data set using the ensembled models with the weights defined above.
# construct data frame for the ensembled solution
solution <- data.frame(Id = as.integer(rownames(test)),SalePrice = as.numeric(test.predict.ridge*.4 + test.predict.lasso*.3 + test.predict.gbm*.15 + test.predict.step*.15))
write.csv(solution,"ensemble_sol.csv",row.names=FALSE)