Background

  1. Choose a dataset You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
  2. Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
  3. To complete this task:.
  1. Describe the problem you are trying to solve.
  2. Describe your datases and what you did to prepare the data for analysis.
  3. Methodologies you used for analyzing the data
  4. What’s the purpose of the analysis performed
  5. Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Dataset Overview

The dataset I chose is Kaggle’s “House Prices - Advanced Regression Techniques” competition dataset which includes features of residential homes in Ames, Iowa. The purpose of this dataset is to predict the final price of each home. This dataset consists of a training dataset and a test dataset (one that does not have the target variable).

The reason I chose this dataset is that this is a classic regression problem example in machine learning. Additionally, this has many more features than what we’ve previously worked with, which I would imagine would happen in the real world as well. I also wanted to work with the random forest algorithm, and compare to neural network (an algorithm I haven’t tried yet), and this type of data consisting of both categorical and numeric features seemed like a great option to do that comparison. This dataset gave a lot of practice of preparing and doing extensive EDA which takes up a good portion of data science. Lastly, using real estate data genuinely seemed interesting, so it seemed like the prefect problem to solve.

Kaggle link: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

Housing features:

SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict. MSSubClass: The building class MSZoning: The general zoning classification LotFrontage: Linear feet of street connected to property LotArea: Lot size in square feet Street: Type of road access Alley: Type of alley access LotShape: General shape of property LandContour: Flatness of the property Utilities: Type of utilities available LotConfig: Lot configuration LandSlope: Slope of property Neighborhood: Physical locations within Ames city limits Condition1: Proximity to main road or railroad Condition2: Proximity to main road or railroad (if a second is present) BldgType: Type of dwelling HouseStyle: Style of dwelling OverallQual: Overall material and finish quality OverallCond: Overall condition rating YearBuilt: Original construction date YearRemodAdd: Remodel date RoofStyle: Type of roof RoofMatl: Roof material Exterior1st: Exterior covering on house Exterior2nd: Exterior covering on house (if more than one material) MasVnrType: Masonry veneer type MasVnrArea: Masonry veneer area in square feet ExterQual: Exterior material quality ExterCond: Present condition of the material on the exterior Foundation: Type of foundation BsmtQual: Height of the basement BsmtCond: General condition of the basement BsmtExposure: Walkout or garden level basement walls BsmtFinType1: Quality of basement finished area BsmtFinSF1: Type 1 finished square feet BsmtFinType2: Quality of second finished area (if present) BsmtFinSF2: Type 2 finished square feet BsmtUnfSF: Unfinished square feet of basement area TotalBsmtSF: Total square feet of basement area Heating: Type of heating HeatingQC: Heating quality and condition CentralAir: Central air conditioning Electrical: Electrical system 1stFlrSF: First Floor square feet 2ndFlrSF: Second floor square feet LowQualFinSF: Low quality finished square feet (all floors) GrLivArea: Above grade (ground) living area square feet BsmtFullBath: Basement full bathrooms BsmtHalfBath: Basement half bathrooms FullBath: Full bathrooms above grade HalfBath: Half baths above grade Bedroom: Number of bedrooms above basement level Kitchen: Number of kitchens KitchenQual: Kitchen quality TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) Functional: Home functionality rating Fireplaces: Number of fireplaces FireplaceQu: Fireplace quality GarageType: Garage location GarageYrBlt: Year garage was built GarageFinish: Interior finish of the garage GarageCars: Size of garage in car capacity GarageArea: Size of garage in square feet GarageQual: Garage quality GarageCond: Garage condition PavedDrive: Paved driveway WoodDeckSF: Wood deck area in square feet OpenPorchSF: Open porch area in square feet EnclosedPorch: Enclosed porch area in square feet 3SsnPorch: Three season porch area in square feet ScreenPorch: Screen porch area in square feet PoolArea: Pool area in square feet PoolQC: Pool quality Fence: Fence quality MiscFeature: Miscellaneous feature not covered in other categories MiscVal: $Value of miscellaneous feature MoSold: Month Sold YrSold: Year Sold SaleType: Type of sale SaleCondition: Condition of sale

Business Goal

The business goal of this project is to predict the sale price for each house. The data science goal is to perform EDA and experiment with models that could fit this dataset to solve a classic machine learning regression problem (solving for sale price variable).

1. Importing Libraries

#Import Libraries
library(kableExtra)
library(knitr)
library(readr)
library(tidyverse)
library(corrplot)
library(dplyr)
library(GGally)
library(caret)
library(pROC)
library(glmnet)
library(MASS)
library(car)
library(correlationfunnel)
library(faraway)
library(arm)
library(performance)
library(see)
library(reshape2)
library(readr)
#library(tidymodels)
#library(rms)
library(smotefamily)
library(themis)
library(skimr)
library(DataExplorer)
library(naniar)
library(mice)
library(corrr)
library(FactoMineR)
library(ggcorrplot)
library(factoextra)
library(ranger)
library(neuralnet)
library(tidymodels)
library(reshape2)
library(gmodels)

2. Data Ingestion

To obtain the dataset, I enrolled in the ongoing Kaggle competition and downloaded the data.

# Read in the training dataset (downloaded from Kaggle)
raw_data <- read.csv("/Users/gillianmcgovern/Documents/CUNY/DATA_622/FINAL\ PROJECT/final_project_train.csv")

3. Exploratory Data Analysis

The housing prices training dataset contains 1,460 observations and 81 variables, where each observation is a house with it’s features and final sale price (SalePrice is the target variable name). This dataset contains both categorical and numeric/continuous variables, with some ordinal variables and nominal as well.

Missing data:

0% complete rows. Given the 0% complete rows, there are only 5.9% total missing observations.

The missing data breakdown bar chart shows that Electrical, MasVnrType, MasVnrArea, BsmtFinType1, BsmtCond, BsmtQual, BsmtFinType2, BsmtExposure, GarageCond, GarageQual, GarageFinish, GarageYrBlt, GarageType, LotFrontage, FireplaceQu, Fence, Alley, MiscFeature, and PoolQC have missing values. Fence, Alley, MiscFeature, and PoolQC have the most missing values, with a missing percentage above 80%. We will have to look into these missing values to check if they are indeed missing, and if we should keep them.

The missing upset plot shows that Fence, Alley, MiscFeature, and PoolQC are usually all missing together.

Duplicate data:

There are no duplicate observations.

Invalid data:

It appears all the values for the categorical features are valid as well (i.e. no special characters need cleaning).

Summary statistics:

Summary statistics of the predictor variables in the training dataset are included in the table below. Key metrics such as minimum, maximum, mean, median, and standard deviation (SD) help us understand the range, central tendency, and variability of each variable.

Some interesting things to note right from the start are:

NOTE: YrSold, MoSold, SaleType, SaleCondition will all be kept for this model since the test dataset given for this competition contains these features, so this is assuming we do have this information to predict the final price of the house. Depending on the timing of making these final price predictions, knowing this information might not be available in a real-world scenario (in which case these features would be removed before training a model). The competition does not mention this information is not available to us for prediction.

# Structure of the data 
str(raw_data)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
# Glimpse of the data
head(raw_data)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
# Id doesn't impact prediction so remove 
raw_data <- raw_data[ , -1]

# Summary statistics
raw_data %>%
  summary() %>%
  kable(caption = "Descriptive Statistics of Predictor Variables") %>%
  kable_styling()
Descriptive Statistics of Predictor Variables
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Min. : 20.0 Length:1460 Min. : 21.00 Min. : 1300 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 0.0 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 0.0 Length:1460 Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 334 Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000 Length:1460 Min. : 2.000 Length:1460 Min. :0.000 Length:1460 Length:1460 Min. :1900 Length:1460 Min. :0.000 Min. : 0.0 Length:1460 Length:1460 Length:1460 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Length:1460 Length:1460 Length:1460 Min. : 0.00 Min. : 1.000 Min. :2006 Length:1460 Length:1460 Min. : 34900
1st Qu.: 20.0 Class :character 1st Qu.: 59.00 1st Qu.: 7554 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Class :character Class :character Class :character Class :character Class :character 1st Qu.: 0.0 Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 0.0 Class :character 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character Class :character Class :character Class :character 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961 Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Class :character Class :character Class :character 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character 1st Qu.:129975
Median : 50.0 Mode :character Median : 69.00 Median : 9478 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 6.000 Median :5.000 Median :1973 Median :1994 Mode :character Mode :character Mode :character Mode :character Mode :character Median : 0.0 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 383.5 Mode :character Median : 0.00 Median : 477.5 Median : 991.5 Mode :character Mode :character Mode :character Mode :character Median :1087 Median : 0 Median : 0.000 Median :1464 Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000 Median :3.000 Median :1.000 Mode :character Median : 6.000 Mode :character Median :1.000 Mode :character Mode :character Median :1980 Mode :character Median :2.000 Median : 480.0 Mode :character Mode :character Mode :character Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000 Mode :character Mode :character Mode :character Median : 0.00 Median : 6.000 Median :2008 Mode :character Mode :character Median :163000
Mean : 56.9 NA Mean : 70.05 Mean : 10517 NA NA NA NA NA NA NA NA NA NA NA NA Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985 NA NA NA NA NA Mean : 103.7 NA NA NA NA NA NA NA Mean : 443.6 NA Mean : 46.55 Mean : 567.2 Mean :1057.4 NA NA NA NA Mean :1163 Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866 Mean :1.047 NA Mean : 6.518 NA Mean :0.613 NA NA Mean :1979 NA Mean :1.767 Mean : 473.0 NA NA NA Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759 NA NA NA Mean : 43.49 Mean : 6.322 Mean :2008 NA NA Mean :180921
3rd Qu.: 70.0 NA 3rd Qu.: 80.00 3rd Qu.: 11602 NA NA NA NA NA NA NA NA NA NA NA NA 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 NA NA NA NA NA 3rd Qu.: 166.0 NA NA NA NA NA NA NA 3rd Qu.: 712.2 NA 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2 NA NA NA NA 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 NA 3rd Qu.: 7.000 NA 3rd Qu.:1.000 NA NA 3rd Qu.:2002 NA 3rd Qu.:2.000 3rd Qu.: 576.0 NA NA NA 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000 NA NA NA 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 NA NA 3rd Qu.:214000
Max. :190.0 NA Max. :313.00 Max. :215245 NA NA NA NA NA NA NA NA NA NA NA NA Max. :10.000 Max. :9.000 Max. :2010 Max. :2010 NA NA NA NA NA Max. :1600.0 NA NA NA NA NA NA NA Max. :5644.0 NA Max. :1474.00 Max. :2336.0 Max. :6110.0 NA NA NA NA Max. :4692 Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000 Max. :3.000 NA Max. :14.000 NA Max. :3.000 NA NA Max. :2010 NA Max. :4.000 Max. :1418.0 NA NA NA Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000 NA NA NA Max. :15500.00 Max. :12.000 Max. :2010 NA NA Max. :755000
NA NA NA’s :259 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA’s :8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA’s :81 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
plot_intro(raw_data)

# Visualize Missing Data
plot_missing(raw_data)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the DataExplorer package.
##   Please report the issue at
##   <https://github.com/boxuancui/DataExplorer/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Check how missingness relates to other variables
gg_miss_upset(raw_data, nsets = 10)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Check how missingness affects SalePrice
gg_miss_fct(x = raw_data, fct = SalePrice)

# Check for duplicates
duplicates <- duplicated(raw_data)

# Print the duplicates
print(raw_data[duplicates, ])
##  [1] MSSubClass    MSZoning      LotFrontage   LotArea       Street       
##  [6] Alley         LotShape      LandContour   Utilities     LotConfig    
## [11] LandSlope     Neighborhood  Condition1    Condition2    BldgType     
## [16] HouseStyle    OverallQual   OverallCond   YearBuilt     YearRemodAdd 
## [21] RoofStyle     RoofMatl      Exterior1st   Exterior2nd   MasVnrType   
## [26] MasVnrArea    ExterQual     ExterCond     Foundation    BsmtQual     
## [31] BsmtCond      BsmtExposure  BsmtFinType1  BsmtFinSF1    BsmtFinType2 
## [36] BsmtFinSF2    BsmtUnfSF     TotalBsmtSF   Heating       HeatingQC    
## [41] CentralAir    Electrical    X1stFlrSF     X2ndFlrSF     LowQualFinSF 
## [46] GrLivArea     BsmtFullBath  BsmtHalfBath  FullBath      HalfBath     
## [51] BedroomAbvGr  KitchenAbvGr  KitchenQual   TotRmsAbvGrd  Functional   
## [56] Fireplaces    FireplaceQu   GarageType    GarageYrBlt   GarageFinish 
## [61] GarageCars    GarageArea    GarageQual    GarageCond    PavedDrive   
## [66] WoodDeckSF    OpenPorchSF   EnclosedPorch X3SsnPorch    ScreenPorch  
## [71] PoolArea      PoolQC        Fence         MiscFeature   MiscVal      
## [76] MoSold        YrSold        SaleType      SaleCondition SalePrice    
## <0 rows> (or 0-length row.names)
# Break up numerical and categorical variables
cat("Numerical predictors:")
## Numerical predictors:
data_raw_numeric <- raw_data |>
  dplyr::select(where(is.numeric))
numerical_predictors <- names(data_raw_numeric)
print(numerical_predictors)
##  [1] "MSSubClass"    "LotFrontage"   "LotArea"       "OverallQual"  
##  [5] "OverallCond"   "YearBuilt"     "YearRemodAdd"  "MasVnrArea"   
##  [9] "BsmtFinSF1"    "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"  
## [13] "X1stFlrSF"     "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"    
## [17] "BsmtFullBath"  "BsmtHalfBath"  "FullBath"      "HalfBath"     
## [21] "BedroomAbvGr"  "KitchenAbvGr"  "TotRmsAbvGrd"  "Fireplaces"   
## [25] "GarageYrBlt"   "GarageCars"    "GarageArea"    "WoodDeckSF"   
## [29] "OpenPorchSF"   "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"  
## [33] "PoolArea"      "MiscVal"       "MoSold"        "YrSold"       
## [37] "SalePrice"
cat("\nCategorical predictors:")
## 
## Categorical predictors:
data_raw_categorical <- raw_data |>
  dplyr::select(where(is.factor) | where(is.character))
categorical_predictors <- names(data_raw_categorical)[names(data_raw_categorical) != "y"]
print(categorical_predictors)
##  [1] "MSZoning"      "Street"        "Alley"         "LotShape"     
##  [5] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
##  [9] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [13] "HouseStyle"    "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [17] "Exterior2nd"   "MasVnrType"    "ExterQual"     "ExterCond"    
## [21] "Foundation"    "BsmtQual"      "BsmtCond"      "BsmtExposure" 
## [25] "BsmtFinType1"  "BsmtFinType2"  "Heating"       "HeatingQC"    
## [29] "CentralAir"    "Electrical"    "KitchenQual"   "Functional"   
## [33] "FireplaceQu"   "GarageType"    "GarageFinish"  "GarageQual"   
## [37] "GarageCond"    "PavedDrive"    "PoolQC"        "Fence"        
## [41] "MiscFeature"   "SaleType"      "SaleCondition"
# Check for any typos in categorical features values
data_raw_categorical %>%
  lapply(unique) %>%
  print()
## $MSZoning
## [1] "RL"      "RM"      "C (all)" "FV"      "RH"     
## 
## $Street
## [1] "Pave" "Grvl"
## 
## $Alley
## [1] NA     "Grvl" "Pave"
## 
## $LotShape
## [1] "Reg" "IR1" "IR2" "IR3"
## 
## $LandContour
## [1] "Lvl" "Bnk" "Low" "HLS"
## 
## $Utilities
## [1] "AllPub" "NoSeWa"
## 
## $LotConfig
## [1] "Inside"  "FR2"     "Corner"  "CulDSac" "FR3"    
## 
## $LandSlope
## [1] "Gtl" "Mod" "Sev"
## 
## $Neighborhood
##  [1] "CollgCr" "Veenker" "Crawfor" "NoRidge" "Mitchel" "Somerst" "NWAmes" 
##  [8] "OldTown" "BrkSide" "Sawyer"  "NridgHt" "NAmes"   "SawyerW" "IDOTRR" 
## [15] "MeadowV" "Edwards" "Timber"  "Gilbert" "StoneBr" "ClearCr" "NPkVill"
## [22] "Blmngtn" "BrDale"  "SWISU"   "Blueste"
## 
## $Condition1
## [1] "Norm"   "Feedr"  "PosN"   "Artery" "RRAe"   "RRNn"   "RRAn"   "PosA"  
## [9] "RRNe"  
## 
## $Condition2
## [1] "Norm"   "Artery" "RRNn"   "Feedr"  "PosN"   "PosA"   "RRAn"   "RRAe"  
## 
## $BldgType
## [1] "1Fam"   "2fmCon" "Duplex" "TwnhsE" "Twnhs" 
## 
## $HouseStyle
## [1] "2Story" "1Story" "1.5Fin" "1.5Unf" "SFoyer" "SLvl"   "2.5Unf" "2.5Fin"
## 
## $RoofStyle
## [1] "Gable"   "Hip"     "Gambrel" "Mansard" "Flat"    "Shed"   
## 
## $RoofMatl
## [1] "CompShg" "WdShngl" "Metal"   "WdShake" "Membran" "Tar&Grv" "Roll"   
## [8] "ClyTile"
## 
## $Exterior1st
##  [1] "VinylSd" "MetalSd" "Wd Sdng" "HdBoard" "BrkFace" "WdShing" "CemntBd"
##  [8] "Plywood" "AsbShng" "Stucco"  "BrkComm" "AsphShn" "Stone"   "ImStucc"
## [15] "CBlock" 
## 
## $Exterior2nd
##  [1] "VinylSd" "MetalSd" "Wd Shng" "HdBoard" "Plywood" "Wd Sdng" "CmentBd"
##  [8] "BrkFace" "Stucco"  "AsbShng" "Brk Cmn" "ImStucc" "AsphShn" "Stone"  
## [15] "Other"   "CBlock" 
## 
## $MasVnrType
## [1] "BrkFace" "None"    "Stone"   "BrkCmn"  NA       
## 
## $ExterQual
## [1] "Gd" "TA" "Ex" "Fa"
## 
## $ExterCond
## [1] "TA" "Gd" "Fa" "Po" "Ex"
## 
## $Foundation
## [1] "PConc"  "CBlock" "BrkTil" "Wood"   "Slab"   "Stone" 
## 
## $BsmtQual
## [1] "Gd" "TA" "Ex" NA   "Fa"
## 
## $BsmtCond
## [1] "TA" "Gd" NA   "Fa" "Po"
## 
## $BsmtExposure
## [1] "No" "Gd" "Mn" "Av" NA  
## 
## $BsmtFinType1
## [1] "GLQ" "ALQ" "Unf" "Rec" "BLQ" NA    "LwQ"
## 
## $BsmtFinType2
## [1] "Unf" "BLQ" NA    "ALQ" "Rec" "LwQ" "GLQ"
## 
## $Heating
## [1] "GasA"  "GasW"  "Grav"  "Wall"  "OthW"  "Floor"
## 
## $HeatingQC
## [1] "Ex" "Gd" "TA" "Fa" "Po"
## 
## $CentralAir
## [1] "Y" "N"
## 
## $Electrical
## [1] "SBrkr" "FuseF" "FuseA" "FuseP" "Mix"   NA     
## 
## $KitchenQual
## [1] "Gd" "TA" "Ex" "Fa"
## 
## $Functional
## [1] "Typ"  "Min1" "Maj1" "Min2" "Mod"  "Maj2" "Sev" 
## 
## $FireplaceQu
## [1] NA   "TA" "Gd" "Fa" "Ex" "Po"
## 
## $GarageType
## [1] "Attchd"  "Detchd"  "BuiltIn" "CarPort" NA        "Basment" "2Types" 
## 
## $GarageFinish
## [1] "RFn" "Unf" "Fin" NA   
## 
## $GarageQual
## [1] "TA" "Fa" "Gd" NA   "Ex" "Po"
## 
## $GarageCond
## [1] "TA" "Fa" NA   "Gd" "Po" "Ex"
## 
## $PavedDrive
## [1] "Y" "N" "P"
## 
## $PoolQC
## [1] NA   "Ex" "Fa" "Gd"
## 
## $Fence
## [1] NA      "MnPrv" "GdWo"  "GdPrv" "MnWw" 
## 
## $MiscFeature
## [1] NA     "Shed" "Gar2" "Othr" "TenC"
## 
## $SaleType
## [1] "WD"    "New"   "COD"   "ConLD" "ConLI" "CWD"   "ConLw" "Con"   "Oth"  
## 
## $SaleCondition
## [1] "Normal"  "Abnorml" "Partial" "AdjLand" "Alloca"  "Family"

3.1 Data Cleaning

We should convert each categorical feature into a factor to be better used with R. Note: I decided GarageType as nominal feature, since it doesn’t really have an order.

After taking a closer look at the metadata file, Alley, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature all have an “NA” category which actually means it doesn’t contain the item related to the predictor. For example, an NA value for GarageType actually means “No Garage”. This is actually an informative categorical value, and shouldn’t be treated as missing data.

To clean the data, let’s convert these NA values to their actual categorical value.

After cleaning up the data, the remaining missing predictors are Electrical, MasVnrType, MasVnrArea, GarageYrBlt, LotFrontage.

As we can see above, GarageYrBlt indicates “No Garage”. For GarageYrBlt, let’s replace NA with 0, and use the existing GarageType categorical variable that already indicates if there’s not a Garage present.

MasVnrType, MasVnrArea, and LotFrontage seem genuinely missing according to the metadata file. After looking at other predictor values when LotFrontage is NA, it seems like for many different predictors and their values, LotFrontage is missing (aka LotFrontage missing doesn’t seem to indicate some sort of informative feature of the house). So it’s not like LotFrontage is only missing when Condition1 == “PosN” for example. Therefore, imputation will probably be used for these variables.

# Convert categorical to factors

raw_data$MSSubClass <- as.factor(raw_data$MSSubClass) # this is a sneaky categorical variable that needs to be a factor (nominal)

# Create new categories if the house feature doesn't exist, such as "No Garage"
raw_data$Alley =  ifelse(is.na(raw_data$Alley), "NoAlley", raw_data$Alley)
raw_data$BsmtQual =  ifelse(is.na(raw_data$BsmtQual), "NoBasement", raw_data$BsmtQual)
raw_data$BsmtCond =  ifelse(is.na(raw_data$BsmtCond), "NoBasement", raw_data$BsmtCond)
raw_data$BsmtExposure =  ifelse(is.na(raw_data$BsmtExposure), "NoBasement", raw_data$BsmtExposure)
raw_data$BsmtFinType1 =  ifelse(is.na(raw_data$BsmtFinType1), "NoBasement", raw_data$BsmtFinType1)
raw_data$BsmtFinType2 =  ifelse(is.na(raw_data$BsmtFinType2), "NoBasement", raw_data$BsmtFinType2)
raw_data$FireplaceQu =  ifelse(is.na(raw_data$FireplaceQu), "NoFireplace", raw_data$FireplaceQu)
raw_data$GarageType =  ifelse(is.na(raw_data$GarageType), "NoGarage", raw_data$GarageType)
raw_data$GarageFinish =  ifelse(is.na(raw_data$GarageFinish), "NoGarage", raw_data$GarageFinish)
raw_data$GarageQual =  ifelse(is.na(raw_data$GarageQual), "NoGarage", raw_data$GarageQual)
raw_data$GarageCond =  ifelse(is.na(raw_data$GarageCond), "NoGarage", raw_data$GarageCond)
raw_data$PoolQC =  ifelse(is.na(raw_data$PoolQC), "NoPool", raw_data$PoolQC)
raw_data$Fence =  ifelse(is.na(raw_data$Fence), "NoFence", raw_data$Fence)
raw_data$MiscFeature =  ifelse(is.na(raw_data$MiscFeature), "None", raw_data$MiscFeature)

# Update ordinal features
raw_data$Fence <- factor(
  raw_data$Fence,
  levels = c("NoFence", "MnWw", "GdWo", "MnPrv", "GdPrv"),
  ordered = TRUE
)
raw_data$PoolQC <- factor(
  raw_data$PoolQC,
  levels = c("NoPool", "Fa", "TA", "Gd", "Ex"),
  ordered = TRUE
)
raw_data$GarageCond <- factor(
  raw_data$GarageCond,
  levels = c("NoGarage", "Po", "Fa", "TA", "Gd", "Ex"),
  ordered = TRUE
)
raw_data$GarageQual <- factor(
  raw_data$GarageQual,
  levels = c("NoGarage", "Po", "Fa", "TA", "Gd", "Ex"),
  ordered = TRUE
)
raw_data$GarageFinish <- factor(
  raw_data$GarageFinish,
  levels = c("NoGarage", "Unf", "RFn", "Fin"),
  ordered = TRUE
)
raw_data$FireplaceQu <- factor(
  raw_data$FireplaceQu,
  levels = c("NoFireplace", "Po", "Fa", "TA", "Gd", "Ex"),
  ordered = TRUE
)
raw_data$Functional <- factor(
  raw_data$Functional,
  levels = c("Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"),
  ordered = TRUE
)
raw_data$KitchenQual <- factor(
  raw_data$KitchenQual,
  levels = c("Po", "Fa", "TA", "Gd", "Ex"),
  ordered = TRUE
)
raw_data$HeatingQC <- factor(
  raw_data$HeatingQC,
  levels = c("Po", "Fa", "TA", "Gd", "Ex"),
  ordered = TRUE
)
raw_data$BsmtFinType2 <- factor(
  raw_data$BsmtFinType2,
  levels = c("NoBasement", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"),
  ordered = TRUE
)
raw_data$BsmtFinType1 <- factor(
  raw_data$BsmtFinType1,
  levels = c("NoBasement", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"),
  ordered = TRUE
)
raw_data$BsmtExposure <- factor(
  raw_data$BsmtExposure,
  levels = c("NoBasement", "No", "Mn", "Av", "Gd"),
  ordered = TRUE
)
raw_data$BsmtCond <- factor(
  raw_data$BsmtCond,
  levels = c("NoBasement", "Po", "Fa", "TA", "Gd", "Ex"),
  ordered = TRUE
)
raw_data$BsmtQual <- factor(
  raw_data$BsmtQual,
  levels = c("NoBasement", "Po", "Fa", "TA", "Gd", "Ex"),
  ordered = TRUE
)
raw_data$ExterCond <- factor(
  raw_data$ExterCond,
  levels = c("Po", "Fa", "TA", "Gd", "Ex"),
  ordered = TRUE
)
raw_data$ExterQual <- factor(
  raw_data$ExterQual,
  levels = c("Po", "Fa", "TA", "Gd", "Ex"),
  ordered = TRUE
)
raw_data$LandSlope <- factor(
  raw_data$LandSlope,
  levels = c("Sev", "Mod", "Gtl"),
  ordered = TRUE
)

# Update nominal features
# treat 1-10 ratings as ordered numeric
nominal_features <- c("SaleCondition", "SaleType", "MiscFeature", "PavedDrive", "GarageType", "CentralAir", "Heating", "Foundation", "Exterior2nd", "Exterior1st", "RoofMatl", "RoofStyle", "HouseStyle", "BldgType", "Condition2", "Condition1", "Neighborhood", "LotConfig", "Utilities", "LandContour", "LotShape", "Alley", "Street", "MSZoning") 
raw_data <- update_columns(raw_data, nominal_features, as.factor)

# Visualize Missing Data
plot_missing(raw_data)

# Check how missingness relates to other variables
gg_miss_upset(raw_data, nsets = 10)

# Does a missing LotFrontage indicate anything?
# Focus on other predictor variable values when this value is NA
gg_miss_fct(x = raw_data, fct = Condition1)

gg_miss_fct(x = raw_data, fct = Alley)

gg_miss_fct(x = raw_data, fct = Neighborhood)

gg_miss_fct(x = raw_data, fct = LotConfig)

gg_miss_fct(x = raw_data, fct = BldgType)

gg_miss_fct(x = raw_data, fct = Street)

# Replace NA with 0 for GarageYrBlt and use the existing Garage missing indicator categorical variables
raw_data$GarageYrBlt =  ifelse(is.na(raw_data$GarageYrBlt), 0, raw_data$GarageYrBlt)


# Keep year as numeric and convert month to factor
# raw_data$YrSold <- as.factor(raw_data$YrSold)
raw_data$MoSold <- as.factor(raw_data$MoSold)
# raw_data$GarageYrBlt <- as.factor(raw_data$GarageYrBlt)
# raw_data$YearBuilt <- as.factor(raw_data$YearBuilt)
# raw_data$YearRemodAdd <- as.factor(raw_data$YearRemodAdd)

3.3 Data Distribution

SalePrice is right skewed and has a wide spread.

Most numerical predictors are at least slightly right skewed except for OverallCond, YearBuilt, YearRemodAdd, and GarageYrBlt.

Most numerical predictors are unimodal except for BsmtFinSF1 (bimodal), BsmtUnfSF (multimodal), MSSubClass (multimodal), TotalBsmtSF (bimodal), X2ndFirSF (bimodal), YearBuilt (multimodal), YearRemodAdd (multimodal), GarageArea (multimodal), OpenPorchSF (multimodal), WoodDeckSF (bimodal), and YrSold (bimodal).

As we saw earlier, predictors have a wide range of spread. For example, GarageArea has a very large spread compared to FullBath which has a very small spread.

For categorical predictors, some things to note:

  • 1-STORY 1946 & NEWER ALL STYLES is the most frequent type of dwelling
  • RL (Residential Low Density) has the highest frequency for general zoning classification
  • Most streets are paved
  • Most houses have no alley
  • Most slopes are gentile
  • Most flatness is near flat/level
  • Condition 1 and 2 are mostly normal
  • Most houses are single family
  • Most roof style is gable
  • Most roof material is standard (composite) shingle
  • Most present condition of the material on the exterior is typical
  • Most have gas appliances
  • Most do not have basement exposure to walkout or garden level walls
  • Spring is the most frequent month houses are sold

There are some features that barely have counts for some categories such as Condition1 and 2 (PosA barely has any count value), RoofMatl and RoofStyle. It’s possible some values can be narrowed down by grouping values into “other” category (depending on the model).

The QQ plots and the box plots show the skew shown in the histograms (many extreme values). Many predictor variables have outliers shown by the red dots. However, based on the feature meanings and provided information, there is no reason to believe that any of these extreme values are mistakes or data errors. As such, we will not remove the extreme values, as they could be predictive of the target.

It seems like for smaller housing prices, there is a wider spread and more outliers, but this makes sense since that’s a bulk of the data. January and July have the widest spread for SalePrice. Housing prices increases as the year the house was built becomes more recent, including remodel date. Generally as SF increases, the price increases as well.

# Plot histograms
plot_histogram(raw_data)

# Categorical Variables
data_raw_categorical <- raw_data |>
  dplyr::select(where(is.factor) | where(is.character))
categorical_predictors <- names(data_raw_categorical)[names(data_raw_categorical) != "y"]
plot_bar(data_raw_categorical)

# Closer look at target distribution (# of orders)
hist(as.numeric(raw_data$SalePrice), main="SalePrice Distribution",
     xlab="Sale Price", col="lightblue", breaks=100)

# QQ Plot
plot_qq(raw_data, sampled_rows = 1000L)
## Warning: Removed 181 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Warning: Removed 181 rows containing non-finite outside the scale range
## (`stat_qq_line()`).

# Box plots - visualize outliers
ggplot(stack(data_raw_numeric), aes(x = ind, y = values)) + 
  geom_boxplot(color = 'skyblue', outlier.color = 'red') +
  coord_cartesian(ylim = c(0, 10000)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        panel.background = element_rect(fill = 'grey96')) +
  labs(title = "Boxplots of Predictor Variables", x="Predictors")
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Box plots by TARGET
plot_boxplot(raw_data, by = "SalePrice", ggtheme = theme_light())
## Warning: Removed 267 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Scatter plots by TARGET
plot_scatterplot(raw_data, by = "SalePrice", sampled_rows = 1000L)

3.4 Identifying Correlations and Relationships

Top positive correlations with SalePrice (full list printed below):

  1. OverallQual: 0.79 (better quality means the house is worth more)
  2. GrLivArea: 0.71 (Larger living area, the higher the price)
  3. GarageCars: 0.64 (Larger the garage space (and the more cars owned presumably), the higher the price)
  4. GarageArea: 0.62 (Larger the garage space (and the more cars owned presumably), the higher the price)
  5. TotalBsmtSF: 0.61 (Larger basement, the higher the price)

Top negative correlations with SalePrice:

  1. KitchenAbvGr: -0.14 (Higher number of kitchens located on the above-ground floors of a house, lower the price – interesting!)
  2. EnclosedPorch: -0.13 (Higher enclosed porch area in square feet, lower the price)
  3. OverallCond: -0.08 (Higher overall condition rating, lower the price – this does not make sense, and needs looking into.)
  4. YrSold: -0.03 (Higher overall condition rating, lower the price – this does not make sense, and needs looking into.)
  5. LowQualFinSF: -0.02 (Higher overall condition rating, lower the price – this does not make sense, and needs looking into.)

There are many more features that are strongly positively correlated with SalePrice than negatively correlated.

Top correlation Among Predictors:

  1. GarageCars <-> GarageArea: 0.88 (More cars, the larger the garage)
  2. GrLivArea <-> TotRmsAbvGrd: 0.83 (Higher above grade (ground) living area square feet, more total rooms above grade)
  3. TotalBsmtSF <-> X1stFlrSF: 0.82 (Higher total basement square feet, higher first floor square feet - share the same or similar floor plan dimensions/area)
  4. X2ndFlrSF <-> GrLivArea: 0.69 (Larger 2nd floor usually means larger living area)
  5. BedroomAbvGr <-> TotRmsAbvGrd: 0.68 (More bedrooms, usually more rooms in total)

Many more can be seen below.

Initial business insights:

The correlation funnel shows that the following values for the predictors correspond to a high SalePrice (21,3497.50 and above):

  • Overall material and finish quality of 7 or higher
  • Good exterior quality material (not excellent since excellent cat value is not very common)
  • 3 car garage
  • 578 sq ft sized garage or larger
  • 1768 sq ft living area or larger
  • Excellent kitchen quality
  • 1308.25 sq ft basement or larger

Feature engineering can be done to remove multicollinearity and slim down the dataset.

Two-way Cross-Tabulations - Categorical Variables:

Based on the p-value, these are the categorical variables that are not statistically significant (the rest are - see dataframe below):

Utilities 1.0000000 Condition1 1.0000000 RoofStyle 1.0000000 RoofMatl 1.0000000 MiscFeature 1.0000000 Fence 1.0000000 GarageCond 1.0000000 BldgType 0.9999861 Exterior1st 0.9999839 HeatingQC 0.9995947 MSSubClass 0.9928572 PavedDrive 0.9912800 Alley 0.9637568 BsmtFinType2 0.9480709 Exterior2nd 0.8469190 HouseStyle 0.6482615 BsmtFinType1 0.2129277 MoSold 0.1417502 LandSlope 0.1050864 LandContour 0.0867465 Condition2 0.0759864 GarageType 0.0647807 Electrical 0.0556880 GarageQual 0.0501487

Most of these are not too surprising, as they don’t seem like critical house features.

Some additional insights into relationships between predictors (from predictor vs. predictor box plots below):

  • There is a higher overall quality for 1 and 2 story houses (lower for 1.5)
  • 2.5 story and 1 story have the largest 1st floor square footage (either house is massive, or since 1 story, 1st floor should be larger)
  • Good exterior condition means larger open porch square footage
  • Better kitchen quality means more recent year built, larger basement, first floor, and overall quality

These predictor vs. predictor box plots also showed some outliers that do not necessarily makes sense such as 1 story house having 2nd floor square footage, but possibly there are exceptions in the real estate world. Since I don’t have enough information to know if these are valid observations or not, I’m going to keep these in the model. The chosen models should be robust to outliers anyway.

#Correlation matrix with target

numeric_vars <- raw_data %>% select_if(is.numeric) 
cor_matrix <- cor(numeric_vars, use="pairwise.complete.obs")
corr_target <- cor_matrix[,"SalePrice"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)

kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with SalePrice"), digits = 2)
Correlation with SalePrice
SalePrice 1.00
OverallQual 0.79
GrLivArea 0.71
GarageCars 0.64
GarageArea 0.62
TotalBsmtSF 0.61
X1stFlrSF 0.61
FullBath 0.56
TotRmsAbvGrd 0.53
YearBuilt 0.52
YearRemodAdd 0.51
MasVnrArea 0.48
Fireplaces 0.47
BsmtFinSF1 0.39
LotFrontage 0.35
WoodDeckSF 0.32
X2ndFlrSF 0.32
OpenPorchSF 0.32
HalfBath 0.28
LotArea 0.26
GarageYrBlt 0.26
BsmtFullBath 0.23
BsmtUnfSF 0.21
BedroomAbvGr 0.17
ScreenPorch 0.11
PoolArea 0.09
X3SsnPorch 0.04
BsmtFinSF2 -0.01
BsmtHalfBath -0.02
MiscVal -0.02
LowQualFinSF -0.03
YrSold -0.03
OverallCond -0.08
EnclosedPorch -0.13
KitchenAbvGr -0.14
#Correlation heatmap
corrplot::corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         number.cex = 0.7, diag = FALSE)

# Correlation top 5 positive and negative
corr_target <- cor_matrix[, "SalePrice"]
corr_target <- corr_target[names(corr_target) != "SalePrice"]  # Remove self-correlation

top_pos <- sort(corr_target, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target, decreasing = FALSE)[1:3]

combined_df <- data.frame(
  Feature = c(names(top_pos), names(top_neg)),
  Correlation = c(top_pos, top_neg)) %>%
  mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))

ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Correlated with SalePrice",
       x = "Feature",
       y = "Correlation with SalePrice",
       fill = "Direction") +
  theme_minimal()

# Correlation among predictors
melt_cor <- melt(cor_matrix)
melt_cor <- melt_cor |>
  filter(Var1 != "SalePrice") |>
  filter(Var2 != "SalePrice")
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
kable(as.data.frame(sorted_cor), digits = 2)
Var1 Var2 value
875 GarageCars GarageArea 0.88
729 GrLivArea TotRmsAbvGrd 0.83
385 TotalBsmtSF X1stFlrSF 0.82
489 X2ndFlrSF GrLivArea 0.69
734 BedroomAbvGr TotRmsAbvGrd 0.68
518 BsmtFinSF1 BsmtFullBath 0.65
593 GrLivArea FullBath 0.63
727 X2ndFlrSF TotRmsAbvGrd 0.62
625 X2ndFlrSF HalfBath 0.61
819 OverallQual GarageCars 0.60
840 GarageYrBlt GarageCars 0.60
479 OverallQual GrLivArea 0.59
175 YearBuilt YearRemodAdd 0.59
139 OverallQual YearBuilt 0.57
488 X1stFlrSF GrLivArea 0.57
853 OverallQual GarageArea 0.56
874 GarageYrBlt GarageArea 0.56
732 FullBath TotRmsAbvGrd 0.55
173 OverallQual YearRemodAdd 0.55
581 OverallQual FullBath 0.55
821 YearBuilt GarageCars 0.54
343 OverallQual TotalBsmtSF 0.54
348 BsmtFinSF1 TotalBsmtSF 0.52
661 GrLivArea BedroomAbvGr 0.52
659 X2ndFlrSF BedroomAbvGr 0.50
314 BsmtFinSF1 BsmtUnfSF -0.50
862 X1stFlrSF GarageArea 0.49
861 TotalBsmtSF GarageArea 0.49
855 YearBuilt GarageArea 0.48
377 OverallQual X1stFlrSF 0.48
834 FullBath GarageCars 0.47
865 GrLivArea GarageArea 0.47
583 YearBuilt FullBath 0.47
831 GrLivArea GarageCars 0.47
763 GrLivArea Fireplaces 0.46
375 LotFrontage X1stFlrSF 0.46
487 TotalBsmtSF GrLivArea 0.45
382 BsmtFinSF1 X1stFlrSF 0.45
828 X1stFlrSF GarageCars 0.44
584 YearRemodAdd FullBath 0.44
827 TotalBsmtSF GarageCars 0.43
717 OverallQual TotRmsAbvGrd 0.43
35 LotFrontage LotArea 0.43
520 BsmtUnfSF BsmtFullBath -0.42
591 X2ndFlrSF FullBath 0.42
822 YearRemodAdd GarageCars 0.42
627 GrLivArea HalfBath 0.42
350 BsmtUnfSF TotalBsmtSF 0.42
207 OverallQual MasVnrArea 0.41
760 X1stFlrSF Fireplaces 0.41
726 X1stFlrSF TotRmsAbvGrd 0.41
868 FullBath GarageArea 0.41
477 LotFrontage GrLivArea 0.40
751 OverallQual Fireplaces 0.40
341 LotFrontage TotalBsmtSF 0.39
345 YearBuilt TotalBsmtSF 0.39
483 MasVnrArea GrLivArea 0.39
957 YearBuilt EnclosedPorch -0.39
590 X1stFlrSF FullBath 0.38
140 OverallCond YearBuilt -0.38
857 MasVnrArea GarageArea 0.37
856 YearRemodAdd GarageArea 0.37
823 MasVnrArea GarageCars 0.36
347 MasVnrArea TotalBsmtSF 0.36
664 FullBath BedroomAbvGr 0.36
838 TotRmsAbvGrd GarageCars 0.36
715 LotFrontage TotRmsAbvGrd 0.35
851 LotFrontage GarageArea 0.34
381 MasVnrArea X1stFlrSF 0.34
733 HalfBath TotRmsAbvGrd 0.34
759 TotalBsmtSF Fireplaces 0.34
872 TotRmsAbvGrd GarageArea 0.34
933 GrLivArea OpenPorchSF 0.33
770 TotRmsAbvGrd Fireplaces 0.33
589 TotalBsmtSF FullBath 0.32
384 BsmtUnfSF X1stFlrSF 0.32
209 YearBuilt MasVnrArea 0.32
921 OverallQual OpenPorchSF 0.31
309 OverallQual BsmtUnfSF 0.31
521 TotalBsmtSF BsmtFullBath 0.31
839 Fireplaces GarageCars 0.30
376 LotArea X1stFlrSF 0.30
858 BsmtFinSF1 GarageArea 0.30
411 OverallQual X2ndFlrSF 0.30
346 YearRemodAdd TotalBsmtSF 0.29
785 OverallQual GarageYrBlt 0.29
588 BsmtUnfSF FullBath 0.29
482 YearRemodAdd GrLivArea 0.29
817 LotFrontage GarageCars 0.29
725 TotalBsmtSF TotRmsAbvGrd 0.29
379 YearBuilt X1stFlrSF 0.28
721 MasVnrArea TotRmsAbvGrd 0.28
585 MasVnrArea FullBath 0.28
615 OverallQual HalfBath 0.27
787 YearBuilt GarageYrBlt 0.27
750 LotArea Fireplaces 0.27
873 Fireplaces GarageArea 0.27
749 LotFrontage Fireplaces 0.27
245 MasVnrArea BsmtFinSF1 0.26
647 LotFrontage BedroomAbvGr 0.26
478 LotArea GrLivArea 0.26
342 LotArea TotalBsmtSF 0.26
756 BsmtFinSF1 Fireplaces 0.26
936 FullBath OpenPorchSF 0.26
735 KitchenAbvGr TotRmsAbvGrd 0.26
69 LotFrontage OverallQual 0.25
724 BsmtUnfSF TotRmsAbvGrd 0.25
243 YearBuilt BsmtFinSF1 0.25
755 MasVnrArea Fireplaces 0.25
899 GrLivArea WoodDeckSF 0.25
929 TotalBsmtSF OpenPorchSF 0.25
522 X1stFlrSF BsmtFullBath 0.24
766 FullBath Fireplaces 0.24
617 YearBuilt HalfBath 0.24
944 GarageArea OpenPorchSF 0.24
380 YearRemodAdd X1stFlrSF 0.24
486 BsmtUnfSF GrLivArea 0.24
241 OverallQual BsmtFinSF1 0.24
887 OverallQual WoodDeckSF 0.24
896 X1stFlrSF WoodDeckSF 0.24
940 TotRmsAbvGrd OpenPorchSF 0.23
239 LotFrontage BsmtFinSF1 0.23
895 TotalBsmtSF WoodDeckSF 0.23
665 HalfBath BedroomAbvGr 0.23
909 GarageCars WoodDeckSF 0.23
924 YearRemodAdd OpenPorchSF 0.23
889 YearBuilt WoodDeckSF 0.22
910 GarageArea WoodDeckSF 0.22
824 BsmtFinSF1 GarageCars 0.22
835 HalfBath GarageCars 0.22
826 BsmtUnfSF GarageCars 0.21
240 LotArea BsmtFinSF1 0.21
943 GarageCars OpenPorchSF 0.21
930 X1stFlrSF OpenPorchSF 0.21
315 BsmtFinSF2 BsmtUnfSF -0.21
484 BsmtFinSF1 GrLivArea 0.21
931 X2ndFlrSF OpenPorchSF 0.21
1055 LotFrontage PoolArea 0.21
890 YearRemodAdd WoodDeckSF 0.21
892 BsmtFinSF1 WoodDeckSF 0.20
767 HalfBath Fireplaces 0.20
420 X1stFlrSF X2ndFlrSF -0.20
619 MasVnrArea HalfBath 0.20
907 Fireplaces WoodDeckSF 0.20
937 HalfBath OpenPorchSF 0.20
481 YearBuilt GrLivArea 0.20
579 LotFrontage FullBath 0.20
700 BedroomAbvGr KitchenAbvGr 0.20
761 X2ndFlrSF Fireplaces 0.19
582 OverallCond FullBath -0.19
958 YearRemodAdd EnclosedPorch -0.19
205 LotFrontage MasVnrArea 0.19
720 YearRemodAdd TotRmsAbvGrd 0.19
716 LotArea TotRmsAbvGrd 0.19
923 YearBuilt OpenPorchSF 0.19
902 FullBath WoodDeckSF 0.19
515 YearBuilt BsmtFullBath 0.19
805 Fireplaces GarageYrBlt 0.19
820 OverallCond GarageCars -0.19
1043 Fireplaces ScreenPorch 0.18
829 X2ndFlrSF GarageCars 0.18
683 OverallQual KitchenAbvGr -0.18
447 YearBuilt LowQualFinSF -0.18
618 YearRemodAdd HalfBath 0.18
860 BsmtUnfSF GarageArea 0.18
312 YearRemodAdd BsmtUnfSF 0.18
852 LotArea GarageArea 0.18
210 YearRemodAdd MasVnrArea 0.18
866 BsmtFullBath GarageArea 0.18
793 TotalBsmtSF GarageYrBlt 0.18
900 BsmtFullBath WoodDeckSF 0.18
685 YearBuilt KitchenAbvGr -0.17
415 MasVnrArea X2ndFlrSF 0.17
419 TotalBsmtSF X2ndFlrSF -0.17
886 LotArea WoodDeckSF 0.17
344 OverallCond TotalBsmtSF -0.17
1069 GrLivArea PoolArea 0.17
523 X2ndFlrSF BsmtFullBath -0.17
941 Fireplaces OpenPorchSF 0.17
656 BsmtUnfSF BedroomAbvGr 0.17
794 X1stFlrSF GarageYrBlt 0.17
906 TotRmsAbvGrd WoodDeckSF 0.17
869 HalfBath GarageArea 0.16
797 GrLivArea GarageYrBlt 0.16
891 MasVnrArea WoodDeckSF 0.16
519 BsmtFinSF2 BsmtFullBath 0.16
512 LotArea BsmtFullBath 0.16
803 KitchenAbvGr GarageYrBlt -0.16
818 LotArea GarageCars 0.15
919 LotFrontage OpenPorchSF 0.15
854 OverallCond GarageArea -0.15
977 GarageCars EnclosedPorch -0.15
662 BsmtFullBath BedroomAbvGr -0.15
686 YearRemodAdd KitchenAbvGr -0.15
311 YearBuilt BsmtUnfSF 0.15
560 BsmtFullBath BsmtHalfBath -0.15
753 YearBuilt Fireplaces 0.15
796 LowQualFinSF GarageYrBlt -0.15
788 YearRemodAdd GarageYrBlt 0.15
378 OverallCond X1stFlrSF -0.14
1062 BsmtFinSF1 PoolArea 0.14
414 YearRemodAdd X2ndFlrSF 0.14
863 X2ndFlrSF GarageArea 0.14
764 BsmtFullBath Fireplaces 0.14
800 FullBath GarageYrBlt 0.14
416 BsmtFinSF1 X2ndFlrSF -0.14
310 OverallCond BsmtUnfSF -0.14
630 FullBath HalfBath 0.14
490 LowQualFinSF GrLivArea 0.13
789 MasVnrArea GarageYrBlt 0.13
698 FullBath KitchenAbvGr 0.13
307 LotFrontage BsmtUnfSF 0.13
832 BsmtFullBath GarageCars 0.13
1066 X1stFlrSF PoolArea 0.13
728 LowQualFinSF TotRmsAbvGrd 0.13
928 BsmtUnfSF OpenPorchSF 0.13
244 YearRemodAdd BsmtFinSF1 0.13
208 OverallCond MasVnrArea -0.13
658 X1stFlrSF BedroomAbvGr 0.13
1065 TotalBsmtSF PoolArea 0.13
580 LotArea FullBath 0.13
979 WoodDeckSF EnclosedPorch -0.13
925 MasVnrArea OpenPorchSF 0.13
769 KitchenAbvGr Fireplaces -0.12
137 LotFrontage YearBuilt 0.12
978 GarageArea EnclosedPorch -0.12
624 X1stFlrSF HalfBath -0.12
648 LotArea BedroomAbvGr 0.12
516 YearRemodAdd BsmtFullBath 0.12
548 OverallCond BsmtHalfBath 0.12
908 GarageYrBlt WoodDeckSF 0.12
801 HalfBath GarageYrBlt 0.12
790 BsmtFinSF1 GarageYrBlt 0.12
970 FullBath EnclosedPorch -0.12
313 MasVnrArea BsmtUnfSF 0.11
955 OverallQual EnclosedPorch -0.11
754 YearRemodAdd Fireplaces 0.11
926 BsmtFinSF1 OpenPorchSF 0.11
274 LotArea BsmtFinSF2 0.11
513 OverallQual BsmtFullBath 0.11
959 MasVnrArea EnclosedPorch -0.11
903 HalfBath WoodDeckSF 0.11
768 BedroomAbvGr Fireplaces 0.11
654 BsmtFinSF1 BedroomAbvGr -0.11
70 LotArea OverallQual 0.11
660 LowQualFinSF BedroomAbvGr 0.11
783 LotFrontage GarageYrBlt 0.11
349 BsmtFinSF2 TotalBsmtSF 0.10
206 LotArea MasVnrArea 0.10
653 MasVnrArea BedroomAbvGr 0.10
960 BsmtFinSF1 EnclosedPorch -0.10
649 OverallQual BedroomAbvGr 0.10
1035 GrLivArea ScreenPorch 0.10
511 LotFrontage BsmtFullBath 0.10
695 GrLivArea KitchenAbvGr 0.10
417 BsmtFinSF2 X2ndFlrSF -0.10
383 BsmtFinSF2 X1stFlrSF 0.10
554 BsmtUnfSF BsmtHalfBath -0.10
804 TotRmsAbvGrd GarageYrBlt 0.10
719 YearBuilt TotRmsAbvGrd 0.10
963 TotalBsmtSF EnclosedPorch -0.10
971 HalfBath EnclosedPorch -0.10
1077 Fireplaces PoolArea 0.10
830 LowQualFinSF GarageCars -0.09
938 BedroomAbvGr OpenPorchSF 0.09
980 OpenPorchSF EnclosedPorch -0.09
897 X2ndFlrSF WoodDeckSF 0.09
105 OverallQual OverallCond -0.09
905 KitchenAbvGr WoodDeckSF -0.09
1029 BsmtFinSF2 ScreenPorch 0.09
171 LotFrontage YearRemodAdd 0.09
1032 X1stFlrSF ScreenPorch 0.09
885 LotFrontage WoodDeckSF 0.09
684 OverallCond KitchenAbvGr -0.09
836 BedroomAbvGr GarageCars 0.09
517 MasVnrArea BsmtFullBath 0.09
920 LotArea OpenPorchSF 0.08
1031 TotalBsmtSF ScreenPorch 0.08
1076 TotRmsAbvGrd PoolArea 0.08
1049 EnclosedPorch ScreenPorch -0.08
1067 X2ndFlrSF PoolArea 0.08
688 BsmtFinSF1 KitchenAbvGr -0.08
409 LotFrontage X2ndFlrSF 0.08
480 OverallCond GrLivArea -0.08
1056 LotArea PoolArea 0.08
976 GarageYrBlt EnclosedPorch -0.08
587 BsmtFinSF2 FullBath -0.08
1048 OpenPorchSF ScreenPorch 0.07
1047 WoodDeckSF ScreenPorch -0.07
174 OverallCond YearRemodAdd 0.07
1081 WoodDeckSF PoolArea 0.07
784 LotArea GarageYrBlt 0.07
1039 HalfBath ScreenPorch 0.07
279 MasVnrArea BsmtFinSF2 -0.07
553 BsmtFinSF2 BsmtHalfBath 0.07
1074 BedroomAbvGr PoolArea 0.07
651 YearBuilt BedroomAbvGr -0.07
956 OverallCond EnclosedPorch 0.07
939 KitchenAbvGr OpenPorchSF -0.07
987 LotFrontage X3SsnPorch 0.07
449 MasVnrArea LowQualFinSF -0.07
691 TotalBsmtSF KitchenAbvGr -0.07
1092 OverallCond MiscVal 0.07
699 HalfBath KitchenAbvGr -0.07
692 X1stFlrSF KitchenAbvGr 0.07
893 BsmtFinSF2 WoodDeckSF 0.07
278 YearRemodAdd BsmtFinSF2 -0.07
1070 BsmtFullBath PoolArea 0.07
864 LowQualFinSF GarageArea -0.07
552 BsmtFinSF1 BsmtHalfBath 0.07
934 BsmtFullBath OpenPorchSF 0.07
1138 BsmtFullBath YrSold 0.07
964 X1stFlrSF EnclosedPorch -0.07
870 BedroomAbvGr GarageArea 0.07
1057 OverallQual PoolArea 0.07
1023 OverallQual ScreenPorch 0.06
594 BsmtFullBath FullBath -0.06
450 BsmtFinSF1 LowQualFinSF -0.06
871 KitchenAbvGr GarageArea -0.06
795 X2ndFlrSF GarageYrBlt 0.06
455 X2ndFlrSF LowQualFinSF 0.06
448 YearRemodAdd LowQualFinSF -0.06
1109 KitchenAbvGr MiscVal 0.06
1068 LowQualFinSF PoolArea 0.06
1028 BsmtFinSF1 ScreenPorch 0.06
965 X2ndFlrSF EnclosedPorch 0.06
1027 MasVnrArea ScreenPorch 0.06
1044 GarageYrBlt ScreenPorch 0.06
966 LowQualFinSF EnclosedPorch 0.06
1080 GarageArea PoolArea 0.06
616 OverallCond HalfBath -0.06
1082 OpenPorchSF PoolArea 0.06
1154 PoolArea YrSold -0.06
1042 TotRmsAbvGrd ScreenPorch 0.06
693 X2ndFlrSF KitchenAbvGr 0.06
103 LotFrontage OverallCond -0.06
275 OverallQual BsmtFinSF2 -0.06
945 WoodDeckSF OpenPorchSF 0.06
586 BsmtFinSF1 FullBath 0.06
1150 OpenPorchSF YrSold -0.06
718 OverallCond TotRmsAbvGrd -0.06
998 X1stFlrSF X3SsnPorch 0.06
514 OverallCond BsmtFullBath -0.05
1024 OverallCond ScreenPorch 0.05
595 BsmtHalfBath FullBath -0.05
1083 EnclosedPorch PoolArea 0.05
613 LotFrontage HalfBath 0.05
730 BsmtFullBath TotRmsAbvGrd -0.05
1041 KitchenAbvGr ScreenPorch -0.05
758 BsmtUnfSF Fireplaces 0.05
1046 GarageArea ScreenPorch 0.05
1085 ScreenPorch PoolArea 0.05
410 LotArea X2ndFlrSF 0.05
837 KitchenAbvGr GarageCars -0.05
1045 GarageCars ScreenPorch 0.05
657 TotalBsmtSF BedroomAbvGr 0.05
1025 YearBuilt ScreenPorch -0.05
280 BsmtFinSF1 BsmtFinSF2 -0.05
968 BsmtFullBath EnclosedPorch -0.05
273 LotFrontage BsmtFinSF2 0.05
942 GarageYrBlt OpenPorchSF 0.05
1072 FullBath PoolArea 0.05
798 BsmtFullBath GarageYrBlt 0.05
277 YearBuilt BsmtFinSF2 -0.05
623 TotalBsmtSF HalfBath -0.05
546 LotArea BsmtHalfBath 0.05
524 LowQualFinSF BsmtFullBath -0.05
757 BsmtFinSF2 Fireplaces 0.05
904 BedroomAbvGr WoodDeckSF 0.05
1139 BsmtHalfBath YrSold -0.05
663 BsmtHalfBath BedroomAbvGr 0.05
242 OverallCond BsmtFinSF1 -0.05
992 YearRemodAdd X3SsnPorch 0.05
722 BsmtFinSF1 TotRmsAbvGrd 0.04
1040 BedroomAbvGr ScreenPorch 0.04
1126 OverallCond YrSold 0.04
1022 LotArea ScreenPorch 0.04
1113 GarageCars MiscVal -0.04
792 BsmtUnfSF GarageYrBlt 0.04
1063 BsmtFinSF2 PoolArea 0.04
972 BedroomAbvGr EnclosedPorch 0.04
696 BsmtFullBath KitchenAbvGr -0.04
1021 LotFrontage ScreenPorch 0.04
1132 BsmtUnfSF YrSold -0.04
622 BsmtUnfSF HalfBath -0.04
689 BsmtFinSF2 KitchenAbvGr -0.04
1033 X2ndFlrSF ScreenPorch 0.04
652 YearRemodAdd BedroomAbvGr -0.04
276 OverallCond BsmtFinSF2 0.04
901 BsmtHalfBath WoodDeckSF 0.04
547 OverallQual BsmtHalfBath -0.04
1147 GarageCars YrSold -0.04
1026 YearRemodAdd ScreenPorch -0.04
443 LotFrontage LowQualFinSF 0.04
825 BsmtFinSF2 GarageCars -0.04
549 YearBuilt BsmtHalfBath -0.04
1090 LotArea MiscVal 0.04
697 BsmtHalfBath KitchenAbvGr -0.04
687 MasVnrArea KitchenAbvGr -0.04
997 TotalBsmtSF X3SsnPorch 0.04
973 KitchenAbvGr EnclosedPorch 0.04
1015 EnclosedPorch X3SsnPorch -0.04
961 BsmtFinSF2 EnclosedPorch 0.04
1137 GrLivArea YrSold -0.04
1142 BedroomAbvGr YrSold -0.04
1011 GarageCars X3SsnPorch 0.04
1128 YearRemodAdd YrSold 0.04
1004 FullBath X3SsnPorch 0.04
723 BsmtFinSF2 TotRmsAbvGrd -0.04
1003 BsmtHalfBath X3SsnPorch 0.04
1064 BsmtUnfSF PoolArea -0.04
1012 GarageArea X3SsnPorch 0.04
791 BsmtFinSF2 GarageYrBlt 0.04
525 GrLivArea BsmtFullBath 0.03
1144 TotRmsAbvGrd YrSold -0.03
1093 YearBuilt MiscVal -0.03
453 TotalBsmtSF LowQualFinSF -0.03
1013 WoodDeckSF X3SsnPorch -0.03
922 OverallCond OpenPorchSF -0.03
621 BsmtFinSF2 HalfBath -0.03
1037 BsmtHalfBath ScreenPorch 0.03
1119 ScreenPorch MiscVal 0.03
1131 BsmtFinSF2 YrSold 0.03
1143 KitchenAbvGr YrSold 0.03
1050 X3SsnPorch ScreenPorch -0.03
1091 OverallQual MiscVal -0.03
991 YearBuilt X3SsnPorch 0.03
628 BsmtFullBath HalfBath -0.03
445 OverallQual LowQualFinSF -0.03
989 OverallQual X3SsnPorch 0.03
690 BsmtUnfSF KitchenAbvGr 0.03
995 BsmtFinSF2 X3SsnPorch -0.03
1095 MasVnrArea MiscVal -0.03
1120 PoolArea MiscVal 0.03
1010 GarageYrBlt X3SsnPorch 0.03
765 BsmtHalfBath Fireplaces 0.03
412 OverallCond X2ndFlrSF 0.03
1136 LowQualFinSF YrSold -0.03
1135 X2ndFlrSF YrSold -0.03
452 BsmtUnfSF LowQualFinSF 0.03
1114 GarageArea MiscVal -0.03
1148 GarageArea YrSold -0.03
1125 OverallQual YrSold -0.03
626 LowQualFinSF HalfBath -0.03
1034 LowQualFinSF ScreenPorch 0.03
551 MasVnrArea BsmtHalfBath 0.03
994 BsmtFinSF1 X3SsnPorch 0.03
990 OverallCond X3SsnPorch 0.03
446 OverallCond LowQualFinSF 0.03
898 LowQualFinSF WoodDeckSF -0.03
935 BsmtHalfBath OpenPorchSF -0.03
975 Fireplaces EnclosedPorch -0.02
1110 TotRmsAbvGrd MiscVal 0.02
1007 KitchenAbvGr X3SsnPorch -0.02
867 BsmtHalfBath GarageArea -0.02
1006 BedroomAbvGr X3SsnPorch -0.02
999 X2ndFlrSF X3SsnPorch -0.02
1145 Fireplaces YrSold -0.02
557 X2ndFlrSF BsmtHalfBath -0.02
1098 BsmtUnfSF MiscVal -0.02
731 BsmtHalfBath TotRmsAbvGrd -0.02
752 OverallCond Fireplaces -0.02
1036 BsmtFullBath ScreenPorch 0.02
1104 BsmtFullBath MiscVal -0.02
1073 HalfBath PoolArea 0.02
1149 WoodDeckSF YrSold 0.02
762 LowQualFinSF Fireplaces -0.02
1100 X1stFlrSF MiscVal -0.02
1079 GarageCars PoolArea 0.02
833 BsmtHalfBath GarageCars -0.02
996 BsmtUnfSF X3SsnPorch 0.02
1001 GrLivArea X3SsnPorch 0.02
988 LotArea X3SsnPorch 0.02
1071 BsmtHalfBath PoolArea 0.02
1140 FullBath YrSold -0.02
559 GrLivArea BsmtHalfBath -0.02
993 MasVnrArea X3SsnPorch 0.02
1152 X3SsnPorch YrSold 0.02
1116 OpenPorchSF MiscVal -0.02
1099 TotalBsmtSF MiscVal -0.02
1117 EnclosedPorch MiscVal 0.02
954 LotArea EnclosedPorch -0.02
932 LowQualFinSF OpenPorchSF 0.02
859 BsmtFinSF2 GarageArea -0.02
682 LotArea KitchenAbvGr -0.02
799 BsmtHalfBath GarageYrBlt 0.02
1101 X2ndFlrSF MiscVal 0.02
1078 GarageYrBlt PoolArea 0.02
655 BsmtFinSF2 BedroomAbvGr -0.02
1133 TotalBsmtSF YrSold -0.01
451 BsmtFinSF2 LowQualFinSF 0.01
1075 KitchenAbvGr PoolArea -0.01
1130 BsmtFinSF1 YrSold 0.01
1106 FullBath MiscVal -0.01
1124 LotArea YrSold -0.01
614 LotArea HalfBath 0.01
454 X1stFlrSF LowQualFinSF -0.01
138 LotArea YearBuilt 0.01
172 LotArea YearRemodAdd 0.01
1127 YearBuilt YrSold -0.01
1134 X1stFlrSF YrSold -0.01
650 OverallCond BedroomAbvGr 0.01
1030 BsmtUnfSF ScreenPorch -0.01
629 BsmtHalfBath HalfBath -0.01
550 YearRemodAdd BsmtHalfBath -0.01
1061 MasVnrArea PoolArea 0.01
1009 Fireplaces X3SsnPorch 0.01
1146 GarageYrBlt YrSold -0.01
953 LotFrontage EnclosedPorch 0.01
1153 ScreenPorch YrSold 0.01
413 YearBuilt X2ndFlrSF 0.01
1094 YearRemodAdd MiscVal -0.01
1141 HalfBath YrSold -0.01
1151 EnclosedPorch YrSold -0.01
802 BedroomAbvGr GarageYrBlt -0.01
485 BsmtFinSF2 GrLivArea -0.01
1115 WoodDeckSF MiscVal -0.01
967 GrLivArea EnclosedPorch 0.01
969 BsmtHalfBath EnclosedPorch -0.01
1129 MasVnrArea YrSold -0.01
1038 FullBath ScreenPorch -0.01
1084 X3SsnPorch PoolArea -0.01
1108 BedroomAbvGr MiscVal 0.01
694 LowQualFinSF KitchenAbvGr 0.01
1123 LotFrontage YrSold 0.01
1105 BsmtHalfBath MiscVal -0.01
545 LotFrontage BsmtHalfBath -0.01
1008 TotRmsAbvGrd X3SsnPorch -0.01
1112 GarageYrBlt MiscVal -0.01
786 OverallCond GarageYrBlt -0.01
681 LotFrontage KitchenAbvGr -0.01
1014 OpenPorchSF X3SsnPorch -0.01
558 LowQualFinSF BsmtHalfBath -0.01
1060 YearRemodAdd PoolArea 0.01
104 LotArea OverallCond -0.01
894 BsmtUnfSF WoodDeckSF -0.01
1005 HalfBath X3SsnPorch 0.00
1059 YearBuilt PoolArea 0.00
1097 BsmtFinSF2 MiscVal 0.00
1155 MiscVal YrSold 0.00
444 LotArea LowQualFinSF 0.00
418 BsmtUnfSF X2ndFlrSF 0.00
1000 LowQualFinSF X3SsnPorch 0.00
620 BsmtFinSF1 HalfBath 0.00
974 TotRmsAbvGrd EnclosedPorch 0.00
1102 LowQualFinSF MiscVal 0.00
1096 BsmtFinSF1 MiscVal 0.00
1089 LotFrontage MiscVal 0.00
888 OverallCond WoodDeckSF 0.00
927 BsmtFinSF2 OpenPorchSF 0.00
308 LotArea BsmtUnfSF 0.00
962 BsmtUnfSF EnclosedPorch 0.00
1103 GrLivArea MiscVal 0.00
1058 OverallCond PoolArea 0.00
556 X1stFlrSF BsmtHalfBath 0.00
1111 Fireplaces MiscVal 0.00
1107 HalfBath MiscVal 0.00
592 LowQualFinSF FullBath 0.00
1118 X3SsnPorch MiscVal 0.00
555 TotalBsmtSF BsmtHalfBath 0.00
1002 BsmtFullBath X3SsnPorch 0.00
# Correlation Funnel
set.seed(123)
raw_data_v2 <- raw_data
raw_data_v2 <- na.omit(raw_data_v2)
raw_data_binarized_tbl <- raw_data_v2 %>%
    binarize(n_bins = 4, thresh_infreq = 0.01)
raw_data_correlated_tbl <- raw_data_binarized_tbl %>%
    correlationfunnel::correlate(target = SalePrice__213497.5_Inf)
raw_data_correlated_tbl %>%
    correlationfunnel::plot_correlation_funnel(interactive = FALSE)
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the correlationfunnel package.
##   Please report the issue at
##   <https://github.com/business-science/correlationfunnel/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: ggrepel: 93 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Two-way Cross-Tabulations - Categorical Variables:

# Comparing categorical features with the target variable - Chi-square

# got this piece of code from the internet
# Loop through categorical variables, and apply chi square test for significance
results <- map_dfr(categorical_predictors, function(var) {
  tibble(
    variable = var,
    chisq_p_value = chisq.test(table(raw_data[[var]], raw_data$SalePrice))$p.value
  )
})
results %>%
  arrange(desc(chisq_p_value)) %>%
  kbl() %>%
  kable_styling(full_width = TRUE)
variable chisq_p_value
Utilities 1.0000000
Condition1 1.0000000
RoofStyle 1.0000000
RoofMatl 1.0000000
MiscFeature 1.0000000
Fence 1.0000000
GarageCond 1.0000000
BldgType 0.9999861
Exterior1st 0.9999839
HeatingQC 0.9995947
MSSubClass 0.9928572
PavedDrive 0.9912800
Alley 0.9637568
BsmtFinType2 0.9480709
Exterior2nd 0.8469190
HouseStyle 0.6482615
BsmtFinType1 0.2129277
MoSold 0.1417502
LandSlope 0.1050864
LandContour 0.0867465
Condition2 0.0759864
GarageType 0.0647807
Electrical 0.0556880
GarageQual 0.0501487
LotConfig 0.0458062
CentralAir 0.0000123
Foundation 0.0000097
FireplaceQu 0.0000000
Neighborhood 0.0000000
Street 0.0000000
GarageFinish 0.0000000
MSZoning 0.0000000
LotShape 0.0000000
ExterCond 0.0000000
SaleCondition 0.0000000
SaleType 0.0000000
BsmtExposure 0.0000000
Heating 0.0000000
ExterQual NaN
BsmtQual NaN
BsmtCond NaN
KitchenQual NaN
Functional NaN
PoolQC NaN
MasVnrType NaN
# too hard to read

# Numerical relationships
ggpairs(data_raw_numeric, progress = FALSE)

# Boxplots broken out by feature
# Pick some features that might seem interesting, otherwise way too many graphs

boxplot_predictors <- c("HouseStyle", "ExterCond", "GarageFinish", "Neighborhood", "KitchenQual")
for (col_name in boxplot_predictors) {
  plot_boxplot(raw_data, by = col_name, geom_boxplot_args = list("outlier.color" = "red"))
}

### 4 Data Preparation

4.1 Fix Missing Values

Random forest doesn’t require missing values to be treated, but neural networks do. Let’s use a missing indicator and median imputation for LotFrontage which is “orange/bad” predictor and has almost 50% values missing. For MasVnrArea, median imputation will be used and for Electrical and MasVnrType, the mode will be used.

raw_data_transform <- raw_data

# Add missing indicator for LotFrontage
# Create new missing indicator variable
raw_data_transform <- raw_data_transform %>%
  mutate(is_LotFrontage_missing = is.na(LotFrontage))
# impute LotFrontage missing values using the median
raw_data_transform <- raw_data_transform %>%
  mutate(LotFrontage_imputed = coalesce(LotFrontage, median(LotFrontage, na.rm = TRUE)))
raw_data_transform$is_LotFrontage_missing <- as.factor(raw_data_transform$is_LotFrontage_missing)
# Remove original feature
raw_data_transform <- raw_data_transform |>
  dplyr::select(-LotFrontage)


# Fill in NA with mode for categorical and median for numeric
raw_data_transform <- raw_data_transform %>%
   mutate_at(vars(c("MasVnrArea")), ~ifelse(is.na(.), median(., na.rm = TRUE), .)) # this value is 0, so makes sense MasVnrType_mode to be None
# Got these from distributions
Electrical_mode <- "SBrkr"
MasVnrType_mode <- "None"
raw_data_transform$Electrical =  ifelse(is.na(raw_data_transform$Electrical), Electrical_mode, raw_data_transform$Electrical)
raw_data_transform$MasVnrType =  ifelse(is.na(raw_data_transform$MasVnrType), MasVnrType_mode, raw_data_transform$MasVnrType)
# convert to factor
raw_data_transform$Electrical <- as.factor(raw_data_transform$Electrical)
raw_data_transform$MasVnrType <- as.factor(raw_data_transform$MasVnrType)

# Check for missing values again
plot_missing(raw_data_transform)

4.2 Double Check EDA (Correlations and Distributions)

# Plot histograms
plot_histogram(raw_data_transform)

# Categorical Variables
plot_bar(raw_data_transform)

# Correlations
plot_correlation(raw_data_transform)
## 1 features with more than 20 categories ignored!
## Neighborhood: 25 categories

4.3 Transform skewed predictors

Since random forest and neural networks can handle skewness (although data preprocessing is necessary for neural networks), let’s not take any transformations such as log or square root.

4.4 Combine features

Let’s do some feature engineering and combine individual features into one to simplify the dataset (not necessary for both, but in some real-world scenarios limiting the data is useful).

OverallTotalRating, TotalBaths, and TotalSF were created.

# Combine features by adding

# Total Overall Rating - Combine OverallQual and OverallCond by addition
raw_data_transform$OverallTotalRating <- raw_data_transform$OverallQual + raw_data_transform$OverallCond
# Total Baths - Full and Half
raw_data_transform$TotalBaths <- raw_data_transform$BsmtFullBath + raw_data_transform$BsmtHalfBath + raw_data_transform$FullBath + raw_data_transform$HalfBath
# Total SF
raw_data_transform$TotalSF <- raw_data_transform$TotalBsmtSF + raw_data_transform$X1stFlrSF + raw_data_transform$X2ndFlrSF

4.5 Remove Multicollinearity

Random Forest can handle multicollinearity (and neural network to some extent), but let’s we remove some features to simplify the dataset. Having multicollinearity in RF’s important features can be less informative since we don’t truly know which one is more impactful.

For GarageCars/GarageArea let’s remove GarageCars. For TotalBsmtSF/X1stFlrSF, let’s remove TotalBsmtSF. For X2ndFlrSF/GrLivArea, let’s remove GrLivArea. For BedroomAbvGr/TotRmsAbvGrd, let’s remove BedroomAbvGr For BsmtFinSF1/BsmtFullBath, let’s remove BsmtFullBath. For X2ndFlrSF/TotRmsAbvGrd, let’s remove TotRmsAbvGrd. For X2ndFlrSF/HalfBath, let’s remove X2ndFlrSF.

# Create a dataset that removes multicollineariy

raw_data_transform2 <- raw_data_transform |>
  dplyr::select(-c(GarageCars, TotRmsAbvGrd, TotalBsmtSF, GrLivArea, BedroomAbvGr, BsmtFullBath, X2ndFlrSF))

4.6 Prepare for Neural Network

The data preprocessing necessary for neural networks is:

  • convert categorical features into numeric (not factors)
  • remove missing values (already done)
  • scale the features between 0 and 1
# Neural networks work best when the input data is scaled to a narrow range around zero - must be integers, not factors

# dataset without the new combined features
raw_data_transform_nn1 <- raw_data_transform |>
  dplyr::select(-c(OverallTotalRating, TotalBaths, TotalSF))
# find categorical predictors
raw_data_transform_categorical <- raw_data_transform_nn1 |>
  dplyr::select(where(is.factor) | where(is.character))
raw_data_transform_categorical_predictors <- names(raw_data_transform_categorical)[names(raw_data_transform_categorical) != "SalePrice"]

dummy_formula <- as.formula(paste("~ . -SalePrice")) # Adjust as needed
# categorical dataset
cat_data_raw_nn <- raw_data_transform_nn1[, raw_data_transform_categorical_predictors]
# use dummyVars to create dummy variables
dummies <- dummyVars(~ ., data = cat_data_raw_nn, fullRank = FALSE)
my_data_encoded <- data.frame(predict(dummies, newdata = cat_data_raw_nn))
# numeric columns
numeric_cols <- setdiff(names(raw_data_transform_nn1), c(raw_data_transform_categorical_predictors, "SalePrice"))
# combine back the columns
final_data_for_nn <- cbind(raw_data_transform_nn1[, numeric_cols], my_data_encoded, raw_data_transform_nn1$SalePrice)
# Rename target column
colnames(final_data_for_nn)[ncol(final_data_for_nn)] <- "SalePrice"

head(final_data_for_nn)
##   LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## 1    8450           7           5      2003         2003        196        706
## 2    9600           6           8      1976         1976          0        978
## 3   11250           7           5      2001         2002        162        486
## 4    9550           7           5      1915         1970          0        216
## 5   14260           8           5      2000         2000        350        655
## 6   14115           5           5      1993         1995          0        732
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## 1          0       150         856       856       854            0      1710
## 2          0       284        1262      1262         0            0      1262
## 3          0       434         920       920       866            0      1786
## 4          0       540         756       961       756            0      1717
## 5          0       490        1145      1145      1053            0      2198
## 6          0        64         796       796       566            0      1362
##   BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr
## 1            1            0        2        1            3            1
## 2            0            1        2        0            3            1
## 3            1            0        2        1            3            1
## 4            1            0        1        0            3            1
## 5            1            0        2        1            4            1
## 6            1            0        1        1            1            1
##   TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF
## 1            8          0        2003          2        548          0
## 2            6          1        1976          2        460        298
## 3            6          1        2001          2        608          0
## 4            7          1        1998          3        642          0
## 5            9          1        2000          3        836        192
## 6            5          0        1993          2        480         40
##   OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal YrSold
## 1          61             0          0           0        0       0   2008
## 2           0             0          0           0        0       0   2007
## 3          42             0          0           0        0       0   2008
## 4          35           272          0           0        0       0   2006
## 5          84             0          0           0        0       0   2008
## 6          30             0        320           0        0     700   2009
##   LotFrontage_imputed MSSubClass.20 MSSubClass.30 MSSubClass.40 MSSubClass.45
## 1                  65             0             0             0             0
## 2                  80             1             0             0             0
## 3                  68             0             0             0             0
## 4                  60             0             0             0             0
## 5                  84             0             0             0             0
## 6                  85             0             0             0             0
##   MSSubClass.50 MSSubClass.60 MSSubClass.70 MSSubClass.75 MSSubClass.80
## 1             0             1             0             0             0
## 2             0             0             0             0             0
## 3             0             1             0             0             0
## 4             0             0             1             0             0
## 5             0             1             0             0             0
## 6             1             0             0             0             0
##   MSSubClass.85 MSSubClass.90 MSSubClass.120 MSSubClass.160 MSSubClass.180
## 1             0             0              0              0              0
## 2             0             0              0              0              0
## 3             0             0              0              0              0
## 4             0             0              0              0              0
## 5             0             0              0              0              0
## 6             0             0              0              0              0
##   MSSubClass.190 MSZoning.C..all. MSZoning.FV MSZoning.RH MSZoning.RL
## 1              0                0           0           0           1
## 2              0                0           0           0           1
## 3              0                0           0           0           1
## 4              0                0           0           0           1
## 5              0                0           0           0           1
## 6              0                0           0           0           1
##   MSZoning.RM Street.Grvl Street.Pave Alley.Grvl Alley.NoAlley Alley.Pave
## 1           0           0           1          0             1          0
## 2           0           0           1          0             1          0
## 3           0           0           1          0             1          0
## 4           0           0           1          0             1          0
## 5           0           0           1          0             1          0
## 6           0           0           1          0             1          0
##   LotShape.IR1 LotShape.IR2 LotShape.IR3 LotShape.Reg LandContour.Bnk
## 1            0            0            0            1               0
## 2            0            0            0            1               0
## 3            1            0            0            0               0
## 4            1            0            0            0               0
## 5            1            0            0            0               0
## 6            1            0            0            0               0
##   LandContour.HLS LandContour.Low LandContour.Lvl Utilities.AllPub
## 1               0               0               1                1
## 2               0               0               1                1
## 3               0               0               1                1
## 4               0               0               1                1
## 5               0               0               1                1
## 6               0               0               1                1
##   Utilities.NoSeWa LotConfig.Corner LotConfig.CulDSac LotConfig.FR2
## 1                0                0                 0             0
## 2                0                0                 0             1
## 3                0                0                 0             0
## 4                0                1                 0             0
## 5                0                0                 0             1
## 6                0                0                 0             0
##   LotConfig.FR3 LotConfig.Inside LandSlope.L LandSlope.Q Neighborhood.Blmngtn
## 1             0                1   0.7071068   0.4082483                    0
## 2             0                0   0.7071068   0.4082483                    0
## 3             0                1   0.7071068   0.4082483                    0
## 4             0                0   0.7071068   0.4082483                    0
## 5             0                0   0.7071068   0.4082483                    0
## 6             0                1   0.7071068   0.4082483                    0
##   Neighborhood.Blueste Neighborhood.BrDale Neighborhood.BrkSide
## 1                    0                   0                    0
## 2                    0                   0                    0
## 3                    0                   0                    0
## 4                    0                   0                    0
## 5                    0                   0                    0
## 6                    0                   0                    0
##   Neighborhood.ClearCr Neighborhood.CollgCr Neighborhood.Crawfor
## 1                    0                    1                    0
## 2                    0                    0                    0
## 3                    0                    1                    0
## 4                    0                    0                    1
## 5                    0                    0                    0
## 6                    0                    0                    0
##   Neighborhood.Edwards Neighborhood.Gilbert Neighborhood.IDOTRR
## 1                    0                    0                   0
## 2                    0                    0                   0
## 3                    0                    0                   0
## 4                    0                    0                   0
## 5                    0                    0                   0
## 6                    0                    0                   0
##   Neighborhood.MeadowV Neighborhood.Mitchel Neighborhood.NAmes
## 1                    0                    0                  0
## 2                    0                    0                  0
## 3                    0                    0                  0
## 4                    0                    0                  0
## 5                    0                    0                  0
## 6                    0                    1                  0
##   Neighborhood.NoRidge Neighborhood.NPkVill Neighborhood.NridgHt
## 1                    0                    0                    0
## 2                    0                    0                    0
## 3                    0                    0                    0
## 4                    0                    0                    0
## 5                    1                    0                    0
## 6                    0                    0                    0
##   Neighborhood.NWAmes Neighborhood.OldTown Neighborhood.Sawyer
## 1                   0                    0                   0
## 2                   0                    0                   0
## 3                   0                    0                   0
## 4                   0                    0                   0
## 5                   0                    0                   0
## 6                   0                    0                   0
##   Neighborhood.SawyerW Neighborhood.Somerst Neighborhood.StoneBr
## 1                    0                    0                    0
## 2                    0                    0                    0
## 3                    0                    0                    0
## 4                    0                    0                    0
## 5                    0                    0                    0
## 6                    0                    0                    0
##   Neighborhood.SWISU Neighborhood.Timber Neighborhood.Veenker Condition1.Artery
## 1                  0                   0                    0                 0
## 2                  0                   0                    1                 0
## 3                  0                   0                    0                 0
## 4                  0                   0                    0                 0
## 5                  0                   0                    0                 0
## 6                  0                   0                    0                 0
##   Condition1.Feedr Condition1.Norm Condition1.PosA Condition1.PosN
## 1                0               1               0               0
## 2                1               0               0               0
## 3                0               1               0               0
## 4                0               1               0               0
## 5                0               1               0               0
## 6                0               1               0               0
##   Condition1.RRAe Condition1.RRAn Condition1.RRNe Condition1.RRNn
## 1               0               0               0               0
## 2               0               0               0               0
## 3               0               0               0               0
## 4               0               0               0               0
## 5               0               0               0               0
## 6               0               0               0               0
##   Condition2.Artery Condition2.Feedr Condition2.Norm Condition2.PosA
## 1                 0                0               1               0
## 2                 0                0               1               0
## 3                 0                0               1               0
## 4                 0                0               1               0
## 5                 0                0               1               0
## 6                 0                0               1               0
##   Condition2.PosN Condition2.RRAe Condition2.RRAn Condition2.RRNn BldgType.1Fam
## 1               0               0               0               0             1
## 2               0               0               0               0             1
## 3               0               0               0               0             1
## 4               0               0               0               0             1
## 5               0               0               0               0             1
## 6               0               0               0               0             1
##   BldgType.2fmCon BldgType.Duplex BldgType.Twnhs BldgType.TwnhsE
## 1               0               0              0               0
## 2               0               0              0               0
## 3               0               0              0               0
## 4               0               0              0               0
## 5               0               0              0               0
## 6               0               0              0               0
##   HouseStyle.1.5Fin HouseStyle.1.5Unf HouseStyle.1Story HouseStyle.2.5Fin
## 1                 0                 0                 0                 0
## 2                 0                 0                 1                 0
## 3                 0                 0                 0                 0
## 4                 0                 0                 0                 0
## 5                 0                 0                 0                 0
## 6                 1                 0                 0                 0
##   HouseStyle.2.5Unf HouseStyle.2Story HouseStyle.SFoyer HouseStyle.SLvl
## 1                 0                 1                 0               0
## 2                 0                 0                 0               0
## 3                 0                 1                 0               0
## 4                 0                 1                 0               0
## 5                 0                 1                 0               0
## 6                 0                 0                 0               0
##   RoofStyle.Flat RoofStyle.Gable RoofStyle.Gambrel RoofStyle.Hip
## 1              0               1                 0             0
## 2              0               1                 0             0
## 3              0               1                 0             0
## 4              0               1                 0             0
## 5              0               1                 0             0
## 6              0               1                 0             0
##   RoofStyle.Mansard RoofStyle.Shed RoofMatl.ClyTile RoofMatl.CompShg
## 1                 0              0                0                1
## 2                 0              0                0                1
## 3                 0              0                0                1
## 4                 0              0                0                1
## 5                 0              0                0                1
## 6                 0              0                0                1
##   RoofMatl.Membran RoofMatl.Metal RoofMatl.Roll RoofMatl.Tar.Grv
## 1                0              0             0                0
## 2                0              0             0                0
## 3                0              0             0                0
## 4                0              0             0                0
## 5                0              0             0                0
## 6                0              0             0                0
##   RoofMatl.WdShake RoofMatl.WdShngl Exterior1st.AsbShng Exterior1st.AsphShn
## 1                0                0                   0                   0
## 2                0                0                   0                   0
## 3                0                0                   0                   0
## 4                0                0                   0                   0
## 5                0                0                   0                   0
## 6                0                0                   0                   0
##   Exterior1st.BrkComm Exterior1st.BrkFace Exterior1st.CBlock
## 1                   0                   0                  0
## 2                   0                   0                  0
## 3                   0                   0                  0
## 4                   0                   0                  0
## 5                   0                   0                  0
## 6                   0                   0                  0
##   Exterior1st.CemntBd Exterior1st.HdBoard Exterior1st.ImStucc
## 1                   0                   0                   0
## 2                   0                   0                   0
## 3                   0                   0                   0
## 4                   0                   0                   0
## 5                   0                   0                   0
## 6                   0                   0                   0
##   Exterior1st.MetalSd Exterior1st.Plywood Exterior1st.Stone Exterior1st.Stucco
## 1                   0                   0                 0                  0
## 2                   1                   0                 0                  0
## 3                   0                   0                 0                  0
## 4                   0                   0                 0                  0
## 5                   0                   0                 0                  0
## 6                   0                   0                 0                  0
##   Exterior1st.VinylSd Exterior1st.Wd.Sdng Exterior1st.WdShing
## 1                   1                   0                   0
## 2                   0                   0                   0
## 3                   1                   0                   0
## 4                   0                   1                   0
## 5                   1                   0                   0
## 6                   1                   0                   0
##   Exterior2nd.AsbShng Exterior2nd.AsphShn Exterior2nd.Brk.Cmn
## 1                   0                   0                   0
## 2                   0                   0                   0
## 3                   0                   0                   0
## 4                   0                   0                   0
## 5                   0                   0                   0
## 6                   0                   0                   0
##   Exterior2nd.BrkFace Exterior2nd.CBlock Exterior2nd.CmentBd
## 1                   0                  0                   0
## 2                   0                  0                   0
## 3                   0                  0                   0
## 4                   0                  0                   0
## 5                   0                  0                   0
## 6                   0                  0                   0
##   Exterior2nd.HdBoard Exterior2nd.ImStucc Exterior2nd.MetalSd Exterior2nd.Other
## 1                   0                   0                   0                 0
## 2                   0                   0                   1                 0
## 3                   0                   0                   0                 0
## 4                   0                   0                   0                 0
## 5                   0                   0                   0                 0
## 6                   0                   0                   0                 0
##   Exterior2nd.Plywood Exterior2nd.Stone Exterior2nd.Stucco Exterior2nd.VinylSd
## 1                   0                 0                  0                   1
## 2                   0                 0                  0                   0
## 3                   0                 0                  0                   1
## 4                   0                 0                  0                   0
## 5                   0                 0                  0                   1
## 6                   0                 0                  0                   1
##   Exterior2nd.Wd.Sdng Exterior2nd.Wd.Shng MasVnrType.BrkCmn MasVnrType.BrkFace
## 1                   0                   0                 0                  1
## 2                   0                   0                 0                  0
## 3                   0                   0                 0                  1
## 4                   0                   1                 0                  0
## 5                   0                   0                 0                  1
## 6                   0                   0                 0                  0
##   MasVnrType.None MasVnrType.Stone   ExterQual.L ExterQual.Q   ExterQual.C
## 1               0                0  3.162278e-01  -0.2672612 -6.324555e-01
## 2               1                0 -1.481950e-18  -0.5345225  1.786843e-17
## 3               0                0  3.162278e-01  -0.2672612 -6.324555e-01
## 4               1                0 -1.481950e-18  -0.5345225  1.786843e-17
## 5               0                0  3.162278e-01  -0.2672612 -6.324555e-01
## 6               1                0 -1.481950e-18  -0.5345225  1.786843e-17
##   ExterQual.4  ExterCond.L ExterCond.Q  ExterCond.C ExterCond.4
## 1  -0.4780914 -1.48195e-18  -0.5345225 1.786843e-17   0.7171372
## 2   0.7171372 -1.48195e-18  -0.5345225 1.786843e-17   0.7171372
## 3  -0.4780914 -1.48195e-18  -0.5345225 1.786843e-17   0.7171372
## 4   0.7171372 -1.48195e-18  -0.5345225 1.786843e-17   0.7171372
## 5  -0.4780914 -1.48195e-18  -0.5345225 1.786843e-17   0.7171372
## 6   0.7171372 -1.48195e-18  -0.5345225 1.786843e-17   0.7171372
##   Foundation.BrkTil Foundation.CBlock Foundation.PConc Foundation.Slab
## 1                 0                 0                1               0
## 2                 0                 1                0               0
## 3                 0                 0                1               0
## 4                 1                 0                0               0
## 5                 0                 0                1               0
## 6                 0                 0                0               0
##   Foundation.Stone Foundation.Wood BsmtQual.L BsmtQual.Q BsmtQual.C BsmtQual.4
## 1                0               0  0.3585686 -0.1091089 -0.5217492 -0.5669467
## 2                0               0  0.3585686 -0.1091089 -0.5217492 -0.5669467
## 3                0               0  0.3585686 -0.1091089 -0.5217492 -0.5669467
## 4                0               0  0.1195229 -0.4364358 -0.2981424  0.3779645
## 5                0               0  0.3585686 -0.1091089 -0.5217492 -0.5669467
## 6                0               1  0.3585686 -0.1091089 -0.5217492 -0.5669467
##   BsmtQual.5 BsmtCond.L BsmtCond.Q BsmtCond.C BsmtCond.4 BsmtCond.5
## 1 -0.3149704  0.1195229 -0.4364358 -0.2981424  0.3779645  0.6299408
## 2 -0.3149704  0.1195229 -0.4364358 -0.2981424  0.3779645  0.6299408
## 3 -0.3149704  0.1195229 -0.4364358 -0.2981424  0.3779645  0.6299408
## 4  0.6299408  0.3585686 -0.1091089 -0.5217492 -0.5669467 -0.3149704
## 5 -0.3149704  0.1195229 -0.4364358 -0.2981424  0.3779645  0.6299408
## 6 -0.3149704  0.1195229 -0.4364358 -0.2981424  0.3779645  0.6299408
##   BsmtExposure.L BsmtExposure.Q BsmtExposure.C BsmtExposure.4 BsmtFinType1.L
## 1  -3.162278e-01     -0.2672612   6.324555e-01     -0.4780914      0.5669467
## 2   6.324555e-01      0.5345225   3.162278e-01      0.1195229      0.3779645
## 3  -1.481950e-18     -0.5345225   1.786843e-17      0.7171372      0.5669467
## 4  -3.162278e-01     -0.2672612   6.324555e-01     -0.4780914      0.3779645
## 5   3.162278e-01     -0.2672612  -6.324555e-01     -0.4780914      0.5669467
## 6  -3.162278e-01     -0.2672612   6.324555e-01     -0.4780914      0.5669467
##   BsmtFinType1.Q BsmtFinType1.C BsmtFinType1.4 BsmtFinType1.5 BsmtFinType1.6
## 1   5.455447e-01      0.4082483      0.2417469      0.1091089     0.03289758
## 2  -5.621884e-17     -0.4082483     -0.5640761     -0.4364358    -0.19738551
## 3   5.455447e-01      0.4082483      0.2417469      0.1091089     0.03289758
## 4  -5.621884e-17     -0.4082483     -0.5640761     -0.4364358    -0.19738551
## 5   5.455447e-01      0.4082483      0.2417469      0.1091089     0.03289758
## 6   5.455447e-01      0.4082483      0.2417469      0.1091089     0.03289758
##   BsmtFinType2.L BsmtFinType2.Q BsmtFinType2.C BsmtFinType2.4 BsmtFinType2.5
## 1     -0.3779645   8.914347e-17      0.4082483     -0.5640761      0.4364358
## 2     -0.3779645   8.914347e-17      0.4082483     -0.5640761      0.4364358
## 3     -0.3779645   8.914347e-17      0.4082483     -0.5640761      0.4364358
## 4     -0.3779645   8.914347e-17      0.4082483     -0.5640761      0.4364358
## 5     -0.3779645   8.914347e-17      0.4082483     -0.5640761      0.4364358
## 6     -0.3779645   8.914347e-17      0.4082483     -0.5640761      0.4364358
##   BsmtFinType2.6 Heating.Floor Heating.GasA Heating.GasW Heating.Grav
## 1     -0.1973855             0            1            0            0
## 2     -0.1973855             0            1            0            0
## 3     -0.1973855             0            1            0            0
## 4     -0.1973855             0            1            0            0
## 5     -0.1973855             0            1            0            0
## 6     -0.1973855             0            1            0            0
##   Heating.OthW Heating.Wall HeatingQC.L HeatingQC.Q HeatingQC.C HeatingQC.4
## 1            0            0   0.6324555   0.5345225   0.3162278   0.1195229
## 2            0            0   0.6324555   0.5345225   0.3162278   0.1195229
## 3            0            0   0.6324555   0.5345225   0.3162278   0.1195229
## 4            0            0   0.3162278  -0.2672612  -0.6324555  -0.4780914
## 5            0            0   0.6324555   0.5345225   0.3162278   0.1195229
## 6            0            0   0.6324555   0.5345225   0.3162278   0.1195229
##   CentralAir.N CentralAir.Y Electrical.FuseA Electrical.FuseF Electrical.FuseP
## 1            0            1                0                0                0
## 2            0            1                0                0                0
## 3            0            1                0                0                0
## 4            0            1                0                0                0
## 5            0            1                0                0                0
## 6            0            1                0                0                0
##   Electrical.Mix Electrical.SBrkr KitchenQual.L KitchenQual.Q KitchenQual.C
## 1              0                1  3.162278e-01    -0.2672612 -6.324555e-01
## 2              0                1 -1.481950e-18    -0.5345225  1.786843e-17
## 3              0                1  3.162278e-01    -0.2672612 -6.324555e-01
## 4              0                1  3.162278e-01    -0.2672612 -6.324555e-01
## 5              0                1  3.162278e-01    -0.2672612 -6.324555e-01
## 6              0                1 -1.481950e-18    -0.5345225  1.786843e-17
##   KitchenQual.4 Functional.L Functional.Q Functional.C Functional.4
## 1    -0.4780914    0.5400617    0.5400617    0.4308202     0.282038
## 2     0.7171372    0.5400617    0.5400617    0.4308202     0.282038
## 3    -0.4780914    0.5400617    0.5400617    0.4308202     0.282038
## 4    -0.4780914    0.5400617    0.5400617    0.4308202     0.282038
## 5    -0.4780914    0.5400617    0.5400617    0.4308202     0.282038
## 6     0.7171372    0.5400617    0.5400617    0.4308202     0.282038
##   Functional.5 Functional.6 Functional.7 FireplaceQu.L FireplaceQu.Q
## 1    0.1497862   0.06154575   0.01706972    -0.5976143     0.5455447
## 2    0.1497862   0.06154575   0.01706972     0.1195229    -0.4364358
## 3    0.1497862   0.06154575   0.01706972     0.1195229    -0.4364358
## 4    0.1497862   0.06154575   0.01706972     0.3585686    -0.1091089
## 5    0.1497862   0.06154575   0.01706972     0.1195229    -0.4364358
## 6    0.1497862   0.06154575   0.01706972    -0.5976143     0.5455447
##   FireplaceQu.C FireplaceQu.4 FireplaceQu.5 GarageType.2Types GarageType.Attchd
## 1    -0.3726780     0.1889822   -0.06299408                 0                 1
## 2    -0.2981424     0.3779645    0.62994079                 0                 1
## 3    -0.2981424     0.3779645    0.62994079                 0                 1
## 4    -0.5217492    -0.5669467   -0.31497039                 0                 0
## 5    -0.2981424     0.3779645    0.62994079                 0                 1
## 6    -0.3726780     0.1889822   -0.06299408                 0                 1
##   GarageType.Basment GarageType.BuiltIn GarageType.CarPort GarageType.Detchd
## 1                  0                  0                  0                 0
## 2                  0                  0                  0                 0
## 3                  0                  0                  0                 0
## 4                  0                  0                  0                 1
## 5                  0                  0                  0                 0
## 6                  0                  0                  0                 0
##   GarageType.NoGarage GarageFinish.L GarageFinish.Q GarageFinish.C GarageQual.L
## 1                   0      0.2236068           -0.5     -0.6708204    0.1195229
## 2                   0      0.2236068           -0.5     -0.6708204    0.1195229
## 3                   0      0.2236068           -0.5     -0.6708204    0.1195229
## 4                   0     -0.2236068           -0.5      0.6708204    0.1195229
## 5                   0      0.2236068           -0.5     -0.6708204    0.1195229
## 6                   0     -0.2236068           -0.5      0.6708204    0.1195229
##   GarageQual.Q GarageQual.C GarageQual.4 GarageQual.5 GarageCond.L GarageCond.Q
## 1   -0.4364358   -0.2981424    0.3779645    0.6299408    0.1195229   -0.4364358
## 2   -0.4364358   -0.2981424    0.3779645    0.6299408    0.1195229   -0.4364358
## 3   -0.4364358   -0.2981424    0.3779645    0.6299408    0.1195229   -0.4364358
## 4   -0.4364358   -0.2981424    0.3779645    0.6299408    0.1195229   -0.4364358
## 5   -0.4364358   -0.2981424    0.3779645    0.6299408    0.1195229   -0.4364358
## 6   -0.4364358   -0.2981424    0.3779645    0.6299408    0.1195229   -0.4364358
##   GarageCond.C GarageCond.4 GarageCond.5 PavedDrive.N PavedDrive.P PavedDrive.Y
## 1   -0.2981424    0.3779645    0.6299408            0            0            1
## 2   -0.2981424    0.3779645    0.6299408            0            0            1
## 3   -0.2981424    0.3779645    0.6299408            0            0            1
## 4   -0.2981424    0.3779645    0.6299408            0            0            1
## 5   -0.2981424    0.3779645    0.6299408            0            0            1
## 6   -0.2981424    0.3779645    0.6299408            0            0            1
##     PoolQC.L  PoolQC.Q   PoolQC.C  PoolQC.4    Fence.L    Fence.Q    Fence.C
## 1 -0.6324555 0.5345225 -0.3162278 0.1195229 -0.6324555  0.5345225 -0.3162278
## 2 -0.6324555 0.5345225 -0.3162278 0.1195229 -0.6324555  0.5345225 -0.3162278
## 3 -0.6324555 0.5345225 -0.3162278 0.1195229 -0.6324555  0.5345225 -0.3162278
## 4 -0.6324555 0.5345225 -0.3162278 0.1195229 -0.6324555  0.5345225 -0.3162278
## 5 -0.6324555 0.5345225 -0.3162278 0.1195229 -0.6324555  0.5345225 -0.3162278
## 6 -0.6324555 0.5345225 -0.3162278 0.1195229  0.3162278 -0.2672612 -0.6324555
##      Fence.4 MiscFeature.Gar2 MiscFeature.None MiscFeature.Othr
## 1  0.1195229                0                1                0
## 2  0.1195229                0                1                0
## 3  0.1195229                0                1                0
## 4  0.1195229                0                1                0
## 5  0.1195229                0                1                0
## 6 -0.4780914                0                0                0
##   MiscFeature.Shed MiscFeature.TenC MoSold.1 MoSold.2 MoSold.3 MoSold.4
## 1                0                0        0        1        0        0
## 2                0                0        0        0        0        0
## 3                0                0        0        0        0        0
## 4                0                0        0        1        0        0
## 5                0                0        0        0        0        0
## 6                1                0        0        0        0        0
##   MoSold.5 MoSold.6 MoSold.7 MoSold.8 MoSold.9 MoSold.10 MoSold.11 MoSold.12
## 1        0        0        0        0        0         0         0         0
## 2        1        0        0        0        0         0         0         0
## 3        0        0        0        0        1         0         0         0
## 4        0        0        0        0        0         0         0         0
## 5        0        0        0        0        0         0         0         1
## 6        0        0        0        0        0         1         0         0
##   SaleType.COD SaleType.Con SaleType.ConLD SaleType.ConLI SaleType.ConLw
## 1            0            0              0              0              0
## 2            0            0              0              0              0
## 3            0            0              0              0              0
## 4            0            0              0              0              0
## 5            0            0              0              0              0
## 6            0            0              0              0              0
##   SaleType.CWD SaleType.New SaleType.Oth SaleType.WD SaleCondition.Abnorml
## 1            0            0            0           1                     0
## 2            0            0            0           1                     0
## 3            0            0            0           1                     0
## 4            0            0            0           1                     1
## 5            0            0            0           1                     0
## 6            0            0            0           1                     0
##   SaleCondition.AdjLand SaleCondition.Alloca SaleCondition.Family
## 1                     0                    0                    0
## 2                     0                    0                    0
## 3                     0                    0                    0
## 4                     0                    0                    0
## 5                     0                    0                    0
## 6                     0                    0                    0
##   SaleCondition.Normal SaleCondition.Partial is_LotFrontage_missing.FALSE
## 1                    1                     0                            1
## 2                    1                     0                            1
## 3                    1                     0                            1
## 4                    0                     0                            1
## 5                    1                     0                            1
## 6                    1                     0                            1
##   is_LotFrontage_missing.TRUE SalePrice
## 1                           0    208500
## 2                           0    181500
## 3                           0    223500
## 4                           0    140000
## 5                           0    250000
## 6                           0    143000
# inspired from textbook
normalize <- function(x) {
    return((x - min(x)) / (max(x) - min(x)))
}
# scale the features
final_data_for_nn_norm <- as.data.frame(lapply(final_data_for_nn, normalize))

4.7 Split Data for Validation

To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.

# Split data for validation
set.seed(123)

train_idx <- createDataPartition(y = raw_data$SalePrice, p = 0.8, list = FALSE)

# Create train/test splits for each dataset
train <- raw_data[train_idx, ]
test <- raw_data[-train_idx, ]

# Transformed predictors dataset
train_transformed <- raw_data_transform[train_idx, ]
test_transformed <- raw_data_transform[-train_idx, ]

# Transformed predictors dataset
train_transformed2 <- raw_data_transform2[train_idx, ]
test_transformed2 <- raw_data_transform2[-train_idx, ]


# NN 1 dataset
train_transformed_nn1 <- final_data_for_nn_norm[train_idx, ]
test_transformed_nn1 <- final_data_for_nn_norm[-train_idx, ]

PCA

I was going to try PCA since I have a lot of features, including ones that show multicollinearity, but since RF and NN are both very robust and PCA is not required, I decided to skip.

# ## Visualize principal component analysis
# raw_data_transform_pca <- raw_data_transform |>
#   dplyr::select(-LotFrontage)
# plot_prcomp(raw_data_transform_pca, maxcat = 10L)

5 Build Models

NOTE: Kaggle competition recommends using RMSE as the chosen metric, and for RMSE, use the log of the actual and predicted values to reduce impact of large errors Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.

Algorithm Selection

I chose random forest as the algorithm from weeks 1-10 and the neural network from recent weeks. Random forest is an ensemble method, that uses a “team” of decision trees.

Random forest would be good for this dataset because:

  • Not a lot of data preprocessing is necessary - can handle categorical and continuous features (which this dataset has)
  • Can handle noisy data (and missing data as well)
  • Can capture nonlinear, complex relationships
  • Can easily be used with many features (80+ features in this dataset) or observations
  • Can extract important features for business insights
  • Has shown to be a robust model with high accuracy as it uses an ensemble of decision trees
  • Not prone to overfitting or underfitting
  • Using ranger package is fairly fast

I have also never used random forest for regression, so this would be a good learning experience.

Potential issues:

  • Not easily interpretable
  • Can’t be extensively tuned for better performance
  • Can have high computational cost
  • Predictions are bound by the range of the target values seen in the training data

Neural network is another good option. Neural networks contain an input layer, a hidden layer, and an output layer, mimicking a human brain. After receiving raw input in the input layer, it is able to process input through the hidden layers, with the output layer producing the result. It does this through optimizing the many parameters through gradient descent and backpropagation. The many hyperparameters and activation function make this model able to handle complex patterns and relationships - it has the ability to fit to the training data well.

Pros of NN:

  • Can capture nonlinear, complex relationships
  • Good for large datasets
  • Has shown to be a robust model with high accuracy
  • Can handle categorical and continuous features (yet data preprocessing needed for categorical (must be numeric))
  • Automated feature engineering (figure out relationships between features)

I have also never used a neural network before, so this would be a good learning experience.

Potential issues:

  • Not very interpretable due to “black box” nature (harder for business decisions)
  • Higher computational cost
  • Difficult to tune
  • Data preprocessing needed for categorical features - increases the amount of features
  • Can’t have missing values
  • Scaling required of features
  • Requires large amounts of data to accurately detect patterns
  • Can be prone to overfitting
  • Not always the best for simpler tasks (datasets that are structured, tabular), better for unstructured data

Random Forest

ranger function arguments:

  • formula Object of class formula or character describing the model to fit. Interaction terms supported only for numerical variables.
  • data Training data of class data.frame, matrix, dgCMatrix (Matrix) or gwaa.data (GenABEL).
  • num.trees Number of trees.
  • mtry Number of variables to possibly split at in each node. Default is the (rounded down) square root of the number variables. Alternatively, a single argument function returning an integer, given the number of independent variables.
  • importance Variable importance mode, one of ’none’, ’impurity’, ’impurity_corrected’, ’permutation’. The ’impurity’ measure is the Gini index for classification, the variance of the responses for regression and the sum of test statistics (see splitrule) for survival.
  • write.forest Save ranger.forest object, required for prediction. Set to FALSE to reduce memory usage if no prediction intended.
  • probability Grow a probability forest as in Malley et al. (2012).
  • min.node.size Minimal node size to split at. Default 1 for classification, 5 for regression, 3 for survival, and 10 for probability. For classification, this can be a vector of class-specific values.
  • min.bucket Minimal terminal node size. No nodes smaller than this value can occur. Default 3 for survival and 1 for all other tree types. For classification, this can be a vector of class-specific values.
  • max.depth Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). replace Sample with replacement.
RF Model 1: Original Predictors

Experiment:

  • Use the original predictors (except for LotFrontage)
  • Use tuning grid for optimal params:
  • Increasing mtry: Smaller number of variables to possibly split at in each node increases the randomness and decreases correlation between trees. Yet, this can decrease the accuracy of individual trees. For smaller mtry, bias increases and variance decreases. Larger values will decrease bias (more complex trees), increase variance (reduce randomness, more correlation among trees).
  • Increasing min.node.size: Smaller values allow trees to become more complex and grow deeper - decreasing bias and higher variance (more prone to overfitting). Larger values allow trees to become less complex and more shallow, increasing bias and lowering variance (less prone to overfitting, but could be too simple).

Purpose:

See how well a model does with all the original predictors, and varied mtry and min node size values (tuning goals above).

Result:

  • Very high R Squared of 0.9699 and low RMSE of 0.0718 (using logs): a very high proportion of the variance is explainable and has a high accuracy
  • Normal RMSE and R Squared are very similar to ones taken with logs
  • Not surprisingly OverallQual, ExterQual and GarageArea are the most important predictors - this matches the correlation funnel!
  • Interestingly, YearBuilt ranks #6 in importance - this was not crazy high in the correlation funnel
  • A potential downside - highly correlated variables are within the top 10 important features such as GarageArea and GarageCars - is GarageCars actually more important?
  • The final values used for the model were mtry = 25, splitrule = variance and min.node.size = 10.
# RF MODEL 1 -- Base model (use original predictors, except for LotFrontage updates)

# Remove the new combined features created
raw_data_transform_predictors <- raw_data_transform |>
  dplyr::select(-c(SalePrice, OverallTotalRating, TotalBaths, TotalSF))

# Target variable
raw_data_transform_response <- raw_data_transform$SalePrice

set.seed(123)
# Train the RF model
# ranger is much faster by improving upon randomForest

# Tuning grid
tuneGrid <- data.frame(
  .mtry = c(3, 10, 25), # increasing mtry
  .splitrule = "variance",
  .min.node.size = c(2, 4, 10) # increasing min node size
)

# Train the model - use cross-validation
rf_model1 <- train(
  x = raw_data_transform_predictors,
  y = raw_data_transform_response,
  method = "ranger",
  trControl = trainControl(
    method = "cv",
    number = 5,
    verboseIter = TRUE
  ),
  metric = "RMSE",
  tuneGrid = tuneGrid,
  # probability = TRUE,
  importance = "impurity"
)
## + Fold1: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold1: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold1: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold1: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold1: mtry=25, splitrule=variance, min.node.size=10 
## - Fold1: mtry=25, splitrule=variance, min.node.size=10 
## + Fold2: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold2: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold2: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold2: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold2: mtry=25, splitrule=variance, min.node.size=10 
## - Fold2: mtry=25, splitrule=variance, min.node.size=10 
## + Fold3: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold3: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold3: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold3: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold3: mtry=25, splitrule=variance, min.node.size=10 
## - Fold3: mtry=25, splitrule=variance, min.node.size=10 
## + Fold4: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold4: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold4: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold4: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold4: mtry=25, splitrule=variance, min.node.size=10 
## - Fold4: mtry=25, splitrule=variance, min.node.size=10 
## + Fold5: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold5: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold5: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold5: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold5: mtry=25, splitrule=variance, min.node.size=10 
## - Fold5: mtry=25, splitrule=variance, min.node.size=10 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 25, splitrule = variance, min.node.size = 10 on full training set
# Print RF model
print(rf_model1)
## Random Forest 
## 
## 1460 samples
##   80 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1169, 1169, 1167, 1168, 1167 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE      Rsquared   MAE     
##    3     2             30929.83  0.8722486  18001.67
##   10     4             28327.61  0.8825102  16616.94
##   25    10             27885.92  0.8827094  16551.26
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 25, splitrule = variance
##  and min.node.size = 10.
# Plot hyperparam tuning
plot(rf_model1)

# Make predictions
rf_model1_predictions <- predict(rf_model1, test_transformed)

# Evaluate
# NOTE: competition recommended looking at the log of SalePrice for RMSE (selected metric) - reduces impact of large errors
log_metrics <- postResample(pred = log(rf_model1_predictions + 1), obs = log(test_transformed$SalePrice + 1))
rf_model1_rmse_log <- log_metrics["RMSE"]
rf_model1_Rsquared_log <- log_metrics["Rsquared"]
rf_model1_MAE_log <- log_metrics["MAE"]


# Not taking log
metrics <- postResample(pred = rf_model1_predictions, obs = test_transformed$SalePrice)
rf_model1_rmse <- metrics["RMSE"]
rf_model1_Rsquared <- metrics["Rsquared"]
rf_model1_MAE <- metrics["MAE"]


rf_model1_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== RF MODEL 1 Evaluation ===\n",
    "RMSLE:", round(rf_model1_rmse_log, 4), 
    "| MAE (log):", round(rf_model1_MAE_log, 4),
    "| R² (log):", round(rf_model1_Rsquared_log, 4),
    "| RMSE:", round(rf_model1_rmse, 4), 
    "| MAE:", round(rf_model1_MAE, 4),
    "| R²:", round(rf_model1_Rsquared, 4),
    "\n\n",
    sep = " "
  )
cat(rf_model1_output)
## 
## === Model Selection and Evaluation ===
## 
##  === RF MODEL 1 Evaluation ===
##  RMSLE: 0.0718 | MAE (log): 0.0451 | R² (log): 0.9699 | RMSE: 18869.4112 | MAE: 9206.9896 | R²: 0.9631
# Feature importance
plot(varImp(rf_model1))

RF Model 2: Address Multicollinearity

Experiment:

  • Remove variables that show very high multicollinearity
  • Use tuning grid for optimal params:
  • Increasing mtry: Smaller number of variables to possibly split at in each node increases the randomness and decreases correlation between trees. Yet, this can decrease the accuracy of individual trees. For smaller mtry, bias increases and variance decreases. Larger values will decrease bias (more complex trees), increase variance (reduce randomness, more correlation among trees).
  • Increasing min.node.size: Smaller values allow trees to become more complex and grow deeper - decreasing bias and higher variance (more prone to overfitting). Larger values allows tress to become less complex and more shallow, increasing bias and lowering variance (less prone to overfitting, but could be too simple).

Purpose:

  • Although random forest is fairly robust to multicollinearity, multicollinearity makes it harder to interpret which of the correlated features are actually more important (model might just arbitrarily pick any).

Result:

  • High R squared of 0.8605 and low RMSE of 0.1521 (using logs): a high proportion of the variance is explainable and has a high accuracy
  • Not as high performing as model 1, but still has a pretty good performance
  • Not surprisingly OverallQual, ExterQual and GarageArea are the most important predictors - this matches the correlation funnel!
  • Top 10 important features changed due to removal of multicollinearity - possibly makes business decisions easier
  • Wider variety of features now included in the top 10 such as FullBath
  • Interestingly, YearBuilt ranks #4 in importance - this was not crazy high in the correlation funnel
  • The final values used for the model were mtry = 25, splitrule = variance and min.node.size = 10.
# RF MODEL 2 -- Use the dataset that removed high multicollinearity predictors 

# Remove the new combined features created
train_transformed2_predictors <- train_transformed2 |>
  dplyr::select(-c(SalePrice, OverallTotalRating, TotalBaths, TotalSF))

# Target variable
train_transformed2_response <- train_transformed2$SalePrice

set.seed(123)
# Train the RF model
# ranger is much faster by improving upon randomForest

# Tuning grid
tuneGrid <- data.frame(
  .mtry = c(3, 10, 25), # increasing mtry
  .splitrule = "variance",
  .min.node.size = c(2, 4, 10) # increasing node size
)

# Train the model - use cross-validation
rf_model2 <- train(
  x = train_transformed2_predictors,
  y = train_transformed2_response,
  method = "ranger",
  trControl = trainControl(
    method = "cv",
    number = 5,
    verboseIter = TRUE
  ),
  metric = "RMSE",
  tuneGrid = tuneGrid,
  importance = "impurity"
)
## + Fold1: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold1: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold1: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold1: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold1: mtry=25, splitrule=variance, min.node.size=10 
## - Fold1: mtry=25, splitrule=variance, min.node.size=10 
## + Fold2: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold2: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold2: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold2: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold2: mtry=25, splitrule=variance, min.node.size=10 
## - Fold2: mtry=25, splitrule=variance, min.node.size=10 
## + Fold3: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold3: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold3: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold3: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold3: mtry=25, splitrule=variance, min.node.size=10 
## - Fold3: mtry=25, splitrule=variance, min.node.size=10 
## + Fold4: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold4: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold4: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold4: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold4: mtry=25, splitrule=variance, min.node.size=10 
## - Fold4: mtry=25, splitrule=variance, min.node.size=10 
## + Fold5: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold5: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold5: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold5: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold5: mtry=25, splitrule=variance, min.node.size=10 
## - Fold5: mtry=25, splitrule=variance, min.node.size=10 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 25, splitrule = variance, min.node.size = 10 on full training set
# Print and plot model (hyperparam tuning)
print(rf_model2)
## Random Forest 
## 
## 1169 samples
##   73 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 935, 935, 934, 936, 936 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE      Rsquared   MAE     
##    3     2             32579.84  0.8428896  20203.31
##   10     4             29963.05  0.8561327  18538.95
##   25    10             29653.15  0.8558539  18618.96
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 25, splitrule = variance
##  and min.node.size = 10.
plot(rf_model2)

# Make predictions
rf_model2_predictions <- predict(rf_model2, test_transformed2)

# Evaluate
# NOTE: competition recommended looking at the log of SalePrice for RMSE (selected metric) - reduces impact of large errors
log_metrics <- postResample(pred = log(rf_model2_predictions + 1), obs = log(test_transformed2$SalePrice + 1))
# RMSE
rf_model2_rmse_log <- log_metrics["RMSE"]
# R Squared
rf_model2_Rsquared_log <- log_metrics["Rsquared"]
# MAE
rf_model2_MAE_log <- log_metrics["MAE"]


# not taking log
metrics <- postResample(pred = rf_model2_predictions, obs = test_transformed2$SalePrice)
# RMSE
rf_model2_rmse <- metrics["RMSE"]
# R Squared
rf_model2_Rsquared <- metrics["Rsquared"]
# MAE
rf_model2_MAE <- metrics["MAE"]

rf_model2_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== RF MODEL 2 Evaluation ===\n",
    "RMSLE:", round(rf_model2_rmse_log, 4), 
    "| MAE (log):", round(rf_model2_MAE_log, 4),
    "| R² (log):", round(rf_model2_Rsquared_log, 4),
    "| RMSE:", round(rf_model2_rmse, 4), 
    "| MAE:", round(rf_model2_MAE, 4),
    "| R²:", round(rf_model2_Rsquared, 4),
    "\n\n",
    sep = " "
  )
cat(rf_model2_output)
## 
## === Model Selection and Evaluation ===
## 
##  === RF MODEL 2 Evaluation ===
##  RMSLE: 0.1521 | MAE (log): 0.1034 | R² (log): 0.8605 | RMSE: 41690.627 | MAE: 21201.3074 | R²: 0.8006
# Feature importance
plot(varImp(rf_model2))

RF Model 3: Combined Features

Experiment:

  • Use combined features created above (OverallTotalRating, TotalBaths, and TotalSF) and remove the related original predictors.
  • Use tuning grid for optimal params:
  • Increasing mtry: Smaller number of variables to possibly split at in each node increases the randomness and decreases correlation between trees. Yet, this can decrease the accuracy of individual trees. For smaller mtry, bias increases and variance decreases. Larger values will decrease bias (more complex trees), increase variance (reduce randomness, more correlation among trees).
  • Increasing min.node.size: Smaller values allow trees to become more complex and grow deeper - decreasing bias and higher variance (more prone to overfitting). Larger values allows tress to become less complex and more shallow, increasing bias and lowering variance (less prone to overfitting, but could be too simple).

Purpose:

  • See how creating new features based off existing ones can help or hurt the model

Result:

  • High R squared of 0.8865 and low RMSE of 0.1369 (using logs): a high proportion of the variance is explainable and has a high accuracy
  • Not as high performing as model 1, but still has a pretty good performance
  • Creating new features performed better than only removing features (model 2), showing multicollinearity is not a huge issue for random forests
  • TotalSF, GarageCars, ExterQual are the most important predictors - now the total square footage of house means more than the rating of the house. Garage space and exterior rating is still important for this model.
  • Interestingly, OverallTotalRating dropped to #5 (OverallRating used to be #1)
  • The final values used for the model were mtry = 25, splitrule = variance and min.node.size = 10.
# RF MODEL 3 - USE COMBINED FEATURES

# Predictors -- remove predictors used for combined features
train_transformed3_predictors <- train_transformed |>
  dplyr::select(-c(SalePrice, OverallQual, OverallCond, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, TotalBsmtSF, X1stFlrSF, X2ndFlrSF))

# Target variable
train_transformed3_response <- train_transformed$SalePrice

set.seed(123)
# Train the RF model
# ranger is much faster by improving upon randomForest

# Tuning grid
tuneGrid <- data.frame(
  .mtry = c(3, 10, 25),
  .splitrule = "variance",
  .min.node.size = c(2, 4, 10)
)

# Train the model - use cross-validation
rf_model3 <- train(
  x = train_transformed3_predictors,
  y = train_transformed3_response,
  method = "ranger",
  trControl = trainControl(
    method = "cv",
    number = 5,
    verboseIter = TRUE
  ),
  metric = "RMSE",
  tuneGrid = tuneGrid,
  importance = "impurity"
)
## + Fold1: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold1: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold1: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold1: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold1: mtry=25, splitrule=variance, min.node.size=10 
## - Fold1: mtry=25, splitrule=variance, min.node.size=10 
## + Fold2: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold2: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold2: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold2: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold2: mtry=25, splitrule=variance, min.node.size=10 
## - Fold2: mtry=25, splitrule=variance, min.node.size=10 
## + Fold3: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold3: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold3: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold3: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold3: mtry=25, splitrule=variance, min.node.size=10 
## - Fold3: mtry=25, splitrule=variance, min.node.size=10 
## + Fold4: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold4: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold4: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold4: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold4: mtry=25, splitrule=variance, min.node.size=10 
## - Fold4: mtry=25, splitrule=variance, min.node.size=10 
## + Fold5: mtry= 3, splitrule=variance, min.node.size= 2 
## - Fold5: mtry= 3, splitrule=variance, min.node.size= 2 
## + Fold5: mtry=10, splitrule=variance, min.node.size= 4 
## - Fold5: mtry=10, splitrule=variance, min.node.size= 4 
## + Fold5: mtry=25, splitrule=variance, min.node.size=10 
## - Fold5: mtry=25, splitrule=variance, min.node.size=10 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 25, splitrule = variance, min.node.size = 10 on full training set
# Plot and print the model
print(rf_model3)
## Random Forest 
## 
## 1169 samples
##   74 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 935, 935, 934, 936, 936 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE      Rsquared   MAE     
##    3     2             29444.78  0.8743839  17682.10
##   10     4             26277.80  0.8905740  15938.26
##   25    10             25454.40  0.8939029  15863.10
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 25, splitrule = variance
##  and min.node.size = 10.
plot(rf_model3)

# Make predictions
rf_model3_predictions <- predict(rf_model3, test_transformed)

# Evaluate
# NOTE: competition recommended looking at the log of SalePrice for RMSE (selected metric) - reduces impact of large errors
log_metrics <- postResample(pred = log(rf_model3_predictions + 1), obs = log(test_transformed$SalePrice + 1))
# RMSE
rf_model3_rmse_log <- log_metrics["RMSE"]
# R Squared
rf_model3_Rsquared_log <- log_metrics["Rsquared"]
# MAE
rf_model3_MAE_log <- log_metrics["MAE"]


# not taking log
metrics <- postResample(pred = rf_model3_predictions, obs = test_transformed$SalePrice)
# RMSE
rf_model3_rmse <- metrics["RMSE"]
# R Squared
rf_model3_Rsquared <- metrics["Rsquared"]
# MAE
rf_model3_MAE <- metrics["MAE"]

rf_model3_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== RF MODEL 3 Evaluation ===\n",
    "RMSLE:", round(rf_model3_rmse_log, 4), 
    "| MAE (log):", round(rf_model3_MAE_log, 4),
    "| R² (log):", round(rf_model3_Rsquared_log, 4),
    "| RMSE:", round(rf_model3_rmse, 4), 
    "| MAE:", round(rf_model3_MAE, 4),
    "| R²:", round(rf_model3_Rsquared, 4),
    "\n\n",
    sep = " "
  )
cat(rf_model3_output)
## 
## === Model Selection and Evaluation ===
## 
##  === RF MODEL 3 Evaluation ===
##  RMSLE: 0.1369 | MAE (log): 0.0888 | R² (log): 0.8865 | RMSE: 38218.1468 | MAE: 18227.6135 | R²: 0.8304
# Feature importance
plot(varImp(rf_model3))

Neural Network

The default parameters for neuralnet are (from textbook):

  • hidden (number of neurons in the hidden layers) == 1
  • act.fct (activation function) == “logistic”

Note: rprop+ is the default algorithm to calculate the neural network. rprop+ is resilient backpropagation with weight backtracking, which manages the learning rate dynamics automatically. So for this scenario, I will not play around the different learning rates (will not update the algorithm to “backprop”).

NN Model 1: Original Predictors (Base Model)

Experiment:

  • Use original predictors and default params as a base model.
  • Use only 1 hidden layer

Purpose:

  • See how a basic neural network performs with the dataset.

Result:

  • High R squared of 0.8587 and RMSE of 0.0246 (using logs): a high proportion of the variance is explainable and has a high accuracy
  • Fairly good performance with only 1 hidden layer
  • log function seems to be fairly suitable for this dataset - better gradient landscape, stable convergence
  • Hyperparameter tuning might improve the model
# NN MODEL 1 - BASE MODEL W/ ORIG (ALL) PREDICTORS
set.seed(123)
nn_model1 <- neuralnet(SalePrice ~ .,
  data = train_transformed_nn1) # use the nn dataset we created earlier

plot(nn_model1)
# compute returns list with 2 components - neurons (stores neurons for each layer in the network) & net.result (stores the model's predicted values)
nn1_results <- compute(nn_model1, test_transformed_nn1)

# Grab predictions - normalized
nn1_predictions <- nn1_results$net.result

# scaling doesnt matter here
cor(nn1_predictions, test_transformed_nn1$SalePrice)
##           [,1]
## [1,] 0.8973185
# Evaluation

# inspired from textbook
unnormalize <- function(x) {
    return(x * (max(final_data_for_nn$SalePrice) - min(final_data_for_nn$SalePrice)) + min(final_data_for_nn$SalePrice))
}
nn1_predictions_unnorm <- unnormalize(nn1_predictions)

# Evaluation using log
nn1_residuals_log <- log(test_transformed$SalePrice + 1) - log(nn1_predictions_unnorm + 1)
nn1_mse_log <- mean(nn1_residuals_log^2)
nn1_rmse_log <- sqrt(nn1_mse_log)
nn1_Rsquared_log <- R2(log(nn1_predictions_unnorm + 1), log(test_transformed$SalePrice + 1))
mae <- function(actual, predicted) {
  mean(abs(actual - predicted))
}
nn1_mae_log <- mae(log(test_transformed$SalePrice + 1), log(nn1_predictions_unnorm + 1))


# NOT using log
nn1_residuals <- test_transformed$SalePrice - nn1_predictions_unnorm
nn1_mse <- mean(nn1_residuals^2)
nn1_rmse <- sqrt(nn1_mse)
nn1_Rsquared <- R2(nn1_predictions_unnorm, test_transformed$SalePrice)
mae <- function(actual, predicted) {
  mean(abs(actual - predicted))
}
nn1_mae <- mae(test_transformed$SalePrice, nn1_predictions_unnorm)

nn_model1_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== NN MODEL 1 Evaluation ===\n",
    "RMSLE:", round(nn1_mse_log, 4), 
    "| MAE (log):", round(nn1_mae_log, 4),
    "| R² (log):", round(nn1_Rsquared_log, 4),
    "| RMSE:", round(nn1_rmse, 4), 
    "| MAE:", round(nn1_mae, 4),
    "| R²:", round(nn1_Rsquared, 4),
    "\n\n",
    sep = " "
  )
cat(nn_model1_output)
## 
## === Model Selection and Evaluation ===
## 
##  === NN MODEL 1 Evaluation ===
##  RMSLE: 0.0246 | MAE (log): 0.0989 | R² (log): 0.8587 | RMSE: 43265.3982 | MAE: 19380.8065 | R²: 0.8052
NN Model 2: Original Predictors - Increase Hidden Nodes

Experiment:

  • Use original predictors
  • Increase number of hidden nodes

Purpose:

  • Increasing the number of hidden nodes should reduce underfitting by making the model more complex. This decreases bias, and increases variance.

Result:

  • Very low R squared of 0.2805 and high RMSE of 122739.4042 (NOT using logs): a high proportion of the variance is not explainable and has a poor accuracy
  • Potential too many hidden nodes that caused too low bias and high variance (overfitting - changes too significantly with small changes in the data)
# NN MODEL 2 - INCREASE HIDDEN LAYERS

set.seed(123)

# Train model
nn_model2 <- neuralnet(SalePrice ~ .,
  data = train_transformed_nn1,
  hidden = 7)

plot(nn_model2)
# compute returns list with 2 components - neurons (stores neurons for each layer in the network) & net.result (stores the model's predicted values)
nn2_results <- compute(nn_model2, test_transformed_nn1)

# Grab predictions - normalized
nn2_predictions <- nn2_results$net.result

# scaling doesnt matter here
cor(nn2_predictions, test_transformed_nn1$SalePrice)
##           [,1]
## [1,] 0.5296615
# Evaluation
nn2_predictions_unnorm <- unnormalize(nn2_predictions)

# Evaluation metrics using log (recommended by kaggle)
nn2_residuals_log <- log(test_transformed$SalePrice + 1) - log(nn2_predictions_unnorm + 1)
## Warning in log(nn2_predictions_unnorm + 1): NaNs produced
nn2_mse_log <- mean(nn2_residuals_log^2)
nn2_rmse_log <- sqrt(nn2_mse_log)
nn2_Rsquared_log <- R2(log(nn2_predictions_unnorm + 1), log(test_transformed$SalePrice + 1))
## Warning in log(nn2_predictions_unnorm + 1): NaNs produced
nn2_mae_log <- mae(log(test_transformed$SalePrice + 1), log(nn2_predictions_unnorm + 1))
## Warning in log(nn2_predictions_unnorm + 1): NaNs produced
# NOT taking log
nn2_residuals <- test_transformed$SalePrice - nn2_predictions_unnorm
nn2_mse <- mean(nn2_residuals^2)
nn2_rmse <- sqrt(nn2_mse)
nn2_Rsquared <- R2(nn2_predictions_unnorm, test_transformed$SalePrice)
nn2_mae <- mae(test_transformed$SalePrice, nn2_predictions_unnorm)

nn_model2_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== NN MODEL 2 Evaluation ===\n",
    "RMSLE:", round(nn2_mse_log, 4), 
    "| MAE (log):", round(nn2_mae_log, 4),
    "| R² (log):", round(nn2_Rsquared_log, 4),
    "| RMSE:", round(nn2_rmse, 4), 
    "| MAE:", round(nn2_mae, 4),
    "| R²:", round(nn2_Rsquared, 4),
    "\n\n",
    sep = " "
  )
cat(nn_model2_output)
## 
## === Model Selection and Evaluation ===
## 
##  === NN MODEL 2 Evaluation ===
##  RMSLE: NaN | MAE (log): NaN | R² (log): NA | RMSE: 122739.4042 | MAE: 74402.0277 | R²: 0.2805
NN Model 3: Original Predictors - Update Activation Function (SmoothReLU) + Extra Hidden Layer

Experiment:

  • Use original predictors
  • Use SmoothReLU for activation function
  • Use 2 hidden layer networks

Purpose:

  • ReLU is highly efficient for gradient descent, but can’t be used with neuralnet since its derivative is undefined at x=0. The purpose of this experiment is to see how using a smoothing approximation of the ReLU known as SmoothReLU can improve the model. I will also try adding a second hidden layer of five nodes to create a two-layer network. This causes the model to learn more complex patterns, yet could also lead to overfitting and vanishing gradient problem.

Result:

  • Very low R squared of 0.0051 and high RMSE of 2326672.3496 (NOT using logs): extremely high proportion of the variance is not explainable and has a poor accuracy
  • Does not fit well to the data at all
  • Adding the extra layer most likely increased the variance greatly causing overfitting (way too complex for this small dataset)
  • Vanishing gradient problem could be happening with the extra layer as well
  • SmoothReLU could also be causing gradient instability
# NN MODEL 3 - USE ReLU (W/ ORIG PREDICTORS)

set.seed(123)

# define a softplus function
softplus <- function(x) { log(1 + exp(x)) }

# Train model
nn_model3 <- neuralnet(SalePrice ~ .,
                      hidden = c(5,5), 
                      act.fct = softplus, 
                      data = train_transformed_nn1)

plot(nn_model3)
# compute returns list with 2 components - neurons (stores neurons for each layer in the network) & net.result (stores the model's predicted values)
nn3_results <- compute(nn_model3, test_transformed_nn1)

# Grab predictions - normalized
nn3_predictions <- nn3_results$net.result

# scaling doesnt matter here
cor(nn3_predictions, test_transformed_nn1$SalePrice)
##            [,1]
## [1,] 0.07133873

ˆ

# Evaluation
nn3_predictions_unnorm <- unnormalize(nn3_predictions)

# Evaluation metrics using log
nn3_residuals_log <- log(test_transformed$SalePrice + 1) - log(nn3_predictions_unnorm + 1)
## Warning in log(nn3_predictions_unnorm + 1): NaNs produced
nn3_mse_log <- mean(nn3_residuals_log^2)
nn3_rmse_log <- sqrt(nn3_mse_log)
nn3_Rsquared_log <- R2(log(nn3_predictions_unnorm + 1), log(test_transformed$SalePrice + 1))
## Warning in log(nn3_predictions_unnorm + 1): NaNs produced
nn3_mae_log <- mae(log(test_transformed$SalePrice + 1), log(nn3_predictions_unnorm + 1))
## Warning in log(nn3_predictions_unnorm + 1): NaNs produced
# NOT using log
nn3_residuals <- test_transformed$SalePrice - nn3_predictions_unnorm
nn3_mse <- mean(nn3_residuals^2)
nn3_rmse <- sqrt(nn3_mse)
nn3_Rsquared <- R2(nn3_predictions_unnorm, test_transformed$SalePrice)
nn3_mae <- mae(test_transformed$SalePrice, nn3_predictions_unnorm)


nn_model3_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== NN MODEL 3 Evaluation ===\n",
    "RMSLE:", round(nn3_mse_log, 4), 
    "| MAE (log):", round(nn3_mae_log, 4),
    "| R² (log):", round(nn3_Rsquared_log, 4),
    "| RMSE:", round(nn3_rmse, 4), 
    "| MAE:", round(nn3_mae, 4),
    "| R²:", round(nn3_Rsquared, 4),
    "\n\n",
    sep = " "
  )
cat(nn_model3_output)
## 
## === Model Selection and Evaluation ===
## 
##  === NN MODEL 3 Evaluation ===
##  RMSLE: NaN | MAE (log): NaN | R² (log): NA | RMSE: 2326672.3496 | MAE: 283846.6175 | R²: 0.0051
NN Model 4: Original Predictors - Only Update Activation Function (SmoothReLU)

Experiment:

  • Use original predictors
  • Use SmoothReLU for activation function

Purpose:

ReLU is highly efficient for gradient descent, but can’t be used with neuralnet since its derivative is undefined at x=0. The purpose of this experiment is to see how using a smoothing approximation of the ReLU known as SmoothReLU can improve the model. Since the last model performed very poorly, I’m going to try the same activation function but just use 1 hidden layer to see if the model can train more effectively. SmoothReLU should also perform fairly well with shallow networks.

Result:

  • Very low R squared of 0.0976 and high RMSE of 316940.4083 (NOT using logs): extremely high proportion of the variance is not explainable and has a poor accuracy
  • Does not fit well to the data at all
  • Removing the extra layer did not help the model
  • SmoothReLU could also be causing gradient instability and might be too smooth
  • Possibly more hyperparameter tuning required
# NN MODEL 4 - USE SmoothReLU (W/ ORIG PREDICTORS)

set.seed(123)

# Train model
nn_model4 <- neuralnet(SalePrice ~ .,
                      hidden = c(5), # only 1 hidden layer network
                      act.fct = softplus, 
                      data = train_transformed_nn1)

plot(nn_model4)
# compute returns list with 2 components - neurons (stores neurons for each layer in the network) & net.result (stores the model's predicted values)
nn4_results <- compute(nn_model4, test_transformed_nn1)

# Grab predictions - normalized
nn4_predictions <- nn4_results$net.result

# scaling doesnt matter here
cor(nn4_predictions, test_transformed_nn1$SalePrice)
##          [,1]
## [1,] 0.225215
# Evaluation
nn4_predictions_unnorm <- unnormalize(nn4_predictions)

# Evaluation metrics using log
nn4_residuals_log <- log(test_transformed$SalePrice + 1) - log(nn4_predictions_unnorm + 1)
## Warning in log(nn4_predictions_unnorm + 1): NaNs produced
nn4_mse_log <- mean(nn4_residuals_log^2)
nn4_rmse_log <- sqrt(nn4_mse_log)
nn4_Rsquared_log <- R2(log(nn4_predictions_unnorm + 1), log(test_transformed$SalePrice + 1))
## Warning in log(nn4_predictions_unnorm + 1): NaNs produced
nn4_mae_log <- mae(log(test_transformed$SalePrice + 1), log(nn4_predictions_unnorm + 1))
## Warning in log(nn4_predictions_unnorm + 1): NaNs produced
# Evaluation metrics NOT using log
nn4_residuals <- test_transformed$SalePrice - nn4_predictions_unnorm
nn4_mse <- mean(nn4_residuals^2)
nn4_rmse <- sqrt(nn4_mse)
nn4_Rsquared <- R2(nn4_predictions_unnorm, test_transformed$SalePrice)
nn4_mae <- mae(test_transformed$SalePrice, nn4_predictions_unnorm)

nn_model4_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== NN MODEL 4 Evaluation ===\n",
    "RMSLE:", round(nn4_mse_log, 4), 
    "| MAE (log):", round(nn4_mae_log, 4),
    "| R² (log):", round(nn4_Rsquared_log, 4),
    "| RMSE:", round(nn4_rmse, 4), 
    "| MAE:", round(nn4_mae, 4),
    "| R²:", round(nn4_Rsquared, 4),
    "\n\n",
    sep = " "
  )
cat(nn_model4_output)
## 
## === Model Selection and Evaluation ===
## 
##  === NN MODEL 4 Evaluation ===
##  RMSLE: NaN | MAE (log): NaN | R² (log): NA | RMSE: 1380219.3757 | MAE: 243127.9723 | R²: 0.0507

6 Comparison

NOTE: See results table below

Recommended model: RF MODEL 1 (Original predictors)

Key Strengths:

RF Model 1 Top 10 Features:

Trade-offs:

Reasoning on not choosing other models:

RF Model 2:

RF Model 2 Top 10 Features:

RF Model 3:

RF Model 2 Top 10 Features:

NN Model 1 (base model):

NN Model 2 (increased hidden nodes):

NN Model 3 (SmoothReLU + 2nd layer):

NN Model 4 (SmoothReLU):

# Comparison table
all_experiments <- data.frame(
  Model = c("RF Model 1", "RF Model 2", "RF Model 3", "NN Model 1", "NN Model 2", "NN Model 3", "NN Model 4"),
  RMSE = c(rf_model1_rmse, rf_model2_rmse, rf_model3_rmse, nn1_rmse, nn2_rmse, nn3_rmse, nn4_rmse),
  MAE = c(rf_model1_MAE, rf_model2_MAE, rf_model3_MAE, nn1_mae, nn2_mae, nn3_mae, nn4_mae),
  Rsquared = c(rf_model1_Rsquared, rf_model2_Rsquared, rf_model3_Rsquared, nn1_Rsquared, nn2_Rsquared, nn3_Rsquared, nn4_Rsquared),
  RMSE_LOG = c(rf_model1_rmse_log, rf_model2_rmse_log, rf_model3_rmse_log, nn1_rmse_log, nn2_rmse_log, nn3_rmse_log, nn4_rmse_log),
  MAE_LOG = c(rf_model1_MAE_log, rf_model2_MAE_log, rf_model3_MAE_log, nn1_mae_log, nn2_mae_log, nn3_mae_log, nn4_mae_log),
  Rsquared_LOG = c(rf_model1_Rsquared_log, rf_model2_Rsquared_log, rf_model3_Rsquared_log, nn1_Rsquared_log, nn2_Rsquared_log, nn3_Rsquared_log, nn4_Rsquared_log)
)


all_experiments %>%
  kbl() %>%
  kable_styling(full_width = TRUE)
Model RMSE MAE Rsquared RMSE_LOG MAE_LOG Rsquared_LOG
RF Model 1 18869.41 9206.99 0.9631335 0.0718030 0.0451099 0.9698832
RF Model 2 41690.63 21201.31 0.8006262 0.1521428 0.1034242 0.8605485
RF Model 3 38218.15 18227.61 0.8303695 0.1368532 0.0887514 0.8864929
NN Model 1 43265.40 19380.81 0.8051805 0.1568195 0.0989110 0.8587128
NN Model 2 122739.40 74402.03 0.2805413 NaN NaN NA
NN Model 3 2326672.35 283846.62 0.0050892 NaN NaN NA
NN Model 4 1380219.38 243127.97 0.0507218 NaN NaN NA

7 Conclusion

To summarize the comparison above, overall random forest models fit this data better. Although, the base model for neural network gave a fairly good performance. For this case, it’s possible that this particular regression problem with this sized data is too simple for a neural network. Neural networks work better with unstructured datasets such as images, text and audio. Random forest specifically handle tabular data with high dimensionality very well. Additionally, random forests can even perform well with smaller datasets.

It’s entirely possible the neural network could have generated a higher performance, but it would take more manual hyperparameter tuning. A downside of neural networks is that a single incorrect hyperparameter can ruin the model, making NNs very tricky to experiment with (requires many experiments). Some parameters I would try tuning would be the learning rate and bias (for normal backpropagation - wasn’t able to do due to timing). In a real world scenario, using a reliable, faster, less computationally expensive and time consuming model, such as the random forest, might work better.

Some additional experiments I would have liked to try if I had the time would be to have a combination of the NN and the RF model where the NN would perform the feature engineering, and then the RF would make the predictions. I also updated the missing variables in this dataset before training the RF models. One other experiment would be to see if the RF performance improved when keeping the missing values. Another thing to try would be removing YearSold, MoSold, etc. since to me, it sounds like data leakage.

Business impact of the final model:

As a business, the top features to pay attention to are (full list can be seen in the model section):

Overall, these features make complete sense. The overall rating/quality and size of a house is what mostly determines the final price. Some other insights from this model are that the order of importance, in terms of specific different areas of the house go Living Area > External > Garage > Basement > First Floor > Kitchen > 2nd Floor > Baths. This tells us that a book, or a house in this case, is judged by it’s cover, as just behind Living Area is External. It’s interesting that External is both more important than first floor, kitchen and 2nd floor.

As mentioned before, it’s also interesting that YearBuilt is a pretty important feature as well. More modern houses will have a higher price. Neighborhood is also fairly low in importance, which doesn’t match up with the classic real estate phrase, “location, location, location.” Many features have extremely low importance such as Fence, PoolQC, HalfBath.

What’s great about this model is that it showcases features that are either higher or lower in importance than expected. Regarding setting up this type of model in production, the best part is not much data preprocessing is needed. If neural network was used in production all that conversion and scaling would need to be done in production as well, increasing the chance of a mishap.