The dataset I chose is Kaggle’s “House Prices - Advanced Regression Techniques” competition dataset which includes features of residential homes in Ames, Iowa. The purpose of this dataset is to predict the final price of each home. This dataset consists of a training dataset and a test dataset (one that does not have the target variable).
The reason I chose this dataset is that this is a classic regression problem example in machine learning. Additionally, this has many more features than what we’ve previously worked with, which I would imagine would happen in the real world as well. I also wanted to work with the random forest algorithm, and compare to neural network (an algorithm I haven’t tried yet), and this type of data consisting of both categorical and numeric features seemed like a great option to do that comparison. This dataset gave a lot of practice of preparing and doing extensive EDA which takes up a good portion of data science. Lastly, using real estate data genuinely seemed interesting, so it seemed like the prefect problem to solve.
Kaggle link: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
Housing features:
SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict. MSSubClass: The building class MSZoning: The general zoning classification LotFrontage: Linear feet of street connected to property LotArea: Lot size in square feet Street: Type of road access Alley: Type of alley access LotShape: General shape of property LandContour: Flatness of the property Utilities: Type of utilities available LotConfig: Lot configuration LandSlope: Slope of property Neighborhood: Physical locations within Ames city limits Condition1: Proximity to main road or railroad Condition2: Proximity to main road or railroad (if a second is present) BldgType: Type of dwelling HouseStyle: Style of dwelling OverallQual: Overall material and finish quality OverallCond: Overall condition rating YearBuilt: Original construction date YearRemodAdd: Remodel date RoofStyle: Type of roof RoofMatl: Roof material Exterior1st: Exterior covering on house Exterior2nd: Exterior covering on house (if more than one material) MasVnrType: Masonry veneer type MasVnrArea: Masonry veneer area in square feet ExterQual: Exterior material quality ExterCond: Present condition of the material on the exterior Foundation: Type of foundation BsmtQual: Height of the basement BsmtCond: General condition of the basement BsmtExposure: Walkout or garden level basement walls BsmtFinType1: Quality of basement finished area BsmtFinSF1: Type 1 finished square feet BsmtFinType2: Quality of second finished area (if present) BsmtFinSF2: Type 2 finished square feet BsmtUnfSF: Unfinished square feet of basement area TotalBsmtSF: Total square feet of basement area Heating: Type of heating HeatingQC: Heating quality and condition CentralAir: Central air conditioning Electrical: Electrical system 1stFlrSF: First Floor square feet 2ndFlrSF: Second floor square feet LowQualFinSF: Low quality finished square feet (all floors) GrLivArea: Above grade (ground) living area square feet BsmtFullBath: Basement full bathrooms BsmtHalfBath: Basement half bathrooms FullBath: Full bathrooms above grade HalfBath: Half baths above grade Bedroom: Number of bedrooms above basement level Kitchen: Number of kitchens KitchenQual: Kitchen quality TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) Functional: Home functionality rating Fireplaces: Number of fireplaces FireplaceQu: Fireplace quality GarageType: Garage location GarageYrBlt: Year garage was built GarageFinish: Interior finish of the garage GarageCars: Size of garage in car capacity GarageArea: Size of garage in square feet GarageQual: Garage quality GarageCond: Garage condition PavedDrive: Paved driveway WoodDeckSF: Wood deck area in square feet OpenPorchSF: Open porch area in square feet EnclosedPorch: Enclosed porch area in square feet 3SsnPorch: Three season porch area in square feet ScreenPorch: Screen porch area in square feet PoolArea: Pool area in square feet PoolQC: Pool quality Fence: Fence quality MiscFeature: Miscellaneous feature not covered in other categories MiscVal: $Value of miscellaneous feature MoSold: Month Sold YrSold: Year Sold SaleType: Type of sale SaleCondition: Condition of sale
The business goal of this project is to predict the sale price for each house. The data science goal is to perform EDA and experiment with models that could fit this dataset to solve a classic machine learning regression problem (solving for sale price variable).
#Import Libraries
library(kableExtra)
library(knitr)
library(readr)
library(tidyverse)
library(corrplot)
library(dplyr)
library(GGally)
library(caret)
library(pROC)
library(glmnet)
library(MASS)
library(car)
library(correlationfunnel)
library(faraway)
library(arm)
library(performance)
library(see)
library(reshape2)
library(readr)
#library(tidymodels)
#library(rms)
library(smotefamily)
library(themis)
library(skimr)
library(DataExplorer)
library(naniar)
library(mice)
library(corrr)
library(FactoMineR)
library(ggcorrplot)
library(factoextra)
library(ranger)
library(neuralnet)
library(tidymodels)
library(reshape2)
library(gmodels)
To obtain the dataset, I enrolled in the ongoing Kaggle competition and downloaded the data.
# Read in the training dataset (downloaded from Kaggle)
raw_data <- read.csv("/Users/gillianmcgovern/Documents/CUNY/DATA_622/FINAL\ PROJECT/final_project_train.csv")
The housing prices training dataset contains 1,460 observations and 81 variables, where each observation is a house with it’s features and final sale price (SalePrice is the target variable name). This dataset contains both categorical and numeric/continuous variables, with some ordinal variables and nominal as well.
Missing data:
0% complete rows. Given the 0% complete rows, there are only 5.9% total missing observations.
The missing data breakdown bar chart shows that Electrical, MasVnrType, MasVnrArea, BsmtFinType1, BsmtCond, BsmtQual, BsmtFinType2, BsmtExposure, GarageCond, GarageQual, GarageFinish, GarageYrBlt, GarageType, LotFrontage, FireplaceQu, Fence, Alley, MiscFeature, and PoolQC have missing values. Fence, Alley, MiscFeature, and PoolQC have the most missing values, with a missing percentage above 80%. We will have to look into these missing values to check if they are indeed missing, and if we should keep them.
The missing upset plot shows that Fence, Alley, MiscFeature, and PoolQC are usually all missing together.
Duplicate data:
There are no duplicate observations.
Invalid data:
It appears all the values for the categorical features are valid as well (i.e. no special characters need cleaning).
Summary statistics:
Summary statistics of the predictor variables in the training dataset are included in the table below. Key metrics such as minimum, maximum, mean, median, and standard deviation (SD) help us understand the range, central tendency, and variability of each variable.
Some interesting things to note right from the start are:
NOTE: YrSold, MoSold, SaleType, SaleCondition will all be kept for this model since the test dataset given for this competition contains these features, so this is assuming we do have this information to predict the final price of the house. Depending on the timing of making these final price predictions, knowing this information might not be available in a real-world scenario (in which case these features would be removed before training a model). The competition does not mention this information is not available to us for prediction.
# Structure of the data
str(raw_data)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
# Glimpse of the data
head(raw_data)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
# Id doesn't impact prediction so remove
raw_data <- raw_data[ , -1]
# Summary statistics
raw_data %>%
summary() %>%
kable(caption = "Descriptive Statistics of Predictor Variables") %>%
kable_styling()
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 20.0 | Length:1460 | Min. : 21.00 | Min. : 1300 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 1.000 | Min. :1.000 | Min. :1872 | Min. :1950 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.0 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.0 | Length:1460 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 334 | Min. : 0 | Min. : 0.000 | Min. : 334 | Min. :0.0000 | Min. :0.00000 | Min. :0.000 | Min. :0.0000 | Min. :0.000 | Min. :0.000 | Length:1460 | Min. : 2.000 | Length:1460 | Min. :0.000 | Length:1460 | Length:1460 | Min. :1900 | Length:1460 | Min. :0.000 | Min. : 0.0 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.000 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.00 | Min. : 1.000 | Min. :2006 | Length:1460 | Length:1460 | Min. : 34900 | |
| 1st Qu.: 20.0 | Class :character | 1st Qu.: 59.00 | 1st Qu.: 7554 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 5.000 | 1st Qu.:5.000 | 1st Qu.:1954 | 1st Qu.:1967 | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 0.0 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 0.0 | Class :character | 1st Qu.: 0.00 | 1st Qu.: 223.0 | 1st Qu.: 795.8 | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 882 | 1st Qu.: 0 | 1st Qu.: 0.000 | 1st Qu.:1130 | 1st Qu.:0.0000 | 1st Qu.:0.00000 | 1st Qu.:1.000 | 1st Qu.:0.0000 | 1st Qu.:2.000 | 1st Qu.:1.000 | Class :character | 1st Qu.: 5.000 | Class :character | 1st Qu.:0.000 | Class :character | Class :character | 1st Qu.:1961 | Class :character | 1st Qu.:1.000 | 1st Qu.: 334.5 | Class :character | Class :character | Class :character | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | Class :character | Class :character | Class :character | 1st Qu.: 0.00 | 1st Qu.: 5.000 | 1st Qu.:2007 | Class :character | Class :character | 1st Qu.:129975 | |
| Median : 50.0 | Mode :character | Median : 69.00 | Median : 9478 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 6.000 | Median :5.000 | Median :1973 | Median :1994 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 0.0 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 383.5 | Mode :character | Median : 0.00 | Median : 477.5 | Median : 991.5 | Mode :character | Mode :character | Mode :character | Mode :character | Median :1087 | Median : 0 | Median : 0.000 | Median :1464 | Median :0.0000 | Median :0.00000 | Median :2.000 | Median :0.0000 | Median :3.000 | Median :1.000 | Mode :character | Median : 6.000 | Mode :character | Median :1.000 | Mode :character | Mode :character | Median :1980 | Mode :character | Median :2.000 | Median : 480.0 | Mode :character | Mode :character | Mode :character | Median : 0.00 | Median : 25.00 | Median : 0.00 | Median : 0.00 | Median : 0.00 | Median : 0.000 | Mode :character | Mode :character | Mode :character | Median : 0.00 | Median : 6.000 | Median :2008 | Mode :character | Mode :character | Median :163000 | |
| Mean : 56.9 | NA | Mean : 70.05 | Mean : 10517 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Mean : 6.099 | Mean :5.575 | Mean :1971 | Mean :1985 | NA | NA | NA | NA | NA | Mean : 103.7 | NA | NA | NA | NA | NA | NA | NA | Mean : 443.6 | NA | Mean : 46.55 | Mean : 567.2 | Mean :1057.4 | NA | NA | NA | NA | Mean :1163 | Mean : 347 | Mean : 5.845 | Mean :1515 | Mean :0.4253 | Mean :0.05753 | Mean :1.565 | Mean :0.3829 | Mean :2.866 | Mean :1.047 | NA | Mean : 6.518 | NA | Mean :0.613 | NA | NA | Mean :1979 | NA | Mean :1.767 | Mean : 473.0 | NA | NA | NA | Mean : 94.24 | Mean : 46.66 | Mean : 21.95 | Mean : 3.41 | Mean : 15.06 | Mean : 2.759 | NA | NA | NA | Mean : 43.49 | Mean : 6.322 | Mean :2008 | NA | NA | Mean :180921 | |
| 3rd Qu.: 70.0 | NA | 3rd Qu.: 80.00 | 3rd Qu.: 11602 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 7.000 | 3rd Qu.:6.000 | 3rd Qu.:2000 | 3rd Qu.:2004 | NA | NA | NA | NA | NA | 3rd Qu.: 166.0 | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 712.2 | NA | 3rd Qu.: 0.00 | 3rd Qu.: 808.0 | 3rd Qu.:1298.2 | NA | NA | NA | NA | 3rd Qu.:1391 | 3rd Qu.: 728 | 3rd Qu.: 0.000 | 3rd Qu.:1777 | 3rd Qu.:1.0000 | 3rd Qu.:0.00000 | 3rd Qu.:2.000 | 3rd Qu.:1.0000 | 3rd Qu.:3.000 | 3rd Qu.:1.000 | NA | 3rd Qu.: 7.000 | NA | 3rd Qu.:1.000 | NA | NA | 3rd Qu.:2002 | NA | 3rd Qu.:2.000 | 3rd Qu.: 576.0 | NA | NA | NA | 3rd Qu.:168.00 | 3rd Qu.: 68.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.000 | NA | NA | NA | 3rd Qu.: 0.00 | 3rd Qu.: 8.000 | 3rd Qu.:2009 | NA | NA | 3rd Qu.:214000 | |
| Max. :190.0 | NA | Max. :313.00 | Max. :215245 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Max. :10.000 | Max. :9.000 | Max. :2010 | Max. :2010 | NA | NA | NA | NA | NA | Max. :1600.0 | NA | NA | NA | NA | NA | NA | NA | Max. :5644.0 | NA | Max. :1474.00 | Max. :2336.0 | Max. :6110.0 | NA | NA | NA | NA | Max. :4692 | Max. :2065 | Max. :572.000 | Max. :5642 | Max. :3.0000 | Max. :2.00000 | Max. :3.000 | Max. :2.0000 | Max. :8.000 | Max. :3.000 | NA | Max. :14.000 | NA | Max. :3.000 | NA | NA | Max. :2010 | NA | Max. :4.000 | Max. :1418.0 | NA | NA | NA | Max. :857.00 | Max. :547.00 | Max. :552.00 | Max. :508.00 | Max. :480.00 | Max. :738.000 | NA | NA | NA | Max. :15500.00 | Max. :12.000 | Max. :2010 | NA | NA | Max. :755000 | |
| NA | NA | NA’s :259 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :8 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :81 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
plot_intro(raw_data)
# Visualize Missing Data
plot_missing(raw_data)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the DataExplorer package.
## Please report the issue at
## <https://github.com/boxuancui/DataExplorer/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Check how missingness relates to other variables
gg_miss_upset(raw_data, nsets = 10)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Check how missingness affects SalePrice
gg_miss_fct(x = raw_data, fct = SalePrice)
# Check for duplicates
duplicates <- duplicated(raw_data)
# Print the duplicates
print(raw_data[duplicates, ])
## [1] MSSubClass MSZoning LotFrontage LotArea Street
## [6] Alley LotShape LandContour Utilities LotConfig
## [11] LandSlope Neighborhood Condition1 Condition2 BldgType
## [16] HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd
## [21] RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## [26] MasVnrArea ExterQual ExterCond Foundation BsmtQual
## [31] BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## [36] BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## [41] CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## [46] GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## [51] BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## [56] Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## [61] GarageCars GarageArea GarageQual GarageCond PavedDrive
## [66] WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## [71] PoolArea PoolQC Fence MiscFeature MiscVal
## [76] MoSold YrSold SaleType SaleCondition SalePrice
## <0 rows> (or 0-length row.names)
# Break up numerical and categorical variables
cat("Numerical predictors:")
## Numerical predictors:
data_raw_numeric <- raw_data |>
dplyr::select(where(is.numeric))
numerical_predictors <- names(data_raw_numeric)
print(numerical_predictors)
## [1] "MSSubClass" "LotFrontage" "LotArea" "OverallQual"
## [5] "OverallCond" "YearBuilt" "YearRemodAdd" "MasVnrArea"
## [9] "BsmtFinSF1" "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF"
## [13] "X1stFlrSF" "X2ndFlrSF" "LowQualFinSF" "GrLivArea"
## [17] "BsmtFullBath" "BsmtHalfBath" "FullBath" "HalfBath"
## [21] "BedroomAbvGr" "KitchenAbvGr" "TotRmsAbvGrd" "Fireplaces"
## [25] "GarageYrBlt" "GarageCars" "GarageArea" "WoodDeckSF"
## [29] "OpenPorchSF" "EnclosedPorch" "X3SsnPorch" "ScreenPorch"
## [33] "PoolArea" "MiscVal" "MoSold" "YrSold"
## [37] "SalePrice"
cat("\nCategorical predictors:")
##
## Categorical predictors:
data_raw_categorical <- raw_data |>
dplyr::select(where(is.factor) | where(is.character))
categorical_predictors <- names(data_raw_categorical)[names(data_raw_categorical) != "y"]
print(categorical_predictors)
## [1] "MSZoning" "Street" "Alley" "LotShape"
## [5] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [9] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [13] "HouseStyle" "RoofStyle" "RoofMatl" "Exterior1st"
## [17] "Exterior2nd" "MasVnrType" "ExterQual" "ExterCond"
## [21] "Foundation" "BsmtQual" "BsmtCond" "BsmtExposure"
## [25] "BsmtFinType1" "BsmtFinType2" "Heating" "HeatingQC"
## [29] "CentralAir" "Electrical" "KitchenQual" "Functional"
## [33] "FireplaceQu" "GarageType" "GarageFinish" "GarageQual"
## [37] "GarageCond" "PavedDrive" "PoolQC" "Fence"
## [41] "MiscFeature" "SaleType" "SaleCondition"
# Check for any typos in categorical features values
data_raw_categorical %>%
lapply(unique) %>%
print()
## $MSZoning
## [1] "RL" "RM" "C (all)" "FV" "RH"
##
## $Street
## [1] "Pave" "Grvl"
##
## $Alley
## [1] NA "Grvl" "Pave"
##
## $LotShape
## [1] "Reg" "IR1" "IR2" "IR3"
##
## $LandContour
## [1] "Lvl" "Bnk" "Low" "HLS"
##
## $Utilities
## [1] "AllPub" "NoSeWa"
##
## $LotConfig
## [1] "Inside" "FR2" "Corner" "CulDSac" "FR3"
##
## $LandSlope
## [1] "Gtl" "Mod" "Sev"
##
## $Neighborhood
## [1] "CollgCr" "Veenker" "Crawfor" "NoRidge" "Mitchel" "Somerst" "NWAmes"
## [8] "OldTown" "BrkSide" "Sawyer" "NridgHt" "NAmes" "SawyerW" "IDOTRR"
## [15] "MeadowV" "Edwards" "Timber" "Gilbert" "StoneBr" "ClearCr" "NPkVill"
## [22] "Blmngtn" "BrDale" "SWISU" "Blueste"
##
## $Condition1
## [1] "Norm" "Feedr" "PosN" "Artery" "RRAe" "RRNn" "RRAn" "PosA"
## [9] "RRNe"
##
## $Condition2
## [1] "Norm" "Artery" "RRNn" "Feedr" "PosN" "PosA" "RRAn" "RRAe"
##
## $BldgType
## [1] "1Fam" "2fmCon" "Duplex" "TwnhsE" "Twnhs"
##
## $HouseStyle
## [1] "2Story" "1Story" "1.5Fin" "1.5Unf" "SFoyer" "SLvl" "2.5Unf" "2.5Fin"
##
## $RoofStyle
## [1] "Gable" "Hip" "Gambrel" "Mansard" "Flat" "Shed"
##
## $RoofMatl
## [1] "CompShg" "WdShngl" "Metal" "WdShake" "Membran" "Tar&Grv" "Roll"
## [8] "ClyTile"
##
## $Exterior1st
## [1] "VinylSd" "MetalSd" "Wd Sdng" "HdBoard" "BrkFace" "WdShing" "CemntBd"
## [8] "Plywood" "AsbShng" "Stucco" "BrkComm" "AsphShn" "Stone" "ImStucc"
## [15] "CBlock"
##
## $Exterior2nd
## [1] "VinylSd" "MetalSd" "Wd Shng" "HdBoard" "Plywood" "Wd Sdng" "CmentBd"
## [8] "BrkFace" "Stucco" "AsbShng" "Brk Cmn" "ImStucc" "AsphShn" "Stone"
## [15] "Other" "CBlock"
##
## $MasVnrType
## [1] "BrkFace" "None" "Stone" "BrkCmn" NA
##
## $ExterQual
## [1] "Gd" "TA" "Ex" "Fa"
##
## $ExterCond
## [1] "TA" "Gd" "Fa" "Po" "Ex"
##
## $Foundation
## [1] "PConc" "CBlock" "BrkTil" "Wood" "Slab" "Stone"
##
## $BsmtQual
## [1] "Gd" "TA" "Ex" NA "Fa"
##
## $BsmtCond
## [1] "TA" "Gd" NA "Fa" "Po"
##
## $BsmtExposure
## [1] "No" "Gd" "Mn" "Av" NA
##
## $BsmtFinType1
## [1] "GLQ" "ALQ" "Unf" "Rec" "BLQ" NA "LwQ"
##
## $BsmtFinType2
## [1] "Unf" "BLQ" NA "ALQ" "Rec" "LwQ" "GLQ"
##
## $Heating
## [1] "GasA" "GasW" "Grav" "Wall" "OthW" "Floor"
##
## $HeatingQC
## [1] "Ex" "Gd" "TA" "Fa" "Po"
##
## $CentralAir
## [1] "Y" "N"
##
## $Electrical
## [1] "SBrkr" "FuseF" "FuseA" "FuseP" "Mix" NA
##
## $KitchenQual
## [1] "Gd" "TA" "Ex" "Fa"
##
## $Functional
## [1] "Typ" "Min1" "Maj1" "Min2" "Mod" "Maj2" "Sev"
##
## $FireplaceQu
## [1] NA "TA" "Gd" "Fa" "Ex" "Po"
##
## $GarageType
## [1] "Attchd" "Detchd" "BuiltIn" "CarPort" NA "Basment" "2Types"
##
## $GarageFinish
## [1] "RFn" "Unf" "Fin" NA
##
## $GarageQual
## [1] "TA" "Fa" "Gd" NA "Ex" "Po"
##
## $GarageCond
## [1] "TA" "Fa" NA "Gd" "Po" "Ex"
##
## $PavedDrive
## [1] "Y" "N" "P"
##
## $PoolQC
## [1] NA "Ex" "Fa" "Gd"
##
## $Fence
## [1] NA "MnPrv" "GdWo" "GdPrv" "MnWw"
##
## $MiscFeature
## [1] NA "Shed" "Gar2" "Othr" "TenC"
##
## $SaleType
## [1] "WD" "New" "COD" "ConLD" "ConLI" "CWD" "ConLw" "Con" "Oth"
##
## $SaleCondition
## [1] "Normal" "Abnorml" "Partial" "AdjLand" "Alloca" "Family"
We should convert each categorical feature into a factor to be better used with R. Note: I decided GarageType as nominal feature, since it doesn’t really have an order.
After taking a closer look at the metadata file, Alley, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature all have an “NA” category which actually means it doesn’t contain the item related to the predictor. For example, an NA value for GarageType actually means “No Garage”. This is actually an informative categorical value, and shouldn’t be treated as missing data.
To clean the data, let’s convert these NA values to their actual categorical value.
After cleaning up the data, the remaining missing predictors are Electrical, MasVnrType, MasVnrArea, GarageYrBlt, LotFrontage.
As we can see above, GarageYrBlt indicates “No Garage”. For GarageYrBlt, let’s replace NA with 0, and use the existing GarageType categorical variable that already indicates if there’s not a Garage present.
MasVnrType, MasVnrArea, and LotFrontage seem genuinely missing according to the metadata file. After looking at other predictor values when LotFrontage is NA, it seems like for many different predictors and their values, LotFrontage is missing (aka LotFrontage missing doesn’t seem to indicate some sort of informative feature of the house). So it’s not like LotFrontage is only missing when Condition1 == “PosN” for example. Therefore, imputation will probably be used for these variables.
# Convert categorical to factors
raw_data$MSSubClass <- as.factor(raw_data$MSSubClass) # this is a sneaky categorical variable that needs to be a factor (nominal)
# Create new categories if the house feature doesn't exist, such as "No Garage"
raw_data$Alley = ifelse(is.na(raw_data$Alley), "NoAlley", raw_data$Alley)
raw_data$BsmtQual = ifelse(is.na(raw_data$BsmtQual), "NoBasement", raw_data$BsmtQual)
raw_data$BsmtCond = ifelse(is.na(raw_data$BsmtCond), "NoBasement", raw_data$BsmtCond)
raw_data$BsmtExposure = ifelse(is.na(raw_data$BsmtExposure), "NoBasement", raw_data$BsmtExposure)
raw_data$BsmtFinType1 = ifelse(is.na(raw_data$BsmtFinType1), "NoBasement", raw_data$BsmtFinType1)
raw_data$BsmtFinType2 = ifelse(is.na(raw_data$BsmtFinType2), "NoBasement", raw_data$BsmtFinType2)
raw_data$FireplaceQu = ifelse(is.na(raw_data$FireplaceQu), "NoFireplace", raw_data$FireplaceQu)
raw_data$GarageType = ifelse(is.na(raw_data$GarageType), "NoGarage", raw_data$GarageType)
raw_data$GarageFinish = ifelse(is.na(raw_data$GarageFinish), "NoGarage", raw_data$GarageFinish)
raw_data$GarageQual = ifelse(is.na(raw_data$GarageQual), "NoGarage", raw_data$GarageQual)
raw_data$GarageCond = ifelse(is.na(raw_data$GarageCond), "NoGarage", raw_data$GarageCond)
raw_data$PoolQC = ifelse(is.na(raw_data$PoolQC), "NoPool", raw_data$PoolQC)
raw_data$Fence = ifelse(is.na(raw_data$Fence), "NoFence", raw_data$Fence)
raw_data$MiscFeature = ifelse(is.na(raw_data$MiscFeature), "None", raw_data$MiscFeature)
# Update ordinal features
raw_data$Fence <- factor(
raw_data$Fence,
levels = c("NoFence", "MnWw", "GdWo", "MnPrv", "GdPrv"),
ordered = TRUE
)
raw_data$PoolQC <- factor(
raw_data$PoolQC,
levels = c("NoPool", "Fa", "TA", "Gd", "Ex"),
ordered = TRUE
)
raw_data$GarageCond <- factor(
raw_data$GarageCond,
levels = c("NoGarage", "Po", "Fa", "TA", "Gd", "Ex"),
ordered = TRUE
)
raw_data$GarageQual <- factor(
raw_data$GarageQual,
levels = c("NoGarage", "Po", "Fa", "TA", "Gd", "Ex"),
ordered = TRUE
)
raw_data$GarageFinish <- factor(
raw_data$GarageFinish,
levels = c("NoGarage", "Unf", "RFn", "Fin"),
ordered = TRUE
)
raw_data$FireplaceQu <- factor(
raw_data$FireplaceQu,
levels = c("NoFireplace", "Po", "Fa", "TA", "Gd", "Ex"),
ordered = TRUE
)
raw_data$Functional <- factor(
raw_data$Functional,
levels = c("Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"),
ordered = TRUE
)
raw_data$KitchenQual <- factor(
raw_data$KitchenQual,
levels = c("Po", "Fa", "TA", "Gd", "Ex"),
ordered = TRUE
)
raw_data$HeatingQC <- factor(
raw_data$HeatingQC,
levels = c("Po", "Fa", "TA", "Gd", "Ex"),
ordered = TRUE
)
raw_data$BsmtFinType2 <- factor(
raw_data$BsmtFinType2,
levels = c("NoBasement", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"),
ordered = TRUE
)
raw_data$BsmtFinType1 <- factor(
raw_data$BsmtFinType1,
levels = c("NoBasement", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"),
ordered = TRUE
)
raw_data$BsmtExposure <- factor(
raw_data$BsmtExposure,
levels = c("NoBasement", "No", "Mn", "Av", "Gd"),
ordered = TRUE
)
raw_data$BsmtCond <- factor(
raw_data$BsmtCond,
levels = c("NoBasement", "Po", "Fa", "TA", "Gd", "Ex"),
ordered = TRUE
)
raw_data$BsmtQual <- factor(
raw_data$BsmtQual,
levels = c("NoBasement", "Po", "Fa", "TA", "Gd", "Ex"),
ordered = TRUE
)
raw_data$ExterCond <- factor(
raw_data$ExterCond,
levels = c("Po", "Fa", "TA", "Gd", "Ex"),
ordered = TRUE
)
raw_data$ExterQual <- factor(
raw_data$ExterQual,
levels = c("Po", "Fa", "TA", "Gd", "Ex"),
ordered = TRUE
)
raw_data$LandSlope <- factor(
raw_data$LandSlope,
levels = c("Sev", "Mod", "Gtl"),
ordered = TRUE
)
# Update nominal features
# treat 1-10 ratings as ordered numeric
nominal_features <- c("SaleCondition", "SaleType", "MiscFeature", "PavedDrive", "GarageType", "CentralAir", "Heating", "Foundation", "Exterior2nd", "Exterior1st", "RoofMatl", "RoofStyle", "HouseStyle", "BldgType", "Condition2", "Condition1", "Neighborhood", "LotConfig", "Utilities", "LandContour", "LotShape", "Alley", "Street", "MSZoning")
raw_data <- update_columns(raw_data, nominal_features, as.factor)
# Visualize Missing Data
plot_missing(raw_data)
# Check how missingness relates to other variables
gg_miss_upset(raw_data, nsets = 10)
# Does a missing LotFrontage indicate anything?
# Focus on other predictor variable values when this value is NA
gg_miss_fct(x = raw_data, fct = Condition1)
gg_miss_fct(x = raw_data, fct = Alley)
gg_miss_fct(x = raw_data, fct = Neighborhood)
gg_miss_fct(x = raw_data, fct = LotConfig)
gg_miss_fct(x = raw_data, fct = BldgType)
gg_miss_fct(x = raw_data, fct = Street)
# Replace NA with 0 for GarageYrBlt and use the existing Garage missing indicator categorical variables
raw_data$GarageYrBlt = ifelse(is.na(raw_data$GarageYrBlt), 0, raw_data$GarageYrBlt)
# Keep year as numeric and convert month to factor
# raw_data$YrSold <- as.factor(raw_data$YrSold)
raw_data$MoSold <- as.factor(raw_data$MoSold)
# raw_data$GarageYrBlt <- as.factor(raw_data$GarageYrBlt)
# raw_data$YearBuilt <- as.factor(raw_data$YearBuilt)
# raw_data$YearRemodAdd <- as.factor(raw_data$YearRemodAdd)
SalePrice is right skewed and has a wide spread.
Most numerical predictors are at least slightly right skewed except for OverallCond, YearBuilt, YearRemodAdd, and GarageYrBlt.
Most numerical predictors are unimodal except for BsmtFinSF1 (bimodal), BsmtUnfSF (multimodal), MSSubClass (multimodal), TotalBsmtSF (bimodal), X2ndFirSF (bimodal), YearBuilt (multimodal), YearRemodAdd (multimodal), GarageArea (multimodal), OpenPorchSF (multimodal), WoodDeckSF (bimodal), and YrSold (bimodal).
As we saw earlier, predictors have a wide range of spread. For example, GarageArea has a very large spread compared to FullBath which has a very small spread.
For categorical predictors, some things to note:
There are some features that barely have counts for some categories such as Condition1 and 2 (PosA barely has any count value), RoofMatl and RoofStyle. It’s possible some values can be narrowed down by grouping values into “other” category (depending on the model).
The QQ plots and the box plots show the skew shown in the histograms (many extreme values). Many predictor variables have outliers shown by the red dots. However, based on the feature meanings and provided information, there is no reason to believe that any of these extreme values are mistakes or data errors. As such, we will not remove the extreme values, as they could be predictive of the target.
It seems like for smaller housing prices, there is a wider spread and more outliers, but this makes sense since that’s a bulk of the data. January and July have the widest spread for SalePrice. Housing prices increases as the year the house was built becomes more recent, including remodel date. Generally as SF increases, the price increases as well.
# Plot histograms
plot_histogram(raw_data)
# Categorical Variables
data_raw_categorical <- raw_data |>
dplyr::select(where(is.factor) | where(is.character))
categorical_predictors <- names(data_raw_categorical)[names(data_raw_categorical) != "y"]
plot_bar(data_raw_categorical)
# Closer look at target distribution (# of orders)
hist(as.numeric(raw_data$SalePrice), main="SalePrice Distribution",
xlab="Sale Price", col="lightblue", breaks=100)
# QQ Plot
plot_qq(raw_data, sampled_rows = 1000L)
## Warning: Removed 181 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Warning: Removed 181 rows containing non-finite outside the scale range
## (`stat_qq_line()`).
# Box plots - visualize outliers
ggplot(stack(data_raw_numeric), aes(x = ind, y = values)) +
geom_boxplot(color = 'skyblue', outlier.color = 'red') +
coord_cartesian(ylim = c(0, 10000)) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1),
panel.background = element_rect(fill = 'grey96')) +
labs(title = "Boxplots of Predictor Variables", x="Predictors")
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Box plots by TARGET
plot_boxplot(raw_data, by = "SalePrice", ggtheme = theme_light())
## Warning: Removed 267 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Scatter plots by TARGET
plot_scatterplot(raw_data, by = "SalePrice", sampled_rows = 1000L)
Top positive correlations with SalePrice (full list printed below):
Top negative correlations with SalePrice:
There are many more features that are strongly positively correlated with SalePrice than negatively correlated.
Top correlation Among Predictors:
Many more can be seen below.
Initial business insights:
The correlation funnel shows that the following values for the predictors correspond to a high SalePrice (21,3497.50 and above):
Feature engineering can be done to remove multicollinearity and slim down the dataset.
Two-way Cross-Tabulations - Categorical Variables:
Based on the p-value, these are the categorical variables that are not statistically significant (the rest are - see dataframe below):
Utilities 1.0000000 Condition1 1.0000000 RoofStyle 1.0000000 RoofMatl 1.0000000 MiscFeature 1.0000000 Fence 1.0000000 GarageCond 1.0000000 BldgType 0.9999861 Exterior1st 0.9999839 HeatingQC 0.9995947 MSSubClass 0.9928572 PavedDrive 0.9912800 Alley 0.9637568 BsmtFinType2 0.9480709 Exterior2nd 0.8469190 HouseStyle 0.6482615 BsmtFinType1 0.2129277 MoSold 0.1417502 LandSlope 0.1050864 LandContour 0.0867465 Condition2 0.0759864 GarageType 0.0647807 Electrical 0.0556880 GarageQual 0.0501487
Most of these are not too surprising, as they don’t seem like critical house features.
Some additional insights into relationships between predictors (from predictor vs. predictor box plots below):
These predictor vs. predictor box plots also showed some outliers that do not necessarily makes sense such as 1 story house having 2nd floor square footage, but possibly there are exceptions in the real estate world. Since I don’t have enough information to know if these are valid observations or not, I’m going to keep these in the model. The chosen models should be robust to outliers anyway.
#Correlation matrix with target
numeric_vars <- raw_data %>% select_if(is.numeric)
cor_matrix <- cor(numeric_vars, use="pairwise.complete.obs")
corr_target <- cor_matrix[,"SalePrice"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)
kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with SalePrice"), digits = 2)
| Correlation with SalePrice | |
|---|---|
| SalePrice | 1.00 |
| OverallQual | 0.79 |
| GrLivArea | 0.71 |
| GarageCars | 0.64 |
| GarageArea | 0.62 |
| TotalBsmtSF | 0.61 |
| X1stFlrSF | 0.61 |
| FullBath | 0.56 |
| TotRmsAbvGrd | 0.53 |
| YearBuilt | 0.52 |
| YearRemodAdd | 0.51 |
| MasVnrArea | 0.48 |
| Fireplaces | 0.47 |
| BsmtFinSF1 | 0.39 |
| LotFrontage | 0.35 |
| WoodDeckSF | 0.32 |
| X2ndFlrSF | 0.32 |
| OpenPorchSF | 0.32 |
| HalfBath | 0.28 |
| LotArea | 0.26 |
| GarageYrBlt | 0.26 |
| BsmtFullBath | 0.23 |
| BsmtUnfSF | 0.21 |
| BedroomAbvGr | 0.17 |
| ScreenPorch | 0.11 |
| PoolArea | 0.09 |
| X3SsnPorch | 0.04 |
| BsmtFinSF2 | -0.01 |
| BsmtHalfBath | -0.02 |
| MiscVal | -0.02 |
| LowQualFinSF | -0.03 |
| YrSold | -0.03 |
| OverallCond | -0.08 |
| EnclosedPorch | -0.13 |
| KitchenAbvGr | -0.14 |
#Correlation heatmap
corrplot::corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
number.cex = 0.7, diag = FALSE)
# Correlation top 5 positive and negative
corr_target <- cor_matrix[, "SalePrice"]
corr_target <- corr_target[names(corr_target) != "SalePrice"] # Remove self-correlation
top_pos <- sort(corr_target, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target, decreasing = FALSE)[1:3]
combined_df <- data.frame(
Feature = c(names(top_pos), names(top_neg)),
Correlation = c(top_pos, top_neg)) %>%
mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))
ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
labs(title = "Top Features Correlated with SalePrice",
x = "Feature",
y = "Correlation with SalePrice",
fill = "Direction") +
theme_minimal()
# Correlation among predictors
melt_cor <- melt(cor_matrix)
melt_cor <- melt_cor |>
filter(Var1 != "SalePrice") |>
filter(Var2 != "SalePrice")
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
kable(as.data.frame(sorted_cor), digits = 2)
| Var1 | Var2 | value | |
|---|---|---|---|
| 875 | GarageCars | GarageArea | 0.88 |
| 729 | GrLivArea | TotRmsAbvGrd | 0.83 |
| 385 | TotalBsmtSF | X1stFlrSF | 0.82 |
| 489 | X2ndFlrSF | GrLivArea | 0.69 |
| 734 | BedroomAbvGr | TotRmsAbvGrd | 0.68 |
| 518 | BsmtFinSF1 | BsmtFullBath | 0.65 |
| 593 | GrLivArea | FullBath | 0.63 |
| 727 | X2ndFlrSF | TotRmsAbvGrd | 0.62 |
| 625 | X2ndFlrSF | HalfBath | 0.61 |
| 819 | OverallQual | GarageCars | 0.60 |
| 840 | GarageYrBlt | GarageCars | 0.60 |
| 479 | OverallQual | GrLivArea | 0.59 |
| 175 | YearBuilt | YearRemodAdd | 0.59 |
| 139 | OverallQual | YearBuilt | 0.57 |
| 488 | X1stFlrSF | GrLivArea | 0.57 |
| 853 | OverallQual | GarageArea | 0.56 |
| 874 | GarageYrBlt | GarageArea | 0.56 |
| 732 | FullBath | TotRmsAbvGrd | 0.55 |
| 173 | OverallQual | YearRemodAdd | 0.55 |
| 581 | OverallQual | FullBath | 0.55 |
| 821 | YearBuilt | GarageCars | 0.54 |
| 343 | OverallQual | TotalBsmtSF | 0.54 |
| 348 | BsmtFinSF1 | TotalBsmtSF | 0.52 |
| 661 | GrLivArea | BedroomAbvGr | 0.52 |
| 659 | X2ndFlrSF | BedroomAbvGr | 0.50 |
| 314 | BsmtFinSF1 | BsmtUnfSF | -0.50 |
| 862 | X1stFlrSF | GarageArea | 0.49 |
| 861 | TotalBsmtSF | GarageArea | 0.49 |
| 855 | YearBuilt | GarageArea | 0.48 |
| 377 | OverallQual | X1stFlrSF | 0.48 |
| 834 | FullBath | GarageCars | 0.47 |
| 865 | GrLivArea | GarageArea | 0.47 |
| 583 | YearBuilt | FullBath | 0.47 |
| 831 | GrLivArea | GarageCars | 0.47 |
| 763 | GrLivArea | Fireplaces | 0.46 |
| 375 | LotFrontage | X1stFlrSF | 0.46 |
| 487 | TotalBsmtSF | GrLivArea | 0.45 |
| 382 | BsmtFinSF1 | X1stFlrSF | 0.45 |
| 828 | X1stFlrSF | GarageCars | 0.44 |
| 584 | YearRemodAdd | FullBath | 0.44 |
| 827 | TotalBsmtSF | GarageCars | 0.43 |
| 717 | OverallQual | TotRmsAbvGrd | 0.43 |
| 35 | LotFrontage | LotArea | 0.43 |
| 520 | BsmtUnfSF | BsmtFullBath | -0.42 |
| 591 | X2ndFlrSF | FullBath | 0.42 |
| 822 | YearRemodAdd | GarageCars | 0.42 |
| 627 | GrLivArea | HalfBath | 0.42 |
| 350 | BsmtUnfSF | TotalBsmtSF | 0.42 |
| 207 | OverallQual | MasVnrArea | 0.41 |
| 760 | X1stFlrSF | Fireplaces | 0.41 |
| 726 | X1stFlrSF | TotRmsAbvGrd | 0.41 |
| 868 | FullBath | GarageArea | 0.41 |
| 477 | LotFrontage | GrLivArea | 0.40 |
| 751 | OverallQual | Fireplaces | 0.40 |
| 341 | LotFrontage | TotalBsmtSF | 0.39 |
| 345 | YearBuilt | TotalBsmtSF | 0.39 |
| 483 | MasVnrArea | GrLivArea | 0.39 |
| 957 | YearBuilt | EnclosedPorch | -0.39 |
| 590 | X1stFlrSF | FullBath | 0.38 |
| 140 | OverallCond | YearBuilt | -0.38 |
| 857 | MasVnrArea | GarageArea | 0.37 |
| 856 | YearRemodAdd | GarageArea | 0.37 |
| 823 | MasVnrArea | GarageCars | 0.36 |
| 347 | MasVnrArea | TotalBsmtSF | 0.36 |
| 664 | FullBath | BedroomAbvGr | 0.36 |
| 838 | TotRmsAbvGrd | GarageCars | 0.36 |
| 715 | LotFrontage | TotRmsAbvGrd | 0.35 |
| 851 | LotFrontage | GarageArea | 0.34 |
| 381 | MasVnrArea | X1stFlrSF | 0.34 |
| 733 | HalfBath | TotRmsAbvGrd | 0.34 |
| 759 | TotalBsmtSF | Fireplaces | 0.34 |
| 872 | TotRmsAbvGrd | GarageArea | 0.34 |
| 933 | GrLivArea | OpenPorchSF | 0.33 |
| 770 | TotRmsAbvGrd | Fireplaces | 0.33 |
| 589 | TotalBsmtSF | FullBath | 0.32 |
| 384 | BsmtUnfSF | X1stFlrSF | 0.32 |
| 209 | YearBuilt | MasVnrArea | 0.32 |
| 921 | OverallQual | OpenPorchSF | 0.31 |
| 309 | OverallQual | BsmtUnfSF | 0.31 |
| 521 | TotalBsmtSF | BsmtFullBath | 0.31 |
| 839 | Fireplaces | GarageCars | 0.30 |
| 376 | LotArea | X1stFlrSF | 0.30 |
| 858 | BsmtFinSF1 | GarageArea | 0.30 |
| 411 | OverallQual | X2ndFlrSF | 0.30 |
| 346 | YearRemodAdd | TotalBsmtSF | 0.29 |
| 785 | OverallQual | GarageYrBlt | 0.29 |
| 588 | BsmtUnfSF | FullBath | 0.29 |
| 482 | YearRemodAdd | GrLivArea | 0.29 |
| 817 | LotFrontage | GarageCars | 0.29 |
| 725 | TotalBsmtSF | TotRmsAbvGrd | 0.29 |
| 379 | YearBuilt | X1stFlrSF | 0.28 |
| 721 | MasVnrArea | TotRmsAbvGrd | 0.28 |
| 585 | MasVnrArea | FullBath | 0.28 |
| 615 | OverallQual | HalfBath | 0.27 |
| 787 | YearBuilt | GarageYrBlt | 0.27 |
| 750 | LotArea | Fireplaces | 0.27 |
| 873 | Fireplaces | GarageArea | 0.27 |
| 749 | LotFrontage | Fireplaces | 0.27 |
| 245 | MasVnrArea | BsmtFinSF1 | 0.26 |
| 647 | LotFrontage | BedroomAbvGr | 0.26 |
| 478 | LotArea | GrLivArea | 0.26 |
| 342 | LotArea | TotalBsmtSF | 0.26 |
| 756 | BsmtFinSF1 | Fireplaces | 0.26 |
| 936 | FullBath | OpenPorchSF | 0.26 |
| 735 | KitchenAbvGr | TotRmsAbvGrd | 0.26 |
| 69 | LotFrontage | OverallQual | 0.25 |
| 724 | BsmtUnfSF | TotRmsAbvGrd | 0.25 |
| 243 | YearBuilt | BsmtFinSF1 | 0.25 |
| 755 | MasVnrArea | Fireplaces | 0.25 |
| 899 | GrLivArea | WoodDeckSF | 0.25 |
| 929 | TotalBsmtSF | OpenPorchSF | 0.25 |
| 522 | X1stFlrSF | BsmtFullBath | 0.24 |
| 766 | FullBath | Fireplaces | 0.24 |
| 617 | YearBuilt | HalfBath | 0.24 |
| 944 | GarageArea | OpenPorchSF | 0.24 |
| 380 | YearRemodAdd | X1stFlrSF | 0.24 |
| 486 | BsmtUnfSF | GrLivArea | 0.24 |
| 241 | OverallQual | BsmtFinSF1 | 0.24 |
| 887 | OverallQual | WoodDeckSF | 0.24 |
| 896 | X1stFlrSF | WoodDeckSF | 0.24 |
| 940 | TotRmsAbvGrd | OpenPorchSF | 0.23 |
| 239 | LotFrontage | BsmtFinSF1 | 0.23 |
| 895 | TotalBsmtSF | WoodDeckSF | 0.23 |
| 665 | HalfBath | BedroomAbvGr | 0.23 |
| 909 | GarageCars | WoodDeckSF | 0.23 |
| 924 | YearRemodAdd | OpenPorchSF | 0.23 |
| 889 | YearBuilt | WoodDeckSF | 0.22 |
| 910 | GarageArea | WoodDeckSF | 0.22 |
| 824 | BsmtFinSF1 | GarageCars | 0.22 |
| 835 | HalfBath | GarageCars | 0.22 |
| 826 | BsmtUnfSF | GarageCars | 0.21 |
| 240 | LotArea | BsmtFinSF1 | 0.21 |
| 943 | GarageCars | OpenPorchSF | 0.21 |
| 930 | X1stFlrSF | OpenPorchSF | 0.21 |
| 315 | BsmtFinSF2 | BsmtUnfSF | -0.21 |
| 484 | BsmtFinSF1 | GrLivArea | 0.21 |
| 931 | X2ndFlrSF | OpenPorchSF | 0.21 |
| 1055 | LotFrontage | PoolArea | 0.21 |
| 890 | YearRemodAdd | WoodDeckSF | 0.21 |
| 892 | BsmtFinSF1 | WoodDeckSF | 0.20 |
| 767 | HalfBath | Fireplaces | 0.20 |
| 420 | X1stFlrSF | X2ndFlrSF | -0.20 |
| 619 | MasVnrArea | HalfBath | 0.20 |
| 907 | Fireplaces | WoodDeckSF | 0.20 |
| 937 | HalfBath | OpenPorchSF | 0.20 |
| 481 | YearBuilt | GrLivArea | 0.20 |
| 579 | LotFrontage | FullBath | 0.20 |
| 700 | BedroomAbvGr | KitchenAbvGr | 0.20 |
| 761 | X2ndFlrSF | Fireplaces | 0.19 |
| 582 | OverallCond | FullBath | -0.19 |
| 958 | YearRemodAdd | EnclosedPorch | -0.19 |
| 205 | LotFrontage | MasVnrArea | 0.19 |
| 720 | YearRemodAdd | TotRmsAbvGrd | 0.19 |
| 716 | LotArea | TotRmsAbvGrd | 0.19 |
| 923 | YearBuilt | OpenPorchSF | 0.19 |
| 902 | FullBath | WoodDeckSF | 0.19 |
| 515 | YearBuilt | BsmtFullBath | 0.19 |
| 805 | Fireplaces | GarageYrBlt | 0.19 |
| 820 | OverallCond | GarageCars | -0.19 |
| 1043 | Fireplaces | ScreenPorch | 0.18 |
| 829 | X2ndFlrSF | GarageCars | 0.18 |
| 683 | OverallQual | KitchenAbvGr | -0.18 |
| 447 | YearBuilt | LowQualFinSF | -0.18 |
| 618 | YearRemodAdd | HalfBath | 0.18 |
| 860 | BsmtUnfSF | GarageArea | 0.18 |
| 312 | YearRemodAdd | BsmtUnfSF | 0.18 |
| 852 | LotArea | GarageArea | 0.18 |
| 210 | YearRemodAdd | MasVnrArea | 0.18 |
| 866 | BsmtFullBath | GarageArea | 0.18 |
| 793 | TotalBsmtSF | GarageYrBlt | 0.18 |
| 900 | BsmtFullBath | WoodDeckSF | 0.18 |
| 685 | YearBuilt | KitchenAbvGr | -0.17 |
| 415 | MasVnrArea | X2ndFlrSF | 0.17 |
| 419 | TotalBsmtSF | X2ndFlrSF | -0.17 |
| 886 | LotArea | WoodDeckSF | 0.17 |
| 344 | OverallCond | TotalBsmtSF | -0.17 |
| 1069 | GrLivArea | PoolArea | 0.17 |
| 523 | X2ndFlrSF | BsmtFullBath | -0.17 |
| 941 | Fireplaces | OpenPorchSF | 0.17 |
| 656 | BsmtUnfSF | BedroomAbvGr | 0.17 |
| 794 | X1stFlrSF | GarageYrBlt | 0.17 |
| 906 | TotRmsAbvGrd | WoodDeckSF | 0.17 |
| 869 | HalfBath | GarageArea | 0.16 |
| 797 | GrLivArea | GarageYrBlt | 0.16 |
| 891 | MasVnrArea | WoodDeckSF | 0.16 |
| 519 | BsmtFinSF2 | BsmtFullBath | 0.16 |
| 512 | LotArea | BsmtFullBath | 0.16 |
| 803 | KitchenAbvGr | GarageYrBlt | -0.16 |
| 818 | LotArea | GarageCars | 0.15 |
| 919 | LotFrontage | OpenPorchSF | 0.15 |
| 854 | OverallCond | GarageArea | -0.15 |
| 977 | GarageCars | EnclosedPorch | -0.15 |
| 662 | BsmtFullBath | BedroomAbvGr | -0.15 |
| 686 | YearRemodAdd | KitchenAbvGr | -0.15 |
| 311 | YearBuilt | BsmtUnfSF | 0.15 |
| 560 | BsmtFullBath | BsmtHalfBath | -0.15 |
| 753 | YearBuilt | Fireplaces | 0.15 |
| 796 | LowQualFinSF | GarageYrBlt | -0.15 |
| 788 | YearRemodAdd | GarageYrBlt | 0.15 |
| 378 | OverallCond | X1stFlrSF | -0.14 |
| 1062 | BsmtFinSF1 | PoolArea | 0.14 |
| 414 | YearRemodAdd | X2ndFlrSF | 0.14 |
| 863 | X2ndFlrSF | GarageArea | 0.14 |
| 764 | BsmtFullBath | Fireplaces | 0.14 |
| 800 | FullBath | GarageYrBlt | 0.14 |
| 416 | BsmtFinSF1 | X2ndFlrSF | -0.14 |
| 310 | OverallCond | BsmtUnfSF | -0.14 |
| 630 | FullBath | HalfBath | 0.14 |
| 490 | LowQualFinSF | GrLivArea | 0.13 |
| 789 | MasVnrArea | GarageYrBlt | 0.13 |
| 698 | FullBath | KitchenAbvGr | 0.13 |
| 307 | LotFrontage | BsmtUnfSF | 0.13 |
| 832 | BsmtFullBath | GarageCars | 0.13 |
| 1066 | X1stFlrSF | PoolArea | 0.13 |
| 728 | LowQualFinSF | TotRmsAbvGrd | 0.13 |
| 928 | BsmtUnfSF | OpenPorchSF | 0.13 |
| 244 | YearRemodAdd | BsmtFinSF1 | 0.13 |
| 208 | OverallCond | MasVnrArea | -0.13 |
| 658 | X1stFlrSF | BedroomAbvGr | 0.13 |
| 1065 | TotalBsmtSF | PoolArea | 0.13 |
| 580 | LotArea | FullBath | 0.13 |
| 979 | WoodDeckSF | EnclosedPorch | -0.13 |
| 925 | MasVnrArea | OpenPorchSF | 0.13 |
| 769 | KitchenAbvGr | Fireplaces | -0.12 |
| 137 | LotFrontage | YearBuilt | 0.12 |
| 978 | GarageArea | EnclosedPorch | -0.12 |
| 624 | X1stFlrSF | HalfBath | -0.12 |
| 648 | LotArea | BedroomAbvGr | 0.12 |
| 516 | YearRemodAdd | BsmtFullBath | 0.12 |
| 548 | OverallCond | BsmtHalfBath | 0.12 |
| 908 | GarageYrBlt | WoodDeckSF | 0.12 |
| 801 | HalfBath | GarageYrBlt | 0.12 |
| 790 | BsmtFinSF1 | GarageYrBlt | 0.12 |
| 970 | FullBath | EnclosedPorch | -0.12 |
| 313 | MasVnrArea | BsmtUnfSF | 0.11 |
| 955 | OverallQual | EnclosedPorch | -0.11 |
| 754 | YearRemodAdd | Fireplaces | 0.11 |
| 926 | BsmtFinSF1 | OpenPorchSF | 0.11 |
| 274 | LotArea | BsmtFinSF2 | 0.11 |
| 513 | OverallQual | BsmtFullBath | 0.11 |
| 959 | MasVnrArea | EnclosedPorch | -0.11 |
| 903 | HalfBath | WoodDeckSF | 0.11 |
| 768 | BedroomAbvGr | Fireplaces | 0.11 |
| 654 | BsmtFinSF1 | BedroomAbvGr | -0.11 |
| 70 | LotArea | OverallQual | 0.11 |
| 660 | LowQualFinSF | BedroomAbvGr | 0.11 |
| 783 | LotFrontage | GarageYrBlt | 0.11 |
| 349 | BsmtFinSF2 | TotalBsmtSF | 0.10 |
| 206 | LotArea | MasVnrArea | 0.10 |
| 653 | MasVnrArea | BedroomAbvGr | 0.10 |
| 960 | BsmtFinSF1 | EnclosedPorch | -0.10 |
| 649 | OverallQual | BedroomAbvGr | 0.10 |
| 1035 | GrLivArea | ScreenPorch | 0.10 |
| 511 | LotFrontage | BsmtFullBath | 0.10 |
| 695 | GrLivArea | KitchenAbvGr | 0.10 |
| 417 | BsmtFinSF2 | X2ndFlrSF | -0.10 |
| 383 | BsmtFinSF2 | X1stFlrSF | 0.10 |
| 554 | BsmtUnfSF | BsmtHalfBath | -0.10 |
| 804 | TotRmsAbvGrd | GarageYrBlt | 0.10 |
| 719 | YearBuilt | TotRmsAbvGrd | 0.10 |
| 963 | TotalBsmtSF | EnclosedPorch | -0.10 |
| 971 | HalfBath | EnclosedPorch | -0.10 |
| 1077 | Fireplaces | PoolArea | 0.10 |
| 830 | LowQualFinSF | GarageCars | -0.09 |
| 938 | BedroomAbvGr | OpenPorchSF | 0.09 |
| 980 | OpenPorchSF | EnclosedPorch | -0.09 |
| 897 | X2ndFlrSF | WoodDeckSF | 0.09 |
| 105 | OverallQual | OverallCond | -0.09 |
| 905 | KitchenAbvGr | WoodDeckSF | -0.09 |
| 1029 | BsmtFinSF2 | ScreenPorch | 0.09 |
| 171 | LotFrontage | YearRemodAdd | 0.09 |
| 1032 | X1stFlrSF | ScreenPorch | 0.09 |
| 885 | LotFrontage | WoodDeckSF | 0.09 |
| 684 | OverallCond | KitchenAbvGr | -0.09 |
| 836 | BedroomAbvGr | GarageCars | 0.09 |
| 517 | MasVnrArea | BsmtFullBath | 0.09 |
| 920 | LotArea | OpenPorchSF | 0.08 |
| 1031 | TotalBsmtSF | ScreenPorch | 0.08 |
| 1076 | TotRmsAbvGrd | PoolArea | 0.08 |
| 1049 | EnclosedPorch | ScreenPorch | -0.08 |
| 1067 | X2ndFlrSF | PoolArea | 0.08 |
| 688 | BsmtFinSF1 | KitchenAbvGr | -0.08 |
| 409 | LotFrontage | X2ndFlrSF | 0.08 |
| 480 | OverallCond | GrLivArea | -0.08 |
| 1056 | LotArea | PoolArea | 0.08 |
| 976 | GarageYrBlt | EnclosedPorch | -0.08 |
| 587 | BsmtFinSF2 | FullBath | -0.08 |
| 1048 | OpenPorchSF | ScreenPorch | 0.07 |
| 1047 | WoodDeckSF | ScreenPorch | -0.07 |
| 174 | OverallCond | YearRemodAdd | 0.07 |
| 1081 | WoodDeckSF | PoolArea | 0.07 |
| 784 | LotArea | GarageYrBlt | 0.07 |
| 1039 | HalfBath | ScreenPorch | 0.07 |
| 279 | MasVnrArea | BsmtFinSF2 | -0.07 |
| 553 | BsmtFinSF2 | BsmtHalfBath | 0.07 |
| 1074 | BedroomAbvGr | PoolArea | 0.07 |
| 651 | YearBuilt | BedroomAbvGr | -0.07 |
| 956 | OverallCond | EnclosedPorch | 0.07 |
| 939 | KitchenAbvGr | OpenPorchSF | -0.07 |
| 987 | LotFrontage | X3SsnPorch | 0.07 |
| 449 | MasVnrArea | LowQualFinSF | -0.07 |
| 691 | TotalBsmtSF | KitchenAbvGr | -0.07 |
| 1092 | OverallCond | MiscVal | 0.07 |
| 699 | HalfBath | KitchenAbvGr | -0.07 |
| 692 | X1stFlrSF | KitchenAbvGr | 0.07 |
| 893 | BsmtFinSF2 | WoodDeckSF | 0.07 |
| 278 | YearRemodAdd | BsmtFinSF2 | -0.07 |
| 1070 | BsmtFullBath | PoolArea | 0.07 |
| 864 | LowQualFinSF | GarageArea | -0.07 |
| 552 | BsmtFinSF1 | BsmtHalfBath | 0.07 |
| 934 | BsmtFullBath | OpenPorchSF | 0.07 |
| 1138 | BsmtFullBath | YrSold | 0.07 |
| 964 | X1stFlrSF | EnclosedPorch | -0.07 |
| 870 | BedroomAbvGr | GarageArea | 0.07 |
| 1057 | OverallQual | PoolArea | 0.07 |
| 1023 | OverallQual | ScreenPorch | 0.06 |
| 594 | BsmtFullBath | FullBath | -0.06 |
| 450 | BsmtFinSF1 | LowQualFinSF | -0.06 |
| 871 | KitchenAbvGr | GarageArea | -0.06 |
| 795 | X2ndFlrSF | GarageYrBlt | 0.06 |
| 455 | X2ndFlrSF | LowQualFinSF | 0.06 |
| 448 | YearRemodAdd | LowQualFinSF | -0.06 |
| 1109 | KitchenAbvGr | MiscVal | 0.06 |
| 1068 | LowQualFinSF | PoolArea | 0.06 |
| 1028 | BsmtFinSF1 | ScreenPorch | 0.06 |
| 965 | X2ndFlrSF | EnclosedPorch | 0.06 |
| 1027 | MasVnrArea | ScreenPorch | 0.06 |
| 1044 | GarageYrBlt | ScreenPorch | 0.06 |
| 966 | LowQualFinSF | EnclosedPorch | 0.06 |
| 1080 | GarageArea | PoolArea | 0.06 |
| 616 | OverallCond | HalfBath | -0.06 |
| 1082 | OpenPorchSF | PoolArea | 0.06 |
| 1154 | PoolArea | YrSold | -0.06 |
| 1042 | TotRmsAbvGrd | ScreenPorch | 0.06 |
| 693 | X2ndFlrSF | KitchenAbvGr | 0.06 |
| 103 | LotFrontage | OverallCond | -0.06 |
| 275 | OverallQual | BsmtFinSF2 | -0.06 |
| 945 | WoodDeckSF | OpenPorchSF | 0.06 |
| 586 | BsmtFinSF1 | FullBath | 0.06 |
| 1150 | OpenPorchSF | YrSold | -0.06 |
| 718 | OverallCond | TotRmsAbvGrd | -0.06 |
| 998 | X1stFlrSF | X3SsnPorch | 0.06 |
| 514 | OverallCond | BsmtFullBath | -0.05 |
| 1024 | OverallCond | ScreenPorch | 0.05 |
| 595 | BsmtHalfBath | FullBath | -0.05 |
| 1083 | EnclosedPorch | PoolArea | 0.05 |
| 613 | LotFrontage | HalfBath | 0.05 |
| 730 | BsmtFullBath | TotRmsAbvGrd | -0.05 |
| 1041 | KitchenAbvGr | ScreenPorch | -0.05 |
| 758 | BsmtUnfSF | Fireplaces | 0.05 |
| 1046 | GarageArea | ScreenPorch | 0.05 |
| 1085 | ScreenPorch | PoolArea | 0.05 |
| 410 | LotArea | X2ndFlrSF | 0.05 |
| 837 | KitchenAbvGr | GarageCars | -0.05 |
| 1045 | GarageCars | ScreenPorch | 0.05 |
| 657 | TotalBsmtSF | BedroomAbvGr | 0.05 |
| 1025 | YearBuilt | ScreenPorch | -0.05 |
| 280 | BsmtFinSF1 | BsmtFinSF2 | -0.05 |
| 968 | BsmtFullBath | EnclosedPorch | -0.05 |
| 273 | LotFrontage | BsmtFinSF2 | 0.05 |
| 942 | GarageYrBlt | OpenPorchSF | 0.05 |
| 1072 | FullBath | PoolArea | 0.05 |
| 798 | BsmtFullBath | GarageYrBlt | 0.05 |
| 277 | YearBuilt | BsmtFinSF2 | -0.05 |
| 623 | TotalBsmtSF | HalfBath | -0.05 |
| 546 | LotArea | BsmtHalfBath | 0.05 |
| 524 | LowQualFinSF | BsmtFullBath | -0.05 |
| 757 | BsmtFinSF2 | Fireplaces | 0.05 |
| 904 | BedroomAbvGr | WoodDeckSF | 0.05 |
| 1139 | BsmtHalfBath | YrSold | -0.05 |
| 663 | BsmtHalfBath | BedroomAbvGr | 0.05 |
| 242 | OverallCond | BsmtFinSF1 | -0.05 |
| 992 | YearRemodAdd | X3SsnPorch | 0.05 |
| 722 | BsmtFinSF1 | TotRmsAbvGrd | 0.04 |
| 1040 | BedroomAbvGr | ScreenPorch | 0.04 |
| 1126 | OverallCond | YrSold | 0.04 |
| 1022 | LotArea | ScreenPorch | 0.04 |
| 1113 | GarageCars | MiscVal | -0.04 |
| 792 | BsmtUnfSF | GarageYrBlt | 0.04 |
| 1063 | BsmtFinSF2 | PoolArea | 0.04 |
| 972 | BedroomAbvGr | EnclosedPorch | 0.04 |
| 696 | BsmtFullBath | KitchenAbvGr | -0.04 |
| 1021 | LotFrontage | ScreenPorch | 0.04 |
| 1132 | BsmtUnfSF | YrSold | -0.04 |
| 622 | BsmtUnfSF | HalfBath | -0.04 |
| 689 | BsmtFinSF2 | KitchenAbvGr | -0.04 |
| 1033 | X2ndFlrSF | ScreenPorch | 0.04 |
| 652 | YearRemodAdd | BedroomAbvGr | -0.04 |
| 276 | OverallCond | BsmtFinSF2 | 0.04 |
| 901 | BsmtHalfBath | WoodDeckSF | 0.04 |
| 547 | OverallQual | BsmtHalfBath | -0.04 |
| 1147 | GarageCars | YrSold | -0.04 |
| 1026 | YearRemodAdd | ScreenPorch | -0.04 |
| 443 | LotFrontage | LowQualFinSF | 0.04 |
| 825 | BsmtFinSF2 | GarageCars | -0.04 |
| 549 | YearBuilt | BsmtHalfBath | -0.04 |
| 1090 | LotArea | MiscVal | 0.04 |
| 697 | BsmtHalfBath | KitchenAbvGr | -0.04 |
| 687 | MasVnrArea | KitchenAbvGr | -0.04 |
| 997 | TotalBsmtSF | X3SsnPorch | 0.04 |
| 973 | KitchenAbvGr | EnclosedPorch | 0.04 |
| 1015 | EnclosedPorch | X3SsnPorch | -0.04 |
| 961 | BsmtFinSF2 | EnclosedPorch | 0.04 |
| 1137 | GrLivArea | YrSold | -0.04 |
| 1142 | BedroomAbvGr | YrSold | -0.04 |
| 1011 | GarageCars | X3SsnPorch | 0.04 |
| 1128 | YearRemodAdd | YrSold | 0.04 |
| 1004 | FullBath | X3SsnPorch | 0.04 |
| 723 | BsmtFinSF2 | TotRmsAbvGrd | -0.04 |
| 1003 | BsmtHalfBath | X3SsnPorch | 0.04 |
| 1064 | BsmtUnfSF | PoolArea | -0.04 |
| 1012 | GarageArea | X3SsnPorch | 0.04 |
| 791 | BsmtFinSF2 | GarageYrBlt | 0.04 |
| 525 | GrLivArea | BsmtFullBath | 0.03 |
| 1144 | TotRmsAbvGrd | YrSold | -0.03 |
| 1093 | YearBuilt | MiscVal | -0.03 |
| 453 | TotalBsmtSF | LowQualFinSF | -0.03 |
| 1013 | WoodDeckSF | X3SsnPorch | -0.03 |
| 922 | OverallCond | OpenPorchSF | -0.03 |
| 621 | BsmtFinSF2 | HalfBath | -0.03 |
| 1037 | BsmtHalfBath | ScreenPorch | 0.03 |
| 1119 | ScreenPorch | MiscVal | 0.03 |
| 1131 | BsmtFinSF2 | YrSold | 0.03 |
| 1143 | KitchenAbvGr | YrSold | 0.03 |
| 1050 | X3SsnPorch | ScreenPorch | -0.03 |
| 1091 | OverallQual | MiscVal | -0.03 |
| 991 | YearBuilt | X3SsnPorch | 0.03 |
| 628 | BsmtFullBath | HalfBath | -0.03 |
| 445 | OverallQual | LowQualFinSF | -0.03 |
| 989 | OverallQual | X3SsnPorch | 0.03 |
| 690 | BsmtUnfSF | KitchenAbvGr | 0.03 |
| 995 | BsmtFinSF2 | X3SsnPorch | -0.03 |
| 1095 | MasVnrArea | MiscVal | -0.03 |
| 1120 | PoolArea | MiscVal | 0.03 |
| 1010 | GarageYrBlt | X3SsnPorch | 0.03 |
| 765 | BsmtHalfBath | Fireplaces | 0.03 |
| 412 | OverallCond | X2ndFlrSF | 0.03 |
| 1136 | LowQualFinSF | YrSold | -0.03 |
| 1135 | X2ndFlrSF | YrSold | -0.03 |
| 452 | BsmtUnfSF | LowQualFinSF | 0.03 |
| 1114 | GarageArea | MiscVal | -0.03 |
| 1148 | GarageArea | YrSold | -0.03 |
| 1125 | OverallQual | YrSold | -0.03 |
| 626 | LowQualFinSF | HalfBath | -0.03 |
| 1034 | LowQualFinSF | ScreenPorch | 0.03 |
| 551 | MasVnrArea | BsmtHalfBath | 0.03 |
| 994 | BsmtFinSF1 | X3SsnPorch | 0.03 |
| 990 | OverallCond | X3SsnPorch | 0.03 |
| 446 | OverallCond | LowQualFinSF | 0.03 |
| 898 | LowQualFinSF | WoodDeckSF | -0.03 |
| 935 | BsmtHalfBath | OpenPorchSF | -0.03 |
| 975 | Fireplaces | EnclosedPorch | -0.02 |
| 1110 | TotRmsAbvGrd | MiscVal | 0.02 |
| 1007 | KitchenAbvGr | X3SsnPorch | -0.02 |
| 867 | BsmtHalfBath | GarageArea | -0.02 |
| 1006 | BedroomAbvGr | X3SsnPorch | -0.02 |
| 999 | X2ndFlrSF | X3SsnPorch | -0.02 |
| 1145 | Fireplaces | YrSold | -0.02 |
| 557 | X2ndFlrSF | BsmtHalfBath | -0.02 |
| 1098 | BsmtUnfSF | MiscVal | -0.02 |
| 731 | BsmtHalfBath | TotRmsAbvGrd | -0.02 |
| 752 | OverallCond | Fireplaces | -0.02 |
| 1036 | BsmtFullBath | ScreenPorch | 0.02 |
| 1104 | BsmtFullBath | MiscVal | -0.02 |
| 1073 | HalfBath | PoolArea | 0.02 |
| 1149 | WoodDeckSF | YrSold | 0.02 |
| 762 | LowQualFinSF | Fireplaces | -0.02 |
| 1100 | X1stFlrSF | MiscVal | -0.02 |
| 1079 | GarageCars | PoolArea | 0.02 |
| 833 | BsmtHalfBath | GarageCars | -0.02 |
| 996 | BsmtUnfSF | X3SsnPorch | 0.02 |
| 1001 | GrLivArea | X3SsnPorch | 0.02 |
| 988 | LotArea | X3SsnPorch | 0.02 |
| 1071 | BsmtHalfBath | PoolArea | 0.02 |
| 1140 | FullBath | YrSold | -0.02 |
| 559 | GrLivArea | BsmtHalfBath | -0.02 |
| 993 | MasVnrArea | X3SsnPorch | 0.02 |
| 1152 | X3SsnPorch | YrSold | 0.02 |
| 1116 | OpenPorchSF | MiscVal | -0.02 |
| 1099 | TotalBsmtSF | MiscVal | -0.02 |
| 1117 | EnclosedPorch | MiscVal | 0.02 |
| 954 | LotArea | EnclosedPorch | -0.02 |
| 932 | LowQualFinSF | OpenPorchSF | 0.02 |
| 859 | BsmtFinSF2 | GarageArea | -0.02 |
| 682 | LotArea | KitchenAbvGr | -0.02 |
| 799 | BsmtHalfBath | GarageYrBlt | 0.02 |
| 1101 | X2ndFlrSF | MiscVal | 0.02 |
| 1078 | GarageYrBlt | PoolArea | 0.02 |
| 655 | BsmtFinSF2 | BedroomAbvGr | -0.02 |
| 1133 | TotalBsmtSF | YrSold | -0.01 |
| 451 | BsmtFinSF2 | LowQualFinSF | 0.01 |
| 1075 | KitchenAbvGr | PoolArea | -0.01 |
| 1130 | BsmtFinSF1 | YrSold | 0.01 |
| 1106 | FullBath | MiscVal | -0.01 |
| 1124 | LotArea | YrSold | -0.01 |
| 614 | LotArea | HalfBath | 0.01 |
| 454 | X1stFlrSF | LowQualFinSF | -0.01 |
| 138 | LotArea | YearBuilt | 0.01 |
| 172 | LotArea | YearRemodAdd | 0.01 |
| 1127 | YearBuilt | YrSold | -0.01 |
| 1134 | X1stFlrSF | YrSold | -0.01 |
| 650 | OverallCond | BedroomAbvGr | 0.01 |
| 1030 | BsmtUnfSF | ScreenPorch | -0.01 |
| 629 | BsmtHalfBath | HalfBath | -0.01 |
| 550 | YearRemodAdd | BsmtHalfBath | -0.01 |
| 1061 | MasVnrArea | PoolArea | 0.01 |
| 1009 | Fireplaces | X3SsnPorch | 0.01 |
| 1146 | GarageYrBlt | YrSold | -0.01 |
| 953 | LotFrontage | EnclosedPorch | 0.01 |
| 1153 | ScreenPorch | YrSold | 0.01 |
| 413 | YearBuilt | X2ndFlrSF | 0.01 |
| 1094 | YearRemodAdd | MiscVal | -0.01 |
| 1141 | HalfBath | YrSold | -0.01 |
| 1151 | EnclosedPorch | YrSold | -0.01 |
| 802 | BedroomAbvGr | GarageYrBlt | -0.01 |
| 485 | BsmtFinSF2 | GrLivArea | -0.01 |
| 1115 | WoodDeckSF | MiscVal | -0.01 |
| 967 | GrLivArea | EnclosedPorch | 0.01 |
| 969 | BsmtHalfBath | EnclosedPorch | -0.01 |
| 1129 | MasVnrArea | YrSold | -0.01 |
| 1038 | FullBath | ScreenPorch | -0.01 |
| 1084 | X3SsnPorch | PoolArea | -0.01 |
| 1108 | BedroomAbvGr | MiscVal | 0.01 |
| 694 | LowQualFinSF | KitchenAbvGr | 0.01 |
| 1123 | LotFrontage | YrSold | 0.01 |
| 1105 | BsmtHalfBath | MiscVal | -0.01 |
| 545 | LotFrontage | BsmtHalfBath | -0.01 |
| 1008 | TotRmsAbvGrd | X3SsnPorch | -0.01 |
| 1112 | GarageYrBlt | MiscVal | -0.01 |
| 786 | OverallCond | GarageYrBlt | -0.01 |
| 681 | LotFrontage | KitchenAbvGr | -0.01 |
| 1014 | OpenPorchSF | X3SsnPorch | -0.01 |
| 558 | LowQualFinSF | BsmtHalfBath | -0.01 |
| 1060 | YearRemodAdd | PoolArea | 0.01 |
| 104 | LotArea | OverallCond | -0.01 |
| 894 | BsmtUnfSF | WoodDeckSF | -0.01 |
| 1005 | HalfBath | X3SsnPorch | 0.00 |
| 1059 | YearBuilt | PoolArea | 0.00 |
| 1097 | BsmtFinSF2 | MiscVal | 0.00 |
| 1155 | MiscVal | YrSold | 0.00 |
| 444 | LotArea | LowQualFinSF | 0.00 |
| 418 | BsmtUnfSF | X2ndFlrSF | 0.00 |
| 1000 | LowQualFinSF | X3SsnPorch | 0.00 |
| 620 | BsmtFinSF1 | HalfBath | 0.00 |
| 974 | TotRmsAbvGrd | EnclosedPorch | 0.00 |
| 1102 | LowQualFinSF | MiscVal | 0.00 |
| 1096 | BsmtFinSF1 | MiscVal | 0.00 |
| 1089 | LotFrontage | MiscVal | 0.00 |
| 888 | OverallCond | WoodDeckSF | 0.00 |
| 927 | BsmtFinSF2 | OpenPorchSF | 0.00 |
| 308 | LotArea | BsmtUnfSF | 0.00 |
| 962 | BsmtUnfSF | EnclosedPorch | 0.00 |
| 1103 | GrLivArea | MiscVal | 0.00 |
| 1058 | OverallCond | PoolArea | 0.00 |
| 556 | X1stFlrSF | BsmtHalfBath | 0.00 |
| 1111 | Fireplaces | MiscVal | 0.00 |
| 1107 | HalfBath | MiscVal | 0.00 |
| 592 | LowQualFinSF | FullBath | 0.00 |
| 1118 | X3SsnPorch | MiscVal | 0.00 |
| 555 | TotalBsmtSF | BsmtHalfBath | 0.00 |
| 1002 | BsmtFullBath | X3SsnPorch | 0.00 |
# Correlation Funnel
set.seed(123)
raw_data_v2 <- raw_data
raw_data_v2 <- na.omit(raw_data_v2)
raw_data_binarized_tbl <- raw_data_v2 %>%
binarize(n_bins = 4, thresh_infreq = 0.01)
raw_data_correlated_tbl <- raw_data_binarized_tbl %>%
correlationfunnel::correlate(target = SalePrice__213497.5_Inf)
raw_data_correlated_tbl %>%
correlationfunnel::plot_correlation_funnel(interactive = FALSE)
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the correlationfunnel package.
## Please report the issue at
## <https://github.com/business-science/correlationfunnel/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: ggrepel: 93 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Two-way Cross-Tabulations - Categorical Variables:
# Comparing categorical features with the target variable - Chi-square
# got this piece of code from the internet
# Loop through categorical variables, and apply chi square test for significance
results <- map_dfr(categorical_predictors, function(var) {
tibble(
variable = var,
chisq_p_value = chisq.test(table(raw_data[[var]], raw_data$SalePrice))$p.value
)
})
results %>%
arrange(desc(chisq_p_value)) %>%
kbl() %>%
kable_styling(full_width = TRUE)
| variable | chisq_p_value |
|---|---|
| Utilities | 1.0000000 |
| Condition1 | 1.0000000 |
| RoofStyle | 1.0000000 |
| RoofMatl | 1.0000000 |
| MiscFeature | 1.0000000 |
| Fence | 1.0000000 |
| GarageCond | 1.0000000 |
| BldgType | 0.9999861 |
| Exterior1st | 0.9999839 |
| HeatingQC | 0.9995947 |
| MSSubClass | 0.9928572 |
| PavedDrive | 0.9912800 |
| Alley | 0.9637568 |
| BsmtFinType2 | 0.9480709 |
| Exterior2nd | 0.8469190 |
| HouseStyle | 0.6482615 |
| BsmtFinType1 | 0.2129277 |
| MoSold | 0.1417502 |
| LandSlope | 0.1050864 |
| LandContour | 0.0867465 |
| Condition2 | 0.0759864 |
| GarageType | 0.0647807 |
| Electrical | 0.0556880 |
| GarageQual | 0.0501487 |
| LotConfig | 0.0458062 |
| CentralAir | 0.0000123 |
| Foundation | 0.0000097 |
| FireplaceQu | 0.0000000 |
| Neighborhood | 0.0000000 |
| Street | 0.0000000 |
| GarageFinish | 0.0000000 |
| MSZoning | 0.0000000 |
| LotShape | 0.0000000 |
| ExterCond | 0.0000000 |
| SaleCondition | 0.0000000 |
| SaleType | 0.0000000 |
| BsmtExposure | 0.0000000 |
| Heating | 0.0000000 |
| ExterQual | NaN |
| BsmtQual | NaN |
| BsmtCond | NaN |
| KitchenQual | NaN |
| Functional | NaN |
| PoolQC | NaN |
| MasVnrType | NaN |
# too hard to read
# Numerical relationships
ggpairs(data_raw_numeric, progress = FALSE)
# Boxplots broken out by feature
# Pick some features that might seem interesting, otherwise way too many graphs
boxplot_predictors <- c("HouseStyle", "ExterCond", "GarageFinish", "Neighborhood", "KitchenQual")
for (col_name in boxplot_predictors) {
plot_boxplot(raw_data, by = col_name, geom_boxplot_args = list("outlier.color" = "red"))
}
### 4 Data Preparation
Random forest doesn’t require missing values to be treated, but neural networks do. Let’s use a missing indicator and median imputation for LotFrontage which is “orange/bad” predictor and has almost 50% values missing. For MasVnrArea, median imputation will be used and for Electrical and MasVnrType, the mode will be used.
raw_data_transform <- raw_data
# Add missing indicator for LotFrontage
# Create new missing indicator variable
raw_data_transform <- raw_data_transform %>%
mutate(is_LotFrontage_missing = is.na(LotFrontage))
# impute LotFrontage missing values using the median
raw_data_transform <- raw_data_transform %>%
mutate(LotFrontage_imputed = coalesce(LotFrontage, median(LotFrontage, na.rm = TRUE)))
raw_data_transform$is_LotFrontage_missing <- as.factor(raw_data_transform$is_LotFrontage_missing)
# Remove original feature
raw_data_transform <- raw_data_transform |>
dplyr::select(-LotFrontage)
# Fill in NA with mode for categorical and median for numeric
raw_data_transform <- raw_data_transform %>%
mutate_at(vars(c("MasVnrArea")), ~ifelse(is.na(.), median(., na.rm = TRUE), .)) # this value is 0, so makes sense MasVnrType_mode to be None
# Got these from distributions
Electrical_mode <- "SBrkr"
MasVnrType_mode <- "None"
raw_data_transform$Electrical = ifelse(is.na(raw_data_transform$Electrical), Electrical_mode, raw_data_transform$Electrical)
raw_data_transform$MasVnrType = ifelse(is.na(raw_data_transform$MasVnrType), MasVnrType_mode, raw_data_transform$MasVnrType)
# convert to factor
raw_data_transform$Electrical <- as.factor(raw_data_transform$Electrical)
raw_data_transform$MasVnrType <- as.factor(raw_data_transform$MasVnrType)
# Check for missing values again
plot_missing(raw_data_transform)
# Plot histograms
plot_histogram(raw_data_transform)
# Categorical Variables
plot_bar(raw_data_transform)
# Correlations
plot_correlation(raw_data_transform)
## 1 features with more than 20 categories ignored!
## Neighborhood: 25 categories
Since random forest and neural networks can handle skewness (although data preprocessing is necessary for neural networks), let’s not take any transformations such as log or square root.
Let’s do some feature engineering and combine individual features into one to simplify the dataset (not necessary for both, but in some real-world scenarios limiting the data is useful).
OverallTotalRating, TotalBaths, and TotalSF were created.
# Combine features by adding
# Total Overall Rating - Combine OverallQual and OverallCond by addition
raw_data_transform$OverallTotalRating <- raw_data_transform$OverallQual + raw_data_transform$OverallCond
# Total Baths - Full and Half
raw_data_transform$TotalBaths <- raw_data_transform$BsmtFullBath + raw_data_transform$BsmtHalfBath + raw_data_transform$FullBath + raw_data_transform$HalfBath
# Total SF
raw_data_transform$TotalSF <- raw_data_transform$TotalBsmtSF + raw_data_transform$X1stFlrSF + raw_data_transform$X2ndFlrSF
Random Forest can handle multicollinearity (and neural network to some extent), but let’s we remove some features to simplify the dataset. Having multicollinearity in RF’s important features can be less informative since we don’t truly know which one is more impactful.
For GarageCars/GarageArea let’s remove GarageCars. For TotalBsmtSF/X1stFlrSF, let’s remove TotalBsmtSF. For X2ndFlrSF/GrLivArea, let’s remove GrLivArea. For BedroomAbvGr/TotRmsAbvGrd, let’s remove BedroomAbvGr For BsmtFinSF1/BsmtFullBath, let’s remove BsmtFullBath. For X2ndFlrSF/TotRmsAbvGrd, let’s remove TotRmsAbvGrd. For X2ndFlrSF/HalfBath, let’s remove X2ndFlrSF.
# Create a dataset that removes multicollineariy
raw_data_transform2 <- raw_data_transform |>
dplyr::select(-c(GarageCars, TotRmsAbvGrd, TotalBsmtSF, GrLivArea, BedroomAbvGr, BsmtFullBath, X2ndFlrSF))
The data preprocessing necessary for neural networks is:
# Neural networks work best when the input data is scaled to a narrow range around zero - must be integers, not factors
# dataset without the new combined features
raw_data_transform_nn1 <- raw_data_transform |>
dplyr::select(-c(OverallTotalRating, TotalBaths, TotalSF))
# find categorical predictors
raw_data_transform_categorical <- raw_data_transform_nn1 |>
dplyr::select(where(is.factor) | where(is.character))
raw_data_transform_categorical_predictors <- names(raw_data_transform_categorical)[names(raw_data_transform_categorical) != "SalePrice"]
dummy_formula <- as.formula(paste("~ . -SalePrice")) # Adjust as needed
# categorical dataset
cat_data_raw_nn <- raw_data_transform_nn1[, raw_data_transform_categorical_predictors]
# use dummyVars to create dummy variables
dummies <- dummyVars(~ ., data = cat_data_raw_nn, fullRank = FALSE)
my_data_encoded <- data.frame(predict(dummies, newdata = cat_data_raw_nn))
# numeric columns
numeric_cols <- setdiff(names(raw_data_transform_nn1), c(raw_data_transform_categorical_predictors, "SalePrice"))
# combine back the columns
final_data_for_nn <- cbind(raw_data_transform_nn1[, numeric_cols], my_data_encoded, raw_data_transform_nn1$SalePrice)
# Rename target column
colnames(final_data_for_nn)[ncol(final_data_for_nn)] <- "SalePrice"
head(final_data_for_nn)
## LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## 1 8450 7 5 2003 2003 196 706
## 2 9600 6 8 1976 1976 0 978
## 3 11250 7 5 2001 2002 162 486
## 4 9550 7 5 1915 1970 0 216
## 5 14260 8 5 2000 2000 350 655
## 6 14115 5 5 1993 1995 0 732
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## 1 0 150 856 856 854 0 1710
## 2 0 284 1262 1262 0 0 1262
## 3 0 434 920 920 866 0 1786
## 4 0 540 756 961 756 0 1717
## 5 0 490 1145 1145 1053 0 2198
## 6 0 64 796 796 566 0 1362
## BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr
## 1 1 0 2 1 3 1
## 2 0 1 2 0 3 1
## 3 1 0 2 1 3 1
## 4 1 0 1 0 3 1
## 5 1 0 2 1 4 1
## 6 1 0 1 1 1 1
## TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF
## 1 8 0 2003 2 548 0
## 2 6 1 1976 2 460 298
## 3 6 1 2001 2 608 0
## 4 7 1 1998 3 642 0
## 5 9 1 2000 3 836 192
## 6 5 0 1993 2 480 40
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal YrSold
## 1 61 0 0 0 0 0 2008
## 2 0 0 0 0 0 0 2007
## 3 42 0 0 0 0 0 2008
## 4 35 272 0 0 0 0 2006
## 5 84 0 0 0 0 0 2008
## 6 30 0 320 0 0 700 2009
## LotFrontage_imputed MSSubClass.20 MSSubClass.30 MSSubClass.40 MSSubClass.45
## 1 65 0 0 0 0
## 2 80 1 0 0 0
## 3 68 0 0 0 0
## 4 60 0 0 0 0
## 5 84 0 0 0 0
## 6 85 0 0 0 0
## MSSubClass.50 MSSubClass.60 MSSubClass.70 MSSubClass.75 MSSubClass.80
## 1 0 1 0 0 0
## 2 0 0 0 0 0
## 3 0 1 0 0 0
## 4 0 0 1 0 0
## 5 0 1 0 0 0
## 6 1 0 0 0 0
## MSSubClass.85 MSSubClass.90 MSSubClass.120 MSSubClass.160 MSSubClass.180
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## MSSubClass.190 MSZoning.C..all. MSZoning.FV MSZoning.RH MSZoning.RL
## 1 0 0 0 0 1
## 2 0 0 0 0 1
## 3 0 0 0 0 1
## 4 0 0 0 0 1
## 5 0 0 0 0 1
## 6 0 0 0 0 1
## MSZoning.RM Street.Grvl Street.Pave Alley.Grvl Alley.NoAlley Alley.Pave
## 1 0 0 1 0 1 0
## 2 0 0 1 0 1 0
## 3 0 0 1 0 1 0
## 4 0 0 1 0 1 0
## 5 0 0 1 0 1 0
## 6 0 0 1 0 1 0
## LotShape.IR1 LotShape.IR2 LotShape.IR3 LotShape.Reg LandContour.Bnk
## 1 0 0 0 1 0
## 2 0 0 0 1 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 1 0 0 0 0
## 6 1 0 0 0 0
## LandContour.HLS LandContour.Low LandContour.Lvl Utilities.AllPub
## 1 0 0 1 1
## 2 0 0 1 1
## 3 0 0 1 1
## 4 0 0 1 1
## 5 0 0 1 1
## 6 0 0 1 1
## Utilities.NoSeWa LotConfig.Corner LotConfig.CulDSac LotConfig.FR2
## 1 0 0 0 0
## 2 0 0 0 1
## 3 0 0 0 0
## 4 0 1 0 0
## 5 0 0 0 1
## 6 0 0 0 0
## LotConfig.FR3 LotConfig.Inside LandSlope.L LandSlope.Q Neighborhood.Blmngtn
## 1 0 1 0.7071068 0.4082483 0
## 2 0 0 0.7071068 0.4082483 0
## 3 0 1 0.7071068 0.4082483 0
## 4 0 0 0.7071068 0.4082483 0
## 5 0 0 0.7071068 0.4082483 0
## 6 0 1 0.7071068 0.4082483 0
## Neighborhood.Blueste Neighborhood.BrDale Neighborhood.BrkSide
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Neighborhood.ClearCr Neighborhood.CollgCr Neighborhood.Crawfor
## 1 0 1 0
## 2 0 0 0
## 3 0 1 0
## 4 0 0 1
## 5 0 0 0
## 6 0 0 0
## Neighborhood.Edwards Neighborhood.Gilbert Neighborhood.IDOTRR
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Neighborhood.MeadowV Neighborhood.Mitchel Neighborhood.NAmes
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 1 0
## Neighborhood.NoRidge Neighborhood.NPkVill Neighborhood.NridgHt
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 1 0 0
## 6 0 0 0
## Neighborhood.NWAmes Neighborhood.OldTown Neighborhood.Sawyer
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Neighborhood.SawyerW Neighborhood.Somerst Neighborhood.StoneBr
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Neighborhood.SWISU Neighborhood.Timber Neighborhood.Veenker Condition1.Artery
## 1 0 0 0 0
## 2 0 0 1 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## Condition1.Feedr Condition1.Norm Condition1.PosA Condition1.PosN
## 1 0 1 0 0
## 2 1 0 0 0
## 3 0 1 0 0
## 4 0 1 0 0
## 5 0 1 0 0
## 6 0 1 0 0
## Condition1.RRAe Condition1.RRAn Condition1.RRNe Condition1.RRNn
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## Condition2.Artery Condition2.Feedr Condition2.Norm Condition2.PosA
## 1 0 0 1 0
## 2 0 0 1 0
## 3 0 0 1 0
## 4 0 0 1 0
## 5 0 0 1 0
## 6 0 0 1 0
## Condition2.PosN Condition2.RRAe Condition2.RRAn Condition2.RRNn BldgType.1Fam
## 1 0 0 0 0 1
## 2 0 0 0 0 1
## 3 0 0 0 0 1
## 4 0 0 0 0 1
## 5 0 0 0 0 1
## 6 0 0 0 0 1
## BldgType.2fmCon BldgType.Duplex BldgType.Twnhs BldgType.TwnhsE
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## HouseStyle.1.5Fin HouseStyle.1.5Unf HouseStyle.1Story HouseStyle.2.5Fin
## 1 0 0 0 0
## 2 0 0 1 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 1 0 0 0
## HouseStyle.2.5Unf HouseStyle.2Story HouseStyle.SFoyer HouseStyle.SLvl
## 1 0 1 0 0
## 2 0 0 0 0
## 3 0 1 0 0
## 4 0 1 0 0
## 5 0 1 0 0
## 6 0 0 0 0
## RoofStyle.Flat RoofStyle.Gable RoofStyle.Gambrel RoofStyle.Hip
## 1 0 1 0 0
## 2 0 1 0 0
## 3 0 1 0 0
## 4 0 1 0 0
## 5 0 1 0 0
## 6 0 1 0 0
## RoofStyle.Mansard RoofStyle.Shed RoofMatl.ClyTile RoofMatl.CompShg
## 1 0 0 0 1
## 2 0 0 0 1
## 3 0 0 0 1
## 4 0 0 0 1
## 5 0 0 0 1
## 6 0 0 0 1
## RoofMatl.Membran RoofMatl.Metal RoofMatl.Roll RoofMatl.Tar.Grv
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## RoofMatl.WdShake RoofMatl.WdShngl Exterior1st.AsbShng Exterior1st.AsphShn
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## Exterior1st.BrkComm Exterior1st.BrkFace Exterior1st.CBlock
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Exterior1st.CemntBd Exterior1st.HdBoard Exterior1st.ImStucc
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Exterior1st.MetalSd Exterior1st.Plywood Exterior1st.Stone Exterior1st.Stucco
## 1 0 0 0 0
## 2 1 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## Exterior1st.VinylSd Exterior1st.Wd.Sdng Exterior1st.WdShing
## 1 1 0 0
## 2 0 0 0
## 3 1 0 0
## 4 0 1 0
## 5 1 0 0
## 6 1 0 0
## Exterior2nd.AsbShng Exterior2nd.AsphShn Exterior2nd.Brk.Cmn
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Exterior2nd.BrkFace Exterior2nd.CBlock Exterior2nd.CmentBd
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Exterior2nd.HdBoard Exterior2nd.ImStucc Exterior2nd.MetalSd Exterior2nd.Other
## 1 0 0 0 0
## 2 0 0 1 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## Exterior2nd.Plywood Exterior2nd.Stone Exterior2nd.Stucco Exterior2nd.VinylSd
## 1 0 0 0 1
## 2 0 0 0 0
## 3 0 0 0 1
## 4 0 0 0 0
## 5 0 0 0 1
## 6 0 0 0 1
## Exterior2nd.Wd.Sdng Exterior2nd.Wd.Shng MasVnrType.BrkCmn MasVnrType.BrkFace
## 1 0 0 0 1
## 2 0 0 0 0
## 3 0 0 0 1
## 4 0 1 0 0
## 5 0 0 0 1
## 6 0 0 0 0
## MasVnrType.None MasVnrType.Stone ExterQual.L ExterQual.Q ExterQual.C
## 1 0 0 3.162278e-01 -0.2672612 -6.324555e-01
## 2 1 0 -1.481950e-18 -0.5345225 1.786843e-17
## 3 0 0 3.162278e-01 -0.2672612 -6.324555e-01
## 4 1 0 -1.481950e-18 -0.5345225 1.786843e-17
## 5 0 0 3.162278e-01 -0.2672612 -6.324555e-01
## 6 1 0 -1.481950e-18 -0.5345225 1.786843e-17
## ExterQual.4 ExterCond.L ExterCond.Q ExterCond.C ExterCond.4
## 1 -0.4780914 -1.48195e-18 -0.5345225 1.786843e-17 0.7171372
## 2 0.7171372 -1.48195e-18 -0.5345225 1.786843e-17 0.7171372
## 3 -0.4780914 -1.48195e-18 -0.5345225 1.786843e-17 0.7171372
## 4 0.7171372 -1.48195e-18 -0.5345225 1.786843e-17 0.7171372
## 5 -0.4780914 -1.48195e-18 -0.5345225 1.786843e-17 0.7171372
## 6 0.7171372 -1.48195e-18 -0.5345225 1.786843e-17 0.7171372
## Foundation.BrkTil Foundation.CBlock Foundation.PConc Foundation.Slab
## 1 0 0 1 0
## 2 0 1 0 0
## 3 0 0 1 0
## 4 1 0 0 0
## 5 0 0 1 0
## 6 0 0 0 0
## Foundation.Stone Foundation.Wood BsmtQual.L BsmtQual.Q BsmtQual.C BsmtQual.4
## 1 0 0 0.3585686 -0.1091089 -0.5217492 -0.5669467
## 2 0 0 0.3585686 -0.1091089 -0.5217492 -0.5669467
## 3 0 0 0.3585686 -0.1091089 -0.5217492 -0.5669467
## 4 0 0 0.1195229 -0.4364358 -0.2981424 0.3779645
## 5 0 0 0.3585686 -0.1091089 -0.5217492 -0.5669467
## 6 0 1 0.3585686 -0.1091089 -0.5217492 -0.5669467
## BsmtQual.5 BsmtCond.L BsmtCond.Q BsmtCond.C BsmtCond.4 BsmtCond.5
## 1 -0.3149704 0.1195229 -0.4364358 -0.2981424 0.3779645 0.6299408
## 2 -0.3149704 0.1195229 -0.4364358 -0.2981424 0.3779645 0.6299408
## 3 -0.3149704 0.1195229 -0.4364358 -0.2981424 0.3779645 0.6299408
## 4 0.6299408 0.3585686 -0.1091089 -0.5217492 -0.5669467 -0.3149704
## 5 -0.3149704 0.1195229 -0.4364358 -0.2981424 0.3779645 0.6299408
## 6 -0.3149704 0.1195229 -0.4364358 -0.2981424 0.3779645 0.6299408
## BsmtExposure.L BsmtExposure.Q BsmtExposure.C BsmtExposure.4 BsmtFinType1.L
## 1 -3.162278e-01 -0.2672612 6.324555e-01 -0.4780914 0.5669467
## 2 6.324555e-01 0.5345225 3.162278e-01 0.1195229 0.3779645
## 3 -1.481950e-18 -0.5345225 1.786843e-17 0.7171372 0.5669467
## 4 -3.162278e-01 -0.2672612 6.324555e-01 -0.4780914 0.3779645
## 5 3.162278e-01 -0.2672612 -6.324555e-01 -0.4780914 0.5669467
## 6 -3.162278e-01 -0.2672612 6.324555e-01 -0.4780914 0.5669467
## BsmtFinType1.Q BsmtFinType1.C BsmtFinType1.4 BsmtFinType1.5 BsmtFinType1.6
## 1 5.455447e-01 0.4082483 0.2417469 0.1091089 0.03289758
## 2 -5.621884e-17 -0.4082483 -0.5640761 -0.4364358 -0.19738551
## 3 5.455447e-01 0.4082483 0.2417469 0.1091089 0.03289758
## 4 -5.621884e-17 -0.4082483 -0.5640761 -0.4364358 -0.19738551
## 5 5.455447e-01 0.4082483 0.2417469 0.1091089 0.03289758
## 6 5.455447e-01 0.4082483 0.2417469 0.1091089 0.03289758
## BsmtFinType2.L BsmtFinType2.Q BsmtFinType2.C BsmtFinType2.4 BsmtFinType2.5
## 1 -0.3779645 8.914347e-17 0.4082483 -0.5640761 0.4364358
## 2 -0.3779645 8.914347e-17 0.4082483 -0.5640761 0.4364358
## 3 -0.3779645 8.914347e-17 0.4082483 -0.5640761 0.4364358
## 4 -0.3779645 8.914347e-17 0.4082483 -0.5640761 0.4364358
## 5 -0.3779645 8.914347e-17 0.4082483 -0.5640761 0.4364358
## 6 -0.3779645 8.914347e-17 0.4082483 -0.5640761 0.4364358
## BsmtFinType2.6 Heating.Floor Heating.GasA Heating.GasW Heating.Grav
## 1 -0.1973855 0 1 0 0
## 2 -0.1973855 0 1 0 0
## 3 -0.1973855 0 1 0 0
## 4 -0.1973855 0 1 0 0
## 5 -0.1973855 0 1 0 0
## 6 -0.1973855 0 1 0 0
## Heating.OthW Heating.Wall HeatingQC.L HeatingQC.Q HeatingQC.C HeatingQC.4
## 1 0 0 0.6324555 0.5345225 0.3162278 0.1195229
## 2 0 0 0.6324555 0.5345225 0.3162278 0.1195229
## 3 0 0 0.6324555 0.5345225 0.3162278 0.1195229
## 4 0 0 0.3162278 -0.2672612 -0.6324555 -0.4780914
## 5 0 0 0.6324555 0.5345225 0.3162278 0.1195229
## 6 0 0 0.6324555 0.5345225 0.3162278 0.1195229
## CentralAir.N CentralAir.Y Electrical.FuseA Electrical.FuseF Electrical.FuseP
## 1 0 1 0 0 0
## 2 0 1 0 0 0
## 3 0 1 0 0 0
## 4 0 1 0 0 0
## 5 0 1 0 0 0
## 6 0 1 0 0 0
## Electrical.Mix Electrical.SBrkr KitchenQual.L KitchenQual.Q KitchenQual.C
## 1 0 1 3.162278e-01 -0.2672612 -6.324555e-01
## 2 0 1 -1.481950e-18 -0.5345225 1.786843e-17
## 3 0 1 3.162278e-01 -0.2672612 -6.324555e-01
## 4 0 1 3.162278e-01 -0.2672612 -6.324555e-01
## 5 0 1 3.162278e-01 -0.2672612 -6.324555e-01
## 6 0 1 -1.481950e-18 -0.5345225 1.786843e-17
## KitchenQual.4 Functional.L Functional.Q Functional.C Functional.4
## 1 -0.4780914 0.5400617 0.5400617 0.4308202 0.282038
## 2 0.7171372 0.5400617 0.5400617 0.4308202 0.282038
## 3 -0.4780914 0.5400617 0.5400617 0.4308202 0.282038
## 4 -0.4780914 0.5400617 0.5400617 0.4308202 0.282038
## 5 -0.4780914 0.5400617 0.5400617 0.4308202 0.282038
## 6 0.7171372 0.5400617 0.5400617 0.4308202 0.282038
## Functional.5 Functional.6 Functional.7 FireplaceQu.L FireplaceQu.Q
## 1 0.1497862 0.06154575 0.01706972 -0.5976143 0.5455447
## 2 0.1497862 0.06154575 0.01706972 0.1195229 -0.4364358
## 3 0.1497862 0.06154575 0.01706972 0.1195229 -0.4364358
## 4 0.1497862 0.06154575 0.01706972 0.3585686 -0.1091089
## 5 0.1497862 0.06154575 0.01706972 0.1195229 -0.4364358
## 6 0.1497862 0.06154575 0.01706972 -0.5976143 0.5455447
## FireplaceQu.C FireplaceQu.4 FireplaceQu.5 GarageType.2Types GarageType.Attchd
## 1 -0.3726780 0.1889822 -0.06299408 0 1
## 2 -0.2981424 0.3779645 0.62994079 0 1
## 3 -0.2981424 0.3779645 0.62994079 0 1
## 4 -0.5217492 -0.5669467 -0.31497039 0 0
## 5 -0.2981424 0.3779645 0.62994079 0 1
## 6 -0.3726780 0.1889822 -0.06299408 0 1
## GarageType.Basment GarageType.BuiltIn GarageType.CarPort GarageType.Detchd
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 1
## 5 0 0 0 0
## 6 0 0 0 0
## GarageType.NoGarage GarageFinish.L GarageFinish.Q GarageFinish.C GarageQual.L
## 1 0 0.2236068 -0.5 -0.6708204 0.1195229
## 2 0 0.2236068 -0.5 -0.6708204 0.1195229
## 3 0 0.2236068 -0.5 -0.6708204 0.1195229
## 4 0 -0.2236068 -0.5 0.6708204 0.1195229
## 5 0 0.2236068 -0.5 -0.6708204 0.1195229
## 6 0 -0.2236068 -0.5 0.6708204 0.1195229
## GarageQual.Q GarageQual.C GarageQual.4 GarageQual.5 GarageCond.L GarageCond.Q
## 1 -0.4364358 -0.2981424 0.3779645 0.6299408 0.1195229 -0.4364358
## 2 -0.4364358 -0.2981424 0.3779645 0.6299408 0.1195229 -0.4364358
## 3 -0.4364358 -0.2981424 0.3779645 0.6299408 0.1195229 -0.4364358
## 4 -0.4364358 -0.2981424 0.3779645 0.6299408 0.1195229 -0.4364358
## 5 -0.4364358 -0.2981424 0.3779645 0.6299408 0.1195229 -0.4364358
## 6 -0.4364358 -0.2981424 0.3779645 0.6299408 0.1195229 -0.4364358
## GarageCond.C GarageCond.4 GarageCond.5 PavedDrive.N PavedDrive.P PavedDrive.Y
## 1 -0.2981424 0.3779645 0.6299408 0 0 1
## 2 -0.2981424 0.3779645 0.6299408 0 0 1
## 3 -0.2981424 0.3779645 0.6299408 0 0 1
## 4 -0.2981424 0.3779645 0.6299408 0 0 1
## 5 -0.2981424 0.3779645 0.6299408 0 0 1
## 6 -0.2981424 0.3779645 0.6299408 0 0 1
## PoolQC.L PoolQC.Q PoolQC.C PoolQC.4 Fence.L Fence.Q Fence.C
## 1 -0.6324555 0.5345225 -0.3162278 0.1195229 -0.6324555 0.5345225 -0.3162278
## 2 -0.6324555 0.5345225 -0.3162278 0.1195229 -0.6324555 0.5345225 -0.3162278
## 3 -0.6324555 0.5345225 -0.3162278 0.1195229 -0.6324555 0.5345225 -0.3162278
## 4 -0.6324555 0.5345225 -0.3162278 0.1195229 -0.6324555 0.5345225 -0.3162278
## 5 -0.6324555 0.5345225 -0.3162278 0.1195229 -0.6324555 0.5345225 -0.3162278
## 6 -0.6324555 0.5345225 -0.3162278 0.1195229 0.3162278 -0.2672612 -0.6324555
## Fence.4 MiscFeature.Gar2 MiscFeature.None MiscFeature.Othr
## 1 0.1195229 0 1 0
## 2 0.1195229 0 1 0
## 3 0.1195229 0 1 0
## 4 0.1195229 0 1 0
## 5 0.1195229 0 1 0
## 6 -0.4780914 0 0 0
## MiscFeature.Shed MiscFeature.TenC MoSold.1 MoSold.2 MoSold.3 MoSold.4
## 1 0 0 0 1 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 1 0 0
## 5 0 0 0 0 0 0
## 6 1 0 0 0 0 0
## MoSold.5 MoSold.6 MoSold.7 MoSold.8 MoSold.9 MoSold.10 MoSold.11 MoSold.12
## 1 0 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0 0
## 3 0 0 0 0 1 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 1
## 6 0 0 0 0 0 1 0 0
## SaleType.COD SaleType.Con SaleType.ConLD SaleType.ConLI SaleType.ConLw
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## SaleType.CWD SaleType.New SaleType.Oth SaleType.WD SaleCondition.Abnorml
## 1 0 0 0 1 0
## 2 0 0 0 1 0
## 3 0 0 0 1 0
## 4 0 0 0 1 1
## 5 0 0 0 1 0
## 6 0 0 0 1 0
## SaleCondition.AdjLand SaleCondition.Alloca SaleCondition.Family
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## SaleCondition.Normal SaleCondition.Partial is_LotFrontage_missing.FALSE
## 1 1 0 1
## 2 1 0 1
## 3 1 0 1
## 4 0 0 1
## 5 1 0 1
## 6 1 0 1
## is_LotFrontage_missing.TRUE SalePrice
## 1 0 208500
## 2 0 181500
## 3 0 223500
## 4 0 140000
## 5 0 250000
## 6 0 143000
# inspired from textbook
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
# scale the features
final_data_for_nn_norm <- as.data.frame(lapply(final_data_for_nn, normalize))
To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.
# Split data for validation
set.seed(123)
train_idx <- createDataPartition(y = raw_data$SalePrice, p = 0.8, list = FALSE)
# Create train/test splits for each dataset
train <- raw_data[train_idx, ]
test <- raw_data[-train_idx, ]
# Transformed predictors dataset
train_transformed <- raw_data_transform[train_idx, ]
test_transformed <- raw_data_transform[-train_idx, ]
# Transformed predictors dataset
train_transformed2 <- raw_data_transform2[train_idx, ]
test_transformed2 <- raw_data_transform2[-train_idx, ]
# NN 1 dataset
train_transformed_nn1 <- final_data_for_nn_norm[train_idx, ]
test_transformed_nn1 <- final_data_for_nn_norm[-train_idx, ]
I was going to try PCA since I have a lot of features, including ones that show multicollinearity, but since RF and NN are both very robust and PCA is not required, I decided to skip.
# ## Visualize principal component analysis
# raw_data_transform_pca <- raw_data_transform |>
# dplyr::select(-LotFrontage)
# plot_prcomp(raw_data_transform_pca, maxcat = 10L)
NOTE: Kaggle competition recommends using RMSE as the chosen metric, and for RMSE, use the log of the actual and predicted values to reduce impact of large errors Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.
I chose random forest as the algorithm from weeks 1-10 and the neural network from recent weeks. Random forest is an ensemble method, that uses a “team” of decision trees.
Random forest would be good for this dataset because:
I have also never used random forest for regression, so this would be a good learning experience.
Potential issues:
Neural network is another good option. Neural networks contain an input layer, a hidden layer, and an output layer, mimicking a human brain. After receiving raw input in the input layer, it is able to process input through the hidden layers, with the output layer producing the result. It does this through optimizing the many parameters through gradient descent and backpropagation. The many hyperparameters and activation function make this model able to handle complex patterns and relationships - it has the ability to fit to the training data well.
Pros of NN:
I have also never used a neural network before, so this would be a good learning experience.
Potential issues:
ranger function arguments:
Experiment:
Purpose:
See how well a model does with all the original predictors, and varied mtry and min node size values (tuning goals above).
Result:
# RF MODEL 1 -- Base model (use original predictors, except for LotFrontage updates)
# Remove the new combined features created
raw_data_transform_predictors <- raw_data_transform |>
dplyr::select(-c(SalePrice, OverallTotalRating, TotalBaths, TotalSF))
# Target variable
raw_data_transform_response <- raw_data_transform$SalePrice
set.seed(123)
# Train the RF model
# ranger is much faster by improving upon randomForest
# Tuning grid
tuneGrid <- data.frame(
.mtry = c(3, 10, 25), # increasing mtry
.splitrule = "variance",
.min.node.size = c(2, 4, 10) # increasing min node size
)
# Train the model - use cross-validation
rf_model1 <- train(
x = raw_data_transform_predictors,
y = raw_data_transform_response,
method = "ranger",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
),
metric = "RMSE",
tuneGrid = tuneGrid,
# probability = TRUE,
importance = "impurity"
)
## + Fold1: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold1: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold1: mtry=10, splitrule=variance, min.node.size= 4
## - Fold1: mtry=10, splitrule=variance, min.node.size= 4
## + Fold1: mtry=25, splitrule=variance, min.node.size=10
## - Fold1: mtry=25, splitrule=variance, min.node.size=10
## + Fold2: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold2: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold2: mtry=10, splitrule=variance, min.node.size= 4
## - Fold2: mtry=10, splitrule=variance, min.node.size= 4
## + Fold2: mtry=25, splitrule=variance, min.node.size=10
## - Fold2: mtry=25, splitrule=variance, min.node.size=10
## + Fold3: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold3: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold3: mtry=10, splitrule=variance, min.node.size= 4
## - Fold3: mtry=10, splitrule=variance, min.node.size= 4
## + Fold3: mtry=25, splitrule=variance, min.node.size=10
## - Fold3: mtry=25, splitrule=variance, min.node.size=10
## + Fold4: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold4: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold4: mtry=10, splitrule=variance, min.node.size= 4
## - Fold4: mtry=10, splitrule=variance, min.node.size= 4
## + Fold4: mtry=25, splitrule=variance, min.node.size=10
## - Fold4: mtry=25, splitrule=variance, min.node.size=10
## + Fold5: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold5: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold5: mtry=10, splitrule=variance, min.node.size= 4
## - Fold5: mtry=10, splitrule=variance, min.node.size= 4
## + Fold5: mtry=25, splitrule=variance, min.node.size=10
## - Fold5: mtry=25, splitrule=variance, min.node.size=10
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 25, splitrule = variance, min.node.size = 10 on full training set
# Print RF model
print(rf_model1)
## Random Forest
##
## 1460 samples
## 80 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1169, 1169, 1167, 1168, 1167
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 3 2 30929.83 0.8722486 18001.67
## 10 4 28327.61 0.8825102 16616.94
## 25 10 27885.92 0.8827094 16551.26
##
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 25, splitrule = variance
## and min.node.size = 10.
# Plot hyperparam tuning
plot(rf_model1)
# Make predictions
rf_model1_predictions <- predict(rf_model1, test_transformed)
# Evaluate
# NOTE: competition recommended looking at the log of SalePrice for RMSE (selected metric) - reduces impact of large errors
log_metrics <- postResample(pred = log(rf_model1_predictions + 1), obs = log(test_transformed$SalePrice + 1))
rf_model1_rmse_log <- log_metrics["RMSE"]
rf_model1_Rsquared_log <- log_metrics["Rsquared"]
rf_model1_MAE_log <- log_metrics["MAE"]
# Not taking log
metrics <- postResample(pred = rf_model1_predictions, obs = test_transformed$SalePrice)
rf_model1_rmse <- metrics["RMSE"]
rf_model1_Rsquared <- metrics["Rsquared"]
rf_model1_MAE <- metrics["MAE"]
rf_model1_output <- paste(
"\n=== Model Selection and Evaluation ===\n\n",
"=== RF MODEL 1 Evaluation ===\n",
"RMSLE:", round(rf_model1_rmse_log, 4),
"| MAE (log):", round(rf_model1_MAE_log, 4),
"| R² (log):", round(rf_model1_Rsquared_log, 4),
"| RMSE:", round(rf_model1_rmse, 4),
"| MAE:", round(rf_model1_MAE, 4),
"| R²:", round(rf_model1_Rsquared, 4),
"\n\n",
sep = " "
)
cat(rf_model1_output)
##
## === Model Selection and Evaluation ===
##
## === RF MODEL 1 Evaluation ===
## RMSLE: 0.0718 | MAE (log): 0.0451 | R² (log): 0.9699 | RMSE: 18869.4112 | MAE: 9206.9896 | R²: 0.9631
# Feature importance
plot(varImp(rf_model1))
Experiment:
Purpose:
Result:
# RF MODEL 2 -- Use the dataset that removed high multicollinearity predictors
# Remove the new combined features created
train_transformed2_predictors <- train_transformed2 |>
dplyr::select(-c(SalePrice, OverallTotalRating, TotalBaths, TotalSF))
# Target variable
train_transformed2_response <- train_transformed2$SalePrice
set.seed(123)
# Train the RF model
# ranger is much faster by improving upon randomForest
# Tuning grid
tuneGrid <- data.frame(
.mtry = c(3, 10, 25), # increasing mtry
.splitrule = "variance",
.min.node.size = c(2, 4, 10) # increasing node size
)
# Train the model - use cross-validation
rf_model2 <- train(
x = train_transformed2_predictors,
y = train_transformed2_response,
method = "ranger",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
),
metric = "RMSE",
tuneGrid = tuneGrid,
importance = "impurity"
)
## + Fold1: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold1: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold1: mtry=10, splitrule=variance, min.node.size= 4
## - Fold1: mtry=10, splitrule=variance, min.node.size= 4
## + Fold1: mtry=25, splitrule=variance, min.node.size=10
## - Fold1: mtry=25, splitrule=variance, min.node.size=10
## + Fold2: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold2: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold2: mtry=10, splitrule=variance, min.node.size= 4
## - Fold2: mtry=10, splitrule=variance, min.node.size= 4
## + Fold2: mtry=25, splitrule=variance, min.node.size=10
## - Fold2: mtry=25, splitrule=variance, min.node.size=10
## + Fold3: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold3: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold3: mtry=10, splitrule=variance, min.node.size= 4
## - Fold3: mtry=10, splitrule=variance, min.node.size= 4
## + Fold3: mtry=25, splitrule=variance, min.node.size=10
## - Fold3: mtry=25, splitrule=variance, min.node.size=10
## + Fold4: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold4: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold4: mtry=10, splitrule=variance, min.node.size= 4
## - Fold4: mtry=10, splitrule=variance, min.node.size= 4
## + Fold4: mtry=25, splitrule=variance, min.node.size=10
## - Fold4: mtry=25, splitrule=variance, min.node.size=10
## + Fold5: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold5: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold5: mtry=10, splitrule=variance, min.node.size= 4
## - Fold5: mtry=10, splitrule=variance, min.node.size= 4
## + Fold5: mtry=25, splitrule=variance, min.node.size=10
## - Fold5: mtry=25, splitrule=variance, min.node.size=10
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 25, splitrule = variance, min.node.size = 10 on full training set
# Print and plot model (hyperparam tuning)
print(rf_model2)
## Random Forest
##
## 1169 samples
## 73 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 935, 935, 934, 936, 936
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 3 2 32579.84 0.8428896 20203.31
## 10 4 29963.05 0.8561327 18538.95
## 25 10 29653.15 0.8558539 18618.96
##
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 25, splitrule = variance
## and min.node.size = 10.
plot(rf_model2)
# Make predictions
rf_model2_predictions <- predict(rf_model2, test_transformed2)
# Evaluate
# NOTE: competition recommended looking at the log of SalePrice for RMSE (selected metric) - reduces impact of large errors
log_metrics <- postResample(pred = log(rf_model2_predictions + 1), obs = log(test_transformed2$SalePrice + 1))
# RMSE
rf_model2_rmse_log <- log_metrics["RMSE"]
# R Squared
rf_model2_Rsquared_log <- log_metrics["Rsquared"]
# MAE
rf_model2_MAE_log <- log_metrics["MAE"]
# not taking log
metrics <- postResample(pred = rf_model2_predictions, obs = test_transformed2$SalePrice)
# RMSE
rf_model2_rmse <- metrics["RMSE"]
# R Squared
rf_model2_Rsquared <- metrics["Rsquared"]
# MAE
rf_model2_MAE <- metrics["MAE"]
rf_model2_output <- paste(
"\n=== Model Selection and Evaluation ===\n\n",
"=== RF MODEL 2 Evaluation ===\n",
"RMSLE:", round(rf_model2_rmse_log, 4),
"| MAE (log):", round(rf_model2_MAE_log, 4),
"| R² (log):", round(rf_model2_Rsquared_log, 4),
"| RMSE:", round(rf_model2_rmse, 4),
"| MAE:", round(rf_model2_MAE, 4),
"| R²:", round(rf_model2_Rsquared, 4),
"\n\n",
sep = " "
)
cat(rf_model2_output)
##
## === Model Selection and Evaluation ===
##
## === RF MODEL 2 Evaluation ===
## RMSLE: 0.1521 | MAE (log): 0.1034 | R² (log): 0.8605 | RMSE: 41690.627 | MAE: 21201.3074 | R²: 0.8006
# Feature importance
plot(varImp(rf_model2))
Experiment:
Purpose:
Result:
# RF MODEL 3 - USE COMBINED FEATURES
# Predictors -- remove predictors used for combined features
train_transformed3_predictors <- train_transformed |>
dplyr::select(-c(SalePrice, OverallQual, OverallCond, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, TotalBsmtSF, X1stFlrSF, X2ndFlrSF))
# Target variable
train_transformed3_response <- train_transformed$SalePrice
set.seed(123)
# Train the RF model
# ranger is much faster by improving upon randomForest
# Tuning grid
tuneGrid <- data.frame(
.mtry = c(3, 10, 25),
.splitrule = "variance",
.min.node.size = c(2, 4, 10)
)
# Train the model - use cross-validation
rf_model3 <- train(
x = train_transformed3_predictors,
y = train_transformed3_response,
method = "ranger",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
),
metric = "RMSE",
tuneGrid = tuneGrid,
importance = "impurity"
)
## + Fold1: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold1: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold1: mtry=10, splitrule=variance, min.node.size= 4
## - Fold1: mtry=10, splitrule=variance, min.node.size= 4
## + Fold1: mtry=25, splitrule=variance, min.node.size=10
## - Fold1: mtry=25, splitrule=variance, min.node.size=10
## + Fold2: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold2: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold2: mtry=10, splitrule=variance, min.node.size= 4
## - Fold2: mtry=10, splitrule=variance, min.node.size= 4
## + Fold2: mtry=25, splitrule=variance, min.node.size=10
## - Fold2: mtry=25, splitrule=variance, min.node.size=10
## + Fold3: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold3: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold3: mtry=10, splitrule=variance, min.node.size= 4
## - Fold3: mtry=10, splitrule=variance, min.node.size= 4
## + Fold3: mtry=25, splitrule=variance, min.node.size=10
## - Fold3: mtry=25, splitrule=variance, min.node.size=10
## + Fold4: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold4: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold4: mtry=10, splitrule=variance, min.node.size= 4
## - Fold4: mtry=10, splitrule=variance, min.node.size= 4
## + Fold4: mtry=25, splitrule=variance, min.node.size=10
## - Fold4: mtry=25, splitrule=variance, min.node.size=10
## + Fold5: mtry= 3, splitrule=variance, min.node.size= 2
## - Fold5: mtry= 3, splitrule=variance, min.node.size= 2
## + Fold5: mtry=10, splitrule=variance, min.node.size= 4
## - Fold5: mtry=10, splitrule=variance, min.node.size= 4
## + Fold5: mtry=25, splitrule=variance, min.node.size=10
## - Fold5: mtry=25, splitrule=variance, min.node.size=10
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 25, splitrule = variance, min.node.size = 10 on full training set
# Plot and print the model
print(rf_model3)
## Random Forest
##
## 1169 samples
## 74 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 935, 935, 934, 936, 936
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 3 2 29444.78 0.8743839 17682.10
## 10 4 26277.80 0.8905740 15938.26
## 25 10 25454.40 0.8939029 15863.10
##
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 25, splitrule = variance
## and min.node.size = 10.
plot(rf_model3)
# Make predictions
rf_model3_predictions <- predict(rf_model3, test_transformed)
# Evaluate
# NOTE: competition recommended looking at the log of SalePrice for RMSE (selected metric) - reduces impact of large errors
log_metrics <- postResample(pred = log(rf_model3_predictions + 1), obs = log(test_transformed$SalePrice + 1))
# RMSE
rf_model3_rmse_log <- log_metrics["RMSE"]
# R Squared
rf_model3_Rsquared_log <- log_metrics["Rsquared"]
# MAE
rf_model3_MAE_log <- log_metrics["MAE"]
# not taking log
metrics <- postResample(pred = rf_model3_predictions, obs = test_transformed$SalePrice)
# RMSE
rf_model3_rmse <- metrics["RMSE"]
# R Squared
rf_model3_Rsquared <- metrics["Rsquared"]
# MAE
rf_model3_MAE <- metrics["MAE"]
rf_model3_output <- paste(
"\n=== Model Selection and Evaluation ===\n\n",
"=== RF MODEL 3 Evaluation ===\n",
"RMSLE:", round(rf_model3_rmse_log, 4),
"| MAE (log):", round(rf_model3_MAE_log, 4),
"| R² (log):", round(rf_model3_Rsquared_log, 4),
"| RMSE:", round(rf_model3_rmse, 4),
"| MAE:", round(rf_model3_MAE, 4),
"| R²:", round(rf_model3_Rsquared, 4),
"\n\n",
sep = " "
)
cat(rf_model3_output)
##
## === Model Selection and Evaluation ===
##
## === RF MODEL 3 Evaluation ===
## RMSLE: 0.1369 | MAE (log): 0.0888 | R² (log): 0.8865 | RMSE: 38218.1468 | MAE: 18227.6135 | R²: 0.8304
# Feature importance
plot(varImp(rf_model3))
The default parameters for neuralnet are (from textbook):
Note: rprop+ is the default algorithm to calculate the neural network. rprop+ is resilient backpropagation with weight backtracking, which manages the learning rate dynamics automatically. So for this scenario, I will not play around the different learning rates (will not update the algorithm to “backprop”).
Experiment:
Purpose:
Result:
# NN MODEL 1 - BASE MODEL W/ ORIG (ALL) PREDICTORS
set.seed(123)
nn_model1 <- neuralnet(SalePrice ~ .,
data = train_transformed_nn1) # use the nn dataset we created earlier
plot(nn_model1)
# compute returns list with 2 components - neurons (stores neurons for each layer in the network) & net.result (stores the model's predicted values)
nn1_results <- compute(nn_model1, test_transformed_nn1)
# Grab predictions - normalized
nn1_predictions <- nn1_results$net.result
# scaling doesnt matter here
cor(nn1_predictions, test_transformed_nn1$SalePrice)
## [,1]
## [1,] 0.8973185
# Evaluation
# inspired from textbook
unnormalize <- function(x) {
return(x * (max(final_data_for_nn$SalePrice) - min(final_data_for_nn$SalePrice)) + min(final_data_for_nn$SalePrice))
}
nn1_predictions_unnorm <- unnormalize(nn1_predictions)
# Evaluation using log
nn1_residuals_log <- log(test_transformed$SalePrice + 1) - log(nn1_predictions_unnorm + 1)
nn1_mse_log <- mean(nn1_residuals_log^2)
nn1_rmse_log <- sqrt(nn1_mse_log)
nn1_Rsquared_log <- R2(log(nn1_predictions_unnorm + 1), log(test_transformed$SalePrice + 1))
mae <- function(actual, predicted) {
mean(abs(actual - predicted))
}
nn1_mae_log <- mae(log(test_transformed$SalePrice + 1), log(nn1_predictions_unnorm + 1))
# NOT using log
nn1_residuals <- test_transformed$SalePrice - nn1_predictions_unnorm
nn1_mse <- mean(nn1_residuals^2)
nn1_rmse <- sqrt(nn1_mse)
nn1_Rsquared <- R2(nn1_predictions_unnorm, test_transformed$SalePrice)
mae <- function(actual, predicted) {
mean(abs(actual - predicted))
}
nn1_mae <- mae(test_transformed$SalePrice, nn1_predictions_unnorm)
nn_model1_output <- paste(
"\n=== Model Selection and Evaluation ===\n\n",
"=== NN MODEL 1 Evaluation ===\n",
"RMSLE:", round(nn1_mse_log, 4),
"| MAE (log):", round(nn1_mae_log, 4),
"| R² (log):", round(nn1_Rsquared_log, 4),
"| RMSE:", round(nn1_rmse, 4),
"| MAE:", round(nn1_mae, 4),
"| R²:", round(nn1_Rsquared, 4),
"\n\n",
sep = " "
)
cat(nn_model1_output)
##
## === Model Selection and Evaluation ===
##
## === NN MODEL 1 Evaluation ===
## RMSLE: 0.0246 | MAE (log): 0.0989 | R² (log): 0.8587 | RMSE: 43265.3982 | MAE: 19380.8065 | R²: 0.8052
Experiment:
Purpose:
ReLU is highly efficient for gradient descent, but can’t be used with neuralnet since its derivative is undefined at x=0. The purpose of this experiment is to see how using a smoothing approximation of the ReLU known as SmoothReLU can improve the model. Since the last model performed very poorly, I’m going to try the same activation function but just use 1 hidden layer to see if the model can train more effectively. SmoothReLU should also perform fairly well with shallow networks.
Result:
# NN MODEL 4 - USE SmoothReLU (W/ ORIG PREDICTORS)
set.seed(123)
# Train model
nn_model4 <- neuralnet(SalePrice ~ .,
hidden = c(5), # only 1 hidden layer network
act.fct = softplus,
data = train_transformed_nn1)
plot(nn_model4)
# compute returns list with 2 components - neurons (stores neurons for each layer in the network) & net.result (stores the model's predicted values)
nn4_results <- compute(nn_model4, test_transformed_nn1)
# Grab predictions - normalized
nn4_predictions <- nn4_results$net.result
# scaling doesnt matter here
cor(nn4_predictions, test_transformed_nn1$SalePrice)
## [,1]
## [1,] 0.225215
# Evaluation
nn4_predictions_unnorm <- unnormalize(nn4_predictions)
# Evaluation metrics using log
nn4_residuals_log <- log(test_transformed$SalePrice + 1) - log(nn4_predictions_unnorm + 1)
## Warning in log(nn4_predictions_unnorm + 1): NaNs produced
nn4_mse_log <- mean(nn4_residuals_log^2)
nn4_rmse_log <- sqrt(nn4_mse_log)
nn4_Rsquared_log <- R2(log(nn4_predictions_unnorm + 1), log(test_transformed$SalePrice + 1))
## Warning in log(nn4_predictions_unnorm + 1): NaNs produced
nn4_mae_log <- mae(log(test_transformed$SalePrice + 1), log(nn4_predictions_unnorm + 1))
## Warning in log(nn4_predictions_unnorm + 1): NaNs produced
# Evaluation metrics NOT using log
nn4_residuals <- test_transformed$SalePrice - nn4_predictions_unnorm
nn4_mse <- mean(nn4_residuals^2)
nn4_rmse <- sqrt(nn4_mse)
nn4_Rsquared <- R2(nn4_predictions_unnorm, test_transformed$SalePrice)
nn4_mae <- mae(test_transformed$SalePrice, nn4_predictions_unnorm)
nn_model4_output <- paste(
"\n=== Model Selection and Evaluation ===\n\n",
"=== NN MODEL 4 Evaluation ===\n",
"RMSLE:", round(nn4_mse_log, 4),
"| MAE (log):", round(nn4_mae_log, 4),
"| R² (log):", round(nn4_Rsquared_log, 4),
"| RMSE:", round(nn4_rmse, 4),
"| MAE:", round(nn4_mae, 4),
"| R²:", round(nn4_Rsquared, 4),
"\n\n",
sep = " "
)
cat(nn_model4_output)
##
## === Model Selection and Evaluation ===
##
## === NN MODEL 4 Evaluation ===
## RMSLE: NaN | MAE (log): NaN | R² (log): NA | RMSE: 1380219.3757 | MAE: 243127.9723 | R²: 0.0507
NOTE: See results table below
Recommended model: RF MODEL 1 (Original predictors)
Key Strengths:
RF Model 1 Top 10 Features:
Trade-offs:
Reasoning on not choosing other models:
RF Model 2:
RF Model 2 Top 10 Features:
RF Model 3:
RF Model 2 Top 10 Features:
NN Model 1 (base model):
NN Model 2 (increased hidden nodes):
NN Model 3 (SmoothReLU + 2nd layer):
NN Model 4 (SmoothReLU):
# Comparison table
all_experiments <- data.frame(
Model = c("RF Model 1", "RF Model 2", "RF Model 3", "NN Model 1", "NN Model 2", "NN Model 3", "NN Model 4"),
RMSE = c(rf_model1_rmse, rf_model2_rmse, rf_model3_rmse, nn1_rmse, nn2_rmse, nn3_rmse, nn4_rmse),
MAE = c(rf_model1_MAE, rf_model2_MAE, rf_model3_MAE, nn1_mae, nn2_mae, nn3_mae, nn4_mae),
Rsquared = c(rf_model1_Rsquared, rf_model2_Rsquared, rf_model3_Rsquared, nn1_Rsquared, nn2_Rsquared, nn3_Rsquared, nn4_Rsquared),
RMSE_LOG = c(rf_model1_rmse_log, rf_model2_rmse_log, rf_model3_rmse_log, nn1_rmse_log, nn2_rmse_log, nn3_rmse_log, nn4_rmse_log),
MAE_LOG = c(rf_model1_MAE_log, rf_model2_MAE_log, rf_model3_MAE_log, nn1_mae_log, nn2_mae_log, nn3_mae_log, nn4_mae_log),
Rsquared_LOG = c(rf_model1_Rsquared_log, rf_model2_Rsquared_log, rf_model3_Rsquared_log, nn1_Rsquared_log, nn2_Rsquared_log, nn3_Rsquared_log, nn4_Rsquared_log)
)
all_experiments %>%
kbl() %>%
kable_styling(full_width = TRUE)
| Model | RMSE | MAE | Rsquared | RMSE_LOG | MAE_LOG | Rsquared_LOG |
|---|---|---|---|---|---|---|
| RF Model 1 | 18869.41 | 9206.99 | 0.9631335 | 0.0718030 | 0.0451099 | 0.9698832 |
| RF Model 2 | 41690.63 | 21201.31 | 0.8006262 | 0.1521428 | 0.1034242 | 0.8605485 |
| RF Model 3 | 38218.15 | 18227.61 | 0.8303695 | 0.1368532 | 0.0887514 | 0.8864929 |
| NN Model 1 | 43265.40 | 19380.81 | 0.8051805 | 0.1568195 | 0.0989110 | 0.8587128 |
| NN Model 2 | 122739.40 | 74402.03 | 0.2805413 | NaN | NaN | NA |
| NN Model 3 | 2326672.35 | 283846.62 | 0.0050892 | NaN | NaN | NA |
| NN Model 4 | 1380219.38 | 243127.97 | 0.0507218 | NaN | NaN | NA |
To summarize the comparison above, overall random forest models fit this data better. Although, the base model for neural network gave a fairly good performance. For this case, it’s possible that this particular regression problem with this sized data is too simple for a neural network. Neural networks work better with unstructured datasets such as images, text and audio. Random forest specifically handle tabular data with high dimensionality very well. Additionally, random forests can even perform well with smaller datasets.
It’s entirely possible the neural network could have generated a higher performance, but it would take more manual hyperparameter tuning. A downside of neural networks is that a single incorrect hyperparameter can ruin the model, making NNs very tricky to experiment with (requires many experiments). Some parameters I would try tuning would be the learning rate and bias (for normal backpropagation - wasn’t able to do due to timing). In a real world scenario, using a reliable, faster, less computationally expensive and time consuming model, such as the random forest, might work better.
Some additional experiments I would have liked to try if I had the time would be to have a combination of the NN and the RF model where the NN would perform the feature engineering, and then the RF would make the predictions. I also updated the missing variables in this dataset before training the RF models. One other experiment would be to see if the RF performance improved when keeping the missing values. Another thing to try would be removing YearSold, MoSold, etc. since to me, it sounds like data leakage.
Business impact of the final model:
As a business, the top features to pay attention to are (full list can be seen in the model section):
Overall, these features make complete sense. The overall rating/quality and size of a house is what mostly determines the final price. Some other insights from this model are that the order of importance, in terms of specific different areas of the house go Living Area > External > Garage > Basement > First Floor > Kitchen > 2nd Floor > Baths. This tells us that a book, or a house in this case, is judged by it’s cover, as just behind Living Area is External. It’s interesting that External is both more important than first floor, kitchen and 2nd floor.
As mentioned before, it’s also interesting that YearBuilt is a pretty important feature as well. More modern houses will have a higher price. Neighborhood is also fairly low in importance, which doesn’t match up with the classic real estate phrase, “location, location, location.” Many features have extremely low importance such as Fence, PoolQC, HalfBath.
What’s great about this model is that it showcases features that are either higher or lower in importance than expected. Regarding setting up this type of model in production, the best part is not much data preprocessing is needed. If neural network was used in production all that conversion and scaling would need to be done in production as well, increasing the chance of a mishap.