Introduction

This document outlines my methodology when working through the Kaggle competition ‘House Prices - Advanced Regression Techniques’

The ultimate goal of this project is to build a model which predicts the sale price of houses in Ames, Iowa.

I will be using R/RStudio for this project. This document is also underpinned by R Markdown as I would like to share the R code & output along with my explanations & thought process.

The Kaggle Data

The data consists of the response variable SalePrice and 80 predictor variables. There are 2919 observations in total over all of the data. However, Kaggle splits these 50/50 into two sets: train and test which are provided as csv files. The sole difference being that the values for the response SalePrice are only provided in the train set.

The idea is to use the train set to build a model, then use the model to make predictions on the test set. Following that, predictions are submitted to Kaggle who then return you a score based on the root mean square error (RMSE) of your predictions. Consequently, you are ranked in the Kaggle leaderboards, hence where the ‘competition’ element comes in.

Motivation

My motivations for taking on this project:

  • Continue to develop my skills in various aspects of data science (data cleaning, processing, EDA and modelling).
  • Fix mistakes (or inefficiencies) in my workflow from previous predictive modelling or similar data projects.
  • Build on my efficiency and overall knowledge of R/RStudio. (I find that projects like the ones provided on Kaggle create natural opportunities to read articles/discussion forums/watch YouTube videos etc when a problem is encountered and find better ways to tackle certain issues or write cleaner code).
  • I plan to one day become a top statistician, so I enjoy using the free time at my disposal to work through projects like this one which challenge my application of statistical theory.
  • Earning a good rank on Kaggle would be nice too. (Who doesn’t like the bragging rights?!).

Workflow

I have partitioned this document into four main sections:

Loading the data

First, I load the csv files (train and test) into the R session:

#read test and train sets 
train<-read.csv("train.csv",header = TRUE)
test<-read.csv("test.csv", header = TRUE)
dim(train)
## [1] 1460   81
dim(test)
## [1] 1459   80

The train set consists of 1460 observations of 81 variables (since it includes the response SalePrice), whereas the test set consists of 1459 observations of 80 variables.

Although I will be training models only on the train set, this doesn’t mean that exploratory data analysis (EDA) on the predictors should be restricted to only the train set. Sure, for SalePrice, I do not have the test set values, but for all of the other variables I will use all of the data to understand their distributions and relationships with each other.

For this reason, I next combine the train and test sets into one data frame. To (row) combine two data frames in R, they need to have the same column structure. To resolve this, I add a SalePrice column to the test data frame and impute the value “NA” for every observation of SalePrice:

#add a SalePrice column to the test and impute NA
test$SalePrice <- NA
#combine
all <- rbind(train, test)

Now that all of the data is held in one data frame, I use the str() command to look at the structure of the data:

str(all)
## 'data.frame':    2919 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Firstly, I see that the first variable is Id. Unique identifiers are of no use in terms of prediction so I remove this column from the data:

#remove ID 
all$Id <- NULL 

Also, it is clear that for a lot of the variables there is repeat values over the first few observations indicating that there is a mix of categorical and numeric variables. I also notice NA values within the data so I display all of the columns containing at least 1 NA (in descending order):

#columns with NA values 
cols.with.NAs <- which(colSums(is.na(all)) > 0)
sort(colSums(sapply(all[cols.with.NAs], is.na)), decreasing = TRUE)
##       PoolQC  MiscFeature        Alley        Fence    SalePrice  FireplaceQu 
##         2909         2814         2721         2348         1459         1420 
##  LotFrontage  GarageYrBlt GarageFinish   GarageQual   GarageCond   GarageType 
##          486          159          159          159          159          157 
##     BsmtCond BsmtExposure     BsmtQual BsmtFinType2 BsmtFinType1   MasVnrType 
##           82           82           81           80           79           24 
##   MasVnrArea     MSZoning    Utilities BsmtFullBath BsmtHalfBath   Functional 
##           23            4            2            2            2            2 
##  Exterior1st  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1
length(cols.with.NAs)
## [1] 35

There are 34 variables which contain at least 1 NA (ignoring SalePrice because I imputed the NAs in order to combine dataframes). With this in mind, I move onto the data cleaning process.

Cleaning the data

I have two main objectives for this part:

  • Correct the data type for every variable based on whether it is an ordered factor, unordered factor or a numeric variable.
  • Appropriately handle the NA values for any variable that has them.

For both of these tasks I primarily make use of the data description provided by Kaggle and perhaps some brief EDA where necessary.

Obviously there are a lot of variables to work through in this dataset, so to avoid repetition I will only include a few examples of correcting the data types and handling the NAs of different variables.

An important note is that typically NA values refer to missing values of a variable. However, based on the data description, an NA is rarely indicative of a missing value in this dataset. I show examples of this below.

Example 1:

FireplaceQu: Fireplace quality

   Ex   Excellent - Exceptional Masonry Fireplace
   Gd   Good - Masonry Fireplace in main level
   TA   Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
   Fa   Fair - Prefabricated Fireplace in basement
   Po   Poor - Ben Franklin Stove
   NA   No Fireplace

FireplaceQu is one of the variables with a vast amount of NA values, however, as shown above, NA simply means that the house does not have a fireplace- not that the value is missing!

Therefore, it makes sense to replace the NAs with the value “None”:

#FireplaceQu (ordinal)
all$FireplaceQu[is.na(all$FireplaceQu)] <- 'None'

I also see from the data description that this variable is an ordered factor (or in other words an ordinal variable) because the levels of the factor represent a change in quality.

I correct ordinal variables as follows:

Qual.levels <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
all$FireplaceQu<-as.integer(revalue(all$FireplaceQu, Qual.levels))

That is, they will be stored as integer types with integer levels.

#count at each level
table(all$FireplaceQu)
## 
##    0    1    2    3    4    5 
## 1420   46   74  592  744   43

I will show another example of cleaning an ordinal variable so that the process is clear.

Example 2:

PoolQC: Pool quality

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   NA   No Pool
   

Again, in this case NA simply means that the house has no pool- not that the value is missing. PoolQC is also an ordinal variable and is similar to FireplaceQu because it has the exact same level names (in fact there are many ordinal variables in the dataset which share the exact same level convention)- so here I can reuse the Qual.levels vector from Example 1 to rename the levels:

#PoolQC (ordinal)
all$PoolQC[is.na(all$PoolQC)] <- 'None'
all$PoolQC<-as.integer(revalue(all$PoolQC, Qual.levels))
table(all$PoolQC)
## 
##    0    2    4    5 
## 2909    2    4    4

I continue to run the table() command on the categorical variables as I clean them because it is useful to get a general understanding of the distribution (count at each level) in preparation for EDA.

Example 3:

LotShape: General shape of property

   Reg  Regular 
   IR1  Slightly irregular
   IR2  Moderately Irregular
   IR3  Irregular

LotShape has no NA values so no handling of NAs is required. I simply infer from the data description that it is ordinal and correct the data type in the usual way.

#LotShape (ordinal)
all$LotShape<-as.integer(revalue(all$LotShape, c('IR3'=0, 'IR2'=1, 'IR1'=2, 'Reg'=3)))
table(all$LotShape)
## 
##    0    1    2    3 
##   16   76  968 1859

The only difference here is that I renamed the levels directly inside of the revalue() function since LotShape does not share level names with any other variable. (It is unnecessary to create a separate vector to store the level names because I will not be reusing them).

Example 4:

MiscFeature: Miscellaneous feature not covered in other categories

   Elev Elevator
   Gar2 2nd Garage (if not described in garage section)
   Othr Other
   Shed Shed (over 100 SF)
   TenC Tennis Court
   NA   None

MiscFeature is another variable with a vast amount of NA values. However, once again, this simply means that the house does not have a miscellaneous feature- not that the value is missing. Therefore, I handle the NAs by replacing them with “None”.

Although MiscFeature is clearly a factor, it is an unordered factor. i.e the levels are different but there is no order/progression between the levels. For unordered factors I use as.factor() to correct the data type.

#MiscFeature (factor)
all$MiscFeature[is.na(all$MiscFeature)] <- 'None'
all$MiscFeature <- as.factor(all$MiscFeature)
table(all$MiscFeature)
## 
## Gar2 None Othr Shed TenC 
##    5 2814    4   95    1

From this point on I will refer to unordered factors as just ‘factors’ and ordered factors as ‘ordinal’ variables.

Example 5:

LotConfig: Lot configuration

   Inside   Inside lot
   Corner   Corner lot
   CulDSac  Cul-de-sac
   FR2  Frontage on 2 sides of property
   FR3  Frontage on 3 sides of property

LotConFig has no NAs to handle and is just a factor since the levels are not ordered:

#LotConFig (factor)
all$LotConfig <- as.factor(all$LotConfig)
table(all$LotConfig)
## 
##  Corner CulDSac     FR2     FR3  Inside 
##     511     176      85      14    2133

Example 6:

In this example I clean a numeric variable.

LotFrontage: Linear feet of street connected to property

This variable has 486 NA values. Unlike in previous examples, this time the data description does not explicitly state how to interpret them. Therefore, I shortlist three options on how to potentially deal with the NAs here:

    1. Assume the NAs mean missing and therefore impute the mean or median to replace them.
    1. Assume the NAs mean missing and therefore remove the variable altogether.
    1. Assume the NAs mean that there is zero linear square feet of street connected to the property and therefore impute 0 to replace them.

I decided to proceed with 1). I did not go with 2) because although 486 is a lot of NAs, there is still over 80% of the values for the variable. I decided against 3) because I thought the assumption is less likely to be true than the assumption in 1) and 2).

Having decided on option 1), now I decide whether mean or median imputation is more appropriate. To do that I run some basic analysis on the variable:

#LotFrontage (numeric)
summary(all$LotFrontage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   21.00   59.00   68.00   69.31   80.00  313.00     486
ggplot(data=all)+geom_histogram(mapping=aes(x=LotFrontage), fill="blue")

There is not much difference between the median (68) and the mean (69.31) for LotFrontage. However, due to the positive skew I decide to go with median imputation to replace the NAs:

#based on the slight positive skew caused by an extreme value, imputing median
all$LotFrontage[is.na(all$LotFrontage)] = median(all$LotFrontage, na.rm = TRUE)

Example 7:

KitchenQual: Kitchen quality

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   

KitchenQual is clearly another ordinal variable which uses the quality levels as previously seen. However, the reason I am including this variable as an additional example is because this variable has 1 NA value to handle. But unlike before I cannot simply impute “None” because the data description does not specify the NA level for this variable (see above), implying that the 1 NA is a genuine missing value.

In most cases the best way to handle a small number of missing values of a factor or ordinal variable is to impute the mode:

#kitchen quality (ordinal)
all$KitchenQual[is.na(all$KitchenQual)] <- 'TA' #mode imputation
all$KitchenQual<-as.integer(revalue(all$KitchenQual, Qual.levels))
table(all$KitchenQual)
## 
##    2    3    4    5 
##   70 1493 1151  205

Other than the NA handling, KitchenQual is just another ordinal variable so I correct its data type as usual.

Example 8:

TotalBsmtSF: Total square feet of basement area

TotalBsmtSF is clearly a numeric variable and there is 1 NA to handle. Here I assume that the NA means that there was no basement to measure for that particular house i.e that it is appropriate to impute zero:

#TotalBsmtSF (integer)
all$TotalBsmtSF[is.na(all$TotalBsmtSF)] <-0

This variable is already in a numeric structure (integer) so there is no data type conversion required here.

Example 9:

Utilities: Type of utilities available

   AllPub   All public Utilities (E,G,W,& S)    
   NoSewr   Electricity, Gas, and Water (Septic Tank)
   NoSeWa   Electricity and Gas Only
   ELO  Electricity only

Utilities is an ordinal variable since there is a clear progression between the levels. So I change the data type as previously seen. However, when table() is run I find an interesting output:

#Utilities (ordinal)
table(all$Utilities)
## 
## AllPub NoSeWa 
##   2916      1

Since there is literally only 1 value that varies from the rest of the values in the data, I remove this variable because it does not offer any predictive power.

#remove this variable 
all$Utilities <- NULL

This concludes the examples for this section. When cleaning the rest of the variables, I was simply repeating the steps shown in one of the nine examples above.

After working through all of the variables, I run the str() command again to see the corrected structure for each variable:

str(all)
## 'data.frame':    2919 obs. of  79 variables:
##  $ MSSubClass   : Factor w/ 16 levels "1 story 1946+",..: 6 1 6 7 6 5 1 6 5 16 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 68 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Alley        : Factor w/ 3 levels "Grvl","None",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LotShape     : int  3 3 2 2 2 2 3 2 3 3 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : int  1 0 1 0 1 0 2 2 0 0 ...
##  $ MasVnrArea   : num  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : int  4 3 4 3 4 3 4 3 3 3 ...
##  $ ExterCond    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : int  4 4 4 3 4 4 5 4 3 3 ...
##  $ BsmtCond     : int  3 3 3 4 3 3 3 3 3 3 ...
##  $ BsmtExposure : int  1 4 2 1 3 1 3 2 1 1 ...
##  $ BsmtFinType1 : int  6 5 6 5 6 6 6 5 1 6 ...
##  $ BsmtFinSF1   : num  706 978 486 216 655 ...
##  $ BsmtFinType2 : int  1 1 1 1 1 1 1 4 1 1 ...
##  $ BsmtFinSF2   : num  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : num  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : num  856 1262 920 756 1145 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : int  5 5 5 4 5 5 5 5 4 5 ...
##  $ CentralAir   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : num  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : int  4 3 4 4 4 3 4 3 3 3 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : int  7 7 7 7 7 7 7 7 6 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : int  0 3 3 4 3 0 4 3 3 3 ...
##  $ GarageType   : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : int  2 2 2 1 2 1 2 2 1 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : num  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : int  3 3 3 3 3 3 3 3 2 4 ...
##  $ GarageCond   : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ PavedDrive   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fence        : Factor w/ 5 levels "GdPrv","GdWo",..: 5 5 5 5 5 3 5 5 5 5 ...
##  $ MiscFeature  : Factor w/ 5 levels "Gar2","None",..: 2 2 2 2 2 4 2 4 2 2 ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : Factor w/ 12 levels "1","2","3","4",..: 2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Also, I double check that all of the NAs have now been handled (with the exception of SalePrice of course):

#count of NAs in each column
colSums(is.na(all))
##    MSSubClass      MSZoning   LotFrontage       LotArea        Street 
##             0             0             0             0             0 
##         Alley      LotShape   LandContour     LotConfig     LandSlope 
##             0             0             0             0             0 
##  Neighborhood    Condition1    Condition2      BldgType    HouseStyle 
##             0             0             0             0             0 
##   OverallQual   OverallCond     YearBuilt  YearRemodAdd     RoofStyle 
##             0             0             0             0             0 
##      RoofMatl   Exterior1st   Exterior2nd    MasVnrType    MasVnrArea 
##             0             0             0             0             0 
##     ExterQual     ExterCond    Foundation      BsmtQual      BsmtCond 
##             0             0             0             0             0 
##  BsmtExposure  BsmtFinType1    BsmtFinSF1  BsmtFinType2    BsmtFinSF2 
##             0             0             0             0             0 
##     BsmtUnfSF   TotalBsmtSF       Heating     HeatingQC    CentralAir 
##             0             0             0             0             0 
##    Electrical     X1stFlrSF     X2ndFlrSF  LowQualFinSF     GrLivArea 
##             0             0             0             0             0 
##  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath  BedroomAbvGr 
##             0             0             0             0             0 
##  KitchenAbvGr   KitchenQual  TotRmsAbvGrd    Functional    Fireplaces 
##             0             0             0             0             0 
##   FireplaceQu    GarageType   GarageYrBlt  GarageFinish    GarageCars 
##             0             0             0             0             0 
##    GarageArea    GarageQual    GarageCond    PavedDrive    WoodDeckSF 
##             0             0             0             0             0 
##   OpenPorchSF EnclosedPorch    X3SsnPorch   ScreenPorch      PoolArea 
##             0             0             0             0             0 
##        PoolQC         Fence   MiscFeature       MiscVal        MoSold 
##             0             0             0             0             0 
##        YrSold      SaleType SaleCondition     SalePrice 
##             0             0             0          1459

Throughout the data cleaning process, I was able to categorise many of the variables into groups. For example, there are several basement variables: BsmtQual, BsmtCond, BsmtExposure, etc which measure either quality/condition/dimension of the basement of the house.

Grouping the variables helps simplify the task and, coupled with EDA, formulate some hypotheses around which variables are likely to be the most important in a model.

With a clean dataset, I move onto the exploratory data analysis (EDA) section.

Exploratory Data Analysis

The main objectives of exploratory data analysis (EDA):

Also, to keep this section concise, I will only include key findings and visualisations.

The response, SalePrice.

A logical place to start is to analyse the response variable, SalePrice.

#The response SalePrice 
summary(all$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   34900  129975  163000  180921  214000  755000    1459
ggplot(data=all[!is.na(all$SalePrice),], mapping = aes(x=SalePrice)) +
  geom_histogram(fill="blue", binwidth = 10000) +
  scale_x_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

skewness(all[!is.na(all$SalePrice),]$SalePrice)
## [1] 1.880941
kurtosis(all[!is.na(all$SalePrice),]$SalePrice)
## [1] 9.509812

SalePrice has a clear positive skew. I have found that this can often be problematic when modelling. For example, when fitting a linear regression model, the extreme values on the right tail could lead to violation of the normality of residuals assumption.

I consider the distribution of the natural logarithm of the SalePrice:

#log transform on SalePrice 
ggplot(data=all[!is.na(all$SalePrice),]) + geom_histogram(mapping = aes(x=log(SalePrice)), fill="green")

skewness(log(all[!is.na(all$SalePrice),]$SalePrice))
## [1] 0.1212104
kurtosis(log(all[!is.na(all$SalePrice),]$SalePrice))
## [1] 3.802656

log(SalePrice) has a fairly symmetrical distribution. As a comparison, a normal distribution has skewness equal to 0 and kurtosis equal to 3. I will transform SalePrice before trying to fit models later in the project.

Identifying ‘important’ variables

Before analysing the relationships between SalePrice and the predictors, it makes sense to understand what the most ‘important’ variables are. I use the word ‘important’ loosely because the way importance is measured depends on the technique used- I will use two techniques:

  • Specifically for numeric and ordinal variables, I run a correlation matrix. Here I measure variable importance based on correlation with SalePrice. Additionally, this can help identify multicollinearity between the predictors.

  • I will also use the caret package to fit a random forest model (with 100 trees) predicting SalePrice. Here I will use the Imp() function to assess the importance of each variable based on how much the percentage mean squared error (% MSE) of the model would increase if that variable was removed. This is ad-hoc but fast and easily done due to the power of the caret package.

Correlation Matrix to determine importance

Starting with a correlation matrix:

#dataframe of numeric variables (including ordinal)
df_all_numVar <- all[, numerics]
#correlations of all numeric variables
cor_numVar <- cor(df_all_numVar, use="pairwise.complete.obs") 

#sort on decreasing correlations with SalePrice
cor_sorted <- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
#select only high corelations
HighCor <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_numVar <- cor_numVar[HighCor, HighCor]
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt", tl.cex = 0.7,cl.cex = .7, number.cex=.7)

length(HighCor)
## [1] 17

There are 16 predictors which have a correlation with SalePrice that is greater than 0.5.

The correlation matrix shows that the two variables with the highest correlation with SalePrice are OverallQual (Overall Quality) and GrLivArea (Ground Living Area in square feet).

OverallQual (Overall Quality) has the highest correlation with SalePrice (0.79). Since OverallQual is an ordinal variable I visualise the strength of the relationship with a multi-level boxplot:

ggplot(data=all[!is.na(all$SalePrice),], aes(x=factor(OverallQual), y=SalePrice))+
  geom_boxplot(col='blue') + labs(x='Overall Quality') +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)+ geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)

The boxplot illustrates that each time the quality level increases by one, there is a clear upward shift in the distribution of SalePrice.

GrLivArea (Ground Living Area in square feet) has the second highest correlation with SalePrice (0.71). As GrLivArea is numeric, I visualise the relationship with a scatterplot:

ggplot(data=all[!is.na(all$SalePrice),], mapping = aes(x=GrLivArea, y=SalePrice))+geom_point(mapping=aes(color=OverallQual))+scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)+geom_text_repel(aes(label = ifelse(all$GrLivArea[!is.na(all$SalePrice)]>4500, rownames(all), ''))) +labs(x="Ground Living Area (sqft)")

Two outliers appear to be severely distorting the linearity between GrLivArea and SalePrice- they have extremely high records of GrLivArea (over 4500 sqft) but have average records of SalePrice.

Since just 2 observations out of 1460 distort the linearity so wildly, I decide that it is reasonable to remove them from the data.

#remove outliers 
all <- all[-c(524, 1299),]
cor(all$SalePrice[!is.na(all$SalePrice)], all$GrLivArea[!is.na(all$SalePrice)])
## [1] 0.7349682

The correlation between SalePrice and GrLivArea increases by almost 3% as a result.

Random Forest to determine importance

Next, I deploy the second technique to understand variable importance which is to fit a random forest predicting SalePrice. Unlike the correlation matrix, this will provide insight into the important factors as well as numeric or ordinal variables.

#random forest
set.seed(1234)
RF <- randomForest(x=all[1:1458,-79], y=all$SalePrice[1:1458], ntree=100,importance=TRUE)
imp_RF <- importance(RF)
imp_DF <- data.frame(Variables = row.names(imp_RF), MSE = imp_RF[,1])
imp_DF <- imp_DF[order(imp_DF$MSE, decreasing = TRUE),]

imp_DF[1:25,]

According to the random forest, Neighborhood is the most important factor variable followed by MSSubClass. Again, I visualise these relationships with SalePrice by producing a multi-level boxplots. (In the plots I arrange the levels in ascending order of median SalePrice):

#SalePrice by Neighborhood 
ggplot(all[!is.na(all$SalePrice),], aes(x=reorder(Neighborhood, SalePrice, FUN=median), y=SalePrice)) +
  geom_boxplot(color = "blue") +theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
  geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)+labs(x= "Neighbourhood")

There is clearly significant predictive power in the Neighborhood variable, although 25 levels for this variable does seem excessive. I keep this in mind for feature engineering.

SalePrice by MSSubClass:

ggplot(all[!is.na(all$SalePrice),], aes(x=reorder(MSSubClass, SalePrice, FUN=median), y=SalePrice))+geom_boxplot(color = "blue") +theme(axis.text.x = element_text(angle = 45, hjust = 1))+scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) + labs(x="MSSubClass")

The relationship here is more subtle, however MSSubClass seems to offer some predictive power.

At this point, having built a strong understanding of the variables, I move onto feature engineering.

Feature Engineering

The main objectives for this section:

Testing new features

Neighborhood:

First, I create a simplified Neighborhood variable since it appeared to have too many levels. I use the multi-level boxplot from earlier to produce a new variable with 3 ordinal levels (based on the distribution of SalePrice), each level consisting of a group of neighborhoods:

#reducing levels based on SalePrice
all$NeighRich[all$Neighborhood %in% c('StoneBr', 'NridgHt', 'NoRidge')] <- 2
all$NeighRich[!all$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale', 'StoneBr', 'NridgHt', 'NoRidge')] <- 1
all$NeighRich[all$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale')] <- 0

#multi-level boxplot
ggplot(all[!is.na(all$SalePrice),], aes(x=as.factor(NeighRich), y=SalePrice)) +
  geom_boxplot(color = "blue") +
  scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
  geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)+labs(x= "Neighbourhood reduced levels")

Level ‘0’ and ‘2’ each hold 3 of the initial Neighborhood levels. Level ‘1’ contains the other 19.

The boxplot illustrates a significant change in the distribution of SalePrice for each engineered level. In fact, the 1st quartile of levels ‘1’ and ‘2’ are greater than the 3rd quartile of levels ‘0’ and ‘1’ respectively. This indicates that based on the train set, this variable is highly predictive.

‘TotalBathrooms’:

Also, when studying the data description I noticed that there are several variables relating to the number of bathrooms in the house. I create a total bathrooms variable which is a weighted sum of the full and half bathrooms:

#total bathrooms 
all$TotalBathrooms <- all$FullBath + (all$HalfBath*0.5) + all$BsmtFullBath + (all$BsmtHalfBath*0.5)

ggplot(data=all[!is.na(all$SalePrice),], aes(x=as.factor(TotalBathrooms), y=SalePrice))+
  geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) + geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)+labs(x= "Total Bathrooms")

cor(all$SalePrice, all$TotalBathrooms, use= "pairwise.complete.obs")
## [1] 0.6358963

It is unsurprising that the number of bathrooms tends to be a key component in determining house price- the fairly strong linearity here makes sense.

‘Age’:

Additionally, when exploring the variables I also noticed two things about the YearBuilt (Year Built) and YearRemodAdd (Year Remodeled) variables:

  • YearRemodAddYearBuilt for all observations.
  • YearRemodAdd = YearBuilt quite often.

Using this I inferred that when YearRemodAdd = YearBuilt, this must mean that the house had never been remodeled.

Therefore, I defined an Age variable by subtracting YearRemodAdd from YrSold (Year Sold). Defining an age variable in this way means that a value can be derived for all observations, regardless of whether the house was remodeled or not.

#house age 
all$Age <- as.numeric(all$YrSold)-all$YearRemodAdd
ggplot(data=all[!is.na(all$SalePrice),], aes(x=Age, y=SalePrice))+
  geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

cor(all$SalePrice, all$Age, use= "pairwise.complete.obs")
## [1] -0.5097058

Age has a significant negative correlation with SalePrice which makes sense intuitively.

Sold New?

Again using the YrSold (year sold) and YearBuilt (year built) variables, I create a binary variable (Yes/No) which indicates whether the house was sold in the same year that it was built. i.e was the house sold new? Yes/No. 

I tested this feature based on the hypothesis: a premium is charged when a house is sold brand new. (This is just a general hypothesis that I wanted to test based on my existing knowledge of housing markets).

#was the house sold in the same year it was built? (Y/N)
all$IsNew <- ifelse(all$YrSold==all$YearBuilt, 1, 0)
table(all$IsNew)
## 
##    0    1 
## 2803  114
ggplot(all[!is.na(all$SalePrice),], aes(x=as.factor(IsNew), y=SalePrice)) +
  geom_bar(stat='summary', fun.y = "median", fill='blue') +
  geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=6) +
  scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
  theme_grey(base_size = 18) + labs(x= "Sold new?") + geom_hline(yintercept=163000, linetype="dashed") #dashed line is median SalePrice

Only 4.3% of houses (in the train set) were sold new, however since the median SalePrice is around $100,000 greater compared to the median of those not sold new, I keep this variable.

‘TotalSqFeet’:

Also, since there are multiple ‘square feet’ numeric variables, I tried to consolidate logical combinations of them. One of first I tried was TotalSqFeet which I defined as the sum of GrLivArea (Ground Living Area) and TotalBsmtSF (Basement Floor Area).

#Total square feet 
all$TotalSqFeet <- all$GrLivArea + all$TotalBsmtSF
ggplot(data=all[!is.na(all$SalePrice),], aes(x=TotalSqFeet, y=SalePrice))+
  geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +labs(x= "Total Sq Feet")

cor(all$SalePrice, all$TotalSqFeet, use= "pairwise.complete.obs")
## [1] 0.829042

This variable is highly correlated with SalePrice (83%).

One-hot encoding

One-hot encoding is the process of converting factor variables into a numerical format, which allows the data to be fed into machine learning algorithms. I take this step because I plan to experiment with multiple machine learning algorithms in the modelling section and I would like the data to be as compatible as possible.

Essentially, I create a new data frame which has a column for every level of every factor. For each factor variable of a given observation, there will be one level column containing 1 and the rest will contain 0. The column containing 1 being the level value which that particular observation takes.

First, creating a vector containing the names of the numeric variables:

numericVarNames<- c('LotFrontage', 'LotArea','BsmtFullBath','BsmtHalfBath','BsmtFinSF2','BsmtUnfSF','MasVnrArea','YearBuilt','X1stFlrSF', 'X2ndFlrSF', 'LowQualFinSF','GrLivArea', 'FullBath', 'HalfBath', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'X3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'TotalBathrooms', 'Age', 'TotalSqFeet')

One-hot encoding:

DFfactors <- all[, !(names(all) %in% numericVarNames)]
DFfactors <- DFfactors[, names(DFfactors) != 'SalePrice']
#dummify categorical variables 
DFdummies <- as.data.frame(model.matrix(~.-1, DFfactors))
dim(DFdummies)
## [1] 2917  205

The new data frame has a large number of columns because each factor variable has a column for every level.

Pre-Processing Numeric Variables

I know from the EDA that I conducted that some of the numeric variables are highly skewed. Therefore, I decide to transform any numeric variable x with a skew greater than 1 (significant positive skew) by taking the natural log of x+1. The idea behind taking the natural log of x+1 rather than x is to avoid any observations where x=0.

DFnumeric <- all[, names(all) %in% numericVarNames]

for(i in 1:ncol(DFnumeric)){
  if (abs(skew(DFnumeric[,i]))>1){
    DFnumeric[,i] <- log(DFnumeric[,i] +1)
  }
}

Also, I use the preProcess() function (from the caret package) to normalize the numeric variables:

PreNum <- preProcess(DFnumeric, method=c("center", "scale"))
print(PreNum)
## Created from 2917 samples and 24 variables
## 
## Pre-processing:
##   - centered (24)
##   - ignored (0)
##   - scaled (24)
DFnorm <- predict(PreNum, DFnumeric)

Dropping variables with zero (or near-zero) variance

Next, I drop variables or levels with zero or near-zero variance- because columns with zero variance offer no predictive power and columns with near-zero variance are likely to overfit models and therefore increase the RMSE of predictions.

#Dropping levels with zero (or near zero) variance in the test or train set 
#check if some values are absent in the test set
ZerocolTest <- which(colSums(DFdummies[(nrow(all[!is.na(all$SalePrice),])+1):nrow(all),])==0)
colnames(DFdummies[ZerocolTest])
##  [1] "Condition2RRAe"     "Condition2RRAn"     "Condition2RRNn"    
##  [4] "HouseStyle2.5Fin"   "RoofMatlMembran"    "RoofMatlMetal"     
##  [7] "RoofMatlRoll"       "Exterior1stImStucc" "Exterior1stStone"  
## [10] "Exterior2ndOther"   "HeatingOthW"        "ElectricalMix"     
## [13] "MiscFeatureTenC"
DFdummies <- DFdummies[,-ZerocolTest] #removing predictors
#check if some values are absent in the train set
ZerocolTrain <- which(colSums(DFdummies[1:nrow(all[!is.na(all$SalePrice),]),])==0)
colnames(DFdummies[ZerocolTrain])
## [1] "MSSubClass1,5 story PUD all"
DFdummies <- DFdummies[,-ZerocolTrain] #removing predictors
fewOnes <- which(colSums(DFdummies[1:nrow(all[!is.na(all$SalePrice),]),])<10)
colnames(DFdummies[fewOnes])
##  [1] "MSSubClass1 story unf attic" "LotConfigFR3"               
##  [3] "NeighborhoodBlueste"         "NeighborhoodNPkVill"        
##  [5] "Condition1PosA"              "Condition1RRNe"             
##  [7] "Condition1RRNn"              "Condition2Feedr"            
##  [9] "Condition2PosA"              "Condition2PosN"             
## [11] "RoofStyleMansard"            "RoofStyleShed"              
## [13] "RoofMatlWdShake"             "RoofMatlWdShngl"            
## [15] "Exterior1stAsphShn"          "Exterior1stBrkComm"         
## [17] "Exterior1stCBlock"           "Exterior2ndAsphShn"         
## [19] "Exterior2ndBrk Cmn"          "Exterior2ndCBlock"          
## [21] "Exterior2ndStone"            "FoundationStone"            
## [23] "FoundationWood"              "HeatingGrav"                
## [25] "HeatingWall"                 "ElectricalFuseP"            
## [27] "GarageTypeCarPort"           "MiscFeatureOthr"            
## [29] "SaleTypeCon"                 "SaleTypeConLD"              
## [31] "SaleTypeConLI"               "SaleTypeConLw"              
## [33] "SaleTypeCWD"                 "SaleTypeOth"                
## [35] "SaleConditionAdjLand"
DFdummies <- DFdummies[,-fewOnes] #removing predictors

Combining the engineered dataframes:

#combine the normalized numeric variables with the dummified factor variables into one data frame
combined <- cbind(DFnorm, DFdummies)

Taking the natural log of the response, SalePrice:

#log(SalePrice)
all$SalePrice <- log(all$SalePrice)

Finally, splitting the combined dataframe into train and test components:

train1 <- combined[!is.na(all$SalePrice),]
test1 <- combined[is.na(all$SalePrice),]

We can now use the cleaned and transformed train set (train1) to experiment with models.

Experimenting with models

In this section I will predominantly use the caret package to fit and assess models. I use (5 fold) cross-validation to make fair assessments of model performance. The final model will be the best-performing model i.e. the model with the lowest root mean square error (RMSE). Models will predict log(SalePrice) meaning that predictions from the final model will need to be exponentiated with base e in order to predict SalePrice.

I will experiment with the following:

First, I use the set.seed() command to fix the randomized folds for cross-validation:

set.seed(1234)

Linear Regression

Starting with a linear regression model using all of the predictors:

my_control <-trainControl(method="cv", number=5)
linear_mod1 = train(x=train1,y=all$SalePrice[!is.na(all$SalePrice)], method='lm', trControl= my_control)
min(linear_mod1$results$RMSE)
## [1] 0.1173242

It is likely that the above linear regression is overfitted because all of the predictors are used by the model.

A linear regression model is fitted by minimizing the sum of the squared residuals- when there is a lot of predictors relative to the sample size, this technique can lead to an overfit model. Overfit models occur when the model starts to explain the error in the train set and therefore doesn’t generalize well to new data.

Therefore, next I try running another linear regression on a reduced dataset (based on EDA):

Linear Regression on a reduced train set

#linear regression on reduced train set
linear_mod2 = train(x=train2,y=all$SalePrice[!is.na(all$SalePrice)], method='lm', trControl= my_control)
min(linear_mod2$results$RMSE)
## [1] 0.1298363

RMSE comparison:

model_list <- list(lm1 = linear_mod1, lm2 = linear_mod2)
resamples = resamples(model_list)
summary(resamples)
## 
## Call:
## summary.resamples(object = resamples)
## 
## Models: lm1, lm2 
## Number of resamples: 5 
## 
## MAE 
##           Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## lm1 0.07531025 0.08110973 0.08244776 0.08154510 0.08425569 0.08460208    0
## lm2 0.09203930 0.09469219 0.09488237 0.09509678 0.09661205 0.09725797    0
## 
## RMSE 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lm1 0.1115543 0.1155767 0.1160029 0.1173242 0.1189430 0.1245443    0
## lm2 0.1277336 0.1292786 0.1293744 0.1298363 0.1305633 0.1322314    0
## 
## Rsquared 
##         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lm1 0.902428 0.9066380 0.9134711 0.9145326 0.9170927 0.9330333    0
## lm2 0.883119 0.8888129 0.8973805 0.8944530 0.8984089 0.9045438    0
bwplot(resamples, metric = "RMSE")

The RMSE is higher for the second linear model indicating that too much information was lost by reducing the number of predictors.

Regularized Regression

With regularized regression, the model is fitted by minimizing the sum of the squared residuals plus a penalization term. The penalization term determines model parameters which define the model to be less sensitive to the predictors, which hedges against the possibility that the model has been fitted too precisely to the train set.

The definition of the penalization term depends on which regularization technique is being used. There are 3 techniques:

  • Lasso (L1) Regularization
  • Ridge (L2) Regularization
  • Elastic Net (a combination of Lasso and Ridge)

The difference is that the Ridge regression penalty is defined in such a way which ‘shrinks’ the model parameters of correlated predictors, whereas the Lasso penalty tends to cause the model to pick one and discard the others. In all of the above cases the strength of the penalization term depends on the hyperparameter lambda. I determine the best value for lambda using cross-validation.

lassoGrid <- expand.grid(alpha = 1, lambda = seq(0.001,0.1,by = 0.0005))
lasso_mod <- train(x=train1, y=all$SalePrice[!is.na(all$SalePrice)], method='glmnet', trControl= my_control, tuneGrid=lassoGrid) 
lasso_mod$bestTune
min(lasso_mod$results$RMSE)
## [1] 0.1134197

Comparing RMSE:

model_list <- list(lm1 = linear_mod1, lasso = lasso_mod)
resamples = resamples(model_list)
summary(resamples)
## 
## Call:
## summary.resamples(object = resamples)
## 
## Models: lm1, lasso 
## Number of resamples: 5 
## 
## MAE 
##             Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## lm1   0.07531025 0.08110973 0.08244776 0.08154510 0.08425569 0.08460208    0
## lasso 0.07680830 0.07855494 0.07942849 0.07940335 0.08105786 0.08116716    0
## 
## RMSE 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lm1   0.1115543 0.1155767 0.1160029 0.1173242 0.1189430 0.1245443    0
## lasso 0.1073087 0.1074690 0.1114766 0.1134197 0.1175392 0.1233047    0
## 
## Rsquared 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lm1   0.9024280 0.9066380 0.9134711 0.9145326 0.9170927 0.9330333    0
## lasso 0.9118777 0.9165367 0.9204476 0.9199835 0.9238094 0.9272464    0
bwplot(resamples, metric = "RMSE")

The lasso regression model performs better than the initial linear regression. I also fitted a ridge regression model (by setting hyperparameter alpha = 0 in the expand.grid() function) but the resultant RMSE was higher than the RMSE for the lasso model.

Extreme Gradient Boosting

In this project I wanted to experiment with an extreme gradient boosting model because I know that they have become very popular in data science over the past 20 years. I am yet to study the theory on gradient boosting machines, however, I know that extreme gradient boosting is the more regularized form (guards more aggressively against overfitting). At the time of writing up this project I have started to read ‘Applied Predictive Modelling’ by Max Kuhn and Kjell Johnson (developers of the caret package) and soon I will encounter real-world examples of fitting XGBoost models. For now though, I fit the model out of curiosity.

As for tuning the parameters, I used the help of Kaggle kernels to try to compose a sensible tuning grid:

#xgboost
xgbTuningGrid = expand.grid(nrounds = c(50, 100), 
                            lambda = seq(0.1, 0.5, 0.1), 
                            alpha = seq(0.1, 0.5, 0.1),
                            eta = c(0.3, 0.4))
xgb_mod = train(x=train1, y=all$SalePrice[!is.na(all$SalePrice)],
                   tuneLength = 3,
                   method = "xgbLinear",
                   trControl = my_control,
                   tuneGrid = xgbTuningGrid)

Comparing RMSE with the Lasso regression model:

model_list = list(gbm = xgb_mod, lasso = lasso_mod)
resamples = resamples(model_list)
summary(resamples)
## 
## Call:
## summary.resamples(object = resamples)
## 
## Models: gbm, lasso 
## Number of resamples: 5 
## 
## MAE 
##            Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## gbm   0.0857297 0.08625690 0.08806357 0.09060407 0.09512357 0.09784660    0
## lasso 0.0768083 0.07855494 0.07942849 0.07940335 0.08105786 0.08116716    0
## 
## RMSE 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## gbm   0.1128568 0.1217299 0.1318506 0.1295182 0.1377619 0.1433920    0
## lasso 0.1073087 0.1074690 0.1114766 0.1134197 0.1175392 0.1233047    0
## 
## Rsquared 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## gbm   0.8687095 0.8888698 0.8993188 0.8953026 0.9070067 0.9126080    0
## lasso 0.9118777 0.9165367 0.9204476 0.9199835 0.9238094 0.9272464    0
bwplot(resamples, metric = "RMSE")

The XGBoost model has a greater RMSE than the lasso model. This is likely because the tuning parameters are not optimal and many have been left as defaults. However, this is only an initial model- I plan to revisit this project and experiment further when I have a stronger conceptual understanding of the algorithm. For now, the Lasso regression remains the best performing model.