This document outlines my methodology when working through the Kaggle competition ‘House Prices - Advanced Regression Techniques’
The ultimate goal of this project is to build a model which predicts the sale price of houses in Ames, Iowa.
I will be using R/RStudio for this project. This document is also underpinned by R Markdown as I would like to share the R code & output along with my explanations & thought process.
The data consists of the response variable SalePrice and 80 predictor variables. There are 2919 observations in total over all of the data. However, Kaggle splits these 50/50 into two sets: train and test which are provided as csv files. The sole difference being that the values for the response SalePrice are only provided in the train set.
The idea is to use the train set to build a model, then use the model to make predictions on the test set. Following that, predictions are submitted to Kaggle who then return you a score based on the root mean square error (RMSE) of your predictions. Consequently, you are ranked in the Kaggle leaderboards, hence where the ‘competition’ element comes in.
My motivations for taking on this project:
I have partitioned this document into four main sections:
First, I load the csv files (train and test) into the R session:
#read test and train sets
train<-read.csv("train.csv",header = TRUE)
test<-read.csv("test.csv", header = TRUE)
dim(train)
## [1] 1460 81
dim(test)
## [1] 1459 80
The train set consists of 1460 observations of 81 variables (since it includes the response SalePrice), whereas the test set consists of 1459 observations of 80 variables.
Although I will be training models only on the train set, this doesn’t mean that exploratory data analysis (EDA) on the predictors should be restricted to only the train set. Sure, for SalePrice, I do not have the test set values, but for all of the other variables I will use all of the data to understand their distributions and relationships with each other.
For this reason, I next combine the train and test sets into one data frame. To (row) combine two data frames in R, they need to have the same column structure. To resolve this, I add a SalePrice column to the test data frame and impute the value “NA” for every observation of SalePrice:
#add a SalePrice column to the test and impute NA
test$SalePrice <- NA
#combine
all <- rbind(train, test)
Now that all of the data is held in one data frame, I use the
str() command to look at the structure of the data:
str(all)
## 'data.frame': 2919 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
Firstly, I see that the first variable is Id. Unique identifiers are of no use in terms of prediction so I remove this column from the data:
#remove ID
all$Id <- NULL
Also, it is clear that for a lot of the variables there is repeat values over the first few observations indicating that there is a mix of categorical and numeric variables. I also notice NA values within the data so I display all of the columns containing at least 1 NA (in descending order):
#columns with NA values
cols.with.NAs <- which(colSums(is.na(all)) > 0)
sort(colSums(sapply(all[cols.with.NAs], is.na)), decreasing = TRUE)
## PoolQC MiscFeature Alley Fence SalePrice FireplaceQu
## 2909 2814 2721 2348 1459 1420
## LotFrontage GarageYrBlt GarageFinish GarageQual GarageCond GarageType
## 486 159 159 159 159 157
## BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1 MasVnrType
## 82 82 81 80 79 24
## MasVnrArea MSZoning Utilities BsmtFullBath BsmtHalfBath Functional
## 23 4 2 2 2 2
## Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
length(cols.with.NAs)
## [1] 35
There are 34 variables which contain at least 1 NA (ignoring SalePrice because I imputed the NAs in order to combine dataframes). With this in mind, I move onto the data cleaning process.
I have two main objectives for this part:
For both of these tasks I primarily make use of the data description provided by Kaggle and perhaps some brief EDA where necessary.
Obviously there are a lot of variables to work through in this dataset, so to avoid repetition I will only include a few examples of correcting the data types and handling the NAs of different variables.
An important note is that typically NA values refer to missing values of a variable. However, based on the data description, an NA is rarely indicative of a missing value in this dataset. I show examples of this below.
Example 1:
FireplaceQu: Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
FireplaceQu is one of the variables with a vast amount of NA values, however, as shown above, NA simply means that the house does not have a fireplace- not that the value is missing!
Therefore, it makes sense to replace the NAs with the value “None”:
#FireplaceQu (ordinal)
all$FireplaceQu[is.na(all$FireplaceQu)] <- 'None'
I also see from the data description that this variable is an ordered factor (or in other words an ordinal variable) because the levels of the factor represent a change in quality.
I correct ordinal variables as follows:
Qual.levels <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
all$FireplaceQu<-as.integer(revalue(all$FireplaceQu, Qual.levels))
That is, they will be stored as integer types with integer levels.
#count at each level
table(all$FireplaceQu)
##
## 0 1 2 3 4 5
## 1420 46 74 592 744 43
I will show another example of cleaning an ordinal variable so that the process is clear.
Example 2:
PoolQC: Pool quality
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
Again, in this case NA simply means that the house has no pool- not that the value is missing. PoolQC is also an ordinal variable and is similar to FireplaceQu because it has the exact same level names (in fact there are many ordinal variables in the dataset which share the exact same level convention)- so here I can reuse the Qual.levels vector from Example 1 to rename the levels:
#PoolQC (ordinal)
all$PoolQC[is.na(all$PoolQC)] <- 'None'
all$PoolQC<-as.integer(revalue(all$PoolQC, Qual.levels))
table(all$PoolQC)
##
## 0 2 4 5
## 2909 2 4 4
I continue to run the table() command on the categorical
variables as I clean them because it is useful to get a general
understanding of the distribution (count at each level) in preparation
for EDA.
Example 3:
LotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
LotShape has no NA values so no handling of NAs is required. I simply infer from the data description that it is ordinal and correct the data type in the usual way.
#LotShape (ordinal)
all$LotShape<-as.integer(revalue(all$LotShape, c('IR3'=0, 'IR2'=1, 'IR1'=2, 'Reg'=3)))
table(all$LotShape)
##
## 0 1 2 3
## 16 76 968 1859
The only difference here is that I renamed the levels directly inside
of the revalue() function since LotShape does not
share level names with any other variable. (It is unnecessary to create
a separate vector to store the level names because I will not be reusing
them).
Example 4:
MiscFeature: Miscellaneous feature not covered in other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
MiscFeature is another variable with a vast amount of NA values. However, once again, this simply means that the house does not have a miscellaneous feature- not that the value is missing. Therefore, I handle the NAs by replacing them with “None”.
Although MiscFeature is clearly a factor, it is an
unordered factor. i.e the levels are different but
there is no order/progression between the levels. For unordered factors
I use as.factor() to correct the data type.
#MiscFeature (factor)
all$MiscFeature[is.na(all$MiscFeature)] <- 'None'
all$MiscFeature <- as.factor(all$MiscFeature)
table(all$MiscFeature)
##
## Gar2 None Othr Shed TenC
## 5 2814 4 95 1
From this point on I will refer to unordered factors as just ‘factors’ and ordered factors as ‘ordinal’ variables.
Example 5:
LotConfig: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
LotConFig has no NAs to handle and is just a factor since the levels are not ordered:
#LotConFig (factor)
all$LotConfig <- as.factor(all$LotConfig)
table(all$LotConfig)
##
## Corner CulDSac FR2 FR3 Inside
## 511 176 85 14 2133
Example 6:
In this example I clean a numeric variable.
LotFrontage: Linear feet of street connected to property
This variable has 486 NA values. Unlike in previous examples, this time the data description does not explicitly state how to interpret them. Therefore, I shortlist three options on how to potentially deal with the NAs here:
I decided to proceed with 1). I did not go with 2) because although 486 is a lot of NAs, there is still over 80% of the values for the variable. I decided against 3) because I thought the assumption is less likely to be true than the assumption in 1) and 2).
Having decided on option 1), now I decide whether mean or median imputation is more appropriate. To do that I run some basic analysis on the variable:
#LotFrontage (numeric)
summary(all$LotFrontage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 21.00 59.00 68.00 69.31 80.00 313.00 486
ggplot(data=all)+geom_histogram(mapping=aes(x=LotFrontage), fill="blue")
There is not much difference between the median (68) and the mean (69.31) for LotFrontage. However, due to the positive skew I decide to go with median imputation to replace the NAs:
#based on the slight positive skew caused by an extreme value, imputing median
all$LotFrontage[is.na(all$LotFrontage)] = median(all$LotFrontage, na.rm = TRUE)
Example 7:
KitchenQual: Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
KitchenQual is clearly another ordinal variable which uses the quality levels as previously seen. However, the reason I am including this variable as an additional example is because this variable has 1 NA value to handle. But unlike before I cannot simply impute “None” because the data description does not specify the NA level for this variable (see above), implying that the 1 NA is a genuine missing value.
In most cases the best way to handle a small number of missing values of a factor or ordinal variable is to impute the mode:
#kitchen quality (ordinal)
all$KitchenQual[is.na(all$KitchenQual)] <- 'TA' #mode imputation
all$KitchenQual<-as.integer(revalue(all$KitchenQual, Qual.levels))
table(all$KitchenQual)
##
## 2 3 4 5
## 70 1493 1151 205
Other than the NA handling, KitchenQual is just another ordinal variable so I correct its data type as usual.
Example 8:
TotalBsmtSF: Total square feet of basement area
TotalBsmtSF is clearly a numeric variable and there is 1 NA to handle. Here I assume that the NA means that there was no basement to measure for that particular house i.e that it is appropriate to impute zero:
#TotalBsmtSF (integer)
all$TotalBsmtSF[is.na(all$TotalBsmtSF)] <-0
This variable is already in a numeric structure (integer) so there is no data type conversion required here.
Example 9:
Utilities: Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
Utilities is an ordinal variable since there is a clear
progression between the levels. So I change the data type as previously
seen. However, when table() is run I find an interesting
output:
#Utilities (ordinal)
table(all$Utilities)
##
## AllPub NoSeWa
## 2916 1
Since there is literally only 1 value that varies from the rest of the values in the data, I remove this variable because it does not offer any predictive power.
#remove this variable
all$Utilities <- NULL
This concludes the examples for this section. When cleaning the rest of the variables, I was simply repeating the steps shown in one of the nine examples above.
After working through all of the variables, I run the
str() command again to see the corrected structure for each
variable:
str(all)
## 'data.frame': 2919 obs. of 79 variables:
## $ MSSubClass : Factor w/ 16 levels "1 story 1946+",..: 6 1 6 7 6 5 1 6 5 16 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : int 65 80 68 60 84 85 75 68 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Alley : Factor w/ 3 levels "Grvl","None",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ LotShape : int 3 3 2 2 2 2 3 2 3 3 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : int 2 2 2 2 2 2 2 2 2 2 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : int 1 0 1 0 1 0 2 2 0 0 ...
## $ MasVnrArea : num 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : int 4 3 4 3 4 3 4 3 3 3 ...
## $ ExterCond : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : int 4 4 4 3 4 4 5 4 3 3 ...
## $ BsmtCond : int 3 3 3 4 3 3 3 3 3 3 ...
## $ BsmtExposure : int 1 4 2 1 3 1 3 2 1 1 ...
## $ BsmtFinType1 : int 6 5 6 5 6 6 6 5 1 6 ...
## $ BsmtFinSF1 : num 706 978 486 216 655 ...
## $ BsmtFinType2 : int 1 1 1 1 1 1 1 4 1 1 ...
## $ BsmtFinSF2 : num 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : num 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : num 856 1262 920 756 1145 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : int 5 5 5 4 5 5 5 5 4 5 ...
## $ CentralAir : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : num 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : num 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : int 4 3 4 4 4 3 4 3 3 3 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : int 7 7 7 7 7 7 7 7 6 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : int 0 3 3 4 3 0 4 3 3 3 ...
## $ GarageType : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : int 2 2 2 1 2 1 2 2 1 2 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : num 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : int 3 3 3 3 3 3 3 3 2 4 ...
## $ GarageCond : int 3 3 3 3 3 3 3 3 3 3 ...
## $ PavedDrive : int 2 2 2 2 2 2 2 2 2 2 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Fence : Factor w/ 5 levels "GdPrv","GdWo",..: 5 5 5 5 5 3 5 5 5 5 ...
## $ MiscFeature : Factor w/ 5 levels "Gar2","None",..: 2 2 2 2 2 4 2 4 2 2 ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : Factor w/ 12 levels "1","2","3","4",..: 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
Also, I double check that all of the NAs have now been handled (with the exception of SalePrice of course):
#count of NAs in each column
colSums(is.na(all))
## MSSubClass MSZoning LotFrontage LotArea Street
## 0 0 0 0 0
## Alley LotShape LandContour LotConfig LandSlope
## 0 0 0 0 0
## Neighborhood Condition1 Condition2 BldgType HouseStyle
## 0 0 0 0 0
## OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 0 0 0 0 0
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea
## 0 0 0 0 0
## ExterQual ExterCond Foundation BsmtQual BsmtCond
## 0 0 0 0 0
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## 0 0 0 0 0
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 0 0 0 0 0
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## 0 0 0 0 0
## BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 0 0 0 0 0
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces
## 0 0 0 0 0
## FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 0 0 0 0 0
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
## 0 0 0 0 0
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## 0 0 0 0 0
## PoolQC Fence MiscFeature MiscVal MoSold
## 0 0 0 0 0
## YrSold SaleType SaleCondition SalePrice
## 0 0 0 1459
Throughout the data cleaning process, I was able to categorise many of the variables into groups. For example, there are several basement variables: BsmtQual, BsmtCond, BsmtExposure, etc which measure either quality/condition/dimension of the basement of the house.
Grouping the variables helps simplify the task and, coupled with EDA, formulate some hypotheses around which variables are likely to be the most important in a model.
With a clean dataset, I move onto the exploratory data analysis (EDA) section.
The main objectives of exploratory data analysis (EDA):
Also, to keep this section concise, I will only include key findings and visualisations.
A logical place to start is to analyse the response variable, SalePrice.
#The response SalePrice
summary(all$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 34900 129975 163000 180921 214000 755000 1459
ggplot(data=all[!is.na(all$SalePrice),], mapping = aes(x=SalePrice)) +
geom_histogram(fill="blue", binwidth = 10000) +
scale_x_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
skewness(all[!is.na(all$SalePrice),]$SalePrice)
## [1] 1.880941
kurtosis(all[!is.na(all$SalePrice),]$SalePrice)
## [1] 9.509812
SalePrice has a clear positive skew. I have found that this can often be problematic when modelling. For example, when fitting a linear regression model, the extreme values on the right tail could lead to violation of the normality of residuals assumption.
I consider the distribution of the natural logarithm of the SalePrice:
#log transform on SalePrice
ggplot(data=all[!is.na(all$SalePrice),]) + geom_histogram(mapping = aes(x=log(SalePrice)), fill="green")
skewness(log(all[!is.na(all$SalePrice),]$SalePrice))
## [1] 0.1212104
kurtosis(log(all[!is.na(all$SalePrice),]$SalePrice))
## [1] 3.802656
log(SalePrice) has a fairly symmetrical distribution. As a comparison, a normal distribution has skewness equal to 0 and kurtosis equal to 3. I will transform SalePrice before trying to fit models later in the project.
Before analysing the relationships between SalePrice and the predictors, it makes sense to understand what the most ‘important’ variables are. I use the word ‘important’ loosely because the way importance is measured depends on the technique used- I will use two techniques:
Specifically for numeric and ordinal variables, I run a correlation matrix. Here I measure variable importance based on correlation with SalePrice. Additionally, this can help identify multicollinearity between the predictors.
I will also use the caret package to fit a
random forest model (with 100 trees) predicting
SalePrice. Here I will use the Imp() function to
assess the importance of each variable based on how much the percentage
mean squared error (% MSE) of the model would increase if that variable
was removed. This is ad-hoc but fast and easily done due to the power of
the caret package.
Starting with a correlation matrix:
#dataframe of numeric variables (including ordinal)
df_all_numVar <- all[, numerics]
#correlations of all numeric variables
cor_numVar <- cor(df_all_numVar, use="pairwise.complete.obs")
#sort on decreasing correlations with SalePrice
cor_sorted <- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
#select only high corelations
HighCor <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_numVar <- cor_numVar[HighCor, HighCor]
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt", tl.cex = 0.7,cl.cex = .7, number.cex=.7)
length(HighCor)
## [1] 17
There are 16 predictors which have a correlation with SalePrice that is greater than 0.5.
The correlation matrix shows that the two variables with the highest correlation with SalePrice are OverallQual (Overall Quality) and GrLivArea (Ground Living Area in square feet).
OverallQual (Overall Quality) has the highest correlation with SalePrice (0.79). Since OverallQual is an ordinal variable I visualise the strength of the relationship with a multi-level boxplot:
ggplot(data=all[!is.na(all$SalePrice),], aes(x=factor(OverallQual), y=SalePrice))+
geom_boxplot(col='blue') + labs(x='Overall Quality') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)+ geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)
The boxplot illustrates that each time the quality level increases by one, there is a clear upward shift in the distribution of SalePrice.
GrLivArea (Ground Living Area in square feet) has the second highest correlation with SalePrice (0.71). As GrLivArea is numeric, I visualise the relationship with a scatterplot:
ggplot(data=all[!is.na(all$SalePrice),], mapping = aes(x=GrLivArea, y=SalePrice))+geom_point(mapping=aes(color=OverallQual))+scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)+geom_text_repel(aes(label = ifelse(all$GrLivArea[!is.na(all$SalePrice)]>4500, rownames(all), ''))) +labs(x="Ground Living Area (sqft)")
Two outliers appear to be severely distorting the linearity between GrLivArea and SalePrice- they have extremely high records of GrLivArea (over 4500 sqft) but have average records of SalePrice.
Since just 2 observations out of 1460 distort the linearity so wildly, I decide that it is reasonable to remove them from the data.
#remove outliers
all <- all[-c(524, 1299),]
cor(all$SalePrice[!is.na(all$SalePrice)], all$GrLivArea[!is.na(all$SalePrice)])
## [1] 0.7349682
The correlation between SalePrice and GrLivArea increases by almost 3% as a result.
Next, I deploy the second technique to understand variable importance which is to fit a random forest predicting SalePrice. Unlike the correlation matrix, this will provide insight into the important factors as well as numeric or ordinal variables.
#random forest
set.seed(1234)
RF <- randomForest(x=all[1:1458,-79], y=all$SalePrice[1:1458], ntree=100,importance=TRUE)
imp_RF <- importance(RF)
imp_DF <- data.frame(Variables = row.names(imp_RF), MSE = imp_RF[,1])
imp_DF <- imp_DF[order(imp_DF$MSE, decreasing = TRUE),]
imp_DF[1:25,]
According to the random forest, Neighborhood is the most important factor variable followed by MSSubClass. Again, I visualise these relationships with SalePrice by producing a multi-level boxplots. (In the plots I arrange the levels in ascending order of median SalePrice):
#SalePrice by Neighborhood
ggplot(all[!is.na(all$SalePrice),], aes(x=reorder(Neighborhood, SalePrice, FUN=median), y=SalePrice)) +
geom_boxplot(color = "blue") +theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)+labs(x= "Neighbourhood")
There is clearly significant predictive power in the Neighborhood variable, although 25 levels for this variable does seem excessive. I keep this in mind for feature engineering.
SalePrice by MSSubClass:
ggplot(all[!is.na(all$SalePrice),], aes(x=reorder(MSSubClass, SalePrice, FUN=median), y=SalePrice))+geom_boxplot(color = "blue") +theme(axis.text.x = element_text(angle = 45, hjust = 1))+scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) + labs(x="MSSubClass")
The relationship here is more subtle, however MSSubClass seems to offer some predictive power.
At this point, having built a strong understanding of the variables, I move onto feature engineering.
The main objectives for this section:
Neighborhood:
First, I create a simplified Neighborhood variable since it appeared to have too many levels. I use the multi-level boxplot from earlier to produce a new variable with 3 ordinal levels (based on the distribution of SalePrice), each level consisting of a group of neighborhoods:
#reducing levels based on SalePrice
all$NeighRich[all$Neighborhood %in% c('StoneBr', 'NridgHt', 'NoRidge')] <- 2
all$NeighRich[!all$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale', 'StoneBr', 'NridgHt', 'NoRidge')] <- 1
all$NeighRich[all$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale')] <- 0
#multi-level boxplot
ggplot(all[!is.na(all$SalePrice),], aes(x=as.factor(NeighRich), y=SalePrice)) +
geom_boxplot(color = "blue") +
scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)+labs(x= "Neighbourhood reduced levels")
Level ‘0’ and ‘2’ each hold 3 of the initial Neighborhood levels. Level ‘1’ contains the other 19.
The boxplot illustrates a significant change in the distribution of SalePrice for each engineered level. In fact, the 1st quartile of levels ‘1’ and ‘2’ are greater than the 3rd quartile of levels ‘0’ and ‘1’ respectively. This indicates that based on the train set, this variable is highly predictive.
‘TotalBathrooms’:
Also, when studying the data description I noticed that there are several variables relating to the number of bathrooms in the house. I create a total bathrooms variable which is a weighted sum of the full and half bathrooms:
#total bathrooms
all$TotalBathrooms <- all$FullBath + (all$HalfBath*0.5) + all$BsmtFullBath + (all$BsmtHalfBath*0.5)
ggplot(data=all[!is.na(all$SalePrice),], aes(x=as.factor(TotalBathrooms), y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) + geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)+labs(x= "Total Bathrooms")
cor(all$SalePrice, all$TotalBathrooms, use= "pairwise.complete.obs")
## [1] 0.6358963
It is unsurprising that the number of bathrooms tends to be a key component in determining house price- the fairly strong linearity here makes sense.
‘Age’:
Additionally, when exploring the variables I also noticed two things about the YearBuilt (Year Built) and YearRemodAdd (Year Remodeled) variables:
Using this I inferred that when YearRemodAdd = YearBuilt, this must mean that the house had never been remodeled.
Therefore, I defined an Age variable by subtracting YearRemodAdd from YrSold (Year Sold). Defining an age variable in this way means that a value can be derived for all observations, regardless of whether the house was remodeled or not.
#house age
all$Age <- as.numeric(all$YrSold)-all$YearRemodAdd
ggplot(data=all[!is.na(all$SalePrice),], aes(x=Age, y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
cor(all$SalePrice, all$Age, use= "pairwise.complete.obs")
## [1] -0.5097058
Age has a significant negative correlation with SalePrice which makes sense intuitively.
Sold New?
Again using the YrSold (year sold) and YearBuilt (year built) variables, I create a binary variable (Yes/No) which indicates whether the house was sold in the same year that it was built. i.e was the house sold new? Yes/No.
I tested this feature based on the hypothesis: a premium is charged when a house is sold brand new. (This is just a general hypothesis that I wanted to test based on my existing knowledge of housing markets).
#was the house sold in the same year it was built? (Y/N)
all$IsNew <- ifelse(all$YrSold==all$YearBuilt, 1, 0)
table(all$IsNew)
##
## 0 1
## 2803 114
ggplot(all[!is.na(all$SalePrice),], aes(x=as.factor(IsNew), y=SalePrice)) +
geom_bar(stat='summary', fun.y = "median", fill='blue') +
geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=6) +
scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
theme_grey(base_size = 18) + labs(x= "Sold new?") + geom_hline(yintercept=163000, linetype="dashed") #dashed line is median SalePrice
Only 4.3% of houses (in the train set) were sold new, however since the median SalePrice is around $100,000 greater compared to the median of those not sold new, I keep this variable.
‘TotalSqFeet’:
Also, since there are multiple ‘square feet’ numeric variables, I tried to consolidate logical combinations of them. One of first I tried was TotalSqFeet which I defined as the sum of GrLivArea (Ground Living Area) and TotalBsmtSF (Basement Floor Area).
#Total square feet
all$TotalSqFeet <- all$GrLivArea + all$TotalBsmtSF
ggplot(data=all[!is.na(all$SalePrice),], aes(x=TotalSqFeet, y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +labs(x= "Total Sq Feet")
cor(all$SalePrice, all$TotalSqFeet, use= "pairwise.complete.obs")
## [1] 0.829042
This variable is highly correlated with SalePrice (83%).
One-hot encoding is the process of converting factor variables into a numerical format, which allows the data to be fed into machine learning algorithms. I take this step because I plan to experiment with multiple machine learning algorithms in the modelling section and I would like the data to be as compatible as possible.
Essentially, I create a new data frame which has a column for every level of every factor. For each factor variable of a given observation, there will be one level column containing 1 and the rest will contain 0. The column containing 1 being the level value which that particular observation takes.
First, creating a vector containing the names of the numeric variables:
numericVarNames<- c('LotFrontage', 'LotArea','BsmtFullBath','BsmtHalfBath','BsmtFinSF2','BsmtUnfSF','MasVnrArea','YearBuilt','X1stFlrSF', 'X2ndFlrSF', 'LowQualFinSF','GrLivArea', 'FullBath', 'HalfBath', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'X3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'TotalBathrooms', 'Age', 'TotalSqFeet')
One-hot encoding:
DFfactors <- all[, !(names(all) %in% numericVarNames)]
DFfactors <- DFfactors[, names(DFfactors) != 'SalePrice']
#dummify categorical variables
DFdummies <- as.data.frame(model.matrix(~.-1, DFfactors))
dim(DFdummies)
## [1] 2917 205
The new data frame has a large number of columns because each factor variable has a column for every level.
I know from the EDA that I conducted that some of the numeric variables are highly skewed. Therefore, I decide to transform any numeric variable x with a skew greater than 1 (significant positive skew) by taking the natural log of x+1. The idea behind taking the natural log of x+1 rather than x is to avoid any observations where x=0.
DFnumeric <- all[, names(all) %in% numericVarNames]
for(i in 1:ncol(DFnumeric)){
if (abs(skew(DFnumeric[,i]))>1){
DFnumeric[,i] <- log(DFnumeric[,i] +1)
}
}
Also, I use the preProcess() function (from the caret
package) to normalize the numeric variables:
PreNum <- preProcess(DFnumeric, method=c("center", "scale"))
print(PreNum)
## Created from 2917 samples and 24 variables
##
## Pre-processing:
## - centered (24)
## - ignored (0)
## - scaled (24)
DFnorm <- predict(PreNum, DFnumeric)
Next, I drop variables or levels with zero or near-zero variance- because columns with zero variance offer no predictive power and columns with near-zero variance are likely to overfit models and therefore increase the RMSE of predictions.
#Dropping levels with zero (or near zero) variance in the test or train set
#check if some values are absent in the test set
ZerocolTest <- which(colSums(DFdummies[(nrow(all[!is.na(all$SalePrice),])+1):nrow(all),])==0)
colnames(DFdummies[ZerocolTest])
## [1] "Condition2RRAe" "Condition2RRAn" "Condition2RRNn"
## [4] "HouseStyle2.5Fin" "RoofMatlMembran" "RoofMatlMetal"
## [7] "RoofMatlRoll" "Exterior1stImStucc" "Exterior1stStone"
## [10] "Exterior2ndOther" "HeatingOthW" "ElectricalMix"
## [13] "MiscFeatureTenC"
DFdummies <- DFdummies[,-ZerocolTest] #removing predictors
#check if some values are absent in the train set
ZerocolTrain <- which(colSums(DFdummies[1:nrow(all[!is.na(all$SalePrice),]),])==0)
colnames(DFdummies[ZerocolTrain])
## [1] "MSSubClass1,5 story PUD all"
DFdummies <- DFdummies[,-ZerocolTrain] #removing predictors
fewOnes <- which(colSums(DFdummies[1:nrow(all[!is.na(all$SalePrice),]),])<10)
colnames(DFdummies[fewOnes])
## [1] "MSSubClass1 story unf attic" "LotConfigFR3"
## [3] "NeighborhoodBlueste" "NeighborhoodNPkVill"
## [5] "Condition1PosA" "Condition1RRNe"
## [7] "Condition1RRNn" "Condition2Feedr"
## [9] "Condition2PosA" "Condition2PosN"
## [11] "RoofStyleMansard" "RoofStyleShed"
## [13] "RoofMatlWdShake" "RoofMatlWdShngl"
## [15] "Exterior1stAsphShn" "Exterior1stBrkComm"
## [17] "Exterior1stCBlock" "Exterior2ndAsphShn"
## [19] "Exterior2ndBrk Cmn" "Exterior2ndCBlock"
## [21] "Exterior2ndStone" "FoundationStone"
## [23] "FoundationWood" "HeatingGrav"
## [25] "HeatingWall" "ElectricalFuseP"
## [27] "GarageTypeCarPort" "MiscFeatureOthr"
## [29] "SaleTypeCon" "SaleTypeConLD"
## [31] "SaleTypeConLI" "SaleTypeConLw"
## [33] "SaleTypeCWD" "SaleTypeOth"
## [35] "SaleConditionAdjLand"
DFdummies <- DFdummies[,-fewOnes] #removing predictors
Combining the engineered dataframes:
#combine the normalized numeric variables with the dummified factor variables into one data frame
combined <- cbind(DFnorm, DFdummies)
Taking the natural log of the response, SalePrice:
#log(SalePrice)
all$SalePrice <- log(all$SalePrice)
Finally, splitting the combined dataframe into train and test components:
train1 <- combined[!is.na(all$SalePrice),]
test1 <- combined[is.na(all$SalePrice),]
We can now use the cleaned and transformed train set (train1) to experiment with models.
In this section I will predominantly use the caret package to fit and assess models. I use (5 fold) cross-validation to make fair assessments of model performance. The final model will be the best-performing model i.e. the model with the lowest root mean square error (RMSE). Models will predict log(SalePrice) meaning that predictions from the final model will need to be exponentiated with base e in order to predict SalePrice.
I will experiment with the following:
First, I use the set.seed() command to fix the
randomized folds for cross-validation:
set.seed(1234)
Starting with a linear regression model using all of the predictors:
my_control <-trainControl(method="cv", number=5)
linear_mod1 = train(x=train1,y=all$SalePrice[!is.na(all$SalePrice)], method='lm', trControl= my_control)
min(linear_mod1$results$RMSE)
## [1] 0.1173242
It is likely that the above linear regression is overfitted because all of the predictors are used by the model.
A linear regression model is fitted by minimizing the sum of the squared residuals- when there is a lot of predictors relative to the sample size, this technique can lead to an overfit model. Overfit models occur when the model starts to explain the error in the train set and therefore doesn’t generalize well to new data.
Therefore, next I try running another linear regression on a reduced dataset (based on EDA):
#linear regression on reduced train set
linear_mod2 = train(x=train2,y=all$SalePrice[!is.na(all$SalePrice)], method='lm', trControl= my_control)
min(linear_mod2$results$RMSE)
## [1] 0.1298363
RMSE comparison:
model_list <- list(lm1 = linear_mod1, lm2 = linear_mod2)
resamples = resamples(model_list)
summary(resamples)
##
## Call:
## summary.resamples(object = resamples)
##
## Models: lm1, lm2
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm1 0.07531025 0.08110973 0.08244776 0.08154510 0.08425569 0.08460208 0
## lm2 0.09203930 0.09469219 0.09488237 0.09509678 0.09661205 0.09725797 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm1 0.1115543 0.1155767 0.1160029 0.1173242 0.1189430 0.1245443 0
## lm2 0.1277336 0.1292786 0.1293744 0.1298363 0.1305633 0.1322314 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm1 0.902428 0.9066380 0.9134711 0.9145326 0.9170927 0.9330333 0
## lm2 0.883119 0.8888129 0.8973805 0.8944530 0.8984089 0.9045438 0
bwplot(resamples, metric = "RMSE")
The RMSE is higher for the second linear model indicating that too much information was lost by reducing the number of predictors.
With regularized regression, the model is fitted by minimizing the sum of the squared residuals plus a penalization term. The penalization term determines model parameters which define the model to be less sensitive to the predictors, which hedges against the possibility that the model has been fitted too precisely to the train set.
The definition of the penalization term depends on which regularization technique is being used. There are 3 techniques:
The difference is that the Ridge regression penalty is defined in such a way which ‘shrinks’ the model parameters of correlated predictors, whereas the Lasso penalty tends to cause the model to pick one and discard the others. In all of the above cases the strength of the penalization term depends on the hyperparameter lambda. I determine the best value for lambda using cross-validation.
lassoGrid <- expand.grid(alpha = 1, lambda = seq(0.001,0.1,by = 0.0005))
lasso_mod <- train(x=train1, y=all$SalePrice[!is.na(all$SalePrice)], method='glmnet', trControl= my_control, tuneGrid=lassoGrid)
lasso_mod$bestTune
min(lasso_mod$results$RMSE)
## [1] 0.1134197
Comparing RMSE:
model_list <- list(lm1 = linear_mod1, lasso = lasso_mod)
resamples = resamples(model_list)
summary(resamples)
##
## Call:
## summary.resamples(object = resamples)
##
## Models: lm1, lasso
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm1 0.07531025 0.08110973 0.08244776 0.08154510 0.08425569 0.08460208 0
## lasso 0.07680830 0.07855494 0.07942849 0.07940335 0.08105786 0.08116716 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm1 0.1115543 0.1155767 0.1160029 0.1173242 0.1189430 0.1245443 0
## lasso 0.1073087 0.1074690 0.1114766 0.1134197 0.1175392 0.1233047 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm1 0.9024280 0.9066380 0.9134711 0.9145326 0.9170927 0.9330333 0
## lasso 0.9118777 0.9165367 0.9204476 0.9199835 0.9238094 0.9272464 0
bwplot(resamples, metric = "RMSE")
The lasso regression model performs better than the initial linear
regression. I also fitted a ridge regression model (by setting
hyperparameter alpha = 0 in the expand.grid() function) but
the resultant RMSE was higher than the RMSE for the lasso model.
In this project I wanted to experiment with an extreme gradient boosting model because I know that they have become very popular in data science over the past 20 years. I am yet to study the theory on gradient boosting machines, however, I know that extreme gradient boosting is the more regularized form (guards more aggressively against overfitting). At the time of writing up this project I have started to read ‘Applied Predictive Modelling’ by Max Kuhn and Kjell Johnson (developers of the caret package) and soon I will encounter real-world examples of fitting XGBoost models. For now though, I fit the model out of curiosity.
As for tuning the parameters, I used the help of Kaggle kernels to try to compose a sensible tuning grid:
#xgboost
xgbTuningGrid = expand.grid(nrounds = c(50, 100),
lambda = seq(0.1, 0.5, 0.1),
alpha = seq(0.1, 0.5, 0.1),
eta = c(0.3, 0.4))
xgb_mod = train(x=train1, y=all$SalePrice[!is.na(all$SalePrice)],
tuneLength = 3,
method = "xgbLinear",
trControl = my_control,
tuneGrid = xgbTuningGrid)
Comparing RMSE with the Lasso regression model:
model_list = list(gbm = xgb_mod, lasso = lasso_mod)
resamples = resamples(model_list)
summary(resamples)
##
## Call:
## summary.resamples(object = resamples)
##
## Models: gbm, lasso
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## gbm 0.0857297 0.08625690 0.08806357 0.09060407 0.09512357 0.09784660 0
## lasso 0.0768083 0.07855494 0.07942849 0.07940335 0.08105786 0.08116716 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## gbm 0.1128568 0.1217299 0.1318506 0.1295182 0.1377619 0.1433920 0
## lasso 0.1073087 0.1074690 0.1114766 0.1134197 0.1175392 0.1233047 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## gbm 0.8687095 0.8888698 0.8993188 0.8953026 0.9070067 0.9126080 0
## lasso 0.9118777 0.9165367 0.9204476 0.9199835 0.9238094 0.9272464 0
bwplot(resamples, metric = "RMSE")
The XGBoost model has a greater RMSE than the lasso model. This is likely because the tuning parameters are not optimal and many have been left as defaults. However, this is only an initial model- I plan to revisit this project and experiment further when I have a stronger conceptual understanding of the algorithm. For now, the Lasso regression remains the best performing model.