Adam Tolnay
Introduction
….. Competition description
….. Full list of variables
….. Executive summary
….. Loading the required libraries
….. Reading in the dataset
Overview of the data
….. Distribution of sale price and year sold
….. Sale price by total square feet
………. Noticing an outlier
Data cleaning
….. Converting categorical variables to factors
….. Missing value imputation
………. Basement, garage and masonry
………. Missing values meaning the absence of a feature
………. Remaining NAs
….. Integer encoding ordinal variables
Feature engineering
….. Total square feet, total bathrooms
….. Age variables
….. Consolidating basement and porch square footage
Removing highly concentrated variables
Relationships with sale price
….. Numeric variables
….. Factors
Preparing the data for modeling
….. Outliers
….. Pre-processing
………. Transforming numeric variables
………. Creating dummy variables
….. Dropping correlated numeric variables
….. Dropping categories with few observations
….. Removing linear dependencies
Modeling
….. Linear models
………. Linear regression
………. Regularization
…………… Lasso
…………… Ridge
….. Tree models
………. Random forest
………. XGBoost
Output
….. Performance on the test set
Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
I used two classes of models to predict SalePrice, linear models and tree-based models. Linear models performed well, with a standard regression having a 10-fold cross-validated RMSE of .1162. A lasso model was slightly better with an RMSE of .1154, while a ridge model showed no improvements. Tree-based models didn’t do as well. A random forest had an RMSE of .1322, and boosted decision trees (using XGBoost) had an RMSE of .1220. Because the models are so different, I averaged the lasso and XGBoost predictions for the final submission, and there was some improvement over each model individually.
library(caret)
library(tidyverse)
library(gridExtra)
library(scales)
train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE)
test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
Id <- test$Id
test$Id <- NULL
train$Id <- NULL
test$SalePrice <- NA
full <- rbind(train, test)
cat('There are ', nrow(full), ' observations of ', ncol(full), 'variables in the dataset')
## There are 2919 observations of 80 variables in the dataset
Distribution of the response variable, SalePrice:
ggplot(train, aes(SalePrice)) + geom_density() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), axis.text.y = element_blank()) +
scale_x_continuous(breaks= seq(0, 800000, by = 100000), labels = comma) +
labs( x = "Sale Price", y = "Density")
The time period covered is 2006 to 2010:
ggplot(full, aes(YrSold)) + geom_bar() +
theme(axis.text.y = element_blank()) +
labs(x = 'Year Sold', y = 'Count')
Sale price by total living area:
ggplot(train, aes(TotalBsmtSF + GrLivArea, SalePrice)) + geom_point() + geom_smooth(method = 'lm') +
scale_y_continuous(breaks= seq(0, 800000, by=100000)) +
labs(x = 'Total Square Feet', y = 'Sale Price')
There is an outlier that I will take a closer look at:
outlierRow <- which(train$TotalBsmtSF + train$GrLivArea == max(train$TotalBsmtSF + train$GrLivArea, na.rm = TRUE))
full[outlierRow, c('SalePrice', 'TotalBsmtSF', 'GrLivArea', 'LotArea', 'OverallQual', 'PoolArea')]
## SalePrice TotalBsmtSF GrLivArea LotArea OverallQual PoolArea
## 1299 160000 6110 5642 63887 10 480
This home has an overall quality of 10, a pool, 64000 square foot lot size and sold for $160,000. It’s not obvious that the observation is an error, but I will flag it as an outlier and consider removing it later.
Because there are so many variables, there was a significant amount of cleaning required. In this section, categorical variables are cast as factors, ordered factors if there is ordinality, and missing values are imputed.
All character columns are factors, so I can use a loop:
characterCols <- which(sapply(full, is.character))
for(i in characterCols){full[,i] <- as.factor(full[,i])}
In addition, the numeric variables MSSubClass, OverallQual, and OverallCond are factors. YrSold is also converted to a factor.
#MSSubClass has numeric codes for different building classes
full$MSSubClass <- as.factor(full$MSSubClass)
#OverallQual and OverallCond are on a scale of 1-10
full$OverallQual <- as.factor(full$OverallQual)
full$OverallCond <- as.factor(full$OverallCond)
#New variable YrSoldFactor so I can use YrSold later.
full$YrSoldFactor <- as.factor(full$YrSold)
Variables with ordinality are converted to ordered factors.
#Many of the ordered factors use these levels
factorLevels <- c('None', 'Po', 'Fa', 'TA', 'Gd', 'Ex')
full$OverallQual <- factor(full$OverallQual, ordered = TRUE)
full$OverallCond <- factor(full$OverallCond, ordered = TRUE)
full$Alley <- factor(full$Alley, ordered = TRUE, levels = c('None', 'Grvl', 'Pave'))
full$ExterQual <- factor(full$ExterQual, ordered = TRUE, levels = factorLevels)
full$ExterCond <- factor(full$ExterCond, ordered = TRUE, levels = factorLevels)
full$BsmtQual <- factor(full$BsmtQual, ordered = TRUE, levels = factorLevels)
full$BsmtCond <- factor(full$BsmtCond, ordered = TRUE, levels = factorLevels)
full$BsmtExposure <- factor(full$BsmtExposure, ordered = TRUE, levels = c('None', 'No', 'Mn', 'Av', 'Gd'))
full$BsmtFinType1 <- factor(full$BsmtFinType1, ordered = TRUE, levels = c('None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'))
full$BsmtFinType2 <- factor(full$BsmtFinType2, ordered = TRUE, levels = c('None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'))
full$HeatingQC <- factor(full$HeatingQC, ordered = TRUE, levels = factorLevels)
full$KitchenQual <- factor(full$KitchenQual, ordered = TRUE, levels = factorLevels)
full$Functional <- factor(full$Functional, ordered = TRUE, levels = c('Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'))
full$FireplaceQu <- factor(full$FireplaceQu, ordered = TRUE, levels = factorLevels)
full$GarageFinish <- factor(full$GarageFinish, ordered = TRUE, levels = c('None', 'Unf', 'RFn', 'Fin'))
full$GarageQual <- factor(full$GarageQual, ordered = TRUE, factorLevels)
full$GarageCond <- factor(full$GarageCond, ordered = TRUE, factorLevels)
full$PavedDrive <- factor(full$PavedDrive, ordered = TRUE, levels = c('N', 'P', 'Y'))
full$PoolQC <- factor(full$PoolQC, ordered = TRUE, factorLevels)
full$Fence <- factor(full$Fence, ordered = TRUE, levels = c('None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv'))
Locating variables with missing values:
naCols <- which(colSums(is.na(full)) > 0)
sort(colSums(is.na(full[,naCols])), decreasing = TRUE)
## PoolQC MiscFeature Alley Fence SalePrice
## 2909 2814 2721 2348 1459
## FireplaceQu LotFrontage GarageYrBlt GarageFinish GarageQual
## 1420 486 159 159 159
## GarageCond GarageType BsmtCond BsmtExposure BsmtQual
## 159 157 82 82 81
## BsmtFinType2 BsmtFinType1 MasVnrType MasVnrArea MSZoning
## 80 79 24 23 4
## Utilities BsmtFullBath BsmtHalfBath Functional Exterior1st
## 2 2 2 2 1
## Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
There are quite a few missing values. Sometimes they refer to a specific category, for example if PoolArea is NA then there’s no pool. However, there are some discrepencies within the garage, basement, and masonry variables.
NA for GarageType, GarageCond, GarageQual, GarageFinish, and GarageYrBlt means there’s no garage, but they don’t agree on which homes don’t have a garage. First I will check that 157 of them agree on the same homes:
nrow(full[is.na(full$GarageType) & is.na(full$GarageCond) & is.na(full$GarageQual) & is.na(full$GarageFinish) & is.na(full$GarageYrBlt),])
## [1] 157
So they all agree on the 157 homes without a garage. Locating the other 2:
full[!is.na(full$GarageType) & (is.na(full$GarageCond) | is.na(full$GarageQual) | is.na(full$GarageFinish) | is.na(full$GarageYrBlt)), c('GarageType', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'GarageCond', 'GarageQual', 'GarageFinish')]
## GarageType GarageYrBlt GarageCars GarageArea GarageCond GarageQual
## 2127 Detchd NA 1 360 <NA> <NA>
## 2577 Detchd NA NA NA <NA> <NA>
## GarageFinish
## 2127 <NA>
## 2577 <NA>
2127 appears to have a garage, while 2577 appears not to.
For GarageYrBlt, I will impute using YearBuilt, and use the mode for the other variables.
full$GarageYrBlt[2127] <- full$YearBuilt[2127]
full$GarageCond[2127] <- names(sort(table(full$GarageCond), decreasing = TRUE))[1]
full$GarageQual[2127] <- names(sort(table(full$GarageQual), decreasing = TRUE))[1]
full$GarageFinish[2127] <- names(sort(table(full$GarageFinish), decreasing = TRUE))[1]
full$GarageType[2577] <- NA
Correcting an error:
full$GarageYrBlt[2593] <- 2007 #Was 2207
A similar process is followed for the basement variables: BsmtFinType1, BsmtFinType2, BsmtQual, BsmtExposure, and BsmtCond.
They agree on the 79 homes without a basement, here are the remaining discrepencies:
full[!is.na(full$BsmtFinType1) & (is.na(full$BsmtFinType2) | is.na(full$BsmtQual) | is.na(full$BsmtExposure) | is.na(full$BsmtCond)), c('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF')]
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 333 Gd TA No GLQ 1124 <NA>
## 949 Gd TA <NA> Unf 0 Unf
## 1488 Gd TA <NA> Unf 0 Unf
## 2041 Gd <NA> Mn GLQ 1044 Rec
## 2186 TA <NA> No BLQ 1033 Unf
## 2218 <NA> Fa No Unf 0 Unf
## 2219 <NA> TA No Unf 0 Unf
## 2349 Gd TA <NA> Unf 0 Unf
## 2525 TA <NA> Av ALQ 755 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 333 479 1603 3206
## 949 0 936 936
## 1488 0 1595 1595
## 2041 382 0 1426
## 2186 0 94 1127
## 2218 0 173 173
## 2219 0 356 356
## 2349 0 725 725
## 2525 0 240 995
They all appear to have basements, so I will impute NAs using the mode:
full$BsmtFinType2[333] <- names(sort(table(full$BsmtFinType2), decreasing = TRUE))[1]
full$BsmtExposure[949] <- names(sort(table(full$BsmtExposure), decreasing = TRUE))[1]
full$BsmtExposure[1488] <- names(sort(table(full$BsmtExposure), decreasing = TRUE))[1]
full$BsmtCond[2041] <- names(sort(table(full$BsmtCond), decreasing = TRUE))[1]
full$BsmtCond[2186] <- names(sort(table(full$BsmtCond), decreasing = TRUE))[1]
full$BsmtQual[2218] <- names(sort(table(full$BsmtQual), decreasing = TRUE))[1]
full$BsmtQual[2219] <- names(sort(table(full$BsmtQual), decreasing = TRUE))[1]
full$BsmtExposure[2349] <- names(sort(table(full$BsmtExposure), decreasing = TRUE))[1]
full$BsmtCond[2525] <- names(sort(table(full$BsmtCond), decreasing = TRUE))[1]
Masonry veneer and type agree on all but one observation, which I replaced with the mode of houses with masonry veneer (The mode is none).
full[2611, c('MasVnrType', 'MasVnrArea')]
## MasVnrType MasVnrArea
## 2611 <NA> 198
full$MasVnrType[2611] <- names(sort(table(full$MasVnrType), decreasing = TRUE))[2]
Imputing missing values for all variables where NA refers to an absence of the feature:
noneCols <- c('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature')
full$MiscFeature <- factor(full$MiscFeature, levels=c(levels(full$MiscFeature), 'None'))
full$GarageType <- factor(full$GarageType, levels=c(levels(full$GarageType), 'None'))
for(i in noneCols){full[is.na(full[,i]), i] <- 'None'}
The remaining missing values:
naCols <- names(which(colSums(is.na(full)) > 0))
sort(colSums(is.na(full[,naCols])), decreasing = TRUE)
## SalePrice LotFrontage GarageYrBlt MasVnrType MasVnrArea
## 1459 486 158 23 23
## MSZoning Utilities BsmtFullBath BsmtHalfBath Functional
## 4 2 2 2 2
## Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 1 1 1 1 1
## TotalBsmtSF Electrical KitchenQual GarageCars GarageArea
## 1 1 1 1 1
## SaleType
## 1
LotFrontage, which measures the length of street connected to the property, has many missing values. The missing values for GarageYrBlt are for homes without a garage, and I will impute them as YearBuilt. For the remaining NAs, I will use the median for numeric variables, and the mode for categorical variables.
#GarageYrBlt
full$GarageYrBlt[is.na(full$GarageYrBlt)] <- full$YearBuilt[is.na(full$GarageYrBlt)]
naColsNumeric <- names(which(sapply(full, is.numeric)))
naColsNumeric <- naColsNumeric[which(naColsNumeric %in% naCols)]
naColsNumeric <- naColsNumeric[-which(naColsNumeric %in% 'SalePrice')]
naColsCategorical <- names(which(sapply(full, is.factor)))
naColsCategorical <- naColsCategorical[which(naColsCategorical %in% naCols)]
#Numeric variables
for(i in naColsNumeric){full[is.na(full[,i]), i] <- median(full[,i], na.rm = TRUE)}
#Categorical variables
for(i in naColsCategorical){full[is.na(full[,i]), i] <- names(sort(table(full[,i]), decreasing = TRUE))[1]}
Integer encoding ordinal factors:
orderedCols <- which(sapply(full, is.ordered))
for (i in orderedCols) {full[[i]] <- as.integer(full[[i]])}
A typical house listing would say something like: 1650 total square feet, 5 bed, 3 bath, 3200 square foot lot size.
I will consolidate variables into these metrics, and remove linear dependencies. Bedrooms and lot size are already available in the data set, so I begin with total square feet and total bathrooms
#Total square feet.
full$TotalSF <- full$TotalBsmtSF + full$GrLivArea
full$GrLivArea <- NULL
#1stFlrSF, 2ndFlrSF, LowQualFinSF and GrLivArea are a linear combination. I will drop LowQualFinSF.
full$LowQualFinSF <- NULL
#Total bathrooms - Half bathrooms are counted as half a bathroom. FullBath and HalfBath do not include bathrooms in the basement.
full$Bathrooms <- full$BsmtFullBath + full$BsmtHalfBath/2 + full$FullBath + full$HalfBath/2
full$BsmtFullBath <- full$BsmtFullBath + full$BsmtHalfBath/2
full$FullBath <- NULL
full$HalfBath <- NULL
full$BsmtHalfBath <- NULL
Next, I convert YearBuilt, GarageYrBlt, and YearRemodAdd to their ages, and create an isNew variable.
#YearBuilt
full$HouseAge <- full$YrSold - full$YearBuilt
full$isNew <- ifelse(full$YrSold == full$YearBuilt, 1, 0)
full$isNew <- as.factor(full$isNew)
#GarageYrBlt
full$GarageAge <- full$YrSold - full$GarageYrBlt
full$GarageYrBlt <- NULL
full$YearBuilt <- NULL
#YearRemodAdd
full$RemodelAge <- full$YrSold - full$YearRemodAdd
full$YearRemodAdd <- NULL
full$YrSold <- NULL #I created a YrSoldFactor variable
Consolidating basement and porch square footage:
#Few homes have multiple basements, so I will combine BsmtFinSF1 and BsmtFinSF2
full$BsmtFinSF <- full$BsmtFinSF1 + full$BsmtFinSF2
full$BsmtFinSF1 <- NULL
full$BsmtFinSF2 <- NULL
#BsmtFinSF, BsmtUnfSF and GrLivArea and TotalSF are a linear combination. I will remove BsmtUnfSF.
full$BsmtUnfSF <- NULL
#The square footage of four porch types are measured: Open, enclosed, three season, and screen. Three of them have low variability so I combine all four into total square feet of porch.
full$PorchSF <- full$OpenPorchSF + full$EnclosedPorch + full$X3SsnPorch + full$ScreenPorch
full$EnclosedPorch <- NULL
full$X3SsnPorch <- NULL
full$ScreenPorch <- NULL
full$OpenPorchSF <- NULL
The nearZeroVar function returns a variable if the ratio of the most common value to the second most common value exceeds the freqCut parameter, in this case 100/1.
nearZeroVar(full, freqCut = 100/1, names = TRUE)
## [1] "Street" "Utilities" "Condition2" "RoofMatl" "Heating"
## [6] "PoolArea" "PoolQC" "MiscVal"
#Street - Only 12 are gravel, the rest are paved.
full$Street <- NULL
#Utilities - Only 3 do no have all utilities.
full$Utilities <- NULL
#Condition2 - Condition1 returns whether the home is close to a railroad or street, while condition2 is normal unless there are multiple conditions. Only 30 have more than one condition.
full$Condition2 <- NULL
#RoofMatl - Only 43 aren't standard shingle
full$RoofMatl <- NULL
#Heating - Only 45 aren't forced warm air furnaces.
full$Heating <- NULL
#PoolArea and PoolQC - Only 13 homes have pools. One of them is the outlier.
full$PoolArea <- NULL
full$PoolQC <- NULL
#MiscVal - 95 homes have sheds, 1 has a tennis court, 5 have a second garage, and 4 are 'other.' There is enough variability to keep this variable.
There was also a problem with the variables Exterior1st and Exterior2nd, which give the material covering the house. They have very similar distributions. Different inspectors probably varied in which material they put 1st and 2nd. I will only keep one of them.
table(full$Exterior1st)
##
## AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd
## 44 2 6 87 2 126 442 1 450
## Plywood Stone Stucco VinylSd Wd Sdng WdShing
## 221 2 43 1026 411 56
table(full$Exterior2nd)
##
## AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc MetalSd
## 38 4 22 47 3 126 406 15 447
## Other Plywood Stone Stucco VinylSd Wd Sdng Wd Shng
## 1 270 6 47 1015 391 81
full$Exterior2nd <- NULL
Numeric variable correlations with SalePrice:
numericCols <- which(sapply(full, is.numeric))
train <- full[!is.na(full$SalePrice),]
correlations <- cor(train[,numericCols])
SPcorrelations <- correlations[,'SalePrice']
SPcorrelations[order(abs(SPcorrelations), decreasing = TRUE)]
## SalePrice OverallQual TotalSF ExterQual KitchenQual
## 1.000000000 0.790981601 0.778958829 0.682639242 0.659599721
## GarageCars Bathrooms GarageArea TotalBsmtSF X1stFlrSF
## 0.640409197 0.631731068 0.623431439 0.613580552 0.605852185
## BsmtQual GarageFinish TotRmsAbvGrd HouseAge FireplaceQu
## 0.585207199 0.549246756 0.533723156 -0.523350418 0.520437606
## RemodelAge GarageAge MasVnrArea Fireplaces HeatingQC
## -0.509078738 -0.508602486 0.472614499 0.466928837 0.427648707
## BsmtExposure BsmtFinSF LotFrontage WoodDeckSF X2ndFlrSF
## 0.375044959 0.366327692 0.334543941 0.324413445 0.319333803
## BsmtFinType1 GarageQual LotArea GarageCond PavedDrive
## 0.304907873 0.273839074 0.263843354 0.263190784 0.231356952
## BsmtFullBath BsmtCond PorchSF BedroomAbvGr Fence
## 0.224953438 0.212607156 0.195738941 0.168213154 -0.146941526
## KitchenAbvGr Functional Alley OverallCond MoSold
## -0.135907371 0.107618893 -0.092607450 -0.077855894 0.046432245
## MiscVal ExterCond BsmtFinType2
## -0.021189580 0.018899118 -0.004329316
Variables that would show up in a home listing like square feet of living space and number of rooms are important, as are variables measuring quality, like OverallQual, ExterQual, and KitchenQual.
It’s strange that the correlation between overall condition and sale price is negative. OverallQual “rates the overall material and finish of the house,” and OverallCond “rates the overall condition of the house.”
One would expect that these two variables are highly correlated, but that’s not the case:
cor(train$OverallCond, train$OverallQual)
## [1] -0.09193234
s1 <- ggplot(train, aes(OverallQual, SalePrice)) + geom_boxplot(aes(group = OverallQual)) + scale_x_continuous(breaks= seq(0, 10, 1))
s2 <- ggplot(train, aes(OverallCond, SalePrice)) + geom_boxplot(aes(group = OverallCond)) + scale_x_continuous(breaks= seq(0, 10, 1))
grid.arrange(s1, s2)
OverallQual seems like the more accurate of the two variables. Maybe some inspectors used a scale of 1-5 for OverallCond?
BsmtFinType2 is the numeric variable least correlated with SalePrice.
table(train$BsmtFinType2)
##
## 1 2 3 4 5 6 7
## 37 1257 46 54 33 19 14
It has been integer encoded, but the categories in order are ‘no basement’, ‘unfinished’, and then increasing levels of finish quality ranging from ‘low quality’ to ‘good living quarters.’ Homes with no second basement must have been encoded as ‘unfinished,’ but that’s in line with the ordinality.
Importance of factors:
I used ANOVA to get a rough sense for the importance of categorical variables. Only year sold and month sold weren’t very significant.
categoricalCols <- which(sapply(full, is.factor))
s1 <- ggplot(train, aes(YrSoldFactor, SalePrice)) + geom_boxplot(aes(group = YrSoldFactor))
s2 <- ggplot(train, aes(MoSold, SalePrice)) + geom_boxplot(aes(group = MoSold))
grid.arrange(s1, s2)
summary(aov(SalePrice ~ YrSoldFactor, train))
## Df Sum Sq Mean Sq F value Pr(>F)
## YrSoldFactor 4 1.631e+10 4.078e+09 0.646 0.63
## Residuals 1455 9.192e+12 6.317e+09
summary(aov(SalePrice ~ MoSold, train))
## Df Sum Sq Mean Sq F value Pr(>F)
## MoSold 1 1.985e+10 1.985e+10 3.15 0.0761 .
## Residuals 1458 9.188e+12 6.302e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SalePrice quartiles don’t vary much by year or month, and the probability that all means are the same is high. However, prices for the various types of homes could vary depending on when they are sold, so I will keep them.
Removing the outlier:
full <- full[-outlierRow,]
Submissions are evaluated based on the RMSE between the natural logarithm of predicted and actual sale price, so I will take the log of SalePrice. I want my models to minimize this metric, so I won’t transform it any more.
full$SalePrice <- log(full$SalePrice)
I will pre-process by transforming numeric variables, and creating dummy variables for factors.
The first step is to separate the data into inumeric, factor, and SalePrice components.
numericCols <- names(which(sapply(full, is.numeric)))
numericCols <- numericCols[-which(numericCols %in% 'SalePrice')]
factorCols <- names(which(sapply(full, is.factor)))
SalePrice <- full$SalePrice
fullNumeric <- full[,numericCols]
fullFactor <- full[,factorCols]
I applied the Box-Cox transformation to numeric variables. Only the lasso and ridge models need standardized predictors, but I found that the RMSE improved for the other models as well.
preProc <- preProcess(fullNumeric, method = c('BoxCox', 'center', 'scale'))
fullNumeric <- predict(preProc, fullNumeric)
Converting factors to dummy variables:
preProc <- dummyVars( ~ ., fullFactor, fullRank = TRUE)
fullFactor <- data.frame(predict(preProc, fullFactor))
Putting everything back together:
full <- cbind(SalePrice, fullNumeric, fullFactor)
Using the training data, I will flag variables for removal if:
Numeric variables are too correlated.
Factor levels have too few observations.
There are any linear dependencies.
train <- full[!is.na(full$SalePrice),]
trainNumeric <- train[,numericCols]
trainFactor <- train[,!(names(train) %in% numericCols)]
Highly correlated pairs of numeric variables:
correlationMatrix <- cor(trainNumeric)
highCor <- findCorrelation(correlationMatrix, cutoff = .8, verbose = TRUE)
## Compare row 36 and column 14 with corr 0.803
## Means: 0.346 vs 0.194 so flagging column 36
## Compare row 27 and column 28 with corr 0.888
## Means: 0.311 vs 0.186 so flagging column 27
## Compare row 38 and column 39 with corr 0.846
## Means: 0.293 vs 0.18 so flagging column 38
## Compare row 25 and column 24 with corr 0.889
## Means: 0.22 vs 0.176 so flagging column 25
## Compare row 30 and column 29 with corr 0.916
## Means: 0.18 vs 0.174 so flagging column 30
## All correlations <= 0.8
The numeric variables with correlations greater than .8 in the training set are:
TotalSF and TotalBsmtSF (.803) - I will keep both
GarageCars and GarageArea (.888) - I will drop GarageArea
HouseAge and GarageAge (.846) - I will drop GarageAge
Fireplaces and FireplaceQu (.889) - I will drop Fireplaces
GarageQual and GarageCond (.916) - I will drop GarageCond
drop <- c('GarageArea', 'GarageAge', 'Fireplaces', 'GarageCond')
trainNumeric <- trainNumeric[,!(names(trainNumeric) %in% drop)]
Dropping any factor levels with too few observations:
Using freCut = 150/1 means any categories with less than 10 observations are dropped. The levels dropped are shown below.
nearZeroVar(trainFactor, freqCut = 150/1, names = TRUE)
## [1] "MSSubClass.40" "MSSubClass.150"
## [3] "LotShape.IR3" "LotConfig.FR3"
## [5] "Neighborhood.Blueste" "Neighborhood.NPkVill"
## [7] "Condition1.PosA" "Condition1.RRNe"
## [9] "Condition1.RRNn" "HouseStyle.2.5Fin"
## [11] "RoofStyle.Mansard" "RoofStyle.Shed"
## [13] "Exterior1st.AsphShn" "Exterior1st.BrkComm"
## [15] "Exterior1st.CBlock" "Exterior1st.ImStucc"
## [17] "Exterior1st.Stone" "Foundation.Stone"
## [19] "Foundation.Wood" "Electrical.FuseP"
## [21] "Electrical.Mix" "GarageType.CarPort"
## [23] "MiscFeature.Othr" "MiscFeature.TenC"
## [25] "SaleType.Con" "SaleType.ConLD"
## [27] "SaleType.ConLI" "SaleType.ConLw"
## [29] "SaleType.CWD" "SaleType.Oth"
## [31] "SaleCondition.AdjLand"
dropLevels <- nearZeroVar(trainFactor, freqCut = 150/1)
drop <- append(drop, names(trainFactor)[dropLevels])
trainFactor <- trainFactor[,-dropLevels]
Removing linear dependencies:
train <- cbind(trainNumeric, trainFactor)
linearCombos <- findLinearCombos(train)
drop <- append(drop, names(train)[linearCombos$remove])
The only linear dependency found is between the dummy variables BldType.Duplex and MSSubclass.Duplex, which are identical.
Finally, I will drop all flagged variables from full, and split it into training and test sets for modeling.
full <- full[,!(names(full) %in% drop)]
train <- full[!is.na(full$SalePrice),]
test <- full[is.na(full$SalePrice),]
Linear regression:
set.seed(528)
lm.model <- train(SalePrice ~ ., data = train, method = 'lm', trControl = trainControl(method="cv", number=10))
lm.predictions <- exp(predict(lm.model, test))
cat('The cross-validated RMSE is ', lm.model$results$RMSE,', and Rsquared is ', lm.model$results$Rsquared)
## The cross-validated RMSE is 0.1161896 , and Rsquared is 0.9151573
A linear model performs very well.
qqnorm(residuals(lm.model)); qqline(residuals(lm.model))
There is too much weight on the tails, but residuals are approximately normally distributed.
Next, I checked variable importance. Caret’s varImp function uses the t-statistic of each coefficient to rank variable importance, and scales the absolute value from 0-100.
varImp(lm.model)
## lm variable importance
##
## only 20 most important variables shown (out of 140)
##
## Overall
## OverallCond 100.00
## OverallQual 95.47
## MSZoning.RL 88.19
## MSZoning.RM 86.18
## MSZoning.FV 77.69
## MSZoning.RH 77.13
## Functional 65.47
## LotArea 61.16
## GarageCars 55.81
## SaleCondition.Normal 52.31
## BsmtFinSF 50.29
## TotalSF 50.25
## HouseAge 44.31
## Condition1.Norm 43.65
## PorchSF 42.62
## Exterior1st.BrkFace 36.79
## GarageQual 34.63
## WoodDeckSF 33.90
## GarageType.None 33.60
## FireplaceQu 30.87
OverallCond is the most important variable. In a previous section, I saw that the correlation of OverallCond with SalePrice was -.092 (before transforming numeric variables), and I’m not sure how to explain this. Here is the correlation after applying transformations:
cor(train$OverallCond, train$SalePrice)
## [1] -0.02556907
Quality and square footage variables are also important. The four zoning variables stand for residential low, medium, and high density, and floating village (retirement communities), so whether the home is urban or rural is important.
Lasso and ridge regressions:
In glmnet, the penalty is defined to be (1-a)/2|B|^2+a|B|, so alpha controls the weights of the L1 and L2 penalty. Alpha = 0 results in a ridge, and alpha = 1 results in a lasso.
Lasso regression:
Initial tuning showed the optimal value of lambda would be in the range (0,.01).
set.seed(50)
lasso.model <- train(SalePrice ~ ., data = train, method = 'glmnet', tuneGrid = data.frame(alpha = 1, lambda = seq(0,.01,.0001)), trControl = trainControl(method="cv", number=10))
ggplot(lasso.model)
lasso.predictions <- exp(predict(lasso.model, test))
cat('The regularization parameter that minimizes RMSE in the lasso model is ', lasso.model$bestTune$lambda, ', with cross-validated RMSE ', min(lasso.model$results$RMSE), ', and Rsquared ', max(lasso.model$results$Rsquared))
## The regularization parameter that minimizes RMSE in the lasso model is 5e-04 , with cross-validated RMSE 0.1154211 , and Rsquared 0.9170709
The lasso model performs slightly better than a linear model. The tuning chart has an interesting shape, with a local maximum around lambda = .001.
An L1 penalty sets some coefficients to 0, so I will check how many variables it drops:
cat('The lasso model drops ', length(which(varImp(lasso.model)$importance==0)), ' of ', ncol(train), ' variables')
## The lasso model drops 21 of 141 variables
I also checked how many variables are dropped at lambda = .0025, the local minimum.
lasso.model2 <- train(SalePrice ~ ., data = train, method = 'glmnet', tuneGrid = data.frame(alpha = 1, lambda = .0025), trControl = trainControl(method="cv", number=10))
cat(length(which(varImp(lasso.model2)$importance==0)), ' variables are dropped at the local minimum')
## 62 variables are dropped at the local minimum
At lambda = .0025, 62 variables are dropped with only a slight increase in RMSE, so 41 of the variables included at lambda = .0005 must have very low predictive power. I considered using the model with lambda = .0025, but improvements in cross-validated RMSE should translate to the test set.
Ridge regression:
Initial tuning showed that the optimal value of lambda would be in the range (0,.1).
set.seed(584)
ridge.model <- train(SalePrice ~ ., data = train, method = 'glmnet', tuneGrid = data.frame(alpha = 0, lambda = seq(0,.1,.001)), trControl = trainControl(method="cv", number=10))
ggplot(ridge.model)
ridge.predictions <- exp(predict(ridge.model, test))
cat('The cross-validated RMSE of the ridge model is ', min(ridge.model$results$RMSE))
## The cross-validated RMSE of the ridge model is 0.1190069
Increasing the L2 penalty doesn’t reduce RMSE, so the ridge model does not improve on the linear model.
Random forest:
I used the suggested mtry (number of parameters tested at each split) for regression of p/3.
set.seed(56)
rf.model <- train(SalePrice ~ ., data = train, method = 'rf', tuneGrid = data.frame(mtry = round(ncol(train)/3)), trControl = trainControl(method="cv", number=10))
rf.predictions <- exp(predict(rf.model, test))
cat('The RMSE is ', rf.model$results$RMSE)
## The RMSE is 0.1322074
A random forest is not an improvement over linear models.
XGBoost:
First, I tuned each hyperparameter individually, keeping the others at their default values, which gave rough ranges for each one. Then I did a random grid search of 100 models to find a good set of parameters.
The grid search takes a while, so it’s not included. The best set of parameters is passed to the model.
set.seed(58)
#gridSearch <- data.frame(nrounds = round(runif(100,100,1000)),
# max_depth = round(runif(100,.51,10.5)),
# eta = round(runif(100,.01,.3), digits = 2),
# gamma = round(runif(100,0,1), digits = 2),
# min_child_weight = round(runif(100,.5,10.5)),
# colsample_bytree = round(runif(100,.3,1), digits = 2),
# subsample = round(runif(100,.3,1), digits = 2))
#
#xgb.model <- train(SalePrice ~ ., data = train, method = 'xgbTree', trControl = trainControl(method= "cv", number = 10), tuneGrid = gridSearch)
xgb.model <- train(SalePrice ~ ., data = train, method = 'xgbTree', trControl = trainControl(method= "cv", number = 10), tuneGrid = data.frame(nrounds = 729, max_depth = 7, eta = 0.04, gamma = 0.05, colsample_bytree
= 0.85, min_child_weight = 9, subsample = 0.7))
xgb.predictions <- exp(predict(xgb.model, test))
cat('The RMSE is ', min(xgb.model$results$RMSE))
## The RMSE is 0.1219796
The RMSE is pretty good, but is still not an improvement over the lasso model. Furthermore, running the grid search of 100 models took a few hours, while comparing 1000 lasso models took a few seconds.
I will average the lasso and XGBoost predictions for the final submission. First, I checked the RMSE between those two very different models:
RMSE(log(lasso.predictions), log(xgb.predictions))
## [1] 0.06580744
Output:
output <- as.data.frame(Id)
output$SalePrice <- (lasso.predictions + xgb.predictions)/2
write.csv(output, file = "Kaggle.csv", row.names = FALSE)
RMSEs on the test set:
lm: .12591
lasso: .12243
ridge: .12294
rf: .13763
xgb: .12741
ensemble: .12121
Cross-validated RMSEs were all underestimated. Although individually XGBoost did worse than lasso, averaging them improved RMSE slightly, so some of the errors cancelled out.