House Prices - Kaggle

Adam Tolnay

Introduction

Competition description

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

View the description on Kaggle

List of variables

Executive summary

I used two classes of models to predict SalePrice, linear models and tree-based models. Linear models performed well, with a standard regression having a 10-fold cross-validated RMSE of .1162. A lasso model was slightly better with an RMSE of .1154, while a ridge model showed no improvements. Tree-based models didn’t do as well. A random forest had an RMSE of .1322, and boosted decision trees (using XGBoost) had an RMSE of .1220. Because the models are so different, I averaged the lasso and XGBoost predictions for the final submission, and there was some improvement over each model individually.

Required libraries

library(caret)
library(tidyverse)
library(gridExtra)
library(scales)

Reading in data

train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE)
test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
Id <- test$Id
test$Id <- NULL
train$Id <- NULL
test$SalePrice <- NA
full <- rbind(train, test)

Overview of the dataset

cat('There are ', nrow(full), ' observations of ', ncol(full), 'variables in the dataset')
## There are  2919  observations of  80 variables in the dataset

Distribution of the response variable, SalePrice:

ggplot(train, aes(SalePrice)) + geom_density() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), axis.text.y = element_blank()) + 
  scale_x_continuous(breaks= seq(0, 800000, by = 100000), labels = comma) + 
  labs( x = "Sale Price", y = "Density")

The time period covered is 2006 to 2010:

ggplot(full, aes(YrSold)) + geom_bar() + 
  theme(axis.text.y = element_blank()) +
  labs(x = 'Year Sold', y = 'Count')

Sale price by total living area:

ggplot(train, aes(TotalBsmtSF + GrLivArea, SalePrice)) + geom_point() + geom_smooth(method = 'lm') + 
  scale_y_continuous(breaks= seq(0, 800000, by=100000)) + 
  labs(x = 'Total Square Feet', y = 'Sale Price') 

There is an outlier that I will take a closer look at:

outlierRow <- which(train$TotalBsmtSF + train$GrLivArea == max(train$TotalBsmtSF + train$GrLivArea, na.rm = TRUE))

full[outlierRow, c('SalePrice', 'TotalBsmtSF', 'GrLivArea', 'LotArea', 'OverallQual', 'PoolArea')]
##      SalePrice TotalBsmtSF GrLivArea LotArea OverallQual PoolArea
## 1299    160000        6110      5642   63887          10      480

This home has an overall quality of 10, a pool, 64000 square foot lot size and sold for $160,000. It’s not obvious that the observation is an error, but I will flag it as an outlier and consider removing it later.

Cleaning

Because there are so many variables, there was a significant amount of cleaning required. In this section, categorical variables are cast as factors, ordered factors if there is ordinality, and missing values are imputed.

All character columns are factors, so I can use a loop:

characterCols <- which(sapply(full, is.character))

for(i in characterCols){full[,i] <- as.factor(full[,i])}

In addition, the numeric variables MSSubClass, OverallQual, and OverallCond are factors. YrSold is also converted to a factor.

#MSSubClass has numeric codes for different building classes
full$MSSubClass <- as.factor(full$MSSubClass)

#OverallQual and OverallCond are on a scale of 1-10
full$OverallQual <- as.factor(full$OverallQual)
full$OverallCond <- as.factor(full$OverallCond)

#New variable YrSoldFactor so I can use YrSold later.
full$YrSoldFactor <- as.factor(full$YrSold)

Variables with ordinality are converted to ordered factors.

#Many of the ordered factors use these levels
factorLevels <- c('None', 'Po', 'Fa', 'TA', 'Gd', 'Ex')

full$OverallQual <- factor(full$OverallQual, ordered = TRUE)
full$OverallCond <- factor(full$OverallCond, ordered = TRUE)
full$Alley <- factor(full$Alley, ordered = TRUE, levels = c('None', 'Grvl', 'Pave'))
full$ExterQual <- factor(full$ExterQual, ordered = TRUE, levels = factorLevels)
full$ExterCond <- factor(full$ExterCond, ordered = TRUE, levels = factorLevels)
full$BsmtQual <- factor(full$BsmtQual, ordered = TRUE, levels = factorLevels)
full$BsmtCond <- factor(full$BsmtCond, ordered = TRUE, levels = factorLevels)
full$BsmtExposure <- factor(full$BsmtExposure, ordered = TRUE, levels = c('None', 'No', 'Mn', 'Av', 'Gd'))
full$BsmtFinType1 <- factor(full$BsmtFinType1, ordered = TRUE, levels = c('None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'))
full$BsmtFinType2 <- factor(full$BsmtFinType2, ordered = TRUE, levels = c('None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'))
full$HeatingQC <- factor(full$HeatingQC, ordered = TRUE, levels = factorLevels)
full$KitchenQual <- factor(full$KitchenQual, ordered = TRUE, levels = factorLevels)
full$Functional <- factor(full$Functional, ordered = TRUE, levels = c('Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'))
full$FireplaceQu <- factor(full$FireplaceQu, ordered = TRUE, levels = factorLevels)
full$GarageFinish <- factor(full$GarageFinish, ordered = TRUE, levels = c('None', 'Unf', 'RFn', 'Fin'))
full$GarageQual <- factor(full$GarageQual, ordered = TRUE, factorLevels)
full$GarageCond <- factor(full$GarageCond, ordered = TRUE, factorLevels)
full$PavedDrive <- factor(full$PavedDrive, ordered = TRUE, levels = c('N', 'P', 'Y'))
full$PoolQC <- factor(full$PoolQC, ordered = TRUE, factorLevels)
full$Fence <- factor(full$Fence, ordered = TRUE, levels = c('None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv'))

Missing value imputation

Locating variables with missing values:

naCols <- which(colSums(is.na(full)) > 0)

sort(colSums(is.na(full[,naCols])), decreasing = TRUE)
##       PoolQC  MiscFeature        Alley        Fence    SalePrice 
##         2909         2814         2721         2348         1459 
##  FireplaceQu  LotFrontage  GarageYrBlt GarageFinish   GarageQual 
##         1420          486          159          159          159 
##   GarageCond   GarageType     BsmtCond BsmtExposure     BsmtQual 
##          159          157           82           82           81 
## BsmtFinType2 BsmtFinType1   MasVnrType   MasVnrArea     MSZoning 
##           80           79           24           23            4 
##    Utilities BsmtFullBath BsmtHalfBath   Functional  Exterior1st 
##            2            2            2            2            1 
##  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1

There are quite a few missing values. Sometimes they refer to a specific category, for example if PoolArea is NA then there’s no pool. However, there are some discrepencies within the garage, basement, and masonry variables.

NA for GarageType, GarageCond, GarageQual, GarageFinish, and GarageYrBlt means there’s no garage, but they don’t agree on which homes don’t have a garage. First I will check that 157 of them agree on the same homes:

nrow(full[is.na(full$GarageType) & is.na(full$GarageCond) & is.na(full$GarageQual) & is.na(full$GarageFinish) & is.na(full$GarageYrBlt),])
## [1] 157

So they all agree on the 157 homes without a garage. Locating the other 2:

full[!is.na(full$GarageType) & (is.na(full$GarageCond) | is.na(full$GarageQual) | is.na(full$GarageFinish) | is.na(full$GarageYrBlt)), c('GarageType', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'GarageCond', 'GarageQual', 'GarageFinish')]
##      GarageType GarageYrBlt GarageCars GarageArea GarageCond GarageQual
## 2127     Detchd          NA          1        360       <NA>       <NA>
## 2577     Detchd          NA         NA         NA       <NA>       <NA>
##      GarageFinish
## 2127         <NA>
## 2577         <NA>

2127 appears to have a garage, while 2577 appears not to.

For GarageYrBlt, I will impute using YearBuilt, and use the mode for the other variables.

full$GarageYrBlt[2127] <- full$YearBuilt[2127]
full$GarageCond[2127] <- names(sort(table(full$GarageCond), decreasing = TRUE))[1]
full$GarageQual[2127] <- names(sort(table(full$GarageQual), decreasing = TRUE))[1]
full$GarageFinish[2127] <- names(sort(table(full$GarageFinish), decreasing = TRUE))[1]

full$GarageType[2577] <- NA

Correcting an error:

full$GarageYrBlt[2593] <- 2007 #Was 2207

A similar process is followed for the basement variables: BsmtFinType1, BsmtFinType2, BsmtQual, BsmtExposure, and BsmtCond.

They agree on the 79 homes without a basement, here are the remaining discrepencies:

full[!is.na(full$BsmtFinType1) & (is.na(full$BsmtFinType2) | is.na(full$BsmtQual) | is.na(full$BsmtExposure) | is.na(full$BsmtCond)), c('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF')]
##      BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 333        Gd       TA           No          GLQ       1124         <NA>
## 949        Gd       TA         <NA>          Unf          0          Unf
## 1488       Gd       TA         <NA>          Unf          0          Unf
## 2041       Gd     <NA>           Mn          GLQ       1044          Rec
## 2186       TA     <NA>           No          BLQ       1033          Unf
## 2218     <NA>       Fa           No          Unf          0          Unf
## 2219     <NA>       TA           No          Unf          0          Unf
## 2349       Gd       TA         <NA>          Unf          0          Unf
## 2525       TA     <NA>           Av          ALQ        755          Unf
##      BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 333         479      1603        3206
## 949           0       936         936
## 1488          0      1595        1595
## 2041        382         0        1426
## 2186          0        94        1127
## 2218          0       173         173
## 2219          0       356         356
## 2349          0       725         725
## 2525          0       240         995

They all appear to have basements, so I will impute NAs using the mode:

full$BsmtFinType2[333] <- names(sort(table(full$BsmtFinType2), decreasing = TRUE))[1]
full$BsmtExposure[949] <- names(sort(table(full$BsmtExposure), decreasing = TRUE))[1]
full$BsmtExposure[1488] <- names(sort(table(full$BsmtExposure), decreasing = TRUE))[1]
full$BsmtCond[2041] <- names(sort(table(full$BsmtCond), decreasing = TRUE))[1]
full$BsmtCond[2186] <- names(sort(table(full$BsmtCond), decreasing = TRUE))[1]
full$BsmtQual[2218] <- names(sort(table(full$BsmtQual), decreasing = TRUE))[1]
full$BsmtQual[2219] <- names(sort(table(full$BsmtQual), decreasing = TRUE))[1]
full$BsmtExposure[2349] <- names(sort(table(full$BsmtExposure), decreasing = TRUE))[1]
full$BsmtCond[2525] <- names(sort(table(full$BsmtCond), decreasing = TRUE))[1]

Masonry veneer and type agree on all but one observation, which I replaced with the mode of houses with masonry veneer (The mode is none).

full[2611, c('MasVnrType', 'MasVnrArea')]
##      MasVnrType MasVnrArea
## 2611       <NA>        198
full$MasVnrType[2611] <- names(sort(table(full$MasVnrType), decreasing = TRUE))[2]

Imputing missing values for all variables where NA refers to an absence of the feature:

noneCols <- c('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature')

full$MiscFeature <- factor(full$MiscFeature, levels=c(levels(full$MiscFeature), 'None'))
full$GarageType <- factor(full$GarageType, levels=c(levels(full$GarageType), 'None'))

for(i in noneCols){full[is.na(full[,i]), i] <- 'None'}

The remaining missing values:

naCols <- names(which(colSums(is.na(full)) > 0))

sort(colSums(is.na(full[,naCols])), decreasing = TRUE)
##    SalePrice  LotFrontage  GarageYrBlt   MasVnrType   MasVnrArea 
##         1459          486          158           23           23 
##     MSZoning    Utilities BsmtFullBath BsmtHalfBath   Functional 
##            4            2            2            2            2 
##  Exterior1st  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF 
##            1            1            1            1            1 
##  TotalBsmtSF   Electrical  KitchenQual   GarageCars   GarageArea 
##            1            1            1            1            1 
##     SaleType 
##            1

LotFrontage, which measures the length of street connected to the property, has many missing values. The missing values for GarageYrBlt are for homes without a garage, and I will impute them as YearBuilt. For the remaining NAs, I will use the median for numeric variables, and the mode for categorical variables.

#GarageYrBlt
full$GarageYrBlt[is.na(full$GarageYrBlt)] <- full$YearBuilt[is.na(full$GarageYrBlt)]

naColsNumeric <- names(which(sapply(full, is.numeric)))
naColsNumeric <- naColsNumeric[which(naColsNumeric %in% naCols)]
naColsNumeric <- naColsNumeric[-which(naColsNumeric %in% 'SalePrice')]

naColsCategorical <- names(which(sapply(full, is.factor)))
naColsCategorical <- naColsCategorical[which(naColsCategorical %in% naCols)]

#Numeric variables
for(i in naColsNumeric){full[is.na(full[,i]), i] <- median(full[,i], na.rm = TRUE)}

#Categorical variables
for(i in naColsCategorical){full[is.na(full[,i]), i] <- names(sort(table(full[,i]), decreasing = TRUE))[1]}

Integer encoding ordinal factors:

orderedCols <- which(sapply(full, is.ordered))
for (i in orderedCols) {full[[i]] <- as.integer(full[[i]])}

Feature engineering

A typical house listing would say something like: 1650 total square feet, 5 bed, 3 bath, 3200 square foot lot size.

I will consolidate variables into these metrics, and remove linear dependencies. Bedrooms and lot size are already available in the data set, so I begin with total square feet and total bathrooms

#Total square feet.
full$TotalSF <- full$TotalBsmtSF + full$GrLivArea
full$GrLivArea <- NULL

#1stFlrSF, 2ndFlrSF, LowQualFinSF and GrLivArea are a linear combination. I will drop LowQualFinSF.
full$LowQualFinSF <- NULL

#Total bathrooms - Half bathrooms are counted as half a bathroom. FullBath and HalfBath do not include bathrooms in the basement.
full$Bathrooms <- full$BsmtFullBath + full$BsmtHalfBath/2 + full$FullBath + full$HalfBath/2
full$BsmtFullBath <- full$BsmtFullBath + full$BsmtHalfBath/2
full$FullBath   <- NULL
full$HalfBath   <- NULL
full$BsmtHalfBath <- NULL 

Next, I convert YearBuilt, GarageYrBlt, and YearRemodAdd to their ages, and create an isNew variable.

#YearBuilt 
full$HouseAge <- full$YrSold - full$YearBuilt
full$isNew <- ifelse(full$YrSold == full$YearBuilt, 1, 0)
full$isNew <- as.factor(full$isNew)

#GarageYrBlt 
full$GarageAge <- full$YrSold - full$GarageYrBlt
full$GarageYrBlt <- NULL
full$YearBuilt <- NULL

#YearRemodAdd 
full$RemodelAge <- full$YrSold - full$YearRemodAdd
full$YearRemodAdd <- NULL
full$YrSold <- NULL #I created a YrSoldFactor variable

Consolidating basement and porch square footage:

#Few homes have multiple basements, so I will combine BsmtFinSF1 and BsmtFinSF2
full$BsmtFinSF <- full$BsmtFinSF1 + full$BsmtFinSF2
full$BsmtFinSF1 <- NULL
full$BsmtFinSF2 <- NULL
#BsmtFinSF, BsmtUnfSF and GrLivArea and TotalSF are a linear combination. I will remove BsmtUnfSF.
full$BsmtUnfSF <- NULL

#The square footage of four porch types are measured: Open, enclosed, three season, and screen. Three of them have low variability so I combine all four into total square feet of porch.
full$PorchSF <- full$OpenPorchSF + full$EnclosedPorch + full$X3SsnPorch + full$ScreenPorch
full$EnclosedPorch <- NULL
full$X3SsnPorch <- NULL
full$ScreenPorch <- NULL
full$OpenPorchSF <- NULL

Removing highly concentrated variables

The nearZeroVar function returns a variable if the ratio of the most common value to the second most common value exceeds the freqCut parameter, in this case 100/1.

nearZeroVar(full, freqCut = 100/1, names = TRUE)
## [1] "Street"     "Utilities"  "Condition2" "RoofMatl"   "Heating"   
## [6] "PoolArea"   "PoolQC"     "MiscVal"
#Street - Only 12 are gravel, the rest are paved.
full$Street <- NULL
#Utilities - Only 3 do no have all utilities.
full$Utilities <- NULL
#Condition2 - Condition1 returns whether the home is close to a railroad or street, while condition2 is normal unless there are multiple conditions. Only 30 have more than one condition.
full$Condition2 <- NULL
#RoofMatl - Only 43 aren't standard shingle
full$RoofMatl <- NULL
#Heating - Only 45 aren't forced warm air furnaces.
full$Heating <- NULL
#PoolArea and PoolQC - Only 13 homes have pools. One of them is the outlier.
full$PoolArea <- NULL
full$PoolQC <- NULL
#MiscVal - 95 homes have sheds, 1 has a tennis court, 5 have a second garage, and 4 are 'other.' There is enough variability to keep this variable.

There was also a problem with the variables Exterior1st and Exterior2nd, which give the material covering the house. They have very similar distributions. Different inspectors probably varied in which material they put 1st and 2nd. I will only keep one of them.

table(full$Exterior1st)
## 
## AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard ImStucc MetalSd 
##      44       2       6      87       2     126     442       1     450 
## Plywood   Stone  Stucco VinylSd Wd Sdng WdShing 
##     221       2      43    1026     411      56
table(full$Exterior2nd)
## 
## AsbShng AsphShn Brk Cmn BrkFace  CBlock CmentBd HdBoard ImStucc MetalSd 
##      38       4      22      47       3     126     406      15     447 
##   Other Plywood   Stone  Stucco VinylSd Wd Sdng Wd Shng 
##       1     270       6      47    1015     391      81
full$Exterior2nd <- NULL

Relationships with SalePrice

Numeric variable correlations with SalePrice:

numericCols <- which(sapply(full, is.numeric))

train <- full[!is.na(full$SalePrice),]

correlations <- cor(train[,numericCols])
SPcorrelations <- correlations[,'SalePrice']
SPcorrelations[order(abs(SPcorrelations), decreasing = TRUE)]
##    SalePrice  OverallQual      TotalSF    ExterQual  KitchenQual 
##  1.000000000  0.790981601  0.778958829  0.682639242  0.659599721 
##   GarageCars    Bathrooms   GarageArea  TotalBsmtSF    X1stFlrSF 
##  0.640409197  0.631731068  0.623431439  0.613580552  0.605852185 
##     BsmtQual GarageFinish TotRmsAbvGrd     HouseAge  FireplaceQu 
##  0.585207199  0.549246756  0.533723156 -0.523350418  0.520437606 
##   RemodelAge    GarageAge   MasVnrArea   Fireplaces    HeatingQC 
## -0.509078738 -0.508602486  0.472614499  0.466928837  0.427648707 
## BsmtExposure    BsmtFinSF  LotFrontage   WoodDeckSF    X2ndFlrSF 
##  0.375044959  0.366327692  0.334543941  0.324413445  0.319333803 
## BsmtFinType1   GarageQual      LotArea   GarageCond   PavedDrive 
##  0.304907873  0.273839074  0.263843354  0.263190784  0.231356952 
## BsmtFullBath     BsmtCond      PorchSF BedroomAbvGr        Fence 
##  0.224953438  0.212607156  0.195738941  0.168213154 -0.146941526 
## KitchenAbvGr   Functional        Alley  OverallCond       MoSold 
## -0.135907371  0.107618893 -0.092607450 -0.077855894  0.046432245 
##      MiscVal    ExterCond BsmtFinType2 
## -0.021189580  0.018899118 -0.004329316

Variables that would show up in a home listing like square feet of living space and number of rooms are important, as are variables measuring quality, like OverallQual, ExterQual, and KitchenQual.

It’s strange that the correlation between overall condition and sale price is negative. OverallQual “rates the overall material and finish of the house,” and OverallCond “rates the overall condition of the house.”

One would expect that these two variables are highly correlated, but that’s not the case:

cor(train$OverallCond, train$OverallQual)
## [1] -0.09193234
s1 <- ggplot(train, aes(OverallQual, SalePrice)) + geom_boxplot(aes(group = OverallQual)) + scale_x_continuous(breaks= seq(0, 10, 1))
s2 <- ggplot(train, aes(OverallCond, SalePrice)) + geom_boxplot(aes(group = OverallCond)) + scale_x_continuous(breaks= seq(0, 10, 1))

grid.arrange(s1, s2)

OverallQual seems like the more accurate of the two variables. Maybe some inspectors used a scale of 1-5 for OverallCond?

BsmtFinType2 is the numeric variable least correlated with SalePrice.

table(train$BsmtFinType2)
## 
##    1    2    3    4    5    6    7 
##   37 1257   46   54   33   19   14

It has been integer encoded, but the categories in order are ‘no basement’, ‘unfinished’, and then increasing levels of finish quality ranging from ‘low quality’ to ‘good living quarters.’ Homes with no second basement must have been encoded as ‘unfinished,’ but that’s in line with the ordinality.

Importance of factors:

I used ANOVA to get a rough sense for the importance of categorical variables. Only year sold and month sold weren’t very significant.

categoricalCols <- which(sapply(full, is.factor))

s1 <- ggplot(train, aes(YrSoldFactor, SalePrice)) + geom_boxplot(aes(group = YrSoldFactor)) 
s2 <- ggplot(train, aes(MoSold, SalePrice)) + geom_boxplot(aes(group = MoSold)) 

grid.arrange(s1, s2)

summary(aov(SalePrice ~ YrSoldFactor, train))
##                Df    Sum Sq   Mean Sq F value Pr(>F)
## YrSoldFactor    4 1.631e+10 4.078e+09   0.646   0.63
## Residuals    1455 9.192e+12 6.317e+09
summary(aov(SalePrice ~ MoSold, train))
##               Df    Sum Sq   Mean Sq F value Pr(>F)  
## MoSold         1 1.985e+10 1.985e+10    3.15 0.0761 .
## Residuals   1458 9.188e+12 6.302e+09                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SalePrice quartiles don’t vary much by year or month, and the probability that all means are the same is high. However, prices for the various types of homes could vary depending on when they are sold, so I will keep them.

Preparing for modeling

Removing the outlier:

full <- full[-outlierRow,]

Submissions are evaluated based on the RMSE between the natural logarithm of predicted and actual sale price, so I will take the log of SalePrice. I want my models to minimize this metric, so I won’t transform it any more.

full$SalePrice <- log(full$SalePrice)

I will pre-process by transforming numeric variables, and creating dummy variables for factors.

The first step is to separate the data into inumeric, factor, and SalePrice components.

numericCols <- names(which(sapply(full, is.numeric)))
numericCols <- numericCols[-which(numericCols %in% 'SalePrice')]

factorCols <- names(which(sapply(full, is.factor)))

SalePrice <- full$SalePrice
fullNumeric <- full[,numericCols]
fullFactor <- full[,factorCols]

I applied the Box-Cox transformation to numeric variables. Only the lasso and ridge models need standardized predictors, but I found that the RMSE improved for the other models as well.

preProc <- preProcess(fullNumeric, method = c('BoxCox', 'center', 'scale'))
fullNumeric <- predict(preProc, fullNumeric)

Converting factors to dummy variables:

preProc <- dummyVars( ~ ., fullFactor, fullRank = TRUE) 
fullFactor <- data.frame(predict(preProc, fullFactor))

Putting everything back together:

full <- cbind(SalePrice, fullNumeric, fullFactor)

Using the training data, I will flag variables for removal if:
Numeric variables are too correlated.
Factor levels have too few observations.
There are any linear dependencies.

train <- full[!is.na(full$SalePrice),]
trainNumeric <- train[,numericCols]
trainFactor <- train[,!(names(train) %in% numericCols)]

Highly correlated pairs of numeric variables:

correlationMatrix <- cor(trainNumeric)
highCor <- findCorrelation(correlationMatrix, cutoff = .8, verbose = TRUE)
## Compare row 36  and column  14 with corr  0.803 
##   Means:  0.346 vs 0.194 so flagging column 36 
## Compare row 27  and column  28 with corr  0.888 
##   Means:  0.311 vs 0.186 so flagging column 27 
## Compare row 38  and column  39 with corr  0.846 
##   Means:  0.293 vs 0.18 so flagging column 38 
## Compare row 25  and column  24 with corr  0.889 
##   Means:  0.22 vs 0.176 so flagging column 25 
## Compare row 30  and column  29 with corr  0.916 
##   Means:  0.18 vs 0.174 so flagging column 30 
## All correlations <= 0.8

The numeric variables with correlations greater than .8 in the training set are:
TotalSF and TotalBsmtSF (.803) - I will keep both
GarageCars and GarageArea (.888) - I will drop GarageArea
HouseAge and GarageAge (.846) - I will drop GarageAge
Fireplaces and FireplaceQu (.889) - I will drop Fireplaces
GarageQual and GarageCond (.916) - I will drop GarageCond

drop <- c('GarageArea', 'GarageAge', 'Fireplaces', 'GarageCond')

trainNumeric <- trainNumeric[,!(names(trainNumeric) %in% drop)]

Dropping any factor levels with too few observations:

Using freCut = 150/1 means any categories with less than 10 observations are dropped. The levels dropped are shown below.

nearZeroVar(trainFactor, freqCut = 150/1, names = TRUE)
##  [1] "MSSubClass.40"         "MSSubClass.150"       
##  [3] "LotShape.IR3"          "LotConfig.FR3"        
##  [5] "Neighborhood.Blueste"  "Neighborhood.NPkVill" 
##  [7] "Condition1.PosA"       "Condition1.RRNe"      
##  [9] "Condition1.RRNn"       "HouseStyle.2.5Fin"    
## [11] "RoofStyle.Mansard"     "RoofStyle.Shed"       
## [13] "Exterior1st.AsphShn"   "Exterior1st.BrkComm"  
## [15] "Exterior1st.CBlock"    "Exterior1st.ImStucc"  
## [17] "Exterior1st.Stone"     "Foundation.Stone"     
## [19] "Foundation.Wood"       "Electrical.FuseP"     
## [21] "Electrical.Mix"        "GarageType.CarPort"   
## [23] "MiscFeature.Othr"      "MiscFeature.TenC"     
## [25] "SaleType.Con"          "SaleType.ConLD"       
## [27] "SaleType.ConLI"        "SaleType.ConLw"       
## [29] "SaleType.CWD"          "SaleType.Oth"         
## [31] "SaleCondition.AdjLand"
dropLevels <- nearZeroVar(trainFactor, freqCut = 150/1)

drop <- append(drop, names(trainFactor)[dropLevels])

trainFactor <- trainFactor[,-dropLevels]

Removing linear dependencies:

train <- cbind(trainNumeric, trainFactor)

linearCombos <- findLinearCombos(train) 

drop <- append(drop, names(train)[linearCombos$remove])

The only linear dependency found is between the dummy variables BldType.Duplex and MSSubclass.Duplex, which are identical.

Finally, I will drop all flagged variables from full, and split it into training and test sets for modeling.

full <- full[,!(names(full) %in% drop)]

train <- full[!is.na(full$SalePrice),]
test <- full[is.na(full$SalePrice),]

Modelling

Linear models

Linear regression:

set.seed(528)

lm.model <- train(SalePrice ~ ., data = train, method = 'lm', trControl = trainControl(method="cv", number=10))

lm.predictions <- exp(predict(lm.model, test))

cat('The cross-validated RMSE is ', lm.model$results$RMSE,', and Rsquared is ', lm.model$results$Rsquared)
## The cross-validated RMSE is  0.1161896 , and Rsquared is  0.9151573

A linear model performs very well.

qqnorm(residuals(lm.model)); qqline(residuals(lm.model))

There is too much weight on the tails, but residuals are approximately normally distributed.

Next, I checked variable importance. Caret’s varImp function uses the t-statistic of each coefficient to rank variable importance, and scales the absolute value from 0-100.

varImp(lm.model)
## lm variable importance
## 
##   only 20 most important variables shown (out of 140)
## 
##                      Overall
## OverallCond           100.00
## OverallQual            95.47
## MSZoning.RL            88.19
## MSZoning.RM            86.18
## MSZoning.FV            77.69
## MSZoning.RH            77.13
## Functional             65.47
## LotArea                61.16
## GarageCars             55.81
## SaleCondition.Normal   52.31
## BsmtFinSF              50.29
## TotalSF                50.25
## HouseAge               44.31
## Condition1.Norm        43.65
## PorchSF                42.62
## Exterior1st.BrkFace    36.79
## GarageQual             34.63
## WoodDeckSF             33.90
## GarageType.None        33.60
## FireplaceQu            30.87

OverallCond is the most important variable. In a previous section, I saw that the correlation of OverallCond with SalePrice was -.092 (before transforming numeric variables), and I’m not sure how to explain this. Here is the correlation after applying transformations:

cor(train$OverallCond, train$SalePrice)
## [1] -0.02556907

Quality and square footage variables are also important. The four zoning variables stand for residential low, medium, and high density, and floating village (retirement communities), so whether the home is urban or rural is important.

Lasso and ridge regressions:

In glmnet, the penalty is defined to be (1-a)/2|B|^2+a|B|, so alpha controls the weights of the L1 and L2 penalty. Alpha = 0 results in a ridge, and alpha = 1 results in a lasso.

Lasso regression:

Initial tuning showed the optimal value of lambda would be in the range (0,.01).

set.seed(50)

lasso.model <- train(SalePrice ~ ., data = train, method = 'glmnet', tuneGrid = data.frame(alpha = 1, lambda = seq(0,.01,.0001)), trControl = trainControl(method="cv", number=10))

ggplot(lasso.model)

lasso.predictions <- exp(predict(lasso.model, test))

cat('The regularization parameter that minimizes RMSE in the lasso model is ', lasso.model$bestTune$lambda, ', with cross-validated RMSE ', min(lasso.model$results$RMSE), ', and Rsquared ', max(lasso.model$results$Rsquared))
## The regularization parameter that minimizes RMSE in the lasso model is  5e-04 , with cross-validated RMSE  0.1154211 , and Rsquared  0.9170709

The lasso model performs slightly better than a linear model. The tuning chart has an interesting shape, with a local maximum around lambda = .001.

An L1 penalty sets some coefficients to 0, so I will check how many variables it drops:

cat('The lasso model drops ', length(which(varImp(lasso.model)$importance==0)), ' of ', ncol(train), ' variables')
## The lasso model drops  21  of  141  variables

I also checked how many variables are dropped at lambda = .0025, the local minimum.

lasso.model2 <- train(SalePrice ~ ., data = train, method = 'glmnet', tuneGrid = data.frame(alpha = 1, lambda = .0025), trControl = trainControl(method="cv", number=10))
cat(length(which(varImp(lasso.model2)$importance==0)), ' variables are dropped at the local minimum')
## 62  variables are dropped at the local minimum

At lambda = .0025, 62 variables are dropped with only a slight increase in RMSE, so 41 of the variables included at lambda = .0005 must have very low predictive power. I considered using the model with lambda = .0025, but improvements in cross-validated RMSE should translate to the test set.

Ridge regression:

Initial tuning showed that the optimal value of lambda would be in the range (0,.1).

set.seed(584)

ridge.model <- train(SalePrice ~ ., data = train, method = 'glmnet', tuneGrid = data.frame(alpha = 0, lambda = seq(0,.1,.001)), trControl = trainControl(method="cv", number=10))

ggplot(ridge.model) 

ridge.predictions <- exp(predict(ridge.model, test))

cat('The  cross-validated RMSE of the ridge model is ', min(ridge.model$results$RMSE))
## The  cross-validated RMSE of the ridge model is  0.1190069

Increasing the L2 penalty doesn’t reduce RMSE, so the ridge model does not improve on the linear model.

Tree models

Random forest:

I used the suggested mtry (number of parameters tested at each split) for regression of p/3.

set.seed(56)

rf.model <- train(SalePrice ~ ., data = train, method = 'rf', tuneGrid = data.frame(mtry = round(ncol(train)/3)), trControl = trainControl(method="cv", number=10))

rf.predictions <- exp(predict(rf.model, test))

cat('The RMSE is ', rf.model$results$RMSE)
## The RMSE is  0.1322074

A random forest is not an improvement over linear models.

XGBoost:

First, I tuned each hyperparameter individually, keeping the others at their default values, which gave rough ranges for each one. Then I did a random grid search of 100 models to find a good set of parameters.

The grid search takes a while, so it’s not included. The best set of parameters is passed to the model.

set.seed(58)

#gridSearch <- data.frame(nrounds = round(runif(100,100,1000)),
#           max_depth = round(runif(100,.51,10.5)),
#           eta = round(runif(100,.01,.3), digits = 2),
#           gamma = round(runif(100,0,1), digits = 2),
#           min_child_weight = round(runif(100,.5,10.5)),
#           colsample_bytree = round(runif(100,.3,1), digits = 2),
#           subsample = round(runif(100,.3,1), digits = 2))
#
#xgb.model <- train(SalePrice ~ ., data = train, method = 'xgbTree', trControl = trainControl(method= "cv", number = 10), tuneGrid = gridSearch)

xgb.model <- train(SalePrice ~ ., data = train, method = 'xgbTree', trControl = trainControl(method= "cv", number = 10), tuneGrid = data.frame(nrounds = 729, max_depth = 7, eta = 0.04, gamma = 0.05, colsample_bytree
 = 0.85, min_child_weight = 9, subsample = 0.7))

xgb.predictions <- exp(predict(xgb.model, test))

cat('The RMSE is ', min(xgb.model$results$RMSE))
## The RMSE is  0.1219796

The RMSE is pretty good, but is still not an improvement over the lasso model. Furthermore, running the grid search of 100 models took a few hours, while comparing 1000 lasso models took a few seconds.

Output

I will average the lasso and XGBoost predictions for the final submission. First, I checked the RMSE between those two very different models:

RMSE(log(lasso.predictions), log(xgb.predictions))
## [1] 0.06580744

Output:

output <- as.data.frame(Id)
output$SalePrice <- (lasso.predictions + xgb.predictions)/2
write.csv(output, file = "Kaggle.csv", row.names = FALSE)

RMSEs on the test set:

lm: .12591
lasso: .12243
ridge: .12294
rf: .13763
xgb: .12741
ensemble: .12121

Cross-validated RMSEs were all underestimated. Although individually XGBoost did worse than lasso, averaging them improved RMSE slightly, so some of the errors cancelled out.