Introduction

The house sale price is impacted by several factors. Several companies such as Zillow and Redfin applied highly advanced models to predict the price of different houses. If one asks home buyers about their dream house or the features of their dream house, they would think about some common features. Many of them play key roles in the house price. The living area, number of rooms and the location of the house (neighborhood) seem to be very important, while for example the size of guest bathroom is not as significant.

In this project, we are going to predict the sale price of residential houses in Ames, Iowa using the training set provided. The high number of features (79) and several missing values make this project more challenging. Note that the new engineered features will add complexity to this problem as well. Five models are applied for this regression problem: XGBoost, Lasso, Ridge, Elastic-net, and Random Forest. Ultimately, the average of 5 models will estimate the sale price for the test dataset.

Data Preparation

The train and test data are downloaded from Kaggle competition: “House Prices: Advanced Regression Techniques”.

train <- read.csv('train.csv', stringsAsFactors = FALSE)
test <- read.csv('test.csv', stringsAsFactors = FALSE)
test$SalePrice <- NA 
dim(train) ; dim(test)
## [1] 1460   81
## [1] 1459   81

There are 1460 samples in the train set and about the same number of samples in the test set. We have 81 columns in both datasets (after adding SalePrice = NA to the test set), including 79 features, SalePrice and the ID. We combine the train and test sets into one data frame so that any feature engineering or data manipulation will be automatically applied to the test set as well.

alldata <- rbind(train, test)

What is the status of missing values in our data?

library(ggplot2)
library(tidyr)
library(dplyr)
missingvalue.plot <- function(MS_Dataset) {
  missing_values <- MS_Dataset %>% summarize_all(funs(sum(is.na(.))/n()))
  missing_values <- gather(missing_values, key="feature", value="missing_pct")
  print(sum(missing_values$missing_pct >0))
  missing_values %>% 
    filter (missing_pct >0) %>% # only plot the features with missing value
    ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct)) +
    geom_bar(stat="identity",fill="red")+
    coord_flip()+theme_bw()
}
missingvalue.plot(alldata)
## [1] 35

Excluding the ‘SalePrice’, there are 34 features with NAs. The NAs in ‘SalePrice’ belongs to the test set.

Handling the Missing Values

Next, we fill in the missing values by using the rest of the data and predicting them.

which(is.na(alldata$PoolQC) & alldata$PoolArea >0)
## [1] 2421 2504 2600

There are three houses that have a pool (because the pool area is larger than zero) but the pool quality is not assigned. We populate the quality based on the average of pool area for each category of pool quality (assuming that the area of pool is correlated with its quality; Rich people probably care more about their pool or rich people with larger houses probably have higher-quality pool and higher-quality house in general):

tapply(alldata$PoolArea, alldata$PoolQC, mean)
##     Ex     Fa     Gd 
## 359.75 583.50 648.50
alldata$PoolArea[which(is.na(alldata$PoolQC) & alldata$PoolArea >0)]
## [1] 368 444 561
alldata$PoolQC[alldata$Id == 2421] <- "Ex"
alldata$PoolQC[alldata$Id == 2504] <- "Ex"
alldata$PoolQC[alldata$Id == 2600] <- "Fa"
alldata$PoolQC[is.na(alldata$PoolQC)] = 'None' # assign to the houses with no pool

We fix the missing values for the “Garage Year Built” with the “House Year Built”.

alldata$GarageYrBlt[is.na(alldata$GarageYrBlt)]<- alldata$YearBuilt[is.na(alldata$GarageYrBlt)]

Let’s fix some of the features associated to garage.

Garage.Cols <- c('GarageArea', 'GarageCars', 'GarageQual', 
                 'GarageFinish', 'GarageCond', 'GarageType')
alldata[which(is.na(alldata$GarageCond)), Garage.Cols]
##      GarageArea GarageCars GarageQual GarageFinish GarageCond GarageType
## 40            0          0       <NA>         <NA>       <NA>       <NA>
## 49            0          0       <NA>         <NA>       <NA>       <NA>
## 79            0          0       <NA>         <NA>       <NA>       <NA>
## 89            0          0       <NA>         <NA>       <NA>       <NA>
## 90            0          0       <NA>         <NA>       <NA>       <NA>
## 100           0          0       <NA>         <NA>       <NA>       <NA>
## 109           0          0       <NA>         <NA>       <NA>       <NA>
## 126           0          0       <NA>         <NA>       <NA>       <NA>
## 128           0          0       <NA>         <NA>       <NA>       <NA>
## 141           0          0       <NA>         <NA>       <NA>       <NA>
## 149           0          0       <NA>         <NA>       <NA>       <NA>
## 156           0          0       <NA>         <NA>       <NA>       <NA>
## 164           0          0       <NA>         <NA>       <NA>       <NA>
## 166           0          0       <NA>         <NA>       <NA>       <NA>
## 199           0          0       <NA>         <NA>       <NA>       <NA>
## 211           0          0       <NA>         <NA>       <NA>       <NA>
## 242           0          0       <NA>         <NA>       <NA>       <NA>
## 251           0          0       <NA>         <NA>       <NA>       <NA>
## 288           0          0       <NA>         <NA>       <NA>       <NA>
## 292           0          0       <NA>         <NA>       <NA>       <NA>
## 308           0          0       <NA>         <NA>       <NA>       <NA>
## 376           0          0       <NA>         <NA>       <NA>       <NA>
## 387           0          0       <NA>         <NA>       <NA>       <NA>
## 394           0          0       <NA>         <NA>       <NA>       <NA>
## 432           0          0       <NA>         <NA>       <NA>       <NA>
## 435           0          0       <NA>         <NA>       <NA>       <NA>
## 442           0          0       <NA>         <NA>       <NA>       <NA>
## 465           0          0       <NA>         <NA>       <NA>       <NA>
## 496           0          0       <NA>         <NA>       <NA>       <NA>
## 521           0          0       <NA>         <NA>       <NA>       <NA>
## 529           0          0       <NA>         <NA>       <NA>       <NA>
## 534           0          0       <NA>         <NA>       <NA>       <NA>
## 536           0          0       <NA>         <NA>       <NA>       <NA>
## 563           0          0       <NA>         <NA>       <NA>       <NA>
## 583           0          0       <NA>         <NA>       <NA>       <NA>
## 614           0          0       <NA>         <NA>       <NA>       <NA>
## 615           0          0       <NA>         <NA>       <NA>       <NA>
## 621           0          0       <NA>         <NA>       <NA>       <NA>
## 636           0          0       <NA>         <NA>       <NA>       <NA>
## 637           0          0       <NA>         <NA>       <NA>       <NA>
## 639           0          0       <NA>         <NA>       <NA>       <NA>
## 650           0          0       <NA>         <NA>       <NA>       <NA>
## 706           0          0       <NA>         <NA>       <NA>       <NA>
## 711           0          0       <NA>         <NA>       <NA>       <NA>
## 739           0          0       <NA>         <NA>       <NA>       <NA>
## 751           0          0       <NA>         <NA>       <NA>       <NA>
## 785           0          0       <NA>         <NA>       <NA>       <NA>
## 827           0          0       <NA>         <NA>       <NA>       <NA>
## 844           0          0       <NA>         <NA>       <NA>       <NA>
## 922           0          0       <NA>         <NA>       <NA>       <NA>
## 943           0          0       <NA>         <NA>       <NA>       <NA>
## 955           0          0       <NA>         <NA>       <NA>       <NA>
## 961           0          0       <NA>         <NA>       <NA>       <NA>
## 969           0          0       <NA>         <NA>       <NA>       <NA>
## 971           0          0       <NA>         <NA>       <NA>       <NA>
## 977           0          0       <NA>         <NA>       <NA>       <NA>
## 1010          0          0       <NA>         <NA>       <NA>       <NA>
## 1012          0          0       <NA>         <NA>       <NA>       <NA>
## 1031          0          0       <NA>         <NA>       <NA>       <NA>
## 1039          0          0       <NA>         <NA>       <NA>       <NA>
## 1097          0          0       <NA>         <NA>       <NA>       <NA>
## 1124          0          0       <NA>         <NA>       <NA>       <NA>
## 1132          0          0       <NA>         <NA>       <NA>       <NA>
## 1138          0          0       <NA>         <NA>       <NA>       <NA>
## 1144          0          0       <NA>         <NA>       <NA>       <NA>
## 1174          0          0       <NA>         <NA>       <NA>       <NA>
## 1180          0          0       <NA>         <NA>       <NA>       <NA>
## 1219          0          0       <NA>         <NA>       <NA>       <NA>
## 1220          0          0       <NA>         <NA>       <NA>       <NA>
## 1235          0          0       <NA>         <NA>       <NA>       <NA>
## 1258          0          0       <NA>         <NA>       <NA>       <NA>
## 1284          0          0       <NA>         <NA>       <NA>       <NA>
## 1324          0          0       <NA>         <NA>       <NA>       <NA>
## 1326          0          0       <NA>         <NA>       <NA>       <NA>
## 1327          0          0       <NA>         <NA>       <NA>       <NA>
## 1338          0          0       <NA>         <NA>       <NA>       <NA>
## 1350          0          0       <NA>         <NA>       <NA>       <NA>
## 1408          0          0       <NA>         <NA>       <NA>       <NA>
## 1450          0          0       <NA>         <NA>       <NA>       <NA>
## 1451          0          0       <NA>         <NA>       <NA>       <NA>
## 1454          0          0       <NA>         <NA>       <NA>       <NA>
## 1514          0          0       <NA>         <NA>       <NA>       <NA>
## 1532          0          0       <NA>         <NA>       <NA>       <NA>
## 1540          0          0       <NA>         <NA>       <NA>       <NA>
## 1553          0          0       <NA>         <NA>       <NA>       <NA>
## 1557          0          0       <NA>         <NA>       <NA>       <NA>
## 1559          0          0       <NA>         <NA>       <NA>       <NA>
## 1561          0          0       <NA>         <NA>       <NA>       <NA>
## 1591          0          0       <NA>         <NA>       <NA>       <NA>
## 1594          0          0       <NA>         <NA>       <NA>       <NA>
## 1595          0          0       <NA>         <NA>       <NA>       <NA>
## 1615          0          0       <NA>         <NA>       <NA>       <NA>
## 1616          0          0       <NA>         <NA>       <NA>       <NA>
## 1718          0          0       <NA>         <NA>       <NA>       <NA>
## 1722          0          0       <NA>         <NA>       <NA>       <NA>
## 1788          0          0       <NA>         <NA>       <NA>       <NA>
## 1809          0          0       <NA>         <NA>       <NA>       <NA>
## 1811          0          0       <NA>         <NA>       <NA>       <NA>
## 1812          0          0       <NA>         <NA>       <NA>       <NA>
## 1820          0          0       <NA>         <NA>       <NA>       <NA>
## 1823          0          0       <NA>         <NA>       <NA>       <NA>
## 1832          0          0       <NA>         <NA>       <NA>       <NA>
## 1835          0          0       <NA>         <NA>       <NA>       <NA>
## 1837          0          0       <NA>         <NA>       <NA>       <NA>
## 1840          0          0       <NA>         <NA>       <NA>       <NA>
## 1848          0          0       <NA>         <NA>       <NA>       <NA>
## 1894          0          0       <NA>         <NA>       <NA>       <NA>
## 2011          0          0       <NA>         <NA>       <NA>       <NA>
## 2082          0          0       <NA>         <NA>       <NA>       <NA>
## 2091          0          0       <NA>         <NA>       <NA>       <NA>
## 2094          0          0       <NA>         <NA>       <NA>       <NA>
## 2097          0          0       <NA>         <NA>       <NA>       <NA>
## 2100          0          0       <NA>         <NA>       <NA>       <NA>
## 2105          0          0       <NA>         <NA>       <NA>       <NA>
## 2127        360          1       <NA>         <NA>       <NA>     Detchd
## 2136          0          0       <NA>         <NA>       <NA>       <NA>
## 2152          0          0       <NA>         <NA>       <NA>       <NA>
## 2154          0          0       <NA>         <NA>       <NA>       <NA>
## 2190          0          0       <NA>         <NA>       <NA>       <NA>
## 2191          0          0       <NA>         <NA>       <NA>       <NA>
## 2192          0          0       <NA>         <NA>       <NA>       <NA>
## 2193          0          0       <NA>         <NA>       <NA>       <NA>
## 2194          0          0       <NA>         <NA>       <NA>       <NA>
## 2213          0          0       <NA>         <NA>       <NA>       <NA>
## 2239          0          0       <NA>         <NA>       <NA>       <NA>
## 2247          0          0       <NA>         <NA>       <NA>       <NA>
## 2354          0          0       <NA>         <NA>       <NA>       <NA>
## 2355          0          0       <NA>         <NA>       <NA>       <NA>
## 2399          0          0       <NA>         <NA>       <NA>       <NA>
## 2400          0          0       <NA>         <NA>       <NA>       <NA>
## 2423          0          0       <NA>         <NA>       <NA>       <NA>
## 2427          0          0       <NA>         <NA>       <NA>       <NA>
## 2553          0          0       <NA>         <NA>       <NA>       <NA>
## 2554          0          0       <NA>         <NA>       <NA>       <NA>
## 2558          0          0       <NA>         <NA>       <NA>       <NA>
## 2576          0          0       <NA>         <NA>       <NA>       <NA>
## 2577         NA         NA       <NA>         <NA>       <NA>     Detchd
## 2580          0          0       <NA>         <NA>       <NA>       <NA>
## 2604          0          0       <NA>         <NA>       <NA>       <NA>
## 2610          0          0       <NA>         <NA>       <NA>       <NA>
## 2692          0          0       <NA>         <NA>       <NA>       <NA>
## 2694          0          0       <NA>         <NA>       <NA>       <NA>
## 2709          0          0       <NA>         <NA>       <NA>       <NA>
## 2768          0          0       <NA>         <NA>       <NA>       <NA>
## 2772          0          0       <NA>         <NA>       <NA>       <NA>
## 2790          0          0       <NA>         <NA>       <NA>       <NA>
## 2792          0          0       <NA>         <NA>       <NA>       <NA>
## 2800          0          0       <NA>         <NA>       <NA>       <NA>
## 2860          0          0       <NA>         <NA>       <NA>       <NA>
## 2863          0          0       <NA>         <NA>       <NA>       <NA>
## 2871          0          0       <NA>         <NA>       <NA>       <NA>
## 2889          0          0       <NA>         <NA>       <NA>       <NA>
## 2892          0          0       <NA>         <NA>       <NA>       <NA>
## 2893          0          0       <NA>         <NA>       <NA>       <NA>
## 2894          0          0       <NA>         <NA>       <NA>       <NA>
## 2910          0          0       <NA>         <NA>       <NA>       <NA>
## 2914          0          0       <NA>         <NA>       <NA>       <NA>
## 2915          0          0       <NA>         <NA>       <NA>       <NA>
## 2918          0          0       <NA>         <NA>       <NA>       <NA>

If a house has zero “GarageArea”" and zero “GarageCars”, it shows it does not have any garage at all. We see one house with GarageArea = 360 sqft and GarageCars = 1 and the other with GarageArea=GarageCars=NA. We assume that the second one does not have any garage at all, therefore we set GarageArea and GarageCars to zero and other garage features to None. The first house definitely has a garage. We can fix the missing features using features from other houses with similar GarageArea and GarageCars value.

names(sapply(alldata[which(((alldata$GarageArea < 375) & (alldata$GarageArea > 345)) & (alldata$GarageCars == 1)) , Garage.Cols], function(x) sort(table(x),  decreasing=TRUE)[1]))
## [1] "GarageArea.352"    "GarageCars.1"      "GarageQual.TA"    
## [4] "GarageFinish.Unf"  "GarageCond.TA"     "GarageType.Attchd"
alldata[2127,'GarageQual']    = 'TA'
alldata[2127, 'GarageFinish'] = 'Unf'
alldata[2127, 'GarageCond']   = 'TA'
alldata[2577, c('GarageQual', 'GarageFinish', 'GarageCond', 'GarageType')] = 'None'
alldata[2577, c('GarageCars', 'GarageArea')] = 0
alldata[which(is.na(alldata$GarageCond)), c('GarageQual', 'GarageFinish', 'GarageCond', 'GarageType')] = 'None'

The missing KitchenQual and Electrical are populated with their most frequent class which are ‘TA’ and ‘SBrkr’, respectively. Use the multiplot function for following plots.

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
feature.plot <- function(DATASET, FEATURE) {ggplot(DATASET , aes(x=DATASET[,FEATURE])) + geom_bar(fill = 'cornflowerblue') + xlab(FEATURE)}

p1 <- feature.plot(alldata, 'KitchenQual') 
p2 <- feature.plot(alldata, 'Electrical') 
layout <- matrix(c(1,2), 1,2, byrow=TRUE)
multiplot(p1, p2, layout = layout)

alldata$KitchenQual[is.na(alldata$KitchenQual)] <- 'TA'
alldata$Electrical[is.na(alldata$Electrical)] <- 'SBrkr'

Let’s move to the Basement NAs,

require(stringr)
Bsmnt.Cols <- names(alldata)[sapply(names(alldata), function(x) str_detect(x, "Bsmt"))]
alldata[is.na(alldata$BsmtCond), Bsmnt.Cols]
##      BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 18       <NA>     <NA>         <NA>         <NA>          0         <NA>
## 40       <NA>     <NA>         <NA>         <NA>          0         <NA>
## 91       <NA>     <NA>         <NA>         <NA>          0         <NA>
## 103      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 157      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 183      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 260      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 343      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 363      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 372      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 393      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 521      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 533      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 534      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 554      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 647      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 706      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 737      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 750      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 779      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 869      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 895      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 898      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 985      <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1001     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1012     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1036     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1046     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1049     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1050     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1091     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1180     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1217     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1219     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1233     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1322     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1413     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1586     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1594     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1730     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1779     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1815     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1848     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1849     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1857     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1858     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1859     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1861     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 1916     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2041       Gd     <NA>           Mn          GLQ       1044          Rec
## 2051     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2067     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2069     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2121     <NA>     <NA>         <NA>         <NA>         NA         <NA>
## 2123     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2186       TA     <NA>           No          BLQ       1033          Unf
## 2189     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2190     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2191     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2194     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2217     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2225     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2388     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2436     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2453     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2454     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2491     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2499     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2525       TA     <NA>           Av          ALQ        755          Unf
## 2548     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2553     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2565     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2579     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2600     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2703     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2764     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2767     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2804     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2805     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2825     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2892     <NA>     <NA>         <NA>         <NA>          0         <NA>
## 2905     <NA>     <NA>         <NA>         <NA>          0         <NA>
##      BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
## 18            0         0           0            0            0
## 40            0         0           0            0            0
## 91            0         0           0            0            0
## 103           0         0           0            0            0
## 157           0         0           0            0            0
## 183           0         0           0            0            0
## 260           0         0           0            0            0
## 343           0         0           0            0            0
## 363           0         0           0            0            0
## 372           0         0           0            0            0
## 393           0         0           0            0            0
## 521           0         0           0            0            0
## 533           0         0           0            0            0
## 534           0         0           0            0            0
## 554           0         0           0            0            0
## 647           0         0           0            0            0
## 706           0         0           0            0            0
## 737           0         0           0            0            0
## 750           0         0           0            0            0
## 779           0         0           0            0            0
## 869           0         0           0            0            0
## 895           0         0           0            0            0
## 898           0         0           0            0            0
## 985           0         0           0            0            0
## 1001          0         0           0            0            0
## 1012          0         0           0            0            0
## 1036          0         0           0            0            0
## 1046          0         0           0            0            0
## 1049          0         0           0            0            0
## 1050          0         0           0            0            0
## 1091          0         0           0            0            0
## 1180          0         0           0            0            0
## 1217          0         0           0            0            0
## 1219          0         0           0            0            0
## 1233          0         0           0            0            0
## 1322          0         0           0            0            0
## 1413          0         0           0            0            0
## 1586          0         0           0            0            0
## 1594          0         0           0            0            0
## 1730          0         0           0            0            0
## 1779          0         0           0            0            0
## 1815          0         0           0            0            0
## 1848          0         0           0            0            0
## 1849          0         0           0            0            0
## 1857          0         0           0            0            0
## 1858          0         0           0            0            0
## 1859          0         0           0            0            0
## 1861          0         0           0            0            0
## 1916          0         0           0            0            0
## 2041        382         0        1426            1            0
## 2051          0         0           0            0            0
## 2067          0         0           0            0            0
## 2069          0         0           0            0            0
## 2121         NA        NA          NA           NA           NA
## 2123          0         0           0            0            0
## 2186          0        94        1127            0            1
## 2189          0         0           0           NA           NA
## 2190          0         0           0            0            0
## 2191          0         0           0            0            0
## 2194          0         0           0            0            0
## 2217          0         0           0            0            0
## 2225          0         0           0            0            0
## 2388          0         0           0            0            0
## 2436          0         0           0            0            0
## 2453          0         0           0            0            0
## 2454          0         0           0            0            0
## 2491          0         0           0            0            0
## 2499          0         0           0            0            0
## 2525          0       240         995            0            0
## 2548          0         0           0            0            0
## 2553          0         0           0            0            0
## 2565          0         0           0            0            0
## 2579          0         0           0            0            0
## 2600          0         0           0            0            0
## 2703          0         0           0            0            0
## 2764          0         0           0            0            0
## 2767          0         0           0            0            0
## 2804          0         0           0            0            0
## 2805          0         0           0            0            0
## 2825          0         0           0            0            0
## 2892          0         0           0            0            0
## 2905          0         0           0            0            0

There are 5 houses that have missing values in their basement features. Two of them (rows 2121 and 2189) do not have basement at all. So, the quantitative features will be set to zero. For the houses in rows 2041, 2186, and 2525 the BsmtCond will be set to its most frequent value, ‘TA’.

p1 <- feature.plot(alldata, 'BsmtCond') 
p2 <- feature.plot(alldata, 'BsmtQual') 
p3 <- feature.plot(alldata, 'BsmtExposure') 
p4 <- feature.plot(alldata, 'BsmtFinType2') 
layout <- matrix(c(1,2,3,4), 2,2, byrow=TRUE)
multiplot(p1, p2, p3, p4,layout = layout)

Also, the ‘BsmtQual’, ‘BsmtExposure’, and ‘BsmtFinType2’ will be filled in with their most frequent values when they are ‘NA’ but other basement features are not ‘NA’:

alldata$BsmtCond[is.na(alldata$BsmtCond) & !is.na(alldata$BsmtExposure)] <- 'TA'
alldata$BsmtExposure[is.na(alldata$BsmtExposure) & !is.na(alldata$BsmtCond)] <- 'No'
alldata$BsmtQual[is.na(alldata$BsmtQual) & !is.na(alldata$BsmtCond)] <- 'TA'
alldata[is.na(alldata$BsmtQual), Bsmnt.Cols[c(1,2,3,4,6)]] <- 'None'
alldata[is.na(alldata$BsmtQual), Bsmnt.Cols[c(5,7,8,9,10,11)]] <- 0
alldata[is.na(alldata$BsmtFinSF1), Bsmnt.Cols[c(5,7,8,9,10,11)]] <- 0
alldata$BsmtFinType2[is.na(alldata$BsmtFinType2) & !is.na(alldata$BsmtCond)] <- 'Unf'
alldata$BsmtFullBath[is.na(alldata$BsmtFullBath) & alldata$BsmtCond == 'None'] <- 0
alldata$BsmtHalfBath[is.na(alldata$BsmtHalfBath) & alldata$BsmtCond == 'None'] <- 0

We re-plot the missing values to see what features are still remaining.

missingvalue.plot(alldata)
## [1] 14

Excluding the ‘SalePrice’, we still have 13 features with NAs. The ‘Lot Frontage’ and ‘MasVnrArea’ will be imputed by Mice package.

library(mice)
set.seed(129)
mice_mod <- mice(alldata[, !names(alldata) %in% c('Id', 'SalePrice')], method ='rf', printFlag = FALSE)
mice_output <- complete(mice_mod)

Comparison between the original ‘LotFrontage’ distribution and the distribution of Mice prediction are plotted blow. We can conclude that Mice mimics the original data very well.

par(mfrow=c(1,2))
hist(alldata$LotFrontage, freq=F, main='Original Data', 
     col='darkgreen', xlab = 'LotFrontage')
hist(mice_output$LotFrontage, freq=F, main='Mice Output', 
     col='lightgreen', xlab = 'LotFrontage')

alldata$LotFrontage[is.na(alldata$LotFrontage)] <- mice_output$LotFrontage[is.na(alldata$LotFrontage)]

Mice did a good job for ‘MasVnrArea’ as well:

par(mfrow=c(1,2))
hist(alldata$MasVnrArea, freq=F, main='Original Data', 
     col='darkred', xlab = 'LotFrontage')
hist(mice_output$MasVnrArea, freq=F, main='MICE Output', 
     col='red', xlab = 'LotFrontage')

alldata$MasVnrArea[is.na(alldata$MasVnrArea)] <- mice_output$MasVnrArea[is.na(alldata$MasVnrArea)]

We are going to populate MSZoning missing data with a Random Forest prediction. Initial models (not presented here) showed that the following 6 features have the highest rank in terms of importance. Therefore, I use these six features in the model.

library(randomForest) 
set.seed(754)
MSZoning_Cols <- c('MSZoning', 'MSSubClass', 'LotShape', 
                   'Neighborhood', 'BldgType', 'HouseStyle')
NoNA_MSZoning <- alldata[!is.na(alldata$MSZoning), MSZoning_Cols]
NoNA_MSZoning[MSZoning_Cols] <- lapply(NoNA_MSZoning[MSZoning_Cols], factor)
MSZoninf_rf <- randomForest(MSZoning ~ . ,data = NoNA_MSZoning)

Following plot shows the importance scores for these variables:

library(ggthemes)
importance    <- importance(MSZoninf_rf)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))
ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
                           y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  coord_flip() + 
  theme_few()

Expectedly, ‘Neighborhood’, ‘MSSubclass’ and ‘BldgType’ have the highest rank in predicting the ‘MSZoning’, the general zoning classification of the sale. We replace only the missing ‘MSZoning’ with Mice prediction.

alldata_temp <- lapply(alldata[MSZoning_Cols], factor)
alldata_temp$MSZoning<- predict(MSZoninf_rf, newdata = alldata_temp)
alldata$MSZoning[is.na(alldata$MSZoning)] <-as.character(alldata_temp$MSZoning[is.na(alldata$MSZoning)])
rm(alldata_temp, NoNA_MSZoning, MSZoning_Cols, importance, rankImportance)

We can treat all the features with only one missing value using their most common value:

p1<- feature.plot(alldata, 'SaleType') 
p2<- feature.plot(alldata[alldata$Exterior1st=="VinylSd",], 'Exterior2nd') 

p3<- feature.plot(alldata, 'Utilities') 
p4<- feature.plot(alldata, 'Functional') 
layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4,layout = layout)

alldata$SaleType[is.na(alldata$SaleType)] <- "WD"
alldata[is.na(alldata$Exterior1st), c('Exterior1st', 'Exterior2nd')] <- 'VinylSd'
alldata$Utilities[is.na(alldata$Utilities)] <- 'AllPub'
alldata$Functional[is.na(alldata$Functional)] <- 'Typ'

Let’s take a look at the plot for “MasVnrType” and “MasVnrArea”:

foo <- na.omit(alldata[,c('MasVnrType','MasVnrArea')]) %>%
  group_by(na.omit(MasVnrType)) %>%
  mutate(MedianArea = median(MasVnrArea, na.rm = TRUE))
  ggplot(foo , aes(x=MasVnrType)) + geom_bar(fill = 'cornflowerblue') + xlab("MasVnrType") +
    geom_text(aes(label = MedianArea), stat='count', vjust=-0.5)

The value on the top of each bar shows the median of “MasVnrArea”. Based on this plot, we fix the missing “MasVnrType” and “MasVnrArea” as follows. The “MasVnrType” (Masonry veneer type) is set to “None” if “MasVnrArea” (Masonry veneer area) is zero. Also, we set Masonry veneer area to zero if “MasVnrType” is “None” and the area is less than 5 sqft but if the area is greater than 250 sqft we use the most frequent type, “BrkFace”. For all other missing “MasVnrType”" we use “BrkFace”.

alldata$MasVnrType[alldata$MasVnrArea == 0] <- 'None'
alldata$MasVnrArea[alldata$MasVnrType== 'None' & alldata$MasVnrArea < 5] <- 0
alldata$MasVnrType[alldata$MasVnrType== 'None' & alldata$MasVnrArea > 250] <- 'BrkFace'
alldata$MasVnrType[is.na(alldata$MasVnrType)] <- 'BrkFace'

We set all the NAs in “Fence”, “FireplaceQu”, “Alley”, and “MiscFeature” as “None”.

alldata$Fence[is.na(alldata$Fence)] <- 'None'
alldata$FireplaceQu[is.na(alldata$FireplaceQu)] <- 'None'
alldata$Alley[is.na(alldata$Alley)] <- 'None'
alldata$MiscFeature[is.na(alldata$MiscFeature)] <- 'None'

Let’s divide all the features into two groups: numeric and categorical:

Categ.vars <- names(which(sapply(alldata, is.character)))
Num.vars <- names(which(sapply(alldata, is.numeric)))
all.numeric <- alldata[Num.vars]

Before going further, a copy of alldata is saved in alldata.RF for Random Forest.

alldata.RF <- alldata
alldata.RF[Categ.vars] <- lapply(alldata.RF[Categ.vars], factor)

Now, we need to find the most influential features on the “SalePrice”. We plot the features versus the “SalePrice” and sort them in a descending order of the sale price to explore any trend or link between that feature and the house price.

require(scales)
train <- alldata[!is.na(alldata$SalePrice),]
group.prices <- function(dataframe, col) {
    group.table <- dataframe[,c(col, 'SalePrice', 'OverallQual')] %>%
      group_by_(col) %>%
      summarise(mean.Quality = round(mean(OverallQual),2),
                mean.Price = median(SalePrice), n = n()) %>%
      arrange(mean.Quality)
  
  g1 <- qplot(x=reorder(group.table[[col]],-group.table[['mean.Price']]), y=group.table[['mean.Price']]) +
          geom_bar(stat='identity', fill='cornflowerblue') +
          theme_minimal() +
          scale_y_continuous(labels = dollar) +
          labs(x=col, y='Mean Price') +
          theme(axis.text.x = element_text(angle = 45))
  
  return(g1)
}
p1 <- group.prices(train, 'FireplaceQu')
p2 <- group.prices(train, 'BsmtQual'   )
p3 <- group.prices(train, 'KitchenQual')
p4 <- group.prices(train, 'ExterQual'  )
p5 <- group.prices(train, 'OverallQual')
p6 <- group.prices(train, 'ExterCond'  )
p7 <- group.prices(train, 'GarageQual' )
p8 <- group.prices(train, 'GarageCond' )
p9 <- group.prices(train, 'HeatingQC'  )

layout <- matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=TRUE)
multiplot(p1,p2,p3,p4,p5,p6,p7,p8,p9,layout = layout)

It seems all the features presented above are important in predicting the sale price specially the “OverallQual”, “ExterQual”, and " KitchenQual“. We will explore their correlation with the the house price in more details later. A numeric value (0 to 5) is assigned to all of the features containing”Qual“,”QC“,”QU“, and”Cond“:

map.fcn <- function(cols, map.list, df){
  for (col in cols){
    df[col]<- as.numeric(map.list[alldata[,col]])
  }
  return(df)
}
qual.cols <- c('ExterQual', 'ExterCond', 'GarageQual', 'GarageCond', 'FireplaceQu', 'KitchenQual', 'HeatingQC', 'BsmtQual')
qual.list <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
all.numeric <- map.fcn (qual.cols, qual.list, all.numeric)

Let’s move to the basement features:

p1 <- group.prices(train, 'BsmtExposure')
p2 <- group.prices(train, 'BsmtFinType1')
p3 <- group.prices(train, 'BsmtFinType2')
layout <- matrix(c(1,2,3),1,3,byrow=TRUE)
multiplot(p1,p2,p3,layout = layout)

bsmt.list <- c('None' = 0, 'No' = 1, 'Mn' = 2, 'Av' = 3, 'Gd' = 4)
all.numeric <- map.fcn (c('BsmtExposure'), bsmt.list, all.numeric)
bsmt.fin.list <- c('None' = 0, 'Unf' = 1, 'LwQ' = 2,'Rec'= 3, 'BLQ' = 4, 'ALQ' = 5, 'GLQ' = 6)
all.numeric <- map.fcn (c('BsmtFinType1','BsmtFinType2'), bsmt.fin.list, all.numeric)

Next three features are, “Functional”, “GarageFinish”, and “Fence”:

p1 <- group.prices(train, 'Functional')
p2 <- group.prices(train, 'GarageFinish')
p3 <- group.prices(train, 'Fence')
layout <- matrix(c(1,2,3),1,3,byrow=TRUE)
multiplot(p1,p2,p3,layout = layout)

functional.list <- c('None' = 0, 'Sal' = 1, 'Sev' = 2, 'Maj2' = 3, 'Maj1' = 4, 
                     'Mod' = 5, 'Min2' = 6, 'Min1' = 7, 'Typ'= 8)
all.numeric     <- map.fcn (c('Functional'), functional.list, all.numeric)
garage.fin.list <- c('None' = 0,'Unf' = 1, 'RFn' = 2, 'Fin' =3)
all.numeric     <- map.fcn (c('GarageFinish'), garage.fin.list, all.numeric)
fence.list      <- c('None' = 0, 'MnWw' = 1, 'GdWo' = 2, 'MnPrv' = 2, 'GdPrv' = 3)
all.numeric['Fence'] <- as.numeric(fence.list[alldata$Fence])
group.prices(train, 'MSSubClass')

MSdwelling.list <- c('20' = 1, '30'= 1, '40' = 0, '45' = 1,'50' = 0, '60' = 2, '70' = 0, '75' = 0, '80' = 0, '85'= 0, '90' = 0, '120' = 2, '150' = 0, '160' = 0, '180' = 1, '190' = 0)
all.numeric['MSSubClass'] <- as.numeric(MSdwelling.list[as.character(alldata$MSSubClass)])
alldata.RF$MSSubClass     <- as.factor(all.numeric$MSSubClass)
p1 <- group.prices(train, 'MSZoning')
p2 <- group.prices(train, 'LandContour')
p3 <- group.prices(train, 'HouseStyle')
layout <- matrix(c(1,2,3,3),2,2,byrow=TRUE)
multiplot(p1,p2,p3,layout = layout)

MSZoning.list <- c('C (all)' = 0, 'RM' = 1, 'RH' = 2, 'RL' = 3, 'FV' = 4)
all.numeric['MSZoning'] <- as.numeric(MSZoning.list[alldata$MSZoning])
LandContour.list <- c('None' = 0, 'Lvl' = 2, 'Bnk' = 1, 'HLS' = 4, 'Low' = 3)
all.numeric['LandContour'] <- as.numeric(LandContour.list[alldata$LandContour])
HouseStyle.list <- c('None' = 0, '2.5Fin' = 4, '2.5Unf' = 2, '1.5Fin' = 2, '1.5Unf' = 1, '1Story' = 3, 'SFoyer' = 2, 'SLvl' = 3, '2Story' = 4)
all.numeric['HouseStyle'] <- as.numeric(HouseStyle.list[alldata$HouseStyle])

To determine which variables are strongly related to the “SalePrice”, we calculate and plot their correlation with our response variable. We only focus on the features with a correlation greater than 0.5 or smaller than -0.5. Correlations close to zero imply weak relationship between two variables.

require(corrplot)
corr.df <- all.numeric[1:1460,]
# only using the first 1460 rows - training data
correlations <- cor(corr.df)
# only want the columns that show strong correlations with SalePrice
corr.SalePrice <- as.matrix(sort(correlations[,'SalePrice'], decreasing = TRUE))
corr.idx <- names(which(apply(corr.SalePrice, 1, function(x) (x > 0.5 | x < -0.5))))
par(mfrow=c(1,1))
corrplot(as.matrix(correlations[corr.idx,corr.idx]), type = 'upper', 
         method='color', addCoef.col = 'black', tl.cex = .7,cl.cex = .7, number.cex=.8)

Features “OverallQual”, “GrLivArea”, “ExterQual”, and “KitchenQual” have the strongest correlation with the “SalePrice”. The Living area is not a surprise, a larger house tends to be more expensive. During the process of converting categorical values to numeric, we discovered the strong correlation between the sale price and the overall quality, exterior quality and the kitchen quality. The correlation plot confirms our previous finding.

“GarageCars” and “GarageArea” are on the second place. People seem to care about the garage size. However, these two features are highly correlated with each other as well. The next features are the basement and first floor area. Again, everything related to the area of a house makes intuitive sense to have an effect on the price. However, the basement area and first floor area are highly correlated with each other. A beneficial engineered feature could be a combination of the basement area, living area above the ground and the garage area.

Finally, we see three features containing the year of house construction, the year of garage construction as well as the remodeling year. It is not a big surprise that a newer or renovated house has higher price in comparison with a similar but older house. See the plot for the correlation of other features with the sale price. To explore the relationship of the top 6 features with the sale price, I use a matrix of scatter plots along with a simple linear regression fit (blue line) and a local polynomial fit (red line).

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
require(GGally)
lm.plt <- function(data, mapping, ...){
  plt <- ggplot(data = data, mapping = mapping) +
    geom_point(shape = 20, alpha = 0.7, color = 'darkseagreen') +
    geom_smooth(method=loess, fill="red", color="red") +
    geom_smooth(method=lm, fill="blue", color="blue") +
    theme_minimal()
  return(plt)
}
ggpairs(corr.df, corr.idx[1:6], lower = list(continuous = lm.plt))

An interesting fact is that highest value of each feature does not necessarily result in the highest house price. For example, highest kitchen quality, exterior quality and garage cars are not necessarily associated with the highest sale price.

Feature Engineering

The exploratory data analysis we have performed, provides valuable information that helps us to select the best features. Furthermore, I believe we can create some new features that may boost our prediction. I start with the year of construction. A house is new if it is built and sold in the same year. Besides, if the year in which the house is remodeled is not equal to the “YearBuilt”, we can assume that the house is remodeled. I do not want to use the term “remodeled” for a new house!

all.numeric['NewHouse'] <- (alldata$YearBuilt == alldata$YrSold) * 1
all.numeric['Remodeled'] <- (alldata$YearBuilt != alldata$YearRemodAdd) * 1
alldata.RF[c('NewHouse','Remodeled')] <- lapply(all.numeric[c('NewHouse','Remodeled')], factor)

Houses with zero square foot for an area based feature show that the feature is not available at all. For these features , we add an indicator column which is 1 for a house with an area greater than zero, and zero for the others.

cols.binary <- c('X2ndFlrSF', 'MasVnrArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'X3SsnPorch', 'ScreenPorch')
has.cols.binary <- str_c('Has',cols.binary)
for (col in cols.binary){
  all.numeric[str_c('Has',col)] <- (alldata[,col] != 0) * 1
}
alldata.RF[has.cols.binary] <- lapply(all.numeric[has.cols.binary], factor)

The time in a year that a house is on market or being sold can be important. Some months are during the hot seasons which can remarkably affect the market:

ggplot(alldata, aes(x=MoSold)) +
  geom_bar(fill = 'cornflowerblue') +
  geom_text(aes(label=..count..), stat='count', vjust = -.5) +
  theme_minimal() +
  scale_x_continuous(breaks = 1:12)

all.numeric['HighSeason']       <- (alldata$MoSold %in% c(4,5,6,7  )) * 1
all.numeric['AfterHighSeason']  <- (alldata$MoSold %in% c(8,9,10,11)) * 1
all.numeric['BeforeHighSeason'] <- (alldata$MoSold %in% c(1,2,3,12 )) * 1
season.col <- c('HighSeason', 'AfterHighSeason', 'BeforeHighSeason')
alldata.RF[season.col] <- lapply(all.numeric[season.col], factor)

Another idea is to combine the area of different parts of the house. For example, what is the total living and basement area (“TotalArea1”)? What if the garage area is added to it (“TotalArea2”)? What is the proportion of the “TotalArea1” to the “LotArea”? We may also get an idea about the design of the house by dividing the number of bedrooms above the ground by the total living area above the ground. For example 3 bedrooms in a 1800 sqft house has different meaning than 3 bedrooms in a 3000 sqft house. Likewise, the number of bathrooms to the total living area, or the total number of bathrooms to the bedrooms. A 4-bedroom house with 4 bathroom is more popular than a 4-bedroom house with 2.5 bathroom. The other question is, how tight is the garage space? Is it a 2-car large garage or 2-car small one? The ratio of “GarageCars” to the “GarageArea” gives us an idea.

all.numeric['TotalArea1'] <- as.numeric(rowSums(alldata[,c('TotalBsmtSF', 'GrLivArea')]))
all.numeric['TotalArea2'] <- as.numeric(rowSums(alldata[,c('TotalBsmtSF', 'GrLivArea', 'GarageArea')]))
all.numeric['TotalArea3'] <- as.numeric(rowSums(alldata[,c('X1stFlrSF', 'GarageArea')]))
all.numeric['TtlAr2PerTtlLot'] <- as.numeric(all.numeric$TotalArea2/all.numeric$LotArea)
all.numeric['TtlAr3PerTtlLot'] <- as.numeric(all.numeric$TotalArea3/all.numeric$LotArea)
all.numeric['BedroomPerLivArea'] <- as.numeric(all.numeric$BedroomAbvGr/all.numeric$GrLivArea)
all.numeric['X1stflrPerLivArea'] <- as.numeric(all.numeric$X1stFlrSF/all.numeric$GrLivArea)

all.numeric['TotalBath1'] <- as.numeric(rowSums(alldata[,c('FullBath', 'HalfBath')]))
all.numeric['TotalBath2'] <- as.numeric(rowSums(alldata[,c('BsmtFullBath', 'BsmtHalfBath','FullBath', 'HalfBath')]))
all.numeric['Bath2PerArea1'] <- as.numeric(all.numeric$TotalBath2/all.numeric$TotalArea1)
all.numeric['Bath1PerBedroom'] <- as.numeric(all.numeric$TotalBath1/(all.numeric$BedroomAbvGr+1))
all.numeric['Bath1PerRooms'] <- as.numeric(all.numeric$TotalBath1/    (as.numeric(rowSums(alldata[,c('TotRmsAbvGrd', 'BedroomAbvGr')]))))

all.numeric['GarageAPerLivArea'] <- as.numeric(all.numeric$GarageArea/all.numeric$GrLivArea)
all.numeric['GarageCarsPerBedroom'] <- as.numeric(all.numeric$GarageCars/(all.numeric$BedroomAbvGr+1))
all.numeric['CarsPerGarageAr'] <- as.numeric(all.numeric$GarageCars/(all.numeric$GarageArea+1))
all.numeric['PoolArPerLivAr'] <- as.numeric(all.numeric$PoolArea/(all.numeric$GrLivArea))
NewFeatures.col = c('TotalArea1','TotalArea2','TotalArea3','TtlAr2PerTtlLot','TtlAr3PerTtlLot','BedroomPerLivArea',
                    'X1stflrPerLivArea','TotalBath1','TotalBath2','Bath2PerArea1', 'Bath1PerBedroom', 'Bath1PerRooms', 'GarageAPerLivArea','GarageCarsPerBedroom','CarsPerGarageAr','PoolArPerLivAr')
alldata.RF[NewFeatures.col] <- all.numeric[NewFeatures.col]

There is a house in which the “YrSold” is prior to the construction year. That might be a pre-construction home. I change the “YrSold” to the “YearBuilt” for that case. Then, I calculate the age of all houses as well as the number of years since the house was remodeled.

all.numeric$YrSold[all.numeric$YrSold < all.numeric$YearBuilt] <- all.numeric$YearBuilt[all.numeric$YrSold < all.numeric$YearBuilt]
all.numeric['Age'] <- as.numeric(all.numeric$YrSold - all.numeric$YearBuilt)
all.numeric['YearSinceRemodel'] <- as.numeric(all.numeric$YrSold - all.numeric$YearRemodAdd)
NewFeature.Years <- c('YrSold', 'Age', 'YearSinceRemodel')
alldata.RF[NewFeature.Years] <- all.numeric[NewFeature.Years]

For linear regression models, we need to have all the features as numeric. Therefore, we create dummy variables from the categorical features. We do not do it for the Random Forest dataset.

library(caret)
dummy <- dummyVars(" ~ .",data=alldata[,Categ.vars])
all.categoric <- data.frame(predict(dummy,newdata=alldata[,Categ.vars])) 

One of the key assumptions of linear regression models is the normality of all dependent features. We make a non-linear log-transformation for all numeric variables that are not normally distributed. The skewness of a distribution determines how it is close or far from the normal distribution.

require(e1071) # skewness
all.numeric <- all.numeric[,!names(all.numeric) %in% c('Id', 'SalePrice')] # remove SalePrice and Id
skewed <- apply(all.numeric, 2, skewness)
skewed <- skewed[(skewed > 0.75) | (skewed < -0.75)]

for(col in names(skewed)){all.numeric[,col] <- log(1+all.numeric[,col])}

In addition to normalization, we need to standardize the data using Caret package preProcess function:

preobj <- preProcess(all.numeric, method = c("center", "scale"))
all.numeric <- predict(preobj, all.numeric)

We combine all the numerical and categorical variables to have all data in one data frame:

allnumcat    <- cbind(all.numeric, all.categoric)

Random Forest

Two more steps remain before modeling: removing the features with near zero variance and adding “SalePricePerArea” feature.

nzv.data <- nearZeroVar(alldata.RF, saveMetrics = TRUE)
# take any of the near-zero-variance perdictors
drop.cols <- rownames(nzv.data)[nzv.data$nzv == TRUE]
# keep only non near-zero-variance variables
alldata.RF <- alldata.RF[,!names(alldata.RF) %in% drop.cols]

We add one more feature i.e., “SalePricePerArea”, to the random forest dataset. Instead of predicting the house price, random forest will predict the price per square foot. A cross validation test (not presented here) shows that the random forest predicts the price/sqft more accurately than the house sale price. We use “TotalArea2” to calculate the price per sqft to take the garage area into account.

alldata.RF$SalePricePerArea <- alldata$SalePrice / alldata.RF$TotalArea2

The entire train set is randomly divided into two sets, 1) training set (80%) and 2) validating set (20%). We train the model using training set and test it with the validating set:

set.seed(1362)
inSub     <- createDataPartition(y=alldata.RF$SalePrice[1:1460] , p=0.8, list = FALSE)
alltrain   <- alldata.RF[1:1460,]
alltest    <- alldata.RF[1461:nrow(alldata.RF),]
training   <- alltrain[inSub,]
validating <- alltrain[-inSub,]
require(Metrics) # rmse
set.seed(12345)
control <- trainControl(method="repeatedcv", number=5, repeats=2, allowParallel = T)
randomForest_model = randomForest(SalePricePerArea ~ . -Id -SalePrice , data=training, ntree=600, metric = "RMSE", trControl=control, importance = TRUE)
RF_predict_valid = predict(randomForest_model, newdata=validating) * validating$TotalArea2
rmse(log(validating$SalePricePerArea * validating$TotalArea2 +1), log(RF_predict_valid+1))
## [1] 0.1107049

RMSE= 0.111 is the accuracy of predicting the validating set with the random forest model. The following plot shows the random forest error versus the number of trees. The accuracy does not vary significantly having more than 150-200 trees. However, we keep ntree=600 for our final training/prediction.

plot(randomForest_model)

The next plot illustrates the top 15 important features in the random forest model. About 40% of top 15 variables belongs to the engineered features.

varImpPlot(randomForest_model,type=1, n.var = 15)

Linear Regression

For linear regression models we predict the “SalePrice”, and not the price/sqft. Before moving forward, we need to remove all the variables with variance near zero, like what we did with random forest data set.

nzv.data <- nearZeroVar(allnumcat, saveMetrics = TRUE)
drop.cols <- rownames(nzv.data)[nzv.data$nzv == TRUE]
allnumcat <- allnumcat[,!names(allnumcat) %in% drop.cols]
LogSalePrice <- log(train$SalePrice+1)
LogSalePrice_train <- LogSalePrice[inSub]
LogSalePrice_valid <- LogSalePrice[-inSub]
alltrain2 <- allnumcat[1:1460,]
alltest2   <- allnumcat[1461:nrow(allnumcat),]
training2  <- alltrain2[inSub,]
validating2  <- alltrain2[-inSub,]

Because of the multicollinearity in our independent variables, it is not a good idea to apply a simple linear regression model without any regularization term. That is why we are using “lasso” and “ridge” linear regression. Fortunately, glmnet package has the cv.glmnet function which performs cross validation for us to find the best lambda that produces the minimum error rate. In addition, we can have ridge regression and lasso regression if we set alpha=0 and alpha=1, respectively. Any value between 0 and 1 provides elastic-net model.

require(glmnet) # ridge, lasso & elastinet
glm.cv.ridge <- cv.glmnet(as.matrix(training2),   LogSalePrice_train, alpha = 0)
glm.cv.lasso <- cv.glmnet(as.matrix(training2),   LogSalePrice_train, alpha = 1)
glm.cv.net   <- cv.glmnet(data.matrix(training2), LogSalePrice_train, alpha = 0.1)

The cv.glmnet provides us the lambda (shrinkage parameter) corresponding to the minimum error rate. The following plot shows the error rate versus different lambda for each model.

# use the lamdba that minimizes the error
penalty.ridge <- glm.cv.ridge$lambda.min
penalty.lasso <- glm.cv.lasso$lambda.min
penalty.net   <- glm.cv.net$lambda.min

par(mfrow=c(1,3))
plot(glm.cv.ridge, main = "Ridge")
plot(glm.cv.lasso, main = "Lasso")
plot(glm.cv.net,   main = "Elastic Net")

Now, let’s train our linear regression models with new lambda and check the RMSEs.

glm.ridge <- glmnet(x = as.matrix(training2), y = LogSalePrice_train, alpha = 0,   lambda = penalty.ridge )
glm.lasso <- glmnet(x = as.matrix(training2), y = LogSalePrice_train, alpha = 1,   lambda = penalty.lasso)
glm.net   <- glmnet(x = as.matrix(training2), y = LogSalePrice_train, alpha = 0.1, lambda = penalty.net)

We predict the sale price for the train set to see how well each model is trained.

y_pred.ridge <- as.numeric(predict(glm.ridge, as.matrix(training2)))
y_pred.lasso <- as.numeric(predict(glm.lasso, as.matrix(training2)))
y_pred.net   <- as.numeric(predict(glm.net,   as.matrix(training2)))

rmse(LogSalePrice_train, y_pred.ridge)
## [1] 0.1195338
rmse(LogSalePrice_train, y_pred.lasso)
## [1] 0.1218315
rmse(LogSalePrice_train, y_pred.net)
## [1] 0.1219004

RMSE for all three LR models is about 0.12. Ridge model is trained slightly better. This might be due to overfitting. Checking the RMSE for validating set will indicate that how accurately each model predicts an unseen sample.

y_pred_valid.ridge <- as.double(predict(glm.ridge, as.matrix(validating2)))
y_pred_valid.lasso <- as.double(predict(glm.lasso, as.matrix(validating2)))
y_pred_valid.net   <- as.double(predict(glm.net,   as.matrix(validating2)))

rmse(LogSalePrice_valid, y_pred_valid.ridge)
## [1] 0.1030443
rmse(LogSalePrice_valid, y_pred_valid.lasso)
## [1] 0.1029623
rmse(LogSalePrice_valid, y_pred_valid.net)
## [1] 0.1021343

All three compete head-to-head; However, elastic-net is leading this time.

XGBoost

There are several available resources to learn about this technique, e.g. see here or here. Thus, I am not talking much about the details of the model. I apply the “repeatedcv” cross validation technique, however I did not find any improvement in accuracy using “repeats” greater than 1. The “expand.grid” function allows us to try different values for XGBoost parameters and find the best for the final training.

library(gridExtra)
library(xgboost)
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

dtrain <- xgb.DMatrix(as.matrix(training2), label = LogSalePrice_train)
dvalid <- xgb.DMatrix(as.matrix(validating2), label = LogSalePrice_valid)
  
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 1 ,number = 4, allowParallel=T)

xgb.grid <- expand.grid(nrounds = 1000,
                        eta = 0.01,                  
                        max_depth = c(4,6,8),
                        colsample_bytree= c(0.8,1),  # subsample ratio of columns when constructing each tree
                        min_child_weight = c(1,2,3),
                        subsample=c(0.4, 0.6, 0.8),
                        gamma=0.03)

set.seed(53)
xgb_tune <- train(as.matrix(training2),
                  LogSalePrice_train, method="xgbTree", 
                  trControl=cv.ctrl, 
                  tuneGrid=xgb.grid, 
                  verbose=T, metric="RMSE", nthread =4)
stopCluster(cl)
xgb_tune$bestTune
##    nrounds max_depth  eta gamma colsample_bytree min_child_weight
## 14    1000         4 0.01  0.03                1                2
##    subsample
## 14       0.6
#set.seed(53)
# cv.res <- xgb.cv(data = as.matrix(alltrain), label = LogSalePrice_train, nfold = 4,
#                   colsample_bytree=1, 
#                   eta=0.01,  
#                   max_depth=6, 
#                   min_child_weight=1,  
#                   alpha=0.3, 
#                   lambda=0.4,
#                   gamma=0.03, # less overfit
#                   subsample=0.4,
#                   nrounds = 1000, objective = "reg:linear")

The “bestTune” attribute of “xgb_tune” provides the parameters that produce the most accurate result. So, we will use the same values for our final training. We know that nrounds = 1000 is relatively small specially with eta=0.01. We will train the XGBoost model again with nrounds = 5000. However, it might be stopped before it reaches 5000th iteration if no more improvement is gained. Finally, the validating set will examine the accuracy of XGBoost for a new sample.

xgb_params <- list(
  booster = 'gbtree',
  objective = 'reg:linear',
  colsample_bytree=1,
  eta=0.01,
  max_depth=6,
  min_child_weight=1,
  alpha=0.3,
  lambda=0.4,
  gamma=0.03, # less overfit
  subsample=0.4,
  seed=5,
  silent=TRUE)

set.seed(53)
watchlist = list(train=dtrain, valid=dvalid)
xgb_bst <- xgb.train(xgb_params, dtrain, nrounds = 5000, early_stopping_rounds = 100, 
                 watchlist, verbose =1, print_every_n = 200)
## [1]  train-rmse:11.414948    valid-rmse:11.419795 
## Multiple eval metrics are present. Will use valid_rmse for early stopping.
## Will train until valid_rmse hasn't improved in 100 rounds.
## 
## [201]    train-rmse:1.550923 valid-rmse:1.552659 
## [401]    train-rmse:0.245676 valid-rmse:0.251799 
## [601]    train-rmse:0.098856 valid-rmse:0.116219 
## [801]    train-rmse:0.081332 valid-rmse:0.105758 
## [1001]   train-rmse:0.074697 valid-rmse:0.103903 
## [1201]   train-rmse:0.071006 valid-rmse:0.103040 
## [1401]   train-rmse:0.068442 valid-rmse:0.102426 
## [1601]   train-rmse:0.066736 valid-rmse:0.102190 
## [1801]   train-rmse:0.065454 valid-rmse:0.102062 
## [2001]   train-rmse:0.064687 valid-rmse:0.101963 
## Stopping. Best iteration:
## [2068]   train-rmse:0.064392 valid-rmse:0.101921
y_pred_valid.xgb <- predict(xgb_bst, dvalid)
rmse(LogSalePrice_valid, y_pred_valid.xgb)
## [1] 0.1019215

We obtained RMSE=0.102 for the validating data set, which is the best among all the models we created. It is interesting to see the important variables in XGBoost model:

model.names <- dimnames(dtrain)[[2]]
importance_matrix <- xgb.importance(model.names, model = xgb_bst)
#Plotting 15 most important feature that contribute to the model
par(mfrow=c(1,1))
xgb.plot.importance(importance_matrix[1:15])

On the importance of creating new features, it is suffice to say that 40% of the top 15 and 60% of top 5 variables belongs to the engineered features! The “TotalArea2” on the first place indicates that not only the living area is critical but also people care about the garage area. This is not a new discovery. The correlation plot in our exploratory data analysis showed that too.

We know that each of the models we created has strengths and weaknesses. For instance, one limitation of using XGBoost is extrapolation, therefore we can use linear model to better predict any sale prices outside the range of prices given in our training set. Or, since random forest selects the features randomly, in general it does not overfit and can smooth out other overfitted models. Thus, a few combined models are generated and the best combination is selected based on RMSE by validating set.

result_valid <- data.frame (xgb = exp(y_pred_valid.xgb)-1, ridge = exp(y_pred_valid.ridge)-1, lasso = exp(y_pred_valid.ridge)-1, net = exp(y_pred_valid.ridge)-1 , RF=RF_predict_valid , True = exp(LogSalePrice_valid) -1 )
result_valid$avgLR <- rowSums(result_valid[,c('ridge', 'lasso', 'net')])/3.0
result_valid$xgb_avgLR <- rowSums(result_valid[,c('xgb', 'avgLR')])/2.0
result_valid$avgall <- rowSums(result_valid[,c('xgb', 'ridge', 'lasso', 'net', 'RF')])/5.0
result_valid$XGB_avgLR_RF <- rowSums(result_valid[,c('xgb', 'avgLR', 'RF')])/3.0
result_valid$RF_avgLR <- rowSums(result_valid[,c('avgLR', 'RF')])/2.0

rmse(log(result_valid$RF_avgLR+1) ,     LogSalePrice_valid)
## [1] 0.09525749
rmse(log(result_valid$XGB_avgLR_RF+1) , LogSalePrice_valid)
## [1] 0.09221097
rmse(log(result_valid$avgall+1) ,       LogSalePrice_valid)
## [1] 0.09406158
rmse(log(result_valid$xgb_avgLR+1) ,    LogSalePrice_valid)
## [1] 0.09751553

The best combination with RMSE = 0.092 comes from the average of XGBoost, average of all three linear regression models and Random Forest. The last step is the final prediction using the entire train set.

Final Prediction

## Random Forest
set.seed(12345)
randomForest_model_final = randomForest(SalePricePerArea ~ . -Id -SalePrice , data=alltrain, ntree=600, metric = "RMSE", trControl=control, importance = FALSE)
alltest[,c("SalePrice","SalePricePerArea")]<- 0
RF_predict_final = predict(randomForest_model_final, newdata=alltest) * alltest$TotalArea2

## Linear Regression Models
glm.ridge_final <- glmnet(x = as.matrix(alltrain2), y = LogSalePrice, alpha = 0,   lambda = penalty.ridge )
glm.lasso_final <- glmnet(x = as.matrix(alltrain2), y = LogSalePrice, alpha = 1,   lambda = penalty.lasso)
glm.net_final   <- glmnet(x = as.matrix(alltrain2), y = LogSalePrice, alpha = 0.1, lambda = penalty.net)
y_pred_final.ridge <- as.double(predict(glm.ridge_final, as.matrix(alltest2)))
y_pred_final.lasso <- as.double(predict(glm.lasso_final, as.matrix(alltest2)))
y_pred_final.net   <- as.double(predict(glm.net_final,   as.matrix(alltest2)))

# XGBoost
dtrain2 <- xgb.DMatrix(as.matrix(alltrain2), label = LogSalePrice)
dtest   <- xgb.DMatrix(as.matrix(alltest2))
watchlist2 <- list(train=dtrain2)
xgb_final <- xgb.train(xgb_params, dtrain2, nrounds = 10000, early_stopping_rounds = 100, 
                 watchlist2, verbose =1, print_every_n = 1000)
## [1]  train-rmse:11.415985 
## Will train until train_rmse hasn't improved in 100 rounds.
## 
## [1001]   train-rmse:0.073809 
## [2001]   train-rmse:0.063711 
## [3001]   train-rmse:0.060949 
## [4001]   train-rmse:0.059638 
## [5001]   train-rmse:0.058720 
## [6001]   train-rmse:0.057957 
## [7001]   train-rmse:0.057460 
## [8001]   train-rmse:0.056947 
## [9001]   train-rmse:0.056580 
## [10000]  train-rmse:0.056360
y_pred_final.xgb <- predict(xgb_final, dtest)

## gathering all predictions in one data frame
result_final <- data.frame (xgb = exp(y_pred_final.xgb)-1, ridge = exp(y_pred_final.ridge)-1, lasso = exp(y_pred_final.lasso)-1, net = exp(y_pred_final.net)-1 , RF=RF_predict_final)
result_final$avgLR <- rowSums(result_final[,c('ridge', 'lasso', 'net')])/3.0
result_final$XGB_avgLR_RF <- rowSums(result_final[,c('xgb', 'avgLR', 'RF')])/3.0
result_final$XGB_avgLR <- rowSums(result_final[,c('xgb', 'avgLR')])/2.0

Submission File

PredictedPrice <- cbind(test['Id'], result_final$XGB_avgLR_RF)
colnames(PredictedPrice) <- c("Id", "SalePrice")
write.csv(PredictedPrice, file = "PredictedPrice_XGB_LRs_RF.csv", row.names = F)

This prediction has RMSE=0.125 on Kaggle final test set. Thanks to “Tanner Carbonati” for sharing his work in this Kaggle Kenrel that inspired the present analysis.