The following is part of my submission for the kaggle House Prices competition. The goal is to predict House price using advanced Machine Learning methods.

for more information see the competition site here

Introduction

In this Analysis I use the Ames Housing dataset describing the sale of individual residential property in Ames, Iowa from 2006 to 2010. The data set contains almost 3000 observations and a large number of explanatory variables involved in assessing home values. The challenge is to predict the final price of each house

The competition structure is as follow:


Packages

I use the following packages in this analysis:

library(ggplot2)
library(caret)
library(dplyr)
library(corrplot)
library(RCurl)
library(lattice)
library(knitr)
library(plotly)
library(leaflet)
library(zoo)
library(reshape2)

Understand the Data

In this section I try to get a better sense of the data, see if some cleaning and pre-processing methods are needed, and understand the patterns of NA values.

Start by loading the raw data.

train.raw <- read.csv("train.csv")
test.raw <- read.csv("test.csv")

The training data contain 1460 observations and 81 features. First let’s have a look on the structure of the training dataset.

dim(train.raw)
[1] 1460   81
str(train.raw)
'data.frame':   1460 obs. of  81 variables:
 $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
 $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
 $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
 $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
 $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
 $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
 $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
 $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
 $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
 $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
 $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
 $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
 $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
 $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
 $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
 $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
 $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
 $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
 $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
 $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
 $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
 $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
 $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
 $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
 $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
 $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
 $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
 $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
 $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
 $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
 $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
 $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
 $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
 $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
 $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
 $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

The explanatories (columns 1:80) include 4 types of variables:

  • Nominal
  • Ordinal
  • Discrete
  • Continuous

As a first step, it would be more effective to classify the explanatories into these categories. For this I use the dataset codebook.

codebook <- getURL("https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt")
# split the text file to strings:
codebook <- unlist(strsplit(codebook, "[ \n()]+"))
# utilizing the order of the variable description (identical to the variables order in the dataset).
# searching for the features' types pattern in the codebook text file:
grep.type <- grep("Nominal|Ordinal|Discrete|Continuous", codebook, value = T)
# arrange in a data.frame (ignoring the first elemet which relates to the observation index):
variables <- data.frame(name = names(train.raw), type = grep.type[2:82])
continuous <- as.character(variables[variables$type == "Continuous",1])
discretes <- as.character(variables[variables$type == "Discrete",1])
nominals <- as.character(variables[variables$type == "Nominal",1])
nominals <- nominals[2:24] # removing ID variable from the predictors
ordinals <- as.character(variables[variables$type == "Ordinal",1])
# first 10 variables' types
head(variables, 10)
          name       type
1           Id    Nominal
2   MSSubClass    Nominal
3     MSZoning    Nominal
4  LotFrontage Continuous
5      LotArea Continuous
6       Street    Nominal
7        Alley    Nominal
8     LotShape    Ordinal
9  LandContour    Nominal
10   Utilities    Ordinal

Now, let’s get a better sense of the data. It would be useful to evaluate variables from each type separately. First, lets see how many variables in each type group:

table(variables$type)

Continuous   Discrete    Nominal    Ordinal 
        20         14         24         23 

Nominal predictors

Starting with the Nominal, let’s look at the variables levels:

par(mfrow = c(5,5), mar = c(2,1,3,3)) # adjusting  graphical parameters
sapply(names(train.raw[ , nominals]),
       function(x) barplot(table(train.raw[ ,x]), main = x))

It seems that some features have low variation across levels (e.g. Street, LandContour, Condition1, and more. The concern here that these predictors may become zero-variance predictors when the data are split into cross-validation/bootstrap sub-samples or that a few samples may have an undue influence on the model. In the following section I will try to identify these near-zero -variance predictors and adjust them before modeling. Also, the variable MSSubClass should be converted into factor.

train.raw$MSSubClass <- as.factor(train.raw$MSSubClass)

Ordinal predictors

In order to maximize the information that can be derived from the Ordinal variables it is important to express the order of the levels. I will do this in the following sections. For now let’s look at the variation across the levels

par(mfrow = c(5,5), mar = c(2,1,3,3)) # adjusting  graphical parameters
sapply(names(train.raw[ , ordinals]),
       function(x) barplot(table(train.raw[ ,x]), main = x))

It can bee seen that in some variables the variation is relatively low. Again, near-zero-variance predictors need to be handled. It should be note that the variables level’s originality is not necessarily represented in the barplots.

Discrete predictors

few words…

par(mfrow = c(3,5), mar = c(2,1,3,3)) # adjusting  graphical parameters
sapply(names(train.raw[ , discretes]),
       function(x) barplot(table(train.raw[ ,x]), main = x))

The discrete predictors histograms look fine.

Continuous predictors

In some continuous explanatories such as PoolArea or BsmtFinSF1 a 0 value means that there is no measurement (e.g. no pool or no finished basement), in this case it should be reported through related categorical / Ordinal variable so that information does not get lost (I’ll handle it later). For now, it will be more useful to look only on the distribution of positive values.

par(mfrow = c(4,5), mar = c(2,1,3,3)) # adjusting  graphical parameters
sapply(names(train.raw[ , continuous]),
       function(x) plot(density(train.raw[train.raw[ ,x]!=0,x],
                                na.rm = T), main = x))

It can be seen that some of the continuous variables distributions are very much skewed. I will further address this issue in the Cleaning and PreProcessing section.

Missing values

Another important aspect in understanding the data is to understand the patterns of missing values Here I provide a short analysis of missing values in the dataset.

I’ll stars by calculating the number of NA values for each predictor:

vars.na.share <- sapply(train.raw, function(x) mean(is.na(x)))
variables$na.share <- vars.na.share
variables$na.exist <- vars.na.share > 0
par(mfrow = c(1,1))
na.table <- filter(variables, na.exist == T) %>% arrange(-na.share)
ggplot(data = na.table,
       aes(x = factor(name, levels = na.table$name),
           y = na.share)) +
    geom_bar(stat = "identity") + 
    coord_flip() +
    labs(title = "Share of NA values", x = "", y = "")

For many predictors the share of NA values is too significant to ignore, and it seems there is no alternative than dig into the data.

Easy to see that for some variables the same share of NA values. Let’s take care of them first.

Starting by Garage related variables, here are the first 6 NA rows:

garage <- grep("Garage", names(train.raw), value = T)
kable(head(train.raw[is.na(train.raw$GarageType),garage]), 
      row.names = F, format = "markdown", align = "c")
GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond
NA NA NA 0 0 NA NA
NA NA NA 0 0 NA NA
NA NA NA 0 0 NA NA
NA NA NA 0 0 NA NA
NA NA NA 0 0 NA NA
NA NA NA 0 0 NA NA

We see that the GarageCars and the GarageArea are both 0 when NAs exist. By looking at the codebook we can see that the common denominator is that NA represent No Garage in all these variables.

Next, Basemente related variables:

basement <- grep("Bsmt", names(train.raw), value = T)
kable(head(train.raw[is.na(train.raw$BsmtFinType1), basement]), 
      row.names = F, format = "markdown", align = "c")
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
NA NA NA NA 0 NA 0 0 0 0 0
NA NA NA NA 0 NA 0 0 0 0 0
NA NA NA NA 0 NA 0 0 0 0 0
NA NA NA NA 0 NA 0 0 0 0 0
NA NA NA NA 0 NA 0 0 0 0 0
NA NA NA NA 0 NA 0 0 0 0 0

Here also the codebook states that in all these variables NA values represent No Basement

Last group: Masonry veneer related variables:

masvnr <- grep("MasVnr", names(train.raw), value = T)
kable(head(train.raw[is.na(train.raw$MasVnrType), masvnr]), 
      row.names = F, format = "markdown", align = "c")
MasVnrType MasVnrArea
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA

These two are the only Masonry veneer related variable. There is no explanation in the codebook when the value is NA, and here None gets its own level. however the number of NA is negligible (less than 0.5%). I’ll assume that NA are “None” for MasVnrType and 0 for MasVnrArea.

The rest of the NAs and their description from the codebook by type:

-Ordinal / Nominal - FireplaceQu - Fireplace quality - Fence - Fence quality - Alley - Type of alley access to property - MiscFeature - Miscellaneous feature not covered in other categories - PoolQC - Pool quality

In all of these variables NA represent None (e.g. no fence, no pool, etc.)

  • Continuous:
  • LotFrontage - Linear feet of street connected to property

Here we don’t have an explanation of NA value but it is reasonable to assume that NA means 0 (e.g. no street is connected). To check if it makes sense I look at the smallest values of LotFrontage:

quantile(train.raw$LotFrontage, 
         probs = seq(.01,.05,len = 5),
         na.rm = T)
## 1% 2% 3% 4% 5% 
## 21 24 24 32 34

There are reasonable small values that are not far from 0, so the assumption of NA = 0 make sense.


Data Cleaning and PreProcessing

In this section I am going to clean and arrange the data, and to prepare it for predictions. It is important to perform exactly the same actions of PreProcessing on the testing set for future predictions.

# starting by creating new dataframes for training and testings
train.clean <- train.raw
test.clean <- test.raw

Imputing Missing values

As I have shown, the pattern of missing values can be identify. Here I transform the missing data into meaningful information, according to the description detailed in the Missing values section.

## training set
    # factors
for (i in na.table[na.table$type %in% c("Ordinal","Nominal"),"name"]){
    train.clean[ ,i] <- `levels<-`(addNA(train.clean[ ,i]),
                          c(levels(train.clean[ ,i]), "None"))
}
    # integers
for (i in na.table[na.table$type %in% c("Continuous","Discrete"),"name"]){
    train.clean[ ,i][is.na(train.clean[ ,i])] <- 0
}

## testing set
    # factors
for (i in na.table[na.table$type %in% c("Ordinal","Nominal"),"name"]){
    test.clean[ ,i] <- `levels<-`(addNA(test.clean[ ,i]),
                          c(levels(test.clean[ ,i]), "None"))
}
    # integers
for (i in na.table[na.table$type %in% c("Continuous","Discrete"),"name"]){
    test.clean[ ,i][is.na(test.clean[ ,i])] <- 0
}

Skewness of continuos predictors

It can be seen that some of the continuous variables distributions (non-zero values) are very much skewed. to cope whit this issue, taking a monotonic transformation of the the continues predictors can be a great idea. It should be note that zero values represent no measurement, hence variables which contain zero values should be treated carefully. As mentioned, the absence of continuous measurement is already reported through other variable (e.g. PoolQC == “None” (means no pool), hence PoolArea == 0). Thus, I would like to keep zeros when the categorical variable value is “None”. To this I employ the The one-parameter Box-Cox transformation only for non-zeros values.

The one-parameter Box–Cox is a useful data transformation technique, used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association such as the Pearson correlation between variables and for other data stabilization procedures. It defined as:

\[ y_{i}^{(\lambda)} = \begin{cases} \frac{y_{i}^{\lambda} -1}{\lambda}, & \text{if }\lambda \ne 0 \\ \ln{y_{i}}, & \text{if }\lambda =0 \end{cases} \]

where the power parameter \(\lambda\) is estimated using the profile likelihood function. (if you are not familiar with the Box-Cox transformation see this insightful page)

cont.pred <- continuous[-20] # transform predictors only
## training set
train.clean[ ,cont.pred][train.clean[ ,cont.pred]==0] <- NA # exluding zeros

for (i in names(train.clean[ ,cont.pred])) {
    boxcox <- BoxCoxTrans(train.clean[ ,i], na.rm = T)
    train.clean[ ,i] <- predict(boxcox, train.clean[ ,i])
train.clean[ ,cont.pred][is.na(train.clean[ ,cont.pred])] <- 0 # replacing zeros
}

## testing set
test.clean[ ,cont.pred][test.clean[ ,cont.pred][-81]==0] <- NA # exluding zeros

for (i in names(test.clean[ ,cont.pred[-81]])) {
    boxcox <- BoxCoxTrans(test.clean[ ,i], na.rm = T)
    test.clean[ ,i] <- predict(boxcox, test.clean[ ,i])
test.clean[ ,cont.pred][is.na(test.clean[ ,cont.pred])] <- 0 # replacing zeros
}

Near-Zero-Variation predictors

After data cleaning and before moving to prediction, excluding variables which have almost no variation in a way they could be harmful to the prediction is a useful step. I employ the nearZeroVar function to identify these predictors.

nzvMat <- nearZeroVar(train.clean, saveMetrics = T)
nzvMat
##                 freqRatio percentUnique zeroVar   nzv
## Id               1.000000   100.0000000   FALSE FALSE
## MSSubClass       1.792642     1.0273973   FALSE FALSE
## MSZoning         5.279817     0.3424658   FALSE FALSE
## LotFrontage      1.811189     7.6027397   FALSE FALSE
## LotArea          1.041667    73.4931507   FALSE FALSE
## Street         242.333333     0.1369863   FALSE  TRUE
## Alley           27.380000     0.2054795   FALSE  TRUE
## LotShape         1.911157     0.2739726   FALSE FALSE
## LandContour     20.809524     0.2739726   FALSE  TRUE
## Utilities     1459.000000     0.1369863   FALSE  TRUE
## LotConfig        4.000000     0.3424658   FALSE FALSE
## LandSlope       21.261538     0.2054795   FALSE  TRUE
## Neighborhood     1.500000     1.7123288   FALSE FALSE
## Condition1      15.555556     0.6164384   FALSE FALSE
## Condition2     240.833333     0.5479452   FALSE  TRUE
## BldgType        10.701754     0.3424658   FALSE FALSE
## HouseStyle       1.631461     0.5479452   FALSE FALSE
## OverallQual      1.061497     0.6849315   FALSE FALSE
## OverallCond      3.257937     0.6164384   FALSE FALSE
## YearBuilt        1.046875     7.6712329   FALSE FALSE
## YearRemodAdd     1.835052     4.1780822   FALSE FALSE
## RoofStyle        3.989510     0.4109589   FALSE FALSE
## RoofMatl       130.363636     0.5479452   FALSE  TRUE
## Exterior1st      2.319820     1.0273973   FALSE FALSE
## Exterior2nd      2.355140     1.0958904   FALSE FALSE
## MasVnrType       1.959551     0.2739726   FALSE FALSE
## MasVnrArea     108.625000    22.3972603   FALSE FALSE
## ExterQual        1.856557     0.2739726   FALSE FALSE
## ExterCond        8.780822     0.3424658   FALSE FALSE
## Foundation       1.020505     0.4109589   FALSE FALSE
## BsmtQual         1.050162     0.3424658   FALSE FALSE
## BsmtCond        20.169231     0.3424658   FALSE  TRUE
## BsmtExposure     4.312217     0.3424658   FALSE FALSE
## BsmtFinType1     1.028708     0.4794521   FALSE FALSE
## BsmtFinSF1      38.916667    43.6301370   FALSE FALSE
## BsmtFinType2    23.259259     0.4794521   FALSE  TRUE
## BsmtFinSF2     258.600000     9.8630137   FALSE  TRUE
## BsmtUnfSF       13.111111    53.4246575   FALSE FALSE
## TotalBsmtSF      1.057143    49.3835616   FALSE FALSE
## Heating         79.333333     0.4109589   FALSE  TRUE
## HeatingQC        1.731308     0.3424658   FALSE FALSE
## CentralAir      14.368421     0.1369863   FALSE FALSE
## Electrical      14.191489     0.4109589   FALSE FALSE
## X1stFlrSF        1.562500    51.5753425   FALSE FALSE
## X2ndFlrSF       82.900000    28.5616438   FALSE FALSE
## LowQualFinSF   478.000000     1.6438356   FALSE  TRUE
## GrLivArea        1.571429    58.9726027   FALSE FALSE
## BsmtFullBath     1.455782     0.2739726   FALSE FALSE
## BsmtHalfBath    17.225000     0.2054795   FALSE FALSE
## FullBath         1.181538     0.2739726   FALSE FALSE
## HalfBath         1.706542     0.2054795   FALSE FALSE
## BedroomAbvGr     2.245810     0.5479452   FALSE FALSE
## KitchenAbvGr    21.415385     0.2739726   FALSE  TRUE
## KitchenQual      1.254266     0.2739726   FALSE FALSE
## TotRmsAbvGrd     1.221884     0.8219178   FALSE FALSE
## Functional      40.000000     0.4794521   FALSE  TRUE
## Fireplaces       1.061538     0.2739726   FALSE FALSE
## FireplaceQu      1.815789     0.4109589   FALSE FALSE
## GarageType       2.248062     0.4794521   FALSE FALSE
## GarageYrBlt      1.246154     6.7123288   FALSE FALSE
## GarageFinish     1.433649     0.2739726   FALSE FALSE
## GarageCars       2.233062     0.3424658   FALSE FALSE
## GarageArea       1.653061    30.2054795   FALSE FALSE
## GarageQual      16.185185     0.4109589   FALSE FALSE
## GarageCond      16.370370     0.4109589   FALSE FALSE
## PavedDrive      14.888889     0.2054795   FALSE FALSE
## WoodDeckSF      20.026316    18.7671233   FALSE FALSE
## OpenPorchSF     22.620690    13.8356164   FALSE FALSE
## EnclosedPorch   83.466667     8.2191781   FALSE  TRUE
## X3SsnPorch     478.666667     1.3698630   FALSE  TRUE
## ScreenPorch    224.000000     5.2054795   FALSE  TRUE
## PoolArea      1453.000000     0.5479452   FALSE  TRUE
## PoolQC         484.333333     0.2739726   FALSE  TRUE
## Fence            7.509554     0.3424658   FALSE FALSE
## MiscFeature     28.693878     0.3424658   FALSE  TRUE
## MiscVal        128.000000     1.4383562   FALSE  TRUE
## MoSold           1.081197     0.8219178   FALSE FALSE
## YrSold           1.027356     0.3424658   FALSE FALSE
## SaleType        10.385246     0.6164384   FALSE FALSE
## SaleCondition    9.584000     0.4109589   FALSE FALSE
## SalePrice        1.176471    45.4109589   FALSE FALSE
nzvNames <- row.names(nzvMat[nzvMat$nzv==T, ])
train.clean <- train.clean[ ,!nzvMat$nzv]
test.clean <- test.clean[ ,!nzvMat$nzv[-81]]

Predictive analytics

As a first step, I divide the training data to 80% training, 20% testing/validation (note that the the ts data is not part of the model). From this stage any learning will be using the training set.

set.seed(1948)
intrain <- createDataPartition(train.clean$SalePrice, p = 0.8, list = F)
tr <- train.clean[intrain, ]
ts <- train.clean[-intrain, ]

Pre-Modeling: Finding patters in the data

Before moving to the construction of the model, it is useful to find some patterns in the data to get some hints about the sort of correlation between the variables.

Let’s start by plotting the correlation matrix of the continuous variables;

conttoplot <- continuous[!(continuous %in% nzvNames)]
corrplot(cor(tr[ ,conttoplot]), method = "ellipse", tl.col = "black")

As might be expected the correlation of variables related to house size have positive correlation with the price, and also to other size related variables.

Next, let’s look at the relation between ordinal variables which relate to the quality of elements in the house and the price

# define the order of quality levels:
q.order <- c("None", "Po", "Fa", "TA", "Gd", "Ex")
# find the relevant variables (Ex is one of the levels)
q.var <- sapply(tr, function(x) sum(levels(x)=="Ex")>0)
par(mfrow = c(3,3), mar = c(2,1,3,3))
for ( i in names(tr[ ,q.var])) {
    p <- tr[ , "SalePrice"]
    o <- factor(tr[ , i], levels = q.order)
    boxplot(p~o, main = i)
}

We can see the monotonic effect of levels, however some exceptions exist.

Let’s look at the Overall quality and Conditions variables in a more fancy plot:

OA.df <- tr[ ,c("OverallQual","OverallCond","SalePrice")]
pl.q <- plot_ly(OA.df, y = ~SalePrice, x = ~OverallQual, 
                type = "box", name = "Overall Quality")
pl.c <- plot_ly(OA.df, y = ~SalePrice, x = ~OverallCond, 
                type = "box", name = "Overall Condition")
subplot(pl.q, pl.c)

It is interesting that while the Overall Quality correlation with the Price is monotonic, the Overall Condition correlation is less clear. We can see the large variation of House Price in the some Condition levels (5 and 9). This may require to threat ordinal variables as categoricals o to use other techniques.

Next, we should look for time trends and serial correlation that might need to be controlled.

time.df <- tr[ ,c("YrSold","MoSold", "SalePrice", "BedroomAbvGr")]
time.df$SalePrice <- time.df$SalePrice/100000
time.df$quarter <- cut(tr$MoSold,
                       breaks = c(0,3,6,9,12),
                       labels = c("Q1","Q2","Q3","Q4"))

time.df$YearQtr <- with(time.df, as.yearqtr(paste(YrSold,quarter)))

time.df$Bedrooms <- cut(time.df$BedroomAbvGr,
                        breaks = c(-1,2,3,Inf),
                        labels = c("0-2","3","4+"))

time.agg <- summarise(group_by(time.df, YearQtr, Bedrooms),
                      mean = mean(SalePrice),
                      sem = sd(SalePrice)/sqrt(n()),
                      sales = n())

ggplotly(ggplot(time.agg, aes(x = YearQtr, colour = Bedrooms))+
             geom_ribbon(aes(ymin = mean - sem,
                             ymax = mean + sem), fill = "grey70", alpha = 0.2) +
             geom_line(aes(y = mean))+
             facet_grid(Bedrooms~.) +
             labs(x = "", y = "") +
             coord_cartesian(ylim = c(1,3.5))) %>%
    layout(title = "Mean Sale Price ($100,000)")

It seems that the Seasonality effects are not very much clear.

Finally, Let’s have a look on the spatial structure. Thanks to JulienSiems that posted the GPS coordinates of the neighborhoods (based on the school location in the district) in his script we can look for spatial effect of house prices. For instance, we can use the distances between neighborhoods and key areas such as employment and shopping areas and other effects that can not be captured by neighborhoods dummies.

coordinates <- data.frame(Neighborhood = levels(tr[ ,"Neighborhood"]),
                          lat = c(42.062806, 42.009408, 42.052500, 42.033590, 42.025425,
                                  42.021051, 42.025949, 42.022800, 42.027885, 42.019208, 
                                  41.991866, 42.031307, 42.042966, 42.050307, 42.050207,
                                  42.060356, 42.051321, 42.028863, 42.033611, 42.035540, 
                                  42.052191, 42.060752, 42.017578, 41.998132, 42.040106),
                          
                          lng = c(-93.639963, -93.645543, -93.628821, -93.627552, -93.675741, 
                                  -93.685643, -93.620215, -93.663040, -93.615692, -93.623401,
                                  -93.602441, -93.626967, -93.613556, -93.656045, -93.625827, 
                                  -93.657107, -93.633798, -93.615497, -93.669348, -93.685131,
                                  -93.643479, -93.628955, -93.651283, -93.648335, -93.657032))

We can look at the geographical distribution of the house prices: Circle size represent the number of sales in the Neighborhood and darker colors represent higher average price.

nbh.price <- summarise(group_by(tr, Neighborhood),
                       sales = n(),
                       mean.price = mean(SalePrice))

coordinates <- merge(x = coordinates,
                     y = nbh.price, 
                     by = "Neighborhood",
                     all.x = T)

pal <- colorNumeric(palette = "Reds",
    domain = coordinates$nbh.price)

Ames <- leaflet(coordinates) %>% 
    addTiles() %>% 
    addCircles(lat = ~lat,
               lng = ~lng, weight = 10,
               radius = ~sales*8,
               color = ~pal(coordinates$mean.price)) %>%
    addMarkers(popup = ~paste(Neighborhood,", Mean:",
                              round(mean.price),sep = ""))
Ames

Prediction

In this section I’ll present some machine learning technique to predict the price of a house.

The evaluation method of the models would be according to the competition rules:

\[RMSE = \sqrt{ \frac{2}{n} \sum_{i=1}^{n} \ln{\frac{y_i}{\hat{y_i}}} }\]

where \(y_i\) is the observed house sale price and \(\hat{y_i}\) is the predicted value.

Linear Model

Here I present only 3 embarrassingly simple linear models. Later, I will post more complex models involving complicated machine learning techniques

However, the best benchmark in many cases is the Linear Least Squares.

# predictiong log(price)
tr$logp <- log(tr$SalePrice)
ts$logp <- log(ts$SalePrice)

lm1 <- train(logp~., data = tr[,-c(1,60)], method = "lm")
ts$pr1 <- predict(lm1, newdata = ts)

qplot(y = ts$pr1, x =ts$logp) + 
    labs(x = "Observed log(Price)", y = "Predicted log(Price)") +
    geom_abline(intercept = 0, slope = 1)

We can see that:

  • The Linear model prediction is not very far
  • There is a concern to Heteroskedasticity
  • Outliers may exist

The RMSE:

postResample(ts$pr1, ts$logp)
##      RMSE  Rsquared 
## 0.1238644 0.9040120

Now, Let’s use ordinal variables as discretes including polynomial effect (only variables which explicitly express quality or condition level)

tonum <- function(df) {
    for (var in names(df[ ,q.var])) {
        df[ ,var] <- as.numeric(
            recode_factor(df[ ,var],
                          None="0",Po="1",Fa="2",TA="3",Gd="4",Ex="5"))
        df[ ,paste(var,"2",sep = "")] <- df[ ,var]^2
        df[ ,paste(var,"3",sep = "")] <- df[ ,var]^3
    }
    df
}

tr.2<-tonum(tr)
ts.2<-tonum(ts)


lm2 <- train(logp~.,
             data = tr.2[ ,-c(1,60)], method = "lm")
ts.2$pr2 <- predict(lm2, newdata = ts.2)
postResample(ts.2$pr2, ts.2$logp)
##      RMSE  Rsquared 
## 0.1242235 0.9032776

It seems that using the ordinal variables as categoricals (e.g. using dummy for each level) is more useful in terms of RMSE, however גifferences are not significant.

For the third linear model I add some variables.

First I create the house and the garage Age at the time of sale instead of YrSold, YearBuilt and GarageYrBlt.Also, Let’s create the time from last renovation, instead of YearRemodAdd.

tr$houseage <- tr$YrSold - tr$YearBuilt
ts$houseage <- ts$YrSold - ts$YearBuilt

tr$houseage2 <- tr$houseage^2
ts$houseage2 <- ts$houseage^2

tr$garageage <- (tr$YrSold - tr$GarageYrBlt) * ifelse(tr$GarageYrBlt==0,0,1)
ts$garageage <- (ts$YrSold - ts$GarageYrBlt) * ifelse(ts$GarageYrBlt==0,0,1)

tr$timeremode <- tr$YrSold - tr$YearRemodAdd
ts$timeremode <- ts$YrSold - ts$YearRemodAdd

Now, let’s have a look on the correlations of the log(SalePrice) and size variables to check whether polynomial effects are needed.

size.var <- grep("SF|Area", names(tr), value = T)
par(mfrow = c(3,4), mar = c(3,3,2,2))
for (v in size.var) {
    d <- data.frame(var = tr[,v], p = log(tr$SalePrice))
    d <- filter(d, var != 0) %>% arrange(-var)
    plot(d, main = v)
    model <- lm(d$p ~ d$var + I(d$var^2))
    pred.line <- predict(model, d)
    lines(x = d$var, y = pred.line, col = "red", lwd = 2)
}

In order to avoid overfitting I use only linear effect.

also, we treat OverallCond as factors:

tr.3 <- tr
ts.3 <- ts

tr.3$MSSubClass <- as.factor(tr.3$MSSubClass)
ts.3$MSSubClass <- as.factor(ts.3$MSSubClass)

 #rename the data.frame
tr.3 <- select(tr.3, -c(Id,YrSold,YearBuilt,GarageYrBlt,YearRemodAdd,
                        SalePrice))
ts.3 <- select(ts.3, -c(Id,YrSold,YearBuilt,GarageYrBlt,YearRemodAdd))

The third linear model:

lm3 <- train(logp~., data = tr.3, method = "lm")
ts.3$pr3 <- predict(lm3, newdata = ts.3)
postResample(ts.3$pr3, ts.3$logp)
##      RMSE  Rsquared 
## 0.1238903 0.9039452

To be continued.

Ron Sarafian.