The following is part of my submission for the kaggle House Prices competition. The goal is to predict House price using advanced Machine Learning methods.
for more information see the competition site here
In this Analysis I use the Ames Housing dataset describing the sale of individual residential property in Ames, Iowa from 2006 to 2010. The data set contains almost 3000 observations and a large number of explanatory variables involved in assessing home values. The challenge is to predict the final price of each house
The competition structure is as follow:
I use the following packages in this analysis:
library(ggplot2)
library(caret)
library(dplyr)
library(corrplot)
library(RCurl)
library(lattice)
library(knitr)
library(plotly)
library(leaflet)
library(zoo)
library(reshape2)
In this section I try to get a better sense of the data, see if some cleaning and pre-processing methods are needed, and understand the patterns of NA values.
Start by loading the raw data.
train.raw <- read.csv("train.csv")
test.raw <- read.csv("test.csv")
The training data contain 1460 observations and 81 features. First let’s have a look on the structure of the training dataset.
dim(train.raw)
[1] 1460 81
str(train.raw)
'data.frame': 1460 obs. of 81 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
$ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
$ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
$ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
$ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
$ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
$ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
$ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
$ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
$ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
$ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
$ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
$ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
$ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
$ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
$ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
$ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
$ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
$ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
$ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
$ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
$ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
$ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
$ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
$ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
$ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
$ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
$ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
$ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
$ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
$ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
$ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
$ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
$ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
$ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
$ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
$ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
$ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
$ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
$ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
$ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
$ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
$ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
$ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
$ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
$ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
$ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
$ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
$ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
$ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
$ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
$ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
$ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
$ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
$ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
$ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
$ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
$ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
$ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
$ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
$ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
$ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
$ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
$ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
$ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
$ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
$ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
$ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
$ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
$ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
$ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
The explanatories (columns 1:80) include 4 types of variables:
As a first step, it would be more effective to classify the explanatories into these categories. For this I use the dataset codebook.
codebook <- getURL("https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt")
# split the text file to strings:
codebook <- unlist(strsplit(codebook, "[ \n()]+"))
# utilizing the order of the variable description (identical to the variables order in the dataset).
# searching for the features' types pattern in the codebook text file:
grep.type <- grep("Nominal|Ordinal|Discrete|Continuous", codebook, value = T)
# arrange in a data.frame (ignoring the first elemet which relates to the observation index):
variables <- data.frame(name = names(train.raw), type = grep.type[2:82])
continuous <- as.character(variables[variables$type == "Continuous",1])
discretes <- as.character(variables[variables$type == "Discrete",1])
nominals <- as.character(variables[variables$type == "Nominal",1])
nominals <- nominals[2:24] # removing ID variable from the predictors
ordinals <- as.character(variables[variables$type == "Ordinal",1])
# first 10 variables' types
head(variables, 10)
name type
1 Id Nominal
2 MSSubClass Nominal
3 MSZoning Nominal
4 LotFrontage Continuous
5 LotArea Continuous
6 Street Nominal
7 Alley Nominal
8 LotShape Ordinal
9 LandContour Nominal
10 Utilities Ordinal
Now, let’s get a better sense of the data. It would be useful to evaluate variables from each type separately. First, lets see how many variables in each type group:
table(variables$type)
Continuous Discrete Nominal Ordinal
20 14 24 23
Starting with the Nominal, let’s look at the variables levels:
par(mfrow = c(5,5), mar = c(2,1,3,3)) # adjusting graphical parameters
sapply(names(train.raw[ , nominals]),
function(x) barplot(table(train.raw[ ,x]), main = x))
It seems that some features have low variation across levels (e.g. Street, LandContour, Condition1, and more. The concern here that these predictors may become zero-variance predictors when the data are split into cross-validation/bootstrap sub-samples or that a few samples may have an undue influence on the model. In the following section I will try to identify these near-zero -variance predictors and adjust them before modeling. Also, the variable MSSubClass should be converted into factor.
train.raw$MSSubClass <- as.factor(train.raw$MSSubClass)
In order to maximize the information that can be derived from the Ordinal variables it is important to express the order of the levels. I will do this in the following sections. For now let’s look at the variation across the levels
par(mfrow = c(5,5), mar = c(2,1,3,3)) # adjusting graphical parameters
sapply(names(train.raw[ , ordinals]),
function(x) barplot(table(train.raw[ ,x]), main = x))
It can bee seen that in some variables the variation is relatively low. Again, near-zero-variance predictors need to be handled. It should be note that the variables level’s originality is not necessarily represented in the barplots.
few words…
par(mfrow = c(3,5), mar = c(2,1,3,3)) # adjusting graphical parameters
sapply(names(train.raw[ , discretes]),
function(x) barplot(table(train.raw[ ,x]), main = x))
The discrete predictors histograms look fine.
In some continuous explanatories such as PoolArea or BsmtFinSF1 a 0 value means that there is no measurement (e.g. no pool or no finished basement), in this case it should be reported through related categorical / Ordinal variable so that information does not get lost (I’ll handle it later). For now, it will be more useful to look only on the distribution of positive values.
par(mfrow = c(4,5), mar = c(2,1,3,3)) # adjusting graphical parameters
sapply(names(train.raw[ , continuous]),
function(x) plot(density(train.raw[train.raw[ ,x]!=0,x],
na.rm = T), main = x))
It can be seen that some of the continuous variables distributions are very much skewed. I will further address this issue in the Cleaning and PreProcessing section.
Another important aspect in understanding the data is to understand the patterns of missing values Here I provide a short analysis of missing values in the dataset.
I’ll stars by calculating the number of NA values for each predictor:
vars.na.share <- sapply(train.raw, function(x) mean(is.na(x)))
variables$na.share <- vars.na.share
variables$na.exist <- vars.na.share > 0
par(mfrow = c(1,1))
na.table <- filter(variables, na.exist == T) %>% arrange(-na.share)
ggplot(data = na.table,
aes(x = factor(name, levels = na.table$name),
y = na.share)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Share of NA values", x = "", y = "")
For many predictors the share of NA values is too significant to ignore, and it seems there is no alternative than dig into the data.
Easy to see that for some variables the same share of NA values. Let’s take care of them first.
Starting by Garage related variables, here are the first 6 NA rows:
garage <- grep("Garage", names(train.raw), value = T)
kable(head(train.raw[is.na(train.raw$GarageType),garage]),
row.names = F, format = "markdown", align = "c")
GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond |
---|---|---|---|---|---|---|
NA | NA | NA | 0 | 0 | NA | NA |
NA | NA | NA | 0 | 0 | NA | NA |
NA | NA | NA | 0 | 0 | NA | NA |
NA | NA | NA | 0 | 0 | NA | NA |
NA | NA | NA | 0 | 0 | NA | NA |
NA | NA | NA | 0 | 0 | NA | NA |
We see that the GarageCars and the GarageArea are both 0 when NAs exist. By looking at the codebook we can see that the common denominator is that NA represent No Garage in all these variables.
Next, Basemente related variables:
basement <- grep("Bsmt", names(train.raw), value = T)
kable(head(train.raw[is.na(train.raw$BsmtFinType1), basement]),
row.names = F, format = "markdown", align = "c")
BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | BsmtFullBath | BsmtHalfBath |
---|---|---|---|---|---|---|---|---|---|---|
NA | NA | NA | NA | 0 | NA | 0 | 0 | 0 | 0 | 0 |
NA | NA | NA | NA | 0 | NA | 0 | 0 | 0 | 0 | 0 |
NA | NA | NA | NA | 0 | NA | 0 | 0 | 0 | 0 | 0 |
NA | NA | NA | NA | 0 | NA | 0 | 0 | 0 | 0 | 0 |
NA | NA | NA | NA | 0 | NA | 0 | 0 | 0 | 0 | 0 |
NA | NA | NA | NA | 0 | NA | 0 | 0 | 0 | 0 | 0 |
Here also the codebook states that in all these variables NA values represent No Basement
Last group: Masonry veneer related variables:
masvnr <- grep("MasVnr", names(train.raw), value = T)
kable(head(train.raw[is.na(train.raw$MasVnrType), masvnr]),
row.names = F, format = "markdown", align = "c")
MasVnrType | MasVnrArea |
---|---|
NA | NA |
NA | NA |
NA | NA |
NA | NA |
NA | NA |
NA | NA |
These two are the only Masonry veneer related variable. There is no explanation in the codebook when the value is NA, and here None gets its own level. however the number of NA is negligible (less than 0.5%). I’ll assume that NA are “None” for MasVnrType and 0 for MasVnrArea.
The rest of the NAs and their description from the codebook by type:
-Ordinal / Nominal - FireplaceQu - Fireplace quality - Fence - Fence quality - Alley - Type of alley access to property - MiscFeature - Miscellaneous feature not covered in other categories - PoolQC - Pool quality
In all of these variables NA represent None (e.g. no fence, no pool, etc.)
Here we don’t have an explanation of NA value but it is reasonable to assume that NA means 0 (e.g. no street is connected). To check if it makes sense I look at the smallest values of LotFrontage:
quantile(train.raw$LotFrontage,
probs = seq(.01,.05,len = 5),
na.rm = T)
## 1% 2% 3% 4% 5%
## 21 24 24 32 34
There are reasonable small values that are not far from 0, so the assumption of NA = 0 make sense.
In this section I am going to clean and arrange the data, and to prepare it for predictions. It is important to perform exactly the same actions of PreProcessing on the testing set for future predictions.
# starting by creating new dataframes for training and testings
train.clean <- train.raw
test.clean <- test.raw
As I have shown, the pattern of missing values can be identify. Here I transform the missing data into meaningful information, according to the description detailed in the Missing values section.
## training set
# factors
for (i in na.table[na.table$type %in% c("Ordinal","Nominal"),"name"]){
train.clean[ ,i] <- `levels<-`(addNA(train.clean[ ,i]),
c(levels(train.clean[ ,i]), "None"))
}
# integers
for (i in na.table[na.table$type %in% c("Continuous","Discrete"),"name"]){
train.clean[ ,i][is.na(train.clean[ ,i])] <- 0
}
## testing set
# factors
for (i in na.table[na.table$type %in% c("Ordinal","Nominal"),"name"]){
test.clean[ ,i] <- `levels<-`(addNA(test.clean[ ,i]),
c(levels(test.clean[ ,i]), "None"))
}
# integers
for (i in na.table[na.table$type %in% c("Continuous","Discrete"),"name"]){
test.clean[ ,i][is.na(test.clean[ ,i])] <- 0
}
It can be seen that some of the continuous variables distributions (non-zero values) are very much skewed. to cope whit this issue, taking a monotonic transformation of the the continues predictors can be a great idea. It should be note that zero values represent no measurement, hence variables which contain zero values should be treated carefully. As mentioned, the absence of continuous measurement is already reported through other variable (e.g. PoolQC == “None” (means no pool), hence PoolArea == 0). Thus, I would like to keep zeros when the categorical variable value is “None”. To this I employ the The one-parameter Box-Cox transformation only for non-zeros values.
The one-parameter Box–Cox is a useful data transformation technique, used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association such as the Pearson correlation between variables and for other data stabilization procedures. It defined as:
\[ y_{i}^{(\lambda)} = \begin{cases} \frac{y_{i}^{\lambda} -1}{\lambda}, & \text{if }\lambda \ne 0 \\ \ln{y_{i}}, & \text{if }\lambda =0 \end{cases} \]
where the power parameter \(\lambda\) is estimated using the profile likelihood function. (if you are not familiar with the Box-Cox transformation see this insightful page)
cont.pred <- continuous[-20] # transform predictors only
## training set
train.clean[ ,cont.pred][train.clean[ ,cont.pred]==0] <- NA # exluding zeros
for (i in names(train.clean[ ,cont.pred])) {
boxcox <- BoxCoxTrans(train.clean[ ,i], na.rm = T)
train.clean[ ,i] <- predict(boxcox, train.clean[ ,i])
train.clean[ ,cont.pred][is.na(train.clean[ ,cont.pred])] <- 0 # replacing zeros
}
## testing set
test.clean[ ,cont.pred][test.clean[ ,cont.pred][-81]==0] <- NA # exluding zeros
for (i in names(test.clean[ ,cont.pred[-81]])) {
boxcox <- BoxCoxTrans(test.clean[ ,i], na.rm = T)
test.clean[ ,i] <- predict(boxcox, test.clean[ ,i])
test.clean[ ,cont.pred][is.na(test.clean[ ,cont.pred])] <- 0 # replacing zeros
}
After data cleaning and before moving to prediction, excluding variables which have almost no variation in a way they could be harmful to the prediction is a useful step. I employ the nearZeroVar function to identify these predictors.
nzvMat <- nearZeroVar(train.clean, saveMetrics = T)
nzvMat
## freqRatio percentUnique zeroVar nzv
## Id 1.000000 100.0000000 FALSE FALSE
## MSSubClass 1.792642 1.0273973 FALSE FALSE
## MSZoning 5.279817 0.3424658 FALSE FALSE
## LotFrontage 1.811189 7.6027397 FALSE FALSE
## LotArea 1.041667 73.4931507 FALSE FALSE
## Street 242.333333 0.1369863 FALSE TRUE
## Alley 27.380000 0.2054795 FALSE TRUE
## LotShape 1.911157 0.2739726 FALSE FALSE
## LandContour 20.809524 0.2739726 FALSE TRUE
## Utilities 1459.000000 0.1369863 FALSE TRUE
## LotConfig 4.000000 0.3424658 FALSE FALSE
## LandSlope 21.261538 0.2054795 FALSE TRUE
## Neighborhood 1.500000 1.7123288 FALSE FALSE
## Condition1 15.555556 0.6164384 FALSE FALSE
## Condition2 240.833333 0.5479452 FALSE TRUE
## BldgType 10.701754 0.3424658 FALSE FALSE
## HouseStyle 1.631461 0.5479452 FALSE FALSE
## OverallQual 1.061497 0.6849315 FALSE FALSE
## OverallCond 3.257937 0.6164384 FALSE FALSE
## YearBuilt 1.046875 7.6712329 FALSE FALSE
## YearRemodAdd 1.835052 4.1780822 FALSE FALSE
## RoofStyle 3.989510 0.4109589 FALSE FALSE
## RoofMatl 130.363636 0.5479452 FALSE TRUE
## Exterior1st 2.319820 1.0273973 FALSE FALSE
## Exterior2nd 2.355140 1.0958904 FALSE FALSE
## MasVnrType 1.959551 0.2739726 FALSE FALSE
## MasVnrArea 108.625000 22.3972603 FALSE FALSE
## ExterQual 1.856557 0.2739726 FALSE FALSE
## ExterCond 8.780822 0.3424658 FALSE FALSE
## Foundation 1.020505 0.4109589 FALSE FALSE
## BsmtQual 1.050162 0.3424658 FALSE FALSE
## BsmtCond 20.169231 0.3424658 FALSE TRUE
## BsmtExposure 4.312217 0.3424658 FALSE FALSE
## BsmtFinType1 1.028708 0.4794521 FALSE FALSE
## BsmtFinSF1 38.916667 43.6301370 FALSE FALSE
## BsmtFinType2 23.259259 0.4794521 FALSE TRUE
## BsmtFinSF2 258.600000 9.8630137 FALSE TRUE
## BsmtUnfSF 13.111111 53.4246575 FALSE FALSE
## TotalBsmtSF 1.057143 49.3835616 FALSE FALSE
## Heating 79.333333 0.4109589 FALSE TRUE
## HeatingQC 1.731308 0.3424658 FALSE FALSE
## CentralAir 14.368421 0.1369863 FALSE FALSE
## Electrical 14.191489 0.4109589 FALSE FALSE
## X1stFlrSF 1.562500 51.5753425 FALSE FALSE
## X2ndFlrSF 82.900000 28.5616438 FALSE FALSE
## LowQualFinSF 478.000000 1.6438356 FALSE TRUE
## GrLivArea 1.571429 58.9726027 FALSE FALSE
## BsmtFullBath 1.455782 0.2739726 FALSE FALSE
## BsmtHalfBath 17.225000 0.2054795 FALSE FALSE
## FullBath 1.181538 0.2739726 FALSE FALSE
## HalfBath 1.706542 0.2054795 FALSE FALSE
## BedroomAbvGr 2.245810 0.5479452 FALSE FALSE
## KitchenAbvGr 21.415385 0.2739726 FALSE TRUE
## KitchenQual 1.254266 0.2739726 FALSE FALSE
## TotRmsAbvGrd 1.221884 0.8219178 FALSE FALSE
## Functional 40.000000 0.4794521 FALSE TRUE
## Fireplaces 1.061538 0.2739726 FALSE FALSE
## FireplaceQu 1.815789 0.4109589 FALSE FALSE
## GarageType 2.248062 0.4794521 FALSE FALSE
## GarageYrBlt 1.246154 6.7123288 FALSE FALSE
## GarageFinish 1.433649 0.2739726 FALSE FALSE
## GarageCars 2.233062 0.3424658 FALSE FALSE
## GarageArea 1.653061 30.2054795 FALSE FALSE
## GarageQual 16.185185 0.4109589 FALSE FALSE
## GarageCond 16.370370 0.4109589 FALSE FALSE
## PavedDrive 14.888889 0.2054795 FALSE FALSE
## WoodDeckSF 20.026316 18.7671233 FALSE FALSE
## OpenPorchSF 22.620690 13.8356164 FALSE FALSE
## EnclosedPorch 83.466667 8.2191781 FALSE TRUE
## X3SsnPorch 478.666667 1.3698630 FALSE TRUE
## ScreenPorch 224.000000 5.2054795 FALSE TRUE
## PoolArea 1453.000000 0.5479452 FALSE TRUE
## PoolQC 484.333333 0.2739726 FALSE TRUE
## Fence 7.509554 0.3424658 FALSE FALSE
## MiscFeature 28.693878 0.3424658 FALSE TRUE
## MiscVal 128.000000 1.4383562 FALSE TRUE
## MoSold 1.081197 0.8219178 FALSE FALSE
## YrSold 1.027356 0.3424658 FALSE FALSE
## SaleType 10.385246 0.6164384 FALSE FALSE
## SaleCondition 9.584000 0.4109589 FALSE FALSE
## SalePrice 1.176471 45.4109589 FALSE FALSE
nzvNames <- row.names(nzvMat[nzvMat$nzv==T, ])
train.clean <- train.clean[ ,!nzvMat$nzv]
test.clean <- test.clean[ ,!nzvMat$nzv[-81]]
As a first step, I divide the training data to 80% training, 20% testing/validation (note that the the ts data is not part of the model). From this stage any learning will be using the training set.
set.seed(1948)
intrain <- createDataPartition(train.clean$SalePrice, p = 0.8, list = F)
tr <- train.clean[intrain, ]
ts <- train.clean[-intrain, ]
Before moving to the construction of the model, it is useful to find some patterns in the data to get some hints about the sort of correlation between the variables.
Let’s start by plotting the correlation matrix of the continuous variables;
conttoplot <- continuous[!(continuous %in% nzvNames)]
corrplot(cor(tr[ ,conttoplot]), method = "ellipse", tl.col = "black")
As might be expected the correlation of variables related to house size have positive correlation with the price, and also to other size related variables.
Next, let’s look at the relation between ordinal variables which relate to the quality of elements in the house and the price
# define the order of quality levels:
q.order <- c("None", "Po", "Fa", "TA", "Gd", "Ex")
# find the relevant variables (Ex is one of the levels)
q.var <- sapply(tr, function(x) sum(levels(x)=="Ex")>0)
par(mfrow = c(3,3), mar = c(2,1,3,3))
for ( i in names(tr[ ,q.var])) {
p <- tr[ , "SalePrice"]
o <- factor(tr[ , i], levels = q.order)
boxplot(p~o, main = i)
}
We can see the monotonic effect of levels, however some exceptions exist.
Let’s look at the Overall quality and Conditions variables in a more fancy plot:
OA.df <- tr[ ,c("OverallQual","OverallCond","SalePrice")]
pl.q <- plot_ly(OA.df, y = ~SalePrice, x = ~OverallQual,
type = "box", name = "Overall Quality")
pl.c <- plot_ly(OA.df, y = ~SalePrice, x = ~OverallCond,
type = "box", name = "Overall Condition")
subplot(pl.q, pl.c)
It is interesting that while the Overall Quality correlation with the Price is monotonic, the Overall Condition correlation is less clear. We can see the large variation of House Price in the some Condition levels (5 and 9). This may require to threat ordinal variables as categoricals o to use other techniques.
Next, we should look for time trends and serial correlation that might need to be controlled.
time.df <- tr[ ,c("YrSold","MoSold", "SalePrice", "BedroomAbvGr")]
time.df$SalePrice <- time.df$SalePrice/100000
time.df$quarter <- cut(tr$MoSold,
breaks = c(0,3,6,9,12),
labels = c("Q1","Q2","Q3","Q4"))
time.df$YearQtr <- with(time.df, as.yearqtr(paste(YrSold,quarter)))
time.df$Bedrooms <- cut(time.df$BedroomAbvGr,
breaks = c(-1,2,3,Inf),
labels = c("0-2","3","4+"))
time.agg <- summarise(group_by(time.df, YearQtr, Bedrooms),
mean = mean(SalePrice),
sem = sd(SalePrice)/sqrt(n()),
sales = n())
ggplotly(ggplot(time.agg, aes(x = YearQtr, colour = Bedrooms))+
geom_ribbon(aes(ymin = mean - sem,
ymax = mean + sem), fill = "grey70", alpha = 0.2) +
geom_line(aes(y = mean))+
facet_grid(Bedrooms~.) +
labs(x = "", y = "") +
coord_cartesian(ylim = c(1,3.5))) %>%
layout(title = "Mean Sale Price ($100,000)")
It seems that the Seasonality effects are not very much clear.
Finally, Let’s have a look on the spatial structure. Thanks to JulienSiems that posted the GPS coordinates of the neighborhoods (based on the school location in the district) in his script we can look for spatial effect of house prices. For instance, we can use the distances between neighborhoods and key areas such as employment and shopping areas and other effects that can not be captured by neighborhoods dummies.
coordinates <- data.frame(Neighborhood = levels(tr[ ,"Neighborhood"]),
lat = c(42.062806, 42.009408, 42.052500, 42.033590, 42.025425,
42.021051, 42.025949, 42.022800, 42.027885, 42.019208,
41.991866, 42.031307, 42.042966, 42.050307, 42.050207,
42.060356, 42.051321, 42.028863, 42.033611, 42.035540,
42.052191, 42.060752, 42.017578, 41.998132, 42.040106),
lng = c(-93.639963, -93.645543, -93.628821, -93.627552, -93.675741,
-93.685643, -93.620215, -93.663040, -93.615692, -93.623401,
-93.602441, -93.626967, -93.613556, -93.656045, -93.625827,
-93.657107, -93.633798, -93.615497, -93.669348, -93.685131,
-93.643479, -93.628955, -93.651283, -93.648335, -93.657032))
We can look at the geographical distribution of the house prices: Circle size represent the number of sales in the Neighborhood and darker colors represent higher average price.
nbh.price <- summarise(group_by(tr, Neighborhood),
sales = n(),
mean.price = mean(SalePrice))
coordinates <- merge(x = coordinates,
y = nbh.price,
by = "Neighborhood",
all.x = T)
pal <- colorNumeric(palette = "Reds",
domain = coordinates$nbh.price)
Ames <- leaflet(coordinates) %>%
addTiles() %>%
addCircles(lat = ~lat,
lng = ~lng, weight = 10,
radius = ~sales*8,
color = ~pal(coordinates$mean.price)) %>%
addMarkers(popup = ~paste(Neighborhood,", Mean:",
round(mean.price),sep = ""))
Ames
In this section I’ll present some machine learning technique to predict the price of a house.
The evaluation method of the models would be according to the competition rules:
\[RMSE = \sqrt{ \frac{2}{n} \sum_{i=1}^{n} \ln{\frac{y_i}{\hat{y_i}}} }\]
where \(y_i\) is the observed house sale price and \(\hat{y_i}\) is the predicted value.
Here I present only 3 embarrassingly simple linear models. Later, I will post more complex models involving complicated machine learning techniques
However, the best benchmark in many cases is the Linear Least Squares.
# predictiong log(price)
tr$logp <- log(tr$SalePrice)
ts$logp <- log(ts$SalePrice)
lm1 <- train(logp~., data = tr[,-c(1,60)], method = "lm")
ts$pr1 <- predict(lm1, newdata = ts)
qplot(y = ts$pr1, x =ts$logp) +
labs(x = "Observed log(Price)", y = "Predicted log(Price)") +
geom_abline(intercept = 0, slope = 1)
We can see that:
The RMSE:
postResample(ts$pr1, ts$logp)
## RMSE Rsquared
## 0.1238644 0.9040120
Now, Let’s use ordinal variables as discretes including polynomial effect (only variables which explicitly express quality or condition level)
tonum <- function(df) {
for (var in names(df[ ,q.var])) {
df[ ,var] <- as.numeric(
recode_factor(df[ ,var],
None="0",Po="1",Fa="2",TA="3",Gd="4",Ex="5"))
df[ ,paste(var,"2",sep = "")] <- df[ ,var]^2
df[ ,paste(var,"3",sep = "")] <- df[ ,var]^3
}
df
}
tr.2<-tonum(tr)
ts.2<-tonum(ts)
lm2 <- train(logp~.,
data = tr.2[ ,-c(1,60)], method = "lm")
ts.2$pr2 <- predict(lm2, newdata = ts.2)
postResample(ts.2$pr2, ts.2$logp)
## RMSE Rsquared
## 0.1242235 0.9032776
It seems that using the ordinal variables as categoricals (e.g. using dummy for each level) is more useful in terms of RMSE, however גifferences are not significant.
For the third linear model I add some variables.
First I create the house and the garage Age at the time of sale instead of YrSold, YearBuilt and GarageYrBlt.Also, Let’s create the time from last renovation, instead of YearRemodAdd.
tr$houseage <- tr$YrSold - tr$YearBuilt
ts$houseage <- ts$YrSold - ts$YearBuilt
tr$houseage2 <- tr$houseage^2
ts$houseage2 <- ts$houseage^2
tr$garageage <- (tr$YrSold - tr$GarageYrBlt) * ifelse(tr$GarageYrBlt==0,0,1)
ts$garageage <- (ts$YrSold - ts$GarageYrBlt) * ifelse(ts$GarageYrBlt==0,0,1)
tr$timeremode <- tr$YrSold - tr$YearRemodAdd
ts$timeremode <- ts$YrSold - ts$YearRemodAdd
Now, let’s have a look on the correlations of the log(SalePrice) and size variables to check whether polynomial effects are needed.
size.var <- grep("SF|Area", names(tr), value = T)
par(mfrow = c(3,4), mar = c(3,3,2,2))
for (v in size.var) {
d <- data.frame(var = tr[,v], p = log(tr$SalePrice))
d <- filter(d, var != 0) %>% arrange(-var)
plot(d, main = v)
model <- lm(d$p ~ d$var + I(d$var^2))
pred.line <- predict(model, d)
lines(x = d$var, y = pred.line, col = "red", lwd = 2)
}
In order to avoid overfitting I use only linear effect.
also, we treat OverallCond as factors:
tr.3 <- tr
ts.3 <- ts
tr.3$MSSubClass <- as.factor(tr.3$MSSubClass)
ts.3$MSSubClass <- as.factor(ts.3$MSSubClass)
#rename the data.frame
tr.3 <- select(tr.3, -c(Id,YrSold,YearBuilt,GarageYrBlt,YearRemodAdd,
SalePrice))
ts.3 <- select(ts.3, -c(Id,YrSold,YearBuilt,GarageYrBlt,YearRemodAdd))
The third linear model:
lm3 <- train(logp~., data = tr.3, method = "lm")
ts.3$pr3 <- predict(lm3, newdata = ts.3)
postResample(ts.3$pr3, ts.3$logp)
## RMSE Rsquared
## 0.1238903 0.9039452
To be continued.
Ron Sarafian.