Project abstract

  • Lasso regressions performs best with a cross validation RMSE-score of 0.1121. Given the fact that there is a lot of multicolinearity among the variables, this was expected. Lasso does not select a substantial number of the available variables in its model, as it is supposed to do.
  • I have also tried Ridge, Elastic Net , and Random Forest models, but since lasso gives the best results of those 4 models I decided to only keeping the lasso model. Also Random Forest took so long to compute even with a small number of trees.
  • The XGBoost model also performs very well with a cross validation RMSE of 0.1162.
  • As those two algorithms are very different, averaging predictions is likely to improve the predictions. As the Lasso cross validated RMSE is better than XGBoost’s CV score, I decided to weight the Lasso results double.

Introduction

Kaggle describes this competition as follows:

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Loading and Exploring Data

Loading libraries required and reading the data into R

Loading R packages used besides base R.

library(knitr)
library(ggplot2)
library(plyr)
library(dplyr)
library(corrplot)
library(caret)
library(matrixcalc)
library(gridExtra)
library(scales)
library(matlib)
library(Rmisc)
library(Hmisc)
library(GGally)
library(ggrepel)
library(randomForest)
library(psych)
library(xgboost)
library(skimr)

Below, I am reading the csv’s as dataframes into R fetching the dataset from my GitHub repo.

train <- read.csv("https://raw.githubusercontent.com/salma71/houses_prices_kaggle/master/datasets/train.csv", stringsAsFactors = F)
head(train)
test <- read.csv("https://raw.githubusercontent.com/salma71/houses_prices_kaggle/master/datasets/test.csv", stringsAsFactors = F)
head(test)

Data size and structure

We need to obtain some summary statistics to understand the dataset. The train dataset consists of character and integer variables. Most of the character variables are actually (ordinal) factors, but I chose to read them into R as character strings as most of them require cleaning and/or feature engineering first.

In total, there are 81 columns/variables, of which the last one is the response variable (SalePrice). Below, I am displaying the variables. All of them are discussed in more detail afterwords.

dim(train)
## [1] 1460   81
dim(test)
## [1] 1459   80
library(knitr)
library(magrittr)
data.frame(variable = names(train),
           classe = sapply(train, typeof),
           first_values = sapply(train, function(x) paste0(head(x),  collapse = ", ")),
           row.names = NULL) %>% 
kable()
variable classe first_values
Id integer 1, 2, 3, 4, 5, 6
MSSubClass integer 60, 20, 60, 70, 60, 50
MSZoning character RL, RL, RL, RL, RL, RL
LotFrontage integer 65, 80, 68, 60, 84, 85
LotArea integer 8450, 9600, 11250, 9550, 14260, 14115
Street character Pave, Pave, Pave, Pave, Pave, Pave
Alley character NA, NA, NA, NA, NA, NA
LotShape character Reg, Reg, IR1, IR1, IR1, IR1
LandContour character Lvl, Lvl, Lvl, Lvl, Lvl, Lvl
Utilities character AllPub, AllPub, AllPub, AllPub, AllPub, AllPub
LotConfig character Inside, FR2, Inside, Corner, FR2, Inside
LandSlope character Gtl, Gtl, Gtl, Gtl, Gtl, Gtl
Neighborhood character CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mitchel
Condition1 character Norm, Feedr, Norm, Norm, Norm, Norm
Condition2 character Norm, Norm, Norm, Norm, Norm, Norm
BldgType character 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam
HouseStyle character 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin
OverallQual integer 7, 6, 7, 7, 8, 5
OverallCond integer 5, 8, 5, 5, 5, 5
YearBuilt integer 2003, 1976, 2001, 1915, 2000, 1993
YearRemodAdd integer 2003, 1976, 2002, 1970, 2000, 1995
RoofStyle character Gable, Gable, Gable, Gable, Gable, Gable
RoofMatl character CompShg, CompShg, CompShg, CompShg, CompShg, CompShg
Exterior1st character VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, VinylSd
Exterior2nd character VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, VinylSd
MasVnrType character BrkFace, None, BrkFace, None, BrkFace, None
MasVnrArea integer 196, 0, 162, 0, 350, 0
ExterQual character Gd, TA, Gd, TA, Gd, TA
ExterCond character TA, TA, TA, TA, TA, TA
Foundation character PConc, CBlock, PConc, BrkTil, PConc, Wood
BsmtQual character Gd, Gd, Gd, TA, Gd, Gd
BsmtCond character TA, TA, TA, Gd, TA, TA
BsmtExposure character No, Gd, Mn, No, Av, No
BsmtFinType1 character GLQ, ALQ, GLQ, ALQ, GLQ, GLQ
BsmtFinSF1 integer 706, 978, 486, 216, 655, 732
BsmtFinType2 character Unf, Unf, Unf, Unf, Unf, Unf
BsmtFinSF2 integer 0, 0, 0, 0, 0, 0
BsmtUnfSF integer 150, 284, 434, 540, 490, 64
TotalBsmtSF integer 856, 1262, 920, 756, 1145, 796
Heating character GasA, GasA, GasA, GasA, GasA, GasA
HeatingQC character Ex, Ex, Ex, Gd, Ex, Ex
CentralAir character Y, Y, Y, Y, Y, Y
Electrical character SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr
X1stFlrSF integer 856, 1262, 920, 961, 1145, 796
X2ndFlrSF integer 854, 0, 866, 756, 1053, 566
LowQualFinSF integer 0, 0, 0, 0, 0, 0
GrLivArea integer 1710, 1262, 1786, 1717, 2198, 1362
BsmtFullBath integer 1, 0, 1, 1, 1, 1
BsmtHalfBath integer 0, 1, 0, 0, 0, 0
FullBath integer 2, 2, 2, 1, 2, 1
HalfBath integer 1, 0, 1, 0, 1, 1
BedroomAbvGr integer 3, 3, 3, 3, 4, 1
KitchenAbvGr integer 1, 1, 1, 1, 1, 1
KitchenQual character Gd, TA, Gd, Gd, Gd, TA
TotRmsAbvGrd integer 8, 6, 6, 7, 9, 5
Functional character Typ, Typ, Typ, Typ, Typ, Typ
Fireplaces integer 0, 1, 1, 1, 1, 0
FireplaceQu character NA, TA, TA, Gd, TA, NA
GarageType character Attchd, Attchd, Attchd, Detchd, Attchd, Attchd
GarageYrBlt integer 2003, 1976, 2001, 1998, 2000, 1993
GarageFinish character RFn, RFn, RFn, Unf, RFn, Unf
GarageCars integer 2, 2, 2, 3, 3, 2
GarageArea integer 548, 460, 608, 642, 836, 480
GarageQual character TA, TA, TA, TA, TA, TA
GarageCond character TA, TA, TA, TA, TA, TA
PavedDrive character Y, Y, Y, Y, Y, Y
WoodDeckSF integer 0, 298, 0, 0, 192, 40
OpenPorchSF integer 61, 0, 42, 35, 84, 30
EnclosedPorch integer 0, 0, 0, 272, 0, 0
X3SsnPorch integer 0, 0, 0, 0, 0, 320
ScreenPorch integer 0, 0, 0, 0, 0, 0
PoolArea integer 0, 0, 0, 0, 0, 0
PoolQC character NA, NA, NA, NA, NA, NA
Fence character NA, NA, NA, NA, NA, MnPrv
MiscFeature character NA, NA, NA, NA, NA, Shed
MiscVal integer 0, 0, 0, 0, 0, 700
MoSold integer 2, 5, 9, 2, 12, 10
YrSold integer 2008, 2007, 2008, 2006, 2008, 2009
SaleType character WD, WD, WD, WD, WD, WD
SaleCondition character Normal, Normal, Normal, Abnorml, Normal, Normal
SalePrice integer 208500, 181500, 223500, 140000, 250000, 143000
# str(train) 
#Getting rid of the IDs but keeping the test IDs in a vector. These are needed to compose the submission file
test_labels <- test$Id
train$Id <- NULL
test$Id <- NULL

Identifying the dependant variables and combine test and train into one dataset

test$SalePrice <- NA
house <- rbind(train, test)
dim(house)
## [1] 2919   80
head(house)

Without the Id’s, the dataframe consists of 79 predictors and our response variable SalePrice.

Getting summary statistics

sum(sapply(house[,1:80], typeof) == "character")
## [1] 43
#Count the number of columns that consists of numerical data

sum(sapply(house[,1:80], typeof) == "integer")
## [1] 37

We have 43 columns as character and 37 columns are numerical. Let’s take a look at some descriptive statistics

# Obtain summary statistics
skim(train[,sapply(house[,1:80], typeof) == "integer"])
Data summary
Name train[, sapply(house[, 1:…
Number of rows 1460
Number of columns 37
_______________________
Column type frequency:
numeric 37
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
MSSubClass 0 1.00 56.90 42.30 20 20.00 50.0 70.00 190 ▇▅▂▁▁
LotFrontage 259 0.82 70.05 24.28 21 59.00 69.0 80.00 313 ▇▃▁▁▁
LotArea 0 1.00 10516.83 9981.26 1300 7553.50 9478.5 11601.50 215245 ▇▁▁▁▁
OverallQual 0 1.00 6.10 1.38 1 5.00 6.0 7.00 10 ▁▂▇▅▁
OverallCond 0 1.00 5.58 1.11 1 5.00 5.0 6.00 9 ▁▁▇▅▁
YearBuilt 0 1.00 1971.27 30.20 1872 1954.00 1973.0 2000.00 2010 ▁▂▃▆▇
YearRemodAdd 0 1.00 1984.87 20.65 1950 1967.00 1994.0 2004.00 2010 ▅▂▂▃▇
MasVnrArea 8 0.99 103.69 181.07 0 0.00 0.0 166.00 1600 ▇▁▁▁▁
BsmtFinSF1 0 1.00 443.64 456.10 0 0.00 383.5 712.25 5644 ▇▁▁▁▁
BsmtFinSF2 0 1.00 46.55 161.32 0 0.00 0.0 0.00 1474 ▇▁▁▁▁
BsmtUnfSF 0 1.00 567.24 441.87 0 223.00 477.5 808.00 2336 ▇▅▂▁▁
TotalBsmtSF 0 1.00 1057.43 438.71 0 795.75 991.5 1298.25 6110 ▇▃▁▁▁
X1stFlrSF 0 1.00 1162.63 386.59 334 882.00 1087.0 1391.25 4692 ▇▅▁▁▁
X2ndFlrSF 0 1.00 346.99 436.53 0 0.00 0.0 728.00 2065 ▇▃▂▁▁
LowQualFinSF 0 1.00 5.84 48.62 0 0.00 0.0 0.00 572 ▇▁▁▁▁
GrLivArea 0 1.00 1515.46 525.48 334 1129.50 1464.0 1776.75 5642 ▇▇▁▁▁
BsmtFullBath 0 1.00 0.43 0.52 0 0.00 0.0 1.00 3 ▇▆▁▁▁
BsmtHalfBath 0 1.00 0.06 0.24 0 0.00 0.0 0.00 2 ▇▁▁▁▁
FullBath 0 1.00 1.57 0.55 0 1.00 2.0 2.00 3 ▁▇▁▇▁
HalfBath 0 1.00 0.38 0.50 0 0.00 0.0 1.00 2 ▇▁▅▁▁
BedroomAbvGr 0 1.00 2.87 0.82 0 2.00 3.0 3.00 8 ▁▇▂▁▁
KitchenAbvGr 0 1.00 1.05 0.22 0 1.00 1.0 1.00 3 ▁▇▁▁▁
TotRmsAbvGrd 0 1.00 6.52 1.63 2 5.00 6.0 7.00 14 ▂▇▇▁▁
Fireplaces 0 1.00 0.61 0.64 0 0.00 1.0 1.00 3 ▇▇▁▁▁
GarageYrBlt 81 0.94 1978.51 24.69 1900 1961.00 1980.0 2002.00 2010 ▁▁▅▅▇
GarageCars 0 1.00 1.77 0.75 0 1.00 2.0 2.00 4 ▁▃▇▂▁
GarageArea 0 1.00 472.98 213.80 0 334.50 480.0 576.00 1418 ▂▇▃▁▁
WoodDeckSF 0 1.00 94.24 125.34 0 0.00 0.0 168.00 857 ▇▂▁▁▁
OpenPorchSF 0 1.00 46.66 66.26 0 0.00 25.0 68.00 547 ▇▁▁▁▁
EnclosedPorch 0 1.00 21.95 61.12 0 0.00 0.0 0.00 552 ▇▁▁▁▁
X3SsnPorch 0 1.00 3.41 29.32 0 0.00 0.0 0.00 508 ▇▁▁▁▁
ScreenPorch 0 1.00 15.06 55.76 0 0.00 0.0 0.00 480 ▇▁▁▁▁
PoolArea 0 1.00 2.76 40.18 0 0.00 0.0 0.00 738 ▇▁▁▁▁
MiscVal 0 1.00 43.49 496.12 0 0.00 0.0 0.00 15500 ▇▁▁▁▁
MoSold 0 1.00 6.32 2.70 1 5.00 6.0 8.00 12 ▃▆▇▃▃
YrSold 0 1.00 2007.82 1.33 2006 2007.00 2008.0 2009.00 2010 ▇▇▇▇▅
SalePrice 0 1.00 180921.20 79442.50 34900 129975.00 163000.0 214000.00 755000 ▇▅▁▁▁

As we can see from the summary statistics, there are some outliers needs to be proceessed. It would be more clear to identify which dependant variables that needs processing after the visualization step.

First of all, I would like to see which variables contain missing values.

# The percentage of data missing in house

cat(sprintf("Percentage of missing values in the overall house dataset: %s%s\n", round(length(which(is.na(house) == TRUE)) * 100 / (nrow(house) * ncol(house)), 2), "%"))
## Percentage of missing values in the overall house dataset: 6.61%
NAcol <- which(colSums(is.na(house)) > 0)
sort(colSums(sapply(house[NAcol], is.na)), decreasing = TRUE)
##       PoolQC  MiscFeature        Alley        Fence    SalePrice 
##         2909         2814         2721         2348         1459 
##  FireplaceQu  LotFrontage  GarageYrBlt GarageFinish   GarageQual 
##         1420          486          159          159          159 
##   GarageCond   GarageType     BsmtCond BsmtExposure     BsmtQual 
##          159          157           82           82           81 
## BsmtFinType2 BsmtFinType1   MasVnrType   MasVnrArea     MSZoning 
##           80           79           24           23            4 
##    Utilities BsmtFullBath BsmtHalfBath   Functional  Exterior1st 
##            2            2            2            2            1 
##  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1
cat('There are', length(NAcol), 'columns with missing values')
## There are 35 columns with missing values

Visualization step - Detailed EDA

SalePrice

I chose to start with exploring some of the most important variables, the response variable; SalePrice

As you can see, the sale prices are right skewed. This was expected as few people can afford very expensive houses. However, it forms a Guassian distribution which will help me later decide which ML algorithm to use.

ggplot(data=house[!is.na(house$SalePrice),], aes(x=SalePrice)) +
        geom_histogram(col = 'white' ,fill="brown", binwidth = 15000) +
        scale_x_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

#Normalize distribution
ggplot(train, aes(x=log(SalePrice+1))) + 
  geom_histogram(col = 'white', fill = 'brown') + 
  theme_light()

summary(house$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   34900  129975  163000  180921  214000  755000    1459

The most important numeric predictors

The character variables need some work before I can use them. To get a feel for the dataset, I decided to first see which numeric variables have a high correlation with the SalePrice.

Correlations with SalePrice

Altogether, there are 10 numeric variables with a correlation of at least 0.5 with SalePrice. All those correlations are positive.

numericVars <- which(sapply(house, is.numeric)) #index vector numeric variables
numericVarNames <- names(numericVars) #saving names vector for use later on
cat('There are', length(numericVars), 'numeric variables')
## There are 37 numeric variables
house_numVar <- house[, numericVars]
cor_numVar <- cor(house_numVar, use="pairwise.complete.obs") #correlations of house numeric variables

#sort on decreasing correlations with SalePrice
cor_sorted <- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
 #select only high corelations
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]

corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")

It becomes clear the multicollinearity is an issue. For example: the correlation between GarageCars and GarageArea is very high (0.89), and both have similar (high) correlations with SalePrice. The other 6 six variables with a correlation higher than 0.5 with SalePrice are:

-TotalBsmtSF: Total square feet of basement area -1stFlrSF: First Floor square feet -FullBath: Full bathrooms above grade -TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) -YearBuilt: Original construction date -YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

signif_col <- house %>%
  select(`YearBuilt`, `FullBath`, `TotalBsmtSF`, `X1stFlrSF`, `TotRmsAbvGrd`, `YearRemodAdd`, `SalePrice`)
# options(repr.plot.width = 20, repr.plot.height = 20)
ggpairs(signif_col)

Overall Quality

Overall Quality has the highest correlation with SalePrice among the numeric variables (0.79). It rates the overall material and finish of the house on a scale from 1 (very poor) to 10 (very excellent).

ggplot(data=house[!is.na(house$SalePrice),], aes(x=factor(OverallQual), y=SalePrice))+
        geom_boxplot(col='blue') + labs(x='Overall Quality') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

The positive correlation is certainly there indeed, and seems to be a slightly upward curve. Regarding outliers, I do not see any extreme values. If there is a candidate to take out as an outlier later on, it seems to be the expensive house with grade 4.

Above Grade (Ground) Living Area (square feet)

The numeric variable with the second highest correlation with SalesPrice is the Above Grade Living Area. This make a lot of sense; big houses are generally more expensive.

ggplot(data=house[!is.na(house$SalePrice),], aes(x=GrLivArea, y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_text_repel(aes(label = ifelse(house$GrLivArea[!is.na(house$SalePrice)]>4500, rownames(house), '')))

Especially the two houses with really big living areas and low SalePrices seem outliers (houses 524 and 1299). I will not take them out yet, as taking outliers can be dangerous. For instance, a low score on the Overall Quality could explain a low price. However, as you can see below, these two houses actually also score maximum points on Overall Quality. Therefore, I will keep houses 1299 and 524 in mind as prime candidates to take out as outliers.

house[c(524, 1299), c('SalePrice', 'GrLivArea', 'OverallQual')]

Exam questions

Correlation bettween three vars

Derive a correlation matrix for any three quantitative variables in the dataset.

cor_se <- house_numVar %>% 
  dplyr::select(SalePrice, OverallQual, GrLivArea)

ggcorr((cor_se), palette = "RdBu", label = TRUE)

Hypotheses testing

  • H0: There is no correlation between the tree variables - the correlation is zero.
  • H1: There is a correlation between the tree variables - the correlation is not zero.
  • Significance level is 0.05.
(ov_vs_sal <- cor.test(cor_se$OverallQual, cor_se$SalePrice, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  cor_se$OverallQual and cor_se$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7709644 0.8094376
## sample estimates:
##       cor 
## 0.7909816
(gr_vs_ov <- cor.test(cor_se$GrLivArea, cor_se$OverallQual, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  cor_se$GrLivArea and cor_se$OverallQual
## t = 37.97, df = 2917, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5503293 0.5989094
## sample estimates:
##       cor 
## 0.5751261
(gr_vs_sal <- cor.test(cor_se$GrLivArea, cor_se$SalePrice, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  cor_se$GrLivArea and cor_se$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6821200 0.7332695
## sample estimates:
##       cor 
## 0.7086245
# (tot_vs_sal <- cor.test(cor_se$TotRmsAbvGrd, cor_se$SalePrice, method = 'pearson'))

The p-value in the three varaibles is below significance level of 0.05 - around 2.2e-16. This implies that the correlation falls within the c(0.8, 0.2) confidence interval and we failed to accept the null hypothese which assumes that the correlation is zero.

The 80% confidence interval

library(psychometric)
(CIr(r = ov_vs_sal$estimate, n = nrow(cor_se), level = 0.8))
## [1] 0.7819292 0.7997005
(CIr(r = gr_vs_sal$estimate, n = nrow(cor_se), level = 0.8))
## [1] 0.6966094 0.7202421
(CIr(r = gr_vs_ov$estimate, n = nrow(cor_se), level = 0.8))
## [1] 0.5590270 0.5907918
# (CIr(r = tot_vs_sal$estimate, n = nrow(corr), level = 0.8))

Familywise error

To calculate the familywize error (type-I error) \(FWE <= 1-(1-alpha)^c\) where:
1. \(alpha\) is the alpha level for an individual test - in this case it is 0.05. 2. \(c\) is the number of comparisons, which is 3.

perc <- (1 - (1 - 0.05)^3)
cat(sprintf("The probability of a type-I error is : %s%s\n", perc, " "))
## The probability of a type-I error is : 0.142625

Yes, I would be worry about that probability, which is very high considering only three tests were performed.

Linear Algebra and Correlation

Invert correlation matrix

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

te <- rcorr(as.matrix(cor_se))
te$r
##             SalePrice OverallQual GrLivArea
## SalePrice   1.0000000   0.7909816 0.7086245
## OverallQual 0.7909816   1.0000000 0.5751261
## GrLivArea   0.7086245   0.5751261 1.0000000

The determinant of the correlation matrix not zero, the matrix has an inverse

det(te$r) # get determinant
## [1] 0.186156
(precision_mat <- round(solve(te$r), 2)) # this is the precision matrix get the inverse matrix
##             SalePrice OverallQual GrLivArea
## SalePrice        3.59       -2.06     -1.36
## OverallQual     -2.06        2.67     -0.08
## GrLivArea       -1.36       -0.08      2.01
(round(te$r%*%precision_mat, 2))
##             SalePrice OverallQual GrLivArea
## SalePrice           1        0.00         0
## OverallQual         0        0.99         0
## GrLivArea           0        0.00         1
(round(precision_mat%*%te$r, 2))
##             SalePrice OverallQual GrLivArea
## SalePrice           1        0.00         0
## OverallQual         0        0.99         0
## GrLivArea           0        0.00         1

LU decomposition

# LU decomposition 
(cor_lu <- lu.decomposition(te$r))
## $L
##           [,1]       [,2] [,3]
## [1,] 1.0000000 0.00000000    0
## [2,] 0.7909816 1.00000000    0
## [3,] 0.7086245 0.03904715    1
## 
## $U
##      [,1]      [,2]       [,3]
## [1,]    1 0.7909816 0.70862448
## [2,]    0 0.3743481 0.01461723
## [3,]    0 0.0000000 0.49728059
# multiply L and U 
(res_2 <- round(cor_lu$L %*% cor_lu$U, 2) )
##      [,1] [,2] [,3]
## [1,] 1.00 0.79 0.71
## [2,] 0.79 1.00 0.58
## [3,] 0.71 0.58 1.00

Confirm the results

round(cor_lu$L %*% cor_lu$U ,2) == round(te$r,2)
##             SalePrice OverallQual GrLivArea
## SalePrice        TRUE        TRUE      TRUE
## OverallQual      TRUE        TRUE      TRUE
## GrLivArea        TRUE        TRUE      TRUE

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Simulation

fit_data <- house$GrLivArea
fit_data <- fit_data[complete.cases(fit_data)]
hist(fit_data)

summary(fit_data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1444    1501    1744    5642
library(MASS)
fit <- fitdistr(fit_data, densfun = "exponential")
# the optimal value
lam <- fit$estimate
sample <- rexp(1000, rate = lam) # create a sample of 1000
hist(sample, freq = F, main = 'Simulated GrLivArea varaible using calculated lamda',xlim = c(1,quantile(sample, 0.99)), xlab = "GrLivArea") # plot the results
curve(dexp(x, rate = lam), col = 'red', add = T)

hist(house$GrLivArea, freq = F,  main = 'GrLivArea exp vs. original', xlim = c(1, quantile(house$GrLivArea, 0.99)))
curve(dexp(x, rate = lam), col = 'red', add = T)

5th and 95th percentiles using(CDF)

# simulated
(cdf_lo <- log(1 - .05)/-lam)
##     rate 
## 76.97892
(cdf_up <- log(1 - .95)/-lam)
##     rate 
## 4495.875
(act_lo <- quantile(house$GrLivArea, 0.05))
##  5% 
## 861
(act_up <- quantile(house$GrLivArea, 0.95))
##    95% 
## 2464.2

95% confidence interval from the empirical data, assuming normality

CI(house$GrLivArea, 0.95)
##    upper     mean    lower 
## 1519.125 1500.760 1482.394

With 95% confidence, the mean of GrLivArea is between 1482.394 and 1519.125. The exponential distribution would not be a good fit in this case. We see that the center of the exponential distribution is shifted left as compared the empirical data. Additionally we see more spread in the exponential distribution.

Modeling

Missing data, label encoding, and factorizing variables

First of all, I would like to see which variables contain missing values.

NAcol <- which(colSums(is.na(house)) > 0)
sort(colSums(sapply(house[NAcol], is.na)), decreasing = TRUE)
##       PoolQC  MiscFeature        Alley        Fence    SalePrice 
##         2909         2814         2721         2348         1459 
##  FireplaceQu  LotFrontage  GarageYrBlt GarageFinish   GarageQual 
##         1420          486          159          159          159 
##   GarageCond   GarageType     BsmtCond BsmtExposure     BsmtQual 
##          159          157           82           82           81 
## BsmtFinType2 BsmtFinType1   MasVnrType   MasVnrArea     MSZoning 
##           80           79           24           23            4 
##    Utilities BsmtFullBath BsmtHalfBath   Functional  Exterior1st 
##            2            2            2            2            1 
##  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1
cat('There are', length(NAcol), 'columns with missing values')
## There are 35 columns with missing values

Of course, the 1459 NAs in SalePrice match the size of the test set perfectly. This means that I have to fix NAs in 34 predictor variables.

Imputing missing data

In this section, I am going to fix the 34 predictors that contains missing values. I will go through them working my way down from most NAs until I have fixed them all. If I stumble upon a variable that actually forms a group with other variables, I will also deal with them as a group. For instance, there are multiple variables that relate to Pool, Garage, and Basement.

As I want to keep the document as readable as possible, I decided to use the “Tabs” option that knitr provides. You can find a short analysis for each (group of) variables under each Tab. You don’t have to go through all sections, and can also just have a look at a few tabs. If you do so, I think that especially the Garage and Basement sections are worthwhile, as I have been carefull in determing which houses really do not have a basement or garage.

Besides making sure that the NAs are taken care off, I have also converted character variables into ordinal integers if there is clear ordinality, or into factors if levels are categories without ordinality. I will convert these factors into numeric later on by using one-hot encoding (using the model.matrix function).

Pool variables

Pool Quality and the PoolArea variable

The PoolQC is the variable with most NAs. The description is as follows:

PoolQC: Pool quality

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   NA   No Pool
   

So, it is obvious that I need to just assign ‘No Pool’ to the NAs. Also, the high number of NAs makes sense as normally only a small proportion of houses have a pool.

house$PoolQC[is.na(house$PoolQC)] <- 'None'

It is also clear that I can label encode this variable as the values are ordinal. As there a multiple variables that use the same quality levels, I am going to create a vector that I can reuse later on.

Qualities <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)

Now, I can use the function ‘revalue’ to do the work for me.

house$PoolQC<-as.integer(revalue(house$PoolQC, Qualities))
table(house$PoolQC)
## 
##    0    2    4    5 
## 2909    2    4    4

However, there is a second variable that relates to Pools. This is the PoolArea variable (in square feet). As you can see below, there are 3 houses without PoolQC. First, I checked if there was a clear relation between the PoolArea and the PoolQC. As I did not see a clear relation (bigger of smaller pools with better PoolQC), I am going to impute PoolQC values based on the Overall Quality of the houses (which is not very high for those 3 houses).

house[house$PoolArea>0 & house$PoolQC==0, c('PoolArea', 'PoolQC', 'OverallQual')]
house$PoolQC[2421] <- 2
house$PoolQC[2504] <- 3
house$PoolQC[2600] <- 2

Miscellaneous Feature

Miscellaneous feature not covered in other categories

Within Miscellaneous Feature, there are 2814 NAs. As the values are not ordinal, I will convert MiscFeature into a factor. Values:

   Elev Elevator
   Gar2 2nd Garage (if not described in garage section)
   Othr Other
   Shed Shed (over 100 SF)
   TenC Tennis Court
   NA   None
house$MiscFeature[is.na(house$MiscFeature)] <- 'None'
house$MiscFeature <- as.factor(house$MiscFeature)

ggplot(house[!is.na(house$SalePrice),], aes(x=MiscFeature, y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

table(house$MiscFeature)
## 
## Gar2 None Othr Shed TenC 
##    5 2814    4   95    1

When looking at the frequencies, the variable seems irrelevant to me. Having a shed probably means ‘no Garage’, which would explain the lower sales price for Shed. Also, while it makes a lot of sense that a house with a Tennis court is expensive, there is only one house with a tennis court in the training set.

Alley

Type of alley access to property

Within Alley, there are 2721 NAs. As the values are not ordinal, I will convert Alley into a factor. Values:

   Grvl Gravel
   Pave Paved
   NA   No alley access
house$Alley[is.na(house$Alley)] <- 'None'
house$Alley <- as.factor(house$Alley)

ggplot(house[!is.na(house$SalePrice),], aes(x=Alley, y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue')+
        scale_y_continuous(breaks= seq(0, 200000, by=50000), labels = comma)

table(house$Alley)
## 
## Grvl None Pave 
##  120 2721   78

Fence

Fence quality

Within Fence, there are 2348 NAs. The values seem to be ordinal. Values:

   GdPrv    Good Privacy
   MnPrv    Minimum Privacy
   GdWo Good Wood
   MnWw Minimum Wood/Wire
   NA   No Fence
house$Fence[is.na(house$Fence)] <- 'None'
table(house$Fence)
## 
## GdPrv  GdWo MnPrv  MnWw  None 
##   118   112   329    12  2348
house[!is.na(house$SalePrice),] %>% group_by(Fence) %>% summarise(median = median(SalePrice), counts=n())

My conclusion is that the values do not seem ordinal (no fence is best). Therefore, I will convert Fence into a factor.

house$Fence <- as.factor(house$Fence)

Fireplace variables

Fireplace quality, and Number of fireplaces

Within Fireplace Quality, there are 1420 NAs. Number of fireplaces is complete.

Fireplace quality

The number of NAs in FireplaceQu matches the number of houses with 0 fireplaces. This means that I can safely replace the NAs in FireplaceQu with ‘no fireplace’. The values are ordinal, and I can use the Qualities vector that I have already created for the Pool Quality. Values:

   Ex   Excellent - Exceptional Masonry Fireplace
   Gd   Good - Masonry Fireplace in main level
   TA   Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
   Fa   Fair - Prefabricated Fireplace in basement
   Po   Poor - Ben Franklin Stove
   NA   No Fireplace
house$FireplaceQu[is.na(house$FireplaceQu)] <- 'None'
house$FireplaceQu<-as.integer(revalue(house$FireplaceQu, Qualities))
table(house$FireplaceQu)
## 
##    0    1    2    3    4    5 
## 1420   46   74  592  744   43

Number of fireplaces

Fireplaces is an integer variable, and there are no missing values.

table(house$Fireplaces)
## 
##    0    1    2    3    4 
## 1420 1268  219   11    1
sum(table(house$Fireplaces))
## [1] 2919

Please return to the 5.2 Tabs menu to select other (groups of) variables

Lot variables

3 variables. One with 1 NA, and 2 complete variables.

LotFrontage: Linear feet of street connected to property

486 NAs. The most reasonable imputation seems to take the median per neigborhood.

ggplot(house[!is.na(house$LotFrontage),], aes(x=as.factor(Neighborhood), y=LotFrontage)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

for (i in 1:nrow(house)){
        if(is.na(house$LotFrontage[i])){
               house$LotFrontage[i] <- as.integer(median(house$LotFrontage[house$Neighborhood==house$Neighborhood[i]], na.rm=TRUE)) 
        }
}

LotShape: General shape of property

No NAs. Values seem ordinal (Regular=best)

   Reg  Regular 
   IR1  Slightly irregular
   IR2  Moderately Irregular
   IR3  Irregular
house$LotShape<-as.integer(revalue(house$LotShape, c('IR3'=0, 'IR2'=1, 'IR1'=2, 'Reg'=3)))
table(house$LotShape)
## 
##    0    1    2    3 
##   16   76  968 1859
sum(table(house$LotShape))
## [1] 2919

LotConfig: Lot configuration

No NAs. The values seemed possibly ordinal to me, but the visualization does not show this. Therefore, I will convert the variable into a factor.

   Inside   Inside lot
   Corner   Corner lot
   CulDSac  Cul-de-sac
   FR2  Frontage on 2 sides of property
   FR3  Frontage on 3 sides of property
   
ggplot(house[!is.na(house$SalePrice),], aes(x=as.factor(LotConfig), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue')+
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

house$LotConfig <- as.factor(house$LotConfig)
table(house$LotConfig)
## 
##  Corner CulDSac     FR2     FR3  Inside 
##     511     176      85      14    2133
sum(table(house$LotConfig))
## [1] 2919

Garage variables

Altogether, there are 7 variables related to garages

Two of those have one NA (GarageCars and GarageArea), one has 157 NAs (GarageType), 4 variables have 159 NAs.

First of all, I am going to replace all 159 missing GarageYrBlt: Year garage was built values with the values in YearBuilt (this is similar to YearRemodAdd, which also defaults to YearBuilt if no remodeling or additions).

house$GarageYrBlt[is.na(house$GarageYrBlt)] <- house$YearBuilt[is.na(house$GarageYrBlt)]

As NAs mean ‘No Garage’ for character variables, I now want to find out where the differences between the 157 NA GarageType and the other 3 character variables with 159 NAs come from.

#check if house 157 NAs are the same observations among the variables with 157/159 NAs
length(which(is.na(house$GarageType) & is.na(house$GarageFinish) & is.na(house$GarageCond) & is.na(house$GarageQual)))
## [1] 157
#Find the 2 additional NAs
kable(house[!is.na(house$GarageType) & is.na(house$GarageFinish), c('GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish')])
GarageCars GarageArea GarageType GarageCond GarageQual GarageFinish
2127 1 360 Detchd NA NA NA
2577 NA NA Detchd NA NA NA

The 157 NAs within GarageType all turn out to be NA in GarageCondition, GarageQuality, and GarageFinish as well. The differences are found in houses 2127 and 2577. As you can see, house 2127 actually does seem to have a Garage and house 2577 does not. Therefore, there should be 158 houses without a Garage. To fix house 2127, I will imputate the most common values (modes) for GarageCond, GarageQual, and GarageFinish.

#Imputing modes.
house$GarageCond[2127] <- names(sort(-table(house$GarageCond)))[1]
house$GarageQual[2127] <- names(sort(-table(house$GarageQual)))[1]
house$GarageFinish[2127] <- names(sort(-table(house$GarageFinish)))[1]

#display "fixed" house
kable(house[2127, c('GarageYrBlt', 'GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish')])
GarageYrBlt GarageCars GarageArea GarageType GarageCond GarageQual GarageFinish
2127 1910 1 360 Detchd TA TA Unf

GarageCars and GarageArea: Size of garage in car capacity and Size of garage in square

Both have 1 NA. As you can see above, it is house 2577 for both variables. The problem probably occured as the GarageType for this house is “detached”, while all other Garage-variables seem to indicate that this house has no Garage.

#fixing 3 values for house 2577
house$GarageCars[2577] <- 0
house$GarageArea[2577] <- 0
house$GarageType[2577] <- NA

#check if NAs of the character variables are now all 158
length(which(is.na(house$GarageType) & is.na(house$GarageFinish) & is.na(house$GarageCond) & is.na(house$GarageQual)))
## [1] 158

Now, the 4 character variables related to garage all have the same set of 158 NAs, which correspond to ‘No Garage’. I will fix all of them in the remainder of this section

GarageType: Garage location

The values do not seem ordinal, so I will convert into a factor.

   2Types   More than one type of garage
   Attchd   Attached to home
   Basment  Basement Garage
   BuiltIn  Built-In (Garage part of house - typically has room above garage)
   CarPort  Car Port
   Detchd   Detached from home
   NA   No Garage
house$GarageType[is.na(house$GarageType)] <- 'No Garage'
house$GarageType <- as.factor(house$GarageType)
table(house$GarageType)
## 
##    2Types    Attchd   Basment   BuiltIn   CarPort    Detchd No Garage 
##        23      1723        36       186        15       778       158

GarageFinish: Interior finish of the garage

The values are ordinal.

   Fin  Finished
   RFn  Rough Finished  
   Unf  Unfinished
   NA   No Garage       
house$GarageFinish[is.na(house$GarageFinish)] <- 'None'
Finish <- c('None'=0, 'Unf'=1, 'RFn'=2, 'Fin'=3)

house$GarageFinish<-as.integer(revalue(house$GarageFinish, Finish))
table(house$GarageFinish)
## 
##    0    1    2    3 
##  158 1231  811  719

GarageQual: Garage quality

Another variable than can be made ordinal with the Qualities vector.

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   NA   No Garage
   
house$GarageQual[is.na(house$GarageQual)] <- 'None'
house$GarageQual<-as.integer(revalue(house$GarageQual, Qualities))
table(house$GarageQual)
## 
##    0    1    2    3    4    5 
##  158    5  124 2605   24    3

GarageCond: Garage condition

Another variable than can be made ordinal with the Qualities vector.

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   NA   No Garage
house$GarageCond[is.na(house$GarageCond)] <- 'None'
house$GarageCond<-as.integer(revalue(house$GarageCond, Qualities))
table(house$GarageCond)
## 
##    0    1    2    3    4    5 
##  158   14   74 2655   15    3

Basement Variables

Altogether, there are 11 variables that relate to the Basement of a house

Five of those have 79-82 NAs, six have one or two NAs.

#check if house 79 NAs are the same observations among the variables with 80+ NAs
length(which(is.na(house$BsmtQual) & is.na(house$BsmtCond) & is.na(house$BsmtExposure) & is.na(house$BsmtFinType1) & is.na(house$BsmtFinType2)))
## [1] 79
#Find the additional NAs; BsmtFinType1 is the one with 79 NAs
house[!is.na(house$BsmtFinType1) & (is.na(house$BsmtCond)|is.na(house$BsmtQual)|is.na(house$BsmtExposure)|is.na(house$BsmtFinType2)), c('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2')]

So altogether, it seems as if there are 79 houses without a basement, because the basement variables of the other houses with missing values are all 80% complete (missing 1 out of 5 values). I am going to impute the modes to fix those 9 houses.

#Imputing modes.
house$BsmtFinType2[333] <- names(sort(-table(house$BsmtFinType2)))[1]
house$BsmtExposure[c(949, 1488, 2349)] <- names(sort(-table(house$BsmtExposure)))[1]
house$BsmtCond[c(2041, 2186, 2525)] <- names(sort(-table(house$BsmtCond)))[1]
house$BsmtQual[c(2218, 2219)] <- names(sort(-table(house$BsmtQual)))[1]

Now that the 5 variables considered agree upon 79 houses with ‘no basement’, I am going to factorize/hot encode them below.

BsmtQual: Evaluates the height of the basement

A variable than can be made ordinal with the Qualities vector.

   Ex   Excellent (100+ inches) 
   Gd   Good (90-99 inches)
   TA   Typical (80-89 inches)
   Fa   Fair (70-79 inches)
   Po   Poor (<70 inches
   NA   No Basement
house$BsmtQual[is.na(house$BsmtQual)] <- 'None'
house$BsmtQual<-as.integer(revalue(house$BsmtQual, Qualities))
table(house$BsmtQual)
## 
##    0    2    3    4    5 
##   79   88 1285 1209  258

BsmtCond: Evaluates the general condition of the basement

A variable than can be made ordinal with the Qualities vector.

   Ex   Excellent
   Gd   Good
   TA   Typical - slight dampness allowed
   Fa   Fair - dampness or some cracking or settling
   Po   Poor - Severe cracking, settling, or wetness
   NA   No Basement
house$BsmtCond[is.na(house$BsmtCond)] <- 'None'
house$BsmtCond<-as.integer(revalue(house$BsmtCond, Qualities))
table(house$BsmtCond)
## 
##    0    1    2    3    4 
##   79    5  104 2609  122

BsmtExposure: Refers to walkout or garden level walls

A variable than can be made ordinal.

   Gd   Good Exposure
   Av   Average Exposure (split levels or foyers typically score average or above)  
   Mn   Mimimum Exposure
   No   No Exposure
   NA   No Basement
house$BsmtExposure[is.na(house$BsmtExposure)] <- 'None'
Exposure <- c('None'=0, 'No'=1, 'Mn'=2, 'Av'=3, 'Gd'=4)

house$BsmtExposure<-as.integer(revalue(house$BsmtExposure, Exposure))
table(house$BsmtExposure)
## 
##    0    1    2    3    4 
##   79 1907  239  418  276

BsmtFinType1: Rating of basement finished area

A variable than can be made ordinal.

   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA   No Basement
    
house$BsmtFinType1[is.na(house$BsmtFinType1)] <- 'None'
FinType <- c('None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6)

house$BsmtFinType1<-as.integer(revalue(house$BsmtFinType1, FinType))
table(house$BsmtFinType1)
## 
##   0   1   2   3   4   5   6 
##  79 851 154 288 269 429 849

BsmtFinType2: Rating of basement finished area (if multiple types)

A variable than can be made ordinal with the FinType vector.

   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA   No Basement
house$BsmtFinType2[is.na(house$BsmtFinType2)] <- 'None'
FinType <- c('None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6)

house$BsmtFinType2<-as.integer(revalue(house$BsmtFinType2, FinType))
table(house$BsmtFinType2)
## 
##    0    1    2    3    4    5    6 
##   79 2494   87  105   68   52   34

Remaining Basement variabes with just a few NAs

I now still have to deal with those 6 variables that have 1 or 2 NAs.

#display remaining NAs. Using BsmtQual as a reference for the 79 houses without basement agreed upon earlier
house[(is.na(house$BsmtFullBath)|is.na(house$BsmtHalfBath)|is.na(house$BsmtFinSF1)|is.na(house$BsmtFinSF2)|is.na(house$BsmtUnfSF)|is.na(house$TotalBsmtSF)), c('BsmtQual', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF')]

It should be obvious that those remaining NAs all refer to ‘not present’. Below, I am fixing those remaining variables.

BsmtFullBath: Basement full bathrooms

An integer variable.

house$BsmtFullBath[is.na(house$BsmtFullBath)] <-0
table(house$BsmtFullBath)
## 
##    0    1    2    3 
## 1707 1172   38    2

BsmtHalfBath: Basement half bathrooms

An integer variable.

house$BsmtHalfBath[is.na(house$BsmtHalfBath)] <-0
table(house$BsmtHalfBath)
## 
##    0    1    2 
## 2744  171    4

BsmtFinSF1: Type 1 finished square feet

An integer variable.

house$BsmtFinSF1[is.na(house$BsmtFinSF1)] <-0

BsmtFinSF2: Type 2 finished square feet

An integer variable.

house$BsmtFinSF2[is.na(house$BsmtFinSF2)] <-0

BsmtUnfSF: Unfinished square feet of basement area

An integer variable.

house$BsmtUnfSF[is.na(house$BsmtUnfSF)] <-0

TotalBsmtSF: Total square feet of basement area

An integer variable.

house$TotalBsmtSF[is.na(house$TotalBsmtSF)] <-0

Please return to the 5.2 Tabs menu to select other (groups of) variables

Masonry variables

Masonry veneer type, and masonry veneer area

Masonry veneer type has 24 NAs. Masonry veneer area has 23 NAs. If a house has a veneer area, it should also have a masonry veneer type. Let’s fix this one first.

#check if the 23 houses with veneer area NA are also NA in the veneer type
length(which(is.na(house$MasVnrType) & is.na(house$MasVnrArea)))
## [1] 23
#find the one that should have a MasVnrType
house[is.na(house$MasVnrType) & !is.na(house$MasVnrArea), c('MasVnrType', 'MasVnrArea')]
#fix this veneer type by imputing the mode
house$MasVnrType[2611] <- names(sort(-table(house$MasVnrType)))[2] #taking the 2nd value as the 1st is 'none'
house[2611, c('MasVnrType', 'MasVnrArea')]

This leaves me with 23 houses that really have no masonry.

Masonry veneer type

Will check the ordinality below.

   BrkCmn   Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   None None
   Stone    Stone
house$MasVnrType[is.na(house$MasVnrType)] <- 'None'

house[!is.na(house$SalePrice),] %>% group_by(MasVnrType) %>% summarise(median = median(SalePrice), counts=n()) %>% arrange(median)

There seems to be a significant difference between “common brick/none” and the other types. I assume that simple stones and for instance wooden houses are just cheaper. I will make the ordinality accordingly.

Masonry <- c('None'=0, 'BrkCmn'=0, 'BrkFace'=1, 'Stone'=2)
house$MasVnrType<-as.integer(revalue(house$MasVnrType, Masonry))
table(house$MasVnrType)
## 
##    0    1    2 
## 1790  880  249

MasVnrArea: Masonry veneer area in square feet

An integer variable.

house$MasVnrArea[is.na(house$MasVnrArea)] <-0

Please return to the 5.2 Tabs menu to select other (groups of) variables

MS Zoning

MSZoning: Identifies the general zoning classification of the sale

4 NAs. Values are categorical.

   A    Agriculture
   C    Commercial
   FV   Floating Village Residential
   I    Industrial
   RH   Residential High Density
   RL   Residential Low Density
   RP   Residential Low Density Park 
   RM   Residential Medium Density
#imputing the mode
house$MSZoning[is.na(house$MSZoning)] <- names(sort(-table(house$MSZoning)))[1]
house$MSZoning <- as.factor(house$MSZoning)
table(house$MSZoning)
## 
## C (all)      FV      RH      RL      RM 
##      25     139      26    2269     460
sum(table(house$MSZoning))
## [1] 2919

Please return to the 5.2 Tabs menu to select other (groups of) variables

Kitchen variables

Kitchen quality and numer of Kitchens above grade

Kitchen quality has 1 NA. Number of Kitchens is complete.

Kitchen quality

1NA. Can be made ordinal with the qualities vector.

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
house$KitchenQual[is.na(house$KitchenQual)] <- 'TA' #replace with most common value
house$KitchenQual<-as.integer(revalue(house$KitchenQual, Qualities))
table(house$KitchenQual)
## 
##    2    3    4    5 
##   70 1493 1151  205
sum(table(house$KitchenQual))
## [1] 2919

Number of Kitchens above grade

An integer variable with no NAs.

table(house$KitchenAbvGr)
## 
##    0    1    2    3 
##    3 2785  129    2
sum(table(house$KitchenAbvGr))
## [1] 2919

Please return to the 5.2 Tabs menu to select other (groups of) variables

Utilities

Utilities: Type of utilities available

2 NAs. Ordinal as additional utilities is better.

   AllPub   All public Utilities (E,G,W,& S)    
   NoSewr   Electricity, Gas, and Water (Septic Tank)
   NoSeWa   Electricity and Gas Only
   ELO  Electricity only

However, the table below shows that only one house does not have all public utilities. This house is in the train set. Therefore, imputing ‘AllPub’ for the NAs means that all houses in the test set will have ‘AllPub’. This makes the variable useless for prediction. Consequently, I will get rid of it.

table(house$Utilities)
## 
## AllPub NoSeWa 
##   2916      1
kable(house[is.na(house$Utilities) | house$Utilities=='NoSeWa', 1:9])
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities
945 20 RL 82 14375 Pave None 2 Lvl NoSeWa
1916 30 RL 109 21780 Grvl None 3 Lvl NA
1946 20 RL 64 31220 Pave None 2 Bnk NA
house$Utilities <- NULL

Please return to the 5.2 Tabs menu to select other (groups of) variables

Home functionality

Functional: Home functionality

1NA. Can be made ordinal (salvage only is worst, typical is best).

   Typ  Typical Functionality
   Min1 Minor Deductions 1
   Min2 Minor Deductions 2
   Mod  Moderate Deductions
   Maj1 Major Deductions 1
   Maj2 Major Deductions 2
   Sev  Severely Damaged
   Sal  Salvage only
#impute mode for the 1 NA
house$Functional[is.na(house$Functional)] <- names(sort(-table(house$Functional)))[1]

house$Functional <- as.integer(revalue(house$Functional, c('Sal'=0, 'Sev'=1, 'Maj2'=2, 'Maj1'=3, 'Mod'=4, 'Min2'=5, 'Min1'=6, 'Typ'=7)))
table(house$Functional)
## 
##    1    2    3    4    5    6    7 
##    2    9   19   35   70   65 2719
sum(table(house$Functional))
## [1] 2919

Please return to the 5.2 Tabs menu to select other (groups of) variables

Exterior variables

There are 4 exterior variables

2 variables have 1 NA, 2 variables have no NAs.

Exterior1st: Exterior covering on house

1 NA. Values are categorical.

   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other    Other
   Plywood  Plywood
   PreCast  PreCast 
   Stone    Stone
   Stucco   Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles
#imputing mode
house$Exterior1st[is.na(house$Exterior1st)] <- names(sort(-table(house$Exterior1st)))[1]

house$Exterior1st <- as.factor(house$Exterior1st)
table(house$Exterior1st)
## 
## AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard ImStucc MetalSd 
##      44       2       6      87       2     126     442       1     450 
## Plywood   Stone  Stucco VinylSd Wd Sdng WdShing 
##     221       2      43    1026     411      56
sum(table(house$Exterior1st))
## [1] 2919

Exterior2nd: Exterior covering on house (if more than one material)

1 NA. Values are categorical.

   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other    Other
   Plywood  Plywood
   PreCast  PreCast
   Stone    Stone
   Stucco   Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles
#imputing mode
house$Exterior2nd[is.na(house$Exterior2nd)] <- names(sort(-table(house$Exterior2nd)))[1]

house$Exterior2nd <- as.factor(house$Exterior2nd)
table(house$Exterior2nd)
## 
## AsbShng AsphShn Brk Cmn BrkFace  CBlock CmentBd HdBoard ImStucc MetalSd 
##      38       4      22      47       3     126     406      15     447 
##   Other Plywood   Stone  Stucco VinylSd Wd Sdng Wd Shng 
##       1     270       6      47    1015     391      81
sum(table(house$Exterior2nd))
## [1] 2919

ExterQual: Evaluates the quality of the material on the exterior

No NAs. Can be made ordinal using the Qualities vector.

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor
   
house$ExterQual<-as.integer(revalue(house$ExterQual, Qualities))
table(house$ExterQual)
## 
##    2    3    4    5 
##   35 1798  979  107
sum(table(house$ExterQual))
## [1] 2919

ExterCond: Evaluates the present condition of the material on the exterior

No NAs. Can be made ordinal using the Qualities vector.

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor
house$ExterCond<-as.integer(revalue(house$ExterCond, Qualities))
table(house$ExterCond)
## 
##    1    2    3    4    5 
##    3   67 2538  299   12
sum(table(house$ExterCond))
## [1] 2919

Please return to the 5.2 Tabs menu to select other (groups of) variables

Electrical system

Electrical: Electrical system

1 NA. Values are categorical.

   SBrkr    Standard Circuit Breakers & Romex
   FuseA    Fuse Box over 60 AMP and all Romex wiring (Average) 
   FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
   FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
   Mix  Mixed
#imputing mode
house$Electrical[is.na(house$Electrical)] <- names(sort(-table(house$Electrical)))[1]

house$Electrical <- as.factor(house$Electrical)
table(house$Electrical)
## 
## FuseA FuseF FuseP   Mix SBrkr 
##   188    50     8     1  2672
sum(table(house$Electrical))
## [1] 2919

Please return to the 5.2 Tabs menu to select other (groups of) variables

Sale Type and Condition

SaleType: Type of sale

1 NA. Values are categorical.

   WD   Warranty Deed - Conventional
   CWD  Warranty Deed - Cash
   VWD  Warranty Deed - VA Loan
   New  Home just constructed and sold
   COD  Court Officer Deed/Estate
   Con  Contract 15% Down payment regular terms
   ConLw    Contract Low Down payment and low interest
   ConLI    Contract Low Interest
   ConLD    Contract Low Down
   Oth  Other
#imputing mode
house$SaleType[is.na(house$SaleType)] <- names(sort(-table(house$SaleType)))[1]

house$SaleType <- as.factor(house$SaleType)
table(house$SaleType)
## 
##   COD   Con ConLD ConLI ConLw   CWD   New   Oth    WD 
##    87     5    26     9     8    12   239     7  2526
sum(table(house$SaleType))
## [1] 2919

SaleCondition: Condition of sale

No NAs. Values are categorical.

   Normal   Normal Sale
   Abnorml  Abnormal Sale -  trade, foreclosure, short sale
   AdjLand  Adjoining Land Purchase
   Alloca   Allocation - two linked properties with separate deeds, typically condo with a garage unit  
   Family   Sale between family members
   Partial  Home was not completed when last assessed (associated with New Homes)
house$SaleCondition <- as.factor(house$SaleCondition)
table(house$SaleCondition)
## 
## Abnorml AdjLand  Alloca  Family  Normal Partial 
##     190      12      24      46    2402     245
sum(table(house$SaleCondition))
## [1] 2919

Label encoding/factorizing the remaining character variables

At this point, I have made sure that all variables with NAs are taken care of. However, I still need to also take care of the remaining character variables that without missing values. Similar to the previous section, I have created Tabs for groups of variables.

Charcol <- names(house[,sapply(house, is.character)])
Charcol
##  [1] "Street"       "LandContour"  "LandSlope"    "Neighborhood"
##  [5] "Condition1"   "Condition2"   "BldgType"     "HouseStyle"  
##  [9] "RoofStyle"    "RoofMatl"     "Foundation"   "Heating"     
## [13] "HeatingQC"    "CentralAir"   "PavedDrive"
cat('There are', length(Charcol), 'remaining columns with character values')
## There are 15 remaining columns with character values

Foundation

Foundation: Type of foundation

    BrkTil          Brick & Tile
    CBlock          Cinder Block
    PConc           Poured Contrete 
    Slab            Slab
    Stone           Stone
    Wood            Wood
#No ordinality, so converting into factors
house$Foundation <- as.factor(house$Foundation)
table(house$Foundation)
## 
## BrkTil CBlock  PConc   Slab  Stone   Wood 
##    311   1235   1308     49     11      5
sum(table(house$Foundation))
## [1] 2919

Heating and airco

There are 2 heating variables, and one that indicates Airco Yes/No.

Heating: Type of heating

   Floor    Floor Furnace
   GasA Gas forced warm air furnace
   GasW Gas hot water or steam heat
   Grav Gravity furnace 
   OthW Hot water or steam heat other than gas
   Wall Wall furnace
   
#No ordinality, so converting into factors
house$Heating <- as.factor(house$Heating)
table(house$Heating)
## 
## Floor  GasA  GasW  Grav  OthW  Wall 
##     1  2874    27     9     2     6
sum(table(house$Heating))
## [1] 2919

HeatingQC: Heating quality and condition

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor
   
#making the variable ordinal using the Qualities vector
house$HeatingQC<-as.integer(revalue(house$HeatingQC, Qualities))
table(house$HeatingQC)
## 
##    1    2    3    4    5 
##    3   92  857  474 1493
sum(table(house$HeatingQC))
## [1] 2919

CentralAir: Central air conditioning

   N    No
   Y    Yes
house$CentralAir<-as.integer(revalue(house$CentralAir, c('N'=0, 'Y'=1)))
table(house$CentralAir)
## 
##    0    1 
##  196 2723
sum(table(house$CentralAir))
## [1] 2919

Please return to the 5.3 Tabs menu to select other (groups of) variables

Roof

There are 2 variables that deal with the roof of houses.

RoofStyle: Type of roof

   Flat Flat
   Gable    Gable
   Gambrel  Gabrel (Barn)
   Hip  Hip
   Mansard  Mansard
   Shed Shed
#No ordinality, so converting into factors
house$RoofStyle <- as.factor(house$RoofStyle)
table(house$RoofStyle)
## 
##    Flat   Gable Gambrel     Hip Mansard    Shed 
##      20    2310      22     551      11       5
sum(table(house$RoofStyle))
## [1] 2919

RoofMatl: Roof material

   ClyTile  Clay or Tile
   CompShg  Standard (Composite) Shingle
   Membran  Membrane
   Metal    Metal
   Roll Roll
   Tar&Grv  Gravel & Tar
   WdShake  Wood Shakes
   WdShngl  Wood Shingles
#No ordinality, so converting into factors
house$RoofMatl <- as.factor(house$RoofMatl)
table(house$RoofMatl)
## 
## ClyTile CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
##       1    2876       1       1       1      23       9       7
sum(table(house$RoofMatl))
## [1] 2919

Please return to the 5.3 Tabs menu to select other (groups of) variables

Land

2 variables that specify the flatness and slope of the propoerty.

LandContour: Flatness of the property

   Lvl  Near Flat/Level 
   Bnk  Banked - Quick and significant rise from street grade to building
   HLS  Hillside - Significant slope from side to side
   Low  Depression
#No ordinality, so converting into factors
house$LandContour <- as.factor(house$LandContour)
table(house$LandContour)
## 
##  Bnk  HLS  Low  Lvl 
##  117  120   60 2622
sum(table(house$LandContour))
## [1] 2919

LandSlope: Slope of property

   Gtl  Gentle slope
   Mod  Moderate Slope  
   Sev  Severe Slope
#Ordinal, so label encoding
house$LandSlope<-as.integer(revalue(house$LandSlope, c('Sev'=0, 'Mod'=1, 'Gtl'=2)))
table(house$LandSlope)
## 
##    0    1    2 
##   16  125 2778
sum(table(house$LandSlope))
## [1] 2919

Please return to the 5.3 Tabs menu to select other (groups of) variables

Dwelling

2 variables that specify the type and style of dwelling.

BldgType: Type of dwelling

   1Fam Single-family Detached  
   2FmCon   Two-family Conversion; originally built as one-family dwelling
   Duplx    Duplex
   TwnhsE   Townhouse End Unit
   TwnhsI   Townhouse Inside Unit

This seems ordinal to me (single family detached=best). Let’s check it with visualization.

ggplot(house[!is.na(house$SalePrice),], aes(x=as.factor(BldgType), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue')+
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

However, the visualization does not show ordinality.

#No ordinality, so converting into factors
house$BldgType <- as.factor(house$BldgType)
table(house$BldgType)
## 
##   1Fam 2fmCon Duplex  Twnhs TwnhsE 
##   2425     62    109     96    227
sum(table(house$BldgType))
## [1] 2919

HouseStyle: Style of dwelling

   1Story   One story
   1.5Fin   One and one-half story: 2nd level finished
   1.5Unf   One and one-half story: 2nd level unfinished
   2Story   Two story
   2.5Fin   Two and one-half story: 2nd level finished
   2.5Unf   Two and one-half story: 2nd level unfinished
   SFoyer   Split Foyer
   SLvl Split Level
#No ordinality, so converting into factors
house$HouseStyle <- as.factor(house$HouseStyle)
table(house$HouseStyle)
## 
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer   SLvl 
##    314     19   1471      8     24    872     83    128
sum(table(house$HouseStyle))
## [1] 2919

Please return to the 5.3 Tabs menu to select other (groups of) variables

Neighborhood and Conditions

3 variables that specify the physical location, and the proximity of ‘conditions’.

Neighborhood: Physical locations within Ames city limits

Note: as the number of levels is really high, I will look into binning later on.

   Blmngtn  Bloomington Heights
   Blueste  Bluestem
   BrDale   Briardale
   BrkSide  Brookside
   ClearCr  Clear Creek
   CollgCr  College Creek
   Crawfor  Crawford
   Edwards  Edwards
   Gilbert  Gilbert
   IDOTRR   Iowa DOT and Rail Road
   MeadowV  Meadow Village
   Mitchel  Mitchell
   Names    North Ames
   NoRidge  Northridge
   NPkVill  Northpark Villa
   NridgHt  Northridge Heights
   NWAmes   Northwest Ames
   OldTown  Old Town
   SWISU    South & West of Iowa State University
   Sawyer   Sawyer
   SawyerW  Sawyer West
   Somerst  Somerset
   StoneBr  Stone Brook
   Timber   Timberland
   Veenker  Veenker
#No ordinality, so converting into factors
house$Neighborhood <- as.factor(house$Neighborhood)
table(house$Neighborhood)
## 
## Blmngtn Blueste  BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert 
##      28      10      30     108      44     267     103     194     165 
##  IDOTRR MeadowV Mitchel   NAmes NoRidge NPkVill NridgHt  NWAmes OldTown 
##      93      37     114     443      71      23     166     131     239 
##  Sawyer SawyerW Somerst StoneBr   SWISU  Timber Veenker 
##     151     125     182      51      48      72      24
sum(table(house$Neighborhood))
## [1] 2919

Condition1: Proximity to various conditions

   Artery   Adjacent to arterial street
   Feedr    Adjacent to feeder street   
   Norm Normal  
   RRNn Within 200' of North-South Railroad
   RRAn Adjacent to North-South Railroad
   PosN Near positive off-site feature--park, greenbelt, etc.
   PosA Adjacent to postive off-site feature
   RRNe Within 200' of East-West Railroad
   RRAe Adjacent to East-West Railroad
#No ordinality, so converting into factors
house$Condition1 <- as.factor(house$Condition1)
table(house$Condition1)
## 
## Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNe   RRNn 
##     92    164   2511     20     39     28     50      6      9
sum(table(house$Condition1))
## [1] 2919

Condition2: Proximity to various conditions (if more than one is present)

   Artery   Adjacent to arterial street
   Feedr    Adjacent to feeder street   
   Norm Normal  
   RRNn Within 200' of North-South Railroad
   RRAn Adjacent to North-South Railroad
   PosN Near positive off-site feature--park, greenbelt, etc.
   PosA Adjacent to postive off-site feature
   RRNe Within 200' of East-West Railroad
   RRAe Adjacent to East-West Railroad
#No ordinality, so converting into factors
house$Condition2 <- as.factor(house$Condition2)
table(house$Condition2)
## 
## Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNn 
##      5     13   2889      4      4      1      1      2
sum(table(house$Condition2))
## [1] 2919

Please return to the 5.3 Tabs menu to select other (groups of) variables

Pavement of Street & Driveway

2 variables

Street: Type of road access to property

   Grvl Gravel  
   Pave Paved
#Ordinal, so label encoding
house$Street<-as.integer(revalue(house$Street, c('Grvl'=0, 'Pave'=1)))
table(house$Street)
## 
##    0    1 
##   12 2907
sum(table(house$Street))
## [1] 2919

PavedDrive: Paved driveway

   Y    Paved 
   P    Partial Pavement
   N    Dirt/Gravel
#Ordinal, so label encoding
house$PavedDrive<-as.integer(revalue(house$PavedDrive, c('N'=0, 'P'=1, 'Y'=2)))
table(house$PavedDrive)
## 
##    0    1    2 
##  216   62 2641
sum(table(house$PavedDrive))
## [1] 2919

Please return to the 5.3 Tabs menu to select other (groups of) variables

Changing some numeric variables into factors

At this point, all variables are complete (No NAs), and all character variables are converted into either numeric labels of into factors. However, there are 3 variables that are recorded numeric but should actually be categorical.

Year and Month Sold

While oridinality within YearBuilt (or remodeled) makes sense (old houses are worth less), we are talking about only 5 years of sales. These years also include an economic crisis. For instance: Sale Prices in 2009 (after the collapse) are very likely to be much lower than in 2007. I wil convert YrSold into a factor before modeling, but as I need the numeric version of YrSold to create an Age variable, I am not doing that yet.

Month Sold is also an Integer variable. However, December is not “better” than January. Therefore, I will convert MoSold values back into factors.

str(house$YrSold)
##  int [1:2919] 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
str(house$MoSold)
##  int [1:2919] 2 5 9 2 12 10 8 11 4 1 ...
house$MoSold <- as.factor(house$MoSold)

Although possible a bit less steep than expected, the effects of the Banking crises that took place at the end of 2007 can be seen indeed. After the highest median prices in 2007, the prices gradually decreased. However, seasonality seems to play a bigger role, as you can see below.

ys <- ggplot(house[!is.na(house$SalePrice),], aes(x=as.factor(YrSold), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue')+
        scale_y_continuous(breaks= seq(0, 800000, by=25000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..)) +
        coord_cartesian(ylim = c(0, 200000)) +
        geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice

ms <- ggplot(house[!is.na(house$SalePrice),], aes(x=MoSold, y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue')+
        scale_y_continuous(breaks= seq(0, 800000, by=25000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..)) +
        coord_cartesian(ylim = c(0, 200000)) +
        geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice

grid.arrange(ys, ms, widths=c(1,2))

MSSubClass

MSSubClass: Identifies the type of dwelling involved in the sale.

    20  1-STORY 1946 & NEWER ALL STYLES
    30  1-STORY 1945 & OLDER
    40  1-STORY W/FINISHED ATTIC ALL AGES
    45  1-1/2 STORY - UNFINISHED ALL AGES
    50  1-1/2 STORY FINISHED ALL AGES
    60  2-STORY 1946 & NEWER
    70  2-STORY 1945 & OLDER
    75  2-1/2 STORY ALL AGES
    80  SPLIT OR MULTI-LEVEL
    85  SPLIT FOYER
    90  DUPLEX - ALL STYLES AND AGES
   120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150  1-1/2 STORY PUD - ALL AGES
   160  2-STORY PUD - 1946 & NEWER
   180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190  2 FAMILY CONVERSION - ALL STYLES AND AGES

These classes are coded as numbers, but really are categories.

str(house$MSSubClass)
##  int [1:2919] 60 20 60 70 60 50 20 60 50 190 ...
house$MSSubClass <- as.factor(house$MSSubClass)

#revalue for better readability
house$MSSubClass<-revalue(house$MSSubClass, c('20'='1 story 1946+', '30'='1 story 1945-', '40'='1 story unf attic', '45'='1,5 story unf', '50'='1,5 story fin', '60'='2 story 1946+', '70'='2 story 1945-', '75'='2,5 story all ages', '80'='split/multi level', '85'='split foyer', '90'='duplex all style/age', '120'='1 story PUD 1946+', '150'='1,5 story PUD all', '160'='2 story PUD 1946+', '180'='PUD multilevel', '190'='2 family conversion'))

str(house$MSSubClass)
##  Factor w/ 16 levels "1 story 1946+",..: 6 1 6 7 6 5 1 6 5 16 ...

Visualization of important variables

I have now finally reached the point where all character variables have been converted into categorical factors or have been label encoded into numbers. In addition, 3 numeric variables have been converted into factors, and I deleted one variable (Utilities). As you can see below, the number of numerical variables is now 56 (including the response variable), and the remaining 23 variables are categorical.

numericVars <- which(sapply(house, is.numeric)) #index vector numeric variables
factorVars <- which(sapply(house, is.factor)) #index vector factor variables
cat('There are', length(numericVars), 'numeric variables, and', length(factorVars), 'categoric variables')
## There are 56 numeric variables, and 23 categoric variables

Correlations again

Below I am checking the correlations again. As you can see, the number of variables with a correlation of at least 0.5 with the SalePrice has increased from 10 (see section 4.2.1) to 16.

all_numVar <- house[, numericVars]
cor_numVar <- cor(all_numVar, use="pairwise.complete.obs") #correlations of all numeric variables

#sort on decreasing correlations with SalePrice
cor_sorted <- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
 #select only high corelations
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]

corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt", tl.cex = 0.7,cl.cex = .7, number.cex=.7)

Finding variable importance with a quick Random Forest

Although the correlations are giving a good overview of the most important numeric variables and multicolinerity among those variables, I wanted to get an overview of the most important variables including the categorical variables before moving on to visualization.

I tried to get the relative importance of variables with a quick linear regression model with the calc.relimp function of package , and also tried the boruta function of package boruta which separates the variables into groups that are important or not. However, these method took a long time. As I only want to get an indication of the variable importance, I eventually decided to keep it simple and just use a quick and dirty Random Forest model with only 100 trees. This also does the job for me, and does not take very long as I can specify a (relatively) small number of trees.

set.seed(2018)
quick_RF <- randomForest(x=house[1:1460,-79], y=house$SalePrice[1:1460], ntree=100,importance=TRUE)
imp_RF <- importance(quick_RF)
imp_DF <- data.frame(Variables = row.names(imp_RF), MSE = imp_RF[,1])
imp_DF <- imp_DF[order(imp_DF$MSE, decreasing = TRUE),]

ggplot(imp_DF[1:20,], aes(x=reorder(Variables, MSE), y=MSE, fill=MSE)) + geom_bar(stat = 'identity') + labs(x = 'Variables', y= '% increase MSE if variable is randomly permuted') + coord_flip() + theme(legend.position="none")

Only 3 of those most important variables are categorical according to RF; Neighborhood, MSSubClass, and GarageType.

The most important categorical variable; Neighborhood

Th first graph shows the median SalePrice by Neighorhood. The frequency (number of houses) of each Neighborhood in the train set is shown in the labels.

The second graph below shows the frequencies across all data.

n1 <- ggplot(house[!is.na(house$SalePrice),], aes(x=Neighborhood, y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
n2 <- ggplot(data=house, aes(x=Neighborhood)) +
        geom_histogram(stat='count')+
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(n1, n2)

Overall Quality, and other Quality variables

I have already visualized the relation between Overall Quality and SalePrice in my initial explorations, but I want to visualize the frequency distribution as well. As there are more quality measurements, I am taking the opportunity to bundle them in this section.

q1 <- ggplot(data=house, aes(x=as.factor(OverallQual))) +
        geom_histogram(stat='count')
q2 <- ggplot(data=house, aes(x=as.factor(ExterQual))) +
        geom_histogram(stat='count')
q3 <- ggplot(data=house, aes(x=as.factor(BsmtQual))) +
        geom_histogram(stat='count')
q4 <- ggplot(data=house, aes(x=as.factor(KitchenQual))) +
        geom_histogram(stat='count')
q5 <- ggplot(data=house, aes(x=as.factor(GarageQual))) +
        geom_histogram(stat='count')
q6 <- ggplot(data=house, aes(x=as.factor(FireplaceQu))) +
        geom_histogram(stat='count')
q7 <- ggplot(data=house, aes(x=as.factor(PoolQC))) +
        geom_histogram(stat='count')

layout <- matrix(c(1,2,8,3,4,8,5,6,7),3,3,byrow=TRUE)
multiplot(q1, q2, q3, q4, q5, q6, q7, layout=layout)

Overall Quality is very important, and also more granular than the other variables. External Quality is also improtant, but has a high correlation with Overall Quality (0.73). Kitchen Quality also seems one to keep, as all houses have a kitchen and there is a variance with some substance. Garage Quality does not seem to distinguish much, as the majority of garages have Q3. Fireplace Quality is in the list of high correlations, and in the important variables list. The PoolQC is just very sparse (the 13 pools cannot even be seen on this scale). I will look at creating a ‘has pool’ variable later on.

The second most important categorical variable; MSSubClass

The first visualization shows the median SalePrice by MSSubClass. The frequency (number of houses) of each MSSubClass in the train set is shown in the labels.

The histrogram shows the frequencies across all data. Most houses are relatively new, and have one or two stories.

ms1 <- ggplot(house[!is.na(house$SalePrice),], aes(x=MSSubClass, y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
ms2 <- ggplot(data=house, aes(x=MSSubClass)) +
        geom_histogram(stat='count')+
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(ms1, ms2)

Garage variables

Several Garage variables have a high correlation with SalePrice, and are also in the top-20 list of the quick random forest. However, there is multicolinearity among them and I think that 7 garage variables is too many anyway. I feel that something like 3 variables should be sufficient (possibly GarageCars, GarageType, and a Quality measurement), but before I do any selection I am visualizing house of them in this section.

#correct error
house$GarageYrBlt[2593] <- 2007 #this must have been a typo. GarageYrBlt=2207, YearBuilt=2006, YearRemodAdd=2007.
g1 <- ggplot(data=house[house$GarageCars !=0,], aes(x=GarageYrBlt)) +
        geom_histogram()
g2 <- ggplot(data=house, aes(x=as.factor(GarageCars))) +
        geom_histogram(stat='count')
g3 <- ggplot(data= house, aes(x=GarageArea)) +
        geom_density()
g4 <- ggplot(data=house, aes(x=as.factor(GarageCond))) +
        geom_histogram(stat='count')
g5 <- ggplot(data=house, aes(x=GarageType)) +
        geom_histogram(stat='count')
g6 <- ggplot(data=house, aes(x=as.factor(GarageQual))) +
        geom_histogram(stat='count')
g7 <- ggplot(data=house, aes(x=as.factor(GarageFinish))) +
        geom_histogram(stat='count')

layout <- matrix(c(1,5,5,2,3,8,6,4,7),3,3,byrow=TRUE)
multiplot(g1, g2, g3, g4, g5, g6, g7, layout=layout)

As already mentioned in section 4.2, GarageCars and GarageArea are highly correlated. Here, GarageQual and GarageCond also seem highly correlated, and both are dominated by level =3.

Basement variables

Similar the garage variables, multiple basement variables are important in the correlations matrix and the Top 20 RF predictors list. However, 11 basement variables seems an overkill. Before I decide what I am going to do with them, I am visualizing 8 of them below. The 2 “Bathroom” variables are dealt with in Feature Engineering (section 7.1), and the “Basement square feet” is already discussed in section 6.2.1.

b1 <- ggplot(data=house, aes(x=BsmtFinSF1)) +
        geom_histogram() + labs(x='Type 1 finished square feet')
b2 <- ggplot(data=house, aes(x=BsmtFinSF2)) +
        geom_histogram()+ labs(x='Type 2 finished square feet')
b3 <- ggplot(data=house, aes(x=BsmtUnfSF)) +
        geom_histogram()+ labs(x='Unfinished square feet')
b4 <- ggplot(data=house, aes(x=as.factor(BsmtFinType1))) +
        geom_histogram(stat='count')+ labs(x='Rating of Type 1 finished area')
b5 <- ggplot(data=house, aes(x=as.factor(BsmtFinType2))) +
        geom_histogram(stat='count')+ labs(x='Rating of Type 2 finished area')
b6 <- ggplot(data=house, aes(x=as.factor(BsmtQual))) +
        geom_histogram(stat='count')+ labs(x='Height of the basement')
b7 <- ggplot(data=house, aes(x=as.factor(BsmtCond))) +
        geom_histogram(stat='count')+ labs(x='Rating of general condition')
b8 <- ggplot(data=house, aes(x=as.factor(BsmtExposure))) +
        geom_histogram(stat='count')+ labs(x='Walkout or garden level walls')

layout <- matrix(c(1,2,3,4,5,9,6,7,8),3,3,byrow=TRUE)
multiplot(b1, b2, b3, b4, b5, b6, b7, b8, layout=layout)

So it seemed as if the Total Basement Surface in square feet (TotalBsmtSF) is further broken down into finished areas (2 if more than one type of finish), and unfinished area. I did a check between the correlation of total of those 3 variables, and TotalBsmtSF. The correlation is exactely 1, so that’s a good thing (no errors or small discrepancies)!

Basement Quality is a confusing variable name, as it turns out that it specifically rates the Height of the basement.

Feature engineering

Total number of Bathrooms

There are 4 bathroom variables. Individually, these variables are not very important. However, I assume that I if I add them up into one predictor, this predictor is likely to become a strong one.

“A half-bath, also known as a powder room or guest bath, has only two of the four main bathroom components-typically a toilet and sink.” Consequently, I will also count the half bathrooms as half.

house$TotBathrooms <- house$FullBath + (house$HalfBath*0.5) + house$BsmtFullBath + (house$BsmtHalfBath*0.5)

As you can see in the first graph, there now seems to be a clear correlation (it’s 0.63). The frequency distribution of Bathrooms in all data is shown in the second graph.

tb1 <- ggplot(data=house[!is.na(house$SalePrice),], aes(x=as.factor(TotBathrooms), y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
tb2 <- ggplot(data=house, aes(x=as.factor(TotBathrooms))) +
        geom_histogram(stat='count')
grid.arrange(tb1, tb2)

Adding ‘House Age’, ‘Remodeled (Yes/No)’, and IsNew variables

Altogether, there are 3 variables that are relevant with regards to the Age of a house; YearBlt, YearRemodAdd, and YearSold. YearRemodAdd defaults to YearBuilt if there has been no Remodeling/Addition. I will use YearRemodeled and YearSold to determine the Age. However, as parts of old constructions will always remain and only parts of the house might have been renovated, I will also introduce a Remodeled Yes/No variable. This should be seen as some sort of penalty parameter that indicates that if the Age is based on a remodeling date, it is probably worth less than houses that were built from scratch in that same year.

house$Remod <- ifelse(house$YearBuilt==house$YearRemodAdd, 0, 1) #0=No Remodeling, 1=Remodeling
house$Age <- as.numeric(house$YrSold)-house$YearRemodAdd
ggplot(data=house[!is.na(house$SalePrice),], aes(x=Age, y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

As expected, the graph shows a negative correlation with Age (old house are worth less).

cor(house$SalePrice[!is.na(house$SalePrice)], house$Age[!is.na(house$SalePrice)])
## [1] -0.5090787

As you can see below, houses that are remodeled are worth less indeed, as expected.

ggplot(house[!is.na(house$SalePrice),], aes(x=as.factor(Remod), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=6) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        theme_grey(base_size = 18) +
        geom_hline(yintercept=163000, linetype="dashed") #dashed line is median SalePrice

Finally, I am creating the IsNew variable below. Altogether, there are 116 new houses in the dataset.

house$IsNew <- ifelse(house$YrSold==house$YearBuilt, 1, 0)
table(house$IsNew)
## 
##    0    1 
## 2803  116

These 116 new houses are fairly evenly distributed among train and test set, and as you can see new houses are worth considerably more on average.

ggplot(house[!is.na(house$SalePrice),], aes(x=as.factor(IsNew), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=6) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        theme_grey(base_size = 18) +
        geom_hline(yintercept=163000, linetype="dashed") #dashed line is median SalePrice

house$YrSold <- as.factor(house$YrSold) #the numeric version is now not needed anymore

Binning Neighborhood

nb1 <- ggplot(house[!is.na(house$SalePrice),], aes(x=reorder(Neighborhood, SalePrice, FUN=median), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') + labs(x='Neighborhood', y='Median SalePrice') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
nb2 <- ggplot(house[!is.na(house$SalePrice),], aes(x=reorder(Neighborhood, SalePrice, FUN=mean), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "mean", fill='blue') + labs(x='Neighborhood', y="Mean SalePrice") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
grid.arrange(nb1, nb2)

Both the median and mean Saleprices agree on 3 neighborhoods with substantially higher saleprices. The separation of the 3 relatively poor neighborhoods is less clear, but at least both graphs agree on the same 3 poor neighborhoods. Since I do not want to ‘overbin’, I am only creating categories for those ‘extremes’.

house$NeighRich[house$Neighborhood %in% c('StoneBr', 'NridgHt', 'NoRidge')] <- 2
house$NeighRich[!house$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale', 'StoneBr', 'NridgHt', 'NoRidge')] <- 1
house$NeighRich[house$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale')] <- 0
table(house$NeighRich)
## 
##    0    1    2 
##  160 2471  288

##3 Total Square Feet

As the total living space generally is very important when people buy houses, I am adding a predictors that adds up the living space above and below ground.

house$TotalSqFeet <- house$GrLivArea + house$TotalBsmtSF
ggplot(data=house[!is.na(house$SalePrice),], aes(x=TotalSqFeet, y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_text_repel(aes(label = ifelse(house$GrLivArea[!is.na(house$SalePrice)]>4500, rownames(house), '')))

As expected, the correlation with SalePrice is very strong indeed (0.78).

cor(house$SalePrice, house$TotalSqFeet, use= "pairwise.complete.obs")
## [1] 0.7789588

The two potential outliers seem to ‘outlie’ even more than before. By taking out these two outliers, the correlation increases by 5%.

cor(house$SalePrice[-c(524, 1299)], house$TotalSqFeet[-c(524, 1299)], use= "pairwise.complete.obs")
## [1] 0.829042

Consolidating Porch variables

Below, I listed the variables that seem related regarding porches.

  • WoodDeckSF: Wood deck area in square feet

  • OpenPorchSF: Open porch area in square feet

  • EnclosedPorch: Enclosed porch area in square feet

  • 3SsnPorch: Three season porch area in square feet

  • ScreenPorch: Screen porch area in square feet

As far as I know, porches are sheltered areas outside of the house, and a wooden deck is unsheltered. Therefore, I am leaving WoodDeckSF alone, and are only consolidating the 4 porch variables.

house$TotalPorchSF <- house$OpenPorchSF + house$EnclosedPorch + house$X3SsnPorch + house$ScreenPorch

Although adding up these Porch areas makes sense (there should not be any overlap between areas), the correlation with SalePrice is not very strong.

cor(house$SalePrice, house$TotalPorchSF, use= "pairwise.complete.obs")
## [1] 0.1957389
ggplot(data=house[!is.na(house$SalePrice),], aes(x=TotalPorchSF, y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

Preparing data for modeling

Dropping highly correlated variables

First of all, I am dropping a variable if two variables are highly correlated. To find these correlated pairs, I have used the correlations matrix again (see section 6.1). For instance: GarageCars and GarageArea have a correlation of 0.89. Of those two, I am dropping the variable with the lowest correlation with SalePrice (which is GarageArea with a SalePrice correlation of 0.62. GarageCars has a SalePrice correlation of 0.64).

dropVars <- c('YearRemodAdd', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'TotalBsmtSF', 'TotalRmsAbvGrd', 'BsmtFinSF1')

house <- house[,!(names(house) %in% dropVars)]

Removing outliers

For the time being, I am keeping it simple and just remove the two really big houses with low SalePrice manually. However, I intend to investigate this more thorough in a later stage (possibly using the ‘outliers’ package).

house <- house[-c(524, 1299),]

PreProcessing predictor variables

Before modeling I need to center and scale the ‘true numeric’ predictors (so not variables that have been label encoded), and create dummy variables for the categorical predictors. Below, I am splitting the dataframe into one with all (true) numeric variables, and another dataframe holding the (ordinal) factors.

numericVarNames <- numericVarNames[!(numericVarNames %in% c('MSSubClass', 'MoSold', 'YrSold', 'SalePrice', 'OverallQual', 'OverallCond'))] #numericVarNames was created before having done anything
numericVarNames <- append(numericVarNames, c('Age', 'TotalPorchSF', 'TotBathrooms', 'TotalSqFeet'))

DFnumeric <- house[, names(house) %in% numericVarNames]

DFfactors <- house[, !(names(house) %in% numericVarNames)]
DFfactors <- DFfactors[, names(DFfactors) != 'SalePrice']

cat('There are', length(DFnumeric), 'numeric variables, and', length(DFfactors), 'factor variables')
## There are 30 numeric variables, and 49 factor variables

Skewness and normalizing of the numeric predictors

Skewness Skewness is a measure of the symmetry in a distribution. A symmetrical dataset will have a skewness equal to 0. So, a normal distribution will have a skewness of 0. Skewness essentially measures the relative size of the two tails. As a rule of thumb, skewness should be between -1 and 1. In this range, data are considered fairly symmetrical. In order to fix the skewness, I am taking the log for all numeric predictors with an absolute skew greater than 0.8 (actually: log+1, to avoid division by zero issues).

for(i in 1:ncol(DFnumeric)){
        if (abs(skew(DFnumeric[,i]))>0.8){
                DFnumeric[,i] <- log(DFnumeric[,i] +1)
        }
}

Normalizing the data

PreNum <- preProcess(DFnumeric, method=c("center", "scale"))
print(PreNum)
## Created from 2917 samples and 30 variables
## 
## Pre-processing:
##   - centered (30)
##   - ignored (0)
##   - scaled (30)
DFnorm <- predict(PreNum, DFnumeric)
dim(DFnorm)
## [1] 2917   30

One hot encoding the categorical variables

DFdummies <- as.data.frame(model.matrix(~.-1, DFfactors))
dim(DFdummies)
## [1] 2917  201

Removing levels with few or no observations in train or test

#check if some values are absent in the test set
ZerocolTest <- which(colSums(DFdummies[(nrow(house[!is.na(house$SalePrice),])+1):nrow(house),])==0)
colnames(DFdummies[ZerocolTest])
##  [1] "Condition2RRAe"     "Condition2RRAn"     "Condition2RRNn"    
##  [4] "HouseStyle2.5Fin"   "RoofMatlMembran"    "RoofMatlMetal"     
##  [7] "RoofMatlRoll"       "Exterior1stImStucc" "Exterior1stStone"  
## [10] "Exterior2ndOther"   "HeatingOthW"        "ElectricalMix"     
## [13] "MiscFeatureTenC"
DFdummies <- DFdummies[,-ZerocolTest] #removing predictors
#check if some values are absent in the train set
ZerocolTrain <- which(colSums(DFdummies[1:nrow(house[!is.na(house$SalePrice),]),])==0)
colnames(DFdummies[ZerocolTrain])
## [1] "MSSubClass1,5 story PUD all"
DFdummies <- DFdummies[,-ZerocolTrain] #removing predictor

Also taking out variables with less than 10 ‘ones’ in the train set.

fewOnes <- which(colSums(DFdummies[1:nrow(house[!is.na(house$SalePrice),]),])<10)
colnames(DFdummies[fewOnes])
##  [1] "MSSubClass1 story unf attic" "LotConfigFR3"               
##  [3] "NeighborhoodBlueste"         "NeighborhoodNPkVill"        
##  [5] "Condition1PosA"              "Condition1RRNe"             
##  [7] "Condition1RRNn"              "Condition2Feedr"            
##  [9] "Condition2PosA"              "Condition2PosN"             
## [11] "RoofStyleMansard"            "RoofStyleShed"              
## [13] "RoofMatlWdShake"             "RoofMatlWdShngl"            
## [15] "Exterior1stAsphShn"          "Exterior1stBrkComm"         
## [17] "Exterior1stCBlock"           "Exterior2ndAsphShn"         
## [19] "Exterior2ndBrk Cmn"          "Exterior2ndCBlock"          
## [21] "Exterior2ndStone"            "FoundationStone"            
## [23] "FoundationWood"              "HeatingGrav"                
## [25] "HeatingWall"                 "ElectricalFuseP"            
## [27] "GarageTypeCarPort"           "MiscFeatureOthr"            
## [29] "SaleTypeCon"                 "SaleTypeConLD"              
## [31] "SaleTypeConLI"               "SaleTypeConLw"              
## [33] "SaleTypeCWD"                 "SaleTypeOth"                
## [35] "SaleConditionAdjLand"
DFdummies <- DFdummies[,-fewOnes] #removing predictors
dim(DFdummies)
## [1] 2917  152

Altogether, I have removed 49 one-hot encoded predictors with little or no variance. Altough this may seem a significant number, it is actually much less than the number of predictors that were taken out by using caret’snear zero variance function (using its default thresholds).

combined <- cbind(DFnorm, DFdummies) #combining all (now numeric) predictors into one dataframe 

Dealing with skewness of response variable

skew(house$SalePrice)
## [1] 1.877427
qqnorm(house$SalePrice)
qqline(house$SalePrice)

The skew of 1.87 indicates a right skew that is too high, and the Q-Q plot shows that sale prices are also not normally distributed. To fix this I am taking the log of SalePrice.

house$SalePrice <- log(house$SalePrice) #default is the natural logarithm, "+1" is not necessary as there are no 0's
skew(house$SalePrice)
## [1] 0.1213182

As you can see,the skew is now quite low and the Q-Q plot is also looking much better.

qqnorm(house$SalePrice)
qqline(house$SalePrice)

Composing train and test sets

train1 <- combined[!is.na(house$SalePrice),]
test1 <- combined[is.na(house$SalePrice),]

Compare different regression models

Lasso regression model

Below, I am using caret cross validation to find the best value for lambda, which is the only hyperparameter that needs to be tuned for the lasso model.

set.seed(27042018)
my_control <-trainControl(method="cv", number=5)
lassoGrid <- expand.grid(alpha = 1, lambda = seq(0.001,0.1,by = 0.0005))

lasso_mod <- train(x=train1, y=house$SalePrice[!is.na(house$SalePrice)], method='glmnet', trControl= my_control, tuneGrid=lassoGrid) 
lasso_mod$bestTune
min(lasso_mod$results$RMSE)
## [1] 0.1127325
lassoVarImp <- varImp(lasso_mod,scale=F)
lassoImportance <- lassoVarImp$importance

varsSelected <- length(which(lassoImportance$Overall!=0))
varsNotSelected <- length(which(lassoImportance$Overall==0))

cat('Lasso uses', varsSelected, 'variables in its model, and did not select', varsNotSelected, 'variables.')
## Lasso uses 132 variables in its model, and did not select 50 variables.

So lasso did what it is supposed to do: it seems to have dealt with multicolinearity well by not using about 45% of the available variables in the model.

LassoPred <- predict(lasso_mod, test1)
predictions_lasso <- exp(LassoPred) #need to reverse the log to the real values
head(predictions_lasso)
##     1461     1462     1463     1464     1465     1466 
## 115562.1 162498.8 179170.1 197566.9 205357.5 168499.1
length(predictions_lasso)
## [1] 1459

XGBoost model

xgb_grid = expand.grid(
nrounds = 1000,
eta = c(0.1, 0.05, 0.01),
max_depth = c(2, 3, 4, 5, 6),
gamma = 0,
colsample_bytree=1,
min_child_weight=c(1, 2, 3, 4 ,5),
subsample=1
)
label_train <- house$SalePrice[!is.na(house$SalePrice)]

# put our testing & training data into two seperates Dmatrixs objects
dtrain <- xgb.DMatrix(data = as.matrix(train1), label= label_train)
dtest <- xgb.DMatrix(data = as.matrix(test1))

In addition, I am taking over the best tuned values from the caret cross validation.

default_param<-list(
        objective = "reg:linear",
        booster = "gbtree",
        eta=0.05, #default = 0.3
        gamma=0,
        max_depth=3, #default=6
        min_child_weight=4, #default=1
        subsample=1,
        colsample_bytree=1
)

The next step is to do cross validation to determine the best number of rounds (for the given set of parameters).

xgbcv <- xgb.cv( params = default_param, data = dtrain, nrounds = 500, nfold = 5, showsd = T, stratified = T, print_every_n = 40, early_stopping_rounds = 10, maximize = F)
## [1]  train-rmse:10.955575+0.004295   test-rmse:10.955331+0.017846 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [41] train-rmse:1.428252+0.000738    test-rmse:1.429281+0.012684 
## [81] train-rmse:0.219593+0.001561    test-rmse:0.231238+0.009798 
## [121]    train-rmse:0.102289+0.003105    test-rmse:0.129010+0.012711 
## [161]    train-rmse:0.090270+0.003373    test-rmse:0.122363+0.013031 
## [201]    train-rmse:0.084180+0.003142    test-rmse:0.119779+0.013185 
## [241]    train-rmse:0.079681+0.002972    test-rmse:0.118523+0.013068 
## [281]    train-rmse:0.076025+0.002940    test-rmse:0.117808+0.013084 
## [321]    train-rmse:0.072816+0.002834    test-rmse:0.117464+0.013091 
## Stopping. Best iteration:
## [329]    train-rmse:0.072294+0.002884    test-rmse:0.117442+0.013036

Although it was a bit of work, the hyperparameter tuning definitly paid of, as the cross validated RMSE inproved considerably (from 0.1225 without the caret tuning, to 0.1162 in this version)!

#train the model using the best iteration found by cross validation
xgb_mod <- xgb.train(data = dtrain, params=default_param, nrounds = 454)
XGBpred <- predict(xgb_mod, dtest)
predictions_XGB <- exp(XGBpred) #need to reverse the log to the real values
head(predictions_XGB)
## [1] 116386.8 162307.3 186494.0 187440.4 187258.3 166241.4
#view variable importance plot
library(Ckmeans.1d.dp) #required for ggplot clustering
mat <- xgb.importance (feature_names = colnames(train1),model = xgb_mod)
xgb.ggplot.importance(importance_matrix = mat[1:20], rel_to_first = TRUE)

Averaging predictions

Since the lasso and XGBoost algorithms are very different, averaging predictions likely improves the scores. As the lasso model does better regarding the cross validated RMSE score (0.1121 versus 0.1162), I am weigting the lasso model double.

sub_avg <- data.frame(Id = test_labels, SalePrice = (predictions_XGB+2*predictions_lasso)/3)
head(sub_avg)
write.csv(sub_avg, file = 'Lasso.csv', row.names = F)

Kaggle submission