Intro

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing almost every aspect of residential homes in Ames, Iowa, this competition results useful to try to predict the final price of each home by attempting to get the most relevant features that influence the price.

Data fields

Here’s a brief description of the data files (i.e. train and test sets):

  • SalePrice - the property’s sale price in dollars. This is the target variable we’re trying to predict.
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: $Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale
Major Findings

The order of the predictive accuracy by model on test set based on Kaggle score was in the following order: SVR > Cluster-Then-Predict > Random Forest > CART > Multiple Linear Regression.

Competition link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview

Data preprocessing

Loading the data
train <- read.csv('train.csv', stringsAsFactors = TRUE)
test <- read.csv('test.csv', stringsAsFactors = TRUE)
# Helper function to make submissions
submit <- function(predictions, num = 1){
    df <- data.frame(Id=test$Id, SalePrice=predictions)
    filename <- paste("submit", as.character(num), ".csv", sep="")
    write.csv(df, filename, row.names = FALSE)
}
Check proportion of NAs by column - If greater than 0.4 remove feature.

Columns removed: Alley, PoolQC, Fence, MiscFeature and FireplaceQu.

cols_to_remove <- colnames(train)[which(colMeans(is.na(train)) > .4)]
train_set <- train[, !names(train) %in% cols_to_remove]
cols_to_remove
## [1] "Alley"       "FireplaceQu" "PoolQC"      "Fence"       "MiscFeature"

Perform same operation on the test set.

test_set <- test[, !names(test) %in% cols_to_remove]

Remove Id column since it’s just an identifier.

train_set <- train_set[,-1]
test_set <- test_set[,-1]

Also get rid of the Utilities variable since almost all its values are the same. Just by look at the proportion of its possible values we can confirm this.

library(knitr)
kable(prop.table(table(train_set$Utilities)))
Var1 Freq
AllPub 0.9993151
NoSeWa 0.0006849
kable(prop.table(table(test_set$Utilities)))
Var1 Freq
AllPub 1
train_set <- subset(train_set, select = -c(Utilities))
test_set <- subset(test_set, select = -c(Utilities))

Remove near zero variance columns

library(caret)
near_zero_v <- nearZeroVar(train_set, saveMetrics = TRUE)
# Get names of nzvcolumns so they can also be removed from test set
nzv_cols <- rownames(near_zero_v)[near_zero_v$nzv]
train_set <- train_set[, !names(train_set) %in% nzv_cols]
test_set <- test_set[, !names(test_set) %in% nzv_cols]

Impute missing values given that the maximum proportion of NAs by column is about 17.7% after removing the previous columns.

# Get columns that contain NAs
col_na.train <- train_set[which(colMeans(is.na(train_set)) > 0)]
col_na.test <- test_set[which(colMeans(is.na(test_set)) > 0)]
library(mice)
# Deal with NAs using a decision tree (CART) method.
set.seed(1004)
imputed.train <- complete(mice(col_na.train, method = "cart"))
imputed.test <- complete(mice(col_na.test, method = "cart"))

Replace train and test sets columns with imputed values.

for(col in colnames(imputed.train)){
    train_set[, col] <- imputed.train[, col]
}

for(col in colnames(imputed.test)){
    test_set[, col] <- imputed.test[, col]
}

Exploratory Data Analysis

library(ggplot2)
library(dplyr)
col_types <- sapply(1:ncol(train_set), function(n) class(train_set[,n]))

Check correlation with outcome variable for numeric columns

train_numeric <- train_set[, which(!col_types %in% c("character", "factor"))]

Get columns that have correlation greater than 0.15 with outcome.

corrs <- sapply(1:ncol(train_numeric[,-29]), function(n){
    cor(train_numeric[,n], train_numeric$SalePrice)
})

Get variables to fit linear regression.

trainReg <- train_set[, !names(train_set) %in% names(train_numeric)[which(abs(corrs) < 0.15)]]
trainReg$SalePrice <- NULL
predictors_corr <- cor(trainReg[, which(sapply(trainReg, class) != "factor")])

Remove correlated predictors - The cutoff or threshold to remove them will be 0.7.

library(reshape2)
melted_predictors_corr <- melt(predictors_corr)
melted_predictors_corr %>% ggplot(aes(Var1, Var2, fill=value)) +
    geom_tile() + xlab(element_blank()) + ylab(element_blank()) +
    scale_fill_gradient2(name='Correlation', low = "blue", mid="white",
                         high="red", limit=c(-1,1)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.2))

Find the independent variables that surpass the cutoff.

highlyCorPred <- findCorrelation(predictors_corr, cutoff = 0.7)
rownames(predictors_corr)[highlyCorPred]
## [1] "GrLivArea"  "GarageCars" "X1stFlrSF"  "YearBuilt"

Modeling

Multiple linear regression

Selecting the final variables before fitting the linear regression

Although findCorrelation suggested the four previous columns that were correlated with another one, I’ll keep the ones that make more sense for the price of a house or that are easier to interpret. For example, I find more relevant to keep the year in which the house was built (YearBuilt) rather than the year that the house garage was built (GarageYrBlt). A similar reasoning was applied for the other three correlated predictors.

trainReg <- select(trainReg, -c(TotRmsAbvGrd, GarageCars,
                                TotalBsmtSF, GarageYrBlt))
testReg <- test_set[, names(trainReg)]

Normalize data

preproc <- preProcess(trainReg)
trainReg.norm <- predict(preproc, trainReg)
testReg.norm <- predict(preproc, testReg)
# Add response variable again
trainReg.norm$SalePrice <- train_set$SalePrice
  • R squared ~= 0.8851
  • Adjusted ~= 0.8703
linear_reg <- lm(SalePrice ~ ., data = trainReg.norm)
rsq <- summary(linear_reg)$r.squared
rsq_adj <- summary(linear_reg)$adj.r.squared
kable(data.frame(rsq, rsq_adj))
rsq rsq_adj
0.8850503 0.8702926

Helper function to compute the Root Mean Squared Error (RMSE) based on logarithms of prices. The log is useful because mistakes on the expensive houses will get the same penalty as cheaper ones.

getRMSE <- function(pred, label){
    sqrt(mean((log(pred) - log(label))^2))
}
Multiple linear regression - RMSE on train set = 0.1326157.
getRMSE(predict(linear_reg), trainReg.norm$SalePrice)
## [1] 0.1326157
Model diagnostics

Constant residuals normally distributed and centered at zero.

hist(linear_reg$residuals, breaks = 20, xlab = "Residuals", ylab="Frequency", main="Residuals histogram")

The mean and median are pretty close to each other.

mean(linear_reg$residuals)
## [1] -3.326782e-13
median(linear_reg$residuals)
## [1] 3.91509e-11

Constant variability (homocedasticity) Residuals seem to behave in a fan-shaped manner but there is a big group that seems to be center at zero.

linear_reg %>% ggplot(aes(x=.fitted,y=.resid)) + geom_point(alpha = 0.3) +
    geom_hline(yintercept = 0, linetype="dashed") +
    xlab("Fitted values") +
    ylab("Residuals")

Colinearity among predictors was already checked (with correlation matrix) so it’s time to make predictions.

preds.linear_reg <- predict(linear_reg, testReg.norm)

First submission to Kaggle.

submit(preds.linear_reg, 1)

Regression tree

Fitting CART model
library(rpart)
library(rpart.plot)
houseTree <- rpart(SalePrice ~ ., data = train_set)

Regression tree - RMSE = 0.2172899.

getRMSE(predict(houseTree), train_set$SalePrice)
## [1] 0.2172899

Perform 10-fold cross validation to choose complexity parameter (cp) with 90% of train set.

numFolds <- trainControl(method="cv", number = 10, p=0.9)
cps <- expand.grid(.cp=seq(0.001, 0.2, 0.01))
houseTree2 <- train(SalePrice ~ ., method='rpart', data = train_set,
                    trControl=numFolds, tuneGrid=cps)
houseTree2$bestTune
##      cp
## 1 0.001

Plotting RMSE vs CP

plot(houseTree2)

Fit tree with tuned parameter.

tree <- rpart(SalePrice ~ ., data = train_set, cp=houseTree2$bestTune)
rpart.plot(tree)

RMSE decreased to 0.157749 with tuned parameter.

getRMSE(predict(tree), train_set$SalePrice)
## [1] 0.157749

Make predictions

preds.tree <- predict(tree, newdata = test_set)

Submit

submit(preds.tree, 2)

Random forest Regression

library(randomForest)
set.seed(566)
rf <- randomForest(SalePrice ~ ., data=train_set, importance=TRUE,
                   nodesize=25, ntree=200)

Check importance of features according to randomForest (the higher the better).

rf$importance
##                    %IncMSE IncNodePurity
## MSSubClass      20453897.0  1.448571e+10
## MSZoning        10743712.9  1.008843e+10
## LotFrontage     11844714.9  3.154942e+10
## LotArea         89667406.7  1.247520e+11
## LotShape         3087501.2  7.611888e+09
## LotConfig        1998832.9  4.294643e+09
## Neighborhood   757030032.8  1.028312e+12
## Condition1      -3226426.6  6.171041e+09
## BldgType         5045697.8  5.143335e+09
## HouseStyle      22171431.9  1.730239e+10
## OverallQual   1285362192.1  2.094848e+12
## OverallCond     27291982.2  2.161470e+10
## YearBuilt      121437458.5  1.980947e+11
## YearRemodAdd    57652022.5  4.100587e+10
## RoofStyle        1172588.6  4.103026e+09
## Exterior1st     58084628.6  4.671893e+10
## Exterior2nd     50984870.0  5.499522e+10
## MasVnrType       3755760.0  5.735615e+09
## MasVnrArea      16832129.9  3.380504e+10
## ExterQual      316123417.9  7.836361e+11
## ExterCond        4643132.2  4.874868e+09
## Foundation       9440915.5  6.019268e+09
## BsmtQual        54445071.5  1.312335e+11
## BsmtExposure    10430152.2  1.524471e+10
## BsmtFinType1    19194202.1  1.899117e+10
## BsmtFinSF1     117197904.9  1.493709e+11
## BsmtUnfSF       17618454.4  2.323850e+10
## TotalBsmtSF    248765913.4  3.748821e+11
## HeatingQC        4122542.0  6.206179e+09
## CentralAir       8167305.6  1.449110e+10
## Electrical        561467.8  1.707556e+09
## X1stFlrSF      260830170.4  3.289810e+11
## X2ndFlrSF      131306891.1  2.376710e+11
## GrLivArea      845220274.9  9.034613e+11
## BsmtFullBath     8936817.9  8.458578e+09
## BsmtHalfBath     2658026.7  1.813562e+09
## FullBath        57993105.4  1.493176e+11
## HalfBath        10841026.8  8.217957e+09
## BedroomAbvGr    13446504.8  1.454909e+10
## KitchenQual    206663397.6  3.256485e+11
## TotRmsAbvGrd    33511840.2  1.129447e+11
## Fireplaces      39671400.3  5.416565e+10
## GarageType      27861044.6  2.265755e+10
## GarageYrBlt     57232467.8  2.646738e+10
## GarageFinish    13810839.2  1.135351e+10
## GarageCars     234541720.0  6.641011e+11
## GarageArea     139608623.4  3.127428e+11
## PavedDrive       1339956.3  3.259677e+09
## WoodDeckSF       5561618.6  1.873462e+10
## OpenPorchSF     11257359.5  1.944500e+10
## MoSold           1710909.9  1.299944e+10
## YrSold           -980489.9  3.622100e+09
## SaleType         2990606.5  6.597732e+09
## SaleCondition    8617902.1  1.574464e+10

Random forest - RMSE = 0.1431867.

getRMSE(predict(rf), train_set$SalePrice)
## [1] 0.1431867

Do this to fix error: “numbers of columns of arguments do not match”. This is because factor variables in training set and test set have different levels.

test_set <- rbind(train_set[1, -55] , test_set)
test_set <- test_set[-1,]

# Predict on test set and submit
preds.rf <- predict(rf, test_set)
submit(preds.rf, 3)

Support Vector Regression (SVR)

library(e1071)

svrReg <- svm(SalePrice ~ ., data=train_set, scale=TRUE,
              type='eps-regression')

SVR - RMSE = 0.1043786.

getRMSE(predict(svrReg), train_set$SalePrice)
## [1] 0.1043786

Predict and submit

preds.svrReg <- predict(svrReg, test_set)
submit(preds.svrReg, 4)

Cluster-Then-Predict

Regression for each cluster

Remove outcome from train set and normalize because variables with greater scales may have greater influence when computing distances.

train_set_clust <- train_set[, -55]
preproc2 <- preProcess(train_set_clust)
train_set_clust <- predict(preproc2, train_set_clust)
test_set_clust <- predict(preproc2, test_set)

Get numeric columns

train_set_clust.num <- train_set_clust[, which(sapply(train_set_clust, class) != 'factor')]
test_set_clust.num <- test_set_clust[, which(sapply(test_set_clust, class) != 'factor')]

Calculate euclidean distance between each point

distances <- dist(train_set_clust.num, method = "euclidean")

The ward.D method minimizes the variance inside each cluster/group.

houseHclust <- hclust(distances, method='ward.D')

Looking at the dendogram it seems like 4 clusters would be a good choice.

plot(houseHclust, labels = FALSE)
abline(h = 240, col="bisque2")

Pass numeric cols to as.kcca in order to make clusters for train and test sets.

library(flexclust)
hclustKcca <- as.kcca(houseHclust, data=train_set_clust.num, k=4)
clusterTrain <- predict(hclustKcca)
clusterTest <- predict(hclustKcca, newdata=test_set_clust.num)

Get the houses assigned to clusters 1, 2, 3 and 4 in training set.

houseTrain1 <- subset(train_set, clusterTrain == 1)
houseTrain2 <- subset(train_set, clusterTrain == 2)
houseTrain3 <- subset(train_set, clusterTrain == 3)
houseTrain4 <- subset(train_set, clusterTrain == 4)

Same for test set

houseTest1 <- subset(test_set, clusterTest == 1)
houseTest2 <- subset(test_set, clusterTest == 2)
houseTest3 <- subset(test_set, clusterTest == 3)
houseTest4 <- subset(test_set, clusterTest == 4)

Fit SVR for each cluster.

svm1 <- svm(SalePrice ~ ., data=houseTrain1, type='eps-regression')
svm2 <- svm(SalePrice ~ ., data=houseTrain2, type='eps-regression')
svm3 <- svm(SalePrice ~ ., data=houseTrain3, type='eps-regression')
svm4 <- svm(SalePrice ~ ., data=houseTrain4, type='eps-regression')

Compute predictions

predictTest1 <- predict(svm1, houseTest1)
predictTest2 <- predict(svm2, houseTest2)
predictTest3 <- predict(svm3, houseTest3)
predictTest4 <- predict(svm4, houseTest4)

Join the four predictions

AllPredictions <- c(predictTest1, predictTest2, predictTest3, predictTest4)

Fill predictions in the correct order from original test file

final_predictions <- rep(0, nrow(test))

for (row in as.numeric(names(AllPredictions))){
    name <- as.character(row)
    final_predictions[row] <- AllPredictions[name]
}

final_predictions <- final_predictions[2:1460]

Submit

submit(final_predictions, 5)