Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing almost every aspect of residential homes in Ames, Iowa, this competition results useful to try to predict the final price of each home by attempting to get the most relevant features that influence the price.
Here’s a brief description of the data files (i.e. train and test sets):
The order of the predictive accuracy by model on test set based on Kaggle score was in the following order: SVR > Cluster-Then-Predict > Random Forest > CART > Multiple Linear Regression.
Competition link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
train <- read.csv('train.csv', stringsAsFactors = TRUE)
test <- read.csv('test.csv', stringsAsFactors = TRUE)
# Helper function to make submissions
submit <- function(predictions, num = 1){
df <- data.frame(Id=test$Id, SalePrice=predictions)
filename <- paste("submit", as.character(num), ".csv", sep="")
write.csv(df, filename, row.names = FALSE)
}
Columns removed: Alley, PoolQC, Fence, MiscFeature and FireplaceQu.
cols_to_remove <- colnames(train)[which(colMeans(is.na(train)) > .4)]
train_set <- train[, !names(train) %in% cols_to_remove]
cols_to_remove
## [1] "Alley" "FireplaceQu" "PoolQC" "Fence" "MiscFeature"
Perform same operation on the test set.
test_set <- test[, !names(test) %in% cols_to_remove]
Remove Id column since it’s just an identifier.
train_set <- train_set[,-1]
test_set <- test_set[,-1]
Also get rid of the Utilities variable since almost all its values are the same. Just by look at the proportion of its possible values we can confirm this.
library(knitr)
kable(prop.table(table(train_set$Utilities)))
| Var1 | Freq |
|---|---|
| AllPub | 0.9993151 |
| NoSeWa | 0.0006849 |
kable(prop.table(table(test_set$Utilities)))
| Var1 | Freq |
|---|---|
| AllPub | 1 |
train_set <- subset(train_set, select = -c(Utilities))
test_set <- subset(test_set, select = -c(Utilities))
Remove near zero variance columns
library(caret)
near_zero_v <- nearZeroVar(train_set, saveMetrics = TRUE)
# Get names of nzvcolumns so they can also be removed from test set
nzv_cols <- rownames(near_zero_v)[near_zero_v$nzv]
train_set <- train_set[, !names(train_set) %in% nzv_cols]
test_set <- test_set[, !names(test_set) %in% nzv_cols]
Impute missing values given that the maximum proportion of NAs by column is about 17.7% after removing the previous columns.
# Get columns that contain NAs
col_na.train <- train_set[which(colMeans(is.na(train_set)) > 0)]
col_na.test <- test_set[which(colMeans(is.na(test_set)) > 0)]
library(mice)
# Deal with NAs using a decision tree (CART) method.
set.seed(1004)
imputed.train <- complete(mice(col_na.train, method = "cart"))
imputed.test <- complete(mice(col_na.test, method = "cart"))
Replace train and test sets columns with imputed values.
for(col in colnames(imputed.train)){
train_set[, col] <- imputed.train[, col]
}
for(col in colnames(imputed.test)){
test_set[, col] <- imputed.test[, col]
}
library(ggplot2)
library(dplyr)
col_types <- sapply(1:ncol(train_set), function(n) class(train_set[,n]))
Check correlation with outcome variable for numeric columns
train_numeric <- train_set[, which(!col_types %in% c("character", "factor"))]
Get columns that have correlation greater than 0.15 with outcome.
corrs <- sapply(1:ncol(train_numeric[,-29]), function(n){
cor(train_numeric[,n], train_numeric$SalePrice)
})
Get variables to fit linear regression.
trainReg <- train_set[, !names(train_set) %in% names(train_numeric)[which(abs(corrs) < 0.15)]]
trainReg$SalePrice <- NULL
predictors_corr <- cor(trainReg[, which(sapply(trainReg, class) != "factor")])
Remove correlated predictors - The cutoff or threshold to remove them will be 0.7.
library(reshape2)
melted_predictors_corr <- melt(predictors_corr)
melted_predictors_corr %>% ggplot(aes(Var1, Var2, fill=value)) +
geom_tile() + xlab(element_blank()) + ylab(element_blank()) +
scale_fill_gradient2(name='Correlation', low = "blue", mid="white",
high="red", limit=c(-1,1)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.2))
Find the independent variables that surpass the cutoff.
highlyCorPred <- findCorrelation(predictors_corr, cutoff = 0.7)
rownames(predictors_corr)[highlyCorPred]
## [1] "GrLivArea" "GarageCars" "X1stFlrSF" "YearBuilt"
Although findCorrelation suggested the four previous columns that were correlated with another one, I’ll keep the ones that make more sense for the price of a house or that are easier to interpret. For example, I find more relevant to keep the year in which the house was built (YearBuilt) rather than the year that the house garage was built (GarageYrBlt). A similar reasoning was applied for the other three correlated predictors.
trainReg <- select(trainReg, -c(TotRmsAbvGrd, GarageCars,
TotalBsmtSF, GarageYrBlt))
testReg <- test_set[, names(trainReg)]
Normalize data
preproc <- preProcess(trainReg)
trainReg.norm <- predict(preproc, trainReg)
testReg.norm <- predict(preproc, testReg)
# Add response variable again
trainReg.norm$SalePrice <- train_set$SalePrice
linear_reg <- lm(SalePrice ~ ., data = trainReg.norm)
rsq <- summary(linear_reg)$r.squared
rsq_adj <- summary(linear_reg)$adj.r.squared
kable(data.frame(rsq, rsq_adj))
| rsq | rsq_adj |
|---|---|
| 0.8850503 | 0.8702926 |
Helper function to compute the Root Mean Squared Error (RMSE) based on logarithms of prices. The log is useful because mistakes on the expensive houses will get the same penalty as cheaper ones.
getRMSE <- function(pred, label){
sqrt(mean((log(pred) - log(label))^2))
}
getRMSE(predict(linear_reg), trainReg.norm$SalePrice)
## [1] 0.1326157
Constant residuals normally distributed and centered at zero.
hist(linear_reg$residuals, breaks = 20, xlab = "Residuals", ylab="Frequency", main="Residuals histogram")
The mean and median are pretty close to each other.
mean(linear_reg$residuals)
## [1] -3.326782e-13
median(linear_reg$residuals)
## [1] 3.91509e-11
Constant variability (homocedasticity) Residuals seem to behave in a fan-shaped manner but there is a big group that seems to be center at zero.
linear_reg %>% ggplot(aes(x=.fitted,y=.resid)) + geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype="dashed") +
xlab("Fitted values") +
ylab("Residuals")
Colinearity among predictors was already checked (with correlation matrix) so it’s time to make predictions.
preds.linear_reg <- predict(linear_reg, testReg.norm)
First submission to Kaggle.
submit(preds.linear_reg, 1)
library(rpart)
library(rpart.plot)
houseTree <- rpart(SalePrice ~ ., data = train_set)
Regression tree - RMSE = 0.2172899.
getRMSE(predict(houseTree), train_set$SalePrice)
## [1] 0.2172899
Perform 10-fold cross validation to choose complexity parameter (cp) with 90% of train set.
numFolds <- trainControl(method="cv", number = 10, p=0.9)
cps <- expand.grid(.cp=seq(0.001, 0.2, 0.01))
houseTree2 <- train(SalePrice ~ ., method='rpart', data = train_set,
trControl=numFolds, tuneGrid=cps)
houseTree2$bestTune
## cp
## 1 0.001
Plotting RMSE vs CP
plot(houseTree2)
Fit tree with tuned parameter.
tree <- rpart(SalePrice ~ ., data = train_set, cp=houseTree2$bestTune)
rpart.plot(tree)
RMSE decreased to 0.157749 with tuned parameter.
getRMSE(predict(tree), train_set$SalePrice)
## [1] 0.157749
Make predictions
preds.tree <- predict(tree, newdata = test_set)
Submit
submit(preds.tree, 2)
library(randomForest)
set.seed(566)
rf <- randomForest(SalePrice ~ ., data=train_set, importance=TRUE,
nodesize=25, ntree=200)
Check importance of features according to randomForest (the higher the better).
rf$importance
## %IncMSE IncNodePurity
## MSSubClass 20453897.0 1.448571e+10
## MSZoning 10743712.9 1.008843e+10
## LotFrontage 11844714.9 3.154942e+10
## LotArea 89667406.7 1.247520e+11
## LotShape 3087501.2 7.611888e+09
## LotConfig 1998832.9 4.294643e+09
## Neighborhood 757030032.8 1.028312e+12
## Condition1 -3226426.6 6.171041e+09
## BldgType 5045697.8 5.143335e+09
## HouseStyle 22171431.9 1.730239e+10
## OverallQual 1285362192.1 2.094848e+12
## OverallCond 27291982.2 2.161470e+10
## YearBuilt 121437458.5 1.980947e+11
## YearRemodAdd 57652022.5 4.100587e+10
## RoofStyle 1172588.6 4.103026e+09
## Exterior1st 58084628.6 4.671893e+10
## Exterior2nd 50984870.0 5.499522e+10
## MasVnrType 3755760.0 5.735615e+09
## MasVnrArea 16832129.9 3.380504e+10
## ExterQual 316123417.9 7.836361e+11
## ExterCond 4643132.2 4.874868e+09
## Foundation 9440915.5 6.019268e+09
## BsmtQual 54445071.5 1.312335e+11
## BsmtExposure 10430152.2 1.524471e+10
## BsmtFinType1 19194202.1 1.899117e+10
## BsmtFinSF1 117197904.9 1.493709e+11
## BsmtUnfSF 17618454.4 2.323850e+10
## TotalBsmtSF 248765913.4 3.748821e+11
## HeatingQC 4122542.0 6.206179e+09
## CentralAir 8167305.6 1.449110e+10
## Electrical 561467.8 1.707556e+09
## X1stFlrSF 260830170.4 3.289810e+11
## X2ndFlrSF 131306891.1 2.376710e+11
## GrLivArea 845220274.9 9.034613e+11
## BsmtFullBath 8936817.9 8.458578e+09
## BsmtHalfBath 2658026.7 1.813562e+09
## FullBath 57993105.4 1.493176e+11
## HalfBath 10841026.8 8.217957e+09
## BedroomAbvGr 13446504.8 1.454909e+10
## KitchenQual 206663397.6 3.256485e+11
## TotRmsAbvGrd 33511840.2 1.129447e+11
## Fireplaces 39671400.3 5.416565e+10
## GarageType 27861044.6 2.265755e+10
## GarageYrBlt 57232467.8 2.646738e+10
## GarageFinish 13810839.2 1.135351e+10
## GarageCars 234541720.0 6.641011e+11
## GarageArea 139608623.4 3.127428e+11
## PavedDrive 1339956.3 3.259677e+09
## WoodDeckSF 5561618.6 1.873462e+10
## OpenPorchSF 11257359.5 1.944500e+10
## MoSold 1710909.9 1.299944e+10
## YrSold -980489.9 3.622100e+09
## SaleType 2990606.5 6.597732e+09
## SaleCondition 8617902.1 1.574464e+10
Random forest - RMSE = 0.1431867.
getRMSE(predict(rf), train_set$SalePrice)
## [1] 0.1431867
Do this to fix error: “numbers of columns of arguments do not match”. This is because factor variables in training set and test set have different levels.
test_set <- rbind(train_set[1, -55] , test_set)
test_set <- test_set[-1,]
# Predict on test set and submit
preds.rf <- predict(rf, test_set)
submit(preds.rf, 3)
library(e1071)
svrReg <- svm(SalePrice ~ ., data=train_set, scale=TRUE,
type='eps-regression')
SVR - RMSE = 0.1043786.
getRMSE(predict(svrReg), train_set$SalePrice)
## [1] 0.1043786
Predict and submit
preds.svrReg <- predict(svrReg, test_set)
submit(preds.svrReg, 4)
Remove outcome from train set and normalize because variables with greater scales may have greater influence when computing distances.
train_set_clust <- train_set[, -55]
preproc2 <- preProcess(train_set_clust)
train_set_clust <- predict(preproc2, train_set_clust)
test_set_clust <- predict(preproc2, test_set)
Get numeric columns
train_set_clust.num <- train_set_clust[, which(sapply(train_set_clust, class) != 'factor')]
test_set_clust.num <- test_set_clust[, which(sapply(test_set_clust, class) != 'factor')]
Calculate euclidean distance between each point
distances <- dist(train_set_clust.num, method = "euclidean")
The ward.D method minimizes the variance inside each cluster/group.
houseHclust <- hclust(distances, method='ward.D')
Looking at the dendogram it seems like 4 clusters would be a good choice.
plot(houseHclust, labels = FALSE)
abline(h = 240, col="bisque2")
Pass numeric cols to as.kcca in order to make clusters for train and test sets.
library(flexclust)
hclustKcca <- as.kcca(houseHclust, data=train_set_clust.num, k=4)
clusterTrain <- predict(hclustKcca)
clusterTest <- predict(hclustKcca, newdata=test_set_clust.num)
Get the houses assigned to clusters 1, 2, 3 and 4 in training set.
houseTrain1 <- subset(train_set, clusterTrain == 1)
houseTrain2 <- subset(train_set, clusterTrain == 2)
houseTrain3 <- subset(train_set, clusterTrain == 3)
houseTrain4 <- subset(train_set, clusterTrain == 4)
Same for test set
houseTest1 <- subset(test_set, clusterTest == 1)
houseTest2 <- subset(test_set, clusterTest == 2)
houseTest3 <- subset(test_set, clusterTest == 3)
houseTest4 <- subset(test_set, clusterTest == 4)
Fit SVR for each cluster.
svm1 <- svm(SalePrice ~ ., data=houseTrain1, type='eps-regression')
svm2 <- svm(SalePrice ~ ., data=houseTrain2, type='eps-regression')
svm3 <- svm(SalePrice ~ ., data=houseTrain3, type='eps-regression')
svm4 <- svm(SalePrice ~ ., data=houseTrain4, type='eps-regression')
Compute predictions
predictTest1 <- predict(svm1, houseTest1)
predictTest2 <- predict(svm2, houseTest2)
predictTest3 <- predict(svm3, houseTest3)
predictTest4 <- predict(svm4, houseTest4)
Join the four predictions
AllPredictions <- c(predictTest1, predictTest2, predictTest3, predictTest4)
Fill predictions in the correct order from original test file
final_predictions <- rep(0, nrow(test))
for (row in as.numeric(names(AllPredictions))){
name <- as.character(row)
final_predictions[row] <- AllPredictions[name]
}
final_predictions <- final_predictions[2:1460]
Submit
submit(final_predictions, 5)