This notebook describes my approach to building a machine learning model to predict house prices in the Ames, Iowa housing dataset (a Kaggle competition). This is my first ever Kaggle competition and my first foray into practical data science.
Firstly I recodify some of the raw data and impute missing values. Secondly I try various machine learning algorithms including:
Finally I assess each model’s performance - using cross-validation to make fair assessments of model performance. The final model is the best-performing model i.e. the model with the lowest root mean square error.
This notebook assumes the reader has basic knowledge of the packages and algorithms used as well as concepts such as cross-validation. For those who are unfamiliar with any of the concepts myriad documentation can be found online.
I predominantly used the caret (machine learning) and dplyr (data manipulation) packages in this project. The caret package provides versatility through its interaction with various individual machine learning packages (e.g. ranger (random forest), glmnet (regularised linear regression), nnet (neural network))
library(caret, quietly = TRUE)
library(dplyr, quietly = TRUE)
Let’s first read the training and test sets supplied by the Kaggle competition into our R session.
setwd("~/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Raw data")
The working directory was changed to C:/Users/aps1991/Documents/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Raw data inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
train = read.csv("train.csv", stringsAsFactors = FALSE)
test = read.csv("test.csv", stringsAsFactors = FALSE)
There are two issues with the raw data. Firstly, like most real-world data sets, it contains missing values. Secondly the raw data contain “NA” entries that have a different meaning to how R normally handles “NA” entries. For example an entry of “NA” in the “FireplaceQu” field in the raw data means that the house does not have a fireplace rather than that data for that field is missing (which is how R normally interprets “NA” entries).
Let’s address the latter issue first by replacing the “NA” entries in both the training and the test sets with more meaningful entries (e.g. “No fireplace”, “No basement”, “No alley access”).
#some NA entries in the training and test sets actually mean "no basement"
# etc. rather than representing a missing value. These values are replaced accordingly
train$Alley[is.na(train$Alley)] = "No alley access"
test$Alley[is.na(test$Alley)] = "No alley access"
train$BsmtQual[is.na(train$BsmtQual)] = "No basement"
test$BsmtQual[is.na(test$BsmtQual)] = "No basement"
train$BsmtCond[is.na(train$BsmtCond)] = "No basement"
test$BsmtCond[is.na(test$BsmtCond)] = "No basement"
train$BsmtExposure[is.na(train$BsmtExposure)] = "No basement"
test$BsmtExposure[is.na(test$BsmtExposure)] = "No basement"
train$BsmtFinType1[is.na(train$BsmtFinType1)] = "No basement"
test$BsmtFinType1[is.na(test$BsmtFinType1)] = "No basement"
train$BsmtFinType2[is.na(train$BsmtFinType2)] = "No basement"
test$BsmtFinType2[is.na(test$BsmtFinType2)] = "No basement"
train$FireplaceQu[is.na(train$FireplaceQu)] = "No fireplace"
test$FireplaceQu[is.na(test$FireplaceQu)] = "No fireplace"
train$GarageType[is.na(train$GarageType)] = "No garage"
test$GarageType[is.na(test$GarageType)] = "No garage"
train$GarageFinish[is.na(train$GarageFinish)] = "No garage"
test$GarageFinish[is.na(test$GarageFinish)] = "No garage"
train$GarageQual[is.na(train$GarageQual)] = "No garage"
test$GarageQual[is.na(test$GarageQual)] = "No garage"
train$GarageCond[is.na(train$GarageCond)] = "No garage"
test$GarageCond[is.na(test$GarageCond)] = "No garage"
train$PoolQC[is.na(train$PoolQC)] = "No pool"
test$PoolQC[is.na(test$PoolQC)] = "No pool"
train$Fence[is.na(train$Fence)] = "No fence"
test$Fence[is.na(test$Fence)] = "No fence"
train$MiscFeature[is.na(train$MiscFeature)] = "None"
test$MiscFeature[is.na(test$MiscFeature)] = "None"
Now let’s deal with the missing values. Some of these can be replaced by the most frequently occuring values in the given field. For example the most frequently occuring value for the Electrical field is “SBrkr” (standard circuit breaker), so that is used to replace the missing values in that particular field.
Note that some of the missing values only occur in the test set (e.g. for the ‘Utilities’ field).
#replace missing values in MasVnrType column in training and test sets with "None"
#, which is most commonly occuring type
train$MasVnrType[is.na(train$MasVnrType)] = "None"
test$MasVnrType[is.na(test$MasVnrType)] = "None"
#replace missing values in Electrical column in training set with _
#"SBrkr, which is most frequently occuring electrical set-up
train$Electrical[is.na(train$Electrical)] = "SBrkr"
#replace missing values in MSZoning column in test set with RL (most frequently occuring)
test$MSZoning[is.na(test$MSZoning)] = "RL"
#replace missing values in Utilities column in test set with AllPub (most frequently occuring)
test$Utilities[is.na(test$Utilities)] = "AllPub"
#replace missing values in Exterior1st column in test set with VinylSd (most frequently occuring)
test$Exterior1st[is.na(test$Exterior1st)] = "VinylSd"
#replace missing values in Exterior2nd column in test set with VinylSd (most frequently occuring)
test$Exterior2nd[is.na(test$Exterior2nd)] = "VinylSd"
#replace missing value in KitchenQual column in test set with TA (most common)
test$KitchenQual[is.na(test$KitchenQual)] = "TA"
#replace missing values in Functional column in test set with Min2 (most common)
test$Functional[is.na(test$Functional)] = "Min2"
#replace missing value in SaleType column in test set with WD (most common)
test$SaleType[is.na(test$SaleType)] = "WD"
“NA” entries in numeric fields relating to the basement can simply be replaced with a zero. For example “NA” entries in the BsmtFinSF1 (square foot area of type 1 finished basement area) simply means there is no basement to begin with and hence there’s no such square footage to speak of i.e. zero square feet.
#replace missing value in BsmtFinSF1 column in test set with 0 (has no basement)
test$BsmtFinSF1[is.na(test$BsmtFinSF1)] = 0
#replace missing value in BsmtFinSF2 column in test set with 0 (has no basement)
test$BsmtFinSF2[is.na(test$BsmtFinSF2)] = 0
#replace missing value in BsmtUnfSF column in test set with 0 (has no basement)
test$BsmtUnfSF[is.na(test$BsmtUnfSF)] = 0
#replace missing value in TotalBsmtSF column in test set with 0 (has no basement)
test$TotalBsmtSF[is.na(test$TotalBsmtSF)] = 0
#replace missing values in BsmtFullBath column in test set with 0 (has no basement)
test$BsmtFullBath[is.na(test$BsmtFullBath)] = 0
#replace missing values in BsmtHalfBath column in test set with 0 (has no basement)
test$BsmtHalfBath[is.na(test$BsmtHalfBath)] = 0
The same can be done for numeric fields relating to the garage.
#replace missing value in GarageCars column in test set with 0 (no garage)
test$GarageCars[is.na(test$GarageCars)] = 0
#replace missing value in GarageArea column in test set with 0 (no garage)
test$GarageArea[is.na(test$GarageArea)] = 0
A method commonly used for replacing missing values in data sets is median imputation. This is used for the LotFrontage field (number of feet of street connected to the property).
#replace missing values in LotFrontage column in training and test sets with median
train$LotFrontage[is.na(train$LotFrontage)] = median(train$LotFrontage, na.rm = TRUE)
test$LotFrontage[is.na(test$LotFrontage)] = median(test$LotFrontage, na.rm = TRUE)
Unfortunately some of the missing values cannot be reasonably estimated (e.g. the year in which the garage of a house was built). Such missing values are replaced by a numeric but non-sensical value such as -9999. Numeric values are easier for machine learning algorithms to handle. The fact that the value itself is non-sensical preserves the fact that the data is missing.
#replace missing values in GarageYrBlt column in training and test sets with -9999
train$GarageYrBlt[is.na(train$GarageYrBlt)] = -9999
test$GarageYrBlt[is.na(test$GarageYrBlt)] = -9999
#replace missing values in MasVnrArea column in training and test sets with -9999
train$MasVnrArea[is.na(train$MasVnrArea)] = -9999
test$MasVnrArea[is.na(test$MasVnrArea)] = -9999
Because I am trying to predict a continuous numeric variable (SalePrice) regression-based algorithms are the most appropriate to try.
For the sake of reproducibility (particularly relating to cross-validation, which randomly selects records from the training set to use for model performance testing) I set the seed to 42 (chosen arbitrarily)
set.seed(42)
The first algorithm I tried was was a simple random forest - generating multiple decision trees and taking the mean prediction as the final output. This basic model simply takes all the variables in the training set apart from SalePrice as explanatory variables. It also only uses one mtry value i.e. only one variable is randomly sampled as a candidate at each split in each decision tree
Note that I use five-fold cross-validation throughout this project.
myControl = trainControl(method = "cv", number = 5, verboseIter = FALSE)
model_rf = train(SalePrice ~ .,
data = train,
tuneLength = 1,
method = "ranger",
importance = 'impurity',
trControl = myControl)
model_rf
Random Forest
1460 samples
80 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1169, 1167, 1168, 1168, 1168
Resampling results:
RMSE Rsquared
30796.79 0.8612762
Tuning parameter 'mtry' was held constant at a value of 16
The second algorithm I tried was linear regression - again using all but the SalePrice variable as explanatory variables.
model_lm = train(SalePrice ~ .,
data = train,
method = "lm",
trControl = myControl)
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleading
model_lm
Linear Regression
1460 samples
80 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1168, 1167, 1167, 1169, 1169
Resampling results:
RMSE Rsquared
51749.52 0.6785317
Tuning parameter 'intercept' was held constant at a value of TRUE
I then compared the performance of both models - using root mean square error (RMSE) as my performance measure. The use of the resamples function ensures that the same training data is used in both models and hence a like-for-like comparison is being made between them. Box plot diagrams are generated for ease of comparison.
model_list <- list(lm = model_lm, rf = model_rf)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: lm, rf
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm 24560 43380 54410 51750 57870 78540 0
rf 25860 27890 30780 30800 33340 36110 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm 0.4682 0.6177 0.6404 0.6785 0.7754 0.8909 0
rf 0.8304 0.8374 0.8754 0.8613 0.8814 0.8819 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
As the box plot illustrates the random forest model performs much better than the linear regression model.
At this point I made predictions on the SalePrice variable for the test set using the random forest model. This was then submitted to Kaggle.
prediction_rf = predict(model_rf, test)
#make a submission file
submit <- data.frame(Id = test$Id, SalePrice = prediction_rf)
setwd("~/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Submissions")
The working directory was changed to C:/Users/aps1991/Documents/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Submissions inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
write.csv(submit, file = "submission1.csv", row.names = FALSE)
rm(submit)
One way to improve the performance of random forest models is to experiment with their tuning parameters. In this instance I decided to experiment with the mtry parameter - using two mtry values instead of one. Everything else was kept the same.
model_rf2 = train(SalePrice ~ .,
data = train,
tuneLength = 2,
method = "ranger",
importance = 'impurity',
trControl = myControl)
Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 94%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 97%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 19 seconds.
model_rf2
Random Forest
1460 samples
80 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1167, 1168, 1168, 1168, 1169
Resampling results across tuning parameters:
mtry RMSE Rsquared
2 46987.57 0.7901497
260 29129.63 0.8685570
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 260.
Again I compared the model with the more basic random forest model produced earlier.
model_list <- list(rf = model_rf, rf2 = model_rf2)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: rf, rf2
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf 25860 27890 30780 30800 33340 36110 0
rf2 22030 22420 30350 29130 31420 39440 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf 0.8304 0.8374 0.8754 0.8613 0.8814 0.8819 0
rf2 0.8218 0.8236 0.8689 0.8686 0.9142 0.9144 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
As the box plot illustrates the second random forest model does indeed perform slightly better than the first. Once again I made predictions on the test set and made a submission to Kaggle.
prediction_rf2 = predict(model_rf2, test)
#make a submission file
submit <- data.frame(Id = test$Id, SalePrice = prediction_rf2)
setwd("~/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Submissions")
The working directory was changed to C:/Users/aps1991/Documents/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Submissions inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
write.csv(submit, file = "submission2.csv", row.names = FALSE)
rm(submit)
Another way to improve a model is to use only the most important explanatory variables i.e. the ones that provide the most information gain to the model. The varImp function from the caret package returns the 20 most important variables to the user. For comparison I used Boruta feature importance analysis as well, but it did not yield a better performing model. Therefore I omit it from this notebook.
varImp(model_rf)
ranger variable importance
only 20 most important variables shown (out of 260)
Overall
OverallQual 100.00
GrLivArea 88.01
X1stFlrSF 62.07
GarageArea 61.74
TotalBsmtSF 57.78
YearBuilt 54.96
GarageCars 51.34
ExterQualTA 41.51
GarageYrBlt 35.59
BsmtFinSF1 35.51
FullBath 34.34
X2ndFlrSF 33.32
LotArea 28.98
YearRemodAdd 28.25
TotRmsAbvGrd 24.47
MasVnrArea 23.92
KitchenQualTA 23.56
Fireplaces 22.28
FireplaceQuNo fireplace 21.55
ExterQualGd 18.48
I then built a random forest model that only used these particular variables as explanatory variables. I returned to using one mtry value.
Top20Variables = c("OverallQual", "GrLivArea", "TotalBsmtSF", "GarageArea", "GarageCars",
"X1stFlrSF", "YearBuilt", "ExterQual", "BsmtFinSF1", "FullBath",
"KitchenQual", "LotArea", "Fireplaces",
"FireplaceQu", "YearRemodAdd", "GarageYrBlt", "X2ndFlrSF",
"TotRmsAbvGrd", "MasVnrArea", "LotFrontage")
train_Top20Var = select(train, one_of(Top20Variables, "SalePrice"))
model_rf_Top20 = train(SalePrice ~ .,
data = train_Top20Var,
tuneLength = 1,
method = "ranger",
importance = 'impurity',
trControl = myControl)
I then compared the performance of this model with model 3.
model_list = list(rf2 = model_rf2, rf_Top20 = model_rf_Top20)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: rf2, rf_Top20
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf2 22030 22420 30350 29130 31420 39440 0
rf_Top20 24310 25170 29090 28520 29530 34490 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf2 0.8218 0.8236 0.8689 0.8686 0.9142 0.9144 0
rf_Top20 0.8313 0.8393 0.8938 0.8739 0.9020 0.9030 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
As the box plot shows the latest model performs better than the model using all possible explanatory variables and two mtry values. Would using two mtry values further improve performance? Let’s try it out.
model_rf_Top20_2mtry = train(SalePrice ~ .,
data = train_Top20Var,
tuneLength = 2,
method = "ranger",
importance = 'impurity',
trControl = myControl)
model_list = list(rf2 = model_rf2, rf_Top20 = model_rf_Top20, rf_Top20_2mtry = model_rf_Top20_2mtry)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: rf2, rf_Top20, rf_Top20_2mtry
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf2 22030 22420 30350 29130 31420 39440 0
rf_Top20 24310 25170 29090 28520 29530 34490 0
rf_Top20_2mtry 26720 26760 29360 30020 32740 34520 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf2 0.8218 0.8236 0.8689 0.8686 0.9142 0.9144 0
rf_Top20 0.8313 0.8393 0.8938 0.8739 0.9020 0.9030 0
rf_Top20_2mtry 0.8119 0.8366 0.8603 0.8581 0.8904 0.8914 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
Model 5 performs better than model 3, but worse than model 4 - currently the best-performing model.
Further increasing of the number of mtry values and increasing the number of folds in the cross-validation did not yield better model performance. Therefore I omit them from this notebook for the sake of brevity.
Perhaps it’s now time to explore alternative algorithms.
A more refined approach to generating random forests is to use stochastic gradient boosting machines - utilising the concept of regularisation to prevent overfitting the individual decision trees to the training set. A more detailed explanation can be found here - http://xgboost.readthedocs.io/en/latest/model.html.
This algorithm should, in theory, generate a model that performs better on unseen data than basic random forests.
To begin with I use all available explanatory variables. I use two mtry values given that it seems to produce a better-performing model than when using one mtry value.
model_gbm = train(SalePrice ~ .,
data = train,
tuneLength = 2,
method = "gbm",
trControl = myControl)
variable 60: Condition2PosN has no variation.variable 61: Condition2RRAe has no variation.variable 113: Exterior2ndOther has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 5650271848.8526 nan 0.1000 509702483.0284
2 5191809995.8405 nan 0.1000 472299103.9891
3 4829877571.8844 nan 0.1000 372426394.9383
4 4524929132.5134 nan 0.1000 266548660.9214
5 4245371592.4290 nan 0.1000 270454386.5982
6 3968163800.4252 nan 0.1000 260770673.5750
7 3732190706.0097 nan 0.1000 222246146.9725
8 3537972873.6552 nan 0.1000 131052539.7284
9 3341917579.6734 nan 0.1000 178413750.1662
10 3177935817.3254 nan 0.1000 136464521.2789
20 2048172812.4733 nan 0.1000 73670539.6589
40 1273129911.7330 nan 0.1000 17378371.2878
60 985492273.5922 nan 0.1000 -2344630.1495
80 853189509.1467 nan 0.1000 -4465380.7298
100 788669594.3488 nan 0.1000 -721108.3493
variable 60: Condition2PosN has no variation.variable 61: Condition2RRAe has no variation.variable 113: Exterior2ndOther has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 5453485147.6910 nan 0.1000 681752400.2177
2 4925542528.5792 nan 0.1000 546975535.9762
3 4419389563.4549 nan 0.1000 503784437.9972
4 4047826686.8723 nan 0.1000 382505590.2872
5 3713551060.9573 nan 0.1000 323789440.3837
6 3413385608.4621 nan 0.1000 306910261.0515
7 3160126649.4340 nan 0.1000 273354955.3098
8 2962215397.1424 nan 0.1000 153978753.5849
9 2773210346.7602 nan 0.1000 194170323.3929
10 2608021568.4124 nan 0.1000 157201457.9822
20 1511896714.6803 nan 0.1000 41815243.9825
40 878840961.2926 nan 0.1000 8966514.7833
60 717781197.4891 nan 0.1000 3186033.9917
80 628736838.1430 nan 0.1000 -5861534.2869
100 573888192.6080 nan 0.1000 -3233313.5280
variable 18: UtilitiesNoSeWa has no variation.variable 62: Condition2RRAn has no variation.variable 86: RoofMatlMetal has no variation.variable 87: RoofMatlRoll has no variation.variable 196: FunctionalSev has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 5804652678.3150 nan 0.1000 533437174.9191
2 5318947112.0308 nan 0.1000 464196673.7101
3 4940256164.5961 nan 0.1000 412649078.2338
4 4583942850.6276 nan 0.1000 302859931.2247
5 4312558997.0428 nan 0.1000 291727604.2037
6 4098589599.2556 nan 0.1000 200212658.0190
7 3862333960.0463 nan 0.1000 210453343.8777
8 3645444193.5630 nan 0.1000 219653085.0540
9 3444185184.1652 nan 0.1000 212919434.3604
10 3227002703.3586 nan 0.1000 221631537.5677
20 2122024896.5411 nan 0.1000 68578008.0746
40 1264323557.8456 nan 0.1000 18736397.1825
60 967053332.7104 nan 0.1000 8885785.1522
80 840382447.5310 nan 0.1000 2417899.6145
100 781716828.5044 nan 0.1000 -10670836.1597
variable 18: UtilitiesNoSeWa has no variation.variable 62: Condition2RRAn has no variation.variable 86: RoofMatlMetal has no variation.variable 87: RoofMatlRoll has no variation.variable 196: FunctionalSev has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 5773658689.8085 nan 0.1000 561525575.6464
2 5165431221.9549 nan 0.1000 621716602.7344
3 4651356991.3304 nan 0.1000 457200936.2785
4 4195522473.5441 nan 0.1000 480984439.3674
5 3840900919.9096 nan 0.1000 345380674.1410
6 3553434588.0255 nan 0.1000 273898910.7008
7 3303892378.0469 nan 0.1000 266000164.2930
8 3062432937.9714 nan 0.1000 239553400.3708
9 2838020102.6407 nan 0.1000 229094441.0207
10 2659943840.0123 nan 0.1000 147858368.5287
20 1497558337.1564 nan 0.1000 67455996.5079
40 873581805.5659 nan 0.1000 8575461.8763
60 701689005.8589 nan 0.1000 1881139.1638
80 631858840.3811 nan 0.1000 -280373.9139
100 590022645.1423 nan 0.1000 -898924.3171
variable 59: Condition2PosA has no variation.variable 91: Exterior1stAsphShn has no variation.variable 100: Exterior1stStone has no variation.variable 129: ExterCondPo has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 5969732960.1739 nan 0.1000 552924255.7270
2 5521560497.0383 nan 0.1000 458201399.9239
3 5098695864.3158 nan 0.1000 394740476.9868
4 4733906996.9784 nan 0.1000 328942851.3775
5 4426660212.3560 nan 0.1000 304350274.3689
6 4154764181.2561 nan 0.1000 240420841.4609
7 3923189882.8255 nan 0.1000 228577082.2086
8 3715721070.5977 nan 0.1000 140780955.7010
9 3527513492.8212 nan 0.1000 198576419.4129
10 3344550857.0054 nan 0.1000 166192567.6767
20 2207511322.2240 nan 0.1000 56883052.1716
40 1351574185.7782 nan 0.1000 18024869.4765
60 1073807642.7710 nan 0.1000 9903922.2337
80 970391890.5752 nan 0.1000 -8549389.8161
100 889978676.1201 nan 0.1000 2257131.2345
variable 59: Condition2PosA has no variation.variable 91: Exterior1stAsphShn has no variation.variable 100: Exterior1stStone has no variation.variable 129: ExterCondPo has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 5886638864.2556 nan 0.1000 725360523.5273
2 5322151464.4489 nan 0.1000 626631554.3647
3 4816414170.9755 nan 0.1000 361654292.5870
4 4369374058.8405 nan 0.1000 441943895.2037
5 4010843762.3376 nan 0.1000 346759278.4105
6 3660144570.3254 nan 0.1000 258840797.8171
7 3355293911.1277 nan 0.1000 284254759.1752
8 3124189449.1373 nan 0.1000 183034786.0258
9 2916620963.4099 nan 0.1000 154884980.9381
10 2742418884.5546 nan 0.1000 179508206.8361
20 1580485799.6371 nan 0.1000 62952110.8633
40 974023745.8206 nan 0.1000 -7436211.1287
60 821338317.9020 nan 0.1000 -501995.8131
80 715094029.6498 nan 0.1000 -11794442.4090
100 660991989.0051 nan 0.1000 -6434516.7447
variable 92: Exterior1stBrkComm has no variation.variable 94: Exterior1stCBlock has no variation.variable 108: Exterior2ndCBlock has no variation.variable 171: HeatingQCPo has no variation.variable 176: ElectricalMix has no variation.variable 244: MiscFeatureTenC has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 5344918242.6380 nan 0.1000 504880250.8628
2 4904309874.9553 nan 0.1000 477756408.5901
3 4521545221.4551 nan 0.1000 359313696.3396
4 4234353120.5693 nan 0.1000 296402054.0323
5 3937432724.4513 nan 0.1000 297703671.3870
6 3657689931.9330 nan 0.1000 253470555.4933
7 3436891485.1518 nan 0.1000 216354672.7895
8 3246431001.5134 nan 0.1000 174970474.4593
9 3080027866.0318 nan 0.1000 138144888.6924
10 2907017441.7662 nan 0.1000 149185566.4749
20 1883639160.9491 nan 0.1000 44340907.1129
40 1174009737.6724 nan 0.1000 21714535.6701
60 919121438.6310 nan 0.1000 1442942.4350
80 805830970.2563 nan 0.1000 -1044999.2928
100 742597633.0147 nan 0.1000 679163.7682
variable 92: Exterior1stBrkComm has no variation.variable 94: Exterior1stCBlock has no variation.variable 108: Exterior2ndCBlock has no variation.variable 171: HeatingQCPo has no variation.variable 176: ElectricalMix has no variation.variable 244: MiscFeatureTenC has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 5182845514.2914 nan 0.1000 662242322.2779
2 4655889784.1980 nan 0.1000 509789923.7702
3 4232042288.4928 nan 0.1000 442482168.3133
4 3839996004.5648 nan 0.1000 408670928.7150
5 3496071682.9379 nan 0.1000 359136443.4336
6 3209994532.9581 nan 0.1000 271592729.6167
7 2947783086.8891 nan 0.1000 236690584.2904
8 2718946591.5493 nan 0.1000 173441847.0680
9 2530881134.7322 nan 0.1000 150968647.1891
10 2378230353.2053 nan 0.1000 154329501.9584
20 1363658646.4008 nan 0.1000 55677422.9063
40 822589954.1110 nan 0.1000 6132295.5415
60 647604265.2711 nan 0.1000 2786321.5609
80 570225574.2407 nan 0.1000 -1548731.4557
100 515076378.7780 nan 0.1000 -2064765.9342
variable 85: RoofMatlMembran has no variation.variable 97: Exterior1stImStucc has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 5989960889.4864 nan 0.1000 476683898.7650
2 5518151542.1995 nan 0.1000 390857992.5062
3 5099877332.3613 nan 0.1000 412080088.9651
4 4795801673.5043 nan 0.1000 303329844.1699
5 4485357197.0879 nan 0.1000 287728191.9446
6 4171953156.4566 nan 0.1000 281803400.3938
7 3950204621.4057 nan 0.1000 223530480.3526
8 3721977422.1482 nan 0.1000 220849719.8581
9 3530815532.9082 nan 0.1000 189694380.3083
10 3351151121.5221 nan 0.1000 178719420.6038
20 2221424699.6346 nan 0.1000 49484509.3180
40 1356488115.2035 nan 0.1000 22155145.0447
60 1073080068.6487 nan 0.1000 6254364.4971
80 944090433.3881 nan 0.1000 -3538285.5237
100 878019134.7697 nan 0.1000 -2786017.8605
variable 85: RoofMatlMembran has no variation.variable 97: Exterior1stImStucc has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 5857845574.3891 nan 0.1000 674389605.0920
2 5220109269.5366 nan 0.1000 640469379.7137
3 4740084712.7440 nan 0.1000 479558202.2902
4 4339262340.4985 nan 0.1000 409039859.7583
5 3997488702.7476 nan 0.1000 368030500.8164
6 3680544563.2900 nan 0.1000 266824028.3358
7 3391724569.8734 nan 0.1000 255925461.5107
8 3146320516.6503 nan 0.1000 172864360.9139
9 2923939853.5198 nan 0.1000 213863860.7456
10 2720457563.6049 nan 0.1000 162889320.3561
20 1630113400.9138 nan 0.1000 37886759.7808
40 984797716.9791 nan 0.1000 10982552.1858
60 794947419.0522 nan 0.1000 -787821.8408
80 693706574.6608 nan 0.1000 -6423565.4574
100 631533035.8482 nan 0.1000 -6807793.4651
Iter TrainDeviance ValidDeviance StepSize Improve
1 5628356047.2082 nan 0.1000 697598554.7995
2 5041763493.3438 nan 0.1000 604409484.5753
3 4581171561.9739 nan 0.1000 482032444.3813
4 4130088972.9514 nan 0.1000 431511384.1349
5 3788468173.1300 nan 0.1000 292009904.5966
6 3451326352.6969 nan 0.1000 292218461.8351
7 3196238006.9510 nan 0.1000 251685425.6301
8 2974823060.4456 nan 0.1000 202866805.7738
9 2784300459.9448 nan 0.1000 189845091.0823
10 2594444828.6467 nan 0.1000 174218193.6472
20 1529143115.5361 nan 0.1000 51317357.2663
40 961040265.6232 nan 0.1000 8614482.3438
60 782376612.7135 nan 0.1000 -11161318.2649
80 687147927.7368 nan 0.1000 -2496093.5285
100 627885196.3541 nan 0.1000 -4375210.9963
model_gbm
Stochastic Gradient Boosting
1460 samples
80 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1169, 1168, 1168, 1169, 1166
Resampling results across tuning parameters:
interaction.depth n.trees RMSE Rsquared
1 50 35345.88 0.8209247
1 100 31854.01 0.8447295
2 50 32357.34 0.8423511
2 100 30936.63 0.8549869
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning
parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 100, interaction.depth =
2, shrinkage = 0.1 and n.minobsinnode = 10.
This model did not perform better than model 4. Perhaps adjusting the tuning paramaters of the model will improve performance. Let’s give it a try.
This time I use three mtry values and a custom tuning grid. Arriving at the best combination of parameter values to use in the tuning grid is a case of trial and error. I experimented with lots of combinations to find one that improved model performance. For the sake of brevity I omit that experimentation from this notebook and simply present my final tuning grid. Once again I use all available explanatory variables.
gbmTuningGrid = expand.grid(interaction.depth = 4,
n.trees = c(50, 100, 150, 200),
shrinkage = 0.3,
n.minobsinnode = 20)
model_gbm2 = train(SalePrice ~ .,
data = train,
tuneLength = 3,
method = "gbm",
trControl = myControl,
tuneGrid = gbmTuningGrid)
variable 85: RoofMatlMembran has no variation.variable 171: HeatingQCPo has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 4210925430.0543 nan 0.3000 1922915590.9523
2 2981304515.4542 nan 0.3000 1156011305.9536
3 2289966836.4109 nan 0.3000 586449778.9877
4 1823407767.7090 nan 0.3000 396518862.2721
5 1523831678.9201 nan 0.3000 245384626.8229
6 1321831729.8312 nan 0.3000 160892050.0187
7 1178004168.8373 nan 0.3000 87789439.3537
8 1090557827.0053 nan 0.3000 77770317.9539
9 1027307622.1358 nan 0.3000 35072316.2508
10 981850969.7853 nan 0.3000 31042458.9056
20 738428069.7402 nan 0.3000 -16886605.7164
40 536518238.0666 nan 0.3000 -9678733.1076
60 424002793.8055 nan 0.3000 -12411699.4225
80 353270887.9508 nan 0.3000 -7678666.2221
100 294176105.3203 nan 0.3000 -11950941.8296
120 258867475.6595 nan 0.3000 -3260970.1245
140 229889868.4406 nan 0.3000 -4894199.0648
160 203868717.1024 nan 0.3000 -2109502.8043
180 177924324.7400 nan 0.3000 -4144252.3727
200 157872920.5733 nan 0.3000 -1066115.8141
variable 25: NeighborhoodBlueste has no variation.variable 59: Condition2PosA has no variation.variable 61: Condition2RRAe has no variation.variable 91: Exterior1stAsphShn has no variation.variable 94: Exterior1stCBlock has no variation.variable 108: Exterior2ndCBlock has no variation.variable 244: MiscFeatureTenC has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 4398542102.1437 nan 0.3000 1976131892.9135
2 3086314440.3206 nan 0.3000 1195824436.2855
3 2307267046.9927 nan 0.3000 655198934.9047
4 1823440274.0090 nan 0.3000 344450347.6573
5 1545149145.4475 nan 0.3000 203305367.5673
6 1312126545.4749 nan 0.3000 147580849.5674
7 1196298062.2283 nan 0.3000 102355315.1869
8 1099884241.4950 nan 0.3000 70419388.8049
9 1030431105.3636 nan 0.3000 11391191.0737
10 965871288.9014 nan 0.3000 48157659.9522
20 720588930.8502 nan 0.3000 -20431092.4871
40 550471342.5232 nan 0.3000 -13158265.3211
60 443594034.8938 nan 0.3000 -8193616.3482
80 356854917.1258 nan 0.3000 -6960226.3212
100 304722025.7141 nan 0.3000 -7205171.7162
120 262966975.5581 nan 0.3000 -6576489.5883
140 233233505.4227 nan 0.3000 -8434973.7242
160 201007858.0625 nan 0.3000 -4250422.0304
180 172939005.6754 nan 0.3000 -1688996.3101
200 151276300.2651 nan 0.3000 -1224858.9095
variable 87: RoofMatlRoll has no variation.variable 97: Exterior1stImStucc has no variation.variable 113: Exterior2ndOther has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 3698456672.7354 nan 0.3000 2090124589.2977
2 2681962557.7599 nan 0.3000 1058040833.1370
3 2017442152.7508 nan 0.3000 681356652.5823
4 1586094388.9876 nan 0.3000 454846663.8383
5 1290068988.7868 nan 0.3000 243600934.7667
6 1121573260.3262 nan 0.3000 171185520.6064
7 999504584.7272 nan 0.3000 113278953.7727
8 899273378.1930 nan 0.3000 82716961.6331
9 815290766.1211 nan 0.3000 74541575.2526
10 737888748.1548 nan 0.3000 47808444.9499
20 546548432.6947 nan 0.3000 -5457328.7235
40 427786593.5390 nan 0.3000 -11163356.9502
60 354145902.1376 nan 0.3000 -8592912.6860
80 295095657.9245 nan 0.3000 -2069415.3564
100 247478490.7065 nan 0.3000 -3965541.0288
120 219566161.3894 nan 0.3000 -1825522.3259
140 189731420.8330 nan 0.3000 -2719810.0365
160 169423297.6106 nan 0.3000 -4889739.1731
180 149864595.8322 nan 0.3000 -3359206.5625
200 136546197.4264 nan 0.3000 -2206355.9445
variable 18: UtilitiesNoSeWa has no variation.variable 62: Condition2RRAn has no variation.variable 86: RoofMatlMetal has no variation.variable 129: ExterCondPo has no variation.variable 176: ElectricalMix has no variation.variable 196: FunctionalSev has no variation.variable 242: MiscFeatureOthr has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 4057891343.3800 nan 0.3000 2059506927.2955
2 2946097424.9979 nan 0.3000 1175926032.7133
3 2258556217.5645 nan 0.3000 626370302.6549
4 1836857968.2027 nan 0.3000 351075334.4480
5 1516516766.3773 nan 0.3000 271415685.2570
6 1283789377.6690 nan 0.3000 189059599.8415
7 1142685076.8204 nan 0.3000 100030331.6092
8 1041025909.0556 nan 0.3000 71775767.2411
9 952939190.4566 nan 0.3000 58268541.2109
10 908782793.7383 nan 0.3000 10675230.6641
20 649312757.6966 nan 0.3000 -11192143.8650
40 492446650.6382 nan 0.3000 -11136286.9732
60 394906571.4437 nan 0.3000 -5236408.4988
80 343025504.1729 nan 0.3000 -10842382.6293
100 304458371.4514 nan 0.3000 -10464455.6380
120 274376906.8999 nan 0.3000 -4677744.3622
140 245020178.3617 nan 0.3000 -8850380.5322
160 216591918.7540 nan 0.3000 -5843212.6632
180 192293577.8633 nan 0.3000 -3730792.2308
200 170869292.5165 nan 0.3000 -2442955.0485
variable 248: SaleTypeCon has no variation.
Iter TrainDeviance ValidDeviance StepSize Improve
1 4267777612.7010 nan 0.3000 2264209298.4500
2 3049054686.4600 nan 0.3000 1230044105.5552
3 2313736117.9324 nan 0.3000 615053171.6664
4 1844439420.9821 nan 0.3000 427644188.9812
5 1563199544.9931 nan 0.3000 283526816.0436
6 1367260422.1000 nan 0.3000 202142494.1075
7 1205730939.8300 nan 0.3000 135637450.5805
8 1103661018.5372 nan 0.3000 60167679.0400
9 1028651650.2720 nan 0.3000 40783612.5865
10 949832575.8823 nan 0.3000 20531495.0207
20 718835379.7456 nan 0.3000 -3552006.1774
40 542439064.3103 nan 0.3000 -20200107.2368
60 416350795.8870 nan 0.3000 1525934.7646
80 341808685.1891 nan 0.3000 -4203645.0724
100 295132866.3019 nan 0.3000 -4829517.9124
120 260246151.7893 nan 0.3000 -1672400.0956
140 232456329.2009 nan 0.3000 -4156720.4926
160 198500486.2668 nan 0.3000 -2613768.8918
180 183671200.9088 nan 0.3000 -3796110.5206
200 165392270.4731 nan 0.3000 452979.0218
Iter TrainDeviance ValidDeviance StepSize Improve
1 4185906746.3885 nan 0.3000 1889487436.4509
2 2971434976.6797 nan 0.3000 1328392006.0548
3 2196231592.4638 nan 0.3000 643502195.1070
4 1767477203.6200 nan 0.3000 384072064.0702
5 1495372443.2042 nan 0.3000 302558379.8608
6 1286230980.1002 nan 0.3000 182213881.1655
7 1166451850.2702 nan 0.3000 115699491.2812
8 1056944744.2522 nan 0.3000 108796800.4292
9 999016769.6519 nan 0.3000 45177861.8023
10 937218202.5290 nan 0.3000 49303221.3555
20 701556852.5337 nan 0.3000 7985469.6269
40 532747299.6631 nan 0.3000 -4948893.1725
60 446588180.0606 nan 0.3000 -6812882.4498
80 371731444.4489 nan 0.3000 -14108797.1712
100 318722974.7448 nan 0.3000 -2169159.8732
120 279441667.5349 nan 0.3000 222184.6603
140 249277112.9271 nan 0.3000 -6183069.1941
150 233388162.5875 nan 0.3000 -786228.0292
model_gbm2
Stochastic Gradient Boosting
1460 samples
80 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1167, 1168, 1169, 1168, 1168
Resampling results across tuning parameters:
n.trees RMSE Rsquared
50 29083.37 0.8683278
100 28775.27 0.8720094
150 28597.02 0.8735638
200 28705.88 0.8728585
Tuning parameter 'interaction.depth' was held constant at a value of 4
Tuning parameter 'shrinkage' was held constant at a value of 0.3
Tuning
parameter 'n.minobsinnode' was held constant at a value of 20
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 150, interaction.depth =
4, shrinkage = 0.3 and n.minobsinnode = 20.
I compare the performance against previous models.
model_list = list(rf2 = model_rf2, rf_Top20_2mtry = model_rf_Top20_2mtry, gbm2 = model_gbm2)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: rf2, rf_Top20_2mtry, gbm2
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf2 22030 22420 30350 29130 31420 39440 0
rf_Top20_2mtry 26720 26760 29360 30020 32740 34520 0
gbm2 22490 24530 25600 28600 33420 36940 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf2 0.8218 0.8236 0.8689 0.8686 0.9142 0.9144 0
rf_Top20_2mtry 0.8119 0.8366 0.8603 0.8581 0.8904 0.8914 0
gbm2 0.8451 0.8455 0.8868 0.8736 0.8915 0.8989 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
This model performs better than model 4. Are there other algorithms available that could possible yield higher performance? Let’s try extreme gradient boosting.
The extreme gradient boosting algorithm works much the same as the stochastic gradient boosting machine, but more aggresively guards against overfitting.
It’s far easier to use extreme gradient boosting in the caret package rather than the xgboost package because the latter requires the training and test sets to be converted to sparse matrices using the concept of one-hot encoding.
Given that using a custom tuning grid in model 7 improved performance I also make use of one in this model. Once again for the sake of brevity I have omitted my experimentations with the tuning parameters.
Again I use all available explanatory variables and three mtry values.
xgbTuningGrid = expand.grid(nrounds = c(50, 100),
lambda = seq(0.1, 0.5, 0.1),
alpha = seq(0.1, 0.5, 0.1),
eta = c(0.3, 0.4))
model_xgb4 = train(SalePrice ~ .,
data = train,
tuneLength = 3,
method = "xgbLinear",
trControl = myControl,
tuneGrid = xgbTuningGrid)
model_xgb4
eXtreme Gradient Boosting
1460 samples
80 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1167, 1168, 1167, 1169, 1169
Resampling results across tuning parameters:
nrounds lambda alpha eta RMSE Rsquared
50 0.1 0.1 0.3 30536.40 0.8528058
50 0.1 0.1 0.4 30536.40 0.8528058
50 0.1 0.2 0.3 30536.40 0.8528058
50 0.1 0.2 0.4 30536.40 0.8528058
50 0.1 0.3 0.3 30568.94 0.8525058
50 0.1 0.3 0.4 30568.94 0.8525058
50 0.1 0.4 0.3 30568.93 0.8525058
50 0.1 0.4 0.4 30568.93 0.8525058
50 0.1 0.5 0.3 30568.93 0.8525058
50 0.1 0.5 0.4 30568.93 0.8525058
50 0.2 0.1 0.3 30281.94 0.8554049
50 0.2 0.1 0.4 30281.94 0.8554049
50 0.2 0.2 0.3 30281.94 0.8554049
50 0.2 0.2 0.4 30281.94 0.8554049
50 0.2 0.3 0.3 30281.67 0.8553927
50 0.2 0.3 0.4 30281.67 0.8553927
50 0.2 0.4 0.3 30281.67 0.8553927
50 0.2 0.4 0.4 30281.67 0.8553927
50 0.2 0.5 0.3 30281.67 0.8553927
50 0.2 0.5 0.4 30281.67 0.8553927
50 0.3 0.1 0.3 29236.31 0.8660122
50 0.3 0.1 0.4 29236.31 0.8660122
50 0.3 0.2 0.3 29232.86 0.8660500
50 0.3 0.2 0.4 29232.86 0.8660500
50 0.3 0.3 0.3 29232.86 0.8660500
50 0.3 0.3 0.4 29232.86 0.8660500
50 0.3 0.4 0.3 29232.86 0.8660500
50 0.3 0.4 0.4 29232.86 0.8660500
50 0.3 0.5 0.3 29232.86 0.8660500
50 0.3 0.5 0.4 29232.86 0.8660500
50 0.4 0.1 0.3 28854.48 0.8687587
50 0.4 0.1 0.4 28854.48 0.8687587
50 0.4 0.2 0.3 28851.32 0.8687861
50 0.4 0.2 0.4 28851.32 0.8687861
50 0.4 0.3 0.3 28851.32 0.8687861
50 0.4 0.3 0.4 28851.32 0.8687861
50 0.4 0.4 0.3 28851.32 0.8687861
50 0.4 0.4 0.4 28851.32 0.8687861
50 0.4 0.5 0.3 28851.32 0.8687861
50 0.4 0.5 0.4 28851.32 0.8687861
50 0.5 0.1 0.3 30608.58 0.8520108
50 0.5 0.1 0.4 30608.58 0.8520108
50 0.5 0.2 0.3 30608.58 0.8520108
50 0.5 0.2 0.4 30608.58 0.8520108
50 0.5 0.3 0.3 30608.58 0.8520108
50 0.5 0.3 0.4 30608.58 0.8520108
50 0.5 0.4 0.3 30608.58 0.8520108
50 0.5 0.4 0.4 30608.58 0.8520108
50 0.5 0.5 0.3 30608.58 0.8520108
50 0.5 0.5 0.4 30608.58 0.8520108
100 0.1 0.1 0.3 30532.22 0.8529242
100 0.1 0.1 0.4 30532.22 0.8529242
100 0.1 0.2 0.3 30535.00 0.8529202
100 0.1 0.2 0.4 30535.00 0.8529202
100 0.1 0.3 0.3 30582.60 0.8524487
100 0.1 0.3 0.4 30582.60 0.8524487
100 0.1 0.4 0.3 30582.59 0.8524487
100 0.1 0.4 0.4 30582.59 0.8524487
100 0.1 0.5 0.3 30589.57 0.8523823
100 0.1 0.5 0.4 30589.57 0.8523823
100 0.2 0.1 0.3 30223.54 0.8560345
100 0.2 0.1 0.4 30223.54 0.8560345
100 0.2 0.2 0.3 30223.54 0.8560345
100 0.2 0.2 0.4 30223.54 0.8560345
100 0.2 0.3 0.3 30233.27 0.8559046
100 0.2 0.3 0.4 30233.27 0.8559046
100 0.2 0.4 0.3 30233.27 0.8559046
100 0.2 0.4 0.4 30233.27 0.8559046
100 0.2 0.5 0.3 30214.12 0.8560560
100 0.2 0.5 0.4 30214.12 0.8560560
100 0.3 0.1 0.3 29217.52 0.8662653
100 0.3 0.1 0.4 29217.52 0.8662653
100 0.3 0.2 0.3 29247.79 0.8659507
100 0.3 0.2 0.4 29247.79 0.8659507
100 0.3 0.3 0.3 29247.79 0.8659507
100 0.3 0.3 0.4 29247.79 0.8659507
100 0.3 0.4 0.3 29239.71 0.8660205
100 0.3 0.4 0.4 29239.71 0.8660205
100 0.3 0.5 0.3 29239.71 0.8660205
100 0.3 0.5 0.4 29239.71 0.8660205
100 0.4 0.1 0.3 28801.41 0.8692857
100 0.4 0.1 0.4 28801.41 0.8692857
100 0.4 0.2 0.3 28784.63 0.8694422
100 0.4 0.2 0.4 28784.63 0.8694422
100 0.4 0.3 0.3 28802.17 0.8693151
100 0.4 0.3 0.4 28802.17 0.8693151
100 0.4 0.4 0.3 28797.92 0.8693516
100 0.4 0.4 0.4 28797.92 0.8693516
100 0.4 0.5 0.3 28797.43 0.8693550
100 0.4 0.5 0.4 28797.43 0.8693550
100 0.5 0.1 0.3 30618.66 0.8519223
100 0.5 0.1 0.4 30618.66 0.8519223
100 0.5 0.2 0.3 30618.66 0.8519223
100 0.5 0.2 0.4 30618.66 0.8519223
100 0.5 0.3 0.3 30618.66 0.8519223
100 0.5 0.3 0.4 30618.66 0.8519223
100 0.5 0.4 0.3 30617.74 0.8519310
100 0.5 0.4 0.4 30617.74 0.8519310
100 0.5 0.5 0.3 30617.40 0.8519334
100 0.5 0.5 0.4 30617.40 0.8519334
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 100, lambda = 0.4, alpha =
0.2 and eta = 0.3.
As per usual I compare model performance.
model_list = list(gbm2 = model_gbm2, xgb4 = model_xgb4)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: gbm2, xgb4
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
gbm2 22490 24530 25600 28600 33420 36940 0
xgb4 23070 25830 28640 28780 31290 35100 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
gbm2 0.8451 0.8455 0.8868 0.8736 0.8915 0.8989 0
xgb4 0.8037 0.8602 0.8800 0.8694 0.8846 0.9188 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
As the box plot illustrates this model actually performs worse than model 7.
Having reached a plateau in my model development I read various kernels (notebooks written by other Kaggle competitors) for this particular Kaggle competition to identify any areas that I could further develop. One kernel (https://www.kaggle.com/sidraina89/house-prices-advanced-regression-techniques/regularized-regression-housing-pricing) said that removing outliers from the training set would improve model performance. Therefore I went ahead and did this - firstly plotting the ground floor living area against the sale price.
ggplot(train ,aes(y = SalePrice, x = GrLivArea)) + geom_point()
This plot presented four outliers where the ground floor living area was more than 4000 square feet. I therefore removed such records from the training set.
train_trim = filter(train, GrLivArea <= 4000)
I repeated this process for the other 20 most predictive explanatory variables (as identified in model 4) such as OverallQual, KitchenQual and YearBuilt. No other outliers were identified.
I then built a linear regression model using this trimmed training set and assessed its performance in the usual way.
model_lm2 = train(SalePrice ~ .,
data = train_trim,
method = "lm",
trControl = myControl)
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleading
model_list = list(gbm2 = model_gbm2,lm2 = model_lm2)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: gbm2, lm2
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
gbm2 22490 24530 25600 28600 33420 36940 0
lm2 23130 23950 24640 26000 26640 31660 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
gbm2 0.8451 0.8455 0.8868 0.8736 0.8915 0.8989 0
lm2 0.8208 0.8916 0.8949 0.8868 0.9063 0.9206 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
As the box plot illustrates this model performed better than model 8.
Given that it was the best-performing model on the untrimmed training set it seemed natural for me to try the extreme gradient boosting algorithm on the trimmed training set.
model_list = list(gbm2 = model_gbm2, lm2 = model_lm2, xgb5 = model_xgb5)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: gbm2, lm2, xgb5
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
gbm2 22490 24530 25600 28600 33420 36940 0
lm2 23130 23950 24640 26000 26640 31660 0
xgb5 24310 25590 27140 26340 27270 27390 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
gbm2 0.8451 0.8455 0.8868 0.8736 0.8915 0.8989 0
lm2 0.8208 0.8916 0.8949 0.8868 0.9063 0.9206 0
xgb5 0.8631 0.8759 0.8790 0.8825 0.8940 0.9007 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
To my surprise not only did it perform worse than model 8 but it also performed worse than model 9 (a simple linear model). Further development of the data was required.
The same kernel I read earlier underlined the importance of understanding the assumptions underlying any machine learning algorithms being used. For example linear regression assumes, among other things, that all variables are normally distributed. I used the following code to identify any variables with a skewness greater than 0.75 (threshold given in kernel).
library(e1071)
column_classes = sapply(names(train_trim), function(x){class(train_trim[[x]])})
numeric_columns = names(column_classes[column_classes == "integer"])
skew = sapply(numeric_columns, function(x){skewness(train_trim[[x]], na.rm = T)})
skew = skew[skew > 0.75]
As per the kernel I took the natural logarithm of these highly skewed variables in both the training set and the test set.
train_trim_log = train_trim
for(x in names(skew))
{
train_trim_log[[x]] = log(train_trim_log[[x]] + 1)
}
# do the same for the test set
test_logtrnsfrm = test
for(x in names(skew))
{
if(x != "SalePrice")
{
test_logtrnsfrm[[x]] = log(test_logtrnsfrm[[x]] + 1)
}
}
I then built a linear model using this adjusted training set.
model_list = list(gbm2 = model_gbm2, lm3 = model_lm3)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: gbm2, lm3
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
gbm2 2.249e+04 24530.000 2.560e+04 2.860e+04 3.342e+04 36940.000 0
lm3 1.092e-01 0.124 1.255e-01 1.532e-01 1.255e-01 0.282 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
gbm2 0.8451 0.8455 0.8868 0.8736 0.8915 0.8989 0
lm3 0.6280 0.9002 0.9050 0.8533 0.9117 0.9218 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
As the box plot illustrates the new model performs exponentially better than the previously best-performing model (gradient boosting machine with an untrimmed and untransformed training set).
I tried the extreme gradient boosting algorithm once again.
model_xgb6 = train(SalePrice ~ .,
data = train_trim_log,
tuneLength = 3,
method = "xgbLinear",
trControl = myControl,
tuneGrid = xgbTuningGrid,
verbose = FALSE)
model_list = list(lm3 = model_lm3, xgb6 = model_xgb6)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: lm3, xgb6
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm3 0.1092 0.1240 0.1255 0.1532 0.1255 0.2820 0
xgb6 0.1213 0.1314 0.1372 0.1340 0.1397 0.1404 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm3 0.6280 0.9002 0.9050 0.8533 0.9117 0.9218 0
xgb6 0.8758 0.8789 0.8816 0.8858 0.8959 0.8969 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
This model performs worse than the simple linear model. Let us try regularised linear regression as per the aforementioned Kaggle kernel.
glmnet uses a combination of lasso regression (penalising the number of non-zero coefficients) and ridge regression (penalising the absolute magnitude of each coefficient) to prevent overfitting when building a generalised linear model.
glmnetTuningGrid = expand.grid(alpha = seq(0, 1, 0.2),
lambda = seq(0, 1, 0.2))
model_glmnet1 = train(SalePrice ~ .,
data = train_trim_log,
method = "glmnet",
trControl = myControl,
tuneGrid = glmnetTuningGrid)
There were missing values in resampled performance measures.
model_list = list(lm3 = model_lm3, xgb6 = model_xgb6, glmnet1 = model_glmnet1)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: lm3, xgb6, glmnet1
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm3 0.1092 0.1240 0.1255 0.1532 0.1255 0.2820 0
xgb6 0.1213 0.1314 0.1372 0.1340 0.1397 0.1404 0
glmnet1 0.1095 0.1102 0.1169 0.1192 0.1291 0.1301 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm3 0.6280 0.9002 0.9050 0.8533 0.9117 0.9218 0
xgb6 0.8758 0.8789 0.8816 0.8858 0.8959 0.8969 0
glmnet1 0.8892 0.8992 0.9092 0.9092 0.9215 0.9268 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
This model performs better than model 12 and is now my best-performing model. Next I see whether principal component analysis (PCA) of the training set can help me improve model performance
A Kaggle kernel by Nandy Singh (https://www.kaggle.com/nandys/house-prices-advanced-regression-techniques/how-to-decrease-0-1-rmse-score) suggested using PCA to engineer new features based on a combination of existing high-variance features. It can also be used to reduce the number of variables being used i.e. remove the ones with low/no variance and hence little/no information about the data relationships.
Firstly I identified variables in the training set that had zero or near-zero variance i.e. were broadly constant and hence had little predictive value.
nearZeroVar(train_trim, saveMetrics = T) #variables with zero or near-zero variance
I then removed these variables from the training set.
out = c("Id", "Street", "Alley", "LandContour", "Utilities", "LandSlope", "Condition2",
"RoofMatl", "BsmtCond", "BsmtFinType2", "BsmtFinSF2", "Heating", "LowQualFinSF",
"KitchenAbvGr", "Functional", "EnclosedPorch", "x3SsnPorch", "PoolArea",
"PoolQC", "MiscFeature", "MiscValue")
train_trim_log2 = train_trim_log[, !(names(train_trim_log) %in% out)]
Given that PCA only works on numeric variables I also removed any non-numeric variables from the training set.
train_trim_log2_numeric = train_trim_log2[, names(train_trim_log2) %in% numeric_columns]
I conducted PCA on this training set - returning the top 15 principal components.
pca = prcomp(train_trim_log2_numeric, scale = T, center = T)
eigenvalues = factoextra::get_eigenvalue(pca)
pcaVar = factoextra::get_pca_var(pca) #PCA variables
pcaVar = as.data.frame(c(pcaVar))[,1:15] #table of correlations between variables and top 15 principal components
To identify which features to combine I produced a table of strongly (positively or negatively) correlated features.
var = pcaVar[FALSE, ]
k = 1
for(i in colnames(pcaVar))
{
for(j in rownames(pcaVar))
{
if(abs(pcaVar[j , i]) >= 0.5) #strong +ve/-ve correlation between variables j and i
{
var[k, i] = j
k = k + 1
}
}
k = 1
}
var = as.data.frame(var)
Based on the table produced above it made sense to combine the OverallQual and OverallCond features to create a TotalQual feature.
#Create new feature called TotQual - product of Overallqual and Overallcond
train_newfeatures = mutate(train_trim_log2, TotQual = (train_trim_log2$OverallCond)*(train_trim_log2$OverallQual))
A linear model was built and its performance compared with the previous two models.
model_list = list(glmnet1 = model_glmnet1, lm4 = model_lm4)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: glmnet1, lm4
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
glmnet1 0.1095 0.1102 0.1169 0.1192 0.1291 0.1301 0
lm4 0.1087 0.1103 0.1141 0.1182 0.1267 0.1312 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
glmnet1 0.8892 0.8992 0.9092 0.9092 0.9215 0.9268 0
lm4 0.8885 0.9035 0.9187 0.9118 0.9207 0.9277 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
This model performs better than model 13 and is now my best-performing model.
Having heard the hype surrounding support vector machines I thought I would try one in this project.
model_list = list(lm4 = model_lm4, svm2 = model_svm2)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: lm4, svm2
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm4 0.1087 0.1103 0.1141 0.1182 0.1267 0.1312 0
svm2 12.6300 13.6500 15.0600 22.5300 16.1200 55.1800 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm4 0.888500 0.903500 0.91870 0.91180 0.92070 0.9277 0
svm2 0.001119 0.005127 0.07594 0.06006 0.08187 0.1363 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
Despite the hype it did not perform better than the linear model. It is likely that further parameter tuning and data processing would be required for the support vector machine to perform better. Because I was only trying this algorithm out of curiosity I did not experiment with it any further.
Again, given the hype surrounding it and my own curiosity, I tried building a neural network.
model_list = list(lm4 = model_lm4, nnet1 = model_nnet1)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: lm4, nnet1
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm4 0.1087 0.1103 0.1141 0.1182 0.1267 0.1312 0
nnet1 11.0100 11.0100 11.0300 11.0300 11.0400 11.0400 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm4 0.8885 0.9035 0.9187 0.9118 0.9207 0.9277 0
nnet1 NA NA NA NaN NA NA 5
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
Once again, despite the hype, it did not perform better than the linear model. Like in the case of the support vector machine this is likely due to further parameter tuning and data processing being required to maximise its performance (different machine learning algorithms have different requirements for performance maximisation). Because I was only trying this algorithm out of curiosity I did not experiment with it any further.
Based on the experimentation described above I conclude that model 14 - linear regression on a log-adjusted training set with outliers and highly skewed variables removed but a new feature (TotQual = OverallCond*OverallQual) added - is my best-performing model on the house price dataset.
model_list = list(lm4 = model_lm4, glmnet1 = model_glmnet1)
resamples = resamples(model_list)
summary(resamples)
Call:
summary.resamples(object = resamples)
Models: lm4, glmnet1
Number of resamples: 5
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm4 0.1087 0.1103 0.1141 0.1182 0.1267 0.1312 0
glmnet1 0.1095 0.1102 0.1169 0.1192 0.1291 0.1301 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lm4 0.8885 0.9035 0.9187 0.9118 0.9207 0.9277 0
glmnet1 0.8892 0.8992 0.9092 0.9092 0.9215 0.9268 0
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
My best Kaggle score (as of Monday 20th February 2017) for this competition - my first one ever - was 0.12581 (root mean square logarithmic error (RMSLE) of my model on 50% of an unseen test set). The benchmark score for the competition was 0.40890, which my first submission outperformed anyway. My ranking was 1,753 out of 4,454 - placing me in the top half of competitors (many of whom typically hold postgraduate qualifications in data-science-related subjects and/or are experienced data scientists - neither of which I am). For a sense of perspective my final submission improved my previous best score by around just 0.002 but raised my leaderboard ranking by almost 1,000 places.
This competition has taught me three things. Firstly a considerable amount of data pre-processing needs to be conducted (e.g. dealing with missing values, dealing with outliers, dealing with skewness) in order to maximise model performance. Secondly it is vital to understand the assumptions and statistical theory underlying the models and data that one uses. Finally a simple model (e.g. linear regression) can outperform a more advanced model (e.g. extreme gradient boosting) in some cases.