This notebook describes my approach to building a machine learning model to predict house prices in the Ames, Iowa housing dataset (a Kaggle competition). This is my first ever Kaggle competition and my first foray into practical data science.

Firstly I recodify some of the raw data and impute missing values. Secondly I try various machine learning algorithms including:

Finally I assess each model’s performance - using cross-validation to make fair assessments of model performance. The final model is the best-performing model i.e. the model with the lowest root mean square error.

This notebook assumes the reader has basic knowledge of the packages and algorithms used as well as concepts such as cross-validation. For those who are unfamiliar with any of the concepts myriad documentation can be found online.

Packages used

I predominantly used the caret (machine learning) and dplyr (data manipulation) packages in this project. The caret package provides versatility through its interaction with various individual machine learning packages (e.g. ranger (random forest), glmnet (regularised linear regression), nnet (neural network))

library(caret, quietly = TRUE)
library(dplyr, quietly = TRUE)

Cleaning the data

Let’s first read the training and test sets supplied by the Kaggle competition into our R session.

setwd("~/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Raw data")
The working directory was changed to C:/Users/aps1991/Documents/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Raw data inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
train = read.csv("train.csv", stringsAsFactors = FALSE)
test = read.csv("test.csv", stringsAsFactors = FALSE)

There are two issues with the raw data. Firstly, like most real-world data sets, it contains missing values. Secondly the raw data contain “NA” entries that have a different meaning to how R normally handles “NA” entries. For example an entry of “NA” in the “FireplaceQu” field in the raw data means that the house does not have a fireplace rather than that data for that field is missing (which is how R normally interprets “NA” entries).

Let’s address the latter issue first by replacing the “NA” entries in both the training and the test sets with more meaningful entries (e.g. “No fireplace”, “No basement”, “No alley access”).

#some NA entries in the training and test sets actually mean "no basement" 
# etc. rather than representing a missing value. These values are replaced accordingly
train$Alley[is.na(train$Alley)] = "No alley access"
test$Alley[is.na(test$Alley)] = "No alley access"
train$BsmtQual[is.na(train$BsmtQual)] = "No basement"
test$BsmtQual[is.na(test$BsmtQual)] = "No basement"
train$BsmtCond[is.na(train$BsmtCond)] = "No basement"
test$BsmtCond[is.na(test$BsmtCond)] = "No basement"
train$BsmtExposure[is.na(train$BsmtExposure)] = "No basement"
test$BsmtExposure[is.na(test$BsmtExposure)] = "No basement"
train$BsmtFinType1[is.na(train$BsmtFinType1)] = "No basement"
test$BsmtFinType1[is.na(test$BsmtFinType1)] = "No basement"
train$BsmtFinType2[is.na(train$BsmtFinType2)] = "No basement"
test$BsmtFinType2[is.na(test$BsmtFinType2)] = "No basement"
train$FireplaceQu[is.na(train$FireplaceQu)] = "No fireplace"
test$FireplaceQu[is.na(test$FireplaceQu)] = "No fireplace"
train$GarageType[is.na(train$GarageType)] = "No garage"
test$GarageType[is.na(test$GarageType)] = "No garage"
train$GarageFinish[is.na(train$GarageFinish)] = "No garage"
test$GarageFinish[is.na(test$GarageFinish)] = "No garage"
train$GarageQual[is.na(train$GarageQual)] = "No garage"
test$GarageQual[is.na(test$GarageQual)] = "No garage"
train$GarageCond[is.na(train$GarageCond)] = "No garage"
test$GarageCond[is.na(test$GarageCond)] = "No garage"
train$PoolQC[is.na(train$PoolQC)] = "No pool"
test$PoolQC[is.na(test$PoolQC)] = "No pool"
train$Fence[is.na(train$Fence)] = "No fence"
test$Fence[is.na(test$Fence)] = "No fence"
train$MiscFeature[is.na(train$MiscFeature)] = "None"
test$MiscFeature[is.na(test$MiscFeature)] = "None"

Now let’s deal with the missing values. Some of these can be replaced by the most frequently occuring values in the given field. For example the most frequently occuring value for the Electrical field is “SBrkr” (standard circuit breaker), so that is used to replace the missing values in that particular field.

Note that some of the missing values only occur in the test set (e.g. for the ‘Utilities’ field).

#replace missing values in MasVnrType column in training and test sets with "None"
#, which is most commonly occuring type
train$MasVnrType[is.na(train$MasVnrType)] = "None"
test$MasVnrType[is.na(test$MasVnrType)] = "None"
#replace missing values in Electrical column in training set with _
#"SBrkr, which is most frequently occuring electrical set-up
train$Electrical[is.na(train$Electrical)] = "SBrkr"
#replace missing values in MSZoning column in test set with RL (most frequently occuring)
test$MSZoning[is.na(test$MSZoning)] = "RL"
#replace missing values in Utilities column in test set with AllPub (most frequently occuring)
test$Utilities[is.na(test$Utilities)] = "AllPub"
#replace missing values in Exterior1st column in test set with VinylSd (most frequently occuring)
test$Exterior1st[is.na(test$Exterior1st)] = "VinylSd"
#replace missing values in Exterior2nd column in test set with VinylSd (most frequently occuring)
test$Exterior2nd[is.na(test$Exterior2nd)] = "VinylSd"
#replace missing value in KitchenQual column in test set with TA (most common)
test$KitchenQual[is.na(test$KitchenQual)] = "TA"
#replace missing values in Functional column in test set with Min2 (most common)
test$Functional[is.na(test$Functional)] = "Min2"
#replace missing value in SaleType column in test set with WD (most common)
test$SaleType[is.na(test$SaleType)] = "WD"

“NA” entries in numeric fields relating to the basement can simply be replaced with a zero. For example “NA” entries in the BsmtFinSF1 (square foot area of type 1 finished basement area) simply means there is no basement to begin with and hence there’s no such square footage to speak of i.e. zero square feet.

#replace missing value in BsmtFinSF1 column in test set with 0 (has no basement)
test$BsmtFinSF1[is.na(test$BsmtFinSF1)] = 0
#replace missing value in BsmtFinSF2 column in test set with 0 (has no basement)
test$BsmtFinSF2[is.na(test$BsmtFinSF2)] = 0
#replace missing value in BsmtUnfSF column in test set with 0 (has no basement)
test$BsmtUnfSF[is.na(test$BsmtUnfSF)] = 0
#replace missing value in TotalBsmtSF column in test set with 0 (has no basement)
test$TotalBsmtSF[is.na(test$TotalBsmtSF)] = 0
#replace missing values in BsmtFullBath column in test set with 0 (has no basement)
test$BsmtFullBath[is.na(test$BsmtFullBath)] = 0
#replace missing values in BsmtHalfBath column in test set with 0 (has no basement)
test$BsmtHalfBath[is.na(test$BsmtHalfBath)] = 0

The same can be done for numeric fields relating to the garage.

#replace missing value in GarageCars column in test set with 0 (no garage)
test$GarageCars[is.na(test$GarageCars)] = 0
#replace missing value in GarageArea column in test set with 0 (no garage)
test$GarageArea[is.na(test$GarageArea)] = 0

A method commonly used for replacing missing values in data sets is median imputation. This is used for the LotFrontage field (number of feet of street connected to the property).

#replace missing values in LotFrontage column in training and test sets with median
train$LotFrontage[is.na(train$LotFrontage)] = median(train$LotFrontage, na.rm = TRUE)
test$LotFrontage[is.na(test$LotFrontage)] = median(test$LotFrontage, na.rm = TRUE)

Unfortunately some of the missing values cannot be reasonably estimated (e.g. the year in which the garage of a house was built). Such missing values are replaced by a numeric but non-sensical value such as -9999. Numeric values are easier for machine learning algorithms to handle. The fact that the value itself is non-sensical preserves the fact that the data is missing.

#replace missing values in GarageYrBlt column in training and test sets with -9999
train$GarageYrBlt[is.na(train$GarageYrBlt)] = -9999
test$GarageYrBlt[is.na(test$GarageYrBlt)] = -9999
#replace missing values in MasVnrArea column in training and test sets with -9999
train$MasVnrArea[is.na(train$MasVnrArea)] = -9999
test$MasVnrArea[is.na(test$MasVnrArea)] = -9999

Experimenting with machine learning algorithms

Because I am trying to predict a continuous numeric variable (SalePrice) regression-based algorithms are the most appropriate to try.

For the sake of reproducibility (particularly relating to cross-validation, which randomly selects records from the training set to use for model performance testing) I set the seed to 42 (chosen arbitrarily)

set.seed(42)

Model 1 - basic random forest

The first algorithm I tried was was a simple random forest - generating multiple decision trees and taking the mean prediction as the final output. This basic model simply takes all the variables in the training set apart from SalePrice as explanatory variables. It also only uses one mtry value i.e. only one variable is randomly sampled as a candidate at each split in each decision tree

Note that I use five-fold cross-validation throughout this project.

myControl = trainControl(method = "cv", number = 5, verboseIter = FALSE)
model_rf = train(SalePrice ~ ., 
              data = train,
              tuneLength = 1,
              method = "ranger",
              importance = 'impurity',
              trControl = myControl)
model_rf
Random Forest 

1460 samples
  80 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1169, 1167, 1168, 1168, 1168 
Resampling results:

  RMSE      Rsquared 
  30796.79  0.8612762

Tuning parameter 'mtry' was held constant at a value of 16
 

Model 2 - linear regression

The second algorithm I tried was linear regression - again using all but the SalePrice variable as explanatory variables.

model_lm = train(SalePrice ~ ., 
              data = train,
              method = "lm",
              trControl = myControl)
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleading
model_lm
Linear Regression 

1460 samples
  80 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1168, 1167, 1167, 1169, 1169 
Resampling results:

  RMSE      Rsquared 
  51749.52  0.6785317

Tuning parameter 'intercept' was held constant at a value of TRUE
 

I then compared the performance of both models - using root mean square error (RMSE) as my performance measure. The use of the resamples function ensures that the same training data is used in both models and hence a like-for-like comparison is being made between them. Box plot diagrams are generated for ease of comparison.

model_list <- list(lm = model_lm, rf = model_rf)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: lm, rf 
Number of resamples: 5 

RMSE 
    Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
lm 24560   43380  54410 51750   57870 78540    0
rf 25860   27890  30780 30800   33340 36110    0

Rsquared 
     Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
lm 0.4682  0.6177 0.6404 0.6785  0.7754 0.8909    0
rf 0.8304  0.8374 0.8754 0.8613  0.8814 0.8819    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

As the box plot illustrates the random forest model performs much better than the linear regression model.

At this point I made predictions on the SalePrice variable for the test set using the random forest model. This was then submitted to Kaggle.

prediction_rf = predict(model_rf, test)
#make a submission file
submit <- data.frame(Id = test$Id, SalePrice = prediction_rf)
setwd("~/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Submissions")
The working directory was changed to C:/Users/aps1991/Documents/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Submissions inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
write.csv(submit, file = "submission1.csv", row.names = FALSE)
rm(submit)

Model 3 - random forest with two mtry values

One way to improve the performance of random forest models is to experiment with their tuning parameters. In this instance I decided to experiment with the mtry parameter - using two mtry values instead of one. Everything else was kept the same.

model_rf2 = train(SalePrice ~ ., 
                 data = train,
                 tuneLength = 2,
                 method = "ranger",
                 importance = 'impurity',
                 trControl = myControl)
Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 94%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 97%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 19 seconds.
model_rf2
Random Forest 

1460 samples
  80 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1167, 1168, 1168, 1168, 1169 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared 
    2   46987.57  0.7901497
  260   29129.63  0.8685570

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 260. 

Again I compared the model with the more basic random forest model produced earlier.

model_list <- list(rf = model_rf, rf2 = model_rf2)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: rf, rf2 
Number of resamples: 5 

RMSE 
     Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
rf  25860   27890  30780 30800   33340 36110    0
rf2 22030   22420  30350 29130   31420 39440    0

Rsquared 
      Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
rf  0.8304  0.8374 0.8754 0.8613  0.8814 0.8819    0
rf2 0.8218  0.8236 0.8689 0.8686  0.9142 0.9144    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

As the box plot illustrates the second random forest model does indeed perform slightly better than the first. Once again I made predictions on the test set and made a submission to Kaggle.

prediction_rf2 = predict(model_rf2, test)
#make a submission file
submit <- data.frame(Id = test$Id, SalePrice = prediction_rf2)
setwd("~/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Submissions")
The working directory was changed to C:/Users/aps1991/Documents/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Submissions inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
write.csv(submit, file = "submission2.csv", row.names = FALSE)
rm(submit)

Model 4 - random forest using 20 most important explanatory variables

Another way to improve a model is to use only the most important explanatory variables i.e. the ones that provide the most information gain to the model. The varImp function from the caret package returns the 20 most important variables to the user. For comparison I used Boruta feature importance analysis as well, but it did not yield a better performing model. Therefore I omit it from this notebook.

varImp(model_rf)
ranger variable importance

  only 20 most important variables shown (out of 260)

                        Overall
OverallQual              100.00
GrLivArea                 88.01
X1stFlrSF                 62.07
GarageArea                61.74
TotalBsmtSF               57.78
YearBuilt                 54.96
GarageCars                51.34
ExterQualTA               41.51
GarageYrBlt               35.59
BsmtFinSF1                35.51
FullBath                  34.34
X2ndFlrSF                 33.32
LotArea                   28.98
YearRemodAdd              28.25
TotRmsAbvGrd              24.47
MasVnrArea                23.92
KitchenQualTA             23.56
Fireplaces                22.28
FireplaceQuNo fireplace   21.55
ExterQualGd               18.48

I then built a random forest model that only used these particular variables as explanatory variables. I returned to using one mtry value.

Top20Variables = c("OverallQual", "GrLivArea", "TotalBsmtSF", "GarageArea", "GarageCars", 
                   "X1stFlrSF", "YearBuilt", "ExterQual", "BsmtFinSF1", "FullBath",
                   "KitchenQual", "LotArea", "Fireplaces",
                   "FireplaceQu", "YearRemodAdd", "GarageYrBlt", "X2ndFlrSF", 
                   "TotRmsAbvGrd", "MasVnrArea", "LotFrontage")
train_Top20Var = select(train, one_of(Top20Variables, "SalePrice"))
model_rf_Top20 = train(SalePrice ~ ., 
                  data = train_Top20Var,
                  tuneLength = 1,
                  method = "ranger",
                  importance = 'impurity',
                  trControl = myControl)

I then compared the performance of this model with model 3.

model_list = list(rf2 = model_rf2, rf_Top20 = model_rf_Top20)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: rf2, rf_Top20 
Number of resamples: 5 

RMSE 
          Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
rf2      22030   22420  30350 29130   31420 39440    0
rf_Top20 24310   25170  29090 28520   29530 34490    0

Rsquared 
           Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
rf2      0.8218  0.8236 0.8689 0.8686  0.9142 0.9144    0
rf_Top20 0.8313  0.8393 0.8938 0.8739  0.9020 0.9030    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

As the box plot shows the latest model performs better than the model using all possible explanatory variables and two mtry values. Would using two mtry values further improve performance? Let’s try it out.

Model 5 - random forest using 20 most important variables and 2 mtry values

model_rf_Top20_2mtry = train(SalePrice ~ ., 
                  data = train_Top20Var,
                  tuneLength = 2,
                  method = "ranger",
                  importance = 'impurity',
                  trControl = myControl)
model_list = list(rf2 = model_rf2, rf_Top20 = model_rf_Top20, rf_Top20_2mtry = model_rf_Top20_2mtry)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: rf2, rf_Top20, rf_Top20_2mtry 
Number of resamples: 5 

RMSE 
                Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
rf2            22030   22420  30350 29130   31420 39440    0
rf_Top20       24310   25170  29090 28520   29530 34490    0
rf_Top20_2mtry 26720   26760  29360 30020   32740 34520    0

Rsquared 
                 Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
rf2            0.8218  0.8236 0.8689 0.8686  0.9142 0.9144    0
rf_Top20       0.8313  0.8393 0.8938 0.8739  0.9020 0.9030    0
rf_Top20_2mtry 0.8119  0.8366 0.8603 0.8581  0.8904 0.8914    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

Model 5 performs better than model 3, but worse than model 4 - currently the best-performing model.

Further increasing of the number of mtry values and increasing the number of folds in the cross-validation did not yield better model performance. Therefore I omit them from this notebook for the sake of brevity.

Perhaps it’s now time to explore alternative algorithms.

Model 6 - stochastic gradient boosting machine

A more refined approach to generating random forests is to use stochastic gradient boosting machines - utilising the concept of regularisation to prevent overfitting the individual decision trees to the training set. A more detailed explanation can be found here - http://xgboost.readthedocs.io/en/latest/model.html.

This algorithm should, in theory, generate a model that performs better on unseen data than basic random forests.

To begin with I use all available explanatory variables. I use two mtry values given that it seems to produce a better-performing model than when using one mtry value.

model_gbm = train(SalePrice ~ ., 
                 data = train,
                 tuneLength = 2,
                 method = "gbm",
                 trControl = myControl)
variable 60: Condition2PosN has no variation.variable 61: Condition2RRAe has no variation.variable 113: Exterior2ndOther has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5650271848.8526             nan     0.1000 509702483.0284
     2 5191809995.8405             nan     0.1000 472299103.9891
     3 4829877571.8844             nan     0.1000 372426394.9383
     4 4524929132.5134             nan     0.1000 266548660.9214
     5 4245371592.4290             nan     0.1000 270454386.5982
     6 3968163800.4252             nan     0.1000 260770673.5750
     7 3732190706.0097             nan     0.1000 222246146.9725
     8 3537972873.6552             nan     0.1000 131052539.7284
     9 3341917579.6734             nan     0.1000 178413750.1662
    10 3177935817.3254             nan     0.1000 136464521.2789
    20 2048172812.4733             nan     0.1000 73670539.6589
    40 1273129911.7330             nan     0.1000 17378371.2878
    60 985492273.5922             nan     0.1000 -2344630.1495
    80 853189509.1467             nan     0.1000 -4465380.7298
   100 788669594.3488             nan     0.1000 -721108.3493
variable 60: Condition2PosN has no variation.variable 61: Condition2RRAe has no variation.variable 113: Exterior2ndOther has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5453485147.6910             nan     0.1000 681752400.2177
     2 4925542528.5792             nan     0.1000 546975535.9762
     3 4419389563.4549             nan     0.1000 503784437.9972
     4 4047826686.8723             nan     0.1000 382505590.2872
     5 3713551060.9573             nan     0.1000 323789440.3837
     6 3413385608.4621             nan     0.1000 306910261.0515
     7 3160126649.4340             nan     0.1000 273354955.3098
     8 2962215397.1424             nan     0.1000 153978753.5849
     9 2773210346.7602             nan     0.1000 194170323.3929
    10 2608021568.4124             nan     0.1000 157201457.9822
    20 1511896714.6803             nan     0.1000 41815243.9825
    40 878840961.2926             nan     0.1000 8966514.7833
    60 717781197.4891             nan     0.1000 3186033.9917
    80 628736838.1430             nan     0.1000 -5861534.2869
   100 573888192.6080             nan     0.1000 -3233313.5280
variable 18: UtilitiesNoSeWa has no variation.variable 62: Condition2RRAn has no variation.variable 86: RoofMatlMetal has no variation.variable 87: RoofMatlRoll has no variation.variable 196: FunctionalSev has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5804652678.3150             nan     0.1000 533437174.9191
     2 5318947112.0308             nan     0.1000 464196673.7101
     3 4940256164.5961             nan     0.1000 412649078.2338
     4 4583942850.6276             nan     0.1000 302859931.2247
     5 4312558997.0428             nan     0.1000 291727604.2037
     6 4098589599.2556             nan     0.1000 200212658.0190
     7 3862333960.0463             nan     0.1000 210453343.8777
     8 3645444193.5630             nan     0.1000 219653085.0540
     9 3444185184.1652             nan     0.1000 212919434.3604
    10 3227002703.3586             nan     0.1000 221631537.5677
    20 2122024896.5411             nan     0.1000 68578008.0746
    40 1264323557.8456             nan     0.1000 18736397.1825
    60 967053332.7104             nan     0.1000 8885785.1522
    80 840382447.5310             nan     0.1000 2417899.6145
   100 781716828.5044             nan     0.1000 -10670836.1597
variable 18: UtilitiesNoSeWa has no variation.variable 62: Condition2RRAn has no variation.variable 86: RoofMatlMetal has no variation.variable 87: RoofMatlRoll has no variation.variable 196: FunctionalSev has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5773658689.8085             nan     0.1000 561525575.6464
     2 5165431221.9549             nan     0.1000 621716602.7344
     3 4651356991.3304             nan     0.1000 457200936.2785
     4 4195522473.5441             nan     0.1000 480984439.3674
     5 3840900919.9096             nan     0.1000 345380674.1410
     6 3553434588.0255             nan     0.1000 273898910.7008
     7 3303892378.0469             nan     0.1000 266000164.2930
     8 3062432937.9714             nan     0.1000 239553400.3708
     9 2838020102.6407             nan     0.1000 229094441.0207
    10 2659943840.0123             nan     0.1000 147858368.5287
    20 1497558337.1564             nan     0.1000 67455996.5079
    40 873581805.5659             nan     0.1000 8575461.8763
    60 701689005.8589             nan     0.1000 1881139.1638
    80 631858840.3811             nan     0.1000 -280373.9139
   100 590022645.1423             nan     0.1000 -898924.3171
variable 59: Condition2PosA has no variation.variable 91: Exterior1stAsphShn has no variation.variable 100: Exterior1stStone has no variation.variable 129: ExterCondPo has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5969732960.1739             nan     0.1000 552924255.7270
     2 5521560497.0383             nan     0.1000 458201399.9239
     3 5098695864.3158             nan     0.1000 394740476.9868
     4 4733906996.9784             nan     0.1000 328942851.3775
     5 4426660212.3560             nan     0.1000 304350274.3689
     6 4154764181.2561             nan     0.1000 240420841.4609
     7 3923189882.8255             nan     0.1000 228577082.2086
     8 3715721070.5977             nan     0.1000 140780955.7010
     9 3527513492.8212             nan     0.1000 198576419.4129
    10 3344550857.0054             nan     0.1000 166192567.6767
    20 2207511322.2240             nan     0.1000 56883052.1716
    40 1351574185.7782             nan     0.1000 18024869.4765
    60 1073807642.7710             nan     0.1000 9903922.2337
    80 970391890.5752             nan     0.1000 -8549389.8161
   100 889978676.1201             nan     0.1000 2257131.2345
variable 59: Condition2PosA has no variation.variable 91: Exterior1stAsphShn has no variation.variable 100: Exterior1stStone has no variation.variable 129: ExterCondPo has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5886638864.2556             nan     0.1000 725360523.5273
     2 5322151464.4489             nan     0.1000 626631554.3647
     3 4816414170.9755             nan     0.1000 361654292.5870
     4 4369374058.8405             nan     0.1000 441943895.2037
     5 4010843762.3376             nan     0.1000 346759278.4105
     6 3660144570.3254             nan     0.1000 258840797.8171
     7 3355293911.1277             nan     0.1000 284254759.1752
     8 3124189449.1373             nan     0.1000 183034786.0258
     9 2916620963.4099             nan     0.1000 154884980.9381
    10 2742418884.5546             nan     0.1000 179508206.8361
    20 1580485799.6371             nan     0.1000 62952110.8633
    40 974023745.8206             nan     0.1000 -7436211.1287
    60 821338317.9020             nan     0.1000 -501995.8131
    80 715094029.6498             nan     0.1000 -11794442.4090
   100 660991989.0051             nan     0.1000 -6434516.7447
variable 92: Exterior1stBrkComm has no variation.variable 94: Exterior1stCBlock has no variation.variable 108: Exterior2ndCBlock has no variation.variable 171: HeatingQCPo has no variation.variable 176: ElectricalMix has no variation.variable 244: MiscFeatureTenC has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5344918242.6380             nan     0.1000 504880250.8628
     2 4904309874.9553             nan     0.1000 477756408.5901
     3 4521545221.4551             nan     0.1000 359313696.3396
     4 4234353120.5693             nan     0.1000 296402054.0323
     5 3937432724.4513             nan     0.1000 297703671.3870
     6 3657689931.9330             nan     0.1000 253470555.4933
     7 3436891485.1518             nan     0.1000 216354672.7895
     8 3246431001.5134             nan     0.1000 174970474.4593
     9 3080027866.0318             nan     0.1000 138144888.6924
    10 2907017441.7662             nan     0.1000 149185566.4749
    20 1883639160.9491             nan     0.1000 44340907.1129
    40 1174009737.6724             nan     0.1000 21714535.6701
    60 919121438.6310             nan     0.1000 1442942.4350
    80 805830970.2563             nan     0.1000 -1044999.2928
   100 742597633.0147             nan     0.1000 679163.7682
variable 92: Exterior1stBrkComm has no variation.variable 94: Exterior1stCBlock has no variation.variable 108: Exterior2ndCBlock has no variation.variable 171: HeatingQCPo has no variation.variable 176: ElectricalMix has no variation.variable 244: MiscFeatureTenC has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5182845514.2914             nan     0.1000 662242322.2779
     2 4655889784.1980             nan     0.1000 509789923.7702
     3 4232042288.4928             nan     0.1000 442482168.3133
     4 3839996004.5648             nan     0.1000 408670928.7150
     5 3496071682.9379             nan     0.1000 359136443.4336
     6 3209994532.9581             nan     0.1000 271592729.6167
     7 2947783086.8891             nan     0.1000 236690584.2904
     8 2718946591.5493             nan     0.1000 173441847.0680
     9 2530881134.7322             nan     0.1000 150968647.1891
    10 2378230353.2053             nan     0.1000 154329501.9584
    20 1363658646.4008             nan     0.1000 55677422.9063
    40 822589954.1110             nan     0.1000 6132295.5415
    60 647604265.2711             nan     0.1000 2786321.5609
    80 570225574.2407             nan     0.1000 -1548731.4557
   100 515076378.7780             nan     0.1000 -2064765.9342
variable 85: RoofMatlMembran has no variation.variable 97: Exterior1stImStucc has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5989960889.4864             nan     0.1000 476683898.7650
     2 5518151542.1995             nan     0.1000 390857992.5062
     3 5099877332.3613             nan     0.1000 412080088.9651
     4 4795801673.5043             nan     0.1000 303329844.1699
     5 4485357197.0879             nan     0.1000 287728191.9446
     6 4171953156.4566             nan     0.1000 281803400.3938
     7 3950204621.4057             nan     0.1000 223530480.3526
     8 3721977422.1482             nan     0.1000 220849719.8581
     9 3530815532.9082             nan     0.1000 189694380.3083
    10 3351151121.5221             nan     0.1000 178719420.6038
    20 2221424699.6346             nan     0.1000 49484509.3180
    40 1356488115.2035             nan     0.1000 22155145.0447
    60 1073080068.6487             nan     0.1000 6254364.4971
    80 944090433.3881             nan     0.1000 -3538285.5237
   100 878019134.7697             nan     0.1000 -2786017.8605
variable 85: RoofMatlMembran has no variation.variable 97: Exterior1stImStucc has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5857845574.3891             nan     0.1000 674389605.0920
     2 5220109269.5366             nan     0.1000 640469379.7137
     3 4740084712.7440             nan     0.1000 479558202.2902
     4 4339262340.4985             nan     0.1000 409039859.7583
     5 3997488702.7476             nan     0.1000 368030500.8164
     6 3680544563.2900             nan     0.1000 266824028.3358
     7 3391724569.8734             nan     0.1000 255925461.5107
     8 3146320516.6503             nan     0.1000 172864360.9139
     9 2923939853.5198             nan     0.1000 213863860.7456
    10 2720457563.6049             nan     0.1000 162889320.3561
    20 1630113400.9138             nan     0.1000 37886759.7808
    40 984797716.9791             nan     0.1000 10982552.1858
    60 794947419.0522             nan     0.1000 -787821.8408
    80 693706574.6608             nan     0.1000 -6423565.4574
   100 631533035.8482             nan     0.1000 -6807793.4651

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 5628356047.2082             nan     0.1000 697598554.7995
     2 5041763493.3438             nan     0.1000 604409484.5753
     3 4581171561.9739             nan     0.1000 482032444.3813
     4 4130088972.9514             nan     0.1000 431511384.1349
     5 3788468173.1300             nan     0.1000 292009904.5966
     6 3451326352.6969             nan     0.1000 292218461.8351
     7 3196238006.9510             nan     0.1000 251685425.6301
     8 2974823060.4456             nan     0.1000 202866805.7738
     9 2784300459.9448             nan     0.1000 189845091.0823
    10 2594444828.6467             nan     0.1000 174218193.6472
    20 1529143115.5361             nan     0.1000 51317357.2663
    40 961040265.6232             nan     0.1000 8614482.3438
    60 782376612.7135             nan     0.1000 -11161318.2649
    80 687147927.7368             nan     0.1000 -2496093.5285
   100 627885196.3541             nan     0.1000 -4375210.9963
model_gbm
Stochastic Gradient Boosting 

1460 samples
  80 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1169, 1168, 1168, 1169, 1166 
Resampling results across tuning parameters:

  interaction.depth  n.trees  RMSE      Rsquared 
  1                   50      35345.88  0.8209247
  1                  100      31854.01  0.8447295
  2                   50      32357.34  0.8423511
  2                  100      30936.63  0.8549869

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning
 parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were n.trees = 100, interaction.depth =
 2, shrinkage = 0.1 and n.minobsinnode = 10. 

This model did not perform better than model 4. Perhaps adjusting the tuning paramaters of the model will improve performance. Let’s give it a try.

Model 7 - stochastic gradient boosting machine with custom tuning grid

This time I use three mtry values and a custom tuning grid. Arriving at the best combination of parameter values to use in the tuning grid is a case of trial and error. I experimented with lots of combinations to find one that improved model performance. For the sake of brevity I omit that experimentation from this notebook and simply present my final tuning grid. Once again I use all available explanatory variables.

gbmTuningGrid = expand.grid(interaction.depth = 4, 
                            n.trees = c(50, 100, 150, 200), 
                            shrinkage = 0.3,
                            n.minobsinnode = 20)
model_gbm2 = train(SalePrice ~ ., 
                  data = train,
                  tuneLength = 3,
                  method = "gbm",
                  trControl = myControl,
                  tuneGrid = gbmTuningGrid)
variable 85: RoofMatlMembran has no variation.variable 171: HeatingQCPo has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 4210925430.0543             nan     0.3000 1922915590.9523
     2 2981304515.4542             nan     0.3000 1156011305.9536
     3 2289966836.4109             nan     0.3000 586449778.9877
     4 1823407767.7090             nan     0.3000 396518862.2721
     5 1523831678.9201             nan     0.3000 245384626.8229
     6 1321831729.8312             nan     0.3000 160892050.0187
     7 1178004168.8373             nan     0.3000 87789439.3537
     8 1090557827.0053             nan     0.3000 77770317.9539
     9 1027307622.1358             nan     0.3000 35072316.2508
    10 981850969.7853             nan     0.3000 31042458.9056
    20 738428069.7402             nan     0.3000 -16886605.7164
    40 536518238.0666             nan     0.3000 -9678733.1076
    60 424002793.8055             nan     0.3000 -12411699.4225
    80 353270887.9508             nan     0.3000 -7678666.2221
   100 294176105.3203             nan     0.3000 -11950941.8296
   120 258867475.6595             nan     0.3000 -3260970.1245
   140 229889868.4406             nan     0.3000 -4894199.0648
   160 203868717.1024             nan     0.3000 -2109502.8043
   180 177924324.7400             nan     0.3000 -4144252.3727
   200 157872920.5733             nan     0.3000 -1066115.8141
variable 25: NeighborhoodBlueste has no variation.variable 59: Condition2PosA has no variation.variable 61: Condition2RRAe has no variation.variable 91: Exterior1stAsphShn has no variation.variable 94: Exterior1stCBlock has no variation.variable 108: Exterior2ndCBlock has no variation.variable 244: MiscFeatureTenC has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 4398542102.1437             nan     0.3000 1976131892.9135
     2 3086314440.3206             nan     0.3000 1195824436.2855
     3 2307267046.9927             nan     0.3000 655198934.9047
     4 1823440274.0090             nan     0.3000 344450347.6573
     5 1545149145.4475             nan     0.3000 203305367.5673
     6 1312126545.4749             nan     0.3000 147580849.5674
     7 1196298062.2283             nan     0.3000 102355315.1869
     8 1099884241.4950             nan     0.3000 70419388.8049
     9 1030431105.3636             nan     0.3000 11391191.0737
    10 965871288.9014             nan     0.3000 48157659.9522
    20 720588930.8502             nan     0.3000 -20431092.4871
    40 550471342.5232             nan     0.3000 -13158265.3211
    60 443594034.8938             nan     0.3000 -8193616.3482
    80 356854917.1258             nan     0.3000 -6960226.3212
   100 304722025.7141             nan     0.3000 -7205171.7162
   120 262966975.5581             nan     0.3000 -6576489.5883
   140 233233505.4227             nan     0.3000 -8434973.7242
   160 201007858.0625             nan     0.3000 -4250422.0304
   180 172939005.6754             nan     0.3000 -1688996.3101
   200 151276300.2651             nan     0.3000 -1224858.9095
variable 87: RoofMatlRoll has no variation.variable 97: Exterior1stImStucc has no variation.variable 113: Exterior2ndOther has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 3698456672.7354             nan     0.3000 2090124589.2977
     2 2681962557.7599             nan     0.3000 1058040833.1370
     3 2017442152.7508             nan     0.3000 681356652.5823
     4 1586094388.9876             nan     0.3000 454846663.8383
     5 1290068988.7868             nan     0.3000 243600934.7667
     6 1121573260.3262             nan     0.3000 171185520.6064
     7 999504584.7272             nan     0.3000 113278953.7727
     8 899273378.1930             nan     0.3000 82716961.6331
     9 815290766.1211             nan     0.3000 74541575.2526
    10 737888748.1548             nan     0.3000 47808444.9499
    20 546548432.6947             nan     0.3000 -5457328.7235
    40 427786593.5390             nan     0.3000 -11163356.9502
    60 354145902.1376             nan     0.3000 -8592912.6860
    80 295095657.9245             nan     0.3000 -2069415.3564
   100 247478490.7065             nan     0.3000 -3965541.0288
   120 219566161.3894             nan     0.3000 -1825522.3259
   140 189731420.8330             nan     0.3000 -2719810.0365
   160 169423297.6106             nan     0.3000 -4889739.1731
   180 149864595.8322             nan     0.3000 -3359206.5625
   200 136546197.4264             nan     0.3000 -2206355.9445
variable 18: UtilitiesNoSeWa has no variation.variable 62: Condition2RRAn has no variation.variable 86: RoofMatlMetal has no variation.variable 129: ExterCondPo has no variation.variable 176: ElectricalMix has no variation.variable 196: FunctionalSev has no variation.variable 242: MiscFeatureOthr has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 4057891343.3800             nan     0.3000 2059506927.2955
     2 2946097424.9979             nan     0.3000 1175926032.7133
     3 2258556217.5645             nan     0.3000 626370302.6549
     4 1836857968.2027             nan     0.3000 351075334.4480
     5 1516516766.3773             nan     0.3000 271415685.2570
     6 1283789377.6690             nan     0.3000 189059599.8415
     7 1142685076.8204             nan     0.3000 100030331.6092
     8 1041025909.0556             nan     0.3000 71775767.2411
     9 952939190.4566             nan     0.3000 58268541.2109
    10 908782793.7383             nan     0.3000 10675230.6641
    20 649312757.6966             nan     0.3000 -11192143.8650
    40 492446650.6382             nan     0.3000 -11136286.9732
    60 394906571.4437             nan     0.3000 -5236408.4988
    80 343025504.1729             nan     0.3000 -10842382.6293
   100 304458371.4514             nan     0.3000 -10464455.6380
   120 274376906.8999             nan     0.3000 -4677744.3622
   140 245020178.3617             nan     0.3000 -8850380.5322
   160 216591918.7540             nan     0.3000 -5843212.6632
   180 192293577.8633             nan     0.3000 -3730792.2308
   200 170869292.5165             nan     0.3000 -2442955.0485
variable 248: SaleTypeCon has no variation.
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 4267777612.7010             nan     0.3000 2264209298.4500
     2 3049054686.4600             nan     0.3000 1230044105.5552
     3 2313736117.9324             nan     0.3000 615053171.6664
     4 1844439420.9821             nan     0.3000 427644188.9812
     5 1563199544.9931             nan     0.3000 283526816.0436
     6 1367260422.1000             nan     0.3000 202142494.1075
     7 1205730939.8300             nan     0.3000 135637450.5805
     8 1103661018.5372             nan     0.3000 60167679.0400
     9 1028651650.2720             nan     0.3000 40783612.5865
    10 949832575.8823             nan     0.3000 20531495.0207
    20 718835379.7456             nan     0.3000 -3552006.1774
    40 542439064.3103             nan     0.3000 -20200107.2368
    60 416350795.8870             nan     0.3000 1525934.7646
    80 341808685.1891             nan     0.3000 -4203645.0724
   100 295132866.3019             nan     0.3000 -4829517.9124
   120 260246151.7893             nan     0.3000 -1672400.0956
   140 232456329.2009             nan     0.3000 -4156720.4926
   160 198500486.2668             nan     0.3000 -2613768.8918
   180 183671200.9088             nan     0.3000 -3796110.5206
   200 165392270.4731             nan     0.3000 452979.0218

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 4185906746.3885             nan     0.3000 1889487436.4509
     2 2971434976.6797             nan     0.3000 1328392006.0548
     3 2196231592.4638             nan     0.3000 643502195.1070
     4 1767477203.6200             nan     0.3000 384072064.0702
     5 1495372443.2042             nan     0.3000 302558379.8608
     6 1286230980.1002             nan     0.3000 182213881.1655
     7 1166451850.2702             nan     0.3000 115699491.2812
     8 1056944744.2522             nan     0.3000 108796800.4292
     9 999016769.6519             nan     0.3000 45177861.8023
    10 937218202.5290             nan     0.3000 49303221.3555
    20 701556852.5337             nan     0.3000 7985469.6269
    40 532747299.6631             nan     0.3000 -4948893.1725
    60 446588180.0606             nan     0.3000 -6812882.4498
    80 371731444.4489             nan     0.3000 -14108797.1712
   100 318722974.7448             nan     0.3000 -2169159.8732
   120 279441667.5349             nan     0.3000 222184.6603
   140 249277112.9271             nan     0.3000 -6183069.1941
   150 233388162.5875             nan     0.3000 -786228.0292
model_gbm2
Stochastic Gradient Boosting 

1460 samples
  80 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1167, 1168, 1169, 1168, 1168 
Resampling results across tuning parameters:

  n.trees  RMSE      Rsquared 
   50      29083.37  0.8683278
  100      28775.27  0.8720094
  150      28597.02  0.8735638
  200      28705.88  0.8728585

Tuning parameter 'interaction.depth' was held constant at a value of 4

Tuning parameter 'shrinkage' was held constant at a value of 0.3
Tuning
 parameter 'n.minobsinnode' was held constant at a value of 20
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were n.trees = 150, interaction.depth =
 4, shrinkage = 0.3 and n.minobsinnode = 20. 

I compare the performance against previous models.

model_list = list(rf2 = model_rf2, rf_Top20_2mtry = model_rf_Top20_2mtry, gbm2 = model_gbm2)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: rf2, rf_Top20_2mtry, gbm2 
Number of resamples: 5 

RMSE 
                Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
rf2            22030   22420  30350 29130   31420 39440    0
rf_Top20_2mtry 26720   26760  29360 30020   32740 34520    0
gbm2           22490   24530  25600 28600   33420 36940    0

Rsquared 
                 Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
rf2            0.8218  0.8236 0.8689 0.8686  0.9142 0.9144    0
rf_Top20_2mtry 0.8119  0.8366 0.8603 0.8581  0.8904 0.8914    0
gbm2           0.8451  0.8455 0.8868 0.8736  0.8915 0.8989    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

This model performs better than model 4. Are there other algorithms available that could possible yield higher performance? Let’s try extreme gradient boosting.

Model 8 - extreme gradient boosting

The extreme gradient boosting algorithm works much the same as the stochastic gradient boosting machine, but more aggresively guards against overfitting.

It’s far easier to use extreme gradient boosting in the caret package rather than the xgboost package because the latter requires the training and test sets to be converted to sparse matrices using the concept of one-hot encoding.

Given that using a custom tuning grid in model 7 improved performance I also make use of one in this model. Once again for the sake of brevity I have omitted my experimentations with the tuning parameters.

Again I use all available explanatory variables and three mtry values.

xgbTuningGrid = expand.grid(nrounds = c(50, 100), 
                            lambda = seq(0.1, 0.5, 0.1), 
                            alpha = seq(0.1, 0.5, 0.1),
                            eta = c(0.3, 0.4))
model_xgb4 = train(SalePrice ~ ., 
                   data = train,
                   tuneLength = 3,
                   method = "xgbLinear",
                   trControl = myControl,
                   tuneGrid = xgbTuningGrid)
model_xgb4
eXtreme Gradient Boosting 

1460 samples
  80 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1167, 1168, 1167, 1169, 1169 
Resampling results across tuning parameters:

  nrounds  lambda  alpha  eta  RMSE      Rsquared 
   50      0.1     0.1    0.3  30536.40  0.8528058
   50      0.1     0.1    0.4  30536.40  0.8528058
   50      0.1     0.2    0.3  30536.40  0.8528058
   50      0.1     0.2    0.4  30536.40  0.8528058
   50      0.1     0.3    0.3  30568.94  0.8525058
   50      0.1     0.3    0.4  30568.94  0.8525058
   50      0.1     0.4    0.3  30568.93  0.8525058
   50      0.1     0.4    0.4  30568.93  0.8525058
   50      0.1     0.5    0.3  30568.93  0.8525058
   50      0.1     0.5    0.4  30568.93  0.8525058
   50      0.2     0.1    0.3  30281.94  0.8554049
   50      0.2     0.1    0.4  30281.94  0.8554049
   50      0.2     0.2    0.3  30281.94  0.8554049
   50      0.2     0.2    0.4  30281.94  0.8554049
   50      0.2     0.3    0.3  30281.67  0.8553927
   50      0.2     0.3    0.4  30281.67  0.8553927
   50      0.2     0.4    0.3  30281.67  0.8553927
   50      0.2     0.4    0.4  30281.67  0.8553927
   50      0.2     0.5    0.3  30281.67  0.8553927
   50      0.2     0.5    0.4  30281.67  0.8553927
   50      0.3     0.1    0.3  29236.31  0.8660122
   50      0.3     0.1    0.4  29236.31  0.8660122
   50      0.3     0.2    0.3  29232.86  0.8660500
   50      0.3     0.2    0.4  29232.86  0.8660500
   50      0.3     0.3    0.3  29232.86  0.8660500
   50      0.3     0.3    0.4  29232.86  0.8660500
   50      0.3     0.4    0.3  29232.86  0.8660500
   50      0.3     0.4    0.4  29232.86  0.8660500
   50      0.3     0.5    0.3  29232.86  0.8660500
   50      0.3     0.5    0.4  29232.86  0.8660500
   50      0.4     0.1    0.3  28854.48  0.8687587
   50      0.4     0.1    0.4  28854.48  0.8687587
   50      0.4     0.2    0.3  28851.32  0.8687861
   50      0.4     0.2    0.4  28851.32  0.8687861
   50      0.4     0.3    0.3  28851.32  0.8687861
   50      0.4     0.3    0.4  28851.32  0.8687861
   50      0.4     0.4    0.3  28851.32  0.8687861
   50      0.4     0.4    0.4  28851.32  0.8687861
   50      0.4     0.5    0.3  28851.32  0.8687861
   50      0.4     0.5    0.4  28851.32  0.8687861
   50      0.5     0.1    0.3  30608.58  0.8520108
   50      0.5     0.1    0.4  30608.58  0.8520108
   50      0.5     0.2    0.3  30608.58  0.8520108
   50      0.5     0.2    0.4  30608.58  0.8520108
   50      0.5     0.3    0.3  30608.58  0.8520108
   50      0.5     0.3    0.4  30608.58  0.8520108
   50      0.5     0.4    0.3  30608.58  0.8520108
   50      0.5     0.4    0.4  30608.58  0.8520108
   50      0.5     0.5    0.3  30608.58  0.8520108
   50      0.5     0.5    0.4  30608.58  0.8520108
  100      0.1     0.1    0.3  30532.22  0.8529242
  100      0.1     0.1    0.4  30532.22  0.8529242
  100      0.1     0.2    0.3  30535.00  0.8529202
  100      0.1     0.2    0.4  30535.00  0.8529202
  100      0.1     0.3    0.3  30582.60  0.8524487
  100      0.1     0.3    0.4  30582.60  0.8524487
  100      0.1     0.4    0.3  30582.59  0.8524487
  100      0.1     0.4    0.4  30582.59  0.8524487
  100      0.1     0.5    0.3  30589.57  0.8523823
  100      0.1     0.5    0.4  30589.57  0.8523823
  100      0.2     0.1    0.3  30223.54  0.8560345
  100      0.2     0.1    0.4  30223.54  0.8560345
  100      0.2     0.2    0.3  30223.54  0.8560345
  100      0.2     0.2    0.4  30223.54  0.8560345
  100      0.2     0.3    0.3  30233.27  0.8559046
  100      0.2     0.3    0.4  30233.27  0.8559046
  100      0.2     0.4    0.3  30233.27  0.8559046
  100      0.2     0.4    0.4  30233.27  0.8559046
  100      0.2     0.5    0.3  30214.12  0.8560560
  100      0.2     0.5    0.4  30214.12  0.8560560
  100      0.3     0.1    0.3  29217.52  0.8662653
  100      0.3     0.1    0.4  29217.52  0.8662653
  100      0.3     0.2    0.3  29247.79  0.8659507
  100      0.3     0.2    0.4  29247.79  0.8659507
  100      0.3     0.3    0.3  29247.79  0.8659507
  100      0.3     0.3    0.4  29247.79  0.8659507
  100      0.3     0.4    0.3  29239.71  0.8660205
  100      0.3     0.4    0.4  29239.71  0.8660205
  100      0.3     0.5    0.3  29239.71  0.8660205
  100      0.3     0.5    0.4  29239.71  0.8660205
  100      0.4     0.1    0.3  28801.41  0.8692857
  100      0.4     0.1    0.4  28801.41  0.8692857
  100      0.4     0.2    0.3  28784.63  0.8694422
  100      0.4     0.2    0.4  28784.63  0.8694422
  100      0.4     0.3    0.3  28802.17  0.8693151
  100      0.4     0.3    0.4  28802.17  0.8693151
  100      0.4     0.4    0.3  28797.92  0.8693516
  100      0.4     0.4    0.4  28797.92  0.8693516
  100      0.4     0.5    0.3  28797.43  0.8693550
  100      0.4     0.5    0.4  28797.43  0.8693550
  100      0.5     0.1    0.3  30618.66  0.8519223
  100      0.5     0.1    0.4  30618.66  0.8519223
  100      0.5     0.2    0.3  30618.66  0.8519223
  100      0.5     0.2    0.4  30618.66  0.8519223
  100      0.5     0.3    0.3  30618.66  0.8519223
  100      0.5     0.3    0.4  30618.66  0.8519223
  100      0.5     0.4    0.3  30617.74  0.8519310
  100      0.5     0.4    0.4  30617.74  0.8519310
  100      0.5     0.5    0.3  30617.40  0.8519334
  100      0.5     0.5    0.4  30617.40  0.8519334

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were nrounds = 100, lambda = 0.4, alpha =
 0.2 and eta = 0.3. 

As per usual I compare model performance.

model_list = list(gbm2 = model_gbm2, xgb4 = model_xgb4)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: gbm2, xgb4 
Number of resamples: 5 

RMSE 
      Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
gbm2 22490   24530  25600 28600   33420 36940    0
xgb4 23070   25830  28640 28780   31290 35100    0

Rsquared 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
gbm2 0.8451  0.8455 0.8868 0.8736  0.8915 0.8989    0
xgb4 0.8037  0.8602 0.8800 0.8694  0.8846 0.9188    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

As the box plot illustrates this model actually performs worse than model 7.

Model 9 - linear model with a trimmed training set

Having reached a plateau in my model development I read various kernels (notebooks written by other Kaggle competitors) for this particular Kaggle competition to identify any areas that I could further develop. One kernel (https://www.kaggle.com/sidraina89/house-prices-advanced-regression-techniques/regularized-regression-housing-pricing) said that removing outliers from the training set would improve model performance. Therefore I went ahead and did this - firstly plotting the ground floor living area against the sale price.

ggplot(train ,aes(y = SalePrice, x = GrLivArea)) + geom_point()

This plot presented four outliers where the ground floor living area was more than 4000 square feet. I therefore removed such records from the training set.

train_trim = filter(train, GrLivArea <= 4000)

I repeated this process for the other 20 most predictive explanatory variables (as identified in model 4) such as OverallQual, KitchenQual and YearBuilt. No other outliers were identified.

I then built a linear regression model using this trimmed training set and assessed its performance in the usual way.

model_lm2 = train(SalePrice ~ ., 
                  data = train_trim,
                  method = "lm",
                  trControl = myControl)
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleading
model_list = list(gbm2 = model_gbm2,lm2 = model_lm2)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: gbm2, lm2 
Number of resamples: 5 

RMSE 
      Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
gbm2 22490   24530  25600 28600   33420 36940    0
lm2  23130   23950  24640 26000   26640 31660    0

Rsquared 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
gbm2 0.8451  0.8455 0.8868 0.8736  0.8915 0.8989    0
lm2  0.8208  0.8916 0.8949 0.8868  0.9063 0.9206    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

As the box plot illustrates this model performed better than model 8.

Model 10 - extreme gradient boosting with a trimmed training set

Given that it was the best-performing model on the untrimmed training set it seemed natural for me to try the extreme gradient boosting algorithm on the trimmed training set.

model_list = list(gbm2 = model_gbm2, lm2 = model_lm2, xgb5 = model_xgb5)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: gbm2, lm2, xgb5 
Number of resamples: 5 

RMSE 
      Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
gbm2 22490   24530  25600 28600   33420 36940    0
lm2  23130   23950  24640 26000   26640 31660    0
xgb5 24310   25590  27140 26340   27270 27390    0

Rsquared 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
gbm2 0.8451  0.8455 0.8868 0.8736  0.8915 0.8989    0
lm2  0.8208  0.8916 0.8949 0.8868  0.9063 0.9206    0
xgb5 0.8631  0.8759 0.8790 0.8825  0.8940 0.9007    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

To my surprise not only did it perform worse than model 8 but it also performed worse than model 9 (a simple linear model). Further development of the data was required.

Model 11 - linear model on trimmed training set with log-transformed skewed variables

The same kernel I read earlier underlined the importance of understanding the assumptions underlying any machine learning algorithms being used. For example linear regression assumes, among other things, that all variables are normally distributed. I used the following code to identify any variables with a skewness greater than 0.75 (threshold given in kernel).

library(e1071)
column_classes = sapply(names(train_trim), function(x){class(train_trim[[x]])})
numeric_columns = names(column_classes[column_classes == "integer"])
skew = sapply(numeric_columns, function(x){skewness(train_trim[[x]], na.rm = T)})
skew = skew[skew > 0.75]

As per the kernel I took the natural logarithm of these highly skewed variables in both the training set and the test set.

train_trim_log = train_trim
for(x in names(skew)) 
{
  train_trim_log[[x]] = log(train_trim_log[[x]] + 1)
  
}
# do the same for the test set
test_logtrnsfrm = test
for(x in names(skew)) 
{
  if(x != "SalePrice")
  {
    test_logtrnsfrm[[x]] = log(test_logtrnsfrm[[x]] + 1)
  }
}

I then built a linear model using this adjusted training set.

model_list = list(gbm2 = model_gbm2, lm3 = model_lm3)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: gbm2, lm3 
Number of resamples: 5 

RMSE 
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
gbm2 2.249e+04 24530.000 2.560e+04 2.860e+04 3.342e+04 36940.000    0
lm3  1.092e-01     0.124 1.255e-01 1.532e-01 1.255e-01     0.282    0

Rsquared 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
gbm2 0.8451  0.8455 0.8868 0.8736  0.8915 0.8989    0
lm3  0.6280  0.9002 0.9050 0.8533  0.9117 0.9218    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

As the box plot illustrates the new model performs exponentially better than the previously best-performing model (gradient boosting machine with an untrimmed and untransformed training set).

Model 12 - extreme gradient boosting using trimmed and log-transformed training set

I tried the extreme gradient boosting algorithm once again.

model_xgb6 = train(SalePrice ~ ., 
                   data = train_trim_log,
                   tuneLength = 3,
                   method = "xgbLinear",
                   trControl = myControl,
                   tuneGrid = xgbTuningGrid,
                   verbose = FALSE)
model_list = list(lm3 = model_lm3, xgb6 = model_xgb6)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: lm3, xgb6 
Number of resamples: 5 

RMSE 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
lm3  0.1092  0.1240 0.1255 0.1532  0.1255 0.2820    0
xgb6 0.1213  0.1314 0.1372 0.1340  0.1397 0.1404    0

Rsquared 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
lm3  0.6280  0.9002 0.9050 0.8533  0.9117 0.9218    0
xgb6 0.8758  0.8789 0.8816 0.8858  0.8959 0.8969    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

This model performs worse than the simple linear model. Let us try regularised linear regression as per the aforementioned Kaggle kernel.

Model 13 - regularised linear regression

glmnet uses a combination of lasso regression (penalising the number of non-zero coefficients) and ridge regression (penalising the absolute magnitude of each coefficient) to prevent overfitting when building a generalised linear model.

glmnetTuningGrid = expand.grid(alpha = seq(0, 1, 0.2),
                               lambda = seq(0, 1, 0.2))
model_glmnet1 = train(SalePrice ~ ., 
                   data = train_trim_log,
                   method = "glmnet",
                   trControl = myControl,
                   tuneGrid = glmnetTuningGrid)
There were missing values in resampled performance measures.
model_list = list(lm3 = model_lm3, xgb6 = model_xgb6, glmnet1 = model_glmnet1)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: lm3, xgb6, glmnet1 
Number of resamples: 5 

RMSE 
          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
lm3     0.1092  0.1240 0.1255 0.1532  0.1255 0.2820    0
xgb6    0.1213  0.1314 0.1372 0.1340  0.1397 0.1404    0
glmnet1 0.1095  0.1102 0.1169 0.1192  0.1291 0.1301    0

Rsquared 
          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
lm3     0.6280  0.9002 0.9050 0.8533  0.9117 0.9218    0
xgb6    0.8758  0.8789 0.8816 0.8858  0.8959 0.8969    0
glmnet1 0.8892  0.8992 0.9092 0.9092  0.9215 0.9268    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

This model performs better than model 12 and is now my best-performing model. Next I see whether principal component analysis (PCA) of the training set can help me improve model performance

Model 14 - PCA and linear model

A Kaggle kernel by Nandy Singh (https://www.kaggle.com/nandys/house-prices-advanced-regression-techniques/how-to-decrease-0-1-rmse-score) suggested using PCA to engineer new features based on a combination of existing high-variance features. It can also be used to reduce the number of variables being used i.e. remove the ones with low/no variance and hence little/no information about the data relationships.

Firstly I identified variables in the training set that had zero or near-zero variance i.e. were broadly constant and hence had little predictive value.

nearZeroVar(train_trim, saveMetrics = T) #variables with zero or near-zero variance 

I then removed these variables from the training set.

out = c("Id", "Street", "Alley", "LandContour", "Utilities", "LandSlope", "Condition2", 
        "RoofMatl", "BsmtCond", "BsmtFinType2", "BsmtFinSF2", "Heating", "LowQualFinSF", 
        "KitchenAbvGr", "Functional", "EnclosedPorch", "x3SsnPorch", "PoolArea", 
        "PoolQC", "MiscFeature", "MiscValue")
train_trim_log2 = train_trim_log[, !(names(train_trim_log) %in% out)]

Given that PCA only works on numeric variables I also removed any non-numeric variables from the training set.

train_trim_log2_numeric = train_trim_log2[, names(train_trim_log2) %in% numeric_columns]

I conducted PCA on this training set - returning the top 15 principal components.

pca = prcomp(train_trim_log2_numeric, scale = T, center = T)
eigenvalues = factoextra::get_eigenvalue(pca)
pcaVar = factoextra::get_pca_var(pca) #PCA variables
pcaVar = as.data.frame(c(pcaVar))[,1:15] #table of correlations between variables and top 15 principal components

To identify which features to combine I produced a table of strongly (positively or negatively) correlated features.

var = pcaVar[FALSE, ]
k = 1
for(i in colnames(pcaVar))
{
  for(j in rownames(pcaVar))
    {
      if(abs(pcaVar[j , i]) >= 0.5) #strong +ve/-ve correlation between variables j and i
        {
          var[k, i] = j
          k = k + 1
        }
    }
  k = 1
}
var = as.data.frame(var)

Based on the table produced above it made sense to combine the OverallQual and OverallCond features to create a TotalQual feature.

#Create new feature called TotQual - product of Overallqual and Overallcond
train_newfeatures = mutate(train_trim_log2, TotQual = (train_trim_log2$OverallCond)*(train_trim_log2$OverallQual))

A linear model was built and its performance compared with the previous two models.

model_list = list(glmnet1 = model_glmnet1, lm4 = model_lm4)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: glmnet1, lm4 
Number of resamples: 5 

RMSE 
          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
glmnet1 0.1095  0.1102 0.1169 0.1192  0.1291 0.1301    0
lm4     0.1087  0.1103 0.1141 0.1182  0.1267 0.1312    0

Rsquared 
          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
glmnet1 0.8892  0.8992 0.9092 0.9092  0.9215 0.9268    0
lm4     0.8885  0.9035 0.9187 0.9118  0.9207 0.9277    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

This model performs better than model 13 and is now my best-performing model.

Model 15 - support vector machine

Having heard the hype surrounding support vector machines I thought I would try one in this project.

model_list = list(lm4 = model_lm4,  svm2 = model_svm2)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: lm4, svm2 
Number of resamples: 5 

RMSE 
        Min. 1st Qu.  Median    Mean 3rd Qu.    Max. NA's
lm4   0.1087  0.1103  0.1141  0.1182  0.1267  0.1312    0
svm2 12.6300 13.6500 15.0600 22.5300 16.1200 55.1800    0

Rsquared 
         Min.  1st Qu.  Median    Mean 3rd Qu.   Max. NA's
lm4  0.888500 0.903500 0.91870 0.91180 0.92070 0.9277    0
svm2 0.001119 0.005127 0.07594 0.06006 0.08187 0.1363    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

Despite the hype it did not perform better than the linear model. It is likely that further parameter tuning and data processing would be required for the support vector machine to perform better. Because I was only trying this algorithm out of curiosity I did not experiment with it any further.

Model 16 - neural network

Again, given the hype surrounding it and my own curiosity, I tried building a neural network.

model_list = list(lm4 = model_lm4, nnet1 = model_nnet1)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: lm4, nnet1 
Number of resamples: 5 

RMSE 
         Min. 1st Qu.  Median    Mean 3rd Qu.    Max. NA's
lm4    0.1087  0.1103  0.1141  0.1182  0.1267  0.1312    0
nnet1 11.0100 11.0100 11.0300 11.0300 11.0400 11.0400    0

Rsquared 
        Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
lm4   0.8885  0.9035 0.9187 0.9118  0.9207 0.9277    0
nnet1     NA      NA     NA    NaN      NA     NA    5
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

Once again, despite the hype, it did not perform better than the linear model. Like in the case of the support vector machine this is likely due to further parameter tuning and data processing being required to maximise its performance (different machine learning algorithms have different requirements for performance maximisation). Because I was only trying this algorithm out of curiosity I did not experiment with it any further.

Best-performing model

Based on the experimentation described above I conclude that model 14 - linear regression on a log-adjusted training set with outliers and highly skewed variables removed but a new feature (TotQual = OverallCond*OverallQual) added - is my best-performing model on the house price dataset.

model_list = list(lm4 = model_lm4, glmnet1 = model_glmnet1)
resamples = resamples(model_list)
summary(resamples)

Call:
summary.resamples(object = resamples)

Models: lm4, glmnet1 
Number of resamples: 5 

RMSE 
          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
lm4     0.1087  0.1103 0.1141 0.1182  0.1267 0.1312    0
glmnet1 0.1095  0.1102 0.1169 0.1192  0.1291 0.1301    0

Rsquared 
          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
lm4     0.8885  0.9035 0.9187 0.9118  0.9207 0.9277    0
glmnet1 0.8892  0.8992 0.9092 0.9092  0.9215 0.9268    0
bwplot(resamples, metric = "RMSE")

rm(resamples, model_list)

Outcome

My best Kaggle score (as of Monday 20th February 2017) for this competition - my first one ever - was 0.12581 (root mean square logarithmic error (RMSLE) of my model on 50% of an unseen test set). The benchmark score for the competition was 0.40890, which my first submission outperformed anyway. My ranking was 1,753 out of 4,454 - placing me in the top half of competitors (many of whom typically hold postgraduate qualifications in data-science-related subjects and/or are experienced data scientists - neither of which I am). For a sense of perspective my final submission improved my previous best score by around just 0.002 but raised my leaderboard ranking by almost 1,000 places.

Final thoughts

This competition has taught me three things. Firstly a considerable amount of data pre-processing needs to be conducted (e.g. dealing with missing values, dealing with outliers, dealing with skewness) in order to maximise model performance. Secondly it is vital to understand the assumptions and statistical theory underlying the models and data that one uses. Finally a simple model (e.g. linear regression) can outperform a more advanced model (e.g. extreme gradient boosting) in some cases.

---
title: "Building a machine learning model - house price Kaggle competition"
author: "Andrew Sivanesan"
output: html_notebook
---

This notebook describes my approach to building a machine learning model to predict house prices in the Ames, Iowa housing dataset (a Kaggle competition). This is my first ever Kaggle competition and my first foray into practical data science.

Firstly I recodify some of the raw data and impute missing values. Secondly I try various machine learning algorithms including:

* linear models (Standard and regularised)
* random forests, 
* gradient boosting machines, 
* extreme gradient boosting,
* a support vector machine and
* a neural network.

Finally I assess each model's performance - using **cross-validation** to make fair assessments of model performance. The final model is the best-performing model i.e. the model with the **lowest root mean square error**.

This notebook assumes the reader has basic knowledge of the packages and algorithms used as well as concepts such as cross-validation. For those who are unfamiliar with any of the concepts myriad documentation can be found online.

## Packages used

I predominantly used the caret (machine learning) and dplyr (data manipulation) packages in this project. The caret package provides versatility through its interaction with various individual machine learning packages (e.g. ranger (random forest), glmnet (regularised linear regression), nnet (neural network))

```{r}
library(caret, quietly = TRUE)
library(dplyr, quietly = TRUE)
```

## Cleaning the data

Let's first read the training and test sets supplied by the Kaggle competition into our R session.

```{r}
setwd("~/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Raw data")

train = read.csv("train.csv", stringsAsFactors = FALSE)
test = read.csv("test.csv", stringsAsFactors = FALSE)
```

There are two issues with the raw data. Firstly, like most real-world data sets, it contains **missing values**. Secondly the raw data contain "NA" entries that have a **different meaning** to how R normally handles "NA" entries. For example an entry of "NA" in the "FireplaceQu" field in the raw data means that the house **does not have a fireplace** rather than that data for that field is missing (which is how R normally interprets "NA" entries).

Let's address the latter issue first by replacing the "NA" entries in both the training and the test sets with more **meaningful entries** (e.g. "No fireplace", "No basement", "No alley access").

```{r}
#some NA entries in the training and test sets actually mean "no basement" 
# etc. rather than representing a missing value. These values are replaced accordingly

train$Alley[is.na(train$Alley)] = "No alley access"
test$Alley[is.na(test$Alley)] = "No alley access"

train$BsmtQual[is.na(train$BsmtQual)] = "No basement"
test$BsmtQual[is.na(test$BsmtQual)] = "No basement"

train$BsmtCond[is.na(train$BsmtCond)] = "No basement"
test$BsmtCond[is.na(test$BsmtCond)] = "No basement"

train$BsmtExposure[is.na(train$BsmtExposure)] = "No basement"
test$BsmtExposure[is.na(test$BsmtExposure)] = "No basement"

train$BsmtFinType1[is.na(train$BsmtFinType1)] = "No basement"
test$BsmtFinType1[is.na(test$BsmtFinType1)] = "No basement"

train$BsmtFinType2[is.na(train$BsmtFinType2)] = "No basement"
test$BsmtFinType2[is.na(test$BsmtFinType2)] = "No basement"

train$FireplaceQu[is.na(train$FireplaceQu)] = "No fireplace"
test$FireplaceQu[is.na(test$FireplaceQu)] = "No fireplace"

train$GarageType[is.na(train$GarageType)] = "No garage"
test$GarageType[is.na(test$GarageType)] = "No garage"

train$GarageFinish[is.na(train$GarageFinish)] = "No garage"
test$GarageFinish[is.na(test$GarageFinish)] = "No garage"

train$GarageQual[is.na(train$GarageQual)] = "No garage"
test$GarageQual[is.na(test$GarageQual)] = "No garage"

train$GarageCond[is.na(train$GarageCond)] = "No garage"
test$GarageCond[is.na(test$GarageCond)] = "No garage"

train$PoolQC[is.na(train$PoolQC)] = "No pool"
test$PoolQC[is.na(test$PoolQC)] = "No pool"

train$Fence[is.na(train$Fence)] = "No fence"
test$Fence[is.na(test$Fence)] = "No fence"

train$MiscFeature[is.na(train$MiscFeature)] = "None"
test$MiscFeature[is.na(test$MiscFeature)] = "None"
```

Now let's deal with the missing values. Some of these can be replaced by the **most frequently occuring values** in the given field. For example the most frequently occuring value for the **Electrical** field is "SBrkr" (standard circuit breaker), so that is used to replace the missing values in that particular field.

Note that some of the missing values only occur in the test set (e.g. for the 'Utilities' field).

```{r}
#replace missing values in MasVnrType column in training and test sets with "None"
#, which is most commonly occuring type
train$MasVnrType[is.na(train$MasVnrType)] = "None"
test$MasVnrType[is.na(test$MasVnrType)] = "None"

#replace missing values in Electrical column in training set with _
#"SBrkr, which is most frequently occuring electrical set-up
train$Electrical[is.na(train$Electrical)] = "SBrkr"

#replace missing values in MSZoning column in test set with RL (most frequently occuring)
test$MSZoning[is.na(test$MSZoning)] = "RL"

#replace missing values in Utilities column in test set with AllPub (most frequently occuring)
test$Utilities[is.na(test$Utilities)] = "AllPub"

#replace missing values in Exterior1st column in test set with VinylSd (most frequently occuring)
test$Exterior1st[is.na(test$Exterior1st)] = "VinylSd"

#replace missing values in Exterior2nd column in test set with VinylSd (most frequently occuring)
test$Exterior2nd[is.na(test$Exterior2nd)] = "VinylSd"

#replace missing value in KitchenQual column in test set with TA (most common)
test$KitchenQual[is.na(test$KitchenQual)] = "TA"

#replace missing values in Functional column in test set with Min2 (most common)
test$Functional[is.na(test$Functional)] = "Min2"

#replace missing value in SaleType column in test set with WD (most common)
test$SaleType[is.na(test$SaleType)] = "WD"
```

"NA" entries in **numeric** fields relating to the basement can simply be replaced with a **zero**. For example "NA" entries in the **BsmtFinSF1** (square foot area of type 1 finished basement area) simply means there is no basement to begin with and hence there's no such square footage to speak of i.e. zero square feet.

```{r}

#replace missing value in BsmtFinSF1 column in test set with 0 (has no basement)
test$BsmtFinSF1[is.na(test$BsmtFinSF1)] = 0

#replace missing value in BsmtFinSF2 column in test set with 0 (has no basement)
test$BsmtFinSF2[is.na(test$BsmtFinSF2)] = 0

#replace missing value in BsmtUnfSF column in test set with 0 (has no basement)
test$BsmtUnfSF[is.na(test$BsmtUnfSF)] = 0

#replace missing value in TotalBsmtSF column in test set with 0 (has no basement)
test$TotalBsmtSF[is.na(test$TotalBsmtSF)] = 0

#replace missing values in BsmtFullBath column in test set with 0 (has no basement)
test$BsmtFullBath[is.na(test$BsmtFullBath)] = 0

#replace missing values in BsmtHalfBath column in test set with 0 (has no basement)
test$BsmtHalfBath[is.na(test$BsmtHalfBath)] = 0
```

The same can be done for numeric fields relating to the garage.

```{r}
#replace missing value in GarageCars column in test set with 0 (no garage)
test$GarageCars[is.na(test$GarageCars)] = 0

#replace missing value in GarageArea column in test set with 0 (no garage)
test$GarageArea[is.na(test$GarageArea)] = 0
```

A method commonly used for replacing missing values in data sets is **median imputation**. This is used for the **LotFrontage** field (number of feet of street connected to the property).

```{r}
#replace missing values in LotFrontage column in training and test sets with median
train$LotFrontage[is.na(train$LotFrontage)] = median(train$LotFrontage, na.rm = TRUE)
test$LotFrontage[is.na(test$LotFrontage)] = median(test$LotFrontage, na.rm = TRUE)
```

Unfortunately some of the missing values cannot be reasonably estimated (e.g. the year in which the garage of a house was built). Such missing values are replaced by a **numeric but non-sensical value** such as -9999. Numeric values are easier for machine learning algorithms to handle. The fact that the value itself is non-sensical preserves the fact that the data is missing.

```{r}
#replace missing values in GarageYrBlt column in training and test sets with -9999
train$GarageYrBlt[is.na(train$GarageYrBlt)] = -9999
test$GarageYrBlt[is.na(test$GarageYrBlt)] = -9999

#replace missing values in MasVnrArea column in training and test sets with -9999
train$MasVnrArea[is.na(train$MasVnrArea)] = -9999
test$MasVnrArea[is.na(test$MasVnrArea)] = -9999
```

## Experimenting with machine learning algorithms

Because I am trying to predict a continuous numeric variable (SalePrice) regression-based algorithms are the most appropriate to try.

For the sake of reproducibility (particularly relating to cross-validation, which randomly selects records from the training set to use for model performance testing) I set the seed to 42 (chosen arbitrarily)

```{r}
set.seed(42)
```

### Model 1 - basic random forest

The first algorithm I tried was was a simple random forest - generating multiple decision trees and taking the mean prediction as the final output. This basic model simply takes all the variables in the training set apart from SalePrice as explanatory variables. It also only uses one mtry value i.e. only one variable is randomly sampled as a candidate at each split in each decision tree

Note that I use five-fold cross-validation throughout this project.

```{r}
myControl = trainControl(method = "cv", number = 5, verboseIter = FALSE)
model_rf = train(SalePrice ~ ., 
              data = train,
              tuneLength = 1,
              method = "ranger",
              importance = 'impurity',
              trControl = myControl)
model_rf
```

###  Model 2 - linear regression

The second algorithm I tried was linear regression - again using all but the SalePrice variable as explanatory variables.

```{r}
model_lm = train(SalePrice ~ ., 
              data = train,
              method = "lm",
              trControl = myControl)
model_lm
```

I then compared the performance of both models - using **root mean square error (RMSE)** as my performance measure. The use of the **resamples** function ensures that the same training data is used in both models and hence a like-for-like comparison is being made between them. Box plot diagrams are generated for ease of comparison.

```{r}
model_list <- list(lm = model_lm, rf = model_rf)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

As the box plot illustrates the random forest model performs much better than the linear regression model.

At this point I made predictions on the SalePrice variable for the test set using the random forest model. This was then submitted to Kaggle.

```{r}
prediction_rf = predict(model_rf, test)

#make a submission file
submit <- data.frame(Id = test$Id, SalePrice = prediction_rf)
setwd("~/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Submissions")
write.csv(submit, file = "submission1.csv", row.names = FALSE)
rm(submit)
```

### Model 3 - random forest with two mtry values

One way to improve the performance of random forest models is to **experiment with their tuning parameters**. In this instance I decided to experiment with the mtry parameter - using **two** mtry values instead of one. Everything else was kept the same.

```{r}
model_rf2 = train(SalePrice ~ ., 
                 data = train,
                 tuneLength = 2,
                 method = "ranger",
                 importance = 'impurity',
                 trControl = myControl)
model_rf2
```

Again I compared the model with the more basic random forest model produced earlier.

```{r}
model_list <- list(rf = model_rf, rf2 = model_rf2)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

As the box plot illustrates the second random forest model does indeed perform slightly better than the first. Once again I made predictions on the test set and made a submission to Kaggle.

```{r}
prediction_rf2 = predict(model_rf2, test)

#make a submission file
submit <- data.frame(Id = test$Id, SalePrice = prediction_rf2)
setwd("~/IMPORTANT FILES - TO BE BACKED UP/Machine Learning/House Prices/House Prices Project/Submissions")
write.csv(submit, file = "submission2.csv", row.names = FALSE)
rm(submit)
```

### Model 4 - random forest using 20 most important explanatory variables

Another way to improve a model is to use only the most **important explanatory variables** i.e. the ones that provide the most information gain to the model. The varImp function from the caret package returns the 20 most important variables to the user. For comparison I used Boruta feature importance analysis as well, but it did not yield a better performing model. Therefore I omit it from this notebook.

```{r}
varImp(model_rf)
```

I then built a random forest model that only used these particular variables as explanatory variables. I returned to using one mtry value.

```{r}
Top20Variables = c("OverallQual", "GrLivArea", "TotalBsmtSF", "GarageArea", "GarageCars", 
                   "X1stFlrSF", "YearBuilt", "ExterQual", "BsmtFinSF1", "FullBath",
                   "KitchenQual", "LotArea", "Fireplaces",
                   "FireplaceQu", "YearRemodAdd", "GarageYrBlt", "X2ndFlrSF", 
                   "TotRmsAbvGrd", "MasVnrArea", "LotFrontage")

train_Top20Var = select(train, one_of(Top20Variables, "SalePrice"))

model_rf_Top20 = train(SalePrice ~ ., 
                  data = train_Top20Var,
                  tuneLength = 1,
                  method = "ranger",
                  importance = 'impurity',
                  trControl = myControl)
```

I then compared the performance of this model with model 3.

```{r}
model_list = list(rf2 = model_rf2, rf_Top20 = model_rf_Top20)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

As the box plot shows the latest model performs better than the model using all possible explanatory variables and two mtry values. Would using two mtry values further improve performance? Let's try it out.

### Model 5 - random forest using 20 most important variables and 2 mtry values

```{r}
model_rf_Top20_2mtry = train(SalePrice ~ ., 
                  data = train_Top20Var,
                  tuneLength = 2,
                  method = "ranger",
                  importance = 'impurity',
                  trControl = myControl)

model_list = list(rf2 = model_rf2, rf_Top20 = model_rf_Top20, rf_Top20_2mtry = model_rf_Top20_2mtry)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

Model 5 performs better than model 3, but worse than model 4 - currently the best-performing model.

Further increasing of the number of mtry values and increasing the number of folds in the cross-validation did not yield better model performance. Therefore I omit them from this notebook for the sake of brevity. 

Perhaps it's now time to explore alternative algorithms.

### Model 6 - stochastic gradient boosting machine

A more refined approach to generating random forests is to use **stochastic gradient boosting machines** - utilising the concept of **regularisation** to prevent overfitting the individual decision trees to the training set. A more detailed explanation can be found here - http://xgboost.readthedocs.io/en/latest/model.html. 

This algorithm should, in theory, generate a model that performs better on unseen data than basic random forests.

To begin with I use all available explanatory variables. I use two mtry values given that it seems to produce a better-performing model than when using one mtry value.

```{r}
model_gbm = train(SalePrice ~ ., 
                 data = train,
                 tuneLength = 2,
                 method = "gbm",
                 trControl = myControl)

model_gbm
```

This model did not perform better than model 4. Perhaps adjusting the tuning paramaters of the model will improve performance. Let's give it a try.

### Model 7 - stochastic gradient boosting machine with custom tuning grid

This time I use three mtry values and a custom tuning grid. Arriving at the best combination of parameter values to use in the tuning grid is a case of trial and error. I experimented with lots of combinations to find one that improved model performance. For the sake of brevity I omit that experimentation from this notebook and simply present my final tuning grid. Once again I use all available explanatory variables.

```{r}
gbmTuningGrid = expand.grid(interaction.depth = 4, 
                            n.trees = c(50, 100, 150, 200), 
                            shrinkage = 0.3,
                            n.minobsinnode = 20)

model_gbm2 = train(SalePrice ~ ., 
                  data = train,
                  tuneLength = 3,
                  method = "gbm",
                  trControl = myControl,
                  tuneGrid = gbmTuningGrid)

model_gbm2
```

I compare the performance against previous models.

```{r}
model_list = list(rf2 = model_rf2, rf_Top20_2mtry = model_rf_Top20_2mtry, gbm2 = model_gbm2)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

This model performs better than model 4. Are there other algorithms available that could possible yield higher performance? Let's try **extreme gradient boosting**.

### Model 8 - extreme gradient boosting

The extreme gradient boosting algorithm works much the same as the stochastic gradient boosting machine, but **more aggresively guards against overfitting.**

It's far easier to use extreme gradient boosting in the caret package rather than the xgboost package because the latter requires the training and test sets to be converted to sparse matrices using the concept of one-hot encoding.

Given that using a custom tuning grid in model 7 improved performance I also make use of one in this model. Once again for the sake of brevity I have omitted my experimentations with the tuning parameters.

Again I use all available explanatory variables and three mtry values.

```{r}
xgbTuningGrid = expand.grid(nrounds = c(50, 100), 
                            lambda = seq(0.1, 0.5, 0.1), 
                            alpha = seq(0.1, 0.5, 0.1),
                            eta = c(0.3, 0.4))

model_xgb4 = train(SalePrice ~ ., 
                   data = train,
                   tuneLength = 3,
                   method = "xgbLinear",
                   trControl = myControl,
                   tuneGrid = xgbTuningGrid)

model_xgb4
```

As per usual I compare model performance.

```{r}
model_list = list(gbm2 = model_gbm2, xgb4 = model_xgb4)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

As the box plot illustrates this model **actually performs worse** than model 7.

### Model 9 - linear model with a trimmed training set

Having reached a plateau in my model development I read various kernels (notebooks written by other Kaggle competitors) for this particular Kaggle competition to identify any areas that I could further develop. One kernel (https://www.kaggle.com/sidraina89/house-prices-advanced-regression-techniques/regularized-regression-housing-pricing) said that **removing outliers** from the training set would improve model performance. Therefore I went ahead and did this - firstly plotting the ground floor living area against the sale price.

```{r}
ggplot(train ,aes(y = SalePrice, x = GrLivArea)) + geom_point()
```

This plot presented four outliers where the ground floor living area was more than 4000 square feet. I therefore removed such records from the training set.

```{r}
train_trim = filter(train, GrLivArea <= 4000)
```

I repeated this process for the other 20 most predictive explanatory variables (as identified in model 4) such as OverallQual, KitchenQual and YearBuilt. No other outliers were identified.

I then built a linear regression model using this trimmed training set and assessed its performance in the usual way.

```{r}
model_lm2 = train(SalePrice ~ ., 
                  data = train_trim,
                  method = "lm",
                  trControl = myControl)

model_list = list(gbm2 = model_gbm2,lm2 = model_lm2)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

As the box plot illustrates this model performed better than model 8. 

### Model 10 - extreme gradient boosting with a trimmed training set

Given that it was the best-performing model on the untrimmed training set it seemed natural for me to try the extreme gradient boosting algorithm on the trimmed training set.

```{r}
model_xgb5 = train(SalePrice ~ ., 
                   data = train_trim,
                   tuneLength = 3,
                   method = "xgbLinear",
                   trControl = myControl,
                   tuneGrid = xgbTuningGrid)

model_list = list(gbm2 = model_gbm2, lm2 = model_lm2, xgb5 = model_xgb5)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

To my surprise not only did it perform worse than model 8 but it also performed worse than model 9 (a simple linear model). Further development of the data was required.

### Model 11 - linear model on trimmed training set with log-transformed skewed variables

The same kernel I read earlier underlined the importance of understanding the assumptions underlying any machine learning algorithms being used. For example linear regression assumes, among other things, that all variables are normally distributed. I used the following code to identify any variables with a skewness greater than 0.75 (threshold given in kernel).

```{r}
library(e1071)

column_classes = sapply(names(train_trim), function(x){class(train_trim[[x]])})
numeric_columns = names(column_classes[column_classes == "integer"])
skew = sapply(numeric_columns, function(x){skewness(train_trim[[x]], na.rm = T)})
skew = skew[skew > 0.75]
```

As per the kernel I took the natural logarithm of these highly skewed variables in both the training set and the test set.

```{r}
train_trim_log = train_trim
for(x in names(skew)) 
{
  train_trim_log[[x]] = log(train_trim_log[[x]] + 1)
  
}
# do the same for the test set
test_logtrnsfrm = test
for(x in names(skew)) 
{
  if(x != "SalePrice")
  {
    test_logtrnsfrm[[x]] = log(test_logtrnsfrm[[x]] + 1)
  }
}
```

I then built a linear model using this adjusted training set.

```{r}
model_lm3 = train(SalePrice ~ ., 
                  data = train_trim_log,
                  method = "lm",
                  trControl = myControl)

model_list = list(gbm2 = model_gbm2, lm3 = model_lm3)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

As the box plot illustrates the new model performs exponentially better than the previously best-performing model (gradient boosting machine with an untrimmed and untransformed training set).

### Model 12 - extreme gradient boosting using trimmed and log-transformed training set

I tried the extreme gradient boosting algorithm once again.

```{r}
model_xgb6 = train(SalePrice ~ ., 
                   data = train_trim_log,
                   tuneLength = 3,
                   method = "xgbLinear",
                   trControl = myControl,
                   tuneGrid = xgbTuningGrid,
                   verbose = FALSE)

model_list = list(lm3 = model_lm3, xgb6 = model_xgb6)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

This model performs worse than the simple linear model. Let us try regularised linear regression as per the aforementioned Kaggle kernel.

### Model 13 - regularised linear regression

glmnet uses a combination of **lasso regression** (penalising the number of non-zero coefficients) and **ridge regression** (penalising the absolute magnitude of each coefficient) to **prevent overfitting** when building a generalised linear model.

```{r}
glmnetTuningGrid = expand.grid(alpha = seq(0, 1, 0.2),
                               lambda = seq(0, 1, 0.2))
model_glmnet1 = train(SalePrice ~ ., 
                   data = train_trim_log,
                   method = "glmnet",
                   trControl = myControl,
                   tuneGrid = glmnetTuningGrid)

model_list = list(lm3 = model_lm3, xgb6 = model_xgb6, glmnet1 = model_glmnet1)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

This model performs better than model 12 and is now my best-performing model. Next I see whether **principal component analysis (PCA)** of the training set can help me improve model performance

### Model 14 - PCA and linear model

A Kaggle kernel by Nandy Singh (https://www.kaggle.com/nandys/house-prices-advanced-regression-techniques/how-to-decrease-0-1-rmse-score) suggested using PCA to **engineer new features** based on a combination of existing high-variance features. It can also be used to **reduce the number of variables** being used i.e. remove the ones with low/no variance and hence little/no information about the data relationships. 

Firstly I identified variables in the training set that had zero or near-zero variance i.e. were broadly constant and hence had little predictive value.

```{r}
nearZeroVar(train_trim, saveMetrics = T) #variables with zero or near-zero variance 
```

I then removed these variables from the training set.

```{r}
out = c("Id", "Street", "Alley", "LandContour", "Utilities", "LandSlope", "Condition2", 
        "RoofMatl", "BsmtCond", "BsmtFinType2", "BsmtFinSF2", "Heating", "LowQualFinSF", 
        "KitchenAbvGr", "Functional", "EnclosedPorch", "x3SsnPorch", "PoolArea", 
        "PoolQC", "MiscFeature", "MiscValue")
train_trim_log2 = train_trim_log[, !(names(train_trim_log) %in% out)]
```

Given that PCA only works on numeric variables I also removed any non-numeric variables from the training set.

```{r}
train_trim_log2_numeric = train_trim_log2[, names(train_trim_log2) %in% numeric_columns]
```

I conducted PCA on this training set - returning the top 15 principal components.

```{r}
pca = prcomp(train_trim_log2_numeric, scale = T, center = T)
eigenvalues = factoextra::get_eigenvalue(pca)
pcaVar = factoextra::get_pca_var(pca) #PCA variables
pcaVar = as.data.frame(c(pcaVar))[,1:15] #table of correlations between variables and top 15 principal components
```

To identify which features to combine I produced a table of strongly (positively or negatively) correlated features.

```{r}
var = pcaVar[FALSE, ]
k = 1
for(i in colnames(pcaVar))
{
  for(j in rownames(pcaVar))
    {
      if(abs(pcaVar[j , i]) >= 0.5) #strong +ve/-ve correlation between variables j and i
        {
          var[k, i] = j
          k = k + 1
        }
    }
  k = 1
}
var = as.data.frame(var)
```

Based on the table produced above it made sense to combine the OverallQual and OverallCond features to create a TotalQual feature.

```{r}
#Create new feature called TotQual - product of Overallqual and Overallcond
train_newfeatures = mutate(train_trim_log2, TotQual = (train_trim_log2$OverallCond)*(train_trim_log2$OverallQual))
```

A linear model was built and its performance compared with the previous two models.

```{r}
model_lm4 = train(SalePrice ~ ., 
                  data = train_newfeatures,
                  method = "lm",
                  trControl = myControl)

model_list = list(glmnet1 = model_glmnet1, lm4 = model_lm4)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

This model performs better than model 13 and is now my best-performing model.

### Model 15 - support vector machine

Having heard the hype surrounding support vector machines I thought I would try one in this project.

```{r}
model_svm2 = train(SalePrice ~ ., 
                   data = train_trim_log,
                   tuneLength = 3,
                   method = "svmLinear",
                   trControl = myControl)

model_list = list(lm4 = model_lm4,  svm2 = model_svm2)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

Despite the hype it did not perform better than the linear model. It is likely that further parameter tuning and data processing would be required for the support vector machine to perform better. Because I was only trying this algorithm out of curiosity I did not experiment with it any further.

### Model 16 - neural network

Again, given the hype surrounding it and my own curiosity, I tried building a neural network.

```{r}
model_nnet1 = train(SalePrice ~ ., 
                  data = train_trim_log2,
                  method = "nnet",
                  trControl = myControl)

model_list = list(lm4 = model_lm4, nnet1 = model_nnet1)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```

Once again, despite the hype, it did not perform better than the linear model. Like in the case of the support vector machine this is likely due to further parameter tuning and data processing being required to maximise its performance (different machine learning algorithms have different requirements for performance maximisation). Because I was only trying this algorithm out of curiosity I did not experiment with it any further.

## Best-performing model

Based on the experimentation described above I conclude that **model 14** - linear regression on a log-adjusted training set with outliers and highly skewed variables removed but a new feature (TotQual = OverallCond*OverallQual) added - is my best-performing model on the house price dataset.

```{r}
model_list = list(lm4 = model_lm4, glmnet1 = model_glmnet1)
resamples = resamples(model_list)
summary(resamples)
bwplot(resamples, metric = "RMSE")
rm(resamples, model_list)
```


## Outcome 

My best Kaggle score (as of Monday 20th February 2017) for this competition - my first one ever - was **0.12581** (root mean square logarithmic error (RMSLE) of my model on 50% of an unseen test set). The benchmark score for the competition was **0.40890**, which my first submission outperformed anyway. My ranking was **1,753 out of 4,454** - placing me in the **top half of competitors** (many of whom typically hold postgraduate qualifications in data-science-related subjects and/or are experienced data scientists - neither of which I am). For a sense of perspective my final submission improved my previous best score by around just 0.002 but raised my leaderboard ranking by almost 1,000 places.

## Final thoughts

This competition has taught me three things. Firstly a **considerable amount of data pre-processing** needs to be conducted (e.g. dealing with missing values, dealing with outliers, dealing with skewness) in order to maximise model performance. Secondly it is vital to **understand the assumptions and statistical theory** underlying the models and data that one uses. Finally a **simple model** (e.g. linear regression) **can outperform a more advanced model** (e.g. extreme gradient boosting) in some cases.