Objective:
The objective of this exercise is to predict the age of a home in the Boston Data Set for the purpose of comparing multiple algorithms.
Data Set
The data set is comprised of 506 observations and 14 variables. It is available in the package MASS. The variables range from crime per captia by town to the pupil-teacher ratio by town. We will be using Age as our response/target variable.
Data Description for the 14 variables are as follows.
More information on the dataset can be read by accessing the website: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
we are going to use the CARET and MASS R packages. MASS package holds the Boston dataset and we would be building models using the CARET package.
Data splitting can be done in a variety of ways.
First, I am going to use base R sample function to split the dataset.
Second, I am going to use the function createDataPartition from the CARET package.
#load the dataset
df <- Boston
str(df)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Data split(80/20) using the base R sample function.
#store rows for partition
rows1 <- sample(1:nrow(df), nrow(df)*0.8, replace = F)
#training data set
train_base <- df[rows1,]
#testing data set
test_base <- df[-rows1,]
# structure of the dataset
str(train_base)
'data.frame': 404 obs. of 14 variables:
$ crim : num 0.0205 18.811 0.122 13.3598 0.2221 ...
$ zn : num 85 0 0 0 0 40 0 0 55 0 ...
$ indus : num 0.74 18.1 2.89 18.1 10.01 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.41 0.597 0.445 0.693 0.547 0.429 0.74 0.532 0.484 0.573 ...
$ rm : num 6.38 4.63 6.62 5.89 6.09 ...
$ age : num 35.7 100 57.8 94.7 95.4 44.4 93.9 77 28.1 69.1 ...
$ dis : num 9.19 1.55 3.5 1.78 2.55 ...
$ rad : int 2 24 2 24 6 1 24 24 5 1 ...
$ tax : num 313 666 276 666 432 335 666 666 370 273 ...
$ ptratio: num 17.3 20.2 18 20.2 17.8 19.7 20.2 20.2 17.6 21 ...
$ black : num 396.9 28.8 358 396.9 396.9 ...
$ lstat : num 5.77 34.37 6.65 16.35 17.09 ...
$ medv : num 24.7 17.9 28.4 12.7 18.7 22.9 12.8 25 31.2 22.4 ...
str(test_base)
'data.frame': 102 obs. of 14 variables:
$ crim : num 0.17 0.784 0.726 0.988 0.956 ...
$ zn : num 12.5 0 0 0 0 0 0 0 0 0 ...
$ indus : num 7.87 8.14 8.14 8.14 8.14 8.14 8.14 8.14 8.14 5.96 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.524 0.538 0.538 0.538 0.538 0.538 0.538 0.538 0.538 0.499 ...
$ rm : num 6 5.99 5.73 5.81 6.05 ...
$ age : num 85.9 81.7 69.5 100 88.8 94.4 87.3 100 82 41.5 ...
$ dis : num 6.59 4.26 3.8 4.1 4.45 ...
$ rad : int 5 4 4 4 4 4 4 4 4 5 ...
$ tax : num 311 307 307 307 307 307 307 307 307 279 ...
$ ptratio: num 15.2 21 21 21 21 21 21 21 21 19.2 ...
$ black : num 387 387 391 395 306 ...
$ lstat : num 17.1 14.7 11.3 19.9 17.3 ...
$ medv : num 18.9 17.5 18.2 14.5 14.8 18.4 21 14.5 13.2 21 ...
Data split(70/30) using the function createDataPartition from the CARET package.
# Usually set.seed(123) for reproducibility. I am not doing it here due to software issue as I can not publish this document on its inclusion.
#store rows for partition
partition <- caret::createDataPartition(y = df$age, times = 1, p = 0.7, list = FALSE)
# create training data set
train_set <- df[partition,]
# create testing data set, subtracting the rows partition to get remaining 30% of the data
test_set <- df[-partition,]
str(train_set)
'data.frame': 356 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.03237 0.02985 0.08829 ...
$ zn : num 18 0 0 0 12.5 12.5 12.5 0 0 0 ...
$ indus : num 2.31 7.07 2.18 2.18 7.87 7.87 7.87 8.14 8.14 8.14 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.458 0.458 0.524 0.524 0.524 0.538 0.538 0.538 ...
$ rm : num 6.58 6.42 7 6.43 6.01 ...
$ age : num 65.2 78.9 45.8 58.7 66.6 85.9 94.3 61.8 29.3 81.7 ...
$ dis : num 4.09 4.97 6.06 6.06 5.56 ...
$ rad : int 1 2 3 3 5 5 5 4 4 4 ...
$ tax : num 296 242 222 222 311 311 311 307 307 307 ...
$ ptratio: num 15.3 17.8 18.7 18.7 15.2 15.2 15.2 21 21 21 ...
$ black : num 397 397 395 394 396 ...
$ lstat : num 4.98 9.14 2.94 5.21 12.43 ...
$ medv : num 24 21.6 33.4 28.7 22.9 18.9 15 20.4 23.1 17.5 ...
str(test_set)
'data.frame': 150 obs. of 14 variables:
$ crim : num 0.0273 0.069 0.1446 0.2112 0.1175 ...
$ zn : num 0 0 12.5 12.5 12.5 12.5 0 0 0 0 ...
$ indus : num 7.07 2.18 7.87 7.87 7.87 7.87 8.14 8.14 8.14 8.14 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.469 0.458 0.524 0.524 0.524 0.524 0.538 0.538 0.538 0.538 ...
$ rm : num 7.18 7.15 6.17 5.63 6.01 ...
$ age : num 61.1 54.2 96.1 100 82.9 39 84.5 56.5 98.1 89.2 ...
$ dis : num 4.97 6.06 5.95 6.08 6.23 ...
$ rad : int 2 3 5 5 5 5 4 4 4 4 ...
$ tax : num 242 222 311 311 311 311 307 307 307 307 ...
$ ptratio: num 17.8 18.7 15.2 15.2 15.2 15.2 21 21 21 21 ...
$ black : num 393 397 397 387 397 ...
$ lstat : num 4.03 5.33 19.15 29.93 13.27 ...
$ medv : num 34.7 36.2 27.1 16.5 18.9 21.7 18.2 19.9 13.6 19.6 ...
We are going to build three models simultaneously, namely - Classical Linear Regression Model, Random Forest and a Gradient Boosting Model (GBM).
We will then choose the best performing model based on Root Mean Square Error value (RMSE) to predict the age of houses in the test data set.
We will also use Repeated Cross Validation 5x. This will help us make sure that our results are well tested.
model_lm <- caret::train(age ~., data = train_set,
method = "lm",
trControl = trainControl(method = "repeatedCV",
number = 5, repeats = 2))
model_forest <- caret::train(age ~., data = train_set,
method = "ranger",
trControl = trainControl(method = "repeatedCV",
number = 5, repeats = 2))
model_gbm <- caret::train(age ~., data = train_set,
method = "gbm",
trControl = trainControl(method = "repeatedCV",
number = 5, repeats = 2))
Using the function resamples from the CARET package to compare the three models side by side.
all_models <- caret::resamples(list(Random_Forest = model_forest,
Linear_Regression = model_lm,
GBM = model_gbm))
#boxplot
bwplot(all_models)
#dotplot
dotplot(all_models)
#summary
summary(all_models)
Call:
summary.resamples(object = all_models)
Models: Random_Forest, Linear_Regression, GBM
Number of resamples: 10
MAE
Min. 1st Qu. Median Mean 3rd Qu. Max.
Random_Forest 8.252112 9.094248 9.78110 9.643055 10.17148 10.70097
Linear_Regression 11.490282 12.514133 12.74935 12.851753 12.99444 15.06752
GBM 8.434679 9.483875 10.09975 9.934096 10.38567 11.70440
NA's
Random_Forest 0
Linear_Regression 0
GBM 0
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max.
Random_Forest 11.07166 12.34664 13.01666 12.88435 13.63783 14.27272
Linear_Regression 14.33443 15.94932 16.39045 16.42580 16.95850 17.83902
GBM 11.55671 12.14554 13.30905 13.38038 14.47298 15.69887
NA's
Random_Forest 0
Linear_Regression 0
GBM 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu.
Random_Forest 0.7183874 0.7592769 0.7754247 0.7842863 0.8105468
Linear_Regression 0.5927382 0.6042303 0.6578105 0.6495473 0.6793772
GBM 0.7110808 0.7290565 0.7645726 0.7668434 0.8077440
Max. NA's
Random_Forest 0.8533229 0
Linear_Regression 0.7170658 0
GBM 0.8204467 0
Observing the mean RMSE values for Random_Forest and GBM, we can say that both of them performed fairly well. This usually is not the case.
Let’s build our models again with some pre-processing first.
#### Center,Scale and Principal Component Analysis(PCA)
The CARET package gives us the option to add the pre-processing steps in a much easy way and does all the heavy-lifting for us.
We will center the data, scale it and then perform Principal Component Analysis on the variables to avoid any collinearity.
This might result in a loss of RMSE value but the built models will be much more stable.
model_lm <- caret::train(age ~., data = train_set,
method = "lm",
trControl = trainControl(method = "repeatedCV", number = 5, repeats = 2),
preProcess = c("center","scale","pca"))
model_forest <- caret::train(age ~., data = train_set,
method = "ranger",
trControl = trainControl(method = "repeatedCV", number = 5, repeats = 2),
preProcess = c("center","scale","pca"))
model_gbm <- caret::train(age ~., data = train_set,
method = "gbm",
trControl = trainControl(method = "repeatedCV", number = 5, repeats = 2),
preProcess = c("center","scale","pca"))
all_models <- caret::resamples(list(Random_Forest = model_forest,
Linear_Regression = model_lm,
GBM = model_gbm))
#boxplot
bwplot(all_models)
#dotplot
dotplot(all_models)
#summary
summary(all_models)
Call:
summary.resamples(object = all_models)
Models: Random_Forest, Linear_Regression, GBM
Number of resamples: 10
MAE
Min. 1st Qu. Median Mean 3rd Qu. Max.
Random_Forest 7.487781 10.711419 10.73529 10.35820 10.87794 11.56189
Linear_Regression 11.721988 12.001735 12.63997 12.90499 13.63416 15.15267
GBM 9.013459 9.591878 10.22780 10.41425 10.71376 13.15846
NA's
Random_Forest 4
Linear_Regression 0
GBM 0
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max.
Random_Forest 10.21081 13.32904 14.19934 13.67404 14.53391 15.75037
Linear_Regression 14.27834 15.60454 16.21224 16.54711 17.82753 18.71726
GBM 11.54933 12.64307 13.01390 13.87030 15.13640 18.28568
NA's
Random_Forest 4
Linear_Regression 0
GBM 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu.
Random_Forest 0.6760441 0.7334036 0.7436857 0.7575291 0.7736068
Linear_Regression 0.5413179 0.6035431 0.6588522 0.6459879 0.6866185
GBM 0.6299870 0.7024621 0.7699587 0.7500499 0.7911640
Max. NA's
Random_Forest 0.8682025 4
Linear_Regression 0.7100061 0
GBM 0.8270775 0
We can immediately see that the GBM and Random_Forest model outperformed the Linear Regression model this time. The mean RMSE value for Random_Forest is lowest, meaning that the Random Forest outperformed the other two.
To make a concrete conclusion, let’s predict the age on the test data set using these three models built and then comparing theie RMSE values.
# predicting using Linear Regression Model
pred_lm <- stats::predict(object = model_lm, test_set)
#RMSE
error_lm <- pred_lm - test_set$age
rmse_lm <- sqrt(mean(error_lm^2))
rmse_lm
[1] 16.61794
The Linear Regression model did not perform well, as expected.
# predicting using GBM
pred_gbm <- stats::predict(object = model_gbm, test_set)
#RMSE
error_gbm <- pred_gbm - test_set$age
rmse_gbm <- sqrt(mean(error_gbm^2))
rmse_gbm
[1] 14.0249
The GBM model performed a bit fairly well than the Linear Regression model and the training GBM model as well.
# predicting using Random Forest Model
pred_rf <- stats::predict(object = model_forest, test_set)
#RMSE
error_rf <- pred_rf - test_set$age
rmse_rf <- sqrt(mean(error_rf^2))
rmse_rf
[1] 13.99354
The Random forest performed best among the three models and also better than the training model set.
In conclusion, this work flow could be used to compare multiple models and select the best performing model based on RMSE mean value. CARET package could also be used to employ several other models in few lines according to Business need/Problem statement.
We could have performed more intermediate calculation or analysis steps such as Exploratory Data Analysis, build confusion matrix, extarcted the coefficients or build numerous plots on the way. This was supposed to be a quick workflow analysis to get you started on some Machine Learning concepts.
Happy Reading and Keep Learning.