In this regression modelling problem, we will investigate the Boston House Price dataset and try to predict the median value of owner occupied homes. The data is available in ‘mlbench’ library and each record in the database describes a Boston suburb or town.
# Load packages
library(mlbench)
library(caret)
library(corrplot)
# Load data
data(BostonHousing)
Let’s split the loaded dataset into train and test sets. We will use 80% of the data to train our models and 20% will be used to test the models.
set.seed(101)
sample <- createDataPartition(BostonHousing$medv, p=0.80, list = FALSE)
train <- BostonHousing[sample,]
test <- BostonHousing[-sample,]
Let’s look at the structure and summary of our training data to get some sense of how our data looks like.
str(train)
## 'data.frame': 407 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : num 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ b : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(train)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:380
## 1st Qu.: 0.07923 1st Qu.: 0.00 1st Qu.: 5.19 1: 27
## Median : 0.25356 Median : 0.00 Median : 9.69
## Mean : 3.78855 Mean : 11.17 Mean :11.21
## 3rd Qu.: 3.80591 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4530 1st Qu.:5.882 1st Qu.: 45.25 1st Qu.: 2.083
## Median :0.5380 Median :6.202 Median : 77.70 Median : 3.092
## Mean :0.5581 Mean :6.285 Mean : 68.95 Mean : 3.730
## 3rd Qu.:0.6310 3rd Qu.:6.624 3rd Qu.: 94.05 3rd Qu.: 5.117
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio b
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 2.52
## 1st Qu.: 4.000 1st Qu.:277.0 1st Qu.:17.40 1st Qu.:376.36
## Median : 5.000 Median :334.0 Median :19.10 Median :392.18
## Mean : 9.688 Mean :412.2 Mean :18.46 Mean :360.68
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.31
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.730 Min. : 5.00
## 1st Qu.: 7.185 1st Qu.:17.05
## Median :11.640 Median :21.20
## Mean :12.815 Mean :22.55
## 3rd Qu.:17.115 3rd Qu.:25.00
## Max. :37.970 Max. :50.00
As we can see, the training dataset has 407 observations and 14 variables. One of the attributes (chas) is a factor while all of the others are numeric. Let’s convert ‘chas’ to numeric.
train$chas <- as.numeric(train$chas)
str(train$chas)
## num [1:407] 1 1 1 1 1 1 1 1 1 1 ...
Let’s look at histograms of each attribute to get a sense of the data distributions.
par(mfrow=c(3,5))
for(i in 1:13) {
hist(train[,i], main=names(train)[i])
}
Let’s look at the same distributions using density plots
par(mfrow=c(3,5))
for(i in 1:13) {
plot(density(train[,i]), main=names(train)[i])
}
Let’s look at the data with box and whisker plots of each attribute.
par(mfrow=c(3,5))
for (i in 1:13) {
boxplot(train[,i], main=names(train)[i])
}
Let’s look at a scatter plot matrix to visualize interactions between variables.
pairs(train[,1:13])
Let’s look at a correlation plot to see how correlated our predictor variables are:
library(corrplot)
cordata <- cor(train[,1:13])
corrplot(cordata, method='circle')
We will use 10-fold cross validation with 3 repeats. This will split our dataset into 10 parts, train in 9 and test on 1 and repeat for all combinations of train-test splits.
Since the data has differing scales, we will standardize the data for the baseline models. Models will be evaluated using RMSE and R2. RMSE will give us an idea of how wrong the predictions are and R2 will tell us how well the models fit the data.
# 10 fold cross validation
library(caret)
control <- trainControl(method='repeatedcv', number=10, repeats=3)
metric <- 'RMSE'
# Linear Regression (LR)
set.seed(101)
fit.lm <- train(medv~., data=train, method='lm', metric=metric,
preProc=c('center', 'scale'), trControl=control)
# Generalized Linear Regression (GLM)
set.seed(101)
fit.glm <- train(medv~., data=train, method='glm', metric=metric,
preProc=c('center', 'scale'), trControl=control)
# Penalized Linear Regression (GLMNET)
set.seed(101)
fit.glmnet <- train(medv~., data=train, method='glm', metric=metric,
preProc=c('center', 'scale'), trControl=control)
# Classification and Regression Trees (CART)
set.seed(101)
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=train, method='rpart', metric=metric,
preProc=c('center', 'scale'), trControl=control,
tuneGrid=grid)
# Support Vector Machines (SVM)
set.seed(101)
fit.svm <- train(medv~., data=train, method='svmRadial', metric=metric,
preProc=c('center', 'scale'), trControl=control)
# k-Nearest Neighbors (KNN)
set.seed(101)
fit.knn <- train(medv~., data=train, method='knn', metric=metric,
preProc=c('center', 'scale'), trControl=control)
# Compare the results of these algorithms
boston.results <- resamples(list(lm=fit.lm, glm=fit.glm,
glmnet=fit.glmnet, cart=fit.cart,
svm=fit.svm, knn=fit.knn))
# Summary and Plot of Results
summary(boston.results)
##
## Call:
## summary.resamples(object = boston.results)
##
## Models: lm, glm, glmnet, cart, svm, knn
## Number of resamples: 30
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm 3.615 4.100 4.629 4.808 5.369 6.814 0
## glm 3.615 4.100 4.629 4.808 5.369 6.814 0
## glmnet 3.615 4.100 4.629 4.808 5.369 6.814 0
## cart 2.877 3.635 4.355 4.404 4.824 6.565 0
## svm 1.770 2.843 3.350 3.779 4.613 6.544 0
## knn 2.173 3.660 4.492 4.630 5.606 7.359 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm 0.6229 0.6754 0.7482 0.7349 0.7877 0.8725 0
## glm 0.6229 0.6754 0.7482 0.7349 0.7877 0.8725 0
## glmnet 0.6229 0.6754 0.7482 0.7349 0.7877 0.8725 0
## cart 0.4956 0.7164 0.7917 0.7702 0.8429 0.9210 0
## svm 0.5615 0.7951 0.8740 0.8357 0.8895 0.9541 0
## knn 0.5407 0.6856 0.7818 0.7604 0.8443 0.9436 0
dotplot(boston.results)
We know that some of the attributes are skewed and others have exponential distribution. One option would be to use square and log transformations. Other option would be to use a more powerful transformation like Box-Cox to rescale the original data and evaluate the effect on 6 algorithms.
# 10 fold cross validation
library(caret)
control <- trainControl(method='repeatedcv', number=10, repeats=3)
metric <- 'RMSE'
# Linear Regression (LR)
set.seed(101)
fit.lm <- train(medv~., data=train, method='lm', metric=metric,
preProc=c('center','scale','BoxCox'), trControl=control)
# Generalized Linear Regression (GLM)
set.seed(101)
fit.glm <- train(medv~., data=train, method='glm', metric=metric,
preProc=c('center','scale','BoxCox'), trControl=control)
# Penalized Linear Regression (GLMNET)
set.seed(101)
fit.glmnet <- train(medv~., data=train, method='glm',
metric=metric, preProc=c('center', 'scale', 'BoxCox'),
trControl=control)
# Classification and Regression Trees (CART)
set.seed(101)
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=train, method='rpart',
metric=metric, preProc=c('center', 'scale','BoxCox'),
trControl=control, tuneGrid=grid)
# Support Vector Machines (SVM)
set.seed(101)
fit.svm <- train(medv~., data=train, method='svmRadial',
metric=metric, preProc=c('center', 'scale','BoxCox'),
trControl=control)
# k-Nearest Neighbors (KNN)
set.seed(101)
fit.knn <- train(medv~., data=train, method='knn', metric=metric,
preProc=c('center', 'scale','BoxCox'), trControl=control)
# Compare the results of these algorithms
boston.results <- resamples(list(lm=fit.lm, glm=fit.glm,
glmnet=fit.glmnet, cart=fit.cart,
svm=fit.svm, knn=fit.knn))
# Summary and Plot of Results
summary(boston.results)
##
## Call:
## summary.resamples(object = boston.results)
##
## Models: lm, glm, glmnet, cart, svm, knn
## Number of resamples: 30
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm 3.089 3.961 4.347 4.470 4.913 6.074 0
## glm 3.089 3.961 4.347 4.470 4.913 6.074 0
## glmnet 3.089 3.961 4.347 4.470 4.913 6.074 0
## cart 2.877 3.635 4.355 4.404 4.824 6.565 0
## svm 1.855 2.619 3.160 3.604 4.466 6.355 0
## knn 2.309 3.731 4.465 4.661 5.697 7.183 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm 0.6182 0.7271 0.7738 0.7690 0.8114 0.8872 0
## glm 0.6182 0.7271 0.7738 0.7690 0.8114 0.8872 0
## glmnet 0.6182 0.7271 0.7738 0.7690 0.8114 0.8872 0
## cart 0.4956 0.7186 0.7917 0.7703 0.8429 0.9210 0
## svm 0.6381 0.8224 0.8886 0.8519 0.9076 0.9605 0
## knn 0.5439 0.6902 0.7892 0.7535 0.8397 0.9414 0
dotplot(boston.results)
We can improve the efficiency of better performing algorithms by tuning their parameters.
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 407 samples
## 13 predictor
##
## Pre-processing: centered (13), scaled (13), Box-Cox transformation (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 365, 367, 367, 366, 366, 366, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared
## 0.25 4.521581 0.7890796
## 0.50 4.031453 0.8208290
## 1.00 3.604246 0.8519076
##
## Tuning parameter 'sigma' was held constant at a value of 0.08776884
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.08776884 and C = 1.
‘C’ is the cost constraint. We can see from the previous results that a value of C=1 can be a good starting point. Let’s expand the grid around C value of 1 and we might see improvement in RMSE with changing C.
‘Sigma’ (smoothing parameter) is another tuning parameter that we can tune. Good values of sigma start at 0.1. Let’s design our grid to include values before and after 0.1.
# Tune the SVM model to check performance
set.seed(101)
trainControl <- trainControl(method='repeatedcv',number=10,repeats=3)
metric <- 'RMSE'
svmGrid <- expand.grid(.sigma=c(0.025,0.05,0.1,0.15,0.25),
.C = seq(1, 10, by=1))
svmModel <- train(medv~., data=train, method='svmRadial',
preProc=c('BoxCox'), metric=metric, tuneGrid=svmGrid,
trControl=trainControl)
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 407 samples
## 13 predictor
##
## Pre-processing: Box-Cox transformation (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 365, 367, 367, 366, 366, 366, ...
## Resampling results across tuning parameters:
##
## sigma C RMSE Rsquared
## 0.025 1 3.757107 0.8383220
## 0.025 2 3.539589 0.8522318
## 0.025 3 3.406374 0.8620623
## 0.025 4 3.333400 0.8676599
## 0.025 5 3.287577 0.8710665
## 0.025 6 3.253241 0.8736106
## 0.025 7 3.232637 0.8749116
## 0.025 8 3.215471 0.8759147
## 0.025 9 3.199100 0.8767889
## 0.025 10 3.184031 0.8777603
## 0.050 1 3.631151 0.8482103
## 0.050 2 3.359205 0.8668207
## 0.050 3 3.261465 0.8729552
## 0.050 4 3.182876 0.8782130
## 0.050 5 3.126305 0.8822203
## 0.050 6 3.086620 0.8851183
## 0.050 7 3.054302 0.8874128
## 0.050 8 3.028543 0.8891427
## 0.050 9 3.004788 0.8906301
## 0.050 10 2.989623 0.8914921
## 0.100 1 3.596656 0.8529271
## 0.100 2 3.225256 0.8763259
## 0.100 3 3.095094 0.8848562
## 0.100 4 3.042610 0.8885441
## 0.100 5 3.023906 0.8896181
## 0.100 6 3.005662 0.8905296
## 0.100 7 2.989058 0.8912053
## 0.100 8 2.977093 0.8915187
## 0.100 9 2.969461 0.8916046
## 0.100 10 2.972984 0.8912742
## 0.150 1 3.680767 0.8470838
## 0.150 2 3.253236 0.8743995
## 0.150 3 3.135324 0.8821825
## 0.150 4 3.105191 0.8843049
## 0.150 5 3.086993 0.8852963
## 0.150 6 3.081799 0.8853711
## 0.150 7 3.085234 0.8849822
## 0.150 8 3.092672 0.8846098
## 0.150 9 3.104102 0.8838316
## 0.150 10 3.117216 0.8828568
## 0.250 1 3.957359 0.8280356
## 0.250 2 3.508160 0.8572508
## 0.250 3 3.413738 0.8631026
## 0.250 4 3.404592 0.8641139
## 0.250 5 3.405156 0.8638888
## 0.250 6 3.415847 0.8627721
## 0.250 7 3.430505 0.8613876
## 0.250 8 3.445842 0.8600126
## 0.250 9 3.458396 0.8588488
## 0.250 10 3.468829 0.8579968
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1 and C = 9.
plot(svmModel)
trainControl <- trainControl(method='repeatedcv', number=10, repeats=3)
metric <- 'RMSE'
# Random Forest
set.seed(101)
fit.rf <- train(medv~., data=train, method='ranger', metric=metric,
preProc=c('BoxCox'), trControl = trainControl)
# Gradient Boosting Machines (GBM)
set.seed(101)
fit.gbm <- train(medv~., data=train, method='gbm', metric=metric,
preProc=c('BoxCox'), trControl = trainControl)
# CUBIST
set.seed(101)
fit.cubist <- train(medv~., data=train, method='cubist', metric=metric,
preProc=c('BoxCox'), trControl = trainControl)
# Compare Algorithms
ensembleResults <- resamples(list(rf=fit.rf, gbm=fit.gbm,
cubist=fit.cubist))
summary(ensembleResults)
##
## Call:
## summary.resamples(object = ensembleResults)
##
## Models: rf, gbm, cubist
## Number of resamples: 30
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 2.154 2.688 3.082 3.314 3.757 4.898 0
## gbm 2.287 2.714 3.106 3.321 3.928 5.135 0
## cubist 1.891 2.369 2.636 3.037 3.484 5.140 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.7610 0.8493 0.8859 0.8745 0.9190 0.9436 0
## gbm 0.6686 0.8379 0.8833 0.8710 0.9165 0.9412 0
## cubist 0.6954 0.8546 0.9190 0.8895 0.9365 0.9576 0
dotplot(ensembleResults)
Let’s tune our cubist model to see if we can achieve better results. Before we tune our cubist model, let’s check the default parameters that we can tune in the model.
fit.cubist
## Cubist
##
## 407 samples
## 13 predictor
##
## Pre-processing: Box-Cox transformation (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 365, 367, 367, 366, 366, 366, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared
## 1 0 3.762387 0.8329693
## 1 5 3.427366 0.8608547
## 1 9 3.440933 0.8603692
## 10 0 3.296426 0.8720001
## 10 5 3.037153 0.8894882
## 10 9 3.058768 0.8881030
## 20 0 3.345520 0.8683533
## 20 5 3.087801 0.8864098
## 20 9 3.107175 0.8851953
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 10 and neighbors = 5.
We can see that the final model used committees = 10 and neighbors = 3
Let’s create a grid search around the values used in our initial cubist model to further tune the model.
set.seed(101)
trainControl <- trainControl(method='repeatedcv', number=10, repeats=3)
metric <- 'RMSE'
grid <- expand.grid(.committees=seq(10,20,by=1), .neighbors=c(3,5,7))
tune.cubist <- train(medv~., data=train, method='cubist', metric=metric,
preProc=c('BoxCox'), tuneGrid=grid, trControl=trainControl)
tune.cubist
## Cubist
##
## 407 samples
## 13 predictor
##
## Pre-processing: Box-Cox transformation (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 365, 367, 367, 366, 366, 366, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared
## 10 3 2.985772 0.8930905
## 10 5 3.037153 0.8894882
## 10 7 3.049847 0.8886975
## 11 3 2.988827 0.8932478
## 11 5 3.044064 0.8893887
## 11 7 3.054479 0.8887430
## 12 3 2.994430 0.8924002
## 12 5 3.052855 0.8883584
## 12 7 3.063723 0.8876571
## 13 3 2.982689 0.8938442
## 13 5 3.042516 0.8897098
## 13 7 3.050520 0.8892734
## 14 3 3.014669 0.8911097
## 14 5 3.075240 0.8869582
## 14 7 3.083127 0.8865349
## 15 3 3.008350 0.8921653
## 15 5 3.068108 0.8881275
## 15 7 3.074275 0.8878248
## 16 3 3.014598 0.8912904
## 16 5 3.073795 0.8871653
## 16 7 3.082422 0.8866903
## 17 3 3.005728 0.8922957
## 17 5 3.068642 0.8880548
## 17 7 3.078560 0.8875011
## 18 3 3.009138 0.8918306
## 18 5 3.071875 0.8875600
## 18 7 3.082250 0.8869850
## 19 3 3.019241 0.8912426
## 19 5 3.081525 0.8869878
## 19 7 3.090376 0.8864601
## 20 3 3.025336 0.8906776
## 20 5 3.087801 0.8864098
## 20 7 3.098079 0.8858063
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 13 and neighbors = 3.
plot(tune.cubist)
We can see that tuning the cubist model further improved our results. The tuned model with committees = 13 and neighbors = 3 has an RMSE = 2.9827 and R2 = 0.8938 which is better than our tuned SVM model (RMSE = 3.0262 and R2 = 0.8880).
# Prepare the test data
set.seed(101)
x <- train[,1:13]
y <- train[,14]
# Preprocess the data using Box-Cox transformation
preParms <- preProcess(x, method=c('BoxCox'))
transX <- predict(preParms, x)
# Train the final model
finalModel <- cubist(x=transX, y=y, committees = 13)
set.seed(101)
testX <- test[,1:13]
testY <- test[,14]
testTransX <- predict(preParms, testX)
# Use final model to make prediction
prediction <- predict(finalModel, newdata=testTransX, neighbors=3)
# Calculate RMSE and R2
RMSE <- sqrt(mean((testY - prediction)^2))
sse <- sum((testY - prediction)^2)
sst <- sum((mean(y) - testY)^2)
R2 <- 1-sse/sst
RMSE
## [1] 2.566475
R2
## [1] 0.9205573
We can see that the RMSE is the unseen data is 2.566 which is better than our expected RMSE of 2.9827. The value of R2 has also improved to 0.9205.