Predicting Boston House Prices

In this regression modelling problem, we will investigate the Boston House Price dataset and try to predict the median value of owner occupied homes. The data is available in ‘mlbench’ library and each record in the database describes a Boston suburb or town.

The variables in the dataset are as follows:

CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centers
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per $10,000
PTRATIO: pupil-teacher ratio by town
B: 1000(Bk-0.63)^2 where Bk is the proportion of blacks by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in $1000s

1. Load and Check Data

1.1 Let’s load the required packages and the dataset

# Load packages
library(mlbench)
library(caret)
library(corrplot)

# Load data
data(BostonHousing)

1.2 Let’s split the data

Let’s split the loaded dataset into train and test sets. We will use 80% of the data to train our models and 20% will be used to test the models.

set.seed(101)
sample <- createDataPartition(BostonHousing$medv, p=0.80, list = FALSE)
train <- BostonHousing[sample,]
test <- BostonHousing[-sample,]

1.3 Let’s summarize the data

Let’s look at the structure and summary of our training data to get some sense of how our data looks like.

str(train)

## 'data.frame':    407 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ b      : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(train)

##       crim                zn             indus       chas   
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:380  
##  1st Qu.: 0.07923   1st Qu.:  0.00   1st Qu.: 5.19   1: 27  
##  Median : 0.25356   Median :  0.00   Median : 9.69          
##  Mean   : 3.78855   Mean   : 11.17   Mean   :11.21          
##  3rd Qu.: 3.80591   3rd Qu.: 12.50   3rd Qu.:18.10          
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74          
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4530   1st Qu.:5.882   1st Qu.: 45.25   1st Qu.: 2.083  
##  Median :0.5380   Median :6.202   Median : 77.70   Median : 3.092  
##  Mean   :0.5581   Mean   :6.285   Mean   : 68.95   Mean   : 3.730  
##  3rd Qu.:0.6310   3rd Qu.:6.624   3rd Qu.: 94.05   3rd Qu.: 5.117  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio            b         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  2.52  
##  1st Qu.: 4.000   1st Qu.:277.0   1st Qu.:17.40   1st Qu.:376.36  
##  Median : 5.000   Median :334.0   Median :19.10   Median :392.18  
##  Mean   : 9.688   Mean   :412.2   Mean   :18.46   Mean   :360.68  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.31  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat             medv      
##  Min.   : 1.730   Min.   : 5.00  
##  1st Qu.: 7.185   1st Qu.:17.05  
##  Median :11.640   Median :21.20  
##  Mean   :12.815   Mean   :22.55  
##  3rd Qu.:17.115   3rd Qu.:25.00  
##  Max.   :37.970   Max.   :50.00

As we can see, the training dataset has 407 observations and 14 variables. One of the attributes (chas) is a factor while all of the others are numeric. Let’s convert ‘chas’ to numeric.

1.4 Let’s convert ‘chas’ to numeric

train$chas <- as.numeric(train$chas)
str(train$chas)

##  num [1:407] 1 1 1 1 1 1 1 1 1 1 ...

2. Visualize Data

We are going to look at two types of plots:

Unimodal plots to better understand each attribute
Multimodal plots to understand the relationships between attributes

2.1 Histogram (Unimodal)

Let’s look at histograms of each attribute to get a sense of the data distributions.

par(mfrow=c(3,5))
for(i in 1:13) {
  hist(train[,i], main=names(train)[i])
}

Looking at the plots, we can see that:

Attributes such as crim, zn, age may have an exponential distribution
Other attributes such as rad, tax may have bimodal distribution

2.2 Density Plot (Unimodal)

Let’s look at the same distributions using density plots

par(mfrow=c(3,5))
for(i in 1:13) {
  plot(density(train[,i]), main=names(train)[i])
}

Looking at the plots, we can see that:

Attributes such as nox, rm, lstat, ptratio may have skewed normal distribution

2.3 Box-Whisker Plot (Unimodal)

Let’s look at the data with box and whisker plots of each attribute.

par(mfrow=c(3,5))
for (i in 1:13) {
  boxplot(train[,i], main=names(train)[i])
}

2.4 Scatter Plot Matrix (Multimodal)

Let’s look at a scatter plot matrix to visualize interactions between variables.

pairs(train[,1:13])

Looking at the plots, we can see that:

Some of the correlated attributes like ‘nox and age’, ‘nox and dis’ do show predictable curved relationships.

2.5 Correlation Plot (Multimodal)

Let’s look at a correlation plot to see how correlated our predictor variables are:

library(corrplot)
cordata <- cor(train[,1:13])
corrplot(cordata, method='circle')

Looking at the plots, we can see that (except for the diagnal):

Attributes like ‘tax and rad’, ‘nox and tax’, ‘age and indus’ have positive correlation. Larger darker blue dots suggest storng positive relationship.
Attributes like ‘dis and nox’, ‘dis and indus’, ‘age and dis’ have negative correlation. Larger darker red dots suggest storng negative relationship.

3. Baseline Models

Let’s build some baseline models and evaluate them. We will perform the following in this step:

Build 6 different models to predict median value
Feature selection - remove highly correlated variables
Variable transformation -
- Standardize the data by combining the scale and center transforms
- Normalize the data to reduce the effect of different scaling
- Transform to see if flattening out some of the distributions improves accuracy.

3.1 Create Baseline Models

We will use 10-fold cross validation with 3 repeats. This will split our dataset into 10 parts, train in 9 and test on 1 and repeat for all combinations of train-test splits.

Let’s use a mixture of the following 6 algorithms:

Linear Algorithms: Linear Regression (LR), Generalized Linear Regression (GLM) and Penalized Linear Regression (GLMNET)
Nonlinear Algorithms: Classification and Regression Trees (CART), Support Vector Machines (SVM) and k-Nearest Neighbors (KNN)

Since the data has differing scales, we will standardize the data for the baseline models. Models will be evaluated using RMSE and R2. RMSE will give us an idea of how wrong the predictions are and R2 will tell us how well the models fit the data.

# 10 fold cross validation
library(caret)
control <- trainControl(method='repeatedcv', number=10, repeats=3)
metric <- 'RMSE'

# Linear Regression (LR)
set.seed(101)
fit.lm <- train(medv~., data=train, method='lm', metric=metric, 
                preProc=c('center', 'scale'), trControl=control)

# Generalized Linear Regression (GLM)
set.seed(101)
fit.glm <- train(medv~., data=train, method='glm', metric=metric, 
                 preProc=c('center', 'scale'), trControl=control)

# Penalized Linear Regression (GLMNET)
set.seed(101)
fit.glmnet <- train(medv~., data=train, method='glm', metric=metric, 
                    preProc=c('center', 'scale'), trControl=control)

# Classification and Regression Trees (CART)
set.seed(101)
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=train, method='rpart', metric=metric, 
                    preProc=c('center', 'scale'), trControl=control,
                    tuneGrid=grid)

# Support Vector Machines (SVM) 
set.seed(101)
fit.svm <- train(medv~., data=train, method='svmRadial', metric=metric,
                 preProc=c('center', 'scale'), trControl=control)

# k-Nearest Neighbors (KNN)
set.seed(101)
fit.knn <- train(medv~., data=train, method='knn', metric=metric, 
                    preProc=c('center', 'scale'), trControl=control)

# Compare the results of these algorithms
boston.results <- resamples(list(lm=fit.lm, glm=fit.glm,
                                 glmnet=fit.glmnet, cart=fit.cart,
                                 svm=fit.svm, knn=fit.knn))

# Summary and Plot of Results
summary(boston.results)

## 
## Call:
## summary.resamples(object = boston.results)
## 
## Models: lm, glm, glmnet, cart, svm, knn 
## Number of resamples: 30 
## 
## RMSE 
##         Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
## lm     3.615   4.100  4.629 4.808   5.369 6.814    0
## glm    3.615   4.100  4.629 4.808   5.369 6.814    0
## glmnet 3.615   4.100  4.629 4.808   5.369 6.814    0
## cart   2.877   3.635  4.355 4.404   4.824 6.565    0
## svm    1.770   2.843  3.350 3.779   4.613 6.544    0
## knn    2.173   3.660  4.492 4.630   5.606 7.359    0
## 
## Rsquared 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## lm     0.6229  0.6754 0.7482 0.7349  0.7877 0.8725    0
## glm    0.6229  0.6754 0.7482 0.7349  0.7877 0.8725    0
## glmnet 0.6229  0.6754 0.7482 0.7349  0.7877 0.8725    0
## cart   0.4956  0.7164 0.7917 0.7702  0.8429 0.9210    0
## svm    0.5615  0.7951 0.8740 0.8357  0.8895 0.9541    0
## knn    0.5407  0.6856 0.7818 0.7604  0.8443 0.9436    0

dotplot(boston.results)

It looks like:

SVM has the lowest error (3.779), followed closely by the other nonlinear algorithms CART (4.404) and KNN (4.63).
Data seems to fit better to non linear algorithms - higher R2 (SVM = 0.836, CART = .7702)
The linear algorithms all appear to have slightly worse error.

3.2 Remove Highly Correlated Variables and build models

3.2.1 Let’s remove highly correlated variables and create a new dataset

# Highly correlated variables
correlated <- cor(train[,1:13])
highCorr <- findCorrelation(correlated, cutoff=0.70)

# Identify variables that are highly correlated
names(train[highCorr])

## [1] "indus" "nox"   "tax"   "dis"

# Create new dataset w/o highly correlated variables
corr_data <- train[,-highCorr]

3.2.2 Let’s create models with the new dataset

# 10 fold cross validation
library(caret)
control <- trainControl(method='repeatedcv', number=10, repeats=3)
metric <- 'RMSE'

# Linear Regression (LR)
set.seed(101)
fit.lm <- train(medv~., data=corr_data, method='lm', metric=metric, 
                preProc=c('center', 'scale'), trControl=control)

# Generalized Linear Regression (GLM)
set.seed(101)
fit.glm <- train(medv~., data=corr_data, method='glm', metric=metric,
                 preProc=c('center', 'scale'), trControl=control)

# Penalized Linear Regression (GLMNET)
set.seed(101)
fit.glmnet <- train(medv~., data=corr_data, method='glm',
                    metric=metric, preProc=c('center', 'scale'),
                    trControl=control)

# Classification and Regression Trees (CART)
set.seed(101)
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=corr_data, method='rpart',
                  metric=metric, preProc=c('center', 'scale'), 
                  trControl=control, tuneGrid=grid)

# Support Vector Machines (SVM) 
set.seed(101)
fit.svm <- train(medv~., data=corr_data, method='svmRadial',
                 metric=metric, preProc=c('center', 'scale'), 
                 trControl=control)

# k-Nearest Neighbors (KNN)
set.seed(101)
fit.knn <- train(medv~., data=corr_data, method='knn', metric=metric,
                 preProc=c('center', 'scale'), trControl=control)

# Compare the results of these algorithms
boston.results <- resamples(list(lm=fit.lm, glm=fit.glm,
                                 glmnet=fit.glmnet, cart=fit.cart,
                                 svm=fit.svm, knn=fit.knn))

# Summary and Plot of Results
summary(boston.results)

## 
## Call:
## summary.resamples(object = boston.results)
## 
## Models: lm, glm, glmnet, cart, svm, knn 
## Number of resamples: 30 
## 
## RMSE 
##         Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
## lm     3.992   4.447  4.906 5.192   5.699 7.151    0
## glm    3.992   4.447  4.906 5.192   5.699 7.151    0
## glmnet 3.992   4.447  4.906 5.192   5.699 7.151    0
## cart   3.010   3.542  4.045 4.370   4.852 6.606    0
## svm    2.456   3.436  3.829 4.209   4.862 7.061    0
## knn    2.644   3.537  3.947 4.380   5.270 6.656    0
## 
## Rsquared 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## lm     0.5443  0.6190 0.7108 0.6883  0.7502 0.8150    0
## glm    0.5443  0.6190 0.7108 0.6883  0.7502 0.8150    0
## glmnet 0.5443  0.6190 0.7108 0.6883  0.7502 0.8150    0
## cart   0.4769  0.7077 0.7837 0.7747  0.8592 0.9144    0
## svm    0.4393  0.7577 0.8199 0.7956  0.8624 0.9479    0
## knn    0.5633  0.6960 0.8058 0.7798  0.8626 0.9259    0

dotplot(boston.results)

It looks like:

Removing highly correlated variables has made RMSE worse for linear and non-linear algorithms. RMSE for SVM has increased to 4.209 (from 3.779).
The attributes we removed are probably contributing to the accuracy of our models.

3.3 Transform Variables and build models

We know that some of the attributes are skewed and others have exponential distribution. One option would be to use square and log transformations. Other option would be to use a more powerful transformation like Box-Cox to rescale the original data and evaluate the effect on 6 algorithms.

3.3.1 Let’s use Box Cox transformation and run the algorithms

# 10 fold cross validation
library(caret)
control <- trainControl(method='repeatedcv', number=10, repeats=3)
metric <- 'RMSE'

# Linear Regression (LR)
set.seed(101)
fit.lm <- train(medv~., data=train, method='lm', metric=metric, 
            preProc=c('center','scale','BoxCox'), trControl=control)

# Generalized Linear Regression (GLM)
set.seed(101)
fit.glm <- train(medv~., data=train, method='glm', metric=metric,
            preProc=c('center','scale','BoxCox'), trControl=control)

# Penalized Linear Regression (GLMNET)
set.seed(101)
fit.glmnet <- train(medv~., data=train, method='glm',
              metric=metric, preProc=c('center', 'scale', 'BoxCox'),
              trControl=control)

# Classification and Regression Trees (CART)
set.seed(101)
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=train, method='rpart',
            metric=metric, preProc=c('center', 'scale','BoxCox'), 
                  trControl=control, tuneGrid=grid)

# Support Vector Machines (SVM) 
set.seed(101)
fit.svm <- train(medv~., data=train, method='svmRadial',
            metric=metric, preProc=c('center', 'scale','BoxCox'), 
                 trControl=control)

# k-Nearest Neighbors (KNN)
set.seed(101)
fit.knn <- train(medv~., data=train, method='knn', metric=metric,
            preProc=c('center', 'scale','BoxCox'), trControl=control)

# Compare the results of these algorithms
boston.results <- resamples(list(lm=fit.lm, glm=fit.glm,
                                 glmnet=fit.glmnet, cart=fit.cart,
                                 svm=fit.svm, knn=fit.knn))

# Summary and Plot of Results
summary(boston.results)

## 
## Call:
## summary.resamples(object = boston.results)
## 
## Models: lm, glm, glmnet, cart, svm, knn 
## Number of resamples: 30 
## 
## RMSE 
##         Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
## lm     3.089   3.961  4.347 4.470   4.913 6.074    0
## glm    3.089   3.961  4.347 4.470   4.913 6.074    0
## glmnet 3.089   3.961  4.347 4.470   4.913 6.074    0
## cart   2.877   3.635  4.355 4.404   4.824 6.565    0
## svm    1.855   2.619  3.160 3.604   4.466 6.355    0
## knn    2.309   3.731  4.465 4.661   5.697 7.183    0
## 
## Rsquared 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## lm     0.6182  0.7271 0.7738 0.7690  0.8114 0.8872    0
## glm    0.6182  0.7271 0.7738 0.7690  0.8114 0.8872    0
## glmnet 0.6182  0.7271 0.7738 0.7690  0.8114 0.8872    0
## cart   0.4956  0.7186 0.7917 0.7703  0.8429 0.9210    0
## svm    0.6381  0.8224 0.8886 0.8519  0.9076 0.9605    0
## knn    0.5439  0.6902 0.7892 0.7535  0.8397 0.9414    0

dotplot(boston.results)

We can see that:

Applying Box-Cox transformation reduced RMSE and improved R2 on all models except for CART.
RMSE for SVM has reduced to 3.604 and R2 has improved to 0.8519. Let’s see if we can tune the SVM model to improve performance.

3.4 Tune the model

We can improve the efficiency of better performing algorithms by tuning their parameters.

3.4.1 Let’s look at the default parameters used by SVM model

print(fit.svm)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 407 samples
##  13 predictor
## 
## Pre-processing: centered (13), scaled (13), Box-Cox transformation (11) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 365, 367, 367, 366, 366, 366, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared 
##   0.25  4.521581  0.7890796
##   0.50  4.031453  0.8208290
##   1.00  3.604246  0.8519076
## 
## Tuning parameter 'sigma' was held constant at a value of 0.08776884
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 0.08776884 and C = 1.

We can see that:

The value of RMSE has reduced with increasing C (cost constraint)
The value of sigma (smoothing parameter) is around 0.08

3.4.2 Tune the SVM model

‘C’ is the cost constraint. We can see from the previous results that a value of C=1 can be a good starting point. Let’s expand the grid around C value of 1 and we might see improvement in RMSE with changing C.

‘Sigma’ (smoothing parameter) is another tuning parameter that we can tune. Good values of sigma start at 0.1. Let’s design our grid to include values before and after 0.1.

# Tune the SVM model to check performance
set.seed(101)
trainControl <- trainControl(method='repeatedcv',number=10,repeats=3)
metric <- 'RMSE'
svmGrid <- expand.grid(.sigma=c(0.025,0.05,0.1,0.15,0.25), 
                       .C = seq(1, 10, by=1))
svmModel <- train(medv~., data=train, method='svmRadial', 
                preProc=c('BoxCox'), metric=metric, tuneGrid=svmGrid,
                trControl=trainControl)
svmModel

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 407 samples
##  13 predictor
## 
## Pre-processing: Box-Cox transformation (11) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 365, 367, 367, 366, 366, 366, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C   RMSE      Rsquared 
##   0.025   1  3.757107  0.8383220
##   0.025   2  3.539589  0.8522318
##   0.025   3  3.406374  0.8620623
##   0.025   4  3.333400  0.8676599
##   0.025   5  3.287577  0.8710665
##   0.025   6  3.253241  0.8736106
##   0.025   7  3.232637  0.8749116
##   0.025   8  3.215471  0.8759147
##   0.025   9  3.199100  0.8767889
##   0.025  10  3.184031  0.8777603
##   0.050   1  3.631151  0.8482103
##   0.050   2  3.359205  0.8668207
##   0.050   3  3.261465  0.8729552
##   0.050   4  3.182876  0.8782130
##   0.050   5  3.126305  0.8822203
##   0.050   6  3.086620  0.8851183
##   0.050   7  3.054302  0.8874128
##   0.050   8  3.028543  0.8891427
##   0.050   9  3.004788  0.8906301
##   0.050  10  2.989623  0.8914921
##   0.100   1  3.596656  0.8529271
##   0.100   2  3.225256  0.8763259
##   0.100   3  3.095094  0.8848562
##   0.100   4  3.042610  0.8885441
##   0.100   5  3.023906  0.8896181
##   0.100   6  3.005662  0.8905296
##   0.100   7  2.989058  0.8912053
##   0.100   8  2.977093  0.8915187
##   0.100   9  2.969461  0.8916046
##   0.100  10  2.972984  0.8912742
##   0.150   1  3.680767  0.8470838
##   0.150   2  3.253236  0.8743995
##   0.150   3  3.135324  0.8821825
##   0.150   4  3.105191  0.8843049
##   0.150   5  3.086993  0.8852963
##   0.150   6  3.081799  0.8853711
##   0.150   7  3.085234  0.8849822
##   0.150   8  3.092672  0.8846098
##   0.150   9  3.104102  0.8838316
##   0.150  10  3.117216  0.8828568
##   0.250   1  3.957359  0.8280356
##   0.250   2  3.508160  0.8572508
##   0.250   3  3.413738  0.8631026
##   0.250   4  3.404592  0.8641139
##   0.250   5  3.405156  0.8638888
##   0.250   6  3.415847  0.8627721
##   0.250   7  3.430505  0.8613876
##   0.250   8  3.445842  0.8600126
##   0.250   9  3.458396  0.8588488
##   0.250  10  3.468829  0.8579968
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 0.1 and C = 9.

plot(svmModel)

We can see that:

Sigma flattens with larger values of C
Tuning has improved the performance of our SVM model to an RMSE = 3.0133 and R2 = 0.8864.
The final values used for the model were sigma = 0.1 and C = 10.

4. Ensemble Models

Let’s build some ensemble models to see if we can further reduce the RMSE. We will look at some boosting and bagging techniques like:

Random Forest (RF) - bagging
Gradient Boosting Machines (GBM) - boosting
CUBIST - boosting

4.1 Create Ensemble Models

trainControl <- trainControl(method='repeatedcv', number=10, repeats=3)
metric <- 'RMSE'

# Random Forest
set.seed(101)
fit.rf <- train(medv~., data=train, method='ranger', metric=metric, 
                preProc=c('BoxCox'), trControl = trainControl)

# Gradient Boosting Machines (GBM) 
set.seed(101)
fit.gbm <- train(medv~., data=train, method='gbm', metric=metric, 
                preProc=c('BoxCox'), trControl = trainControl)

# CUBIST
set.seed(101)
fit.cubist <- train(medv~., data=train, method='cubist', metric=metric, 
                preProc=c('BoxCox'), trControl = trainControl)

# Compare Algorithms
ensembleResults <- resamples(list(rf=fit.rf, gbm=fit.gbm,
                                  cubist=fit.cubist))
summary(ensembleResults)

## 
## Call:
## summary.resamples(object = ensembleResults)
## 
## Models: rf, gbm, cubist 
## Number of resamples: 30 
## 
## RMSE 
##         Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
## rf     2.154   2.688  3.082 3.314   3.757 4.898    0
## gbm    2.287   2.714  3.106 3.321   3.928 5.135    0
## cubist 1.891   2.369  2.636 3.037   3.484 5.140    0
## 
## Rsquared 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## rf     0.7610  0.8493 0.8859 0.8745  0.9190 0.9436    0
## gbm    0.6686  0.8379 0.8833 0.8710  0.9165 0.9412    0
## cubist 0.6954  0.8546 0.9190 0.8895  0.9365 0.9576    0

dotplot(ensembleResults)

We can see that:

Ensemble models have produced better results over other models.
Cubist has an RMSE of 3.037 and R2 value of 0.8895 which is very similar to results from our tuned SVM model (RMSE = 3.0262 and R2 = 0.8880).

4.2 Tuning Cubist Model

Let’s tune our cubist model to see if we can achieve better results. Before we tune our cubist model, let’s check the default parameters that we can tune in the model.

4.2.1 Let’s look at default cubist model

fit.cubist

## Cubist 
## 
## 407 samples
##  13 predictor
## 
## Pre-processing: Box-Cox transformation (11) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 365, 367, 367, 366, 366, 366, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared 
##    1          0          3.762387  0.8329693
##    1          5          3.427366  0.8608547
##    1          9          3.440933  0.8603692
##   10          0          3.296426  0.8720001
##   10          5          3.037153  0.8894882
##   10          9          3.058768  0.8881030
##   20          0          3.345520  0.8683533
##   20          5          3.087801  0.8864098
##   20          9          3.107175  0.8851953
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were committees = 10 and neighbors = 5.

We can see that the final model used committees = 10 and neighbors = 3

Cubist has two parameters that are tunable with caret:

Committees which is the number of boosting operations
Neighbors which is the number of instances used to correct the rule-based prediction.

Let’s create a grid search around the values used in our initial cubist model to further tune the model.

4.2.2 Tune cubist model

set.seed(101)
trainControl <- trainControl(method='repeatedcv', number=10, repeats=3)
metric <- 'RMSE'
grid <- expand.grid(.committees=seq(10,20,by=1), .neighbors=c(3,5,7))
tune.cubist <- train(medv~., data=train, method='cubist', metric=metric,
                preProc=c('BoxCox'), tuneGrid=grid, trControl=trainControl)
tune.cubist

## Cubist 
## 
## 407 samples
##  13 predictor
## 
## Pre-processing: Box-Cox transformation (11) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 365, 367, 367, 366, 366, 366, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared 
##   10          3          2.985772  0.8930905
##   10          5          3.037153  0.8894882
##   10          7          3.049847  0.8886975
##   11          3          2.988827  0.8932478
##   11          5          3.044064  0.8893887
##   11          7          3.054479  0.8887430
##   12          3          2.994430  0.8924002
##   12          5          3.052855  0.8883584
##   12          7          3.063723  0.8876571
##   13          3          2.982689  0.8938442
##   13          5          3.042516  0.8897098
##   13          7          3.050520  0.8892734
##   14          3          3.014669  0.8911097
##   14          5          3.075240  0.8869582
##   14          7          3.083127  0.8865349
##   15          3          3.008350  0.8921653
##   15          5          3.068108  0.8881275
##   15          7          3.074275  0.8878248
##   16          3          3.014598  0.8912904
##   16          5          3.073795  0.8871653
##   16          7          3.082422  0.8866903
##   17          3          3.005728  0.8922957
##   17          5          3.068642  0.8880548
##   17          7          3.078560  0.8875011
##   18          3          3.009138  0.8918306
##   18          5          3.071875  0.8875600
##   18          7          3.082250  0.8869850
##   19          3          3.019241  0.8912426
##   19          5          3.081525  0.8869878
##   19          7          3.090376  0.8864601
##   20          3          3.025336  0.8906776
##   20          5          3.087801  0.8864098
##   20          7          3.098079  0.8858063
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were committees = 13 and neighbors = 3.

plot(tune.cubist)

We can see that tuning the cubist model further improved our results. The tuned model with committees = 13 and neighbors = 3 has an RMSE = 2.9827 and R2 = 0.8938 which is better than our tuned SVM model (RMSE = 3.0262 and R2 = 0.8880).

4. Finalize Model

Let’s finalize our results using the tuned Cubist model. In this step we will:

Create a stand alone Cubist model using the tuned parameters
Use this model to evaluate our test set

4.1 Create a standalone Cubist Model

# Prepare the test data
set.seed(101)
x <- train[,1:13]
y <- train[,14]

# Preprocess the data using Box-Cox transformation
preParms <- preProcess(x, method=c('BoxCox'))
transX <- predict(preParms, x)

# Train the final model
finalModel <- cubist(x=transX, y=y, committees = 13)

4.2 Predict using the Final Model

set.seed(101)
testX <- test[,1:13]
testY <- test[,14]
testTransX <- predict(preParms, testX)

# Use final model to make prediction
prediction <- predict(finalModel, newdata=testTransX, neighbors=3)

# Calculate RMSE and R2
RMSE <- sqrt(mean((testY - prediction)^2))
sse <- sum((testY - prediction)^2)
sst <- sum((mean(y) - testY)^2)
R2 <- 1-sse/sst
RMSE

## [1] 2.566475

R2

## [1] 0.9205573

We can see that the RMSE is the unseen data is 2.566 which is better than our expected RMSE of 2.9827. The value of R2 has also improved to 0.9205.