Predict Housing Age of Boston Housing Data

Comparing Multiple Models:

Objective:

The objective of this exercise is to predict the age of a home in the Boston Data Set for the purpose of comparing multiple algorithms.

Data Set

The data set is comprised of 506 observations and 14 variables. It is available in the package MASS. The variables range from crime per captia by town to the pupil-teacher ratio by town. We will be using Age as our response/target variable.

Data Description for the 14 variables are as follows.

CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000’s

More information on the dataset can be read by accessing the website: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

Load Packages

we are going to use the CARET and MASS R packages. MASS package holds the Boston dataset and we would be building models using the CARET package.

Split the Data into Training and Testing sets

Data splitting can be done in a variety of ways.
First, I am going to use base R sample function to split the dataset.
Second, I am going to use the function createDataPartition from the CARET package.

#load the dataset

df <- Boston
str(df)

'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Data split(80/20) using the base R sample function.

#store rows for partition
rows1 <- sample(1:nrow(df), nrow(df)*0.8, replace = F)

#training data set
train_base <- df[rows1,]
#testing data set
test_base <- df[-rows1,]

# structure of the dataset
str(train_base)

'data.frame':   404 obs. of  14 variables:
 $ crim   : num  0.0205 18.811 0.122 13.3598 0.2221 ...
 $ zn     : num  85 0 0 0 0 40 0 0 55 0 ...
 $ indus  : num  0.74 18.1 2.89 18.1 10.01 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.41 0.597 0.445 0.693 0.547 0.429 0.74 0.532 0.484 0.573 ...
 $ rm     : num  6.38 4.63 6.62 5.89 6.09 ...
 $ age    : num  35.7 100 57.8 94.7 95.4 44.4 93.9 77 28.1 69.1 ...
 $ dis    : num  9.19 1.55 3.5 1.78 2.55 ...
 $ rad    : int  2 24 2 24 6 1 24 24 5 1 ...
 $ tax    : num  313 666 276 666 432 335 666 666 370 273 ...
 $ ptratio: num  17.3 20.2 18 20.2 17.8 19.7 20.2 20.2 17.6 21 ...
 $ black  : num  396.9 28.8 358 396.9 396.9 ...
 $ lstat  : num  5.77 34.37 6.65 16.35 17.09 ...
 $ medv   : num  24.7 17.9 28.4 12.7 18.7 22.9 12.8 25 31.2 22.4 ...

str(test_base)

'data.frame':   102 obs. of  14 variables:
 $ crim   : num  0.17 0.784 0.726 0.988 0.956 ...
 $ zn     : num  12.5 0 0 0 0 0 0 0 0 0 ...
 $ indus  : num  7.87 8.14 8.14 8.14 8.14 8.14 8.14 8.14 8.14 5.96 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.524 0.538 0.538 0.538 0.538 0.538 0.538 0.538 0.538 0.499 ...
 $ rm     : num  6 5.99 5.73 5.81 6.05 ...
 $ age    : num  85.9 81.7 69.5 100 88.8 94.4 87.3 100 82 41.5 ...
 $ dis    : num  6.59 4.26 3.8 4.1 4.45 ...
 $ rad    : int  5 4 4 4 4 4 4 4 4 5 ...
 $ tax    : num  311 307 307 307 307 307 307 307 307 279 ...
 $ ptratio: num  15.2 21 21 21 21 21 21 21 21 19.2 ...
 $ black  : num  387 387 391 395 306 ...
 $ lstat  : num  17.1 14.7 11.3 19.9 17.3 ...
 $ medv   : num  18.9 17.5 18.2 14.5 14.8 18.4 21 14.5 13.2 21 ...

Data split(70/30) using the function createDataPartition from the CARET package.

# Usually set.seed(123) for reproducibility. I am not doing it here due to software issue as I can not publish this document on its inclusion.

#store rows for partition
partition <- caret::createDataPartition(y = df$age, times = 1, p = 0.7, list = FALSE)

# create training data set
train_set <- df[partition,]

# create testing data set, subtracting the rows partition to get remaining 30% of the data
test_set <- df[-partition,]

str(train_set)

'data.frame':   356 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.03237 0.02985 0.08829 ...
 $ zn     : num  18 0 0 0 12.5 12.5 12.5 0 0 0 ...
 $ indus  : num  2.31 7.07 2.18 2.18 7.87 7.87 7.87 8.14 8.14 8.14 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.458 0.458 0.524 0.524 0.524 0.538 0.538 0.538 ...
 $ rm     : num  6.58 6.42 7 6.43 6.01 ...
 $ age    : num  65.2 78.9 45.8 58.7 66.6 85.9 94.3 61.8 29.3 81.7 ...
 $ dis    : num  4.09 4.97 6.06 6.06 5.56 ...
 $ rad    : int  1 2 3 3 5 5 5 4 4 4 ...
 $ tax    : num  296 242 222 222 311 311 311 307 307 307 ...
 $ ptratio: num  15.3 17.8 18.7 18.7 15.2 15.2 15.2 21 21 21 ...
 $ black  : num  397 397 395 394 396 ...
 $ lstat  : num  4.98 9.14 2.94 5.21 12.43 ...
 $ medv   : num  24 21.6 33.4 28.7 22.9 18.9 15 20.4 23.1 17.5 ...

str(test_set)

'data.frame':   150 obs. of  14 variables:
 $ crim   : num  0.0273 0.069 0.1446 0.2112 0.1175 ...
 $ zn     : num  0 0 12.5 12.5 12.5 12.5 0 0 0 0 ...
 $ indus  : num  7.07 2.18 7.87 7.87 7.87 7.87 8.14 8.14 8.14 8.14 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.469 0.458 0.524 0.524 0.524 0.524 0.538 0.538 0.538 0.538 ...
 $ rm     : num  7.18 7.15 6.17 5.63 6.01 ...
 $ age    : num  61.1 54.2 96.1 100 82.9 39 84.5 56.5 98.1 89.2 ...
 $ dis    : num  4.97 6.06 5.95 6.08 6.23 ...
 $ rad    : int  2 3 5 5 5 5 4 4 4 4 ...
 $ tax    : num  242 222 311 311 311 311 307 307 307 307 ...
 $ ptratio: num  17.8 18.7 15.2 15.2 15.2 15.2 21 21 21 21 ...
 $ black  : num  393 397 397 387 397 ...
 $ lstat  : num  4.03 5.33 19.15 29.93 13.27 ...
 $ medv   : num  34.7 36.2 27.1 16.5 18.9 21.7 18.2 19.9 13.6 19.6 ...

Model Building

We are going to build three models simultaneously, namely - Classical Linear Regression Model, Random Forest and a Gradient Boosting Model (GBM).
We will then choose the best performing model based on Root Mean Square Error value (RMSE) to predict the age of houses in the test data set.

We will also use Repeated Cross Validation 5x. This will help us make sure that our results are well tested.

model_lm <- caret::train(age ~., data = train_set,
                         method = "lm",
                         trControl = trainControl(method = "repeatedCV",
                                                  number = 5, repeats = 2))


model_forest <- caret::train(age ~., data = train_set,
                         method = "ranger",
                         trControl = trainControl(method = "repeatedCV",
                                                  number = 5, repeats = 2))


model_gbm <- caret::train(age ~., data = train_set,
                         method = "gbm",
                         trControl = trainControl(method = "repeatedCV",
                                                  number = 5, repeats = 2))

Comparing Models side by side

Using the function resamples from the CARET package to compare the three models side by side.

all_models <- caret::resamples(list(Random_Forest = model_forest,
                                    Linear_Regression = model_lm,
                                    GBM = model_gbm))
#boxplot
bwplot(all_models)

#dotplot
dotplot(all_models)

#summary
summary(all_models)


Call:
summary.resamples(object = all_models)

Models: Random_Forest, Linear_Regression, GBM 
Number of resamples: 10 

MAE 
                       Min.   1st Qu.   Median      Mean  3rd Qu.     Max.
Random_Forest      8.252112  9.094248  9.78110  9.643055 10.17148 10.70097
Linear_Regression 11.490282 12.514133 12.74935 12.851753 12.99444 15.06752
GBM                8.434679  9.483875 10.09975  9.934096 10.38567 11.70440
                  NA's
Random_Forest        0
Linear_Regression    0
GBM                  0

RMSE 
                      Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
Random_Forest     11.07166 12.34664 13.01666 12.88435 13.63783 14.27272
Linear_Regression 14.33443 15.94932 16.39045 16.42580 16.95850 17.83902
GBM               11.55671 12.14554 13.30905 13.38038 14.47298 15.69887
                  NA's
Random_Forest        0
Linear_Regression    0
GBM                  0

Rsquared 
                       Min.   1st Qu.    Median      Mean   3rd Qu.
Random_Forest     0.7183874 0.7592769 0.7754247 0.7842863 0.8105468
Linear_Regression 0.5927382 0.6042303 0.6578105 0.6495473 0.6793772
GBM               0.7110808 0.7290565 0.7645726 0.7668434 0.8077440
                       Max. NA's
Random_Forest     0.8533229    0
Linear_Regression 0.7170658    0
GBM               0.8204467    0

Observing the mean RMSE values for Random_Forest and GBM, we can say that both of them performed fairly well. This usually is not the case.
Let’s build our models again with some pre-processing first.

#### Center,Scale and Principal Component Analysis(PCA)

The CARET package gives us the option to add the pre-processing steps in a much easy way and does all the heavy-lifting for us.
We will center the data, scale it and then perform Principal Component Analysis on the variables to avoid any collinearity.
This might result in a loss of RMSE value but the built models will be much more stable.

model_lm <- caret::train(age ~., data = train_set,
                         method = "lm",
                         trControl = trainControl(method = "repeatedCV", number = 5, repeats = 2),
                         preProcess = c("center","scale","pca"))


model_forest <- caret::train(age ~., data = train_set,
                         method = "ranger",
                         trControl = trainControl(method = "repeatedCV", number = 5, repeats = 2),
                         preProcess = c("center","scale","pca"))


model_gbm <- caret::train(age ~., data = train_set,
                         method = "gbm",
                         trControl = trainControl(method = "repeatedCV", number = 5, repeats = 2),
                         preProcess = c("center","scale","pca"))

all_models <- caret::resamples(list(Random_Forest = model_forest,
                                    Linear_Regression = model_lm,
                                    GBM = model_gbm))
#boxplot
bwplot(all_models)

#dotplot
dotplot(all_models)

#summary
summary(all_models)


Call:
summary.resamples(object = all_models)

Models: Random_Forest, Linear_Regression, GBM 
Number of resamples: 10 

MAE 
                       Min.   1st Qu.   Median     Mean  3rd Qu.     Max.
Random_Forest      7.487781 10.711419 10.73529 10.35820 10.87794 11.56189
Linear_Regression 11.721988 12.001735 12.63997 12.90499 13.63416 15.15267
GBM                9.013459  9.591878 10.22780 10.41425 10.71376 13.15846
                  NA's
Random_Forest        4
Linear_Regression    0
GBM                  0

RMSE 
                      Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
Random_Forest     10.21081 13.32904 14.19934 13.67404 14.53391 15.75037
Linear_Regression 14.27834 15.60454 16.21224 16.54711 17.82753 18.71726
GBM               11.54933 12.64307 13.01390 13.87030 15.13640 18.28568
                  NA's
Random_Forest        4
Linear_Regression    0
GBM                  0

Rsquared 
                       Min.   1st Qu.    Median      Mean   3rd Qu.
Random_Forest     0.6760441 0.7334036 0.7436857 0.7575291 0.7736068
Linear_Regression 0.5413179 0.6035431 0.6588522 0.6459879 0.6866185
GBM               0.6299870 0.7024621 0.7699587 0.7500499 0.7911640
                       Max. NA's
Random_Forest     0.8682025    4
Linear_Regression 0.7100061    0
GBM               0.8270775    0

We can immediately see that the GBM and Random_Forest model outperformed the Linear Regression model this time. The mean RMSE value for Random_Forest is lowest, meaning that the Random Forest outperformed the other two.
To make a concrete conclusion, let’s predict the age on the test data set using these three models built and then comparing theie RMSE values.

Model on Test Data Set

# predicting using Linear Regression Model

pred_lm <- stats::predict(object = model_lm, test_set)

#RMSE
error_lm <- pred_lm - test_set$age
rmse_lm <- sqrt(mean(error_lm^2))
rmse_lm

[1] 16.61794

The Linear Regression model did not perform well, as expected.

# predicting using GBM

pred_gbm <- stats::predict(object = model_gbm, test_set)

#RMSE
error_gbm <- pred_gbm - test_set$age
rmse_gbm <- sqrt(mean(error_gbm^2))
rmse_gbm

[1] 14.0249

The GBM model performed a bit fairly well than the Linear Regression model and the training GBM model as well.

# predicting using Random Forest Model

pred_rf <- stats::predict(object = model_forest, test_set)

#RMSE
error_rf <- pred_rf - test_set$age
rmse_rf <- sqrt(mean(error_rf^2))
rmse_rf

[1] 13.99354

The Random forest performed best among the three models and also better than the training model set.

Conclusion

In conclusion, this work flow could be used to compare multiple models and select the best performing model based on RMSE mean value. CARET package could also be used to employ several other models in few lines according to Business need/Problem statement.

We could have performed more intermediate calculation or analysis steps such as Exploratory Data Analysis, build confusion matrix, extarcted the coefficients or build numerous plots on the way. This was supposed to be a quick workflow analysis to get you started on some Machine Learning concepts.

Happy Reading and Keep Learning.