Predicting Housing Price using Different Modelling Techniques

Packages Required

library(MASS)          #for obtaining Boston Housing Data
library(rpart)         #for regression trees
library(rpart.plot)    #for plotting regression trees
library(ipred)         #for bagging
library(randomForest)  #for random forest 
library(dplyr)         #for manipulation
library(gbm)           #for boosting
library(tidyverse)     #for tidying the data
library(kableExtra)    #for table presentation

Linear Regression

set.seed(2)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.75)
boston_train <- Boston[sample_index,]
boston_test <- Boston[-sample_index,]

I would like to predict the median value of owner-occupied homes, medv. Initially I look at how we can regress it on other predictor variables and use Multiple Linear Regression to predict the values.

I have used combination of forward and backward stepwise variable selection method to find the best subsets of variables from 13 variables.

The linear model gives me following varibales that are significant and help in explaining the variability in median housing value.

##fitting linear regression 
nullmodel=lm(medv~1, data=boston_train)
fullmodel=lm(medv~., data=boston_train)

#using AIC to find the best subsets of predictors 
#using a combination of forward and backward stepwise variable selection 
model_step_s <- step(nullmodel, scope=list(lower=nullmodel, 
                                           upper=fullmodel), direction='both')
## Start:  AIC=1686.05
## medv ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + lstat    1   17428.0 14811 1393.3
## + rm       1   15543.6 16696 1438.7
## + indus    1    8413.4 23826 1573.4
## + ptratio  1    6985.8 25253 1595.5
## + tax      1    6911.4 25328 1596.6
## + nox      1    6005.6 26234 1609.9
## + crim     1    5096.6 27143 1622.8
## + age      1    4651.6 27588 1629.0
## + rad      1    4555.2 27684 1630.3
## + zn       1    4260.8 27978 1634.3
## + black    1    3250.9 28988 1647.8
## + dis      1    2238.1 30001 1660.8
## + chas     1     588.4 31651 1681.1
## <none>                 32239 1686.0
## 
## Step:  AIC=1393.27
## medv ~ lstat
## 
##           Df Sum of Sq   RSS    AIC
## + rm       1    2993.9 11817 1309.7
## + ptratio  1    1647.8 13164 1350.6
## + dis      1     511.8 14299 1381.9
## + chas     1     298.4 14513 1387.5
## + black    1     219.1 14592 1389.6
## + age      1     217.9 14593 1389.7
## + crim     1     193.7 14618 1390.3
## + tax      1     176.3 14635 1390.7
## + indus    1     136.6 14675 1391.8
## + zn       1     128.3 14683 1392.0
## <none>                 14811 1393.3
## + rad      1      15.5 14796 1394.9
## + nox      1       4.5 14807 1395.2
## - lstat    1   17428.0 32239 1686.0
## 
## Step:  AIC=1309.68
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS    AIC
## + ptratio  1    1061.5 10756 1276.0
## + crim     1     348.4 11469 1300.3
## + tax      1     310.2 11507 1301.6
## + black    1     309.7 11508 1301.6
## + chas     1     252.4 11565 1303.5
## + dis      1     222.1 11595 1304.5
## + rad      1     142.2 11675 1307.1
## + indus    1      68.0 11749 1309.5
## <none>                 11817 1309.7
## + zn       1      45.4 11772 1310.2
## + nox      1      10.7 11807 1311.3
## + age      1       7.6 11810 1311.4
## - rm       1    2993.9 14811 1393.3
## - lstat    1    4878.3 16696 1438.7
## 
## Step:  AIC=1276.01
## medv ~ lstat + rm + ptratio
## 
##           Df Sum of Sq   RSS    AIC
## + dis      1     343.3 10413 1265.7
## + black    1     245.1 10511 1269.3
## + chas     1     239.2 10517 1269.5
## + crim     1     166.6 10589 1272.1
## <none>                 10756 1276.0
## + tax      1      45.7 10710 1276.4
## + age      1      31.2 10725 1276.9
## + nox      1      14.1 10742 1277.5
## + zn       1      13.2 10743 1277.5
## + rad      1       0.9 10755 1278.0
## + indus    1       0.0 10756 1278.0
## - ptratio  1    1061.5 11817 1309.7
## - rm       1    2407.6 13164 1350.6
## - lstat    1    3880.4 14636 1390.8
## 
## Step:  AIC=1265.72
## medv ~ lstat + rm + ptratio + dis
## 
##           Df Sum of Sq   RSS    AIC
## + nox      1     497.6  9915 1249.2
## + black    1     334.1 10079 1255.4
## + crim     1     275.0 10138 1257.6
## + tax      1     203.0 10210 1260.3
## + indus    1     187.1 10226 1260.8
## + chas     1     158.1 10255 1261.9
## + zn       1      64.4 10348 1265.4
## + age      1      63.3 10349 1265.4
## <none>                 10413 1265.7
## + rad      1      24.7 10388 1266.8
## - dis      1     343.3 10756 1276.0
## - ptratio  1    1182.6 11595 1304.5
## - rm       1    2069.6 12482 1332.4
## - lstat    1    4062.5 14475 1388.6
## 
## Step:  AIC=1249.16
## medv ~ lstat + rm + ptratio + dis + nox
## 
##           Df Sum of Sq     RSS    AIC
## + black    1    204.92  9710.1 1243.2
## + chas     1    199.92  9715.1 1243.4
## + crim     1    192.93  9722.1 1243.7
## + zn       1     61.08  9854.0 1248.8
## <none>                  9915.0 1249.2
## + tax      1     25.24  9889.8 1250.2
## + indus    1     24.80  9890.2 1250.2
## + rad      1     18.68  9896.4 1250.4
## + age      1      4.49  9910.6 1251.0
## - nox      1    497.61 10412.7 1265.7
## - dis      1    826.80 10741.8 1277.5
## - ptratio  1   1354.56 11269.6 1295.7
## - rm       1   2030.00 11945.0 1317.8
## - lstat    1   2812.63 12727.7 1341.8
## 
## Step:  AIC=1243.24
## medv ~ lstat + rm + ptratio + dis + nox + black
## 
##           Df Sum of Sq     RSS    AIC
## + chas     1    174.50  9535.6 1238.4
## + crim     1    133.74  9576.4 1240.0
## + zn       1     74.53  9635.6 1242.3
## + rad      1     72.25  9637.9 1242.4
## <none>                  9710.1 1243.2
## + indus    1     13.75  9696.4 1244.7
## + age      1     13.25  9696.9 1244.7
## + tax      1      1.89  9708.2 1245.2
## - black    1    204.92  9915.0 1249.2
## - nox      1    368.44 10078.6 1255.4
## - dis      1    800.64 10510.8 1271.3
## - ptratio  1   1275.78 10985.9 1288.0
## - rm       1   2081.27 11791.4 1314.8
## - lstat    1   2590.29 12300.4 1330.9
## 
## Step:  AIC=1238.37
## medv ~ lstat + rm + ptratio + dis + nox + black + chas
## 
##           Df Sum of Sq     RSS    AIC
## + crim     1    119.71  9415.9 1235.6
## + zn       1     74.80  9460.8 1237.4
## + rad      1     62.06  9473.6 1237.9
## <none>                  9535.6 1238.4
## + age      1     17.22  9518.4 1239.7
## + indus    1     16.31  9519.3 1239.7
## + tax      1      1.85  9533.8 1240.3
## - chas     1    174.50  9710.1 1243.2
## - black    1    179.51  9715.1 1243.4
## - nox      1    407.76  9943.4 1252.2
## - dis      1    741.75 10277.4 1264.8
## - ptratio  1   1258.30 10793.9 1283.3
## - rm       1   2080.55 11616.2 1311.2
## - lstat    1   2419.79 11955.4 1322.1
## 
## Step:  AIC=1235.58
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + crim
## 
##           Df Sum of Sq     RSS    AIC
## + rad      1    200.99  9214.9 1229.4
## + zn       1    111.59  9304.3 1233.1
## <none>                  9415.9 1235.6
## + age      1     23.98  9391.9 1236.6
## + indus    1     17.20  9398.7 1236.9
## + tax      1      7.84  9408.1 1237.3
## - crim     1    119.71  9535.6 1238.4
## - black    1    128.18  9544.1 1238.7
## - chas     1    160.47  9576.4 1240.0
## - nox      1    362.93  9778.8 1247.9
## - dis      1    775.83 10191.7 1263.6
## - ptratio  1   1094.59 10510.5 1275.3
## - lstat    1   2121.41 11537.3 1310.6
## - rm       1   2149.29 11565.2 1311.5
## 
## Step:  AIC=1229.41
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + crim + 
##     rad
## 
##           Df Sum of Sq     RSS    AIC
## + tax      1    219.59  8995.3 1222.3
## + zn       1     71.17  9143.7 1228.5
## <none>                  9214.9 1229.4
## + indus    1     32.28  9182.6 1230.1
## + age      1     12.39  9202.5 1230.9
## - chas     1    133.22  9348.1 1232.8
## - black    1    200.84  9415.8 1235.6
## - rad      1    200.99  9415.9 1235.6
## - crim     1    258.65  9473.6 1237.9
## - nox      1    522.65  9737.6 1248.3
## - dis      1    827.53 10042.4 1260.0
## - ptratio  1   1294.50 10509.4 1277.2
## - rm       1   1857.31 11072.2 1297.0
## - lstat    1   2190.54 11405.5 1308.2
## 
## Step:  AIC=1222.26
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + crim + 
##     rad + tax
## 
##           Df Sum of Sq     RSS    AIC
## + zn       1    119.28  8876.1 1219.2
## <none>                  8995.3 1222.3
## + age      1     12.50  8982.8 1223.7
## + indus    1      0.28  8995.0 1224.2
## - chas     1    107.82  9103.2 1224.8
## - black    1    178.91  9174.2 1227.7
## - tax      1    219.59  9214.9 1229.4
## - crim     1    265.92  9261.3 1231.3
## - nox      1    383.61  9378.9 1236.1
## - rad      1    412.74  9408.1 1237.3
## - dis      1    852.97  9848.3 1254.6
## - ptratio  1   1252.79 10248.1 1269.7
## - rm       1   1676.30 10671.6 1285.0
## - lstat    1   2129.44 11124.8 1300.8
## 
## Step:  AIC=1219.21
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + crim + 
##     rad + tax + zn
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  8876.1 1219.2
## + age      1      4.38  8871.7 1221.0
## + indus    1      0.01  8876.1 1221.2
## - chas     1    106.31  8982.4 1221.7
## - zn       1    119.28  8995.3 1222.3
## - black    1    174.12  9050.2 1224.6
## - tax      1    267.69  9143.7 1228.5
## - crim     1    294.04  9170.1 1229.6
## - nox      1    328.81  9204.9 1231.0
## - rad      1    428.02  9304.1 1235.0
## - ptratio  1    829.04  9705.1 1251.0
## - dis      1    961.09  9837.1 1256.2
## - rm       1   1538.72 10414.8 1277.8
## - lstat    1   2113.60 10989.7 1298.2
(model_summary <- summary(model_step_s))
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + black + 
##     chas + crim + rad + tax + zn, data = boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8197  -2.7675  -0.7004   1.6232  26.5551 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.013831   6.097676   5.578 4.72e-08 ***
## lstat        -0.532238   0.056934  -9.348  < 2e-16 ***
## rm            3.971642   0.497929   7.976 1.94e-14 ***
## ptratio      -0.899235   0.153590  -5.855 1.06e-08 ***
## dis          -1.389423   0.220409  -6.304 8.34e-10 ***
## nox         -15.703142   4.258821  -3.687 0.000261 ***
## black         0.008873   0.003307   2.683 0.007623 ** 
## chas          2.379287   1.134825   2.097 0.036712 *  
## crim         -0.134317   0.038522  -3.487 0.000548 ***
## rad           0.347270   0.082549   4.207 3.26e-05 ***
## tax          -0.014500   0.004358  -3.327 0.000967 ***
## zn            0.036694   0.016523   2.221 0.026978 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.918 on 367 degrees of freedom
## Multiple R-squared:  0.7247, Adjusted R-squared:  0.7164 
## F-statistic: 87.82 on 11 and 367 DF,  p-value: < 2.2e-16

The in-sample MSE comes out to be 24.185

predicted_val <- predict(object = model_step_s, newdata = boston_test)
lin_train_mse <- round((model_summary$sigma)^2,3) 
lin_test_mse <- round(mean((predicted_val - boston_test$medv)^2),3)

The MSE for test data comes out to be 18.102

Regression Trees

I would try to further improve the accuracy using other modelling techniques and compare them on the basis of Root Mean Squared error. Let’s look at Regression Trees first.

boston_rpart <- rpart(formula = medv ~ ., data = boston_train, cp = 0.00001)
plotcp(boston_rpart)

printcp(boston_rpart)
## 
## Regression tree:
## rpart(formula = medv ~ ., data = boston_train, cp = 1e-05)
## 
## Variables actually used in tree construction:
##  [1] age     black   crim    dis     indus   lstat   nox     ptratio
##  [9] rm      tax    
## 
## Root node error: 32239/379 = 85.064
## 
## n= 379 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.44674924      0   1.00000 1.00591 0.094565
## 2  0.16631896      1   0.55325 0.57745 0.057263
## 3  0.06378413      2   0.38693 0.43737 0.050097
## 4  0.05587702      3   0.32315 0.41961 0.051179
## 5  0.03423855      4   0.26727 0.39771 0.048341
## 6  0.02516202      5   0.23303 0.34989 0.042127
## 7  0.01634600      6   0.20787 0.29575 0.037785
## 8  0.01297229      7   0.19152 0.27925 0.038362
## 9  0.00783403      8   0.17855 0.25512 0.037736
## 10 0.00656300      9   0.17072 0.24918 0.037665
## 11 0.00610276     10   0.16415 0.25175 0.039508
## 12 0.00550626     11   0.15805 0.25050 0.039488
## 13 0.00458964     12   0.15255 0.25058 0.039451
## 14 0.00398972     13   0.14796 0.24447 0.039346
## 15 0.00398407     14   0.14397 0.24051 0.039233
## 16 0.00398106     15   0.13998 0.24051 0.039233
## 17 0.00237774     16   0.13600 0.23626 0.039234
## 18 0.00219110     17   0.13362 0.23229 0.039074
## 19 0.00197157     18   0.13143 0.23060 0.038846
## 20 0.00179284     20   0.12749 0.23126 0.038837
## 21 0.00177033     21   0.12570 0.22897 0.038833
## 22 0.00131561     22   0.12393 0.23014 0.038860
## 23 0.00122621     23   0.12261 0.23355 0.039440
## 24 0.00122604     24   0.12138 0.23383 0.039439
## 25 0.00114331     25   0.12016 0.23433 0.039440
## 26 0.00102879     26   0.11901 0.23482 0.039617
## 27 0.00100222     27   0.11799 0.23342 0.039626
## 28 0.00096814     28   0.11698 0.23254 0.039599
## 29 0.00092994     29   0.11602 0.23411 0.039856
## 30 0.00058257     30   0.11509 0.23662 0.041150
## 31 0.00038713     31   0.11450 0.23731 0.041148
## 32 0.00027305     32   0.11412 0.23787 0.041164
## 33 0.00001000     33   0.11384 0.23787 0.041164

The minimum cp value is 0.0017703

Building a pruned regression tree with cp value as 0.01

boston_rpart <- rpart(formula = medv ~ ., data = boston_train, cp = 0.01)
prp(boston_rpart,digits = 4, extra = 1)

boston_train_pred_tree = predict(boston_rpart)
boston_test_pred_tree = predict(boston_rpart, boston_test)
reg_train_mse <- round(mean((boston_train_pred_tree - boston_train$medv)^2),3) 
reg_test_mse <- round(mean((boston_test_pred_tree - boston_test$medv)^2),3)

The in-sample MSE comes out to be: 15.188 Whereas, the test set MSE comes out to be: 16.155

Linear regression models fail in situations where the relationship between features and outcome is nonlinear or where features interact with each other.

Bagging

Next, I will use use Bagging which is a general approach that uses bootstrapping in conjunction with any regression model to construct an ensemble.

Bagging models provide several advantages over models that are not bagged. First, bagging effectively reduces the variance of a prediction through its aggregation process. For models that produce an unstable prediction, like regression trees, aggregating over many versions of the training data actually reduces the variance in the prediction and, hence, makes the prediction more stable.

Selecting optimal number of trees that minimizes the out-of-bag error

ntree<- c(seq(10, 200, 10))
oob_error<- rep(0, length(ntree))
for(i in 1:length(ntree)){
  set.seed(2)
  boston.bag<- bagging(medv~., data = boston_train, nbagg=ntree[i])
  oob_error[i] <- bagging(medv~., data = boston_train, nbagg=ntree[i], coob=T)$err
}
plot(ntree, oob_error, type = 'l', col=2, lwd=2, xaxt="n")
axis(1, at = ntree, las=1)

Building the final model with 70 trees

boston_bag<- bagging(medv~., data = boston_train, nbagg= 70)
boston_train_bag_tree = predict(boston_bag)
boston_bag_pred<- predict(boston_bag, newdata = boston_test)
boston_bag_oob<- bagging(medv~., data = boston_train, coob=T, nbagg= 70)

bag_train_mse <- round(mean((boston_train_bag_tree - boston_train$medv)^2),3)
bag_test_mse <- round(mean((boston_test$medv-boston_bag_pred)^2),3)

The in-sample MSE comes out to be - 18.043 Whereas the test-set MSE comes out to be - 8.828

Thus, when compared to a single regression tree, the MSE has significantly reduced.

Another advantage of bagging models is that they can provide their own internal estimate of predictive performance that correlates well with either cross-validation estimates or test set estimates. Thus the OOB estimate, which is the root mean squared error value obtained is - 4.362

Random Forest

The trees in bagging, are not completely independent of each other since all of the original predictors are considered at every split of every tree. Reducing correlation among trees, known as de-correlating trees, is then the next logical step to improving the performance of bagging. Thus, we use Random Forest where trees are built using a random subset of the top k predictors at each split in the tree.

By default, k = P/3.

boston_rf<- randomForest(medv~., data = boston_train, importance=TRUE)
boston_rf_train<- predict(boston_rf)
boston_rf_pred<- predict(boston_rf, boston_test)
#boston_rf

we can see the important variables -

boston_rf$importance 
##            %IncMSE IncNodePurity
## crim     8.8444623     2063.6917
## zn       0.5836399      213.8538
## indus    7.2368757     2111.4653
## chas     0.4285816      176.8836
## nox     13.2467309     2745.0711
## rm      32.8841101     8848.1181
## age      3.6841547      904.0701
## dis      8.3369999     2147.3566
## rad      1.6011582      327.4526
## tax      3.6387667      975.9978
## ptratio  4.7672099     1447.7767
## black    1.7108403      668.1993
## lstat   59.7962858     8632.2048

We can further see the OOB Error which is MSE for every size of tree considered. We observe that the error is stabilized around 300 trees.

plot(boston_rf$mse, type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")

We can also compare the test set error with the OOB error.

oob.err<- rep(0, 13)
test.err<- rep(0, 13)
for(i in 1:13){
  fit<- randomForest(medv~., data = boston_train, mtry=i)
  oob.err[i]<- fit$mse[500]
  test.err[i]<- mean((boston_test$medv-predict(fit, boston_test))^2)
  cat(i, " ")
}
## 1  2  3  4  5  6  7  8  9  10  11  12  13
matplot(cbind(test.err, oob.err), pch=15, col = c("red", "blue"), 
        type = "b", ylab = "MSE", xlab = "mtry")
legend("topright", legend = c("test Error", "OOB Error"),
       pch = 15, col = c("red", "blue"))

The optimal subset of predictor variables that should be used by each tree is approximately 6.

Final tree after obtaining the tuned parameters-

boston_rf<- randomForest(medv~., data = boston_train, importance=TRUE, ntree = 300, mtry = 6)
boston_rf_train<- predict(boston_rf)
boston_rf_pred<- predict(boston_rf, boston_test)

rf_train_mse <- round(mean((boston_train$medv-boston_rf_train)^2),3)
rf_test_mse <- round(mean((boston_test$medv-boston_rf_pred)^2),3)

The MSE of training sample comes out to be: 12.173 The MSE on the test sample comes out to be: 5.538

The minimum OOB error comes out to be: 5.638

Boosting

so, with Random Forest, a set of independent trees are grown and then a strong ensemble is formed. While it is a great technique to improve the model prediction performance, there is another great technique known as Boosting. Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.”

boston.boost<- gbm(medv~., data = boston_train, distribution = "gaussian",
                   n.trees = 10000, shrinkage = 0.01, interaction.depth = 8)
summary(boston.boost)

##             var    rel.inf
## lstat     lstat 37.9517771
## rm           rm 27.2130163
## dis         dis  9.4770881
## nox         nox  5.6450559
## crim       crim  5.4662302
## age         age  4.3412137
## black     black  3.2406271
## ptratio ptratio  2.3356085
## tax         tax  1.6850969
## indus     indus  1.2366370
## rad         rad  0.6928438
## chas       chas  0.5990241
## zn           zn  0.1157814

We observe that lstat is the most important variable here.

We can also visualize how the testing error changes with different number of trees.

ntree<- seq(100, 10000, 100)
predmat<- predict(boston.boost, newdata = boston_test, n.trees = ntree)
err<- apply((predmat-boston_test$medv)^2, 2, mean)
plot(ntree, err, type = 'l', col=2, lwd=2, xlab = "n.trees", ylab = "Test MSE")
abline(h=min(err), lty=2)

boston.boost<- gbm(medv~., data = boston_train, distribution = "gaussian",
                   n.trees = 2000, shrinkage = 0.01, interaction.depth = 8)
summary(boston.boost)

##             var     rel.inf
## lstat     lstat 36.67095769
## rm           rm 30.97032287
## dis         dis  8.75665950
## nox         nox  5.39464792
## crim       crim  4.85895807
## age         age  4.09968917
## black     black  2.44298478
## ptratio ptratio  2.31132672
## tax         tax  1.76325406
## indus     indus  1.30891230
## rad         rad  0.81304258
## chas       chas  0.53494510
## zn           zn  0.07429925
boston.boost.pred.train <- predict(boston.boost,  n.trees = 2000)
boston.boost.pred.test <- predict(boston.boost, boston_test, n.trees = 2000)

boost_train_mse <- round(mean((boston_train$medv-boston.boost.pred.train)^2),3) 
boost_test_mse <- round(mean((boston_test$medv-boston.boost.pred.test)^2),3)
  • The training set MSE comes out to be: 1.276.

  • The test-set MSE in this case comes out to be: 5.365

However, gradient boosting machine could be susceptible to over-fitting, since the learner employed—even in its weakly defined learning capacity—is tasked with optimally fitting the gradient. This means that boosting will select the optimal learner at each stage of the algorithm. Despite using weak learners, boosting still employs the greedy strategy of choosing the optimal weak learner at each stage. Although this strategy generates an optimal solution at the current stage, it has the drawbacks of not finding the optimal global model as well as over-fitting the training data.

There are further improvements that can be made upon boosting mechanism.

Executive Summary

Linear Regression

At first, I used simple linear regression to predict the variable. I have performed variable selective using step-wise method and used AIC as the measure of variable selection.

Decision Trees

Trees were created using CART to improve the accuracy of prediction. They are best suited when the relationship between the variables are non-linear.

Bagging

While decision trees are easy to interpret, they sometimes cause overfitting. If we try to reduce overfitting (low bias high variance), prediction accuracy is compromised. To improve the prediction accuracy, we use Bootstrap Aggregating, where we bootstrap multiple samples and use them as an ensemble. This thereby helps in reducing the variance.

Random Forest

Bagging helps in reducing variance but since the bootstrap samples are very highly correlated, variance isn’t reduced much. Random Forest provides further improvement by taking a set of de-correlated trees to form ensembles. This helps in reducing variance significantly over Bagging.

Boosting

Boosting is another technique where unlike Random Forests and Bagging, where a set of independent trees are used as an ensemble, boosting models and tries to improve over the existing trees. It models on the residuals of the previously fit trees and tries to improve the accuracy thereby. This method provides us better prediction performance over any other technique.

model = factor(c("Linear Regression", "Decision Tree", "Bagging", 
              "Random Forest", "Boosting"),
              levels=c("Linear Regression", "Decision Tree", "Bagging", 
                       "Random Forest", "Boosting"))

train_mse <- c(lin_train_mse,
               reg_train_mse,
               bag_train_mse,
               rf_train_mse,
               boost_train_mse)

test_mse <- c(lin_test_mse,
               reg_test_mse,
               bag_test_mse,
               rf_test_mse,
               boost_test_mse)

table <- data.frame(model=model,
                                train_mse = train_mse,
                               test_mse = test_mse)


kable(table)
model train_mse test_mse
Linear Regression 24.185 18.102
Decision Tree 15.188 16.155
Bagging 18.043 8.828
Random Forest 12.173 5.538
Boosting 1.276 5.365

Thus, we see that the prediction accuracy on the test-set continues to improve as we model using slightly better technique than the previous model and minimum MSE is obtained for Boosting followed by Random Forest and then by Bagging.