CART and BAGGING

Introduction

This project aims to find the factors affecting the domestic property value in the city of Boston. Factors like per capita income, environmental factors, educational facilities, property size, etc were taken into consideration to determine the most significant parameters.

For predicting the house prices We use Tree Family:

CART
Bagging
Boosting
Random Forest

We will compare the performance of these methods to get the best model

Part 1 deals with CART and BAGGING

Packages Required

The following packages are required for the project:

## function to install and load multiple packages at once in R

install_load <- function(pack){
    ## Statement to check if the package has been previously installed
    new_pack_load <- pack[!(pack %in% installed.packages()[,"Package"])]
  if (length(new_pack_load))
      install.packages(new_pack_load, dependencies = TRUE)
      sapply(pack, require, character.only = TRUE)
}

package_load <- c("corrr", "gridExtra", "ggplot2", "tidyverse", "dplyr", "DT", "MASS", "leaps", "glmnet", "PerformanceAnalytics", "tree", "rpart", "rpart.plot","randomForest", "ipred", "gbm")
install_load(package_load)

##                corrr            gridExtra              ggplot2 
##                 TRUE                 TRUE                 TRUE 
##            tidyverse                dplyr                   DT 
##                 TRUE                 TRUE                 TRUE 
##                 MASS                leaps               glmnet 
##                 TRUE                 TRUE                 TRUE 
## PerformanceAnalytics                 tree                rpart 
##                 TRUE                 TRUE                 TRUE 
##           rpart.plot         randomForest                ipred 
##                 TRUE                 TRUE                 TRUE 
##                  gbm 
##                 TRUE

Data Exploration

Checking for Data structure

Our data contains 506 observations containing 14 variables. The datatypes are as follows:

glimpse(Boston)

## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

Data Summary

A quick summary of the distribution of every variable in the data

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Distributions

Predictor vars vs Target var

We plot the scatter plots of target variable medv versus the other variables, we see that rm and lstat show parabolic nature

Boston %>%
  gather(-medv, key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = value, y = medv)) +
  geom_point() +
  stat_smooth() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Boxplots for Predictors

Boxplots show no significant outliers in the data

Boston %>%
  gather(-medv, key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = '',y = value)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1) +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Histograms for Predictors

The histograms of predictors give the following insights:

Rad and Tax seem to have two different peaks separated by no data in between
rm follows perfect normal dostribution
Most of the distributions here are skewed

Boston %>%
  gather(-medv, key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Splitting the Data

We split our data in 80:20 ratio as training data and test data. We will use our train data for modelling and test data for validation

set.seed(12420352)
index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]

CART

We fit a regression tree to the Training data.
We start off by creating the deepeset tree keepinf cp=0.001. The fitted tree can be seen below:

Boston.tree <- rpart(medv ~ ., data = Boston.train,  cp = 0.001)
rpart.plot(Boston.tree, type = 1, fallen.leaves = FALSE)

Tree Pruning Parameters

In order to avoid overfitting, and reduce variance, we need to prune back the big tree, We use cross validation to select the best cp level for the tree. The below plot shows the relative cross validation error associated with different tree sizes(terminal nodes). We can see that after 8 terminal nodes, the decrease in error is not significant. Hence we prune our tree at this level.

So for pruning we use

terminal nodes: 8
cp: 0.0085790

plotcp(Boston.tree)

printcp(Boston.tree)

## 
## Regression tree:
## rpart(formula = medv ~ ., data = Boston.train, cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] black   crim    dis     indus   lstat   nox     ptratio rm      tax    
## 
## Root node error: 34807/404 = 86.155
## 
## n= 404 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4394029      0   1.00000 1.00605 0.093348
## 2  0.1801891      1   0.56060 0.64455 0.066897
## 3  0.0707158      2   0.38041 0.43935 0.051076
## 4  0.0424304      3   0.30969 0.38269 0.050220
## 5  0.0345825      4   0.26726 0.37247 0.050483
## 6  0.0239214      5   0.23268 0.35471 0.049760
## 7  0.0236063      6   0.20876 0.33543 0.047298
## 8  0.0088629      7   0.18515 0.31061 0.047573
## 9  0.0085790      8   0.17629 0.30480 0.045155
## 10 0.0078425      9   0.16771 0.30122 0.045136
## 11 0.0065798     10   0.15987 0.29348 0.044691
## 12 0.0058768     11   0.15329 0.30020 0.045738
## 13 0.0044298     12   0.14741 0.28274 0.040699
## 14 0.0042918     13   0.14298 0.28074 0.040854
## 15 0.0032757     14   0.13869 0.27332 0.039521
## 16 0.0032652     15   0.13541 0.27335 0.039458
## 17 0.0023268     16   0.13215 0.27044 0.039449
## 18 0.0022818     17   0.12982 0.27062 0.039497
## 19 0.0017836     18   0.12754 0.27116 0.039486
## 20 0.0016586     19   0.12576 0.27257 0.039451
## 21 0.0016459     21   0.12244 0.27281 0.039459
## 22 0.0015443     22   0.12079 0.27206 0.039467
## 23 0.0013947     23   0.11925 0.27245 0.039484
## 24 0.0013701     24   0.11785 0.27250 0.039475
## 25 0.0011746     26   0.11511 0.27313 0.040010
## 26 0.0010000     27   0.11394 0.27125 0.039026

Pruned Tree

We get the pruned tree with the below splits:

prune.tree1 <- prune(Boston.tree, cp = 0.0085790)
prune.tree1

## n= 404 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 404 34806.8000 22.42599  
##    2) lstat>=9.725 236  5997.1150 17.23475  
##      4) lstat>=16.085 114  2184.3260 13.89386  
##        8) dis< 2.0037 60   698.0260 11.33000 *
##        9) dis>=2.0037 54   653.6720 16.74259 *
##      5) lstat< 16.085 122  1351.4000 20.35656 *
##    3) lstat< 9.725 168 13515.4700 29.71845  
##      6) rm< 7.0115 123  4635.5160 26.02276  
##       12) dis>=2.0891 116  2225.6160 25.17155  
##         24) rm< 6.5445 67   618.6687 22.89552 *
##         25) rm>=6.5445 49   785.2869 28.28367  
##           50) lstat>=5.195 28   203.4668 26.11071 *
##           51) lstat< 5.195 21   273.3324 31.18095 *
##       13) dis< 2.0891 7   933.0343 40.12857 *
##      7) rm>=7.0115 45  2608.1520 39.82000  
##       14) rm< 7.437 22   404.9277 34.53182 *
##       15) rm>=7.437 23   999.5191 44.87826 *

rpart.plot(prune.tree1, type =1, fallen.leaves = FALSE)

Tree Performance

We use the Train MSE and Test MSPE as the parameters to test the tree performance. We get a test MSPE of 15.94623 and train MSE of 15.18823 for the tree.

boston.train.pred.tree = predict(prune.tree1)
boston.test.pred.tree = predict(prune.tree1, Boston.test)
mean((boston.test.pred.tree - Boston.test$medv)^2)

## [1] 15.94623

mean((boston.train.pred.tree - Boston.train$medv)^2)

## [1] 15.18823

Bagging

Bagging (Bootstrap+Aggregation) is a method for reducing the variance by building multiple tree models on bootstrapped data, and aggregating the results of all trees. The below model aggregates the prediction from 500 trees taking all 13 variables at each step.

Pre <- as.formula("medv ~ .")
bag.boston <- randomForest(Pre, data=Boston.train, mtry=13, importance =TRUE)
bag.boston

## 
## Call:
##  randomForest(formula = Pre, data = Boston.train, mtry = 13, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 10.94713
##                     % Var explained: 87.29

Number of trees

On plotting the cross validation error for different number of trees used for aggregation, we see that the error stops decresing after 100 trees, so using more trees gives no added advantage.

plot(bag.boston)

Performance

Checking for the MSE and test MSPE value, we get a test error of 7.960609, and training error of 10.94713. This is a significant inprovement over single tree model

boston.train.pred.tree = predict(bag.boston)
boston.test.pred.tree = predict(bag.boston, Boston.test)
mean((boston.test.pred.tree - Boston.test$medv)^2)

## [1] 7.960609

mean((boston.train.pred.tree - Boston.train$medv)^2)

## [1] 10.94713

Variable Importance

We plot the two plots to check for variable importance. The former is based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model. The latter is a measure of the total decrease in node RSS that results from splits over that variable, averaged over all trees. From the plots, we see that rm and lstat are the two most important variables

importance(bag.boston)

##           %IncMSE IncNodePurity
## crim    14.965464    1465.16882
## zn       3.423625      35.81986
## indus   10.451900     258.84124
## chas    -1.237404      28.38546
## nox     16.774764     636.84081
## rm      46.651222   11743.19452
## age     12.152279     580.90133
## dis     28.662656    2540.36637
## rad      3.817848     125.78485
## tax     17.627977     554.42317
## ptratio 18.708013     577.95894
## black    8.483604     350.10429
## lstat   46.444565   15596.09061

varImpPlot(bag.boston)

Summary

On comparing the performance of CART vs Bagging:

CART: Test error: 15.94623 and train error: 15.18823
BAGGING: Test error: 7.960609 and train error: 10.94713

Bagging clearly outperforms CART
We will compare these two techniques with RANDOM FOREST and BOOSTING in part 2

Boston Dataset-Tree Family Part-1

Hemang Goswami