BOOSTING and RANDOM FOREST

Introduction

This project aims to find the factors affecting the domestic property value in the city of Boston. Factors like per capita income, environmental factors, educational facilities, property size, etc were taken into consideration to determine the most significant parameters.

For predicting the house prices We use Tree Family:

  • CART
  • Bagging
  • Boosting
  • Random Forest

We will compare the performance of these methods to get the best model

Part 2 deals with BOOSTING AND RANDOM FOREST

Packages Required

The following packages are required for the project:

## function to install and load multiple packages at once in R

install_load <- function(pack){
    ## Statement to check if the package has been previously installed
    new_pack_load <- pack[!(pack %in% installed.packages()[,"Package"])]
  if (length(new_pack_load))
      install.packages(new_pack_load, dependencies = TRUE)
      sapply(pack, require, character.only = TRUE)
}

package_load <- c("corrr", "gridExtra", "ggplot2", "tidyverse", "dplyr", "DT", "MASS", "leaps", "glmnet", "PerformanceAnalytics", "tree", "rpart", "rpart.plot","randomForest", "ipred", "gbm")
install_load(package_load)
##                corrr            gridExtra              ggplot2 
##                 TRUE                 TRUE                 TRUE 
##            tidyverse                dplyr                   DT 
##                 TRUE                 TRUE                 TRUE 
##                 MASS                leaps               glmnet 
##                 TRUE                 TRUE                 TRUE 
## PerformanceAnalytics                 tree                rpart 
##                 TRUE                 TRUE                 TRUE 
##           rpart.plot         randomForest                ipred 
##                 TRUE                 TRUE                 TRUE 
##                  gbm 
##                 TRUE

Data Exploration

Checking for Data structure

Our data contains 506 observations containing 14 variables. The datatypes are as follows:

glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

Data Summary

A quick summary of the distribution of every variable in the data

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Distributions

Predictor vars vs Target var

We plot the scatter plots of target variable medv versus the other variables, we see that rm and lstat show parabolic nature

Boston %>%
  gather(-medv, key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = value, y = medv)) +
  geom_point() +
  stat_smooth() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Boxplots for Predictors

Boxplots show no significant outliers in the data

Boston %>%
  gather(-medv, key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = '',y = value)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1) +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Histograms for Predictors

The histograms of predictors give the following insights:

  • Rad and Tax seem to have two different peaks separated by no data in between
  • rm follows perfect normal dostribution
  • Most of the distributions here are skewed
Boston %>%
  gather(-medv, key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Splitting the Data

We split our data in 80:20 ratio as training data and test data. We will use our train data for modelling and test data for validation

set.seed(12420352)
index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]

Random forest

Random Forest gives an additional advantage over Bagging by removing the multicollinearity between the trees, and hence further reducing the variance. We need to find two parameters for random forest:

  • NUmber of Trees: We create upto a maximum of 500 trees
  • Number of Variables at each step: we chos the best size between 3-5
oob.err<- data.frame()

for(i in 3:5){
  fit<- randomForest(medv~., data = Boston.train, mtry=i)
  oob.err <- rbind(oob.err,data.frame(cbind(mtry=i, ntree=seq(1,500), oob_error=fit$mse)))
  cat(i, " ")
}
## 3  4  5
oob.err$mtry <- as.factor(oob.err$mtry)

Below plot shows that variable size=4 and tree size of 100 is good enough for the model

ggplot(data=oob.err, aes(x=ntree, y=oob_error, color=mtry)) + geom_point() + geom_line()

Creating model with selected parameters

Pre <- as.formula("medv ~ .")
rf.boston <- randomForest(Pre,data=Boston.train, mtry=4, ntree=100, importance =TRUE)
rf.boston
## 
## Call:
##  randomForest(formula = Pre, data = Boston.train, mtry = 4, ntree = 100,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 11.59266
##                     % Var explained: 86.54

Performance

Checking for the MSE and test MSPE value, we get a test error of 7.953978, and training error of 11.91586.

boston.train.pred.tree = predict(rf.boston)
boston.test.pred.tree = predict(rf.boston, Boston.test)
mean((boston.test.pred.tree - Boston.test$medv)^2)
## [1] 7.571549
mean((boston.train.pred.tree - Boston.train$medv)^2)
## [1] 11.59266

Variable Importance

rm and lstat are the important variables again in random forest model too

importance(rf.boston)
##           %IncMSE IncNodePurity
## crim     6.664825     2178.1090
## zn       2.317746      206.5644
## indus    5.962539     2162.5411
## chas     2.644446      252.1947
## nox      7.660876     2576.0658
## rm      16.407279    10604.4800
## age      4.802469      969.8919
## dis      7.666008     2384.9127
## rad      2.348645      355.1198
## tax      6.656205      876.0768
## ptratio  7.614382     2213.7355
## black    3.826313      752.7358
## lstat   12.060453     8629.4972
varImpPlot(rf.boston)

Boosting

We use Gradient Boosting method with following parameters:

  • Number of trees: 10000
  • shrinkage: 0.01
  • Tree depth: 8

rm and lstat are the most influential variables

boston.boost<- gbm(medv~., data = Boston.train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 8)
summary(boston.boost)

##             var    rel.inf
## lstat     lstat 39.5617585
## rm           rm 28.9728906
## dis         dis  9.3589794
## crim       crim  4.2978865
## nox         nox  3.8464416
## age         age  3.6606354
## ptratio ptratio  2.8917482
## black     black  2.6447396
## tax         tax  2.0662900
## indus     indus  1.0548192
## rad         rad  0.8875357
## chas       chas  0.6073448
## zn           zn  0.1489306

Tuning the Hyperparameters

Number of trees

Checking for error variation with different number of trees, we see that the tree size is the least when number of trees is around 500-1000 mark

ntree<- seq(100, 10000, 100)
predmat<- predict(boston.boost, newdata = Boston.test, n.trees = ntree)
err<- apply((predmat-Boston.test$medv)^2, 2, mean)
plot(ntree, err, type = 'l', col=2, lwd=2, xlab = "n.trees", ylab = "Test MSE")

shrinkage and tree depth

We can also tune the other two parameters : shrinkage and tree depth by checking the performance on test data

oob.err<- data.frame()
shrinkage_val <- c(0.01, 0.02, 0.05, 0.1)
trees <- c(500, 700, 900, 1000, 1200, 1500)
depth <- c(2, 3, 5, 8)

for(i in shrinkage_val){
    for(k in depth){
  fit<- gbm(medv~., data = Boston.train, distribution = "gaussian", n.trees = 2500, shrinkage = i, interaction.depth = k)
  
  for(j in trees){
  fit.pred = predict(fit, Boston.test, n.trees = j)
  mspe <- mean((fit.pred - Boston.test$medv)^2)
  oob.err <- rbind(oob.err,data.frame(cbind(shrink=i, ntree=j, depth_val=k, oob_error=mspe)))
  }
  
  cat(i, " ")
  }
}
## 0.01  0.01  0.01  0.01  0.02  0.02  0.02  0.02  0.05  0.05  0.05  0.05  0.1  0.1  0.1  0.1
head(oob.err %>% arrange((oob_error)), 10)
##    shrink ntree depth_val oob_error
## 1    0.02   500         5  6.713356
## 2    0.01   700         5  6.729115
## 3    0.01   900         5  6.764589
## 4    0.01   500         8  6.784178
## 5    0.01   900         8  6.785266
## 6    0.02   700         5  6.786077
## 7    0.01   500         5  6.799373
## 8    0.01   700         8  6.803835
## 9    0.01  1000         8  6.807028
## 10   0.01  1000         5  6.814400

Performance

Checking the performance of Gradient Boosting method with following parameters:

  • Number of trees: 900
  • shrinkage: 0.01
  • Tree depth: 8
boston.boost<- gbm(medv~., data = Boston.train, distribution = "gaussian", n.trees = 900, shrinkage = 0.01, interaction.depth = 8)
summary(boston.boost)

##             var     rel.inf
## lstat     lstat 44.12305773
## rm           rm 28.04294547
## dis         dis  8.20558906
## crim       crim  4.26713113
## nox         nox  3.36487020
## age         age  2.95590726
## ptratio ptratio  2.87778696
## tax         tax  2.04182794
## black     black  1.84154431
## indus     indus  0.95251606
## rad         rad  0.78624255
## chas       chas  0.45296121
## zn           zn  0.08762012
boston.train.pred.tree = predict(boston.boost, n.trees = 900)
boston.test.pred.tree = predict(boston.boost, Boston.test, n.trees = 900)
mean((boston.test.pred.tree - Boston.test$medv)^2)
## [1] 6.647038
mean((boston.train.pred.tree - Boston.train$medv)^2)
## [1] 3.010795

Effect of two most significant variables

On checking the impact of two most significant variables:

  • on increasing lstat, the prices decrease
  • on increasing rm, the prices increase
par(mfrow=c(1,2))
plot(boston.boost, i="lstat")
plot(boston.boost, i="rm")

Summary

On comparing the performance of different techniques:

  • CART: Test error: 15.94623 and train error: 15.18823
  • BAGGING: Test error: 7.960609 and train error: 10.94713
  • RANDOM FOREST: Test error: 7.953978 and train error: 11.91586
  • BOOSTING: Test error: 6.867363 and train error: 3.038744

Boosting clearly outperforms all other techniques in this dataset
Random Forest and Bagging give almost the same performance
CART does not do as well in prediction as other methods