Random Forest Analysis on Glass Dataset

Objectives

Learn and summarize (very briefly) the random forest/tree machine learning method.
Explore the glass dataset.
Develop a model to identify different glass types, using the random forest technique.

Summary Notes

What is a decision tree?

Trees, in this sense, is a space of observations that have been divided into several different regions, with a two-part goal: first, to minimize the intra-group differences. Second, to maximize the inter-group differences. Splits are drawn by conditions in any of the predictors (for example, a split could be at the point where refraction is > 0.04 units). The splits are done one at a time, creating branches that crawl downward until a desired stopping condition is reached (say, each region will only have 5 or less observations).

What is a random forest?

A random forest is a technique for making many different trees. First, it bootstraps a number of samples (hundreds, typically). Then it builds trees from each of the bootstrapped samples.

The unique thing about random forests is that during the building of the trees, for each split, a random sample of the overall predictors is chosen. This means that not every predictor is considered for the split, which helps eliminate any correlation between the trees.

Exploratory Data Analysis

load.libraries <- c('tidyverse', 'GGally', 'Amelia', 'ranger', 'corrplot', 'plyr', 'randomForest', 
                    'gridExtra')
install.lib <- load.libraries[!load.libraries %in% installed.packages()]
for(libs in install.lib) install.packages(libs, dependences = TRUE)
sapply(load.libraries, require, character = TRUE)

## Loading required package: tidyverse

## -- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.3.0     v forcats 0.3.0

## -- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: GGally

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

## Loading required package: Amelia

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

## Loading required package: ranger

## Warning: package 'ranger' was built under R version 3.5.2

## Loading required package: corrplot

## corrplot 0.84 loaded

## Loading required package: plyr

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

## Loading required package: randomForest

## Warning: package 'randomForest' was built under R version 3.5.2

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ranger':
## 
##     importance

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## Loading required package: gridExtra

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

##    tidyverse       GGally       Amelia       ranger     corrplot 
##         TRUE         TRUE         TRUE         TRUE         TRUE 
##         plyr randomForest    gridExtra 
##         TRUE         TRUE         TRUE

glass <- read_csv("Glass.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   RI = col_double(),
##   Na = col_double(),
##   Mg = col_double(),
##   Al = col_double(),
##   Si = col_double(),
##   K = col_double(),
##   Ca = col_double(),
##   Ba = col_double(),
##   Fe = col_double(),
##   Type = col_double()
## )

glimpse(glass)

## Observations: 214
## Variables: 11
## $ X1   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.5...
## $ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 1...
## $ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.6...
## $ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.3...
## $ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 7...
## $ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.5...
## $ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.4...
## $ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.1...
## $ Type <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

summary(glass)

##        X1               RI              Na              Mg       
##  Min.   :  1.00   Min.   :1.511   Min.   :10.73   Min.   :0.000  
##  1st Qu.: 54.25   1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115  
##  Median :107.50   Median :1.518   Median :13.30   Median :3.480  
##  Mean   :107.50   Mean   :1.518   Mean   :13.41   Mean   :2.685  
##  3rd Qu.:160.75   3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600  
##  Max.   :214.00   Max.   :1.534   Max.   :17.38   Max.   :4.490  
##        Al              Si              K                Ca        
##  Min.   :0.290   Min.   :69.81   Min.   :0.0000   Min.   : 5.430  
##  1st Qu.:1.190   1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240  
##  Median :1.360   Median :72.79   Median :0.5550   Median : 8.600  
##  Mean   :1.445   Mean   :72.65   Mean   :0.4971   Mean   : 8.957  
##  3rd Qu.:1.630   3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172  
##  Max.   :3.500   Max.   :75.41   Max.   :6.2100   Max.   :16.190  
##        Ba              Fe               Type     
##  Min.   :0.000   Min.   :0.00000   Min.   :1.00  
##  1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:1.00  
##  Median :0.000   Median :0.00000   Median :2.00  
##  Mean   :0.175   Mean   :0.05701   Mean   :2.78  
##  3rd Qu.:0.000   3rd Qu.:0.10000   3rd Qu.:3.00  
##  Max.   :3.150   Max.   :0.51000   Max.   :7.00

We have an index variable that we can eliminate. Column 2 gives the refraction index of the glass. The elemental properties (the amounts present in the glass observation) are listed from column [3:10] and then column 11 has the type of glass, given a factor from 1 to 7.

missmap(glass)

## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used

## Warning: Unknown or uninitialised column: 'arguments'.

## Warning: Unknown or uninitialised column: 'arguments'.

## Warning: Unknown or uninitialised column: 'imputations'.

head(glass)

tail(glass)

There is no clear missing values or weirdness with the data to work out. One thing to note: the unique types of glass in this data range from 1-7, with the exception of type 4, which is omitted. I will rewrite the factor levels by their type to make more meaningful analysis.

unique(glass$Type)

## [1] 1 2 3 5 6 7

glass$Type <- as.factor(glass$Type)
glass$Type <- revalue(glass$Type, c("1" = "Building_Wind_Processed", "2" = "Building_Window_Non", 
                                    "3" = "Vehicle_Wind_Processed", "5" = "Container", 
                                    "6" = "Tableware", "7" = "Headlamps"))
levels(glass$Type)

## [1] "Building_Wind_Processed" "Building_Window_Non"    
## [3] "Vehicle_Wind_Processed"  "Container"              
## [5] "Tableware"               "Headlamps"

glass <- glass %>% 
  select(-"X1")

Let’s examine a few things: How are the elements correlated together?

glassCor <- cor(glass[-10])
corrplot(glassCor, method = "ellipse", type = "lower")

No strong correlation between most of the elements is present. We can witness, however, that the amount of calcium is positively correlated with a higher refraction rate, at a closer linear fit than most of the other correlations present.

What about the refraction index for the type?

ggplot(glass, aes(x = RI)) + 
  geom_histogram(fill = "green") + 
  facet_wrap(~ Type)+ 
  ggtitle("Refraction Index by Count Observed in Each Glass Type")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The range of refraction index varies heavily for each. Building Windows have the most observations and span the largest interval.

Random Forest

To apply the random forest model, I will work a formula that uses refraction index and the 7 explanatory elements as predictor variables.

set.seed(12306)
glass_model_rf <- randomForest(formula = Type ~ ., data = glass)
print(glass_model_rf)

## 
## Call:
##  randomForest(formula = Type ~ ., data = glass) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.69%
## Confusion matrix:
##                         Building_Wind_Processed Building_Window_Non
## Building_Wind_Processed                      61                   7
## Building_Window_Non                           7                  64
## Vehicle_Wind_Processed                        6                   3
## Container                                     0                   3
## Tableware                                     0                   2
## Headlamps                                     1                   3
##                         Vehicle_Wind_Processed Container Tableware
## Building_Wind_Processed                      2         0         0
## Building_Window_Non                          1         1         1
## Vehicle_Wind_Processed                       8         0         0
## Container                                    0         9         0
## Tableware                                    0         0         7
## Headlamps                                    0         0         0
##                         Headlamps class.error
## Building_Wind_Processed         0   0.1285714
## Building_Window_Non             2   0.1578947
## Vehicle_Wind_Processed          0   0.5294118
## Container                       1   0.3076923
## Tableware                       0   0.2222222
## Headlamps                      25   0.1379310

The confusion matrix in the output tells us how the actual type in the data compares to the predicted type. As we can see here, for eample, the number of observations that were estimated as Building_Wind_Processed glasses (the columns) when the glass was actually a Building_Wind_Process glass is 61. It incorrectly guessed Building_Wind_Process 7 times for Building_Wind_Non and 6 times for Vehicle_Wind_Processed, and so on.

The out of bag estimate is a calculation for the prediction error in the model. It tells us, approximately, that 18.6% of the guesses are expected to be incorrectly identified.

Tuning the random forest model comes down to finding optimal values for the number of trees to make, the number of variables to go through for each split, and the size of the bootstrapped samples being considered.

Starting with the optimal values for the number of trees, we look at a plot of the Out-of-Bag error estimates as a function of the number of trees produced in the model. We see that OOB estimate (black line), more or less flatlines after about 100 trees, and so 500 as a default was over-kill for the model.

plot(glass_model_rf)
err <- glass_model_rf$err.rate
oob_err <- err[nrow(err), "OOB"]
par(xpd=TRUE)
legend(x = "top", 
       legend = colnames(err),
       fill = 1:ncol(err))

Next, we can tune the number of predictors that are considered at each split. The default is to sqrt the number of predictors, in this case sqrt(9) is 3. Interestingly, the default value does not provide the lowest OOB error rate. We find instead that 2 would be preferable.

set.seed(1)              
res <- tuneRF(x = subset(glass, select = -Type),
              y = glass$Type,
              ntreeTry = 500)

## mtry = 3  OOB error = 21.5% 
## Searching left ...
## mtry = 2     OOB error = 19.63% 
## 0.08695652 0.05 
## mtry = 1     OOB error = 20.09% 
## -0.02380952 0.05 
## Searching right ...
## mtry = 6     OOB error = 21.5% 
## -0.0952381 0.05

print(res)

##       mtry  OOBError
## 1.OOB    1 0.2009346
## 2.OOB    2 0.1962617
## 3.OOB    3 0.2149533
## 6.OOB    6 0.2149533

mtry_opt <- res[,"mtry"][which.min(res[,"OOBError"])]
print(mtry_opt)

## 2.OOB 
##     2

Finally, just to see how this preforms - we can write an algorithm to tune the model generally for all the relevant hyperparameters.

mtry <- seq(2, ncol(glass) * 0.8, 2)
nodesize <- seq(3, 8, 2)
sampsize <- nrow(glass) * c(0.7, 0.8)

hyper_grid <- expand.grid(mtry = mtry, nodesize = nodesize, sampsize = sampsize)
oob_err <- c()

# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {

    # Train a Random Forest model
    model <- randomForest(formula = Type ~ ., 
                          data = glass,
                          mtry = hyper_grid$mtry[i],
                          nodesize = hyper_grid$nodesize[i],
                          sampsize = hyper_grid$sampsize[i])
                          
    # Store OOB error for the model                      
    oob_err[i] <- model$err.rate[nrow(model$err.rate), "OOB"]
}

# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])

##   mtry nodesize sampsize
## 1    2        3    149.8

The best parameters for the model? 2 variables considered for each split - the nodesize (tree depth) of 3 branches points at minimum, and a bootstrap sample size of 171.2.

set.seed(12306)

mtry <- hyper_grid[opt_i, "mtry"]
nodesize <- hyper_grid[opt_i, "nodesize"] 
sampsize <- hyper_grid[opt_i, "sampsize"]

glass_model_tuned <- randomForest(formula = Type ~ ., data = glass, mtry = mtry, nodesize = nodesize, 
                                  sampsize = sampsize)

## Warning in matrix(rfout$nodestatus, ncol = ntree): data length [150299] is
## not a sub-multiple or multiple of the number of rows [301]

## Warning in matrix(rfout$bestvar, ncol = ntree): data length [150299] is not
## a sub-multiple or multiple of the number of rows [301]

## Warning in matrix(rfout$nodepred, ncol = ntree): data length [150299] is
## not a sub-multiple or multiple of the number of rows [301]

## Warning in matrix(rfout$xbestsplit, ncol = ntree): data length [150299] is
## not a sub-multiple or multiple of the number of rows [301]

print(glass_model_tuned)

## 
## Call:
##  randomForest(formula = Type ~ ., data = glass, mtry = mtry, nodesize = nodesize,      sampsize = sampsize) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 21.03%
## Confusion matrix:
##                         Building_Wind_Processed Building_Window_Non
## Building_Wind_Processed                      62                   7
## Building_Window_Non                          11                  62
## Vehicle_Wind_Processed                       10                   3
## Container                                     0                   3
## Tableware                                     0                   2
## Headlamps                                     1                   3
##                         Vehicle_Wind_Processed Container Tableware
## Building_Wind_Processed                      1         0         0
## Building_Window_Non                          1         1         1
## Vehicle_Wind_Processed                       4         0         0
## Container                                    0         9         0
## Tableware                                    0         0         7
## Headlamps                                    0         0         0
##                         Headlamps class.error
## Building_Wind_Processed         0   0.1142857
## Building_Window_Non             0   0.1842105
## Vehicle_Wind_Processed          0   0.7647059
## Container                       1   0.3076923
## Tableware                       0   0.2222222
## Headlamps                      25   0.1379310

Tuning the model seemed to increase the error rate - weird!

Random Forest Analysis on Glass Dataset

Aaron Phalin

February 1, 2019

Random Forest Analysis on Glass Dataset

Objectives

Summary Notes

What is a decision tree?

What is a random forest?

Exploratory Data Analysis

Random Forest