Trees, in this sense, is a space of observations that have been divided into several different regions, with a two-part goal: first, to minimize the intra-group differences. Second, to maximize the inter-group differences. Splits are drawn by conditions in any of the predictors (for example, a split could be at the point where refraction is > 0.04 units). The splits are done one at a time, creating branches that crawl downward until a desired stopping condition is reached (say, each region will only have 5 or less observations).
A random forest is a technique for making many different trees. First, it bootstraps a number of samples (hundreds, typically). Then it builds trees from each of the bootstrapped samples.
The unique thing about random forests is that during the building of the trees, for each split, a random sample of the overall predictors is chosen. This means that not every predictor is considered for the split, which helps eliminate any correlation between the trees.
load.libraries <- c('tidyverse', 'GGally', 'Amelia', 'ranger', 'corrplot', 'plyr', 'randomForest',
'gridExtra')
install.lib <- load.libraries[!load.libraries %in% installed.packages()]
for(libs in install.lib) install.packages(libs, dependences = TRUE)
sapply(load.libraries, require, character = TRUE)
## Loading required package: tidyverse
## -- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.3.0 v forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: GGally
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
## Loading required package: Amelia
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
## Loading required package: ranger
## Warning: package 'ranger' was built under R version 3.5.2
## Loading required package: corrplot
## corrplot 0.84 loaded
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.5.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
##
## importance
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## tidyverse GGally Amelia ranger corrplot
## TRUE TRUE TRUE TRUE TRUE
## plyr randomForest gridExtra
## TRUE TRUE TRUE
glass <- read_csv("Glass.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## RI = col_double(),
## Na = col_double(),
## Mg = col_double(),
## Al = col_double(),
## Si = col_double(),
## K = col_double(),
## Ca = col_double(),
## Ba = col_double(),
## Fe = col_double(),
## Type = col_double()
## )
glimpse(glass)
## Observations: 214
## Variables: 11
## $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ RI <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.5...
## $ Na <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 1...
## $ Mg <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.6...
## $ Al <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.3...
## $ Si <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 7...
## $ K <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.5...
## $ Ca <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.4...
## $ Ba <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Fe <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.1...
## $ Type <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
summary(glass)
## X1 RI Na Mg
## Min. : 1.00 Min. :1.511 Min. :10.73 Min. :0.000
## 1st Qu.: 54.25 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115
## Median :107.50 Median :1.518 Median :13.30 Median :3.480
## Mean :107.50 Mean :1.518 Mean :13.41 Mean :2.685
## 3rd Qu.:160.75 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600
## Max. :214.00 Max. :1.534 Max. :17.38 Max. :4.490
## Al Si K Ca
## Min. :0.290 Min. :69.81 Min. :0.0000 Min. : 5.430
## 1st Qu.:1.190 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240
## Median :1.360 Median :72.79 Median :0.5550 Median : 8.600
## Mean :1.445 Mean :72.65 Mean :0.4971 Mean : 8.957
## 3rd Qu.:1.630 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172
## Max. :3.500 Max. :75.41 Max. :6.2100 Max. :16.190
## Ba Fe Type
## Min. :0.000 Min. :0.00000 Min. :1.00
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:1.00
## Median :0.000 Median :0.00000 Median :2.00
## Mean :0.175 Mean :0.05701 Mean :2.78
## 3rd Qu.:0.000 3rd Qu.:0.10000 3rd Qu.:3.00
## Max. :3.150 Max. :0.51000 Max. :7.00
We have an index variable that we can eliminate. Column 2 gives the refraction index of the glass. The elemental properties (the amounts present in the glass observation) are listed from column [3:10] and then column 11 has the type of glass, given a factor from 1 to 7.
missmap(glass)
## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'imputations'.
head(glass)
tail(glass)
There is no clear missing values or weirdness with the data to work out. One thing to note: the unique types of glass in this data range from 1-7, with the exception of type 4, which is omitted. I will rewrite the factor levels by their type to make more meaningful analysis.
unique(glass$Type)
## [1] 1 2 3 5 6 7
glass$Type <- as.factor(glass$Type)
glass$Type <- revalue(glass$Type, c("1" = "Building_Wind_Processed", "2" = "Building_Window_Non",
"3" = "Vehicle_Wind_Processed", "5" = "Container",
"6" = "Tableware", "7" = "Headlamps"))
levels(glass$Type)
## [1] "Building_Wind_Processed" "Building_Window_Non"
## [3] "Vehicle_Wind_Processed" "Container"
## [5] "Tableware" "Headlamps"
glass <- glass %>%
select(-"X1")
Let’s examine a few things: How are the elements correlated together?
glassCor <- cor(glass[-10])
corrplot(glassCor, method = "ellipse", type = "lower")
No strong correlation between most of the elements is present. We can witness, however, that the amount of calcium is positively correlated with a higher refraction rate, at a closer linear fit than most of the other correlations present.
What about the refraction index for the type?
ggplot(glass, aes(x = RI)) +
geom_histogram(fill = "green") +
facet_wrap(~ Type)+
ggtitle("Refraction Index by Count Observed in Each Glass Type")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The range of refraction index varies heavily for each. Building Windows have the most observations and span the largest interval.
To apply the random forest model, I will work a formula that uses refraction index and the 7 explanatory elements as predictor variables.
set.seed(12306)
glass_model_rf <- randomForest(formula = Type ~ ., data = glass)
print(glass_model_rf)
##
## Call:
## randomForest(formula = Type ~ ., data = glass)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 18.69%
## Confusion matrix:
## Building_Wind_Processed Building_Window_Non
## Building_Wind_Processed 61 7
## Building_Window_Non 7 64
## Vehicle_Wind_Processed 6 3
## Container 0 3
## Tableware 0 2
## Headlamps 1 3
## Vehicle_Wind_Processed Container Tableware
## Building_Wind_Processed 2 0 0
## Building_Window_Non 1 1 1
## Vehicle_Wind_Processed 8 0 0
## Container 0 9 0
## Tableware 0 0 7
## Headlamps 0 0 0
## Headlamps class.error
## Building_Wind_Processed 0 0.1285714
## Building_Window_Non 2 0.1578947
## Vehicle_Wind_Processed 0 0.5294118
## Container 1 0.3076923
## Tableware 0 0.2222222
## Headlamps 25 0.1379310
The confusion matrix in the output tells us how the actual type in the data compares to the predicted type. As we can see here, for eample, the number of observations that were estimated as Building_Wind_Processed glasses (the columns) when the glass was actually a Building_Wind_Process glass is 61. It incorrectly guessed Building_Wind_Process 7 times for Building_Wind_Non and 6 times for Vehicle_Wind_Processed, and so on.
The out of bag estimate is a calculation for the prediction error in the model. It tells us, approximately, that 18.6% of the guesses are expected to be incorrectly identified.
Tuning the random forest model comes down to finding optimal values for the number of trees to make, the number of variables to go through for each split, and the size of the bootstrapped samples being considered.
Starting with the optimal values for the number of trees, we look at a plot of the Out-of-Bag error estimates as a function of the number of trees produced in the model. We see that OOB estimate (black line), more or less flatlines after about 100 trees, and so 500 as a default was over-kill for the model.
plot(glass_model_rf)
err <- glass_model_rf$err.rate
oob_err <- err[nrow(err), "OOB"]
par(xpd=TRUE)
legend(x = "top",
legend = colnames(err),
fill = 1:ncol(err))
Next, we can tune the number of predictors that are considered at each split. The default is to sqrt the number of predictors, in this case sqrt(9) is 3. Interestingly, the default value does not provide the lowest OOB error rate. We find instead that 2 would be preferable.
set.seed(1)
res <- tuneRF(x = subset(glass, select = -Type),
y = glass$Type,
ntreeTry = 500)
## mtry = 3 OOB error = 21.5%
## Searching left ...
## mtry = 2 OOB error = 19.63%
## 0.08695652 0.05
## mtry = 1 OOB error = 20.09%
## -0.02380952 0.05
## Searching right ...
## mtry = 6 OOB error = 21.5%
## -0.0952381 0.05
print(res)
## mtry OOBError
## 1.OOB 1 0.2009346
## 2.OOB 2 0.1962617
## 3.OOB 3 0.2149533
## 6.OOB 6 0.2149533
mtry_opt <- res[,"mtry"][which.min(res[,"OOBError"])]
print(mtry_opt)
## 2.OOB
## 2
Finally, just to see how this preforms - we can write an algorithm to tune the model generally for all the relevant hyperparameters.
mtry <- seq(2, ncol(glass) * 0.8, 2)
nodesize <- seq(3, 8, 2)
sampsize <- nrow(glass) * c(0.7, 0.8)
hyper_grid <- expand.grid(mtry = mtry, nodesize = nodesize, sampsize = sampsize)
oob_err <- c()
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {
# Train a Random Forest model
model <- randomForest(formula = Type ~ .,
data = glass,
mtry = hyper_grid$mtry[i],
nodesize = hyper_grid$nodesize[i],
sampsize = hyper_grid$sampsize[i])
# Store OOB error for the model
oob_err[i] <- model$err.rate[nrow(model$err.rate), "OOB"]
}
# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])
## mtry nodesize sampsize
## 1 2 3 149.8
The best parameters for the model? 2 variables considered for each split - the nodesize (tree depth) of 3 branches points at minimum, and a bootstrap sample size of 171.2.
set.seed(12306)
mtry <- hyper_grid[opt_i, "mtry"]
nodesize <- hyper_grid[opt_i, "nodesize"]
sampsize <- hyper_grid[opt_i, "sampsize"]
glass_model_tuned <- randomForest(formula = Type ~ ., data = glass, mtry = mtry, nodesize = nodesize,
sampsize = sampsize)
## Warning in matrix(rfout$nodestatus, ncol = ntree): data length [150299] is
## not a sub-multiple or multiple of the number of rows [301]
## Warning in matrix(rfout$bestvar, ncol = ntree): data length [150299] is not
## a sub-multiple or multiple of the number of rows [301]
## Warning in matrix(rfout$nodepred, ncol = ntree): data length [150299] is
## not a sub-multiple or multiple of the number of rows [301]
## Warning in matrix(rfout$xbestsplit, ncol = ntree): data length [150299] is
## not a sub-multiple or multiple of the number of rows [301]
print(glass_model_tuned)
##
## Call:
## randomForest(formula = Type ~ ., data = glass, mtry = mtry, nodesize = nodesize, sampsize = sampsize)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 21.03%
## Confusion matrix:
## Building_Wind_Processed Building_Window_Non
## Building_Wind_Processed 62 7
## Building_Window_Non 11 62
## Vehicle_Wind_Processed 10 3
## Container 0 3
## Tableware 0 2
## Headlamps 1 3
## Vehicle_Wind_Processed Container Tableware
## Building_Wind_Processed 1 0 0
## Building_Window_Non 1 1 1
## Vehicle_Wind_Processed 4 0 0
## Container 0 9 0
## Tableware 0 0 7
## Headlamps 0 0 0
## Headlamps class.error
## Building_Wind_Processed 0 0.1142857
## Building_Window_Non 0 0.1842105
## Vehicle_Wind_Processed 0 0.7647059
## Container 1 0.3076923
## Tableware 0 0.2222222
## Headlamps 25 0.1379310
Tuning the model seemed to increase the error rate - weird!