Recreate the simulated data from Exercise 7.2:
library(mlbench)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(kableExtra)
library(AppliedPredictiveModeling)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
The code to the RF model has been provided to us through the literature.
What is the code actually doing? We should dive into the theory. The improtance is calculated with the following formula:
\[ ni_j=W_{left(j)}C_{left(j)}-W_{right(j)}C_{right(j)} \] \(ni_j\) stands for node importance. \(w_j\) is the weighted number of samples reaching node j. \(c_j\) is the impurity value of node j. Left(j) is the child node from left split on node j and right(j) is the child node from right split on node j. Once \(ni_j\) is calculated,the importance for each variable on a decision tree is calculated with formula I. Formula I is then normalized as shown as formula II.The final feature importance is the average over all trees. We simply find the sum of the features importance value and then divide by all trees T, shown in formula III.
\[ I) \ fi_j=\frac{ (\sum_{j \ node \ split \ on \ i}) ni_j }{ ( \sum_{all \ nodes}) ni_k}\\ II) \ \parallel fi_j \parallel = \frac{(fi_j)}{\sum_{all \ features}fi_j} \\ III) \ RFfi_j= \frac{ ( \sum_{all \ trees})\parallel fi_j \parallel}{T} \]
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
dt <- as.matrix(rfImp1)
dt <- as.matrix(rfImp1)
library(dplyr)
#kable(dt)
dt%>% sort(decreasing = T) %>% kable(caption="RF Variable Importance") %>% kable_styling()# %>% row_spec()
x |
---|
8.7322354 |
7.6151188 |
6.4153694 |
2.0235246 |
0.7635918 |
0.1651112 |
-0.0059617 |
-0.0749448 |
-0.0952927 |
-0.1663626 |
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The most important feature was V1 with an importance of over 8 followed by V2. Variables V6 through V10 are not as significant as v1-V5.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1) %>% sort(decreasing = T) %>% kable(caption="Correlation of New Predictor") %>% kable_styling()# %>% row_spec()
x |
---|
0.9460206 |
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
dt <- as.matrix(rfImp2)
#kable(dt)
dt%>% sort(decreasing = T) %>% kable(caption="RF Variable Importance With New Variable") %>% kable_styling() #%>%row_spec()
x |
---|
7.0475224 |
6.0689606 |
5.6911997 |
4.2833158 |
1.8723844 |
0.6297022 |
0.1356906 |
0.0289481 |
0.0084044 |
-0.0134564 |
-0.0437056 |
By adding a highly correlated variable, we see that V1 importance drops by roughly 2 measures.
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
model3 <- cforest(y ~ ., data=simulated)
varimp(model3) %>% sort(decreasing = T) %>% kable(caption="Unconditional CForest Model: Variable Importance") %>% kable_styling()# %>% row_spec()
x | |
---|---|
V4 | 7.6223893 |
V2 | 6.0579731 |
duplicate1 | 5.0941897 |
V1 | 4.6171159 |
V5 | 1.7161194 |
V7 | 0.0465375 |
V9 | 0.0046062 |
V3 | 0.0003116 |
V6 | -0.0289427 |
V10 | -0.0310326 |
V8 | -0.0380966 |
dt1 <-as.data.frame(as.matrix(varimp(model3)))
varimp(model3, conditional=T) %>% sort(decreasing = T) %>% kable(caption="Conditional CForest Model: Variable Importance") %>% kable_styling() #%>% row_spec()
x | |
---|---|
V4 | 5.9438679 |
V2 | 4.8497979 |
duplicate1 | 1.9172148 |
V1 | 1.8822368 |
V5 | 1.0565051 |
V6 | 0.0234120 |
V3 | 0.0214755 |
V9 | 0.0080890 |
V7 | -0.0017224 |
V10 | -0.0027082 |
V8 | -0.0113309 |
dt2 <-as.data.frame(as.matrix(varImp(model3, conditional=T)))
We see that adding a conditional or non conditional clause for cForest does not have an effect on the importance of variables V1-V6. We some changes in variable importance closer to the upper bound. What is going on in the background? When we set conditional to true,we compute feature importance by creating permutations built on covariates associated with features of interest. The clause has a backend threshold where it only grabs features with a compliment of the p value greater than the threshold (determined on the backend). This generated score is analagous to beta coefficients for regression models.
Let’s try a gradient boosting tree. Why does this tree matter? It is computationaly fast. On a high level, the GB tree starts by taking the average across the target label. We then compute residuals where the residual is simply the difference between the actual value and the predicted value. A decision tree is built on these residuals where each leaf node will contain some prediction along with the residual value. There are some cases where there are more residuals than leaves but on the backend, the GB tree function will compute the average residuals in these cases. We then of course predict new target values and compute new residuals. This process can be repeated until convergece or a tree limit is defined.
library(gbm)
## Loaded gbm 2.1.5
model4 <- gbm(y ~ ., data=simulated, distribution='gaussian')
summary(model4) %>% kable(caption="Gradient Boosting Tree") %>% kable_styling() #%>% row_spec()
var | rel.inf | |
---|---|---|
V4 | V4 | 29.8959700 |
V2 | V2 | 24.2027154 |
V1 | V1 | 15.1180860 |
duplicate1 | duplicate1 | 12.1496624 |
V5 | V5 | 9.6480839 |
V3 | V3 | 8.6586019 |
V6 | V6 | 0.1812410 |
V9 | V9 | 0.1456393 |
V7 | V7 | 0.0000000 |
V8 | V8 | 0.0000000 |
V10 | V10 | 0.0000000 |
The summary returns the variable name along with a measure of influence on the target variable. From our GBM tree, we can see v4 is the most influential. V1 and correlated duplicate1 are also much more influential. V6-V10 does not break the top half of our list of influential variables. The summary GBM automatically generates a chart of variable influence.
The next model we want to try is the cubist model. The cubist model is a rather unique variation on trees. Each leaf in the tree contains a linear regression model. Every layer in the tree alters the predicitons used within each leaf contained model. In other words, the selection of predictors in leaf n is based on the previous splis. It should be noted that each intermediate step between leafs also contain a linear model. Predictions are made via linear models on the on the terminal node.
library(Cubist)
model5 <- cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)
dt3<-as.data.frame(as.matrix(varImp(model5)))
dt3 %>% kable(caption="Cubist Model Regression Variable Importance") %>% kable_styling() #%>% row_spec()
Overall | |
---|---|
V3 | 43.5 |
V1 | 52.5 |
V2 | 59.5 |
duplicate1 | 27.5 |
V4 | 46.0 |
V8 | 4.0 |
V5 | 27.0 |
V6 | 10.0 |
V10 | 1.0 |
V7 | 0.0 |
V9 | 0.0 |
V2 is the most important variable closely followed by V1. We still do not see variables V6-V10 increasing in importance.
Use a simulation to show tree bias with different granularities.
We need to understand the purpose of this problem. The literature covers some key elements when it comes to tree bias based off certain conditions. We know that if we have a variable with many distinct values, then it is said that this variable has a low variance. These types of variables are magnetic to the response variable vs other predictors that have a higher variance. This greatly affects the splits in any tree system and in some cases, splits are done on noise. Typically we try to avoid cases where noise takes the role of the root node.
Let’s create a dataset where our predictors will vary in terms of unique values.We then run rpart.
library(rpart)
set.seed(100)
X1 <- rep(1:2,each=100)
X2 <- rnorm(200,mean=0,sd=2)
#X3 <- sample(0:100 / 100, 200, replace = T)
Y <- X1 + rnorm(200,mean=0,sd=4)
sim <- data.frame(Y=Y, X1=X1, X2=X2)
model6 <- rpart(Y ~ ., data = sim)
dt4<-as.data.frame(as.matrix(varImp(model6)))
dt4 %>% kable(caption="Tree Bias") %>% kable_styling() #%>% row_spec()
Overall | |
---|---|
X1 | 0.4226645 |
X2 | 0.8863903 |
In our simulation, we created X1 to be related to Y wile X2 was not yet X2 is more important. This highlights the bias variance tradeoff within this small simulated dataset.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
There are some mechanics that occur on the right hand model that we should point out. First, we see that there is a relationship between the learning rate and how “greedy” the model gets. Recall that a greedy model tries to find the optimal point at each iteration. The relationship states that as the learning rate approaches one, the overall “greed” of the model increases. As a consequence of increased “greed” , the model tends to no longer identify as many variables related to the label variable. We should also address the bagging fraction. When the bagging fraction increases, we tend to see more usage of data in the model. An analogy is if you consider data to be the fuel of the model (engine),then more of it is consumed as the bagging fraction increases.This also reduces the number of predictors considered to be important.
The literature states that there is a relationship between interaction depth and feature importance. As we see tree depth increase, we expect a more uniform spread of variable importance across all variables. Visually, this makes the lines bigger in the importance figures shown in 8.2.4.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
library(dplyr)
library(e1071)
data(ChemicalManufacturingProcess)
cd<-ChemicalManufacturingProcess
predictors <- subset(cd,select= -Yield)
yield <- subset(cd,select="Yield")
set.seed(517)
trainingRows <- createDataPartition(yield$Yield, p = 0.8 , list = FALSE)
x_train <- predictors[trainingRows,]
y_train <- yield[trainingRows,]
x_test <- predictors[-trainingRows,]
y_test <- yield[-trainingRows,]
#Pre-process trainPredictors and apply to trainPredictors and testPredictors
pp <- preProcess(x_train,method=c("BoxCox","center","scale","knnImpute"))
transform_x_train <- predict(pp,x_train)
transform_x_test <- predict(pp,x_test)
#Identify and remove NZV
nz_predictors <- nearZeroVar(transform_x_train)
transform_x_train <- transform_x_train[-nz_predictors]
transform_x_test <- transform_x_test[-nz_predictors]
#Identify and remove highly correlated predictors
pred_corr = cor(transform_x_train)
high_corr <- findCorrelation(pred_corr)
transform_x_train <-transform_x_train[, -high_corr]
transform_x_test <- transform_x_test[, -high_corr]
#Set-up trainControl boot method (CV takes way too long)
set.seed(517)
tune_param <- trainControl(method = "boot", number = 25)
#CART
set.seed(614)
rpart_param <- expand.grid(maxdepth= seq(1,10,by=1))
rpart_tune <- train(x = transform_x_train, y = y_train,method = "rpart2",metric = "Rsquared",tuneGrid = rpart_param,trControl = tune_param)
#RF
set.seed(614)
rf_param <- expand.grid(mtry=seq(2,38,by=3))
rf_tune <- train(x = transform_x_train, y =y_train,method = "rf",tuneGrid = rf_param,metric = "Rsquared",
importance = TRUE, trControl = tune_param)
#GBM
set.seed(614)
gbm_param <- expand.grid(interaction.depth=seq(1,6,by=1),n.trees=c(25,50,100,200),shrinkage=c(0.01,0.05,0.1,0.2),n.minobsinnode=c(5, 10, 15))
gbm_tune <- train(x = transform_x_train, y = y_train, method = "gbm",metric = "Rsquared",verbose = FALSE,
tuneGrid = gbm_param,trControl = tune_param)
#cubist
set.seed(614)
cubist_param <- expand.grid(committees = c(1, 5, 10, 20, 50, 100), neighbors = c(0, 1, 3, 5, 7))
cubist_tune <- train(x = transform_x_train, y = y_train,method = "cubist", verbose = FALSE,metric = "Rsquared",
tuneGrid = cubist_param,trControl = tune_param)
rpart_metric<-as.data.frame(round(rpart_tune$results$Rsquared[best(rpart_tune$results, "Rsquared", maximize = TRUE)],2))
colnames(rpart_metric) <- "r_squared"
row.names(rpart_metric) <- "CART"
rf_metric<-as.data.frame(round(rf_tune$results$Rsquared[best(rf_tune$results, "Rsquared", maximize = TRUE)],2))
colnames(rf_metric) <- "r_squared"
row.names(rf_metric) <- "RF"
gbm_metric<-as.data.frame(round(gbm_tune$results$Rsquared[best(gbm_tune$results, "Rsquared", maximize = TRUE)],2))
colnames(gbm_metric) <- "r_squared"
row.names(gbm_metric) <- "GBM"
cubist_metric<-as.data.frame(round(cubist_tune$results$Rsquared[best(cubist_tune$results, "Rsquared", maximize = TRUE)],2))
colnames(cubist_metric) <- "r_squared"
row.names(cubist_metric) <- "Cubist"
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
metrics<-sqldf("
select r_squared, 'CART' as model from rpart_metric
union all
select r_squared, 'RF' as model from rf_metric
union all
select r_squared, 'GBM' as model from gbm_metric
union all
select r_squared, 'Cubist' as model from cubist_metric
")
metrics %>% kable(caption="R Squared Comparisons (Training)") %>% kable_styling()# %>% row_spec()
r_squared | model |
---|---|
0.36 | CART |
0.55 | RF |
0.56 | GBM |
0.64 | Cubist |
Our initial performance inspection should start by checking the variability captured in the data. We generate the r squared value for each model and can observe that the cubist model is the highest performing with 70 percent of the data variability captured in the data. CART captures the least amount of data variability out of our four models. Lets test the performance against the test data.
rpart_pred <- predict(rpart_tune, newdata=transform_x_test)
rpart_res <- data.frame()
rpart_res <- rbind(rpart_res, t(postResample(pred=rpart_pred, obs=y_test)))
rf_pred <- predict(rf_tune, newdata=transform_x_test)
rf_res <- data.frame()
rf_res <- rbind(rf_res, t(postResample(pred=rf_pred, obs=y_test)))
gbm_pred <- predict(gbm_tune, newdata=transform_x_test)
gbm_res <- data.frame()
gbm_res <- rbind(gbm_res, t(postResample(pred=gbm_pred, obs=y_test)))
cubist_pred <- predict(cubist_tune, newdata=transform_x_test)
cubist_res <- data.frame()
cubist_res <- rbind(cubist_res, t(postResample(pred=cubist_pred, obs=y_test)))
test_metric<-sqldf(
"
select 'CART' as model, RMSE, Rsquared, MAE from rpart_res
union all
select 'RF' as model, RMSE, Rsquared, MAE from rf_res
union all
select 'GBM' as model, RMSE, Rsquared, MAE from gbm_res
union all
select 'Cubist' as model, RMSE, Rsquared, MAE from cubist_res
")
test_metric %>% kable(caption="R Squared Comparisons (Test)") %>% kable_styling()# %>% row_spec()
model | RMSE | Rsquared | MAE |
---|---|---|---|
CART | 1.3775897 | 0.5074503 | 0.9737224 |
RF | 1.0167800 | 0.7770856 | 0.7270947 |
GBM | 0.9902281 | 0.7272257 | 0.7668699 |
Cubist | 0.8313616 | 0.8165322 | 0.5784957 |
We also examine model performance on the test data. We want to see the predictive performance of each model. Based on performance agaist the test data, Cubist model still has the highest RSquared in addition to the lowest RMSE and MAE. I believe ths cubist model outperforms any of the other models from chapter 6 and 7.
Now we test variable importance for our optimal model.
cubist_imp <- varImp(cubist_tune, scale = FALSE)
#bookTheme()
plot(cubist_imp, top=20, scales = list(y = list(cex = 0.8)))
The cubist model was our selected as our optial model. ManufacturingProcess32 is the feature highlighted as the most important followed by MP 9,33,17,and 13 to round out the top five. When compared to variable importance from the most optimal linear and non linear models, ManufacturingProcess32 is still within the top five most important predictors.
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner,
## node_surv, node_terminal, varimp
plot(as.party(rpart_tune$finalModel),gp=gpar(fontsize=11))
CART is our most optimal single tree. We can use partykit to visualuze our tree. Manufacturing Process 32 becomes our root node. Accordig to our tree, when MP 32 is greater, this results in an increased yield.
Manufacturing processes 32 and 13 are at the top, with higher values of process 32 being associated with larger yields. Lower values of process 32 are associated with smaller yields. However, a lower values of process 32 may be counter-acted with a corresponding lower value of process 13.