Assignment 3: KJ 8.1-8.3; KJ 8.7

8.1)

Recreate the simulated data from Exercise 7.2:

library(mlbench)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(kableExtra)
library(AppliedPredictiveModeling)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:

The code to the RF model has been provided to us through the literature.

What is the code actually doing? We should dive into the theory. The improtance is calculated with the following formula:

\[ ni_j=W_{left(j)}C_{left(j)}-W_{right(j)}C_{right(j)} \] \(ni_j\) stands for node importance. \(w_j\) is the weighted number of samples reaching node j. \(c_j\) is the impurity value of node j. Left(j) is the child node from left split on node j and right(j) is the child node from right split on node j. Once \(ni_j\) is calculated,the importance for each variable on a decision tree is calculated with formula I. Formula I is then normalized as shown as formula II.The final feature importance is the average over all trees. We simply find the sum of the features importance value and then divide by all trees T, shown in formula III.

\[ I) \ fi_j=\frac{ (\sum_{j \ node \ split \ on \ i}) ni_j }{ ( \sum_{all \ nodes}) ni_k}\\ II) \ \parallel fi_j \parallel = \frac{(fi_j)}{\sum_{all \ features}fi_j} \\ III) \ RFfi_j= \frac{ ( \sum_{all \ trees})\parallel fi_j \parallel}{T} \]

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
model1 <- randomForest(y ~ ., data = simulated, 
  importance = TRUE,
 ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

dt <- as.matrix(rfImp1)

dt <- as.matrix(rfImp1)

library(dplyr)

#kable(dt)

dt%>% sort(decreasing = T) %>% kable(caption="RF Variable Importance") %>% kable_styling()# %>% row_spec()
RF Variable Importance
x
8.7322354
7.6151188
6.4153694
2.0235246
0.7635918
0.1651112
-0.0059617
-0.0749448
-0.0952927
-0.1663626

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The most important feature was V1 with an importance of over 8 followed by V2. Variables V6 through V10 are not as significant as v1-V5.

    1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1) %>% sort(decreasing = T) %>% kable(caption="Correlation of New Predictor") %>% kable_styling()# %>% row_spec()
Correlation of New Predictor
x
0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, 
  importance = TRUE,
 ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

dt <- as.matrix(rfImp2)

#kable(dt)

dt%>% sort(decreasing = T) %>% kable(caption="RF Variable Importance With New Variable") %>% kable_styling() #%>%row_spec()
RF Variable Importance With New Variable
x
7.0475224
6.0689606
5.6911997
4.2833158
1.8723844
0.6297022
0.1356906
0.0289481
0.0084044
-0.0134564
-0.0437056

By adding a highly correlated variable, we see that V1 importance drops by roughly 2 measures.

    1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
model3 <- cforest(y ~ ., data=simulated)

varimp(model3) %>% sort(decreasing = T) %>% kable(caption="Unconditional CForest Model: Variable Importance") %>% kable_styling()# %>% row_spec()
Unconditional CForest Model: Variable Importance
x
V4 7.6223893
V2 6.0579731
duplicate1 5.0941897
V1 4.6171159
V5 1.7161194
V7 0.0465375
V9 0.0046062
V3 0.0003116
V6 -0.0289427
V10 -0.0310326
V8 -0.0380966
dt1 <-as.data.frame(as.matrix(varimp(model3)))
varimp(model3, conditional=T) %>% sort(decreasing = T) %>% kable(caption="Conditional CForest Model: Variable Importance") %>% kable_styling() #%>% row_spec()
Conditional CForest Model: Variable Importance
x
V4 5.9438679
V2 4.8497979
duplicate1 1.9172148
V1 1.8822368
V5 1.0565051
V6 0.0234120
V3 0.0214755
V9 0.0080890
V7 -0.0017224
V10 -0.0027082
V8 -0.0113309
dt2 <-as.data.frame(as.matrix(varImp(model3, conditional=T)))

We see that adding a conditional or non conditional clause for cForest does not have an effect on the importance of variables V1-V6. We some changes in variable importance closer to the upper bound. What is going on in the background? When we set conditional to true,we compute feature importance by creating permutations built on covariates associated with features of interest. The clause has a backend threshold where it only grabs features with a compliment of the p value greater than the threshold (determined on the backend). This generated score is analagous to beta coefficients for regression models.

    1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Let’s try a gradient boosting tree. Why does this tree matter? It is computationaly fast. On a high level, the GB tree starts by taking the average across the target label. We then compute residuals where the residual is simply the difference between the actual value and the predicted value. A decision tree is built on these residuals where each leaf node will contain some prediction along with the residual value. There are some cases where there are more residuals than leaves but on the backend, the GB tree function will compute the average residuals in these cases. We then of course predict new target values and compute new residuals. This process can be repeated until convergece or a tree limit is defined.

library(gbm)
## Loaded gbm 2.1.5
model4 <- gbm(y ~ ., data=simulated, distribution='gaussian')

summary(model4) %>% kable(caption="Gradient Boosting Tree") %>% kable_styling() #%>% row_spec()
Gradient Boosting Tree
var rel.inf
V4 V4 29.8959700
V2 V2 24.2027154
V1 V1 15.1180860
duplicate1 duplicate1 12.1496624
V5 V5 9.6480839
V3 V3 8.6586019
V6 V6 0.1812410
V9 V9 0.1456393
V7 V7 0.0000000
V8 V8 0.0000000
V10 V10 0.0000000

The summary returns the variable name along with a measure of influence on the target variable. From our GBM tree, we can see v4 is the most influential. V1 and correlated duplicate1 are also much more influential. V6-V10 does not break the top half of our list of influential variables. The summary GBM automatically generates a chart of variable influence.

The next model we want to try is the cubist model. The cubist model is a rather unique variation on trees. Each leaf in the tree contains a linear regression model. Every layer in the tree alters the predicitons used within each leaf contained model. In other words, the selection of predictors in leaf n is based on the previous splis. It should be noted that each intermediate step between leafs also contain a linear model. Predictions are made via linear models on the on the terminal node.

library(Cubist)
model5 <- cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)

dt3<-as.data.frame(as.matrix(varImp(model5)))

dt3 %>% kable(caption="Cubist Model Regression Variable Importance") %>% kable_styling() #%>% row_spec()
Cubist Model Regression Variable Importance
Overall
V3 43.5
V1 52.5
V2 59.5
duplicate1 27.5
V4 46.0
V8 4.0
V5 27.0
V6 10.0
V10 1.0
V7 0.0
V9 0.0

V2 is the most important variable closely followed by V1. We still do not see variables V6-V10 increasing in importance.

8.2)

Use a simulation to show tree bias with different granularities.

We need to understand the purpose of this problem. The literature covers some key elements when it comes to tree bias based off certain conditions. We know that if we have a variable with many distinct values, then it is said that this variable has a low variance. These types of variables are magnetic to the response variable vs other predictors that have a higher variance. This greatly affects the splits in any tree system and in some cases, splits are done on noise. Typically we try to avoid cases where noise takes the role of the root node.

Let’s create a dataset where our predictors will vary in terms of unique values.We then run rpart.

library(rpart)

set.seed(100)
X1 <- rep(1:2,each=100)
X2 <- rnorm(200,mean=0,sd=2)
#X3 <- sample(0:100 / 100, 200, replace = T)
Y <- X1 + rnorm(200,mean=0,sd=4)
sim <- data.frame(Y=Y, X1=X1, X2=X2)

model6 <- rpart(Y ~ ., data = sim)

dt4<-as.data.frame(as.matrix(varImp(model6)))

dt4 %>% kable(caption="Tree Bias") %>% kable_styling() #%>% row_spec()
Tree Bias
Overall
X1 0.4226645
X2 0.8863903

In our simulation, we created X1 to be related to Y wile X2 was not yet X2 is more important. This highlights the bias variance tradeoff within this small simulated dataset.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

There are some mechanics that occur on the right hand model that we should point out. First, we see that there is a relationship between the learning rate and how “greedy” the model gets. Recall that a greedy model tries to find the optimal point at each iteration. The relationship states that as the learning rate approaches one, the overall “greed” of the model increases. As a consequence of increased “greed” , the model tends to no longer identify as many variables related to the label variable. We should also address the bagging fraction. When the bagging fraction increases, we tend to see more usage of data in the model. An analogy is if you consider data to be the fuel of the model (engine),then more of it is consumed as the bagging fraction increases.This also reduces the number of predictors considered to be important.

The literature states that there is a relationship between interaction depth and feature importance. As we see tree depth increase, we expect a more uniform spread of variable importance across all variables. Visually, this makes the lines bigger in the importance figures shown in 8.2.4.

8.7)

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
library(dplyr)
library(e1071)

data(ChemicalManufacturingProcess)
cd<-ChemicalManufacturingProcess



predictors <- subset(cd,select= -Yield)
yield <- subset(cd,select="Yield")

set.seed(517)
trainingRows <- createDataPartition(yield$Yield, p = 0.8 , list = FALSE)

x_train <- predictors[trainingRows,]
y_train <- yield[trainingRows,]

x_test <- predictors[-trainingRows,]
y_test <- yield[-trainingRows,]

#Pre-process trainPredictors and apply to trainPredictors and testPredictors
pp <- preProcess(x_train,method=c("BoxCox","center","scale","knnImpute"))
transform_x_train <- predict(pp,x_train)
transform_x_test <- predict(pp,x_test)

#Identify and remove NZV
nz_predictors <- nearZeroVar(transform_x_train)
transform_x_train <- transform_x_train[-nz_predictors]
transform_x_test <- transform_x_test[-nz_predictors]

#Identify and remove highly correlated predictors
pred_corr = cor(transform_x_train)
high_corr <- findCorrelation(pred_corr)
transform_x_train <-transform_x_train[, -high_corr]
transform_x_test <- transform_x_test[, -high_corr]

#Set-up trainControl boot method (CV takes way too long)
set.seed(517)
tune_param <- trainControl(method = "boot", number = 25)
#CART
set.seed(614)
rpart_param <- expand.grid(maxdepth= seq(1,10,by=1))
rpart_tune <- train(x = transform_x_train, y = y_train,method = "rpart2",metric = "Rsquared",tuneGrid = rpart_param,trControl = tune_param)
#RF
set.seed(614)
rf_param <- expand.grid(mtry=seq(2,38,by=3))
rf_tune <- train(x = transform_x_train, y =y_train,method = "rf",tuneGrid = rf_param,metric = "Rsquared",
importance = TRUE, trControl = tune_param)
#GBM
set.seed(614)
gbm_param <- expand.grid(interaction.depth=seq(1,6,by=1),n.trees=c(25,50,100,200),shrinkage=c(0.01,0.05,0.1,0.2),n.minobsinnode=c(5, 10, 15))
gbm_tune <- train(x = transform_x_train, y = y_train, method = "gbm",metric = "Rsquared",verbose = FALSE,
tuneGrid = gbm_param,trControl = tune_param)
#cubist
set.seed(614)
cubist_param <- expand.grid(committees = c(1, 5, 10, 20, 50, 100), neighbors = c(0, 1, 3, 5, 7))
cubist_tune <- train(x = transform_x_train, y = y_train,method = "cubist", verbose = FALSE,metric = "Rsquared",
tuneGrid = cubist_param,trControl = tune_param)
rpart_metric<-as.data.frame(round(rpart_tune$results$Rsquared[best(rpart_tune$results, "Rsquared", maximize = TRUE)],2))
colnames(rpart_metric) <- "r_squared"
row.names(rpart_metric) <- "CART"

rf_metric<-as.data.frame(round(rf_tune$results$Rsquared[best(rf_tune$results, "Rsquared", maximize = TRUE)],2))
colnames(rf_metric) <- "r_squared"
row.names(rf_metric) <- "RF"

gbm_metric<-as.data.frame(round(gbm_tune$results$Rsquared[best(gbm_tune$results, "Rsquared", maximize = TRUE)],2))
colnames(gbm_metric) <- "r_squared"
row.names(gbm_metric) <- "GBM"

cubist_metric<-as.data.frame(round(cubist_tune$results$Rsquared[best(cubist_tune$results, "Rsquared", maximize = TRUE)],2))
colnames(cubist_metric) <- "r_squared"
row.names(cubist_metric) <- "Cubist"

library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
metrics<-sqldf("
  select r_squared, 'CART' as model from rpart_metric
  union all 
  select r_squared, 'RF' as model from rf_metric
  union all 
  select r_squared, 'GBM' as model from gbm_metric
  union all 
  select r_squared, 'Cubist' as model from cubist_metric
               ")

metrics %>% kable(caption="R Squared Comparisons (Training)") %>% kable_styling()# %>% row_spec()
R Squared Comparisons (Training)
r_squared model
0.36 CART
0.55 RF
0.56 GBM
0.64 Cubist

Our initial performance inspection should start by checking the variability captured in the data. We generate the r squared value for each model and can observe that the cubist model is the highest performing with 70 percent of the data variability captured in the data. CART captures the least amount of data variability out of our four models. Lets test the performance against the test data.

rpart_pred <- predict(rpart_tune, newdata=transform_x_test)
rpart_res <- data.frame()
rpart_res <- rbind(rpart_res, t(postResample(pred=rpart_pred, obs=y_test)))

rf_pred <- predict(rf_tune, newdata=transform_x_test)
rf_res <- data.frame()
rf_res <- rbind(rf_res, t(postResample(pred=rf_pred, obs=y_test)))

gbm_pred <- predict(gbm_tune, newdata=transform_x_test)
gbm_res <- data.frame()
gbm_res <- rbind(gbm_res, t(postResample(pred=gbm_pred, obs=y_test)))

cubist_pred <- predict(cubist_tune, newdata=transform_x_test)
cubist_res <- data.frame()
cubist_res <- rbind(cubist_res, t(postResample(pred=cubist_pred, obs=y_test)))

test_metric<-sqldf(
  "
  select 'CART' as model, RMSE, Rsquared, MAE from rpart_res
  union all
  select 'RF' as model, RMSE, Rsquared, MAE from rf_res
  union all 
  select 'GBM' as model, RMSE, Rsquared, MAE from gbm_res
  union all 
  select 'Cubist' as model, RMSE, Rsquared, MAE from cubist_res
  ")

test_metric %>% kable(caption="R Squared Comparisons (Test)") %>% kable_styling()# %>% row_spec()
R Squared Comparisons (Test)
model RMSE Rsquared MAE
CART 1.3775897 0.5074503 0.9737224
RF 1.0167800 0.7770856 0.7270947
GBM 0.9902281 0.7272257 0.7668699
Cubist 0.8313616 0.8165322 0.5784957

We also examine model performance on the test data. We want to see the predictive performance of each model. Based on performance agaist the test data, Cubist model still has the highest RSquared in addition to the lowest RMSE and MAE. I believe ths cubist model outperforms any of the other models from chapter 6 and 7.

Now we test variable importance for our optimal model.

cubist_imp <- varImp(cubist_tune, scale = FALSE)
#bookTheme()
plot(cubist_imp, top=20, scales = list(y = list(cex = 0.8)))

The cubist model was our selected as our optial model. ManufacturingProcess32 is the feature highlighted as the most important followed by MP 9,33,17,and 13 to round out the top five. When compared to variable importance from the most optimal linear and non linear models, ManufacturingProcess32 is still within the top five most important predictors.

library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner,
##     node_surv, node_terminal, varimp
plot(as.party(rpart_tune$finalModel),gp=gpar(fontsize=11))

CART is our most optimal single tree. We can use partykit to visualuze our tree. Manufacturing Process 32 becomes our root node. Accordig to our tree, when MP 32 is greater, this results in an increased yield.

Manufacturing processes 32 and 13 are at the top, with higher values of process 32 being associated with larger yields. Lower values of process 32 are associated with smaller yields. However, a lower values of process 32 may be counter-acted with a corresponding lower value of process 13.