DATA624: Homework 9
library(tidyverse)
library(party)
library(mlbench)
library(randomForest)
library(caret)
library(gbm)
library(rpart)
library(AppliedPredictiveModeling)
library(Cubist)
library(rpart.plot)
Objective
Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
8.1
Recreate the simulated data from Exercise 7.2:
set.seed(200)
<- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
simulated colnames(simulated)[ncol(simulated)] <- "y"
Part A
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
<- randomForest(y ~ ., data = simulated,
model1 importance = TRUE,
ntree = 1000)
<- varImp(model1, scale = FALSE)
rfImp1
::kable(rfImp1) kableExtra
Overall | |
---|---|
V1 | 8.7322354 |
V2 | 6.4153694 |
V3 | 0.7635918 |
V4 | 7.6151188 |
V5 | 2.0235246 |
V6 | 0.1651112 |
V7 | -0.0059617 |
V8 | -0.1663626 |
V9 | -0.0952927 |
V10 | -0.0749448 |
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No, the random forest model does not significantly use predictors V6 to V10.
Part B
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
set.seed(123)
$duplicate1 <- simulated$V1 + rnorm(200) * .1
simulatedcor(simulated$duplicate1, simulated$V1)
## [1] 0.9504983
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
<- randomForest(y ~ ., data = simulated,
model2 importance = TRUE,
ntree = 1000)
<- varImp(model2, scale = FALSE)
rfImp2
::kable(rfImp2) kableExtra
Overall | |
---|---|
V1 | 5.1832605 |
V2 | 6.3186107 |
V3 | 0.6972065 |
V4 | 6.9525399 |
V5 | 1.9866611 |
V6 | 0.2338298 |
V7 | -0.0067128 |
V8 | -0.0837440 |
V9 | -0.0209328 |
V10 | -0.0696097 |
duplicate1 | 4.3054797 |
set.seed(123)
$duplicate2 <- simulated$V1 + rnorm(200) * .1
simulatedcor(simulated$duplicate2, simulated$V1)
## [1] 0.9504983
<- randomForest(y ~ ., data = simulated,
model3 importance = TRUE,
ntree = 1000)
<- varImp(model3, scale = FALSE)
rfImp3
::kable(rfImp3) kableExtra
Overall | |
---|---|
V1 | 4.1374642 |
V2 | 6.4273743 |
V3 | 0.5204054 |
V4 | 7.2625004 |
V5 | 1.9475777 |
V6 | 0.2383798 |
V7 | 0.0275598 |
V8 | -0.0475072 |
V9 | -0.0186150 |
V10 | -0.0231890 |
duplicate1 | 3.2627027 |
duplicate2 | 2.8812822 |
The importance score for V1 was reduced with the addition of duplicated1. The addition of another correlated variable to V1 further reduces the importance.
Part C
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varImp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
<- cforest(y~., data = simulated) model4
varImp(model4, conditional = FALSE)
## Overall
## V1 4.604711928
## V2 5.754812650
## V3 0.063192997
## V4 7.278888213
## V5 1.851751818
## V6 0.044066429
## V7 0.053314917
## V8 0.004184564
## V9 -0.017524552
## V10 -0.024138736
## duplicate1 3.427396047
## duplicate2 1.928271229
varImp(model4, conditional = TRUE)
## Overall
## V1 1.695390053
## V2 4.610713807
## V3 0.047851832
## V4 5.926964145
## V5 1.341996707
## V6 0.023093544
## V7 0.010071784
## V8 -0.007191415
## V9 0.006679198
## V10 -0.011566382
## duplicate1 1.339780765
## duplicate2 0.743524464
The variable importances differ between the conditional and nonconditional parameter. It should be noted that the uninformative features, V6-V10, contribute very little if at all to the model outcome.
Part D
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Boosted
The pattern of the uniformative features continued as most received and importance factor of about 0. A notable occurrence is that duplicate2 did not receive any weighting in variable importance.
set.seed(123)
<- gbm(y~., data = simulated, distribution = "gaussian")
model5
summary.gbm(model5)
## var rel.inf
## V4 V4 29.767165
## V2 V2 22.645801
## V1 V1 15.792289
## duplicate1 duplicate1 12.558058
## V5 V5 11.557021
## V3 V3 7.254762
## V6 V6 0.424903
## V7 V7 0.000000
## V8 V8 0.000000
## V9 V9 0.000000
## V10 V10 0.000000
## duplicate2 duplicate2 0.000000
Cubist
The cubist model had similar variable importance ranking with the gbm model. The less correlated features were assigned little importance. The variable duplicate2 was also given a value of 0.
<- cubist(x = simulated[,-11], y = simulated$y, committees = 100)
model6
varImp(model6)
## Overall
## V1 50.0
## V3 46.0
## V2 58.0
## V4 47.5
## V5 30.5
## duplicate1 25.5
## V8 7.5
## V6 3.5
## V7 0.0
## V9 0.0
## V10 0.0
## duplicate2 0.0
8.2
Use a simulation to show tree bias with different granularities.
<- sample(0:10, 1000, replace = T)
small <- sample(0:100, 1000, replace = T)
medium <- sample(0:1000, 1000, replace = T)
large
<- small + medium + large + rnorm(200) y
var(large)
## [1] 87808.23
<- data.frame(y, small, medium, large)
sim.data
<- rpart(y~., data = sim.data)
sim.tree <- varImp(sim.tree, scale = FALSE)
sim.tree.imp
::kable(sim.tree.imp) kableExtra
Overall | |
---|---|
large | 4.2620776 |
medium | 0.3133978 |
small | 0.0219418 |
The decision tree mostly used the large variable, which has the biggest variance in number selection. The medium variable was used somewhat to make decisions, whereas the small variable that had the smallest variance was hardly used for splits. This shows that there is a selection bias amongst tree models that favor predictors with larger variance.
8.3
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Part A
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The reason why the model on the right focuses its importance on just the first few of predictors and the model on the left spreads importance across more predictors is due to how the tuning parameters, bagging fraction and learning rate are assigned. A lower bagging fraction implies that less data is to be selected for training the subsequent tree whereas a higher bagging fraction would do the opposite. The learning rate affects the model fitting process by taking incremental steps toward a local minimum. The model on the left spreads the importance over more variables because it had a slower learning rate and used less data for subsequent trees which allowed the extraction of more information from other variables. The model on the right had a faster learning rate and used more data, forcing the convergence of the model to focus on a handful of predictors.
Part B
Which model do you think would be more predictive of other samples?
I would think that the model on the left to be more predictive of other samples as it is less likely to have overfit to the training set.
Part C
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing the interaction depth would provide the tree with more iterations to learn from other predictors, lowering the slope of the predictor importance as more predictors will hold importance.
8.7
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data("ChemicalManufacturingProcess")
<- preProcess(ChemicalManufacturingProcess,
imputed.knn method = "knnImpute",
k = sqrt(nrow(ChemicalManufacturingProcess))
)
<- predict(imputed.knn, ChemicalManufacturingProcess)
imputed.data
<- nearZeroVar(imputed.data)
near_zero
<- imputed.data[, -near_zero]
imputed.data
set.seed(115)
<- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.7, list = FALSE)
train_test_split
<- imputed.data[train_test_split,]
train.data <- imputed.data[-train_test_split,] test.data
Part A
Which tree-based regression model gives the optimal resampling and test set performance?
Single
set.seed(123)
<- train(Yield ~ .,
single data = train.data,
method = "rpart",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
<- predict(single, newdata = test.data)
pred1
<- postResample(pred = pred1, obs = test.data$Yield)
rpart.metrics
rpart.metrics
## RMSE Rsquared MAE
## 0.8037771 0.3632390 0.6024717
Bagged
set.seed(123)
<- train(Yield ~ .,
bagged data = train.data,
method = "treebag",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
<- predict(bagged, newdata = test.data)
pred2
<- postResample(pred = pred2, obs = test.data$Yield)
bagged.metrics
bagged.metrics
## RMSE Rsquared MAE
## 0.7385620 0.3566364 0.5416379
Boosted
set.seed(123)
<- gbm(Yield~., data = train.data, distribution = "gaussian")
gb
<- predict(gb, newdata = test.data) pred3
## Using 100 trees...
<- postResample(pred = pred3, obs = test.data$Yield)
boosted.metrics
boosted.metrics
## RMSE Rsquared MAE
## 0.7108198 0.3972903 0.5259861
Random Forest
set.seed(123)
<- train(Yield ~ .,
rf data = train.data,
method = "rf",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
<- predict(rf, newdata = test.data)
pred4
<- postResample(pred = pred4, obs = test.data$Yield)
rf.metrics
rf.metrics
## RMSE Rsquared MAE
## 0.6381682 0.4644105 0.4773300
Cubist
<- train(Yield ~ .,
cubist.model data = train.data,
method = "cubist",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
<- predict(cubist.model, newdata = test.data)
pred5
<- postResample(pred = pred5, obs = test.data$Yield)
cubist.metrics
cubist.metrics
## RMSE Rsquared MAE
## 0.5267067 0.6506320 0.3761447
Comparison
The best model across all metrics on the testing set data is the cubist model.
::kable(rbind(rpart.metrics,
kableExtra
bagged.metrics,
boosted.metrics,
rf.metrics, cubist.metrics))
RMSE | Rsquared | MAE | |
---|---|---|---|
rpart.metrics | 0.8037771 | 0.3632390 | 0.6024717 |
bagged.metrics | 0.7385620 | 0.3566364 | 0.5416379 |
boosted.metrics | 0.7108198 | 0.3972903 | 0.5259861 |
rf.metrics | 0.6381682 | 0.4644105 | 0.4773300 |
cubist.metrics | 0.5267067 | 0.6506320 | 0.3761447 |
Part B
The manufacturing process variables are the most important for this model. This differs from the previous linear and nonlinear models that had a 50 / 50 split within the top 10 between the biological material and manufacturing process.
varImp(cubist.model)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess17 100.000
## ManufacturingProcess32 86.598
## ManufacturingProcess39 57.732
## ManufacturingProcess09 52.577
## ManufacturingProcess29 34.021
## BiologicalMaterial02 29.897
## ManufacturingProcess13 27.835
## ManufacturingProcess31 21.649
## BiologicalMaterial03 20.619
## BiologicalMaterial06 19.588
## BiologicalMaterial12 14.433
## ManufacturingProcess19 13.402
## ManufacturingProcess01 13.402
## BiologicalMaterial09 12.371
## ManufacturingProcess15 11.340
## ManufacturingProcess33 11.340
## ManufacturingProcess04 10.309
## BiologicalMaterial01 9.278
## BiologicalMaterial04 8.247
## ManufacturingProcess06 7.216
Part C
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
The single tree plot provides useful information about the data. The plot shows that the most decisive split occurs with the root node ManufacturingProcess32. The model shows that lower yeild is associated with BiologicalMaterial12, ManufacturingProcess18, BiologicalMaterial04, and BiologicalMaterial11. Higher Yields are associated with ManufacturingProcess31, BiologicalMater05, and ManufacturingProcess17.
<- rpart(Yield~., data = train.data)
rpart.model
rpart.plot(rpart.model)