Exercise 8.1 Recreate the simulated data from Exercise 7.2:
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(rpart)
## Warning: package 'rpart' was built under R version 4.4.3
library(ipred)
## Warning: package 'ipred' was built under R version 4.4.2
library(party)
## Warning: package 'party' was built under R version 4.4.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.4.2
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.4.3
##
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
##
## where
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.4.3
## Loading required package: lattice
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(partykit)
## Warning: package 'partykit' was built under R version 4.4.3
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.4.3
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(c) Use the cforest function in the party package to fit a random
forest model using conditional inference trees. The party package
function varimp can calculate predictor importance. The conditional
argument of that function toggles between the traditional importance
measure and the modified version described in Strobl et al. (2007). Do
these importances show the same pattern as the traditional random forest
model?
- Yes, all the scores are decreased again.
cforestModel <- cforest(y ~ .,
data = simulated)
cf <- data.frame(Overall = varimp(cforestModel, conditional= TRUE)) %>%
arrange(desc(Overall))
cf
## Overall
## V4 5.78489850
## V2 5.22819343
## V1 3.33065561
## duplicate1 2.78120557
## V5 1.39002706
## V3 0.04936459
## V6 -0.07029802
## V9 -0.11458990
## V7 -0.15477759
## V10 -0.17347857
## V8 -0.35461203
(d) Repeat this process with different tree models, such as boosted
trees and Cubist. Does the same pattern occur?
- Answer: No, the pattern are not the same. In Boosted trees,V4 has
higher important scores. In Cubist V2 has higher important score. But V6
-V10 still not significant compare to others.
Boosted Trees
# boosted trees
boosted <- gbm(y ~.,
data = simulated,
distribution = "gaussian")
summary.gbm(boosted)

## var rel.inf
## V4 V4 30.3669149
## V1 V1 24.4090151
## V2 V2 21.2680313
## V5 V5 11.4582221
## V3 V3 8.1351552
## duplicate1 duplicate1 4.1515584
## V10 V10 0.2111029
## V6 V6 0.0000000
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
Cubist
# cubist
cubistModel <- train(y ~ .,
data = simulated,
method = "cubist")
varImp(cubistModel)
## cubist variable importance
##
## Overall
## V2 100.00
## V1 89.52
## V4 80.65
## V3 67.74
## duplicate1 59.68
## V5 50.00
## V6 25.00
## V8 0.00
## V10 0.00
## V9 0.00
## V7 0.00
Exercise 8.7 Refer to Exercises 6.3 and 7.5 which describe a
chemical manufacturing process. Use the same data imputation, data
splitting, and pre-processing steps as before and train several
tree-based models:
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
data(ChemicalManufacturingProcess)
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
## impute missing values
imputed_df <- preProcess(ChemicalManufacturingProcess, "knnImpute")
imp_df <- predict(imputed_df, ChemicalManufacturingProcess)
sum(is.na(imp_df))
## [1] 0
# split the data in 80/20
set.seed(123)
train_index <- createDataPartition(imp_df$Yield, p = 0.8, list=FALSE)
train_data <- imp_df[train_index, ]
test_data <- imp_df[-train_index, ]
# separate predictors and response
X_train <- train_data[, -1] # Remove Yield column
y_train <- train_data$Yield
X_test <- test_data[, -1]
y_test <- test_data$Yield
(b) Which predictors are most important in the optimal tree-based
regression model? Do either the biological or process variables dominate
the list? How do the top 10 important predictors compare to the top 10
predictors from the optimal linear and nonlinear models?
- Answer: ManufacturingProcess32 is most important predictor in the
model, and fellow by ManufacturingProcess17 and BiologicalMaterial06.
Comparing the previous homework, ManufacturingProcess32 still top
important predictor. The important predictors are similar.
plot(varImp(cubist_model), top=10)

(c) Plot the optimal single tree with the distribution of yield in
the terminal nodes. Does this view of the data provide additional
knowledge about the biological or process predictors and their
relationship with yield?
- Answer: Yes, this model show the relationship between each predictor
in the tree.
set.seed(1234)
treemodel <- train(x = X_train,
y = y_train,
method = "rpart",
trControl = trainControl("cv", number = 10),
tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
finaltreemodel <- treemodel$finalModel
tree_party <- as.party(finaltreemodel)
plot(tree_party, main = "Single Tree Plot", gp = gpar(fontsize = 7))
