Homework 9 - Regression Trees and Rule-Based Models

Name: Charles Ugiagbe.

Date: 11/27/23

library(tidyverse)
library(partykit)
library(party)
library(rpart)
library(mlbench)
library(AppliedPredictiveModeling)
library(randomForest)
library(caret)
library(gbm)
library(Cubist)
library(rpart.plot)

Question 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(222)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Part 8.1A

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

kableExtra::kable(rfImp1)

## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'

	Overall
V1	6.2731059
V2	7.5697862
V3	0.7202476
V4	12.2518902
V5	1.2145862
V6	-0.0155427
V7	-0.1992948
V8	-0.0606985
V9	-0.0264306
V10	-0.0584667

Question: Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Answer: No, the random forest model does not significantly use the uninformative predictors V6 to V10.

Part 8.7B

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(111)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9425556

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

When another highly correlated predictor is added, the important scores for the other variables increase while the importance score for V1 decreased yet more.

model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

kableExtra::kable(rfImp2)

	Overall
V1	3.8090525
V2	7.1975556
V3	0.5529476
V4	11.9729937
V5	1.0865307
V6	-0.0290631
V7	-0.1067976
V8	-0.0684317
V9	-0.1140780
V10	-0.0062604
duplicate1	4.0596497

set.seed(111)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9425556

model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

kableExtra::kable(rfImp3)

	Overall
V1	2.7254180
V2	7.7360969
V3	0.4062591
V4	12.4416632
V5	1.2799369
V6	0.0046985
V7	-0.0481235
V8	-0.0595274
V9	-0.0940122
V10	-0.0356415
duplicate1	2.6684864
duplicate2	3.1529421

The importance score for V1 was reduced with the addition of duplicated1. The addition of another correlated variable to V1 further reduces the importance.

Part 8.1C

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varImp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model4 <- cforest(y~., data = simulated)

varImp(model4, conditional = FALSE)

##                Overall
## V1          3.47903152
## V2          6.72312059
## V3          0.06836182
## V4         12.45489178
## V5          0.61216864
## V6         -0.02008054
## V7         -0.02639576
## V8         -0.01154652
## V9         -0.10373868
## V10        -0.05339805
## duplicate1  3.18612144
## duplicate2  1.44229700

varImp(model4, conditional = TRUE)

##                 Overall
## V1          1.782259041
## V2          6.340471627
## V3          0.017436678
## V4         11.002228344
## V5          0.485233570
## V6          0.006385456
## V7         -0.036130064
## V8         -0.042626261
## V9         -0.039970254
## V10        -0.001984277
## duplicate1  1.210519592
## duplicate2  0.409864153

The variable importances differ between the conditional and nonconditional parameter. It should be noted that the uninformative features, V6-V10, contribute very little if at all to the model outcome.

Part 8.7D

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted

The pattern of the uniformative features continued as most received and importance factor of about 0. A notable occurrence is that duplicate2 did not receive any weighting in variable importance.

set.seed(111)

model5 <- gbm(y~., data = simulated, distribution = "gaussian")

summary.gbm(model5)

##                   var    rel.inf
## V4                 V4 42.4706128
## V2                 V2 23.1043497
## duplicate1 duplicate1 16.4936039
## V3                 V3  6.3746279
## V1                 V1  5.7895306
## V5                 V5  5.0562346
## V9                 V9  0.2987317
## V10               V10  0.2629378
## V7                 V7  0.1493710
## V6                 V6  0.0000000
## V8                 V8  0.0000000
## duplicate2 duplicate2  0.0000000

Cubist

The cubist model had similar variable importance ranking with the gbm model. The less correlated features were assigned little importance. The variable duplicate2 was also given a value of 0.

model6 <- cubist(x = simulated[,-11], y = simulated$y, committees = 100)

varImp(model6)

##            Overall
## V2            63.0
## V3            45.5
## V1            55.5
## V4            35.5
## duplicate1     8.0
## V7             4.5
## V5            29.0
## V6             1.0
## V8             1.0
## V9             0.0
## V10            0.0
## duplicate2     0.0

Question 8.2

Use a simulation to show tree bias with different granularities.

Solution 8.2

In the case there is a variable that has higher number of unique values, the tree will select that variable over others. There is a higher chance the model will choose the noise variables over the informative variables in the top nodes.

For this simulation, the tree-based model selected the variables that have more distinct values as more important. It also selected the noisy or the variables with the most repetitive values as the top node.

set.seed(917)

a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)

y <- a + b + c + d + e

simData <- data.frame(a,b,c,d,e,y) 

rpartTree <- rpart(y ~ ., data = simData)

plot(as.party(rpartTree), gp = gpar(fontsize = 7))

varImp(rpartTree)

##    Overall
## a 2.122025
## b 3.780325
## c 3.273619
## d 2.981395
## e 3.514504

Question 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Which model do you think would be more predictive of other samples?
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Solution 8.3

Soln 8.3A

The model on the right focuses its importance on just the first few of predictors and the model on the left spreads importance across more predictors is because to how the tuning parameters, bagging fraction and learning rate are assigned. A lower bagging fraction implies that less data is to be selected for training the subsequent tree whereas a higher bagging fraction would do the opposite. The learning rate affects the model fitting process by taking incremental steps toward a local minimum. The model on the left spreads the importance over more variables because it had a slower learning rate and used less data for subsequent trees which allowed the extraction of more information from other variables. The model on the right had a faster learning rate and used more data, forcing the convergence of the model to focus on a handful of predictors.

Soln 8.3B

I would think that the model on the left to be more predictive of other samples as it is less likely to have overfit to the training set.

Soln 8.3C

An increase in the interaction depth would provide the tree with more iterations to learn from other predictors, lowering the slope of the predictor importance as more predictors will hold importance.

Question 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and Spre-processing steps as before and train several tree-based models:

Which tree-based regression model gives the optimal re-sampling and test set performance?
The manufacturing process variables are the most important for this model. This differs from the previous linear and nonlinear models that had a 50 / 50 split within the top 10 between the biological material and manufacturing process.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Solution 8.7

data("ChemicalManufacturingProcess")

imputed.knn <- preProcess(ChemicalManufacturingProcess,
           method = "knnImpute",
           k = sqrt(nrow(ChemicalManufacturingProcess))
           )

imputed.data <- predict(imputed.knn, ChemicalManufacturingProcess)

near_zero <- nearZeroVar(imputed.data)

imputed.data <- imputed.data[, -near_zero]

set.seed(917)
train_test_split <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.7, list = FALSE)

train.data <- imputed.data[train_test_split,]
test.data <- imputed.data[-train_test_split,]

Soln 8.7A

Single

set.seed(917)
single <- train(Yield ~ .,
                  data = train.data,
                  method = "rpart",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

pred1 <- predict(single, newdata = test.data)

rpart.metrics <- postResample(pred = pred1, obs = test.data$Yield)

rpart.metrics

##      RMSE  Rsquared       MAE 
## 0.7943422 0.3450245 0.5800113

Bagged

set.seed(917)
bagged <- train(Yield ~ .,
                  data = train.data,
                  method = "treebag",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))


pred2 <- predict(bagged, newdata = test.data)

bagged.metrics <- postResample(pred = pred2, obs = test.data$Yield)

bagged.metrics

##      RMSE  Rsquared       MAE 
## 0.5557336 0.6426890 0.4099427

Boosted

set.seed(917)
gb <- gbm(Yield~., data = train.data, distribution = "gaussian")


pred3 <- predict(gb, newdata = test.data)

## Using 100 trees...

boosted.metrics <- postResample(pred = pred3, obs = test.data$Yield)

boosted.metrics

##      RMSE  Rsquared       MAE 
## 0.5093102 0.7036780 0.3871666

Random Forest

set.seed(917)
rf <- train(Yield ~ .,
                  data = train.data,
                  method = "rf",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))


pred4 <- predict(rf, newdata = test.data)

rf.metrics <- postResample(pred = pred4, obs = test.data$Yield)

rf.metrics

##      RMSE  Rsquared       MAE 
## 0.4879136 0.7743460 0.3955877

Cubist

cubist.model <- train(Yield ~ .,
                  data = train.data,
                  method = "cubist",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))


pred5 <- predict(cubist.model, newdata = test.data)

cubist.metrics <- postResample(pred = pred5, obs = test.data$Yield)

cubist.metrics

##      RMSE  Rsquared       MAE 
## 0.4064884 0.8098096 0.3207813

Comparison

The best model across all metrics on the testing set data is the cubist model.

kableExtra::kable(rbind(rpart.metrics,
                        bagged.metrics,
                        boosted.metrics,
                        rf.metrics,
                        cubist.metrics))

	RMSE	Rsquared	MAE
rpart.metrics	0.7943422	0.3450245	0.5800113
bagged.metrics	0.5557336	0.6426890	0.4099427
boosted.metrics	0.5093102	0.7036780	0.3871666
rf.metrics	0.4879136	0.7743460	0.3955877
cubist.metrics	0.4064884	0.8098096	0.3207813

Soln 8.7B

varImp(cubist.model)

## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     62.79
## ManufacturingProcess13   62.79
## BiologicalMaterial02     45.35
## ManufacturingProcess37   41.86
## BiologicalMaterial03     38.37
## ManufacturingProcess33   31.40
## BiologicalMaterial10     29.07
## ManufacturingProcess39   29.07
## ManufacturingProcess17   27.91
## ManufacturingProcess04   26.74
## ManufacturingProcess09   18.60
## BiologicalMaterial12     17.44
## ManufacturingProcess15   16.28
## ManufacturingProcess02   16.28
## BiologicalMaterial04     15.12
## BiologicalMaterial09     12.79
## ManufacturingProcess06   11.63
## ManufacturingProcess30   11.63
## ManufacturingProcess43   10.47

Soln 8.7C

The single tree plot provides useful information about the data. The plot shows that the most decisive split occurs with the root node ManufacturingProcess32. The model shows that lower yeild is associated with BiologicalMaterial12, ManufacturingProcess18, BiologicalMaterial04, and BiologicalMaterial11. Higher Yields are associated with ManufacturingProcess31, BiologicalMater05, and ManufacturingProcess17.

rpart.model <-  rpart(Yield~., data = train.data)

rpart.plot(rpart.model)