HW_9 Instructions:

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
## install.packages("randomForest")
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
rf_model1 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(rf_model1, scale = FALSE)

rfImp1
##         Overall
## V1   8.86329776
## V2   6.72851763
## V3   0.84145353
## V4   7.60284159
## V5   2.26864193
## V6   0.11268425
## V7   0.07374772
## V8  -0.07210708
## V9  -0.06913906
## V10 -0.10577619

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Based on the output, noticed that the random forest did not consider v6-v10 at all during model creation.

(b)Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9356508
rf_model2 <- randomForest(y~.,data = simulated,importance = TRUE,ntree = 1000)
rfImp2 <- varImp(rf_model2,scale = FALSE)
rfImp2
##                Overall
## V1          5.78865821
## V2          6.48386619
## V3          0.55469974
## V4          6.78484850
## V5          1.96248183
## V6          0.10126938
## V7          0.14210730
## V8         -0.09726812
## V9         -0.08440763
## V10         0.04878300
## duplicate1  4.64551303

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Noticed that another highly correlated predictor is introduced, the importance scores of the other variables increase, while the importance score of V1 decreases even further (8.8 to 5.7).

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
# install.packages("party")
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
cf_model3 <-  cforest(y ~ .,data = simulated,control = cforest_unbiased(ntree = 1000))

rfImp3 <- varimp(cf_model3, conditional = TRUE)

rfImp3
##           V1           V2           V3           V4           V5           V6 
##  3.116679316  5.177355616  0.021889615  6.083411529  1.049162844  0.004700388 
##           V7           V8           V9          V10   duplicate1 
##  0.022634172 -0.006653913 -0.006331198 -0.023419483  1.567060882
rfImp4 <- varimp(cf_model3, conditional = FALSE)

rfImp4
##           V1           V2           V3           V4           V5           V6 
##  6.715731666  6.562767795  0.021195652  7.582102576  1.633426540 -0.010168262 
##           V7           V8           V9          V10   duplicate1 
##  0.006876682 -0.045691676  0.019218762 -0.019957444  3.165066162

The scores show patterns that closely resemble those of the traditional random forest model. In the first model, V1 ranked as the most important variable and V4 are the second. However, in the cforest models reverses their positions.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

bt_model5 <- train(simulated[, c(1:10)],
  simulated$y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)

rfImp5 <- varImp(bt_model5, scale = FALSE)

rfImp5
## gbm variable importance
## 
##     Overall
## V1   4634.0
## V2   4316.6
## V4   4287.1
## V5   1844.2
## V3   1310.7
## V6    416.4
## V7    368.9
## V10   250.6
## V9    246.1
## V8    224.5

Using boosted trees method, the V2 went up to the second important. And V1 is the most important and V4 is dropping to third.

library(Cubist)
set.seed(100)
cb_model6 <- train(simulated[, c(1:10)],
                     simulated$y,
                     method = "cubist")

rfImp6 <- varImp(cb_model6, scale = FALSE)

rfImp6
## cubist variable importance
## 
##     Overall
## V1     72.0
## V2     54.5
## V4     49.0
## V3     42.0
## V5     40.0
## V6     11.0
## V10     0.0
## V7      0.0
## V9      0.0
## V8      0.0

Using the cubist model, the outputs are similar to boosted trees method.

8.2. Use a simulation to show tree bias with different granularities.

set.seed(100)

v1 <- runif(1000,1,1000)
v2 <- runif(1000,1,100)
v3 <- runif(1000,1,500)
y <- v1+v3

df <- data.frame(v1,v2,v3,y)

head(df)
##          v1        v2       v3        y
## 1 308.45834  8.330781 432.0469 740.5052
## 2 258.41483 12.074769 384.0390 642.4539
## 3 552.77011 62.770456 440.7217 993.4919
## 4  57.32677 67.437098 112.7623 170.0890
## 5 469.08073 37.223527 361.0632 830.1440
## 6 484.28696 19.134957 346.5501 830.8371
sim_Model <- randomForest(y~.,data = df,importance = TRUE,ntree = 100)

varImp(sim_Model,scale = TRUE)
##        Overall
## v1 118.6832527
## v2   0.3729799
## v3  71.0517053

Noticed that the V1 has a higher importance score than V3, even though the response variable Y was constructed from both V1 and V3. It is also important to note that V3 contains more distinct values than V1 or V2. This illustrates that tree-based models tend to favor predictors with a greater number of unique values, as seen in V2 having the lowest importance score compared to V1 and V3.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Based on the Figure 8.24 on the text book, the model on the right assigns most of its importance to just a few predictors because both the bagging fraction and learning rate are set to high values as 0.9. This leads to rapid fitting and overemphasis on the strongest predictors early in the process. In contrast, the model on the left uses lower values set to 0.1, which slows learning and allows more predictors to contribute, resulting in a more balanced distribution of importance.

(b) Which model do you think would be more predictive of other samples?

The model on the left with lower bagging fraction and learning rate and it’s likely to perform better to new samples because it learns gradually and reduces the risk of overfitting. The model on the right, it’s more likely to overfitting and less predictive on unseen data.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

To my point of view, increasing the interaction depth impact the model to capture more complex interactions among predictors. This typically distributes variable importance more evenly across multiple predictors, reducing the dominance of the top variables and flattening the importance curve for both models.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176  58
#install.packages("RANN")
library(RANN)
set.seed(100)
preProcValues <- preProcess(ChemicalManufacturingProcess[, -ncol(ChemicalManufacturingProcess)], method = "knnImpute")
data_imputed <- predict(preProcValues, ChemicalManufacturingProcess)
set.seed(123)

trainIndex <- createDataPartition(data_imputed$Yield, p = 0.75, list = FALSE) # split the imputed data into training (75%) and testing (25%) sets

trainData <- data_imputed[trainIndex, ]
testData <- data_imputed[-trainIndex, ]
(a) Which tree-based regression model gives the optimal resampling and test set performance?
# Bagged tree
library(ipred)
bagged_model <- bagging(Yield ~ ., 
                     data = trainData)
bagged_pred <- predict(bagged_model, testData)

#boosted tree
boosted_model <- gbm(Yield ~ .,
                   data = trainData,
                   distribution = "gaussian",
                   n.trees=1000)
boosted_pred <- predict(boosted_model, testData)
## Using 1000 trees...
#
# cubist 
cubist_model <- train( data = trainData,
                    Yield ~ .,
                    method = "cubist")
cubist_pred <- predict(cubist_model, testData)
postResample(bagged_pred, testData$Yield)
##      RMSE  Rsquared       MAE 
## 0.6752331 0.4908298 0.4840183
postResample(boosted_pred, testData$Yield)
##      RMSE  Rsquared       MAE 
## 0.6086258 0.5957798 0.4475895
postResample(cubist_pred, testData$Yield)
##      RMSE  Rsquared       MAE 
## 0.5608231 0.6632255 0.4408881

Based on the outputs, the cubist model has better performance than Bagged and Boosted models, it has the lowest RMSE and lowest MAE which means predictions are closest to actual values. In addition, it has the highest R square which means it explains the most variance.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
plot(varImp(cubist_model), 10)

Based on the previous (Partial Least Squares model) output ranking (by order): ManufacturingProcess32, (cubist -1)
ManufacturingProcess09, (cubist -10)
ManufacturingProcess17, (cubist -2)
ManufacturingProcess13, (cubist -9)
ManufacturingProcess36, (cubist -x)
ManufacturingProcess06, (cubist -4)
BiologicalMaterial02, (cubist -3)
ManufacturingProcess33, (cubist -x)
ManufacturingProcess11, (cubist -x)
BiologicalMaterial06. (cubist -5)

Noticed that ManufacturingProcess32 is the top predictor for both the non-linear model (Partial Least Squares) and the tree model. In addition, the order of importance is switched for these two models. There are two identical biological predictors are shared amongst these models. ManufacturingProcess17 also appears as an important predictor within these models.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(rpart)
rpartTree <- rpart(Yield ~ ., data =data_imputed )
rpart.plot::rpart.plot(rpartTree)