Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
library(mlbench)
library(Cubist)
library(partykit)
library(caret)
library(randomForest)
library(caret)
library(tidyverse)
library(party)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 – V10)? The uniform predictors V6-V10 were hardly used compared to V1-V5.
rfImp1 %>%
mutate (var = rownames(rfImp1)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'coral2') +
labs(title = 'Importance' , y = 'Vars')
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
The new duplicated predictor is highly correlated with 94% between it and V1.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
Adding in my duplicated variable and then rerunning the new model bumped V1 down to the 3rd most significant Variable making V4 and V2 more important.
newMod <- randomForest(y ~ ., data = simulated, importance = T, ntree = 1000)
newModImp <- varImp(newMod, scale = F)
newModImp %>%
mutate (var = rownames(newModImp)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'coral2') +
labs(title = 'Importance' , y = 'Vars')
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
This new model seems to lower the importance of all variables reeling in the high end of variance and the low end of variance.
cforestMod <- cforest(y ~ ., data = simulated)
cforestImp <- varimp(cforestMod, conditional = TRUE)%>% as.data.frame()
cforestImp %>%
rename(Overall = '.') %>%
mutate (var = rownames(cforestImp)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'coral2') +
labs(title = 'Importance' , y = 'Vars')
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
This models looks to be only giving importance to the 5 most important variables, which is unlike any of the models I have ran previously.
library(Cubist)
cubeMod <- cubist(simulated[,!names(simulated) %in% 'y'],simulated[,c('y')])
cubeImp <- varImp(cubeMod, scale = F)
cubeImp %>%
mutate (var = rownames(cubeImp)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'coral2') +
labs(title = 'Importance' , y = 'Vars')
Use a simulation to show tree bias with different granularities.
V3 takes the highest importance while V2 decreases and V1 have nearly no significance at all.
#set.seed(02180)
#V1 <- runif(500, 2,500)
#V2 <- rnorm(500, 2,10)
#V3 <- rnorm(500, 1,1000)
#y <- V2 + V3
#df <- data.frame(V1, V2, V3, y)
#test_model <- cforest(y ~ ., data = df, ntree = 10)
#test_model_imp <- varimp(test_model, conditional = FALSE)
#barplot(sort(test_model_imp),horiz = TRUE, main = 'Exercise 8.2', col = 'coral')
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The rightmost model (parameters set to 0.9) has a higher bagging rate as well as a higher learning rate than the leftmost model (parameters set to 0.1), as a result much more of the data is used spreading the importance more thinly leading to fewer important variables.
Which model do you think would be more predictive of other samples?
I think that predicive model with higher bagigng and learning rates will lead to over fitting since so many less variables are considered in the variance of the model.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
If you increased the interaction depth the model will likely include more predictors, lowering the RMSE and decreasing the slope of the predictors since more will be included.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
data(ChemicalManufacturingProcess)
impute_chem <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute'))
chm <- predict(impute_chem, ChemicalManufacturingProcess[,-c(1)])
set.seed(02180)
train_dp <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X_train <- chm[train_dp,]
y_train <- ChemicalManufacturingProcess$Yield[train_dp]
X_test <- chm[-train_dp,]
y_test <- ChemicalManufacturingProcess$Yield[-train_dp]
library(rpart)
set.seed(02180)
model_single_tree <- train(x= X_train, y= y_train, method="rpart", tuneLength=10, control= rpart.control(maxdepth=2))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
model_single_tree
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01160228 1.511349 0.3745372 1.174852
## 0.01279021 1.511349 0.3745372 1.174852
## 0.01932425 1.511349 0.3745372 1.174852
## 0.02141428 1.511349 0.3745372 1.174852
## 0.02407785 1.511349 0.3745372 1.174852
## 0.03617889 1.511349 0.3745372 1.174852
## 0.04802293 1.510092 0.3742895 1.176626
## 0.07531651 1.513436 0.3686719 1.181598
## 0.07791213 1.515951 0.3666780 1.184320
## 0.37992201 1.701083 0.3362906 1.364467
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.04802293.
single_pred <- predict(model_single_tree, X_test)
postResample(single_pred, y_test)
## RMSE Rsquared MAE
## 1.3785655 0.4955932 1.1273024
varImp(model_single_tree, scale = F)
## rpart variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess31 0.6694
## ManufacturingProcess13 0.5428
## BiologicalMaterial06 0.4717
## ManufacturingProcess32 0.3799
## ManufacturingProcess09 0.2616
## ManufacturingProcess06 0.2342
## BiologicalMaterial11 0.2269
## ManufacturingProcess36 0.2219
## BiologicalMaterial12 0.2210
## ManufacturingProcess30 0.2124
## ManufacturingProcess11 0.1913
## ManufacturingProcess33 0.0000
## ManufacturingProcess16 0.0000
## ManufacturingProcess08 0.0000
## ManufacturingProcess44 0.0000
## ManufacturingProcess14 0.0000
## ManufacturingProcess20 0.0000
## ManufacturingProcess07 0.0000
## BiologicalMaterial04 0.0000
## ManufacturingProcess22 0.0000
set.seed(02180)
rt_grid <- expand.grid(mtry=seq(2,38,by=2))
model_r_forest <- train(X_train, y_train, method = "rf", tuneGrid = rt_grid, metric = "Rsquared", importance = TRUE,
trControl = trainControl(method = "boot", number = 25))
forest_pred <- predict(model_r_forest, X_test)
postResample(forest_pred, y_test)
## RMSE Rsquared MAE
## 1.0302616 0.7594758 0.8511904
model_bagged <- ipred::ipredbagg(y_train, X_train)
bagPred <- predict(model_bagged, X_test)
postResample(bagPred, y_test)
## RMSE Rsquared MAE
## 1.0529113 0.7277518 0.8344543
varImp(model_r_forest, scale = F)
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 16.946
## ManufacturingProcess13 10.812
## BiologicalMaterial06 9.293
## ManufacturingProcess31 8.555
## BiologicalMaterial03 8.093
## BiologicalMaterial12 7.704
## ManufacturingProcess39 7.475
## ManufacturingProcess17 7.310
## ManufacturingProcess33 7.072
## BiologicalMaterial04 6.843
## BiologicalMaterial02 6.834
## ManufacturingProcess09 5.781
## ManufacturingProcess28 5.539
## BiologicalMaterial11 5.488
## BiologicalMaterial09 5.282
## BiologicalMaterial08 5.201
## ManufacturingProcess36 5.075
## BiologicalMaterial01 4.908
## ManufacturingProcess20 4.851
## ManufacturingProcess02 4.734
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
From the models I ran Random Forest was the best performing model with bagged trees taking a close second.The manufacturing and biological predictors appear to be fairly spread with no predictor having a clear lead.
forest_imp <- varImp(model_r_forest, scale = F)
plot(forest_imp, top = 20, scales = list(y = list(cex = 0.5)))
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Due to its high variance the diagram starts off with a split at Manufacturing32. if manufacturing32 is less than 160 than ‘yes’ and is classified as 39 if no then 42 and so on. This tree shows that manufacturing32 is the most important process in the chamical manufacturing process. This could likely be expanded further to more relationships within the tree.
model_single_tree$finalModel
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 484.57250 40.18153
## 2) ManufacturingProcess32< 159.5 88 160.84460 39.27955
## 4) BiologicalMaterial11< 145.075 39 45.22569 38.55769 *
## 5) BiologicalMaterial11>=145.075 49 79.12258 39.85408 *
## 3) ManufacturingProcess32>=159.5 56 139.62810 41.59893
## 6) ManufacturingProcess13>=33.65 37 77.07199 41.01054 *
## 7) ManufacturingProcess13< 33.65 19 24.80207 42.74474 *
rpart.plot::rpart.plot(model_single_tree$finalModel, box.palette = "darkseagreen", shadow.col = "coral1", nn = TRUE)