library(AppliedPredictiveModeling)
library(dplyr)
library(forecast)
library(ggplot2)
library(tidyr)
library(mice)
library(corrplot)
library(MASS)
library(earth)
library(RANN)Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
order(-rfImp1)## [1] 1 4 2 5 3 6 7 10 9 8
rfImp1## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
Answer: Based on the ‘rfImp1’, uninformative predictors V6 – V10 were not significantly used in the random forest model. The predictors that were used significantly in the random forest model are: 1 - 4 - 2 - 5 - 3
Creating copy of original data:
simulated2 <- simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated2, importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
order(-rfImp2)## [1] 4 2 1 11 5 3 6 10 9 7 8
rfImp2## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
Answer: The importance score for V1 changed by adding a highly correlated value.The V1 predictor is no longer the most important, it is now the third.
library(party)
model3 <- cforest(y ~., data = simulated)
order(-varimp(model3, conditional = FALSE))## [1] 1 4 2 5 7 3 9 6 10 8
order(-varimp(model3, conditional = TRUE))## [1] 4 1 2 5 3 6 9 8 10 7
Answer: They do not show these significances the same pattern as the traditional random forest model. They are not in the same order.
Cubist
library(Cubist)
simulated_x <- subset(simulated, select = -c(y))
cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)
order(-varImp(cubist_model))## [1] 1 3 4 2 5 6 7 8 9 10
Boosted Trees
library(gbm)
set.seed(801)
gbm_model <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(gbm_model)## var rel.inf
## V4 V4 29.1564185
## V1 V1 26.9686148
## V2 V2 24.2398322
## V5 V5 11.5192020
## V3 V3 7.8231482
## V6 V6 0.2927843
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
Answer: With the different tree models, the pattern is not the same. In the Cubist model the V1 predictor is the most important. In the Boosted tree model, the V4 predictor is the most important. Regarding the predictors V6 - V 10, they were not used significantly in any of the other models.
Use a simulation to show tree bias with different granularities.
set.seed(802)
low <- sample(0:10, 500, replace = T)
medium <- sample(0:100, 500, replace = T)
high <- sample(0:1000, 500, replace = T)
y <- low + medium + high + rnorm(200)
var(low)## [1] 9.795796
var(medium)## [1] 810.4978
var(high)## [1] 81218.55
sim_data <- data.frame(low, medium, high, y)
diff_gran_model <- randomForest(y ~., data = sim_data, importance = TRUE, ntree = 1000)
varImp(diff_gran_model, scale=FALSE)## Overall
## low -392.6412
## medium 902.1563
## high 139468.8096
Answer: We can see that even though all three variables contribute equally to the y-value, the only variable that matters to you in the end is the first one with a large number of distinct values. The resulting tree model confirms the tree bias in which variables with the lowest granularity are ranked with the highest importance.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Answer: The 0.1 fraction used on the leftmost tree makes it care for granularity more whereas the 0.9 on the right makes that less so important. The results seen are a due to the different bagging fraction chosen.
Answer: I think the most predictive model is the model on the left. The model on the left is more balanced. Bagging fraction and shrinkage parameters: 0.1 in the model on the left.
Answer: Increasing the depth of the interaction would affect the number of predictors considered for the models. By increasing the number of predictors the results would be more balanced.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Load data:
data(ChemicalManufacturingProcess)
set.seed(807)
imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
full_data <- predict(imputed_data, ChemicalManufacturingProcess)
index_chem <- createDataPartition(full_data$Yield , p=.8, list=F)
train_chem <- full_data[index_chem,]
test_chem <- full_data[-index_chem,]
train_predictors <- train_chem[-c(1)]
test_predictors <- test_chem[-c(1)]Random Forest
set.seed(817)
rf_model <- randomForest(train_predictors, train_chem$Yield, importance = TRUE, ntrees = 1000)
rf_model##
## Call:
## randomForest(x = train_predictors, y = train_chem$Yield, importance = TRUE, ntrees = 1000)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 15
##
## Mean of squared residuals: 0.3685206
## % Var explained: 62.06
rfPred <- predict(rf_model, newdata = test_predictors)
postResample(pred = rfPred, obs = test_chem$Yield)## RMSE Rsquared MAE
## 0.6415305 0.6845213 0.4895513
**Boosted Trees*
set.seed(827)
gbm_model <- gbm.fit(train_predictors, train_chem$Yield, distribution = "gaussian")## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.9705 nan 0.0010 0.0007
## 2 0.9698 nan 0.0010 0.0007
## 3 0.9690 nan 0.0010 0.0007
## 4 0.9683 nan 0.0010 0.0006
## 5 0.9676 nan 0.0010 0.0008
## 6 0.9670 nan 0.0010 0.0006
## 7 0.9663 nan 0.0010 0.0006
## 8 0.9655 nan 0.0010 0.0007
## 9 0.9647 nan 0.0010 0.0007
## 10 0.9639 nan 0.0010 0.0007
## 20 0.9569 nan 0.0010 0.0007
## 40 0.9434 nan 0.0010 0.0005
## 60 0.9298 nan 0.0010 0.0006
## 80 0.9172 nan 0.0010 0.0006
## 100 0.9051 nan 0.0010 0.0005
gbm_model## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 46 predictors of which 7 had non-zero influence.
gbmPred <- predict(gbm_model, newdata = test_predictors)
postResample(pred = gbmPred, obs = test_chem$Yield)## RMSE Rsquared MAE
## 1.0173919 0.4896455 0.7944192
Cubist
set.seed(837)
cube_model <- cubist(train_predictors, train_chem$Yield)
cube_model##
## Call:
## cubist.default(x = train_predictors, y = train_chem$Yield)
##
## Number of samples: 144
## Number of predictors: 46
##
## Number of committees: 1
## Number of rules: 1
cubePred <- predict(cube_model, newdata = test_predictors)
postResample(pred = cubePred, obs = test_chem$Yield)## RMSE Rsquared MAE
## 0.6267279 0.6609403 0.4886768
Answer: The Cubist model provides optimal metrics, with the lowest RMSE of 0.62, and an R2 of 0.66.
head(varImp(rf_model),10)## Overall
## BiologicalMaterial01 4.0327661
## BiologicalMaterial03 12.0298968
## BiologicalMaterial05 0.2910504
## BiologicalMaterial06 11.2978147
## BiologicalMaterial08 5.5829684
## BiologicalMaterial09 5.3869876
## BiologicalMaterial10 2.3758324
## BiologicalMaterial11 6.4869886
## ManufacturingProcess01 2.7009899
## ManufacturingProcess02 3.3163745
The predictors dominating the list are ‘BiologicalMaterial’ with 8 in the top ten. The remaining two are ‘ManufacturingProcess’. The top 10 important predictors are different from the top 10 predictors of the optimal linear and nonlinear models. In the optimal linear and non-linear models there were more ‘ManufacturingProcess’ predictors.
library(rpart)
library(rpart.plot)
rpart_tree <- rpart(Yield ~., data = train_chem)
rpart.plot(rpart_tree)Answer: This model further reinforces the importance of process variables. We can see in the graph that the most frequently seen variables are ‘ManufacturingProcess’, this suggests that the Manufacturing Process variables are strong predictors of the performance response variable. The graph gives a clearer view of how to increase the performance of a given sample.