library(caret)
library(ggplot2)
library(fpp3)
library(randomForest)
library(party)
library(gbm)
library(ipred)
library(Cubist)
library(AppliedPredictiveModeling)
library(partykit)
library(rpart)
library(rpart.plot)
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
set.seed(249)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
a.)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = TRUE)
arrange(rfImp1, desc(Overall))
## Overall
## V4 67.4362233
## V2 52.1992635
## V1 38.0234115
## V5 25.3295901
## V3 22.2779311
## V10 2.4306944
## V9 2.2605172
## V8 0.5157776
## V7 -0.9535470
## V6 -1.6744404
The model didn’t find any importance in the uninformative predictors (v6 - v10)
b.)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9452518
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = TRUE)
rfImp2
## Overall
## V1 24.4087149
## V2 51.3866512
## V3 19.1343015
## V4 61.1884308
## V5 24.3731237
## V6 1.5022630
## V7 -0.6332609
## V8 0.6493199
## V9 1.3207666
## V10 1.5215275
## duplicate1 22.6552450
After adding a duplicate predictor, the importance of V1 has gone down significantly and its importance is about equal to the duplicate1 variable
C.)
model3 <- cforest(y ~ ., data = simulated[,c(1:11)])
rfImp3 <- varimp(model3, conditional = TRUE)
as.data.frame(rfImp3) %>% arrange(desc(rfImp3))
## rfImp3
## V4 12.65531889
## V2 7.91154037
## V1 4.05076021
## V5 2.35201831
## V3 0.81214493
## V9 0.31342247
## V10 -0.04943123
## V8 -0.23900633
## V7 -0.27175848
## V6 -0.45326208
rfImp1 %>% arrange(desc(Overall))
## Overall
## V4 67.4362233
## V2 52.1992635
## V1 38.0234115
## V5 25.3295901
## V3 22.2779311
## V10 2.4306944
## V9 2.2605172
## V8 0.5157776
## V7 -0.9535470
## V6 -1.6744404
The importance between the two models are the same order for those that carry any weight in prediction. In the new model though, V4 (the most important predictor) has twice the importance of V2 (the second most important), while the first model contains much more equal importance between the two.
d.)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1), n.minobsinnode = 10)
set.seed(400)
gbmTune <- train(simulated[,c(1:10)], simulated$y,
method = "gbm",
tuneGrid = gbmGrid,
verbose= FALSE)
rfImp4 <- varImp(gbmTune, conditional= TRUE)
rfImp4
## gbm variable importance
##
## Overall
## V4 100.0000
## V2 66.9868
## V1 43.0693
## V5 32.1570
## V3 25.9062
## V9 3.5688
## V10 1.4810
## V7 0.9635
## V6 0.4489
## V8 0.0000
cubistMod <- Cubist::cubist(simulated[,c(1:10)], simulated$y)
varImp5 <- varImp(cubistMod, conditional = TRUE)
varImp5
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
The important predictors remain about the same for the boosted tree model. The cubist model is different as it only uses v1, v2, v4 and v5 for prediction and all of them hold the same importance to the model.V3 which holds importance in other models has 0 importance for the cubist model.
We can simulate this by creating predictors with the same range, but various amount of distinct values. We expect those with higher amounts of distinct values to have higher importance.
set.seed(124)
p1 <- sample(1:100 / 10, 500, replace= TRUE)
p2 <- sample(1: 1000 / 100, 500, replace = TRUE)
p3 <- sample(1:10000/ 1000, 500, replace = TRUE)
p4 <- sample(1:100000 / 10000, 500, replace = TRUE)
y <- p1 + p2 + p3 + p4
bias_data <- data.frame(p1, p2, p3, p4, y)
tree <- train(y ~ .,data = bias_data, method = "rpart")
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
varImp(tree)
## rpart variable importance
##
## Overall
## p4 100.00
## p2 91.35
## p1 58.10
## p3 0.00
Although the model should assign equal importance to all predictors, the predictor (p4) with the largest amount of unique values has the highest importance.
a.)
Why does the the model on the right focus on just the first few predictors while the model on the left spreads importance accross more predictors?
The shrinkage parameter is the learning rate of th model, with 0.1 being a slower learning rate and 0.9 being a faster learning rate. If a model is learning slower then each additional fitted tree has a smaller impact. This means that many small steps are taken and the chance of a model being over fit is lower. The bagging fraction is the fraction of training set observations that are selected for the next tree. Knowing this, we can say that the model with the 0.9 bagging rate and 0.9 shrinkage is attributing more importance to lots of predictors because the model is being over fit to the training data. The model with the 0.1 bagging and shrinkage is instead doing a better job of generalizing the training data to other data and is selecting a smaller portion of predictors that are actually important for doing this.
b.)
The model with the 0.1 shrinkage and 0.1 bagging fraction would be more predictive of other samples
c.)
Interaction depth is the number of splits that are performed on a tree, starting from a single node. If more splits are happening for a tree, this means that decisions are being made using more predictors, increasing their importance. This means that increasing the interaction depth would increase the slope of predictor importance.
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
set.seed(1953)
process <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
imputed_chemical <- predict(process,ChemicalManufacturingProcess)
split <- createDataPartition(imputed_chemical$Yield, p = 0.8, list = FALSE)
y <- imputed_chemical[,1]
x <- imputed_chemical[,-1]
train_x <- x[split,]
train_y <- y[split]
test_x <- x[-split,]
test_y <- y[-split]
set.seed(250335)
tree_model <- train(train_x, train_y, method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
a.)
Tree Model:
tree_predict <- predict(tree_model, test_x)
tree <- postResample(tree_predict, test_y)
tree
## RMSE Rsquared MAE
## 1.3667648 0.5945126 1.1022925
Bagged Trees:
train <- train_x
train$y <- train_y
bagged_model <- bagging(y ~., data = train)
bagged_predict <- predict(bagged_model, test_x)
bagged <- postResample(bagged_predict, test_y)
bagged
## RMSE Rsquared MAE
## 1.3415432 0.6167599 1.0419778
Random forest:
rfModel <- randomForest(train_x, train_y)
rfPredict <- predict(rfModel, test_x)
random_forest <- postResample(rfPredict, test_y)
random_forest
## RMSE Rsquared MAE
## 1.2938670 0.6694588 1.0073341
Gradient Boost:
gbmTune <- train(train_x , train_y, method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPredict <- predict(gbmTune,test_x)
gradient_boost <- postResample(gbmPredict, test_y)
gradient_boost
## RMSE Rsquared MAE
## 1.2443838 0.6549174 0.9349364
Cubist:
cubistModel <- cubist(train_x, train_y)
cube_predict <-predict(cubistModel, test_x)
cubist <- postResample(cube_predict, test_y)
cubist
## RMSE Rsquared MAE
## 1.5470180 0.4544537 1.1933483
model_fits <- rbind(tree = tree, bagged = bagged, random_forest = random_forest, gradient_boost = gradient_boost, cubist = cubist)
model_fits <- data.frame(model_fits)
model_fits %>% arrange(RMSE)
## RMSE Rsquared MAE
## gradient_boost 1.244384 0.6549174 0.9349364
## random_forest 1.293867 0.6694588 1.0073341
## bagged 1.341543 0.6167599 1.0419778
## tree 1.366765 0.5945126 1.1022925
## cubist 1.547018 0.4544537 1.1933483
The optimal model would be the gradient boost, based off of the RMSE and Rsquared values.
varImp(gbmTune)
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial12 32.994
## ManufacturingProcess09 21.923
## ManufacturingProcess31 20.611
## ManufacturingProcess17 18.408
## ManufacturingProcess06 15.963
## ManufacturingProcess13 14.910
## BiologicalMaterial03 11.509
## ManufacturingProcess24 9.017
## BiologicalMaterial11 8.612
## BiologicalMaterial09 8.196
## BiologicalMaterial08 7.735
## ManufacturingProcess27 7.115
## ManufacturingProcess05 6.084
## BiologicalMaterial05 5.626
## ManufacturingProcess25 5.340
## ManufacturingProcess43 5.179
## ManufacturingProcess01 4.680
## ManufacturingProcess04 4.395
## ManufacturingProcess15 4.229
Manufacturing process 32, biological material 12, manufacturing process 9 and manufacturing process 17 are the most important processes. Manufacturing processes dominate the list. The most important predictors chosen are similar to that of the linear and nonlinear models. The difference is that the gradient boost model gives far more importance to the manufacturing process 32 relative to the other predictors.
tree_model$finalModel
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 454.8814000 40.13271
## 2) ManufacturingProcess32< 159.5 85 152.4430000 39.19294
## 4) BiologicalMaterial12< 19.735 33 37.7083900 38.30939
## 8) BiologicalMaterial05>=19.6 8 6.4923870 37.13375 *
## 9) BiologicalMaterial05< 19.6 25 16.6206200 38.68560
## 18) BiologicalMaterial09>=12.985 7 0.6794857 37.84857 *
## 19) BiologicalMaterial09< 12.985 18 9.1295780 39.01111 *
## 5) BiologicalMaterial12>=19.735 52 72.6242100 39.75365
## 10) ManufacturingProcess25>=4831.5 36 23.4217200 39.24722 *
## 11) ManufacturingProcess25< 4831.5 16 19.1951400 40.89312 *
## 3) ManufacturingProcess32>=159.5 59 119.2197000 41.48661
## 6) ManufacturingProcess09< 44.93 10 18.6880400 39.97600 *
## 7) ManufacturingProcess09>=44.93 49 73.0552200 41.79490
## 14) ManufacturingProcess06< 208.1 25 35.0710000 41.26000
## 28) ManufacturingProcess04< 933.5 18 20.1008300 40.86611 *
## 29) ManufacturingProcess04>=933.5 7 4.9963430 42.27286 *
## 15) ManufacturingProcess06>=208.1 24 23.3804000 42.35208
## 30) ManufacturingProcess17>=33.7 13 8.4249230 41.79462 *
## 31) ManufacturingProcess17< 33.7 11 6.1408910 43.01091 *
windows.options(width = 20, height = 20)
rpart.plot(tree_model$finalModel)
Based on this decision tree, we can see that it does potentially provide additional knowledge about the predictors and their relationship with yield. The first decision in our tree is based off of manufacturing process 32. The tree shows that when manufacturing process 32 has a value higher than 160, the yield tends to be higher as well. Overall, the highest yields tend to result from high values of manufacturing process 32, 09, 06, and 17.