set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
No the model focused on the predictors V1-V5.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.83890885
## V2 6.49023056
## V3 0.67583163
## V4 7.58822553
## V5 2.27426009
## V6 0.17436781
## V7 0.15136583
## V8 -0.03078937
## V9 -0.02989832
## V10 -0.08529218
The importance show a somewhat similar pattern where V1-V5 are the most important but this one differs where V3 is of less importance. it seems to emphasize the very most important variable and deamphisize the less important thus making a simpler model.
library(party)
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:fabletools':
##
## refit
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
##
## index
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
##
## Attaching package: 'party'
## The following object is masked from 'package:fabletools':
##
## response
model1 <- cforest(y ~ ., data = simulated)
rfImp1 <- varImp(model1)
rfImp1
## Overall
## V1 6.571632689
## V2 6.110131511
## V3 0.010450332
## V4 7.485796796
## V5 1.889117623
## V6 -0.006751589
## V7 0.007081381
## V8 -0.036022685
## V9 0.008976088
## V10 0.005356254
## duplicate1 2.722380892
They follow similar pattern by identifying the most impactful to be the same but the cubist assigned the same value of 50 to the top 4 variables rather than ranking them.
library(gbm)
## Loaded gbm 2.1.8
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)
## var rel.inf
## V4 V4 31.2071181
## V2 V2 22.4595980
## V1 V1 14.1867814
## duplicate1 duplicate1 12.7266608
## V5 V5 10.5518956
## V3 V3 8.2074258
## V6 V6 0.5274133
## V7 V7 0.1331071
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
library(Cubist)
cube <- cubist(simulated[,-11], simulated[, 11])
c_imp<-varImp(cube)
c_imp
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
## duplicate1 0
V3-V5 are the most important and they ahve moderate granularities. As the granularity becomes extreme at either end it loses importance.
library(rpart)
set.seed(10101)
V1 <- sample(0:1000000 / 1000000, 100, replace = TRUE)
V2 <- sample(0:100000 / 100000, 100, replace = TRUE)
V3 <- sample(0:10000 / 10000, 100, replace = TRUE)
V4 <- sample(0:1000 / 1000, 100, replace = TRUE)
V5 <- sample(0:100 / 100, 100, replace = TRUE)
V6 <- sample(0:10 / 10, 100, replace = TRUE)
y <- V1 + V2+V6+V5+ rnorm(1000, mean = 0, sd = 20.5)
df <- data.frame(V1, V2, V3, V4, V5, V6, y)
r <- rpart(y ~., data=df)
varImp(r)
## Overall
## V1 0.007390939
## V2 0.002166609
## V3 0.012655076
## V4 0.030256099
## V5 0.014603885
## V6 0.009416652
The model on the left has a high value for the bagging fraction and learning rate. With a high learning rate the value of each additional tree is lessened. The higher bagging fraction indicates that a greater amount of the data is used in each iteration. This leads to fewer values being viewed as important as each additional tree is less important and the data is likely overfit by the higher bagging fraction producing only a few impactful variables.
The model on the left would be more predictive as it will be much less overfit then the model on the right.
Increasing interaction depth would likely lead to a predictor slope more similar to the model on the left where multiple variables are important. By increasing the tree depth you allow for more variables to be impactful but it may lead to overfitting.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
chem<-preProcess(ChemicalManufacturingProcess,
method = c("medianImpute"))
set.seed(6354)
partition <- createDataPartition(ChemicalManufacturingProcess[,1] , p=0.75, list=F)
training <- ChemicalManufacturingProcess[partition,-1]
y_training<- ChemicalManufacturingProcess[partition,1]
testing<- ChemicalManufacturingProcess[-partition,-1]
y_testing<- ChemicalManufacturingProcess[-partition,1]
grid <- expand.grid(n.trees=c(50, 100),
interaction.depth=c(1, 5, 10),
shrinkage=c(0.01, 0.1),
n.minobsinnode=c(5, 10))
r <- train(x=training, y= y_training, method="rpart", preProcess=c( "corr", "medianImpute"), tuneLength=10)
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
rf <- train(x=training, y= y_training, method="rf", preProcess=c( "corr", "medianImpute"), tuneLength=10,verbose=FALSE)
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
gbm<-train(x = training,y = y_training, method = 'gbm', preProcess=c( "corr", "medianImpute"),tuneGrid = grid, verbose = FALSE)
rPred <- predict(r, newdata = testing)
rfPred <- predict(rf, newdata = testing)
gbmPred <- predict(gbm, newdata = testing)
postResample(pred = rPred, obs = y_training)
## RMSE Rsquared MAE
## 2.136208 NA 1.703349
postResample(pred = rfPred, obs = y_training)
## RMSE Rsquared MAE
## 2.095278 NA 1.690864
postResample(pred = gbmPred, obs = y_training)
## RMSE Rsquared MAE
## 2.239082 NA 1.833380
varImp(rf)
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13 34.132
## BiologicalMaterial12 27.319
## BiologicalMaterial03 19.035
## BiologicalMaterial06 16.731
## ManufacturingProcess17 15.675
## ManufacturingProcess09 10.543
## ManufacturingProcess06 10.453
## BiologicalMaterial11 9.587
## ManufacturingProcess36 7.775
## ManufacturingProcess31 5.599
## ManufacturingProcess05 4.920
## BiologicalMaterial04 4.808
## BiologicalMaterial02 4.781
## BiologicalMaterial09 4.458
## BiologicalMaterial05 4.307
## ManufacturingProcess11 4.215
## ManufacturingProcess01 3.777
## ManufacturingProcess04 3.656
## ManufacturingProcess15 3.631
The random forest gave the lowest RMSE and the lowest MAE which makes is th ebst in test performance.
Neither biological or manufacturing dominate the list which is different from some earlier predictors which were largely dominated by one or the other usually manufacturing). ManufacturingProcess32 was far and away the most important for the rf model with a mix of manufacturing (13,17,9,6,36) and biological (12,3,6,11) rounding out the top 10.
Using the rpart.plot package it is easier to see the relationship of the variables via the optimal tree. We can see yet again that ManufacturingProcess32 is easily the most important with a strong combination with Biological Material6. throughout the examples of this dataset we have found these to be very impactful variables.
library("rpart.plot")
model = rpart(y_training ~ ., data = training)
rpart.plot(model)