library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
model1 <- randomForest(y~., data = simulated,
importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
(V6 - V10)?No, the scores for V6-V10 were much lower than the other predictors.
cforest function in the party package to fit a random forest model using the conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the tradtional random forest model?library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
model4 <- cforest(y ~ ., data = simulated)
rfImp4 <- varimp(model4)
rfImp4
## V1 V2 V3 V4 V5 V6
## 4.11910478 5.81149273 0.01133118 7.16991462 1.62054023 -0.01818532
## V7 V8 V9 V10 duplicate1 duplicate2
## -0.02795336 -0.03663807 -0.04360946 -0.03454443 4.62957645 1.08104915
They do, but to a lesser degree. The score for V1 is slightly higher in model4 than in model3.
library(gbm)
## Loaded gbm 2.1.5
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)
## var rel.inf
## V4 V4 29.1891921
## V1 V1 24.2202771
## V2 V2 23.0396157
## V5 V5 10.6261851
## V3 V3 7.3239995
## duplicate1 duplicate1 5.0541842
## V6 V6 0.3612731
## duplicate2 duplicate2 0.1852732
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
library(Cubist)
cModel <- cubist(simulated[, -11], simulated[, 11])
varImp(cModel)
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## duplicate1 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
## duplicate2 0
We see that the boosted tree does follow the same pattern as the others, but not the cubist model. Or rather, the cubist model might, but seems to normalize the scores. All the same predictors are considered important for the cubist model.
set.seed(42)
X1 <- sample(1:10, 100, replace = T)
X2 <- rnorm(100)
X3 <- rnorm(100, mean = 0, sd = 4)
X4 <- sample(1:10/10, 100, replace = T)
Y <- X1 + rnorm(100, mean = 0, sd = .5)
df <- data.frame(X1, X2, X3, X4, Y)
Given our predictors, we should see that X1 has the most importance, given that it has the lowest granularity out of all the predictors.
library(rpart)
simTree <- rpart(Y~., data = df)
varImp(simTree)
## Overall
## X1 3.3176171
## X2 0.3638746
## X3 0.5578960
## X4 0.2486994
Indeed, because the values of V1 are so much larger than the others, it has a much higher importance than any of the others.
The model on the right has a larger bagging fraction and learning rate, which means more of the dataset is used in creating each tree and more of each simulated tree is used for the final prediction. This would suggest that the right model might have a better sense of which predictors are truly important. After all, the book itself notes “that small[er] values of the learning parameter (< 0.01) work best”.
That said, without testing either of the models, there is a strong chance that the model on the right may over fit the data, as it disregards so many other predictors, especially predictors that the left model consider important.
Given that interaction depth is really just tree depth, one would assume that the slope would only grow more drastic. The right model, which uses more of the dataset and more trees, would only reinforce its previous decisions with a greater tree depth. The same would then be true for the left tree.
library(AppliedPredictiveModeling)
library(impute)
library(ipred)
cmp <- data("ChemicalManufacturingProcess")
cmp <- impute.knn(as.matrix(ChemicalManufacturingProcess))
cmp <- as.data.frame(cmp$data)
smp <- floor(0.75 * nrow(cmp))
x <- sample(seq_len(nrow(cmp)), size = smp)
y <- sample(seq_len(nrow(cmp)), size = smp)
cmpTrain <- cmp[x,-1]
cmpTest <- cmp[-x,-1]
yTrain <- cmp[y,1]
yTest <- cmp[-y,1]
# Bagged Trees
cmpBT <- ipredbagg(yTrain, cmpTrain)
# Random Forest
cmpRF <- randomForest(cmpTrain, yTrain)
# Boosted Tree
cmpGBM <- gbm.fit(cmpTrain, yTrain, distribution = "gaussian")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 3.4844 nan 0.0010 0.0001
## 2 3.4838 nan 0.0010 -0.0000
## 3 3.4838 nan 0.0010 -0.0005
## 4 3.4834 nan 0.0010 -0.0002
## 5 3.4831 nan 0.0010 -0.0005
## 6 3.4826 nan 0.0010 -0.0001
## 7 3.4822 nan 0.0010 0.0003
## 8 3.4820 nan 0.0010 -0.0002
## 9 3.4815 nan 0.0010 -0.0002
## 10 3.4814 nan 0.0010 -0.0007
## 20 3.4767 nan 0.0010 0.0002
## 40 3.4674 nan 0.0010 0.0002
## 60 3.4589 nan 0.0010 0.0000
## 80 3.4512 nan 0.0010 0.0000
## 100 3.4426 nan 0.0010 -0.0004
# Cubist
cmpCub <- cubist(cmpTrain, yTrain)
cmpBT.pred <- predict(cmpBT, newdata = cmpTest)
cmpRF.pred <- predict(cmpRF, newdata = cmpTest)
cmpGBM.pred <- predict(cmpGBM, newdata = cmpTest,
n.trees = cmpGBM$n.trees)
cmpCub.pred <- predict(cmpCub, newdata = cmpTest)
postResample(cmpBT.pred, yTest)
## RMSE Rsquared MAE
## 1.855391440 0.001844473 1.577750438
postResample(cmpRF.pred, yTest)
## RMSE Rsquared MAE
## 1.855356831 0.002632157 1.587732136
postResample(cmpGBM.pred, yTest)
## RMSE Rsquared MAE
## 1.758457640 0.001630899 1.452252524
postResample(cmpCub.pred, yTest)
## RMSE Rsquared MAE
## 1.890693363 0.005350072 1.537437005
Generally speaking, the random forest model might offer the best performance. It doesn’t have the lowest RMSE, nor the highest \(R^2\), but that might mean it offers the best middle ground.
qqnorm(cmpBT.pred, main = "Bagged")
qqline(cmpBT.pred)
qqnorm(cmpRF.pred, main = "Random Forest")
qqline(cmpRF.pred)
qqnorm(cmpGBM.pred, main = "Boosted")
qqline(cmpGBM.pred)
qqnorm(cmpCub.pred, main = "Cubist")
qqline(cmpCub.pred)
The residual plots tend to agree, with the random forest having a generally better fit than most, save for perhaps the cubist model.
cmpRF$importance
## IncNodePurity
## BiologicalMaterial01 9.09659840
## BiologicalMaterial02 5.52837630
## BiologicalMaterial03 6.59033877
## BiologicalMaterial04 9.06924320
## BiologicalMaterial05 6.52243630
## BiologicalMaterial06 6.28282125
## BiologicalMaterial07 0.09778131
## BiologicalMaterial08 6.13796946
## BiologicalMaterial09 8.93826202
## BiologicalMaterial10 8.70184021
## BiologicalMaterial11 5.78476367
## BiologicalMaterial12 7.51486626
## ManufacturingProcess01 6.75065464
## ManufacturingProcess02 4.55837864
## ManufacturingProcess03 4.24484920
## ManufacturingProcess04 8.37197485
## ManufacturingProcess05 16.16392014
## ManufacturingProcess06 13.67153994
## ManufacturingProcess07 2.24781856
## ManufacturingProcess08 1.25810497
## ManufacturingProcess09 13.18060219
## ManufacturingProcess10 8.93316473
## ManufacturingProcess11 8.15220580
## ManufacturingProcess12 0.44250416
## ManufacturingProcess13 27.24671173
## ManufacturingProcess14 10.13834102
## ManufacturingProcess15 9.56799779
## ManufacturingProcess16 9.70307869
## ManufacturingProcess17 9.55001138
## ManufacturingProcess18 6.65482687
## ManufacturingProcess19 6.61292835
## ManufacturingProcess20 8.09854293
## ManufacturingProcess21 6.12772689
## ManufacturingProcess22 6.35809946
## ManufacturingProcess23 4.32597074
## ManufacturingProcess24 10.48159891
## ManufacturingProcess25 6.72807779
## ManufacturingProcess26 5.82566678
## ManufacturingProcess27 7.88480245
## ManufacturingProcess28 2.75801357
## ManufacturingProcess29 4.67023377
## ManufacturingProcess30 5.01011088
## ManufacturingProcess31 7.53221053
## ManufacturingProcess32 14.88077737
## ManufacturingProcess33 4.98193170
## ManufacturingProcess34 1.84282354
## ManufacturingProcess35 36.53553664
## ManufacturingProcess36 7.28035732
## ManufacturingProcess37 13.00341253
## ManufacturingProcess38 3.91188278
## ManufacturingProcess39 8.13208298
## ManufacturingProcess40 1.09883529
## ManufacturingProcess41 1.01062441
## ManufacturingProcess42 7.01059978
## ManufacturingProcess43 5.08586389
## ManufacturingProcess44 2.26434939
## ManufacturingProcess45 10.03921195
If we sort, we’ll lose the labels, but we can just match that up visually.
sort(cmpRF$importance, decreasing = TRUE)
## [1] 36.53553664 27.24671173 16.16392014 14.88077737 13.67153994
## [6] 13.18060219 13.00341253 10.48159891 10.13834102 10.03921195
## [11] 9.70307869 9.56799779 9.55001138 9.09659840 9.06924320
## [16] 8.93826202 8.93316473 8.70184021 8.37197485 8.15220580
## [21] 8.13208298 8.09854293 7.88480245 7.53221053 7.51486626
## [26] 7.28035732 7.01059978 6.75065464 6.72807779 6.65482687
## [31] 6.61292835 6.59033877 6.52243630 6.35809946 6.28282125
## [36] 6.13796946 6.12772689 5.82566678 5.78476367 5.52837630
## [41] 5.08586389 5.01011088 4.98193170 4.67023377 4.55837864
## [46] 4.32597074 4.24484920 3.91188278 2.75801357 2.26434939
## [51] 2.24781856 1.84282354 1.25810497 1.09883529 1.01062441
## [56] 0.44250416 0.09778131
BiologicalMaterial10 = 21.65270753ManufacturingProcess04 = 19.62509265BiologicalMaterial04 = 16.67539808ManufacturingProcess19 = 16.47245116ManufacturingProcess22 = 14.50093483ManufacturingProcess16 = 13.97917118BiologicalMaterial09 = 12.63348029BiologicalMaterial11 = 12.43943832ManufacturingProcess09 = 12.29018437ManufacturingProcess23 = 11.97977332The manufacturing processes are almost equal with the biological materials, which was not strictly the same as the previous assignment. In that model, ManufacturingProcess40 ended up as the most important, but in this model it doesn’t make the top 10.
The randomForest package doesn’t seem to have a native plot for trees. We can select a single tree through the getTree function, but I’m not sure which is the best one.
getTree(cmpRF)
## left daughter right daughter split var split point status prediction
## 1 2 3 47 483.000000 -3 40.26735
## 2 4 5 36 11.500000 -3 42.05684
## 3 6 7 26 4739.500000 -3 39.96646
## 4 8 9 22 9.200000 -3 43.00500
## 5 10 11 47 474.500000 -3 40.43143
## 6 0 0 0 0.000000 -1 44.16000
## 7 12 13 25 32.550000 -3 39.85209
## 8 14 15 3 66.375000 -3 43.22222
## 9 0 0 0 0.000000 -1 42.35333
## 10 0 0 0 0.000000 -1 42.15500
## 11 0 0 0 0.000000 -1 39.74200
## 12 0 0 0 0.000000 -1 46.34000
## 13 16 17 10 2.335000 -3 39.73194
## 14 0 0 0 0.000000 -1 42.99200
## 15 0 0 0 0.000000 -1 43.51000
## 16 18 19 26 4886.500000 -3 38.44000
## 17 20 21 15 1.505000 -3 39.99033
## 18 22 23 8 16.415000 -3 38.19250
## 19 0 0 0 0.000000 -1 40.42000
## 20 0 0 0 0.000000 -1 37.71200
## 21 24 25 44 163.500000 -3 40.12435
## 22 0 0 0 0.000000 -1 36.47500
## 23 26 27 49 0.650000 -3 38.43786
## 24 28 29 47 508.000000 -3 40.40435
## 25 30 31 23 9.612698 -3 38.91687
## 26 0 0 0 0.000000 -1 37.66500
## 27 32 33 17 990.100000 -3 38.74700
## 28 34 35 1 6.915000 -3 40.66750
## 29 36 37 25 34.250000 -3 38.65000
## 30 38 39 44 164.500000 -3 39.86500
## 31 40 41 4 13.175000 -3 38.34800
## 32 0 0 0 0.000000 -1 38.96750
## 33 42 43 2 52.715000 -3 38.60000
## 34 44 45 49 2.050000 -3 40.95714
## 35 46 47 45 61.071429 -3 39.37727
## 36 0 0 0 0.000000 -1 37.65200
## 37 0 0 0 0.000000 -1 39.89750
## 38 0 0 0 0.000000 -1 38.42000
## 39 0 0 0 0.000000 -1 40.15400
## 40 0 0 0 0.000000 -1 38.82500
## 41 48 49 37 4879.500000 -3 38.03000
## 42 0 0 0 0.000000 -1 38.37000
## 43 0 0 0 0.000000 -1 38.64600
## 44 50 51 37 4868.500000 -3 40.83277
## 45 0 0 0 0.000000 -1 43.88000
## 46 0 0 0 0.000000 -1 43.38000
## 47 52 53 44 158.500000 -3 38.97700
## 48 0 0 0 0.000000 -1 37.94800
## 49 0 0 0 0.000000 -1 38.44000
## 50 54 55 2 48.520000 -3 40.49294
## 51 56 57 38 6080.500000 -3 41.72154
## 52 0 0 0 0.000000 -1 38.23800
## 53 0 0 0 0.000000 -1 39.71600
## 54 0 0 0 0.000000 -1 42.69000
## 55 58 59 54 11.100000 -3 40.42636
## 56 60 61 2 60.320000 -3 42.30000
## 57 0 0 0 0.000000 -1 40.42000
## 58 0 0 0 0.000000 -1 38.98000
## 59 62 63 11 143.795000 -3 40.51968
## 60 64 65 44 158.500000 -3 42.13500
## 61 0 0 0 0.000000 -1 43.62000
## 62 66 67 18 206.950000 -3 39.97111
## 63 68 69 4 12.260000 -3 40.74409
## 64 0 0 0 0.000000 -1 42.19800
## 65 0 0 0 0.000000 -1 42.03000
## 66 70 71 53 0.050000 -3 39.57667
## 67 0 0 0 0.000000 -1 40.76000
## 68 72 73 26 4869.000000 -3 40.96867
## 69 74 75 30 4763.000000 -3 40.26286
## 70 0 0 0 0.000000 -1 40.01000
## 71 0 0 0 0.000000 -1 39.14333
## 72 76 77 29 34.400000 -3 40.83364
## 73 0 0 0 0.000000 -1 41.34000
## 74 0 0 0 0.000000 -1 40.57000
## 75 0 0 0 0.000000 -1 40.14000
## 76 0 0 0 0.000000 -1 40.68000
## 77 78 79 53 0.050000 -3 40.92143
## 78 0 0 0 0.000000 -1 40.87000
## 79 0 0 0 0.000000 -1 40.96000
If we could view the tree, however, we would have a new perspective on the relationship between the biological and manufacturing predictors. We would see how a particular tree, or in the question’s case, an optimal tree, would choose to group the predictors together and which sits as the most important predictor for that tree.