Exercises 8.1,8.2,8.3 & 8.7 from the K&J book. The rpubs version of this work can be found here, and source/data can be found on github here.
library(knitr)
library(ggplot2)
library(tidyr)
library(AppliedPredictiveModeling)
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
data(solubility)
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Loading required package: lattice
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
kable(rfImp1)
Overall | |
---|---|
V1 | 8.7322354 |
V2 | 6.4153694 |
V3 | 0.7635918 |
V4 | 7.6151188 |
V5 | 2.0235246 |
V6 | 0.1651112 |
V7 | -0.0059617 |
V8 | -0.1663626 |
V9 | -0.0952927 |
V10 | -0.0749448 |
No, the random forest model does not use \(V6-V10\) to any significant degree as we can clearly see from the table above. The average score for \(V1-V5\) = 5.109968 whereas the average score for \(V6-V10\) is -0.0354901.
library(party)
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'party'
## The following objects are masked from 'package:partykit':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cforestModel <- cforest(y ~ ., data=simulated)
vi <- cbind(rfImp2,
varimp(cforestModel) %>% sort(decreasing = T),
varimp(cforestModel, conditional=T) %>% sort(decreasing = T))
colnames(vi) <- c("Traditional","Unconditional","Conditional")
vi <- data.frame(vi)
kable(vi)
Traditional | Unconditional | Conditional | |
---|---|---|---|
V1 | 5.6911997 | 7.6223893 | 6.1905784 |
V2 | 6.0689606 | 6.0579731 | 4.6889804 |
V3 | 0.6297022 | 5.0941897 | 1.9267517 |
V4 | 7.0475224 | 4.6171159 | 1.8079531 |
V5 | 1.8723844 | 1.7161194 | 1.0516669 |
V6 | 0.1356906 | 0.0465375 | 0.0281748 |
V7 | -0.0134564 | 0.0046062 | 0.0151185 |
V8 | -0.0437056 | 0.0003116 | 0.0128788 |
V9 | 0.0084044 | -0.0289427 | -0.0043566 |
V10 | 0.0289481 | -0.0310326 | -0.0117094 |
duplicate1 | 4.2833158 | -0.0380966 | -0.0221906 |
vi$vars <- row.names(vi)
vi.tidy <- gather(vi,"condition","value",-vars)
ggplot(vi.tidy,aes(x=vars,y=value, fill = condition,color=condition))+
geom_bar(stat="identity",position="dodge")+
ggtitle("Variable Importance by Model")
Whereas the traditional model (green) shows a relatively high degree of importance for the the duplicate variable, the cforest model does not. Otherwise, variable importance is somewhat similar with the only other glaring difference being the the traditional model underweights V3 with respect to the other 2 models. All seem to equally ignore \(V6-V10\)
library(Cubist)
library(gbm)
## Loaded gbm 2.1.5
model.cubist <- cubist(x=simulated[,-(ncol(simulated)-1)],
y=simulated$y,committees=10)
model.gbm <- gbm(y ~ ., data=simulated,n.trees=50, distribution='gaussian')
gbm.summary <- summary(model.gbm)
varImp(model.cubist)
vi <- cbind(vi,varImp(model.cubist),gbm.summary[2])
vi <- data.frame(vi)
vi <- subset(vi, select = -c(vars))
colnames(vi) <- c("Traditional","Unconditional","Conditional","Cubist","GBM")
kable(vi)
Traditional | Unconditional | Conditional | Cubist | GBM | |
---|---|---|---|---|---|
V1 | 5.6911997 | 7.6223893 | 6.1905784 | 32.5 | 31.00986 |
V2 | 6.0689606 | 6.0579731 | 4.6889804 | 51.5 | 23.85183 |
V3 | 0.6297022 | 5.0941897 | 1.9267517 | 58.5 | 20.07400 |
V4 | 7.0475224 | 4.6171159 | 1.8079531 | 45.5 | 12.97916 |
V5 | 1.8723844 | 1.7161194 | 1.0516669 | 49.5 | 10.16261 |
V6 | 0.1356906 | 0.0465375 | 0.0281748 | 38.5 | 1.92255 |
V7 | -0.0134564 | 0.0046062 | 0.0151185 | 21.5 | 0.00000 |
V8 | -0.0437056 | 0.0003116 | 0.0128788 | 0.0 | 0.00000 |
V9 | 0.0084044 | -0.0289427 | -0.0043566 | 0.0 | 0.00000 |
V10 | 0.0289481 | -0.0310326 | -0.0117094 | 0.0 | 0.00000 |
duplicate1 | 4.2833158 | -0.0380966 | -0.0221906 | 0.0 | 0.00000 |
vi$vars <- row.names(vi)
vi.tidy <- gather(vi,"condition","value",-vars)
ggplot(vi.tidy,aes(x=vars,y=value, fill = condition,color=condition))+
geom_bar(stat="identity",position="dodge")+
ggtitle("Variable Importance by Model")
As previously, the traditional model is the only one that assigns any weight to the duplicate variable. The GBM model appears to assign weights in relatively similar places as the models previously tested whereas the Cubist model assigns some importances to \(V6\) and \(V7\). I did some messing around with it and re-created this result several times - I am wondering whether this is a model feature, or operator error.
set.seed(19)
a <- sample.int(5 , 500, replace = TRUE)/5
b <- sample.int(10, 500, replace = TRUE)/10
c <- sample.int(25, 500, replace = TRUE)/25
d <- sample.int(50, 500, replace = TRUE)/50
e <- sample.int(100, 500, replace = TRUE)/100
target <- a + e * rnorm(500)
df <- data.frame(a, b, c, d, e, target)
str(df)
## 'data.frame': 500 obs. of 6 variables:
## $ a : num 1 0.4 0.6 1 0.4 0.2 0.6 1 0.2 1 ...
## $ b : num 1 0.6 0.2 1 0.7 0.5 0.6 0.9 0.3 1 ...
## $ c : num 0.52 0.08 0.84 0.44 0.56 0.32 0.04 0.12 0.88 0.44 ...
## $ d : num 0.04 0.6 0.16 0.48 0.7 0.14 0.46 0.18 0.12 0.92 ...
## $ e : num 0.81 0.02 0.1 0.41 0.64 0.23 0.32 0.84 0.99 0.78 ...
## $ target: num 0.64 0.424 0.678 0.278 -0.444 ...
library(rpart)
library(rpart.plot)
model.rpart <- rpart(target ~ ., data=df)
kable(varImp(model.rpart))
Overall | |
---|---|
a | 0.1941175 |
b | 0.2511647 |
c | 0.1834566 |
d | 0.0636462 |
e | 0.2548975 |
rpart.plot(model.rpart)
In the above, first we generate a set of 5 random variables with differning granularity. We create a target variable as a linear combination of the smallest and largest granularity, multiplied by a noise factor.
We then train the model and can clearly see in the varImp table that variable “E” is considerably more important than variable a. The r-part plot is a little bit more cryptic and I think that perhapse, in hindsight, I should have considered using fewer variables.
Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Fig 8.24.
In the context of bagging fraction and learning rate:
The learning rate governs the faction of the current prediction value added to the previous prediction and a value of <0.01 is suggested by the text. The bagging fraction determines the proportion of the training data seen by the model - the book suggests 50%.
We would expect the performance of the model on the right to be better on analagous data sets due to it’s level of tuning. However, It will likely also be more sensitive differences in the outsample data.
Which model I would chose would depend on the application (consequences of being wrong etc…). Where accuracy is not overly important, but robustness matters a lot, the model on the left is preferred, otherwise, the better tuned model on the right.
g1 <- expand.grid(n.trees=100, interaction.depth=1, shrinkage=0.1, n.minobsinnode=10)
g2 <- expand.grid(n.trees=100, interaction.depth=10, shrinkage=0.1,n.minobsinnode=10)
model.gbm1 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = g1, verbose = FALSE)
model.gbm2 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = g2, verbose = FALSE)
var.imp <- cbind(varImp(model.gbm1)[[1]],varImp(model.gbm2)[[1]])
colnames(var.imp) <- c("Depth1", "Depth10")
kable(var.imp[order(-var.imp$Depth1),][1:25,])
Depth1 | Depth10 | |
---|---|---|
NumCarbon | 100.0000000 | 100.0000000 |
MolWeight | 71.2326619 | 75.7839450 |
SurfaceArea2 | 40.6536164 | 22.3079238 |
NumAromaticBonds | 36.5934625 | 16.3954527 |
SurfaceArea1 | 22.4156010 | 42.0861640 |
NumChlorine | 22.0324377 | 9.3635377 |
HydrophilicFactor | 12.0430919 | 21.1485802 |
NumNonHAtoms | 8.9283369 | 12.6230651 |
FP172 | 8.2081308 | 2.5659970 |
NumHalogen | 4.5214749 | 10.2322460 |
NumRotBonds | 4.1502299 | 3.5024223 |
FP059 | 3.8123259 | 2.0345170 |
FP116 | 2.4919572 | 1.8422265 |
FP112 | 2.1801740 | 1.7124038 |
FP135 | 1.8358101 | 1.1442461 |
NumOxygen | 1.6214354 | 2.9817220 |
NumMultBonds | 1.5572537 | 3.9728826 |
FP075 | 1.4608977 | 3.0037295 |
FP206 | 1.3469506 | 0.8144075 |
FP142 | 1.0667059 | 1.8600409 |
FP125 | 0.8456011 | 0.2060303 |
FP072 | 0.8013671 | 2.0490774 |
FP161 | 0.7224100 | 0.6564066 |
NumHydrogen | 0.6908099 | 5.4346728 |
FP043 | 0.6587296 | 0.4635920 |
As we can see above, increasing the interaction depth flattens the slope and gives weights to far more variables, further out.
First we’ll load and split the data by recycling code from HW7 & 8.
set.seed(19)
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
#impute using knn
chem.imp <- preProcess(chem[,2:ncol(chem)], method=c('knnImpute'))
chem <- cbind(chem$Yield,predict(chem.imp, chem[,2:ncol(chem)]))
colnames(chem)[1] <- "Yield"
#split 70/30
n <- floor(0.70 * nrow(chem))
idx <- sample(seq_len(nrow(chem)), size = n)
train <- chem[idx, ]
test <- chem[-idx, ]
Next we train, analyze and compare several models:
# train all the models
grid.rf <- expand.grid(mtry=seq(5,40,by=5))
model.rf <- train(Yield ~.,
data = train,
method = "rf",
tuneGrid = grid.rf,
metric = "Rsquared",
importance = TRUE,
trControl = trainControl(method = "cv", number = 10))
grid.crf <- expand.grid(mtry=seq(5,50,by=5))
model.crf <- train(Yield ~.,
data = train,
method = "cforest",
tuneGrid = grid.crf,
metric = "Rsquared",
trControl = trainControl(method = "oob"))
grid.cube <- expand.grid(committees = c(1,5,10,15,20,25),
neighbors = c(0,1,3,4,5))
model.cube <- train(Yield ~.,
data = train,
method = "cubist",
metric = "Rsquared",
tuneGrid = grid.cube,
trControl = trainControl(method = "cv", number = 10))
grid.rpart <- expand.grid(maxdepth= seq(1,10,by=1))
model.rpart <- train(Yield ~.,
data = train,
method = "rpart2",
metric = "Rsquared",
tuneGrid = grid.rpart,
trControl = trainControl(method = "cv", number = 10))
grid.gbm <- expand.grid(n.trees=c(50, 100, 150, 200),
interaction.depth=c(1, 5, 10, 15),
shrinkage=c(0.01, 0.1, 0.5),
n.minobsinnode=c(5, 10, 15))
model.gbm <- train(Yield ~.,
data = train,
method = 'gbm',
tuneGrid = grid.gbm ,
verbose = FALSE)
pred.rf <- predict(model.rf, newdata = test[,-1])
pred.crf <- predict(model.crf, newdata = test[,-1])
pred.cube <- predict(model.cube, newdata = test[,-1])
pred.rpart <- predict(model.rpart, newdata = test[,-1])
pred.gbm <- predict(model.gbm, newdata = test[,-1])
train.results <- data.frame(rbind(getTrainPerf(model.rf),
getTrainPerf(model.crf),
getTrainPerf(model.cube),
getTrainPerf(model.rpart),
getTrainPerf(model.gbm)))
row.names(train.results) <- c("RandomForest", "cForest","Cubeist","Rpart","GBM")
train.results
test.results <- data.frame(rbind(postResample(pred = pred.rf, obs = test$Yield),
postResample(pred = pred.crf, obs = test$Yield),
postResample(pred = pred.cube, obs = test$Yield),
postResample(pred = pred.rpart, obs = test$Yield),
postResample(pred = pred.gbm, obs = test$Yield)))
row.names(test.results) <- c("RandomForest", "cForest","Cubeist","Rpart","GBM")
test.results
From the above tables we can see that the best model in both training and testing was the Cubist model.
plot(varImp(model.cube),
top=10,
scales = list(y = list(cex = 0.8)),
main="Variable Importance for Cubist")
plot(varImp(model.rf),
top=10, scales = list(y = list(cex = 0.8)),
main="Variable Importance for RandomForest")
plot(varImp(model.crf),
top=10, scales = list(y = list(cex = 0.8)),
main="Variable Importance for cForest")
plot(varImp(model.rpart),
top=10, scales = list(y = list(cex = 0.8)),
main="Variable Importance for Rpart")
plot(varImp(model.gbm),
top=10, scales = list(y = list(cex = 0.8)),
main="Variable Importance for GBM")
var.imp <- data.frame(cbind(varImp(model.cube)[[1]],
varImp(model.rf)[[1]],
varImp(model.crf)[[1]],
varImp(model.rpart)[[1]],
varImp(model.gbm)[[1]]))
colnames(var.imp) <- c("Cubist","RandomForest", "cForest","Rpart","GBM")
kable(var.imp[1:10,])
Cubist | RandomForest | cForest | Rpart | GBM | |
---|---|---|---|---|---|
BiologicalMaterial06 | 85.9375 | 45.97124 | 4.0486620 | 63.08239 | 4.863126 |
ManufacturingProcess13 | 100.0000 | 63.30621 | 15.2548348 | 64.28698 | 6.351365 |
ManufacturingProcess28 | 67.1875 | 72.06596 | 21.1195833 | 67.64603 | 19.563297 |
BiologicalMaterial12 | 48.4375 | 60.63919 | 3.5450018 | 55.51352 | 3.567137 |
BiologicalMaterial09 | 25.0000 | 52.80275 | 1.5252778 | 100.00000 | 21.163190 |
ManufacturingProcess39 | 32.8125 | 74.58936 | 17.9040659 | 0.00000 | 16.727681 |
ManufacturingProcess17 | 34.3750 | 29.60265 | 0.6560436 | 0.00000 | 0.000000 |
ManufacturingProcess01 | 21.8750 | 59.85886 | 8.9780520 | 0.00000 | 3.149193 |
ManufacturingProcess31 | 34.3750 | 54.20562 | 1.5336713 | 0.00000 | 18.388157 |
ManufacturingProcess32 | 82.8125 | 43.09557 | 0.5293143 | 0.00000 | 12.732971 |
From the above, we can see that the most important variable in the best model is mfp32 with the top 10 being dominated by manufacturing, in general. While all other models (except Rpart) show mfp32 as the top variable in terms of importance, there seems to be less of a bias towards manufacturing process variables in all the other models. Just by eye, the other models appear to be about 50/50, Mfg/Bio whereas the Cubist model is 80% Mfg.
plot(as.party(model.rpart$finalModel),gp=gpar(fontsize=11))
I could not find any way to plot the optimal model (cubist), however, given that they are all reasonably similar, I chose to plot rpart using “party”, which is relatively simple.
It appears as though the top of the tree (i.e. the big decisions) are governed by manufacturing - we see this in variable importance. It’s not incredibly clear, but based on the terminal distributions it appeas as though bio processes may be associated with lower yield outcomes, in general.