library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
8.1 a
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
Base on the plot and actual output of rfIm1, Predictors V6 to 10 may not significantly used in the random forest model while V1 to 5 is good enough.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.83890885
## V2 6.49023056
## V3 0.67583163
## V4 7.58822553
## V5 2.27426009
## V6 0.17436781
## V7 0.15136583
## V8 -0.03078937
## V9 -0.02989832
## V10 -0.08529218
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
rfImp1 %>%
mutate (var = rownames(rfImp1)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'Red') +
labs(title = 'Result of varImp' , y = 'Variable')
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216
8.1 b Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
V1 VarImp drop a lot from 8 to 5.
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 6.29780744
## V2 6.08038134
## V3 0.58410718
## V4 6.93924427
## V5 2.03104094
## V6 0.07947642
## V7 -0.02566414
## V8 -0.11007435
## V9 -0.08839463
## V10 -0.00715093
## duplicate1 3.56411581
rfImp2 %>%
mutate (var = rownames(rfImp2)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'Red') +
labs(title = 'Result of varImp' , y = 'Variable')
8.1 c Use the cforest function in the party package to fit a random
forest model using conditional inference trees. The party package
function varimp can calculate predictor importance. The conditional
argument of that func- tion toggles between the traditional importance
measure and the modified version described in Strobl et al. (2007). Do
these importances show the same pattern as the traditional random forest
model?
varimp is lower than traditional random forest model
model3 <- cforest(y ~ ., simulated,)
conditional3 <- varimp(model3, conditional = TRUE)
conditional3
## V1 V2 V3 V4 V5 V6
## 4.054042760 4.895393237 0.085021626 5.999816224 1.440787988 -0.149067984
## V7 V8 V9 V10 duplicate1
## -0.008241435 -0.219075571 -0.318497534 -0.254597181 1.798780859
8.1 d
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
V6 to 10 are always not significantly, however, Cubist is break it even from V1, V2, V4 and V5.
library(gbm)
## Loaded gbm 2.1.8.1
boostedtrees <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(boostedtrees)
## var rel.inf
## V4 V4 30.5140029
## V2 V2 21.5389454
## V1 V1 21.3162619
## V5 V5 11.9240567
## V3 V3 7.8277277
## duplicate1 duplicate1 6.5970757
## V6 V6 0.2819298
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
library(Cubist)
cubist <- cubist(simulated[,-11], simulated[, 11])
# Print out the variable importance scores.
cubistImp<-varImp(cubist)
cubistImp
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
## duplicate1 0
8.2. Use a simulation to show tree bias with different granularities.
V1 <- runif(100, 1, 1000)
V2 <- runif(100, 5, 500)
V3 <- rnorm(100, 100,10)
y <- V2 + V1
df <- data.frame(V1, V2, V3, y)
model82 <- cforest(y ~ ., data = df, ntree = 10)
cfImp_cond <- varimp(model82)
cfImp_cond
## V1 V2
## 96832.85 16147.46
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradi- ent. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9: (a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
it is due to the fact that the trees from boosting are dependent on each other and hence will have correlated structures as the method follows by the gradient. Differences between variable importance ordering and magnitude between random forests and boosting should not be disconcerting.
left, base on the bagging fraction
it is spreading out of importance
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
library(RANN)
estdata <- preProcess(ChemicalManufacturingProcess, "knnImpute")
chemdata <- predict(estdata, ChemicalManufacturingProcess)
chemdata <- chemdata[, -nearZeroVar(chemdata)]
ch_index <- createDataPartition(chemdata$Yield, p = .8, list = FALSE)
trainx <- chemdata[ch_index, -1]
trainy <- chemdata[ch_index, 1]
testx <- chemdata[-ch_index, -1]
testy <- chemdata[-ch_index, 1]
set.seed(100)
randomForest <- train(x = trainx,
y = trainy,
method = 'rf',
tuneLength = 10,
importance = TRUE,
trControl = trainControl(method = 'cv'))
randomForestPred <- predict(randomForest, testx)
postResample(pred = randomForestPred, obs = testy)
## RMSE Rsquared MAE
## 0.5344321 0.7493185 0.4330098
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = c(5, 10, 15))
set.seed(100)
gbmModel <- train(x = trainx,
y = trainy,
method = 'gbm',
tuneGrid = gbmGrid,
trControl = trainControl(method = 'cv'),
verbose = FALSE)
gbmModelPred <- predict(gbmModel, testx)
postResample(pred = gbmModelPred, obs = testy)
## RMSE Rsquared MAE
## 0.5933345 0.6318533 0.4864452
pre_process <- c("nzv", "corr", "center","scale", "medianImpute")
ctrl <- trainControl(method = "boot", number = 25)
CubGrid <- expand.grid(committees = c(1, 5, 10, 20, 50, 100),
neighbors = c(0, 1, 3, 5, 7))
set.seed(100)
cubistmodel <- train(x = trainx, y = trainy,method = "cubist",
verbose = FALSE, metric = "Rsquared", tuneGrid = CubGrid
,trControl = ctrl, preProcess=pre_process)
cubistModelPred <- predict(cubistmodel, testx)
postResample(pred = cubistModelPred, obs = testy)
## RMSE Rsquared MAE
## 0.5251509 0.7179397 0.3579232
random forecast is the best since is the 2nd lowest RMSE value which close to cubistmodel,but it has higher Rsquraed value.
varImp(randomForest)
## rf variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial12 53.20
## ManufacturingProcess17 51.25
## ManufacturingProcess13 51.17
## ManufacturingProcess31 45.90
## BiologicalMaterial03 45.40
## ManufacturingProcess09 44.13
## BiologicalMaterial06 41.65
## ManufacturingProcess11 38.06
## BiologicalMaterial11 38.01
## ManufacturingProcess06 37.52
## ManufacturingProcess36 37.02
## ManufacturingProcess39 35.40
## BiologicalMaterial02 35.20
## ManufacturingProcess29 32.50
## ManufacturingProcess33 32.21
## ManufacturingProcess01 32.18
## ManufacturingProcess28 31.64
## ManufacturingProcess30 29.34
## ManufacturingProcess27 28.86
ManufacturingProcess32.
library(rpart)
library(rpart.plot)
rpart_tree <- rpart(trainy ~., data = trainx)
rpart.plot(rpart_tree)