DATA624-HW9

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson.

8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

8.1(a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (V6 – V10)? ANS: Based on the varImpPlot, did not significantly use V6-V10

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = TRUE)
rfImp1

##        Overall
## V1  57.4506930
## V2  46.0366873
## V3   9.8217121
## V4  52.7991593
## V5  22.2954807
## V6   3.2482485
## V7   2.7239894
## V8  -0.6437884
## V9  -0.6204323
## V10 -1.5041925

varImpPlot(model1)

8.1(b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example: Fit another random forest model to these data. Did the importance score for V1 change? ANS: With a highly correlated predictor for V1, The importance score of V1 have been reduced to about 1/2 of it’s original importance.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9396216

model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = TRUE)
rfImp1

##               Overall
## V1         33.7040129
## V2         45.4568836
## V3          9.5329892
## V4         51.7654739
## V5         23.2785698
## V6          1.6701722
## V7         -0.4936772
## V8         -2.2341346
## V9         -1.9061987
## V10        -0.1393049
## duplicate1 24.7043635

varImpPlot(model1)

What happens when you add another predictor that is also highly correlated with V1? ANS: V1 importance again has has been reduced.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9312569

model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = TRUE)
rfImp1

##               Overall
## V1         30.4865414
## V2         50.7443596
## V3         10.7630613
## V4         56.1630122
## V5         25.4808517
## V6          3.1393385
## V7          2.0180717
## V8         -2.4439115
## V9         -0.1811037
## V10         0.4243420
## duplicate1 20.2813359
## duplicate2 19.5826330

varImpPlot(model1)

8.1(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

#library(party)
#set.seed(200)
#cf_sim <- cforest(y ~ ., data = simulated)
#varimp(cf_sim)

8 .1(d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)

## Warning: package 'gbm' was built under R version 3.4.4

## Loaded gbm 2.1.5

set.seed(200)
gbmsim <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmsim)

##                   var    rel.inf
## V4                 V4 31.6854828
## V2                 V2 22.3217552
## V1                 V1 20.7886787
## V5                 V5 11.9510758
## V3                 V3  7.5762062
## duplicate1 duplicate1  4.7661355
## duplicate2 duplicate2  0.4340805
## V6                 V6  0.3452699
## V8                 V8  0.1313155
## V7                 V7  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

8.2.

Use a simulation to show tree bias with different granularities

library(Cubist)

## Warning: package 'Cubist' was built under R version 3.4.4

set.seed(200)
cub_sim <- cubist(simulated[,-11], simulated$y)
cub_sim$usage

##    Conditions Model   Variable
## 1           0   100         V1
## 2           0   100         V2
## 3           0   100         V4
## 4           0   100         V5
## 5           0     0         V3
## 6           0     0         V6
## 7           0     0         V7
## 8           0     0         V8
## 9           0     0         V9
## 10          0     0        V10
## 11          0     0 duplicate1
## 12          0     0 duplicate2

8.3.

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

8.3(a).

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

ANS: From the book: The importance profile for boosting has a much steeper importance slope than the one for random forests. This is due to the fact that the trees from boosting are dependent on each other and hence will have correlated structures as the method follows by the gradient. Therefore, many of the same predictors will be selected across the trees, increasing their contribution to the importance metric. The higher the values of learning rate and bagging fraction get, the more the boosting behavior of the model and hence produce behavior simular to the above text reference. On the other hand reducing these 2 values will allow more predictors to have importance in the model.

8.3(b)

Which model do you think would be more predictive of other samples? ANS: The model with the least amount of important predictors would be more predictive than the one more important predictors - less chance of overittining

8.3(c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth would flatten the slope of predictor importance for the model to the left, but not so much for the model on the right.

8.7.

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

8.7(a)

Which tree-based regression model gives the optimal resampling and test set performance? ANS: The Cubist model gives the optimal resampling and test st performance.

library(ggplot2)
library(magrittr)
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 3.4.4

library(caret)
data("ChemicalManufacturingProcess")

preP <- preProcess(ChemicalManufacturingProcess, 
                   method = c("BoxCox", "knnImpute", "center", "scale"))
df <- predict(preP, ChemicalManufacturingProcess)
## Restore the response variable values to original
df$Yield = ChemicalManufacturingProcess$Yield

## Split the data into a training and a test set
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
df.train <- df[trainRows, ]
df.test <- df[-trainRows, ]



colYield <- which(colnames(df) == "Yield")
trainX <- df.train[, -colYield]
trainY <- df.train$Yield
testX <- df.test[, -colYield]
testY <- df.test$Yield

## Single Tree Models
## Model 1 tunes over the complexity parameter
st1Model <- train(trainX, trainY,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

st1Model.train.pred <- predict(st1Model)
st1Model.test.pred <- predict(st1Model, newdata = testX)

## Model 2 tunes over the maximum depth
st2Model <- train(trainX, trainY,
                  method = "rpart2",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))
st2Model.train.pred <- predict(st2Model)
st2Model.test.pred <- predict(st2Model, newdata = testX)




## Random Forest Model
library(randomForest)
library(caret)
rfModel <- randomForest(trainY ~ .,
                        data = trainX,
                        importance = TRUE,
                        ntree = 1000)
rfModel.train.pred <- predict(rfModel)
rfModel.test.pred <- predict(rfModel, newdata = testX)

## Boosted Trees
library(gbm)
gbmModel <- gbm(trainY ~ ., data = trainX, distribution = "gaussian")
gbmModel.train.pred <- predict(gbmModel, n.tree = 100)
gbmModel.test.pred <- predict(gbmModel, n.tree = 100, newdata = testX)

## Cubist Model
cubistModel <- train(trainX,
                     trainY,
                     method = "cubist")
cubistModel.train.pred <- predict(cubistModel)
cubistModel.test.pred <- predict(cubistModel, newdata = testX)

rbind(
  "st1Model" = postResample(pred = st1Model.train.pred, obs = trainY),
  "st2Model" = postResample(pred = st2Model.train.pred, obs = trainY),
  "rforest" = postResample(pred = rfModel.train.pred, obs = trainY),
  "boosted" = postResample(pred = gbmModel.train.pred, obs = trainY),
  "cubist" = postResample(pred = cubistModel.train.pred, obs = trainY)
)

##               RMSE  Rsquared       MAE
## st1Model 1.4151050 0.3953993 1.1345621
## st2Model 1.4151050 0.3953993 1.1345621
## rforest  1.1233204 0.6410073 0.8803285
## boosted  0.8003022 0.8164133 0.6231775
## cubist   0.2005720 0.9896783 0.1565198

8.7(b)

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImpSorted <- function(dfVarImp) {
  varImpRows <- order(abs(dfVarImp$Overall), decreasing = TRUE)
  dfResult <- data.frame(dfVarImp[varImpRows, 1],
                         row.names = rownames(dfVarImp)[varImpRows])
  colnames(dfResult) <- colnames(dfVarImp)
  return(dfResult)
}
mdlVarImp <- varImp(cubistModel)
plot(mdlVarImp)

# Top 10 Predictors
mdlvarImp <- varImpSorted(mdlVarImp$importance)
head(mdlvarImp, 10)

##                          Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess09  60.90909
## BiologicalMaterial06    46.36364
## ManufacturingProcess04  45.45455
## ManufacturingProcess17  40.90909
## ManufacturingProcess33  38.18182
## BiologicalMaterial02    34.54545
## ManufacturingProcess13  32.72727
## ManufacturingProcess39  30.90909
## ManufacturingProcess29  22.72727

8.7(c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?