Data 624 Assignment 9: APM Chapter 8

library(gt)
library(mlbench)
library(caret)
library(skimr)
library(AppliedPredictiveModeling)
library(rpart)
library(tidyverse)
library(tidymodels)
library(vip)
library(ggthemes)
library(randomForest)
library(gbm)
library(party)
library(Cubist)

Part A.

Create the data

set.seed(200)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random fores model to all the predictors

m1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE, ntree = 100)

Estimate the variable importance scores:

rfImp <- varImp(m1, scale = FALSE)

level <- rownames(rfImp)

rfImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=level, ordered=TRUE)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col(fill="steelblue") +
  theme_fivethirtyeight() + 
  labs(title="Importance", subtitle="Model: Random Forest",
       y="Variable", x="Importance")

Did the random forest model significantly use the uninformative predictors (V6-V10)?

No, variable V6-V10 had a neglible impact on the model - very low scores.

Part B.

Add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9402881

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

m2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 100)

rfImp <- varImp(m2, scale = FALSE)

level <- rownames(rfImp)

rfImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=level, ordered=TRUE)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col(fill="steelblue") +
  theme_fivethirtyeight() + 
  labs(title="Importance", subtitle="Model: Random Forest M2",
       y="Variable", x="Importance")

Add another dupe and refit

simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1
m3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE, ntree = 100)

rfImp <- varImp(m3, scale = FALSE)

level <- rownames(rfImp)

rfImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=level, ordered=T)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col(fill="steelblue") +
   theme_fivethirtyeight() + 
  labs(title="Importance", subtitle="Model: Random Forest M3",
       y="Variable", x="Overall Importance")

Adding a second correlated variable reduces the importance of V1 as well as the importance of the 1st correlated variable. Additionally, the importance of V3 also seems to have been reduced a bit.

Part C.

Use the `cforest` function in the `party` package to fit a random forest model using conditional inference trees. The party package function `varimp` can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

m1_c <- cforest(y ~ ., data=simulated)

conImp <- varimp(m1_c, conditional=TRUE)
conImp <- as.data.frame(conImp)
names(conImp) <- "Overall"

conImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=level, ordered=TRUE)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col(fill="steelblue") +
  theme_fivethirtyeight() + 
  labs(title="Importance (Conditional)",subtitle="Model: Random Forest M1_C", y="Variable", x="Importance (Conditional)")

The Conditional Tree model seems to be a more extreme version of M3, above. V1, V3, dupe1 and dupe2 have all been made less imporant compared to the previous plot in Part B.

Part D.

CUBIST Model

set.seed(200)
dep <- simulated %>% select(y)
ind <- simulated %>% select(-y)
param.cubist <- expand.grid(committees = seq(1,10,by=1),neighbors = seq(1,9,by=2))

ctrl.cubist <- trainControl(method="cv",n=10)

m1_cubist <- train(x=ind, y=dep$y, method="cubist",trControl = ctrl.cubist, tuneGrid = param.cubist, verbose=FALSE)

m1_cubist$bestTune

##    committees neighbors
## 25          5         9

rfImp <- varImp(m1_cubist$finalModel, scale = FALSE)

level <- rownames(rfImp)

rfImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=level, ordered=TRUE)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col(fill="steelblue") +
  theme_fivethirtyeight() +
  labs(title="Importance", subtitle="Modle: m1_cubist",
       y="Variable", x="Importance")

The cubist models seems to have ignored the unimportant variables and was not as adversely impacted by the two correlated variables - well played.

set.seed(200)

# Set our range of tuning parameters
param.gbm <- expand.grid(interaction.depth = seq(1,7,by=2),n.trees = seq(100,1000,by=50), shrinkage = seq(0.01,0.1,by=0.01), n.minobsinnode = 10)
ctrl.gbm <- trainControl(method="cv",n=10)
m1_gb <- train(x=ind, y=dep$y, method="gbm",trControl = ctrl.gbm, tuneGrid = param.gbm, verbose=FALSE)
m1_gb$bestTune

##     n.trees interaction.depth shrinkage n.minobsinnode
## 245     900                 1      0.04             10

rfImp <- varImp(m1_gb$finalModel, scale = FALSE)

levelv <- rownames(rfImp)

rfImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=level, ordered=TRUE)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col(fill="steelblue") +
  theme_fivethirtyeight() +
  labs(title="Importance", subtitle="Modle: m1_gb",
       y="Variable", x="Importance")

The GB model was similar to the M3 Random Forest, but Variable 3 seems to be more important in the GB model compared to the RF model.

Of all the models, the cubist model appear to be the best - it ignored the unimportant variables and also handled the correlated variables much better.

Use a simulation to show tree bias with different granularities.

Tree models are known to suffer from selection bias. Predictors with higher frequency of distince values are favored over predictors with lower frequencies of distinct values.

set.seed(200)
p1 <- sample(0:10000 / 10000, 2000, replace = TRUE)
p2 <- sample(0:1000 / 1000, 2000, replace = TRUE)
p3 <- sample(0:100 / 100, 2000, replace = TRUE)
p4 <- sample(0:10 / 10, 2000, replace = TRUE)

y <- p4 + p3 + p2 + p1

df <- data.frame(p1, p2, p3, p4, y)
skim(df)

Data summary
Name	df
Number of rows	2000
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
p1	1	0.50	0.29	0.00	0.25	0.50	0.76	1.00	▇▇▇▇▇
p2	1	0.49	0.29	0.00	0.24	0.50	0.72	1.00	▇▇▇▇▇
p3	1	0.51	0.29	0.00	0.25	0.51	0.77	1.00	▇▇▇▇▇
p4	1	0.50	0.32	0.00	0.20	0.50	0.80	1.00	▇▆▅▆▅
y	1	2.00	0.60	0.23	1.59	2.00	2.41	3.75	▁▅▇▅▁

rpartTree <- rpart(y ~ ., data=df)

varImp(rpartTree)

##     Overall
## p1 4.958308
## p2 2.982601
## p3 2.450068
## p4 1.637977

Simulation Observation:

The simulation above demonstrates the tendency for tree models to show selection bias toward predictors with more distinct values. The value of importance was rank ordered by the predictors with the most distinct values down to the predictors with the least distinct values. The simulation show less strong bias when a fifth “error term” of the form rnorm(2000) was added to the equation.

In stocastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1. and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

fig8.3

Part A

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has learning of 0.1, therefore the importance get’s spread out over more predictors. The higher learning rate will focus the importance on a smaller set of variables.

Part B

Which model do you think would be more predictive of other samples?

The model on the left should do better. The model on the left is more likely to generalize while the one on the right is more likely to overfit the training data. The more weak learners the better.

Part C

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth will spread out the importance more, since each tree now can grow deeper, and has more chance for other features to be involved in the splitting process. Therefore, increasing the depth reduce the slope of the importance plot.