Applied Predictive Modeling.
Instructions
Do problems 8.1, 8.2, 8.3
, and 8.7
in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.
Exercises
8.1
Recreate the simulated data from Exercise 7.2
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a)
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors ( V6 – V10
)?
Answer: Yes, see below.
First, let’s take a look at the correlations graph.
Plot: Simulated correlations.
Interesting to note the existence of some correlated variables but not considered significant.
The variable importance
scores were computed even though they are time consuming; importance = TRUE
generated the following values.
%IncMSE | IncNodePurity | |
---|---|---|
V1 | 8.7322354 | 1063.5046 |
V2 | 6.4153694 | 941.2445 |
V3 | 0.7635918 | 299.2105 |
V4 | 7.6151188 | 1077.4011 |
V5 | 2.0235246 | 478.4097 |
V6 | 0.1651112 | 182.3768 |
V7 | -0.0059617 | 204.3141 |
V8 | -0.1663626 | 147.7224 |
V9 | -0.0952927 | 154.3783 |
V10 | -0.0749448 | 184.4402 |
Let’s have a visualization of the %IncMSE importance.
Plot: randomForest fit %IncMSE importance.
Let’s have a visualization of the %IncNodePurity importance.
Plot: randomForest fit %IncNodePurity importance.
The reason why the random Forest used the uninformative predictors ( V6 – V10
) was to keep reducing the MSE as noted on the above table and graphs.
(b)
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
Plot: Highly correlated predictor added.
The correlated value in between duplicate1
and V1
is 0.9497025.
Fit another random forest model to these data.
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
The variable importance
scores were computed even though they are time consuming; importance = TRUE
generated the following values.
Predictor | %IncMSE Modified | IncNodePurity Modified | %IncMSE Original | IncNodePurity Original |
---|---|---|---|---|
duplicate1 | NA | NA | 4.0427750 | 667.4905 |
V1 | 8.7322354 | 1063.5046 | 6.1207162 | 802.1288 |
V10 | -0.0749448 | 184.4402 | -0.0090829 | 155.6008 |
V2 | 6.4153694 | 941.2445 | 6.1291602 | 853.3309 |
V3 | 0.7635918 | 299.2105 | 0.6187432 | 259.6605 |
V4 | 7.6151188 | 1077.4011 | 6.8201411 | 915.3690 |
V5 | 2.0235246 | 478.4097 | 2.0630925 | 445.3001 |
V6 | 0.1651112 | 182.3768 | 0.1553443 | 173.5134 |
V7 | -0.0059617 | 204.3141 | -0.0297387 | 173.5496 |
V8 | -0.1663626 | 147.7224 | -0.0735684 | 133.8088 |
V9 | -0.0952927 | 154.3783 | -0.0303934 | 144.4209 |
As noted on the above table, it is important to note that the predictors importance changed by having a highly correlated predictor to an important predictor.
Let’s have a visualization of the %IncMSE importance.
Plot: randomForest fit %IncMSE importance.
Let’s have a visualization of the %IncNodePurity importance.
Plot: randomForest fit %IncNodePurity importance.
Did the importance score for V1
change? Yes.
What happens when you add another predictor that is also highly correlated with V1
?
Uninformative predictors with high correlations to informative predictors have abnormally large importance values. In some cases, their importance can be greater than or equal to weakly important variables. Another impact in between-predictor correlations is to dilute the importance of key predictors.
Theory tells us that if we have a critical predictor of importance of X
. If another predictor is just as critical but is almost perfectly correlated as the first, the importance of these two predictors will be roughly X/2.
This can be seeing since in the highly correlated value, the V1
new value shows to have %IncMSE = 6.0070978
and %IncNodePurity = 741.9267
which is almost half from the first observed table in which the predictors were not correlated.
(c)
Use the cforest
function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007).
library(party)
# Calculating original simulation
model3_O <- cforest(y ~ ., data = simulated_Original,
controls = cforest_unbiased(ntree = 1000))
cfImp3_O <- varimp(model3_O, conditional = TRUE)
# Calculating Simulated with Correlated values
model3_S <- cforest(y ~ ., data = simulated,
controls = cforest_unbiased(ntree = 1000))
cfImp3_S <- varimp(model3_S, conditional = TRUE)
Let’s take a look at the comparison:
Row.names | Original | Simulated |
---|---|---|
duplicate1 | NA | 1.3013786 |
V1 | 5.5913454 | 3.2541200 |
V10 | -0.0135945 | -0.0104096 |
V2 | 5.3244317 | 4.8894377 |
V3 | 0.0352544 | 0.0092629 |
V4 | 6.5044757 | 6.2341991 |
V5 | 1.1661623 | 1.2562339 |
V6 | -0.0014171 | 0.0136426 |
V7 | 0.0222381 | 0.0181762 |
V8 | -0.0071525 | -0.0148796 |
V9 | 0.0013865 | 0.0111486 |
From the above table, we can also notice how the values change considerable when a highly correlated value is added to the model.
Plot: Original cforest importance fit.
Plot: Simulated cforest importance fit.
Do these importances show the same pattern as the traditional random forest model? Yes, it does show similar patterns, only part of it.
(d)
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Cubist
library(Cubist)
# Calculating Original Simulation
model4_O <- cubist(x = simulated_Original[,-11],
y = simulated_Original[,11],
committees = 100)
cubistImp4_O <- varImp(model4_O)
# Calculating Simulated with Correlated values
model4_S <- cubist(x = simulated[,-12],
y = simulated[,12],
committees = 100)
cubistImp4_S <- varImp(model4_S)
Let’s take a look at the below table to identify some possible changes:
Predictors | Original | Simulated |
---|---|---|
duplicate1 | NA | 0.0 |
V1 | 71.5 | 71.0 |
V10 | 0.0 | 0.0 |
V2 | 58.5 | 59.0 |
V3 | 47.0 | 46.5 |
V4 | 48.0 | 48.0 |
V5 | 33.0 | 32.5 |
V6 | 13.0 | 12.0 |
V7 | 0.0 | 0.0 |
V8 | 0.0 | 1.0 |
V9 | 0.0 | 0.0 |
From the above table, we notice how the values for the selected variables and order has changed in terms of importance.
Boosted trees
library(gbm)
model5_O <- gbm(y ~ ., data = simulated_Original, distribution = "gaussian")
model5_S <- gbm(y ~ ., data = simulated, distribution = "gaussian")
Let’s take a look and compare results:
Let’s take a look at the below table to identify some possible changes:
Predictors | Original | Simulated |
---|---|---|
duplicate1 | NA | 9.6873842 |
V1 | 27.7651820 | 19.3988095 |
V10 | 0.0000000 | 0.0000000 |
V2 | 21.7112939 | 21.9992854 |
V3 | 8.4111141 | 8.3701866 |
V4 | 30.5091258 | 29.1592214 |
V5 | 10.9236138 | 10.8430832 |
V6 | 0.3337437 | 0.5420297 |
V7 | 0.0000000 | 0.0000000 |
V8 | 0.0000000 | 0.0000000 |
V9 | 0.3459267 | 0.0000000 |
Banging
library(ipred)
# Calculating Original Simulation
model6_O <- bagging(y ~ ., data = simulated_Original)
banggingImp6_O <- varImp(model6_O)
# Calculating Simulated with Correlated values
model6_S <- bagging(y ~ ., data = simulated)
banggingImp6_S <- varImp(model6_S)
Let’s take a look at the below table to identify some possible changes:
Predictors | Original | Simulated |
---|---|---|
duplicate1 | NA | 1.3780229 |
V1 | 1.9106498 | 1.6802887 |
V10 | 0.7838910 | 0.6695382 |
V2 | 2.2263883 | 1.8601888 |
V3 | 1.2831348 | 1.0602350 |
V4 | 2.6696479 | 2.6727242 |
V5 | 2.3874442 | 2.2585687 |
V6 | 1.0001164 | 0.8214774 |
V7 | 0.9258563 | 0.9326516 |
V8 | 0.5921348 | 0.5508431 |
V9 | 0.6332503 | 0.6186524 |
Now, responding to the original question: Does the same pattern occur? we could answer that the same pattern (meaning the order of importance change) occur, not necessarily in the same order or with the same predictors.
8.2.
Use a simulation to show tree bias with different granularity.
Solution
From the text book:
Trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors (Loh and Shih 1997; Carolin et al. 2007; Loh 2010). Loh and Shih (1997) remarked that
“The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.”
Also, as the number of missing values increases, the selection of predictors becomes more biased (Carolin et al. 2007). It is worth noting that the variable importance scores for the solubility regression tree (Fig. 8.6) show that the model tends to rely more on continuous (i.e., less granular) predictors than the binary fingerprints. This could be due to the selection bias or the content of the variables.
Let’s create a simulated experiment in this case as follows:
Let’s create a data set as follows:
- Home Value.
- State (only 2 categories).
- Square feet (random values taken as 200 possible outcomes, not related to the home Price).
# Defining granular data
set.seed(123)
State <- rep(0:1,each=100)
# Defining relationsships
Price <- State + rnorm(200,mean=0,sd=4)
# Defining non-granular data
set.seed(231)
SFeet <- rnorm(200,mean=0,sd=2)
SData <- data.frame(Price,State,SFeet)
Example of relationship in data:
Boxplot: Simulated home values by State.
Boxplot: Simulated home values by State.
Let’s create our simulation to run 25 times using the rpart
function with 1 level.
library(rpart)
# Need to Store variables with highest importance
HImportance <- c()
for (i in 1:25){
#print(i)
# Defining granular data
State <- rep(1:2,each=100)
# Defining relationsships
set.seed(123 + i)
Price <- State + rnorm(200,mean=0,sd=4)
# Defining non-granular data
set.seed(200 + i)
SFeet <- rnorm(200,mean=0,sd=2)
# Defining simulated Data Frame
SData <- data.frame(Price,State,SFeet)
# Finding New Model
Smodel <- rpart(Price ~ ., data = SData, control = rpart.control(maxdepth=1))
# Finding importance
Imp <- varImp(Smodel)
# Find Highest Importance
HImportance[i] <- rownames(Imp)[which.max(Imp$Overall)]
}
Let’s see our test results from the simulation:
## HImportance
## SFeet State
## 18 7
References
Kuhn, M. & Johnson, K. 2018. Applied Predictive Modeling. USA: Pfizer Global R&D. http://appliedpredictivemodeling.com/.
R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.