\label{fig:figmain}Applied Predictive Modeling.

Applied Predictive Modeling.

Instructions

Do problems 8.1, 8.2, 8.3, and 8.7 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.

URL: http://appliedpredictivemodeling.com/

Exercises

8.1

Recreate the simulated data from Exercise 7.2

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors ( V6 – V10 )?

Answer: Yes, see below.

First, let’s take a look at the correlations graph.

\label{fig:fig1}Plot: Simulated correlations.

Plot: Simulated correlations.

Interesting to note the existence of some correlated variables but not considered significant.

The variable importance scores were computed even though they are time consuming; importance = TRUE generated the following values.

%IncMSE IncNodePurity
V1 8.7322354 1063.5046
V2 6.4153694 941.2445
V3 0.7635918 299.2105
V4 7.6151188 1077.4011
V5 2.0235246 478.4097
V6 0.1651112 182.3768
V7 -0.0059617 204.3141
V8 -0.1663626 147.7224
V9 -0.0952927 154.3783
V10 -0.0749448 184.4402

Let’s have a visualization of the %IncMSE importance.

\label{fig:fig2}Plot: randomForest fit %IncMSE importance.

Plot: randomForest fit %IncMSE importance.

Let’s have a visualization of the %IncNodePurity importance.

\label{fig:fig3}Plot: randomForest fit %IncNodePurity importance.

Plot: randomForest fit %IncNodePurity importance.

The reason why the random Forest used the uninformative predictors ( V6 – V10 ) was to keep reducing the MSE as noted on the above table and graphs.

(b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
\label{fig:fig4}Plot: Highly correlated predictor added.

Plot: Highly correlated predictor added.

The correlated value in between duplicate1 and V1 is 0.9497025.

Fit another random forest model to these data.

model2 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE, 
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

The variable importance scores were computed even though they are time consuming; importance = TRUE generated the following values.

Predictor %IncMSE Modified IncNodePurity Modified %IncMSE Original IncNodePurity Original
duplicate1 NA NA 4.0427750 667.4905
V1 8.7322354 1063.5046 6.1207162 802.1288
V10 -0.0749448 184.4402 -0.0090829 155.6008
V2 6.4153694 941.2445 6.1291602 853.3309
V3 0.7635918 299.2105 0.6187432 259.6605
V4 7.6151188 1077.4011 6.8201411 915.3690
V5 2.0235246 478.4097 2.0630925 445.3001
V6 0.1651112 182.3768 0.1553443 173.5134
V7 -0.0059617 204.3141 -0.0297387 173.5496
V8 -0.1663626 147.7224 -0.0735684 133.8088
V9 -0.0952927 154.3783 -0.0303934 144.4209

As noted on the above table, it is important to note that the predictors importance changed by having a highly correlated predictor to an important predictor.

Let’s have a visualization of the %IncMSE importance.

\label{fig:b.fig2}Plot: randomForest fit %IncMSE importance.

Plot: randomForest fit %IncMSE importance.

Let’s have a visualization of the %IncNodePurity importance.

\label{fig:figb.3}Plot: randomForest fit %IncNodePurity importance.

Plot: randomForest fit %IncNodePurity importance.

Did the importance score for V1 change? Yes.

What happens when you add another predictor that is also highly correlated with V1 ?

Uninformative predictors with high correlations to informative predictors have abnormally large importance values. In some cases, their importance can be greater than or equal to weakly important variables. Another impact in between-predictor correlations is to dilute the importance of key predictors.

Theory tells us that if we have a critical predictor of importance of X. If another predictor is just as critical but is almost perfectly correlated as the first, the importance of these two predictors will be roughly X/2.

This can be seeing since in the highly correlated value, the V1 new value shows to have %IncMSE = 6.0070978 and %IncNodePurity = 741.9267 which is almost half from the first observed table in which the predictors were not correlated.

(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007).

library(party)
# Calculating original simulation
model3_O <- cforest(y ~ ., data = simulated_Original,
                  controls = cforest_unbiased(ntree = 1000))

cfImp3_O <- varimp(model3_O, conditional = TRUE)

# Calculating Simulated with Correlated values
model3_S <- cforest(y ~ ., data = simulated,
                  controls = cforest_unbiased(ntree = 1000))

cfImp3_S <- varimp(model3_S, conditional = TRUE)

Let’s take a look at the comparison:

Row.names Original Simulated
duplicate1 NA 1.3013786
V1 5.5913454 3.2541200
V10 -0.0135945 -0.0104096
V2 5.3244317 4.8894377
V3 0.0352544 0.0092629
V4 6.5044757 6.2341991
V5 1.1661623 1.2562339
V6 -0.0014171 0.0136426
V7 0.0222381 0.0181762
V8 -0.0071525 -0.0148796
V9 0.0013865 0.0111486

From the above table, we can also notice how the values change considerable when a highly correlated value is added to the model.

\label{fig:figc.2}Plot: Original cforest importance fit.

Plot: Original cforest importance fit.

\label{fig:figc.2a}Plot: Simulated cforest importance fit.

Plot: Simulated cforest importance fit.

Do these importances show the same pattern as the traditional random forest model? Yes, it does show similar patterns, only part of it.

(d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Cubist

library(Cubist)

# Calculating Original Simulation
model4_O <- cubist(x = simulated_Original[,-11], 
                 y = simulated_Original[,11],
                 committees = 100)

cubistImp4_O <- varImp(model4_O)

# Calculating Simulated with Correlated values
model4_S <- cubist(x = simulated[,-12], 
                 y = simulated[,12],
                 committees = 100)

cubistImp4_S <- varImp(model4_S)

Let’s take a look at the below table to identify some possible changes:

Predictors Original Simulated
duplicate1 NA 0.0
V1 71.5 71.0
V10 0.0 0.0
V2 58.5 59.0
V3 47.0 46.5
V4 48.0 48.0
V5 33.0 32.5
V6 13.0 12.0
V7 0.0 0.0
V8 0.0 1.0
V9 0.0 0.0

From the above table, we notice how the values for the selected variables and order has changed in terms of importance.

Boosted trees

library(gbm)
model5_O <- gbm(y ~ ., data = simulated_Original, distribution = "gaussian")
model5_S <- gbm(y ~ ., data = simulated, distribution = "gaussian")

Let’s take a look and compare results:

Let’s take a look at the below table to identify some possible changes:

Predictors Original Simulated
duplicate1 NA 9.6873842
V1 27.7651820 19.3988095
V10 0.0000000 0.0000000
V2 21.7112939 21.9992854
V3 8.4111141 8.3701866
V4 30.5091258 29.1592214
V5 10.9236138 10.8430832
V6 0.3337437 0.5420297
V7 0.0000000 0.0000000
V8 0.0000000 0.0000000
V9 0.3459267 0.0000000

Banging

library(ipred)

# Calculating Original Simulation
model6_O <- bagging(y ~ ., data = simulated_Original)

banggingImp6_O <- varImp(model6_O)

# Calculating Simulated with Correlated values
model6_S <- bagging(y ~ ., data = simulated)

banggingImp6_S <- varImp(model6_S)

Let’s take a look at the below table to identify some possible changes:

Predictors Original Simulated
duplicate1 NA 1.3780229
V1 1.9106498 1.6802887
V10 0.7838910 0.6695382
V2 2.2263883 1.8601888
V3 1.2831348 1.0602350
V4 2.6696479 2.6727242
V5 2.3874442 2.2585687
V6 1.0001164 0.8214774
V7 0.9258563 0.9326516
V8 0.5921348 0.5508431
V9 0.6332503 0.6186524

Now, responding to the original question: Does the same pattern occur? we could answer that the same pattern (meaning the order of importance change) occur, not necessarily in the same order or with the same predictors.

8.2.

Use a simulation to show tree bias with different granularity.

Solution

From the text book:

Trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors (Loh and Shih 1997; Carolin et al. 2007; Loh 2010). Loh and Shih (1997) remarked that

“The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.”

Also, as the number of missing values increases, the selection of predictors becomes more biased (Carolin et al. 2007). It is worth noting that the variable importance scores for the solubility regression tree (Fig. 8.6) show that the model tends to rely more on continuous (i.e., less granular) predictors than the binary fingerprints. This could be due to the selection bias or the content of the variables.

Let’s create a simulated experiment in this case as follows:

Let’s create a data set as follows:

  • Home Value.
  • State (only 2 categories).
  • Square feet (random values taken as 200 possible outcomes, not related to the home Price).
# Defining granular data
set.seed(123)
State <-  rep(0:1,each=100)
# Defining relationsships
Price <- State + rnorm(200,mean=0,sd=4)
# Defining non-granular data
set.seed(231)
SFeet <- rnorm(200,mean=0,sd=2)

SData <- data.frame(Price,State,SFeet)

Example of relationship in data:

\label{fig:fig.2a}Boxplot: Simulated home values by State.

Boxplot: Simulated home values by State.

\label{fig:fig.22a}Boxplot: Simulated home values by State.

Boxplot: Simulated home values by State.

Let’s create our simulation to run 25 times using the rpart function with 1 level.

library(rpart)
# Need to Store variables with highest importance
HImportance <- c()

for (i in 1:25){

    #print(i)
    # Defining granular data
    State <-  rep(1:2,each=100)
    # Defining relationsships
    set.seed(123 + i)
    Price <- State + rnorm(200,mean=0,sd=4)
    # Defining non-granular data
    set.seed(200 + i)
    SFeet <- rnorm(200,mean=0,sd=2)
    
    # Defining simulated Data Frame
    SData <- data.frame(Price,State,SFeet)
    
    # Finding New Model
    Smodel <- rpart(Price ~ ., data = SData, control = rpart.control(maxdepth=1))
    
    # Finding importance
    Imp <- varImp(Smodel)
    
    # Find Highest Importance
    HImportance[i] <- rownames(Imp)[which.max(Imp$Overall)]
} 

Let’s see our test results from the simulation:

## HImportance
## SFeet State 
##    18     7

References

Kuhn, M. & Johnson, K. 2018. Applied Predictive Modeling. USA: Pfizer Global R&D. http://appliedpredictivemodeling.com/.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.