$\label{fig:figmain}Applied Predictive Modeling.$

Applied Predictive Modeling.

Instructions

Do problems 8.1, 8.2, 8.3, and 8.7 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.

URL: http://appliedpredictivemodeling.com/

Exercises

8.1

Recreate the simulated data from Exercise 7.2

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors ( V6 – V10 )?

Answer: Yes, see below.

First, let’s take a look at the correlations graph.

$\label{fig:fig1}Plot: Simulated correlations.$

Plot: Simulated correlations.

Interesting to note the existence of some correlated variables but not considered significant.

The variable importance scores were computed even though they are time consuming; importance = TRUE generated the following values.

	%IncMSE	IncNodePurity
V1	8.7322354	1063.5046
V2	6.4153694	941.2445
V3	0.7635918	299.2105
V4	7.6151188	1077.4011
V5	2.0235246	478.4097
V6	0.1651112	182.3768
V7	-0.0059617	204.3141
V8	-0.1663626	147.7224
V9	-0.0952927	154.3783
V10	-0.0749448	184.4402

Let’s have a visualization of the %IncMSE importance.

$\label{fig:fig2}Plot: randomForest fit %IncMSE importance.$

Plot: randomForest fit %IncMSE importance.

Let’s have a visualization of the %IncNodePurity importance.

$\label{fig:fig3}Plot: randomForest fit %IncNodePurity importance.$

Plot: randomForest fit %IncNodePurity importance.

The reason why the random Forest used the uninformative predictors ( V6 – V10 ) was to keep reducing the MSE as noted on the above table and graphs.

(b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

$\label{fig:fig4}Plot: Highly correlated predictor added.$

Plot: Highly correlated predictor added.

The correlated value in between duplicate1 and V1 is 0.9497025.

Fit another random forest model to these data.

model2 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE, 
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

The variable importance scores were computed even though they are time consuming; importance = TRUE generated the following values.

Predictor	%IncMSE Modified	IncNodePurity Modified	%IncMSE Original	IncNodePurity Original
duplicate1	NA	NA	4.0427750	667.4905
V1	8.7322354	1063.5046	6.1207162	802.1288
V10	-0.0749448	184.4402	-0.0090829	155.6008
V2	6.4153694	941.2445	6.1291602	853.3309
V3	0.7635918	299.2105	0.6187432	259.6605
V4	7.6151188	1077.4011	6.8201411	915.3690
V5	2.0235246	478.4097	2.0630925	445.3001
V6	0.1651112	182.3768	0.1553443	173.5134
V7	-0.0059617	204.3141	-0.0297387	173.5496
V8	-0.1663626	147.7224	-0.0735684	133.8088
V9	-0.0952927	154.3783	-0.0303934	144.4209

As noted on the above table, it is important to note that the predictors importance changed by having a highly correlated predictor to an important predictor.

Let’s have a visualization of the %IncMSE importance.

$\label{fig:b.fig2}Plot: randomForest fit %IncMSE importance.$

Plot: randomForest fit %IncMSE importance.

Let’s have a visualization of the %IncNodePurity importance.

$\label{fig:figb.3}Plot: randomForest fit %IncNodePurity importance.$

Plot: randomForest fit %IncNodePurity importance.

Did the importance score for V1 change? Yes.

What happens when you add another predictor that is also highly correlated with V1 ?

Uninformative predictors with high correlations to informative predictors have abnormally large importance values. In some cases, their importance can be greater than or equal to weakly important variables. Another impact in between-predictor correlations is to dilute the importance of key predictors.

Theory tells us that if we have a critical predictor of importance of X. If another predictor is just as critical but is almost perfectly correlated as the first, the importance of these two predictors will be roughly X/2.

This can be seeing since in the highly correlated value, the V1 new value shows to have %IncMSE = 6.0070978 and %IncNodePurity = 741.9267 which is almost half from the first observed table in which the predictors were not correlated.

(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007).

library(party)
# Calculating original simulation
model3_O <- cforest(y ~ ., data = simulated_Original,
                  controls = cforest_unbiased(ntree = 1000))

cfImp3_O <- varimp(model3_O, conditional = TRUE)

# Calculating Simulated with Correlated values
model3_S <- cforest(y ~ ., data = simulated,
                  controls = cforest_unbiased(ntree = 1000))

cfImp3_S <- varimp(model3_S, conditional = TRUE)

Let’s take a look at the comparison:

Row.names	Original	Simulated
duplicate1	NA	1.3013786
V1	5.5913454	3.2541200
V10	-0.0135945	-0.0104096
V2	5.3244317	4.8894377
V3	0.0352544	0.0092629
V4	6.5044757	6.2341991
V5	1.1661623	1.2562339
V6	-0.0014171	0.0136426
V7	0.0222381	0.0181762
V8	-0.0071525	-0.0148796
V9	0.0013865	0.0111486

From the above table, we can also notice how the values change considerable when a highly correlated value is added to the model.

$\label{fig:figc.2}Plot: Original cforest importance fit.$

Plot: Original cforest importance fit.

$\label{fig:figc.2a}Plot: Simulated cforest importance fit.$

Plot: Simulated cforest importance fit.

Do these importances show the same pattern as the traditional random forest model? Yes, it does show similar patterns, only part of it.

(d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Cubist

library(Cubist)

# Calculating Original Simulation
model4_O <- cubist(x = simulated_Original[,-11], 
                 y = simulated_Original[,11],
                 committees = 100)

cubistImp4_O <- varImp(model4_O)

# Calculating Simulated with Correlated values
model4_S <- cubist(x = simulated[,-12], 
                 y = simulated[,12],
                 committees = 100)

cubistImp4_S <- varImp(model4_S)

Let’s take a look at the below table to identify some possible changes:

Predictors	Original	Simulated
duplicate1	NA	0.0
V1	71.5	71.0
V10	0.0	0.0
V2	58.5	59.0
V3	47.0	46.5
V4	48.0	48.0
V5	33.0	32.5
V6	13.0	12.0
V7	0.0	0.0
V8	0.0	1.0
V9	0.0	0.0

From the above table, we notice how the values for the selected variables and order has changed in terms of importance.

Boosted trees

library(gbm)
model5_O <- gbm(y ~ ., data = simulated_Original, distribution = "gaussian")
model5_S <- gbm(y ~ ., data = simulated, distribution = "gaussian")

Let’s take a look and compare results:

Let’s take a look at the below table to identify some possible changes:

Predictors	Original	Simulated
duplicate1	NA	9.6873842
V1	27.7651820	19.3988095
V10	0.0000000	0.0000000
V2	21.7112939	21.9992854
V3	8.4111141	8.3701866
V4	30.5091258	29.1592214
V5	10.9236138	10.8430832
V6	0.3337437	0.5420297
V7	0.0000000	0.0000000
V8	0.0000000	0.0000000
V9	0.3459267	0.0000000

Banging

library(ipred)

# Calculating Original Simulation
model6_O <- bagging(y ~ ., data = simulated_Original)

banggingImp6_O <- varImp(model6_O)

# Calculating Simulated with Correlated values
model6_S <- bagging(y ~ ., data = simulated)

banggingImp6_S <- varImp(model6_S)

Let’s take a look at the below table to identify some possible changes:

Predictors	Original	Simulated
duplicate1	NA	1.3780229
V1	1.9106498	1.6802887
V10	0.7838910	0.6695382
V2	2.2263883	1.8601888
V3	1.2831348	1.0602350
V4	2.6696479	2.6727242
V5	2.3874442	2.2585687
V6	1.0001164	0.8214774
V7	0.9258563	0.9326516
V8	0.5921348	0.5508431
V9	0.6332503	0.6186524

Now, responding to the original question: Does the same pattern occur? we could answer that the same pattern (meaning the order of importance change) occur, not necessarily in the same order or with the same predictors.

8.2.

Use a simulation to show tree bias with different granularity.

Solution

From the text book:

Trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors (Loh and Shih 1997; Carolin et al. 2007; Loh 2010). Loh and Shih (1997) remarked that

“The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.”

Also, as the number of missing values increases, the selection of predictors becomes more biased (Carolin et al. 2007). It is worth noting that the variable importance scores for the solubility regression tree (Fig. 8.6) show that the model tends to rely more on continuous (i.e., less granular) predictors than the binary fingerprints. This could be due to the selection bias or the content of the variables.

Let’s create a simulated experiment in this case as follows:

Let’s create a data set as follows:

Home Value.
State (only 2 categories).
Square feet (random values taken as 200 possible outcomes, not related to the home Price).

# Defining granular data
set.seed(123)
State <-  rep(0:1,each=100)
# Defining relationsships
Price <- State + rnorm(200,mean=0,sd=4)
# Defining non-granular data
set.seed(231)
SFeet <- rnorm(200,mean=0,sd=2)

SData <- data.frame(Price,State,SFeet)

Example of relationship in data:

$\label{fig:fig.2a}Boxplot: Simulated home values by State.$

Boxplot: Simulated home values by State.

$\label{fig:fig.22a}Boxplot: Simulated home values by State.$

Boxplot: Simulated home values by State.

Let’s create our simulation to run 25 times using the rpart function with 1 level.

library(rpart)
# Need to Store variables with highest importance
HImportance <- c()

for (i in 1:25){

    #print(i)
    # Defining granular data
    State <-  rep(1:2,each=100)
    # Defining relationsships
    set.seed(123 + i)
    Price <- State + rnorm(200,mean=0,sd=4)
    # Defining non-granular data
    set.seed(200 + i)
    SFeet <- rnorm(200,mean=0,sd=2)
    
    # Defining simulated Data Frame
    SData <- data.frame(Price,State,SFeet)
    
    # Finding New Model
    Smodel <- rpart(Price ~ ., data = SData, control = rpart.control(maxdepth=1))
    
    # Finding importance
    Imp <- varImp(Smodel)
    
    # Find Highest Importance
    HImportance[i] <- rownames(Imp)[which.max(Imp$Overall)]
}

Let’s see our test results from the simulation:

## HImportance
## SFeet State 
##    18     7

References

Kuhn, M. & Johnson, K. 2018. Applied Predictive Modeling. USA: Pfizer Global R&D. http://appliedpredictivemodeling.com/.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

CUNY School of Professional Studies

Instructions

Exercises

8.1

(a)

(b)

(c)

(d)

8.2.

Solution

References