8.1

library(mlbench)
set.seed(586)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a.

library(randomForest)
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
rf_model1 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rf_imp1 <- varImp(rf_model1, scale = FALSE)
rf_imp1
##          Overall
## V1   5.462118657
## V2   5.744648556
## V3   0.835727253
## V4   9.820484510
## V5   2.745409595
## V6   0.032484839
## V7   0.171309681
## V8   0.033181173
## V9  -0.094152899
## V10  0.005305403

The uninformative predictors have comparatively low values compared to V1-V5 which indicates that they are not are not very important to the model.

b.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9454482
rf_model2 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rf_imp2 <- varImp(rf_model2, scale = FALSE)
rf_imp2
##                Overall
## V1          3.30129260
## V2          5.33946326
## V3          0.75700225
## V4          8.91150160
## V5          2.54108251
## V6         -0.10622025
## V7          0.09889892
## V8          0.12509313
## V9         -0.05079979
## V10         0.05737038
## duplicate1  4.03261946

The importance score for V1 dropped. This demonstrates why it’s a good practice to remove highly correlated variables so that the importance score isn’t diluted between two similar values.

c.

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
cf_model <- cforest(y ~., data = simulated)
varImp(cf_model, conditional=TRUE)
##                 Overall
## V1          1.373485278
## V2          3.093126008
## V3          0.028893381
## V4          8.408861991
## V5          2.094538325
## V6          0.031409034
## V7          0.030863459
## V8          0.049552971
## V9         -0.008084814
## V10         0.012927856
## duplicate1  1.935761543

The results of the conditional inference tree rf model appear similar to the traditional rf methodology. One difference is that V4 now has the highest store, though V1-5 are still on top overall. V1 would certainly have a better score if we removed the ‘duplicate1’ column as it is designed to dilute the importance score of V1.

d.

library(gbm)
## Loaded gbm 2.1.8
gbm_model <- gbm(y ~ ., data=simulated, distribution='gaussian')
summary(gbm_model)

##                   var    rel.inf
## V4                 V4 36.5450353
## V2                 V2 19.0280570
## duplicate1 duplicate1 12.6754130
## V5                 V5 11.6837644
## V1                 V1 10.8880837
## V3                 V3  9.0642627
## V9                 V9  0.1153839
## V6                 V6  0.0000000
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V10               V10  0.0000000

We appear to have the same results with boosted trees – V6-10 having less importance to the model.

library(Cubist)
cubist_model <- cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)
varImp(cubist_model)
##            Overall
## V3            50.0
## V2            61.5
## duplicate1    35.0
## V1            39.5
## V4            47.0
## V5            32.0
## V6            10.5
## V8             5.5
## V7             3.0
## V9             3.0
## V10            1.5

With the cubist model, V1-5 again share the trait of higher importance scores while V6-10 are lower.

8.2

We’ll simulate the data by creating columns of values that have have differing variance in their values, then we’ll use a tree model function to see how that affects importance scores. We also need a predictor variable y, which will be created by summing our random variables and adding an element of randomness.

high_var <- sample(500,1000, replace=TRUE)
med_var <- sample(250,1000, replace=TRUE)
low_var <- sample(10,1000, replace=TRUE)

sim_df <- data.frame(high_var,med_var,low_var)

sim_df$y <- sim_df$high_var+sim_df$med_var+sim_df$low_var+sample(100,nrow(sim_df), replace=TRUE)

gbm_model <- gbm(y ~ ., data=sim_df, distribution='gaussian')
summary(gbm_model)

##               var     rel.inf
## high_var high_var 80.27676817
## med_var   med_var 19.69847654
## low_var   low_var  0.02475529

We can see the the simulated data with higher granularity (fewer repeated values) has a much higher importance than those columns with medium or low granularity.

8.3

a.

The right-hand model, having the higher bagging fraction and learning rate, will place more significance on the impact of whichever predictors appear to have the best fit within a lesser number of iterations.

The left-hand model, will “spread out” that significance to the other predictors. The smaller bagging fraction of the left hand model means that a smaller subset of the data is tested each time, so the sample could appear very different on subsequent iterations. A higher bagging fraction, in this case 90%, means that each iteration is essentially testing the same dataset.

The learning rate is a measure of how much of each iterations prediction data persists through subsequent predictions. A higher learning rate favors fewer predictors that appear significant early on.

b.

Since the right-hand model’s higher learning rate and bagging fraction carries a greater risk of overfitting, the left-hand model would likely be more predictive of other samples since it is less specific to certain data.

c.

Increasing interaction depth increases the nodes per tree and the number of splits performed by a tree. This would result a greater bredth of interactions, and therefore more focus on more predictors. Predictor importance would likely be more even – a flatter slope.

8.7

library(AppliedPredictiveModeling)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data("ChemicalManufacturingProcess")

preP <- preProcess(ChemicalManufacturingProcess, 
                   method = c("BoxCox", "knnImpute", "center", "scale"))
df <- predict(preP, ChemicalManufacturingProcess)

df$Yield = ChemicalManufacturingProcess$Yield

trainRows <- createDataPartition(df$Yield, p=.80, list=FALSE)

train_df <- df[trainRows,]
test_df <- df[-trainRows,]

train_x <- select(train_df, -Yield)
test_x <- select(test_df, -Yield)

Random Forest

rf_model <- randomForest(train_x, train_df$Yield, importance=TRUE, ntrees=1000)

rf_model
## 
## Call:
##  randomForest(x = train_x, y = train_df$Yield, importance = TRUE,      ntrees = 1000) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 19
## 
##           Mean of squared residuals: 1.198396
##                     % Var explained: 66.91
rfPred <- predict(rf_model, newdata = test_x)
postResample(pred = rfPred, obs = test_df$Yield)
##      RMSE  Rsquared       MAE 
## 1.0904741 0.5049413 0.8873233

Boosted Tree

gbm_model <- gbm.fit(train_x, train_df$Yield, distribution = "gaussian")
gbmPred <- predict(gbm_model, newdata = test_x)
## Using 100 trees...
postResample(pred = gbmPred, obs = test_df$Yield)
##      RMSE  Rsquared       MAE 
## 1.4614067 0.5256973 1.2901887

Cubist

cubist_model <- cubist(train_x, train_df$Yield)
cubist_model
## 
## Call:
## cubist.default(x = train_x, y = train_df$Yield)
## 
## Number of samples: 144 
## Number of predictors: 57 
## 
## Number of committees: 1 
## Number of rules: 1
cubistPred <- predict(cubist_model, newdata = test_x)
postResample(pred = cubistPred, obs = test_df$Yield)
##      RMSE  Rsquared       MAE 
## 1.4910891 0.2839258 1.1673941

a.

The random forest model appeared to have the best results, lowest error, highest r-squared, and lowest MAE.

b.

imp <- as.data.frame(varImp(rf_model))
imp <- data.frame(overall = imp$Overall,
           names   = rownames(imp))
imp[order(imp$overall,decreasing = T),]
##        overall                  names
## 44 18.93631162 ManufacturingProcess32
## 21 12.28307440 ManufacturingProcess09
## 3  10.99863095   BiologicalMaterial03
## 25 10.57711137 ManufacturingProcess13
## 12  9.97718089   BiologicalMaterial12
## 6   9.05218459   BiologicalMaterial06
## 29  8.14072934 ManufacturingProcess17
## 23  7.38552448 ManufacturingProcess11
## 51  7.36232708 ManufacturingProcess39
## 2   6.39522542   BiologicalMaterial02
## 48  6.39353678 ManufacturingProcess36
## 4   6.32144797   BiologicalMaterial04
## 8   6.11297011   BiologicalMaterial08
## 43  6.05341410 ManufacturingProcess31
## 11  5.81556181   BiologicalMaterial11
## 32  5.79840622 ManufacturingProcess20
## 30  5.56620617 ManufacturingProcess18
## 40  5.30856018 ManufacturingProcess28
## 37  4.87945962 ManufacturingProcess25
## 5   4.72151995   BiologicalMaterial05
## 9   4.68158584   BiologicalMaterial09
## 41  4.25877090 ManufacturingProcess29
## 39  4.14357044 ManufacturingProcess27
## 42  4.09242574 ManufacturingProcess30
## 13  4.08915832 ManufacturingProcess01
## 18  4.08348414 ManufacturingProcess06
## 45  3.79271095 ManufacturingProcess33
## 1   3.68107127   BiologicalMaterial01
## 14  3.56694006 ManufacturingProcess02
## 54  3.20340533 ManufacturingProcess42
## 22  2.98565800 ManufacturingProcess10
## 55  2.94277707 ManufacturingProcess43
## 46  2.83844440 ManufacturingProcess34
## 10  2.51335943   BiologicalMaterial10
## 17  2.24327573 ManufacturingProcess05
## 28  1.97350048 ManufacturingProcess16
## 24  1.89819118 ManufacturingProcess12
## 33  1.69240778 ManufacturingProcess21
## 38  1.65215211 ManufacturingProcess26
## 35  1.61084530 ManufacturingProcess23
## 57  1.34368934 ManufacturingProcess45
## 56  1.29694275 ManufacturingProcess44
## 31  1.29675623 ManufacturingProcess19
## 36  1.19719607 ManufacturingProcess24
## 7   1.17840876   BiologicalMaterial07
## 15  1.07868497 ManufacturingProcess03
## 16  0.89997461 ManufacturingProcess04
## 34  0.85843830 ManufacturingProcess22
## 27  0.81301546 ManufacturingProcess15
## 50  0.62891479 ManufacturingProcess38
## 49  0.48669010 ManufacturingProcess37
## 26  0.24788064 ManufacturingProcess14
## 20  0.17158515 ManufacturingProcess08
## 19 -0.01404109 ManufacturingProcess07
## 47 -0.21198657 ManufacturingProcess35
## 53 -0.90615574 ManufacturingProcess41
## 52 -1.36633421 ManufacturingProcess40

The processes seem to dominate the top ten most important predictors, and occupy the top three spots.

The top predictors are mostly the same as with the non-linear and linear models, though manufacturing process 31 has not appeared as a top predictor of yield before.

library(rpart)
library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
rpart_tree <- rpart(Yield ~., data=train_df)

plot(as.party(rpart_tree))