library(mlbench)
set.seed(586)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
rf_model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rf_imp1 <- varImp(rf_model1, scale = FALSE)
rf_imp1
## Overall
## V1 5.462118657
## V2 5.744648556
## V3 0.835727253
## V4 9.820484510
## V5 2.745409595
## V6 0.032484839
## V7 0.171309681
## V8 0.033181173
## V9 -0.094152899
## V10 0.005305403
The uninformative predictors have comparatively low values compared to V1-V5 which indicates that they are not are not very important to the model.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9454482
rf_model2 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rf_imp2 <- varImp(rf_model2, scale = FALSE)
rf_imp2
## Overall
## V1 3.30129260
## V2 5.33946326
## V3 0.75700225
## V4 8.91150160
## V5 2.54108251
## V6 -0.10622025
## V7 0.09889892
## V8 0.12509313
## V9 -0.05079979
## V10 0.05737038
## duplicate1 4.03261946
The importance score for V1 dropped. This demonstrates why it’s a good practice to remove highly correlated variables so that the importance score isn’t diluted between two similar values.
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
cf_model <- cforest(y ~., data = simulated)
varImp(cf_model, conditional=TRUE)
## Overall
## V1 1.373485278
## V2 3.093126008
## V3 0.028893381
## V4 8.408861991
## V5 2.094538325
## V6 0.031409034
## V7 0.030863459
## V8 0.049552971
## V9 -0.008084814
## V10 0.012927856
## duplicate1 1.935761543
The results of the conditional inference tree rf model appear similar to the traditional rf methodology. One difference is that V4 now has the highest store, though V1-5 are still on top overall. V1 would certainly have a better score if we removed the ‘duplicate1’ column as it is designed to dilute the importance score of V1.
library(gbm)
## Loaded gbm 2.1.8
gbm_model <- gbm(y ~ ., data=simulated, distribution='gaussian')
summary(gbm_model)
## var rel.inf
## V4 V4 36.5450353
## V2 V2 19.0280570
## duplicate1 duplicate1 12.6754130
## V5 V5 11.6837644
## V1 V1 10.8880837
## V3 V3 9.0642627
## V9 V9 0.1153839
## V6 V6 0.0000000
## V7 V7 0.0000000
## V8 V8 0.0000000
## V10 V10 0.0000000
We appear to have the same results with boosted trees – V6-10 having less importance to the model.
library(Cubist)
cubist_model <- cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)
varImp(cubist_model)
## Overall
## V3 50.0
## V2 61.5
## duplicate1 35.0
## V1 39.5
## V4 47.0
## V5 32.0
## V6 10.5
## V8 5.5
## V7 3.0
## V9 3.0
## V10 1.5
With the cubist model, V1-5 again share the trait of higher importance scores while V6-10 are lower.
We’ll simulate the data by creating columns of values that have have differing variance in their values, then we’ll use a tree model function to see how that affects importance scores. We also need a predictor variable y, which will be created by summing our random variables and adding an element of randomness.
high_var <- sample(500,1000, replace=TRUE)
med_var <- sample(250,1000, replace=TRUE)
low_var <- sample(10,1000, replace=TRUE)
sim_df <- data.frame(high_var,med_var,low_var)
sim_df$y <- sim_df$high_var+sim_df$med_var+sim_df$low_var+sample(100,nrow(sim_df), replace=TRUE)
gbm_model <- gbm(y ~ ., data=sim_df, distribution='gaussian')
summary(gbm_model)
## var rel.inf
## high_var high_var 80.27676817
## med_var med_var 19.69847654
## low_var low_var 0.02475529
We can see the the simulated data with higher granularity (fewer repeated values) has a much higher importance than those columns with medium or low granularity.
The right-hand model, having the higher bagging fraction and learning rate, will place more significance on the impact of whichever predictors appear to have the best fit within a lesser number of iterations.
The left-hand model, will “spread out” that significance to the other predictors. The smaller bagging fraction of the left hand model means that a smaller subset of the data is tested each time, so the sample could appear very different on subsequent iterations. A higher bagging fraction, in this case 90%, means that each iteration is essentially testing the same dataset.
The learning rate is a measure of how much of each iterations prediction data persists through subsequent predictions. A higher learning rate favors fewer predictors that appear significant early on.
Since the right-hand model’s higher learning rate and bagging fraction carries a greater risk of overfitting, the left-hand model would likely be more predictive of other samples since it is less specific to certain data.
Increasing interaction depth increases the nodes per tree and the number of splits performed by a tree. This would result a greater bredth of interactions, and therefore more focus on more predictors. Predictor importance would likely be more even – a flatter slope.
library(AppliedPredictiveModeling)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data("ChemicalManufacturingProcess")
preP <- preProcess(ChemicalManufacturingProcess,
method = c("BoxCox", "knnImpute", "center", "scale"))
df <- predict(preP, ChemicalManufacturingProcess)
df$Yield = ChemicalManufacturingProcess$Yield
trainRows <- createDataPartition(df$Yield, p=.80, list=FALSE)
train_df <- df[trainRows,]
test_df <- df[-trainRows,]
train_x <- select(train_df, -Yield)
test_x <- select(test_df, -Yield)
rf_model <- randomForest(train_x, train_df$Yield, importance=TRUE, ntrees=1000)
rf_model
##
## Call:
## randomForest(x = train_x, y = train_df$Yield, importance = TRUE, ntrees = 1000)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 19
##
## Mean of squared residuals: 1.198396
## % Var explained: 66.91
rfPred <- predict(rf_model, newdata = test_x)
postResample(pred = rfPred, obs = test_df$Yield)
## RMSE Rsquared MAE
## 1.0904741 0.5049413 0.8873233
gbm_model <- gbm.fit(train_x, train_df$Yield, distribution = "gaussian")
gbmPred <- predict(gbm_model, newdata = test_x)
## Using 100 trees...
postResample(pred = gbmPred, obs = test_df$Yield)
## RMSE Rsquared MAE
## 1.4614067 0.5256973 1.2901887
cubist_model <- cubist(train_x, train_df$Yield)
cubist_model
##
## Call:
## cubist.default(x = train_x, y = train_df$Yield)
##
## Number of samples: 144
## Number of predictors: 57
##
## Number of committees: 1
## Number of rules: 1
cubistPred <- predict(cubist_model, newdata = test_x)
postResample(pred = cubistPred, obs = test_df$Yield)
## RMSE Rsquared MAE
## 1.4910891 0.2839258 1.1673941
The random forest model appeared to have the best results, lowest error, highest r-squared, and lowest MAE.
imp <- as.data.frame(varImp(rf_model))
imp <- data.frame(overall = imp$Overall,
names = rownames(imp))
imp[order(imp$overall,decreasing = T),]
## overall names
## 44 18.93631162 ManufacturingProcess32
## 21 12.28307440 ManufacturingProcess09
## 3 10.99863095 BiologicalMaterial03
## 25 10.57711137 ManufacturingProcess13
## 12 9.97718089 BiologicalMaterial12
## 6 9.05218459 BiologicalMaterial06
## 29 8.14072934 ManufacturingProcess17
## 23 7.38552448 ManufacturingProcess11
## 51 7.36232708 ManufacturingProcess39
## 2 6.39522542 BiologicalMaterial02
## 48 6.39353678 ManufacturingProcess36
## 4 6.32144797 BiologicalMaterial04
## 8 6.11297011 BiologicalMaterial08
## 43 6.05341410 ManufacturingProcess31
## 11 5.81556181 BiologicalMaterial11
## 32 5.79840622 ManufacturingProcess20
## 30 5.56620617 ManufacturingProcess18
## 40 5.30856018 ManufacturingProcess28
## 37 4.87945962 ManufacturingProcess25
## 5 4.72151995 BiologicalMaterial05
## 9 4.68158584 BiologicalMaterial09
## 41 4.25877090 ManufacturingProcess29
## 39 4.14357044 ManufacturingProcess27
## 42 4.09242574 ManufacturingProcess30
## 13 4.08915832 ManufacturingProcess01
## 18 4.08348414 ManufacturingProcess06
## 45 3.79271095 ManufacturingProcess33
## 1 3.68107127 BiologicalMaterial01
## 14 3.56694006 ManufacturingProcess02
## 54 3.20340533 ManufacturingProcess42
## 22 2.98565800 ManufacturingProcess10
## 55 2.94277707 ManufacturingProcess43
## 46 2.83844440 ManufacturingProcess34
## 10 2.51335943 BiologicalMaterial10
## 17 2.24327573 ManufacturingProcess05
## 28 1.97350048 ManufacturingProcess16
## 24 1.89819118 ManufacturingProcess12
## 33 1.69240778 ManufacturingProcess21
## 38 1.65215211 ManufacturingProcess26
## 35 1.61084530 ManufacturingProcess23
## 57 1.34368934 ManufacturingProcess45
## 56 1.29694275 ManufacturingProcess44
## 31 1.29675623 ManufacturingProcess19
## 36 1.19719607 ManufacturingProcess24
## 7 1.17840876 BiologicalMaterial07
## 15 1.07868497 ManufacturingProcess03
## 16 0.89997461 ManufacturingProcess04
## 34 0.85843830 ManufacturingProcess22
## 27 0.81301546 ManufacturingProcess15
## 50 0.62891479 ManufacturingProcess38
## 49 0.48669010 ManufacturingProcess37
## 26 0.24788064 ManufacturingProcess14
## 20 0.17158515 ManufacturingProcess08
## 19 -0.01404109 ManufacturingProcess07
## 47 -0.21198657 ManufacturingProcess35
## 53 -0.90615574 ManufacturingProcess41
## 52 -1.36633421 ManufacturingProcess40
The processes seem to dominate the top ten most important predictors, and occupy the top three spots.
The top predictors are mostly the same as with the non-linear and linear models, though manufacturing process 31 has not appeared as a top predictor of yield before.
library(rpart)
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
rpart_tree <- rpart(Yield ~., data=train_df)
plot(as.party(rpart_tree))