Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Answer:
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
# model summary
print(model1)
##
## Call:
## randomForest(formula = y ~ ., data = simulated, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 6.754258
## % Var explained: 72.3
print(" ")
## [1] " "
# Extract variable importance
# gives a table with metrics like %IncMSE (importance based on % mean squared error increase) and IncNodePurity (based on Gini impurity).
rfImp1 <- importance(model1)
print( rfImp1)
## %IncMSE IncNodePurity
## V1 57.1035713 1063.5046
## V2 47.0088894 941.2445
## V3 10.8342807 299.2105
## V4 51.6960809 1077.4011
## V5 22.0568333 478.4097
## V6 2.9419111 182.3768
## V7 -0.1001392 204.3141
## V8 -3.2180039 147.7224
## V9 -1.8067946 154.3783
## V10 -1.3228315 184.4402
# Plot variable importance
varImpPlot(model1, main = "Random Forest Default model with trees = 1000")
The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
Answer:
Using cforest function, the predictor importance with conditional = TRUE showed V4, V2,V1 and then duplicate1.In this case, the pattern changed from the traditional random forest model as the conditional importance corrects the bias, and swapped ranks.
This is because When conditional = TRUE, the function shuffles the variable within subgroups defined by correlated variables(V1 and duplicate1 in this case).It measures importance after controlling for correlation.
Using cforest function, the predictor importance with conditional = FALSE showed V4,V2,V1 and then duplicate1, duplicate2.In this case too, the pattern differed with the traditional random forest model.
# Fit cforest model
model3 <- cforest(y ~ .,
data = simulated, ntree = 1000)
# Calculate variable importance with conditional = TRUE.
cforestImp_true <- varimp(model3, conditional = TRUE)
# Calculate variable importance with conditional = FALSE.
cforestImp_false <- varimp(model3, conditional = FALSE)
# Plot
barplot(sort(cforestImp_true, decreasing = TRUE),
main = "Variable Importance from cforest (partykit) with condiational = TRUE",
horiz = TRUE, las = 1, col = "blue")
# Plot
barplot(sort(cforestImp_false, decreasing = TRUE),
main = "Variable Importance from cforest (partykit)with condiational = FALSE",
horiz = TRUE, las = 1, col = "blue")
Answer:
Using gradient boosting model(gbm) function, the predictor importance by relative influence showed V4,V2, V1 and then V5.In this case, the pattern remained mostly same as the traditional random forest model.
Using Cubist model function, the predictor importance showed V3,V1, V2 and then duplicate1, V4.In this case, the pattern differed from all other previous models.
# boosted trees
# Distribution defines the type of loss function that will be optimized during boosting.
# for continuous variable, it should be gaussian
model4 <- gbm(y ~ .,
data = simulated,
distribution = "gaussian",
n.trees=1000)
# save the importance to a variable and view it
# shows relative influence of each variable.
gbmImp <- summary(model4, plotit = FALSE)
print(gbmImp)
## var rel.inf
## V4 V4 24.727673
## V2 V2 19.265064
## V1 V1 16.716224
## V5 V5 10.480562
## V3 V3 8.650080
## duplicate1 duplicate1 8.234744
## V7 V7 2.542862
## V6 V6 2.423038
## V10 V10 1.909883
## V9 V9 1.859935
## duplicate2 duplicate2 1.809675
## V8 V8 1.380259
# Plot
barplot(gbmImp$rel.inf,
names.arg = rownames(gbmImp),
las = 2,
main = "Variable Importance from GBM model",
col = "blue",
horiz = TRUE)
model5 <- cubist(simulated[,c(1:10, 12)], simulated[,11] ,
committees = 100)
# 1. Get variable importance
cubistImp <- varImp(model5)
print(cubistImp)
## Overall
## V3 43.5
## V1 52.5
## V2 59.5
## duplicate1 27.5
## V4 46.0
## V8 4.0
## V5 27.0
## V6 10.0
## V10 1.0
## V7 0.0
## V9 0.0
# 2. Plot simple barplot
barplot(sort(cubistImp$Overall, decreasing = TRUE),
names.arg = rownames(cubistImp),
las = 2,
main = "Variable Importance (Cubist)",
col = "lightgreen",
horiz = TRUE)
# Simulating a dataset with one important predictor and some noise
set.seed(42)
n <- 1000 # Number of observations
x1 <- rnorm(n)
x2 <- rnorm(n) # Noise feature
y <- 2 * x1 + rnorm(n) * 0.5 # Linear relationship + some noise
# Creating a data frame
simulated_data <- data.frame(x1 = x1, x2 = x2, y = y)
# Train decision trees with different depths (granularities)
tree_depths <- c(1, 2, 3, 5, 10) # Different tree depths (granularities)
tree_predictions <- data.frame(x1 = x1, x2 = x2, y = y)
for (depth in tree_depths) {
model <- rpart(y ~ x1 + x2, data = simulated_data, control = rpart.control(maxdepth = depth))
tree_predictions[paste("depth", depth, sep = "_")] <- predict(model, simulated_data)
}
# Compute errors for each model to visualize bias (mean squared error)
errors <- sapply(tree_depths, function(depth) {
pred <- tree_predictions[[paste("depth", depth, sep = "_")]]
mean((pred - y)^2) # Mean Squared Error (MSE)
})
# Plot the errors to show how tree bias changes with depth
error_df <- data.frame(Depth = tree_depths, Error = errors)
ggplot(error_df, aes(x = Depth, y = Error)) +
geom_point() +
geom_line() +
labs(title = "Tree Bias and Granularity (Tree Depth)",
x = "Tree Depth (Granularity)",
y = "Mean Squared Error (MSE)") +
theme_minimal()
# 5. Plot predictions with different tree depths to visualize bias at each granularity
ggplot(tree_predictions, aes(x = x1, y = y)) +
geom_point(alpha = 0.3) + # Original data points
geom_line(aes(y = depth_1), color = "red", linetype = "dashed") +
geom_line(aes(y = depth_2), color = "blue", linetype = "dashed") +
geom_line(aes(y = depth_3), color = "green", linetype = "dashed") +
geom_line(aes(y = depth_5), color = "purple", linetype = "dashed") +
geom_line(aes(y = depth_10), color = "orange", linetype = "dashed") +
labs(title = "Effect of Tree Granularity on Predictions",
x = "Predictor x1", y = "Prediction y") +
theme_minimal()
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right has a higher bagging fraction(0.9) and a higher learning rate (0.9). Each tree takes 90% of the data due to which there is less bias but high overfitting. With learning rate of 0.9, it makes makes big corrections and model fits training data quickly. Hence it overfits to a few variables (top predictors dominate), and other variables are largely ignored.
The model on the left has a low bagging fraction(0.1) and a low learning rate (0.1). Each tree takes only 10% of the data due to which there will be more randomness, more diverse trees,less overfitting, but slightly higher bias. With learning rate of 0.1, each tree makes small corrections. The model builds up slowly but carefully,usually needs more trees, but tends to generalize better.Hence allowing more predictors to contribute meaningfully over time.
The left-hand model (bagging fraction 0.1, learning rate 0.1) would likely be more predictive on new/unseen data (better generalization).As explained above, this model builds up slowly but carefully,usually needs more trees, but tends to generalize better.Hence allowing more predictors to contribute meaningfully over time.
If we increase interaction depth:
In the left-hand model, we would likely see more variables gaining importance (because complex relationships among multiple features would be discovered). Hence slope would get a little flatter.
In the right-hand model, it might further concentrate importance among the top few predictors because deeper trees can exploit even finer-grained overfitting (making the dominance of top features even steeper). Hence slope would get even steeper.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
cmp <- data.frame(ChemicalManufacturingProcess)
# summary of missing values
paste0(" Missing values: ",sum(is.na(cmp)))
## [1] " Missing values: 106"
#data imputation
cmp_impute <- preProcess(cmp, method = "knnImpute")
cmp_impute <- predict(cmp_impute,cmp)
# summary post data imputation
paste0(" Missing values post data kNN Impute:",sum(is.na(cmp_impute)))
## [1] " Missing values post data kNN Impute:0"
# pre-process to filter predictor with low frequencies
cmp_transform <- cmp_impute[, -nearZeroVar(cmp_impute)]
# Check for skewness
library(e1071)
skewVals <- apply(cmp_transform, 2, skewness)
head(skewVals)
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0.31095956 0.27331650 0.24412685
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0.02851075 1.73231530 0.30400532
#data transformation
cmp_boxcox <- preProcess(cmp_transform, method = c("BoxCox","knnImpute"))
cmp_transform <- predict(cmp_boxcox,cmp_transform)
set.seed(100)
# sample data for training
index <- createDataPartition(cmp_transform$Yield, p = .8, list = FALSE)
# train and test data
train_cmp <- cmp_transform[index, ]
test_cmp <- cmp_transform[-index, ]
# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)
#pre process
pre_process <- c("center","scale")
RMSE Rsquared MAE 0.6943058 0.4138988 0.5543733
cartModel <- train(train_cmp[ ,-1], train_cmp$Yield,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"),
preProcess=pre_process)
cartPred <- predict(cartModel, test_cmp[ ,-1])
postResample(cartPred, test_cmp$Yield)
## RMSE Rsquared MAE
## 0.6943058 0.4138988 0.5543733
RMSE Rsquared MAE 0.6011820 0.5487259 0.4892362
baggedTree <- ipredbagg(train_cmp$Yield,train_cmp[ ,-1])
baggedPred <- predict(baggedTree, test_cmp[ ,-1])
postResample(baggedPred, test_cmp$Yield)
## RMSE Rsquared MAE
## 0.5377956 0.6597275 0.4358674
RMSE Rsquared MAE 0.5337560 0.6695344 0.4505818
# default mmtry is number of predictors divided by 3 , so around 17.
#mtry is the number of predictors sampled for each split.
rfGrid <- expand.grid(mtry=seq(2,38,by=3) )
rfModel <- train(x = train_cmp[ ,-1], y = train_cmp$Yield,
method = "rf",
tuneGrid = rfGrid,
importance = TRUE,
trControl = ctrl,
ntree = 1000,
preProcess=pre_process)
# rfModel <- randomForest(train_cmp[ ,-1], train_cmp$Yield,
# importance = TRUE,
# tuneGrid = rfGrid)
#validation plot
plot(rfModel)
rfModel
## Random Forest
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 129, 131, 128, 132, 129, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.6648184 0.6460245 0.5304425
## 5 0.6136384 0.6847991 0.4861535
## 8 0.6004020 0.6906913 0.4711585
## 11 0.5966490 0.6869064 0.4659671
## 14 0.5980594 0.6807980 0.4638020
## 17 0.5951567 0.6830679 0.4624478
## 20 0.5956728 0.6802582 0.4612384
## 23 0.5939083 0.6803502 0.4569980
## 26 0.5906581 0.6819413 0.4532038
## 29 0.5974513 0.6716757 0.4578175
## 32 0.5952718 0.6743171 0.4548119
## 35 0.5961569 0.6716593 0.4556231
## 38 0.5993002 0.6643732 0.4577768
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.
#predict the permeability response for the test data
rfPred <- predict(rfModel, test_cmp[ ,-1])
#compare the variance of the predicted values to the test values.
postResample(rfPred, test_cmp$Yield)
## RMSE Rsquared MAE
## 0.5329399 0.6711639 0.4516756
The final values used for the model were n.trees = 1000, interaction.depth = 7, shrinkage = 0.01 and n.minobsinnode = 5. RMSE Rsquared MAE 0.4232880 0.7962436 0.3474775
# boosted trees
gbmGrid <- expand.grid(interaction.depth=seq(1,7,by=1),
n.trees=c(100,200,1000),
shrinkage=c(0.01,0.1,0.2),
n.minobsinnode=5)
# Distribution defines the type of loss function that will be optimized during boosting.
# for continuous variable, it should be gaussian
gbmModel <-train(x = train_cmp[ ,-1], y = train_cmp$Yield,
method = "gbm",
verbose = FALSE,
tuneGrid = gbmGrid,
trControl = ctrl,
preProcess=pre_process)
# gbmModel <- gbm.fit(train_cmp[ ,-1], train_cmp$Yield,
# distribution = "gaussian")
gbmModel
## Stochastic Gradient Boosting
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 128, 130, 128, 130, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 100 0.7949873 0.5614275 0.6439822
## 0.01 1 200 0.7108323 0.6000413 0.5685745
## 0.01 1 1000 0.6447759 0.6453416 0.4942067
## 0.01 2 100 0.7575022 0.5867715 0.6151400
## 0.01 2 200 0.6776275 0.6242034 0.5360107
## 0.01 2 1000 0.6310216 0.6540692 0.4851540
## 0.01 3 100 0.7356672 0.6201522 0.5940267
## 0.01 3 200 0.6610748 0.6403695 0.5185862
## 0.01 3 1000 0.6180795 0.6675513 0.4740618
## 0.01 4 100 0.7322117 0.6042125 0.5892057
## 0.01 4 200 0.6563940 0.6359813 0.5118489
## 0.01 4 1000 0.6071720 0.6730395 0.4628975
## 0.01 5 100 0.7270426 0.6162607 0.5860861
## 0.01 5 200 0.6540353 0.6370434 0.5074453
## 0.01 5 1000 0.6092770 0.6649472 0.4603749
## 0.01 6 100 0.7140387 0.6320304 0.5721952
## 0.01 6 200 0.6454468 0.6496612 0.5005246
## 0.01 6 1000 0.5922546 0.6873008 0.4483384
## 0.01 7 100 0.7146530 0.6202685 0.5755882
## 0.01 7 200 0.6395350 0.6475126 0.4972033
## 0.01 7 1000 0.5960109 0.6809773 0.4557017
## 0.10 1 100 0.6629602 0.6326668 0.5189936
## 0.10 1 200 0.6763677 0.6213301 0.5252572
## 0.10 1 1000 0.7197298 0.5796031 0.5618885
## 0.10 2 100 0.6499817 0.6303965 0.5132366
## 0.10 2 200 0.6497993 0.6319189 0.5045489
## 0.10 2 1000 0.6549580 0.6277247 0.5010916
## 0.10 3 100 0.6325445 0.6459484 0.4875629
## 0.10 3 200 0.6367840 0.6472474 0.4882845
## 0.10 3 1000 0.6361177 0.6444604 0.4857790
## 0.10 4 100 0.6262783 0.6421746 0.4872065
## 0.10 4 200 0.6230969 0.6445489 0.4829454
## 0.10 4 1000 0.6233832 0.6434539 0.4843281
## 0.10 5 100 0.6100910 0.6714449 0.4747903
## 0.10 5 200 0.6056374 0.6693468 0.4742842
## 0.10 5 1000 0.6075889 0.6660154 0.4760716
## 0.10 6 100 0.6277625 0.6474907 0.4974891
## 0.10 6 200 0.6285653 0.6477977 0.4997165
## 0.10 6 1000 0.6281427 0.6482223 0.5009528
## 0.10 7 100 0.5971046 0.6712353 0.4492362
## 0.10 7 200 0.5967403 0.6684985 0.4497500
## 0.10 7 1000 0.5972372 0.6660236 0.4511523
## 0.20 1 100 0.6584252 0.6242897 0.5282094
## 0.20 1 200 0.6675373 0.6136230 0.5275327
## 0.20 1 1000 0.6813559 0.6052637 0.5336996
## 0.20 2 100 0.6773656 0.6021799 0.5203261
## 0.20 2 200 0.6797596 0.5942099 0.5148179
## 0.20 2 1000 0.6792201 0.5930233 0.5142170
## 0.20 3 100 0.6810359 0.5975683 0.5259855
## 0.20 3 200 0.6779337 0.6019289 0.5270753
## 0.20 3 1000 0.6773681 0.6020045 0.5274302
## 0.20 4 100 0.6413194 0.6298069 0.5082579
## 0.20 4 200 0.6425801 0.6289266 0.5073091
## 0.20 4 1000 0.6423629 0.6291289 0.5069582
## 0.20 5 100 0.6509071 0.6377616 0.4984452
## 0.20 5 200 0.6440456 0.6450716 0.4948838
## 0.20 5 1000 0.6429377 0.6461838 0.4943461
## 0.20 6 100 0.6881063 0.5541484 0.5282064
## 0.20 6 200 0.6868702 0.5557430 0.5252629
## 0.20 6 1000 0.6871605 0.5554477 0.5254399
## 0.20 7 100 0.6590602 0.6068400 0.5135123
## 0.20 7 200 0.6559479 0.6101664 0.5095901
## 0.20 7 1000 0.6557410 0.6104720 0.5094169
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
## 6, shrinkage = 0.01 and n.minobsinnode = 5.
#predict the permeability response for the test data
gbmPred <- predict(gbmModel, test_cmp[ ,-1])
#compare the variance of the predicted values to the test values.
postResample(gbmPred, test_cmp$Yield)
## RMSE Rsquared MAE
## 0.4125580 0.8054418 0.3384450
Best tuned with committes = 100 and neighbor = 1. RMSE Rsquared MAE 0.4115425 0.8076333 0.2910282
# committess are the n.trees
# neighbors can take a integer value between 0 to 9
cubistGrid <- expand.grid(committees = c(100, 200, 1000),
neighbors = c(0, 1, 3, 5, 7, 9))
cubistModel <- train(x = train_cmp[ ,-1], y = train_cmp$Yield,
method = "cubist",
verbose = FALSE,
tuneGrid = cubistGrid,
trControl = ctrl,
preProcess=pre_process)
# cubistModel <- cubist(train_cmp[ ,-1], train_cmp$Yield)
cubistModel$bestTune
## committees neighbors
## 2 100 1
#predict the permeability response for the test data
cubistPred <- predict(cubistModel, test_cmp[ ,-1])
#compare the variance of the predicted values to the test values.
postResample(cubistPred, test_cmp$Yield)
## RMSE Rsquared MAE
## 0.3998226 0.8162177 0.2786997
Answer:
# Define the data
results_table <- data.frame(
Model = c("CART", "Bagged", "Random Forest ", "GBM", "Cubist"),
RMSE = c(0.6943058, 0.6011820, 0.5304502, 0.4125580, 0.3998226),
Rsquared = c(0.4138988, 0.5487259,0.6762934,0.8054418, 0.8162177)
)
# Display the table using kable
kable(arrange(results_table, desc("Rsquared")), format = "markdown", align = "l")
Model | RMSE | Rsquared |
---|---|---|
CART | 0.6943058 | 0.4138988 |
Bagged | 0.6011820 | 0.5487259 |
Random Forest | 0.5304502 | 0.6762934 |
GBM | 0.4125580 | 0.8054418 |
Cubist | 0.3998226 | 0.8162177 |
Answer:
In the cubist model,among the Top 10 predictors,process variables(6) are slightly dominant than biological variables(4).ManufacturingProcess32 is the most important, followed by ManufacturingProcess13 and BiologicalMaterial03.
In the optimal non linear model(nNet),there is an similar case .In the Top 10 selection, process variables(6) are more than biological variables(4). Hence neither the biological or process variables dominate the list, but process variables outnumber. e.g: ManufacturingProcess32 had the highest positive correlation with Yield, followed by BiologicalMaterial06 and BiologicalMaterial03. ManufacturingProcess13, 17 and 36 were negatively correlated with Yield.
In the optimal linear model(pls), ManufacturingProcess32 has the highest positive correlation with Yield, followed by ManufacturingProcess09 and BiologicalMaterial02. ManufacturingProcess13,ManufacturingProcess36 and ManufacturingProcess17 which were in the Top5 are negatively correlated with Yield.
Hence there is difference in the order of importance of the predictors.
top_predictors <-data.frame(varImp(cubistModel,scale = TRUE)$importance)
top_predictors |>
arrange(desc(Overall))
## Overall
## ManufacturingProcess32 100.000000
## ManufacturingProcess13 49.438202
## BiologicalMaterial03 48.314607
## ManufacturingProcess09 43.820225
## BiologicalMaterial02 31.460674
## ManufacturingProcess04 30.337079
## BiologicalMaterial06 25.842697
## ManufacturingProcess17 22.471910
## ManufacturingProcess33 20.224719
## BiologicalMaterial04 16.853933
## ManufacturingProcess14 16.853933
## ManufacturingProcess29 16.853933
## ManufacturingProcess10 15.730337
## BiologicalMaterial08 13.483146
## BiologicalMaterial10 12.359551
## BiologicalMaterial09 12.359551
## BiologicalMaterial12 11.235955
## ManufacturingProcess15 11.235955
## ManufacturingProcess39 10.112360
## ManufacturingProcess01 10.112360
## ManufacturingProcess25 10.112360
## BiologicalMaterial11 8.988764
## ManufacturingProcess27 8.988764
## BiologicalMaterial05 7.865169
## ManufacturingProcess28 7.865169
## ManufacturingProcess11 7.865169
## ManufacturingProcess20 5.617978
## ManufacturingProcess26 5.617978
## ManufacturingProcess31 5.617978
## ManufacturingProcess19 4.494382
## ManufacturingProcess18 4.494382
## ManufacturingProcess42 4.494382
## BiologicalMaterial01 3.370787
## ManufacturingProcess22 3.370787
## ManufacturingProcess02 3.370787
## ManufacturingProcess12 3.370787
## ManufacturingProcess21 2.247191
## ManufacturingProcess45 2.247191
## ManufacturingProcess44 2.247191
## ManufacturingProcess23 2.247191
## ManufacturingProcess05 2.247191
## ManufacturingProcess30 2.247191
## ManufacturingProcess06 2.247191
## ManufacturingProcess34 1.123596
## ManufacturingProcess03 1.123596
## ManufacturingProcess07 1.123596
## ManufacturingProcess08 0.000000
## ManufacturingProcess16 0.000000
## ManufacturingProcess24 0.000000
## ManufacturingProcess35 0.000000
## ManufacturingProcess36 0.000000
## ManufacturingProcess37 0.000000
## ManufacturingProcess38 0.000000
## ManufacturingProcess40 0.000000
## ManufacturingProcess41 0.000000
## ManufacturingProcess43 0.000000
top_10_predictors <- top_predictors |>
arrange(desc(Overall)) |>
slice_head(n = 10)
#correlation
cmp_transform |>
select(c("Yield", row.names(top_10_predictors))) |>
cor() |>
corrplot()
Answer:
The optimal single tree model used the same ManufacturingProcess32 variable as the first split, indicating it too found that variable to be important. For the second split, it used BiologicalMaterial12 which was not in the top 10 of the cubist model above, but lower on the list.
The view of the data is interpretative and provides additional knowledge of the predictors and its relationship with yield.
For the first split, ManufacturingProcess32 < 0.199, it is a early critical decision point as low values split off a different yield group.
For the second split, BiologicalMaterial12 < -0.497 , indicates that for low ManufacturingProcess32, the nature of BiologicalMaterial12 hugely impacts Yield.
ManufacturingProcess06 & ManufacturingProcess17 are important thresholds that control whether yield is good or bad
High yields(Yield ≈ 1.8) is achieved when ManufacturingProcess32 >= 0.199 and ManufacturingProcess17 < -0.699.
rpartGrid <- expand.grid(maxdepth= seq(1,10,by=1))
rp_model <- train(Yield ~ ., data = train_cmp, method = "rpart2", tuneGrid = rpartGrid,
trControl = ctrl, preProcess=pre_process)
#rpartTree <- rpart(Yield ~ ., data = train_cmp)
plot(as.party(rp_model$finalModel), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 6))
#final_cart <- cartModel$finalModel
# Plot the tree
#rpart.plot(final_cart,
# type = 3, # Draw split labels under nodes
# extra = 101, # Show number of obs and mean response at each node
# fallen.leaves = TRUE, # Put leaves at the bottom
# box.palette = "RdBu", # Color nodes by prediction
# tweak = 1.2, # Size of the plot
# main = "Optimal CART Tree with Yield Distribution")