In this assignment, the problems 8.1, 8.2, 8.3, and 8.7 have been solved from the Kuhn and Johnson book.
library(mlbench)
library(randomForest)
library(caret)
library(party)
library(dplyr)
library(rpart)
#library(ipred)
library(partykit)
library(AppliedPredictiveModeling)
library(gbm)
library(Cubist)
#library(rpart.plot)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
head(simulated)
## V1 V2 V3 V4 V5 V6 V7
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060 0.8266609
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931 0.6489601
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911 0.1785614
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637 0.5133614
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157 0.6644906
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891 0.3360117
## V8 V9 V10 y
## 1 0.4214081 0.59111440 0.5886216 18.46398
## 2 0.8446239 0.92819306 0.7584008 16.09836
## 3 0.3495908 0.01759542 0.4441185 17.76165
## 4 0.7970260 0.68986918 0.4450716 13.78730
## 5 0.9038919 0.39696995 0.5500808 18.42984
## 6 0.6489177 0.53116033 0.9066182 20.85817
set.seed(200)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.605365900
## V2 6.831259165
## V3 0.741534943
## V4 7.883384091
## V5 2.244750293
## V6 0.136054182
## V7 0.055950944
## V8 -0.068195812
## V9 0.003196175
## V10 -0.054705900
The negative or near zero importance scores for V6–V10 indicate that the random forest model recognized these predictors as uninformative predictors and the model did not use these uninformative predictors significantly for making predictions.
simulated<-simulated|>select(-duplicate1)
set.seed(200)
model3 <- cforest(y ~ ., data = simulated,
ntree = 1000)
rfImp3 <- varimp(model3,conditional = TRUE)
rfImp3
## V1 V2 V3 V4 V5 V6
## 6.08762374 5.24473369 0.02006549 6.10961502 1.41465983 -0.19520548
## V7 V8 V9 V10
## -0.17739105 -0.34888062 -0.14389739 -0.20262827
Like the traditional random forest, the predictors V1–V5 are again identified as significant or important. Again, the predictors V6–V10 having negative or near-zero values are identified as insignificant or unimportant for predicting the response by the random forest using conditional inference trees. Consequently, the pattern of predictor importance and their respective importance scores have been changed. For example, with traditional random forest, the top most important predictor was V1, but with conditional inference trees, it bacme V4. The ascending orders for the important scores of the variables for both traditional and conditional inference trees are found to be following:
Traditional:
order(as.numeric(unlist(rfImp1)))
## [1] 8 10 9 7 6 3 5 2 4 1
Conditional:
order(rfImp3)
## [1] 8 10 6 7 9 3 5 2 1 4
Boosted trees:
set.seed(100)
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
gbmImp4 <- summary.gbm(gbmModel)
gbmImp4
## var rel.inf
## V4 V4 29.7926789
## V1 V1 26.0431191
## V2 V2 23.7400606
## V5 V5 11.1100184
## V3 V3 8.7680640
## V6 V6 0.3369801
## V8 V8 0.2090789
## V7 V7 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
Cubist:
cubistMod <- cubist(x=simulated[, 1:10],y=simulated$y,committees = 100)
cubistImp<-varImp(cubistMod)
cubistImp
## Overall
## V1 71.5
## V3 47.0
## V2 58.5
## V4 48.0
## V5 33.0
## V6 13.0
## V7 0.0
## V8 0.0
## V9 0.0
## V10 0.0
No, the pattern did not occur the same for the boosted trees and cubist models.
set.seed(100)
# Simulating different predictors
V1 <- sample(0:50,200, replace = T)
V2<- sample(0:100,200, replace = T)
V3<- sample(0:500,200, replace = T)
# creating response variable
y <- V1 + V2 + V3 + rnorm(200)
# create simulated dataset
simulated2 <- data.frame(V1,V2,V3, y)
# Predictors' variances
var(V1)
## [1] 224.5988
var(V2)
## [1] 861.805
var(V3)
## [1] 22751.54
set.seed(100)
# fit random forest with simulated data
rfmodel1 <- randomForest(y ~., data = simulated2, importance = TRUE, ntree = 1000)
# See importance
varImp(rfmodel1, scale=FALSE)
## Overall
## V1 374.8066
## V2 1189.1852
## V3 37796.6048
In this simulation, tree bias with different granularities has been observed using a random forest model with three predictors: V1, V2, and V3. The predictors were sampled from different ranges. V3 had the largest range and the most distinct values. V2 had a medium range and V1 had the smallest range. The random forest model then assigned the highest importance to V3. It reflects the bias of tree-based models towards predictors with higher granularity as they provide more distinct splits in the data. On the other hand, V1 had the smallest range. It also had the lowest granularity. As a result, V1 received the lowest importance. It indicates how the model favors predictors with more distinct values. This simulation demonstrates how random forests exhibit tree bias. The model prioritizes predictors with higher granularity. Predictors with higher granularity provide more distinct splits in the data. These more informative splits contribute to improved model performance
The model on the right side of the figure 8.24 uses a higher bagging fraction (0.9) and learning rate (0.9). With a higher bagging fraction, a larger portion of the training data is used to create each decision tree. As the bagging fraction increases,the samples used to build trees become more alike, causing some predictors being more influential. On the other hand, the model on the left of the figure 8.24 has a lower bagging fraction(0.1) and learning rate (0.1). A lower learning rate makes the model less aggressive, which allows it to consider more predictors as important.
The model with the lower bagging fraction and lower learning rate i.e. the model on the left of the figure 8.24 would be more predictive of other samples.
Increasing interaction depth helps models find more complex patterns in the data. In the model with a higher bagging fraction (0.9) and learning rate (0.9), this would make predictors with higher variance even more important. As a result, the slope of predictor importance will become steeper for this model. In the model with a lower bagging fraction (0.1) and learning rate (0.1), the effect would be smaller. This model spreads importance more evenly and is less aggressive in favoring certain predictors
Get data and do required pre-processing:
data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)
## [1] 176 58
#head(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values:
# Check for missing values in each column
missing_values <- colSums(is.na(ChemicalManufacturingProcess))
#print(missing_values)
# Apply KNN imputation
knn_impute_chemical <- preProcess(ChemicalManufacturingProcess, method=c('knnImpute'))
# Imputed dataset
imputed_chemical_df <- predict(knn_impute_chemical, ChemicalManufacturingProcess)
# Calculate total number of missing values after imputation
total_missing <- sum(is.na(imputed_chemical_df))
#print(total_missing)
Split the data into train test set:
#dim(imputed_chemical_df)
imputed_chemical_df <- imputed_chemical_df %>%
select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
set.seed(100)
train_chemical <-createDataPartition(imputed_chemical_df$Yield, times = 1, p = .70, list = FALSE)
train_chemical_x <- imputed_chemical_df[train_chemical, ][, -c(1)]
test_chemical_x <- imputed_chemical_df[-train_chemical, ][, -c(1)]
train_chemical_y<- imputed_chemical_df[train_chemical, ]$Yield
test_chemical_y <- imputed_chemical_df[-train_chemical, ]$Yield
Optimal linear regression from previous problem from previous homework was ridge regression which is given below:
set.seed(135)
ridgegrid <- data.frame(.lambda = seq(0,0.1,length=15))
ridge_model <- train(x=train_chemical_x,y=train_chemical_y,
method='ridge',
tuneGrid=ridgegrid,
trControl=trainControl(method='cv'),
preProc = c('center','scale')
)
ridge_model
## Ridge Regression
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 111, 112, 112, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 4.054160 0.3911420 1.6139763
## 0.007142857 1.633270 0.5251733 0.8315422
## 0.014285714 1.767593 0.5375635 0.8611365
## 0.021428571 1.777055 0.5452884 0.8623404
## 0.028571429 1.764046 0.5507236 0.8590180
## 0.035714286 1.746125 0.5548928 0.8551684
## 0.042857143 1.727679 0.5582498 0.8507987
## 0.050000000 1.709956 0.5610352 0.8464616
## 0.057142857 1.693284 0.5633933 0.8425123
## 0.064285714 1.677696 0.5654189 0.8388062
## 0.071428571 1.663129 0.5671781 0.8355731
## 0.078571429 1.649496 0.5687194 0.8324824
## 0.085714286 1.636707 0.5700795 0.8295368
## 0.092857143 1.624683 0.5712868 0.8268756
## 0.100000000 1.613350 0.5723639 0.8243647
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
Make prediction on test set and evaluate performance of Ridge model:
ridgepred <- predict(ridge_model, newdata=test_chemical_x)
postResample(pred=ridgepred, obs=test_chemical_y)
## RMSE Rsquared MAE
## 1.6115319 0.1753037 0.8482122
Tune optimum nonlinear regression SVM Model from previous homework:
set.seed(100)
svmRTuned <- train(x = train_chemical_x,
y = train_chemical_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 112, 112, 111, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7950783 0.5019956 0.6355235
## 0.50 0.7206527 0.5675452 0.5786388
## 1.00 0.6620420 0.6391608 0.5329275
## 2.00 0.6199184 0.6705515 0.5039466
## 4.00 0.5997918 0.6797461 0.4884378
## 8.00 0.5916547 0.6873380 0.4857772
## 16.00 0.5916547 0.6873380 0.4857772
## 32.00 0.5916547 0.6873380 0.4857772
## 64.00 0.5916547 0.6873380 0.4857772
## 128.00 0.5916547 0.6873380 0.4857772
## 256.00 0.5916547 0.6873380 0.4857772
## 512.00 0.5916547 0.6873380 0.4857772
## 1024.00 0.5916547 0.6873380 0.4857772
## 2048.00 0.5916547 0.6873380 0.4857772
##
## Tuning parameter 'sigma' was held constant at a value of 0.01447582
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01447582 and C = 8.
Make prediction on test data and evaluate SVM model performance:
svmRPred <- predict(svmRTuned, newdata = test_chemical_x)
postResample(pred = svmRPred, obs = test_chemical_y)
## RMSE Rsquared MAE
## 0.6809040 0.5165981 0.5419960
Fit Recursive Partitioning Decision Tree :
set.seed(100)
rpartTune <- train(train_chemical_x, train_chemical_y,
method = "rpart2",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
## note: only 9 possible values of the max tree depth from the initial fit.
## Truncating the grid to 9 .
rpartTune
## CART
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 112, 112, 111, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 0.8644371 0.3525877 0.6811590
## 2 0.8624244 0.3584603 0.7020611
## 3 0.8340208 0.4130831 0.6557903
## 4 0.7920130 0.4697482 0.6194825
## 5 0.7807961 0.4888475 0.6076105
## 6 0.7839218 0.4853527 0.5925733
## 7 0.8017760 0.4755253 0.5978002
## 8 0.8127784 0.4666723 0.6052341
## 9 0.8116171 0.4672931 0.6010899
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 5.
Make prediction on test set and evaluate performance of Recursive Partitioning Decision Tree Model:
rpartpred <- predict(rpartTune, newdata=test_chemical_x)
postResample(pred=rpartpred, obs=test_chemical_y)
## RMSE Rsquared MAE
## 0.7373570 0.3977679 0.5922980
Fit Random Forest:
set.seed(100)
rfmodel <- randomForest(train_chemical_x, train_chemical_y, importance = TRUE, ntrees = 1000)
rfmodel
##
## Call:
## randomForest(x = train_chemical_x, y = train_chemical_y, importance = TRUE, ntrees = 1000)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 18
##
## Mean of squared residuals: 0.4016283
## % Var explained: 62.4
Make prediction on test set and evaluate performance of Random Forest:
rfpred <- predict(rfmodel, newdata=test_chemical_x)
postResample(pred=rfpred, obs=test_chemical_y)
## RMSE Rsquared MAE
## 0.5519336 0.6419386 0.4468046
Fit Boosted Tress:
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(0.01, 0.1),.n.minobsinnode = c(10, 20))
set.seed(100)
gbmTune <- train(train_chemical_x, train_chemical_y,
method = "gbm",
preProc = c("center", "scale"),
tuneGrid = gbmGrid,
verbose = FALSE)
gbmTune
## Stochastic Gradient Boosting
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 0.01 1 10 100 0.8619427 0.4961031
## 0.01 1 10 150 0.8147432 0.5128075
## 0.01 1 10 200 0.7835660 0.5246585
## 0.01 1 10 250 0.7629037 0.5335685
## 0.01 1 10 300 0.7505948 0.5389202
## 0.01 1 10 350 0.7417453 0.5419065
## 0.01 1 10 400 0.7366480 0.5436428
## 0.01 1 10 450 0.7332183 0.5448325
## 0.01 1 10 500 0.7299197 0.5462591
## 0.01 1 10 550 0.7274676 0.5472404
## 0.01 1 10 600 0.7251939 0.5480176
## 0.01 1 10 650 0.7233207 0.5488942
## 0.01 1 10 700 0.7222373 0.5494308
## 0.01 1 10 750 0.7200409 0.5511872
## 0.01 1 10 800 0.7194872 0.5514386
## 0.01 1 10 850 0.7182830 0.5518998
## 0.01 1 10 900 0.7178425 0.5521463
## 0.01 1 10 950 0.7175529 0.5519596
## 0.01 1 10 1000 0.7172759 0.5518846
## 0.01 1 20 100 0.8684749 0.4783788
## 0.01 1 20 150 0.8223527 0.4975089
## 0.01 1 20 200 0.7931652 0.5088669
## 0.01 1 20 250 0.7748864 0.5135702
## 0.01 1 20 300 0.7629090 0.5165410
## 0.01 1 20 350 0.7546083 0.5189928
## 0.01 1 20 400 0.7488313 0.5207284
## 0.01 1 20 450 0.7457348 0.5208224
## 0.01 1 20 500 0.7430661 0.5215792
## 0.01 1 20 550 0.7417082 0.5212001
## 0.01 1 20 600 0.7398886 0.5212005
## 0.01 1 20 650 0.7388621 0.5208721
## 0.01 1 20 700 0.7381801 0.5209677
## 0.01 1 20 750 0.7369238 0.5209005
## 0.01 1 20 800 0.7361333 0.5212499
## 0.01 1 20 850 0.7359617 0.5203852
## 0.01 1 20 900 0.7351879 0.5202388
## 0.01 1 20 950 0.7352199 0.5196898
## 0.01 1 20 1000 0.7348395 0.5194283
## 0.01 3 10 100 0.8101889 0.5229983
## 0.01 3 10 150 0.7690982 0.5327018
## 0.01 3 10 200 0.7468371 0.5398614
## 0.01 3 10 250 0.7342148 0.5452599
## 0.01 3 10 300 0.7261570 0.5495239
## 0.01 3 10 350 0.7202948 0.5534397
## 0.01 3 10 400 0.7162083 0.5557302
## 0.01 3 10 450 0.7136344 0.5570454
## 0.01 3 10 500 0.7107801 0.5593219
## 0.01 3 10 550 0.7082383 0.5613770
## 0.01 3 10 600 0.7064542 0.5626714
## 0.01 3 10 650 0.7050697 0.5636954
## 0.01 3 10 700 0.7036995 0.5651753
## 0.01 3 10 750 0.7026979 0.5659385
## 0.01 3 10 800 0.7017441 0.5668405
## 0.01 3 10 850 0.7013269 0.5670750
## 0.01 3 10 900 0.7012487 0.5671458
## 0.01 3 10 950 0.7005533 0.5675776
## 0.01 3 10 1000 0.7000768 0.5680647
## 0.01 3 20 100 0.8552634 0.4900206
## 0.01 3 20 150 0.8104619 0.5027118
## 0.01 3 20 200 0.7844226 0.5094961
## 0.01 3 20 250 0.7686843 0.5131210
## 0.01 3 20 300 0.7579430 0.5170305
## 0.01 3 20 350 0.7525156 0.5172783
## 0.01 3 20 400 0.7479366 0.5185796
## 0.01 3 20 450 0.7442613 0.5202874
## 0.01 3 20 500 0.7421446 0.5204887
## 0.01 3 20 550 0.7399813 0.5207975
## 0.01 3 20 600 0.7390178 0.5205588
## 0.01 3 20 650 0.7375460 0.5210134
## 0.01 3 20 700 0.7356886 0.5222227
## 0.01 3 20 750 0.7351315 0.5215755
## 0.01 3 20 800 0.7344454 0.5216805
## 0.01 3 20 850 0.7349301 0.5204034
## 0.01 3 20 900 0.7345114 0.5201655
## 0.01 3 20 950 0.7342266 0.5197294
## 0.01 3 20 1000 0.7342032 0.5190023
## 0.01 5 10 100 0.8099838 0.5196701
## 0.01 5 10 150 0.7664955 0.5336396
## 0.01 5 10 200 0.7443165 0.5424552
## 0.01 5 10 250 0.7315031 0.5486593
## 0.01 5 10 300 0.7229947 0.5534912
## 0.01 5 10 350 0.7165208 0.5574049
## 0.01 5 10 400 0.7119047 0.5607382
## 0.01 5 10 450 0.7089361 0.5625588
## 0.01 5 10 500 0.7060637 0.5645443
## 0.01 5 10 550 0.7040043 0.5658574
## 0.01 5 10 600 0.7025471 0.5668620
## 0.01 5 10 650 0.7013123 0.5679026
## 0.01 5 10 700 0.7002210 0.5688047
## 0.01 5 10 750 0.6988815 0.5698984
## 0.01 5 10 800 0.6978051 0.5709989
## 0.01 5 10 850 0.6968338 0.5719899
## 0.01 5 10 900 0.6957553 0.5731314
## 0.01 5 10 950 0.6956943 0.5732012
## 0.01 5 10 1000 0.6951357 0.5737641
## 0.01 5 20 100 0.8551111 0.4897872
## 0.01 5 20 150 0.8087220 0.5046218
## 0.01 5 20 200 0.7825776 0.5116426
## 0.01 5 20 250 0.7669653 0.5147037
## 0.01 5 20 300 0.7570251 0.5178158
## 0.01 5 20 350 0.7512534 0.5190385
## 0.01 5 20 400 0.7469635 0.5202314
## 0.01 5 20 450 0.7440647 0.5201334
## 0.01 5 20 500 0.7417145 0.5206497
## 0.01 5 20 550 0.7394494 0.5218859
## 0.01 5 20 600 0.7384177 0.5214764
## 0.01 5 20 650 0.7366376 0.5218347
## 0.01 5 20 700 0.7363315 0.5207254
## 0.01 5 20 750 0.7354775 0.5208958
## 0.01 5 20 800 0.7344882 0.5213555
## 0.01 5 20 850 0.7341582 0.5210159
## 0.01 5 20 900 0.7340244 0.5203763
## 0.01 5 20 950 0.7341274 0.5196927
## 0.01 5 20 1000 0.7335606 0.5197196
## 0.01 7 10 100 0.8082362 0.5248709
## 0.01 7 10 150 0.7646140 0.5388778
## 0.01 7 10 200 0.7416104 0.5484260
## 0.01 7 10 250 0.7283160 0.5540308
## 0.01 7 10 300 0.7203198 0.5583439
## 0.01 7 10 350 0.7154354 0.5604361
## 0.01 7 10 400 0.7113687 0.5632397
## 0.01 7 10 450 0.7077097 0.5664971
## 0.01 7 10 500 0.7054048 0.5679274
## 0.01 7 10 550 0.7032892 0.5693644
## 0.01 7 10 600 0.7017489 0.5702326
## 0.01 7 10 650 0.7004403 0.5710407
## 0.01 7 10 700 0.6997275 0.5714053
## 0.01 7 10 750 0.6989119 0.5719676
## 0.01 7 10 800 0.6981644 0.5725451
## 0.01 7 10 850 0.6974258 0.5732495
## 0.01 7 10 900 0.6967265 0.5739977
## 0.01 7 10 950 0.6965644 0.5742331
## 0.01 7 10 1000 0.6961831 0.5745076
## 0.01 7 20 100 0.8552297 0.4883830
## 0.01 7 20 150 0.8101163 0.5014834
## 0.01 7 20 200 0.7845376 0.5077401
## 0.01 7 20 250 0.7684001 0.5127391
## 0.01 7 20 300 0.7579139 0.5158798
## 0.01 7 20 350 0.7516809 0.5175866
## 0.01 7 20 400 0.7480381 0.5181748
## 0.01 7 20 450 0.7449675 0.5190462
## 0.01 7 20 500 0.7431517 0.5188939
## 0.01 7 20 550 0.7417291 0.5190313
## 0.01 7 20 600 0.7402981 0.5190832
## 0.01 7 20 650 0.7385103 0.5199205
## 0.01 7 20 700 0.7377899 0.5198690
## 0.01 7 20 750 0.7367947 0.5198067
## 0.01 7 20 800 0.7364050 0.5192006
## 0.01 7 20 850 0.7355238 0.5194831
## 0.01 7 20 900 0.7347639 0.5197915
## 0.01 7 20 950 0.7347416 0.5192253
## 0.01 7 20 1000 0.7350005 0.5184241
## 0.10 1 10 100 0.7275323 0.5352510
## 0.10 1 10 150 0.7279266 0.5323091
## 0.10 1 10 200 0.7295162 0.5310154
## 0.10 1 10 250 0.7338571 0.5269244
## 0.10 1 10 300 0.7334167 0.5279079
## 0.10 1 10 350 0.7369279 0.5244824
## 0.10 1 10 400 0.7394030 0.5216317
## 0.10 1 10 450 0.7408112 0.5203014
## 0.10 1 10 500 0.7401792 0.5214431
## 0.10 1 10 550 0.7418537 0.5199436
## 0.10 1 10 600 0.7439024 0.5180999
## 0.10 1 10 650 0.7446695 0.5173160
## 0.10 1 10 700 0.7455351 0.5161426
## 0.10 1 10 750 0.7462623 0.5152511
## 0.10 1 10 800 0.7464859 0.5152783
## 0.10 1 10 850 0.7474904 0.5145378
## 0.10 1 10 900 0.7478471 0.5139936
## 0.10 1 10 950 0.7487633 0.5131569
## 0.10 1 10 1000 0.7489365 0.5129501
## 0.10 1 20 100 0.7431509 0.5119040
## 0.10 1 20 150 0.7488784 0.5003224
## 0.10 1 20 200 0.7504600 0.4988738
## 0.10 1 20 250 0.7550166 0.4944737
## 0.10 1 20 300 0.7602555 0.4883015
## 0.10 1 20 350 0.7629998 0.4868599
## 0.10 1 20 400 0.7673899 0.4827199
## 0.10 1 20 450 0.7712596 0.4792313
## 0.10 1 20 500 0.7715267 0.4794554
## 0.10 1 20 550 0.7742775 0.4773005
## 0.10 1 20 600 0.7758234 0.4762470
## 0.10 1 20 650 0.7768118 0.4759240
## 0.10 1 20 700 0.7773467 0.4752894
## 0.10 1 20 750 0.7778800 0.4751854
## 0.10 1 20 800 0.7785847 0.4749088
## 0.10 1 20 850 0.7795916 0.4746016
## 0.10 1 20 900 0.7810432 0.4732491
## 0.10 1 20 950 0.7814590 0.4733112
## 0.10 1 20 1000 0.7828688 0.4722748
## 0.10 3 10 100 0.7088970 0.5526858
## 0.10 3 10 150 0.7056364 0.5571894
## 0.10 3 10 200 0.7055293 0.5584932
## 0.10 3 10 250 0.7056879 0.5582212
## 0.10 3 10 300 0.7054920 0.5586714
## 0.10 3 10 350 0.7056000 0.5587617
## 0.10 3 10 400 0.7054466 0.5591601
## 0.10 3 10 450 0.7057178 0.5589761
## 0.10 3 10 500 0.7053254 0.5593909
## 0.10 3 10 550 0.7053087 0.5595411
## 0.10 3 10 600 0.7050890 0.5598573
## 0.10 3 10 650 0.7051730 0.5598414
## 0.10 3 10 700 0.7050711 0.5599593
## 0.10 3 10 750 0.7050454 0.5600081
## 0.10 3 10 800 0.7050121 0.5600747
## 0.10 3 10 850 0.7050387 0.5600621
## 0.10 3 10 900 0.7050327 0.5600945
## 0.10 3 10 950 0.7050146 0.5601346
## 0.10 3 10 1000 0.7050362 0.5601241
## 0.10 3 20 100 0.7392738 0.5117369
## 0.10 3 20 150 0.7406613 0.5084034
## 0.10 3 20 200 0.7455712 0.5031134
## 0.10 3 20 250 0.7485909 0.5018524
## 0.10 3 20 300 0.7504264 0.5011576
## 0.10 3 20 350 0.7537975 0.4991533
## 0.10 3 20 400 0.7543302 0.4982676
## 0.10 3 20 450 0.7552320 0.4980858
## 0.10 3 20 500 0.7563803 0.4977284
## 0.10 3 20 550 0.7576535 0.4966238
## 0.10 3 20 600 0.7581226 0.4970064
## 0.10 3 20 650 0.7592556 0.4961807
## 0.10 3 20 700 0.7602633 0.4957032
## 0.10 3 20 750 0.7614354 0.4947545
## 0.10 3 20 800 0.7617726 0.4944720
## 0.10 3 20 850 0.7614385 0.4950398
## 0.10 3 20 900 0.7619100 0.4946527
## 0.10 3 20 950 0.7621800 0.4944959
## 0.10 3 20 1000 0.7623670 0.4945085
## 0.10 5 10 100 0.7096003 0.5572463
## 0.10 5 10 150 0.7086674 0.5575044
## 0.10 5 10 200 0.7077940 0.5589958
## 0.10 5 10 250 0.7083593 0.5585496
## 0.10 5 10 300 0.7083080 0.5588205
## 0.10 5 10 350 0.7081298 0.5591577
## 0.10 5 10 400 0.7083298 0.5591138
## 0.10 5 10 450 0.7084492 0.5591287
## 0.10 5 10 500 0.7084908 0.5592184
## 0.10 5 10 550 0.7083985 0.5593860
## 0.10 5 10 600 0.7084622 0.5593270
## 0.10 5 10 650 0.7085878 0.5592666
## 0.10 5 10 700 0.7086858 0.5592076
## 0.10 5 10 750 0.7087874 0.5591617
## 0.10 5 10 800 0.7088329 0.5591629
## 0.10 5 10 850 0.7088602 0.5591647
## 0.10 5 10 900 0.7088670 0.5591817
## 0.10 5 10 950 0.7089035 0.5591615
## 0.10 5 10 1000 0.7089559 0.5591108
## 0.10 5 20 100 0.7451326 0.5042425
## 0.10 5 20 150 0.7436587 0.5053513
## 0.10 5 20 200 0.7481337 0.5007987
## 0.10 5 20 250 0.7501660 0.4999680
## 0.10 5 20 300 0.7521845 0.4983645
## 0.10 5 20 350 0.7565896 0.4945845
## 0.10 5 20 400 0.7572699 0.4944394
## 0.10 5 20 450 0.7586502 0.4940749
## 0.10 5 20 500 0.7585619 0.4944812
## 0.10 5 20 550 0.7611195 0.4922538
## 0.10 5 20 600 0.7612814 0.4922562
## 0.10 5 20 650 0.7617362 0.4920931
## 0.10 5 20 700 0.7622562 0.4913072
## 0.10 5 20 750 0.7623991 0.4914863
## 0.10 5 20 800 0.7630963 0.4912227
## 0.10 5 20 850 0.7634993 0.4911920
## 0.10 5 20 900 0.7632487 0.4915922
## 0.10 5 20 950 0.7635985 0.4913546
## 0.10 5 20 1000 0.7642794 0.4906482
## 0.10 7 10 100 0.7083174 0.5552304
## 0.10 7 10 150 0.7054586 0.5596078
## 0.10 7 10 200 0.7048000 0.5612095
## 0.10 7 10 250 0.7042707 0.5620460
## 0.10 7 10 300 0.7043668 0.5624224
## 0.10 7 10 350 0.7043567 0.5626495
## 0.10 7 10 400 0.7039353 0.5632376
## 0.10 7 10 450 0.7041271 0.5631395
## 0.10 7 10 500 0.7041667 0.5631268
## 0.10 7 10 550 0.7041838 0.5632392
## 0.10 7 10 600 0.7041713 0.5632585
## 0.10 7 10 650 0.7042707 0.5631922
## 0.10 7 10 700 0.7042538 0.5633161
## 0.10 7 10 750 0.7042096 0.5633967
## 0.10 7 10 800 0.7042487 0.5633671
## 0.10 7 10 850 0.7042830 0.5633789
## 0.10 7 10 900 0.7043372 0.5633227
## 0.10 7 10 950 0.7043226 0.5633839
## 0.10 7 10 1000 0.7043076 0.5634143
## 0.10 7 20 100 0.7472499 0.4985965
## 0.10 7 20 150 0.7483026 0.4959972
## 0.10 7 20 200 0.7543290 0.4900388
## 0.10 7 20 250 0.7585461 0.4861161
## 0.10 7 20 300 0.7591736 0.4862691
## 0.10 7 20 350 0.7613460 0.4843779
## 0.10 7 20 400 0.7600581 0.4871606
## 0.10 7 20 450 0.7622616 0.4853936
## 0.10 7 20 500 0.7643479 0.4835444
## 0.10 7 20 550 0.7651031 0.4830860
## 0.10 7 20 600 0.7660236 0.4825360
## 0.10 7 20 650 0.7665230 0.4824743
## 0.10 7 20 700 0.7668062 0.4823541
## 0.10 7 20 750 0.7670085 0.4824451
## 0.10 7 20 800 0.7676877 0.4821189
## 0.10 7 20 850 0.7681026 0.4818954
## 0.10 7 20 900 0.7682819 0.4818895
## 0.10 7 20 950 0.7679894 0.4819427
## 0.10 7 20 1000 0.7684987 0.4818401
## MAE
## 0.6693536
## 0.6274496
## 0.5990362
## 0.5806300
## 0.5692617
## 0.5618710
## 0.5578578
## 0.5549705
## 0.5522950
## 0.5500891
## 0.5481299
## 0.5463266
## 0.5455160
## 0.5437695
## 0.5435762
## 0.5429259
## 0.5427975
## 0.5427577
## 0.5426541
## 0.6810479
## 0.6405847
## 0.6143237
## 0.5971396
## 0.5859434
## 0.5782820
## 0.5725587
## 0.5694167
## 0.5666103
## 0.5649419
## 0.5628340
## 0.5611790
## 0.5599578
## 0.5583834
## 0.5571357
## 0.5567332
## 0.5558540
## 0.5557011
## 0.5555212
## 0.6205456
## 0.5835808
## 0.5622831
## 0.5508115
## 0.5437736
## 0.5390406
## 0.5363979
## 0.5344535
## 0.5323755
## 0.5304821
## 0.5294357
## 0.5283121
## 0.5277132
## 0.5270900
## 0.5266832
## 0.5265111
## 0.5268211
## 0.5265770
## 0.5265215
## 0.6695530
## 0.6300367
## 0.6064017
## 0.5909745
## 0.5803969
## 0.5750286
## 0.5708530
## 0.5670699
## 0.5648574
## 0.5628058
## 0.5613395
## 0.5595374
## 0.5577044
## 0.5569553
## 0.5562781
## 0.5567306
## 0.5564601
## 0.5561453
## 0.5559356
## 0.6197370
## 0.5796928
## 0.5603241
## 0.5494230
## 0.5429427
## 0.5378745
## 0.5340388
## 0.5317976
## 0.5295642
## 0.5280563
## 0.5267435
## 0.5260471
## 0.5254545
## 0.5246981
## 0.5239339
## 0.5237143
## 0.5231572
## 0.5233957
## 0.5232101
## 0.6699282
## 0.6279786
## 0.6043387
## 0.5885560
## 0.5788536
## 0.5733976
## 0.5692084
## 0.5662876
## 0.5638733
## 0.5616239
## 0.5605242
## 0.5588479
## 0.5582222
## 0.5573152
## 0.5565258
## 0.5561486
## 0.5559258
## 0.5561572
## 0.5557056
## 0.6177703
## 0.5781390
## 0.5574636
## 0.5452604
## 0.5388041
## 0.5349349
## 0.5319535
## 0.5291170
## 0.5270831
## 0.5255024
## 0.5247650
## 0.5240813
## 0.5236829
## 0.5232816
## 0.5230103
## 0.5227653
## 0.5225928
## 0.5227819
## 0.5227327
## 0.6695407
## 0.6287360
## 0.6045540
## 0.5889442
## 0.5794382
## 0.5735577
## 0.5703986
## 0.5676093
## 0.5654919
## 0.5641587
## 0.5625836
## 0.5610258
## 0.5599017
## 0.5581880
## 0.5579995
## 0.5569359
## 0.5566738
## 0.5565242
## 0.5568133
## 0.5535232
## 0.5565581
## 0.5597103
## 0.5631183
## 0.5644581
## 0.5681129
## 0.5711066
## 0.5719502
## 0.5720739
## 0.5736196
## 0.5755803
## 0.5759443
## 0.5766939
## 0.5778021
## 0.5782349
## 0.5790959
## 0.5794930
## 0.5806018
## 0.5808018
## 0.5635400
## 0.5685202
## 0.5702298
## 0.5751029
## 0.5797174
## 0.5845101
## 0.5893326
## 0.5937401
## 0.5955435
## 0.5988326
## 0.6013750
## 0.6027988
## 0.6040039
## 0.6050816
## 0.6060279
## 0.6076136
## 0.6093571
## 0.6103493
## 0.6120680
## 0.5395250
## 0.5389886
## 0.5394795
## 0.5397209
## 0.5399441
## 0.5400663
## 0.5401641
## 0.5405018
## 0.5403660
## 0.5404457
## 0.5403869
## 0.5405612
## 0.5405987
## 0.5406563
## 0.5407018
## 0.5407625
## 0.5407826
## 0.5408205
## 0.5408707
## 0.5610795
## 0.5645953
## 0.5690792
## 0.5731626
## 0.5761319
## 0.5795435
## 0.5813704
## 0.5827195
## 0.5844897
## 0.5862410
## 0.5877619
## 0.5893819
## 0.5902934
## 0.5916969
## 0.5923694
## 0.5926506
## 0.5930527
## 0.5935363
## 0.5942392
## 0.5371885
## 0.5389483
## 0.5405084
## 0.5424078
## 0.5426563
## 0.5427944
## 0.5433265
## 0.5437527
## 0.5438916
## 0.5439480
## 0.5441288
## 0.5442476
## 0.5443388
## 0.5444813
## 0.5445662
## 0.5446081
## 0.5446137
## 0.5446503
## 0.5447035
## 0.5639665
## 0.5654285
## 0.5717677
## 0.5754112
## 0.5784644
## 0.5835478
## 0.5840624
## 0.5861608
## 0.5872658
## 0.5904127
## 0.5914346
## 0.5921018
## 0.5931346
## 0.5937509
## 0.5949836
## 0.5960804
## 0.5959579
## 0.5968853
## 0.5978519
## 0.5351406
## 0.5361751
## 0.5371364
## 0.5380257
## 0.5384851
## 0.5387527
## 0.5387313
## 0.5390157
## 0.5392294
## 0.5393146
## 0.5393858
## 0.5395080
## 0.5395514
## 0.5396037
## 0.5397062
## 0.5397496
## 0.5398181
## 0.5398299
## 0.5398482
## 0.5683145
## 0.5693486
## 0.5755584
## 0.5797045
## 0.5817843
## 0.5846819
## 0.5852263
## 0.5872347
## 0.5900088
## 0.5918224
## 0.5932891
## 0.5941301
## 0.5944857
## 0.5952697
## 0.5961450
## 0.5966344
## 0.5970321
## 0.5970143
## 0.5977415
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
## 5, shrinkage = 0.01 and n.minobsinnode = 10.
Make prediction on test set and evaluate performance of Boosted Trees:
gbmpred <- predict(gbmTune, newdata=test_chemical_x)
postResample(pred=gbmpred, obs=test_chemical_y)
## RMSE Rsquared MAE
## 0.6095757 0.6183966 0.4866248
Fit Cubist model:
cubistGrid <- expand.grid(committees = c(1, 5, 10),
neighbors = c(0, 1, 3, 5))
set.seed(100)
cubistTune <- train(train_chemical_x, train_chemical_y,
method = "cubist",
preProc = c("center", "scale"),
tuneGrid = cubistGrid,
verbose = FALSE)
cubistTune
## Cubist
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.0002956 0.3554524 0.7248086
## 1 1 0.9899484 0.3824150 0.7010254
## 1 3 0.9770161 0.3818919 0.6939003
## 1 5 0.9829963 0.3743484 0.7036435
## 5 0 0.7638949 0.5146604 0.5882052
## 5 1 0.7386180 0.5445762 0.5574313
## 5 3 0.7401463 0.5381512 0.5619807
## 5 5 0.7449943 0.5324229 0.5690466
## 10 0 0.7207429 0.5569843 0.5578867
## 10 1 0.6919859 0.5889563 0.5218133
## 10 3 0.6947846 0.5836795 0.5306527
## 10 5 0.7001805 0.5775887 0.5379893
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 10 and neighbors = 1.
Make prediction on test set and evaluate performance of Cubist model:
cubistpred <- predict(cubistTune, newdata=test_chemical_x)
postResample(pred=cubistpred, obs=test_chemical_y)
## RMSE Rsquared MAE
## 0.7045971 0.5916504 0.4809221
Combine results of different models into a single table:
# Get results
#ridge_res <- postResample(pred=ridgepred, obs=test_chemical_y)
#svm_res<-postResample(pred = svmRPred, obs = test_chemical_y)
rpart_res<- postResample(pred=rpartpred, obs=test_chemical_y)
rf_res <- postResample(pred=rfpred, obs=test_chemical_y)
gbm_res<- postResample(pred=gbmpred, obs=test_chemical_y)
cubist_res<-postResample(pred=cubistpred, obs=test_chemical_y)
# Combine into a sigle table
all_results<- rbind(
#ridge_linear = ridge_res,
#svm_nonlinear = svm_res,
rpart = rpart_res,
"random forest" = rf_res,
gbm = gbm_res,
cubist= cubist_res
)
# Convert to a data frame
results <- as.data.frame(all_results)
# See results
print(results)
## RMSE Rsquared MAE
## rpart 0.7373570 0.3977679 0.5922980
## random forest 0.5519336 0.6419386 0.4468046
## gbm 0.6095757 0.6183966 0.4866248
## cubist 0.7045971 0.5916504 0.4809221
The Random Forest model is observed to perform best, as it has the lowest RMSE value and the highest R-squared value among all models.
Top 10 predictors for Random Forest:
# Get predictors importance from random forest model
imp_scores <- varImp(rfmodel)
top10 <- head(imp_scores[order(-imp_scores$Overall), , drop = FALSE], 10)
top10
## Overall
## ManufacturingProcess32 21.930907
## BiologicalMaterial12 10.243346
## ManufacturingProcess31 9.851923
## ManufacturingProcess17 9.773773
## ManufacturingProcess13 8.650226
## ManufacturingProcess36 7.739404
## ManufacturingProcess09 7.258026
## BiologicalMaterial06 7.092092
## BiologicalMaterial03 7.089226
## BiologicalMaterial02 6.648094
Top 10 predictors for optimal linear model (ridge model):
varImp(ridge_model)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 94.64
## BiologicalMaterial06 76.19
## ManufacturingProcess17 73.26
## ManufacturingProcess31 69.73
## ManufacturingProcess09 69.65
## BiologicalMaterial12 66.34
## ManufacturingProcess36 66.21
## BiologicalMaterial02 66.06
## BiologicalMaterial03 63.32
## ManufacturingProcess11 57.92
## ManufacturingProcess06 53.69
## ManufacturingProcess30 50.51
## BiologicalMaterial04 50.26
## ManufacturingProcess29 44.79
## BiologicalMaterial08 44.11
## BiologicalMaterial09 42.22
## BiologicalMaterial11 39.01
## ManufacturingProcess33 38.69
## ManufacturingProcess02 37.76
Top 10 predictors for optimal nonlinear (svm model) model:
varImp(svmRTuned)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 94.64
## BiologicalMaterial06 76.19
## ManufacturingProcess17 73.26
## ManufacturingProcess31 69.73
## ManufacturingProcess09 69.65
## BiologicalMaterial12 66.34
## ManufacturingProcess36 66.21
## BiologicalMaterial02 66.06
## BiologicalMaterial03 63.32
## ManufacturingProcess11 57.92
## ManufacturingProcess06 53.69
## ManufacturingProcess30 50.51
## BiologicalMaterial04 50.26
## ManufacturingProcess29 44.79
## BiologicalMaterial08 44.11
## BiologicalMaterial09 42.22
## BiologicalMaterial11 39.01
## ManufacturingProcess33 38.69
## ManufacturingProcess02 37.76
The top five predictors namely ManufacturingProcess32, BiologicalMaterial12, ManufacturingProcess31, ManufacturingProcess17, and ManufacturingProcess13 are identified as the most important for the Random Forest model. The process variables dominate the list in this model. Across all three models (random Forest, optimal linear (ridge), and optimal nonlinear (SVM)), the top 10 variables remain the same. They only differ in their order for nine variables, except for the top predictor, ManufacturingProcess32. This predictor consistently remains the best performer. It is also observed that the process variables dominate the importance lists across all three models.
plot optimal single Recursive Partitioning Decision Tree:
plot(as.party(rpartTune$finalModel),gp=gpar(fontsize=8))
This view of the data provides valuable insights into the roles of biological and process predictors in determining yield. From the plot, it is observed that the root node begins with the ManufacturingProcess32 predictor, splitting initially between a process variable and a biological material variable. Throughout the splitting process, biological variables exhibit some dominance in explaining the broader variability in the response. This is confirmed at the terminal nodes, where seven out of the ten nodes are determined by biological variables. This shows that biological variables are important for making detailed and specific predictions.