In this assignment, the problems 6.2 and 6.3 have been solved from the Kuhn and Johnson book.
library(AppliedPredictiveModeling)
library(lars)
## Loaded lars 1.3
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(MASS)
library(elasticnet)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(RANN)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
data(permeability)
#head(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
dim(fingerprints)
## [1] 165 1107
# Find out low frequency column numbers
pre_lf <- nearZeroVar(fingerprints)
# Remove low frequency columns
pred1 <- fingerprints[,-pre_lf]
dim(pred1)
## [1] 165 388
The number of predictors remaining after filtering out those with low frequency is 388.
set.seed(123)
# Create a 70-30 train-test split
split_perm<- sample(c(rep(0, 0.7 * nrow(permeability)),
rep(1, 0.3 * nrow(permeability))))
finger_train <- pred1[split_perm == 0,]
finger_test <- pred1[split_perm == 1,]
perm_train <- permeability[split_perm == 0]
perm_test <- permeability[split_perm == 1]
# Get tuned PLS model
pls_model <- train(x=finger_train, y=perm_train,
method='pls', metric='Rsquared',
tuneLength=20,
trControl=trainControl(method='cv'),
preProcess=c('center', 'scale')
)
pls_model
## Partial Least Squares
##
## 115 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 103, 103, 104, 104, 105, 103, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.13277 0.4070822 10.475643
## 2 11.26798 0.5349631 8.494166
## 3 11.69540 0.5137372 9.164204
## 4 11.66819 0.5037188 9.138646
## 5 11.77796 0.5151984 8.836781
## 6 12.15963 0.4809347 9.316472
## 7 12.28613 0.4808918 9.371363
## 8 12.21954 0.4932274 9.077101
## 9 12.05612 0.5100212 8.958103
## 10 12.21703 0.5106510 9.034134
## 11 12.21932 0.5177616 9.209332
## 12 12.12359 0.5155698 9.220382
## 13 12.11052 0.5071781 9.184265
## 14 12.17779 0.4870462 9.254787
## 15 12.22873 0.4889741 9.194144
## 16 12.31458 0.4822532 9.259095
## 17 12.51942 0.4729862 9.388909
## 18 12.72522 0.4637895 9.499965
## 19 13.05632 0.4484861 9.711818
## 20 13.15870 0.4469599 9.692208
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 2.
The optimal number of latent variables (ncomp) is 5 and the corresponding re-sampled R-squared value is 0.524 and the RMSE value is 11.27, which is the smallest in this set. This indicates that with these 5 components, the model achieves its highest R-squared and captures the most variance in the data among the evaluated models.
pls_model$results|> filter(ncomp == 5)
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 5 11.77796 0.5151984 8.836781 4.477242 0.2961044 3.652371
# Generate predictions using the PLS model on test data
pls_finger_predict <- predict(pls_model,newdata=finger_test)
# Evaluate model performance with resampling metrics
pls<-postResample(pred=pls_finger_predict, obs=perm_test)
pls
## RMSE Rsquared MAE
## 12.5447910 0.3434813 8.1748256
The test set estimate of R-squared is 0.41013. This value is lower than the training set R-squared value of 0.5242.
set.seed(246)
# Find custom set of penalties
ridge_grid <- data.frame(.lambda = seq(0, .1, length = 15))
# Fit ridge regression model
ridge_model <- train(x=finger_train,
y=perm_train,
method='ridge',
tuneGrid=ridge_grid,#Fit the model over many penalty values
trControl=trainControl(method='cv'),
preProcess=c('center','scale')# put the predictors on the same scale
)
## Warning: model fit failed for Fold01: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold03: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold05: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
ridge_model
## Ridge Regression
##
## 115 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 104, 103, 103, 103, 103, 104, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 55.21114 0.4752174 44.911261
## 0.007142857 928.06887 0.4488383 679.486669
## 0.014285714 13.40101 0.4635154 9.915916
## 0.021428571 12.87662 0.4757953 9.587425
## 0.028571429 12.55192 0.4905130 9.429138
## 0.035714286 12.25474 0.5007713 9.207660
## 0.042857143 12.07771 0.5078405 9.096379
## 0.050000000 11.94192 0.5135906 9.006202
## 0.057142857 11.84380 0.5177573 8.941061
## 0.064285714 11.72812 0.5242420 8.868106
## 0.071428571 11.64596 0.5289827 8.796775
## 0.078571429 11.60338 0.5314857 8.771784
## 0.085714286 11.59480 0.5324738 8.789822
## 0.092857143 11.55570 0.5352570 8.767197
## 0.100000000 11.53473 0.5375239 8.758483
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
The optimal ridge regression model was selected based on the smallest RMSE value of 11.535, with a final lambda value of 0.1. The corresponding R-squred value at this point is 0.54.
Predict the response for the test set by ridge model:
# Generate predictions using the Ridge model on test data
ridge_finger_predict <- predict(ridge_model,newdata=finger_test)
# Evaluate model performance with resampling metrics
ridge<-postResample(pred=ridge_finger_predict, obs=perm_test)
ridge
## RMSE Rsquared MAE
## 12.3120682 0.3923648 9.5858249
set.seed(246)
# Grid search for optimized penalty lambdas
enet_grid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
# Fit penalized regression model
enet_model <- train(finger_train, perm_train,
method = "enet",
tuneGrid = enet_grid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale"))
## Warning: model fit failed for Fold01: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold03: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold05: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
enet_model
## Elasticnet
##
## 115 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 104, 103, 103, 103, 103, 104, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 14.151111 0.6312289 10.421341
## 0.00 0.10 15.943787 0.6279431 11.513413
## 0.00 0.15 17.769514 0.6228903 12.685448
## 0.00 0.20 19.262009 0.6351900 13.708168
## 0.00 0.25 20.873851 0.6394342 15.147687
## 0.00 0.30 22.708174 0.6383821 16.801769
## 0.00 0.35 24.738976 0.6267321 18.817938
## 0.00 0.40 26.816843 0.6124962 20.856474
## 0.00 0.45 28.967773 0.5996391 22.886737
## 0.00 0.50 31.185890 0.5833154 24.905094
## 0.00 0.55 33.515329 0.5678710 26.862720
## 0.00 0.60 35.805584 0.5557674 28.764330
## 0.00 0.65 38.087091 0.5453041 30.673345
## 0.00 0.70 40.518435 0.5331935 32.782948
## 0.00 0.75 43.061828 0.5208817 35.003538
## 0.00 0.80 45.559322 0.5130966 37.093992
## 0.00 0.85 47.978368 0.5070999 39.034944
## 0.00 0.90 50.626297 0.4882228 41.254331
## 0.00 0.95 52.918592 0.4813814 43.077956
## 0.00 1.00 55.211142 0.4752174 44.911261
## 0.01 0.05 29.378745 0.4565886 21.518862
## 0.01 0.10 47.614456 0.4929702 35.104099
## 0.01 0.15 66.416995 0.5041494 49.788016
## 0.01 0.20 87.060071 0.5140643 65.123568
## 0.01 0.25 106.145482 0.5233051 78.704054
## 0.01 0.30 124.055199 0.5122281 90.790054
## 0.01 0.35 142.433777 0.5026247 103.011489
## 0.01 0.40 161.095550 0.4910332 115.336788
## 0.01 0.45 179.997372 0.4791362 127.659326
## 0.01 0.50 199.239356 0.4679310 140.059035
## 0.01 0.55 218.706830 0.4553725 152.481449
## 0.01 0.60 238.284789 0.4399234 167.215278
## 0.01 0.65 257.900953 0.4268302 182.097605
## 0.01 0.70 277.589834 0.4157174 196.961145
## 0.01 0.75 297.259016 0.4071169 211.756895
## 0.01 0.80 316.917593 0.3998459 226.531946
## 0.01 0.85 336.609509 0.3921753 241.321548
## 0.01 0.90 357.036416 0.3850288 256.659349
## 0.01 0.95 378.064341 0.3786432 272.412310
## 0.01 1.00 399.458742 0.3738213 288.393964
## 0.10 0.05 12.155275 0.5318898 9.228464
## 0.10 0.10 11.540557 0.5397695 8.158448
## 0.10 0.15 11.266053 0.5646036 7.809112
## 0.10 0.20 10.722894 0.5866221 7.490231
## 0.10 0.25 10.404309 0.5940981 7.483926
## 0.10 0.30 10.066515 0.6042563 7.385388
## 0.10 0.35 9.849010 0.6114751 7.332344
## 0.10 0.40 9.780107 0.6126701 7.364652
## 0.10 0.45 9.803324 0.6122784 7.425639
## 0.10 0.50 9.879791 0.6073968 7.502528
## 0.10 0.55 10.043954 0.5968707 7.623457
## 0.10 0.60 10.251238 0.5865432 7.774031
## 0.10 0.65 10.429900 0.5782867 7.928979
## 0.10 0.70 10.597833 0.5713734 8.065470
## 0.10 0.75 10.782808 0.5640196 8.211557
## 0.10 0.80 10.946394 0.5578514 8.329209
## 0.10 0.85 11.097780 0.5523454 8.439981
## 0.10 0.90 11.239763 0.5474603 8.536174
## 0.10 0.95 11.386675 0.5425306 8.646261
## 0.10 1.00 11.534731 0.5375239 8.758483
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.1.
The elastic net used a penalty of lambda 0.1 and a fraction of 0.40, which led to the minimal RMSE value of 9.780 and an R-squared of 0.613.
In terms of training set performance, the elastic net shows the best predictive performance with the lowest RMSE of 9.780 and the highest R-squared value of 0.613. The partial least squares (PLS) model follows with an R-squared of 0.524 and an RMSE of 11.270. The ridge regression model has an RMSE of 11.535 and a corresponding R-squared value of 0.540. Hence, the elastic net is the most effective model based on the training data results.
Predict the response for the test set by elastic net model:
# Generate predictions using the Ridge model on test data
enet_finger_predict <- predict(enet_model,newdata=finger_test)
# Evaluate model performance with resampling metrics
enet<-postResample(pred=enet_finger_predict, obs=perm_test)
enet
## RMSE Rsquared MAE
## 12.0675531 0.3969818 9.1975157
test_evaluation <-rbind(pls,ridge,enet)
knitr::kable(test_evaluation)
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| pls | 12.54479 | 0.3434813 | 8.174826 |
| ridge | 12.31207 | 0.3923648 | 9.585825 |
| enet | 12.06755 | 0.3969818 | 9.197516 |
Again, among the models evaluated on the test data, the elastic net exhibits the best predictive performance with the lowest root mean squared error (RMSE) of 12.07 and the highest R-squared value of 0.397. This indicates that the elastic net model explains a greater proportion of the variance in the unseen data compared to the partial least squares (PLS) model, which has an RMSE of 12.545 and an R-squared of 0.344, and the ridge regression model, which has an RMSE of 12.312 and an R-squared of 0.392. Therefore, the elastic net is the most effective model for predicting outcomes on the test dataset.
Given the predictive performance results, I will not recommend to fully replace the permeability laboratory experiment with the models at this time. The models exhibit relatively low R-squared values, particularly the PLS and Ridge regression models show that they explain only a limited portion of the variability in the data. Though, the Elastic Net model shows slightly better performance with an R-squared of 0.397, but it still falls short of providing a strong predictive capability. Therefore, I think the laboratory experiments remain crucial for accurate permeability measurements in this case. Additional model refinement or additional data may be necessary here to improve predictive accuracy.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
data("ChemicalManufacturingProcess")
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
dim(ChemicalManufacturingProcess)
## [1] 176 58
head(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 NA NA NA
## 2 0.0 0 NA
## 3 0.0 0 NA
## 4 0.0 0 NA
## 5 10.7 0 NA
## 6 12.0 0 NA
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 NA NA NA
## 2 917 1032.2 210.0
## 3 912 1003.6 207.1
## 4 911 1014.6 213.3
## 5 918 1027.5 205.7
## 6 924 1016.8 208.9
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 NA NA 43.00
## 2 177 178 46.57
## 3 178 178 45.07
## 4 177 177 44.92
## 5 178 178 44.96
## 6 178 178 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 NA NA NA
## 2 NA NA 0
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 NA NA NA
## 2 3 0 3
## 3 4 1 4
## 4 5 2 5
## 5 8 4 18
## 6 9 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 NA NA 11.6
## 2 0.1 0.15 11.1
## 3 0.0 0.00 12.0
## 4 0.0 0.00 10.6
## 5 0.0 0.00 11.0
## 6 0.0 0.00 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
# Check for missing values in each column
missing_values <- colSums(is.na(ChemicalManufacturingProcess))
print(missing_values)
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0 0 0
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0 0 0
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 0 0 0
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 0 0 0
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0 1 3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## 15 1 1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 2 1 1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## 0 9 10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 1 0 1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 0 0 0
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## 0 0 0
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 0 1 1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 1 5 5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## 5 5 5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## 5 5 0
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 5 5 5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 5 0 0
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## 0 1 1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## 0 0 0
## ManufacturingProcess45
## 0
# Apply KNN imputation
knn_impute_chemical <- preProcess(ChemicalManufacturingProcess, method=c('knnImpute'))
# Imputed dataset
imputed_chemical_df <- predict(knn_impute_chemical, ChemicalManufacturingProcess)
# Calculate total number of missing values after imputation
total_missing <- sum(is.na(imputed_chemical_df))
print(total_missing)
## [1] 0
Split the data into train test set:
dim(imputed_chemical_df)
## [1] 176 58
set.seed(100)
train_chemical <-createDataPartition(imputed_chemical_df$Yield, times = 1, p = .70, list = FALSE)
train_chemical_x <- imputed_chemical_df[train_chemical, ][, -c(1)]
test_chemical_x <- imputed_chemical_df[-train_chemical, ][, -c(1)]
train_chemical_y<- imputed_chemical_df[train_chemical, ]$Yield
test_chemical_y <- imputed_chemical_df[-train_chemical, ]$Yield
To get a bias variance balance I will apply a ridge regression here:
set.seed(135)
ridgegrid <- data.frame(.lambda = seq(0,0.1,length=15))
ridge_chemical <- train(x=train_chemical_x,y=train_chemical_y,
method='ridge',
tuneGrid=ridgegrid,
trControl=trainControl(method='cv'),
preProc = c('center','scale')
)
ridge_chemical
## Ridge Regression
##
## 124 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 111, 112, 112, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 3.274546 0.3894348 1.3886320
## 0.007142857 1.952203 0.5198298 0.9212209
## 0.014285714 1.988899 0.5343489 0.9244613
## 0.021428571 1.951873 0.5426475 0.9136532
## 0.028571429 1.911371 0.5483202 0.9020967
## 0.035714286 1.875157 0.5525935 0.8923681
## 0.042857143 1.843596 0.5559911 0.8845154
## 0.050000000 1.815971 0.5587837 0.8778078
## 0.057142857 1.791533 0.5611311 0.8717044
## 0.064285714 1.769673 0.5631361 0.8662582
## 0.071428571 1.749921 0.5648698 0.8615315
## 0.078571429 1.731921 0.5663834 0.8573224
## 0.085714286 1.715396 0.5677153 0.8534072
## 0.092857143 1.700131 0.5688950 0.8497503
## 0.100000000 1.685958 0.5699458 0.8464050
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
The optimal value of the performance metric is found for lambda 0.1 for the ridge regression, which gave the smallest RMSE value of 1.69 and the corresponding R-squared value is 0.57.
Make prediction on test set:
predict_chemical <- predict(ridge_chemical, newdata=test_chemical_x)
postResample(pred=predict_chemical, obs=test_chemical_y)
## RMSE Rsquared MAE
## 1.5895179 0.1800329 0.8326279
For ridge regression, the test set RMSE is 1.59 and the corresponding R-squared value is 0.18. The RMSE value is slightly lower than the training set RMSE of 1.69, indicating a bit improvement in error. However, the R-squared on the test set drops significantly to 0.18 from 0.57 on the training set, showing a sharp decline in explanatory power. This suggests that though the model achieves a similar error rate on new data, it struggles to generalize and capture the underlying patterns effectively. Probably, this is happened due to overfitting to the training data.
plot(varImp(ridge_chemical, scale = FALSE), top=30, scales = list(y = list(cex = 0.8)))
The manufacturing processes seem to be the most influential predictors. Among them, manufacturingprocess32 and manufacturingprocess13 stand out as especially important as these two predictors show significantly higher importance compared to the others available.
# Calculate correlation matrix
correlation <- cor(dplyr::select(imputed_chemical_df, ManufacturingProcess32, ManufacturingProcess13, BiologicalMaterial06, ManufacturingProcess17, ManufacturingProcess31, Yield))
correlation
## ManufacturingProcess32 ManufacturingProcess13
## ManufacturingProcess32 1.000000000 -0.10120679
## ManufacturingProcess13 -0.101206789 1.00000000
## BiologicalMaterial06 0.600595801 -0.12186756
## ManufacturingProcess17 0.016041778 0.78241345
## ManufacturingProcess31 -0.009650671 0.07650659
## Yield 0.608332150 -0.50367972
## BiologicalMaterial06 ManufacturingProcess17
## ManufacturingProcess32 0.600595801 0.016041778
## ManufacturingProcess13 -0.121867557 0.782413453
## BiologicalMaterial06 1.000000000 0.006004003
## ManufacturingProcess17 0.006004003 1.000000000
## ManufacturingProcess31 -0.048798568 0.038426259
## Yield 0.478163422 -0.425806872
## ManufacturingProcess31 Yield
## ManufacturingProcess32 -0.009650671 0.60833215
## ManufacturingProcess13 0.076506590 -0.50367972
## BiologicalMaterial06 -0.048798568 0.47816342
## ManufacturingProcess17 0.038426259 -0.42580687
## ManufacturingProcess31 1.000000000 -0.07085698
## Yield -0.070856979 1.00000000
# Visualize correlation matrix
corrplot::corrplot(correlation, method = 'square', type = "upper")
The correlations of some of the most important predictors with the response variable have been determined. It is seen that ManufacturingProcess32 is a strong positive influencer of yield, while BiologicalMaterial06 has moderate positive effect on yield. ManufacturingProcess13 and ManufacturingProcess17 have moderate negative effects on yield. ManufacturingProcess31 appears to have a negligible impact. These correlations can help us to take decisions on which processes or materials to prioritize or avoid to optimize yield in the manufacturing process.