data( "permeability")
Answer:
fingerprints predictors in the original dataset is 1107 for 165 compounds.
fingerprints predictors in the dataset after using nearZeroVar function is 338 for 165 compounds.
# Check the initial observations and predictors
dim(fingerprints)
## [1] 165 1107
# Filter out the predictors that have low frequencies
fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
# Display the predictors left
dim(fingerprints)
## [1] 165 388
Answer:
set.seed(100)
# sample data for training
sample <- permeability |> createDataPartition(p = .8, list = FALSE)
# train data for both datasets
train_pm <- permeability[sample, ]
train_fp <- fingerprints[sample, ]
# test data for both datasets
test_pm <- permeability[-sample, ]
test_fp <- fingerprints [-sample, ]
# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- train(train_fp, train_pm, method = "pls", metric = "Rsquared",
tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
plot(plsTune)
plsTune
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 121, 119, 120, 120, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.89902 0.3366671 9.723079
## 2 11.57073 0.4528188 8.008647
## 3 11.74004 0.4540994 8.638716
## 4 11.93784 0.4435949 8.785540
## 5 11.72550 0.4550101 8.696446
## 6 11.50509 0.4693658 8.546633
## 7 11.39297 0.4871664 8.653850
## 8 11.13762 0.5051883 8.558643
## 9 11.18859 0.5013221 8.642897
## 10 11.29239 0.4985107 8.727017
## 11 11.47258 0.4836285 8.824869
## 12 11.42934 0.4880340 8.975986
## 13 11.74486 0.4702584 9.189660
## 14 12.08391 0.4502299 9.380611
## 15 12.27229 0.4497430 9.560355
## 16 12.53866 0.4429663 9.685105
## 17 12.57991 0.4380633 9.684354
## 18 12.60036 0.4371346 9.779651
## 19 12.65043 0.4323267 9.815696
## 20 12.94446 0.4207868 10.028232
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 8.
getTrainPerf(plsTune)
## TrainRMSE TrainRsquared TrainMAE method
## 1 11.13762 0.5051883 8.558643 pls
Answer:
Predicted the response for the test set using predict() function, ncomp = 6.
The test set estimate of R2 using PLS model is 0.5357105.
fp_predict <- predict(plsTune, test_fp)
postResample(fp_predict, test_pm)
## RMSE Rsquared MAE
## 11.0223004 0.5357105 7.9250542
Answer:
Elastic Net model computation took a bit more time than PLS model build.
The test set estimate of R2 using elasticnet model is 0.460516 with lambda = 0.01 and .fraction = 0.05.
Lasso Regression model computation is L1 regularization.
The test set estimate of R2 using Lasso model is 0.4774312 with .fraction = 0.05.
Based on the R2 comparison, the PLS model offers better predictive performance as it has the highest R2 .
# grid of penalties
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
# tuning penalized regression(elastic net ) model using metric = "Rsquared"
enetTune <- train(train_fp, train_pm, method = "enet",metric = "Rsquared",
tuneGrid = enetGrid, trControl = ctrl, preProc = c("center", "scale"))
#validation plot
plot(enetTune)
#summary
enetTune
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 120, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 15.92454 0.4149977 11.250097
## 0.00 0.10 20.23916 0.4268401 13.146906
## 0.00 0.15 25.27825 0.4014007 16.074842
## 0.00 0.20 30.66808 0.3862955 19.391415
## 0.00 0.25 36.61766 0.3743805 22.867382
## 0.00 0.30 42.29704 0.3624048 26.285371
## 0.00 0.35 48.74081 0.3454254 29.391911
## 0.00 0.40 54.99337 0.3369527 32.547410
## 0.00 0.45 61.01564 0.3311644 35.657877
## 0.00 0.50 66.98059 0.3255013 38.748701
## 0.00 0.55 72.89343 0.3234535 41.651433
## 0.00 0.60 78.80292 0.3183460 44.476682
## 0.00 0.65 84.64774 0.3142939 47.294215
## 0.00 0.70 90.49634 0.3110215 50.100267
## 0.00 0.75 96.50695 0.3064360 52.955658
## 0.00 0.80 102.54072 0.3011334 55.830134
## 0.00 0.85 108.69852 0.2986829 58.762840
## 0.00 0.90 114.88765 0.2963885 61.697929
## 0.00 0.95 121.08212 0.2946134 64.655689
## 0.00 1.00 127.14467 0.2925550 67.711128
## 0.01 0.05 11.13390 0.4875269 7.808608
## 0.01 0.10 11.02644 0.4697844 7.760532
## 0.01 0.15 11.19436 0.4587664 8.171019
## 0.01 0.20 11.51798 0.4473346 8.425328
## 0.01 0.25 11.72125 0.4451314 8.496069
## 0.01 0.30 12.02011 0.4390133 8.693026
## 0.01 0.35 12.32496 0.4276333 8.869635
## 0.01 0.40 12.51186 0.4209352 8.995068
## 0.01 0.45 12.68564 0.4120608 9.145245
## 0.01 0.50 12.87752 0.3987716 9.264984
## 0.01 0.55 13.14639 0.3806198 9.428356
## 0.01 0.60 13.42127 0.3645401 9.623630
## 0.01 0.65 13.65940 0.3520916 9.846036
## 0.01 0.70 13.90300 0.3397872 10.073486
## 0.01 0.75 14.20370 0.3285202 10.302569
## 0.01 0.80 14.48125 0.3193193 10.529721
## 0.01 0.85 14.72357 0.3118963 10.749190
## 0.01 0.90 14.93236 0.3058452 10.926728
## 0.01 0.95 15.11693 0.3012204 11.082565
## 0.01 1.00 15.28554 0.2970509 11.218829
## 0.10 0.05 12.03362 0.4599280 9.057243
## 0.10 0.10 11.11587 0.4835982 7.742628
## 0.10 0.15 11.01906 0.4809139 7.555460
## 0.10 0.20 11.06195 0.4728057 7.794452
## 0.10 0.25 11.13881 0.4669137 8.028810
## 0.10 0.30 11.30545 0.4577672 8.294670
## 0.10 0.35 11.49460 0.4486985 8.503234
## 0.10 0.40 11.62638 0.4436208 8.615526
## 0.10 0.45 11.77095 0.4368198 8.689516
## 0.10 0.50 11.88887 0.4309562 8.728833
## 0.10 0.55 12.00117 0.4257441 8.789869
## 0.10 0.60 12.10288 0.4213316 8.881068
## 0.10 0.65 12.17475 0.4189686 8.938232
## 0.10 0.70 12.23948 0.4165820 8.989967
## 0.10 0.75 12.29727 0.4145380 9.045190
## 0.10 0.80 12.35420 0.4125055 9.090148
## 0.10 0.85 12.40335 0.4108638 9.120956
## 0.10 0.90 12.44991 0.4092878 9.143724
## 0.10 0.95 12.51563 0.4067511 9.178773
## 0.10 1.00 12.60301 0.4029965 9.220182
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.01.
#predict the permeability response for the test data
enet_predict <- predict(enetTune, test_fp)
#compare the variance of the predicted values to the test values.
postResample(enet_predict, test_pm)
## RMSE Rsquared MAE
## 11.925564 0.460516 8.762171
# Lasso Regression model
larsTune <- train(train_fp, train_pm, method = "lars", metric = "Rsquared",
tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
plot(larsTune)
larsTune
## Least Angle Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 117, 120, 121, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.05 10.78089 0.4884988 7.621235
## 0.10 11.22069 0.4658293 8.155298
## 0.15 11.85998 0.4389142 8.710538
## 0.20 12.56456 0.4081787 9.161977
## 0.25 13.28352 0.3782736 9.579537
## 0.30 14.03239 0.3496227 10.049710
## 0.35 14.74708 0.3177346 10.571696
## 0.40 15.24085 0.2967624 11.001037
## 0.45 15.90103 0.2775492 11.479329
## 0.50 16.83813 0.2546089 11.973205
## 0.55 17.94281 0.2350213 12.707837
## 0.60 19.08541 0.2186174 13.394901
## 0.65 20.33394 0.2064060 14.106612
## 0.70 21.52771 0.1948684 14.795779
## 0.75 22.59915 0.1870641 15.447077
## 0.80 23.54321 0.1822464 16.042508
## 0.85 24.34344 0.1776840 16.557470
## 0.90 25.43092 0.1757257 17.277698
## 0.95 26.48296 0.1728901 17.913372
## 1.00 27.51909 0.1708402 18.541743
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
lars_predict <- predict(larsTune, test_fp)
postResample(lars_predict, test_pm)
## RMSE Rsquared MAE
## 11.7132875 0.4774312 8.3278765
Answer:
As the R2 value is low around 0.50,it only explains 50% of the variability in the response variable. I do not recommend any of the models to replace the permeability laboratory experiment. Perhaps we need to look into other models for better accuracy.
Answer: The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
cmp <- data.frame(ChemicalManufacturingProcess)
# summary of missing values
paste0(" Missing values: ",sum(is.na(cmp)))
## [1] " Missing values: 106"
#data imputation
cmp_impute <- preProcess(cmp, method = "knnImpute")
cmp_impute <- predict(cmp_impute,cmp)
# summary post data imputation
paste0(" Missing values post data kNN Impute:",sum(is.na(cmp_impute)))
## [1] " Missing values post data kNN Impute:0"
# pre-process to filter predictor with low frequencies
cmp_transform <- cmp_impute[, -nearZeroVar(cmp_impute)]
# Check for skewness
library(e1071)
skewVals <- apply(cmp_transform, 2, skewness)
head(skewVals)
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0.31095956 0.27331650 0.24412685
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0.02851075 1.73231530 0.30400532
#data transformation
cmp_impute <- preProcess(cmp, method = c("BoxCox","knnImpute"))
cmp_impute <- predict(cmp_impute,cmp)
set.seed(100)
# sample data for training
index <- createDataPartition(cmp_transform$Yield, p = .8, list = FALSE)
# train and test data
train_cmp <- cmp_transform[index, ]
test_cmp <- cmp_transform[-index, ]
# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- train(Yield ~ ., train_cmp , method = "pls", metric = "RMSE",
tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
#Validation plot
plot(plsTune)
#optimal PLS components
plsTune$bestTune
## ncomp
## 1 1
#performance
getTrainPerf(plsTune)
## TrainRMSE TrainRsquared TrainMAE method
## 1 0.755647 0.4850805 0.618129 pls
Answer: R2 of the test set is 0.1807132 which is lower than the train set R2 which was 0.4780728. The diagnostic plots shows us visually to assess how close the predicted values are to the actual values.
cmp_predict <- predict(plsTune, test_cmp[ ,-1])
postResample(cmp_predict, test_cmp[ ,1])
## RMSE Rsquared MAE
## 1.2109611 0.1856010 0.7573744
xyplot( predict(plsTune) ~ train_cmp$Yield , type = c("p","g"))
xyplot( predict(plsTune) ~ resid(plsTune) , type = c("p","g"))
Answer:
Of the top 20 predictors, there is a mix of both Manufacturing Process and Biological Material. However, the most of the predictors in this are Manufacturing Processes predictors.
varImp(plsTune)
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess36 84.21
## ManufacturingProcess13 82.93
## ManufacturingProcess09 80.75
## BiologicalMaterial02 80.34
## BiologicalMaterial06 79.06
## BiologicalMaterial03 75.65
## BiologicalMaterial04 71.71
## ManufacturingProcess17 70.37
## ManufacturingProcess33 70.19
## BiologicalMaterial08 62.74
## ManufacturingProcess06 62.29
## ManufacturingProcess12 59.52
## BiologicalMaterial01 59.15
## BiologicalMaterial11 58.35
## BiologicalMaterial12 57.64
## ManufacturingProcess11 55.73
## ManufacturingProcess28 42.73
## ManufacturingProcess04 41.49
## ManufacturingProcess15 41.48
Answer:
According to the correlation plot, ManufacturingProcess32 has the highest positive correlation with Yield, followed by ManufacturingProcess09 and BiologicalMaterial02. ManufacturingProcess13,ManufacturingProcess36 and ManufacturingProcess17 which were in the Top5 are negatively correlated with Yield. This information can be helpful improving yield in future runs of the manufacturing process, as these were the predictors that affected the yield. If they want to maximize or improve their yield, they way want to increase the measurements of these manufacturing process and these biological measurements of the raw materials which have positive correlation and decrease the measurements which have negative correlation.
top_predictors <- varImp(plsTune)$importance |>
arrange(-Overall) |>
head(10)
cmp_transform |>
select(c("Yield", row.names(top_predictors))) |>
cor() |>
corrplot()