Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug. a) start R and Use these commands to load the data
#loaded relevant libraries
library(AppliedPredictiveModeling)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(caTools)
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
library(lars)
library(MASS)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(stats)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tsibble 1.1.5 ✔ feasts 0.4.1
## ✔ tsibbledata 0.4.1 ✔ fable 0.4.0
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ fabletools::MAE() masks caret::MAE()
## ✖ fabletools::RMSE() masks caret::RMSE()
## ✖ dplyr::select() masks MASS::select()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(fable)
library(ggplot2)
library(e1071)
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:fabletools':
##
## interpolate
library(lattice)
library(corrplot)
## corrplot 0.95 loaded
##
## Attaching package: 'corrplot'
##
## The following object is masked from 'package:pls':
##
## corrplot
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
##
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
##
## The following object is masked from 'package:datasets':
##
## sleep
data(permeability)
#loaded data put into data frame
p_df <- as.data.frame(permeability)
class(permeability) #matrix
## [1] "matrix" "array"
class(p_df) #data frame
## [1] "data.frame"
fingerprints <- fingerprints #matrix fingerprints declared to ensure usability.
head(p_df) #dataframe of the permeability
## permeability
## 1 12.520
## 2 1.120
## 3 19.405
## 4 1.730
## 5 1.680
## 6 0.510
data loaded
#removing the columns with near zero variance / low frequency diversity
filter_fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
#how many predictors were there?
dim(fingerprints)
## [1] 165 1107
#how many predictors are left?
dim(filter_fingerprints)
## [1] 165 388
now there are 388 columns or predictors.
set.seed(987654321)
#splitting the permeability data into a 75% training and a 25% test set
#select indices to use for extracting the training data
train_indices <- createDataPartition(p_df$permeability, p = 0.75, list = FALSE)
# use the indices to select the data
train_p <- permeability[train_indices, ]
test_p <- permeability[-train_indices, ]
train_fp <- filter_fingerprints[train_indices, ]
test_fp <- filter_fingerprints[-train_indices, ]
#test of 20 variables for PLS model
ctrl <- trainControl(method= "cv", number= 20)
#implement the model
plsTune <- train(x = train_fp, y = train_p,
method= "pls",
tunelength= 20,
tuneGrid = expand.grid(ncomp = 1:20), #force it to test 20 variables.
trControl = ctrl,
preProc = c("center", "scale")
) #by default the model tries to reduce the RMSE
#view info of the model
print(plsTune)
## Partial Least Squares
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 119, 119, 119, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.25976 0.4464616 10.358836
## 2 12.00416 0.4872695 8.841882
## 3 11.87550 0.4924602 9.254769
## 4 11.43491 0.5164738 9.116062
## 5 11.16276 0.5290583 8.804034
## 6 11.13298 0.5525563 8.774824
## 7 11.02419 0.5726444 8.619423
## 8 10.77778 0.5980230 8.576113
## 9 10.68594 0.5855937 8.398050
## 10 10.59289 0.5959762 8.263912
## 11 10.68505 0.5886411 8.364086
## 12 10.67276 0.5970798 8.292205
## 13 10.82059 0.5846147 8.455137
## 14 11.22155 0.5698572 8.788628
## 15 11.27470 0.5657656 8.805156
## 16 11.64483 0.5549272 9.108120
## 17 11.83424 0.5497440 9.165310
## 18 12.12880 0.5405549 9.290222
## 19 12.15433 0.5342537 9.371949
## 20 12.38511 0.5274388 9.552238
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.
#see the iterations of the latent variables and their performance plotted
plot(plsTune)
#attempted a second model to see the optimal Rsquared value
plsTune2 <- train(x = train_fp, y = train_p,
method= "pls",
tunelength= 20,
tuneGrid = expand.grid(ncomp = 1:20),
metric = 'Rsquared', #optimizing Rsquared
trControl = ctrl,
preProc = c("center", "scale")
)
print(plsTune2)
## Partial Least Squares
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 119, 118, 119, 119, 121, 118, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.27651 0.3819973 10.242590
## 2 11.93108 0.5312996 8.955680
## 3 11.76173 0.5223206 9.282503
## 4 11.69984 0.5377856 9.450407
## 5 11.47535 0.5417391 9.147189
## 6 11.43199 0.5295197 9.024695
## 7 11.30985 0.5559088 8.987647
## 8 11.27594 0.5740494 9.053436
## 9 11.42119 0.5845358 8.949448
## 10 11.41917 0.5638812 8.936092
## 11 11.70490 0.5539041 9.099776
## 12 11.83443 0.5437615 9.110245
## 13 12.01633 0.5314545 9.321229
## 14 12.20600 0.5074832 9.477129
## 15 12.50586 0.4957734 9.714224
## 16 12.80412 0.4768489 9.942856
## 17 13.03100 0.4794901 10.123784
## 18 13.13572 0.4762658 10.172648
## 19 13.31113 0.4670994 10.187537
## 20 13.31420 0.4661427 10.283183
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 9.
plot(plsTune2)
for a test of 20 variables for a model selected by the default method of
finding the lowest RMSE there were 10 latent variables with an Rsquared
of 0.5959762. for a model selected by finding the highest Rsquared,
there were 9 latent variables with an Rsquared of 0.5845358. due to 10
having the lowest RMSE and the largest Rsquared it is probably the best
model. d. Predict the response for the test set. What is the test set
estimate of Rsquared? initially my thoughts that the model would have
the same performance as the training set with a Rsquared of 0.59
#predict with the first PLS model that optimized RMSE
fp_predict <- predict(plsTune, test_fp)
#test to see how well the predictions did
postResample(fp_predict, test_p)
## RMSE Rsquared MAE
## 10.887182 0.457854 8.165262
# predict with the PLS model that optimized Rsquared
fp_predict2 <- predict(plsTune2, test_fp)
#test predictions
postResample(fp_predict2, test_p)
## RMSE Rsquared MAE
## 10.6318507 0.4760463 7.9407355
surprisingly the model with less latent variables (9) perfomed better on the test data set even though it had worse performance on the training set. though their performance is similar.
try other models any better?
set.seed(3)
lmTune <- train(x = train_fp, y = train_p,
method= "lm",
trControl = ctrl,
preProc = c("center", "scale")
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
print(lmTune)
## Linear Regression
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 119, 118, 119, 120, 118, 120, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 60.45876 0.3249167 36.69025
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
fp_predict3 <- predict(lmTune, test_fp)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
postResample(fp_predict3, test_p)
## RMSE Rsquared MAE
## 52.3926831 0.1303428 32.3763365
ridgeGrid <- data.frame(.lambda = seq(0.0001,0.3,length = 10))
ridgeTune <- train(x = train_fp, y = train_p,
method= "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c("center", "scale")
)
print(ridgeTune)
## Ridge Regression
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 120, 119, 120, 119, 118, 118, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00010000 163858.79275 0.1896401 93642.221368
## 0.03342222 13.15790 0.4706041 10.324337
## 0.06674444 12.53399 0.4914283 9.766323
## 0.10006667 12.19513 0.5138133 9.437761
## 0.13338889 12.07589 0.5254832 9.345561
## 0.16671111 12.03991 0.5341925 9.345011
## 0.20003333 12.04456 0.5407807 9.355552
## 0.23335556 12.07982 0.5460861 9.384765
## 0.26667778 12.13601 0.5504579 9.427785
## 0.30000000 12.21082 0.5540042 9.493195
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1667111.
plot(ridgeTune)
fp_predict4 <- predict(ridgeTune, test_fp)
postResample(fp_predict4, test_p)
## RMSE Rsquared MAE
## 11.0353444 0.4835548 8.5750659
a linear regression model has the worst Rsquared of ~0.3 for the training set and a Rsquared ~0.1 for the test data set. It also had the highest RMSE.
For a ridge regression model had the highest Rsquared for the test set. It’s RMSE was worse than the PLS models but better than the linear regression. The final value used for the model was lambda = 0.1667111. RMSE: 12.03991 0.5341925
Start R and use these commands to load the data:
data("ChemicalManufacturingProcess")
#book mentioned process predictors, which was not found, so process predictors was made from chemical manufacturing process
apropos('processPredictors')
## character(0)
#how many na values?
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
Impute the predictors the impute package is no longer available on CRAN used VIM instead
#imputed with K nearest neighbors set to 5.
imputeCMP <- kNN(ChemicalManufacturingProcess, k = 5)
#imputation removed nas
sum(is.na(imputeCMP))
## [1] 0
imputed with nearest 5 neighbors
#created processPredictors since it does not seem to be available after loading the data
#removed the output and kept the predictors
processPredictors <- select(imputeCMP, -"Yield")
set.seed(987654321)
#splitting the permeability data into a 75% training and a 25% test set
#select indices to use for extracting the training data
train_indice <- createDataPartition(imputeCMP$Yield, p = 0.75, list = FALSE)
# use the indices to select the data
train_cmp <- imputeCMP[train_indice, ]
test_cmp <- imputeCMP[-train_indice, ]
train_cmpp <- processPredictors[train_indice, ]
test_cmpp <- processPredictors[-train_indice, ]
#test of 20 variables for PLS model
ctrl <- trainControl(method= "cv", number= 20)
#implement the model
plsTunecmp <- train(x = train_cmpp, y = train_cmp$Yield,
method= "pls",
tunelength= 57,
tuneGrid = expand.grid(ncomp = 1:20), #force it to test 20 variables.
trControl = ctrl,
preProc = c("center", "scale")
) #by default the model tries to reduce the RMSE
#info of the model
print(plsTunecmp)
## Partial Least Squares
##
## 132 samples
## 115 predictors
##
## Pre-processing: centered (57), scaled (57), ignore (58)
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 125, 124, 124, 126, 128, 126, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.353285 0.5325094 1.121985
## 2 1.410579 0.5684854 1.081299
## 3 1.353358 0.6275744 1.052066
## 4 1.650966 0.6288592 1.178190
## 5 1.891649 0.6280510 1.249222
## 6 1.897481 0.6350393 1.249742
## 7 1.926913 0.6325423 1.269054
## 8 1.954366 0.6260732 1.293714
## 9 1.965167 0.6144576 1.305091
## 10 2.090428 0.6003619 1.366453
## 11 2.210097 0.5925558 1.426800
## 12 2.301631 0.5861027 1.469666
## 13 2.294127 0.5762289 1.463510
## 14 2.282930 0.5681934 1.467621
## 15 2.248300 0.5684622 1.454498
## 16 2.214395 0.5582624 1.439491
## 17 2.315414 0.5546140 1.485624
## 18 2.538266 0.5527585 1.576116
## 19 2.652645 0.5441030 1.624933
## 20 2.769165 0.5450234 1.675027
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.
plot(plsTunecmp)
#attempted a second model to see the optimal Rsquared value
plsTunecmp2 <- train(x = train_cmpp, y = train_cmp$Yield,
method= "pls",
tunelength= 57,
tuneGrid = expand.grid(ncomp = 1:20),
metric = 'Rsquared', #optimizing Rsquared
trControl = ctrl,
preProc = c("center", "scale")
)
# checked model that optimized Rsquared
print(plsTunecmp2)
## Partial Least Squares
##
## 132 samples
## 115 predictors
##
## Pre-processing: centered (57), scaled (57), ignore (58)
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 127, 126, 124, 125, 124, 125, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.362851 0.5346905 1.141341
## 2 1.393145 0.5258934 1.066228
## 3 1.289711 0.6444817 1.013242
## 4 1.515723 0.6327822 1.100800
## 5 1.728108 0.6479596 1.154141
## 6 1.756615 0.6290911 1.157878
## 7 1.798649 0.6177643 1.182533
## 8 1.798173 0.6079081 1.188746
## 9 1.828769 0.5946988 1.218899
## 10 1.939382 0.5831397 1.269384
## 11 2.028877 0.5842301 1.324990
## 12 2.113436 0.5839007 1.358965
## 13 2.112742 0.5875895 1.360486
## 14 2.086512 0.5833899 1.363457
## 15 2.054611 0.5753634 1.355360
## 16 2.016909 0.5702812 1.343235
## 17 2.076828 0.5657995 1.370536
## 18 2.142933 0.5660274 1.396796
## 19 2.206281 0.5663064 1.423895
## 20 2.297630 0.5648803 1.466473
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 5.
plot(plsTunecmp2)
ncomp = 1 had the best performance metric for reducing the RMSE was
1.353285. but it had a relatively low rsquared
based off of the rSquared and RMSE I believe ncomp = 3 should be the best (second best for both RMSE and RSquared. tested it. I believe that 1 latent variable would perform worse than 3.
#PLS model with 3 latent variables
plsTunecmp3 <- train(x = train_cmpp, y = train_cmp$Yield,
method= "pls",
tunelength= 57,
tuneGrid = expand.grid(ncomp = 3), #force it to use 3 latent variables
metric = 'Rsquared', #optimizing Rsquared
trControl = ctrl,
preProc = c("center", "scale")
)
print(plsTunecmp3)
## Partial Least Squares
##
## 132 samples
## 115 predictors
##
## Pre-processing: centered (57), scaled (57), ignore (58)
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 125, 125, 124, 126, 125, 125, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.38745 0.6231064 1.0418
##
## Tuning parameter 'ncomp' was held constant at a value of 3
#check performance of 3 latent variables model on test data set
cmp_predict <- predict(plsTunecmp3, test_cmpp)
postResample(cmp_predict, test_cmp$Yield)
## RMSE Rsquared MAE
## 1.5332791 0.4786589 1.2090102
#check performance of 1 latent vriables PLS model on the test data set
print(plsTunecmp)
## Partial Least Squares
##
## 132 samples
## 115 predictors
##
## Pre-processing: centered (57), scaled (57), ignore (58)
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 125, 124, 124, 126, 128, 126, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.353285 0.5325094 1.121985
## 2 1.410579 0.5684854 1.081299
## 3 1.353358 0.6275744 1.052066
## 4 1.650966 0.6288592 1.178190
## 5 1.891649 0.6280510 1.249222
## 6 1.897481 0.6350393 1.249742
## 7 1.926913 0.6325423 1.269054
## 8 1.954366 0.6260732 1.293714
## 9 1.965167 0.6144576 1.305091
## 10 2.090428 0.6003619 1.366453
## 11 2.210097 0.5925558 1.426800
## 12 2.301631 0.5861027 1.469666
## 13 2.294127 0.5762289 1.463510
## 14 2.282930 0.5681934 1.467621
## 15 2.248300 0.5684622 1.454498
## 16 2.214395 0.5582624 1.439491
## 17 2.315414 0.5546140 1.485624
## 18 2.538266 0.5527585 1.576116
## 19 2.652645 0.5441030 1.624933
## 20 2.769165 0.5450234 1.675027
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.
cmp_predict1 <- predict(plsTunecmp, test_cmpp)
postResample(cmp_predict1, test_cmp$Yield)
## RMSE Rsquared MAE
## 1.9094688 0.2028172 1.4093548
3 latent variables (ncomp) performed better than 1. 3: RMSE 1.5332791, Rsquared 0.4786589, MAE 1.2090102. 1: RMSE 1.9094688, Rsquared 0.2028172, MAE 1.4093548 the test set performs worse than the training set.
then decided to check ncomp 5.
print(plsTunecmp2)
## Partial Least Squares
##
## 132 samples
## 115 predictors
##
## Pre-processing: centered (57), scaled (57), ignore (58)
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 127, 126, 124, 125, 124, 125, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.362851 0.5346905 1.141341
## 2 1.393145 0.5258934 1.066228
## 3 1.289711 0.6444817 1.013242
## 4 1.515723 0.6327822 1.100800
## 5 1.728108 0.6479596 1.154141
## 6 1.756615 0.6290911 1.157878
## 7 1.798649 0.6177643 1.182533
## 8 1.798173 0.6079081 1.188746
## 9 1.828769 0.5946988 1.218899
## 10 1.939382 0.5831397 1.269384
## 11 2.028877 0.5842301 1.324990
## 12 2.113436 0.5839007 1.358965
## 13 2.112742 0.5875895 1.360486
## 14 2.086512 0.5833899 1.363457
## 15 2.054611 0.5753634 1.355360
## 16 2.016909 0.5702812 1.343235
## 17 2.076828 0.5657995 1.370536
## 18 2.142933 0.5660274 1.396796
## 19 2.206281 0.5663064 1.423895
## 20 2.297630 0.5648803 1.466473
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 5.
cmp_predict2 <- predict(plsTunecmp2, test_cmpp)
postResample(cmp_predict2, test_cmp$Yield)
## RMSE Rsquared MAE
## 2.4504133 0.1541251 1.5157113
ncomp 5 perfomed the worst.
which predictors are important
#stores the one latent variable Model
onecomp <- plsTunecmp$finalModel
#stores the three latent variable Model
threecomp <- plsTunecmp3$finalModel
#shows the latent variables
onecomp$loadings
##
## Loadings:
## Comp 1
## BiologicalMaterial01 0.257
## BiologicalMaterial02 0.296
## BiologicalMaterial03 0.249
## BiologicalMaterial04 0.256
## BiologicalMaterial05 0.113
## BiologicalMaterial06 0.284
## BiologicalMaterial07
## BiologicalMaterial08 0.269
## BiologicalMaterial09 0.101
## BiologicalMaterial10 0.193
## BiologicalMaterial11 0.254
## BiologicalMaterial12 0.253
## ManufacturingProcess01
## ManufacturingProcess02 -0.187
## ManufacturingProcess03
## ManufacturingProcess04 -0.178
## ManufacturingProcess05
## ManufacturingProcess06 0.152
## ManufacturingProcess07
## ManufacturingProcess08
## ManufacturingProcess09 0.179
## ManufacturingProcess10 0.107
## ManufacturingProcess11 0.151
## ManufacturingProcess12 0.129
## ManufacturingProcess13 -0.160
## ManufacturingProcess14
## ManufacturingProcess15 0.122
## ManufacturingProcess16
## ManufacturingProcess17 -0.101
## ManufacturingProcess18
## ManufacturingProcess19 0.108
## ManufacturingProcess20
## ManufacturingProcess21
## ManufacturingProcess22
## ManufacturingProcess23
## ManufacturingProcess24 -0.125
## ManufacturingProcess25
## ManufacturingProcess26
## ManufacturingProcess27
## ManufacturingProcess28 0.188
## ManufacturingProcess29
## ManufacturingProcess30 0.102
## ManufacturingProcess31
## ManufacturingProcess32 0.230
## ManufacturingProcess33 0.195
## ManufacturingProcess34
## ManufacturingProcess35
## ManufacturingProcess36 -0.204
## ManufacturingProcess37
## ManufacturingProcess38
## ManufacturingProcess39
## ManufacturingProcess40
## ManufacturingProcess41
## ManufacturingProcess42
## ManufacturingProcess43
## ManufacturingProcess44
## ManufacturingProcess45
## Yield_impTRUE
## BiologicalMaterial01_impTRUE
## BiologicalMaterial02_impTRUE
## BiologicalMaterial03_impTRUE
## BiologicalMaterial04_impTRUE
## BiologicalMaterial05_impTRUE
## BiologicalMaterial06_impTRUE
## BiologicalMaterial07_impTRUE
## BiologicalMaterial08_impTRUE
## BiologicalMaterial09_impTRUE
## BiologicalMaterial10_impTRUE
## BiologicalMaterial11_impTRUE
## BiologicalMaterial12_impTRUE
## ManufacturingProcess01_impTRUE
## ManufacturingProcess02_impTRUE
## ManufacturingProcess03_impTRUE
## ManufacturingProcess04_impTRUE
## ManufacturingProcess05_impTRUE
## ManufacturingProcess06_impTRUE
## ManufacturingProcess07_impTRUE
## ManufacturingProcess08_impTRUE
## ManufacturingProcess09_impTRUE
## ManufacturingProcess10_impTRUE
## ManufacturingProcess11_impTRUE
## ManufacturingProcess12_impTRUE
## ManufacturingProcess13_impTRUE
## ManufacturingProcess14_impTRUE
## ManufacturingProcess15_impTRUE
## ManufacturingProcess16_impTRUE
## ManufacturingProcess17_impTRUE
## ManufacturingProcess18_impTRUE
## ManufacturingProcess19_impTRUE
## ManufacturingProcess20_impTRUE
## ManufacturingProcess21_impTRUE
## ManufacturingProcess22_impTRUE
## ManufacturingProcess23_impTRUE
## ManufacturingProcess24_impTRUE
## ManufacturingProcess25_impTRUE
## ManufacturingProcess26_impTRUE
## ManufacturingProcess27_impTRUE
## ManufacturingProcess28_impTRUE
## ManufacturingProcess29_impTRUE
## ManufacturingProcess30_impTRUE
## ManufacturingProcess31_impTRUE
## ManufacturingProcess32_impTRUE
## ManufacturingProcess33_impTRUE
## ManufacturingProcess34_impTRUE
## ManufacturingProcess35_impTRUE
## ManufacturingProcess36_impTRUE
## ManufacturingProcess37_impTRUE
## ManufacturingProcess38_impTRUE
## ManufacturingProcess39_impTRUE
## ManufacturingProcess40_impTRUE
## ManufacturingProcess41_impTRUE
## ManufacturingProcess42_impTRUE
## ManufacturingProcess43_impTRUE
## ManufacturingProcess44_impTRUE
## ManufacturingProcess45_impTRUE
##
## Comp 1
## SS loadings 1.112
## Proportion Var 0.010
#3 latent variables model
threecomp$loadings
##
## Loadings:
## Comp 1 Comp 2 Comp 3
## BiologicalMaterial01 0.257 -0.191
## BiologicalMaterial02 0.296 -0.127
## BiologicalMaterial03 0.249
## BiologicalMaterial04 0.256 -0.186
## BiologicalMaterial05 0.113 -0.149
## BiologicalMaterial06 0.284 -0.102
## BiologicalMaterial07 -0.270
## BiologicalMaterial08 0.269 -0.174
## BiologicalMaterial09 0.101
## BiologicalMaterial10 0.193 -0.238
## BiologicalMaterial11 0.254 -0.133
## BiologicalMaterial12 0.253 -0.114 -0.144
## ManufacturingProcess01 0.120
## ManufacturingProcess02 -0.187 0.261
## ManufacturingProcess03
## ManufacturingProcess04 -0.178
## ManufacturingProcess05 -0.122
## ManufacturingProcess06 0.152 0.150
## ManufacturingProcess07
## ManufacturingProcess08
## ManufacturingProcess09 0.179 0.282 -0.218
## ManufacturingProcess10 0.107 0.102 -0.229
## ManufacturingProcess11 0.151 0.185 -0.231
## ManufacturingProcess12 0.129 0.187
## ManufacturingProcess13 -0.160 -0.339 0.116
## ManufacturingProcess14 -0.179 0.247
## ManufacturingProcess15 0.122 -0.148 0.210
## ManufacturingProcess16
## ManufacturingProcess17 -0.101 -0.394
## ManufacturingProcess18 -0.324 0.295
## ManufacturingProcess19 0.108 -0.320 0.264
## ManufacturingProcess20 -0.322 0.173
## ManufacturingProcess21 -0.198
## ManufacturingProcess22
## ManufacturingProcess23
## ManufacturingProcess24 -0.125 0.117
## ManufacturingProcess25 -0.121 0.140
## ManufacturingProcess26 -0.114 0.135
## ManufacturingProcess27 -0.124 0.131
## ManufacturingProcess28 0.188 -0.170
## ManufacturingProcess29 -0.160 0.145
## ManufacturingProcess30 0.102
## ManufacturingProcess31 0.124
## ManufacturingProcess32 0.230 0.311
## ManufacturingProcess33 0.195 -0.110 0.280
## ManufacturingProcess34 0.130
## ManufacturingProcess35
## ManufacturingProcess36 -0.204 -0.307
## ManufacturingProcess37 -0.167
## ManufacturingProcess38 0.126
## ManufacturingProcess39 0.164
## ManufacturingProcess40
## ManufacturingProcess41
## ManufacturingProcess42 0.200
## ManufacturingProcess43 -0.143 0.154
## ManufacturingProcess44 0.182
## ManufacturingProcess45 0.186
## Yield_impTRUE
## BiologicalMaterial01_impTRUE
## BiologicalMaterial02_impTRUE
## BiologicalMaterial03_impTRUE
## BiologicalMaterial04_impTRUE
## BiologicalMaterial05_impTRUE
## BiologicalMaterial06_impTRUE
## BiologicalMaterial07_impTRUE
## BiologicalMaterial08_impTRUE
## BiologicalMaterial09_impTRUE
## BiologicalMaterial10_impTRUE
## BiologicalMaterial11_impTRUE
## BiologicalMaterial12_impTRUE
## ManufacturingProcess01_impTRUE
## ManufacturingProcess02_impTRUE
## ManufacturingProcess03_impTRUE
## ManufacturingProcess04_impTRUE
## ManufacturingProcess05_impTRUE
## ManufacturingProcess06_impTRUE
## ManufacturingProcess07_impTRUE
## ManufacturingProcess08_impTRUE
## ManufacturingProcess09_impTRUE
## ManufacturingProcess10_impTRUE
## ManufacturingProcess11_impTRUE
## ManufacturingProcess12_impTRUE
## ManufacturingProcess13_impTRUE
## ManufacturingProcess14_impTRUE
## ManufacturingProcess15_impTRUE
## ManufacturingProcess16_impTRUE
## ManufacturingProcess17_impTRUE
## ManufacturingProcess18_impTRUE
## ManufacturingProcess19_impTRUE
## ManufacturingProcess20_impTRUE
## ManufacturingProcess21_impTRUE
## ManufacturingProcess22_impTRUE
## ManufacturingProcess23_impTRUE
## ManufacturingProcess24_impTRUE
## ManufacturingProcess25_impTRUE
## ManufacturingProcess26_impTRUE
## ManufacturingProcess27_impTRUE
## ManufacturingProcess28_impTRUE
## ManufacturingProcess29_impTRUE
## ManufacturingProcess30_impTRUE
## ManufacturingProcess31_impTRUE
## ManufacturingProcess32_impTRUE
## ManufacturingProcess33_impTRUE
## ManufacturingProcess34_impTRUE
## ManufacturingProcess35_impTRUE
## ManufacturingProcess36_impTRUE
## ManufacturingProcess37_impTRUE
## ManufacturingProcess38_impTRUE
## ManufacturingProcess39_impTRUE
## ManufacturingProcess40_impTRUE
## ManufacturingProcess41_impTRUE
## ManufacturingProcess42_impTRUE
## ManufacturingProcess43_impTRUE
## ManufacturingProcess44_impTRUE
## ManufacturingProcess45_impTRUE
##
## Comp 1 Comp 2 Comp 3
## SS loadings 1.112 1.369 1.242
## Proportion Var 0.010 0.012 0.011
## Cumulative Var 0.010 0.022 0.032
in the two PLS models tested: ncomp 3 and ncomp 1. a majority of the biological materials are used as predictors except biological material 7 for comp 1 in both models. biological material 7 is used for comp 3. around half of the manufacturing processes are used to predict the yield. This may indicate that the biological materials have the most influence on the yield.
varImp(threecomp) #importance in the 3 latent variable model variables.
## Overall
## BiologicalMaterial01 0.0719437141
## BiologicalMaterial02 0.0855289360
## BiologicalMaterial03 0.0811498501
## BiologicalMaterial04 0.0708180298
## BiologicalMaterial05 0.0304753507
## BiologicalMaterial06 0.0821258726
## BiologicalMaterial07 0.0412765647
## BiologicalMaterial08 0.0772627370
## BiologicalMaterial09 0.0337012505
## BiologicalMaterial10 0.0524852896
## BiologicalMaterial11 0.0748167555
## BiologicalMaterial12 0.0758378014
## ManufacturingProcess01 0.0268683197
## ManufacturingProcess02 0.0505015392
## ManufacturingProcess03 0.0101553570
## ManufacturingProcess04 0.0540081087
## ManufacturingProcess05 0.0280197715
## ManufacturingProcess06 0.0766476793
## ManufacturingProcess07 0.0272803696
## ManufacturingProcess08 0.0062560395
## ManufacturingProcess09 0.1021141532
## ManufacturingProcess10 0.0443906051
## ManufacturingProcess11 0.0663840081
## ManufacturingProcess12 0.0646665065
## ManufacturingProcess13 0.1119905526
## ManufacturingProcess14 0.0121168424
## ManufacturingProcess15 0.0412995544
## ManufacturingProcess16 0.0067601415
## ManufacturingProcess17 0.0990486372
## ManufacturingProcess18 0.0374727510
## ManufacturingProcess19 0.0397113185
## ManufacturingProcess20 0.0435332092
## ManufacturingProcess21 0.0158719950
## ManufacturingProcess22 0.0072871263
## ManufacturingProcess23 0.0227509721
## ManufacturingProcess24 0.0372525332
## ManufacturingProcess25 0.0076477163
## ManufacturingProcess26 0.0114006449
## ManufacturingProcess27 0.0068426037
## ManufacturingProcess28 0.0548534061
## ManufacturingProcess29 0.0283002328
## ManufacturingProcess30 0.0356758676
## ManufacturingProcess31 0.0135885194
## ManufacturingProcess32 0.1211090684
## ManufacturingProcess33 0.0703036345
## ManufacturingProcess34 0.0594050721
## ManufacturingProcess35 0.0406110182
## ManufacturingProcess36 0.1106298804
## ManufacturingProcess37 0.0233492310
## ManufacturingProcess38 0.0192908740
## ManufacturingProcess39 0.0108529941
## ManufacturingProcess40 0.0044560263
## ManufacturingProcess41 0.0027232624
## ManufacturingProcess42 0.0151153456
## ManufacturingProcess43 0.0211919069
## ManufacturingProcess44 0.0093542375
## ManufacturingProcess45 0.0086845947
## Yield_impTRUE 0.0000000000
## BiologicalMaterial01_impTRUE 0.0000000000
## BiologicalMaterial02_impTRUE 0.0000000000
## BiologicalMaterial03_impTRUE 0.0000000000
## BiologicalMaterial04_impTRUE 0.0000000000
## BiologicalMaterial05_impTRUE 0.0000000000
## BiologicalMaterial06_impTRUE 0.0000000000
## BiologicalMaterial07_impTRUE 0.0000000000
## BiologicalMaterial08_impTRUE 0.0000000000
## BiologicalMaterial09_impTRUE 0.0000000000
## BiologicalMaterial10_impTRUE 0.0000000000
## BiologicalMaterial11_impTRUE 0.0000000000
## BiologicalMaterial12_impTRUE 0.0000000000
## ManufacturingProcess01_impTRUE 0.0017519877
## ManufacturingProcess02_impTRUE 0.0017519877
## ManufacturingProcess03_impTRUE 0.0077269176
## ManufacturingProcess04_impTRUE 0.0017519877
## ManufacturingProcess05_impTRUE 0.0017519877
## ManufacturingProcess06_impTRUE 0.0017519877
## ManufacturingProcess07_impTRUE 0.0017519877
## ManufacturingProcess08_impTRUE 0.0017519877
## ManufacturingProcess09_impTRUE 0.0000000000
## ManufacturingProcess10_impTRUE 0.0046171518
## ManufacturingProcess11_impTRUE 0.0047070469
## ManufacturingProcess12_impTRUE 0.0017519877
## ManufacturingProcess13_impTRUE 0.0000000000
## ManufacturingProcess14_impTRUE 0.0004758256
## ManufacturingProcess15_impTRUE 0.0000000000
## ManufacturingProcess16_impTRUE 0.0000000000
## ManufacturingProcess17_impTRUE 0.0000000000
## ManufacturingProcess18_impTRUE 0.0000000000
## ManufacturingProcess19_impTRUE 0.0000000000
## ManufacturingProcess20_impTRUE 0.0000000000
## ManufacturingProcess21_impTRUE 0.0000000000
## ManufacturingProcess22_impTRUE 0.0017519877
## ManufacturingProcess23_impTRUE 0.0017519877
## ManufacturingProcess24_impTRUE 0.0017519877
## ManufacturingProcess25_impTRUE 0.0014452909
## ManufacturingProcess26_impTRUE 0.0014452909
## ManufacturingProcess27_impTRUE 0.0014452909
## ManufacturingProcess28_impTRUE 0.0014452909
## ManufacturingProcess29_impTRUE 0.0014452909
## ManufacturingProcess30_impTRUE 0.0014452909
## ManufacturingProcess31_impTRUE 0.0014452909
## ManufacturingProcess32_impTRUE 0.0000000000
## ManufacturingProcess33_impTRUE 0.0014452909
## ManufacturingProcess34_impTRUE 0.0014452909
## ManufacturingProcess35_impTRUE 0.0014452909
## ManufacturingProcess36_impTRUE 0.0014452909
## ManufacturingProcess37_impTRUE 0.0000000000
## ManufacturingProcess38_impTRUE 0.0000000000
## ManufacturingProcess39_impTRUE 0.0000000000
## ManufacturingProcess40_impTRUE 0.0017519877
## ManufacturingProcess41_impTRUE 0.0017519877
## ManufacturingProcess42_impTRUE 0.0000000000
## ManufacturingProcess43_impTRUE 0.0000000000
## ManufacturingProcess44_impTRUE 0.0000000000
## ManufacturingProcess45_impTRUE 0.0000000000
#arrange the variables by importance
varImp(plsTunecmp3)$importance |>
arrange(-Overall)
## Overall
## ManufacturingProcess32 100.0000000
## ManufacturingProcess13 92.4708233
## ManufacturingProcess36 91.3473135
## ManufacturingProcess09 84.3158605
## ManufacturingProcess17 81.7846579
## BiologicalMaterial02 70.6214136
## BiologicalMaterial06 67.8114973
## BiologicalMaterial03 67.0055935
## BiologicalMaterial08 63.7959965
## ManufacturingProcess06 63.2881421
## BiologicalMaterial12 62.6194243
## BiologicalMaterial11 61.7763447
## BiologicalMaterial01 59.4040687
## BiologicalMaterial04 58.4745889
## ManufacturingProcess33 58.0498517
## ManufacturingProcess11 54.8134083
## ManufacturingProcess12 53.3952637
## ManufacturingProcess34 49.0508868
## ManufacturingProcess28 45.2925672
## ManufacturingProcess04 44.5946034
## BiologicalMaterial10 43.3372086
## ManufacturingProcess02 41.6992220
## ManufacturingProcess10 36.6534114
## ManufacturingProcess20 35.9454579
## ManufacturingProcess15 34.1011247
## BiologicalMaterial07 34.0821420
## ManufacturingProcess35 33.5325989
## ManufacturingProcess19 32.7897151
## ManufacturingProcess18 30.9413254
## ManufacturingProcess24 30.7594912
## ManufacturingProcess30 29.4576352
## BiologicalMaterial09 27.8271899
## BiologicalMaterial05 25.1635581
## ManufacturingProcess29 23.3675588
## ManufacturingProcess05 23.1359814
## ManufacturingProcess07 22.5254557
## ManufacturingProcess01 22.1852253
## ManufacturingProcess37 19.2795067
## ManufacturingProcess23 18.7855232
## ManufacturingProcess43 17.4981999
## ManufacturingProcess38 15.9285132
## ManufacturingProcess21 13.1055380
## ManufacturingProcess42 12.4807711
## ManufacturingProcess31 11.2200676
## ManufacturingProcess14 10.0049010
## ManufacturingProcess26 9.4135353
## ManufacturingProcess39 8.9613389
## ManufacturingProcess03 8.3852986
## ManufacturingProcess44 7.7238126
## ManufacturingProcess45 7.1708872
## ManufacturingProcess03_impTRUE 6.3801313
## ManufacturingProcess25 6.3147346
## ManufacturingProcess22 6.0169948
## ManufacturingProcess27 5.6499516
## ManufacturingProcess16 5.5818624
## ManufacturingProcess08 5.1656243
## ManufacturingProcess11_impTRUE 3.8866180
## ManufacturingProcess10_impTRUE 3.8123914
## ManufacturingProcess40 3.6793498
## ManufacturingProcess41 2.2486032
## ManufacturingProcess01_impTRUE 1.4466198
## ManufacturingProcess02_impTRUE 1.4466198
## ManufacturingProcess04_impTRUE 1.4466198
## ManufacturingProcess05_impTRUE 1.4466198
## ManufacturingProcess06_impTRUE 1.4466198
## ManufacturingProcess07_impTRUE 1.4466198
## ManufacturingProcess08_impTRUE 1.4466198
## ManufacturingProcess12_impTRUE 1.4466198
## ManufacturingProcess22_impTRUE 1.4466198
## ManufacturingProcess23_impTRUE 1.4466198
## ManufacturingProcess24_impTRUE 1.4466198
## ManufacturingProcess40_impTRUE 1.4466198
## ManufacturingProcess41_impTRUE 1.4466198
## ManufacturingProcess25_impTRUE 1.1933796
## ManufacturingProcess26_impTRUE 1.1933796
## ManufacturingProcess27_impTRUE 1.1933796
## ManufacturingProcess28_impTRUE 1.1933796
## ManufacturingProcess29_impTRUE 1.1933796
## ManufacturingProcess30_impTRUE 1.1933796
## ManufacturingProcess31_impTRUE 1.1933796
## ManufacturingProcess33_impTRUE 1.1933796
## ManufacturingProcess34_impTRUE 1.1933796
## ManufacturingProcess35_impTRUE 1.1933796
## ManufacturingProcess36_impTRUE 1.1933796
## ManufacturingProcess14_impTRUE 0.3928902
## Yield_impTRUE 0.0000000
## BiologicalMaterial01_impTRUE 0.0000000
## BiologicalMaterial02_impTRUE 0.0000000
## BiologicalMaterial03_impTRUE 0.0000000
## BiologicalMaterial04_impTRUE 0.0000000
## BiologicalMaterial05_impTRUE 0.0000000
## BiologicalMaterial06_impTRUE 0.0000000
## BiologicalMaterial07_impTRUE 0.0000000
## BiologicalMaterial08_impTRUE 0.0000000
## BiologicalMaterial09_impTRUE 0.0000000
## BiologicalMaterial10_impTRUE 0.0000000
## BiologicalMaterial11_impTRUE 0.0000000
## BiologicalMaterial12_impTRUE 0.0000000
## ManufacturingProcess09_impTRUE 0.0000000
## ManufacturingProcess13_impTRUE 0.0000000
## ManufacturingProcess15_impTRUE 0.0000000
## ManufacturingProcess16_impTRUE 0.0000000
## ManufacturingProcess17_impTRUE 0.0000000
## ManufacturingProcess18_impTRUE 0.0000000
## ManufacturingProcess19_impTRUE 0.0000000
## ManufacturingProcess20_impTRUE 0.0000000
## ManufacturingProcess21_impTRUE 0.0000000
## ManufacturingProcess32_impTRUE 0.0000000
## ManufacturingProcess37_impTRUE 0.0000000
## ManufacturingProcess38_impTRUE 0.0000000
## ManufacturingProcess39_impTRUE 0.0000000
## ManufacturingProcess42_impTRUE 0.0000000
## ManufacturingProcess43_impTRUE 0.0000000
## ManufacturingProcess44_impTRUE 0.0000000
## ManufacturingProcess45_impTRUE 0.0000000
most important predictors listed above
toppredict <- varImp(plsTunecmp3)$importance |>
arrange(-Overall) |>
head(10)
#correlation heatmap
imputeCMP |>
select(c("Yield", row.names(toppredict))) |>
cor() |>
corrplot(method = "number", number.cex = 0.7, type = "upper")
Manufacturing process 32 appears to have the highest absolute
correlation for yield. It also has the highest importance. the top
manufacturing processes do have a higher absolute value of correlations
to yield more than the top biological materials but not by much. a
higher percentage of biological materials still are used more as
predictors than the manufacturing processes. The manufacturing processes
that are picked however do have a strong correlation with the yield and
greater importance.
The PLS model suggests importance to the top manufacturing processes and
biological materials. Focusing on monitoring and improving those
processes and materials may help produce higher yields. More resources
potentially could be used for the highly correlated materials and
processes and less resources on the less correlated materials and
processes.