knitr::include_graphics('6-1.png')
library(AppliedPredictiveModeling)
data(permeability)
summary(permeability)
## permeability
## Min. : 0.06
## 1st Qu.: 1.55
## Median : 4.91
## Mean :12.24
## 3rd Qu.:15.47
## Max. :55.60
This matrix contains the 1107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
low_frequency <- nearZeroVar(fingerprints) # low frequencies using nearZeroVar function
X <- fingerprints[,-low_frequency] # Removing the low frequencies
print(paste0(dim(X)[2], " columns are left after removing 719 columns using nearZeroVar function"))
## [1] "388 columns are left after removing 719 columns using nearZeroVar function"
So basically, out of 1107 columns only 388 are left and rest 719 columns were removed using nearZeroVar function which helped identifying the low frequencies.
set.seed(100)
# Splitting the data into training and test
splitt <- createDataPartition(permeability, p=0.8, list=FALSE)
# Training
X_train <- X[splitt, ]
y_train <- permeability[splitt, ]
# Test
X_test <- X[-splitt, ]
y_test <- permeability[-splitt, ]
# PLS Method
model_pls <- train(X_train, y_train, method='pls', metric='RSquared',
tuneLength=20, trControl = trainControl(method='cv'),
preProcess= c('center','scale'))
model_pls
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 121, 119, 120, 120, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.89902 0.3366671 9.723079
## 2 11.57073 0.4528188 8.008647
## 3 11.74004 0.4540994 8.638716
## 4 11.93784 0.4435949 8.785540
## 5 11.72550 0.4550101 8.696446
## 6 11.50509 0.4693658 8.546633
## 7 11.39297 0.4871664 8.653850
## 8 11.13762 0.5051883 8.558643
## 9 11.18859 0.5013221 8.642897
## 10 11.29239 0.4985107 8.727017
## 11 11.47258 0.4836285 8.824869
## 12 11.42934 0.4880340 8.975986
## 13 11.74486 0.4702584 9.189660
## 14 12.08391 0.4502299 9.380611
## 15 12.27229 0.4497430 9.560355
## 16 12.53866 0.4429663 9.685105
## 17 12.57991 0.4380633 9.684354
## 18 12.60036 0.4371346 9.779651
## 19 12.65043 0.4323267 9.815696
## 20 12.94446 0.4207868 10.028232
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
plot(model_pls)
I splitted the data into training and test datasets using CreateDataPartition function from caret and preprocessed through centering and scaling the data to make it normalized. Using train function, it automatically gets the optimized results and sets the cross-validation at 10 folds to find the parameter of PLS model from 1 to 20. After fitting the model, it seems like the best value for the model is ncomp of 8 which gives the least RMSE value of 11.13762 and R2 value of 0.5051883.
postResample(predict(model_pls, X_test), obs=y_test)
## RMSE Rsquared MAE
## 11.0223004 0.5357105 7.9250542
I used postResample with predict to calculate the r2 of the test set which is 0.5357105
I will check with Ridge regression, Lasso Regression and Elastic Net method to compare their performance
set.seed(102)
# Reidge Method Fit
ridge_fit <- train(X_train, y_train, method='ridge', metric='Rsquared',
tuneGrid = data.frame(.lambda= seq(0,1, by=0.1)),
trControl = trainControl(method = 'cv'), preProcess = c('center','scale'))
ridge_fit
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 117, 121, 121, 121, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 16.68100 0.3535319 12.392045
## 0.1 12.37848 0.4760924 9.419578
## 0.2 11.99326 0.5193754 9.156013
## 0.3 12.02169 0.5367031 9.236369
## 0.4 12.25428 0.5444773 9.463318
## 0.5 12.60832 0.5481801 9.813055
## 0.6 13.05235 0.5497518 10.202162
## 0.7 13.56447 0.5499816 10.631959
## 0.8 14.13176 0.5495345 11.072799
## 0.9 14.74415 0.5486228 11.539933
## 1.0 15.39105 0.5474532 12.028484
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.7.
plot(ridge_fit)
# Predicting
ridge_pred <- predict(ridge_fit, X_test)
ridge_pred
## 1 5 9 13 16 20
## 6.8715290 -12.3603643 43.1185578 9.9532611 -0.5725331 2.7589316
## 21 27 31 33 39 55
## 9.2087234 11.7819743 2.2031998 -2.4673333 0.3994258 -2.3324883
## 64 68 71 82 87 88
## 55.0792497 24.2195355 8.5771918 0.7493490 9.4232598 21.0940189
## 92 98 111 118 120 126
## 19.4365734 -14.1884664 50.6256634 55.0001298 42.3894960 29.3488493
## 133 138 140 141 147 152
## 43.7844537 -1.5604796 -13.4900706 45.3146828 -7.8234895 0.9416249
## 161 165
## 7.6667275 -11.6859589
set.seed(1003)
lasso_fit <- train(X_train, y_train, method='lasso', metric='Rsquared',
tuneGrid = data.frame(.fraction = seq(0,0.5, by=0.05)),
trControl = trainControl(method='cv'),
preProcess = c('center','scale'))
lasso_fit
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 120, 120, 119, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 15.59280 NaN 12.269671
## 0.05 12.77994 0.4943310 9.341717
## 0.10 12.49320 0.4776698 8.894579
## 0.15 12.35461 0.4591157 8.924000
## 0.20 12.14665 0.4605743 8.850390
## 0.25 12.10852 0.4557808 8.841609
## 0.30 12.20951 0.4514769 8.966546
## 0.35 12.40588 0.4459776 9.159343
## 0.40 12.59000 0.4409435 9.264564
## 0.45 12.70559 0.4407685 9.308617
## 0.50 12.79818 0.4417921 9.327518
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
plot(lasso_fit)
# Predicting
lasso_pred <- predict(lasso_fit, X_test)
lasso_pred
## 1 5 9 13 16 20 21 27
## 9.254919 1.121138 27.646237 9.071857 2.685629 7.113777 13.683067 7.258316
## 31 33 39 55 64 68 71 82
## 11.940185 4.573050 4.850462 7.113777 36.390157 17.914883 11.639968 11.962899
## 87 88 92 98 111 118 120 126
## 5.889171 14.345882 14.345882 1.121138 34.502736 29.392821 27.505400 20.299933
## 133 138 140 141 147 152 161 165
## 29.820867 9.254919 5.531406 29.392821 5.549286 9.001198 9.071857 2.548079
set.seed(1330)
elasticnet_fit <- train(X_train, y_train, method ='enet', metric='Rsquared',
tuneGrid = expand.grid(.fraction=seq(0,1,by=0.1),
.lambda=seq(0,1,by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale'))
elasticnet_fit
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 120, 121, 119, 120, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0 0.0 15.10056 NaN 11.950769
## 0.0 0.1 10.90709 0.5261732 7.858133
## 0.0 0.2 10.55378 0.5322219 8.031803
## 0.0 0.3 10.59789 0.5278275 8.098240
## 0.0 0.4 11.04379 0.5033320 8.327753
## 0.0 0.5 11.33859 0.4856997 8.534471
## 0.0 0.6 11.77705 0.4611452 8.908106
## 0.0 0.7 12.27343 0.4395358 9.260702
## 0.0 0.8 12.82588 0.4168607 9.613527
## 0.0 0.9 13.33086 0.3968621 9.878898
## 0.0 1.0 13.83232 0.3776926 10.182851
## 0.1 0.0 15.20724 NaN 12.119440
## 0.1 0.1 11.06042 0.5447559 7.845591
## 0.1 0.2 10.56758 0.5573242 7.518855
## 0.1 0.3 10.56288 0.5523973 7.759646
## 0.1 0.4 10.84704 0.5335333 7.986106
## 0.1 0.5 11.19471 0.5125235 8.238357
## 0.1 0.6 11.41615 0.4998777 8.442198
## 0.1 0.7 11.56228 0.4917630 8.557423
## 0.1 0.8 11.69565 0.4848098 8.645344
## 0.1 0.9 11.77921 0.4799380 8.708494
## 0.1 1.0 11.89337 0.4732632 8.787934
## 0.2 0.0 15.20724 NaN 12.119440
## 0.2 0.1 11.31352 0.5257065 8.077974
## 0.2 0.2 10.79116 0.5615412 7.499780
## 0.2 0.3 10.67474 0.5606996 7.671780
## 0.2 0.4 10.81792 0.5550249 7.915661
## 0.2 0.5 11.08292 0.5428790 8.117785
## 0.2 0.6 11.33912 0.5307468 8.347349
## 0.2 0.7 11.51217 0.5224006 8.492870
## 0.2 0.8 11.60398 0.5177535 8.589217
## 0.2 0.9 11.67153 0.5153241 8.653937
## 0.2 1.0 11.74219 0.5121920 8.706994
## 0.3 0.0 15.20724 NaN 12.119440
## 0.3 0.1 11.44068 0.5143715 8.137754
## 0.3 0.2 10.98539 0.5612163 7.529199
## 0.3 0.3 10.88073 0.5640987 7.682321
## 0.3 0.4 10.99772 0.5629668 7.931742
## 0.3 0.5 11.22546 0.5554816 8.228256
## 0.3 0.6 11.47231 0.5460987 8.493921
## 0.3 0.7 11.68951 0.5375374 8.698589
## 0.3 0.8 11.78649 0.5341309 8.802670
## 0.3 0.9 11.87150 0.5312367 8.875783
## 0.3 1.0 11.93985 0.5294215 8.915696
## 0.4 0.0 15.20724 NaN 12.119440
## 0.4 0.1 11.49414 0.5081250 8.118392
## 0.4 0.2 11.10564 0.5616590 7.567687
## 0.4 0.3 11.14932 0.5652887 7.762486
## 0.4 0.4 11.29399 0.5652505 8.112423
## 0.4 0.5 11.49350 0.5618952 8.449932
## 0.4 0.6 11.74696 0.5543219 8.728406
## 0.4 0.7 11.97129 0.5468773 8.928236
## 0.4 0.8 12.10688 0.5434337 9.046400
## 0.4 0.9 12.20181 0.5412077 9.133802
## 0.4 1.0 12.28547 0.5395755 9.195989
## 0.5 0.0 15.20724 NaN 12.119440
## 0.5 0.1 11.54667 0.5035266 8.090540
## 0.5 0.2 11.23999 0.5616389 7.579910
## 0.5 0.3 11.44031 0.5666527 7.895838
## 0.5 0.4 11.63243 0.5665093 8.338674
## 0.5 0.5 11.83545 0.5649853 8.699247
## 0.5 0.6 12.10242 0.5589166 9.008868
## 0.5 0.7 12.33964 0.5527586 9.244038
## 0.5 0.8 12.51590 0.5490720 9.391092
## 0.5 0.9 12.62325 0.5473255 9.489546
## 0.5 1.0 12.72370 0.5459095 9.572861
## 0.6 0.0 15.20724 NaN 12.119440
## 0.6 0.1 11.55989 0.5026139 8.046826
## 0.6 0.2 11.39380 0.5613658 7.590022
## 0.6 0.3 11.75036 0.5673233 8.057736
## 0.6 0.4 12.00887 0.5671143 8.578379
## 0.6 0.5 12.23404 0.5666414 8.990433
## 0.6 0.6 12.52258 0.5615676 9.337129
## 0.6 0.7 12.77554 0.5567109 9.602596
## 0.6 0.8 12.98087 0.5533827 9.791754
## 0.6 0.9 13.10927 0.5516446 9.923882
## 0.6 1.0 13.22669 0.5503819 10.028408
## 0.7 0.0 15.20724 NaN 12.119440
## 0.7 0.1 11.55659 0.5033404 7.999209
## 0.7 0.2 11.57395 0.5606529 7.625056
## 0.7 0.3 12.08126 0.5676174 8.261798
## 0.7 0.4 12.41786 0.5672975 8.876303
## 0.7 0.5 12.67924 0.5674377 9.319651
## 0.7 0.6 12.98739 0.5633118 9.682834
## 0.7 0.7 13.26065 0.5593920 9.973055
## 0.7 0.8 13.48860 0.5564425 10.193431
## 0.7 0.9 13.64559 0.5547649 10.357475
## 0.7 1.0 13.78095 0.5535627 10.481751
## 0.8 0.0 15.20724 NaN 12.119440
## 0.8 0.1 11.55962 0.5033489 7.963638
## 0.8 0.2 11.77199 0.5595254 7.690778
## 0.8 0.3 12.43677 0.5676985 8.529432
## 0.8 0.4 12.85697 0.5672278 9.235567
## 0.8 0.5 13.15985 0.5678457 9.684884
## 0.8 0.6 13.49226 0.5644743 10.091048
## 0.8 0.7 13.78930 0.5612664 10.389205
## 0.8 0.8 14.03796 0.5587117 10.633623
## 0.8 0.9 14.22639 0.5570616 10.822550
## 0.8 1.0 14.37887 0.5559092 10.962326
## 0.9 0.0 15.20724 NaN 12.119440
## 0.9 0.1 11.56374 0.5030984 7.925300
## 0.9 0.2 11.96410 0.5583837 7.745091
## 0.9 0.3 12.80943 0.5675944 8.834157
## 0.9 0.4 13.31956 0.5669248 9.604394
## 0.9 0.5 13.67390 0.5677131 10.087027
## 0.9 0.6 14.02932 0.5652010 10.519444
## 0.9 0.7 14.35122 0.5625630 10.845240
## 0.9 0.8 14.62520 0.5602674 11.093939
## 0.9 0.9 14.84047 0.5588207 11.310634
## 0.9 1.0 15.01112 0.5576687 11.472406
## 1.0 0.0 15.20724 NaN 12.119440
## 1.0 0.1 11.57529 0.5026705 7.886830
## 1.0 0.2 12.16238 0.5568689 7.810669
## 1.0 0.3 13.19717 0.5672067 9.157990
## 1.0 0.4 13.80287 0.5664559 9.981477
## 1.0 0.5 14.21182 0.5673689 10.493906
## 1.0 0.6 14.59134 0.5655703 10.961196
## 1.0 0.7 14.94110 0.5633577 11.308077
## 1.0 0.8 15.24101 0.5614017 11.567233
## 1.0 0.9 15.48141 0.5602249 11.804238
## 1.0 1.0 15.67178 0.5589978 11.988423
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.5 and lambda = 0.8.
plot(elasticnet_fit)
# Predicting
elasticnet_pred <- predict(elasticnet_fit, X_test)
elasticnet_pred
## 1 5 9 13 16 20
## 3.9425450 -10.5292553 41.9768735 9.6745091 -1.2174161 2.8654812
## 21 27 31 33 39 55
## 9.7190122 8.2920811 5.1076710 -1.7813726 2.3125160 0.7474772
## 64 68 71 82 87 88
## 54.4335689 17.4342385 10.6866868 2.3277868 3.7901522 15.5845091
## 92 98 111 118 120 126
## 18.5634237 -12.7178723 51.7544277 51.4280954 44.0467638 27.2006802
## 133 138 140 141 147 152
## 46.3417286 2.2342046 -9.4605524 45.4928978 -4.9647705 2.5595302
## 161 165
## 7.9132472 -9.0596473
PLS_ <- c(11.137, 0.5051883, 8.558643)
ridge_ <- c(13.56447, 0.5499816, 10.631959)
lasso_ <- c(12.77994, 0.4943310, 9.341717)
elasticnet_ <- c(13.15985, 0.5678457, 9.684884)
models_all <- rbind(data.frame(PLS_, ridge_, lasso_, elasticnet_))
row.names(models_all) <- c("RMSE", "R2", "MAE")
models_all %>% kable() %>% kable_styling(full_width = FALSE)
PLS_ | ridge_ | lasso_ | elasticnet_ | |
---|---|---|---|---|
RMSE | 11.1370000 | 13.5644700 | 12.779940 | 13.1598500 |
R2 | 0.5051883 | 0.5499816 | 0.494331 | 0.5678457 |
MAE | 8.5586430 | 10.6319590 | 9.341717 | 9.6848840 |
Seems like PLS Model is doing better if we see the values of RMSE and MAE as compared with other models. Although rsquared is lower in PLS model than the others but I won’t rely much on rsquared as it can be increased with adding even an insignificant predictor. I would choose PLS model here.
I would not suggest any model discussed above because as we can see down in the histogram that most of the target variable’s data is under 10.
hist(permeability)
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.
Start R and use these commands to load the data
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
I’ll use bagImpute method from preProcess function using caret package to impute the missing values. It used bagged tree model to do so.
chemical <- preProcess(ChemicalManufacturingProcess[, -c(1)], method="bagImpute")
chemical_imp <- predict(chemical, ChemicalManufacturingProcess[,-c(1)])
print(paste0("Total missing values after imputation are ", sum(is.na(chemical_imp))))
## [1] "Total missing values after imputation are 0"
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
set.seed(440)
# Splitting data into training and test datasets
splitt <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X_train <- chemical_imp[splitt, ]
y_train <- ChemicalManufacturingProcess$Yield[splitt]
X_test <- chemical_imp[-splitt, ]
y_test <- ChemicalManufacturingProcess$Yield[-splitt]
Since, overall PLS model performed model in previous exercise comparatively with the other models that’s why I’ll choose PLS Regression here and I’ll choose RMSE as compared with R2 which is better accuracy criteria.
model_pls <- train(X_train, y_train, method='pls', metric='RMSE',
tuneLength=20, trControl = trainControl(method='cv'),
preProcess= c('center','scale'))
model_pls
## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 129, 131, 132, 129, 128, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.647698 0.4051549 1.255832
## 2 1.906286 0.5059496 1.235695
## 3 1.803567 0.5291919 1.193270
## 4 2.140077 0.5267616 1.312367
## 5 2.472295 0.5069412 1.401038
## 6 2.653739 0.4946784 1.447325
## 7 2.934339 0.4930722 1.531310
## 8 3.060311 0.4746942 1.591578
## 9 3.428803 0.4663961 1.721432
## 10 3.736364 0.4431318 1.841348
## 11 4.017309 0.4329641 1.929933
## 12 4.291445 0.4173452 1.999973
## 13 4.616753 0.4134367 2.069126
## 14 5.018988 0.4071779 2.139774
## 15 5.240959 0.4051173 2.173287
## 16 5.358958 0.4084378 2.192843
## 17 5.532612 0.4035234 2.255547
## 18 5.637576 0.4055526 2.288323
## 19 5.754911 0.4070059 2.326917
## 20 5.826497 0.4069770 2.354792
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.
plot(model_pls)
As I said, RMSE was used to select the optimal value and ncomp value of 1 was found as the optimal value which gives RMSE of 1.647698 which is lower as compared with the other models with higher ncomp value.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
pls_pred <- predict(model_pls, X_test)
postResample(pls_pred, y_test)
## RMSE Rsquared MAE
## 1.3434943 0.3511318 1.0221352
The value of RMSE is 1.3434943 which is lower than the model on which it was trained with PLS regression. It seems like the performance metric has been improved comparatively on the test test. Although Rsquared dropped but statisticians don’t rely on it as it gives superficial value with adding additional insignificant variables. Both RMSE and MAE values dropped on the test data.
Which predictors are most important in the model you have trained? Do either the biologifcal or process predictors dominate the list?
model_pls$finalModel$coefficients
## , , 1 comps
##
## .outcome
## BiologicalMaterial01 0.0854810556
## BiologicalMaterial02 0.1115580006
## BiologicalMaterial03 0.1003412250
## BiologicalMaterial04 0.0868070308
## BiologicalMaterial05 0.0456141633
## BiologicalMaterial06 0.1107286277
## BiologicalMaterial07 -0.0268460331
## BiologicalMaterial08 0.0864336416
## BiologicalMaterial09 0.0158896155
## BiologicalMaterial10 0.0464771420
## BiologicalMaterial11 0.0809494826
## BiologicalMaterial12 0.0830604261
## ManufacturingProcess01 -0.0204798789
## ManufacturingProcess02 -0.0535088367
## ManufacturingProcess03 -0.0151599670
## ManufacturingProcess04 -0.0558731267
## ManufacturingProcess05 0.0246767347
## ManufacturingProcess06 0.0800819588
## ManufacturingProcess07 -0.0147907630
## ManufacturingProcess08 0.0067922794
## ManufacturingProcess09 0.1066632299
## ManufacturingProcess10 0.0456662725
## ManufacturingProcess11 0.0721968848
## ManufacturingProcess12 0.0695261383
## ManufacturingProcess13 -0.1106316154
## ManufacturingProcess14 -0.0025378278
## ManufacturingProcess15 0.0529035525
## ManufacturingProcess16 -0.0077364395
## ManufacturingProcess17 -0.0914489164
## ManufacturingProcess18 -0.0130760771
## ManufacturingProcess19 0.0386235393
## ManufacturingProcess20 -0.0137715142
## ManufacturingProcess21 0.0005223598
## ManufacturingProcess22 0.0029867588
## ManufacturingProcess23 -0.0226077506
## ManufacturingProcess24 -0.0483953851
## ManufacturingProcess25 0.0029429257
## ManufacturingProcess26 0.0094337918
## ManufacturingProcess27 0.0017725134
## ManufacturingProcess28 0.0627865551
## ManufacturingProcess29 0.0337582807
## ManufacturingProcess30 0.0466022880
## ManufacturingProcess31 -0.0131702809
## ManufacturingProcess32 0.1338298795
## ManufacturingProcess33 0.0951190142
## ManufacturingProcess34 0.0325844134
## ManufacturingProcess35 -0.0429243105
## ManufacturingProcess36 -0.1188083008
## ManufacturingProcess37 -0.0366988501
## ManufacturingProcess38 -0.0354165908
## ManufacturingProcess39 0.0050329889
## ManufacturingProcess40 -0.0108934911
## ManufacturingProcess41 -0.0039833630
## ManufacturingProcess42 -0.0089037859
## ManufacturingProcess43 0.0367389479
## ManufacturingProcess44 0.0093815735
## ManufacturingProcess45 0.0021861853
varImp(model_pls)
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess36 88.73
## BiologicalMaterial02 83.29
## BiologicalMaterial06 82.67
## ManufacturingProcess13 82.60
## ManufacturingProcess09 79.62
## BiologicalMaterial03 74.88
## ManufacturingProcess33 70.96
## ManufacturingProcess17 68.21
## BiologicalMaterial04 64.73
## BiologicalMaterial08 64.45
## BiologicalMaterial01 63.73
## BiologicalMaterial12 61.92
## BiologicalMaterial11 60.33
## ManufacturingProcess06 59.68
## ManufacturingProcess11 53.77
## ManufacturingProcess12 51.76
## ManufacturingProcess28 46.71
## ManufacturingProcess04 41.52
## ManufacturingProcess02 39.75
Looking at the above coefficients, it seems like ManufacturingProcess32 has the highest coefficient value of 0.1338298795. Even overall ManufacturingProcess seems to appear dominating the list slightly.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Well, positive coefficients shows positive impact of the predictor and vice versa. Some of these predictors have positive impact while some have negative impact. Itself the value of coefficient indicates the strength of the relationship. As it increases, it will have stronger relationship on the target variable. According to the coefficients above, ManufacturingProcess32 has the highest positive relationship followed by ManufacturingProcess36 (negative impact), BiologicalMaterial02 and BiologicalMaterial06 as both do have positive impact. Overall at the bigger picture, ManufacturingProcess has negative relationship while BiologicalMaterial has positive relationship.