Exercise 6.2

knitr::include_graphics('6-1.png')

(a)

library(AppliedPredictiveModeling)
data(permeability)
summary(permeability)
##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

This matrix contains the 1107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b)

low_frequency <- nearZeroVar(fingerprints) # low frequencies using nearZeroVar function
X <- fingerprints[,-low_frequency] # Removing the low frequencies
print(paste0(dim(X)[2], " columns are left after removing 719 columns using nearZeroVar function"))
## [1] "388 columns are left after removing 719 columns using nearZeroVar function"

So basically, out of 1107 columns only 388 are left and rest 719 columns were removed using nearZeroVar function which helped identifying the low frequencies.

(c)

set.seed(100)


# Splitting the data into training and test
splitt <- createDataPartition(permeability, p=0.8, list=FALSE)

# Training
X_train <- X[splitt, ]
y_train <- permeability[splitt, ]

# Test
X_test <- X[-splitt, ]
y_test <- permeability[-splitt, ]

# PLS Method
model_pls <- train(X_train, y_train, method='pls', metric='RSquared',
                   tuneLength=20, trControl = trainControl(method='cv'),
                   preProcess= c('center','scale'))

model_pls
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 121, 119, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.89902  0.3366671   9.723079
##    2     11.57073  0.4528188   8.008647
##    3     11.74004  0.4540994   8.638716
##    4     11.93784  0.4435949   8.785540
##    5     11.72550  0.4550101   8.696446
##    6     11.50509  0.4693658   8.546633
##    7     11.39297  0.4871664   8.653850
##    8     11.13762  0.5051883   8.558643
##    9     11.18859  0.5013221   8.642897
##   10     11.29239  0.4985107   8.727017
##   11     11.47258  0.4836285   8.824869
##   12     11.42934  0.4880340   8.975986
##   13     11.74486  0.4702584   9.189660
##   14     12.08391  0.4502299   9.380611
##   15     12.27229  0.4497430   9.560355
##   16     12.53866  0.4429663   9.685105
##   17     12.57991  0.4380633   9.684354
##   18     12.60036  0.4371346   9.779651
##   19     12.65043  0.4323267   9.815696
##   20     12.94446  0.4207868  10.028232
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
plot(model_pls)

I splitted the data into training and test datasets using CreateDataPartition function from caret and preprocessed through centering and scaling the data to make it normalized. Using train function, it automatically gets the optimized results and sets the cross-validation at 10 folds to find the parameter of PLS model from 1 to 20. After fitting the model, it seems like the best value for the model is ncomp of 8 which gives the least RMSE value of 11.13762 and R2 value of 0.5051883.

(d)

postResample(predict(model_pls, X_test), obs=y_test)
##       RMSE   Rsquared        MAE 
## 11.0223004  0.5357105  7.9250542

I used postResample with predict to calculate the r2 of the test set which is 0.5357105

(e)

I will check with Ridge regression, Lasso Regression and Elastic Net method to compare their performance

Ridge Regression Method

set.seed(102)

# Reidge Method Fit
ridge_fit <- train(X_train, y_train, method='ridge', metric='Rsquared',
                   tuneGrid = data.frame(.lambda= seq(0,1, by=0.1)),
                   trControl = trainControl(method = 'cv'), preProcess = c('center','scale'))
ridge_fit
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 117, 121, 121, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     16.68100  0.3535319  12.392045
##   0.1     12.37848  0.4760924   9.419578
##   0.2     11.99326  0.5193754   9.156013
##   0.3     12.02169  0.5367031   9.236369
##   0.4     12.25428  0.5444773   9.463318
##   0.5     12.60832  0.5481801   9.813055
##   0.6     13.05235  0.5497518  10.202162
##   0.7     13.56447  0.5499816  10.631959
##   0.8     14.13176  0.5495345  11.072799
##   0.9     14.74415  0.5486228  11.539933
##   1.0     15.39105  0.5474532  12.028484
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.7.
plot(ridge_fit)

# Predicting
ridge_pred <- predict(ridge_fit, X_test)
ridge_pred
##           1           5           9          13          16          20 
##   6.8715290 -12.3603643  43.1185578   9.9532611  -0.5725331   2.7589316 
##          21          27          31          33          39          55 
##   9.2087234  11.7819743   2.2031998  -2.4673333   0.3994258  -2.3324883 
##          64          68          71          82          87          88 
##  55.0792497  24.2195355   8.5771918   0.7493490   9.4232598  21.0940189 
##          92          98         111         118         120         126 
##  19.4365734 -14.1884664  50.6256634  55.0001298  42.3894960  29.3488493 
##         133         138         140         141         147         152 
##  43.7844537  -1.5604796 -13.4900706  45.3146828  -7.8234895   0.9416249 
##         161         165 
##   7.6667275 -11.6859589

Lasso Regression Fit

set.seed(1003)
lasso_fit <- train(X_train, y_train, method='lasso', metric='Rsquared', 
                   tuneGrid = data.frame(.fraction = seq(0,0.5, by=0.05)),
                   trControl = trainControl(method='cv'),
                   preProcess = c('center','scale'))

lasso_fit
## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 120, 120, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      15.59280        NaN  12.269671
##   0.05      12.77994  0.4943310   9.341717
##   0.10      12.49320  0.4776698   8.894579
##   0.15      12.35461  0.4591157   8.924000
##   0.20      12.14665  0.4605743   8.850390
##   0.25      12.10852  0.4557808   8.841609
##   0.30      12.20951  0.4514769   8.966546
##   0.35      12.40588  0.4459776   9.159343
##   0.40      12.59000  0.4409435   9.264564
##   0.45      12.70559  0.4407685   9.308617
##   0.50      12.79818  0.4417921   9.327518
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
plot(lasso_fit)

# Predicting
lasso_pred <- predict(lasso_fit, X_test)
lasso_pred
##         1         5         9        13        16        20        21        27 
##  9.254919  1.121138 27.646237  9.071857  2.685629  7.113777 13.683067  7.258316 
##        31        33        39        55        64        68        71        82 
## 11.940185  4.573050  4.850462  7.113777 36.390157 17.914883 11.639968 11.962899 
##        87        88        92        98       111       118       120       126 
##  5.889171 14.345882 14.345882  1.121138 34.502736 29.392821 27.505400 20.299933 
##       133       138       140       141       147       152       161       165 
## 29.820867  9.254919  5.531406 29.392821  5.549286  9.001198  9.071857  2.548079

Elastic Net Method

set.seed(1330)

elasticnet_fit <- train(X_train, y_train, method ='enet', metric='Rsquared',
                        tuneGrid = expand.grid(.fraction=seq(0,1,by=0.1),
                        .lambda=seq(0,1,by=0.1)),
                        trControl=trainControl(method='cv'),
                        preProcess=c('center','scale'))

elasticnet_fit
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 120, 121, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.0     0.0       15.10056        NaN  11.950769
##   0.0     0.1       10.90709  0.5261732   7.858133
##   0.0     0.2       10.55378  0.5322219   8.031803
##   0.0     0.3       10.59789  0.5278275   8.098240
##   0.0     0.4       11.04379  0.5033320   8.327753
##   0.0     0.5       11.33859  0.4856997   8.534471
##   0.0     0.6       11.77705  0.4611452   8.908106
##   0.0     0.7       12.27343  0.4395358   9.260702
##   0.0     0.8       12.82588  0.4168607   9.613527
##   0.0     0.9       13.33086  0.3968621   9.878898
##   0.0     1.0       13.83232  0.3776926  10.182851
##   0.1     0.0       15.20724        NaN  12.119440
##   0.1     0.1       11.06042  0.5447559   7.845591
##   0.1     0.2       10.56758  0.5573242   7.518855
##   0.1     0.3       10.56288  0.5523973   7.759646
##   0.1     0.4       10.84704  0.5335333   7.986106
##   0.1     0.5       11.19471  0.5125235   8.238357
##   0.1     0.6       11.41615  0.4998777   8.442198
##   0.1     0.7       11.56228  0.4917630   8.557423
##   0.1     0.8       11.69565  0.4848098   8.645344
##   0.1     0.9       11.77921  0.4799380   8.708494
##   0.1     1.0       11.89337  0.4732632   8.787934
##   0.2     0.0       15.20724        NaN  12.119440
##   0.2     0.1       11.31352  0.5257065   8.077974
##   0.2     0.2       10.79116  0.5615412   7.499780
##   0.2     0.3       10.67474  0.5606996   7.671780
##   0.2     0.4       10.81792  0.5550249   7.915661
##   0.2     0.5       11.08292  0.5428790   8.117785
##   0.2     0.6       11.33912  0.5307468   8.347349
##   0.2     0.7       11.51217  0.5224006   8.492870
##   0.2     0.8       11.60398  0.5177535   8.589217
##   0.2     0.9       11.67153  0.5153241   8.653937
##   0.2     1.0       11.74219  0.5121920   8.706994
##   0.3     0.0       15.20724        NaN  12.119440
##   0.3     0.1       11.44068  0.5143715   8.137754
##   0.3     0.2       10.98539  0.5612163   7.529199
##   0.3     0.3       10.88073  0.5640987   7.682321
##   0.3     0.4       10.99772  0.5629668   7.931742
##   0.3     0.5       11.22546  0.5554816   8.228256
##   0.3     0.6       11.47231  0.5460987   8.493921
##   0.3     0.7       11.68951  0.5375374   8.698589
##   0.3     0.8       11.78649  0.5341309   8.802670
##   0.3     0.9       11.87150  0.5312367   8.875783
##   0.3     1.0       11.93985  0.5294215   8.915696
##   0.4     0.0       15.20724        NaN  12.119440
##   0.4     0.1       11.49414  0.5081250   8.118392
##   0.4     0.2       11.10564  0.5616590   7.567687
##   0.4     0.3       11.14932  0.5652887   7.762486
##   0.4     0.4       11.29399  0.5652505   8.112423
##   0.4     0.5       11.49350  0.5618952   8.449932
##   0.4     0.6       11.74696  0.5543219   8.728406
##   0.4     0.7       11.97129  0.5468773   8.928236
##   0.4     0.8       12.10688  0.5434337   9.046400
##   0.4     0.9       12.20181  0.5412077   9.133802
##   0.4     1.0       12.28547  0.5395755   9.195989
##   0.5     0.0       15.20724        NaN  12.119440
##   0.5     0.1       11.54667  0.5035266   8.090540
##   0.5     0.2       11.23999  0.5616389   7.579910
##   0.5     0.3       11.44031  0.5666527   7.895838
##   0.5     0.4       11.63243  0.5665093   8.338674
##   0.5     0.5       11.83545  0.5649853   8.699247
##   0.5     0.6       12.10242  0.5589166   9.008868
##   0.5     0.7       12.33964  0.5527586   9.244038
##   0.5     0.8       12.51590  0.5490720   9.391092
##   0.5     0.9       12.62325  0.5473255   9.489546
##   0.5     1.0       12.72370  0.5459095   9.572861
##   0.6     0.0       15.20724        NaN  12.119440
##   0.6     0.1       11.55989  0.5026139   8.046826
##   0.6     0.2       11.39380  0.5613658   7.590022
##   0.6     0.3       11.75036  0.5673233   8.057736
##   0.6     0.4       12.00887  0.5671143   8.578379
##   0.6     0.5       12.23404  0.5666414   8.990433
##   0.6     0.6       12.52258  0.5615676   9.337129
##   0.6     0.7       12.77554  0.5567109   9.602596
##   0.6     0.8       12.98087  0.5533827   9.791754
##   0.6     0.9       13.10927  0.5516446   9.923882
##   0.6     1.0       13.22669  0.5503819  10.028408
##   0.7     0.0       15.20724        NaN  12.119440
##   0.7     0.1       11.55659  0.5033404   7.999209
##   0.7     0.2       11.57395  0.5606529   7.625056
##   0.7     0.3       12.08126  0.5676174   8.261798
##   0.7     0.4       12.41786  0.5672975   8.876303
##   0.7     0.5       12.67924  0.5674377   9.319651
##   0.7     0.6       12.98739  0.5633118   9.682834
##   0.7     0.7       13.26065  0.5593920   9.973055
##   0.7     0.8       13.48860  0.5564425  10.193431
##   0.7     0.9       13.64559  0.5547649  10.357475
##   0.7     1.0       13.78095  0.5535627  10.481751
##   0.8     0.0       15.20724        NaN  12.119440
##   0.8     0.1       11.55962  0.5033489   7.963638
##   0.8     0.2       11.77199  0.5595254   7.690778
##   0.8     0.3       12.43677  0.5676985   8.529432
##   0.8     0.4       12.85697  0.5672278   9.235567
##   0.8     0.5       13.15985  0.5678457   9.684884
##   0.8     0.6       13.49226  0.5644743  10.091048
##   0.8     0.7       13.78930  0.5612664  10.389205
##   0.8     0.8       14.03796  0.5587117  10.633623
##   0.8     0.9       14.22639  0.5570616  10.822550
##   0.8     1.0       14.37887  0.5559092  10.962326
##   0.9     0.0       15.20724        NaN  12.119440
##   0.9     0.1       11.56374  0.5030984   7.925300
##   0.9     0.2       11.96410  0.5583837   7.745091
##   0.9     0.3       12.80943  0.5675944   8.834157
##   0.9     0.4       13.31956  0.5669248   9.604394
##   0.9     0.5       13.67390  0.5677131  10.087027
##   0.9     0.6       14.02932  0.5652010  10.519444
##   0.9     0.7       14.35122  0.5625630  10.845240
##   0.9     0.8       14.62520  0.5602674  11.093939
##   0.9     0.9       14.84047  0.5588207  11.310634
##   0.9     1.0       15.01112  0.5576687  11.472406
##   1.0     0.0       15.20724        NaN  12.119440
##   1.0     0.1       11.57529  0.5026705   7.886830
##   1.0     0.2       12.16238  0.5568689   7.810669
##   1.0     0.3       13.19717  0.5672067   9.157990
##   1.0     0.4       13.80287  0.5664559   9.981477
##   1.0     0.5       14.21182  0.5673689  10.493906
##   1.0     0.6       14.59134  0.5655703  10.961196
##   1.0     0.7       14.94110  0.5633577  11.308077
##   1.0     0.8       15.24101  0.5614017  11.567233
##   1.0     0.9       15.48141  0.5602249  11.804238
##   1.0     1.0       15.67178  0.5589978  11.988423
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.5 and lambda = 0.8.
plot(elasticnet_fit)

# Predicting
elasticnet_pred <- predict(elasticnet_fit, X_test)
elasticnet_pred
##           1           5           9          13          16          20 
##   3.9425450 -10.5292553  41.9768735   9.6745091  -1.2174161   2.8654812 
##          21          27          31          33          39          55 
##   9.7190122   8.2920811   5.1076710  -1.7813726   2.3125160   0.7474772 
##          64          68          71          82          87          88 
##  54.4335689  17.4342385  10.6866868   2.3277868   3.7901522  15.5845091 
##          92          98         111         118         120         126 
##  18.5634237 -12.7178723  51.7544277  51.4280954  44.0467638  27.2006802 
##         133         138         140         141         147         152 
##  46.3417286   2.2342046  -9.4605524  45.4928978  -4.9647705   2.5595302 
##         161         165 
##   7.9132472  -9.0596473

Comparison

PLS_ <- c(11.137, 0.5051883, 8.558643)
ridge_ <- c(13.56447, 0.5499816, 10.631959)
lasso_ <- c(12.77994, 0.4943310, 9.341717)
elasticnet_ <- c(13.15985, 0.5678457, 9.684884)

models_all <- rbind(data.frame(PLS_, ridge_, lasso_, elasticnet_))
row.names(models_all) <- c("RMSE", "R2", "MAE")
models_all %>% kable() %>% kable_styling(full_width = FALSE)
PLS_ ridge_ lasso_ elasticnet_
RMSE 11.1370000 13.5644700 12.779940 13.1598500
R2 0.5051883 0.5499816 0.494331 0.5678457
MAE 8.5586430 10.6319590 9.341717 9.6848840

Seems like PLS Model is doing better if we see the values of RMSE and MAE as compared with other models. Although rsquared is lower in PLS model than the others but I won’t rely much on rsquared as it can be increased with adding even an insignificant predictor. I would choose PLS model here.

(f)

I would not suggest any model discussed above because as we can see down in the histogram that most of the target variable’s data is under 10.

hist(permeability)

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.

(a)

Start R and use these commands to load the data

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

I’ll use bagImpute method from preProcess function using caret package to impute the missing values. It used bagged tree model to do so.

chemical <- preProcess(ChemicalManufacturingProcess[, -c(1)], method="bagImpute")
chemical_imp <- predict(chemical, ChemicalManufacturingProcess[,-c(1)])

print(paste0("Total missing values after imputation are ", sum(is.na(chemical_imp))))
## [1] "Total missing values after imputation are 0"

(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(440)

# Splitting data into training and test datasets
splitt <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X_train <- chemical_imp[splitt, ]
y_train <- ChemicalManufacturingProcess$Yield[splitt]

X_test <- chemical_imp[-splitt, ]
y_test <- ChemicalManufacturingProcess$Yield[-splitt]

Since, overall PLS model performed model in previous exercise comparatively with the other models that’s why I’ll choose PLS Regression here and I’ll choose RMSE as compared with R2 which is better accuracy criteria.

model_pls <- train(X_train, y_train, method='pls', metric='RMSE',
                   tuneLength=20, trControl = trainControl(method='cv'),
                   preProcess= c('center','scale'))

model_pls
## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 129, 131, 132, 129, 128, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.647698  0.4051549  1.255832
##    2     1.906286  0.5059496  1.235695
##    3     1.803567  0.5291919  1.193270
##    4     2.140077  0.5267616  1.312367
##    5     2.472295  0.5069412  1.401038
##    6     2.653739  0.4946784  1.447325
##    7     2.934339  0.4930722  1.531310
##    8     3.060311  0.4746942  1.591578
##    9     3.428803  0.4663961  1.721432
##   10     3.736364  0.4431318  1.841348
##   11     4.017309  0.4329641  1.929933
##   12     4.291445  0.4173452  1.999973
##   13     4.616753  0.4134367  2.069126
##   14     5.018988  0.4071779  2.139774
##   15     5.240959  0.4051173  2.173287
##   16     5.358958  0.4084378  2.192843
##   17     5.532612  0.4035234  2.255547
##   18     5.637576  0.4055526  2.288323
##   19     5.754911  0.4070059  2.326917
##   20     5.826497  0.4069770  2.354792
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.
plot(model_pls)

As I said, RMSE was used to select the optimal value and ncomp value of 1 was found as the optimal value which gives RMSE of 1.647698 which is lower as compared with the other models with higher ncomp value.

(d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

pls_pred <- predict(model_pls, X_test)
postResample(pls_pred, y_test)
##      RMSE  Rsquared       MAE 
## 1.3434943 0.3511318 1.0221352

The value of RMSE is 1.3434943 which is lower than the model on which it was trained with PLS regression. It seems like the performance metric has been improved comparatively on the test test. Although Rsquared dropped but statisticians don’t rely on it as it gives superficial value with adding additional insignificant variables. Both RMSE and MAE values dropped on the test data.

(e)

Which predictors are most important in the model you have trained? Do either the biologifcal or process predictors dominate the list?

model_pls$finalModel$coefficients
## , , 1 comps
## 
##                             .outcome
## BiologicalMaterial01    0.0854810556
## BiologicalMaterial02    0.1115580006
## BiologicalMaterial03    0.1003412250
## BiologicalMaterial04    0.0868070308
## BiologicalMaterial05    0.0456141633
## BiologicalMaterial06    0.1107286277
## BiologicalMaterial07   -0.0268460331
## BiologicalMaterial08    0.0864336416
## BiologicalMaterial09    0.0158896155
## BiologicalMaterial10    0.0464771420
## BiologicalMaterial11    0.0809494826
## BiologicalMaterial12    0.0830604261
## ManufacturingProcess01 -0.0204798789
## ManufacturingProcess02 -0.0535088367
## ManufacturingProcess03 -0.0151599670
## ManufacturingProcess04 -0.0558731267
## ManufacturingProcess05  0.0246767347
## ManufacturingProcess06  0.0800819588
## ManufacturingProcess07 -0.0147907630
## ManufacturingProcess08  0.0067922794
## ManufacturingProcess09  0.1066632299
## ManufacturingProcess10  0.0456662725
## ManufacturingProcess11  0.0721968848
## ManufacturingProcess12  0.0695261383
## ManufacturingProcess13 -0.1106316154
## ManufacturingProcess14 -0.0025378278
## ManufacturingProcess15  0.0529035525
## ManufacturingProcess16 -0.0077364395
## ManufacturingProcess17 -0.0914489164
## ManufacturingProcess18 -0.0130760771
## ManufacturingProcess19  0.0386235393
## ManufacturingProcess20 -0.0137715142
## ManufacturingProcess21  0.0005223598
## ManufacturingProcess22  0.0029867588
## ManufacturingProcess23 -0.0226077506
## ManufacturingProcess24 -0.0483953851
## ManufacturingProcess25  0.0029429257
## ManufacturingProcess26  0.0094337918
## ManufacturingProcess27  0.0017725134
## ManufacturingProcess28  0.0627865551
## ManufacturingProcess29  0.0337582807
## ManufacturingProcess30  0.0466022880
## ManufacturingProcess31 -0.0131702809
## ManufacturingProcess32  0.1338298795
## ManufacturingProcess33  0.0951190142
## ManufacturingProcess34  0.0325844134
## ManufacturingProcess35 -0.0429243105
## ManufacturingProcess36 -0.1188083008
## ManufacturingProcess37 -0.0366988501
## ManufacturingProcess38 -0.0354165908
## ManufacturingProcess39  0.0050329889
## ManufacturingProcess40 -0.0108934911
## ManufacturingProcess41 -0.0039833630
## ManufacturingProcess42 -0.0089037859
## ManufacturingProcess43  0.0367389479
## ManufacturingProcess44  0.0093815735
## ManufacturingProcess45  0.0021861853
varImp(model_pls)
## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess36   88.73
## BiologicalMaterial02     83.29
## BiologicalMaterial06     82.67
## ManufacturingProcess13   82.60
## ManufacturingProcess09   79.62
## BiologicalMaterial03     74.88
## ManufacturingProcess33   70.96
## ManufacturingProcess17   68.21
## BiologicalMaterial04     64.73
## BiologicalMaterial08     64.45
## BiologicalMaterial01     63.73
## BiologicalMaterial12     61.92
## BiologicalMaterial11     60.33
## ManufacturingProcess06   59.68
## ManufacturingProcess11   53.77
## ManufacturingProcess12   51.76
## ManufacturingProcess28   46.71
## ManufacturingProcess04   41.52
## ManufacturingProcess02   39.75

Looking at the above coefficients, it seems like ManufacturingProcess32 has the highest coefficient value of 0.1338298795. Even overall ManufacturingProcess seems to appear dominating the list slightly.

(f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Well, positive coefficients shows positive impact of the predictor and vice versa. Some of these predictors have positive impact while some have negative impact. Itself the value of coefficient indicates the strength of the relationship. As it increases, it will have stronger relationship on the target variable. According to the coefficients above, ManufacturingProcess32 has the highest positive relationship followed by ManufacturingProcess36 (negative impact), BiologicalMaterial02 and BiologicalMaterial06 as both do have positive impact. Overall at the bigger picture, ManufacturingProcess has negative relationship while BiologicalMaterial has positive relationship.