Homework 7

Daniel DeBonis

6.2 Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
high_freq <- fingerprints[,-nearZeroVar(fingerprints)]
dim(high_freq)
## [1] 165 388

After filtering out the predictors with low variance, there are 388 remaining.

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(121)
split <- createDataPartition(permeability, p = 0.8, list = FALSE, times = 1)
Xtrain.data  <- high_freq[split, ] #fingerprints train
xtest.data <- high_freq[-split, ] #fingerprints test
Ytrain.data  <- permeability[split, ] #permability train
ytest.data <- permeability[-split, ]
ctrl <- trainControl(method = "cv", number = 10)
pls.mod <- train(x = Xtrain.data, 
                y = Ytrain.data, method = "pls", 
                tuneLength = 20, 
                trControl = ctrl, 
                preProc = c("center", "scale"))
pls.mod
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 121, 119, 119, 119, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.98794  0.3459993  10.088589
##    2     11.76199  0.4739001   8.522928
##    3     11.82525  0.4737298   8.910669
##    4     11.50972  0.4863403   8.766358
##    5     11.20084  0.5123139   8.408934
##    6     10.91283  0.5374273   8.322364
##    7     10.93839  0.5483197   8.260723
##    8     11.19758  0.5470639   8.625416
##    9     11.45343  0.5323240   8.718073
##   10     11.61172  0.5241004   8.753287
##   11     11.62767  0.5257602   8.940583
##   12     11.78028  0.5171329   9.120635
##   13     12.19547  0.5035613   9.370479
##   14     12.64865  0.4787144   9.710662
##   15     13.03644  0.4569241   9.998892
##   16     13.44504  0.4318441  10.312141
##   17     13.83538  0.4174899  10.536723
##   18     13.90907  0.4113511  10.565914
##   19     14.14390  0.4070009  10.565672
##   20     14.22143  0.4058685  10.558467
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.

The model that gives us the lowest RMSE is ncomp = 6. The value for R^2 in this model is 0.5374273. Surprisingly, this is not the highest value of R^2 seen across the models.

plot(pls.mod)

Predict the response for the test set. What is the test set estimate of R2?

plsPred <- predict(pls.mod, newdata=xtest.data)
postResample(pred=plsPred, obs=ytest.data)
##       RMSE   Rsquared        MAE 
## 12.0438820  0.4348518  8.8937768

The estimate for R^2 this time is .4348518.

Try building other models discussed in this chapter. Do any have better predictive performance?

Three models mentioned in the chapter that are appropriate to try are the Ridge, Lasso, and elastic net models. Each of these models requires a tuning parameter, which can be calculated in these packages.

library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
set.seed(121)
rFit <- train(x=Xtrain.data,
                  y=Ytrain.data,
                  method='ridge',
                  metric='Rsquared',
                  tuneGrid=data.frame(.lambda = seq(0, 2, by=0.1)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
## Warning: model fit failed for Fold06: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
rFit
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 119, 119, 121, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     46.19026  0.3253513  19.650645
##   0.1     11.94113  0.4779242   9.065819
##   0.2     12.08944  0.4898421   9.256008
##   0.3     12.43501  0.4956773   9.492480
##   0.4     12.85888  0.4991488   9.775096
##   0.5     13.35473  0.5014317  10.110119
##   0.6     13.91240  0.5028683  10.544420
##   0.7     14.50678  0.5037906  11.007365
##   0.8     15.14342  0.5045929  11.495508
##   0.9     15.80643  0.5046419  12.040654
##   1.0     16.49417  0.5047758  12.634879
##   1.1     17.20184  0.5047593  13.251666
##   1.2     17.92537  0.5046424  13.863099
##   1.3     18.66175  0.5044508  14.469191
##   1.4     19.40845  0.5042034  15.103153
##   1.5     20.16340  0.5039141  15.752667
##   1.6     20.92484  0.5035937  16.414177
##   1.7     21.69131  0.5032502  17.081964
##   1.8     22.46158  0.5028900  17.765829
##   1.9     23.23460  0.5025177  18.443540
##   2.0     24.00951  0.5021371  19.115252
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.

At the optimized lambda of 1, the ridge regression model gives us an RMSE of 16.49417 and an R^2 of .5047758. This means that the fit was better in the PLS model since that one gave us a lower RMSE and a higher R^2

set.seed(121)
lFit <- train(x=Xtrain.data,
                  y=Ytrain.data,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0, .1, by=0.01)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
## Warning: model fit failed for Fold06: fraction=0.1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in train.default(x = Xtrain.data, y = Ytrain.data, method = "lasso", :
## missing values found in aggregated results
lFit
## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 119, 119, 121, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      15.48794        NaN  12.306539
##   0.01      13.74275  0.5101493  10.846539
##   0.02      12.98883  0.5143246  10.304896
##   0.03      12.41819  0.5211371   9.712647
##   0.04      12.24017  0.5106165   9.393944
##   0.05      12.21253  0.4999390   9.203348
##   0.06      12.27696  0.4892846   9.141869
##   0.07      12.38404  0.4806316   9.164475
##   0.08      12.45188  0.4764676   9.190038
##   0.09      12.49103  0.4741956   9.182362
##   0.10      12.53263  0.4726914   9.205190
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.03.

In the lasso regression model at the optimal fraction of .03, the RMSE is 12.41819 and the R^2 is .5211371. The lasso regression gives us a better fit than the ridge regression, but not as good as the PLS model.

set.seed(121)
enetFit <- train(x=Xtrain.data,
                 y=Ytrain.data,
                 method='enet',
                 metric='Rsquared',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
                 trControl=trainControl(method='cv'),
                 preProcess=c('center','scale')
                  )
## Warning: model fit failed for Fold06: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in train.default(x = Xtrain.data, y = Ytrain.data, method = "enet", :
## missing values found in aggregated results
enetFit
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 119, 119, 121, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.0     0.0       15.48794        NaN  12.306539
##   0.0     0.1       12.53263  0.4726914   9.205190
##   0.0     0.2       13.37124  0.4464626   9.470231
##   0.0     0.3       16.07839  0.3982089  10.536091
##   0.0     0.4       20.82657  0.3676119  12.091321
##   0.0     0.5       24.66945  0.3592001  13.200135
##   0.0     0.6       28.56979  0.3572732  14.264707
##   0.0     0.7       32.83586  0.3547052  15.486208
##   0.0     0.8       37.17556  0.3464053  16.804062
##   0.0     0.9       41.64325  0.3361191  18.172117
##   0.0     1.0       46.19026  0.3253513  19.650645
##   0.1     0.0       15.23652        NaN  12.140960
##   0.1     0.1       12.29167  0.4443243   8.886001
##   0.1     0.2       11.56270  0.4774777   8.355892
##   0.1     0.3       11.14460  0.4991002   8.252875
##   0.1     0.4       11.31593  0.4957519   8.501932
##   0.1     0.5       11.54821  0.4915198   8.715412
##   0.1     0.6       11.68556  0.4882573   8.818469
##   0.1     0.7       11.75523  0.4859669   8.873831
##   0.1     0.8       11.83917  0.4820786   8.964666
##   0.1     0.9       11.89992  0.4789536   9.028914
##   0.1     1.0       11.94113  0.4779242   9.065819
##   0.2     0.0       15.23652        NaN  12.140960
##   0.2     0.1       12.36814  0.4439155   9.016793
##   0.2     0.2       11.85422  0.4702428   8.427597
##   0.2     0.3       11.34154  0.4982855   8.287876
##   0.2     0.4       11.32767  0.5048431   8.442298
##   0.2     0.5       11.54716  0.5016530   8.651618
##   0.2     0.6       11.79296  0.4959509   8.879779
##   0.2     0.7       11.93072  0.4933927   9.024751
##   0.2     0.8       12.02888  0.4908546   9.148048
##   0.2     0.9       12.08610  0.4887420   9.229331
##   0.2     1.0       12.08944  0.4898421   9.256008
##   0.3     0.0       15.23652        NaN  12.140960
##   0.3     0.1       12.38445  0.4462789   9.048442
##   0.3     0.2       12.04712  0.4665965   8.485871
##   0.3     0.3       11.57877  0.4938469   8.379134
##   0.3     0.4       11.48521  0.5075740   8.491967
##   0.3     0.5       11.71066  0.5077032   8.763252
##   0.3     0.6       12.00234  0.5014957   9.008633
##   0.3     0.7       12.19079  0.4988686   9.192528
##   0.3     0.8       12.32401  0.4962424   9.346347
##   0.3     0.9       12.41241  0.4944781   9.445787
##   0.3     1.0       12.43501  0.4956773   9.492480
##   0.4     0.0       15.23652        NaN  12.140960
##   0.4     0.1       12.36580  0.4468242   9.027024
##   0.4     0.2       12.19016  0.4673554   8.514826
##   0.4     0.3       11.79679  0.4914212   8.471929
##   0.4     0.4       11.69860  0.5085207   8.575156
##   0.4     0.5       11.94449  0.5111364   8.883579
##   0.4     0.6       12.26536  0.5055823   9.179451
##   0.4     0.7       12.50677  0.5024583   9.419210
##   0.4     0.8       12.68585  0.4999392   9.596733
##   0.4     0.9       12.80781  0.4981629   9.719895
##   0.4     1.0       12.85888  0.4991488   9.775096
##   0.5     0.0       15.23652        NaN  12.140960
##   0.5     0.1       12.37803  0.4473183   9.025175
##   0.5     0.2       12.28691  0.4684369   8.534237
##   0.5     0.3       12.02709  0.4894102   8.597701
##   0.5     0.4       11.96715  0.5075662   8.744280
##   0.5     0.5       12.25651  0.5117740   9.068066
##   0.5     0.6       12.60864  0.5081465   9.405587
##   0.5     0.7       12.89905  0.5046777   9.686795
##   0.5     0.8       13.10924  0.5024042   9.902366
##   0.5     0.9       13.27810  0.5004156  10.045580
##   0.5     1.0       13.35473  0.5014317  10.110119
##   0.6     0.0       15.23652        NaN  12.140960
##   0.6     0.1       12.37111  0.4495273   9.006977
##   0.6     0.2       12.37970  0.4698334   8.547917
##   0.6     0.3       12.27345  0.4875657   8.765048
##   0.6     0.4       12.27781  0.5057442   8.991529
##   0.6     0.5       12.61796  0.5114498   9.339637
##   0.6     0.6       13.01293  0.5094180   9.704055
##   0.6     0.7       13.35439  0.5060132  10.032838
##   0.6     0.8       13.59905  0.5037954  10.272586
##   0.6     0.9       13.81101  0.5018255  10.456400
##   0.6     1.0       13.91240  0.5028683  10.544420
##   0.7     0.0       15.23652        NaN  12.140960
##   0.7     0.1       12.35846  0.4510659   8.978109
##   0.7     0.2       12.47792  0.4704114   8.560264
##   0.7     0.3       12.52439  0.4860267   8.955062
##   0.7     0.4       12.61527  0.5036900   9.269168
##   0.7     0.5       13.01230  0.5106658   9.655736
##   0.7     0.6       13.45977  0.5097252  10.058454
##   0.7     0.7       13.84946  0.5067171  10.433913
##   0.7     0.8       14.12848  0.5046667  10.680996
##   0.7     0.9       14.37318  0.5029897  10.890067
##   0.7     1.0       14.50678  0.5037906  11.007365
##   0.8     0.0       15.23652        NaN  12.140960
##   0.8     0.1       12.35039  0.4529550   8.942577
##   0.8     0.2       12.58242  0.4715987   8.584235
##   0.8     0.3       12.78093  0.4858173   9.163693
##   0.8     0.4       12.97646  0.5024329   9.554853
##   0.8     0.5       13.44516  0.5102259   9.989378
##   0.8     0.6       13.95486  0.5097891  10.481704
##   0.8     0.7       14.38757  0.5072529  10.880732
##   0.8     0.8       14.70217  0.5053797  11.140051
##   0.8     0.9       14.97387  0.5040870  11.351495
##   0.8     1.0       15.14342  0.5045929  11.495508
##   0.9     0.0       15.23652        NaN  12.140960
##   0.9     0.1       12.32497  0.4554050   8.893853
##   0.9     0.2       12.69110  0.4716981   8.621928
##   0.9     0.3       13.06596  0.4840724   9.369324
##   0.9     0.4       13.37276  0.4995295   9.839987
##   0.9     0.5       13.91206  0.5081996  10.373523
##   0.9     0.6       14.48140  0.5087106  10.942880
##   0.9     0.7       14.94883  0.5071795  11.356560
##   0.9     0.8       15.30195  0.5054687  11.629613
##   0.9     0.9       15.59941  0.5044801  11.869143
##   0.9     1.0       15.80643  0.5046419  12.040654
##   1.0     0.0       15.23652        NaN  12.140960
##   1.0     0.1       12.30693  0.4572780   8.843362
##   1.0     0.2       12.81078  0.4723433   8.676224
##   1.0     0.3       13.36616  0.4830828   9.596825
##   1.0     0.4       13.78825  0.4977065  10.147079
##   1.0     0.5       14.40359  0.5067869  10.784527
##   1.0     0.6       15.03323  0.5078845  11.407113
##   1.0     0.7       15.53235  0.5072256  11.838048
##   1.0     0.8       15.92916  0.5055444  12.157805
##   1.0     0.9       16.25355  0.5047812  12.435100
##   1.0     1.0       16.49417  0.5047758  12.634879
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.5 and lambda = 0.5.

Optimized with values that are both .5, this model gives us a RMSE of 12.25651 and an R^2 of .5117740. After looking at each model, the lowest RMSE and highest R^2 are found using the PLS model.

Would you recommend any of your models to replace the permeability laboratory experiment?

I do not believe that any of these models are strong enough to replace the laboratory experiment. With R^2 values that top out at .54, our model can at best explain 54% of the variance. Our MAE values also range around 8-9, meaning our margin of error in predictions is about 8-9, which is too wide of a window for meaningful application when it comes to medicine.

6.3 A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).

library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(ChemicalManufacturingProcess, col = c("yellow", "navy"))

We can use this function to visualize where data is missing in our data set. Here, about 1% of all values are missing. One way to deal with missing data that seems appropriate in this case is using K Nearest Neighbors. The preProcess function in the carat packages contains ways to ready the data for this implementation and then complete it.

library(RANN)
knn <- preProcess(ChemicalManufacturingProcess,
  method = c("center","scale","knnImpute"))
knnpre <- predict(knn, ChemicalManufacturingProcess)
missmap(knnpre, col = c("yellow", "navy"))

With the missing map graphic, we can now be confident that all missing values have appropriate replacements. Now to replicate the process from the previous problem, starting with eliminating variables with near-zero variance.

near_zero_chem <- nearZeroVar(knnpre)
filtered_chem <- knnpre[, -near_zero_chem]
ncol(filtered_chem)
## [1] 57

Only one variable was removed. #### Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric? For this exercise, I will chose a Partial Least Squares model to build upon the example in the previous question.

set.seed(79)

index_chem <- createDataPartition(filtered_chem$Yield, p = .8, list = FALSE)

# We need to exclude 'Yield' column since it is the predicted variable)
X_train_chem <- filtered_chem[index_chem, !(names(filtered_chem) %in% "Yield")]
X_test_chem  <- filtered_chem[-index_chem, !(names(filtered_chem) %in% "Yield")]

# Only using the 'Yield' column)
y_train_chem <- filtered_chem[index_chem, "Yield", drop = TRUE]
y_test_chem  <- filtered_chem[-index_chem, "Yield", drop = TRUE]

pls_chem <- train(
  x = X_train_chem,
  y = y_train_chem,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = trainControl(method = "cv", number = 10) 
)

plot(pls_chem)

pls_chem
## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 130, 129, 131, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8474935  0.3825554  0.6548616
##    2     1.1260325  0.4217226  0.7096134
##    3     0.7416650  0.5378682  0.5907047
##    4     0.7834128  0.5126599  0.6039552
##    5     0.9363218  0.5008578  0.6368069
##    6     1.0067108  0.4922164  0.6675886
##    7     1.0682939  0.4954744  0.6858345
##    8     1.1192581  0.4912483  0.7140196
##    9     1.1732719  0.4677467  0.7406973
##   10     1.2706852  0.4533025  0.7665604
##   11     1.4800818  0.4417582  0.8281665
##   12     1.6785742  0.4406087  0.8804082
##   13     1.8827794  0.4256143  0.9459933
##   14     2.0877310  0.4305886  0.9957205
##   15     2.1776560  0.4303086  1.0172406
##   16     2.2571005  0.4267955  1.0396893
##   17     2.3021289  0.4255633  1.0501958
##   18     2.3624749  0.4262620  1.0669353
##   19     2.4675894  0.4208767  1.0970309
##   20     2.5814665  0.4155298  1.1318633
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

The lowest RMSE is found where ncomp=3. This also corresponds to the highest R^2 in this case of .5378682.

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

predicted_chem <- predict(pls_chem, X_test_chem)

predicted_chem_y <- data.frame(obs = y_test_chem, pred = predicted_chem)
defaultSummary(predicted_chem_y)
##      RMSE  Rsquared       MAE 
## 0.5182009 0.7592100 0.4290475

Our R^2 value is greater here compared to our training data. It seems like this model is a good fit.

plot(predicted_chem_y, col = "darkgreen", main = "Observed vs. Predicted", xlab = "", ylab = "Predictions")
par(new = TRUE)
plot(y_test_chem, col = "blue", axes=F, ylab = "", xlab="Observed") 

We have no clear patterns in comparing the observed values and the predictions. This combined with the rather high value for R^2 we observe tells us that this model does a good job of capturing the variance without any major bias.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

varImp(pls_chem)
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   85.17
## ManufacturingProcess17   76.76
## ManufacturingProcess09   73.51
## ManufacturingProcess36   72.01
## ManufacturingProcess33   60.82
## ManufacturingProcess06   59.93
## ManufacturingProcess12   57.37
## BiologicalMaterial02     55.86
## BiologicalMaterial08     55.36
## BiologicalMaterial06     53.58
## BiologicalMaterial03     52.87
## BiologicalMaterial12     52.20
## BiologicalMaterial11     51.33
## BiologicalMaterial01     47.62
## ManufacturingProcess11   45.43
## ManufacturingProcess28   44.50
## BiologicalMaterial04     43.74
## ManufacturingProcess04   40.91
## ManufacturingProcess10   35.06

The most important markers seem to be those related to the manufacturing process, making up 8 of the top 10 variables based on their importance to the model.

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

A correlation matrix would be the best way to explore the relationships between top predictors. There are too many possible predictors in our data set, so we should filter out only those with the biggest impact to explore.

library(corrplot)
## corrplot 0.95 loaded
## 
## Attaching package: 'corrplot'
## The following object is masked from 'package:pls':
## 
##     corrplot
chem_imp <- varImp(pls_chem, scale = FALSE)$importance |>
  dplyr::arrange(-Overall)
top10 <- head(chem_imp, 10)

correlation_matrix <- filtered_chem |>
  dplyr::select(Yield, all_of(rownames(top10))) |>
  cor(use = "complete.obs")
corrplot::corrplot(correlation_matrix, 
                   method = "color", 
                   type = "upper", 
                   order = "hclust",
                   tl.cex = 0.8)

We can see that the first three manufacturing processes listed have a negative correlation with yield, so it would be worth exploring minimizing the use or impact of these process if we want yield to be maximized. MP 32 seems to have the highest positive correlation with yield, so that a process whose impact should be maximized due to that positive correlation.