Questions 6.2 and 6.3 from the Kuhn& Johnson book

library(caret)
library(corrplot)
library(dplyr)
library(data.table)
library(e1071)
library(fable)
library(forecast)
library(fpp2)
library(fpp3)
library(ggplot2)
library(GGally)
library(knitr)
library(latex2exp) 
library(mlbench)
library(MASS)
library(purrr)
library(pls)
library(patchwork)
library(RANN) 
library(stats)
library(tsibble)
library(tidyr)

——————————————————————————–

(6.2) Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)
head(permeability)
##   permeability
## 1       12.520
## 2        1.120
## 3       19.405
## 4        1.730
## 5        1.680
## 6        0.510

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

dim(fingerprints)
## [1]  165 1107
high_freq <- fingerprints[, -nearZeroVar(fingerprints)]
dim(high_freq)
## [1] 165 388

Filtering the predictors with low frequencies takes us from 1107 to 388 predictors.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

set.seed(12345)

# Train Test Split
index <- createDataPartition(permeability, p = .8, list = FALSE)

X_train <- high_freq[index, ]
X_test <- high_freq[-index, ]

y_train <- permeability[index, ]
y_test <- permeability[-index, ]

# Check for missing values
any(is.na(X_train)) | any(is.na(X_test))
## [1] FALSE
dim(X_train)
## [1] 133 388
dim(X_test)
## [1]  32 388
# Fit PLS model
ctrl <- trainControl(method = "cv", number = 10)

plsTune <- train(X_train, 
                 y_train,
                 method = "pls",
                 tuneLength = 20,
                 trControl = ctrl,
                 preProc = c("center", "scale")
                 )

plot(plsTune)

# View component breakdown
plsTune
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 119, 121, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.71638  0.3726546   9.803750
##    2     11.43482  0.4812433   8.287554
##    3     11.56869  0.4775294   8.635061
##    4     11.82234  0.4622131   9.000811
##    5     11.96746  0.4722405   9.004607
##    6     11.80342  0.4763650   8.712800
##    7     11.75711  0.4782825   8.794521
##    8     11.57750  0.4943179   8.741823
##    9     11.42685  0.5092455   8.573922
##   10     11.65230  0.5024878   8.601296
##   11     12.19187  0.4755751   8.866223
##   12     12.24185  0.4766699   8.927797
##   13     12.56914  0.4611196   9.122023
##   14     12.87329  0.4497174   9.246870
##   15     13.26387  0.4322739   9.658887
##   16     13.21115  0.4341617   9.699480
##   17     13.45490  0.4270706   9.834210
##   18     13.65659  0.4172002  10.008847
##   19     13.68960  0.4115089  10.210977
##   20     13.50473  0.4179000  10.122616
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 9.

After tuning the PLS model, the optimal number of latent components chosen was 9, and the optimal \(R^2\) is 0.5092455.

(d) Predict the response for the test set. What is the test set estimate of \(R^2\)?

pls_pred <- predict(plsTune, X_test)

pls_values <- data.frame(obs = y_test, pred = pls_pred)
defaultSummary(pls_values)
##       RMSE   Rsquared        MAE 
## 12.1949589  0.5063305  9.2982092

The PLS prediction for the test set gives an \(R^2\) of 0.5063305, and an RMSE of 12.1949589.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

# Other options include Ridge, Lasso, PCA, or Elastic Net regression (Pg. 109)

################################################################################
# Ridge
set.seed(12345)

# Grid Search for optimal Lambda to minimize RMSE. 
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))

ridgeRegFit <- train(X_train, 
                     y_train,
                     method = "ridge",
                     tuneGrid = ridgeGrid,
                     trControl = trainControl(method = "cv", number = 10),
                     preProc = c("center", "scale")
                     )

plot(ridgeRegFit)

ridge_pred <- predict(ridgeRegFit, X_test)
ridge_values <- data.frame(obs = y_test, pred = ridge_pred)
defaultSummary(ridge_values)
##       RMSE   Rsquared        MAE 
## 12.1620996  0.5235568  8.1476488
################################################################################
# Elastic Net
set.seed(12345)

# Penalty Grid
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

enetfit <- train(X_train, 
                 y_train, 
                 method = "enet",
                 tuneGrid = enetGrid, 
                 trControl = ctrl,
                 preProc = c("center", "scale")
                 )

plot(enetfit)

enet_pred <- predict(enetfit, X_test)
enet_values <- data.frame(obs = y_test, pred = enet_pred)
defaultSummary(enet_values)
##       RMSE   Rsquared        MAE 
## 11.2847478  0.5456092  8.7630728
################################################################################
# PCA
set.seed(12345)

pcafit <- train(X_train,
                y_train,
                method = "pcr",
                preProcess = c("center", "scale"),
                trControl = ctrl,
                tuneLength=20)

plot(pcafit)

pca_pred <- predict(pcafit, X_test)
pca_values <- data.frame(obs = y_test, pred = pca_pred)
defaultSummary(pca_values)
##       RMSE   Rsquared        MAE 
## 12.2442554  0.3829767  8.8667536

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

PLS has \(R^2\) = 0.5063305 and RMSE = 20.98823. Ridge has \(R^2\) = 0.5235568 and RMSE = 12.1949589. Ridge has \(R^2\) =0.5235568 and RMSE = 12.1620996. Elastic Net has \(R^2\) = 0.5456092 and RMSE = 11.2847478. PCA has \(R^2\) = 0.3829767 and RMSE = 12.2442554, so it is the best performing model for predictions in terms of RMSE, although it has the lowest \(R^2\). I’d recommend the Elastic Net model be used, as it has a good trade-off in terms of having a high \(R^2\) and lower RMSE than both PLS or Ridge Regression.

——————————————————————————–

(6.3) A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately $100,000 per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
head(ChemicalManufacturingProcess)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. Yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

sum(is.na(ChemicalManufacturingProcess))
## [1] 106
# Impute 
CMP_impute <- ChemicalManufacturingProcess %>%
  preProcess("knnImpute")

CMP_complete <- predict(CMP_impute, ChemicalManufacturingProcess)
# sum(is.na(CMP_complete))

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

dim(CMP_complete)
## [1] 176  58
# Remove near zero Variance predictors
high_freq <- CMP_complete[, -nearZeroVar(CMP_complete)]
dim(high_freq)
## [1] 176  57
# Train Test Split
set.seed(12345)

index <- createDataPartition(CMP_complete$Yield, p = .8, list = FALSE)

train_predictors <- CMP_complete[index,-1]
train_yield <- CMP_complete[index, 1]

test_predictors <- CMP_complete[-index,-1]
test_yield <- CMP_complete[-index, 1]

################################################################################
# PLS model
set.seed(12345)

ctrl <- trainControl(method = "cv", number = 10)

plsfit <- train(Yield ~ ., 
                CMP_complete , 
                method = "pls", 
                tuneLength = 20, 
                trControl = ctrl, 
                preProc = c("center", "scale")
                )

plot(plsfit)

# component breakdown
plsfit
## Partial Least Squares 
## 
## 176 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 159, 158, 158, 158, 156, 156, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8070701  0.4590283  0.6363546
##    2     1.0476282  0.4608495  0.6578625
##    3     0.7485621  0.5414792  0.5733654
##    4     0.8707881  0.5056252  0.6101097
##    5     1.0820833  0.4911390  0.6604954
##    6     1.1413661  0.4879997  0.6694010
##    7     1.2866645  0.4596829  0.7163126
##    8     1.3283092  0.4638250  0.7292310
##    9     1.4990963  0.4501553  0.7754994
##   10     1.6064668  0.4409531  0.8065579
##   11     1.7073812  0.4338370  0.8391927
##   12     1.7701905  0.4250131  0.8612871
##   13     1.7937892  0.4231621  0.8687128
##   14     1.8565371  0.4204732  0.8863970
##   15     1.8655700  0.4251664  0.8791829
##   16     1.8786654  0.4252873  0.8806641
##   17     1.8964791  0.4218860  0.8876833
##   18     1.8948848  0.4176976  0.8884454
##   19     1.8905119  0.4184225  0.8882248
##   20     1.8638482  0.4204240  0.8835634
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
# Optimal value for ncomp = 3, RMSE = 0.7485621, R^2 = 0.5414792, MAE = 0.5733654

################################################################################
# Elastic Net
set.seed(12345)

# Penalty Grid
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

enetfit <- train(Yield ~ .,
                 CMP_complete ,
                 method = "enet", 
                 tuneGrid = enetGrid, 
                 trControl = ctrl, 
                 preProc = c("center", "scale")
                 )

plot(enetfit)

# The final values used for the model were fraction = 0.1 and lambda = 0. RMSE= 0.6270263, R^2 = 0.6105673, MAE = 0.5124015
enetfit
## Elasticnet 
## 
## 176 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 159, 158, 158, 158, 156, 156, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.00    0.05      0.6737675  0.6025991  0.5478705
##   0.00    0.10      0.6270263  0.6105673  0.5124015
##   0.00    0.15      0.8037959  0.5382868  0.5625304
##   0.00    0.20      0.9830845  0.5022518  0.6177938
##   0.00    0.25      1.1143308  0.4835048  0.6571924
##   0.00    0.30      1.1966371  0.4715096  0.6831595
##   0.00    0.35      1.1842749  0.4735835  0.6868155
##   0.00    0.40      1.0839818  0.4881027  0.6668150
##   0.00    0.45      0.9790663  0.5200490  0.6447380
##   0.00    0.50      0.9584618  0.5315659  0.6392019
##   0.00    0.55      0.9976951  0.5017038  0.6633437
##   0.00    0.60      1.0494990  0.4737168  0.6841317
##   0.00    0.65      1.0682234  0.4592798  0.6946494
##   0.00    0.70      1.0885606  0.4439481  0.7046945
##   0.00    0.75      1.1289802  0.4339416  0.7187261
##   0.00    0.80      1.1419888  0.4201253  0.7250237
##   0.00    0.85      1.3398572  0.4000811  0.7758828
##   0.00    0.90      1.5275164  0.4004955  0.8217257
##   0.00    0.95      1.7033632  0.4063405  0.8624566
##   0.00    1.00      1.8963090  0.4138529  0.9037371
##   0.01    0.05      0.8271767  0.5825466  0.6716286
##   0.01    0.10      0.7000189  0.6047191  0.5707346
##   0.01    0.15      0.6487755  0.5976067  0.5323076
##   0.01    0.20      0.6524335  0.5845118  0.5228021
##   0.01    0.25      0.6562770  0.5877414  0.5159059
##   0.01    0.30      0.6951182  0.5762913  0.5229829
##   0.01    0.35      0.7410304  0.5663286  0.5363403
##   0.01    0.40      0.8069713  0.5430867  0.5651266
##   0.01    0.45      0.9088646  0.5072632  0.5978919
##   0.01    0.50      1.0310360  0.4898899  0.6295445
##   0.01    0.55      1.1239325  0.4809325  0.6551173
##   0.01    0.60      1.2093383  0.4726674  0.6812461
##   0.01    0.65      1.3044945  0.4639658  0.7099902
##   0.01    0.70      1.3853729  0.4578050  0.7331399
##   0.01    0.75      1.4330752  0.4535866  0.7478460
##   0.01    0.80      1.4654250  0.4503605  0.7595614
##   0.01    0.85      1.4996206  0.4474169  0.7712193
##   0.01    0.90      1.5258061  0.4446021  0.7804890
##   0.01    0.95      1.5441125  0.4421963  0.7877313
##   0.01    1.00      1.5617891  0.4402367  0.7945131
##   0.10    0.05      0.8917987  0.5241858  0.7238055
##   0.10    0.10      0.8018328  0.5936881  0.6528618
##   0.10    0.15      0.7279522  0.6012604  0.5940311
##   0.10    0.20      0.6730841  0.6071303  0.5525083
##   0.10    0.25      0.6510524  0.5960564  0.5356208
##   0.10    0.30      0.6505919  0.5868922  0.5240445
##   0.10    0.35      0.6497864  0.5885496  0.5192295
##   0.10    0.40      0.6737667  0.5794641  0.5247874
##   0.10    0.45      0.7244950  0.5708264  0.5363648
##   0.10    0.50      0.7688899  0.5669224  0.5467472
##   0.10    0.55      0.8180306  0.5636699  0.5592141
##   0.10    0.60      0.8817797  0.5467213  0.5852615
##   0.10    0.65      0.9521850  0.5262343  0.6083696
##   0.10    0.70      1.0255037  0.5098619  0.6314284
##   0.10    0.75      1.0834823  0.4987084  0.6504042
##   0.10    0.80      1.1259987  0.4909686  0.6639728
##   0.10    0.85      1.1618713  0.4855072  0.6756364
##   0.10    0.90      1.1901173  0.4819024  0.6848276
##   0.10    0.95      1.2102977  0.4786543  0.6918864
##   0.10    1.00      1.2256645  0.4758650  0.6973098
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1 and lambda = 0.

Elastic Net seems to perform better than PLS for this Chemical Manufacturing data. The final values used for the model were fraction = 0.1 and lambda = 0, and the RMSE= 0.6270263, \(R^2\) = 0.6105673, and MAE = 0.5124015.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

enet_pred <- predict(enetfit, test_predictors)
enet_values <- data.frame(obs = test_yield, pred = enet_pred)
defaultSummary(enet_values)
##      RMSE  Rsquared       MAE 
## 0.5254707 0.7166146 0.4108514

The metrics for the Elastic Net predictions are RMSE = 0.5254707, \(R^2\) = 0.7166146, and MAE = 0.4108514, which suggests this model is prone to lower error and explains the variance of the chemical yield data better than that of the training set.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

varImp(enetfit)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   90.02
## BiologicalMaterial06     84.56
## ManufacturingProcess17   74.88
## ManufacturingProcess36   74.68
## BiologicalMaterial03     73.53
## ManufacturingProcess09   70.37
## BiologicalMaterial12     67.97
## BiologicalMaterial02     65.32
## ManufacturingProcess31   63.93
## ManufacturingProcess06   58.00
## ManufacturingProcess33   48.99
## BiologicalMaterial11     48.11
## ManufacturingProcess11   47.77
## BiologicalMaterial04     47.12
## BiologicalMaterial08     41.87
## BiologicalMaterial01     39.13
## ManufacturingProcess30   34.83
## ManufacturingProcess12   33.32
## BiologicalMaterial09     32.41

The predictors with the greatest weight on the model are mostly the manufacturing process predictors, where of the top 20 predictors, 11 are manufacturing and 9 are biological material predictors. The sum of the weights of these 20 predictors are 697 and 533 respectively. The 3 highest predictors are ManufacturingProcess32, ManufacturingProcess13, and BiologicalMaterial06.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

top20 <- varImp(enetfit)$importance %>%
  arrange(desc(Overall)) %>%
  head(20) 

# Extract row names of top20 as a character vector
top20_names <- rownames(top20)

# Create a character vector of columns to select: "Yield" and the top20 predictors
cols_to_select <- c("Yield", top20_names)

# Select the columns and calculate correlations
cor_matrix <- CMP_complete[ , cols_to_select] %>%
  cor()

# Visualize the correlation of "Yield" with the top 20 predictors
library(ggplot2)
# Only get correlations with Yield
cor_data <- as.data.frame(cor_matrix["Yield", -1])  

# Rename columns 
colnames(cor_data) <- "Correlation with Yield"

# Create a bar plot of the correlations
ggplot(cor_data, aes(x = rownames(cor_data), y = `Correlation with Yield`)) +
  geom_bar(stat = "identity") +
  coord_flip() + 
  labs(title = "Correlation of Top 20 Predictors with Yield",
       x = "Predictor",
       y = "Correlation Coefficient") +
  theme_minimal()

# Top 5
ggplot(CMP_complete, aes(x = ManufacturingProcess32, y = Yield)) + geom_point()

ggplot(CMP_complete, aes(x = ManufacturingProcess13, y = Yield)) + geom_point()

ggplot(CMP_complete, aes(x = BiologicalMaterial06, y = Yield)) + geom_point()

ggplot(CMP_complete, aes(x = ManufacturingProcess17, y = Yield)) + geom_point()

ggplot(CMP_complete, aes(x = ManufacturingProcess36, y = Yield)) + geom_point()

By knowing which Biological and Manufacturing process have the greatest/lowest impact on yield, be it positive or negative, we can focus on those specific processes for future runs. This allows for improvement in future chemical manufacturing production to increas Yield.