In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
Exercise 6.2 Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
[1] 165 1107
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
permeability
1 12.520
2 1.120
3 19.405
4 1.730
5 1.680
6 0.510
6.2 (b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
# low frequency instances
low_frequency <- nearZeroVar(fingerprints)
#remove low frequency columns
predictors <- fingerprints[,-low_frequency]
#388 predictors remaining
dim(predictors)
[1] 165 388
Applying the nerZeroVar function and filtering out the low frequency predictors, we are left with 388 out of the original 1,107.
6.2 (c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
set.seed(1234)
#70 30 split
split1<- sample(c(rep(0, 0.7 * nrow(permeability)),
rep(1, 0.3 * nrow(permeability))))
#training split1[0] = 115 observations, test split1[1] = 49 observations
#table(split1)
X_train <- predictors[split1 == 0,]
X_test <- predictors[split1 == 1,]
y_train <- permeability[split1 == 0]
y_test <- permeability[split1 == 1]
#PLS model
plsTune <- train(X_train, y_train,
method='pls', metric='Rsquared',
tuneLength=20,
trControl=trainControl(method='cv'),
preProc=c('center', 'scale')
)
plsTune
Partial Least Squares
116 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 104, 104, 104, 104, 105, 104, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 13.38840 0.2959225 10.431093
2 12.35381 0.4399532 9.050859
3 12.13249 0.4668407 9.365398
4 12.02310 0.4763183 9.268427
5 12.02263 0.4676213 9.055663
6 12.16466 0.4625271 9.240923
7 12.20446 0.4574523 9.328355
8 12.41799 0.4539250 9.517162
9 12.83428 0.4214665 9.562445
10 12.76405 0.4373230 9.544015
11 12.99256 0.4309400 9.732026
12 13.07068 0.4398532 9.804314
13 13.12504 0.4281296 10.001348
14 13.24639 0.4202037 10.165836
15 13.32535 0.4242194 10.289753
16 13.31609 0.4364889 10.287679
17 13.70001 0.4190135 10.574290
18 13.70872 0.4190474 10.538441
19 13.89268 0.4243703 10.552879
20 14.11390 0.4176743 10.731040
Rsquared was used to select the optimal model using the largest value.
The final value used for the model was ncomp = 4.
ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 4 12.0231 0.4763183 9.268427 4.082438 0.1981892 3.040907
The best tune was found at ncomp = 4 with an R2 value of 0.4763183.
6.2 (d) Predict the response for the test set. What is the test set estimate of R2?
#generate prediction using model and testing data
plsPred <- predict(plsTune, newdata=X_test)
#evaluation metrics
postResample(pred=plsPred, obs=y_test)
RMSE Rsquared MAE
12.7796708 0.4216779 9.6126076
The predictions on the test set yield an R2 of 0.4216779, which is lower than the training set R2.
6.2 (e) Try building other models discussed in this chapter. Do any have better predictive performance?
We will try building a ridge regression and elastic net model, which use penalization to reduce RMSE.
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
set.seed(100)
ridgeRegFit <- train(X_train, y_train,
method = "ridge",
## Fit the model over many penalty values
tuneGrid = ridgeGrid,
trControl = trainControl(method = "cv", number = 10),
## put the predictors on the same scale
preProc = c("center", "scale"))
ridgeRegFit
Ridge Regression
116 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 104, 104, 104, 104, 105, 105, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.000000000 14.55588 0.3173449 10.417979
0.007142857 631.30131 0.2569984 469.683387
0.014285714 570.22386 0.3601733 405.938914
0.021428571 14.57913 0.3964656 10.210058
0.028571429 124.48933 0.3747589 95.703556
0.035714286 13.88572 0.4200497 9.889313
0.042857143 13.79456 0.4239522 9.836511
0.050000000 13.65230 0.4282102 9.746425
0.057142857 13.57523 0.4310020 9.704939
0.064285714 13.59675 0.4321774 9.761754
0.071428571 13.45120 0.4361686 9.649546
0.078571429 13.40835 0.4378574 9.634705
0.085714286 13.37705 0.4398602 9.616871
0.092857143 13.34792 0.4402707 9.617400
0.100000000 13.69949 0.4373412 10.026444
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was lambda = 0.09285714.
The ridge regression used an optimal penalty of lambda 0.09285714 which yielded a RMSE of 13.34792.
set.seed(122)
enetTune <- train(X_train, y_train,
method = "enet",
tuneGrid = enetGrid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale"))
enetTune
Elasticnet
116 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 105, 104, 104, 104, 105, 104, ...
Resampling results across tuning parameters:
lambda fraction RMSE Rsquared MAE
0.00 0.05 12.91033 0.5116222 9.632147
0.00 0.10 12.58828 0.4870914 9.188126
0.00 0.15 12.25991 0.5098889 8.923334
0.00 0.20 12.18518 0.5152225 9.012106
0.00 0.25 12.15724 0.5196486 9.074315
0.00 0.30 12.27674 0.5182757 9.114047
0.00 0.35 12.50294 0.5061502 9.178764
0.00 0.40 12.79939 0.4938982 9.311635
0.00 0.45 13.02000 0.4851729 9.439620
0.00 0.50 13.19886 0.4821319 9.541227
0.00 0.55 13.37451 0.4813503 9.656710
0.00 0.60 13.59166 0.4761981 9.792077
0.00 0.65 13.90464 0.4634218 10.038511
0.00 0.70 14.23016 0.4512323 10.261893
0.00 0.75 14.54342 0.4391963 10.478922
0.00 0.80 14.81480 0.4307129 10.727167
0.00 0.85 15.09997 0.4224009 10.999141
0.00 0.90 15.37839 0.4141348 11.298986
0.00 0.95 15.66220 0.4048057 11.583570
0.00 1.00 15.90872 0.3963515 11.820231
0.01 0.05 20.88627 0.3794407 14.467340
0.01 0.10 28.80878 0.4394761 19.914592
0.01 0.15 36.89494 0.4620998 25.304035
0.01 0.20 45.63375 0.4606999 31.947842
0.01 0.25 54.45547 0.4626485 39.038946
0.01 0.30 63.94273 0.4552503 46.611731
0.01 0.35 72.70676 0.4440229 53.276174
0.01 0.40 81.25977 0.4375340 59.315006
0.01 0.45 89.90216 0.4373184 65.757286
0.01 0.50 97.64734 0.4426080 71.898104
0.01 0.55 105.19338 0.4487837 77.884536
0.01 0.60 112.70963 0.4531425 83.846840
0.01 0.65 120.24349 0.4539379 89.820891
0.01 0.70 127.81762 0.4504208 95.936445
0.01 0.75 135.35266 0.4439039 101.983263
0.01 0.80 142.81631 0.4392931 108.034378
0.01 0.85 150.25850 0.4352925 114.028172
0.01 0.90 157.74289 0.4327008 120.091048
0.01 0.95 165.20865 0.4310425 126.161160
0.01 1.00 172.56353 0.4289499 132.061754
0.10 0.05 12.67864 0.4310056 9.479508
0.10 0.10 12.59539 0.4364951 9.010060
0.10 0.15 12.43198 0.4650376 8.906089
0.10 0.20 12.12494 0.4892180 8.878067
0.10 0.25 12.03487 0.4966274 8.829683
0.10 0.30 12.03545 0.4993618 8.801046
0.10 0.35 12.07555 0.5003554 8.866265
0.10 0.40 12.14208 0.5020144 8.939771
0.10 0.45 12.24923 0.5004362 9.034995
0.10 0.50 12.39010 0.4974425 9.167412
0.10 0.55 12.54927 0.4932563 9.310846
0.10 0.60 12.71382 0.4894136 9.432550
0.10 0.65 12.83567 0.4871285 9.539284
0.10 0.70 12.92938 0.4861329 9.642519
0.10 0.75 12.99312 0.4872212 9.749181
0.10 0.80 13.03259 0.4893210 9.825620
0.10 0.85 13.05166 0.4918069 9.875434
0.10 0.90 13.06176 0.4940823 9.905915
0.10 0.95 13.06979 0.4960422 9.928735
0.10 1.00 13.09991 0.4964066 9.961618
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.25 and lambda = 0.1.
The elastic net used an optimal penalty of lambda 0.1 and fraction 0.25, which yielded a RMSE of 12.03487 and R2 of 0.4966274.
It seems that the elastic net model has superior predictive performance than any of the previous models discussed in this exercise.
6.2 (f) Would you recommend any of your models to replace the permeability laboratory experiment?
RMSE Rsquared MAE
10.6736992 0.5748519 7.9853073
With an R2 of about 0.57 for the predictions on our test set from our best model, I don’t feel confident that we could replace the laboratory experiments with any of these models.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),
measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process.Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(AppliedPredictiveModeling)
# data(chemicalManufacturing)
data(ChemicalManufacturingProcess)
# View basic information
str(ChemicalManufacturingProcess)
'data.frame': 176 obs. of 58 variables:
$ Yield : num 38 42.4 42 41.4 42.5 ...
$ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
$ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
$ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
$ BiologicalMaterial04 : num 12.7 14.6 14.6 14.6 14 ...
$ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
$ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
$ BiologicalMaterial07 : num 100 100 100 100 100 100 100 100 100 100 ...
$ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
$ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
$ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
$ BiologicalMaterial11 : num 138 154 154 154 148 ...
$ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
$ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
$ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
$ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
$ ManufacturingProcess04: num NA 917 912 911 918 924 933 929 928 938 ...
$ ManufacturingProcess05: num NA 1032 1004 1015 1028 ...
$ ManufacturingProcess06: num NA 210 207 213 206 ...
$ ManufacturingProcess07: num NA 177 178 177 178 178 177 178 177 177 ...
$ ManufacturingProcess08: num NA 178 178 177 178 178 178 178 177 177 ...
$ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
$ ManufacturingProcess10: num NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
$ ManufacturingProcess11: num NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
$ ManufacturingProcess12: num NA 0 0 0 0 0 0 0 0 0 ...
$ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
$ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
$ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
$ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
$ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
$ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
$ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
$ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
$ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
$ ManufacturingProcess22: num NA 3 4 5 8 9 1 2 3 4 ...
$ ManufacturingProcess23: num NA 0 1 2 4 1 1 2 3 1 ...
$ ManufacturingProcess24: num NA 3 4 5 18 1 1 2 3 4 ...
$ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
$ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
$ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
$ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
$ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
$ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
$ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
$ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
$ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
$ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
$ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
$ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
$ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
$ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
$ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
$ ManufacturingProcess40: num NA 0.1 0 0 0 0 0 0 0 0 ...
$ ManufacturingProcess41: num NA 0.15 0 0 0 0 0 0 0 0 ...
$ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
$ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
$ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
$ ManufacturingProcess45: num 2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
# Separate predictors and response
predictors <- ChemicalManufacturingProcess[, -ncol(ChemicalManufacturingProcess)] # Exclude 'yield'
response <- ChemicalManufacturingProcess$Yield
Data ChemicalManufacturingProcess 176 Obs. of 58 variables predictors 176 obs. of 57 variables
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
6.3 (b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
# Load caret for preprocessing and imputation
library(caret)
# Use the preProcess function for median imputation
preprocess <- preProcess(predictors, method = "medianImpute")
imputed_predictors <- predict(preprocess, predictors)
# Check if missing values are handled
sum(is.na(imputed_predictors)) # Should return 0
[1] 0
We used preProcess function for median imputation
6.3 (c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
# Split the data into training (80%) and testing (20%) sets
set.seed(123) # For reproducibility
train_index <- createDataPartition(response, p = 0.8, list = FALSE)
# Training and testing sets
train_predictors <- imputed_predictors[train_index, ]
test_predictors <- imputed_predictors[-train_index, ]
train_response <- response[train_index]
test_response <- response[-train_index]
# Train a Random Forest model with cross-validation
rf_model <- train(
x = train_predictors,
y = train_response,
method = "rf",
tuneLength = 5, # Tune over 5 values of mtry
trControl = trainControl(method = "cv", number = 10) # 10-fold cross-validation
)
# Print the model details
rf_model
Random Forest
144 samples
57 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 131, 130, 130, 129, 131, 129, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 1.1088034 0.7478813 0.8695597
15 0.6229766 0.9287201 0.4366745
29 0.3853600 0.9703916 0.2339461
43 0.2701465 0.9826543 0.1406445
57 0.2378435 0.9873862 0.1102605
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 57.
The Random Forest model was tuned across several values of (the number of predictors randomly sampled at each split). Based on the results, the following metrics were obtained for the optimal model:
Optimal Tuning Parameter (): The optimal value is , which resulted in the best performance metrics. Performance Metrics: : 0.2378 : 0.9874 : 0.1103 These metrics indicate that the model explains approximately 98.74% of the variance in the data, with minimal prediction error (as reflected by the low RMSE and MAE values). The optimal value of the performance metric is therefore 0.9874 for R, which demonstrates excellent predictive accuracy.
6.3 (d) Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
# Predict yield for the test set
test_predictions <- predict(rf_model, newdata = test_predictors)
# Calculate RMSE for the test set
test_rmse <- sqrt(mean((test_predictions - test_response)^2))
# Print test set RMSE
test_rmse
[1] 0.1392793
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 2 1.1088034 0.7478813 0.8695597 0.2683554 0.12966646 0.17441405
2 15 0.6229766 0.9287201 0.4366745 0.2627290 0.03881346 0.14306262
3 29 0.3853600 0.9703916 0.2339461 0.2492077 0.02488059 0.10467149
4 43 0.2701465 0.9826543 0.1406445 0.2443090 0.02298169 0.08468996
5 57 0.2378435 0.9873862 0.1102605 0.2116371 0.01817200 0.06253861
The performance of the Random Forest model on the test set was evaluated using the Root Mean Squared Error (RMSE).
Test Set Performance Metric: The test set RMSE is , indicating a low prediction error on unseen data.
Comparison with Resampled Performance Metric on the Training Set: From the cross-validated training phase, the lowest RMSE for the model (with ) was . The test set RMSE () is significantly lower than the resampled RMSE from the training phase, suggesting that the model generalizes well and performs better on the test data than expected from the cross-validation results.
This result reflects the robustness of the Random Forest model in capturing the relationships between predictors and the response variable.
6.3 (e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
importance <- varImp(rf_model, scale = TRUE)
# Extract variable importance as a data frame
importance_df <- as.data.frame(importance$importance)
# Add row names as a column for predictors
importance_df <- cbind(Predictor = rownames(importance_df), importance_df)
# Exclude 'Yield' from the importance rankings
filtered_importance <- importance_df[importance_df$Predictor != "Yield", ]
# Order by importance
filtered_importance <- filtered_importance[order(filtered_importance$Overall, decreasing = TRUE), ]
# Select the top 10 predictors
top_predictors_to_plot <- head(filtered_importance, 10)
# Plot using ggplot2
library(ggplot2)
ggplot(data = top_predictors_to_plot, aes(x = reorder(Predictor, -Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top Predictors by Importance (Excluding Yield)",
x = "Predictor", y = "Importance") +
theme_minimal()
From the variable importance rankings, the Random Forest model identified the following top predictors for determining product yield:
Most Important Predictors: The top 10 predictors, ranked by importance, include a mix of biological and process predictors: Biological Predictors: BiologicalMaterial02 (highest-ranked predictor overall), BiologicalMaterial11, BiologicalMaterial03, and BiologicalMaterial12. Process Predictors: ManufacturingProcess13, ManufacturingProcess17, ManufacturingProcess04, ManufacturingProcess06, and ManufacturingProcess09.
Dominance of Predictors: Both biological and process predictors play significant roles. Biological predictors dominate slightly, accounting for 4 of the top 10 predictors and including the highest-ranked predictor, BiologicalMaterial02.
Conclusion: Biological predictors are critical for ensuring raw material quality, which directly impacts yield. Process predictors highlight opportunities to refine and optimize key manufacturing steps, further improving yield.
6.3 (f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
suppressMessages({
suppressWarnings({
# Load necessary libraries
library(ggplot2)
library(gridExtra)
# Define the top predictors to plot
top_predictors_to_plot <- c(
"BiologicalMaterial02", "BiologicalMaterial03", "BiologicalMaterial11",
"ManufacturingProcess13", "ManufacturingProcess17", "ManufacturingProcess04",
"BiologicalMaterial12", "ManufacturingProcess06", "ManufacturingProcess09",
"ManufacturingProcess09"
)
# Create an empty list to store plots
plots <- list()
# Iterate over the top predictors and create individual plots
for (predictor in top_predictors_to_plot) {
p <- ggplot(data = ChemicalManufacturingProcess,
aes_string(x = predictor, y = "Yield")) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "yellow") +
ggtitle(paste("Relationship between", predictor, "and Yield")) +
theme_minimal() +
theme(plot.title = element_text(size = 5, face = 'bold'),
axis.title = element_text(size = 7),
axis.text = element_text(size = 5)) # Adjust the size as needed
# Append each plot to the list
plots[[predictor]] <- p
}
# Combine all plots into a single view using grid.arrange
do.call(grid.arrange, c(plots, ncol = 3)) # Adjust ncol for the layout
})
})
The relationships between the top predictors and yield highlight areas for improvement in both biological materials and manufacturing processes:
Biological Predictors: Predictors like BiologicalMaterial02, BiologicalMaterial03, and BiologicalMaterial11 exhibit a positive correlation with yield. This suggests that improving the quality of these biological materials could lead to better outcomes. Predictors such as BiologicalMaterial12 show weaker positive trends but still indicate potential opportunities for improvement.
Process Predictors: Variables such as ManufacturingProcess13, ManufacturingProcess17, and ManufacturingProcess04 show a negative correlation with yield, indicating inefficiencies or suboptimal steps in these processes. Other process-related predictors, like ManufacturingProcess06, exhibit positive trends, suggesting areas for targeted optimization.
How this helps improve yield: Focus on biological materials to ensure consistent and high-quality raw inputs. Analyze and optimize manufacturing steps that negatively impact yield, such as adjusting settings or refining workflows. Balance both aspects—quality inputs and efficient processes—to systematically enhance overall yield in future runs.