Assignment7

Homework6

In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

Problem 6.2

Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Start R and use these commands to load the data:

In addition to this step, I will also load all the libraries and seed that I will use throughout the exercise.

library(AppliedPredictiveModeling)
library(caret)
library(pls)      
library(glmnet)   
library(earth)

set.seed(602)

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

I will extract the predictors and response into X and y respectively, and check their dimensions.

data(permeability)
X <- fingerprints      
y <- permeability  
cat("(a) Data dimensions:\n")

## (a) Data dimensions:

cat(" - n samples:", nrow(X), "\n")

##  - n samples: 165

cat(" - p predictors:", ncol(X), "\n\n")

##  - p predictors: 1107

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nearZeroVar: this function identifies predictors that have near zero variance, which are often uninformative for modeling. By removing these predictors, we can reduce the dimensionality of the dataset and potentially improve model performance.

nzv_info <- nearZeroVar(fingerprints, saveMetrics = TRUE)
filtered <- fingerprints[, !nzv_info$nzv]
cat("(b) After nearZeroVar filtering:\n")

## (b) After nearZeroVar filtering:

cat(" - predictors remaining:", ncol(filtered), "\n")

##  - predictors remaining: 388

cat(" - removed:", sum(nzv_info$nzv), "\n\n")

##  - removed: 719

The nearZeroVar function identified and removed 719 predictors with near-zero variance, leaving 388 predictors for modeling.

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

createDataPartition: This function creates a balanced split of the data into training and test sets, ensuring that the distribution of the response variable is maintained in both sets.

set.seed(602)
trainIndex <- createDataPartition(permeability, p = 0.8, list = FALSE)

Training set: 80% of the data (132 samples)
Test set: 20% of the data (33 samples)

Partial Least Squares (PLS) regression is a technique that finds the fundamental relations between two matrices (predictors and response) by projecting the predictors into a new space. It is particularly useful when dealing with multicollinearity or when the number of predictors exceeds the number of observations.

X_train <- fingerprints[trainIndex, , drop = FALSE]
X_test  <- fingerprints[-trainIndex, , drop = FALSE]
y_train <- permeability[trainIndex]
y_test  <- permeability[-trainIndex]

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

pls_fit <- train(
  x = X_train, y = y_train,
  method = "pls",
  preProcess = c("center","scale"),
  trControl = ctrl,
  tuneLength = 30,
  metric = "Rsquared"
)

Caret’s train function is used to train the PLS model with 10 fold cross validation repeated 5 times. The data is centered and scaled as part of the preprocessing steps. The tuneLength parameter specifies that up to 30 latent variables should be considered during tuning. trainControl will set up the resampling method and performance metric (R2) for model evaluation.

best_ncomp <- pls_fit$bestTune$ncomp
cv_row     <- subset(pls_fit$results, ncomp == best_ncomp)

cat("(c) Optimal number of latent variables:", best_ncomp, "\n")

## (c) Optimal number of latent variables: 2

cat("    Resampled R2 estimate:", round(cv_row$Rsquared, 4), "\n")

##     Resampled R2 estimate: 0.4748

cat("    Resampled RMSE:", round(cv_row$RMSE, 4), "\n\n")

##     Resampled RMSE: 11.71

The PLS model explains approximately half of the variation in permeability, providing moderate predictive accuracy suitable for preliminary screening but not yet for precise quantitative prediction.

Predict the response for the test set. What is the test set estimate of R2?

After we tune the model using crossvalidation on the training data, we must evaluate it on completely unseen molecules (the test set) to understand how well it generalizes. Using the PLS model from previous step, we can predict the permeability for the test set and calculate the R2 value to assess predictive performance.

pls_pred <- predict(pls_fit, X_test)
test_perf <- postResample(pred = pls_pred, obs = y_test)
test_perf

##       RMSE   Rsquared        MAE 
## 12.8253155  0.4292181  7.8820576

The test-set R2 = 0.429 means that, on unseen molecules, the model explains roughly 43 % of the variability in measured permeability values.This is somewhat lower than the crossvalidated R2 = 0.4814 from training, but still close enough to suggest the model generalizes rather than overfits
The RMSE = 12.83 indicates that, on average, predicted permeability values deviate from true experimental values by about approximate 12.8 units.
The MAE = 7.9 tells us the average absolute prediction error.

The PLS model explains about 43 % of the variance in new data, with an average error near +-12.8 permeability units, indicating fair generalization but limited precision.

Try building other models discussed in this chapter. Do any have better predictive performance?

In this part , I will explore other linear and nonlinear modeling techniques to see if they can outperform the PLS model in predicting permeability.

PCR: Principal Components Regression
Elastic Net (glmnet): a regularized regression method that combines Lasso and Ridge penalties to handle multicollinearity and perform variable selection.
MARS (Multivariate Adaptive Regression Splines): a flexible modeling technique that captures nonlinear relationships and interactions between predictors.
k-Nearest Neighbors (for comparison): a simple, non parametric method that predicts the response based on the average of the k-nearest training samples in the predictor space.

ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5,
  summaryFunction = defaultSummary
)

set.seed(602)
pcr_fit <- train(
  x = X_train, y = y_train,
  method = "pcr",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 30,
  metric = "Rsquared"
)

set.seed(602)
enet_grid <- expand.grid(
  alpha = c(0, 0.5, 1),                 
  lambda = 10^seq(2, -4, length.out = 60)
)
glmnet_fit <- train(
  x = X_train, y = y_train,
  method = "glmnet",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneGrid = enet_grid,
  metric = "Rsquared"
)

set.seed(602)
mars_fit <- train(
  x = X_train, y = y_train,
  method = "earth",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 15,
  metric = "Rsquared"
)

set.seed(602)
knn_fit <- train(
  x = X_train, y = y_train,
  method = "knn",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 30,
  metric = "Rsquared"
)

all_models <- list(
  PLS  = pls_fit,
  PCR  = pcr_fit,
  ENET = glmnet_fit,
  MARS = mars_fit,
  KNN  = knn_fit
)

resamp <- resamples(all_models)
summary(resamp)

## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: PLS, PCR, ENET, MARS, KNN 
## Number of resamples: 50 
## 
## MAE 
##          Min.  1st Qu.   Median     Mean   3rd Qu.     Max. NA's
## PLS  4.490202 7.135379 8.729333 8.493329  9.299483 14.35262    0
## PCR  5.390501 7.226752 8.451080 8.726320 10.089294 12.51803    0
## ENET 5.725900 8.193631 9.041096 8.903995  9.876332 12.52417    0
## MARS 3.620401 5.833789 6.880511 7.108949  8.629239 11.46743    4
## KNN  4.975308 6.954875 7.951666 8.399316  9.383573 17.06103    0
## 
## RMSE 
##          Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS  5.384167 10.434321 11.78355 11.70999 13.15682 19.47829    0
## PCR  6.851210 10.057080 11.90287 11.74837 13.95191 17.02145    0
## ENET 5.949551 10.357726 11.78543 11.66920 13.28737 16.57782    0
## MARS 5.050738  9.037124 10.09698 10.63569 12.26298 17.27696    4
## KNN  6.935717  9.509982 11.75077 11.82494 13.41416 21.48359    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS  0.010266707 0.3295869 0.4824714 0.4748033 0.6664270 0.8337259    0
## PCR  0.010148735 0.2806332 0.4417306 0.4635471 0.6620481 0.8674690    0
## ENET 0.004998217 0.3861168 0.5725982 0.5422090 0.7319950 0.9663744    0
## MARS 0.024181799 0.3214183 0.5759825 0.5227619 0.7321760 0.9142742    4
## KNN  0.007971010 0.3285069 0.4597507 0.4739322 0.6614803 0.9421662    0

perf_tbl <- data.frame(Model = names(all_models), RMSE = NA, R2 = NA, MAE = NA)
for (i in seq_along(all_models)) {
  pred <- predict(all_models[[i]], X_test)
  m <- postResample(pred, y_test)
  perf_tbl[i, 2:4] <- m
}
perf_tbl[order(-perf_tbl$R2), ]

##   Model     RMSE        R2      MAE
## 2   PCR 12.84413 0.4594581 9.029116
## 4  MARS 12.17000 0.4560603 7.200874
## 3  ENET 13.20902 0.4518567 9.641743
## 1   PLS 12.82532 0.4292181 7.882058
## 5   KNN 13.75565 0.3136338 8.977039

MARS (Multivariate Adaptive Regression Splines) achieved the lowest RMSE (12.17) and lowest MAE (7.20), showing slightly better prediction accuracy on average compared with other methods.
PCR performed very close to MARS in terms of R² (0.459), but had a slightly higher error (RMSE = 12.84).
Elastic Net (ENET), which was the strongest performer in cross validation, dropped slightly on the test set (R² = 0.452).
PLS and KNN lagged behind in both accuracy and explained variance, confirming their moderate predictive power.

Several alternative models were trained, including PCR, Elastic Net, MARS, and kNN. Among them, MARS achieved the best overall test set performance (R2 = 0.456, RMSE = 12.17, MAE = 7.20), slightly outperforming PCR and Elastic Net. However, the differences were small, all models explain roughly 45 % of the variance in permeability. This suggests that while nonlinear or regularized models provide marginal gains, none offer a dramatic improvement over the original PLS model.

Would you recommend any of your models to replace the permeability laboratory experiment?

Based on the results, I wouldn’t replace the permeability lab test with any of these models right now. Even though models like MARS and PCR performed a bit better than the others, they still only explained around 45% of the variation in permeability.

That said, the models can still be very helpful as early screening tools. They could help scientists quickly identify which compounds are worth testing in the lab, saving both time and money. So while they shouldn’t replace the lab experiments, they could definitely support and speed up the early stages of the research process.

Problem 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

I will load the necessary libraries and the dataset, after that I will inspect the structure.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

dim(ChemicalManufacturingProcess)

## [1] 176  58

head(ChemicalManufacturingProcess)

##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0

summary(ChemicalManufacturingProcess$Yield)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.25   38.75   39.97   40.18   41.48   46.34

The dataset ChemicalManufacturingProcess from the AppliedPredictiveModeling package was loaded successfully. It contains 176 manufacturing batches and 58 variables (57 predictors and one response, Yield).

We’re working with data from a pharmaceutical manufacturing process, where each observation represents one production batch. For each batch, there are 57 predictor measurements:

12 describe the raw material’s biological quality .
45 describe controllable manufacturing process settings .

processPredictors <- ChemicalManufacturingProcess[, -58]  
yield <- ChemicalManufacturingProcess$Yield  


cat("(a) Data dimensions:\n")

## (a) Data dimensions:

cat(" - n samples:", nrow(processPredictors), "\n")

##  - n samples: 176

cat(" - p predictors:", ncol(processPredictors), "\n\n")

##  - p predictors: 57

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).

Before modeling, it’s important to handle any missing data in the predictors.

Lets validate how many missing values are present.

sum(is.na(processPredictors))

## [1] 106

cat("(b) Total missing values in predictors:", sum(is.na(processPredictors)), "\n\n")

## (b) Total missing values in predictors: 106

The predictors had 106 missing values spread across several variables, a very small proportion of the total data.

Now we will use preProcess from the caret package to impute these missing values using k-nearest neighbors (KNN) imputation, along with centering and scaling the data.

#install.packages("RANN")
preProcValues <- preProcess(processPredictors,
                            method = c("center", "scale", "knnImpute"))
processPredictors_imputed <- predict(preProcValues, processPredictors)
sum(is.na(processPredictors_imputed))

## [1] 0

A total of 106 missing predictor values (about 1% of all data) were identified. Using the caret package’s preProcess() function with knnImpute, center, and scale, all missing values were successfully filled.

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

The objective is to build a predictive model for yield using the imputed predictors. I will split the data into training (80%) and test (20%) sets, then train a model.

set.seed(631)

data_imputed <- cbind(processPredictors_imputed, Yield = yield)

set.seed(631)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.8, list = FALSE)
training <- data_imputed[trainIndex, ]
testing  <- data_imputed[-trainIndex, ]

ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 5)

set.seed(631)
pls_fit <- train(Yield ~ ., 
                 data = training,
                 method = "pls",
                 tuneLength = 20,
                 trControl = ctrl,
                 metric = "Rsquared",
                 preProcess = c("center", "scale"))

pls_fit

## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 130, 128, 131, 129, 129, 131, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8223954  0.4201587  0.6512392
##    2     1.0555546  0.4607526  0.6731944
##    3     0.8092787  0.5232743  0.5954760
##    4     0.9022249  0.5156774  0.6341939
##    5     0.9731157  0.5002784  0.6547891
##    6     0.9821266  0.4912454  0.6540495
##    7     1.0185733  0.5004027  0.6677131
##    8     1.0199938  0.5018035  0.6698446
##    9     1.0785701  0.4998744  0.6881153
##   10     1.1396380  0.4918636  0.7087891
##   11     1.1949658  0.4868934  0.7239973
##   12     1.2775644  0.4756464  0.7530424
##   13     1.3346971  0.4648140  0.7745308
##   14     1.3818003  0.4531712  0.7901397
##   15     1.4100913  0.4548370  0.7973616
##   16     1.4549144  0.4540403  0.8079149
##   17     1.4911680  0.4461375  0.8206459
##   18     1.5232076  0.4413947  0.8320037
##   19     1.5583011  0.4414793  0.8448329
##   20     1.6079868  0.4389825  0.8618122
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.

After splitting the dataset into training and testing portions and using repeated 10-fold cross validation to tune a Partial Least Squares model, the best performance occurred with 3 latent components. This model explained about 52 % (R2 = 0.523, RMSE = 0.81, and MAE = 0.59) of the variation in product yield, with an average prediction error (RMSE) of around 0.8 yield units.

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

pls_pred <- predict(pls_fit, newdata = testing)

test_perf <- postResample(pred = pls_pred, obs = testing$Yield)
test_perf

##      RMSE  Rsquared       MAE 
## 0.5847297 0.6681721 0.4850774

On the test set, the PLS model with 3 latent components achieved an RMSE = 0.585, R2 = 0.668, and MAE = 0.485. These results are even better than the cross-validated training metrics (R² = 0.523, RMSE = 0.809), showing that the model generalizes well and explains about 67 % of the variation in yield. The model predicts product yield within roughly ±0.5 percentage points, indicating strong and reliable performance.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

In this step we need to determin ate which predictors are most influential in predicting yield according to the trained PLS model.

importance <- varImp(pls_fit, scale = TRUE)
importance_df <- importance$importance

head(importance_df[order(-importance_df$Overall), , drop = FALSE], 10)

##                          Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess13  91.02735
## ManufacturingProcess17  83.71351
## ManufacturingProcess09  81.41170
## ManufacturingProcess36  80.88777
## ManufacturingProcess06  62.61069
## BiologicalMaterial02    62.17932
## ManufacturingProcess33  61.52891
## ManufacturingProcess12  60.13153
## BiologicalMaterial06    59.17009

The most important predictors in the trained PLS model were primarily manufacturing process variables, including ManufacturingProcess32, 13, 17, 09, and 36. Only two biological predictors (BiologicalMaterial02 and 06) appeared among the top ten. Therefore, process predictors dominate, indicating that yield is influenced mainly by controllable process parameters rather than by the biological properties of the raw materials.

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

I will create a plot to visualize the relationship between the top predictors and yield. In this part to avoid continue having the issue with the duplicate columns I will ensure dataset has unique names.

names(training) <- make.names(names(training), unique = TRUE)

top_vars <- c("ManufacturingProcess32", "ManufacturingProcess13", 
              "ManufacturingProcess17", "ManufacturingProcess09", 
              "ManufacturingProcess36")

for (var in top_vars) {
  p <- ggplot(training, aes_string(x = var, y = "Yield")) +
    geom_point(alpha = 0.7, color = "steelblue") +
    geom_smooth(method = "loess", color = "darkorange", se = FALSE) +
    labs(title = paste("Relationship between", var, "and Yield"),
         x = var, y = "Product Yield (%)") +
    theme_minimal()
  print(p)
}

By plotting the five most influential process variables against product yield, we can see distinct patterns that reveal how changes in each setting affect performance:

ManufacturingProcess32 shows a positive relationship with yield. As this variable increases, yield rises until it reaches a plateau. This suggests that maintaining this parameter in the higher range without exceeding the stable zone can boost yield.
ManufacturingProcess13 has a strong negative relationship with yield. Higher values consistently reduce output, meaning that lowering this parameter could improve results.
ManufacturingProcess17 displays a curved relationship: yield is higher at mid-to-low levels but decreases when the setting is pushed too high.
ManufacturingProcess09 shows a steady positive effect, with yield increasing gradually as this variable rises. Small upward adjustments may therefore lead to consistent improvements.
ManufacturingProcess36 has a complex, wave-like pattern, suggesting there may be specific optimal operating points rather than a simple linear trend. Fine tuning around those peaks could yield the best outcomes.

The top process variables show clear and interpretable relationships with product yield some positive, some negative, and some nonlinear. By identifying the ranges where yield is highest and controlling these parameters accordingly, the company can systematically improve future production runs. This analysis confirms that optimizing manufacturing process conditions offers the most direct and profitable path to higher yield.

Assignment7

Alinzon Simon

2025-11-01

Homework6

Problem 6.2

Problem 6.3