In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
In addition to this step, I will also load all the libraries and seed that I will use throughout the exercise.
library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(glmnet)
library(earth)
set.seed(602)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
I will extract the predictors and response into X and y respectively, and check their dimensions.
data(permeability)
X <- fingerprints
y <- permeability
cat("(a) Data dimensions:\n")
## (a) Data dimensions:
cat(" - n samples:", nrow(X), "\n")
## - n samples: 165
cat(" - p predictors:", ncol(X), "\n\n")
## - p predictors: 1107
nearZeroVar: this function identifies predictors that have near zero variance, which are often uninformative for modeling. By removing these predictors, we can reduce the dimensionality of the dataset and potentially improve model performance.
nzv_info <- nearZeroVar(fingerprints, saveMetrics = TRUE)
filtered <- fingerprints[, !nzv_info$nzv]
cat("(b) After nearZeroVar filtering:\n")
## (b) After nearZeroVar filtering:
cat(" - predictors remaining:", ncol(filtered), "\n")
## - predictors remaining: 388
cat(" - removed:", sum(nzv_info$nzv), "\n\n")
## - removed: 719
The nearZeroVar function identified and removed 719 predictors with near-zero variance, leaving 388 predictors for modeling.
set.seed(602)
trainIndex <- createDataPartition(permeability, p = 0.8, list = FALSE)
Partial Least Squares (PLS) regression is a technique that finds the fundamental relations between two matrices (predictors and response) by projecting the predictors into a new space. It is particularly useful when dealing with multicollinearity or when the number of predictors exceeds the number of observations.
X_train <- fingerprints[trainIndex, , drop = FALSE]
X_test <- fingerprints[-trainIndex, , drop = FALSE]
y_train <- permeability[trainIndex]
y_test <- permeability[-trainIndex]
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
pls_fit <- train(
x = X_train, y = y_train,
method = "pls",
preProcess = c("center","scale"),
trControl = ctrl,
tuneLength = 30,
metric = "Rsquared"
)
Caret’s train function is used to train the PLS model with 10 fold cross validation repeated 5 times. The data is centered and scaled as part of the preprocessing steps. The tuneLength parameter specifies that up to 30 latent variables should be considered during tuning. trainControl will set up the resampling method and performance metric (R2) for model evaluation.
best_ncomp <- pls_fit$bestTune$ncomp
cv_row <- subset(pls_fit$results, ncomp == best_ncomp)
cat("(c) Optimal number of latent variables:", best_ncomp, "\n")
## (c) Optimal number of latent variables: 2
cat(" Resampled R2 estimate:", round(cv_row$Rsquared, 4), "\n")
## Resampled R2 estimate: 0.4748
cat(" Resampled RMSE:", round(cv_row$RMSE, 4), "\n\n")
## Resampled RMSE: 11.71
The PLS model explains approximately half of the variation in permeability, providing moderate predictive accuracy suitable for preliminary screening but not yet for precise quantitative prediction.
After we tune the model using crossvalidation on the training data, we must evaluate it on completely unseen molecules (the test set) to understand how well it generalizes. Using the PLS model from previous step, we can predict the permeability for the test set and calculate the R2 value to assess predictive performance.
pls_pred <- predict(pls_fit, X_test)
test_perf <- postResample(pred = pls_pred, obs = y_test)
test_perf
## RMSE Rsquared MAE
## 12.8253155 0.4292181 7.8820576
The test-set R2 = 0.429 means that, on unseen molecules, the model explains roughly 43 % of the variability in measured permeability values.This is somewhat lower than the crossvalidated R2 = 0.4814 from training, but still close enough to suggest the model generalizes rather than overfits
The RMSE = 12.83 indicates that, on average, predicted permeability values deviate from true experimental values by about approximate 12.8 units.
The MAE = 7.9 tells us the average absolute prediction error.
The PLS model explains about 43 % of the variance in new data, with an average error near +-12.8 permeability units, indicating fair generalization but limited precision.
In this part , I will explore other linear and nonlinear modeling techniques to see if they can outperform the PLS model in predicting permeability.
PCR: Principal Components Regression
Elastic Net (glmnet): a regularized regression method that combines Lasso and Ridge penalties to handle multicollinearity and perform variable selection.
MARS (Multivariate Adaptive Regression Splines): a flexible modeling technique that captures nonlinear relationships and interactions between predictors.
k-Nearest Neighbors (for comparison): a simple, non parametric method that predicts the response based on the average of the k-nearest training samples in the predictor space.
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = defaultSummary
)
set.seed(602)
pcr_fit <- train(
x = X_train, y = y_train,
method = "pcr",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 30,
metric = "Rsquared"
)
set.seed(602)
enet_grid <- expand.grid(
alpha = c(0, 0.5, 1),
lambda = 10^seq(2, -4, length.out = 60)
)
glmnet_fit <- train(
x = X_train, y = y_train,
method = "glmnet",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneGrid = enet_grid,
metric = "Rsquared"
)
set.seed(602)
mars_fit <- train(
x = X_train, y = y_train,
method = "earth",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 15,
metric = "Rsquared"
)
set.seed(602)
knn_fit <- train(
x = X_train, y = y_train,
method = "knn",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 30,
metric = "Rsquared"
)
all_models <- list(
PLS = pls_fit,
PCR = pcr_fit,
ENET = glmnet_fit,
MARS = mars_fit,
KNN = knn_fit
)
resamp <- resamples(all_models)
summary(resamp)
##
## Call:
## summary.resamples(object = resamp)
##
## Models: PLS, PCR, ENET, MARS, KNN
## Number of resamples: 50
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 4.490202 7.135379 8.729333 8.493329 9.299483 14.35262 0
## PCR 5.390501 7.226752 8.451080 8.726320 10.089294 12.51803 0
## ENET 5.725900 8.193631 9.041096 8.903995 9.876332 12.52417 0
## MARS 3.620401 5.833789 6.880511 7.108949 8.629239 11.46743 4
## KNN 4.975308 6.954875 7.951666 8.399316 9.383573 17.06103 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 5.384167 10.434321 11.78355 11.70999 13.15682 19.47829 0
## PCR 6.851210 10.057080 11.90287 11.74837 13.95191 17.02145 0
## ENET 5.949551 10.357726 11.78543 11.66920 13.28737 16.57782 0
## MARS 5.050738 9.037124 10.09698 10.63569 12.26298 17.27696 4
## KNN 6.935717 9.509982 11.75077 11.82494 13.41416 21.48359 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.010266707 0.3295869 0.4824714 0.4748033 0.6664270 0.8337259 0
## PCR 0.010148735 0.2806332 0.4417306 0.4635471 0.6620481 0.8674690 0
## ENET 0.004998217 0.3861168 0.5725982 0.5422090 0.7319950 0.9663744 0
## MARS 0.024181799 0.3214183 0.5759825 0.5227619 0.7321760 0.9142742 4
## KNN 0.007971010 0.3285069 0.4597507 0.4739322 0.6614803 0.9421662 0
perf_tbl <- data.frame(Model = names(all_models), RMSE = NA, R2 = NA, MAE = NA)
for (i in seq_along(all_models)) {
pred <- predict(all_models[[i]], X_test)
m <- postResample(pred, y_test)
perf_tbl[i, 2:4] <- m
}
perf_tbl[order(-perf_tbl$R2), ]
## Model RMSE R2 MAE
## 2 PCR 12.84413 0.4594581 9.029116
## 4 MARS 12.17000 0.4560603 7.200874
## 3 ENET 13.20902 0.4518567 9.641743
## 1 PLS 12.82532 0.4292181 7.882058
## 5 KNN 13.75565 0.3136338 8.977039
MARS (Multivariate Adaptive Regression Splines) achieved the lowest RMSE (12.17) and lowest MAE (7.20), showing slightly better prediction accuracy on average compared with other methods.
PCR performed very close to MARS in terms of R² (0.459), but had a slightly higher error (RMSE = 12.84).
Elastic Net (ENET), which was the strongest performer in cross validation, dropped slightly on the test set (R² = 0.452).
PLS and KNN lagged behind in both accuracy and explained variance, confirming their moderate predictive power.
Several alternative models were trained, including PCR, Elastic Net, MARS, and kNN. Among them, MARS achieved the best overall test set performance (R2 = 0.456, RMSE = 12.17, MAE = 7.20), slightly outperforming PCR and Elastic Net. However, the differences were small, all models explain roughly 45 % of the variance in permeability. This suggests that while nonlinear or regularized models provide marginal gains, none offer a dramatic improvement over the original PLS model.
Based on the results, I wouldn’t replace the permeability lab test with any of these models right now. Even though models like MARS and PCR performed a bit better than the others, they still only explained around 45% of the variation in permeability.
That said, the models can still be very helpful as early screening tools. They could help scientists quickly identify which compounds are worth testing in the lab, saving both time and money. So while they shouldn’t replace the lab experiments, they could definitely support and speed up the early stages of the research process.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
I will load the necessary libraries and the dataset, after that I will inspect the structure.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
dim(ChemicalManufacturingProcess)
## [1] 176 58
head(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 NA NA NA
## 2 0.0 0 NA
## 3 0.0 0 NA
## 4 0.0 0 NA
## 5 10.7 0 NA
## 6 12.0 0 NA
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 NA NA NA
## 2 917 1032.2 210.0
## 3 912 1003.6 207.1
## 4 911 1014.6 213.3
## 5 918 1027.5 205.7
## 6 924 1016.8 208.9
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 NA NA 43.00
## 2 177 178 46.57
## 3 178 178 45.07
## 4 177 177 44.92
## 5 178 178 44.96
## 6 178 178 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 NA NA NA
## 2 NA NA 0
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 NA NA NA
## 2 3 0 3
## 3 4 1 4
## 4 5 2 5
## 5 8 4 18
## 6 9 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 NA NA 11.6
## 2 0.1 0.15 11.1
## 3 0.0 0.00 12.0
## 4 0.0 0.00 10.6
## 5 0.0 0.00 11.0
## 6 0.0 0.00 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
summary(ChemicalManufacturingProcess$Yield)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.25 38.75 39.97 40.18 41.48 46.34
The dataset ChemicalManufacturingProcess from the AppliedPredictiveModeling package was loaded successfully. It contains 176 manufacturing batches and 58 variables (57 predictors and one response, Yield).
We’re working with data from a pharmaceutical manufacturing process, where each observation represents one production batch. For each batch, there are 57 predictor measurements:
12 describe the raw material’s biological quality .
45 describe controllable manufacturing process settings .
processPredictors <- ChemicalManufacturingProcess[, -58]
yield <- ChemicalManufacturingProcess$Yield
cat("(a) Data dimensions:\n")
## (a) Data dimensions:
cat(" - n samples:", nrow(processPredictors), "\n")
## - n samples: 176
cat(" - p predictors:", ncol(processPredictors), "\n\n")
## - p predictors: 57
Before modeling, it’s important to handle any missing data in the predictors.
Lets validate how many missing values are present.
sum(is.na(processPredictors))
## [1] 106
cat("(b) Total missing values in predictors:", sum(is.na(processPredictors)), "\n\n")
## (b) Total missing values in predictors: 106
The predictors had 106 missing values spread across several variables, a very small proportion of the total data.
Now we will use preProcess from the caret package to impute these missing values using k-nearest neighbors (KNN) imputation, along with centering and scaling the data.
#install.packages("RANN")
preProcValues <- preProcess(processPredictors,
method = c("center", "scale", "knnImpute"))
processPredictors_imputed <- predict(preProcValues, processPredictors)
sum(is.na(processPredictors_imputed))
## [1] 0
A total of 106 missing predictor values (about 1% of all data) were identified. Using the caret package’s preProcess() function with knnImpute, center, and scale, all missing values were successfully filled.
The objective is to build a predictive model for yield using the imputed predictors. I will split the data into training (80%) and test (20%) sets, then train a model.
set.seed(631)
data_imputed <- cbind(processPredictors_imputed, Yield = yield)
set.seed(631)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.8, list = FALSE)
training <- data_imputed[trainIndex, ]
testing <- data_imputed[-trainIndex, ]
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5)
set.seed(631)
pls_fit <- train(Yield ~ .,
data = training,
method = "pls",
tuneLength = 20,
trControl = ctrl,
metric = "Rsquared",
preProcess = c("center", "scale"))
pls_fit
## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 130, 128, 131, 129, 129, 131, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8223954 0.4201587 0.6512392
## 2 1.0555546 0.4607526 0.6731944
## 3 0.8092787 0.5232743 0.5954760
## 4 0.9022249 0.5156774 0.6341939
## 5 0.9731157 0.5002784 0.6547891
## 6 0.9821266 0.4912454 0.6540495
## 7 1.0185733 0.5004027 0.6677131
## 8 1.0199938 0.5018035 0.6698446
## 9 1.0785701 0.4998744 0.6881153
## 10 1.1396380 0.4918636 0.7087891
## 11 1.1949658 0.4868934 0.7239973
## 12 1.2775644 0.4756464 0.7530424
## 13 1.3346971 0.4648140 0.7745308
## 14 1.3818003 0.4531712 0.7901397
## 15 1.4100913 0.4548370 0.7973616
## 16 1.4549144 0.4540403 0.8079149
## 17 1.4911680 0.4461375 0.8206459
## 18 1.5232076 0.4413947 0.8320037
## 19 1.5583011 0.4414793 0.8448329
## 20 1.6079868 0.4389825 0.8618122
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.
After splitting the dataset into training and testing portions and using repeated 10-fold cross validation to tune a Partial Least Squares model, the best performance occurred with 3 latent components. This model explained about 52 % (R2 = 0.523, RMSE = 0.81, and MAE = 0.59) of the variation in product yield, with an average prediction error (RMSE) of around 0.8 yield units.
pls_pred <- predict(pls_fit, newdata = testing)
test_perf <- postResample(pred = pls_pred, obs = testing$Yield)
test_perf
## RMSE Rsquared MAE
## 0.5847297 0.6681721 0.4850774
On the test set, the PLS model with 3 latent components achieved an RMSE = 0.585, R2 = 0.668, and MAE = 0.485. These results are even better than the cross-validated training metrics (R² = 0.523, RMSE = 0.809), showing that the model generalizes well and explains about 67 % of the variation in yield. The model predicts product yield within roughly ±0.5 percentage points, indicating strong and reliable performance.
In this step we need to determin ate which predictors are most influential in predicting yield according to the trained PLS model.
importance <- varImp(pls_fit, scale = TRUE)
importance_df <- importance$importance
head(importance_df[order(-importance_df$Overall), , drop = FALSE], 10)
## Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess13 91.02735
## ManufacturingProcess17 83.71351
## ManufacturingProcess09 81.41170
## ManufacturingProcess36 80.88777
## ManufacturingProcess06 62.61069
## BiologicalMaterial02 62.17932
## ManufacturingProcess33 61.52891
## ManufacturingProcess12 60.13153
## BiologicalMaterial06 59.17009
The most important predictors in the trained PLS model were primarily manufacturing process variables, including ManufacturingProcess32, 13, 17, 09, and 36. Only two biological predictors (BiologicalMaterial02 and 06) appeared among the top ten. Therefore, process predictors dominate, indicating that yield is influenced mainly by controllable process parameters rather than by the biological properties of the raw materials.
I will create a plot to visualize the relationship between the top predictors and yield. In this part to avoid continue having the issue with the duplicate columns I will ensure dataset has unique names.
names(training) <- make.names(names(training), unique = TRUE)
top_vars <- c("ManufacturingProcess32", "ManufacturingProcess13",
"ManufacturingProcess17", "ManufacturingProcess09",
"ManufacturingProcess36")
for (var in top_vars) {
p <- ggplot(training, aes_string(x = var, y = "Yield")) +
geom_point(alpha = 0.7, color = "steelblue") +
geom_smooth(method = "loess", color = "darkorange", se = FALSE) +
labs(title = paste("Relationship between", var, "and Yield"),
x = var, y = "Product Yield (%)") +
theme_minimal()
print(p)
}
By plotting the five most influential process variables against product yield, we can see distinct patterns that reveal how changes in each setting affect performance:
ManufacturingProcess32 shows a positive relationship with yield. As this variable increases, yield rises until it reaches a plateau. This suggests that maintaining this parameter in the higher range without exceeding the stable zone can boost yield.
ManufacturingProcess13 has a strong negative relationship with yield. Higher values consistently reduce output, meaning that lowering this parameter could improve results.
ManufacturingProcess17 displays a curved relationship: yield is higher at mid-to-low levels but decreases when the setting is pushed too high.
ManufacturingProcess09 shows a steady positive effect, with yield increasing gradually as this variable rises. Small upward adjustments may therefore lead to consistent improvements.
ManufacturingProcess36 has a complex, wave-like pattern, suggesting there may be specific optimal operating points rather than a simple linear trend. Fine tuning around those peaks could yield the best outcomes.
The top process variables show clear and interpretable relationships with product yield some positive, some negative, and some nonlinear. By identifying the ranges where yield is highest and controlling these parameters accordingly, the company can systematically improve future production runs. This analysis confirms that optimizing manufacturing process conditions offers the most direct and profitable path to higher yield.