Data 624: HW 7

Nakesha Fray

2025-04-01

In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug: (a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

After filtering out the predictors that have low frequencies using the nearZeroVar function from the caret package, there were 388 binary molecular predictors left for modeling.

filtered_finger<- fingerprints[, -nearZeroVar(fingerprints)]
ncol(filtered_finger)
## [1] 388

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

First I combined the predictors and permeability with permeability response into in a dataset. Then I used the createDataPartition function, to create a stratified random split of the data, and specified split of data that was allocated to the training set was 80%. Then I subset the data into objects for training and test sets. Then I pre-processed the predictor data by centering and scaling their values, using the preProc argument. After tuning a PLS model, I found that the number of latent variables that are optimal is 9 and the corresponding resampled estimate of R2 is 0.5051022.

pharma<- cbind(as.data.frame(filtered_finger), permeability)

set.seed(1000)
splits <- createDataPartition(pharma$permeability, p = .80, times = 1, list = FALSE)

trainingSet <- pharma[ splits,]
testSet <- pharma[-splits,]

pls_model <- train(permeability ~ ., data = trainingSet, method = "pls",
  center = TRUE,
  preProc = c("center", "scale"),
  tuneLength = 20,
  trControl = trainControl("cv", number = 10)
)

pls_model
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 119, 119, 121, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.59536  0.3693244   9.805256
##    2     11.36884  0.4454927   8.049067
##    3     11.58699  0.4561760   8.509318
##    4     11.68694  0.4946321   8.780709
##    5     11.89345  0.4830362   8.874407
##    6     11.53064  0.4965231   8.550423
##    7     11.33409  0.5018929   8.557250
##    8     11.35862  0.4993897   8.649191
##    9     11.30355  0.5051022   8.341151
##   10     11.81290  0.4805622   8.893693
##   11     12.06506  0.4766764   9.035158
##   12     12.33083  0.4575478   9.232594
##   13     12.80795  0.4327702   9.377509
##   14     13.07629  0.4162078   9.627518
##   15     13.18796  0.4137421   9.678683
##   16     13.21733  0.4156452   9.762953
##   17     13.54689  0.4064093   9.996089
##   18     13.79667  0.3938512  10.283401
##   19     14.01036  0.3865552  10.307815
##   20     14.24004  0.3812984  10.440209
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 9.
plot(pls_model)

(d) Predict the response for the test set. What is the test set estimate of R2?

The test set estimate of the R2 is 0.5999064. This means the that the PLS model was able to explain 59.99% of the variance in the test set response variable (permeability).

postResample((predict(pls_model, testSet)), obs=testSet$permeability)
##       RMSE   Rsquared        MAE 
## 11.3008376  0.5999064  8.8856568

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

It seems the best predictive performance is still PLS compared to ridge, lasso, and elastic net, with an R2 of 0.4342637, 0.4552866, and 0.5177496, respectively - all indicating how much of the variance is explained by the model. Since the R2 of elastic net is the the highest amongst these models, it performed the best - but not compared to the pls model above.

# Ridge
set.seed(1000)
ridge_model <- train(permeability ~ ., data = trainingSet, method = "glmnet", 
  preProcess = c("center", "scale"),  
  trControl = trainControl("cv", number = 10),  
  tuneLength = 25,  
  tuneGrid = expand.grid(alpha = 0,  
                         lambda = seq(0, 1, 0.1))  
)

postResample((predict(ridge_model, testSet)), obs=testSet$permeability)
##       RMSE   Rsquared        MAE 
## 12.9695390  0.4342637  9.1532720
ridge_model
## glmnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 120, 120, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0.0     11.08798  0.4956449  8.135367
##   0.1     11.08798  0.4956449  8.135367
##   0.2     11.08798  0.4956449  8.135367
##   0.3     11.08798  0.4956449  8.135367
##   0.4     11.08798  0.4956449  8.135367
##   0.5     11.08798  0.4956449  8.135367
##   0.6     11.08798  0.4956449  8.135367
##   0.7     11.08798  0.4956449  8.135367
##   0.8     11.08798  0.4956449  8.135367
##   0.9     11.08798  0.4956449  8.135367
##   1.0     11.08798  0.4956449  8.135367
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.
plot(ridge_model)

# Lasso
set.seed(1000)
lasso_model <- train(permeability ~ ., data = trainingSet, method = "glmnet", 
  preProcess = c("center", "scale"),  
  trControl = trainControl("cv", number = 10),  
  tuneLength = 25,  
  tuneGrid = expand.grid(alpha = 1,  
                         lambda = seq(0, 1, 0.1))  
)

postResample((predict(lasso_model, testSet)), obs=testSet$permeability)
##       RMSE   Rsquared        MAE 
## 12.6709171  0.4552866  8.9368366
lasso_model
## glmnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 120, 120, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0.0     12.45251  0.4522927  9.009786
##   0.1     12.36336  0.4562001  8.969798
##   0.2     11.23368  0.5012798  8.322382
##   0.3     11.23981  0.4895268  8.198963
##   0.4     11.35663  0.4788788  8.149659
##   0.5     11.49720  0.4684971  8.128068
##   0.6     11.42171  0.4739773  8.044443
##   0.7     11.32164  0.4785990  7.955179
##   0.8     11.21430  0.4856152  7.914310
##   0.9     11.14458  0.4908056  7.871912
##   1.0     11.08018  0.4953501  7.833244
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 1.
plot(lasso_model)

# Elastic net
set.seed(1000)
enet_model <- train(permeability ~ ., data = trainingSet,method = "glmnet", 
  preProcess = c("center", "scale"),  
  trControl = trainControl("cv", number = 10),  
  tuneLength = 25,  
  tuneGrid = expand.grid(alpha = 0.5,  
                         lambda = seq(0, 1, 0.1))  
)

postResample((predict(enet_model, testSet)), obs=testSet$permeability)
##       RMSE   Rsquared        MAE 
## 12.2239282  0.5177496  8.9762007
enet_model
## glmnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 120, 120, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0.0     12.04129  0.4747986  8.735653
##   0.1     12.04129  0.4747986  8.735653
##   0.2     11.99044  0.4767187  8.710899
##   0.3     11.45212  0.4972083  8.359730
##   0.4     11.21300  0.5044402  8.217114
##   0.5     11.19053  0.4986365  8.178341
##   0.6     11.20679  0.4940095  8.138357
##   0.7     11.24730  0.4898042  8.102803
##   0.8     11.27879  0.4875098  8.057439
##   0.9     11.31990  0.4846443  8.023564
##   1.0     11.33841  0.4836704  7.983602
## 
## Tuning parameter 'alpha' was held constant at a value of 0.5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.5.
plot(enet_model)

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

I would not recommend any of my models to replace the permeability laboratory experiment, since the highest R2 value was produced by the PLS model. Even though I would not replace the pls model, the elastic net produced the second best results with an R2 of 0.5177496 - still underperforming compared to the pls model.

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix process Predictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

In order to handle missing data, I used the preProcess() function in R with “knnImpute” which uses the K-nearest neighbors (KNN) to impute missing values - this method takes an average of the neighboring observations to fill in missing values. I then used predict to fill in these values in the dataset ChemicalManufacturingProcess.

impute <- preProcess(ChemicalManufacturingProcess, "knnImpute")
bio <- predict(impute, ChemicalManufacturingProcess)

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I used the createDataPartition function again, to create a stratified random split of the data, and specified split of data that was allocated to the training set was 80%. Then I subset the data into objects for training and test sets. Then I pre-processed the predictor data by centering and scaling their values, using the preProc argument. After tuning a PLS model, that the number of latent variables that are optimal is 3 and the corresponding resampled estimate of R2 is 0.6267188.

filtered_bio <- bio[, -nearZeroVar(bio)]

set.seed(1000)
splits2 <- createDataPartition(filtered_bio$Yield, p = .80, times = 1, list = FALSE)

trainingSet2 <- filtered_bio[ splits2,]
testSet2 <- filtered_bio[-splits2,]

pls_model2 <- train(Yield ~ ., data = trainingSet2, method = "pls",
  center = TRUE,
  preProc = c("center", "scale"),
  tuneLength = 20,
  trControl = trainControl("cv", number = 10)
)

pls_model2
## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 129, 131, 130, 129, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.7269863  0.4814300  0.5933813
##    2     0.8419938  0.5124438  0.5941560
##    3     0.6307537  0.6267188  0.5247899
##    4     0.7514369  0.5625502  0.5723596
##    5     0.9623931  0.5160661  0.6378539
##    6     1.0189097  0.5137361  0.6476890
##    7     1.1941463  0.5045709  0.6885589
##    8     1.2327254  0.5038693  0.7076048
##    9     1.2525801  0.4986959  0.7151345
##   10     1.2701005  0.4884081  0.7220001
##   11     1.2342606  0.4838535  0.7167019
##   12     1.2546725  0.4850881  0.7165534
##   13     1.2715086  0.4811654  0.7206945
##   14     1.3217803  0.4771116  0.7344409
##   15     1.4368269  0.4681273  0.7770038
##   16     1.5539048  0.4618554  0.8108984
##   17     1.6282422  0.4560727  0.8298962
##   18     1.7457725  0.4525802  0.8656009
##   19     1.8100557  0.4533691  0.8852009
##   20     1.9087912  0.4598116  0.9109586
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plot(pls_model2)

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

For the training set, the RMSE for ncomp = 3 is 0.6307537 with an R2 of 0.6267188. For the test set, the RMSE was 1.6320156 and the R2 = 0.1789398. The model’s performance on the test set seems to be worse than its performance on the training set, especially with the very small R2 value. It might be helpful here to use a different model than pls to see if the results of the RMSE and R2 improves.

postResample((predict(pls_model2, testSet2)), obs=testSet2$Yield)
##      RMSE  Rsquared       MAE 
## 1.6320156 0.1789398 0.8194090

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The most important predictors in the model that I have trained are ManufacturingProcess32, ManufacturingProcess36, ManufacturingProcess13, ManufacturingProcess09, ManufacturingProcess17, ManufacturingProcess06, BiologicalMaterial02, BiologicalMaterial06, ManufacturingProcess11, ManufacturingProcess33, BiologicalMaterial08, BiologicalMaterial03, BiologicalMaterial04, BiologicalMaterial12, BiologicalMaterial01, BiologicalMaterial11, ManufacturingProcess12, ManufacturingProcess28, ManufacturingProcess34, and ManufacturingProcess02. The process predictors dominate the list compared to biological predictors, and they are also the top 5.

imp <- varImp(pls_model2)
imp
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess36   90.38
## ManufacturingProcess13   74.00
## ManufacturingProcess09   72.21
## ManufacturingProcess17   67.09
## BiologicalMaterial02     63.07
## ManufacturingProcess06   62.76
## BiologicalMaterial06     60.78
## ManufacturingProcess33   60.04
## BiologicalMaterial03     53.91
## BiologicalMaterial08     53.63
## ManufacturingProcess11   53.38
## BiologicalMaterial12     50.59
## BiologicalMaterial04     50.28
## BiologicalMaterial01     48.81
## BiologicalMaterial11     48.13
## ManufacturingProcess12   48.08
## ManufacturingProcess28   43.18
## ManufacturingProcess04   40.11
## ManufacturingProcess34   39.22
ggplot(varImp(pls_model2), top =15)

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

We can see from the results of the correlation that most of the relationships between the top predictors and the response is either moderate or weak. The relationship between ManufacturingProcess32: Corr = 0.6343, ManufacturingProcess09: Corr = 0.4982, ManufacturingProcess06: Corr = 0.4308, BiologicalMaterial02: Corr = 0.5049, BiologicalMaterial06: Corr = 0.4990, and ManufacturingProcess33: Corr = 0.4339 and Yield are all moderately positive. The relationship between ManufacturingProcess36: Corr = -0.5846, ManufacturingProcess13: Corr = -0.4872, ManufacturingProcess17: Corr = -0.4020 are moderately negative correlation. Lastly, the relationship between ManufacturingProcess11: Corr = 0.3757 and Yield is fairly weak and positive. Positive associations generally means increasing the value of the predictor would increase the yield, and vice versa for negative associations. where decreasing the value of the predictor would decrease the yield. However, these relationships are not very strong, so we might not see this occur for all.

This information might be helpful in improving yield in future runs of the manufacturing process since knowing the predictors highly correlated with the product yield will help to improve production. The manufacture may want to use more of these predictors during production to have a higher yield or possibly less of the uncorrelated predictors, especially the predictors that produced a negative correlated value.

top_predictors <- (c("ManufacturingProcess32", "ManufacturingProcess36","ManufacturingProcess13", "ManufacturingProcess09", "ManufacturingProcess17", "ManufacturingProcess06", "BiologicalMaterial02", "BiologicalMaterial06","ManufacturingProcess11", "ManufacturingProcess33"))

cor_results <- cor(trainingSet2[, top_predictors], trainingSet2$Yield)
cor_results
##                              [,1]
## ManufacturingProcess32  0.6343301
## ManufacturingProcess36 -0.5846408
## ManufacturingProcess13 -0.4871624
## ManufacturingProcess09  0.4982479
## ManufacturingProcess17 -0.4019911
## ManufacturingProcess06  0.4308473
## BiologicalMaterial02    0.5049309
## BiologicalMaterial06    0.4990380
## ManufacturingProcess11  0.3756784
## ManufacturingProcess33  0.4338702
plot_data <- trainingSet2[, c(top_predictors, "Yield")]

plot_data_long <- pivot_longer(plot_data, cols = top_predictors, 
                               names_to = "Predictor", values_to = "Value")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(top_predictors)
## 
##   # Now:
##   data %>% select(all_of(top_predictors))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(plot_data_long, aes(x = Value, y = Yield)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +
  facet_wrap(~ Predictor, scales = "free_x") +  
  labs(title = "Relationship Between Top Predictors and Yield",
       x = "Predictor Value",
       y = "Yield")
## `geom_smooth()` using formula = 'y ~ x'