In Kuhn and Johnson do problems 6.2 and 6.3.

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug: (a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
library(caret)
library(ggplot2)
nzvfp <- nearZeroVar(fingerprints)
nonzvfp <- fingerprints[,-nzvfp]
ncol(nonzvfp)
## [1] 388
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R squared?
split <- createDataPartition(permeability, p = 0.8, list = FALSE, times = 1)

Xtrain.data  <- nonzvfp[split, ]      
Xtest.data   <- nonzvfp[-split, ]      

Ytrain.data  <- permeability[split]     
Ytest.data   <- permeability[-split]  

#PLS Model & scale + center
ctrl <- trainControl(method = "cv", number = 10) 
plsmod <- train(x = Xtrain.data, y = Ytrain.data,
                method = "pls",
                tuneLength = 15,         
                trControl = ctrl,       
                preProc = c("center", "scale"))
print(plsmod)
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 121, 119, 120, 121, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.59309  0.3188302  9.598363
##    2     11.36710  0.4595255  7.749247
##    3     11.07388  0.4746904  8.057209
##    4     11.16464  0.4785656  8.382748
##    5     11.13931  0.4855600  8.043630
##    6     11.02535  0.5135675  7.981960
##    7     11.06765  0.5073874  8.127469
##    8     10.98450  0.5229158  8.197463
##    9     11.23015  0.5068022  8.431096
##   10     11.49945  0.4834073  8.591727
##   11     11.83014  0.4725379  8.710300
##   12     11.89082  0.4662821  8.809819
##   13     12.04049  0.4616730  8.933141
##   14     12.42299  0.4413192  9.361306
##   15     12.56859  0.4383558  9.479632
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
plot(plsmod)

(d) Predict the response for the test set. What is the test set estimate for R squared?

test_predictions <- predict(plsmod, newdata = Xtest.data)
test_r_squared <- cor(Ytest.data, test_predictions)^2

print(paste("Test set R-squared:", round(test_r_squared, 4))) 
## [1] "Test set R-squared: 0.5313"
  1. Try building other models discussed in this chapter. Do any have better predictive performance?
#CV
ctrl <- trainControl(method = "cv", number = 10) 

#Ridge Regression Model & scale + center
ridgeGrid <- data.frame(.lambda = seq(0, 1, by = 0.1))
ridgeRegFit <- train(Xtrain.data, Ytrain.data, 
                     method = "ridge", 
                     tuneGrid = ridgeGrid, 
                     trControl = ctrl, 
                     preProc = c("center", "scale", "knnImpute"))
print(ridgeRegFit)
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388), nearest neighbor imputation (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 120, 121, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE          Rsquared   MAE         
##   0.0     1.125874e+18  0.2800089  6.823853e+17
##   0.1     1.212636e+01  0.4170406  8.568913e+00
##   0.2     1.196254e+01  0.4449319  8.418901e+00
##   0.3     1.214782e+01  0.4569978  8.603576e+00
##   0.4     1.249251e+01  0.4629229  8.967930e+00
##   0.5     1.293384e+01  0.4661798  9.399849e+00
##   0.6     1.344670e+01  0.4681095  9.890764e+00
##   0.7     1.401751e+01  0.4691712  1.046584e+01
##   0.8     1.463229e+01  0.4696932  1.104786e+01
##   0.9     1.528873e+01  0.4699563  1.164365e+01
##   1.0     1.597278e+01  0.4700258  1.223602e+01
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.2.
ctrl <- trainControl(method = "cv", number = 10)

# Lasso Regression Model & scale + center
lassoGrid <- expand.grid(.alpha = 1, .lambda = seq(0, 1, by = 0.1))
lassoRegFit <- train(Xtrain.data, Ytrain.data, 
                     method = "glmnet", 
                     tuneGrid = lassoGrid, 
                     trControl = ctrl, 
                     preProc = c("center", "scale", "knnImpute"))
print(lassoRegFit)
## glmnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388), nearest neighbor imputation (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 119, 120, 120, 118, 121, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0.0     12.24832  0.4611300  8.866728
##   0.1     12.23527  0.4614725  8.833927
##   0.2     11.71666  0.4965321  8.361525
##   0.3     11.62843  0.5125720  8.362615
##   0.4     11.72340  0.5054104  8.436689
##   0.5     11.76697  0.5000980  8.443677
##   0.6     11.88020  0.4889762  8.497880
##   0.7     12.02198  0.4758262  8.564787
##   0.8     12.07811  0.4701341  8.584719
##   0.9     12.05770  0.4732123  8.560465
##   1.0     12.07119  0.4743039  8.563313
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.3.
models_list <- list(PLS = plsmod, Ridge = ridgeRegFit, Lasso = lassoRegFit)
r_squared_results <- lapply(models_list, function(model) {
  cor(Ytest.data, predict(model, newdata = Xtest.data))^2
})

print("Test set R-squared values:")
## [1] "Test set R-squared values:"
print(r_squared_results)
## $PLS
## [1] 0.5312741
## 
## $Ridge
## [1] 0.4884316
## 
## $Lasso
## [1] 0.4750754
  1. Would you recommend any of your models to replace the permeability laboratory experiment?

Initially, I thought I would recommend the model with the best results which happens to be PLS model, with the highest test set R-squared value at 0.5312741, explaining more of the variance than other models. It also happens to have the lowest RMSE value which minimized predictive errors. In reference to 1.4 permeability laboratory experiment… “are effective at quantifying a compound’s permeability, but the assay is expensive labor intensive” so if a predictive modelling happens to perform better then we might be able to suggest replacement. However, even though PLS performed better than other models, an R-squared of .53 is not sufficient and only explains half of the variance in the data.

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
library(caret)
library(RANN)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
#Initially had issues accessing it /came up as null..
yield <- ChemicalManufacturingProcess$Yield
processPredictors <- ChemicalManufacturingProcess[, -1]  

# Impute using knn
preProc <- preProcess(processPredictors, method = "knnImpute")
processPredictors_imputed <- predict(preProc, processPredictors)

print(sum(is.na(processPredictors_imputed)))
## [1] 0
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
#start with NZV,  Identify and remove 
nzv_columns <- nearZeroVar(processPredictors_imputed)
Predictors_cleaned <- processPredictors_imputed[, -nzv_columns]

data_combined <- data.frame(Yield = yield, Predictors_cleaned)

trainIndex <- createDataPartition(data_combined$Yield, p = .8, 
                                  list = FALSE, 
                                  times = 1)
trainData <- data_combined[trainIndex, ]
testData <- data_combined[-trainIndex, ]

# Prepare Ridge Regression
x_train <- as.matrix(trainData[, -1])
y_train <- trainData$Yield

x_test <- as.matrix(testData[, -1])
y_test <- testData$Yield

ridge_model <- glmnet(x_train, y_train, alpha = 0)

#cv
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_lambda <- cv_ridge$lambda.min

print(best_lambda)
## [1] 1.039974
  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The training RMSE is lower.

y_train_pred <- predict(cv_ridge, s = best_lambda, newx = x_train)

y_test_pred <- predict(cv_ridge, s = best_lambda, newx = x_test)

rmse_train <- sqrt(mean((y_train - y_train_pred)^2))
print(paste("Training RMSE:", rmse_train))
## [1] "Training RMSE: 1.02169306158907"
rmse_test <- sqrt(mean((y_test - y_test_pred)^2))
print(paste("Test RMSE:", rmse_test))
## [1] "Test RMSE: 1.16522526334198"
  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Process predictor dominate the list. This is found by extracting coefficients and sortin them; then comparing the absolute coefficients for both predictors.

ridge_coef <- coef(ridge_model, s = best_lambda)
importance <- as.data.frame(as.matrix(ridge_coef))
importance <- importance[-1, , drop = FALSE]
importance <- importance[order(-abs(importance[, 1])), , drop = FALSE] 

biological_importance <- importance[grep("BiologicalMaterial", rownames(importance)), , drop = FALSE]
process_importance <- importance[grep("ManufacturingProcess", rownames(importance)), , drop = FALSE]

biological_importance_text <- paste(rownames(biological_importance), 
                                     round(biological_importance[, 1], 3), 
                                     sep = ": ")
process_importance_text <- paste(rownames(process_importance), 
                                  round(process_importance[, 1], 3), 
                                  sep = ": ")

dominance <- sum(abs(biological_importance)) > sum(abs(process_importance))
if (dominance) {
    cat("\nBiological predictors dominate the model.\n")
} else {
    cat("\nProcess predictors dominate the model.\n")
}
## 
## Process predictors dominate the model.
  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

I already have code above to find the importance of each predictor; then the code below loops the top predictors against ‘yield’ and creates scatter plot. Next, I added linear regression line to see the strength of the relationship. The vizualizations show that the top predictors are ManufactoringProcess 26(concentrated around zero on the x-axis), ManufactoringProcess 20(concentrated around zero on the x-axis), ManufactoringProcess 32 (postive linear relationship), ManufactoringProcess 29(concentrated around zero on the x-axis) and ManufactoringProcess 13 (negative linear relationship). 3 of the variables have low variance and probably do not have much importance. So I inlcuded top 7, which included ManufactoringProcess 36 and ManufactoringProcess 6. The information could be helpful to improve yield in future runs of manufacturing process, to use these variables or perhaps top 10 if there is limited time, cost, or persons available. Similarly, this could be helpful to focus more on ManufactoringProcess than Biological Materials

top_predictors <- rownames(importance)[1:7] 
for (predictor in top_predictors) {
    if (predictor %in% colnames(data_combined)) {
        plot <- ggplot(data_combined, aes_string(x = predictor, y = "Yield")) +
            geom_point() +
            geom_smooth(method = "lm", se = FALSE, color = "blue") +  
            labs(title = paste("Relationship between", predictor, "and Yield"),
                 x = predictor,
                 y = "Yield") +
            theme_minimal()
        
        print(plot)  
        ggsave(paste0("relationship_", predictor, ".png"), plot = plot) 
    } else {
        cat(paste("Warning: Predictor", predictor, "not found in data.\n"))
    }
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'