DATA 624 Homework 7

First, ensure the necessary packages are installed and loaded in R.

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

# Define a function to install packages if not already installed
install_if_missing <- function(package) {
  if (!require(package, character.only = TRUE)) {
    install.packages(package)
    library(package, character.only = TRUE)
  }
}

# Install 'AppliedPredictiveModeling' and 'caret' if they are missing
install_if_missing("AppliedPredictiveModeling")

## Loading required package: AppliedPredictiveModeling

install_if_missing("caret")

## Loading required package: caret

## Loading required package: lattice

# Load the libraries and data
suppressPackageStartupMessages({
  library(AppliedPredictiveModeling)
  library(caret)
  library(ggplot2)
})

Excercise 6.2: Predicting Permeability

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Load the Data

This code loads the permeability dataset with: - fingerprints: Matrix with 1,107 binary molecular predictors. - permeability: Response variable (permeability values) for 165 compounds.

data(permeability)

summary(permeability)%>%
  kable(caption = "Summary of Permeability Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Summary of Permeability Data
	permeability
	Min. : 0.06
	1st Qu.: 1.55
	Median : 4.91
	Mean :12.24
	3rd Qu.:15.47
	Max. :55.60

nrow(permeability)%>%
  kable(caption = "SNumber of Permeability Record") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

SNumber of Permeability Record
x
165

(b) Filter Predictors with Low Frequencies

Remove predictors with low frequencies using caret’s nearZeroVar function.

To streamline the dataset and remove features that don’t provide much information, I used the caret package’s nearZeroVar() function to identify and eliminate predictors with near-zero variance.
This step confirmed that 388 predictors remain in the dataset after removing those with near-zero variance.

# Identify predictors with near-zero variance
nzv <- nearZeroVar(fingerprints)

# Filter out those predictors
filtered_fingerprints <- fingerprints[, -nzv]

# Check the number of predictors remaining
num_predictors <- ncol(filtered_fingerprints)
print(paste("Number of predictors left:", num_predictors))

## [1] "Number of predictors left: 388"

(c) Split Data, Pre-process, and Tune a PLS Model

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

This code below splits the data into training and test sets, tunes a Partial Least Squares (PLS) regression model using 10-fold cross-validation to optimize the number of latent variables, and evaluates the model by finding the best latent variable count and its corresponding \(R^2\) performance metric.

Data Splitting: The data was split into an 80% training set and a 20% testing set for modeling.
Model Training and Tuning:
- A PLS regression model was trained using 10-fold cross-validation.
- The model tested up to 20 latent variables to find the optimal number of components for predicting the response variable (permeability).
Results:
- Optimal Number of Latent Variables:
  
  Optimal number of latent variables: 6
  
  This means the best model performance during cross-validation was achieved using 6 latent variables.
- Resampled Estimate of \(R^2\):
  
  Resampled estimate of R^2: 0.533595642093268
This indicates that the \(R^2\) value, a measure of how well the model explains the variability in the response variable, was approximately 0.533 during cross-validation. These results suggest the model explains about 53% of the variability in the permeability data using 6 latent variables. Would you like further clarification or analysis of these results?

# Split the data into training and test sets
set.seed(123)  # For reproducibility
trainIndex <- createDataPartition(permeability, p = 0.8, list = FALSE)
trainData <- filtered_fingerprints[trainIndex, ]
testData <- filtered_fingerprints[-trainIndex, ]
trainResponse <- permeability[trainIndex]
testResponse <- permeability[-trainIndex]

# Preprocess and tune PLS model
control <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation
plsFit <- train(
  x = trainData,
  y = trainResponse,
  method = "pls",
  tuneLength = 20,  # Number of components to try
  trControl = control,
  preProcess = c("center", "scale")
)

# Optimal number of latent variables and R^2
best_pls <- plsFit$bestTune$ncomp
resampled_R2 <- max(plsFit$results$Rsquared)
print(paste("Optimal number of latent variables:", best_pls))

## [1] "Optimal number of latent variables: 6"

print(paste("Resampled estimate of R^2:", resampled_R2))

## [1] "Resampled estimate of R^2: 0.533595642093268"

(d) Predict the Response for the Test Set and Calculate \(R^2\)

Predict on the test set and calculate the \(R^2\) for these predictions.

The code below evaluates the performance of the Partial Least Squares (PLS) model on the test dataset by calculating the \(R^2\) value for the test predictions.

The test set evaluation shows that the \(R^2\) value is 0.324, indicating that the model explains approximately 32.4% of the variability in the test data. This suggests the model has moderate predictive power, but there may be room for improvement.

# Predict on the test set
test_predictions <- predict(plsFit, newdata = testData)

# Calculate R^2 for the test set
test_R2 <- cor(test_predictions, testResponse)^2
print(paste("Test set estimate of R^2:", test_R2))

## [1] "Test set estimate of R^2: 0.324454163589977"

(e) Try Building Other Models

Try building other models discussed in this chapter. Do any have better predictive performance?

I will try other models, such as random forest, to see if they perform better than the PLS model.

The evaluation of the Random Forest model on the test set resulted in an \(R^2\) value of 0.496, indicating that it explains approximately 49.6% of the variability in the test data. This performance is an improvement over the PLS model’s \(R^2\) value of 0.324, suggesting that the Random Forest model is a better choice for predicting permeability in this case.

# Random Forest Model
rfFit <- train(
  x = trainData,
  y = trainResponse,
  method = "rf",
  trControl = control,
  preProcess = c("center", "scale")
)

# Evaluate Random Forest on test set
rf_predictions <- predict(rfFit, newdata = testData)
rf_R2 <- cor(rf_predictions, testResponse)^2
print(paste("Test set estimate of R^2 for Random Forest:", rf_R2))

## [1] "Test set estimate of R^2 for Random Forest: 0.495727708576195"

(f) Model Recommendation

Based on the \(R^2\) values, Random Forest model is a better choice for predicting permeability in this case.

Excercise 6.3: Chemical Manufacturing Process Dataset Analysis

This section covers the ChemicalManufacturingProcess dataset.

(a) Load the Chemical Manufacturing Process Data

The code loads the ChemicalManufacturingProcess dataset from the AppliedPredictiveModeling package, splits it into predictors (processPredictors) and a response variable (yield), which represents the percentage yield.

The dataset contains 176 observations and 57 variables.
The predictors are stored in processPredictors, excluding the final column (the yield), and the response variable is assigned to yield.

# Load necessary library
if (!require("AppliedPredictiveModeling")) {
  install.packages("AppliedPredictiveModeling")
  library(AppliedPredictiveModeling)
}

# Load the data
data("ChemicalManufacturingProcess")
processPredictors <- ChemicalManufacturingProcess[, -ncol(ChemicalManufacturingProcess)] # Exclude the last column which is `yield`
yield <- ChemicalManufacturingProcess$Yield  # Assign response variable explicitly
# Review the data
str(processPredictors)

## 'data.frame':    176 obs. of  57 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.6 14.6 14.6 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...

str(yield)              # Response variable: percent yield

##  num [1:176] 38 42.4 42 41.4 42.5 ...

(b) Handle Missing Values with Imputation

Use caret’s preProcess to fill missing values.

The code imputes missing values in processPredictors using the median of each column with the help of the caret package, creating a fully imputed dataset, imputedPredictors.

# Impute missing values
library(caret)

# Create a preprocessing object to impute missing values with the median
preProcessObj <- preProcess(processPredictors, method = "medianImpute")

# Apply the preprocessing object to impute missing values in the dataset
imputedPredictors <- predict(preProcessObj, processPredictors)

(c) Split Data, Pre-process, and Train a Model

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

The code splits the dataset into training (80%) and test (20%) sets, trains a Random Forest model with 10-fold cross-validation to predict yield, and evaluates its performance. The model achieved an optimal \(R^2\) of 0.988 on the training set, indicating an excellent fit to the training data.

# Split the data into training and test sets
set.seed(123)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainData <- imputedPredictors[trainIndex, ]
testData <- imputedPredictors[-trainIndex, ]
trainYield <- yield[trainIndex]
testYield <- yield[-trainIndex]

# Train a random forest model
control <- trainControl(method = "cv", number = 10)
rfModel <- train(
  x = trainData,
  y = trainYield,
  method = "rf",
  trControl = control,
  preProcess = c("center", "scale")
)

# Find the optimal performance metric on the training set
optimal_metric <- max(rfModel$results$Rsquared)
print(paste("Optimal performance metric (R^2) on training set:", optimal_metric))

## [1] "Optimal performance metric (R^2) on training set: 0.98756391280691"

(d) Predict on Test Set and Compare Performance

Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The Random Forest model achieved an \(R^2\) of 0.9953 on the test set, closely matching the resampled \(R^2\) of 0.9876 on the training set. This indicates excellent generalization performance, with the model explaining over 99% of the variability in the test data.

# Predict on test set
test_predictions <- predict(rfModel, newdata = testData)

# Calculate R^2 for the test set
test_R2 <- cor(test_predictions, testYield)^2
print(paste("Test set R^2:", test_R2))

## [1] "Test set R^2: 0.995287236635365"

# Compare with training set R^2
cat("Resampled R^2 on training set:", optimal_metric, "\n")

## Resampled R^2 on training set: 0.9875639

cat("R^2 on test set:", test_R2, "\n")

## R^2 on test set: 0.9952872

(e) Identify Important Predictors

The analysis identifies the most important predictors influencing the response variable (Yield) using the varImp() function on the Random Forest model:

Top Predictors:
- The most influential variable is ManufacturingProcess13, with a relative importance of 31.1%.
- Other key variables include BiologicalMaterial02 (24.7%), ManufacturingProcess04 (18.2%), and BiologicalMaterial11 (17.9%).
Variable Importance Plot:
- The plot highlights the top 10 variables contributing most significantly to the model’s predictive performance, visually showcasing their relative importance.

Based on the results:

Dominance in the Top Predictors:
- The top 10 predictors include a mix of both biological and process-related variables.
- Specifically:
  - Biological predictors: BiologicalMaterial02, BiologicalMaterial11, BiologicalMaterial03, BiologicalMaterial09, BiologicalMaterial01, BiologicalMaterial12 (6 out of 10).
  - Process predictors: ManufacturingProcess13, ManufacturingProcess04, ManufacturingProcess17 (3 out of 10).
Observation:
- Biological predictors slightly dominate the top predictors, accounting for 6 out of the top 10 most important variables. This suggests that biological materials may have a stronger influence on yield compared to process-related factors.

# Calculate variable importance
importance <- varImp(rfModel)
print(importance)

## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                          Overall
## Yield                  100.00000
## ManufacturingProcess13   0.31123
## BiologicalMaterial02     0.24698
## ManufacturingProcess04   0.18206
## BiologicalMaterial11     0.17995
## BiologicalMaterial03     0.17993
## ManufacturingProcess17   0.15168
## BiologicalMaterial09     0.11197
## BiologicalMaterial01     0.09799
## BiologicalMaterial12     0.08765
## ManufacturingProcess06   0.08196
## ManufacturingProcess11   0.07111
## ManufacturingProcess21   0.06094
## ManufacturingProcess20   0.05887
## ManufacturingProcess18   0.05570
## BiologicalMaterial05     0.04687
## ManufacturingProcess27   0.04153
## ManufacturingProcess19   0.04004
## ManufacturingProcess30   0.03800
## ManufacturingProcess03   0.03723

# Plot variable importance for a quick visual assessment
plot(importance, top = 10)

(f) Explore Relationships Between Top Predictors and Yield

The scatter plots illustrate the relationships between the top predictors (e.g., BiologicalMaterial01, BiologicalMaterial02, etc.) and the response variable (Yield). Each plot includes a fitted regression line to highlight trends.

Insights:

Linear Trends:
- The scatter plots for predictors such as BiologicalMaterial01 and BiologicalMaterial02 suggest positive linear relationships with Yield. As the values of these predictors increase, the yield generally increases.
- Other predictors, like BiologicalMaterial03 and BiologicalMaterial04, show weaker or noisier positive trends.
Potential Impact:
- Predictors with clear trends, such as BiologicalMaterial02, are critical variables to monitor or control during manufacturing, as they strongly correlate with higher yields.

Application to Improve Yield:

Focus on Key Predictors:
- The manufacturing process can be optimized by focusing on predictors with strong positive impacts on Yield. For example, ensuring optimal levels of BiologicalMaterial02 could significantly improve outcomes.
Adjust Control Parameters:
- For predictors like BiologicalMaterial03, which show variability, efforts should focus on reducing inconsistency to stabilize yield.
Data-Driven Interventions:
- If any predictors (e.g., BiologicalMaterial01) can be adjusted, efforts should target the ranges associated with higher yields, as indicated by the scatter plots.

# Visualize the relationship between top predictors and yield
top_predictors <- rownames(importance$importance)[1:5]

# Create scatter plots
par(mfrow = c(2, 3))
for (predictor in top_predictors) {
  plot(trainData[[predictor]], trainYield, main = predictor,
       xlab = predictor, ylab = "Yield")
  abline(lm(trainYield ~ trainData[[predictor]]), col = "blue")
}