Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
The matrix fingerprints
contains the 1,107
binary molecular predictors for the 165 compounds, while
permeability
contains permeability response.
Here we load the data. Note, permeability is a measure of a molecule’s ability to cross a membrane.
From Section 1.4 there are three distinguishing features to the data that represent challenges to this data being modeled with linear regression. For one, permeability is highly skewed with the vast majority of compounds having no or low permeability, whereas ideally we would see a normally distributed target variable.
Also from section 1.4, our predictors are sparse with only 15.5% present, so we have a dimensionality problem, and there is a high degree of collinearity associated with our predictors.
Here we calculate a skewness in permeability of 1.396559:
# Check out packages
library(AppliedPredictiveModeling)
library(e1071)
library(knitr)
library(caret)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(pls)
library(elasticnet)
library(RANN)
# Load data
data(permeability)
# Calculate skewness of our target variable
kable(data.frame(Skewness = skewness(permeability)), align="c")
Skewness |
---|
1.396559 |
Next we transform the permeability vector using box-cox, but it’s not working the way I expect. Instead it returns a list of six but one of them is a lambda of zero so we now know we can take the log of our permeability when it comes time to model.
# Perform box-cox transformation
permbc <- BoxCoxTrans(permeability)
kable(data.frame(Lambda = permbc$lambda), align="c")
Lambda |
---|
0 |
The fingerprint predictors indicate the presence or absence
of substructures of a molecule and are often sparse meaning that
relatively few of the molecules contain each substructure. Filter out
the predictors that have low frequencies using the
nearZeroVar
function from the caret
package.
How many predictors are left for modeling?
Predictors identified by nearZeroVar
have too few
non-zero values and should be removed.
In our case we have 719 problematic predictors and after removing them we have 388 predictors left.
# Count number of problematic predictors
#length(nearZeroVar(fingerprints))
# Remove problematic predictors
fp <- fingerprints[, -nearZeroVar(fingerprints)]
# Display number of remaining predictors
kable(data.frame(Predictors = ncol(fp)), align="c")
Predictors |
---|
388 |
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
This is a very long Part because I tried everything in the book! I ended up removing a lot of near zero and collinear variables before splitting my data into train and test sets. By the second exercise I realized I could include the near zero and collinear variable exclusions in the preprocessing step.
The optimal number of latent variables with PLS was three, this was where the error was the lowest and corresponded to a relatively higher R-squared.
The corresponding resampled R-squared with 3 latent PLS variable was 0.5846841. This means that 58% of the variance was explained by our model meaning we have a lot of room for improvement.
Here we show the log transformation of permeability, which we will do later to make the problem more tractable while modeling.
While it doesn’t look gaussian (normally distributed) after log transformation, it does look closer to gaussian.
Interestingly, later we discover that there are significant improvements in our final model if we don’t take the log of permeability while fitting the models. See the final table in Exercise 6.2 for comparison.
dfp <- data.frame(permeability = as.vector(permeability))
# Assuming your data frame is called 'df' and the column is named 'permeability'
# Original histogram
p1 <- ggplot(dfp, aes(x = permeability)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
ggtitle("Original Data") +
theme_minimal()
# Log-transformed histogram
dfp$log_permeability <- log(dfp$permeability)
p2 <- ggplot(dfp, aes(x = log_permeability)) +
geom_histogram(bins = 30, fill = "green", color = "black", alpha = 0.7) +
ggtitle("Log-Transformed Data") +
theme_minimal()
# Display side-by-side
grid.arrange(p1, p2, ncol = 2)
Here we checked for missing values and there are none
# Check for missing values
sum(is.na(fp))
## [1] 0
Even after removing the near zero variables we still have more predictors than values. It’s possible that we could fit regularized linear regression and still converge on a model but we still have more tools for dimensionality reduction.
If we were to use PCA we would need 27 components to capture 95% of the variance.
#Let's try PCA
pcaObject <- prcomp(fp, center=TRUE, scale.=TRUE)
percentVariance <- pcaObject$sd^2/sum(pcaObject$sd^2)*100
sum(percentVariance[1:27])
## [1] 95.02577
Here we look at the correlation matrix and see an almost tartan-like pattern suggesting the molecular fingerprints are in some taxonomic order that has some correlation to permeability.
Clearly visible in the correlation matrix below, there is high degree of collinearity in our predictors that we should address before trying PCA or PLS.
Note, the correlation matrix is 388 x 388 so we display only the upper left 70 x 70 submatrix here. Enough to see the pattern of collinearity but not so much that the matrix is indiscernable.
# Correlation matrix plot
correlations <- cor(fp)
corrplot::corrplot(correlations[1:70, 1:70], method="square", type = "lower", order = 'original')
# The dimensionality is 388 x 388
#dim(correlations)
We need to eliminate some of the collinear predictors to have a more stable, relevant model. We’ll use an algorithm of finding the two highest pair-wise correlated predictors and removing the one with the most correlation to the remaining predictors and continue until no pair-wise correlations are above a certain threshold.
After running the following code enough to populate this table we have settled on a pair-wise correlation threshold of 0.75, for 333 predictors removed and 55 predictors remaining.
Pair-Wise Correlation Threshold | Predictors Removed | Predictors Left |
---|---|---|
0.95 | 261 | 127 |
0.90 | 278 | 110 |
0.85 | 300 | 88 |
0.80 | 318 | 70 |
0.75 | 333 | 55 |
0.70 | 342 | 46 |
0.65 | 348 | 40 |
# filter out highly correlated predictors above the 0.75 threshold
highCorr <- findCorrelation(correlations, cutoff = .75)
filteredfp <- fp[, -highCorr]
# Code for identifying the number of predictors removed and remaining
#length(highCorr)
#ncol(filteredfp)
Now that we’ve filtered for near zero and highly collinear variables we actually need more components, 31 instead of 27, to capture 95% of the variance. Maybe this is because the components are a lot shorter.
I would try to do this step with PLS instead but our data is still separated into predictors and target.
# How to run PCA
preProcess(filteredfp, method = c("center", "scale", "pca"))
## Created from 165 samples and 55 variables
##
## Pre-processing:
## - centered (55)
## - ignored (0)
## - principal component signal extraction (55)
## - scaled (55)
##
## PCA needed 31 components to capture 95 percent of the variance
Here we split the data into a training and a test set using an 80-20 split.
# Split data
set.seed(22)
trainingRows <- createDataPartition(permeability, p = .80, list=FALSE)
trainPredictors <- filteredfp[trainingRows, ]
trainTarget <- permeability[trainingRows]
testPredictors <- filteredfp[-trainingRows, ]
testTarget <- permeability[-trainingRows]
Here we ran PLS on our training data using cross-validation for more stable performance metrics.
The optimal number of PLS components or latent variables is three since that results in the lowest error and the highest R-squared.
The corresponding resampled R-squared value was 0.5846841. Since only 58% of the variance is captured in our model there’s room to improve the amount of variance explained with a different model.
# Make a combined dataframe for the train data
trainData <- as.data.frame(trainTarget)
trainData <- cbind(trainData, trainPredictors)
colnames(trainData)[1] <- "permeability"
# Define control as 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)
# Train the PLS model and tune for the optimal number of components
plsFit <- train(
log(permeability) ~ .,
data = trainData,
method = "pls",
tuneLength = 20, # Evaluate up to 20 components
trControl = train_control, # Cross-validation control
preProcess = c("center", "scale") # Center and scale the data
)
# Display the results
print(plsFit)
## Partial Least Squares
##
## 133 samples
## 55 predictor
##
## Pre-processing: centered (55), scaled (55)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 119, 119, 118, 120, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.126915 0.5116974 0.9414516
## 2 1.047013 0.5707059 0.8544444
## 3 1.041003 0.5846841 0.8503616
## 4 1.045734 0.5881774 0.8587506
## 5 1.064919 0.5825260 0.8761451
## 6 1.088411 0.5644504 0.8970565
## 7 1.111668 0.5521913 0.9148290
## 8 1.125418 0.5421839 0.9334434
## 9 1.142132 0.5337365 0.9510120
## 10 1.173360 0.5191300 0.9763978
## 11 1.200423 0.5092493 1.0107145
## 12 1.197646 0.5127768 1.0052812
## 13 1.206623 0.5113733 1.0110937
## 14 1.215358 0.5077655 1.0092907
## 15 1.215990 0.5083837 1.0066813
## 16 1.228777 0.5019447 1.0165204
## 17 1.225961 0.5041743 1.0122425
## 18 1.229201 0.5001290 1.0133949
## 19 1.226646 0.5036160 1.0060259
## 20 1.218589 0.5058999 0.9961022
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
Predict the response for the test set. What is the test set estimate of R2?
The test set estimate of R-squared is 0.4237932. This indicates ~42% of the variance in our test data was explained by the model.
Because of PLS’s low performance, we should pursue other models.
Here we produce the predicted values for the test set data.
# Make a combined dataframe for the test data
testData <- as.data.frame(testTarget)
testData <- cbind(testData, testPredictors)
colnames(testData)[1] <- "permeability"
# Predict the response for the test set
test_predictions <- predict(plsFit, testData, ncomp = 1)
test_predictions
## [1] 1.32719709 1.67946162 0.96984192 1.34403742 0.03377816 0.65803119
## [7] 0.70576651 1.44171145 2.58160205 1.04819758 0.41081963 1.59493015
## [13] 1.29680717 1.56272211 2.28559343 1.54668888 3.35326387 3.01554713
## [19] -0.19198962 -0.15815997 3.09574706 2.85797032 3.46362093 2.39827440
## [25] 3.53214450 2.31682496 0.65081003 -1.13102856 0.73289462 0.90660692
## [31] 0.34570919 3.06087165
We calculate R-squared by squaring the correlation between the predicted and actual values in the test set data.
Our partial least squares model had an R-squared of 0.4045785 on the test data, indicating ~40% of the variance in the test data was explained by our model, compared to our anticipated 49% from the training data.
# Calculate the square of the correlation between the predicted and actual values
cor(test_predictions, testData$permeability)^2
## [1] 0.4237932
Try building other models discussed in this chapter. Do any have better predictive performance?
We tried three additional models: ridge, lasso and elastic-net and all of them had better predictive performance using R-squared on the test data as the metric.
The ridge model ultimately performed the best and this may be due to our initial efforts in the preprocessing stage to remove degenerate and collinear predictors, as ridge works best when you know all of your predictors are significant.
Using grid analysis we arrive at an ideal ridge lambda of 0.22 and a resulting R-squared of 0.6084128.
The ridge model returned an R-squared of 0.4810941 on the test data.
# Trying the Ridge model
ridgeGrid <- data.frame(.lambda = seq(.1,.3, length=11))
set.seed(222)
ridgeRegFit <- train(x=trainData[,-1], y=log(trainData$permeability), method="ridge", tuneGrid=ridgeGrid, trControl=train_control, preProc=c("center","scale"))
ridgeRegFit
## Ridge Regression
##
## 133 samples
## 55 predictor
##
## Pre-processing: centered (55), scaled (55)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 119, 119, 121, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.10 1.042469 0.5866936 0.8671657
## 0.12 1.035862 0.5921830 0.8607188
## 0.14 1.031135 0.5966770 0.8559178
## 0.16 1.027821 0.6004057 0.8524658
## 0.18 1.025610 0.6035308 0.8497191
## 0.20 1.024287 0.6061703 0.8478302
## 0.22 1.023695 0.6084128 0.8469614
## 0.24 1.023717 0.6103262 0.8466635
## 0.26 1.024263 0.6119638 0.8466136
## 0.28 1.025263 0.6133679 0.8467754
## 0.30 1.026661 0.6145726 0.8474149
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.22.
# Predict test data using ridge model
ridgePred <- predict(ridgeRegFit, newdata = testData[,-1])
# Determine R-squared for test data
cor(as.data.frame(ridgePred), testData$permeability)^2
## [,1]
## ridgePred 0.4810941
Using grid analysis we arrive at an ideal lasso lambda of 0.22 and a resulting R-squared of 0.6051579.
The lasso model returned an R-squared of 0.4617224 on the test data.
# Trying the Lasso model
lassoGrid <- data.frame(fraction = seq(.1,.3, length=11))
set.seed(322)
lassoRegFit <- train(x=trainData[,-1], y=log(trainData$permeability), method="lasso", tuneGrid=lassoGrid, trControl=train_control, preProc=c("center","scale"))
lassoRegFit
## The lasso
##
## 133 samples
## 55 predictor
##
## Pre-processing: centered (55), scaled (55)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 120, 119, 119, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.10 1.0145171 0.5622959 0.8175052
## 0.12 0.9984494 0.5720645 0.8026214
## 0.14 0.9855996 0.5843299 0.7994443
## 0.16 0.9794831 0.5902332 0.8040254
## 0.18 0.9728947 0.5968510 0.8063597
## 0.20 0.9672096 0.6019765 0.8042530
## 0.22 0.9646317 0.6051579 0.8039533
## 0.24 0.9679313 0.6051342 0.8086465
## 0.26 0.9756211 0.6024830 0.8177834
## 0.28 0.9841088 0.5985019 0.8269686
## 0.30 0.9937083 0.5941721 0.8365533
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.22.
# Predict test data using ridge model
lassoPred <- predict(lassoRegFit, newdata = testData[,-1])
# Determine R-squared for test data
cor(as.data.frame(lassoPred), testData$permeability)^2
## [,1]
## lassoPred 0.4617224
Using grid analysis we arrive at ideal elastic-net lambdas of 0.55 for ridge and 0.40 for lasso and a resulting R-squared of 0.5979077.
The elastic-net model returned an R-squared of 0.4385897 on the test data.
# Trying the Elastic-net model
enetGrid <- expand.grid(.fraction = seq(.3,.5, length=5), .lambda = seq(.4,.6, length=5))
set.seed(422)
enetRegFit <- train(x=trainData[,-1], y=log(trainData$permeability), method="enet", tuneGrid=enetGrid, trControl=train_control, preProc=c("center","scale"))
enetRegFit
## Elasticnet
##
## 133 samples
## 55 predictor
##
## Pre-processing: centered (55), scaled (55)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 120, 120, 120, 120, 120, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.40 0.30 1.0305831 0.5723879 0.8248979
## 0.40 0.35 0.9932371 0.5892383 0.7877729
## 0.40 0.40 0.9756366 0.5961431 0.7751026
## 0.40 0.45 0.9805989 0.5907875 0.7837304
## 0.40 0.50 0.9913728 0.5850508 0.7932941
## 0.45 0.30 1.0315541 0.5715499 0.8256755
## 0.45 0.35 0.9939202 0.5888437 0.7882235
## 0.45 0.40 0.9749147 0.5970255 0.7733718
## 0.45 0.45 0.9787254 0.5928965 0.7814887
## 0.45 0.50 0.9918978 0.5859708 0.7926742
## 0.50 0.30 1.0320147 0.5708428 0.8258400
## 0.50 0.35 0.9943240 0.5884560 0.7883018
## 0.50 0.40 0.9745769 0.5975387 0.7723703
## 0.50 0.45 0.9775398 0.5946432 0.7798136
## 0.50 0.50 0.9926432 0.5868816 0.7923254
## 0.55 0.30 1.0320526 0.5703302 0.8254835
## 0.55 0.35 0.9945049 0.5881268 0.7881579
## 0.55 0.40 0.9743835 0.5979077 0.7718374
## 0.55 0.45 0.9772675 0.5956860 0.7790713
## 0.55 0.50 0.9936671 0.5877067 0.7927737
## 0.60 0.30 1.0317889 0.5699474 0.8246873
## 0.60 0.35 0.9945134 0.5878235 0.7878573
## 0.60 0.40 0.9745012 0.5979531 0.7715564
## 0.60 0.45 0.9773159 0.5964559 0.7788176
## 0.60 0.50 0.9948644 0.5884861 0.7933127
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.55.
# Predict test data using ridge model
enetPred <- predict(enetRegFit, newdata = testData[,-1])
# Determine R-squared for test data
cor(as.data.frame(enetPred), testData$permeability)^2
## [,1]
## enetPred 0.4385897
Would you recommend any of your models to replace the permeability laboratory experiment?
Based on these results we are looking at a non-linear relationship and so there are limitations with using a linear model to explain it. Any recommendation for a linear model in this context would only be to explain potential drivers in the data’s behavior and we should switch to nonlinear models to continue to explore this data.
Out of the three penalized regression models the ridge regression model returned the highest R-squared on the test data of 0.4617224 and so we would recommend the ridge model over the PLS model initially fit, however there’s still room for further improvement.
I looked in Section 1.4 for what the permeability laboratory experiment might be and if there is an R-squared or RMSE in the text to compare against but couldn’t find one but there may be an additional model to consider not modeled here.
I’ll save it for a future exercise but below I include the R-squared values for the predictions on the test data if we didn’t take the log of permeability when fitting our models and we see tremendous improvement in the penalty regression models: 0.5733212 instead of 0.4617224 for the ridge model. Maybe taking the log of our data caused it to be over fit, or we need to confirm our results by trying different train-test splits.
Model | Test data R-squared | Test data R-squared wLog |
---|---|---|
Ridge | 0.5733 | 0.4811 |
Lasso | 0.5546 | 0.4617 |
Elastic-Net | 0.4677 | 0.4386 |
PLS | 0.4046 | 0.4238 |
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors
contains the 57
predictors (12 describing the input biological material and 45
describing the process predictors) for the 176 manufacturing runs.
yield
contains the percent yield for each run.
Here we load the data.
# Load the data
data("ChemicalManufacturingProcess")
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Here we impute the missing values using kNN and then check that the 106 missing values are filled in.
Note, this was a sticky step. Here was how I battered through:
There was no clear example in Section 3.8. Page 42 discussed using a
K-nearest neighbor model to impute missing values. Section 3.8 showed
that the impute
package has a function
impute.knn
and the preProcess
function has
imputation methods using kNN or bagged trees. impute
is not
available as a package in my version of R so I’m going with
preProcess
. I found the method type (“knnImpute”) from
??caret::preProcess
(there are also bagImpute
and medianImpute
) and finally connected that I had to
follow the
trans <- preProcess(datawMissing, method = c("knnImpute"))
then data <- predict(trans, datawMissing)
format. Lastly
I had to install RANN
(it has fast k-nearest neighbor
search algorithms) to run the knnImpute
method in
preProcess
.
# There are 106 missing values
#sum(is.na(ChemicalManufacturingProcess))
# Produce the model that can impute values using kNN
imputeModel <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
# Impute the missing values (not in)
CMP <- predict(imputeModel, ChemicalManufacturingProcess)
# No more missing values
#sum(is.na(CMP))
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Using elastic-net, our optimal RMSE is 0.6319838 with a corresponding R-squared of 0.5992708. This is with a ridge lambda of 0.90 and a lasso lambda of 0.35.
Subsequent parts ask about which predictors have the biggest impact
on the model and how. This would be done most easily with ordinary or
penalized linear regression because we would have a coefficient for each
predictor. I was curious to try it with PLS but I could only find
varImp(plsFit)
as a way to see which variables had the
biggest impact but I couldn’t tell you if it was a positive or negative
impact.
Here we split the data into a training and a test set using an 80-20 split.
# Split data
set.seed(522)
trainingRows2 <- createDataPartition(CMP[,1], p = .80, list=FALSE)
trainPredictors2 <- CMP[trainingRows2,-1]
trainTarget2 <- CMP[trainingRows2,1]
testPredictors2 <- CMP[-trainingRows2,-1]
testTarget2 <- CMP[-trainingRows2,1]
In addition to the imputation of missing data done previously, we are doing centering, scaling of variables and removing any predictors (eleven of them) that are too near zero or too collinear.
Using elastic-net, our optimal RMSE is 0.6319838 with a corresponding R-squared of 0.5992708. This is with a ridge lambda of 0.90 and a lasso lambda of 0.35.
#Excluded because using same definition as in Exercise 6.2
#train_control <- trainControl(method = "cv", number = 10)
# Fitting an elastic-net model
enetGrid2 <- expand.grid(.fraction = seq(.3,.5, length=5), .lambda = seq(.8,1, length=5))
set.seed(622)
enetRegFit2 <- train(
x = trainPredictors2,
y = trainTarget2,
method = "enet",
tuneGrid = enetGrid2,
trControl = train_control,
preProc = c("center","scale","nzv","corr"))
enetRegFit2
## Elasticnet
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (46), scaled (46), remove (11)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 131, 130, 129, 130, 129, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.80 0.30 0.6419498 0.6006973 0.5326539
## 0.80 0.35 0.6321112 0.5997685 0.5172543
## 0.80 0.40 0.6378945 0.5972613 0.5207585
## 0.80 0.45 0.6470469 0.5983609 0.5251385
## 0.80 0.50 0.6558894 0.6007408 0.5288848
## 0.85 0.30 0.6415661 0.6003010 0.5317892
## 0.85 0.35 0.6320170 0.5995313 0.5168066
## 0.85 0.40 0.6381209 0.5972205 0.5210214
## 0.85 0.45 0.6487120 0.5979212 0.5269532
## 0.85 0.50 0.6585650 0.6003136 0.5312913
## 0.90 0.30 0.6412076 0.5999004 0.5313458
## 0.90 0.35 0.6319838 0.5992708 0.5163612
## 0.90 0.40 0.6384923 0.5971439 0.5213966
## 0.90 0.45 0.6503998 0.5975653 0.5285233
## 0.90 0.50 0.6614089 0.5998798 0.5338540
## 0.95 0.30 0.6408650 0.5994802 0.5309382
## 0.95 0.35 0.6320066 0.5989931 0.5159858
## 0.95 0.40 0.6389794 0.5970669 0.5217422
## 0.95 0.45 0.6521483 0.5972605 0.5299552
## 0.95 0.50 0.6644577 0.5994047 0.5365967
## 1.00 0.30 0.6405284 0.5990653 0.5304803
## 1.00 0.35 0.6320611 0.5987481 0.5156063
## 1.00 0.40 0.6395563 0.5969996 0.5221203
## 1.00 0.45 0.6540158 0.5969182 0.5313423
## 1.00 0.50 0.6676725 0.5989283 0.5396577
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.35 and lambda = 0.9.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
The R-squared of the prediction on the test set is 0.5265838 compared to the resampled R-squared on the training set of 0.5992708. Additionally, the RMSE on the test set is 0.6835943 compared to the training set RMSE of 0.6319838.
Since both RMSE is higher and R-squared is lower on the test data it suggests our model might be overfit and we should explore different models or additional preprocessing options.
# Predict test data using ridge model
enetPred2 <- predict(enetRegFit2, newdata = testPredictors2)
# Determine R-squared for test data
test_r2 <- cor(as.data.frame(enetPred2), testTarget2)^2
# Determine RMSE for test data
test_rmse <- sqrt(mean((enetPred2 - testTarget2)^2))
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
I expected to be able to see the coefficients in the model like an ordinary linear regression model report but it doesn’t seem to be available.
Instead I found the “Sequence of moves” which seems to represent the order the variables were entered into the elastic-net regression model. I think we can assume that if a variable was entered earlier it is more influential than a variable that was entered later.
Out of the first ten predictors that were entered into the elastic-net model, eight of them were Manufacturing processes and the first six most important predictors for yield were Manufacturing processes.
# try to find information on variables.
enetRegFit2$finalModel
##
## Call:
## elasticnet::enet(x = as.matrix(x), y = y, lambda = param$lambda)
## Sequence of moves:
## ManufacturingProcess32 ManufacturingProcess13 ManufacturingProcess09
## Var 35 21 17
## Step 1 2 3
## ManufacturingProcess36 ManufacturingProcess17 ManufacturingProcess06
## Var 39 25 14
## Step 4 5 6
## BiologicalMaterial06 BiologicalMaterial03 ManufacturingProcess11
## Var 4 2 19
## Step 7 8 9
## ManufacturingProcess33 BiologicalMaterial11 BiologicalMaterial08
## Var 36 8 5
## Step 10 11 12
## ManufacturingProcess12 ManufacturingProcess34 ManufacturingProcess39
## Var 20 37 42
## Step 13 14 15
## ManufacturingProcess30 ManufacturingProcess37 ManufacturingProcess44
## Var 34 40 45
## Step 16 17 18
## ManufacturingProcess24 ManufacturingProcess15 BiologicalMaterial05
## Var 31 23 3
## Step 19 20 21
## ManufacturingProcess45 ManufacturingProcess43 BiologicalMaterial01
## Var 46 44 1
## Step 22 23 24
## ManufacturingProcess10 ManufacturingProcess07 ManufacturingProcess20
## Var 18 15 27
## Step 25 26 27
## ManufacturingProcess19 ManufacturingProcess21 ManufacturingProcess35
## Var 26 28 38
## Step 28 29 30
## ManufacturingProcess28 ManufacturingProcess05 ManufacturingProcess04
## Var 33 13 12
## Step 31 32 33
## ManufacturingProcess16 ManufacturingProcess01 ManufacturingProcess40
## Var 24 9 43
## Step 34 35 36
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess02
## Var 29 30 10
## Step 37 38 39
## BiologicalMaterial09 ManufacturingProcess08 ManufacturingProcess03
## Var 6 16 11
## Step 40 41 42
## BiologicalMaterial10 ManufacturingProcess14 ManufacturingProcess38
## Var 7 22 41
## Step 43 44 45
## ManufacturingProcess26
## Var 32 47
## Step 46 47
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
I found the final coefficients for our model and it confirms what we found in the prior section using order of predictor entry into the model. There are 13 predictors in the final model, 9 of which are manufacturing processes and four are biological materials. The top six, sorted by absolute value, are all manufacturing processes.
Three of the manufacturing processes have negative coefficients. It’s possible these steps need to be adjusted to stop their negative impact on final product yields.
If we’re spending money on biological materials that aren’t 06, 03, 11 or 08 then maybe we should reduce expenditures on those to increase expenditures on these biological materials that have had a demonstrable, positive impact on yields.
Manufacturing Process 32 has the most significant impact on yield. It’s possible that A/B testing changes in this process would be most likely to garner a kaizen increase in yield. (and validate if the predictor does indeed drive yield improvements!)
# Find and display non-zero coefficients
coes <- predict.enet(enetRegFit2$finalModel, s=enetRegFit2$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
non_zero_coefs <- coes[coes != 0]
sorted_coefs <- sort(non_zero_coefs, decreasing = TRUE)
print(sorted_coefs)
## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess06
## 0.285046129 0.171770522 0.094278734
## BiologicalMaterial06 ManufacturingProcess11 BiologicalMaterial03
## 0.049009044 0.040148179 0.039423886
## ManufacturingProcess33 BiologicalMaterial11 BiologicalMaterial08
## 0.029209661 0.019739564 0.007307787
## ManufacturingProcess12 ManufacturingProcess17 ManufacturingProcess36
## 0.001930146 -0.118650286 -0.136463539
## ManufacturingProcess13
## -0.179598257
These exercises come from ‘Applied Predictive Modeling’ by Max Kuhn and Kjell Johnson.