Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Part a

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

Here we load the data. Note, permeability is a measure of a molecule’s ability to cross a membrane.

From Section 1.4 there are three distinguishing features to the data that represent challenges to this data being modeled with linear regression. For one, permeability is highly skewed with the vast majority of compounds having no or low permeability, whereas ideally we would see a normally distributed target variable.

Also from section 1.4, our predictors are sparse with only 15.5% present, so we have a dimensionality problem, and there is a high degree of collinearity associated with our predictors.

Calculate skewness

Here we calculate a skewness in permeability of 1.396559:

# Check out packages
library(AppliedPredictiveModeling)
library(e1071)
library(knitr)
library(caret)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(pls)
library(elasticnet)
library(RANN)

# Load data
data(permeability)

# Calculate skewness of our target variable
kable(data.frame(Skewness = skewness(permeability)), align="c")

Skewness
1.396559

Box-cox analysis

Next we transform the permeability vector using box-cox, but it’s not working the way I expect. Instead it returns a list of six but one of them is a lambda of zero so we now know we can take the log of our permeability when it comes time to model.

# Perform box-cox transformation
permbc <- BoxCoxTrans(permeability)
kable(data.frame(Lambda = permbc$lambda), align="c")

Lambda
0

Part b

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

Predictors identified by nearZeroVar have too few non-zero values and should be removed.

Near zero variables

In our case we have 719 problematic predictors and after removing them we have 388 predictors left.

# Count number of problematic predictors
#length(nearZeroVar(fingerprints))

# Remove problematic predictors
fp <- fingerprints[, -nearZeroVar(fingerprints)]

# Display number of remaining predictors
kable(data.frame(Predictors = ncol(fp)), align="c")

Predictors
388

Part c

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

This is a very long Part because I tried everything in the book! I ended up removing a lot of near zero and collinear variables before splitting my data into train and test sets. By the second exercise I realized I could include the near zero and collinear variable exclusions in the preprocessing step.

The optimal number of latent variables with PLS was three, this was where the error was the lowest and corresponded to a relatively higher R-squared.

The corresponding resampled R-squared with 3 latent PLS variable was 0.5846841. This means that 58% of the variance was explained by our model meaning we have a lot of room for improvement.

Transformation

Here we show the log transformation of permeability, which we will do later to make the problem more tractable while modeling.

While it doesn’t look gaussian (normally distributed) after log transformation, it does look closer to gaussian.

Interestingly, later we discover that there are significant improvements in our final model if we don’t take the log of permeability while fitting the models. See the final table in Exercise 6.2 for comparison.

dfp <- data.frame(permeability = as.vector(permeability))

# Assuming your data frame is called 'df' and the column is named 'permeability'
# Original histogram
p1 <- ggplot(dfp, aes(x = permeability)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  ggtitle("Original Data") +
  theme_minimal()

# Log-transformed histogram
dfp$log_permeability <- log(dfp$permeability)
p2 <- ggplot(dfp, aes(x = log_permeability)) +
  geom_histogram(bins = 30, fill = "green", color = "black", alpha = 0.7) +
  ggtitle("Log-Transformed Data") +
  theme_minimal()

# Display side-by-side
grid.arrange(p1, p2, ncol = 2)

Missing data

Here we checked for missing values and there are none

# Check for missing values
sum(is.na(fp))

## [1] 0

Try PCA

Even after removing the near zero variables we still have more predictors than values. It’s possible that we could fit regularized linear regression and still converge on a model but we still have more tools for dimensionality reduction.

If we were to use PCA we would need 27 components to capture 95% of the variance.

#Let's try PCA

pcaObject <- prcomp(fp, center=TRUE, scale.=TRUE)

percentVariance <- pcaObject$sd^2/sum(pcaObject$sd^2)*100

sum(percentVariance[1:27])

## [1] 95.02577

Collinearity

Here we look at the correlation matrix and see an almost tartan-like pattern suggesting the molecular fingerprints are in some taxonomic order that has some correlation to permeability.

Clearly visible in the correlation matrix below, there is high degree of collinearity in our predictors that we should address before trying PCA or PLS.

Note, the correlation matrix is 388 x 388 so we display only the upper left 70 x 70 submatrix here. Enough to see the pattern of collinearity but not so much that the matrix is indiscernable.

# Correlation matrix plot
correlations <- cor(fp)
corrplot::corrplot(correlations[1:70, 1:70], method="square", type = "lower", order = 'original')

# The dimensionality is 388 x 388
#dim(correlations)

Pair-wise elimination

We need to eliminate some of the collinear predictors to have a more stable, relevant model. We’ll use an algorithm of finding the two highest pair-wise correlated predictors and removing the one with the most correlation to the remaining predictors and continue until no pair-wise correlations are above a certain threshold.

After running the following code enough to populate this table we have settled on a pair-wise correlation threshold of 0.75, for 333 predictors removed and 55 predictors remaining.

Pair-Wise Correlation Threshold	Predictors Removed	Predictors Left
0.95	261	127
0.90	278	110
0.85	300	88
0.80	318	70
0.75	333	55
0.70	342	46
0.65	348	40

# filter out highly correlated predictors above the 0.75 threshold
highCorr <- findCorrelation(correlations, cutoff = .75)
filteredfp <- fp[, -highCorr]

# Code for identifying the number of predictors removed and remaining
#length(highCorr)
#ncol(filteredfp)

PCA revisited

Now that we’ve filtered for near zero and highly collinear variables we actually need more components, 31 instead of 27, to capture 95% of the variance. Maybe this is because the components are a lot shorter.

I would try to do this step with PLS instead but our data is still separated into predictors and target.

# How to run PCA
preProcess(filteredfp, method = c("center", "scale", "pca"))

## Created from 165 samples and 55 variables
## 
## Pre-processing:
##   - centered (55)
##   - ignored (0)
##   - principal component signal extraction (55)
##   - scaled (55)
## 
## PCA needed 31 components to capture 95 percent of the variance

Split data

Here we split the data into a training and a test set using an 80-20 split.

# Split data
set.seed(22)
trainingRows <- createDataPartition(permeability, p = .80, list=FALSE)
trainPredictors <- filteredfp[trainingRows, ]
trainTarget <- permeability[trainingRows]
testPredictors <- filteredfp[-trainingRows, ]
testTarget <- permeability[-trainingRows]

PLS

Here we ran PLS on our training data using cross-validation for more stable performance metrics.

The optimal number of PLS components or latent variables is three since that results in the lowest error and the highest R-squared.

The corresponding resampled R-squared value was 0.5846841. Since only 58% of the variance is captured in our model there’s room to improve the amount of variance explained with a different model.

# Make a combined dataframe for the train data
trainData <- as.data.frame(trainTarget)
trainData <- cbind(trainData, trainPredictors)
colnames(trainData)[1] <- "permeability"

# Define control as 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Train the PLS model and tune for the optimal number of components
plsFit <- train(
  log(permeability) ~ ., 
  data = trainData, 
  method = "pls", 
  tuneLength = 20,            # Evaluate up to 20 components
  trControl = train_control,  # Cross-validation control
  preProcess = c("center", "scale")  # Center and scale the data
)

# Display the results
print(plsFit)

## Partial Least Squares 
## 
## 133 samples
##  55 predictor
## 
## Pre-processing: centered (55), scaled (55) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 119, 119, 118, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.126915  0.5116974  0.9414516
##    2     1.047013  0.5707059  0.8544444
##    3     1.041003  0.5846841  0.8503616
##    4     1.045734  0.5881774  0.8587506
##    5     1.064919  0.5825260  0.8761451
##    6     1.088411  0.5644504  0.8970565
##    7     1.111668  0.5521913  0.9148290
##    8     1.125418  0.5421839  0.9334434
##    9     1.142132  0.5337365  0.9510120
##   10     1.173360  0.5191300  0.9763978
##   11     1.200423  0.5092493  1.0107145
##   12     1.197646  0.5127768  1.0052812
##   13     1.206623  0.5113733  1.0110937
##   14     1.215358  0.5077655  1.0092907
##   15     1.215990  0.5083837  1.0066813
##   16     1.228777  0.5019447  1.0165204
##   17     1.225961  0.5041743  1.0122425
##   18     1.229201  0.5001290  1.0133949
##   19     1.226646  0.5036160  1.0060259
##   20     1.218589  0.5058999  0.9961022
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

Part d

Predict the response for the test set. What is the test set estimate of R2?

The test set estimate of R-squared is 0.4237932. This indicates ~42% of the variance in our test data was explained by the model.

Because of PLS’s low performance, we should pursue other models.

Predicted response

Here we produce the predicted values for the test set data.

# Make a combined dataframe for the test data
testData <- as.data.frame(testTarget)
testData <- cbind(testData, testPredictors)
colnames(testData)[1] <- "permeability"

# Predict the response for the test set
test_predictions <- predict(plsFit, testData, ncomp = 1)
test_predictions

##  [1]  1.32719709  1.67946162  0.96984192  1.34403742  0.03377816  0.65803119
##  [7]  0.70576651  1.44171145  2.58160205  1.04819758  0.41081963  1.59493015
## [13]  1.29680717  1.56272211  2.28559343  1.54668888  3.35326387  3.01554713
## [19] -0.19198962 -0.15815997  3.09574706  2.85797032  3.46362093  2.39827440
## [25]  3.53214450  2.31682496  0.65081003 -1.13102856  0.73289462  0.90660692
## [31]  0.34570919  3.06087165

R-squared

We calculate R-squared by squaring the correlation between the predicted and actual values in the test set data.

Our partial least squares model had an R-squared of 0.4045785 on the test data, indicating ~40% of the variance in the test data was explained by our model, compared to our anticipated 49% from the training data.

# Calculate the square of the correlation between the predicted and actual values
cor(test_predictions, testData$permeability)^2

## [1] 0.4237932

Part e

Try building other models discussed in this chapter. Do any have better predictive performance?

We tried three additional models: ridge, lasso and elastic-net and all of them had better predictive performance using R-squared on the test data as the metric.

The ridge model ultimately performed the best and this may be due to our initial efforts in the preprocessing stage to remove degenerate and collinear predictors, as ridge works best when you know all of your predictors are significant.

Ridge model

Using grid analysis we arrive at an ideal ridge lambda of 0.22 and a resulting R-squared of 0.6084128.

The ridge model returned an R-squared of 0.4810941 on the test data.

# Trying the Ridge model
ridgeGrid <- data.frame(.lambda = seq(.1,.3, length=11))
set.seed(222)
ridgeRegFit <- train(x=trainData[,-1], y=log(trainData$permeability), method="ridge", tuneGrid=ridgeGrid, trControl=train_control, preProc=c("center","scale"))
ridgeRegFit

## Ridge Regression 
## 
## 133 samples
##  55 predictor
## 
## Pre-processing: centered (55), scaled (55) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 121, 119, 119, 121, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.10    1.042469  0.5866936  0.8671657
##   0.12    1.035862  0.5921830  0.8607188
##   0.14    1.031135  0.5966770  0.8559178
##   0.16    1.027821  0.6004057  0.8524658
##   0.18    1.025610  0.6035308  0.8497191
##   0.20    1.024287  0.6061703  0.8478302
##   0.22    1.023695  0.6084128  0.8469614
##   0.24    1.023717  0.6103262  0.8466635
##   0.26    1.024263  0.6119638  0.8466136
##   0.28    1.025263  0.6133679  0.8467754
##   0.30    1.026661  0.6145726  0.8474149
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.22.

# Predict test data using ridge model
ridgePred <- predict(ridgeRegFit, newdata = testData[,-1])

# Determine R-squared for test data
cor(as.data.frame(ridgePred), testData$permeability)^2

##                [,1]
## ridgePred 0.4810941

Lasso model

Using grid analysis we arrive at an ideal lasso lambda of 0.22 and a resulting R-squared of 0.6051579.

The lasso model returned an R-squared of 0.4617224 on the test data.

# Trying the Lasso model
lassoGrid <- data.frame(fraction = seq(.1,.3, length=11))
set.seed(322)
lassoRegFit <- train(x=trainData[,-1], y=log(trainData$permeability), method="lasso", tuneGrid=lassoGrid, trControl=train_control, preProc=c("center","scale"))
lassoRegFit

## The lasso 
## 
## 133 samples
##  55 predictor
## 
## Pre-processing: centered (55), scaled (55) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 120, 119, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE       Rsquared   MAE      
##   0.10      1.0145171  0.5622959  0.8175052
##   0.12      0.9984494  0.5720645  0.8026214
##   0.14      0.9855996  0.5843299  0.7994443
##   0.16      0.9794831  0.5902332  0.8040254
##   0.18      0.9728947  0.5968510  0.8063597
##   0.20      0.9672096  0.6019765  0.8042530
##   0.22      0.9646317  0.6051579  0.8039533
##   0.24      0.9679313  0.6051342  0.8086465
##   0.26      0.9756211  0.6024830  0.8177834
##   0.28      0.9841088  0.5985019  0.8269686
##   0.30      0.9937083  0.5941721  0.8365533
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.22.

# Predict test data using ridge model
lassoPred <- predict(lassoRegFit, newdata = testData[,-1])

# Determine R-squared for test data
cor(as.data.frame(lassoPred), testData$permeability)^2

##                [,1]
## lassoPred 0.4617224

Elastic-net model

Using grid analysis we arrive at ideal elastic-net lambdas of 0.55 for ridge and 0.40 for lasso and a resulting R-squared of 0.5979077.

The elastic-net model returned an R-squared of 0.4385897 on the test data.

# Trying the Elastic-net model
enetGrid <- expand.grid(.fraction = seq(.3,.5, length=5), .lambda = seq(.4,.6, length=5))
set.seed(422)
enetRegFit <- train(x=trainData[,-1], y=log(trainData$permeability), method="enet", tuneGrid=enetGrid, trControl=train_control, preProc=c("center","scale"))
enetRegFit

## Elasticnet 
## 
## 133 samples
##  55 predictor
## 
## Pre-processing: centered (55), scaled (55) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 120, 120, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.40    0.30      1.0305831  0.5723879  0.8248979
##   0.40    0.35      0.9932371  0.5892383  0.7877729
##   0.40    0.40      0.9756366  0.5961431  0.7751026
##   0.40    0.45      0.9805989  0.5907875  0.7837304
##   0.40    0.50      0.9913728  0.5850508  0.7932941
##   0.45    0.30      1.0315541  0.5715499  0.8256755
##   0.45    0.35      0.9939202  0.5888437  0.7882235
##   0.45    0.40      0.9749147  0.5970255  0.7733718
##   0.45    0.45      0.9787254  0.5928965  0.7814887
##   0.45    0.50      0.9918978  0.5859708  0.7926742
##   0.50    0.30      1.0320147  0.5708428  0.8258400
##   0.50    0.35      0.9943240  0.5884560  0.7883018
##   0.50    0.40      0.9745769  0.5975387  0.7723703
##   0.50    0.45      0.9775398  0.5946432  0.7798136
##   0.50    0.50      0.9926432  0.5868816  0.7923254
##   0.55    0.30      1.0320526  0.5703302  0.8254835
##   0.55    0.35      0.9945049  0.5881268  0.7881579
##   0.55    0.40      0.9743835  0.5979077  0.7718374
##   0.55    0.45      0.9772675  0.5956860  0.7790713
##   0.55    0.50      0.9936671  0.5877067  0.7927737
##   0.60    0.30      1.0317889  0.5699474  0.8246873
##   0.60    0.35      0.9945134  0.5878235  0.7878573
##   0.60    0.40      0.9745012  0.5979531  0.7715564
##   0.60    0.45      0.9773159  0.5964559  0.7788176
##   0.60    0.50      0.9948644  0.5884861  0.7933127
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.55.

# Predict test data using ridge model
enetPred <- predict(enetRegFit, newdata = testData[,-1])

# Determine R-squared for test data
cor(as.data.frame(enetPred), testData$permeability)^2

##               [,1]
## enetPred 0.4385897

Part f

Would you recommend any of your models to replace the permeability laboratory experiment?

Based on these results we are looking at a non-linear relationship and so there are limitations with using a linear model to explain it. Any recommendation for a linear model in this context would only be to explain potential drivers in the data’s behavior and we should switch to nonlinear models to continue to explore this data.

Out of the three penalized regression models the ridge regression model returned the highest R-squared on the test data of 0.4617224 and so we would recommend the ridge model over the PLS model initially fit, however there’s still room for further improvement.

I looked in Section 1.4 for what the permeability laboratory experiment might be and if there is an R-squared or RMSE in the text to compare against but couldn’t find one but there may be an additional model to consider not modeled here.

I’ll save it for a future exercise but below I include the R-squared values for the predictions on the test data if we didn’t take the log of permeability when fitting our models and we see tremendous improvement in the penalty regression models: 0.5733212 instead of 0.4617224 for the ridge model. Maybe taking the log of our data caused it to be over fit, or we need to confirm our results by trying different train-test splits.

Model	Test data R-squared	Test data R-squared wLog
Ridge	0.5733	0.4811
Lasso	0.5546	0.4617
Elastic-Net	0.4677	0.4386
PLS	0.4046	0.4238

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Part a

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

Here we load the data.

# Load the data
data("ChemicalManufacturingProcess")

Part b

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Here we impute the missing values using kNN and then check that the 106 missing values are filled in.

Note, this was a sticky step. Here was how I battered through:
There was no clear example in Section 3.8. Page 42 discussed using a K-nearest neighbor model to impute missing values. Section 3.8 showed that the impute package has a function impute.knn and the preProcess function has imputation methods using kNN or bagged trees. impute is not available as a package in my version of R so I’m going with preProcess. I found the method type (“knnImpute”) from ??caret::preProcess (there are also bagImpute and medianImpute) and finally connected that I had to follow the trans <- preProcess(datawMissing, method = c("knnImpute")) then data <- predict(trans, datawMissing) format. Lastly I had to install RANN (it has fast k-nearest neighbor search algorithms) to run the knnImpute method in preProcess.

# There are 106 missing values
#sum(is.na(ChemicalManufacturingProcess))

# Produce the model that can impute values using kNN
imputeModel <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))

# Impute the missing values (not in)
CMP <- predict(imputeModel, ChemicalManufacturingProcess)

# No more missing values
#sum(is.na(CMP))

Part c

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Using elastic-net, our optimal RMSE is 0.6319838 with a corresponding R-squared of 0.5992708. This is with a ridge lambda of 0.90 and a lasso lambda of 0.35.

Subsequent parts ask about which predictors have the biggest impact on the model and how. This would be done most easily with ordinary or penalized linear regression because we would have a coefficient for each predictor. I was curious to try it with PLS but I could only find varImp(plsFit) as a way to see which variables had the biggest impact but I couldn’t tell you if it was a positive or negative impact.

Split data

Here we split the data into a training and a test set using an 80-20 split.

# Split data
set.seed(522)
trainingRows2 <- createDataPartition(CMP[,1], p = .80, list=FALSE)
trainPredictors2 <- CMP[trainingRows2,-1]
trainTarget2 <- CMP[trainingRows2,1]
testPredictors2 <- CMP[-trainingRows2,-1]
testTarget2 <- CMP[-trainingRows2,1]

Tune model

In addition to the imputation of missing data done previously, we are doing centering, scaling of variables and removing any predictors (eleven of them) that are too near zero or too collinear.

Using elastic-net, our optimal RMSE is 0.6319838 with a corresponding R-squared of 0.5992708. This is with a ridge lambda of 0.90 and a lasso lambda of 0.35.

#Excluded because using same definition as in Exercise 6.2
#train_control <- trainControl(method = "cv", number = 10)

# Fitting an elastic-net model
enetGrid2 <- expand.grid(.fraction = seq(.3,.5, length=5), .lambda = seq(.8,1, length=5))

set.seed(622)

enetRegFit2 <- train(
  x = trainPredictors2,
  y = trainTarget2,
  method = "enet",
  tuneGrid = enetGrid2,
  trControl = train_control,
  preProc = c("center","scale","nzv","corr"))
enetRegFit2

## Elasticnet 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (46), scaled (46), remove (11) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 131, 130, 129, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.80    0.30      0.6419498  0.6006973  0.5326539
##   0.80    0.35      0.6321112  0.5997685  0.5172543
##   0.80    0.40      0.6378945  0.5972613  0.5207585
##   0.80    0.45      0.6470469  0.5983609  0.5251385
##   0.80    0.50      0.6558894  0.6007408  0.5288848
##   0.85    0.30      0.6415661  0.6003010  0.5317892
##   0.85    0.35      0.6320170  0.5995313  0.5168066
##   0.85    0.40      0.6381209  0.5972205  0.5210214
##   0.85    0.45      0.6487120  0.5979212  0.5269532
##   0.85    0.50      0.6585650  0.6003136  0.5312913
##   0.90    0.30      0.6412076  0.5999004  0.5313458
##   0.90    0.35      0.6319838  0.5992708  0.5163612
##   0.90    0.40      0.6384923  0.5971439  0.5213966
##   0.90    0.45      0.6503998  0.5975653  0.5285233
##   0.90    0.50      0.6614089  0.5998798  0.5338540
##   0.95    0.30      0.6408650  0.5994802  0.5309382
##   0.95    0.35      0.6320066  0.5989931  0.5159858
##   0.95    0.40      0.6389794  0.5970669  0.5217422
##   0.95    0.45      0.6521483  0.5972605  0.5299552
##   0.95    0.50      0.6644577  0.5994047  0.5365967
##   1.00    0.30      0.6405284  0.5990653  0.5304803
##   1.00    0.35      0.6320611  0.5987481  0.5156063
##   1.00    0.40      0.6395563  0.5969996  0.5221203
##   1.00    0.45      0.6540158  0.5969182  0.5313423
##   1.00    0.50      0.6676725  0.5989283  0.5396577
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.35 and lambda = 0.9.

Part d

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The R-squared of the prediction on the test set is 0.5265838 compared to the resampled R-squared on the training set of 0.5992708. Additionally, the RMSE on the test set is 0.6835943 compared to the training set RMSE of 0.6319838.

Since both RMSE is higher and R-squared is lower on the test data it suggests our model might be overfit and we should explore different models or additional preprocessing options.

# Predict test data using ridge model
enetPred2 <- predict(enetRegFit2, newdata = testPredictors2)

# Determine R-squared for test data
test_r2 <- cor(as.data.frame(enetPred2), testTarget2)^2

# Determine RMSE for test data
test_rmse <- sqrt(mean((enetPred2 - testTarget2)^2))

Part e

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

I expected to be able to see the coefficients in the model like an ordinary linear regression model report but it doesn’t seem to be available.

Instead I found the “Sequence of moves” which seems to represent the order the variables were entered into the elastic-net regression model. I think we can assume that if a variable was entered earlier it is more influential than a variable that was entered later.

Out of the first ten predictors that were entered into the elastic-net model, eight of them were Manufacturing processes and the first six most important predictors for yield were Manufacturing processes.

# try to find information on variables.
enetRegFit2$finalModel

## 
## Call:
## elasticnet::enet(x = as.matrix(x), y = y, lambda = param$lambda)
## Sequence of  moves:
##      ManufacturingProcess32 ManufacturingProcess13 ManufacturingProcess09
## Var                      35                     21                     17
## Step                      1                      2                      3
##      ManufacturingProcess36 ManufacturingProcess17 ManufacturingProcess06
## Var                      39                     25                     14
## Step                      4                      5                      6
##      BiologicalMaterial06 BiologicalMaterial03 ManufacturingProcess11
## Var                     4                    2                     19
## Step                    7                    8                      9
##      ManufacturingProcess33 BiologicalMaterial11 BiologicalMaterial08
## Var                      36                    8                    5
## Step                     10                   11                   12
##      ManufacturingProcess12 ManufacturingProcess34 ManufacturingProcess39
## Var                      20                     37                     42
## Step                     13                     14                     15
##      ManufacturingProcess30 ManufacturingProcess37 ManufacturingProcess44
## Var                      34                     40                     45
## Step                     16                     17                     18
##      ManufacturingProcess24 ManufacturingProcess15 BiologicalMaterial05
## Var                      31                     23                    3
## Step                     19                     20                   21
##      ManufacturingProcess45 ManufacturingProcess43 BiologicalMaterial01
## Var                      46                     44                    1
## Step                     22                     23                   24
##      ManufacturingProcess10 ManufacturingProcess07 ManufacturingProcess20
## Var                      18                     15                     27
## Step                     25                     26                     27
##      ManufacturingProcess19 ManufacturingProcess21 ManufacturingProcess35
## Var                      26                     28                     38
## Step                     28                     29                     30
##      ManufacturingProcess28 ManufacturingProcess05 ManufacturingProcess04
## Var                      33                     13                     12
## Step                     31                     32                     33
##      ManufacturingProcess16 ManufacturingProcess01 ManufacturingProcess40
## Var                      24                      9                     43
## Step                     34                     35                     36
##      ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess02
## Var                      29                     30                     10
## Step                     37                     38                     39
##      BiologicalMaterial09 ManufacturingProcess08 ManufacturingProcess03
## Var                     6                     16                     11
## Step                   40                     41                     42
##      BiologicalMaterial10 ManufacturingProcess14 ManufacturingProcess38
## Var                     7                     22                     41
## Step                   43                     44                     45
##      ManufacturingProcess26   
## Var                      32 47
## Step                     46 47

Part f

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

I found the final coefficients for our model and it confirms what we found in the prior section using order of predictor entry into the model. There are 13 predictors in the final model, 9 of which are manufacturing processes and four are biological materials. The top six, sorted by absolute value, are all manufacturing processes.

Three Recommendations to Management

Three of the manufacturing processes have negative coefficients. It’s possible these steps need to be adjusted to stop their negative impact on final product yields.

If we’re spending money on biological materials that aren’t 06, 03, 11 or 08 then maybe we should reduce expenditures on those to increase expenditures on these biological materials that have had a demonstrable, positive impact on yields.

Manufacturing Process 32 has the most significant impact on yield. It’s possible that A/B testing changes in this process would be most likely to garner a kaizen increase in yield. (and validate if the predictor does indeed drive yield improvements!)

# Find and display non-zero coefficients
coes <- predict.enet(enetRegFit2$finalModel, s=enetRegFit2$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
non_zero_coefs <- coes[coes != 0]
sorted_coefs <- sort(non_zero_coefs, decreasing = TRUE)
print(sorted_coefs)

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess06 
##            0.285046129            0.171770522            0.094278734 
##   BiologicalMaterial06 ManufacturingProcess11   BiologicalMaterial03 
##            0.049009044            0.040148179            0.039423886 
## ManufacturingProcess33   BiologicalMaterial11   BiologicalMaterial08 
##            0.029209661            0.019739564            0.007307787 
## ManufacturingProcess12 ManufacturingProcess17 ManufacturingProcess36 
##            0.001930146           -0.118650286           -0.136463539 
## ManufacturingProcess13 
##           -0.179598257

References

These exercises come from ‘Applied Predictive Modeling’ by Max Kuhn and Kjell Johnson.

D624 Homework 7: Linear regression

PK O’Flaherty

2024-11-29

Exercise 6.2

Part a

Calculate skewness

Box-cox analysis

Part b

Near zero variables

Part c

Transformation

Missing data

Try PCA

Collinearity

Pair-wise elimination

PCA revisited

Split data

PLS

Part d

Predicted response

R-squared

Part e

Ridge model

Lasso model

Elastic-net model

Part f

Exercise 6.3

Part a

Part b

Part c

Split data

Tune model

Part d

Part e

Part f

Three Recommendations to Management

References