• Exercise 6.2
  • Exercise 6.3

Exercises from Chapter 6 of textbook Applied Predictive Modeling by Kuhn & Johnson

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data. The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

data(permeability)
fingerprints <- as.data.frame(fingerprints)

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

After creating a new dataframe fingerprints_thinned that shows the dataframe after nearZeroVar is used, there are 388 predictor variables remaining.

#remove predictors with low frequencies
fingerprints_thinned <- fingerprints[, -nearZeroVar(fingerprints)]

#check new predictor count
ncol(fingerprints_thinned)
## [1] 388

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

#add permeability/target varaible
dataset <- cbind(data.frame(permeability), fingerprints_thinned)

#split train/test
set.seed(3190)
sample_set <- sample(nrow(dataset), round(nrow(dataset)*0.75), replace = FALSE)
fingerprints_train <- dataset[sample_set, ]
fingerprints_test <- dataset[-sample_set, ]

It looks like 7 is the optimal number of latent variables.

# build the PLS model
pls_model <- train(fingerprints_train[,-1],
                   fingerprints_train$permeability,
                   method = "pls", 
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

#plot RMSE of ncomps
plot(pls_model$results$RMSE)

In looking at the full results, I see the low RMSE value of 10.142 on ncomp 7 as identified above, and the associated R-squared value is 0.608.

pls_model$results
##    ncomp     RMSE  Rsquared       MAE   RMSESD RsquaredSD    MAESD
## 1      1 13.48118 0.3324390 10.387461 3.542885  0.2538493 2.924443
## 2      2 11.73656 0.5044877  8.620059 4.070957  0.2451729 3.145744
## 3      3 11.49670 0.5014269  8.615089 3.146592  0.1555225 2.359770
## 4      4 10.95246 0.5648879  8.680918 2.863782  0.1592836 2.471589
## 5      5 10.59649 0.5851518  8.197448 2.996456  0.1732325 2.315941
## 6      6 10.49329 0.5884128  8.126480 2.637389  0.1555512 2.048783
## 7      7 10.14244 0.6077336  7.894780 2.702364  0.1571095 2.179632
## 8      8 10.15322 0.6075330  7.889303 2.464001  0.1463963 2.160220
## 9      9 10.42128 0.5844003  8.044052 2.673910  0.1578883 2.336455
## 10    10 10.70107 0.5738913  8.121887 2.922505  0.1771802 2.430940

(d) Predict the response for the test set. What is the test set estimate of R2?

Taking a look at the predictions from this model in comparison to the test set, I see an R-squared of 0.355 meaning the model can account for 35.5% of the variance in the data. The RMSE is 14.131.

pred <- predict(pls_model, fingerprints_test)

postResample(pred = pred, obs = fingerprints_test$permeability)
##       RMSE   Rsquared        MAE 
## 14.1307091  0.3547684 10.3893889

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

I’ll try a PCR model first. This has a worse R-Squared which means this model can only account for 28.5% of the variance in the dataset. The RMSE is lower than the PLS though.

pcr_model <- train(fingerprints_train[,-1],
                   fingerprints_train$permeability,
                   method = "pcr", 
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

pred <- predict(pcr_model, fingerprints_test)

postResample(pred = pred, obs = fingerprints_test$permeability)
##       RMSE   Rsquared        MAE 
## 12.1046996  0.2816046  8.5982847

Finally I try an elastic net model, which has a better than PCR but worse the PLS R-squared of 33.9% of variance in the dataset explained. This has an RMSE between the two previous models as well, at 13.958.

enet_model <- train(fingerprints_train[,-1],
                   fingerprints_train$permeability,
                   method = "enet", 
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

pred <- predict(enet_model, fingerprints_test)

postResample(pred = pred, obs = fingerprints_test$permeability)
##       RMSE   Rsquared        MAE 
## 13.9583366  0.3391313  9.7450271

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

This is a little tricky, as while the PLS has the best R-squared (highest) it has the worst RMSE value. The PCR model has the lower RMSE but quite a bad R-squared at 28.2%. I believe I would choose the original model, as the elastic net model has only a slightly lower RMSE but also loses about 1.5% points of R-Squared.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand,manufacturing process predictors can be changed in the manufacturing process.Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data. The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors)for the 176 manufacturing runs. yield contains the percent yield for each run.

data(ChemicalManufacturingProcess)
chem <- as.data.frame(ChemicalManufacturingProcess)

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

I use the RANN library to impute the missing values. When I do a skim of the new dataframe I see no missing values.

knn_imp <- preProcess(chem, "knnImpute")
chem_imp <- predict(knn_imp, chem)

skim(chem_imp)
Data summary
Name chem_imp
Number of rows 176
Number of columns 58
_______________________
Column type frequency:
numeric 58
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Yield 0 1 0.00 1.00 -2.67 -0.77 -0.11 0.70 3.34 ▁▇▇▃▁
BiologicalMaterial01 0 1 0.00 1.00 -2.57 -0.61 -0.15 0.64 3.36 ▂▇▇▂▁
BiologicalMaterial02 0 1 0.00 1.00 -2.19 -0.75 -0.15 0.76 2.25 ▂▇▆▅▃
BiologicalMaterial03 0 1 0.00 1.00 -2.68 -0.68 -0.12 0.68 2.64 ▂▅▇▆▁
BiologicalMaterial04 0 1 0.00 1.00 -1.67 -0.62 -0.14 0.49 6.05 ▇▆▁▁▁
BiologicalMaterial05 0 1 0.00 1.00 -2.91 -0.74 -0.06 0.71 3.39 ▁▅▇▃▁
BiologicalMaterial06 0 1 0.00 1.00 -2.22 -0.76 -0.12 0.65 2.79 ▂▇▆▅▁
BiologicalMaterial07 0 1 0.00 1.00 -0.13 -0.13 -0.13 -0.13 7.57 ▇▁▁▁▁
BiologicalMaterial08 0 1 0.00 1.00 -2.39 -0.64 0.02 0.57 2.43 ▁▅▇▃▂
BiologicalMaterial09 0 1 0.00 1.00 -3.40 -0.60 -0.04 0.67 2.96 ▁▃▇▅▁
BiologicalMaterial10 0 1 0.00 1.00 -1.72 -0.57 -0.15 0.32 6.79 ▇▅▁▁▁
BiologicalMaterial11 0 1 0.00 1.00 -2.31 -0.65 -0.18 0.55 2.44 ▂▆▇▃▂
BiologicalMaterial12 0 1 0.00 1.00 -2.39 -0.61 -0.10 0.71 2.60 ▂▆▇▃▂
ManufacturingProcess01 0 1 0.00 1.00 -6.15 -0.22 0.11 0.50 1.59 ▁▁▁▅▇
ManufacturingProcess02 0 1 0.01 0.99 -1.97 0.31 0.51 0.57 0.69 ▂▁▁▁▇
ManufacturingProcess03 0 1 0.04 0.97 -3.11 -0.43 0.38 0.47 2.70 ▁▂▆▇▁
ManufacturingProcess04 0 1 0.00 1.00 -3.32 -0.61 0.34 0.66 2.25 ▁▂▃▇▂
ManufacturingProcess05 0 1 0.00 1.00 -2.58 -0.49 -0.09 0.23 5.69 ▁▇▁▁▁
ManufacturingProcess06 0 1 -0.01 1.00 -1.63 -0.63 -0.22 0.48 7.41 ▇▃▁▁▁
ManufacturingProcess07 0 1 0.00 1.00 -0.96 -0.96 -0.96 1.04 1.04 ▇▁▁▁▇
ManufacturingProcess08 0 1 0.00 1.00 -1.11 -1.11 0.89 0.89 0.89 ▆▁▁▁▇
ManufacturingProcess09 0 1 0.00 1.00 -4.38 -0.50 0.05 0.55 2.39 ▁▁▅▇▂
ManufacturingProcess10 0 1 0.02 0.99 -2.19 -0.62 -0.10 0.55 3.16 ▂▇▇▂▁
ManufacturingProcess11 0 1 0.03 1.00 -2.63 -0.54 0.02 0.72 2.95 ▁▇▇▆▁
ManufacturingProcess12 0 1 0.00 1.00 -0.48 -0.48 -0.48 -0.48 2.07 ▇▁▁▁▂
ManufacturingProcess13 0 1 0.00 1.00 -2.37 -0.60 0.09 0.68 4.03 ▃▆▇▁▁
ManufacturingProcess14 0 1 0.00 1.00 -2.80 -0.49 0.03 0.52 3.69 ▁▅▇▂▁
ManufacturingProcess15 0 1 0.00 1.00 -2.31 -0.50 -0.13 0.38 3.33 ▂▇▆▂▁
ManufacturingProcess16 0 1 0.00 1.00 -12.98 -0.01 0.06 0.15 0.81 ▁▁▁▁▇
ManufacturingProcess17 0 1 0.00 1.00 -2.44 -0.68 0.05 0.61 4.53 ▂▇▆▁▁
ManufacturingProcess18 0 1 0.00 1.00 -13.09 0.01 0.07 0.14 0.44 ▁▁▁▁▇
ManufacturingProcess19 0 1 0.00 1.00 -3.03 -0.60 -0.14 0.48 2.58 ▁▃▇▃▂
ManufacturingProcess20 0 1 0.00 1.00 -13.06 -0.01 0.07 0.15 0.58 ▁▁▁▁▇
ManufacturingProcess21 0 1 0.00 1.00 -2.10 -0.56 -0.17 0.21 4.84 ▂▇▂▁▁
ManufacturingProcess22 0 1 0.00 1.00 -1.62 -0.72 -0.12 0.78 1.98 ▇▇▇▅▅
ManufacturingProcess23 0 1 0.00 1.00 -1.81 -0.61 -0.01 0.59 1.79 ▇▆▇▆▇
ManufacturingProcess24 0 1 0.01 1.00 -1.52 -0.83 -0.14 0.89 2.44 ▇▇▅▆▁
ManufacturingProcess25 0 1 0.00 0.99 -12.93 0.00 0.07 0.13 0.43 ▁▁▁▁▇
ManufacturingProcess26 0 1 0.00 0.99 -12.94 0.01 0.06 0.12 0.31 ▁▁▁▁▇
ManufacturingProcess27 0 1 0.00 0.99 -12.89 0.00 0.07 0.13 0.42 ▁▁▁▁▇
ManufacturingProcess28 0 1 -0.03 1.00 -1.26 -1.26 0.73 0.78 0.94 ▅▁▁▁▇
ManufacturingProcess29 0 1 0.00 0.99 -12.03 -0.19 -0.07 0.23 1.20 ▁▁▁▁▇
ManufacturingProcess30 0 1 0.01 0.99 -9.39 -0.37 0.04 0.55 2.09 ▁▁▁▅▇
ManufacturingProcess31 0 1 0.00 0.99 -12.63 -0.02 0.11 0.22 0.42 ▁▁▁▁▇
ManufacturingProcess32 0 1 0.00 1.00 -2.87 -0.64 -0.09 0.65 2.69 ▁▃▇▃▁
ManufacturingProcess33 0 1 -0.01 0.99 -3.04 -0.62 0.18 0.59 2.60 ▁▃▇▅▁
ManufacturingProcess34 0 1 -0.01 0.99 -3.56 0.12 0.12 0.12 1.96 ▁▂▁▇▁
ManufacturingProcess35 0 1 -0.02 1.00 -3.01 -0.52 -0.06 0.50 2.44 ▁▃▇▅▂
ManufacturingProcess36 0 1 -0.01 0.99 -2.94 -0.66 -0.08 0.49 2.78 ▂▇▁▇▃
ManufacturingProcess37 0 1 0.00 1.00 -2.28 -0.70 -0.03 0.64 2.89 ▂▇▇▃▁
ManufacturingProcess38 0 1 0.00 1.00 -3.90 -0.82 0.72 0.72 0.72 ▁▁▁▅▇
ManufacturingProcess39 0 1 0.00 1.00 -4.55 0.17 0.23 0.30 0.43 ▁▁▁▁▇
ManufacturingProcess40 0 1 0.00 1.00 -0.46 -0.46 -0.46 -0.46 2.15 ▇▁▁▁▂
ManufacturingProcess41 0 1 0.00 1.00 -0.44 -0.44 -0.44 -0.44 3.28 ▇▁▁▁▁
ManufacturingProcess42 0 1 0.00 1.00 -5.77 0.10 0.20 0.25 0.46 ▁▁▁▁▇
ManufacturingProcess43 0 1 0.00 1.00 -1.05 -0.36 -0.13 0.13 11.62 ▇▁▁▁▁
ManufacturingProcess44 0 1 0.00 1.00 -5.61 -0.02 0.29 0.29 0.92 ▁▁▁▁▇
ManufacturingProcess45 0 1 0.00 1.00 -5.25 -0.09 0.15 0.40 1.14 ▁▁▁▂▇

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

#split train/test
set.seed(3190)
sample_set <- sample(nrow(chem_imp), round(nrow(chem_imp)*0.75), replace = FALSE)
chem_train <- chem_imp[sample_set, ]
chem_test <- chem_imp[-sample_set, ]

I’ll build a PLS model for this dataset. I see the optimal value of ncomps is 3 in this case.

# build the PLS model
pls_model <- train(chem_train[,-1],
                   chem_train$Yield,
                   method = "pls", 
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

#plot RMSE of ncomps
plot(pls_model$results$RMSE)

Further, at the level of 3 the RMSE is 0.604 and the R-squared states the model can account for 63.46% of the variance in the dataset.

pls_model$results
##    ncomp      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1      1 0.7001241 0.5177166 0.5793990 0.11519441  0.1728657 0.10464902
## 2      2 0.6174564 0.5998932 0.5162735 0.11499137  0.1512413 0.09839459
## 3      3 0.6035627 0.6345528 0.5088571 0.08866118  0.1364007 0.08299434
## 4      4 0.6105764 0.6445618 0.5086705 0.08645226  0.1436748 0.08094136
## 5      5 0.6241008 0.6388256 0.5235059 0.09492814  0.1416651 0.09419574
## 6      6 0.6323922 0.6353155 0.5195763 0.08685005  0.1340769 0.09081020
## 7      7 0.6442984 0.6372570 0.5272632 0.08708980  0.1325100 0.08681575
## 8      8 0.6558119 0.6356937 0.5391171 0.08788075  0.1367980 0.08201953
## 9      9 0.6646960 0.6300397 0.5510012 0.09263265  0.1394335 0.08702770
## 10    10 0.6796563 0.6148485 0.5540910 0.09946885  0.1462734 0.09034939

(d) Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Now considering the test data, the RMSE is 0.73 with a lower R-squared of 48.6%.

pred <- predict(pls_model, chem_test)

postResample(pred = pred, obs = chem_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.7298442 0.4862441 0.5645561

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

I found an interesting function varImp() that can be used to show the importance of variables in different model types. With regard to PLS the score is calculated by: “the variable importance measure here is based on weighted sums of the absolute regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components and are computed separately for each outcome. Therefore, the contribution of the coefficients are weighted proportionally to the reduction in the sums of squares.”

More here: https://topepo.github.io/caret/variable-importance.html

In my case, it appears the top 5 predictors are process, followed by 2 biologic, and then another 3 process - so the process measures are dominating in this model.

varImp(pls_model)
## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   88.55
## ManufacturingProcess36   86.82
## ManufacturingProcess13   80.74
## ManufacturingProcess17   78.58
## BiologicalMaterial02     68.25
## BiologicalMaterial06     66.27
## ManufacturingProcess11   65.49
## ManufacturingProcess06   60.47
## ManufacturingProcess33   60.40
## BiologicalMaterial08     58.44
## BiologicalMaterial12     58.34
## BiologicalMaterial03     55.59
## BiologicalMaterial04     53.11
## BiologicalMaterial11     53.03
## BiologicalMaterial01     50.44
## ManufacturingProcess15   47.97
## ManufacturingProcess12   47.42
## ManufacturingProcess04   46.70
## ManufacturingProcess34   44.71

(f) Explore the relationships between each of the top predictors and the response.How could this information be helpful in improving yield in future runs of the manufacturing process?

Going back to the original dataset and selecting just the top 5 predictors from above, I build a correlation plot. There appear to be strong positive correlations between Yield and ManufacturingProcess32, and negative correlations between Yield and ManufacturingProcess36, ManufacturingProcess13, and ManufacturingProcess17.

chem_top5 <- chem_imp %>%
  select(c(ManufacturingProcess32, ManufacturingProcess09, ManufacturingProcess36, ManufacturingProcess13, ManufacturingProcess17,Yield))

correlation <- cor(chem_top5)


# plot correlations
corrplot.mixed(correlation, tl.col = 'black', tl.pos = 'lt', 
               number.cex= .9, tl.cex = .7)

There are many methods for combing variables that could be used to leverage this information, further the entire dataset could be checked for high correlation amid the predictor variables, with redundant variables removed.