Questions 6.2 and 6.3 from the Kuhn& Johnson book
library(caret)
library(corrplot)
library(dplyr)
library(data.table)
library(e1071)
library(fable)
library(forecast)
library(fpp2)
library(fpp3)
library(ggplot2)
library(GGally)
library(knitr)
library(latex2exp)
library(mlbench)
library(MASS)
library(purrr)
library(pls)
library(patchwork)
library(RANN)
library(stats)
library(tsibble)
library(tidyr)
(a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)
head(permeability)
## permeability
## 1 12.520
## 2 1.120
## 3 19.405
## 4 1.730
## 5 1.680
## 6 0.510
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
dim(fingerprints)
## [1] 165 1107
high_freq <- fingerprints[, -nearZeroVar(fingerprints)]
dim(high_freq)
## [1] 165 388
Filtering the predictors with low frequencies takes us from 1107 to 388 predictors.
(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
set.seed(12345)
# Train Test Split
index <- createDataPartition(permeability, p = .8, list = FALSE)
X_train <- high_freq[index, ]
X_test <- high_freq[-index, ]
y_train <- permeability[index, ]
y_test <- permeability[-index, ]
# Check for missing values
any(is.na(X_train)) | any(is.na(X_test))
## [1] FALSE
dim(X_train)
## [1] 133 388
dim(X_test)
## [1] 32 388
# Fit PLS model
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- train(X_train,
y_train,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale")
)
plot(plsTune)
# View component breakdown
plsTune
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 119, 121, 120, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.71638 0.3726546 9.803750
## 2 11.43482 0.4812433 8.287554
## 3 11.56869 0.4775294 8.635061
## 4 11.82234 0.4622131 9.000811
## 5 11.96746 0.4722405 9.004607
## 6 11.80342 0.4763650 8.712800
## 7 11.75711 0.4782825 8.794521
## 8 11.57750 0.4943179 8.741823
## 9 11.42685 0.5092455 8.573922
## 10 11.65230 0.5024878 8.601296
## 11 12.19187 0.4755751 8.866223
## 12 12.24185 0.4766699 8.927797
## 13 12.56914 0.4611196 9.122023
## 14 12.87329 0.4497174 9.246870
## 15 13.26387 0.4322739 9.658887
## 16 13.21115 0.4341617 9.699480
## 17 13.45490 0.4270706 9.834210
## 18 13.65659 0.4172002 10.008847
## 19 13.68960 0.4115089 10.210977
## 20 13.50473 0.4179000 10.122616
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 9.
After tuning the PLS model, the optimal number of latent components chosen was 9, and the optimal \(R^2\) is 0.5092455.
(d) Predict the response for the test set. What is the test set estimate of \(R^2\)?
pls_pred <- predict(plsTune, X_test)
pls_values <- data.frame(obs = y_test, pred = pls_pred)
defaultSummary(pls_values)
## RMSE Rsquared MAE
## 12.1949589 0.5063305 9.2982092
The PLS prediction for the test set gives an \(R^2\) of 0.5063305, and an RMSE of 12.1949589.
(e) Try building other models discussed in this chapter. Do any have better predictive performance?
# Other options include Ridge, Lasso, PCA, or Elastic Net regression (Pg. 109)
################################################################################
# Ridge
set.seed(12345)
# Grid Search for optimal Lambda to minimize RMSE.
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
ridgeRegFit <- train(X_train,
y_train,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale")
)
plot(ridgeRegFit)
ridge_pred <- predict(ridgeRegFit, X_test)
ridge_values <- data.frame(obs = y_test, pred = ridge_pred)
defaultSummary(ridge_values)
## RMSE Rsquared MAE
## 12.1620996 0.5235568 8.1476488
################################################################################
# Elastic Net
set.seed(12345)
# Penalty Grid
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
enetfit <- train(X_train,
y_train,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale")
)
plot(enetfit)
enet_pred <- predict(enetfit, X_test)
enet_values <- data.frame(obs = y_test, pred = enet_pred)
defaultSummary(enet_values)
## RMSE Rsquared MAE
## 11.2847478 0.5456092 8.7630728
################################################################################
# PCA
set.seed(12345)
pcafit <- train(X_train,
y_train,
method = "pcr",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength=20)
plot(pcafit)
pca_pred <- predict(pcafit, X_test)
pca_values <- data.frame(obs = y_test, pred = pca_pred)
defaultSummary(pca_values)
## RMSE Rsquared MAE
## 12.2442554 0.3829767 8.8667536
(f) Would you recommend any of your models to replace the permeability laboratory experiment?
PLS has \(R^2\) = 0.5063305 and RMSE = 20.98823. Ridge has \(R^2\) = 0.5235568 and RMSE = 12.1949589. Ridge has \(R^2\) =0.5235568 and RMSE = 12.1620996. Elastic Net has \(R^2\) = 0.5456092 and RMSE = 11.2847478. PCA has \(R^2\) = 0.3829767 and RMSE = 12.2442554, so it is the best performing model for predictions in terms of RMSE, although it has the lowest \(R^2\). I’d recommend the Elastic Net model be used, as it has a good trade-off in terms of having a high \(R^2\) and lower RMSE than both PLS or Ridge Regression.
(a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
head(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 NA NA NA
## 2 0.0 0 NA
## 3 0.0 0 NA
## 4 0.0 0 NA
## 5 10.7 0 NA
## 6 12.0 0 NA
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 NA NA NA
## 2 917 1032.2 210.0
## 3 912 1003.6 207.1
## 4 911 1014.6 213.3
## 5 918 1027.5 205.7
## 6 924 1016.8 208.9
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 NA NA 43.00
## 2 177 178 46.57
## 3 178 178 45.07
## 4 177 177 44.92
## 5 178 178 44.96
## 6 178 178 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 NA NA NA
## 2 NA NA 0
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 NA NA NA
## 2 3 0 3
## 3 4 1 4
## 4 5 2 5
## 5 8 4 18
## 6 9 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 NA NA 11.6
## 2 0.1 0.15 11.1
## 3 0.0 0.00 12.0
## 4 0.0 0.00 10.6
## 5 0.0 0.00 11.0
## 6 0.0 0.00 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. Yield contains the percent yield for each run.
(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
# Impute
CMP_impute <- ChemicalManufacturingProcess %>%
preProcess("knnImpute")
CMP_complete <- predict(CMP_impute, ChemicalManufacturingProcess)
# sum(is.na(CMP_complete))
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
dim(CMP_complete)
## [1] 176 58
# Remove near zero Variance predictors
high_freq <- CMP_complete[, -nearZeroVar(CMP_complete)]
dim(high_freq)
## [1] 176 57
# Train Test Split
set.seed(12345)
index <- createDataPartition(CMP_complete$Yield, p = .8, list = FALSE)
train_predictors <- CMP_complete[index,-1]
train_yield <- CMP_complete[index, 1]
test_predictors <- CMP_complete[-index,-1]
test_yield <- CMP_complete[-index, 1]
################################################################################
# PLS model
set.seed(12345)
ctrl <- trainControl(method = "cv", number = 10)
plsfit <- train(Yield ~ .,
CMP_complete ,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale")
)
plot(plsfit)
# component breakdown
plsfit
## Partial Least Squares
##
## 176 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 159, 158, 158, 158, 156, 156, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8070701 0.4590283 0.6363546
## 2 1.0476282 0.4608495 0.6578625
## 3 0.7485621 0.5414792 0.5733654
## 4 0.8707881 0.5056252 0.6101097
## 5 1.0820833 0.4911390 0.6604954
## 6 1.1413661 0.4879997 0.6694010
## 7 1.2866645 0.4596829 0.7163126
## 8 1.3283092 0.4638250 0.7292310
## 9 1.4990963 0.4501553 0.7754994
## 10 1.6064668 0.4409531 0.8065579
## 11 1.7073812 0.4338370 0.8391927
## 12 1.7701905 0.4250131 0.8612871
## 13 1.7937892 0.4231621 0.8687128
## 14 1.8565371 0.4204732 0.8863970
## 15 1.8655700 0.4251664 0.8791829
## 16 1.8786654 0.4252873 0.8806641
## 17 1.8964791 0.4218860 0.8876833
## 18 1.8948848 0.4176976 0.8884454
## 19 1.8905119 0.4184225 0.8882248
## 20 1.8638482 0.4204240 0.8835634
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
# Optimal value for ncomp = 3, RMSE = 0.7485621, R^2 = 0.5414792, MAE = 0.5733654
################################################################################
# Elastic Net
set.seed(12345)
# Penalty Grid
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
enetfit <- train(Yield ~ .,
CMP_complete ,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale")
)
plot(enetfit)
# The final values used for the model were fraction = 0.1 and lambda = 0. RMSE= 0.6270263, R^2 = 0.6105673, MAE = 0.5124015
enetfit
## Elasticnet
##
## 176 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 159, 158, 158, 158, 156, 156, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 0.6737675 0.6025991 0.5478705
## 0.00 0.10 0.6270263 0.6105673 0.5124015
## 0.00 0.15 0.8037959 0.5382868 0.5625304
## 0.00 0.20 0.9830845 0.5022518 0.6177938
## 0.00 0.25 1.1143308 0.4835048 0.6571924
## 0.00 0.30 1.1966371 0.4715096 0.6831595
## 0.00 0.35 1.1842749 0.4735835 0.6868155
## 0.00 0.40 1.0839818 0.4881027 0.6668150
## 0.00 0.45 0.9790663 0.5200490 0.6447380
## 0.00 0.50 0.9584618 0.5315659 0.6392019
## 0.00 0.55 0.9976951 0.5017038 0.6633437
## 0.00 0.60 1.0494990 0.4737168 0.6841317
## 0.00 0.65 1.0682234 0.4592798 0.6946494
## 0.00 0.70 1.0885606 0.4439481 0.7046945
## 0.00 0.75 1.1289802 0.4339416 0.7187261
## 0.00 0.80 1.1419888 0.4201253 0.7250237
## 0.00 0.85 1.3398572 0.4000811 0.7758828
## 0.00 0.90 1.5275164 0.4004955 0.8217257
## 0.00 0.95 1.7033632 0.4063405 0.8624566
## 0.00 1.00 1.8963090 0.4138529 0.9037371
## 0.01 0.05 0.8271767 0.5825466 0.6716286
## 0.01 0.10 0.7000189 0.6047191 0.5707346
## 0.01 0.15 0.6487755 0.5976067 0.5323076
## 0.01 0.20 0.6524335 0.5845118 0.5228021
## 0.01 0.25 0.6562770 0.5877414 0.5159059
## 0.01 0.30 0.6951182 0.5762913 0.5229829
## 0.01 0.35 0.7410304 0.5663286 0.5363403
## 0.01 0.40 0.8069713 0.5430867 0.5651266
## 0.01 0.45 0.9088646 0.5072632 0.5978919
## 0.01 0.50 1.0310360 0.4898899 0.6295445
## 0.01 0.55 1.1239325 0.4809325 0.6551173
## 0.01 0.60 1.2093383 0.4726674 0.6812461
## 0.01 0.65 1.3044945 0.4639658 0.7099902
## 0.01 0.70 1.3853729 0.4578050 0.7331399
## 0.01 0.75 1.4330752 0.4535866 0.7478460
## 0.01 0.80 1.4654250 0.4503605 0.7595614
## 0.01 0.85 1.4996206 0.4474169 0.7712193
## 0.01 0.90 1.5258061 0.4446021 0.7804890
## 0.01 0.95 1.5441125 0.4421963 0.7877313
## 0.01 1.00 1.5617891 0.4402367 0.7945131
## 0.10 0.05 0.8917987 0.5241858 0.7238055
## 0.10 0.10 0.8018328 0.5936881 0.6528618
## 0.10 0.15 0.7279522 0.6012604 0.5940311
## 0.10 0.20 0.6730841 0.6071303 0.5525083
## 0.10 0.25 0.6510524 0.5960564 0.5356208
## 0.10 0.30 0.6505919 0.5868922 0.5240445
## 0.10 0.35 0.6497864 0.5885496 0.5192295
## 0.10 0.40 0.6737667 0.5794641 0.5247874
## 0.10 0.45 0.7244950 0.5708264 0.5363648
## 0.10 0.50 0.7688899 0.5669224 0.5467472
## 0.10 0.55 0.8180306 0.5636699 0.5592141
## 0.10 0.60 0.8817797 0.5467213 0.5852615
## 0.10 0.65 0.9521850 0.5262343 0.6083696
## 0.10 0.70 1.0255037 0.5098619 0.6314284
## 0.10 0.75 1.0834823 0.4987084 0.6504042
## 0.10 0.80 1.1259987 0.4909686 0.6639728
## 0.10 0.85 1.1618713 0.4855072 0.6756364
## 0.10 0.90 1.1901173 0.4819024 0.6848276
## 0.10 0.95 1.2102977 0.4786543 0.6918864
## 0.10 1.00 1.2256645 0.4758650 0.6973098
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1 and lambda = 0.
Elastic Net seems to perform better than PLS for this Chemical Manufacturing data. The final values used for the model were fraction = 0.1 and lambda = 0, and the RMSE= 0.6270263, \(R^2\) = 0.6105673, and MAE = 0.5124015.
(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
enet_pred <- predict(enetfit, test_predictors)
enet_values <- data.frame(obs = test_yield, pred = enet_pred)
defaultSummary(enet_values)
## RMSE Rsquared MAE
## 0.5254707 0.7166146 0.4108514
The metrics for the Elastic Net predictions are RMSE = 0.5254707, \(R^2\) = 0.7166146, and MAE = 0.4108514, which suggests this model is prone to lower error and explains the variance of the chemical yield data better than that of the training set.
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
varImp(enetfit)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 90.02
## BiologicalMaterial06 84.56
## ManufacturingProcess17 74.88
## ManufacturingProcess36 74.68
## BiologicalMaterial03 73.53
## ManufacturingProcess09 70.37
## BiologicalMaterial12 67.97
## BiologicalMaterial02 65.32
## ManufacturingProcess31 63.93
## ManufacturingProcess06 58.00
## ManufacturingProcess33 48.99
## BiologicalMaterial11 48.11
## ManufacturingProcess11 47.77
## BiologicalMaterial04 47.12
## BiologicalMaterial08 41.87
## BiologicalMaterial01 39.13
## ManufacturingProcess30 34.83
## ManufacturingProcess12 33.32
## BiologicalMaterial09 32.41
The predictors with the greatest weight on the model are mostly the manufacturing process predictors, where of the top 20 predictors, 11 are manufacturing and 9 are biological material predictors. The sum of the weights of these 20 predictors are 697 and 533 respectively. The 3 highest predictors are ManufacturingProcess32, ManufacturingProcess13, and BiologicalMaterial06.
(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
top20 <- varImp(enetfit)$importance %>%
arrange(desc(Overall)) %>%
head(20)
# Extract row names of top20 as a character vector
top20_names <- rownames(top20)
# Create a character vector of columns to select: "Yield" and the top20 predictors
cols_to_select <- c("Yield", top20_names)
# Select the columns and calculate correlations
cor_matrix <- CMP_complete[ , cols_to_select] %>%
cor()
# Visualize the correlation of "Yield" with the top 20 predictors
library(ggplot2)
# Only get correlations with Yield
cor_data <- as.data.frame(cor_matrix["Yield", -1])
# Rename columns
colnames(cor_data) <- "Correlation with Yield"
# Create a bar plot of the correlations
ggplot(cor_data, aes(x = rownames(cor_data), y = `Correlation with Yield`)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Correlation of Top 20 Predictors with Yield",
x = "Predictor",
y = "Correlation Coefficient") +
theme_minimal()
# Top 5
ggplot(CMP_complete, aes(x = ManufacturingProcess32, y = Yield)) + geom_point()
ggplot(CMP_complete, aes(x = ManufacturingProcess13, y = Yield)) + geom_point()
ggplot(CMP_complete, aes(x = BiologicalMaterial06, y = Yield)) + geom_point()
ggplot(CMP_complete, aes(x = ManufacturingProcess17, y = Yield)) + geom_point()
ggplot(CMP_complete, aes(x = ManufacturingProcess36, y = Yield)) + geom_point()
By knowing which Biological and Manufacturing process have the greatest/lowest impact on yield, be it positive or negative, we can focus on those specific processes for future runs. This allows for improvement in future chemical manufacturing production to increas Yield.