Question 6.2 A
library(AppliedPredictiveModeling)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tibble)
library(tidyr)
library(ggplot2)
data(permeability)
Part B
After subtracting using NearZeroVar, we only have 388 predictors left.
nzv <- nearZeroVar(fingerprints)
fingerprints_filtered <- fingerprints[, -nzv]
dim(fingerprints_filtered)
## [1] 165 388
Part C Split the data into training and test sets and then fit a PLS model on the training data using 10 fold cross validation. The tuning parameter was evaluated from 1 to 20 components. Based on our resampling results the optimal model selected by cross-validation used 9 latent variables and had a resampled R^2 of 0.4373. The highest observed cross-validated R^2 was 0.4443 at 5 components, but the lowest RMSE occurred at 9 components, which is why this would be our optimal number of latent variables.
#First we set our seed and
set.seed(62462)
x <- fingerprints[, -nzv]
y <- permeability
#Split data 80%/20%
Trainsplit <- createDataPartition(y, p = 0.8, list = FALSE)
train_x <- x[Trainsplit, ]
test_x <- x[-Trainsplit, ]
train_y <- y[Trainsplit]
test_y <- y[-Trainsplit]
ctrl <- trainControl(method = "cv", number = 10)
pls_model <- train(
x = train_x,
y = train_y,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = ctrl
)
pls_model
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 118, 121, 119, 119, 118, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.58830 0.2701593 10.484186
## 2 12.51249 0.3866277 8.874014
## 3 12.35552 0.4091924 9.303344
## 4 12.31505 0.4283548 9.169979
## 5 12.16807 0.4443312 9.063389
## 6 12.15799 0.4365662 9.162581
## 7 12.29198 0.4244414 9.384539
## 8 12.30901 0.4229727 9.429921
## 9 12.16223 0.4373211 9.224731
## 10 12.52104 0.4312315 9.417766
## 11 12.62894 0.4260757 9.503149
## 12 12.95416 0.4101189 9.819188
## 13 13.35056 0.3912642 10.006916
## 14 13.76414 0.3731831 10.282876
## 15 14.40740 0.3380444 10.610761
## 16 14.98058 0.3103440 10.922962
## 17 15.30738 0.2999879 11.048193
## 18 15.53869 0.2917046 11.230411
## 19 15.51836 0.2908946 11.262496
## 20 15.95305 0.2750077 11.645145
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
Part D The tuned PLS model was then used to predict permeability on the held-out test set. The test-set performance was RMSE = 8.4201, MAE = 6.4669, and R^2 = 0.6805. The model explained about 68% of the variation indicating strong predictive performance on the test set. I would have expected it to be a bit closer to the test RMSE so I must assume that these were very strong predictors or there may have been some more variable compounds in the prediction dataset as well. It is likely a combination of these that caused this outcome.
pls_pred <- predict(pls_model, newdata = test_x)
postResample(pred = pls_pred, obs = test_y)
## RMSE Rsquared MAE
## 8.4200782 0.6804608 6.4668817
Part E
I compared the tuned PLS model with ridge regression, lasso regression, and elastic net using the same training and test split. On the test set, the PLS model had RMSE = 8.42, R^2 = 0.6805, and MAE = 6.47. Ridge regression had a higher R^2 = 0.7785, but it also had a worse RMSE = 10.43 and MAE = 7.64. Elastic net also had worse prediction errors than PLS, with RMSE = 10.50 and MAE = 7.07. The lasso model produced extremely large errors, suggesting an unstable or poor fit, so it was not considered here. Our results suggest PLS provided the best predictive performance on this split because it had the lowest test-set error measures.
# Ridge regression
ridge_grid <- expand.grid(lambda = seq(0, 0.10, length = 20))
ridge_model <- train(
x = train_x,
y = train_y,
method = "ridge",
preProcess = c("center", "scale"),
tuneGrid = ridge_grid,
trControl = ctrl
)
ridge_model
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 119, 121, 118, 121, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 1.570704e+26 0.1357507 5.654761e+25
## 0.005263158 4.916346e+01 0.2261837 2.893677e+01
## 0.010526316 4.114272e+01 0.2778971 2.554478e+01
## 0.015789474 1.692617e+01 0.2875033 1.220246e+01
## 0.021052632 1.924429e+03 0.2917511 1.073490e+03
## 0.026315789 1.576733e+01 0.3031166 1.150796e+01
## 0.031578947 1.538408e+01 0.3149947 1.126935e+01
## 0.036842105 1.520018e+01 0.3194208 1.119145e+01
## 0.042105263 1.487557e+01 0.3337439 1.096383e+01
## 0.047368421 1.468251e+01 0.3425391 1.085727e+01
## 0.052631579 1.453659e+01 0.3494269 1.076514e+01
## 0.057894737 1.456310e+01 0.3480253 1.076029e+01
## 0.063157895 1.431162e+01 0.3629589 1.064745e+01
## 0.068421053 1.421177e+01 0.3687409 1.058548e+01
## 0.073684211 1.412136e+01 0.3740305 1.052565e+01
## 0.078947368 1.405344e+01 0.3783054 1.048605e+01
## 0.084210526 1.397682e+01 0.3834068 1.042852e+01
## 0.089473684 1.389516e+01 0.3883824 1.036535e+01
## 0.094736842 1.383332e+01 0.3923924 1.031843e+01
## 0.100000000 1.379852e+01 0.3955399 1.029681e+01
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
ridge_model$bestTune
## lambda
## 20 0.1
ridge_pred <- predict(ridge_model, newdata = test_x)
ridge_test <- postResample(pred = ridge_pred, obs = test_y)
ridge_test
## RMSE Rsquared MAE
## 8.8923630 0.6637158 7.0802748
# Lasso regression
lasso_grid <- expand.grid(fraction = seq(0.05, 1.00, length = 20))
lasso_model <- train(
x = train_x,
y = train_y,
method = "lasso",
preProcess = c("center", "scale"),
tuneGrid = lasso_grid,
trControl = ctrl
)
lasso_model
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 118, 119, 120, 119, 121, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.05 7.125985e+14 0.3913785 2.790099e+14
## 0.10 1.436233e+15 0.3830256 5.612075e+14
## 0.15 2.159869e+15 0.3874314 8.434050e+14
## 0.20 2.883505e+15 0.3887548 1.125603e+15
## 0.25 3.607141e+15 0.3840712 1.407800e+15
## 0.30 4.330778e+15 0.3811369 1.689998e+15
## 0.35 5.054414e+15 0.3764348 1.972195e+15
## 0.40 5.778051e+15 0.3678653 2.254393e+15
## 0.45 6.501687e+15 0.3581032 2.536590e+15
## 0.50 7.225324e+15 0.3483971 2.818788e+15
## 0.55 7.948960e+15 0.3434420 3.100986e+15
## 0.60 8.672597e+15 0.3365944 3.383183e+15
## 0.65 9.396234e+15 0.3337979 3.665381e+15
## 0.70 1.011987e+16 0.3349652 3.947578e+15
## 0.75 1.084351e+16 0.3348344 4.229776e+15
## 0.80 1.156714e+16 0.3317065 4.511973e+15
## 0.85 1.229078e+16 0.3276144 4.794171e+15
## 0.90 1.301442e+16 0.3251199 5.076368e+15
## 0.95 1.373805e+16 0.3188556 5.358565e+15
## 1.00 1.446169e+16 0.3111218 5.640761e+15
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.05.
lasso_model$bestTune
## fraction
## 1 0.05
lasso_pred <- predict(lasso_model, newdata = test_x)
lasso_test <- postResample(pred = lasso_pred, obs = test_y)
lasso_test
## RMSE Rsquared MAE
## 1.146737e+04 5.586020e-02 3.472114e+03
# Elastic net
enet_grid <- expand.grid(
fraction = seq(0.05, 1.00, length = 10),
lambda = seq(0.001, 0.10, length = 10)
)
enet_model <- train(
x = train_x,
y = train_y,
method = "enet",
preProcess = c("center", "scale"),
tuneGrid = enet_grid,
trControl = ctrl
)
enet_model
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 121, 121, 120, 118, 119, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.001 0.0500000 118.00689 0.2045016 69.856020
## 0.001 0.1555556 329.05047 0.1912810 182.247634
## 0.001 0.2611111 538.15192 0.1717057 301.049642
## 0.001 0.3666667 758.65020 0.1545316 427.466080
## 0.001 0.4722222 981.53946 0.1388149 552.433267
## 0.001 0.5777778 1203.42782 0.1264437 674.703059
## 0.001 0.6833333 1427.41287 0.1155196 797.223614
## 0.001 0.7888889 1653.89416 0.1093290 921.038571
## 0.001 0.8944444 1883.44311 0.1061554 1045.340636
## 0.001 1.0000000 2110.19188 0.1053302 1168.933946
## 0.012 0.0500000 14.75561 0.3976749 10.729054
## 0.012 0.1555556 22.94609 0.3934940 16.102293
## 0.012 0.2611111 31.28186 0.3768562 21.242523
## 0.012 0.3666667 39.09975 0.3508812 25.538881
## 0.012 0.4722222 47.39641 0.3255760 29.884548
## 0.012 0.5777778 55.78782 0.3043199 34.330082
## 0.012 0.6833333 64.20138 0.2929942 38.751785
## 0.012 0.7888889 72.63021 0.2877169 43.413641
## 0.012 0.8944444 80.98266 0.2832547 47.992833
## 0.012 1.0000000 89.27559 0.2803421 52.491375
## 0.023 0.0500000 12.44939 0.4170920 8.892478
## 0.023 0.1555556 12.39041 0.4048264 8.855836
## 0.023 0.2611111 12.50736 0.4085065 9.125323
## 0.023 0.3666667 12.94110 0.3890827 9.342383
## 0.023 0.4722222 13.44217 0.3648879 9.535783
## 0.023 0.5777778 13.90789 0.3441756 9.906125
## 0.023 0.6833333 14.43625 0.3269003 10.333523
## 0.023 0.7888889 14.97204 0.3142979 10.736179
## 0.023 0.8944444 15.43489 0.3060791 11.107463
## 0.023 1.0000000 15.82875 0.2991945 11.402738
## 0.034 0.0500000 12.47636 0.4218148 9.038863
## 0.034 0.1555556 12.53653 0.3942653 8.865798
## 0.034 0.2611111 12.52110 0.4086231 9.126980
## 0.034 0.3666667 12.86031 0.3969946 9.358264
## 0.034 0.4722222 13.21704 0.3819855 9.466610
## 0.034 0.5777778 13.59810 0.3658623 9.663813
## 0.034 0.6833333 13.92188 0.3570046 9.910022
## 0.034 0.7888889 14.34750 0.3454552 10.257172
## 0.034 0.8944444 14.74981 0.3343056 10.603892
## 0.034 1.0000000 15.10433 0.3254961 10.879889
## 0.045 0.0500000 12.56344 0.4250384 9.203076
## 0.045 0.1555556 12.54785 0.3943201 8.814199
## 0.045 0.2611111 12.39672 0.4163723 9.036362
## 0.045 0.3666667 12.72419 0.4024322 9.278699
## 0.045 0.4722222 12.97930 0.3946764 9.363929
## 0.045 0.5777778 13.29848 0.3816799 9.472416
## 0.045 0.6833333 13.49552 0.3779408 9.577994
## 0.045 0.7888889 13.83033 0.3693305 9.812806
## 0.045 0.8944444 14.19647 0.3570046 10.111674
## 0.045 1.0000000 14.51402 0.3482228 10.376781
## 0.056 0.0500000 12.60020 0.4251990 9.281039
## 0.056 0.1555556 12.51650 0.3971114 8.779484
## 0.056 0.2611111 12.32401 0.4208399 8.961922
## 0.056 0.3666667 12.64842 0.4069165 9.215172
## 0.056 0.4722222 12.84404 0.4027014 9.290319
## 0.056 0.5777778 13.11463 0.3932367 9.367557
## 0.056 0.6833333 13.31621 0.3894322 9.462072
## 0.056 0.7888889 13.60181 0.3826295 9.653833
## 0.056 0.8944444 13.93910 0.3701522 9.920964
## 0.056 1.0000000 14.23457 0.3612952 10.143099
## 0.067 0.0500000 12.65488 0.4288773 9.391299
## 0.067 0.1555556 12.55285 0.3963772 8.782547
## 0.067 0.2611111 12.37359 0.4189880 8.963664
## 0.067 0.3666667 12.65950 0.4080633 9.233022
## 0.067 0.4722222 12.83073 0.4040766 9.294671
## 0.067 0.5777778 13.03496 0.3981701 9.332239
## 0.067 0.6833333 13.20324 0.3963600 9.394521
## 0.067 0.7888889 13.41139 0.3936857 9.496656
## 0.067 0.8944444 13.71684 0.3828756 9.733943
## 0.067 1.0000000 13.98900 0.3740494 9.959850
## 0.078 0.0500000 12.69368 0.4302793 9.457133
## 0.078 0.1555556 12.54608 0.3979219 8.771262
## 0.078 0.2611111 12.37537 0.4185778 8.924040
## 0.078 0.3666667 12.62398 0.4109466 9.199665
## 0.078 0.4722222 12.82236 0.4051801 9.300259
## 0.078 0.5777778 12.98363 0.4021263 9.315936
## 0.078 0.6833333 13.12536 0.4019012 9.366369
## 0.078 0.7888889 13.29809 0.4004761 9.448096
## 0.078 0.8944444 13.56184 0.3921885 9.616327
## 0.078 1.0000000 13.82490 0.3831107 9.834757
## 0.089 0.0500000 12.76805 0.4301631 9.553760
## 0.089 0.1555556 12.52206 0.4011014 8.740802
## 0.089 0.2611111 12.40075 0.4169507 8.904942
## 0.089 0.3666667 12.60812 0.4134609 9.194771
## 0.089 0.4722222 12.82023 0.4061166 9.310851
## 0.089 0.5777778 12.94414 0.4056864 9.307764
## 0.089 0.6833333 13.06684 0.4063374 9.346933
## 0.089 0.7888889 13.21561 0.4057463 9.413205
## 0.089 0.8944444 13.43999 0.3996871 9.536338
## 0.089 1.0000000 13.68994 0.3912227 9.729456
## 0.100 0.0500000 12.81281 0.4304290 9.624466
## 0.100 0.1555556 12.51065 0.4035470 8.720264
## 0.100 0.2611111 12.43278 0.4145059 8.894274
## 0.100 0.3666667 12.61310 0.4130824 9.184413
## 0.100 0.4722222 12.84834 0.4027745 9.325883
## 0.100 0.5777778 12.94386 0.4050024 9.314861
## 0.100 0.6833333 13.03499 0.4084090 9.343808
## 0.100 0.7888889 13.15677 0.4093087 9.397065
## 0.100 0.8944444 13.35397 0.4051781 9.486202
## 0.100 1.0000000 13.58956 0.3976765 9.656569
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2611111 and lambda
## = 0.056.
enet_model$bestTune
## fraction lambda
## 53 0.2611111 0.056
enet_pred <- predict(enet_model, newdata = test_x)
enet_test <- postResample(pred = enet_pred, obs = test_y)
enet_test
## RMSE Rsquared MAE
## 9.7528530 0.5801447 7.8950955
#Results
results <- rbind(
PLS = c(RMSE = 8.4200782, Rsquared = 0.6804608, MAE = 6.4668817),
Ridge = ridge_test,
Lasso = lasso_test,
Enet = enet_test
)
results <- as.data.frame(results)
results
## RMSE Rsquared MAE
## PLS 8.420078 0.6804608 6.466882
## Ridge 8.892363 0.6637158 7.080275
## Lasso 11467.370062 0.0558602 3472.114238
## Enet 9.752853 0.5801447 7.895095
Part F
In this analysis, the PLS model showed reasonably good predictive ability on the test set, but not enough certainty to fully replace laboratory measurement. Our best R^2 was .68 and RMSE of 8.42 I don’t believe would be accurate enough to actually replace our permeability lab experiment. It would be more appropriate as a way to reduce cost and time by identifying compounds to be tested in the permeability experiment/
6.3 A
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
Part B Missing predictor values were imputed using median imputation. After splitting the data into training and test datasets I applied preProcess() with medianImpute, center, and scale to the predictors.
chem <- ChemicalManufacturingProcess
ProcessPredictors <- chem[, names(chem) != "Yield"]
yield <- chem$Yield
Training_split <- createDataPartition(yield, p = 0.80, list = FALSE)
trainsplit_x <- ProcessPredictors[Training_split, ]
testsplit_x <- ProcessPredictors[-Training_split, ]
trainsplit_y <- yield[Training_split]
testsplit_y <- yield[-Training_split]
medprocess <- preProcess(
trainsplit_x,
method = c("medianImpute", "center", "scale")
)
training_x_f <- predict(medprocess, trainsplit_x)
test_x_f <- predict(medprocess, testsplit_x)
Part C The optimal model used 4 latent variables, based on the smallest RMSE. The outcome at 4 was RMSE = 1.1945, R^2 = 0.6342, and MAE = 0.9822. In our PLS we seem to gain in RMSE and R^2 values up to 4 with a drop off beginning after that point.
ctrl <- trainControl(
method = "cv",
number = 10
)
chem_pls <- train(
x = training_x_f,
y = trainsplit_y,
method = "pls",
tuneLength = 20,
metric = "RMSE",
trControl = ctrl
)
chem_pls
## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 131, 129, 128, 129, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.390444 0.4808496 1.1440788
## 2 1.274229 0.5542536 1.0295985
## 3 1.223829 0.5801451 0.9732079
## 4 1.196021 0.6049435 0.9724990
## 5 1.200570 0.6174447 0.9847294
## 6 1.211527 0.6093765 0.9917656
## 7 1.201692 0.6240207 0.9940188
## 8 1.244765 0.6106207 1.0317198
## 9 1.256870 0.6103914 1.0415963
## 10 1.275109 0.6071681 1.0547934
## 11 1.277922 0.6013721 1.0495775
## 12 1.278733 0.6042408 1.0545566
## 13 1.285990 0.6011582 1.0657477
## 14 1.288896 0.6030001 1.0683622
## 15 1.298976 0.5993073 1.0773815
## 16 1.302490 0.5960829 1.0727553
## 17 1.297523 0.5985248 1.0672503
## 18 1.300664 0.5954899 1.0676782
## 19 1.306795 0.5911569 1.0712583
## 20 1.309907 0.5876253 1.0715018
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.
Part D The PLS model was used to predict yield on the test set. The test set showed was RMSE = 2.4906, R^2 = 0.2189, and MAE = 1.4280. Compared with the cross validated training from part c, this shows a noticeable decline in accuracy. The resampled training results were RMSE = 1.1945 and R^2 = 0.6342 which suggests the model performed much worse on unseen data. This concludes that the model did not generalize as well as expected due to the substantial R^2 difference.
pls_pred <- predict(chem_pls, newdata = test_x_f)
postResample(pred = pls_pred, obs = testsplit_y)
## RMSE Rsquared MAE
## 2.2962372 0.3001268 1.1916439
Part E
The PLS model indicated that the most important predictors were ManufacturingProcess32, ManufacturingProcess36, ManufacturingProcess13, ManufacturingProcess09, ManufacturingProcess17, and ManufacturingProcess33, followed by BiologicalMaterial02, BiologicalMaterial03, BiologicalMaterial08, and BiologicalMaterial06. The manufacturing process predictors were the majority of the list with 6 of the top 10 predictors coming from the process variables and the 6 most important predictors overall were all manufacturing variables. This suggests that manufacturing process conditions (process predicters) probably have a stronger relationship with yield than the biological characteristics of the materials.
chem_imp <- varImp(chem_pls, scale = TRUE)
## Warning: package 'pls' was built under R version 4.5.2
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
chem_imp
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 87.13
## ManufacturingProcess09 82.17
## ManufacturingProcess36 81.32
## ManufacturingProcess17 77.34
## ManufacturingProcess06 62.30
## BiologicalMaterial02 60.89
## BiologicalMaterial03 60.01
## BiologicalMaterial06 58.51
## BiologicalMaterial08 58.19
## ManufacturingProcess12 58.00
## ManufacturingProcess31 55.69
## ManufacturingProcess33 55.26
## BiologicalMaterial12 54.60
## BiologicalMaterial11 52.12
## ManufacturingProcess11 49.63
## ManufacturingProcess30 49.60
## BiologicalMaterial04 49.34
## ManufacturingProcess29 48.99
## ManufacturingProcess28 47.72
plot(chem_imp, top = 15)
Part F The relationships between the most important predictors and yield
can help us see which variables are associated with the strongest with a
higher or lower yield. Since the top predictors are mostly manufacturing
process variables this suggests that improvements in yield would come
more from optimizing processes than from differences in raw material
biological factors. Biological predictors are still useful because they
can help assess the quality of the raw materials before production
begins. Examining scatterplots of the top predictors versus yield may
reveal trends, thresholds, or optimal operating ranges that can guide
future process adjustments and improve product yield. This may also be a
way to chose predictors for other models.
top_vars <- c(
"ManufacturingProcess32",
"ManufacturingProcess36",
"ManufacturingProcess13",
"ManufacturingProcess09",
"ManufacturingProcess17",
"ManufacturingProcess33",
"BiologicalMaterial02",
"BiologicalMaterial03"
)
plot_data <- data.frame(training_x_f, Yield = trainsplit_y) |>
select(all_of(top_vars), Yield) |>
pivot_longer(
cols = -Yield,
names_to = "Predictor",
values_to = "Value"
)
ggplot(plot_data, aes(x = Value, y = Yield)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ Predictor, scales = "free_x") +
labs(
title = "Yield vs Top Predictors",
x = "Predictor Value",
y = "Yield"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.