library(AppliedPredictiveModeling)
data(permeability)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
high_freq <- fingerprints[,-nearZeroVar(fingerprints)]
dim(high_freq)
## [1] 165 388
After filtering out the predictors with low variance, there are 388 remaining.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
set.seed(121)
split <- createDataPartition(permeability, p = 0.8, list = FALSE, times = 1)
Xtrain.data <- high_freq[split, ] #fingerprints train
xtest.data <- high_freq[-split, ] #fingerprints test
Ytrain.data <- permeability[split, ] #permability train
ytest.data <- permeability[-split, ]
ctrl <- trainControl(method = "cv", number = 10)
pls.mod <- train(x = Xtrain.data,
y = Ytrain.data, method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale"))
pls.mod
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 121, 119, 119, 119, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.98794 0.3459993 10.088589
## 2 11.76199 0.4739001 8.522928
## 3 11.82525 0.4737298 8.910669
## 4 11.50972 0.4863403 8.766358
## 5 11.20084 0.5123139 8.408934
## 6 10.91283 0.5374273 8.322364
## 7 10.93839 0.5483197 8.260723
## 8 11.19758 0.5470639 8.625416
## 9 11.45343 0.5323240 8.718073
## 10 11.61172 0.5241004 8.753287
## 11 11.62767 0.5257602 8.940583
## 12 11.78028 0.5171329 9.120635
## 13 12.19547 0.5035613 9.370479
## 14 12.64865 0.4787144 9.710662
## 15 13.03644 0.4569241 9.998892
## 16 13.44504 0.4318441 10.312141
## 17 13.83538 0.4174899 10.536723
## 18 13.90907 0.4113511 10.565914
## 19 14.14390 0.4070009 10.565672
## 20 14.22143 0.4058685 10.558467
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
The model that gives us the lowest RMSE is ncomp = 6. The value for R^2 in this model is 0.5374273. Surprisingly, this is not the highest value of R^2 seen across the models.
plot(pls.mod)
plsPred <- predict(pls.mod, newdata=xtest.data)
postResample(pred=plsPred, obs=ytest.data)
## RMSE Rsquared MAE
## 12.0438820 0.4348518 8.8937768
The estimate for R^2 this time is .4348518.
Three models mentioned in the chapter that are appropriate to try are the Ridge, Lasso, and elastic net models. Each of these models requires a tuning parameter, which can be calculated in these packages.
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
set.seed(121)
rFit <- train(x=Xtrain.data,
y=Ytrain.data,
method='ridge',
metric='Rsquared',
tuneGrid=data.frame(.lambda = seq(0, 2, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold06: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
rFit
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 119, 119, 121, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 46.19026 0.3253513 19.650645
## 0.1 11.94113 0.4779242 9.065819
## 0.2 12.08944 0.4898421 9.256008
## 0.3 12.43501 0.4956773 9.492480
## 0.4 12.85888 0.4991488 9.775096
## 0.5 13.35473 0.5014317 10.110119
## 0.6 13.91240 0.5028683 10.544420
## 0.7 14.50678 0.5037906 11.007365
## 0.8 15.14342 0.5045929 11.495508
## 0.9 15.80643 0.5046419 12.040654
## 1.0 16.49417 0.5047758 12.634879
## 1.1 17.20184 0.5047593 13.251666
## 1.2 17.92537 0.5046424 13.863099
## 1.3 18.66175 0.5044508 14.469191
## 1.4 19.40845 0.5042034 15.103153
## 1.5 20.16340 0.5039141 15.752667
## 1.6 20.92484 0.5035937 16.414177
## 1.7 21.69131 0.5032502 17.081964
## 1.8 22.46158 0.5028900 17.765829
## 1.9 23.23460 0.5025177 18.443540
## 2.0 24.00951 0.5021371 19.115252
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.
At the optimized lambda of 1, the ridge regression model gives us an RMSE of 16.49417 and an R^2 of .5047758. This means that the fit was better in the PLS model since that one gave us a lower RMSE and a higher R^2
set.seed(121)
lFit <- train(x=Xtrain.data,
y=Ytrain.data,
method='lasso',
metric='Rsquared',
tuneGrid=data.frame(.fraction = seq(0, .1, by=0.01)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold06: fraction=0.1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in train.default(x = Xtrain.data, y = Ytrain.data, method = "lasso", :
## missing values found in aggregated results
lFit
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 119, 119, 121, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 15.48794 NaN 12.306539
## 0.01 13.74275 0.5101493 10.846539
## 0.02 12.98883 0.5143246 10.304896
## 0.03 12.41819 0.5211371 9.712647
## 0.04 12.24017 0.5106165 9.393944
## 0.05 12.21253 0.4999390 9.203348
## 0.06 12.27696 0.4892846 9.141869
## 0.07 12.38404 0.4806316 9.164475
## 0.08 12.45188 0.4764676 9.190038
## 0.09 12.49103 0.4741956 9.182362
## 0.10 12.53263 0.4726914 9.205190
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.03.
In the lasso regression model at the optimal fraction of .03, the RMSE is 12.41819 and the R^2 is .5211371. The lasso regression gives us a better fit than the ridge regression, but not as good as the PLS model.
set.seed(121)
enetFit <- train(x=Xtrain.data,
y=Ytrain.data,
method='enet',
metric='Rsquared',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold06: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in train.default(x = Xtrain.data, y = Ytrain.data, method = "enet", :
## missing values found in aggregated results
enetFit
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 119, 119, 121, 120, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0 0.0 15.48794 NaN 12.306539
## 0.0 0.1 12.53263 0.4726914 9.205190
## 0.0 0.2 13.37124 0.4464626 9.470231
## 0.0 0.3 16.07839 0.3982089 10.536091
## 0.0 0.4 20.82657 0.3676119 12.091321
## 0.0 0.5 24.66945 0.3592001 13.200135
## 0.0 0.6 28.56979 0.3572732 14.264707
## 0.0 0.7 32.83586 0.3547052 15.486208
## 0.0 0.8 37.17556 0.3464053 16.804062
## 0.0 0.9 41.64325 0.3361191 18.172117
## 0.0 1.0 46.19026 0.3253513 19.650645
## 0.1 0.0 15.23652 NaN 12.140960
## 0.1 0.1 12.29167 0.4443243 8.886001
## 0.1 0.2 11.56270 0.4774777 8.355892
## 0.1 0.3 11.14460 0.4991002 8.252875
## 0.1 0.4 11.31593 0.4957519 8.501932
## 0.1 0.5 11.54821 0.4915198 8.715412
## 0.1 0.6 11.68556 0.4882573 8.818469
## 0.1 0.7 11.75523 0.4859669 8.873831
## 0.1 0.8 11.83917 0.4820786 8.964666
## 0.1 0.9 11.89992 0.4789536 9.028914
## 0.1 1.0 11.94113 0.4779242 9.065819
## 0.2 0.0 15.23652 NaN 12.140960
## 0.2 0.1 12.36814 0.4439155 9.016793
## 0.2 0.2 11.85422 0.4702428 8.427597
## 0.2 0.3 11.34154 0.4982855 8.287876
## 0.2 0.4 11.32767 0.5048431 8.442298
## 0.2 0.5 11.54716 0.5016530 8.651618
## 0.2 0.6 11.79296 0.4959509 8.879779
## 0.2 0.7 11.93072 0.4933927 9.024751
## 0.2 0.8 12.02888 0.4908546 9.148048
## 0.2 0.9 12.08610 0.4887420 9.229331
## 0.2 1.0 12.08944 0.4898421 9.256008
## 0.3 0.0 15.23652 NaN 12.140960
## 0.3 0.1 12.38445 0.4462789 9.048442
## 0.3 0.2 12.04712 0.4665965 8.485871
## 0.3 0.3 11.57877 0.4938469 8.379134
## 0.3 0.4 11.48521 0.5075740 8.491967
## 0.3 0.5 11.71066 0.5077032 8.763252
## 0.3 0.6 12.00234 0.5014957 9.008633
## 0.3 0.7 12.19079 0.4988686 9.192528
## 0.3 0.8 12.32401 0.4962424 9.346347
## 0.3 0.9 12.41241 0.4944781 9.445787
## 0.3 1.0 12.43501 0.4956773 9.492480
## 0.4 0.0 15.23652 NaN 12.140960
## 0.4 0.1 12.36580 0.4468242 9.027024
## 0.4 0.2 12.19016 0.4673554 8.514826
## 0.4 0.3 11.79679 0.4914212 8.471929
## 0.4 0.4 11.69860 0.5085207 8.575156
## 0.4 0.5 11.94449 0.5111364 8.883579
## 0.4 0.6 12.26536 0.5055823 9.179451
## 0.4 0.7 12.50677 0.5024583 9.419210
## 0.4 0.8 12.68585 0.4999392 9.596733
## 0.4 0.9 12.80781 0.4981629 9.719895
## 0.4 1.0 12.85888 0.4991488 9.775096
## 0.5 0.0 15.23652 NaN 12.140960
## 0.5 0.1 12.37803 0.4473183 9.025175
## 0.5 0.2 12.28691 0.4684369 8.534237
## 0.5 0.3 12.02709 0.4894102 8.597701
## 0.5 0.4 11.96715 0.5075662 8.744280
## 0.5 0.5 12.25651 0.5117740 9.068066
## 0.5 0.6 12.60864 0.5081465 9.405587
## 0.5 0.7 12.89905 0.5046777 9.686795
## 0.5 0.8 13.10924 0.5024042 9.902366
## 0.5 0.9 13.27810 0.5004156 10.045580
## 0.5 1.0 13.35473 0.5014317 10.110119
## 0.6 0.0 15.23652 NaN 12.140960
## 0.6 0.1 12.37111 0.4495273 9.006977
## 0.6 0.2 12.37970 0.4698334 8.547917
## 0.6 0.3 12.27345 0.4875657 8.765048
## 0.6 0.4 12.27781 0.5057442 8.991529
## 0.6 0.5 12.61796 0.5114498 9.339637
## 0.6 0.6 13.01293 0.5094180 9.704055
## 0.6 0.7 13.35439 0.5060132 10.032838
## 0.6 0.8 13.59905 0.5037954 10.272586
## 0.6 0.9 13.81101 0.5018255 10.456400
## 0.6 1.0 13.91240 0.5028683 10.544420
## 0.7 0.0 15.23652 NaN 12.140960
## 0.7 0.1 12.35846 0.4510659 8.978109
## 0.7 0.2 12.47792 0.4704114 8.560264
## 0.7 0.3 12.52439 0.4860267 8.955062
## 0.7 0.4 12.61527 0.5036900 9.269168
## 0.7 0.5 13.01230 0.5106658 9.655736
## 0.7 0.6 13.45977 0.5097252 10.058454
## 0.7 0.7 13.84946 0.5067171 10.433913
## 0.7 0.8 14.12848 0.5046667 10.680996
## 0.7 0.9 14.37318 0.5029897 10.890067
## 0.7 1.0 14.50678 0.5037906 11.007365
## 0.8 0.0 15.23652 NaN 12.140960
## 0.8 0.1 12.35039 0.4529550 8.942577
## 0.8 0.2 12.58242 0.4715987 8.584235
## 0.8 0.3 12.78093 0.4858173 9.163693
## 0.8 0.4 12.97646 0.5024329 9.554853
## 0.8 0.5 13.44516 0.5102259 9.989378
## 0.8 0.6 13.95486 0.5097891 10.481704
## 0.8 0.7 14.38757 0.5072529 10.880732
## 0.8 0.8 14.70217 0.5053797 11.140051
## 0.8 0.9 14.97387 0.5040870 11.351495
## 0.8 1.0 15.14342 0.5045929 11.495508
## 0.9 0.0 15.23652 NaN 12.140960
## 0.9 0.1 12.32497 0.4554050 8.893853
## 0.9 0.2 12.69110 0.4716981 8.621928
## 0.9 0.3 13.06596 0.4840724 9.369324
## 0.9 0.4 13.37276 0.4995295 9.839987
## 0.9 0.5 13.91206 0.5081996 10.373523
## 0.9 0.6 14.48140 0.5087106 10.942880
## 0.9 0.7 14.94883 0.5071795 11.356560
## 0.9 0.8 15.30195 0.5054687 11.629613
## 0.9 0.9 15.59941 0.5044801 11.869143
## 0.9 1.0 15.80643 0.5046419 12.040654
## 1.0 0.0 15.23652 NaN 12.140960
## 1.0 0.1 12.30693 0.4572780 8.843362
## 1.0 0.2 12.81078 0.4723433 8.676224
## 1.0 0.3 13.36616 0.4830828 9.596825
## 1.0 0.4 13.78825 0.4977065 10.147079
## 1.0 0.5 14.40359 0.5067869 10.784527
## 1.0 0.6 15.03323 0.5078845 11.407113
## 1.0 0.7 15.53235 0.5072256 11.838048
## 1.0 0.8 15.92916 0.5055444 12.157805
## 1.0 0.9 16.25355 0.5047812 12.435100
## 1.0 1.0 16.49417 0.5047758 12.634879
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.5 and lambda = 0.5.
Optimized with values that are both .5, this model gives us a RMSE of 12.25651 and an R^2 of .5117740. After looking at each model, the lowest RMSE and highest R^2 are found using the PLS model.
I do not believe that any of these models are strong enough to replace the laboratory experiment. With R^2 values that top out at .54, our model can at best explain 54% of the variance. Our MAE values also range around 8-9, meaning our margin of error in predictions is about 8-9, which is too wide of a window for meaningful application when it comes to medicine.
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(ChemicalManufacturingProcess, col = c("yellow", "navy"))
We can use this function to visualize where data is missing in our data set. Here, about 1% of all values are missing. One way to deal with missing data that seems appropriate in this case is using K Nearest Neighbors. The preProcess function in the carat packages contains ways to ready the data for this implementation and then complete it.
library(RANN)
knn <- preProcess(ChemicalManufacturingProcess,
method = c("center","scale","knnImpute"))
knnpre <- predict(knn, ChemicalManufacturingProcess)
missmap(knnpre, col = c("yellow", "navy"))
With the missing map graphic, we can now be confident that all missing values have appropriate replacements. Now to replicate the process from the previous problem, starting with eliminating variables with near-zero variance.
near_zero_chem <- nearZeroVar(knnpre)
filtered_chem <- knnpre[, -near_zero_chem]
ncol(filtered_chem)
## [1] 57
Only one variable was removed. #### Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric? For this exercise, I will chose a Partial Least Squares model to build upon the example in the previous question.
set.seed(79)
index_chem <- createDataPartition(filtered_chem$Yield, p = .8, list = FALSE)
# We need to exclude 'Yield' column since it is the predicted variable)
X_train_chem <- filtered_chem[index_chem, !(names(filtered_chem) %in% "Yield")]
X_test_chem <- filtered_chem[-index_chem, !(names(filtered_chem) %in% "Yield")]
# Only using the 'Yield' column)
y_train_chem <- filtered_chem[index_chem, "Yield", drop = TRUE]
y_test_chem <- filtered_chem[-index_chem, "Yield", drop = TRUE]
pls_chem <- train(
x = X_train_chem,
y = y_train_chem,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = trainControl(method = "cv", number = 10)
)
plot(pls_chem)
pls_chem
## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 130, 129, 131, 129, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8474935 0.3825554 0.6548616
## 2 1.1260325 0.4217226 0.7096134
## 3 0.7416650 0.5378682 0.5907047
## 4 0.7834128 0.5126599 0.6039552
## 5 0.9363218 0.5008578 0.6368069
## 6 1.0067108 0.4922164 0.6675886
## 7 1.0682939 0.4954744 0.6858345
## 8 1.1192581 0.4912483 0.7140196
## 9 1.1732719 0.4677467 0.7406973
## 10 1.2706852 0.4533025 0.7665604
## 11 1.4800818 0.4417582 0.8281665
## 12 1.6785742 0.4406087 0.8804082
## 13 1.8827794 0.4256143 0.9459933
## 14 2.0877310 0.4305886 0.9957205
## 15 2.1776560 0.4303086 1.0172406
## 16 2.2571005 0.4267955 1.0396893
## 17 2.3021289 0.4255633 1.0501958
## 18 2.3624749 0.4262620 1.0669353
## 19 2.4675894 0.4208767 1.0970309
## 20 2.5814665 0.4155298 1.1318633
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
The lowest RMSE is found where ncomp=3. This also corresponds to the highest R^2 in this case of .5378682.
predicted_chem <- predict(pls_chem, X_test_chem)
predicted_chem_y <- data.frame(obs = y_test_chem, pred = predicted_chem)
defaultSummary(predicted_chem_y)
## RMSE Rsquared MAE
## 0.5182009 0.7592100 0.4290475
Our R^2 value is greater here compared to our training data. It seems like this model is a good fit.
plot(predicted_chem_y, col = "darkgreen", main = "Observed vs. Predicted", xlab = "", ylab = "Predictions")
par(new = TRUE)
plot(y_test_chem, col = "blue", axes=F, ylab = "", xlab="Observed")
We have no clear patterns in comparing the observed values and the predictions. This combined with the rather high value for R^2 we observe tells us that this model does a good job of capturing the variance without any major bias.
varImp(pls_chem)
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 85.17
## ManufacturingProcess17 76.76
## ManufacturingProcess09 73.51
## ManufacturingProcess36 72.01
## ManufacturingProcess33 60.82
## ManufacturingProcess06 59.93
## ManufacturingProcess12 57.37
## BiologicalMaterial02 55.86
## BiologicalMaterial08 55.36
## BiologicalMaterial06 53.58
## BiologicalMaterial03 52.87
## BiologicalMaterial12 52.20
## BiologicalMaterial11 51.33
## BiologicalMaterial01 47.62
## ManufacturingProcess11 45.43
## ManufacturingProcess28 44.50
## BiologicalMaterial04 43.74
## ManufacturingProcess04 40.91
## ManufacturingProcess10 35.06
The most important markers seem to be those related to the manufacturing process, making up 8 of the top 10 variables based on their importance to the model.
A correlation matrix would be the best way to explore the relationships between top predictors. There are too many possible predictors in our data set, so we should filter out only those with the biggest impact to explore.
library(corrplot)
## corrplot 0.95 loaded
##
## Attaching package: 'corrplot'
## The following object is masked from 'package:pls':
##
## corrplot
chem_imp <- varImp(pls_chem, scale = FALSE)$importance |>
dplyr::arrange(-Overall)
top10 <- head(chem_imp, 10)
correlation_matrix <- filtered_chem |>
dplyr::select(Yield, all_of(rownames(top10))) |>
cor(use = "complete.obs")
corrplot::corrplot(correlation_matrix,
method = "color",
type = "upper",
order = "hclust",
tl.cex = 0.8)
We can see that the first three manufacturing processes listed have a negative correlation with yield, so it would be worth exploring minimizing the use or impact of these process if we want yield to be maximized. MP 32 seems to have the highest positive correlation with yield, so that a process whose impact should be maximized due to that positive correlation.