In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
Developing a model to predict permeability (see Sect. 1.4) could save sig nificant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response. Kuhn, Max; Johnson, Kjell. Applied Predictive Modeling (p. 138). Springer New York. Kindle Edition.
head(describe(fingerprints), 10) #limit 10
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 165 0.23 0.42 0 0.17 0 0 1 1 1.27 -0.39 0.03
## X2 2 165 0.22 0.42 0 0.16 0 0 1 1 1.31 -0.28 0.03
## X3 3 165 0.21 0.41 0 0.14 0 0 1 1 1.40 -0.05 0.03
## X4 4 165 0.21 0.41 0 0.14 0 0 1 1 1.40 -0.05 0.03
## X5 5 165 0.21 0.41 0 0.14 0 0 1 1 1.40 -0.05 0.03
## X6 6 165 0.45 0.50 0 0.44 0 0 1 1 0.21 -1.97 0.04
## X7 7 165 1.00 0.00 1 1.00 0 1 1 0 NaN NaN 0.00
## X8 8 165 1.00 0.00 1 1.00 0 1 1 0 NaN NaN 0.00
## X9 9 165 0.02 0.13 0 0.00 0 0 1 1 7.15 49.38 0.01
## X10 10 165 0.02 0.15 0 0.00 0 0 1 1 6.13 35.80 0.01
str(fingerprints)
## num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
#head(fingerprints)
low_freq<- caret::nearZeroVar(fingerprints)
pred_df <-fingerprints[,-low_freq]
dim(pred_df)
## [1] 165 388
When we exclude low frequency predictors, 388 are found.
set.seed(432)
split_1<- sample(c(rep(0, 0.7 * nrow(permeability)),
rep(1, 0.3 * nrow(permeability))))
table(split_1)
## split_1
## 0 1
## 115 49
train_1 <- pred_df[split_1 == 0,]
test_1 <- pred_df[split_1 == 1,]
train_val1 <- permeability[split_1 == 0]
test_val1 <- permeability[split_1 == 1]
tune1 <- train(train_1, train_val1,
method='pls', metric='Rsquared',
tuneLength=20,
trControl=trainControl(method='cv'),
preProc=c('center', 'scale')
)
tune1
## Partial Least Squares
##
## 116 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 105, 104, 105, 104, 104, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.43068 0.3371880 10.319406
## 2 12.35354 0.4319338 8.980061
## 3 12.72791 0.4215996 9.721106
## 4 13.22950 0.3943144 9.868499
## 5 12.84057 0.4166923 9.820530
## 6 12.89208 0.4238546 9.879228
## 7 12.34849 0.4577443 9.357330
## 8 12.19153 0.4781810 9.059569
## 9 12.28542 0.4789290 9.076210
## 10 12.37766 0.4842363 9.259815
## 11 12.81321 0.4673728 9.665919
## 12 13.21172 0.4424502 10.028955
## 13 14.03620 0.4073698 10.724542
## 14 14.71433 0.3780076 11.219147
## 15 15.16874 0.3515616 11.557017
## 16 15.40113 0.3382075 11.692555
## 17 16.12396 0.3116538 12.133942
## 18 16.92900 0.2783572 12.781934
## 19 17.34429 0.2631386 13.187677
## 20 17.88137 0.2424371 13.691859
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.
This function determines ncomp 9 as the optimal model.
#run prediction and eval
pred_1 <- predict(tune1, newdata=test_1)
postResample(pred=pred_1, obs=test_val1)
## RMSE Rsquared MAE
## 11.2545378 0.5132289 8.2131883
R-squared = .5242
#let's try ridge regression
ridge_grid <- data.frame(.lambda = seq(0, .1, length = 15))
#enet_grid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
set.seed(432)
ridge_reg_fit <- train(train_1, train_val1,
method = "ridge",
tuneGrid = ridge_grid,
trControl = trainControl(method = "cv", number = 10, verboseIter = FALSE),
#predictors
preProc = c("center", "scale"))
## Warning: model fit failed for Fold03: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold05: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
ridge_reg_fit
## Ridge Regression
##
## 116 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 104, 104, 104, 104, 104, 104, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 16.17472 0.3642978 12.21149
## 0.007142857 40.95803 0.1379076 29.87720
## 0.014285714 112.37215 0.1633977 84.36172
## 0.021428571 320.18077 0.1937930 229.85171
## 0.028571429 18.00521 0.2704993 12.78184
## 0.035714286 17.27937 0.2924620 12.34718
## 0.042857143 16.79044 0.3110419 12.05392
## 0.050000000 16.37375 0.3280473 11.80373
## 0.057142857 16.08923 0.3402429 11.63551
## 0.064285714 15.89527 0.3491905 11.53285
## 0.071428571 15.66014 0.3600055 11.40208
## 0.078571429 15.65052 0.3607620 11.39481
## 0.085714286 15.51272 0.3683991 11.38592
## 0.092857143 15.23332 0.3816867 11.21204
## 0.100000000 15.09919 0.3887969 11.13863
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
Our ridge regression employed a penalty of lambda = .1, RMSE = 15.1
#troubleshooting
#anyNA(train_1)
#anyNA(train_val1)
# Adjust the tuning grid to avoid zero lambda or problematic fraction values
enet_grid <- expand.grid(
.lambda = seq(0.001, 0.1, length = 10),
.fraction = seq(0.1, 0.9, length = 10)
)
#elastic net next
# first set range of penalities and mixing params
#enet_grid <- expand.grid(.lambda = seq(0, 0.1, length = 15),
# .fraction = seq(0.05, 1, length = 20))
#I've simplified the model so that it stops throwing errors related to (zmat <gamhat)
set.seed(432)
elastic_net_fit <- train(
train_1, train_val1,
method = "enet",
tuneGrid = enet_grid,
trControl = trainControl(method = "cv", number = 5),
preProc = c("center", "scale")
)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813
## Warning: model fit failed for Fold3: lambda=0.100, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813
## Warning: model fit failed for Fold3: lambda=0.089, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813
## Warning: model fit failed for Fold3: lambda=0.078, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813
## Warning: model fit failed for Fold3: lambda=0.067, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813
## Warning: model fit failed for Fold3: lambda=0.056, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813
## Warning: model fit failed for Fold3: lambda=0.045, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813
## Warning: model fit failed for Fold3: lambda=0.034, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813
## Warning: model fit failed for Fold3: lambda=0.023, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813
## Warning: model fit failed for Fold3: lambda=0.012, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813
## Warning: model fit failed for Fold3: lambda=0.001, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
#elastic_net_fit
# run and evaluate predictions
pred_enet <- predict(elastic_net_fit, newdata = test_1)
test_results_1<- postResample(pred = pred_enet, obs = test_val1)
test_results_1
## RMSE Rsquared MAE
## 9.6272475 0.5883668 7.4348566
The models all perform reasonably well, but the elastic net model enjoys the lowest RMSE and highest R-squared. However, I’m concerned I’ve mitigated the analytical value of this model by simplifying it above.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(package = "AppliedPredictiveModeling")$results[, "Item"]
## [1] "ChemicalManufacturingProcess" "abalone"
## [3] "bio (hepatic)" "cars2010 (FuelEconomy)"
## [5] "cars2011 (FuelEconomy)" "cars2012 (FuelEconomy)"
## [7] "chem (hepatic)" "classes (twoClassData)"
## [9] "concrete" "diagnosis (AlzheimerDisease)"
## [11] "fingerprints (permeability)" "injury (hepatic)"
## [13] "logisticCreditPredictions" "mixtures (concrete)"
## [15] "permeability" "predictors (AlzheimerDisease)"
## [17] "predictors (twoClassData)" "schedulingData"
## [19] "segmentationOriginal" "solTestX (solubility)"
## [21] "solTestXtrans (solubility)" "solTestY (solubility)"
## [23] "solTrainX (solubility)" "solTrainXtrans (solubility)"
## [25] "solTrainY (solubility)"
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors
(12 describing the input biological material and 45 describing the
process predictors) for the 176 manufacturing runs. yield contains the
percent yield for each run.
# I think I'm going to convert the data to a matrix first and visualize it
matrix_1 <- as.matrix(ChemicalManufacturingProcess)
aggr(matrix_1,
numbers = TRUE,
col = c("lightblue", "purple"),
combined = TRUE,
sortVars = TRUE,
labels = names(ChemicalManufacturingProcess),
cex.axis = 0.7,
gap = 2,
ylab = c("Missing Data", "Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## ManufacturingProcess03 15
## ManufacturingProcess11 10
## ManufacturingProcess10 9
## ManufacturingProcess25 5
## ManufacturingProcess26 5
## ManufacturingProcess27 5
## ManufacturingProcess28 5
## ManufacturingProcess29 5
## ManufacturingProcess30 5
## ManufacturingProcess31 5
## ManufacturingProcess33 5
## ManufacturingProcess34 5
## ManufacturingProcess35 5
## ManufacturingProcess36 5
## ManufacturingProcess02 3
## ManufacturingProcess06 2
## ManufacturingProcess01 1
## ManufacturingProcess04 1
## ManufacturingProcess05 1
## ManufacturingProcess07 1
## ManufacturingProcess08 1
## ManufacturingProcess12 1
## ManufacturingProcess14 1
## ManufacturingProcess22 1
## ManufacturingProcess23 1
## ManufacturingProcess24 1
## ManufacturingProcess40 1
## ManufacturingProcess41 1
## Yield 0
## BiologicalMaterial01 0
## BiologicalMaterial02 0
## BiologicalMaterial03 0
## BiologicalMaterial04 0
## BiologicalMaterial05 0
## BiologicalMaterial06 0
## BiologicalMaterial07 0
## BiologicalMaterial08 0
## BiologicalMaterial09 0
## BiologicalMaterial10 0
## BiologicalMaterial11 0
## BiologicalMaterial12 0
## ManufacturingProcess09 0
## ManufacturingProcess13 0
## ManufacturingProcess15 0
## ManufacturingProcess16 0
## ManufacturingProcess17 0
## ManufacturingProcess18 0
## ManufacturingProcess19 0
## ManufacturingProcess20 0
## ManufacturingProcess21 0
## ManufacturingProcess32 0
## ManufacturingProcess37 0
## ManufacturingProcess38 0
## ManufacturingProcess39 0
## ManufacturingProcess42 0
## ManufacturingProcess43 0
## ManufacturingProcess44 0
## ManufacturingProcess45 0
# impute and re-asses
imputed_data <- mice(matrix_1, method = "pmm", m = 5, maxit = 5, printFlag = FALSE)
## Warning: Number of logged events: 675
complete_data <- complete(imputed_data)
# Check NAs
anyNA(complete_data)
## [1] FALSE
The proportion of missing data is small and dispersed across many predictors, so I chose the pmm method (predictive mean matching) from the mice library for imputation. This method predicts values for missing data and randomly selects and imputes donors from complete cases with similar predicted values, assuming the same distribution for missing and observed data. We successfully eliminated missing data per the above check.
# filter and check dimensions
chem_data <- ChemicalManufacturingProcess
chem_data_filtered <- chem_data[, -nearZeroVar(chem_data)]
dim(chem_data_filtered)
## [1] 176 57
# generate 80/20 training and test sets
set.seed(432)
train_indices <- createDataPartition(chem_data_filtered$Yield, times = 1, p = 0.80, list = FALSE)
train_predictors <- chem_data_filtered[train_indices, -c(1)] # Exclude 'Yield' column
test_predictors <- chem_data_filtered[-train_indices, -c(1)]
train_response <- chem_data_filtered[train_indices, ]$Yield
test_response <- chem_data_filtered[-train_indices, ]$Yield
# train
(perf_fit <- train(
x = train_predictors,
y = train_response,
method = "pls",
metric = "Rsquared",
tuneLength = 25,
trControl = trainControl(method = "cv", number = 10),
preProcess = c('center', 'scale')
))
## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 130, 128, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.366557 0.4834273 1.108564
## 2 1.318038 0.5215160 1.052882
## 3 1.682071 0.5375627 1.137754
## 4 2.187599 0.5580511 1.269234
## 5 2.323014 0.5632677 1.301737
## 6 2.619363 0.5563855 1.404710
## 7 2.656325 0.5324331 1.430982
## 8 2.892100 0.5225133 1.490543
## 9 2.952069 0.5086699 1.517270
## 10 3.005844 0.4945833 1.535283
## 11 3.062673 0.4605866 1.580011
## 12 3.227874 0.4353043 1.650421
## 13 3.227337 0.4435162 1.642451
## 14 3.137262 0.4524246 1.609899
## 15 3.166196 0.4685060 1.605729
## 16 3.294770 0.4710852 1.639593
## 17 3.556879 0.4660242 1.708761
## 18 3.709272 0.4689027 1.749011
## 19 3.899968 0.4745939 1.789447
## 20 3.982159 0.4782400 1.806003
## 21 4.081416 0.4784149 1.835560
## 22 4.206296 0.4769404 1.873522
## 23 4.503215 0.4762817 1.960228
## 24 4.794765 0.4718393 2.046734
## 25 4.999727 0.4649212 2.111659
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 5.
perf_fit
## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 130, 128, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.366557 0.4834273 1.108564
## 2 1.318038 0.5215160 1.052882
## 3 1.682071 0.5375627 1.137754
## 4 2.187599 0.5580511 1.269234
## 5 2.323014 0.5632677 1.301737
## 6 2.619363 0.5563855 1.404710
## 7 2.656325 0.5324331 1.430982
## 8 2.892100 0.5225133 1.490543
## 9 2.952069 0.5086699 1.517270
## 10 3.005844 0.4945833 1.535283
## 11 3.062673 0.4605866 1.580011
## 12 3.227874 0.4353043 1.650421
## 13 3.227337 0.4435162 1.642451
## 14 3.137262 0.4524246 1.609899
## 15 3.166196 0.4685060 1.605729
## 16 3.294770 0.4710852 1.639593
## 17 3.556879 0.4660242 1.708761
## 18 3.709272 0.4689027 1.749011
## 19 3.899968 0.4745939 1.789447
## 20 3.982159 0.4782400 1.806003
## 21 4.081416 0.4784149 1.835560
## 22 4.206296 0.4769404 1.873522
## 23 4.503215 0.4762817 1.960228
## 24 4.794765 0.4718393 2.046734
## 25 4.999727 0.4649212 2.111659
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 5.
The optimal model is PLS model ncomp = 2 with the highest R-squared value (0.5299). It expalins the most variance.
predict_test_y2 <- predict(perf_fit, test_predictors)
performance_results <- postResample(pred = predict_test_y2, obs = test_response)
performance_results
## RMSE Rsquared MAE
## 1.1886974 0.6450927 0.9534105
The test set performance metrics show that the model enjoys a higher R-squared (0.6451) and a comparable RMSE (1.1887) to the training set, indicating decent predictive accuracy.
variable_importance <- varImp(perf_fit, scale = TRUE)
## Warning: package 'pls' was built under R version 4.3.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
variable_importance
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 95.52
## ManufacturingProcess13 89.86
## ManufacturingProcess17 84.67
## ManufacturingProcess36 78.40
## BiologicalMaterial06 70.47
## BiologicalMaterial02 70.31
## BiologicalMaterial03 69.07
## ManufacturingProcess28 62.72
## ManufacturingProcess12 62.09
## BiologicalMaterial01 58.89
## ManufacturingProcess06 58.75
## ManufacturingProcess33 57.75
## BiologicalMaterial11 56.92
## BiologicalMaterial08 56.54
## BiologicalMaterial04 55.35
## BiologicalMaterial12 53.72
## ManufacturingProcess34 53.06
## ManufacturingProcess11 51.37
## ManufacturingProcess37 49.69
The most important predictors in the model are mostly manufacturing process variables, with only a few biological predictors appearing in the top 10.
library(corrplot)
correlation_1 <- chem_data_filtered %>%
select('Yield', "ManufacturingProcess32", "ManufacturingProcess36",
"ManufacturingProcess09", "ManufacturingProcess13",
"ManufacturingProcess33", "ManufacturingProcess17",
"BiologicalMaterial03", "BiologicalMaterial02",
"BiologicalMaterial06", "BiologicalMaterial08",
"ManufacturingProcess06", "BiologicalMaterial12",
"ManufacturingProcess12", "ManufacturingProcess11",
"BiologicalMaterial04", "ManufacturingProcess34",
"BiologicalMaterial01", "ManufacturingProcess29",
"ManufacturingProcess30", "BiologicalMaterial11")
plot_corr_1 <- cor(correlation_1)
corrplot.mixed(plot_corr_1, tl.col = 'red', tl.pos = 'lt',
upper = "number", lower="circle")
ManufacturingProcess32 shows a strong positive correlation (0.61) with Yield, suggesting it significantly impacts the output, while other key process and biological variables have moderate correlations. This analysis indicates that adjusting ManufacturingProcess32 and related variables could help improve the manufacturing yield.
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.