Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
#install.packages('AppliedPredictiveModeling')
library(AppliedPredictiveModeling) #caret library
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.0.5
data(permeability)
#view(permeability)
#view(fingerprints)
#table(fingerprints)
str(fingerprints)
## num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
#dim(fingerprints)
vec1 <- nearZeroVar(fingerprints)
str(vec1)
## int [1:719] 7 8 9 10 13 14 17 18 19 22 ...
cat("The number of predictors removed by nearZeroVar() is : 719\n")
## The number of predictors removed by nearZeroVar() is : 719
cat("\nThe predictors left for modeling are: ",( ncol(fingerprints)- 719))
##
## The predictors left for modeling are: 388
# When predictors should be removed, a vector of integers is
# returned that indicates which columns should be removed.
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
Using Partial Least Squares , pls package with PLS and PCR functions.
L_fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
trainDf <- createDataPartition(permeability, p=0.8, list = FALSE)
trainX <-L_fingerprints[trainDf, ]
trainY <- permeability[trainDf, ]
plsTune <- train(trainX, trainY,
method = "pls",
## The default tuning grid evaluates
## components 1... tuneLength
tuneLength = 20,
trControl = trainControl(method = 'cv'),
preProc = c("center", "scale"))
plsTune
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 119, 120, 118, 120, 118, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.35336 0.2938959 10.220062
## 2 12.25110 0.4363139 8.592127
## 3 12.31136 0.4193319 9.230250
## 4 12.37045 0.4237796 9.314451
## 5 12.36047 0.4351206 9.164026
## 6 12.22259 0.4519223 9.113898
## 7 11.95712 0.4694642 9.045313
## 8 11.94385 0.4773805 9.101885
## 9 12.07763 0.4721779 9.165854
## 10 12.21463 0.4730678 9.418333
## 11 12.65542 0.4548923 9.716033
## 12 12.80730 0.4475218 9.927495
## 13 13.02832 0.4403498 10.112927
## 14 13.36616 0.4273066 10.422379
## 15 13.84797 0.4092395 10.785683
## 16 14.03591 0.4160907 10.975621
## 17 14.44770 0.4028446 11.395831
## 18 14.81086 0.3938069 11.672057
## 19 14.99114 0.3890475 11.802900
## 20 15.21497 0.3835949 11.935622
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
plot(plsTune)
testX <- L_fingerprints[-trainDf,]
testY <- permeability[-trainDf,]
postResample(pred = predict(plsTune, newdata=testX), obs = testY)
## RMSE Rsquared MAE
## 10.1860552 0.5734765 7.5226062
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 120, 120, 120, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 2.092689e+06 0.2337601 1.372793e+06
## 0.007142857 1.739555e+01 0.2626825 1.249489e+01
## 0.014285714 1.630146e+01 0.2925794 1.171292e+01
## 0.021428571 1.574087e+01 0.3050547 1.139997e+01
## 0.028571429 1.563652e+01 0.3113932 1.131897e+01
## 0.035714286 1.512477e+01 0.3276945 1.102041e+01
## 0.042857143 1.500001e+01 0.3334625 1.094598e+01
## 0.050000000 1.480649e+01 0.3428727 1.085559e+01
## 0.057142857 1.465880e+01 0.3495139 1.075191e+01
## 0.064285714 1.454606e+01 0.3557285 1.067354e+01
## 0.071428571 1.446967e+01 0.3605672 1.062030e+01
## 0.078571429 1.434900e+01 0.3673231 1.055879e+01
## 0.085714286 1.426747e+01 0.3723594 1.050630e+01
## 0.092857143 1.421732e+01 0.3760147 1.047285e+01
## 0.100000000 1.416065e+01 0.3804344 1.043440e+01
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
Robust linear regression
set.seed(15668)
rlmPCA <- train(trainX, trainY,
method = "rlm",
preProcess = "pca",
trControl = trainControl(method = 'cv'))
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
rlmPCA
## Robust Linear Model
##
## 133 samples
## 388 predictors
##
## Pre-processing: principal component signal extraction (388), centered
## (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 119, 119, 119, 120, ...
## Resampling results across tuning parameters:
##
## intercept psi RMSE Rsquared MAE
## FALSE psi.huber 17.37841 0.4059832 13.687887
## FALSE psi.hampel 17.27178 0.4229198 13.822532
## FALSE psi.bisquare 17.28827 0.4176797 13.678140
## TRUE psi.huber 12.43291 0.4329646 9.015585
## TRUE psi.hampel 12.28054 0.4336035 9.145003
## TRUE psi.bisquare 13.26879 0.4223061 8.803127
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.hampel.
plot(rlmPCA)
elasticnet model with enet and glmnet package
sum(is.na(trainX))
## [1] 0
enetGrid <- expand.grid(.lambda = c(0, 1, by = .1),
.fraction = seq(0, 1, by=0.1))
set.seed(100)
enetTune <- train(trainX, trainY,
method = "enet",
tuneGrid = enetGrid,
trControl = trainControl(method = 'cv'),
preProc = c("center", "scale"))
## Warning: model fit failed for Fold01: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold05: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
Something wrongwith elasticnet model…error…There were missing values in resampled performance measures.
for pls model, Rsquared = 0.5401011, ridge-regression model , rsquared = 0.5515002, for the robust linear model , rsquared = 0.5055565 , therefore the ridge-regression model offer a better predictive performance.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(AppliedPredictiveModeling)
#data(chemicalManufacturing) data set �chemicalManufacturing� not found
data(ChemicalManufacturingProcess)
view(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess
sum(is.na(df))
## [1] 106
missing.values <- function(df){
df %>%
gather(key = "variables", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(variables, is.missing) %>%
dplyr::summarise(number.missing = n()) %>%
filter(is.missing==T) %>%
dplyr::select(-is.missing) %>%
arrange(desc(number.missing))
}
#missing.values(insuranceT_df1)%>% kable()
# plot missing values
missing.values(df) %>%
ggplot() +
geom_bar(aes(x=variables, y=number.missing), stat = 'identity', col='blue') +
labs(x='variables', y="number of missing values", title='Number of missing values per Variable') +
theme(axis.text.x = element_text(angle = 100, hjust = 0.3)) +
coord_flip()
## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.
#vis_miss(training_df)
#gg_miss_var(df, show_pct = TRUE) + labs(y = "Missing Values in % to total record")+ theme()
#colSums(is.na(df))%>% kable()
cat("\n The table below shows the total number of missing values per variable")
##
## The table below shows the total number of missing values per variable
apply(is.na(df), 2, sum)
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0 0 0
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0 0 0
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 0 0 0
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 0 0 0
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0 1 3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## 15 1 1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 2 1 1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## 0 9 10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 1 0 1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 0 0 0
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## 0 0 0
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 0 1 1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 1 5 5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## 5 5 5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## 5 5 0
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 5 5 5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 5 0 0
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## 0 1 1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## 0 0 0
## ManufacturingProcess45
## 0
# To impute missing values, the impute package has a function, impute.knn, that uses K-nearest neighbors to estimate the missing data. The previously mentioned
# preProcess function applies imputation methods based on K-nearest neigh-bors or bagged trees.
trans <- preProcess(df[, -c(1)], method = c("center", "scale"))
trans
## Created from 152 samples and 57 variables
##
## Pre-processing:
## - centered (57)
## - ignored (0)
## - scaled (57)
set.seed(23421)
pred <- predict(trans, df[, -c(1)], method=c("center", "scale"))
trainDf <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list = FALSE)
trainX <-pred[trainDf, ]
trainY <- ChemicalManufacturingProcess$Yield[trainDf]
plsTune <- train(trainX, trainY,
method = "pls",
## The default tuning grid evaluates
## components 1... tuneLength
tuneLength = 20,
trControl = trainControl(method = 'cv'),
preProc = c("center", "scale"))
plsTune
## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 129, 130, 129, 129, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.556353 0.4100442 1.237883
## 2 1.705038 0.5387117 1.209708
## 3 1.385879 0.5849881 1.123247
## 4 1.897593 0.5997805 1.249398
## 5 1.988025 0.5969115 1.297806
## 6 2.205661 0.6011127 1.352923
## 7 2.307503 0.5800153 1.397091
## 8 2.632779 0.5644277 1.506758
## 9 2.856106 0.5366897 1.605498
## 10 3.041707 0.5324215 1.672764
## 11 3.282507 0.5220131 1.756760
## 12 3.450897 0.5226061 1.800760
## 13 3.489631 0.5245820 1.807724
## 14 3.601547 0.5239084 1.854144
## 15 3.704923 0.5265612 1.885565
## 16 3.745047 0.5243378 1.904323
## 17 3.919538 0.5206776 1.961643
## 18 4.010953 0.5172996 1.991563
## 19 4.220760 0.5118486 2.056230
## 20 4.377621 0.5065974 2.104799
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plot(plsTune)
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
testX <- pred[-trainDf,]
testY <- ChemicalManufacturingProcess$Yield[-trainDf]
postResample(pred = predict(plsTune, newdata=testX), obs = testY)
## RMSE Rsquared MAE
## 0.9556998 0.7477124 0.7687460
varImp(plsTune)
## Warning: package 'pls' was built under R version 4.0.5
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess09 100.00
## ManufacturingProcess32 95.38
## ManufacturingProcess13 91.09
## ManufacturingProcess17 87.39
## ManufacturingProcess36 71.46
## BiologicalMaterial02 68.05
## BiologicalMaterial06 65.79
## BiologicalMaterial03 64.78
## ManufacturingProcess11 61.10
## ManufacturingProcess12 60.81
## ManufacturingProcess33 59.19
## ManufacturingProcess06 58.17
## BiologicalMaterial08 55.46
## BiologicalMaterial11 55.17
## BiologicalMaterial01 54.98
## BiologicalMaterial04 54.85
## BiologicalMaterial12 52.66
## ManufacturingProcess28 47.82
## ManufacturingProcess24 44.89
## BiologicalMaterial10 39.58
Based on the loess rsquared 20 most important variables, there are more ManufacturingProcess than BiologicalMaterial, meaning process predictors dominate the list. Also, ManufacturingProcess predictors are more important in the model we have trained.
ManufacturingProcess09 , ManufacturingProcess32, ManufacturingProcess13,ManufacturingProcess17, ManufacturingProcess36, BiologicalMaterial02 predictor and response variable Yield
fit <- lm(Yield ~ ManufacturingProcess09 + ManufacturingProcess32 + ManufacturingProcess13 + ManufacturingProcess17 + ManufacturingProcess36 + BiologicalMaterial02, data = ChemicalManufacturingProcess, method = "qr")
summary(fit)
##
## Call:
## lm(formula = Yield ~ ManufacturingProcess09 + ManufacturingProcess32 +
## ManufacturingProcess13 + ManufacturingProcess17 + ManufacturingProcess36 +
## BiologicalMaterial02, data = ChemicalManufacturingProcess,
## method = "qr")
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6783 -0.8038 0.0640 0.6644 3.1673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.77108 11.52456 1.542 0.12500
## ManufacturingProcess09 0.33244 0.09937 3.346 0.00102 **
## ManufacturingProcess32 0.15945 0.02830 5.634 7.49e-08 ***
## ManufacturingProcess13 -0.17988 0.16061 -1.120 0.26435
## ManufacturingProcess17 -0.29548 0.11631 -2.540 0.01200 *
## ManufacturingProcess36 -194.43411 162.30594 -1.198 0.23267
## BiologicalMaterial02 0.03898 0.02960 1.317 0.18979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.115 on 164 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.6534, Adjusted R-squared: 0.6408
## F-statistic: 51.54 on 6 and 164 DF, p-value: < 2.2e-16
#Diagnostic Plots
plot(fit)
Comparing Models with Anova
anova(fit)
## Analysis of Variance Table
##
## Response: Yield
## Df Sum Sq Mean Sq F value Pr(>F)
## ManufacturingProcess09 1 161.369 161.369 129.8606 < 2e-16 ***
## ManufacturingProcess32 1 203.787 203.787 163.9957 < 2e-16 ***
## ManufacturingProcess13 1 8.149 8.149 6.5578 0.01134 *
## ManufacturingProcess17 1 6.426 6.426 5.1716 0.02426 *
## ManufacturingProcess36 1 2.367 2.367 1.9047 0.16943
## BiologicalMaterial02 1 2.154 2.154 1.7336 0.18979
## Residuals 164 203.792 1.243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Variable Selection
Selecting a subset of predictor variables from a larger set (e.g., stepwise selection) is a controversial topic. You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from the MASS package. stepAIC( ) performs stepwise model selection by exact AIC # {r, warning=FALSE} # # # Stepwise Regression # #library(MASS) # step <- stepAIC(fit, direction="both") # step$anova # display results # #
file:///C:/Users/owner/OneDrive/Documents/Data624/applied%20predictive%20modelling(1).pdf