Data624

Github Link
Web Link

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Start R and use these commands to load the data:

#install.packages('AppliedPredictiveModeling')
library(AppliedPredictiveModeling)  #caret library

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.0.5

data(permeability)
#view(permeability)
#view(fingerprints)

#table(fingerprints)
str(fingerprints)

##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

#dim(fingerprints)
vec1 <- nearZeroVar(fingerprints)
str(vec1)

##  int [1:719] 7 8 9 10 13 14 17 18 19 22 ...

cat("The number of predictors removed by nearZeroVar() is : 719\n")

## The number of predictors removed by nearZeroVar() is : 719

cat("\nThe predictors left for modeling are: ",( ncol(fingerprints)- 719))

## 
## The predictors left for modeling are:  388

# When predictors should be removed, a vector of integers is
# returned that indicates which columns should be removed.

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

Using Partial Least Squares , pls package with PLS and PCR functions.

L_fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
trainDf <- createDataPartition(permeability, p=0.8, list = FALSE)
trainX <-L_fingerprints[trainDf, ]
trainY <- permeability[trainDf, ]

plsTune <- train(trainX, trainY,

 method = "pls",

 ## The default tuning grid evaluates

 ## components 1... tuneLength

 tuneLength = 20,

 trControl = trainControl(method = 'cv'),

 preProc = c("center", "scale"))

plsTune

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 119, 120, 118, 120, 118, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.35336  0.2938959  10.220062
##    2     12.25110  0.4363139   8.592127
##    3     12.31136  0.4193319   9.230250
##    4     12.37045  0.4237796   9.314451
##    5     12.36047  0.4351206   9.164026
##    6     12.22259  0.4519223   9.113898
##    7     11.95712  0.4694642   9.045313
##    8     11.94385  0.4773805   9.101885
##    9     12.07763  0.4721779   9.165854
##   10     12.21463  0.4730678   9.418333
##   11     12.65542  0.4548923   9.716033
##   12     12.80730  0.4475218   9.927495
##   13     13.02832  0.4403498  10.112927
##   14     13.36616  0.4273066  10.422379
##   15     13.84797  0.4092395  10.785683
##   16     14.03591  0.4160907  10.975621
##   17     14.44770  0.4028446  11.395831
##   18     14.81086  0.3938069  11.672057
##   19     14.99114  0.3890475  11.802900
##   20     15.21497  0.3835949  11.935622
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.

plot(plsTune)

Predict the response for the test set. What is the test set estimate of R2?

testX <- L_fingerprints[-trainDf,]
testY <- permeability[-trainDf,]
postResample(pred = predict(plsTune, newdata=testX), obs = testY)

##       RMSE   Rsquared        MAE 
## 10.1860552  0.5734765  7.5226062

Try building other models discussed in this chapter. Do any have better predictive performance? -Ridge-regression Model with lm.ridge function from MASS package

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 120, 120, 120, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE          Rsquared   MAE         
##   0.000000000  2.092689e+06  0.2337601  1.372793e+06
##   0.007142857  1.739555e+01  0.2626825  1.249489e+01
##   0.014285714  1.630146e+01  0.2925794  1.171292e+01
##   0.021428571  1.574087e+01  0.3050547  1.139997e+01
##   0.028571429  1.563652e+01  0.3113932  1.131897e+01
##   0.035714286  1.512477e+01  0.3276945  1.102041e+01
##   0.042857143  1.500001e+01  0.3334625  1.094598e+01
##   0.050000000  1.480649e+01  0.3428727  1.085559e+01
##   0.057142857  1.465880e+01  0.3495139  1.075191e+01
##   0.064285714  1.454606e+01  0.3557285  1.067354e+01
##   0.071428571  1.446967e+01  0.3605672  1.062030e+01
##   0.078571429  1.434900e+01  0.3673231  1.055879e+01
##   0.085714286  1.426747e+01  0.3723594  1.050630e+01
##   0.092857143  1.421732e+01  0.3760147  1.047285e+01
##   0.100000000  1.416065e+01  0.3804344  1.043440e+01
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

Robust linear regression

set.seed(15668)
rlmPCA <- train(trainX, trainY,

 method = "rlm",
 preProcess = "pca",

 trControl = trainControl(method = 'cv'))

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

rlmPCA

## Robust Linear Model 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: principal component signal extraction (388), centered
##  (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 119, 119, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE      Rsquared   MAE      
##   FALSE      psi.huber     17.37841  0.4059832  13.687887
##   FALSE      psi.hampel    17.27178  0.4229198  13.822532
##   FALSE      psi.bisquare  17.28827  0.4176797  13.678140
##    TRUE      psi.huber     12.43291  0.4329646   9.015585
##    TRUE      psi.hampel    12.28054  0.4336035   9.145003
##    TRUE      psi.bisquare  13.26879  0.4223061   8.803127
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.hampel.

plot(rlmPCA)

elasticnet model with enet and glmnet package

sum(is.na(trainX))

## [1] 0

enetGrid <- expand.grid(.lambda = c(0, 1, by = .1),
 .fraction = seq(0, 1, by=0.1))

 set.seed(100)

 enetTune <- train(trainX, trainY,

 method = "enet",

 tuneGrid = enetGrid,

 trControl = trainControl(method = 'cv'),

 preProc = c("center", "scale"))

## Warning: model fit failed for Fold01: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold05: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

Something wrongwith elasticnet model…error…There were missing values in resampled performance measures.

for pls model, Rsquared = 0.5401011, ridge-regression model , rsquared = 0.5515002, for the robust linear model , rsquared = 0.5055565 , therefore the ridge-regression model offer a better predictive performance.

Would you recommend any of your models to replace the permeability laboratory experiment? I would be inclined to recommend ridge-regression model to replace the permeability experiment, but the Rsquared of 0.55 is not strong. this is just to indicate that the data approximate 55% fit the linear model. In other words, there are 55% of the variability in the outcome data(permeability) cannot be explained by the model.

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data Chemical manufacturing data: This data set contains information about a chemical manufacturing process, in which the goal is to understand the relationship between the process and the resulting final product yield and can be found in the in the AppliedPredictiveModeling R package.

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)
#data(chemicalManufacturing) data set �chemicalManufacturing� not found
data(ChemicalManufacturingProcess)
view(ChemicalManufacturingProcess)

df <- ChemicalManufacturingProcess

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8). ##### To impute missing values, the impute package has a function, impute.knn, that uses K-nearest neighbors to estimate the missing data. The previously mentioned ##### preProcess function applies imputation methods based on K-nearest neigh-bors or bagged trees.

sum(is.na(df))

## [1] 106

missing.values <- function(df){
    df %>%
    gather(key = "variables", value = "val") %>%
    mutate(is.missing = is.na(val)) %>%
    group_by(variables, is.missing) %>%
    dplyr::summarise(number.missing = n()) %>%
    filter(is.missing==T) %>%
    dplyr::select(-is.missing) %>%
    arrange(desc(number.missing)) 
}

#missing.values(insuranceT_df1)%>% kable()


# plot missing values
 missing.values(df) %>%
   ggplot() +
     geom_bar(aes(x=variables, y=number.missing), stat = 'identity', col='blue') +
     labs(x='variables', y="number of missing values", title='Number of missing values per Variable') +
   theme(axis.text.x = element_text(angle = 100, hjust = 0.3)) + 
   coord_flip()

## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.

#vis_miss(training_df)
#gg_miss_var(df, show_pct = TRUE) + labs(y = "Missing Values in % to total record")+ theme()
#colSums(is.na(df))%>% kable()
cat("\n The table below shows the total number of missing values per variable")

## 
##  The table below shows the total number of missing values per variable

apply(is.na(df), 2, sum)

##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      1                      3 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                     15                      1                      1 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      2                      1                      1 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      9                     10 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      1                      0                      1 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      1                      1 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      1                      5                      5 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      5                      5                      5 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      5                      5                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      5                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      1                      1 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0

# To impute missing values, the impute package has a function, impute.knn, that uses K-nearest neighbors to estimate the missing data. The previously mentioned
# preProcess function applies imputation methods based on K-nearest neigh-bors or bagged trees.

trans <- preProcess(df[, -c(1)],  method = c("center", "scale"))

trans

## Created from 152 samples and 57 variables
## 
## Pre-processing:
##   - centered (57)
##   - ignored (0)
##   - scaled (57)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(23421)
pred <- predict(trans, df[, -c(1)], method=c("center", "scale"))

trainDf <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list = FALSE)
trainX <-pred[trainDf, ]
trainY <- ChemicalManufacturingProcess$Yield[trainDf]

plsTune <- train(trainX, trainY,

 method = "pls",

 ## The default tuning grid evaluates

 ## components 1... tuneLength

 tuneLength = 20,

 trControl = trainControl(method = 'cv'),

 preProc = c("center", "scale"))

plsTune

## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 129, 130, 129, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.556353  0.4100442  1.237883
##    2     1.705038  0.5387117  1.209708
##    3     1.385879  0.5849881  1.123247
##    4     1.897593  0.5997805  1.249398
##    5     1.988025  0.5969115  1.297806
##    6     2.205661  0.6011127  1.352923
##    7     2.307503  0.5800153  1.397091
##    8     2.632779  0.5644277  1.506758
##    9     2.856106  0.5366897  1.605498
##   10     3.041707  0.5324215  1.672764
##   11     3.282507  0.5220131  1.756760
##   12     3.450897  0.5226061  1.800760
##   13     3.489631  0.5245820  1.807724
##   14     3.601547  0.5239084  1.854144
##   15     3.704923  0.5265612  1.885565
##   16     3.745047  0.5243378  1.904323
##   17     3.919538  0.5206776  1.961643
##   18     4.010953  0.5172996  1.991563
##   19     4.220760  0.5118486  2.056230
##   20     4.377621  0.5065974  2.104799
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

plot(plsTune)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

testX <- pred[-trainDf,]
testY <- ChemicalManufacturingProcess$Yield[-trainDf]
postResample(pred = predict(plsTune, newdata=testX), obs = testY)

##      RMSE  Rsquared       MAE 
## 0.9556998 0.7477124 0.7687460

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

varImp(plsTune)

## Warning: package 'pls' was built under R version 4.0.5

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess09  100.00
## ManufacturingProcess32   95.38
## ManufacturingProcess13   91.09
## ManufacturingProcess17   87.39
## ManufacturingProcess36   71.46
## BiologicalMaterial02     68.05
## BiologicalMaterial06     65.79
## BiologicalMaterial03     64.78
## ManufacturingProcess11   61.10
## ManufacturingProcess12   60.81
## ManufacturingProcess33   59.19
## ManufacturingProcess06   58.17
## BiologicalMaterial08     55.46
## BiologicalMaterial11     55.17
## BiologicalMaterial01     54.98
## BiologicalMaterial04     54.85
## BiologicalMaterial12     52.66
## ManufacturingProcess28   47.82
## ManufacturingProcess24   44.89
## BiologicalMaterial10     39.58

Based on the loess rsquared 20 most important variables, there are more ManufacturingProcess than BiologicalMaterial, meaning process predictors dominate the list. Also, ManufacturingProcess predictors are more important in the model we have trained.

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process? The predictors with better AIC values would be used to get new predictors which would improve Yield in the future runs of the manufacturing proces.

ManufacturingProcess09 , ManufacturingProcess32, ManufacturingProcess13,ManufacturingProcess17, ManufacturingProcess36, BiologicalMaterial02 predictor and response variable Yield

fit <- lm(Yield ~ ManufacturingProcess09 + ManufacturingProcess32 + ManufacturingProcess13 + ManufacturingProcess17 + ManufacturingProcess36 + BiologicalMaterial02, data = ChemicalManufacturingProcess, method = "qr")
summary(fit)

## 
## Call:
## lm(formula = Yield ~ ManufacturingProcess09 + ManufacturingProcess32 + 
##     ManufacturingProcess13 + ManufacturingProcess17 + ManufacturingProcess36 + 
##     BiologicalMaterial02, data = ChemicalManufacturingProcess, 
##     method = "qr")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6783 -0.8038  0.0640  0.6644  3.1673 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              17.77108   11.52456   1.542  0.12500    
## ManufacturingProcess09    0.33244    0.09937   3.346  0.00102 ** 
## ManufacturingProcess32    0.15945    0.02830   5.634 7.49e-08 ***
## ManufacturingProcess13   -0.17988    0.16061  -1.120  0.26435    
## ManufacturingProcess17   -0.29548    0.11631  -2.540  0.01200 *  
## ManufacturingProcess36 -194.43411  162.30594  -1.198  0.23267    
## BiologicalMaterial02      0.03898    0.02960   1.317  0.18979    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.115 on 164 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.6534, Adjusted R-squared:  0.6408 
## F-statistic: 51.54 on 6 and 164 DF,  p-value: < 2.2e-16

#Diagnostic Plots 
plot(fit)

Comparing Models with Anova

anova(fit)

## Analysis of Variance Table
## 
## Response: Yield
##                         Df  Sum Sq Mean Sq  F value  Pr(>F)    
## ManufacturingProcess09   1 161.369 161.369 129.8606 < 2e-16 ***
## ManufacturingProcess32   1 203.787 203.787 163.9957 < 2e-16 ***
## ManufacturingProcess13   1   8.149   8.149   6.5578 0.01134 *  
## ManufacturingProcess17   1   6.426   6.426   5.1716 0.02426 *  
## ManufacturingProcess36   1   2.367   2.367   1.9047 0.16943    
## BiologicalMaterial02     1   2.154   2.154   1.7336 0.18979    
## Residuals              164 203.792   1.243                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variable Selection

Selecting a subset of predictor variables from a larger set (e.g., stepwise selection) is a controversial topic. You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from the MASS package. stepAIC( ) performs stepwise model selection by exact AIC # {r, warning=FALSE} # # # Stepwise Regression # #library(MASS) # step <- stepAIC(fit, direction="both") # step$anova # display results # #

file:///C:/Users/owner/OneDrive/Documents/Data624/applied%20predictive%20modelling(1).pdf

https://www.statmethods.net/stats/regression.html

Data624_HW7

Alexis Mekueko

11/4/2021