Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the near ZeroVar function from the caret package. How many predictors are left for modeling?

    Went from 1107 to just 388 predictors.

  2. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

    9 components were considered optimal, with R2 of 0.5195.

  3. Predict the response for the test set. What is the test set estimate of R2?

    R2 is 0.3673.

  4. Try building other models discussed in this chapter. Do any have better predictive performance?

    The first model has performed better compared to the other model.

  5. Would you recommend any of your models to replace the permeability laboratory experiment?

    I recommend Partial Least Squares model because it had higher R2.

    library(caret)
    ## Warning: package 'caret' was built under R version 4.4.3
    ## Loading required package: ggplot2
    ## Loading required package: lattice
    library(dplyr)
    ## 
    ## Attaching package: 'dplyr'
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    dim(fingerprints)
    ## [1]  165 1107
    fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
    
    dim(fingerprints)
    ## [1] 165 388
set.seed(111)

index <- createDataPartition(permeability, p = .75, list = FALSE)

# train 
train1 <- permeability[index, ]
train2 <- fingerprints[index, ]
# test
test1 <- permeability[-index, ]
test2 <- fingerprints [-index, ]

ctrl <- trainControl(method = "cv", number = 8)

plotting <- train(train2, train1, method = "pls", metric = "Rsquared",
             tuneLength = 15, trControl = ctrl, preProc = c("center", "scale"))

plotting
## Partial Least Squares 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 109, 109, 110, 109, 110, 109, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.76592  0.3776333   9.815992
##    2     11.40519  0.5078635   8.253262
##    3     11.53084  0.4868592   8.778235
##    4     11.97248  0.4611414   9.434752
##    5     11.96809  0.4664318   8.991471
##    6     11.92453  0.4785022   8.744012
##    7     11.94447  0.4810436   8.791578
##    8     11.81916  0.4983004   8.715324
##    9     11.79335  0.5195373   8.414370
##   10     12.22976  0.4992911   8.837210
##   11     12.57865  0.4857728   9.112221
##   12     12.71654  0.4872855   9.106980
##   13     13.02366  0.4777870   9.403427
##   14     13.63921  0.4575121   9.920551
##   15     14.11936  0.4341340  10.318592
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 9.
plot(plotting) 

predict2 <- predict(plotting, test2)


postResample(predict2, train1)
## Warning in pred - obs: longer object length is not a multiple of shorter object
## length
## Warning in pred - obs: longer object length is not a multiple of shorter object
## length
##     RMSE Rsquared      MAE 
## 19.28790       NA 14.85754
plotting$results |>  inner_join(plotting$bestTune)
## Joining with `by = join_by(ncomp)`
plotp <- predict(plotting, test2)

postResample(plotp, test1)
##       RMSE   Rsquared        MAE 
## 12.5931592  0.3673063  9.2945799
# grid of penalties
enetg <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

# tuning penalized regression model
enetT <- train(train2, train1, method = "enet",
                  tuneGrid = enetg, trControl = ctrl, preProc = c("center", "scale"))
## Warning: model fit failed for Fold4: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
enetT
## Elasticnet 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 109, 110, 110, 109, 109, 110, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE          Rsquared   MAE         
##   0.00    0.05       30481.27963  0.4292787  8.153720e+03
##   0.00    0.10       59520.83323  0.4418802  1.703814e+04
##   0.00    0.15       88661.49911  0.4240393  2.588565e+04
##   0.00    0.20      117803.06516  0.3970320  3.470807e+04
##   0.00    0.25      146957.62969  0.3838957  4.353028e+04
##   0.00    0.30      176118.67526  0.3813290  5.235238e+04
##   0.00    0.35      205283.57554  0.3696072  6.117459e+04
##   0.00    0.40      234450.88510  0.3512829  6.999693e+04
##   0.00    0.45      263619.84565  0.3330654  7.881933e+04
##   0.00    0.50      292789.83988  0.3247918  8.764165e+04
##   0.00    0.55      321960.52673  0.3253290  9.646387e+04
##   0.00    0.60      351131.86379  0.3243580  1.052861e+05
##   0.00    0.65      380303.72506  0.3213973  1.141084e+05
##   0.00    0.70      409476.11739  0.3086267  1.229307e+05
##   0.00    0.75      438648.80642  0.2969719  1.317530e+05
##   0.00    0.80      467821.70505  0.2876825  1.405754e+05
##   0.00    0.85      496994.77190  0.2803339  1.493977e+05
##   0.00    0.90      526167.96034  0.2758102  1.582199e+05
##   0.00    0.95      555341.39003  0.2666238  1.670423e+05
##   0.00    1.00      584514.94958  0.2593920  1.758647e+05
##   0.01    0.05          11.97090  0.4993364  8.503784e+00
##   0.01    0.10          11.35435  0.5257815  8.244001e+00
##   0.01    0.15          11.10789  0.5324969  8.242124e+00
##   0.01    0.20          11.45246  0.5063530  8.587873e+00
##   0.01    0.25          11.87519  0.4837519  8.866184e+00
##   0.01    0.30          12.33365  0.4583002  9.245963e+00
##   0.01    0.35          12.58593  0.4499965  9.450912e+00
##   0.01    0.40          12.85662  0.4387065  9.643957e+00
##   0.01    0.45          13.06077  0.4296960  9.767804e+00
##   0.01    0.50          13.23662  0.4235138  9.884790e+00
##   0.01    0.55          13.44376  0.4169569  1.004188e+01
##   0.01    0.60          13.64430  0.4103143  1.016496e+01
##   0.01    0.65          13.84249  0.4038919  1.027265e+01
##   0.01    0.70          14.07515  0.3941451  1.037727e+01
##   0.01    0.75          14.26050  0.3874334  1.045456e+01
##   0.01    0.80          14.42779  0.3830204  1.052523e+01
##   0.01    0.85          14.61389  0.3774797  1.063585e+01
##   0.01    0.90          14.85299  0.3701958  1.078713e+01
##   0.01    0.95          15.11980  0.3636500  1.098777e+01
##   0.01    1.00          15.34354  0.3585219  1.117518e+01
##   0.10    0.05          11.94762  0.5253833  9.111061e+00
##   0.10    0.10          11.87160  0.5079963  8.356451e+00
##   0.10    0.15          11.83472  0.5044326  8.323712e+00
##   0.10    0.20          11.55149  0.5216011  8.305906e+00
##   0.10    0.25          11.38479  0.5314412  8.341782e+00
##   0.10    0.30          11.34914  0.5352393  8.391422e+00
##   0.10    0.35          11.43840  0.5297517  8.509078e+00
##   0.10    0.40          11.63593  0.5185904  8.655131e+00
##   0.10    0.45          11.84261  0.5072624  8.782906e+00
##   0.10    0.50          12.00880  0.4981932  8.876809e+00
##   0.10    0.55          12.15593  0.4915649  8.977954e+00
##   0.10    0.60          12.29890  0.4861654  9.079283e+00
##   0.10    0.65          12.43018  0.4813938  9.190681e+00
##   0.10    0.70          12.55208  0.4770321  9.283027e+00
##   0.10    0.75          12.66045  0.4723571  9.338851e+00
##   0.10    0.80          12.76165  0.4680022  9.386194e+00
##   0.10    0.85          12.85474  0.4640445  9.426468e+00
##   0.10    0.90          12.94552  0.4598024  9.473281e+00
##   0.10    0.95          13.05098  0.4556771  9.534642e+00
##   0.10    1.00          13.14528  0.4521864  9.580251e+00
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.15 and lambda = 0.01.
plot(enetT)

enetp <- predict(enetT, test2)

postResample(enetp, test1)
##       RMSE   Rsquared        MAE 
## 12.2403848  0.3517408  8.1202141

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro- cess. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

    160 missing values

  2. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

    RMSE is 7961275

  3. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

    R2 of 0.8031260 using pls method, higher than train set

  4. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

    ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess36, ManufacturingProcess09, ManufacturingProcess17 are the most important predictors used for modeling, and dominates the list.

  5. Explore the relationships between each of the top predictors and the re- sponse. How could this information be helpful in improving yield in future runs of the manufacturing process?

    The plot tell us that many correlation would lead to less yield, only a few combination would improve the yield like biologicalmaterial9 and manufactoringprocess09

ChemicalManufacturingProcess
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
missing <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
chemical <- predict(missing, ChemicalManufacturingProcess)

sum(is.na(chemical))
## [1] 0
# filtering low frequencies
chemical <- chemical[, -nearZeroVar(chemical)]

# index for training
index <- createDataPartition(chemical$Yield, p = .9, list = FALSE)

# train 
train_chem <- chemical[index, ]

# test
test_chem <- chemical[-index, ]

plsT <- train(Yield ~ ., chemical , method = "pls", 
             tuneLength = 15, trControl = ctrl, preProc = c("center", "scale"))
plsT 
## Partial Least Squares 
## 
## 176 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 155, 155, 153, 155, 154, 153, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.490913  0.4234020  1.177767
##    2     1.854481  0.5037120  1.242113
##    3     1.520139  0.5710656  1.075256
##    4     1.624426  0.5474911  1.154110
##    5     1.825080  0.5353433  1.219167
##    6     1.874234  0.5227565  1.230690
##    7     1.930315  0.5155914  1.232501
##    8     2.020598  0.5036775  1.258564
##    9     2.041942  0.4935666  1.269258
##   10     2.065987  0.4921261  1.265742
##   11     1.997603  0.5002811  1.245103
##   12     1.972232  0.5025815  1.245685
##   13     1.880550  0.5247905  1.206295
##   14     1.824223  0.5388444  1.172811
##   15     1.802969  0.5422419  1.152112
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.
plot(plsT) 

enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.1, 1, length = 25))

enetT <- train(Yield ~ ., chemical , method = "enet", 
                  tuneGrid = enetGrid, trControl = ctrl, preProc = c("center", "scale"))

enetT
## Elasticnet 
## 
## 176 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 153, 156, 155, 153, 153, 155, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.1000    1.207331  0.5978623  0.9569458
##   0.00    0.1375    1.324607  0.5621426  0.9895011
##   0.00    0.1750    1.538919  0.5355144  1.0310770
##   0.00    0.2125    1.589234  0.5344367  1.0520268
##   0.00    0.2500    1.730925  0.4954886  1.1134145
##   0.00    0.2875    1.707424  0.4915925  1.1207487
##   0.00    0.3250    1.703017  0.4859474  1.1293331
##   0.00    0.3625    1.642489  0.4928759  1.1234124
##   0.00    0.4000    1.577436  0.5068484  1.1115689
##   0.00    0.4375    1.476431  0.5370977  1.0876597
##   0.00    0.4750    1.430996  0.5531546  1.0701728
##   0.00    0.5125    1.780415  0.4982220  1.1585399
##   0.00    0.5500    2.251001  0.4803639  1.2662560
##   0.00    0.5875    2.616397  0.4725896  1.3441325
##   0.00    0.6250    3.004431  0.4630931  1.4256121
##   0.00    0.6625    3.437026  0.4373990  1.5299531
##   0.00    0.7000    3.891834  0.4121174  1.6343379
##   0.00    0.7375    4.354967  0.3916589  1.7394552
##   0.00    0.7750    4.822157  0.3764792  1.8462864
##   0.00    0.8125    5.290092  0.3656410  1.9532045
##   0.00    0.8500    5.774969  0.3585026  2.0632708
##   0.00    0.8875    6.224207  0.3530210  2.1654957
##   0.00    0.9250    6.661895  0.3486991  2.2633130
##   0.00    0.9625    7.160535  0.3449146  2.3731546
##   0.00    1.0000    7.627006  0.3416514  2.4757928
##   0.01    0.1000    1.303930  0.6132325  1.0605619
##   0.01    0.1375    1.222196  0.6034488  0.9950007
##   0.01    0.1750    1.227819  0.5830197  0.9771749
##   0.01    0.2125    1.218043  0.5907455  0.9633317
##   0.01    0.2500    1.205897  0.6002927  0.9515431
##   0.01    0.2875    1.214403  0.6009651  0.9480287
##   0.01    0.3250    1.235343  0.5981000  0.9528313
##   0.01    0.3625    1.254530  0.5977652  0.9555235
##   0.01    0.4000    1.299327  0.5924323  0.9632277
##   0.01    0.4375    1.326394  0.5859124  0.9820789
##   0.01    0.4750    1.400062  0.5633033  1.0112497
##   0.01    0.5125    1.531019  0.5291596  1.0605355
##   0.01    0.5500    1.687262  0.4919515  1.1090516
##   0.01    0.5875    1.792510  0.4759005  1.1388082
##   0.01    0.6250    1.844829  0.4692182  1.1547790
##   0.01    0.6625    1.879985  0.4640821  1.1672257
##   0.01    0.7000    1.920132  0.4591206  1.1804327
##   0.01    0.7375    1.970980  0.4542185  1.1952388
##   0.01    0.7750    2.033463  0.4488508  1.2127914
##   0.01    0.8125    2.002314  0.4494825  1.2089180
##   0.01    0.8500    1.917560  0.4571726  1.1928867
##   0.01    0.8875    1.847321  0.4688804  1.1792176
##   0.01    0.9250    1.782173  0.4858622  1.1645221
##   0.01    0.9625    1.728535  0.5024984  1.1472102
##   0.01    1.0000    1.715237  0.5018991  1.1427518
##   0.10    0.1000    1.494869  0.5972803  1.2124796
##   0.10    0.1375    1.392957  0.6095156  1.1321717
##   0.10    0.1750    1.309280  0.6107641  1.0653720
##   0.10    0.2125    1.247076  0.6127554  1.0157471
##   0.10    0.2500    1.214549  0.6047069  0.9887146
##   0.10    0.2875    1.208770  0.5958584  0.9743990
##   0.10    0.3250    1.203215  0.5966161  0.9645080
##   0.10    0.3625    1.193808  0.6027521  0.9556947
##   0.10    0.4000    1.192736  0.6054448  0.9504059
##   0.10    0.4375    1.193442  0.6073676  0.9473647
##   0.10    0.4750    1.230561  0.5975903  0.9552203
##   0.10    0.5125    1.300753  0.5857970  0.9709291
##   0.10    0.5500    1.368343  0.5789347  0.9875984
##   0.10    0.5875    1.435716  0.5762059  1.0018755
##   0.10    0.6250    1.476726  0.5745453  1.0155836
##   0.10    0.6625    1.535181  0.5643222  1.0323130
##   0.10    0.7000    1.612147  0.5464156  1.0585143
##   0.10    0.7375    1.696472  0.5259372  1.0878757
##   0.10    0.7750    1.780811  0.5078655  1.1153995
##   0.10    0.8125    1.808678  0.5020110  1.1260408
##   0.10    0.8500    1.830015  0.4976426  1.1351344
##   0.10    0.8875    1.862698  0.4938794  1.1460052
##   0.10    0.9250    1.893092  0.4909857  1.1561965
##   0.10    0.9625    1.919196  0.4886921  1.1648443
##   0.10    1.0000    1.935204  0.4868378  1.1707266
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.1.
plot(enetT)

lm_model <- lm(Yield ~ ., chemical)

summary(lm_model)
## 
## Call:
## lm(formula = Yield ~ ., data = chemical)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.16927 -0.54577 -0.03172  0.50904  2.00004 
## 
## Coefficients: (1 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.960e+00  8.736e+01   0.057  0.95481    
## BiologicalMaterial01    2.529e-01  3.346e-01   0.756  0.45131    
## BiologicalMaterial02   -1.130e-01  1.280e-01  -0.883  0.37913    
## BiologicalMaterial03    1.541e-01  2.363e-01   0.652  0.51537    
## BiologicalMaterial04   -1.013e-01  5.288e-01  -0.191  0.84847    
## BiologicalMaterial05    1.512e-01  1.067e-01   1.417  0.15920    
## BiologicalMaterial06    1.578e-02  3.021e-01   0.052  0.95844    
## BiologicalMaterial08    3.848e-01  6.381e-01   0.603  0.54756    
## BiologicalMaterial09   -7.929e-01  1.372e+00  -0.578  0.56434    
## BiologicalMaterial10    5.728e-02  1.380e+00   0.042  0.96696    
## BiologicalMaterial11   -8.874e-02  8.291e-02  -1.070  0.28663    
## BiologicalMaterial12    3.298e-01  6.346e-01   0.520  0.60426    
## ManufacturingProcess01  6.666e-02  9.485e-02   0.703  0.48355    
## ManufacturingProcess02  1.404e-02  4.459e-02   0.315  0.75338    
## ManufacturingProcess03 -3.335e+00  5.121e+00  -0.651  0.51618    
## ManufacturingProcess04  6.350e-02  2.943e-02   2.158  0.03292 *  
## ManufacturingProcess05  7.740e-04  3.862e-03   0.200  0.84149    
## ManufacturingProcess06  3.297e-02  4.326e-02   0.762  0.44750    
## ManufacturingProcess07 -1.725e-01  2.134e-01  -0.808  0.42066    
## ManufacturingProcess08 -7.092e-02  2.521e-01  -0.281  0.77892    
## ManufacturingProcess09  2.638e-01  1.796e-01   1.469  0.14445    
## ManufacturingProcess10 -1.017e-01  5.728e-01  -0.177  0.85942    
## ManufacturingProcess11  1.899e-01  7.123e-01   0.267  0.79019    
## ManufacturingProcess12  3.486e-05  1.027e-04   0.340  0.73474    
## ManufacturingProcess13 -2.556e-01  3.826e-01  -0.668  0.50539    
## ManufacturingProcess14  8.961e-04  1.114e-02   0.080  0.93605    
## ManufacturingProcess15  1.426e-03  8.929e-03   0.160  0.87335    
## ManufacturingProcess16 -4.999e-05  3.193e-04  -0.157  0.87587    
## ManufacturingProcess17 -1.469e-01  3.012e-01  -0.488  0.62672    
## ManufacturingProcess18  4.254e-03  4.454e-03   0.955  0.34144    
## ManufacturingProcess19 -2.604e-03  7.305e-03  -0.356  0.72214    
## ManufacturingProcess20 -4.506e-03  4.713e-03  -0.956  0.34097    
## ManufacturingProcess21         NA         NA      NA       NA    
## ManufacturingProcess22 -1.571e-02  4.197e-02  -0.374  0.70884    
## ManufacturingProcess23 -4.324e-02  8.345e-02  -0.518  0.60526    
## ManufacturingProcess24 -1.906e-02  2.337e-02  -0.816  0.41633    
## ManufacturingProcess25 -7.629e-03  1.395e-02  -0.547  0.58556    
## ManufacturingProcess26  6.883e-03  1.058e-02   0.651  0.51638    
## ManufacturingProcess27 -6.951e-03  7.822e-03  -0.889  0.37595    
## ManufacturingProcess28 -7.652e-02  3.111e-02  -2.460  0.01534 *  
## ManufacturingProcess29  1.406e+00  9.034e-01   1.556  0.12233    
## ManufacturingProcess30 -4.007e-01  6.269e-01  -0.639  0.52388    
## ManufacturingProcess31  5.243e-02  1.209e-01   0.434  0.66534    
## ManufacturingProcess32  3.322e-01  6.928e-02   4.795 4.71e-06 ***
## ManufacturingProcess33 -3.975e-01  1.301e-01  -3.054  0.00278 ** 
## ManufacturingProcess34 -1.324e+00  2.802e+00  -0.473  0.63735    
## ManufacturingProcess35 -1.955e-02  1.763e-02  -1.109  0.26959    
## ManufacturingProcess36  2.977e+02  3.095e+02   0.962  0.33805    
## ManufacturingProcess37 -6.997e-01  2.898e-01  -2.414  0.01729 *  
## ManufacturingProcess38 -1.846e-01  2.419e-01  -0.763  0.44701    
## ManufacturingProcess39  7.242e-02  1.311e-01   0.552  0.58165    
## ManufacturingProcess40  4.544e-01  6.581e+00   0.069  0.94507    
## ManufacturingProcess41  2.881e-01  4.761e+00   0.061  0.95184    
## ManufacturingProcess42  2.910e-02  2.125e-01   0.137  0.89130    
## ManufacturingProcess43  2.282e-01  1.187e-01   1.922  0.05702 .  
## ManufacturingProcess44 -4.892e-01  1.181e+00  -0.414  0.67946    
## ManufacturingProcess45  9.578e-01  5.422e-01   1.766  0.07989 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.041 on 120 degrees of freedom
## Multiple R-squared:  0.7818, Adjusted R-squared:  0.6818 
## F-statistic: 7.816 on 55 and 120 DF,  p-value: < 2.2e-16
lm_model
## 
## Call:
## lm(formula = Yield ~ ., data = chemical)
## 
## Coefficients:
##            (Intercept)    BiologicalMaterial01    BiologicalMaterial02  
##              4.960e+00               2.529e-01              -1.130e-01  
##   BiologicalMaterial03    BiologicalMaterial04    BiologicalMaterial05  
##              1.541e-01              -1.013e-01               1.512e-01  
##   BiologicalMaterial06    BiologicalMaterial08    BiologicalMaterial09  
##              1.578e-02               3.848e-01              -7.929e-01  
##   BiologicalMaterial10    BiologicalMaterial11    BiologicalMaterial12  
##              5.728e-02              -8.874e-02               3.298e-01  
## ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  
##              6.666e-02               1.404e-02              -3.335e+00  
## ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  
##              6.350e-02               7.740e-04               3.297e-02  
## ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess09  
##             -1.725e-01              -7.092e-02               2.638e-01  
## ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  
##             -1.017e-01               1.899e-01               3.486e-05  
## ManufacturingProcess13  ManufacturingProcess14  ManufacturingProcess15  
##             -2.556e-01               8.961e-04               1.426e-03  
## ManufacturingProcess16  ManufacturingProcess17  ManufacturingProcess18  
##             -4.999e-05              -1.469e-01               4.254e-03  
## ManufacturingProcess19  ManufacturingProcess20  ManufacturingProcess21  
##             -2.604e-03              -4.506e-03                      NA  
## ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  
##             -1.571e-02              -4.324e-02              -1.906e-02  
## ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  
##             -7.629e-03               6.883e-03              -6.951e-03  
## ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  
##             -7.652e-02               1.406e+00              -4.007e-01  
## ManufacturingProcess31  ManufacturingProcess32  ManufacturingProcess33  
##              5.243e-02               3.322e-01              -3.975e-01  
## ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  
##             -1.324e+00              -1.955e-02               2.977e+02  
## ManufacturingProcess37  ManufacturingProcess38  ManufacturingProcess39  
##             -6.997e-01              -1.846e-01               7.242e-02  
## ManufacturingProcess40  ManufacturingProcess41  ManufacturingProcess42  
##              4.544e-01               2.881e-01               2.910e-02  
## ManufacturingProcess43  ManufacturingProcess44  ManufacturingProcess45  
##              2.282e-01              -4.892e-01               9.578e-01
lm_predict <- predict(lm_model, test_chem[ ,-1])

postResample(lm_predict, test_chem[ ,1])
##      RMSE  Rsquared       MAE 
## 0.8719894 0.6547831 0.6328814
pls_predict <- predict(plsT, test_chem[ ,-1])

postResample(pls_predict, test_chem[ ,1])
##      RMSE  Rsquared       MAE 
## 1.2759739 0.2746225 1.0764468
enet_predict <- predict(enetT, test_chem[ ,-1])

postResample(enet_predict, test_chem[ ,1])
##      RMSE  Rsquared       MAE 
## 1.0480148 0.4872262 0.8797834
varImp(plsT)
## Warning: package 'pls' was built under R version 4.4.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess36   86.88
## ManufacturingProcess13   82.77
## ManufacturingProcess09   82.73
## BiologicalMaterial02     79.12
## BiologicalMaterial06     78.57
## BiologicalMaterial03     73.12
## ManufacturingProcess33   70.72
## ManufacturingProcess17   69.95
## ManufacturingProcess06   64.34
## BiologicalMaterial08     62.56
## BiologicalMaterial04     62.38
## BiologicalMaterial12     60.35
## BiologicalMaterial01     58.94
## BiologicalMaterial11     58.27
## ManufacturingProcess12   57.05
## ManufacturingProcess11   53.14
## ManufacturingProcess28   43.30
## ManufacturingProcess04   41.85
## ManufacturingProcess30   37.22
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.4.3
chemical |> select(Yield , ends_with(c("32","13","36","09")  )) |>
  plot_correlation()