In this assignment, the problems 6.2 and 6.3 have been solved from the Kuhn and Johnson book.

library(AppliedPredictiveModeling)
library(lars)
## Loaded lars 1.3
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(MASS)
library(elasticnet)
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
library(RANN)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Problem-6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
data(permeability)
#head(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
dim(fingerprints)
## [1]  165 1107
# Find out low frequency column numbers
pre_lf <- nearZeroVar(fingerprints)

# Remove low frequency columns 
pred1 <- fingerprints[,-pre_lf]
dim(pred1)
## [1] 165 388

The number of predictors remaining after filtering out those with low frequency is 388.

  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
set.seed(123)

# Create a 70-30 train-test split
split_perm<- sample(c(rep(0, 0.7 * nrow(permeability)), 
                  rep(1, 0.3 * nrow(permeability))))

finger_train <- pred1[split_perm == 0,]
finger_test <- pred1[split_perm == 1,]

perm_train <- permeability[split_perm == 0]
perm_test <- permeability[split_perm == 1]

# Get tuned PLS model 
pls_model <- train(x=finger_train, y=perm_train, 
                method='pls', metric='Rsquared',
                tuneLength=20, 
                trControl=trainControl(method='cv'),
                preProcess=c('center', 'scale')
                )

pls_model
## Partial Least Squares 
## 
## 115 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 103, 103, 104, 104, 105, 103, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.13277  0.4070822  10.475643
##    2     11.26798  0.5349631   8.494166
##    3     11.69540  0.5137372   9.164204
##    4     11.66819  0.5037188   9.138646
##    5     11.77796  0.5151984   8.836781
##    6     12.15963  0.4809347   9.316472
##    7     12.28613  0.4808918   9.371363
##    8     12.21954  0.4932274   9.077101
##    9     12.05612  0.5100212   8.958103
##   10     12.21703  0.5106510   9.034134
##   11     12.21932  0.5177616   9.209332
##   12     12.12359  0.5155698   9.220382
##   13     12.11052  0.5071781   9.184265
##   14     12.17779  0.4870462   9.254787
##   15     12.22873  0.4889741   9.194144
##   16     12.31458  0.4822532   9.259095
##   17     12.51942  0.4729862   9.388909
##   18     12.72522  0.4637895   9.499965
##   19     13.05632  0.4484861   9.711818
##   20     13.15870  0.4469599   9.692208
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 2.

The optimal number of latent variables (ncomp) is 5 and the corresponding re-sampled R-squared value is 0.524 and the RMSE value is 11.27, which is the smallest in this set. This indicates that with these 5 components, the model achieves its highest R-squared and captures the most variance in the data among the evaluated models.

pls_model$results|> filter(ncomp == 5)
##   ncomp     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1     5 11.77796 0.5151984 8.836781 4.477242  0.2961044 3.652371
  1. Predict the response for the test set. What is the test set estimate of R2?
# Generate predictions using the PLS model on test data
pls_finger_predict <- predict(pls_model,newdata=finger_test)

# Evaluate model performance with resampling metrics
pls<-postResample(pred=pls_finger_predict, obs=perm_test)
pls
##       RMSE   Rsquared        MAE 
## 12.5447910  0.3434813  8.1748256

The test set estimate of R-squared is 0.41013. This value is lower than the training set R-squared value of 0.5242.

  1. Try building other models discussed in this chapter. Do any have better predictive performance?
set.seed(246)

# Find custom set of penalties
ridge_grid <- data.frame(.lambda = seq(0, .1, length = 15))

# Fit ridge regression model
ridge_model <- train(x=finger_train,
                  y=perm_train,
                  method='ridge',
                  tuneGrid=ridge_grid,#Fit the model over many penalty values
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')# put the predictors on the same scale
                  )
## Warning: model fit failed for Fold01: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold03: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold05: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
ridge_model
## Ridge Regression 
## 
## 115 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 104, 103, 103, 103, 103, 104, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE       
##   0.000000000   55.21114  0.4752174   44.911261
##   0.007142857  928.06887  0.4488383  679.486669
##   0.014285714   13.40101  0.4635154    9.915916
##   0.021428571   12.87662  0.4757953    9.587425
##   0.028571429   12.55192  0.4905130    9.429138
##   0.035714286   12.25474  0.5007713    9.207660
##   0.042857143   12.07771  0.5078405    9.096379
##   0.050000000   11.94192  0.5135906    9.006202
##   0.057142857   11.84380  0.5177573    8.941061
##   0.064285714   11.72812  0.5242420    8.868106
##   0.071428571   11.64596  0.5289827    8.796775
##   0.078571429   11.60338  0.5314857    8.771784
##   0.085714286   11.59480  0.5324738    8.789822
##   0.092857143   11.55570  0.5352570    8.767197
##   0.100000000   11.53473  0.5375239    8.758483
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

The optimal ridge regression model was selected based on the smallest RMSE value of 11.535, with a final lambda value of 0.1. The corresponding R-squred value at this point is 0.54.

Predict the response for the test set by ridge model:

# Generate predictions using the Ridge model on test data
ridge_finger_predict <- predict(ridge_model,newdata=finger_test)

# Evaluate model performance with resampling metrics
ridge<-postResample(pred=ridge_finger_predict, obs=perm_test)
ridge
##       RMSE   Rsquared        MAE 
## 12.3120682  0.3923648  9.5858249
set.seed(246)

# Grid search for optimized penalty lambdas
enet_grid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

# Fit penalized regression model
enet_model <- train(finger_train, perm_train,
method = "enet",
tuneGrid = enet_grid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale"))
## Warning: model fit failed for Fold01: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold03: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold05: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
enet_model
## Elasticnet 
## 
## 115 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 104, 103, 103, 103, 103, 104, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE        Rsquared   MAE       
##   0.00    0.05       14.151111  0.6312289   10.421341
##   0.00    0.10       15.943787  0.6279431   11.513413
##   0.00    0.15       17.769514  0.6228903   12.685448
##   0.00    0.20       19.262009  0.6351900   13.708168
##   0.00    0.25       20.873851  0.6394342   15.147687
##   0.00    0.30       22.708174  0.6383821   16.801769
##   0.00    0.35       24.738976  0.6267321   18.817938
##   0.00    0.40       26.816843  0.6124962   20.856474
##   0.00    0.45       28.967773  0.5996391   22.886737
##   0.00    0.50       31.185890  0.5833154   24.905094
##   0.00    0.55       33.515329  0.5678710   26.862720
##   0.00    0.60       35.805584  0.5557674   28.764330
##   0.00    0.65       38.087091  0.5453041   30.673345
##   0.00    0.70       40.518435  0.5331935   32.782948
##   0.00    0.75       43.061828  0.5208817   35.003538
##   0.00    0.80       45.559322  0.5130966   37.093992
##   0.00    0.85       47.978368  0.5070999   39.034944
##   0.00    0.90       50.626297  0.4882228   41.254331
##   0.00    0.95       52.918592  0.4813814   43.077956
##   0.00    1.00       55.211142  0.4752174   44.911261
##   0.01    0.05       29.378745  0.4565886   21.518862
##   0.01    0.10       47.614456  0.4929702   35.104099
##   0.01    0.15       66.416995  0.5041494   49.788016
##   0.01    0.20       87.060071  0.5140643   65.123568
##   0.01    0.25      106.145482  0.5233051   78.704054
##   0.01    0.30      124.055199  0.5122281   90.790054
##   0.01    0.35      142.433777  0.5026247  103.011489
##   0.01    0.40      161.095550  0.4910332  115.336788
##   0.01    0.45      179.997372  0.4791362  127.659326
##   0.01    0.50      199.239356  0.4679310  140.059035
##   0.01    0.55      218.706830  0.4553725  152.481449
##   0.01    0.60      238.284789  0.4399234  167.215278
##   0.01    0.65      257.900953  0.4268302  182.097605
##   0.01    0.70      277.589834  0.4157174  196.961145
##   0.01    0.75      297.259016  0.4071169  211.756895
##   0.01    0.80      316.917593  0.3998459  226.531946
##   0.01    0.85      336.609509  0.3921753  241.321548
##   0.01    0.90      357.036416  0.3850288  256.659349
##   0.01    0.95      378.064341  0.3786432  272.412310
##   0.01    1.00      399.458742  0.3738213  288.393964
##   0.10    0.05       12.155275  0.5318898    9.228464
##   0.10    0.10       11.540557  0.5397695    8.158448
##   0.10    0.15       11.266053  0.5646036    7.809112
##   0.10    0.20       10.722894  0.5866221    7.490231
##   0.10    0.25       10.404309  0.5940981    7.483926
##   0.10    0.30       10.066515  0.6042563    7.385388
##   0.10    0.35        9.849010  0.6114751    7.332344
##   0.10    0.40        9.780107  0.6126701    7.364652
##   0.10    0.45        9.803324  0.6122784    7.425639
##   0.10    0.50        9.879791  0.6073968    7.502528
##   0.10    0.55       10.043954  0.5968707    7.623457
##   0.10    0.60       10.251238  0.5865432    7.774031
##   0.10    0.65       10.429900  0.5782867    7.928979
##   0.10    0.70       10.597833  0.5713734    8.065470
##   0.10    0.75       10.782808  0.5640196    8.211557
##   0.10    0.80       10.946394  0.5578514    8.329209
##   0.10    0.85       11.097780  0.5523454    8.439981
##   0.10    0.90       11.239763  0.5474603    8.536174
##   0.10    0.95       11.386675  0.5425306    8.646261
##   0.10    1.00       11.534731  0.5375239    8.758483
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.1.

The elastic net used a penalty of lambda 0.1 and a fraction of 0.40, which led to the minimal RMSE value of 9.780 and an R-squared of 0.613.

In terms of training set performance, the elastic net shows the best predictive performance with the lowest RMSE of 9.780 and the highest R-squared value of 0.613. The partial least squares (PLS) model follows with an R-squared of 0.524 and an RMSE of 11.270. The ridge regression model has an RMSE of 11.535 and a corresponding R-squared value of 0.540. Hence, the elastic net is the most effective model based on the training data results.

Predict the response for the test set by elastic net model:

# Generate predictions using the Ridge model on test data
enet_finger_predict <- predict(enet_model,newdata=finger_test)

# Evaluate model performance with resampling metrics
enet<-postResample(pred=enet_finger_predict, obs=perm_test)
enet
##       RMSE   Rsquared        MAE 
## 12.0675531  0.3969818  9.1975157
test_evaluation <-rbind(pls,ridge,enet)
knitr::kable(test_evaluation)
RMSE Rsquared MAE
pls 12.54479 0.3434813 8.174826
ridge 12.31207 0.3923648 9.585825
enet 12.06755 0.3969818 9.197516

Again, among the models evaluated on the test data, the elastic net exhibits the best predictive performance with the lowest root mean squared error (RMSE) of 12.07 and the highest R-squared value of 0.397. This indicates that the elastic net model explains a greater proportion of the variance in the unseen data compared to the partial least squares (PLS) model, which has an RMSE of 12.545 and an R-squared of 0.344, and the ridge regression model, which has an RMSE of 12.312 and an R-squared of 0.392. Therefore, the elastic net is the most effective model for predicting outcomes on the test dataset.

  1. Would you recommend any of your models to replace the permeability laboratory experiment?

Given the predictive performance results, I will not recommend to fully replace the permeability laboratory experiment with the models at this time. The models exhibit relatively low R-squared values, particularly the PLS and Ridge regression models show that they explain only a limited portion of the variability in the data. Though, the Elastic Net model shows slightly better performance with an R-squared of 0.397, but it still falls short of providing a strong predictive capability. Therefore, I think the laboratory experiments remain crucial for accurate permeability measurements in this case. Additional model refinement or additional data may be necessary here to improve predictive accuracy.

Problem-6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

dim(ChemicalManufacturingProcess)
## [1] 176  58
  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
head(ChemicalManufacturingProcess)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0
# Check for missing values in each column
missing_values <- colSums(is.na(ChemicalManufacturingProcess))
print(missing_values)
##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      1                      3 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                     15                      1                      1 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      2                      1                      1 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      9                     10 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      1                      0                      1 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      1                      1 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      1                      5                      5 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      5                      5                      5 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      5                      5                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      5                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      1                      1 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0
# Apply KNN imputation
knn_impute_chemical <- preProcess(ChemicalManufacturingProcess, method=c('knnImpute'))

# Imputed dataset
imputed_chemical_df <- predict(knn_impute_chemical, ChemicalManufacturingProcess)

# Calculate total number of missing values after imputation
total_missing <- sum(is.na(imputed_chemical_df))
print(total_missing)
## [1] 0
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Split the data into train test set:

dim(imputed_chemical_df)
## [1] 176  58
set.seed(100)
train_chemical <-createDataPartition(imputed_chemical_df$Yield, times = 1, p = .70, list = FALSE)

train_chemical_x <- imputed_chemical_df[train_chemical, ][, -c(1)] 
test_chemical_x <- imputed_chemical_df[-train_chemical, ][, -c(1)] 
train_chemical_y<- imputed_chemical_df[train_chemical, ]$Yield
test_chemical_y <- imputed_chemical_df[-train_chemical, ]$Yield

To get a bias variance balance I will apply a ridge regression here:

set.seed(135)

ridgegrid <- data.frame(.lambda = seq(0,0.1,length=15))

ridge_chemical <- train(x=train_chemical_x,y=train_chemical_y,
                       method='ridge',
                       tuneGrid=ridgegrid,
                       trControl=trainControl(method='cv'),
                       preProc = c('center','scale')
                       )

ridge_chemical
## Ridge Regression 
## 
## 124 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 111, 112, 111, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE      
##   0.000000000  3.274546  0.3894348  1.3886320
##   0.007142857  1.952203  0.5198298  0.9212209
##   0.014285714  1.988899  0.5343489  0.9244613
##   0.021428571  1.951873  0.5426475  0.9136532
##   0.028571429  1.911371  0.5483202  0.9020967
##   0.035714286  1.875157  0.5525935  0.8923681
##   0.042857143  1.843596  0.5559911  0.8845154
##   0.050000000  1.815971  0.5587837  0.8778078
##   0.057142857  1.791533  0.5611311  0.8717044
##   0.064285714  1.769673  0.5631361  0.8662582
##   0.071428571  1.749921  0.5648698  0.8615315
##   0.078571429  1.731921  0.5663834  0.8573224
##   0.085714286  1.715396  0.5677153  0.8534072
##   0.092857143  1.700131  0.5688950  0.8497503
##   0.100000000  1.685958  0.5699458  0.8464050
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

The optimal value of the performance metric is found for lambda 0.1 for the ridge regression, which gave the smallest RMSE value of 1.69 and the corresponding R-squared value is 0.57.

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Make prediction on test set:

predict_chemical <- predict(ridge_chemical, newdata=test_chemical_x)
postResample(pred=predict_chemical, obs=test_chemical_y)
##      RMSE  Rsquared       MAE 
## 1.5895179 0.1800329 0.8326279

For ridge regression, the test set RMSE is 1.59 and the corresponding R-squared value is 0.18. The RMSE value is slightly lower than the training set RMSE of 1.69, indicating a bit improvement in error. However, the R-squared on the test set drops significantly to 0.18 from 0.57 on the training set, showing a sharp decline in explanatory power. This suggests that though the model achieves a similar error rate on new data, it struggles to generalize and capture the underlying patterns effectively. Probably, this is happened due to overfitting to the training data.

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
plot(varImp(ridge_chemical, scale = FALSE), top=30, scales = list(y = list(cex = 0.8)))

The manufacturing processes seem to be the most influential predictors. Among them, manufacturingprocess32 and manufacturingprocess13 stand out as especially important as these two predictors show significantly higher importance compared to the others available.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
# Calculate correlation matrix
correlation <- cor(dplyr::select(imputed_chemical_df, ManufacturingProcess32, ManufacturingProcess13, BiologicalMaterial06, ManufacturingProcess17, ManufacturingProcess31, Yield))
correlation
##                        ManufacturingProcess32 ManufacturingProcess13
## ManufacturingProcess32            1.000000000            -0.10120679
## ManufacturingProcess13           -0.101206789             1.00000000
## BiologicalMaterial06              0.600595801            -0.12186756
## ManufacturingProcess17            0.016041778             0.78241345
## ManufacturingProcess31           -0.009650671             0.07650659
## Yield                             0.608332150            -0.50367972
##                        BiologicalMaterial06 ManufacturingProcess17
## ManufacturingProcess32          0.600595801            0.016041778
## ManufacturingProcess13         -0.121867557            0.782413453
## BiologicalMaterial06            1.000000000            0.006004003
## ManufacturingProcess17          0.006004003            1.000000000
## ManufacturingProcess31         -0.048798568            0.038426259
## Yield                           0.478163422           -0.425806872
##                        ManufacturingProcess31       Yield
## ManufacturingProcess32           -0.009650671  0.60833215
## ManufacturingProcess13            0.076506590 -0.50367972
## BiologicalMaterial06             -0.048798568  0.47816342
## ManufacturingProcess17            0.038426259 -0.42580687
## ManufacturingProcess31            1.000000000 -0.07085698
## Yield                            -0.070856979  1.00000000
# Visualize correlation matrix
corrplot::corrplot(correlation, method = 'square', type = "upper")

The correlations of some of the most important predictors with the response variable have been determined. It is seen that ManufacturingProcess32 is a strong positive influencer of yield, while BiologicalMaterial06 has moderate positive effect on yield. ManufacturingProcess13 and ManufacturingProcess17 have moderate negative effects on yield. ManufacturingProcess31 appears to have a negligible impact. These correlations can help us to take decisions on which processes or materials to prioritize or avoid to optimize yield in the manufacturing process.