Homework 7

Author

Naomi Buell

library(tidyverse)
library(AppliedPredictiveModeling)
library(skimr)
library(caret)
library(GGally)

set.seed(813098)

6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Start R and use these commands to load the data:

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nzv <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv]
n_predictors_left <- ncol(filtered_fingerprints)
n_predictors_left

[1] 388

There are 388 predictors left for modeling after filtering out the near-zero variance predictors.

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

# Split the data into a training and a test set
trainingRows <- createDataPartition(
    permeability,
    p = .80,
    list = FALSE
)
head(trainingRows)

     Resample1
[1,]         1
[2,]         4
[3,]         5
[4,]         6
[5,]         7
[6,]         8

# Subset the data into objects for training using integer sub-setting.
trainPredictors <- filtered_fingerprints[trainingRows, ]
trainClasses <- permeability[trainingRows]
# Do the same for the test set using negative integers.
testPredictors <- filtered_fingerprints[-trainingRows, ]
testClasses <- permeability[-trainingRows]

# Pre-process the data and tune a PLS model
ctrl <- trainControl(method = "cv", number = 10)

plsTune <- train(
    trainPredictors,
    trainClasses,
    method = "pls",
    tuneLength = 20,
    trControl = ctrl,
    preProc = c("center", "scale")
)
plsTune

Partial Least Squares 

133 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 119, 118, 120, 121, 120, 120, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared   MAE      
   1     13.04495  0.3889410   9.855879
   2     12.07025  0.4167704   8.578359
   3     11.76663  0.4344682   8.818290
   4     11.82458  0.4049298   9.049706
   5     11.50111  0.4259716   8.743398
   6     11.81385  0.4098471   9.017622
   7     11.72507  0.4174966   9.009777
   8     11.79303  0.4172816   8.951220
   9     11.83927  0.4234698   8.873060
  10     11.96716  0.4186799   8.918212
  11     12.34722  0.4011802   9.121489
  12     12.33673  0.4030007   9.230783
  13     12.58661  0.3882106   9.417677
  14     12.80737  0.3829754   9.531202
  15     12.86936  0.3874828   9.690045
  16     13.22190  0.3753557   9.977137
  17     13.34250  0.3748184  10.015780
  18     13.53016  0.3825441  10.148752
  19     13.61111  0.3886850  10.214828
  20     13.65062  0.3864161  10.207473

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 5.

Predict the response for the test set. What is the test set estimate of R²?

# Predict the response for the test set
predictions <- predict(plsTune, newdata = testPredictors)

# Get test set estimate of R^2
plsValues <- data.frame(obs = testClasses, pred = predictions)
defaultSummaryValues <- defaultSummary(plsValues)
Rsquared <- defaultSummaryValues["Rsquared"]

The test set estimate of R² is 0.5815661.

Try building other models discussed in this chapter. Do any have better predictive performance?

# Ordinary Linear Regression model
lmFit <- train(
    x = trainPredictors,
    y = trainClasses,
    method = "lm",
    trControl = ctrl
)
lmFit

Linear Regression 

133 samples
388 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 119, 119, 120, 121, 118, 120, ... 
Resampling results:

  RMSE     Rsquared   MAE     
  42.3375  0.2665748  27.43291

Tuning parameter 'intercept' was held constant at a value of TRUE

lm_predictions <- predict(lmFit, newdata = testPredictors)
lm_values <- data.frame(obs = testClasses, pred = lm_predictions)
lm_defaultSummaryValues <- defaultSummary(lm_values)
lm_Rsquared <- lm_defaultSummaryValues["Rsquared"]

# Robust linear regression model
rlmPCA <- train(
    trainPredictors,
    trainClasses,
    method = "rlm",
    preProcess = "pca",
    trControl = ctrl
)
rlmPCA

Robust Linear Model 

133 samples
388 predictors

Pre-processing: principal component signal extraction (388), centered
 (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 119, 121, 121, 118, 121, 119, ... 
Resampling results across tuning parameters:

  intercept  psi           RMSE      Rsquared   MAE      
  FALSE      psi.huber     17.20786  0.4532193  13.683070
  FALSE      psi.hampel    17.20015  0.4663720  13.885365
  FALSE      psi.bisquare  17.15836  0.4501079  13.611705
   TRUE      psi.huber     11.85322  0.5044184   8.740722
   TRUE      psi.hampel    12.33251  0.4653272   9.248913
   TRUE      psi.bisquare  12.60861  0.5347240   8.304136

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were intercept = TRUE and psi = psi.huber.

rlm_predictions <- predict(rlmPCA, newdata = testPredictors)
rlm_values <- data.frame(obs = testClasses, pred = rlm_predictions)
rlm_defaultSummaryValues <- defaultSummary(rlm_values)
rlm_Rsquared <- rlm_defaultSummaryValues["Rsquared"]

# Elastic net model
enetGrid <- expand.grid(
    .lambda = c(0, 0.01, .1),
    .fraction = seq(.05, 1, length = 20)
)
enetTune <- train(
    trainPredictors,
    trainClasses,
    method = "enet",
    tuneGrid = enetGrid,
    trControl = ctrl,
    preProc = c("center", "scale")
)
enetTune

Elasticnet 

133 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 120, 120, 120, 119, 120, 120, ... 
Resampling results across tuning parameters:

  lambda  fraction  RMSE       Rsquared   MAE       
  0.00    0.05       12.26290  0.5122591    9.240691
  0.00    0.10       12.04153  0.5026511    8.860188
  0.00    0.15       11.89661  0.5102722    8.687994
  0.00    0.20       11.74986  0.5202451    8.574839
  0.00    0.25       11.87210  0.5118214    8.665897
  0.00    0.30       11.97120  0.5119878    8.862492
  0.00    0.35       12.22086  0.5016257    9.094315
  0.00    0.40       12.53193  0.4896627    9.265691
  0.00    0.45       12.77545  0.4798137    9.452927
  0.00    0.50       13.01862  0.4712435    9.597859
  0.00    0.55       13.32409  0.4625014    9.827843
  0.00    0.60       13.62263  0.4568342   10.057174
  0.00    0.65       14.07472  0.4458742   10.334246
  0.00    0.70       14.59850  0.4326191   10.613620
  0.00    0.75       14.91685  0.4230139   10.780312
  0.00    0.80       15.19519  0.4145982   10.936113
  0.00    0.85       15.45128  0.4077523   11.048518
  0.00    0.90       15.73520  0.4023391   11.231894
  0.00    0.95       15.96312  0.4009913   11.330718
  0.00    1.00       16.18310  0.3998642   11.426169
  0.01    0.05       46.69467  0.4231756   29.412260
  0.01    0.10       79.96583  0.4187060   49.968999
  0.01    0.15      113.31488  0.4240885   70.378471
  0.01    0.20      145.27725  0.4157101   90.274706
  0.01    0.25      177.47351  0.4053567  110.837883
  0.01    0.30      210.07390  0.3965803  131.557844
  0.01    0.35      242.90840  0.3851313  152.421193
  0.01    0.40      275.92797  0.3749525  173.360531
  0.01    0.45      308.96992  0.3702313  194.258573
  0.01    0.50      342.05749  0.3706891  215.181464
  0.01    0.55      375.61419  0.3713373  236.419309
  0.01    0.60      409.24263  0.3712460  257.661287
  0.01    0.65      442.97486  0.3664618  279.082377
  0.01    0.70      476.78166  0.3604166  300.572167
  0.01    0.75      510.64767  0.3545795  322.060905
  0.01    0.80      544.56064  0.3482289  343.570192
  0.01    0.85      576.88024  0.3431127  364.598634
  0.01    0.90      608.83972  0.3372916  385.518813
  0.01    0.95      640.78670  0.3328813  406.396622
  0.01    1.00      672.70170  0.3284291  427.229363
  0.10    0.05       12.32161  0.4789667    9.235035
  0.10    0.10       11.84629  0.4930441    8.596401
  0.10    0.15       11.84183  0.5007407    8.631976
  0.10    0.20       11.71222  0.5114511    8.618887
  0.10    0.25       11.64390  0.5146486    8.626041
  0.10    0.30       11.64322  0.5136992    8.618967
  0.10    0.35       11.59209  0.5168968    8.542478
  0.10    0.40       11.61510  0.5170314    8.548679
  0.10    0.45       11.63622  0.5186747    8.584014
  0.10    0.50       11.70063  0.5168669    8.666544
  0.10    0.55       11.75876  0.5148511    8.743543
  0.10    0.60       11.82424  0.5128873    8.841233
  0.10    0.65       11.89822  0.5105326    8.936577
  0.10    0.70       11.95469  0.5104441    9.009389
  0.10    0.75       12.01456  0.5102456    9.072426
  0.10    0.80       12.07212  0.5098505    9.131490
  0.10    0.85       12.13026  0.5095282    9.190574
  0.10    0.90       12.19561  0.5082315    9.255955
  0.10    0.95       12.26695  0.5065606    9.316829
  0.10    1.00       12.34175  0.5048552    9.378287

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.35 and lambda = 0.1.

enet_predictions <- predict(enetTune, newdata = testPredictors)
enet_values <- data.frame(obs = testClasses, pred = enet_predictions)
enet_defaultSummaryValues <- defaultSummary(enet_values)
enet_Rsquared <- enet_defaultSummaryValues["Rsquared"]

# Create a data frame to compare R-squared values
model_comparison <- data.frame(
    model = c("PLS", "Linear", "Robust Linear", "Elastic Net"),
    r_squared = c(
        Rsquared,
        lm_Rsquared,
        rlm_Rsquared,
        enet_Rsquared
    )
) |>
    arrange(desc(r_squared))
model_comparison

          model  r_squared
1           PLS 0.58156609
2   Elastic Net 0.56599643
3 Robust Linear 0.48831242
4        Linear 0.04017562

The model with the highest R² was PLS with an R² of 0.5815661.

Would you recommend any of your models to replace the permeability laboratory experiment?

# PLS
defaultSummaryValues

     RMSE  Rsquared       MAE 
9.9102548 0.5815661 6.3187502

# LM
lm_defaultSummaryValues

       RMSE    Rsquared         MAE 
28.42499691  0.04017562 15.47919899

# RLM
rlm_defaultSummaryValues

      RMSE   Rsquared        MAE 
11.5870686  0.4883124  7.4740797

# Elastic net
enet_defaultSummaryValues

      RMSE   Rsquared        MAE 
10.0095069  0.5659964  7.1854308

No, I would not recommend any of my models (LM, RLM, or elastic net) over the permeability laboratory experiment PLS model since the latter has the lowest RMSE, highest R², and lowest MAE.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)
set.seed(813098)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

# Impute missing values using KNN imputation
ChemicalManufacturingProcess |> skim()

Data summary
Name	ChemicalManufacturingProc…
Number of rows	176
Number of columns	58
_______________________
Column type frequency:
numeric	58
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Yield	0	1.00	40.18	1.85	35.25	38.75	39.97	41.48	46.34	▁▇▇▃▁
BiologicalMaterial01	0	1.00	6.41	0.71	4.58	5.98	6.30	6.87	8.81	▂▇▇▂▁
BiologicalMaterial02	0	1.00	55.69	4.03	46.87	52.68	55.09	58.74	64.75	▂▇▆▅▃
BiologicalMaterial03	0	1.00	67.70	4.00	56.97	64.98	67.22	70.43	78.25	▂▅▇▆▁
BiologicalMaterial04	0	1.00	12.35	1.77	9.38	11.25	12.10	13.22	23.09	▇▆▁▁▁
BiologicalMaterial05	0	1.00	18.60	1.84	13.24	17.24	18.49	19.90	24.85	▁▅▇▃▁
BiologicalMaterial06	0	1.00	48.91	3.75	40.60	46.06	48.46	51.34	59.38	▂▇▆▅▁
BiologicalMaterial07	0	1.00	100.01	0.11	100.00	100.00	100.00	100.00	100.83	▇▁▁▁▁
BiologicalMaterial08	0	1.00	17.49	0.68	15.88	17.06	17.51	17.88	19.14	▁▅▇▃▂
BiologicalMaterial09	0	1.00	12.85	0.42	11.44	12.60	12.84	13.13	14.08	▁▃▇▅▁
BiologicalMaterial10	0	1.00	2.80	0.60	1.77	2.46	2.71	2.99	6.87	▇▅▁▁▁
BiologicalMaterial11	0	1.00	146.95	4.82	135.81	143.82	146.08	149.60	158.73	▂▆▇▃▂
BiologicalMaterial12	0	1.00	20.20	0.77	18.35	19.73	20.12	20.75	22.21	▂▆▇▃▂
ManufacturingProcess01	1	0.99	11.21	1.82	0.00	10.80	11.40	12.15	14.10	▁▁▁▅▇
ManufacturingProcess02	3	0.98	16.68	8.47	0.00	19.30	21.00	21.50	22.50	▂▁▁▁▇
ManufacturingProcess03	15	0.91	1.54	0.02	1.47	1.53	1.54	1.55	1.60	▁▃▆▇▁
ManufacturingProcess04	1	0.99	931.85	6.27	911.00	928.00	934.00	936.00	946.00	▁▂▃▇▁
ManufacturingProcess05	1	0.99	1001.69	30.53	923.00	986.75	999.20	1008.85	1175.30	▁▇▁▁▁
ManufacturingProcess06	2	0.99	207.40	2.70	203.00	205.70	206.80	208.70	227.40	▇▃▁▁▁
ManufacturingProcess07	1	0.99	177.48	0.50	177.00	177.00	177.00	178.00	178.00	▇▁▁▁▇
ManufacturingProcess08	1	0.99	177.55	0.50	177.00	177.00	178.00	178.00	178.00	▆▁▁▁▇
ManufacturingProcess09	0	1.00	45.66	1.55	38.89	44.89	45.73	46.52	49.36	▁▁▅▇▂
ManufacturingProcess10	9	0.95	9.18	0.77	7.50	8.70	9.10	9.55	11.60	▂▇▆▂▁
ManufacturingProcess11	10	0.94	9.39	0.72	7.50	9.00	9.40	9.90	11.50	▂▆▇▅▁
ManufacturingProcess12	1	0.99	857.81	1784.53	0.00	0.00	0.00	0.00	4549.00	▇▁▁▁▂
ManufacturingProcess13	0	1.00	34.51	1.02	32.10	33.90	34.60	35.20	38.60	▃▇▇▁▁
ManufacturingProcess14	1	0.99	4853.87	54.52	4701.00	4828.00	4856.00	4882.50	5055.00	▁▅▇▂▁
ManufacturingProcess15	0	1.00	6038.92	58.31	5904.00	6010.00	6031.50	6061.00	6233.00	▂▇▆▂▁
ManufacturingProcess16	0	1.00	4565.80	351.70	0.00	4560.75	4588.00	4619.00	4852.00	▁▁▁▁▇
ManufacturingProcess17	0	1.00	34.34	1.25	31.30	33.50	34.40	35.10	40.00	▂▇▆▁▁
ManufacturingProcess18	0	1.00	4809.68	367.48	0.00	4813.00	4835.00	4862.00	4971.00	▁▁▁▁▇
ManufacturingProcess19	0	1.00	6028.20	45.58	5890.00	6000.75	6022.00	6050.25	6146.00	▁▃▇▃▂
ManufacturingProcess20	0	1.00	4556.46	349.01	0.00	4552.75	4582.00	4609.50	4759.00	▁▁▁▁▇
ManufacturingProcess21	0	1.00	-0.16	0.78	-1.80	-0.60	-0.30	0.00	3.60	▂▇▂▁▁
ManufacturingProcess22	1	0.99	5.41	3.33	0.00	3.00	5.00	8.00	12.00	▇▇▇▅▅
ManufacturingProcess23	1	0.99	3.02	1.66	0.00	2.00	3.00	4.00	6.00	▇▆▇▆▇
ManufacturingProcess24	1	0.99	8.83	5.80	0.00	4.00	8.00	14.00	23.00	▇▇▅▆▁
ManufacturingProcess25	5	0.97	4828.18	373.48	0.00	4832.00	4855.00	4877.00	4990.00	▁▁▁▁▇
ManufacturingProcess26	5	0.97	6015.60	464.87	0.00	6019.50	6047.00	6070.50	6161.00	▁▁▁▁▇
ManufacturingProcess27	5	0.97	4562.51	353.98	0.00	4560.00	4587.00	4609.00	4710.00	▁▁▁▁▇
ManufacturingProcess28	5	0.97	6.59	5.25	0.00	0.00	10.40	10.75	11.50	▅▁▁▁▇
ManufacturingProcess29	5	0.97	20.01	1.66	0.00	19.70	19.90	20.40	22.00	▁▁▁▁▇
ManufacturingProcess30	5	0.97	9.16	0.98	0.00	8.80	9.10	9.70	11.20	▁▁▁▅▇
ManufacturingProcess31	5	0.97	70.18	5.56	0.00	70.10	70.80	71.40	72.50	▁▁▁▁▇
ManufacturingProcess32	0	1.00	158.47	5.40	143.00	155.00	158.00	162.00	173.00	▁▃▇▃▁
ManufacturingProcess33	5	0.97	63.54	2.48	56.00	62.00	64.00	65.00	70.00	▁▃▇▅▁
ManufacturingProcess34	5	0.97	2.49	0.05	2.30	2.50	2.50	2.50	2.60	▁▂▁▇▁
ManufacturingProcess35	5	0.97	495.60	10.82	463.00	490.00	495.00	501.50	522.00	▁▂▇▅▂
ManufacturingProcess36	5	0.97	0.02	0.00	0.02	0.02	0.02	0.02	0.02	▂▇▇▁▃
ManufacturingProcess37	0	1.00	1.01	0.45	0.00	0.70	1.00	1.30	2.30	▂▇▇▃▁
ManufacturingProcess38	0	1.00	2.53	0.65	0.00	2.00	3.00	3.00	3.00	▁▁▁▅▇
ManufacturingProcess39	0	1.00	6.85	1.51	0.00	7.10	7.20	7.30	7.50	▁▁▁▁▇
ManufacturingProcess40	1	0.99	0.02	0.04	0.00	0.00	0.00	0.00	0.10	▇▁▁▁▂
ManufacturingProcess41	1	0.99	0.02	0.05	0.00	0.00	0.00	0.00	0.20	▇▁▁▁▁
ManufacturingProcess42	0	1.00	11.21	1.94	0.00	11.40	11.60	11.70	12.10	▁▁▁▁▇
ManufacturingProcess43	0	1.00	0.91	0.87	0.00	0.60	0.80	1.02	11.00	▇▁▁▁▁
ManufacturingProcess44	0	1.00	1.81	0.32	0.00	1.80	1.90	1.90	2.10	▁▁▁▁▇
ManufacturingProcess45	0	1.00	2.14	0.41	0.00	2.10	2.20	2.30	2.60	▁▁▁▂▇

Some of the predictors have missing values. The skim() function shows that the ChemicalManufacturingProcess data set has 176 rows and 58 columns, with some columns having missing values.

impute <- preProcess(
    ChemicalManufacturingProcess,
    method = c("knnImpute")
)
impute

Created from 152 samples and 58 variables

Pre-processing:
  - centered (58)
  - ignored (0)
  - 5 nearest neighbor imputation (58)
  - scaled (58)

chemical_impute <- predict(
    impute,
    ChemicalManufacturingProcess
)

# Browse the imputed data to confirm completeness.
chemical_impute |> skim()

Data summary
Name	chemical_impute
Number of rows	176
Number of columns	58
_______________________
Column type frequency:
numeric	58
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Yield	1	0.00	1.00	-2.67	-0.77	-0.11	0.70	3.34	▁▇▇▃▁
BiologicalMaterial01	1	0.00	1.00	-2.57	-0.61	-0.15	0.64	3.36	▂▇▇▂▁
BiologicalMaterial02	1	0.00	1.00	-2.19	-0.75	-0.15	0.76	2.25	▂▇▆▅▃
BiologicalMaterial03	1	0.00	1.00	-2.68	-0.68	-0.12	0.68	2.64	▂▅▇▆▁
BiologicalMaterial04	1	0.00	1.00	-1.67	-0.62	-0.14	0.49	6.05	▇▆▁▁▁
BiologicalMaterial05	1	0.00	1.00	-2.91	-0.74	-0.06	0.71	3.39	▁▅▇▃▁
BiologicalMaterial06	1	0.00	1.00	-2.22	-0.76	-0.12	0.65	2.79	▂▇▆▅▁
BiologicalMaterial07	1	0.00	1.00	-0.13	-0.13	-0.13	-0.13	7.57	▇▁▁▁▁
BiologicalMaterial08	1	0.00	1.00	-2.39	-0.64	0.02	0.57	2.43	▁▅▇▃▂
BiologicalMaterial09	1	0.00	1.00	-3.40	-0.60	-0.04	0.67	2.96	▁▃▇▅▁
BiologicalMaterial10	1	0.00	1.00	-1.72	-0.57	-0.15	0.32	6.79	▇▅▁▁▁
BiologicalMaterial11	1	0.00	1.00	-2.31	-0.65	-0.18	0.55	2.44	▂▆▇▃▂
BiologicalMaterial12	1	0.00	1.00	-2.39	-0.61	-0.10	0.71	2.60	▂▆▇▃▂
ManufacturingProcess01	1	0.00	1.00	-6.15	-0.22	0.11	0.50	1.59	▁▁▁▅▇
ManufacturingProcess02	1	0.01	0.99	-1.97	0.31	0.51	0.57	0.69	▂▁▁▁▇
ManufacturingProcess03	1	0.04	0.97	-3.11	-0.43	0.38	0.47	2.70	▁▂▆▇▁
ManufacturingProcess04	1	0.00	1.00	-3.32	-0.61	0.34	0.66	2.25	▁▂▃▇▂
ManufacturingProcess05	1	0.00	1.00	-2.58	-0.49	-0.09	0.23	5.69	▁▇▁▁▁
ManufacturingProcess06	1	-0.01	1.00	-1.63	-0.63	-0.22	0.48	7.41	▇▃▁▁▁
ManufacturingProcess07	1	0.00	1.00	-0.96	-0.96	-0.96	1.04	1.04	▇▁▁▁▇
ManufacturingProcess08	1	0.00	1.00	-1.11	-1.11	0.89	0.89	0.89	▆▁▁▁▇
ManufacturingProcess09	1	0.00	1.00	-4.38	-0.50	0.05	0.55	2.39	▁▁▅▇▂
ManufacturingProcess10	1	0.02	0.99	-2.19	-0.62	-0.10	0.55	3.16	▂▇▇▂▁
ManufacturingProcess11	1	0.03	1.00	-2.63	-0.54	0.02	0.72	2.95	▁▇▇▆▁
ManufacturingProcess12	1	0.00	1.00	-0.48	-0.48	-0.48	-0.48	2.07	▇▁▁▁▂
ManufacturingProcess13	1	0.00	1.00	-2.37	-0.60	0.09	0.68	4.03	▃▆▇▁▁
ManufacturingProcess14	1	0.00	1.00	-2.80	-0.49	0.03	0.52	3.69	▁▅▇▂▁
ManufacturingProcess15	1	0.00	1.00	-2.31	-0.50	-0.13	0.38	3.33	▂▇▆▂▁
ManufacturingProcess16	1	0.00	1.00	-12.98	-0.01	0.06	0.15	0.81	▁▁▁▁▇
ManufacturingProcess17	1	0.00	1.00	-2.44	-0.68	0.05	0.61	4.53	▂▇▆▁▁
ManufacturingProcess18	1	0.00	1.00	-13.09	0.01	0.07	0.14	0.44	▁▁▁▁▇
ManufacturingProcess19	1	0.00	1.00	-3.03	-0.60	-0.14	0.48	2.58	▁▃▇▃▂
ManufacturingProcess20	1	0.00	1.00	-13.06	-0.01	0.07	0.15	0.58	▁▁▁▁▇
ManufacturingProcess21	1	0.00	1.00	-2.10	-0.56	-0.17	0.21	4.84	▂▇▂▁▁
ManufacturingProcess22	1	0.00	1.00	-1.62	-0.72	-0.12	0.78	1.98	▇▇▇▅▅
ManufacturingProcess23	1	0.00	1.00	-1.81	-0.61	-0.01	0.59	1.79	▇▆▇▆▇
ManufacturingProcess24	1	0.01	1.00	-1.52	-0.83	-0.14	0.89	2.44	▇▇▅▆▁
ManufacturingProcess25	1	0.00	0.99	-12.93	0.00	0.07	0.13	0.43	▁▁▁▁▇
ManufacturingProcess26	1	0.00	0.99	-12.94	0.01	0.06	0.12	0.31	▁▁▁▁▇
ManufacturingProcess27	1	0.00	0.99	-12.89	0.00	0.07	0.13	0.42	▁▁▁▁▇
ManufacturingProcess28	1	-0.03	1.00	-1.26	-1.26	0.73	0.78	0.94	▅▁▁▁▇
ManufacturingProcess29	1	0.00	0.99	-12.03	-0.19	-0.07	0.23	1.20	▁▁▁▁▇
ManufacturingProcess30	1	0.01	0.99	-9.39	-0.37	0.04	0.55	2.09	▁▁▁▅▇
ManufacturingProcess31	1	0.00	0.99	-12.63	-0.02	0.11	0.22	0.42	▁▁▁▁▇
ManufacturingProcess32	1	0.00	1.00	-2.87	-0.64	-0.09	0.65	2.69	▁▃▇▃▁
ManufacturingProcess33	1	-0.01	0.99	-3.04	-0.62	0.18	0.59	2.60	▁▃▇▅▁
ManufacturingProcess34	1	-0.01	0.99	-3.56	0.12	0.12	0.12	1.96	▁▂▁▇▁
ManufacturingProcess35	1	-0.02	1.00	-3.01	-0.52	-0.06	0.50	2.44	▁▃▇▅▂
ManufacturingProcess36	1	-0.01	0.99	-2.94	-0.66	-0.08	0.49	2.78	▂▇▁▇▃
ManufacturingProcess37	1	0.00	1.00	-2.28	-0.70	-0.03	0.64	2.89	▂▇▇▃▁
ManufacturingProcess38	1	0.00	1.00	-3.90	-0.82	0.72	0.72	0.72	▁▁▁▅▇
ManufacturingProcess39	1	0.00	1.00	-4.55	0.17	0.23	0.30	0.43	▁▁▁▁▇
ManufacturingProcess40	1	0.00	1.00	-0.46	-0.46	-0.46	-0.46	2.15	▇▁▁▁▂
ManufacturingProcess41	1	0.00	1.00	-0.44	-0.44	-0.44	-0.44	3.28	▇▁▁▁▁
ManufacturingProcess42	1	0.00	1.00	-5.77	0.10	0.20	0.25	0.46	▁▁▁▁▇
ManufacturingProcess43	1	0.00	1.00	-1.05	-0.36	-0.13	0.13	11.62	▇▁▁▁▁
ManufacturingProcess44	1	0.00	1.00	-5.61	-0.02	0.29	0.29	0.92	▁▁▁▁▇
ManufacturingProcess45	1	0.00	1.00	-5.25	-0.09	0.15	0.40	1.14	▁▁▁▂▇

I used KNN to impute to fill in missing values. The data is now 100% complete.

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

# remove nzv predictors
nzv <- nearZeroVar(chemical_impute)
filtered_chemical <- chemical_impute[, -nzv]
filtered_chemical |> ncol()

[1] 57

# Split the data into a training and a test set
trainingRows <- createDataPartition(
    filtered_chemical$Yield,
    p = .80,
    list = FALSE
)
chemical_train <- filtered_chemical[trainingRows, ]
chemical_test <- filtered_chemical[-trainingRows, ]

# Preprocess and tune a PLS model
plsTune <- train(
    chemical_train[
        ,
        !names(chemical_train) %in% "Yield"
    ],
    chemical_train$Yield,
    method = "pls",
    tuneLength = 20,
    trControl = ctrl,
    preProc = c("center", "scale")
)
plsTune

Partial Least Squares 

144 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 129, 129, 131, 130, 130, 129, ... 
Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared   MAE      
   1     0.7735224  0.4192347  0.6342773
   2     0.8287404  0.5360269  0.6014268
   3     0.7082333  0.5900823  0.5361202
   4     0.8332205  0.5694967  0.5802218
   5     0.8900838  0.5486265  0.6036344
   6     0.8909049  0.5520740  0.6064889
   7     0.9067980  0.5417347  0.6193941
   8     0.9717348  0.5325744  0.6454026
   9     0.9587565  0.5346397  0.6445537
  10     0.9692858  0.5230863  0.6510060
  11     0.9455037  0.5193371  0.6492300
  12     0.9189663  0.5335323  0.6425027
  13     0.9533670  0.5408934  0.6468623
  14     1.0262358  0.5237221  0.6747596
  15     1.0751780  0.5114347  0.6884491
  16     1.0517792  0.5036778  0.6841665
  17     1.0757681  0.4974535  0.6884337
  18     1.1338358  0.4942898  0.7045341
  19     1.1811687  0.4917023  0.7160439
  20     1.2733157  0.4895829  0.7401724

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 3.

RMSE <- plsTune$results$RMSE[plsTune$bestTune$ncomp]

The optimal value of the performance metric RMSE was 0.7082333. Note that this value is likely to be highly optimistic as it has been derived by re-predicting the training set data.

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

# Predict the response for the test set
predictions <- predict(plsTune, newdata = chemical_test)

# Get value of the performance metric
plsValues <- data.frame(obs = chemical_test$Yield, pred = predictions)
performance_metric <- defaultSummary(plsValues)["RMSE"]

The value of the performance metric RMSE is 0.8726064, which is higher (worse) than the resampled performance metric on the training set.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

importance <- varImp(plsTune, scale = FALSE)$importance |> arrange(-Overall)
top_predictors <- head(importance)
top_predictors

                          Overall
ManufacturingProcess32 0.06759997
ManufacturingProcess36 0.05713581
ManufacturingProcess13 0.05702366
ManufacturingProcess17 0.05467426
ManufacturingProcess09 0.05089636
ManufacturingProcess06 0.04741952

The most important predictors in this model are all manufacturing processes, as printed above.

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

top_predictors_data <- ChemicalManufacturingProcess[, c(
    top_predictors |> rownames(),
    "Yield"
)]

# Create correlation plot
ggpairs(
    top_predictors_data
) +
    theme_classic()

I used ggpairs to plot the relationships between each of the top predictors and the response in the bottom row, and the rightmost column provides the correlation between the predictors and the response variable. This information can help identify which predictors have the strongest effects on the yield, guiding future improvements in the manufacturing process. For example, ManufacutringProcess32 has the highest positive correlation with Yield (as ManufacturingProcess32 increases, Yield increases), so this would be a measure to aim to increase for a greater yield.