Description

In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

6.2

Developing a model to predict permeability (see Sect. 1.4) could save sig nificant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response. Kuhn, Max; Johnson, Kjell. Applied Predictive Modeling (p. 138). Springer New York. Kindle Edition.

head(describe(fingerprints), 10)  #limit 10

##     vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1     1 165 0.23 0.42      0    0.17   0   0   1     1 1.27    -0.39 0.03
## X2     2 165 0.22 0.42      0    0.16   0   0   1     1 1.31    -0.28 0.03
## X3     3 165 0.21 0.41      0    0.14   0   0   1     1 1.40    -0.05 0.03
## X4     4 165 0.21 0.41      0    0.14   0   0   1     1 1.40    -0.05 0.03
## X5     5 165 0.21 0.41      0    0.14   0   0   1     1 1.40    -0.05 0.03
## X6     6 165 0.45 0.50      0    0.44   0   0   1     1 0.21    -1.97 0.04
## X7     7 165 1.00 0.00      1    1.00   0   1   1     0  NaN      NaN 0.00
## X8     8 165 1.00 0.00      1    1.00   0   1   1     0  NaN      NaN 0.00
## X9     9 165 0.02 0.13      0    0.00   0   0   1     1 7.15    49.38 0.01
## X10   10 165 0.02 0.15      0    0.00   0   0   1     1 6.13    35.80 0.01

str(fingerprints)

##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...

#head(fingerprints)

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

low_freq<- caret::nearZeroVar(fingerprints)
pred_df <-fingerprints[,-low_freq]
dim(pred_df)

## [1] 165 388

When we exclude low frequency predictors, 388 are found.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(432)

split_1<- sample(c(rep(0, 0.7 * nrow(permeability)), 
                  rep(1, 0.3 * nrow(permeability))))

table(split_1)

## split_1
##   0   1 
## 115  49

train_1 <- pred_df[split_1 == 0,]
test_1 <- pred_df[split_1 == 1,]

train_val1 <- permeability[split_1 == 0]
test_val1 <- permeability[split_1 == 1]

tune1 <- train(train_1, train_val1, 
                method='pls', metric='Rsquared',
                tuneLength=20, 
                trControl=trainControl(method='cv'),
                preProc=c('center', 'scale')
                )
tune1

## Partial Least Squares 
## 
## 116 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 105, 104, 105, 104, 104, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.43068  0.3371880  10.319406
##    2     12.35354  0.4319338   8.980061
##    3     12.72791  0.4215996   9.721106
##    4     13.22950  0.3943144   9.868499
##    5     12.84057  0.4166923   9.820530
##    6     12.89208  0.4238546   9.879228
##    7     12.34849  0.4577443   9.357330
##    8     12.19153  0.4781810   9.059569
##    9     12.28542  0.4789290   9.076210
##   10     12.37766  0.4842363   9.259815
##   11     12.81321  0.4673728   9.665919
##   12     13.21172  0.4424502  10.028955
##   13     14.03620  0.4073698  10.724542
##   14     14.71433  0.3780076  11.219147
##   15     15.16874  0.3515616  11.557017
##   16     15.40113  0.3382075  11.692555
##   17     16.12396  0.3116538  12.133942
##   18     16.92900  0.2783572  12.781934
##   19     17.34429  0.2631386  13.187677
##   20     17.88137  0.2424371  13.691859
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.

This function determines ncomp 9 as the optimal model.

(d) Predict the response for the test set. What is the test set estimate of R2?

#run prediction and eval
pred_1 <- predict(tune1, newdata=test_1)

postResample(pred=pred_1, obs=test_val1)

##       RMSE   Rsquared        MAE 
## 11.2545378  0.5132289  8.2131883

R-squared = .5242

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

#let's try ridge regression

ridge_grid <- data.frame(.lambda = seq(0, .1, length = 15))

#enet_grid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
set.seed(432)

ridge_reg_fit <- train(train_1, train_val1,
method = "ridge",
tuneGrid = ridge_grid,
trControl = trainControl(method = "cv", number = 10, verboseIter = FALSE),
#predictors
preProc = c("center", "scale"))

## Warning: model fit failed for Fold03: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold05: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

ridge_reg_fit

## Ridge Regression 
## 
## 116 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 104, 104, 104, 104, 104, 104, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000   16.17472  0.3642978   12.21149
##   0.007142857   40.95803  0.1379076   29.87720
##   0.014285714  112.37215  0.1633977   84.36172
##   0.021428571  320.18077  0.1937930  229.85171
##   0.028571429   18.00521  0.2704993   12.78184
##   0.035714286   17.27937  0.2924620   12.34718
##   0.042857143   16.79044  0.3110419   12.05392
##   0.050000000   16.37375  0.3280473   11.80373
##   0.057142857   16.08923  0.3402429   11.63551
##   0.064285714   15.89527  0.3491905   11.53285
##   0.071428571   15.66014  0.3600055   11.40208
##   0.078571429   15.65052  0.3607620   11.39481
##   0.085714286   15.51272  0.3683991   11.38592
##   0.092857143   15.23332  0.3816867   11.21204
##   0.100000000   15.09919  0.3887969   11.13863
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

Our ridge regression employed a penalty of lambda = .1, RMSE = 15.1

#troubleshooting
#anyNA(train_1)
#anyNA(train_val1)

# Adjust the tuning grid to avoid zero lambda or problematic fraction values
enet_grid <- expand.grid(
  .lambda = seq(0.001, 0.1, length = 10),  
  .fraction = seq(0.1, 0.9, length = 10)   
)

#elastic net next
# first set range of penalities and mixing params
#enet_grid <- expand.grid(.lambda = seq(0, 0.1, length = 15),  
 #                        .fraction = seq(0.05, 1, length = 20))

#I've simplified the model so that it stops throwing errors related to (zmat <gamhat)
set.seed(432)
elastic_net_fit <- train(
  train_1, train_val1,
  method = "enet",
  tuneGrid = enet_grid,
  trControl = trainControl(method = "cv", number = 5),
  preProc = c("center", "scale")
)

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813

## Warning: model fit failed for Fold3: lambda=0.100, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
##   Some of the columns of x have zero variance

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813

## Warning: model fit failed for Fold3: lambda=0.089, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
##   Some of the columns of x have zero variance

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813

## Warning: model fit failed for Fold3: lambda=0.078, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
##   Some of the columns of x have zero variance

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813

## Warning: model fit failed for Fold3: lambda=0.067, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
##   Some of the columns of x have zero variance

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813

## Warning: model fit failed for Fold3: lambda=0.056, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
##   Some of the columns of x have zero variance

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813

## Warning: model fit failed for Fold3: lambda=0.045, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
##   Some of the columns of x have zero variance

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813

## Warning: model fit failed for Fold3: lambda=0.034, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
##   Some of the columns of x have zero variance

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813

## Warning: model fit failed for Fold3: lambda=0.023, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
##   Some of the columns of x have zero variance

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813

## Warning: model fit failed for Fold3: lambda=0.012, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
##   Some of the columns of x have zero variance

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X780, X782, X792, X793, X800, X801,
## X806, X813

## Warning: model fit failed for Fold3: lambda=0.001, fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
##   Some of the columns of x have zero variance

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

#elastic_net_fit

# run and evaluate predictions
pred_enet <- predict(elastic_net_fit, newdata = test_1)
test_results_1<- postResample(pred = pred_enet, obs = test_val1)
test_results_1

##      RMSE  Rsquared       MAE 
## 9.6272475 0.5883668 7.4348566

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(package = "AppliedPredictiveModeling")$results[, "Item"]

##  [1] "ChemicalManufacturingProcess"  "abalone"                      
##  [3] "bio (hepatic)"                 "cars2010 (FuelEconomy)"       
##  [5] "cars2011 (FuelEconomy)"        "cars2012 (FuelEconomy)"       
##  [7] "chem (hepatic)"                "classes (twoClassData)"       
##  [9] "concrete"                      "diagnosis (AlzheimerDisease)" 
## [11] "fingerprints (permeability)"   "injury (hepatic)"             
## [13] "logisticCreditPredictions"     "mixtures (concrete)"          
## [15] "permeability"                  "predictors (AlzheimerDisease)"
## [17] "predictors (twoClassData)"     "schedulingData"               
## [19] "segmentationOriginal"          "solTestX (solubility)"        
## [21] "solTestXtrans (solubility)"    "solTestY (solubility)"        
## [23] "solTrainX (solubility)"        "solTrainXtrans (solubility)"  
## [25] "solTrainY (solubility)"

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

# I think I'm going to convert the data to a matrix first and visualize it
matrix_1 <- as.matrix(ChemicalManufacturingProcess)
aggr(matrix_1, 
     numbers = TRUE, 
     col = c("lightblue", "purple"),  
     combined = TRUE, 
     sortVars = TRUE,  
     labels = names(ChemicalManufacturingProcess), 
     cex.axis = 0.7, 
     gap = 2, 
     ylab = c("Missing Data", "Pattern"))

## 
##  Variables sorted by number of missings: 
##                Variable Count
##  ManufacturingProcess03    15
##  ManufacturingProcess11    10
##  ManufacturingProcess10     9
##  ManufacturingProcess25     5
##  ManufacturingProcess26     5
##  ManufacturingProcess27     5
##  ManufacturingProcess28     5
##  ManufacturingProcess29     5
##  ManufacturingProcess30     5
##  ManufacturingProcess31     5
##  ManufacturingProcess33     5
##  ManufacturingProcess34     5
##  ManufacturingProcess35     5
##  ManufacturingProcess36     5
##  ManufacturingProcess02     3
##  ManufacturingProcess06     2
##  ManufacturingProcess01     1
##  ManufacturingProcess04     1
##  ManufacturingProcess05     1
##  ManufacturingProcess07     1
##  ManufacturingProcess08     1
##  ManufacturingProcess12     1
##  ManufacturingProcess14     1
##  ManufacturingProcess22     1
##  ManufacturingProcess23     1
##  ManufacturingProcess24     1
##  ManufacturingProcess40     1
##  ManufacturingProcess41     1
##                   Yield     0
##    BiologicalMaterial01     0
##    BiologicalMaterial02     0
##    BiologicalMaterial03     0
##    BiologicalMaterial04     0
##    BiologicalMaterial05     0
##    BiologicalMaterial06     0
##    BiologicalMaterial07     0
##    BiologicalMaterial08     0
##    BiologicalMaterial09     0
##    BiologicalMaterial10     0
##    BiologicalMaterial11     0
##    BiologicalMaterial12     0
##  ManufacturingProcess09     0
##  ManufacturingProcess13     0
##  ManufacturingProcess15     0
##  ManufacturingProcess16     0
##  ManufacturingProcess17     0
##  ManufacturingProcess18     0
##  ManufacturingProcess19     0
##  ManufacturingProcess20     0
##  ManufacturingProcess21     0
##  ManufacturingProcess32     0
##  ManufacturingProcess37     0
##  ManufacturingProcess38     0
##  ManufacturingProcess39     0
##  ManufacturingProcess42     0
##  ManufacturingProcess43     0
##  ManufacturingProcess44     0
##  ManufacturingProcess45     0

# impute and re-asses
imputed_data <- mice(matrix_1, method = "pmm", m = 5, maxit = 5, printFlag = FALSE)

## Warning: Number of logged events: 675

complete_data <- complete(imputed_data)
# Check NAs
anyNA(complete_data)

## [1] FALSE

The proportion of missing data is small and dispersed across many predictors, so I chose the pmm method (predictive mean matching) from the mice library for imputation. This method predicts values for missing data and randomly selects and imputes donors from complete cases with similar predicted values, assuming the same distribution for missing and observed data. We successfully eliminated missing data per the above check.

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

# filter and check dimensions
chem_data <- ChemicalManufacturingProcess
chem_data_filtered <- chem_data[, -nearZeroVar(chem_data)]
dim(chem_data_filtered)

## [1] 176  57

# generate 80/20 training and test sets
set.seed(432)
train_indices <- createDataPartition(chem_data_filtered$Yield, times = 1, p = 0.80, list = FALSE)
train_predictors <- chem_data_filtered[train_indices, -c(1)]  # Exclude 'Yield' column
test_predictors <- chem_data_filtered[-train_indices, -c(1)]
train_response <- chem_data_filtered[train_indices, ]$Yield
test_response <- chem_data_filtered[-train_indices, ]$Yield

# train
(perf_fit <- train(
  x = train_predictors, 
  y = train_response,
  method = "pls",
  metric = "Rsquared",
  tuneLength = 25,
  trControl = trainControl(method = "cv", number = 10),
  preProcess = c('center', 'scale')
))

## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 130, 128, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.366557  0.4834273  1.108564
##    2     1.318038  0.5215160  1.052882
##    3     1.682071  0.5375627  1.137754
##    4     2.187599  0.5580511  1.269234
##    5     2.323014  0.5632677  1.301737
##    6     2.619363  0.5563855  1.404710
##    7     2.656325  0.5324331  1.430982
##    8     2.892100  0.5225133  1.490543
##    9     2.952069  0.5086699  1.517270
##   10     3.005844  0.4945833  1.535283
##   11     3.062673  0.4605866  1.580011
##   12     3.227874  0.4353043  1.650421
##   13     3.227337  0.4435162  1.642451
##   14     3.137262  0.4524246  1.609899
##   15     3.166196  0.4685060  1.605729
##   16     3.294770  0.4710852  1.639593
##   17     3.556879  0.4660242  1.708761
##   18     3.709272  0.4689027  1.749011
##   19     3.899968  0.4745939  1.789447
##   20     3.982159  0.4782400  1.806003
##   21     4.081416  0.4784149  1.835560
##   22     4.206296  0.4769404  1.873522
##   23     4.503215  0.4762817  1.960228
##   24     4.794765  0.4718393  2.046734
##   25     4.999727  0.4649212  2.111659
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 5.

perf_fit

## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 130, 128, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.366557  0.4834273  1.108564
##    2     1.318038  0.5215160  1.052882
##    3     1.682071  0.5375627  1.137754
##    4     2.187599  0.5580511  1.269234
##    5     2.323014  0.5632677  1.301737
##    6     2.619363  0.5563855  1.404710
##    7     2.656325  0.5324331  1.430982
##    8     2.892100  0.5225133  1.490543
##    9     2.952069  0.5086699  1.517270
##   10     3.005844  0.4945833  1.535283
##   11     3.062673  0.4605866  1.580011
##   12     3.227874  0.4353043  1.650421
##   13     3.227337  0.4435162  1.642451
##   14     3.137262  0.4524246  1.609899
##   15     3.166196  0.4685060  1.605729
##   16     3.294770  0.4710852  1.639593
##   17     3.556879  0.4660242  1.708761
##   18     3.709272  0.4689027  1.749011
##   19     3.899968  0.4745939  1.789447
##   20     3.982159  0.4782400  1.806003
##   21     4.081416  0.4784149  1.835560
##   22     4.206296  0.4769404  1.873522
##   23     4.503215  0.4762817  1.960228
##   24     4.794765  0.4718393  2.046734
##   25     4.999727  0.4649212  2.111659
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 5.

The optimal model is PLS model ncomp = 2 with the highest R-squared value (0.5299). It expalins the most variance.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

predict_test_y2 <- predict(perf_fit, test_predictors)

performance_results <- postResample(pred = predict_test_y2, obs = test_response)

performance_results

##      RMSE  Rsquared       MAE 
## 1.1886974 0.6450927 0.9534105

The test set performance metrics show that the model enjoys a higher R-squared (0.6451) and a comparable RMSE (1.1887) to the training set, indicating decent predictive accuracy.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

variable_importance <- varImp(perf_fit, scale = TRUE)

## Warning: package 'pls' was built under R version 4.3.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:corrplot':
## 
##     corrplot

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

variable_importance

## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   95.52
## ManufacturingProcess13   89.86
## ManufacturingProcess17   84.67
## ManufacturingProcess36   78.40
## BiologicalMaterial06     70.47
## BiologicalMaterial02     70.31
## BiologicalMaterial03     69.07
## ManufacturingProcess28   62.72
## ManufacturingProcess12   62.09
## BiologicalMaterial01     58.89
## ManufacturingProcess06   58.75
## ManufacturingProcess33   57.75
## BiologicalMaterial11     56.92
## BiologicalMaterial08     56.54
## BiologicalMaterial04     55.35
## BiologicalMaterial12     53.72
## ManufacturingProcess34   53.06
## ManufacturingProcess11   51.37
## ManufacturingProcess37   49.69

The most important predictors in the model are mostly manufacturing process variables, with only a few biological predictors appearing in the top 10.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

library(corrplot)

correlation_1 <- chem_data_filtered %>% 
  select('Yield', "ManufacturingProcess32", "ManufacturingProcess36", 
         "ManufacturingProcess09", "ManufacturingProcess13",
         "ManufacturingProcess33", "ManufacturingProcess17", 
         "BiologicalMaterial03", "BiologicalMaterial02", 
         "BiologicalMaterial06", "BiologicalMaterial08",
         "ManufacturingProcess06", "BiologicalMaterial12", 
         "ManufacturingProcess12", "ManufacturingProcess11", 
         "BiologicalMaterial04", "ManufacturingProcess34", 
         "BiologicalMaterial01", "ManufacturingProcess29", 
         "ManufacturingProcess30", "BiologicalMaterial11")

plot_corr_1 <- cor(correlation_1)

corrplot.mixed(plot_corr_1, tl.col = 'red', tl.pos = 'lt', 
         upper = "number", lower="circle")

ManufacturingProcess32 shows a strong positive correlation (0.61) with Yield, suggesting it significantly impacts the output, while other key process and biological variables have moderate correlations. This analysis indicates that adjusting ManufacturingProcess32 and related variables could help improve the manufacturing yield.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

DATA 624 HW7

Evan McLaughlin

2024-11-03

Description

6.2

(a) Start R and use these commands to load the data:

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

(d) Predict the response for the test set. What is the test set estimate of R2?

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

6.3

(a) Start R and use these commands to load the data:

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

DATA 624 HW7

Evan McLaughlin

2024-11-03

Description

6.2

(a) Start R and use these commands to load the data:

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

(d) Predict the response for the test set. What is the test set estimate of R2?

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

6.3

(a) Start R and use these commands to load the data:

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?