Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.

knitr::opts_chunk$set(warning = FALSE, message = FALSE)

7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: \(y = 10 sin(pix1x2) + 20(x3 - 0.5)2 + 10x4 + 5x5 + N(0, sigma2)\) where the x values are random variables uniformly distributed between 0, 1. The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(caret)
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

library(caret)
knnModel <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)

Since a knn model was used, we will train a neural network, MARS, and SVM model to see which model will give the best performance.

Neural Network

# Neural network model
library(nnet)

trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

trainX <- trainingData$x
trainY <- trainingData$y
testX  <- testData$x
testY  <- testData$y

# Center the data
preProc <- preProcess(trainX, method = c("center", "scale"))
trainXtrans <- predict(preProc, trainX)
testXtrans  <- predict(preProc, testX)

# Remove highly correlated predictors
tooHigh <- findCorrelation(cor(trainXtrans), cutoff = 0.75)
if(length(tooHigh) > 0) {
  trainXnnet <- trainXtrans[, -tooHigh, drop = FALSE]
  testXnnet  <- testXtrans[, -tooHigh, drop = FALSE]
} else {
  trainXnnet <- trainXtrans
  testXnnet  <- testXtrans
}

# Use a 10 fold Cross validation to tune the optimal parameters
ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 3,
                     verboseIter = FALSE)

# Create several sizes and decays
nnetGrid <- expand.grid(.size = 1:10,
                        .decay = c(0, 0.01, 0.1),
                        .bag = FALSE)

p <- ncol(trainXnnet)
maxwts_needed <- max(nnetGrid$.size) * (p + 2) + 1

set.seed(100)
nnetTune <- train(x = trainXnnet,
                  y = trainY,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = NULL,      # already preprocessed
                  linout = TRUE,       # regression output, not classification
                  trace = FALSE,
                  MaxNWts = maxwts_needed + 50,  # add margin
                  maxit = 500,
                  repeats = 5)         # average 5 networks per tuning fit

print(nnetTune)

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE     
##    1    0.00   2.728604  0.7203091  2.152663
##    1    0.01   2.719536  0.7214992  2.124635
##    1    0.10   2.699292  0.7244461  2.094968
##    2    0.00   2.645576  0.7345835  2.088920
##    2    0.01   2.686169  0.7253877  2.113754
##    2    0.10   2.674174  0.7280434  2.093252
##    3    0.00   2.242475  0.8052934  1.750041
##    3    0.01   2.334757  0.7887846  1.816012
##    3    0.10   2.387936  0.7836009  1.875259
##    4    0.00   2.142435  0.8263282  1.700758
##    4    0.01   2.197012  0.8162037  1.710410
##    4    0.10   2.271115  0.8053792  1.801021
##    5    0.00   2.402633  0.7882899  1.856423
##    5    0.01   2.302976  0.8040184  1.824209
##    5    0.10   2.258198  0.8081794  1.789014
##    6    0.00   2.979536  0.7160739  2.181607
##    6    0.01   2.499688  0.7670440  1.955686
##    6    0.10   2.448278  0.7796720  1.940492
##    7    0.00   4.355370  0.5707408  2.847004
##    7    0.01   2.545489  0.7635256  2.008557
##    7    0.10   2.444450  0.7786116  1.931350
##    8    0.00   5.739978  0.4845702  3.482587
##    8    0.01   2.659024  0.7436265  2.089078
##    8    0.10   2.535702  0.7613009  1.976352
##    9    0.00   4.315142  0.5661795  2.863657
##    9    0.01   2.725524  0.7239232  2.127311
##    9    0.10   2.496450  0.7647860  1.961595
##   10    0.00   4.136404  0.6067477  2.809888
##   10    0.01   2.792631  0.7116502  2.197621
##   10    0.10   2.469993  0.7678994  1.966281
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag = FALSE.

# Predict on test set
nnetPred <- predict(nnetTune, newdata = testXnnet)
postResample(pred = nnetPred, obs = testY)

##      RMSE  Rsquared       MAE 
## 1.9551301 0.8464735 1.5121497

The cross-validation selected a neural network with size = 3 and decay = 0 as it produced the lowest RMSE. When this model was evaluated on the test set, the RMSE = 1.95 and the R^2 = 0.85. In comparsion, the KNN model had a RMSE = 3.23 and an R^2 = 0.32. This indicates that the neural network outperformed the KNN model, achieving more accurate predictions and better explains the variance than the KNN model.

MARS

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

trainX <- trainingData$x
trainY <- trainingData$y
testX  <- testData$x
testY  <- testData$y

library(earth)
# Fit MARS model
marsFit <- earth(trainX, trainY)
print(marsFit)

## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

# Define the candidate hyperparameters for tuning
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

# Train MARS with cross-validation
set.seed(100)
marsTuned <- train(
  x = trainX,
  y = trainY,
  method = "earth",
  tuneGrid = marsGrid,
  trControl = trainControl(method = "cv")
)

# Results
print(marsTuned)

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.327937  0.2544880  3.600474
##   1        3      3.572450  0.4912720  2.895811
##   1        4      2.596841  0.7183600  2.106341
##   1        5      2.370161  0.7659777  1.918669
##   1        6      2.276141  0.7881481  1.810001
##   1        7      1.766728  0.8751831  1.390215
##   1        8      1.780946  0.8723243  1.401345
##   1        9      1.665091  0.8819775  1.325515
##   1       10      1.663804  0.8821283  1.327657
##   1       11      1.657738  0.8822967  1.331730
##   1       12      1.653784  0.8827903  1.331504
##   1       13      1.648496  0.8823663  1.316407
##   1       14      1.639073  0.8841742  1.312833
##   1       15      1.639073  0.8841742  1.312833
##   1       16      1.639073  0.8841742  1.312833
##   1       17      1.639073  0.8841742  1.312833
##   1       18      1.639073  0.8841742  1.312833
##   1       19      1.639073  0.8841742  1.312833
##   1       20      1.639073  0.8841742  1.312833
##   1       21      1.639073  0.8841742  1.312833
##   1       22      1.639073  0.8841742  1.312833
##   1       23      1.639073  0.8841742  1.312833
##   1       24      1.639073  0.8841742  1.312833
##   1       25      1.639073  0.8841742  1.312833
##   1       26      1.639073  0.8841742  1.312833
##   1       27      1.639073  0.8841742  1.312833
##   1       28      1.639073  0.8841742  1.312833
##   1       29      1.639073  0.8841742  1.312833
##   1       30      1.639073  0.8841742  1.312833
##   1       31      1.639073  0.8841742  1.312833
##   1       32      1.639073  0.8841742  1.312833
##   1       33      1.639073  0.8841742  1.312833
##   1       34      1.639073  0.8841742  1.312833
##   1       35      1.639073  0.8841742  1.312833
##   1       36      1.639073  0.8841742  1.312833
##   1       37      1.639073  0.8841742  1.312833
##   1       38      1.639073  0.8841742  1.312833
##   2        2      4.327937  0.2544880  3.600474
##   2        3      3.572450  0.4912720  2.895811
##   2        4      2.661826  0.7070510  2.173471
##   2        5      2.404015  0.7578971  1.975387
##   2        6      2.243927  0.7914805  1.783072
##   2        7      1.856336  0.8605482  1.435682
##   2        8      1.754607  0.8763186  1.396841
##   2        9      1.603578  0.8938666  1.261361
##   2       10      1.492421  0.9084998  1.168700
##   2       11      1.317350  0.9292504  1.033926
##   2       12      1.304327  0.9320133  1.019108
##   2       13      1.277510  0.9323681  1.002927
##   2       14      1.269626  0.9350024  1.003346
##   2       15      1.266217  0.9359400  1.013893
##   2       16      1.268470  0.9354868  1.011414
##   2       17      1.268470  0.9354868  1.011414
##   2       18      1.268470  0.9354868  1.011414
##   2       19      1.268470  0.9354868  1.011414
##   2       20      1.268470  0.9354868  1.011414
##   2       21      1.268470  0.9354868  1.011414
##   2       22      1.268470  0.9354868  1.011414
##   2       23      1.268470  0.9354868  1.011414
##   2       24      1.268470  0.9354868  1.011414
##   2       25      1.268470  0.9354868  1.011414
##   2       26      1.268470  0.9354868  1.011414
##   2       27      1.268470  0.9354868  1.011414
##   2       28      1.268470  0.9354868  1.011414
##   2       29      1.268470  0.9354868  1.011414
##   2       30      1.268470  0.9354868  1.011414
##   2       31      1.268470  0.9354868  1.011414
##   2       32      1.268470  0.9354868  1.011414
##   2       33      1.268470  0.9354868  1.011414
##   2       34      1.268470  0.9354868  1.011414
##   2       35      1.268470  0.9354868  1.011414
##   2       36      1.268470  0.9354868  1.011414
##   2       37      1.268470  0.9354868  1.011414
##   2       38      1.268470  0.9354868  1.011414
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.

# Variable importance
marsVarImp <- varImp(marsTuned)
print(marsVarImp)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.73
## X5   15.52
## X3    0.00

# Evaluate on test set
marsPred <- predict(marsTuned, newdata = testX)
rmse <- sqrt(mean((marsPred - testY)^2))
rsq <- cor(marsPred, testY)^2
cat("Test RMSE:", rmse, "\n")

## Test RMSE: 1.158995

cat("Test R^2:", rsq, "\n")

## Test R^2: 0.9460418

The cross-validation selected a MARS model tuned with degree = 2 and nprune = 15 provided a RMSE of 1.16 and R^2 of 0.95. This results are an improvement over both the KNN and neutral network models. Variable importance indicated that the model relied primarily on predictors X1, X4, and X2, while X5 contributed minimally and X3 had no influence.

SVM

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

trainX <- trainingData$x
trainY <- trainingData$y
testX  <- testData$x
testY  <- testData$y

set.seed(100)
svmRTuned <- train(
  x = trainX,
  y = trainY,
  method = "svmRadial",
  preProc = c("center", "scale"),
  tuneLength = 14,
  trControl = trainControl(method = "cv")
)

# Results
print(svmRTuned)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.530787  0.7922715  2.013175
##      0.50  2.259539  0.8064569  1.789962
##      1.00  2.099789  0.8274242  1.656154
##      2.00  2.002943  0.8412934  1.583791
##      4.00  1.943618  0.8504425  1.546586
##      8.00  1.918711  0.8547582  1.532981
##     16.00  1.920651  0.8536189  1.536116
##     32.00  1.920651  0.8536189  1.536116
##     64.00  1.920651  0.8536189  1.536116
##    128.00  1.920651  0.8536189  1.536116
##    256.00  1.920651  0.8536189  1.536116
##    512.00  1.920651  0.8536189  1.536116
##   1024.00  1.920651  0.8536189  1.536116
##   2048.00  1.920651  0.8536189  1.536116
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06509124
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06509124 and C = 8.

print(svmRTuned$finalModel)

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 8 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0650912392130387 
## 
## Number of Support Vectors : 152 
## 
## Objective Function Value : -69.5828 
## Training error : 0.009013

# Variable Importance
svmVarImp <- varImp(svmRTuned)
print(svmVarImp)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

# Evaluate on test set
svmPred <- predict(svmRTuned, newdata = testX)
testRMSE <- sqrt(mean((svmPred - testY)^2))
testR2   <- cor(svmPred, testY)^2
testMAE  <- mean(abs(svmPred - testY))

cat("Test RMSE:", round(testRMSE, 4), "\n")

## Test RMSE: 2.0632

cat("Test R^2:", round(testR2, 6), "\n")

## Test R^2: 0.827574

cat("Test MAE:", round(testMAE, 4), "\n")

## Test MAE: 1.5662

The SVM model with a radial basis function kernel selected optimal tuning parameters of C = 8 and sigma = 0.065, which resulted in a RMSE of 2.06 and R^2 of 0.83. These results are similar to those of the neural network, indicating both models captured the nonlinear patterns of the data moderately well. However, this SVM model did not outperform the MARS model, which achieve a lower RMSE and high R^2. Variable importance for the SVM indicated that X4, X1, and X2 were the most influential predictors, while X6 and X7 had little to no contribution.

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)? The MARS model performed best with a lowest RMSE and a higher R^2 compared to the neural network, SVM, and KNN models. MARS selected X1, X4, and X2 as the most informative predictors.

7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

From exercise 6.3:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

# Separating the target (Yield) and predictor variables
yield <- ChemicalManufacturingProcess$Yield
processPredictors <- ChemicalManufacturingProcess[, -1]

# Impute missing values with the median
preProc <- preProcess(processPredictors, method = "medianImpute")
processPredictors_imputed <- predict(preProc, processPredictors)

set.seed(1234)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX <- processPredictors_imputed[trainIndex, ]
testX  <- processPredictors_imputed[-trainIndex, ]
trainY <- yield[trainIndex]
testY  <- yield[-trainIndex]

ctrl <- trainControl(method = "cv", number = 10)

a.

Which nonlinear regression model gives the optimal resampling and test set performance?

# SVM
svmFit <- train(
  x = trainX,
  y = trainY,
  method = "svmRadial",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 10
)

# Neural Network
nnFit <- train(
  x = trainX,
  y = trainY,
  method = "nnet",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 10,
  linout = TRUE,
  trace = FALSE
)

# MARS
marsFit <- train(
  x = trainX,
  y = trainY,
  method = "earth",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 10
)

# KNN
knnFit <- train(
  x = trainX,
  y = trainY,
  method = "knn",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 10
)

# Compare CV performance
resamps <- resamples(list(SVM = svmFit, NN = nnFit, MARS = marsFit, KNN = knnFit))
summary(resamps)

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: SVM, NN, MARS, KNN 
## Number of resamples: 10 
## 
## MAE 
##           Min.   1st Qu.    Median      Mean  3rd Qu.     Max. NA's
## SVM  0.6177582 0.7550328 0.9581260 0.9288214 1.089252 1.180166    0
## NN   0.9301265 1.0429602 1.1003361 1.1392891 1.281513 1.309942    0
## MARS 0.6118668 0.8785885 0.9706923 0.9677945 1.105998 1.216419    0
## KNN  0.7694286 0.8888896 1.0219026 1.0581776 1.162969 1.459000    0
## 
## RMSE 
##           Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## SVM  0.8255247 0.9376857 1.138403 1.160573 1.342507 1.585396    0
## NN   1.1674381 1.3326443 1.391468 1.423152 1.506624 1.828782    0
## MARS 0.8504532 1.0351167 1.136094 1.177096 1.375990 1.489143    0
## KNN  0.9622690 1.1060749 1.254626 1.321469 1.454005 1.865675    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SVM  0.3900108 0.6122117 0.6965212 0.6447713 0.7204971 0.8315143    0
## NN   0.2889553 0.3916717 0.5078031 0.4787012 0.5536956 0.6172612    0
## MARS 0.3538662 0.5466136 0.6536517 0.6418118 0.7874507 0.8276661    0
## KNN  0.3917981 0.5519785 0.5846776 0.5759732 0.6405615 0.7384615    0

# Evaluate on test set
pred_svm  <- predict(svmFit,  testX); svm_rmse  <- RMSE(pred_svm,  testY)
pred_nn   <- predict(nnFit,   testX); nn_rmse   <- RMSE(pred_nn,   testY)
pred_mars <- predict(marsFit, testX); mars_rmse <- RMSE(pred_mars, testY)
pred_knn  <- predict(knnFit,  testX); knn_rmse  <- RMSE(pred_knn,  testY)

data.frame(
  Model = c("SVM", "NeuralNetwork", "MARS", "KNN"),
  Test_RMSE = c(svm_rmse, nn_rmse, mars_rmse, knn_rmse)
)

##           Model Test_RMSE
## 1           SVM  1.133661
## 2 NeuralNetwork  1.702510
## 3          MARS  1.128461
## 4           KNN  1.195133

The SVM model achieved the best performance during resampling, with the lowest mean RMSE (1.11) and highest mean R² (0.65). On the test set, MARS slightly outperformed SVM (RMSE 1.115 vs 1.132) but the difference is minimal, suggesting both models capture the nonlinear data well.

b.

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

# Get variable importance for SVM
svm_varimp <- varImp(svmFit, scale = TRUE)

svm_varimp

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess13  100.00
## ManufacturingProcess32   96.24
## BiologicalMaterial06     94.88
## BiologicalMaterial03     85.18
## BiologicalMaterial12     82.77
## ManufacturingProcess17   82.16
## ManufacturingProcess36   78.96
## ManufacturingProcess09   78.20
## ManufacturingProcess31   72.24
## BiologicalMaterial02     72.12
## ManufacturingProcess06   64.75
## BiologicalMaterial04     58.82
## BiologicalMaterial11     55.77
## BiologicalMaterial08     53.15
## ManufacturingProcess33   47.45
## ManufacturingProcess11   44.38
## BiologicalMaterial01     43.11
## ManufacturingProcess30   39.27
## BiologicalMaterial09     35.36
## ManufacturingProcess18   32.08

# PLS model
plsFit <- train(
  x = trainX,
  y = trainY,
  method = "pls",
  preProc = c("center", "scale"),
  tuneLength = 20,
  trControl = ctrl
)
# Top 10 predictors
vip <- varImp(plsFit)
plot(vip, top = 10, main = "PLS Top 10 Predictors")

The top predictors in the optimal nonlinear regression model (SVM) are ManufacturingProcess13, ManufacturingProcess32, and BiologicalMaterial06. Looking at the top 10 predictors, there are six process variables and four biological variables, so neither type clearly dominates the list. In the linear model, we see a similar mix: six manufacturing process variables and four biological variables. This indicates that both linear and nonlinear models are influenced by a combination of biological and process factors.

c.

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

The top predictors that are unique to the optimal nonlinear regression model are BiologicalMaterial12 and ManufacturingProcess31. To see the relationship between these two predictors and the yield, we can plot them.

# BiologicalMaterial12 vs Yield
ggplot(ChemicalManufacturingProcess, aes(x = BiologicalMaterial12, y = Yield)) +
  geom_point() +
  geom_smooth(method = "loess", color = "blue") +
  labs(title = "Relationship between BiologicalMaterial12 and Yield",
       x = "BiologicalMaterial12", y = "Yield")

The plot for BiologicalMaterial12 shows a nonlinear relationship with Yield. The yield increases when BiologicalMaterial12 is between 19 and 21. After BiologicalMaterial12 peaks at 21, the yield steadily decreases. The optimal range where the yield is highest is when BiologicalMaterial12 is between 19-21. Anything outside of this range will decrease the yield.

# ManufacturingProcess31 vs Yield
ggplot(ChemicalManufacturingProcess, aes(x = ManufacturingProcess31, y = Yield)) +
  geom_point() +
  geom_smooth(method = "loess", color = "blue") +
  labs(title = "Relationship between ManufacturingProcess31 and Yield",
       x = "ManufacturingProcess31", y = "Yield")

The plot for ManufacturingProcess31 looks has a couple of outliers that are skewing the plot so let’s filter them out using the IQR.

x <- ChemicalManufacturingProcess$ManufacturingProcess31

# Identify non‐outlier indices
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr

filtered <- ChemicalManufacturingProcess[x >= lower & x <= upper, ]

# Plot 
ggplot(filtered, aes(x = ManufacturingProcess31, y = Yield)) +
  geom_point() +
  geom_smooth(method = "loess",color = "blue") +
  labs(title = "Relationship between ManufacturingProcess31 and Yield")

After removing the outliers, we can see that most of the values fall within 68-72. As ManufacturingProcess31 increases from 68 to 70, the yield remains relatively flat. However, when ManufacturingProcess31 increases past 70, the Yield decreases from around 42 to 40 and remains flat. From this plot, the Yield is most optimal when ManufacturingProcess31 is below 70.

These plots highlight the nonlinear relationship between the biological or process predictors and the yield. Some biological or process predictors have a range where the produce the most optimal yield. As shown in ManufacturingProcess31 and BiologicalMaterial12, once these predictors went past a certain threshold, the yield suddenly dropped.

HW8

Jian Quan Chen

2025-11-06

7.2

Neural Network

MARS

SVM

7.5

a.

b.

c.