Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: \(y = 10 sin(pix1x2) + 20(x3 - 0.5)2 + 10x4 + 5x5 + N(0, sigma2)\) where the x values are random variables uniformly distributed between 0, 1. The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(caret)
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)
Since a knn model was used, we will train a neural network, MARS, and SVM model to see which model will give the best performance.
# Neural network model
library(nnet)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
trainX <- trainingData$x
trainY <- trainingData$y
testX <- testData$x
testY <- testData$y
# Center the data
preProc <- preProcess(trainX, method = c("center", "scale"))
trainXtrans <- predict(preProc, trainX)
testXtrans <- predict(preProc, testX)
# Remove highly correlated predictors
tooHigh <- findCorrelation(cor(trainXtrans), cutoff = 0.75)
if(length(tooHigh) > 0) {
trainXnnet <- trainXtrans[, -tooHigh, drop = FALSE]
testXnnet <- testXtrans[, -tooHigh, drop = FALSE]
} else {
trainXnnet <- trainXtrans
testXnnet <- testXtrans
}
# Use a 10 fold Cross validation to tune the optimal parameters
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
verboseIter = FALSE)
# Create several sizes and decays
nnetGrid <- expand.grid(.size = 1:10,
.decay = c(0, 0.01, 0.1),
.bag = FALSE)
p <- ncol(trainXnnet)
maxwts_needed <- max(nnetGrid$.size) * (p + 2) + 1
set.seed(100)
nnetTune <- train(x = trainXnnet,
y = trainY,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = NULL, # already preprocessed
linout = TRUE, # regression output, not classification
trace = FALSE,
MaxNWts = maxwts_needed + 50, # add margin
maxit = 500,
repeats = 5) # average 5 networks per tuning fit
print(nnetTune)
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.00 2.728604 0.7203091 2.152663
## 1 0.01 2.719536 0.7214992 2.124635
## 1 0.10 2.699292 0.7244461 2.094968
## 2 0.00 2.645576 0.7345835 2.088920
## 2 0.01 2.686169 0.7253877 2.113754
## 2 0.10 2.674174 0.7280434 2.093252
## 3 0.00 2.242475 0.8052934 1.750041
## 3 0.01 2.334757 0.7887846 1.816012
## 3 0.10 2.387936 0.7836009 1.875259
## 4 0.00 2.142435 0.8263282 1.700758
## 4 0.01 2.197012 0.8162037 1.710410
## 4 0.10 2.271115 0.8053792 1.801021
## 5 0.00 2.402633 0.7882899 1.856423
## 5 0.01 2.302976 0.8040184 1.824209
## 5 0.10 2.258198 0.8081794 1.789014
## 6 0.00 2.979536 0.7160739 2.181607
## 6 0.01 2.499688 0.7670440 1.955686
## 6 0.10 2.448278 0.7796720 1.940492
## 7 0.00 4.355370 0.5707408 2.847004
## 7 0.01 2.545489 0.7635256 2.008557
## 7 0.10 2.444450 0.7786116 1.931350
## 8 0.00 5.739978 0.4845702 3.482587
## 8 0.01 2.659024 0.7436265 2.089078
## 8 0.10 2.535702 0.7613009 1.976352
## 9 0.00 4.315142 0.5661795 2.863657
## 9 0.01 2.725524 0.7239232 2.127311
## 9 0.10 2.496450 0.7647860 1.961595
## 10 0.00 4.136404 0.6067477 2.809888
## 10 0.01 2.792631 0.7116502 2.197621
## 10 0.10 2.469993 0.7678994 1.966281
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag = FALSE.
# Predict on test set
nnetPred <- predict(nnetTune, newdata = testXnnet)
postResample(pred = nnetPred, obs = testY)
## RMSE Rsquared MAE
## 1.9551301 0.8464735 1.5121497
The cross-validation selected a neural network with size = 3 and decay = 0 as it produced the lowest RMSE. When this model was evaluated on the test set, the RMSE = 1.95 and the R^2 = 0.85. In comparsion, the KNN model had a RMSE = 3.23 and an R^2 = 0.32. This indicates that the neural network outperformed the KNN model, achieving more accurate predictions and better explains the variance than the KNN model.
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
trainX <- trainingData$x
trainY <- trainingData$y
testX <- testData$x
testY <- testData$y
library(earth)
# Fit MARS model
marsFit <- earth(trainX, trainY)
print(marsFit)
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
# Define the candidate hyperparameters for tuning
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Train MARS with cross-validation
set.seed(100)
marsTuned <- train(
x = trainX,
y = trainY,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv")
)
# Results
print(marsTuned)
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.327937 0.2544880 3.600474
## 1 3 3.572450 0.4912720 2.895811
## 1 4 2.596841 0.7183600 2.106341
## 1 5 2.370161 0.7659777 1.918669
## 1 6 2.276141 0.7881481 1.810001
## 1 7 1.766728 0.8751831 1.390215
## 1 8 1.780946 0.8723243 1.401345
## 1 9 1.665091 0.8819775 1.325515
## 1 10 1.663804 0.8821283 1.327657
## 1 11 1.657738 0.8822967 1.331730
## 1 12 1.653784 0.8827903 1.331504
## 1 13 1.648496 0.8823663 1.316407
## 1 14 1.639073 0.8841742 1.312833
## 1 15 1.639073 0.8841742 1.312833
## 1 16 1.639073 0.8841742 1.312833
## 1 17 1.639073 0.8841742 1.312833
## 1 18 1.639073 0.8841742 1.312833
## 1 19 1.639073 0.8841742 1.312833
## 1 20 1.639073 0.8841742 1.312833
## 1 21 1.639073 0.8841742 1.312833
## 1 22 1.639073 0.8841742 1.312833
## 1 23 1.639073 0.8841742 1.312833
## 1 24 1.639073 0.8841742 1.312833
## 1 25 1.639073 0.8841742 1.312833
## 1 26 1.639073 0.8841742 1.312833
## 1 27 1.639073 0.8841742 1.312833
## 1 28 1.639073 0.8841742 1.312833
## 1 29 1.639073 0.8841742 1.312833
## 1 30 1.639073 0.8841742 1.312833
## 1 31 1.639073 0.8841742 1.312833
## 1 32 1.639073 0.8841742 1.312833
## 1 33 1.639073 0.8841742 1.312833
## 1 34 1.639073 0.8841742 1.312833
## 1 35 1.639073 0.8841742 1.312833
## 1 36 1.639073 0.8841742 1.312833
## 1 37 1.639073 0.8841742 1.312833
## 1 38 1.639073 0.8841742 1.312833
## 2 2 4.327937 0.2544880 3.600474
## 2 3 3.572450 0.4912720 2.895811
## 2 4 2.661826 0.7070510 2.173471
## 2 5 2.404015 0.7578971 1.975387
## 2 6 2.243927 0.7914805 1.783072
## 2 7 1.856336 0.8605482 1.435682
## 2 8 1.754607 0.8763186 1.396841
## 2 9 1.603578 0.8938666 1.261361
## 2 10 1.492421 0.9084998 1.168700
## 2 11 1.317350 0.9292504 1.033926
## 2 12 1.304327 0.9320133 1.019108
## 2 13 1.277510 0.9323681 1.002927
## 2 14 1.269626 0.9350024 1.003346
## 2 15 1.266217 0.9359400 1.013893
## 2 16 1.268470 0.9354868 1.011414
## 2 17 1.268470 0.9354868 1.011414
## 2 18 1.268470 0.9354868 1.011414
## 2 19 1.268470 0.9354868 1.011414
## 2 20 1.268470 0.9354868 1.011414
## 2 21 1.268470 0.9354868 1.011414
## 2 22 1.268470 0.9354868 1.011414
## 2 23 1.268470 0.9354868 1.011414
## 2 24 1.268470 0.9354868 1.011414
## 2 25 1.268470 0.9354868 1.011414
## 2 26 1.268470 0.9354868 1.011414
## 2 27 1.268470 0.9354868 1.011414
## 2 28 1.268470 0.9354868 1.011414
## 2 29 1.268470 0.9354868 1.011414
## 2 30 1.268470 0.9354868 1.011414
## 2 31 1.268470 0.9354868 1.011414
## 2 32 1.268470 0.9354868 1.011414
## 2 33 1.268470 0.9354868 1.011414
## 2 34 1.268470 0.9354868 1.011414
## 2 35 1.268470 0.9354868 1.011414
## 2 36 1.268470 0.9354868 1.011414
## 2 37 1.268470 0.9354868 1.011414
## 2 38 1.268470 0.9354868 1.011414
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
# Variable importance
marsVarImp <- varImp(marsTuned)
print(marsVarImp)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.24
## X2 48.73
## X5 15.52
## X3 0.00
# Evaluate on test set
marsPred <- predict(marsTuned, newdata = testX)
rmse <- sqrt(mean((marsPred - testY)^2))
rsq <- cor(marsPred, testY)^2
cat("Test RMSE:", rmse, "\n")
## Test RMSE: 1.158995
cat("Test R^2:", rsq, "\n")
## Test R^2: 0.9460418
The cross-validation selected a MARS model tuned with degree = 2 and nprune = 15 provided a RMSE of 1.16 and R^2 of 0.95. This results are an improvement over both the KNN and neutral network models. Variable importance indicated that the model relied primarily on predictors X1, X4, and X2, while X5 contributed minimally and X3 had no influence.
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
trainX <- trainingData$x
trainY <- trainingData$y
testX <- testData$x
testY <- testData$y
set.seed(100)
svmRTuned <- train(
x = trainX,
y = trainY,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv")
)
# Results
print(svmRTuned)
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.530787 0.7922715 2.013175
## 0.50 2.259539 0.8064569 1.789962
## 1.00 2.099789 0.8274242 1.656154
## 2.00 2.002943 0.8412934 1.583791
## 4.00 1.943618 0.8504425 1.546586
## 8.00 1.918711 0.8547582 1.532981
## 16.00 1.920651 0.8536189 1.536116
## 32.00 1.920651 0.8536189 1.536116
## 64.00 1.920651 0.8536189 1.536116
## 128.00 1.920651 0.8536189 1.536116
## 256.00 1.920651 0.8536189 1.536116
## 512.00 1.920651 0.8536189 1.536116
## 1024.00 1.920651 0.8536189 1.536116
## 2048.00 1.920651 0.8536189 1.536116
##
## Tuning parameter 'sigma' was held constant at a value of 0.06509124
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06509124 and C = 8.
print(svmRTuned$finalModel)
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 8
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0650912392130387
##
## Number of Support Vectors : 152
##
## Objective Function Value : -69.5828
## Training error : 0.009013
# Variable Importance
svmVarImp <- varImp(svmRTuned)
print(svmVarImp)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
# Evaluate on test set
svmPred <- predict(svmRTuned, newdata = testX)
testRMSE <- sqrt(mean((svmPred - testY)^2))
testR2 <- cor(svmPred, testY)^2
testMAE <- mean(abs(svmPred - testY))
cat("Test RMSE:", round(testRMSE, 4), "\n")
## Test RMSE: 2.0632
cat("Test R^2:", round(testR2, 6), "\n")
## Test R^2: 0.827574
cat("Test MAE:", round(testMAE, 4), "\n")
## Test MAE: 1.5662
The SVM model with a radial basis function kernel selected optimal tuning parameters of C = 8 and sigma = 0.065, which resulted in a RMSE of 2.06 and R^2 of 0.83. These results are similar to those of the neural network, indicating both models captured the nonlinear patterns of the data moderately well. However, this SVM model did not outperform the MARS model, which achieve a lower RMSE and high R^2. Variable importance for the SVM indicated that X4, X1, and X2 were the most influential predictors, while X6 and X7 had little to no contribution.
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)? The MARS model performed best with a lowest RMSE and a higher R^2 compared to the neural network, SVM, and KNN models. MARS selected X1, X4, and X2 as the most informative predictors.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
From exercise 6.3:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Separating the target (Yield) and predictor variables
yield <- ChemicalManufacturingProcess$Yield
processPredictors <- ChemicalManufacturingProcess[, -1]
# Impute missing values with the median
preProc <- preProcess(processPredictors, method = "medianImpute")
processPredictors_imputed <- predict(preProc, processPredictors)
set.seed(1234)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX <- processPredictors_imputed[trainIndex, ]
testX <- processPredictors_imputed[-trainIndex, ]
trainY <- yield[trainIndex]
testY <- yield[-trainIndex]
ctrl <- trainControl(method = "cv", number = 10)
Which nonlinear regression model gives the optimal resampling and test set performance?
# SVM
svmFit <- train(
x = trainX,
y = trainY,
method = "svmRadial",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 10
)
# Neural Network
nnFit <- train(
x = trainX,
y = trainY,
method = "nnet",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 10,
linout = TRUE,
trace = FALSE
)
# MARS
marsFit <- train(
x = trainX,
y = trainY,
method = "earth",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 10
)
# KNN
knnFit <- train(
x = trainX,
y = trainY,
method = "knn",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 10
)
# Compare CV performance
resamps <- resamples(list(SVM = svmFit, NN = nnFit, MARS = marsFit, KNN = knnFit))
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: SVM, NN, MARS, KNN
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 0.6177582 0.7550328 0.9581260 0.9288214 1.089252 1.180166 0
## NN 0.9301265 1.0429602 1.1003361 1.1392891 1.281513 1.309942 0
## MARS 0.6118668 0.8785885 0.9706923 0.9677945 1.105998 1.216419 0
## KNN 0.7694286 0.8888896 1.0219026 1.0581776 1.162969 1.459000 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 0.8255247 0.9376857 1.138403 1.160573 1.342507 1.585396 0
## NN 1.1674381 1.3326443 1.391468 1.423152 1.506624 1.828782 0
## MARS 0.8504532 1.0351167 1.136094 1.177096 1.375990 1.489143 0
## KNN 0.9622690 1.1060749 1.254626 1.321469 1.454005 1.865675 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 0.3900108 0.6122117 0.6965212 0.6447713 0.7204971 0.8315143 0
## NN 0.2889553 0.3916717 0.5078031 0.4787012 0.5536956 0.6172612 0
## MARS 0.3538662 0.5466136 0.6536517 0.6418118 0.7874507 0.8276661 0
## KNN 0.3917981 0.5519785 0.5846776 0.5759732 0.6405615 0.7384615 0
# Evaluate on test set
pred_svm <- predict(svmFit, testX); svm_rmse <- RMSE(pred_svm, testY)
pred_nn <- predict(nnFit, testX); nn_rmse <- RMSE(pred_nn, testY)
pred_mars <- predict(marsFit, testX); mars_rmse <- RMSE(pred_mars, testY)
pred_knn <- predict(knnFit, testX); knn_rmse <- RMSE(pred_knn, testY)
data.frame(
Model = c("SVM", "NeuralNetwork", "MARS", "KNN"),
Test_RMSE = c(svm_rmse, nn_rmse, mars_rmse, knn_rmse)
)
## Model Test_RMSE
## 1 SVM 1.133661
## 2 NeuralNetwork 1.702510
## 3 MARS 1.128461
## 4 KNN 1.195133
The SVM model achieved the best performance during resampling, with the lowest mean RMSE (1.11) and highest mean R² (0.65). On the test set, MARS slightly outperformed SVM (RMSE 1.115 vs 1.132) but the difference is minimal, suggesting both models capture the nonlinear data well.
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
# Get variable importance for SVM
svm_varimp <- varImp(svmFit, scale = TRUE)
svm_varimp
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess13 100.00
## ManufacturingProcess32 96.24
## BiologicalMaterial06 94.88
## BiologicalMaterial03 85.18
## BiologicalMaterial12 82.77
## ManufacturingProcess17 82.16
## ManufacturingProcess36 78.96
## ManufacturingProcess09 78.20
## ManufacturingProcess31 72.24
## BiologicalMaterial02 72.12
## ManufacturingProcess06 64.75
## BiologicalMaterial04 58.82
## BiologicalMaterial11 55.77
## BiologicalMaterial08 53.15
## ManufacturingProcess33 47.45
## ManufacturingProcess11 44.38
## BiologicalMaterial01 43.11
## ManufacturingProcess30 39.27
## BiologicalMaterial09 35.36
## ManufacturingProcess18 32.08
# PLS model
plsFit <- train(
x = trainX,
y = trainY,
method = "pls",
preProc = c("center", "scale"),
tuneLength = 20,
trControl = ctrl
)
# Top 10 predictors
vip <- varImp(plsFit)
plot(vip, top = 10, main = "PLS Top 10 Predictors")
The top predictors in the optimal nonlinear regression model (SVM) are ManufacturingProcess13, ManufacturingProcess32, and BiologicalMaterial06. Looking at the top 10 predictors, there are six process variables and four biological variables, so neither type clearly dominates the list. In the linear model, we see a similar mix: six manufacturing process variables and four biological variables. This indicates that both linear and nonlinear models are influenced by a combination of biological and process factors.
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
The top predictors that are unique to the optimal nonlinear regression model are BiologicalMaterial12 and ManufacturingProcess31. To see the relationship between these two predictors and the yield, we can plot them.
# BiologicalMaterial12 vs Yield
ggplot(ChemicalManufacturingProcess, aes(x = BiologicalMaterial12, y = Yield)) +
geom_point() +
geom_smooth(method = "loess", color = "blue") +
labs(title = "Relationship between BiologicalMaterial12 and Yield",
x = "BiologicalMaterial12", y = "Yield")
The plot for BiologicalMaterial12 shows a nonlinear relationship with
Yield. The yield increases when BiologicalMaterial12 is between 19 and
21. After BiologicalMaterial12 peaks at 21, the yield steadily
decreases. The optimal range where the yield is highest is when
BiologicalMaterial12 is between 19-21. Anything outside of this range
will decrease the yield.
# ManufacturingProcess31 vs Yield
ggplot(ChemicalManufacturingProcess, aes(x = ManufacturingProcess31, y = Yield)) +
geom_point() +
geom_smooth(method = "loess", color = "blue") +
labs(title = "Relationship between ManufacturingProcess31 and Yield",
x = "ManufacturingProcess31", y = "Yield")
The plot for ManufacturingProcess31 looks has a couple of outliers that are skewing the plot so let’s filter them out using the IQR.
x <- ChemicalManufacturingProcess$ManufacturingProcess31
# Identify non‐outlier indices
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr
filtered <- ChemicalManufacturingProcess[x >= lower & x <= upper, ]
# Plot
ggplot(filtered, aes(x = ManufacturingProcess31, y = Yield)) +
geom_point() +
geom_smooth(method = "loess",color = "blue") +
labs(title = "Relationship between ManufacturingProcess31 and Yield")
After removing the outliers, we can see that most of the values fall within 68-72. As ManufacturingProcess31 increases from 68 to 70, the yield remains relatively flat. However, when ManufacturingProcess31 increases past 70, the Yield decreases from around 42 to 40 and remains flat. From this plot, the Yield is most optimal when ManufacturingProcess31 is below 70.
These plots highlight the nonlinear relationship between the biological or process predictors and the yield. Some biological or process predictors have a range where the produce the most optimal yield. As shown in ManufacturingProcess31 and BiologicalMaterial12, once these predictors went past a certain threshold, the yield suddenly dropped.