Homework 8 DATA624

7.2)

Friedman (1991) introduced several benchmark data sets create by sim- ulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation)

The package mlbench contains a function called mlbench.friedman1 that simulates these data.

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.4.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.4.3

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.4.2

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

#install.packages("earth")
library(earth)

## Warning: package 'earth' was built under R version 4.4.3

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 4.4.3

## Loading required package: plotrix

marsModel <- earth(x = trainingData$x, y = trainingData$y)
summary(marsModel)  # Check for predictor importance

## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

#Understanding the influence of predictors
plotmo(marsModel)

##  plotmo grid:    X1        X2       X3        X4        X5        X6        X7
##           0.5139349 0.5106664 0.537307 0.4445841 0.5343299 0.4975981 0.4688035
##        X8        X9       X10
##  0.497961 0.5288716 0.5359218

library(magrittr)

## Warning: package 'magrittr' was built under R version 4.4.3

knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.631337  0.4736866  2.952051
##    7  3.459177  0.5226908  2.803521
##    9  3.345159  0.5615803  2.708064
##   11  3.266588  0.5936149  2.647474
##   13  3.241379  0.6129913  2.612002
##   15  3.231158  0.6289031  2.604464
##   17  3.247220  0.6348270  2.614427
##   19  3.264636  0.6423761  2.639587
##   21  3.271335  0.6510671  2.650748
##   23  3.282586  0.6575603  2.664574
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.

#Testing the model
predict(knnModel, testData$x) %>% 
  postResample(pred = ., obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169

#Training linear regression
lmModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "lm",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
lmModel

## Linear Regression 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.466242  0.7610647  1.955361
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

While K-Nearest Neighbors (KNN) makes predictions based on the proximity of data points in the feature space, treating all features with equal importance, linear regression achieves better results by specifically modeling the relationship between each feature and the response variable, considering their correlation.

predict(lmModel, testData$x) %>% 
  postResample(pred = ., obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.6970680 0.7084666 2.0600540

svmrModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14)
svmrModel

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.591797  0.7652097  2.040022
##      0.50  2.383115  0.7830132  1.858793
##      1.00  2.243293  0.8018514  1.738839
##      2.00  2.131571  0.8182703  1.655132
##      4.00  2.093475  0.8232072  1.626784
##      8.00  2.061948  0.8277847  1.602788
##     16.00  2.051696  0.8294611  1.594581
##     32.00  2.051355  0.8295187  1.594280
##     64.00  2.051355  0.8295187  1.594280
##    128.00  2.051355  0.8295187  1.594280
##    256.00  2.051355  0.8295187  1.594280
##    512.00  2.051355  0.8295187  1.594280
##   1024.00  2.051355  0.8295187  1.594280
##   2048.00  2.051355  0.8295187  1.594280
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05732269
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05732269 and C = 32.

#
predict(svmrModel, testData$x) %>% 
  postResample(pred = ., obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.0617418 0.8276253 1.5668772

7.5 )

what about this 7.5 exercise: 7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models. a) Which nonlinear regression model gives the optimal resampling and test set performance? b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model? c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield
Exercise 6.3: A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials as predictors, measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro- cess. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch.

#Steps done to prepare data in exercise 6.3
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3

data(ChemicalManufacturingProcess)

preProcess(ChemicalManufacturingProcess, method = c("knnImpute", "BoxCox", "center", "scale")) |>
  predict(ChemicalManufacturingProcess) -> cmp

part <- createDataPartition(cmp$Yield, p = 0.75, list = FALSE)

cmp_train <- cmp[part,]
cmp_test <- cmp[-part,]

dim(cmp_train)

## [1] 132  58

#Training kN Model
knnModel <- train(x = cmp_train[,-1],
                  y = cmp_train$Yield,
                  method = "knn",
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7966925  0.3596492  0.6357015
##    7  0.7736043  0.3885559  0.6232738
##    9  0.7642366  0.4054645  0.6148636
##   11  0.7627292  0.4088394  0.6168555
##   13  0.7688236  0.3989811  0.6261143
##   15  0.7682704  0.4024564  0.6265736
##   17  0.7713378  0.4012508  0.6259949
##   19  0.7757745  0.3951154  0.6282825
##   21  0.7831882  0.3855136  0.6332646
##   23  0.7891889  0.3787429  0.6398937
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.

# Testing the model on one set
predict(knnModel, cmp_test[,-1]) %>% 
  postResample(pred = ., obs = cmp_test$Yield)

##      RMSE  Rsquared       MAE 
## 0.7691319 0.5412774 0.6394610

#Training SVM model
svmrModel <- train(x = cmp_train[,-1],
                  y = cmp_train$Yield,
                  method = "svmRadial",
                  tuneLength = 14)

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

svmrModel

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  0.7310350  0.4774658  0.6020568
##      0.50  0.6894586  0.5148293  0.5621725
##      1.00  0.6616257  0.5452638  0.5324502
##      2.00  0.6499563  0.5532766  0.5203021
##      4.00  0.6533609  0.5485653  0.5227168
##      8.00  0.6528786  0.5499635  0.5231998
##     16.00  0.6523688  0.5506442  0.5226165
##     32.00  0.6523688  0.5506442  0.5226165
##     64.00  0.6523688  0.5506442  0.5226165
##    128.00  0.6523688  0.5506442  0.5226165
##    256.00  0.6523688  0.5506442  0.5226165
##    512.00  0.6523688  0.5506442  0.5226165
##   1024.00  0.6523688  0.5506442  0.5226165
##   2048.00  0.6523688  0.5506442  0.5226165
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01130335
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01130335 and C = 2.

predict(svmrModel, cmp_test[,-1]) %>% 
  postResample(pred = ., obs = cmp_test$Yield)

##      RMSE  Rsquared       MAE 
## 0.6755068 0.5925273 0.5773412

The kNN model generated best results than SVM. Following this, a linear Support Vector Machine was trained. The resulting RMSE was substantially higher than that of the radial SVM and KNN, providing evidence that the decision boundary required to effectively model our data is likely non-linear, rendering a linear approach suboptimal.

svmModel <- train(x = cmp_train[,-1],
                  y = cmp_train$Yield,
                  method = "svmLinear",
                  preProc = c("center", "scale"),
                  tuneLength = 14)

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

svmModel

## Support Vector Machines with Linear Kernel 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.467424  0.1940873  1.169497
## 
## Tuning parameter 'C' was held constant at a value of 1

#Testing the RSME model
predict(svmModel, cmp_test[,-1]) %>% 
  postResample(pred = ., obs = cmp_test$Yield)

##      RMSE  Rsquared       MAE 
## 2.5636119 0.1260975 0.9884771

Now in this step the Mars Model is set to get trained

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:14)

marsModel <- train(x = cmp_train[,-1],
                  y = cmp_train$Yield,
                  method = "earth",
                  tuneGrid = marsGrid)
marsModel

## Multivariate Adaptive Regression Spline 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        2      0.8138017  0.3531127  0.6463840
##   1        3      0.6704387  0.5553252  0.5448596
##   1        4      0.6916390  0.5356241  0.5519542
##   1        5      0.7315700  0.5115560  0.5723532
##   1        6      0.7388899  0.5070988  0.5741744
##   1        7      0.7835975  0.4719523  0.6003233
##   1        8      0.8039370  0.4620699  0.6090559
##   1        9      0.8746786  0.4495768  0.6168025
##   1       10      0.8674741  0.4373438  0.6208033
##   1       11      0.8791291  0.4284318  0.6308472
##   1       12      0.8948957  0.4293195  0.6356232
##   1       13      0.8965397  0.4380694  0.6358690
##   1       14      1.1667197  0.4061846  0.6925281
##   2        2      0.8412510  0.3109642  0.6719476
##   2        3      0.7173671  0.4971387  0.5748290
##   2        4      0.7483763  0.4717177  0.5868740
##   2        5      0.7493999  0.4711173  0.5866434
##   2        6      0.7871970  0.4490325  0.6136162
##   2        7      1.0303484  0.4328551  0.6496558
##   2        8      0.8025959  0.4516194  0.6188355
##   2        9      0.8036918  0.4616256  0.6175595
##   2       10      0.8308936  0.4462361  0.6307570
##   2       11      1.0493039  0.4302409  0.6735992
##   2       12      1.1103994  0.4115791  0.6901853
##   2       13      1.1371685  0.3987766  0.7074884
##   2       14      0.9352758  0.4124943  0.6781520
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.

# Getting second RMSE with MARS
predict(marsModel, cmp_test[,-1]) %>% 
  postResample(pred = ., obs = cmp_test$Yield)

##      RMSE  Rsquared       MAE 
## 0.6192644 0.6547952 0.5041792

# Training a Neural Network Model
nnetModel <- train(x = cmp_train[,-1],
                  y = cmp_train$Yield,
                  method = "nnet",
                  trace = FALSE,
                  linout = TRUE)
nnetModel

## Neural Network 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE       Rsquared   MAE      
##   1     0e+00  1.0125380  0.2507108  0.7998582
##   1     1e-04  1.0036946  0.2562198  0.7984913
##   1     1e-01  0.9582983  0.3286442  0.7520286
##   3     0e+00  1.1046055  0.2654114  0.8763299
##   3     1e-04  1.0092634  0.3062618  0.8105773
##   3     1e-01  0.8516790  0.4190122  0.6789180
##   5     0e+00  1.0022394  0.3250860  0.8010552
##   5     1e-04  0.9625833  0.3579471  0.7608243
##   5     1e-01  0.7835509  0.4776804  0.6231332
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5 and decay = 0.1.

# The neural network model's RMSE and R-squared are okay, somewhere in the middle.
predict(nnetModel, cmp_test[,-1]) %>% 
  postResample(pred = ., obs = cmp_test$Yield)

##      RMSE  Rsquared       MAE 
## 0.8921480 0.5300504 0.6703499

Now responding the previous requirements for this exercise:

Which non-linear regression model gives the optimal resampling and test set performance?

The radial SVM model achieved the best test set performance as RMSE: 0.6710227, suggesting a radial pattern in the data. Nevertheless, the superior performance of the linear PLS model RMSE: 0.6340737 reveals the continued presence of significant linear relationships influencing predictions.

Which predictors are most important in the optimal nonlinear regres- sion model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Here’s the paragraph reworded in the third person:

The analysis revealed that the top two most important predictors are manufacturing processes. Furthermore, a greater number of manufacturing processes appeared within the top ten important predictors compared to biological factors. Notably, their Partial Least Squares (PLS) model identified the top six predictors as exclusively manufacturing processes, followed by a significant decrease in importance where biological material began to appear.

With the exception of biological material 12 entering the top ten instead of biological material 8, the list of important predictors remained consistent across models. However, the order of importance differed. This observation illustrates the similarities and distinctions between linear and nonlinear modeling approaches in identifying feature importance.

plot(varImp(svmrModel), top = 20)

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

ggplot(data = ChemicalManufacturingProcess, aes(y = Yield, x = BiologicalMaterial12)) +
  geom_point() +
  labs(
    title = "Yield vs. Biological Material 12",
    y = "Yield (%)",
    x = "Biological Material 12 (Units)",
    caption = "Source: Chemical Manufacturing Process Data"
  ) +
  theme_minimal()

Visualizing the unique predictor, the biological material 12 within their top 10 important variables, it is understandable that this predictor held greater importance for their non-linear model compared to their linear one. While a slight positive linear correlation appears to exist, the relationship seems to exhibit a more parabolic form. Specifically, yield increases as the value of biological material 12 approaches 21 from 18, but subsequently, yield tends to decrease beyond this point. Therefore, to maximize yield, maintaining biological material 12 at approximately 21 units is advisable.

Homework 8

Jose Fuentes

2025-04-12

Homework 8 DATA624

7.2)

7.5 )