1 Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\[ y = 10\sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + \mathcal{N}(0, \sigma^2) \] When the \(x\) values are random variables uniformly distributed between \([0,1]\), (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1. Which models appear to give the best performance? Does MARS select the informative predictors (those named x1-x5)?

1.1 Simulate the data

set.seed(523)

# training data with 200 observations 
training_data <- mlbench.friedman1(200, sd = 1)
training_data$x <- as.data.frame(training_data$x)

# large test set for a stable estimate of test performance
test_data <- mlbench.friedman1(5000, sd = 1)
test_data$x <- as.data.frame(test_data$x)

dim(training_data$x)

## [1] 200  10

head(training_data$x)

head(training_data$y)

## [1] 13.79226 14.26570 12.75271 10.42632 17.89828 13.88635

1.2 Train the models

1.2.1 KNN

set.seed(711)

# KNN needs centering and scaling because distance matters
knn_model <- train(
  x = training_data$x,
  y = training_data$y,
  method = "knn",
  preProcess = c("center", "scale"),
  tuneLength = 10
)

# make predictions on the test set
knn_pred <- predict(knn_model, newdata = test_data$x)

# evaluate test performance
knn_perf <- postResample(pred = knn_pred, obs = test_data$y)
knn_perf

##     RMSE Rsquared      MAE 
## 3.067497 0.670777 2.447238

1.3 Train MARS

set.seed(711)

# MARS can automatically model nonlinear effects and interactions
mars_model <- train(
  x = training_data$x,
  y = training_data$y,
  method = "earth",
  tuneLength = 10
)

mars_model

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##    2      4.228762  0.3565304  3.457980
##    3      3.799064  0.4876387  3.036414
##    4      3.429216  0.5941799  2.716858
##    6      2.691678  0.7417994  2.131112
##    7      2.266327  0.8147079  1.765858
##    9      1.913156  0.8708335  1.505295
##   10      1.852916  0.8786086  1.449118
##   12      1.833128  0.8818429  1.412838
##   13      1.864663  0.8771211  1.432693
##   15      1.880327  0.8757602  1.453217
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 12 and degree = 1.

mars_pred <- predict(mars_model, newdata = test_data$x)
mars_perf <- postResample(pred = mars_pred, obs = test_data$y)
mars_perf

##      RMSE  Rsquared       MAE 
## 1.8549447 0.8657453 1.4339010

1.3.1 Radial SVM

set.seed(711)

# SVM also benefits from centering and scaling
svm_model <- train(
  x = training_data$x,
  y = training_data$y,
  method = "svmRadial",
  preProcess = c("center", "scale"),
  tuneLength = 8
)

svm_model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   C      RMSE      Rsquared   MAE     
##    0.25  2.943605  0.7343084  2.336887
##    0.50  2.652884  0.7600243  2.121254
##    1.00  2.448094  0.7880937  1.955265
##    2.00  2.286777  0.8130825  1.817966
##    4.00  2.226374  0.8229231  1.776091
##    8.00  2.231532  0.8224507  1.786547
##   16.00  2.232409  0.8221225  1.785002
##   32.00  2.232409  0.8221225  1.785002
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06468623
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06468623 and C = 4.

svm_pred <- predict(svm_model, newdata = test_data$x)
svm_perf <- postResample(pred = svm_pred, obs = test_data$y)
svm_perf

##      RMSE  Rsquared       MAE 
## 2.0063671 0.8423032 1.5450528

1.4 Compare test performance

perf_compare <- data.frame(
  Model = c("KNN", "MARS", "SVM Radial"),
  RMSE = c(knn_perf["RMSE"], mars_perf["RMSE"], svm_perf["RMSE"]),
  Rsquared = c(knn_perf["Rsquared"], mars_perf["Rsquared"], svm_perf["Rsquared"]),
  MAE = c(knn_perf["MAE"], mars_perf["MAE"], svm_perf["MAE"])
)

kable(perf_compare, digits = 3, caption = "Exercise 7.2 Test Set Performance")

Exercise 7.2 Test Set Performance
Model	RMSE	Rsquared	MAE
KNN	3.067	0.671	2.447
MARS	1.855	0.866	1.434
SVM Radial	2.006	0.842	1.545

Based on the test-set results, MARS performed best among the three nonlinear models. It achieved the lowest RMSE (1.855), the highest \(R^2\) (0.866), and the lowest MAE (1.434). The radial SVM model ranked second, with RMSE = 2.006, \(R^2 = 0.842\), and MAE = 1.545, while KNN gave the weakest performance, with RMSE = 3.067, \(R^2 = 0.671\), and MAE = 2.447.

These results suggest that MARS was best able to recover the nonlinear structure in the simulated Friedman data. This is reasonable because the response was generated from a smooth nonlinear function, and MARS is designed to capture this type of pattern effectively.

1.5 Check whether MARS selected the informative predictors

# variable importance helps us see which predictors MARS used most
mars_imp <- varImp(mars_model)$importance %>%
  rownames_to_column("Predictor") %>%
  arrange(desc(Overall))

kable(mars_imp, digits = 2, caption = "Exercise 7.2 MARS Variable Importance")

Exercise 7.2 MARS Variable Importance
Predictor	Overall
V4	100.00
V1	63.17
V2	41.56
V5	21.79
V3	0.00

# refit the final MARS model to inspect which variables were used
mars_fit <- earth(
  x = training_data$x,
  y = training_data$y,
  nprune = mars_model$bestTune$nprune,
  degree = mars_model$bestTune$degree
)

summary(mars_fit)

## Call: earth(x=training_data$x, y=training_data$y,
##             degree=mars_model$bestTune$degree,
##             nprune=mars_model$bestTune$nprune)
## 
##                coefficients
## (Intercept)       29.406789
## h(0.413077-V1)   -13.606744
## h(V1-0.413077)     5.596838
## h(V1-0.783373)   -17.457891
## h(0.660692-V2)    -9.588465
## h(0.488376-V3)    11.939197
## h(V3-0.488376)    11.232006
## h(V4-0.127252)   -11.766258
## h(0.943702-V4)   -21.938402
## h(0.695462-V5)    -5.156652
## h(V5-0.695462)     8.301894
## 
## Selected 11 of 18 terms, and 5 of 10 predictors (nprune=12)
## Termination condition: Reached nk 21
## Importance: V4, V1, V2, V5, V3, V6-unused, V7-unused, V8-unused, V9-unused, ...
## Number of terms at each degree of interaction: 1 10 (additive model)
## GCV 2.798772    RSS 448.3773    GRSq 0.8989645    RSq 0.9182526

The MARS variable-importance output shows that the model focused on V4, V1, V2, V5, and V3. The final model summary states that 5 of 10 predictors were used, and the unused variables were V6 through V10. This means that MARS successfully identified the truly informative predictors and ignored the five noise variables.

Although V3 had the smallest importance value, it still appeared in the final MARS model summary, so the model did select all five informative predictors overall.

2 Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Which nonlinear regression model important in the optimal resampling and test set performance?

2.1 Load Data

data(ChemicalManufacturingProcess)

predictors <- ChemicalManufacturingProcess %>% select(-Yield)
yield <- ChemicalManufacturingProcess$Yield

data_overview <- data.frame(
  Rows = nrow(predictors),
  Predictors = ncol(predictors),
  Yield_Mean = mean(yield),
  Yield_SD = sd(yield)
)

kable(data_overview, digits = 2, caption = "Exercise 7.5 Data Overview")

Exercise 7.5 Data Overview
Rows	Predictors	Yield_Mean	Yield_SD
176	57	40.18	1.85

2.2 Split the data

set.seed(901)

training_rows <- createDataPartition(yield, p = 0.7, list = FALSE)

train_predictors <- predictors[training_rows, ]
train_yield <- yield[training_rows]

test_predictors <- predictors[-training_rows, ]
test_yield <- yield[-training_rows]

2.3 Pre-process the data

# use the same general preprocessing idea
pp <- preProcess(
  train_predictors,
  method = c("YeoJohnson", "center", "scale", "knnImpute")
)

pp_train_predictors <- predict(pp, train_predictors)
pp_test_predictors <- predict(pp, test_predictors)

# remove near-zero variance predictors
nzvpp <- nearZeroVar(pp_train_predictors)
if(length(nzvpp) > 0) {
  pp_train_predictors <- pp_train_predictors[, -nzvpp]
  pp_test_predictors <- pp_test_predictors[, -nzvpp]
}

# remove highly correlated predictors
predcorr <- cor(pp_train_predictors)
highCorrpp <- findCorrelation(predcorr)

if(length(highCorrpp) > 0) {
  pp_train_predictors <- pp_train_predictors[, -highCorrpp]
  pp_test_predictors <- pp_test_predictors[, -highCorrpp]
}

preprocess_overview <- data.frame(
  Training_Rows = nrow(pp_train_predictors),
  Training_Predictors = ncol(pp_train_predictors),
  Test_Rows = nrow(pp_test_predictors),
  Test_Predictors = ncol(pp_test_predictors)
)

kable(preprocess_overview, caption = "Exercise 7.5 Data After Preprocessing")

Exercise 7.5 Data After Preprocessing
Training_Rows	Training_Predictors	Test_Rows	Test_Predictors
124	46	52	46

2.4 Set resampling method

set.seed(901)

# bootstrap resampling
ctrl <- trainControl(method = "boot", number = 25)

2.5 Train nonlinear models

2.5.1 MARS

set.seed(415)

mars_chem_grid <- expand.grid(
  degree = 1:2,
  nprune = 2:10
)

mars_chem_tune <- train(
  x = pp_train_predictors,
  y = train_yield,
  method = "earth",
  tuneGrid = mars_chem_grid,
  trControl = ctrl
)

mars_chem_tune

## Multivariate Adaptive Regression Spline 
## 
## 124 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.421239  0.3555088  1.1274650
##   1        3      1.317376  0.4776361  0.9971114
##   1        4      1.710509  0.4716973  1.0557262
##   1        5      1.782419  0.4954470  1.0612259
##   1        6      1.851577  0.4485230  1.1130193
##   1        7      3.106691  0.4233550  1.3307674
##   1        8      3.061793  0.4445200  1.2969232
##   1        9      4.435348  0.4029023  1.5229229
##   1       10      4.181450  0.4323397  1.4693020
##   2        2      1.414421  0.3625596  1.1218329
##   2        3      1.307061  0.4503190  1.0339823
##   2        4      1.410912  0.4692309  1.0426146
##   2        5      1.422248  0.4713860  1.0506640
##   2        6      1.500273  0.4369123  1.0979399
##   2        7      1.662403  0.4365046  1.1417185
##   2        8      1.737990  0.4215994  1.1918019
##   2        9      1.729978  0.4319843  1.1980618
##   2       10      1.782667  0.4261091  1.2146607
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 2.

2.5.2 Polynomial SVM

set.seed(415)

psvm_tune_grid <- expand.grid(
  degree = c(1, 2),
  scale = c(0.25, 0.5, 1),
  C = c(0.01, 0.05, 0.1)
)

psvm_chem_tune <- train(
  x = pp_train_predictors,
  y = train_yield,
  method = "svmPoly",
  trControl = ctrl,
  tuneGrid = psvm_tune_grid
)

psvm_chem_tune

## Support Vector Machines with Polynomial Kernel 
## 
## 124 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     RMSE       Rsquared   MAE     
##   1       0.25   0.01   1.606975  0.3404999  1.172610
##   1       0.25   0.05   2.131784  0.3240448  1.220070
##   1       0.25   0.10   2.655741  0.2941084  1.329853
##   1       0.50   0.01   1.733798  0.3400106  1.153279
##   1       0.50   0.05   2.655704  0.2941243  1.329840
##   1       0.50   0.10   3.152961  0.2736269  1.434653
##   1       1.00   0.01   1.991684  0.3327148  1.192845
##   1       1.00   0.05   3.152936  0.2736284  1.434646
##   1       1.00   0.10   3.855858  0.2607261  1.572457
##   2       0.25   0.01   7.590887  0.2767245  2.062433
##   2       0.25   0.05   8.029920  0.2274698  2.159394
##   2       0.25   0.10   8.000849  0.2218754  2.165128
##   2       0.50   0.01   9.688751  0.2067014  2.452452
##   2       0.50   0.05   9.711800  0.1885894  2.479941
##   2       0.50   0.10   9.711800  0.1885894  2.479941
##   2       1.00   0.01  11.245890  0.1704498  2.768059
##   2       1.00   0.05  11.245890  0.1704498  2.768059
##   2       1.00   0.10  11.245890  0.1704498  2.768059
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 1, scale = 0.25 and C = 0.01.

2.5.3 KNN

set.seed(415)

knn_chem_tune <- train(
  x = pp_train_predictors,
  y = train_yield,
  method = "knn",
  tuneLength = 10,
  trControl = ctrl
)

knn_chem_tune

## k-Nearest Neighbors 
## 
## 124 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.405772  0.3745983  1.145204
##    7  1.376566  0.3880314  1.125607
##    9  1.345643  0.4161709  1.101996
##   11  1.339045  0.4229833  1.098540
##   13  1.343674  0.4205954  1.102025
##   15  1.343667  0.4250135  1.100131
##   17  1.351936  0.4207890  1.110264
##   19  1.353119  0.4219766  1.112926
##   21  1.362666  0.4192643  1.121974
##   23  1.371645  0.4146175  1.130539
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.

2.6 Best tuning parameters

best_tune_75 <- bind_rows(
  data.frame(Model = "MARS", mars_chem_tune$bestTune),
  data.frame(Model = "SVM Poly", psvm_chem_tune$bestTune),
  data.frame(Model = "KNN", knn_chem_tune$bestTune)
)

kable(best_tune_75, caption = "Exercise 7.5 Best Tuning Parameters")

Exercise 7.5 Best Tuning Parameters
Model	nprune	degree	scale	C	k
MARS	3	2	NA	NA	NA
SVM Poly	NA	1	0.25	0.01	NA
KNN	NA	NA	NA	NA	11

2.7 compare resampling results

resamp <- resamples(list(
  MARS = mars_chem_tune,
  SVM_Poly = psvm_chem_tune,
  KNN = knn_chem_tune
))

summary(resamp)

## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: MARS, SVM_Poly, KNN 
## Number of resamples: 25 
## 
## MAE 
##               Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## MARS     0.7141849 0.9542644 1.030962 1.033982 1.108773 1.405970    0
## SVM_Poly 0.9438352 1.0896824 1.148821 1.172610 1.317141 1.414758    0
## KNN      0.8845607 0.9910457 1.112735 1.098540 1.190447 1.326746    0
## 
## RMSE 
##              Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## MARS     0.930888 1.228278 1.287889 1.307061 1.399731 1.768076    0
## SVM_Poly 1.144107 1.324773 1.501958 1.606975 1.775840 3.186983    0
## KNN      1.140694 1.226457 1.328108 1.339045 1.406592 1.572455    0
## 
## Rsquared 
##                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## MARS     0.21268136 0.3846315 0.4310200 0.4503190 0.5018176 0.7071742    0
## SVM_Poly 0.04309999 0.1939825 0.3431186 0.3404999 0.4660888 0.6265679    0
## KNN      0.25623654 0.3579855 0.4105657 0.4229833 0.4814752 0.6256223    0

bwplot(resamp, metric = "RMSE")

dotplot(resamp, metric = "Rsquared")

2.8 Compare test-set performance

mars_pred <- predict(mars_chem_tune, newdata = pp_test_predictors)
svm_pred  <- predict(psvm_chem_tune, newdata = pp_test_predictors)
knn_pred  <- predict(knn_chem_tune, newdata = pp_test_predictors)

mars_perf <- postResample(mars_pred, test_yield)
svm_perf  <- postResample(svm_pred, test_yield)
knn_perf  <- postResample(knn_pred, test_yield)

chem_compare <- data.frame(
  Model = c("MARS", "SVM Poly", "KNN"),
  RMSE = c(mars_perf["RMSE"], svm_perf["RMSE"], knn_perf["RMSE"]),
  Rsquared = c(mars_perf["Rsquared"], svm_perf["Rsquared"], knn_perf["Rsquared"]),
  MAE = c(mars_perf["MAE"], svm_perf["MAE"], knn_perf["MAE"])
)

chem_compare_display <- chem_compare %>%
  mutate(
    RMSE = format(round(RMSE, 3), big.mark = ",", scientific = FALSE),
    Rsquared = sprintf("%.3f", Rsquared),
    MAE = format(round(MAE, 3), big.mark = ",", scientific = FALSE)
  )

kable(chem_compare_display, caption = "Exercise 7.5 Test-Set Performance")

Exercise 7.5 Test-Set Performance
Model	RMSE	Rsquared	MAE
MARS	1.294	0.611	1.065
SVM Poly	22,487,073,217.711	0.001	3,118,395,982.684
KNN	1.580	0.403	1.297

Among the nonlinear models, MARS gave the best overall performance. In the resampling comparison, MARS had the lowest average RMSE and the highest average \(R^2\). On the test set, MARS also performed best, with RMSE = 1.294, \(R^2 = 0.611\), and MAE = 1.065.

KNN ranked second on the test set, with RMSE = 1.580, \(R^2 = 0.403\), and MAE = 1.297. In contrast, the polynomial SVM performed extremely poorly on the test set, with a very large RMSE and almost no explanatory power. Although its resampling results were not the worst, its test-set performance suggests poor generalization to new data.

Overall, MARS is the best nonlinear model for this dataset.

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

mars_imp <- varImp(mars_chem_tune)$importance %>%
  rownames_to_column("Predictor") %>%
  arrange(desc(Overall)) %>%
  filter(Overall > 0 | row_number() <= 2)

kable(mars_imp, digits = 3, caption = "Key Predictors in the Final MARS Model")

Key Predictors in the Final MARS Model
Predictor	Overall
ManufacturingProcess32	100
ManufacturingProcess09	0

# top predictors from the optimal linear model
linear_top10 <- tibble(
  Predictor = c(
    "ManufacturingProcess32",
    "ManufacturingProcess36",
    "ManufacturingProcess13",
    "ManufacturingProcess09",
    "ManufacturingProcess17",
    "BiologicalMaterial06",
    "ManufacturingProcess33",
    "BiologicalMaterial08",
    "BiologicalMaterial01",
    "BiologicalMaterial03"
  ),
  Linear_Importance = c(
    0.15328137,
    0.12314752,
    0.12221124,
    0.11978856,
    0.11768643,
    0.10420115,
    0.10329736,
    0.09362826,
    0.09014032,
    0.08934011
  ),
  Predictor_Type = c(
    "Process", "Process", "Process", "Process", "Process",
    "Biological", "Process", "Biological", "Biological", "Biological"
  )
)

kable(linear_top10, digits = 3, caption = "Top 10 Predictors from the Optimal Linear Model")

Top 10 Predictors from the Optimal Linear Model
Predictor	Linear_Importance	Predictor_Type
ManufacturingProcess32	0.153	Process
ManufacturingProcess36	0.123	Process
ManufacturingProcess13	0.122	Process
ManufacturingProcess09	0.120	Process
ManufacturingProcess17	0.118	Process
BiologicalMaterial06	0.104	Biological
ManufacturingProcess33	0.103	Process
BiologicalMaterial08	0.094	Biological
BiologicalMaterial01	0.090	Biological
BiologicalMaterial03	0.089	Biological

comparison_75 <- full_join(
  mars_imp %>% rename(Nonlinear_Importance = Overall),
  linear_top10,
  by = "Predictor"
)

kable(comparison_75, digits = 3, caption = "Comparison of Nonlinear and Linear Model Predictors")

Comparison of Nonlinear and Linear Model Predictors
Predictor	Nonlinear_Importance	Linear_Importance	Predictor_Type
ManufacturingProcess32	100	0.153	Process
ManufacturingProcess09	0	0.120	Process
ManufacturingProcess36	NA	0.123	Process
ManufacturingProcess13	NA	0.122	Process
ManufacturingProcess17	NA	0.118	Process
BiologicalMaterial06	NA	0.104	Biological
ManufacturingProcess33	NA	0.103	Process
BiologicalMaterial08	NA	0.094	Biological
BiologicalMaterial01	NA	0.090	Biological
BiologicalMaterial03	NA	0.089	Biological

linear_type_summary <- linear_top10 %>%
  count(Predictor_Type)

kable(linear_type_summary, caption = "Variable Types in the Top 10 Linear Predictors")

Variable Types in the Top 10 Linear Predictors
Predictor_Type	n
Biological	4
Process	6

The variable-importance output indicates that ManufacturingProcess32 is the dominant predictor in the final MARS model. ManufacturingProcess09 also appears in the final model, although its reported importance is much smaller. Because both retained predictors are manufacturing process variables, the final nonlinear model is clearly dominated by process variables.

Compared with the optimal linear model from Exercise 6.3, there is clear overlap. The optimal linear model was a PLS model, and its top predictors included ManufacturingProcess32, ManufacturingProcess36, ManufacturingProcess13, ManufacturingProcess09, ManufacturingProcess17, BiologicalMaterial06, ManufacturingProcess33, BiologicalMaterial08, BiologicalMaterial01, and BiologicalMaterial03. Among those top 10 linear predictors, 6 were process variables and 4 were biological variables, and the first five predictors in the linear ranking were all process variables.

This means that the nonlinear model and the linear model agree on the importance of ManufacturingProcess32 and ManufacturingProcess09. The nonlinear MARS model is more compact because it retains only a small number of predictors, while the linear PLS model spreads importance across a broader group of variables. Even so, both models suggest that process variables are the main drivers of yield.

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

top2 <- mars_imp$Predictor[1:2]
top2

## [1] "ManufacturingProcess32" "ManufacturingProcess09"

plot_data <- data.frame(
  x1 = pp_train_predictors[[top2[1]]],
  x2 = pp_train_predictors[[top2[2]]],
  Yield = train_yield
)

ggplot(plot_data, aes(x = x1, y = Yield)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = paste("Yield vs", top2[1]), x = top2[1], y = "Yield")

ggplot(plot_data, aes(x = x2, y = Yield)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = paste("Yield vs", top2[2]), x = top2[2], y = "Yield")

The scatterplots and LOESS smooth curves show that both ManufacturingProcess32 and ManufacturingProcess09 are positively associated with yield overall, although the patterns are not identical. For ManufacturingProcess32, the relationship appears mildly nonlinear: yield is lower in the middle-left region, rises as the predictor increases, and then levels off slightly at the high end. For ManufacturingProcess09, the relationship looks more clearly increasing, with yield tending to rise as the predictor increases.

These plots suggest that the relationship between important process predictors and yield is not perfectly linear, especially for ManufacturingProcess32, where the smooth curve shows noticeable curvature. This helps explain why a nonlinear model such as MARS performed well in this comparison.

At the same time, these two predictors are not unique to the nonlinear model, because both ManufacturingProcess32 and ManufacturingProcess09 were also among the top predictors in the optimal linear model from Exercise 6.3. Therefore, the plots do not show a completely different set of nonlinear-only drivers. Instead, they show that the nonlinear model places stronger emphasis on two process predictors that were already important in the linear analysis.

Data 624 HW 8

Jayden Jiang

2026-04-15