Applied Predictive Modeling - Chapter 7 Exercises: 7.2, 7.5

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10 \sin(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + N(0, \sigma^2)\)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

Tune several models on these data. For example:

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(mlbench)

set.seed(200)

trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

#Neural Network
nnetModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "nnet",
  preProcess = c("center", "scale"),
  tuneLength = 5,
  trace = FALSE
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
nnetModel
## Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE     
##   1     0e+00  14.41790        NaN  13.56150
##   1     1e-04  14.41790  0.4542748  13.56150
##   1     1e-03  14.41791  0.3664249  13.56150
##   1     1e-02  14.41792  0.6623630  13.56152
##   1     1e-01  14.41807  0.7597399  13.56168
##   3     0e+00  14.41790        NaN  13.56150
##   3     1e-04  14.41790  0.3173180  13.56150
##   3     1e-03  14.41791  0.2631604  13.56150
##   3     1e-02  14.41792  0.7118890  13.56151
##   3     1e-01  14.41801  0.7604438  13.56162
##   5     0e+00  14.41790        NaN  13.56150
##   5     1e-04  14.41790  0.2987806  13.56150
##   5     1e-03  14.41790  0.1618633  13.56150
##   5     1e-02  14.41791  0.6888859  13.56151
##   5     1e-01  14.41799  0.7612978  13.56160
##   7     0e+00  14.41790        NaN  13.56150
##   7     1e-04  14.41790  0.2484722  13.56150
##   7     1e-03  14.41790  0.1047706  13.56150
##   7     1e-02  14.41791  0.6883861  13.56151
##   7     1e-01  14.41798  0.7605236  13.56159
##   9     0e+00  14.41790        NaN  13.56150
##   9     1e-04  14.41790  0.2591097  13.56150
##   9     1e-03  14.41790  0.1359100  13.56150
##   9     1e-02  14.41791  0.6756231  13.56151
##   9     1e-01  14.41797  0.7609534  13.56158
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.
#SVM
svmModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "svmRadial",
  preProcess = c("center", "scale"),
  tuneLength = 5
)

svmModel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared   MAE     
##   0.25  2.611982  0.7851367  2.073638
##   0.50  2.356043  0.8036432  1.857602
##   1.00  2.200593  0.8214443  1.726521
##   2.00  2.101961  0.8324907  1.640215
##   4.00  2.047560  0.8378579  1.600164
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06169286
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06169286 and C = 4.
#MARS
marsModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "earth",
  tuneLength = 5
)
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsModel
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##    2      4.405736  0.2100417  3.624081
##    5      2.576285  0.7272177  2.056392
##    8      1.901169  0.8511616  1.490399
##   11      1.825508  0.8624122  1.417457
##   15      1.840871  0.8620852  1.428086
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 11 and degree = 1.
#KNN
knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProcess = c("center", "scale"),
  tuneLength = 10
)

knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.581490  0.4852674  2.893313
##    7  3.471696  0.5193268  2.824060
##    9  3.389064  0.5491322  2.758147
##   11  3.301997  0.5869503  2.669257
##   13  3.261358  0.6096858  2.631723
##   15  3.245500  0.6308239  2.614982
##   17  3.256677  0.6385922  2.625094
##   19  3.268671  0.6429010  2.636666
##   21  3.265805  0.6532672  2.645450
##   23  3.281141  0.6577555  2.660849
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
#Neural Network
nnetPred <- predict(nnetModel, testData$x)
postResample(nnetPred, testData$y)
##     RMSE Rsquared      MAE 
## 14.27693       NA 13.38691
#SVM
svmPred <- predict(svmModel, testData$x)
postResample(svmPred, testData$y)
##      RMSE  Rsquared       MAE 
## 2.0558058 0.8292187 1.5527979
#MARS
marsPred <- predict(marsModel, testData$x)
postResample(marsPred, testData$y)
##      RMSE  Rsquared       MAE 
## 1.8163128 0.8673823 1.3894608
#KNN
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169
#Results
results <- rbind(
  KNN  = postResample(predict(knnModel, testData$x), testData$y),
  NNET = postResample(predict(nnetModel, testData$x), testData$y),
  SVM  = postResample(predict(svmModel, testData$x), testData$y),
  MARS = postResample(predict(marsModel, testData$x), testData$y)
)

results
##           RMSE  Rsquared       MAE
## KNN   3.175066 0.6785946  2.544317
## NNET 14.276927        NA 13.386908
## SVM   2.055806 0.8292187  1.552798
## MARS  1.816313 0.8673823  1.389461
#Variable Importance
varImp(marsModel)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   82.24
## X2   63.10
## X5   38.54
## X3   26.30
## X6    0.00

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Based on the test set results, the MARS model appears to give the best performance, with the lowest RMSE (1.8163) and the highest \(R^2\) (0.8674). Yes, MARS selects the informative predictors. Yes, MARS selects X1–X5, as they have the highest importance values, while noise variables like X6 have zero importance.

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(caret)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(AppliedPredictiveModeling)

data(ChemicalManufacturingProcess)

set.seed(123)


#Test vs Train
train_index <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)

train_data <- ChemicalManufacturingProcess[train_index, ]
test_data  <- ChemicalManufacturingProcess[-train_index, ]

x_train <- train_data %>% select(-Yield)
y_train <- train_data$Yield

x_test <- test_data %>% select(-Yield)
y_test <- test_data$Yield

# Preprocess
pre <- preProcess(x_train, method = c("medianImpute", "center", "scale"))

x_train_proc <- predict(pre, x_train)
x_test_proc  <- predict(pre, x_test)

ctrl <- trainControl(method = "cv", number = 5)

#KNN
knnModel <- train(
  x = x_train_proc,
  y = y_train,
  method = "knn",
  trControl = ctrl,
  tuneLength = 10
)

#NN
nnetModel <- train(
  x = x_train_proc,
  y = y_train,
  method = "nnet",
  trControl = ctrl,
  tuneLength = 5,
  trace = FALSE
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
#SVM
svmModel <- train(
  x = x_train_proc,
  y = y_train,
  method = "svmRadial",
  trControl = ctrl,
  tuneLength = 5
)

#MARS
marsModel <- train(
  x = x_train_proc,
  y = y_train,
  method = "earth",
  trControl = ctrl,
  tuneLength = 5
)

#Test set performance
results <- rbind(
  KNN  = postResample(predict(knnModel, x_test_proc), y_test),
  NNET = postResample(predict(nnetModel, x_test_proc), y_test),
  SVM  = postResample(predict(svmModel, x_test_proc), y_test),
  MARS = postResample(predict(marsModel, x_test_proc), y_test)
)

results
##           RMSE  Rsquared       MAE
## KNN   1.402052 0.4285288  1.166375
## NNET 39.146857        NA 39.104687
## SVM   1.216960 0.5627652  1.014307
## MARS  1.279300 0.5273828  1.051239
  1. Which nonlinear regression model gives the optimal resampling and test set performance?

Based on the test set results, the SVM model appears to give the best performance, with the lowest RMSE (1.217) and the highest \(R^2\) (0.563).

  1. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
#Variable Importance (Non Linear Model)
svm_imp <- varImp(svmModel, scale = TRUE)
svm_imp
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     94.06
## ManufacturingProcess36   81.54
## BiologicalMaterial03     81.27
## ManufacturingProcess13   80.63
## ManufacturingProcess31   78.52
## BiologicalMaterial02     76.04
## ManufacturingProcess17   75.92
## ManufacturingProcess09   73.04
## BiologicalMaterial12     69.48
## ManufacturingProcess06   66.24
## BiologicalMaterial11     59.72
## ManufacturingProcess33   57.06
## ManufacturingProcess29   54.40
## BiologicalMaterial04     53.93
## BiologicalMaterial01     45.62
## BiologicalMaterial08     44.93
## ManufacturingProcess30   42.47
## BiologicalMaterial09     40.88
## ManufacturingProcess11   38.38
svm_top10 <- svm_imp$importance %>%
  arrange(desc(Overall)) %>%
  head(10)

svm_top10
##                          Overall
## ManufacturingProcess32 100.00000
## BiologicalMaterial06    94.06439
## ManufacturingProcess36  81.53513
## BiologicalMaterial03    81.26510
## ManufacturingProcess13  80.63144
## ManufacturingProcess31  78.52140
## BiologicalMaterial02    76.03669
## ManufacturingProcess17  75.91748
## ManufacturingProcess09  73.04357
## BiologicalMaterial12    69.48078

The most important predictors include both process and biological variables, but process variables slightly dominate (6 out of 10). There is overlap with the linear model, especially for key process variables. Process variables dominate in both models, but the nonlinear SVM model also highlights additional variables, indicating it captures more complex relationships.

  1. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?)
# ManufacturingProcess31
ggplot(ChemicalManufacturingProcess, aes(x = ManufacturingProcess31, y = Yield)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm") +
  labs(title = "Yield vs ManufacturingProcess31")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).

# BiologicalMaterial02
ggplot(ChemicalManufacturingProcess, aes(x = BiologicalMaterial02, y = Yield)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm") +
  labs(title = "Yield vs BiologicalMaterial02")
## `geom_smooth()` using formula = 'y ~ x'

# BiologicalMaterial12
ggplot(ChemicalManufacturingProcess, aes(x = BiologicalMaterial12, y = Yield)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm") +
  labs(title = "Yield vs BiologicalMaterial12")
## `geom_smooth()` using formula = 'y ~ x'

ManufacturingProcess31 shows a weak relationship with yield, while BiologicalMaterial02 and 12 show moderate positive trends. This suggests some predictors have nonlinear effects, which explains why the SVM model identified them as important.