7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\(y = 10 \sin(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + N(0, \sigma^2)\)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
Tune several models on these data. For example:
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
#Neural Network
nnetModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "nnet",
preProcess = c("center", "scale"),
tuneLength = 5,
trace = FALSE
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
nnetModel
## Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0e+00 14.41790 NaN 13.56150
## 1 1e-04 14.41790 0.4542748 13.56150
## 1 1e-03 14.41791 0.3664249 13.56150
## 1 1e-02 14.41792 0.6623630 13.56152
## 1 1e-01 14.41807 0.7597399 13.56168
## 3 0e+00 14.41790 NaN 13.56150
## 3 1e-04 14.41790 0.3173180 13.56150
## 3 1e-03 14.41791 0.2631604 13.56150
## 3 1e-02 14.41792 0.7118890 13.56151
## 3 1e-01 14.41801 0.7604438 13.56162
## 5 0e+00 14.41790 NaN 13.56150
## 5 1e-04 14.41790 0.2987806 13.56150
## 5 1e-03 14.41790 0.1618633 13.56150
## 5 1e-02 14.41791 0.6888859 13.56151
## 5 1e-01 14.41799 0.7612978 13.56160
## 7 0e+00 14.41790 NaN 13.56150
## 7 1e-04 14.41790 0.2484722 13.56150
## 7 1e-03 14.41790 0.1047706 13.56150
## 7 1e-02 14.41791 0.6883861 13.56151
## 7 1e-01 14.41798 0.7605236 13.56159
## 9 0e+00 14.41790 NaN 13.56150
## 9 1e-04 14.41790 0.2591097 13.56150
## 9 1e-03 14.41790 0.1359100 13.56150
## 9 1e-02 14.41791 0.6756231 13.56151
## 9 1e-01 14.41797 0.7609534 13.56158
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.
#SVM
svmModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 5
)
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.611982 0.7851367 2.073638
## 0.50 2.356043 0.8036432 1.857602
## 1.00 2.200593 0.8214443 1.726521
## 2.00 2.101961 0.8324907 1.640215
## 4.00 2.047560 0.8378579 1.600164
##
## Tuning parameter 'sigma' was held constant at a value of 0.06169286
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06169286 and C = 4.
#MARS
marsModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneLength = 5
)
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsModel
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 4.405736 0.2100417 3.624081
## 5 2.576285 0.7272177 2.056392
## 8 1.901169 0.8511616 1.490399
## 11 1.825508 0.8624122 1.417457
## 15 1.840871 0.8620852 1.428086
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 11 and degree = 1.
#KNN
knnModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10
)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.581490 0.4852674 2.893313
## 7 3.471696 0.5193268 2.824060
## 9 3.389064 0.5491322 2.758147
## 11 3.301997 0.5869503 2.669257
## 13 3.261358 0.6096858 2.631723
## 15 3.245500 0.6308239 2.614982
## 17 3.256677 0.6385922 2.625094
## 19 3.268671 0.6429010 2.636666
## 21 3.265805 0.6532672 2.645450
## 23 3.281141 0.6577555 2.660849
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
#Neural Network
nnetPred <- predict(nnetModel, testData$x)
postResample(nnetPred, testData$y)
## RMSE Rsquared MAE
## 14.27693 NA 13.38691
#SVM
svmPred <- predict(svmModel, testData$x)
postResample(svmPred, testData$y)
## RMSE Rsquared MAE
## 2.0558058 0.8292187 1.5527979
#MARS
marsPred <- predict(marsModel, testData$x)
postResample(marsPred, testData$y)
## RMSE Rsquared MAE
## 1.8163128 0.8673823 1.3894608
#KNN
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.1750657 0.6785946 2.5443169
#Results
results <- rbind(
KNN = postResample(predict(knnModel, testData$x), testData$y),
NNET = postResample(predict(nnetModel, testData$x), testData$y),
SVM = postResample(predict(svmModel, testData$x), testData$y),
MARS = postResample(predict(marsModel, testData$x), testData$y)
)
results
## RMSE Rsquared MAE
## KNN 3.175066 0.6785946 2.544317
## NNET 14.276927 NA 13.386908
## SVM 2.055806 0.8292187 1.552798
## MARS 1.816313 0.8673823 1.389461
#Variable Importance
varImp(marsModel)
## earth variable importance
##
## Overall
## X1 100.00
## X4 82.24
## X2 63.10
## X5 38.54
## X3 26.30
## X6 0.00
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
Based on the test set results, the MARS model appears to give the best performance, with the lowest RMSE (1.8163) and the highest \(R^2\) (0.8674). Yes, MARS selects the informative predictors. Yes, MARS selects X1–X5, as they have the highest importance values, while noise variables like X6 have zero importance.
7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
library(caret)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
set.seed(123)
#Test vs Train
train_index <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
train_data <- ChemicalManufacturingProcess[train_index, ]
test_data <- ChemicalManufacturingProcess[-train_index, ]
x_train <- train_data %>% select(-Yield)
y_train <- train_data$Yield
x_test <- test_data %>% select(-Yield)
y_test <- test_data$Yield
# Preprocess
pre <- preProcess(x_train, method = c("medianImpute", "center", "scale"))
x_train_proc <- predict(pre, x_train)
x_test_proc <- predict(pre, x_test)
ctrl <- trainControl(method = "cv", number = 5)
#KNN
knnModel <- train(
x = x_train_proc,
y = y_train,
method = "knn",
trControl = ctrl,
tuneLength = 10
)
#NN
nnetModel <- train(
x = x_train_proc,
y = y_train,
method = "nnet",
trControl = ctrl,
tuneLength = 5,
trace = FALSE
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
#SVM
svmModel <- train(
x = x_train_proc,
y = y_train,
method = "svmRadial",
trControl = ctrl,
tuneLength = 5
)
#MARS
marsModel <- train(
x = x_train_proc,
y = y_train,
method = "earth",
trControl = ctrl,
tuneLength = 5
)
#Test set performance
results <- rbind(
KNN = postResample(predict(knnModel, x_test_proc), y_test),
NNET = postResample(predict(nnetModel, x_test_proc), y_test),
SVM = postResample(predict(svmModel, x_test_proc), y_test),
MARS = postResample(predict(marsModel, x_test_proc), y_test)
)
results
## RMSE Rsquared MAE
## KNN 1.402052 0.4285288 1.166375
## NNET 39.146857 NA 39.104687
## SVM 1.216960 0.5627652 1.014307
## MARS 1.279300 0.5273828 1.051239
Based on the test set results, the SVM model appears to give the best performance, with the lowest RMSE (1.217) and the highest \(R^2\) (0.563).
#Variable Importance (Non Linear Model)
svm_imp <- varImp(svmModel, scale = TRUE)
svm_imp
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 94.06
## ManufacturingProcess36 81.54
## BiologicalMaterial03 81.27
## ManufacturingProcess13 80.63
## ManufacturingProcess31 78.52
## BiologicalMaterial02 76.04
## ManufacturingProcess17 75.92
## ManufacturingProcess09 73.04
## BiologicalMaterial12 69.48
## ManufacturingProcess06 66.24
## BiologicalMaterial11 59.72
## ManufacturingProcess33 57.06
## ManufacturingProcess29 54.40
## BiologicalMaterial04 53.93
## BiologicalMaterial01 45.62
## BiologicalMaterial08 44.93
## ManufacturingProcess30 42.47
## BiologicalMaterial09 40.88
## ManufacturingProcess11 38.38
svm_top10 <- svm_imp$importance %>%
arrange(desc(Overall)) %>%
head(10)
svm_top10
## Overall
## ManufacturingProcess32 100.00000
## BiologicalMaterial06 94.06439
## ManufacturingProcess36 81.53513
## BiologicalMaterial03 81.26510
## ManufacturingProcess13 80.63144
## ManufacturingProcess31 78.52140
## BiologicalMaterial02 76.03669
## ManufacturingProcess17 75.91748
## ManufacturingProcess09 73.04357
## BiologicalMaterial12 69.48078
The most important predictors include both process and biological variables, but process variables slightly dominate (6 out of 10). There is overlap with the linear model, especially for key process variables. Process variables dominate in both models, but the nonlinear SVM model also highlights additional variables, indicating it captures more complex relationships.
# ManufacturingProcess31
ggplot(ChemicalManufacturingProcess, aes(x = ManufacturingProcess31, y = Yield)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm") +
labs(title = "Yield vs ManufacturingProcess31")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
# BiologicalMaterial02
ggplot(ChemicalManufacturingProcess, aes(x = BiologicalMaterial02, y = Yield)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm") +
labs(title = "Yield vs BiologicalMaterial02")
## `geom_smooth()` using formula = 'y ~ x'
# BiologicalMaterial12
ggplot(ChemicalManufacturingProcess, aes(x = BiologicalMaterial12, y = Yield)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm") +
labs(title = "Yield vs BiologicalMaterial12")
## `geom_smooth()` using formula = 'y ~ x'
ManufacturingProcess31 shows a weak relationship with yield, while BiologicalMaterial02 and 12 show moderate positive trends. This suggests some predictors have nonlinear effects, which explains why the SVM model identified them as important.