1-) Friedman (1991) introduced several benchmark data sets create by sim-ulation. One of these simulations used the following nonlinear equation to create data: y =10 sin(πx1x2) + 20(x3 − 0.5)2 +10x4 +5x5 +N(0,σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simula-tion). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(caret)library(mlbench)set.seed(200)trainingData <-mlbench.friedman1(200, sd =1)## We convert the 'x' data from a matrix to a data frame## One reason is that this will give the columns names.trainingData$x <-data.frame(trainingData$x)## Look at the data usingfeaturePlot(trainingData$x, trainingData$y)
## This creates a list with a vector 'y' and a matrix## of predictors 'x'. Also simulate a large test set to## estimate the true error rate with good precision: testData <-mlbench.friedman1(5000, sd =1) testData$x <-data.frame(testData$x)
k-Nearest Neighbors
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 3.466085 0.5121775 2.816838
7 3.349428 0.5452823 2.727410
9 3.264276 0.5785990 2.660026
11 3.214216 0.6024244 2.603767
13 3.196510 0.6176570 2.591935
15 3.184173 0.6305506 2.577482
17 3.183130 0.6425367 2.567787
19 3.198752 0.6483184 2.592683
21 3.188993 0.6611428 2.588787
23 3.200458 0.6638353 2.604529
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
knnPred <-predict(knnModel, newdata = testData$x)## The function 'postResample' can be used to get the test set perforamnce valuespostResample(pred = knnPred, obs = testData$y)
RMSE Rsquared MAE
3.2040595 0.6819919 2.5683461
Neural Network Method
# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75tooHigh <-findCorrelation(cor(trainingData$x), cutoff = .75)# returns an empty variable# create a tuning gridnnetGrid <-expand.grid(.decay =c(0, 0.01, .1),.size =c(1:10))# 10-fold cross-validation to make reasonable estimatesctrl <-trainControl(method ="cv", number =10)set.seed(100)# tunennetTune <-train(trainingData$x, trainingData$y,method ="nnet",tuneGrid = nnetGrid,trControl = ctrl,preProc =c("center", "scale"),linout =TRUE,trace =FALSE,MaxNWts =10* (ncol(trainingData$x) +1) +10+1,maxit =500)nnetTune
Support Vector Machines with Radial Basis Function Kernel
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.518451 0.7977688 2.010337
0.50 2.271316 0.8116556 1.804299
1.00 2.106614 0.8331518 1.662671
2.00 2.019537 0.8441622 1.570576
4.00 1.939589 0.8559148 1.516878
8.00 1.904125 0.8612118 1.497448
16.00 1.900928 0.8620090 1.502851
32.00 1.900928 0.8620090 1.502851
64.00 1.900928 0.8620090 1.502851
128.00 1.900928 0.8620090 1.502851
256.00 1.900928 0.8620090 1.502851
512.00 1.900928 0.8620090 1.502851
1024.00 1.900928 0.8620090 1.502851
2048.00 1.900928 0.8620090 1.502851
Tuning parameter 'sigma' was held constant at a value of 0.06172165
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.06172165 and C = 16.
Comment: MARS demonstrates the strongest performance, achieving the lowest RMSE and MAE along with the highest R² value. SVM ranks second in terms of performance.
Does MARS select the informative predictors (those named X1–X5)?
Comment: MARS selects the informative predictors (those named X1–X5) even though X3 has an importance of zero.
2-) Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)# imputationmissing <-preProcess(ChemicalManufacturingProcess, method ="bagImpute")Chemical <-predict(missing, ChemicalManufacturingProcess)# filtering low frequenciesChemical <- Chemical[, -nearZeroVar(Chemical)]set.seed(1122)# index for trainingindex <-createDataPartition(Chemical$Yield, p = .8, list =FALSE)# train train_x <- Chemical[index, -1]train_y <- Chemical[index, 1]# testtest_x <- Chemical[-index, -1]test_y <- Chemical[-index, 1]
(a) Which nonlinear regression model gives the optimal resampling and test set performance?
k-Nearest Neighbors
144 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 1.396642 0.3744077 1.114509
7 1.376192 0.3917046 1.112997
9 1.374055 0.3930803 1.124211
11 1.368435 0.3997261 1.124392
13 1.374638 0.3988759 1.134410
15 1.379299 0.3990339 1.139894
17 1.370991 0.4118993 1.134783
19 1.373526 0.4128174 1.138457
21 1.387453 0.4049173 1.150880
23 1.393734 0.4067580 1.157569
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 11.
Support Vector Machines with Radial Basis Function Kernel
144 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 130, 130, 128, 131, 130, 130, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 1.334962 0.5230124 1.1059145
0.50 1.235421 0.5541532 1.0318823
1.00 1.177799 0.5889356 0.9935824
2.00 1.114080 0.6334924 0.9355337
4.00 1.113288 0.6336954 0.9350635
8.00 1.146250 0.6038205 0.9595706
16.00 1.157939 0.5957328 0.9707883
32.00 1.157939 0.5957328 0.9707883
64.00 1.157939 0.5957328 0.9707883
128.00 1.157939 0.5957328 0.9707883
256.00 1.157939 0.5957328 0.9707883
512.00 1.157939 0.5957328 0.9707883
1024.00 1.157939 0.5957328 0.9707883
2048.00 1.157939 0.5957328 0.9707883
Tuning parameter 'sigma' was held constant at a value of 0.0131875
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.0131875 and C = 4.
Comment: Process variables make up the majority, with a ratio of 11 to 9—mirroring the distribution observed in the optimal linear model from Homework 7.
- How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
library(AppliedPredictiveModeling)data(ChemicalManufacturingProcess)missing <-preProcess(ChemicalManufacturingProcess, method ="bagImpute")Chemical <-predict(missing, ChemicalManufacturingProcess)set.seed(9987)larsTune <-train(Yield ~ ., Chemical , method ="lars", metric ="Rsquared",tuneLength =20, trControl = ctrl, preProc =c("center", "scale"))plot(varImp(larsTune), top =10,main ="Linear: Top 10 Important Predictors")
plot(varImp(svmRTune), top =10,main ="Nonlinear: Top 10 Important Predictors")
Comment: The top ten most important predictors match those identified in the optimal linear model, which was the LARS model at the exception of one predictor that did not match in both models.
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Comment: From the correlation plot, it turns out that ManufacturingProcess32 has the highest positive correlation with Yield. Three of the top ten variables are negatively correlated with Yield.