Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:
\(y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)
where the x values are random variables uniformly distributed between
[0, 1] (there are also 5 other non-informative variables created in the
simulation). The package mlbench contains a function called
mlbench.friedman1 that simulates these data:
#library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data.
For example:
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to bet the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
tooHigh <- findCorrelation(cor(trainingData$x), cutoff = 0.75)
nnetGrid <- expand.grid(decay = c(0.0, 0.01, 0.1),
size = c(1:10)
)
ctrl <- trainControl(method = "cv")
set.seed(100)
nnetTune <- train(trainingData$x,
trainingData$y,
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center","scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500
)
nnetTune
## Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.540546 0.7254252 2.008197
## 0.00 2 2.655140 0.7062546 2.133980
## 0.00 3 2.308948 0.7717065 1.813897
## 0.00 4 2.268677 0.8065299 1.791016
## 0.00 5 2.491449 0.7556790 1.938877
## 0.00 6 3.445760 0.6172989 2.291658
## 0.00 7 5.259894 0.5137135 3.140884
## 0.00 8 5.096729 0.4494295 3.299397
## 0.00 9 6.724966 0.5040399 4.091065
## 0.00 10 3.529274 0.6008843 2.765820
## 0.01 1 2.385136 0.7603460 1.887587
## 0.01 2 2.583767 0.7260485 2.018814
## 0.01 3 2.282621 0.7815267 1.812073
## 0.01 4 2.274770 0.7901402 1.842674
## 0.01 5 2.653199 0.7237241 2.117160
## 0.01 6 2.753791 0.7153836 2.190768
## 0.01 7 2.799525 0.7123252 2.209083
## 0.01 8 3.342579 0.6050358 2.630422
## 0.01 9 3.795128 0.6115537 2.873952
## 0.01 10 3.453153 0.6008848 2.824870
## 0.10 1 2.394058 0.7596252 1.894319
## 0.10 2 2.618767 0.7152952 2.117662
## 0.10 3 2.580239 0.7353788 2.005527
## 0.10 4 2.442308 0.7777448 1.907659
## 0.10 5 2.617403 0.7227439 2.059628
## 0.10 6 2.543814 0.7549811 2.060699
## 0.10 7 2.811804 0.6959408 2.149668
## 0.10 8 2.900555 0.6937553 2.336332
## 0.10 9 3.101142 0.6579730 2.481152
## 0.10 10 2.973902 0.6608165 2.360836
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4 and decay = 0.
nnetPred <- predict(nnetTune, newdata = testData$x)
postResample(pred = nnetPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.4700280 0.7762282 1.9078363
Create a MARS grid, assigning the degrees and number of pruning parameters, then use it to tune a model.
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(trainingData$x,
trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = ctrl)
summary(marsTuned)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
## nprune=13)
##
## coefficients
## (Intercept) 22.050690
## h(0.621722-X1) -15.001651
## h(X1-0.621722) 10.878737
## h(0.601063-X2) -18.830135
## h(0.447442-X3) 9.940077
## h(X3-0.606015) 12.999390
## h(0.734892-X4) -9.877554
## h(X4-0.734892) 10.414930
## h(0.850094-X5) -5.604897
## h(X1-0.621722) * h(X2-0.295997) -43.245766
## h(0.649253-X1) * h(0.601063-X2) 26.218297
##
## Selected 11 of 18 terms, and 5 of 10 predictors (nprune=13)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 8 2
## GCV 1.747495 RSS 264.5358 GRSq 0.929051 RSq 0.9457576
Examine the importance of the variables in the model.
marsPred <- predict(marsTuned, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.2803060 0.9335241 1.0168673
svmTuned <- train(trainingData$x,
trainingData$y,
method = "svmRadial",
preProcess = c("center","scale"),
tuneLength = 14,
trControl = ctrl)
svmTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.506264 0.8036998 1.986174
## 0.50 2.279290 0.8140317 1.786308
## 1.00 2.116618 0.8349158 1.659155
## 2.00 1.997289 0.8496435 1.547397
## 4.00 1.923603 0.8598428 1.494257
## 8.00 1.864844 0.8675510 1.468198
## 16.00 1.864796 0.8672641 1.472051
## 32.00 1.864322 0.8673259 1.471146
## 64.00 1.864322 0.8673259 1.471146
## 128.00 1.864322 0.8673259 1.471146
## 256.00 1.864322 0.8673259 1.471146
## 512.00 1.864322 0.8673259 1.471146
## 1024.00 1.864322 0.8673259 1.471146
## 2048.00 1.864322 0.8673259 1.471146
##
## Tuning parameter 'sigma' was held constant at a value of 0.0627191
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0627191 and C = 32.
svmTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 32
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0627190975351777
##
## Number of Support Vectors : 152
##
## Objective Function Value : -73.9918
## Training error : 0.008495
svmPred <- predict(svmTuned, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0731179 0.8257494 1.5747339
Which models appear to give the best performance?
The MARS model gives the best performance based on having the lowest RMSE and the highest Rsquared.
rbind(
"knn" = postResample(pred = knnPred, obs = testData$y),
"net" = postResample(pred = nnetPred, obs = testData$y),
"svm" = postResample(pred = svmPred, obs = testData$y),
"mars" = postResample(pred = marsPred, obs = testData$y)
)
## RMSE Rsquared MAE
## knn 3.204059 0.6819919 2.568346
## net 2.470028 0.7762282 1.907836
## svm 2.073118 0.8257494 1.574734
## mars 1.280306 0.9335241 1.016867
Does MARS select the informative predictors (those named
X1–X5)?
MARS does select the informative predictors named X1–X5
although X3 has an importance very close to zero.
varImp(marsTuned)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.33
## X2 48.88
## X5 15.63
## X3 0.00
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
First acquire the data.
data("ChemicalManufacturingProcess")
chem_mfg <- ChemicalManufacturingProcess
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter.
Transform, impute missing values, center and scale the data, then check for low frequency values and remove them. Only one variable is removed
set.seed(404)
preProc <- preProcess(chem_mfg, method = c("BoxCox","knnImpute","center","scale"))
chem_pred <- predict(preProc,chem_mfg)
# identify predictors with low frequencies
lowFreqPredictors <- nearZeroVar(chem_pred)
# remove the above set from the data and store the result in a dataframe
chem_pred <- chem_pred[,-lowFreqPredictors]
trainingRows <- sample(c(rep(0, 0.75 * nrow(chem_pred)),
rep(1, 0.25 * nrow(chem_pred))))
chem_train <- chem_pred[trainingRows == 0,]
chem_test <- chem_pred[trainingRows == 1,]
With the data split, train each of the model types from the chapter, first KNN is trained.
trainX <- chem_train %>% select(-Yield)
trainY <- chem_train$Yield
testX <- chem_test %>% select(-Yield)
testY <- chem_test$Yield
knnModel2 <- train(x = trainX,
y = trainY,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10
)
knnPred2 <- predict(knnModel2, newdata = testX)
knnModel2
## k-Nearest Neighbors
##
## 132 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.7849372 0.3512194 0.6254559
## 7 0.7686649 0.3681700 0.6220105
## 9 0.7665498 0.3708665 0.6250032
## 11 0.7777691 0.3562217 0.6417412
## 13 0.7823360 0.3568635 0.6446992
## 15 0.7835197 0.3631654 0.6498450
## 17 0.7830489 0.3712792 0.6512016
## 19 0.7833626 0.3758749 0.6528065
## 21 0.7855431 0.3789807 0.6541842
## 23 0.7851436 0.3860551 0.6546798
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
Next Neural Network is trained.
tooHigh2 <- findCorrelation(cor(trainX), cutoff = 0.75)
trainXnnet <- trainX[,-tooHigh2]
nnetGrid2 <- expand.grid(decay = c(0.0, 0.01, 0.1),
size = c(1:6)
)
set.seed(101)
nnetTune2 <- train(trainX,
trainY,
method = "nnet",
tuneGrid = nnetGrid2,
trControl = ctrl,
preProc = c("center","scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainXnnet) + 1) + 10 + 1,
maxit = 500
)
nnetTune2
## Neural Network
##
## 132 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 120, 119, 118, 119, 118, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 0.9344072 0.3326020 0.7414160
## 0.00 2 0.8590479 0.3602825 0.6682005
## 0.00 3 0.9311655 0.3109729 0.7343087
## 0.00 4 1.0177424 0.3439668 0.8293547
## 0.00 5 0.9824927 0.3485994 0.7932048
## 0.00 6 0.9228106 0.3484658 0.7276560
## 0.01 1 0.9008567 0.3092818 0.6912542
## 0.01 2 1.1299205 0.2780768 0.8949143
## 0.01 3 1.0227262 0.3137698 0.8256810
## 0.01 4 0.8114177 0.4690071 0.6618649
## 0.01 5 0.8006894 0.4558987 0.6284714
## 0.01 6 0.7503117 0.5000343 0.6063890
## 0.10 1 0.7492391 0.4607657 0.5871559
## 0.10 2 0.7695902 0.4989661 0.6057562
## 0.10 3 0.8649446 0.4472045 0.6714006
## 0.10 4 0.8129531 0.4691542 0.6418103
## 0.10 5 0.7667720 0.4785186 0.6163900
## 0.10 6 0.6703694 0.5724950 0.5478685
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 6 and decay = 0.1.
Next MARS is trained.
marsGrid2 <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned2 <- train(trainX,
trainY,
method = "earth",
tuneGrid = marsGrid2,
trControl = ctrl)
summary(marsTuned2)
## Call: earth(x=data.frame[132,56], y=c(-1.216,1.21,1...), keepxy=TRUE, degree=1,
## nprune=3)
##
## coefficients
## (Intercept) -0.5194954
## h(0.998218-ManufacturingProcess11) -0.3726384
## h(ManufacturingProcess32- -1.21828) 0.7018203
##
## Selected 3 of 21 terms, and 2 of 56 predictors (nprune=3)
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess11, ...
## Number of terms at each degree of interaction: 1 2 (additive model)
## GCV 0.4063841 RSS 49.65583 GRSq 0.5296448 RSq 0.5579302
Last SVM is trained.
svmTuned2 <- train(trainX,
trainY,
method = "svmRadial",
preProcess = c("center","scale"),
tuneLength = 10,
trControl = ctrl)
svmTuned2
## Support Vector Machines with Radial Basis Function Kernel
##
## 132 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 118, 119, 119, 120, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7057397 0.5318341 0.5954018
## 0.50 0.6602977 0.5603907 0.5507888
## 1.00 0.6128876 0.6116305 0.5045976
## 2.00 0.5874948 0.6367443 0.4753599
## 4.00 0.5758092 0.6468031 0.4714576
## 8.00 0.5708086 0.6508088 0.4719308
## 16.00 0.5666914 0.6557679 0.4692121
## 32.00 0.5666914 0.6557679 0.4692121
## 64.00 0.5666914 0.6557679 0.4692121
## 128.00 0.5666914 0.6557679 0.4692121
##
## Tuning parameter 'sigma' was held constant at a value of 0.01166393
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01166393 and C = 16.
What is the optimal value of the performance metric?
Overall the optimal value of the performance metric is \(R^2 = 0.9980564\) in the Neural Network model.
Which nonlinear regression model gives the optimal resampling and test set performance?
The Neural Net model gives the best performance based on having the lowest RMSE and the highest Rsquared.
rbind(
"nnet" = postResample(pred = predict(nnetTune2), obs = trainY),
"knn" = postResample(pred = predict(knnModel2), obs = trainY),
"mars" = postResample(pred = predict(marsTuned2), obs = trainY),
"svm" = postResample(pred = predict(svmTuned2), obs = trainY)
)
## RMSE Rsquared MAE
## nnet 0.04288538 0.9980564 0.03299418
## knn 0.62525567 0.5753418 0.51608229
## mars 0.61333555 0.5579302 0.49940287
## svm 0.08804417 0.9934878 0.08580215
Which predictors are most important in the optimal nonlinear regression model?
In the SVM model the top five most important predictors are ManufacturingProcesses 32 and 36, and BiologicalMaterial 06, 03 and 01.
varImp(svmTuned2)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 68.15
## BiologicalMaterial03 57.69
## ManufacturingProcess36 55.55
## BiologicalMaterial01 50.58
## ManufacturingProcess31 50.07
## BiologicalMaterial02 49.51
## ManufacturingProcess29 46.93
## BiologicalMaterial04 45.54
## ManufacturingProcess33 45.48
## ManufacturingProcess13 45.39
## BiologicalMaterial12 44.54
## ManufacturingProcess02 43.34
## ManufacturingProcess09 40.54
## ManufacturingProcess06 36.00
## BiologicalMaterial08 31.46
## ManufacturingProcess17 29.54
## ManufacturingProcess26 29.03
## BiologicalMaterial11 28.94
## ManufacturingProcess11 28.50
Do either the biological or process variables dominate the list?
With only 8 biological variables but 12 process variables, the process variables edge out the biological ones for dominance.
How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
Looking at the top 20 predictors from the other models, the Neural Net model’s top 20 variables show only 5 biological and 15 process variables that dominate the list.
varImp(nnetTune2)
## nnet variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess23 100.00
## ManufacturingProcess32 96.43
## ManufacturingProcess34 92.81
## ManufacturingProcess33 84.75
## BiologicalMaterial09 82.03
## ManufacturingProcess01 80.49
## ManufacturingProcess03 80.04
## BiologicalMaterial11 78.73
## ManufacturingProcess28 76.03
## ManufacturingProcess45 73.56
## ManufacturingProcess36 72.53
## ManufacturingProcess06 71.24
## BiologicalMaterial12 70.66
## ManufacturingProcess17 70.64
## ManufacturingProcess37 70.00
## ManufacturingProcess41 69.68
## ManufacturingProcess08 69.22
## BiologicalMaterial10 68.23
## BiologicalMaterial04 67.77
## ManufacturingProcess22 67.55
The MARS model shows 2 process variables dominating the top slots.
varImp(marsTuned2)
## earth variable importance
##
## Overall
## ManufacturingProcess32 100
## ManufacturingProcess11 0
The KNN model matches the SVM model with 8 biological and 12 process variables dominating the list.
varImp(knnModel2)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 68.15
## BiologicalMaterial03 57.69
## ManufacturingProcess36 55.55
## BiologicalMaterial01 50.58
## ManufacturingProcess31 50.07
## BiologicalMaterial02 49.51
## ManufacturingProcess29 46.93
## BiologicalMaterial04 45.54
## ManufacturingProcess33 45.48
## ManufacturingProcess13 45.39
## BiologicalMaterial12 44.54
## ManufacturingProcess02 43.34
## ManufacturingProcess09 40.54
## ManufacturingProcess06 36.00
## BiologicalMaterial08 31.46
## ManufacturingProcess17 29.54
## ManufacturingProcess26 29.03
## BiologicalMaterial11 28.94
## ManufacturingProcess11 28.50
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Plot the top 10 predictors for the Neural Network model as identified above. Then make a correlation plot to Yield.
The correlation plot shows that ManufacturingProcesses 32, 34, 33 and 28 appear to have positive correlations with Yield, while ManufacturingProcesses 23, 01, 03 and 45 appear to have a negative correlation with Yield. That said, both BiologicalMaterials 09 and 11 in the top 10 predictors appear to have a positive correlation with Yield.
top10X <- chem_pred %>%
select(c("ManufacturingProcess23","ManufacturingProcess32","ManufacturingProcess34","ManufacturingProcess33","BiologicalMaterial09","ManufacturingProcess01","ManufacturingProcess03","BiologicalMaterial11","ManufacturingProcess28","ManufacturingProcess45","Yield"))
correlation <- cor(top10X)
corrplot(correlation)