Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.
Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(pix1x2) + 20(x3 - 0.5)2 + 10x4 + 5x5 + N(0, delta2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data: > library(mlbench) > set.seed(200) > trainingData <- mlbench.friedman1(200, sd = 1) > We convert the ‘x’ data from a matrix to a data frame > One reason is that this will give the columns names. > trainingData\(x <- data.frame(trainingData\)x) > Look at the data using > featurePlot(trainingData\(x, trainingData\)y) > or other methods. > > This creates a list with a vector ‘y’ and a matrix > of predictors ‘x’. Also simulate a large test set to > estimate the true error rate with good precision: > testData <- mlbench.friedman1(5000, sd = 1) > testData\(x <- data.frame(testData\)x) > Tune several models on these data. For example: > library(caret) > knnModel <- train(x = trainingData\(x, + y = trainingData\)y, + method = “knn”, + preProc = c(“center”, “scale”), + tuneLength = 10) > knnModel
200 samples 10 predictors Pre-processing: centered, scaled Resampling: Bootstrap (25 reps) Summary of sample sizes: 200, 200, 200, 200, 200, 200, … Resampling results across tuning parameters: k RMSE Rsquared RMSE SD Rsquared SD 5 3.51 0.496 0.238 0.0641 7 3.36 0.536 0.24 0.0617 9 3.3 0.559 0.251 0.0546 11 3.24 0.586 0.252 0.0501 13 3.2 0.61 0.234 0.0465 15 3.19 0.623 0.264 0.0496 17 3.19 0.63 0.286 0.0528 19 3.18 0.643 0.274 0.048 21 3.2 0.646 0.269 0.0464 23 3.2 0.652 0.267 0.0465 RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 19. > knnPred <- predict(knnModel, newdata = testData\(x) > The function 'postResample' can be used to get the test set > perforamnce values > postResample(pred = knnPred, obs = testData\)y) RMSE Rsquared 3.2286834 0.6871735
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## Loading required package: lattice
library(earth)
## Warning: package 'earth' was built under R version 4.4.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.4.3
## Loading required package: plotrix
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(nnet)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.3
After loading in the packages, I can make a toy regression dataset where only x1–x5 actually affect y, and x6–x10 are only noise. I use 200 points to train and 5000 points to test, with a little random noise added (sd = 1).
set.seed(200)
tr <- mlbench.friedman1(n = 200, sd = 1)
te <- mlbench.friedman1(n = 5000, sd = 1)
x_names <- paste0("x", 1:10)
training <- data.frame(tr$x)
colnames(training) <- x_names
training$y <- tr$y
testing <- data.frame(te$x)
colnames(testing) <- x_names
testing$y <- te$y
str(training)
## 'data.frame': 200 obs. of 11 variables:
## $ x1 : num 0.534 0.584 0.59 0.691 0.667 ...
## $ x2 : num 0.648 0.438 0.588 0.226 0.819 ...
## $ x3 : num 0.8508 0.6727 0.4097 0.0334 0.7168 ...
## $ x4 : num 0.1816 0.6692 0.3381 0.0669 0.8032 ...
## $ x5 : num 0.929 0.1638 0.8941 0.6374 0.0831 ...
## $ x6 : num 0.3618 0.4531 0.0268 0.525 0.2234 ...
## $ x7 : num 0.827 0.649 0.179 0.513 0.664 ...
## $ x8 : num 0.421 0.845 0.35 0.797 0.904 ...
## $ x9 : num 0.5911 0.9282 0.0176 0.6899 0.397 ...
## $ x10: num 0.589 0.758 0.444 0.445 0.55 ...
## $ y : num 18.5 16.1 17.8 13.8 18.4 ...
Then we move onto training, which here I use caret with bootstrap resampling (25 repeats) to tune models, and I center/scale the predictors. Then after training setup, I can make way for the model fit for KNN, MARS (earth), SVM with RBF (svmRadial), Random Forest, GBM, and a simple neural net.
ctrl <- trainControl(method = "boot", number = 25)
pp <- c("center", "scale")
models <- list()
# KNN
set.seed(123)
models$knn <- train(
y ~ ., data = training,
method = "knn",
preProcess = pp,
trControl = ctrl,
tuneLength = 10
)
# MARS
set.seed(123)
models$mars <- train(
y ~ ., data = training,
method = "earth",
trControl = ctrl,
tuneGrid = expand.grid(
degree = 1:3,
nprune = 2:25
)
)
# SVM
set.seed(123)
models$svmR <- train(
y ~ ., data = training,
method = "svmRadial",
preProcess = pp,
trControl = ctrl,
tuneLength = 10
)
# Random Forest
set.seed(123)
models$rf <- train(
y ~ ., data = training,
method = "rf",
trControl = ctrl,
tuneLength = 10,
importance = TRUE
)
## note: only 9 unique complexity parameters in default grid. Truncating the grid to 9 .
# GBM
set.seed(123)
models$gbm <- train(
y ~ ., data = training,
method = "gbm",
trControl = ctrl,
verbose = FALSE,
tuneLength = 10
)
# Neural Networktest
set.seed(123)
models$nnet <- train(
y ~ ., data = training,
method = "nnet",
preProcess = pp,
trControl = ctrl,
linout = TRUE, trace = FALSE,
tuneLength = 10
)
models
## $knn
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.548433 0.4919564 2.888283
## 7 3.425531 0.5255725 2.778090
## 9 3.346026 0.5523023 2.704791
## 11 3.252313 0.5875603 2.620492
## 13 3.232552 0.6000482 2.601113
## 15 3.205067 0.6203296 2.586704
## 17 3.172791 0.6408339 2.566738
## 19 3.183306 0.6494300 2.587220
## 21 3.190873 0.6556293 2.596793
## 23 3.202234 0.6597746 2.604279
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
##
## $mars
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.379381 0.2301740 3.575902
## 1 3 3.649438 0.4583683 2.944879
## 1 4 2.769352 0.6876944 2.223704
## 1 5 2.529007 0.7399204 2.018331
## 1 6 2.366383 0.7734368 1.888582
## 1 7 1.988717 0.8380231 1.581362
## 1 8 1.883468 0.8556729 1.484933
## 1 9 1.827116 0.8637619 1.443208
## 1 10 1.788268 0.8690065 1.410531
## 1 11 1.810724 0.8662722 1.424227
## 1 12 1.815936 0.8656814 1.422587
## 1 13 1.824463 0.8644827 1.433229
## 1 14 1.842692 0.8615394 1.450268
## 1 15 1.856755 0.8590033 1.460392
## 1 16 1.855987 0.8591456 1.456274
## 1 17 1.861692 0.8581446 1.459553
## 1 18 1.861692 0.8581446 1.459553
## 1 19 1.861692 0.8581446 1.459553
## 1 20 1.861692 0.8581446 1.459553
## 1 21 1.861692 0.8581446 1.459553
## 1 22 1.861692 0.8581446 1.459553
## 1 23 1.861692 0.8581446 1.459553
## 1 24 1.861692 0.8581446 1.459553
## 1 25 1.861692 0.8581446 1.459553
## 2 2 4.380494 0.2299576 3.575048
## 2 3 3.629884 0.4646310 2.926350
## 2 4 2.778371 0.6865919 2.224771
## 2 5 2.566876 0.7332750 2.054810
## 2 6 2.382859 0.7709321 1.896330
## 2 7 2.048769 0.8295692 1.624635
## 2 8 1.913359 0.8499676 1.511701
## 2 9 1.767324 0.8723918 1.391842
## 2 10 1.671011 0.8872298 1.316228
## 2 11 1.597987 0.8969986 1.256555
## 2 12 1.517892 0.9073210 1.191216
## 2 13 1.474002 0.9134780 1.159746
## 2 14 1.469111 0.9143875 1.150608
## 2 15 1.496019 0.9116532 1.166780
## 2 16 1.492804 0.9118417 1.167786
## 2 17 1.498717 0.9113877 1.172241
## 2 18 1.494332 0.9118672 1.168045
## 2 19 1.491873 0.9118443 1.164638
## 2 20 1.491873 0.9118443 1.164638
## 2 21 1.491873 0.9118443 1.164638
## 2 22 1.491873 0.9118443 1.164638
## 2 23 1.491873 0.9118443 1.164638
## 2 24 1.491873 0.9118443 1.164638
## 2 25 1.491873 0.9118443 1.164638
## 3 2 4.380494 0.2299576 3.575048
## 3 3 3.629884 0.4646310 2.926350
## 3 4 2.778371 0.6865919 2.224771
## 3 5 2.567865 0.7331820 2.055592
## 3 6 2.383233 0.7710261 1.900749
## 3 7 2.049557 0.8295167 1.625178
## 3 8 1.895175 0.8519974 1.499687
## 3 9 1.746761 0.8746763 1.379256
## 3 10 1.640293 0.8911817 1.295797
## 3 11 1.577209 0.8996651 1.243485
## 3 12 1.537710 0.9045652 1.205956
## 3 13 1.491772 0.9109255 1.173306
## 3 14 1.492584 0.9113387 1.165290
## 3 15 1.522638 0.9082233 1.181381
## 3 16 1.505946 0.9101743 1.176136
## 3 17 1.519061 0.9087574 1.186519
## 3 18 1.519184 0.9088871 1.187386
## 3 19 1.517421 0.9088800 1.183905
## 3 20 1.518389 0.9088264 1.185146
## 3 21 1.518389 0.9088264 1.185146
## 3 22 1.518389 0.9088264 1.185146
## 3 23 1.518389 0.9088264 1.185146
## 3 24 1.518389 0.9088264 1.185146
## 3 25 1.518389 0.9088264 1.185146
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
##
## $svmR
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.521970 0.7754171 2.000615
## 0.50 2.334701 0.7878369 1.837681
## 1.00 2.215481 0.8032030 1.738507
## 2.00 2.135470 0.8138612 1.679711
## 4.00 2.107537 0.8177454 1.651024
## 8.00 2.100487 0.8189859 1.648012
## 16.00 2.100901 0.8189207 1.648818
## 32.00 2.100901 0.8189207 1.648818
## 64.00 2.100901 0.8189207 1.648818
## 128.00 2.100901 0.8189207 1.648818
##
## Tuning parameter 'sigma' was held constant at a value of 0.06510592
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06510592 and C = 8.
##
## $rf
## Random Forest
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 2.939397 0.7792566 2.400400
## 3 2.743268 0.7807734 2.244765
## 4 2.664879 0.7707821 2.182894
## 5 2.633024 0.7596175 2.162161
## 6 2.624467 0.7492846 2.153683
## 7 2.638857 0.7378569 2.158893
## 8 2.637784 0.7335712 2.163582
## 9 2.659298 0.7247490 2.177576
## 10 2.674422 0.7176961 2.185493
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
##
## $gbm
## Stochastic Gradient Boosting
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 2.662927 0.7669351 2.155691
## 1 100 2.136730 0.8284599 1.715460
## 1 150 1.990956 0.8414438 1.592837
## 1 200 1.949177 0.8447809 1.560267
## 1 250 1.934782 0.8461753 1.550482
## 1 300 1.932574 0.8462382 1.549500
## 1 350 1.933538 0.8455336 1.549050
## 1 400 1.935569 0.8452545 1.550463
## 1 450 1.941269 0.8442172 1.554403
## 1 500 1.947469 0.8430192 1.555970
## 2 50 2.259010 0.8125994 1.820147
## 2 100 1.972925 0.8441382 1.581289
## 2 150 1.929312 0.8482113 1.544319
## 2 200 1.910190 0.8502221 1.525771
## 2 250 1.896076 0.8521578 1.513182
## 2 300 1.893888 0.8522390 1.509556
## 2 350 1.893187 0.8520427 1.509233
## 2 400 1.893784 0.8519167 1.509371
## 2 450 1.893104 0.8520214 1.509618
## 2 500 1.895153 0.8516943 1.511954
## 3 50 2.131435 0.8269735 1.709878
## 3 100 1.963621 0.8445106 1.575209
## 3 150 1.938152 0.8467056 1.553579
## 3 200 1.926838 0.8481352 1.544587
## 3 250 1.925908 0.8480251 1.544353
## 3 300 1.922480 0.8484169 1.539545
## 3 350 1.923510 0.8480165 1.541188
## 3 400 1.923422 0.8480129 1.540206
## 3 450 1.923526 0.8479550 1.540493
## 3 500 1.924342 0.8477260 1.541208
## 4 50 2.151210 0.8194326 1.749935
## 4 100 2.003060 0.8372337 1.621161
## 4 150 1.984440 0.8391518 1.603440
## 4 200 1.975062 0.8400893 1.595649
## 4 250 1.970456 0.8405529 1.591093
## 4 300 1.970681 0.8403586 1.590703
## 4 350 1.968761 0.8405830 1.589445
## 4 400 1.968652 0.8405646 1.589682
## 4 450 1.969487 0.8403925 1.590496
## 4 500 1.969543 0.8403299 1.590424
## 5 50 2.129360 0.8222135 1.706991
## 5 100 2.022195 0.8346287 1.622360
## 5 150 2.006717 0.8363211 1.610935
## 5 200 1.999619 0.8370751 1.604096
## 5 250 1.997688 0.8370989 1.602274
## 5 300 1.997392 0.8370523 1.602808
## 5 350 1.997579 0.8369950 1.602795
## 5 400 1.998657 0.8366967 1.603907
## 5 450 1.999441 0.8365490 1.604727
## 5 500 1.999514 0.8364955 1.604745
## 6 50 2.141636 0.8207296 1.729567
## 6 100 2.031486 0.8344218 1.632916
## 6 150 2.009323 0.8369521 1.615006
## 6 200 2.003077 0.8377095 1.610907
## 6 250 2.001749 0.8376926 1.609989
## 6 300 2.001376 0.8376650 1.609890
## 6 350 2.000561 0.8377161 1.610021
## 6 400 2.000613 0.8376923 1.610314
## 6 450 2.000891 0.8376305 1.610756
## 6 500 2.000591 0.8376486 1.610762
## 7 50 2.147796 0.8170121 1.728816
## 7 100 2.045462 0.8310030 1.642742
## 7 150 2.027714 0.8326445 1.628203
## 7 200 2.022155 0.8333263 1.623764
## 7 250 2.019609 0.8335130 1.621476
## 7 300 2.020193 0.8332604 1.623893
## 7 350 2.019976 0.8332055 1.624626
## 7 400 2.020696 0.8330308 1.625349
## 7 450 2.020919 0.8329466 1.625767
## 7 500 2.020928 0.8329237 1.625955
## 8 50 2.134665 0.8205264 1.715710
## 8 100 2.040522 0.8324108 1.632554
## 8 150 2.023147 0.8346841 1.618988
## 8 200 2.019630 0.8350097 1.616400
## 8 250 2.018333 0.8350745 1.614718
## 8 300 2.017688 0.8350880 1.614124
## 8 350 2.017636 0.8349422 1.614186
## 8 400 2.017815 0.8348525 1.614336
## 8 450 2.017487 0.8348920 1.613893
## 8 500 2.016970 0.8349546 1.613204
## 9 50 2.149750 0.8197409 1.738022
## 9 100 2.037922 0.8339700 1.644096
## 9 150 2.024284 0.8351677 1.633434
## 9 200 2.015943 0.8363526 1.628613
## 9 250 2.017149 0.8358945 1.629243
## 9 300 2.014786 0.8360855 1.627102
## 9 350 2.014319 0.8360793 1.627469
## 9 400 2.013502 0.8361514 1.627219
## 9 450 2.013162 0.8361894 1.627011
## 9 500 2.012633 0.8362407 1.626413
## 10 50 2.181351 0.8146208 1.764322
## 10 100 2.078980 0.8271963 1.678231
## 10 150 2.054503 0.8301874 1.659809
## 10 200 2.050290 0.8304564 1.656022
## 10 250 2.047502 0.8306832 1.653793
## 10 300 2.046938 0.8306972 1.654210
## 10 350 2.046900 0.8306213 1.654248
## 10 400 2.046991 0.8304888 1.654283
## 10 450 2.046110 0.8305817 1.653799
## 10 500 2.046233 0.8305084 1.653734
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 450, interaction.depth =
## 2, shrinkage = 0.1 and n.minobsinnode = 10.
##
## $nnet
## Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.0000000000 2.720821 0.6963528 2.144372
## 1 0.0001000000 2.740495 0.6879415 2.159635
## 1 0.0002371374 2.644606 0.7104242 2.078619
## 1 0.0005623413 2.570416 0.7304802 2.006435
## 1 0.0013335214 2.721690 0.6944735 2.147888
## 1 0.0031622777 2.614148 0.7199003 2.050357
## 1 0.0074989421 2.635328 0.7130946 2.067298
## 1 0.0177827941 2.555706 0.7318441 1.991173
## 1 0.0421696503 2.472042 0.7517463 1.921555
## 1 0.1000000000 2.477171 0.7502891 1.923842
## 3 0.0000000000 2.748545 0.7072398 2.153599
## 3 0.0001000000 2.843439 0.6906492 2.227506
## 3 0.0002371374 3.000057 0.6595436 2.348786
## 3 0.0005623413 2.865831 0.6900342 2.240437
## 3 0.0013335214 2.998019 0.6590715 2.350431
## 3 0.0031622777 2.916226 0.6754999 2.259149
## 3 0.0074989421 2.800455 0.6962105 2.202414
## 3 0.0177827941 2.773423 0.7043736 2.186727
## 3 0.0421696503 2.712292 0.7116412 2.132861
## 3 0.1000000000 2.719941 0.7122206 2.129439
## 5 0.0000000000 3.842557 0.5228318 2.757129
## 5 0.0001000000 3.857055 0.5299531 2.800108
## 5 0.0002371374 3.230973 0.6245512 2.486117
## 5 0.0005623413 4.043926 0.5100135 2.902935
## 5 0.0013335214 3.694755 0.5674325 2.715365
## 5 0.0031622777 3.780857 0.5697064 2.720876
## 5 0.0074989421 3.507333 0.6046869 2.592462
## 5 0.0177827941 3.534284 0.5898856 2.664891
## 5 0.0421696503 3.335676 0.6234519 2.571558
## 5 0.1000000000 3.078458 0.6479865 2.412689
## 7 0.0000000000 4.262041 0.4949653 3.135508
## 7 0.0001000000 4.185739 0.5238830 3.101229
## 7 0.0002371374 4.070085 0.5327162 3.002808
## 7 0.0005623413 4.245490 0.4872326 3.140667
## 7 0.0013335214 4.304565 0.5077572 3.050619
## 7 0.0031622777 4.087502 0.5330464 3.049046
## 7 0.0074989421 3.990086 0.5191206 2.971460
## 7 0.0177827941 3.913309 0.5313526 2.957436
## 7 0.0421696503 3.743472 0.5584149 2.855628
## 7 0.1000000000 3.456147 0.6020078 2.671900
## 9 0.0000000000 3.815721 0.5436299 2.991900
## 9 0.0001000000 3.639058 0.5548910 2.849963
## 9 0.0002371374 3.624585 0.5724005 2.869579
## 9 0.0005623413 3.573753 0.5614835 2.834294
## 9 0.0013335214 3.828343 0.5290028 2.928182
## 9 0.0031622777 3.714919 0.5456330 2.914720
## 9 0.0074989421 3.784196 0.5367378 2.972530
## 9 0.0177827941 3.848398 0.5504220 3.018004
## 9 0.0421696503 3.444928 0.5907449 2.710247
## 9 0.1000000000 3.583667 0.5771935 2.745304
## 11 0.0000000000 3.516020 0.5822296 2.822225
## 11 0.0001000000 3.605001 0.5602203 2.911470
## 11 0.0002371374 3.549931 0.5854071 2.838226
## 11 0.0005623413 3.421415 0.5968753 2.740715
## 11 0.0013335214 3.676684 0.5508806 2.939927
## 11 0.0031622777 3.585169 0.5743332 2.837713
## 11 0.0074989421 3.564169 0.5790511 2.828052
## 11 0.0177827941 3.540136 0.5723844 2.828299
## 11 0.0421696503 3.362786 0.6041004 2.680440
## 11 0.1000000000 3.138938 0.6453742 2.512598
## 13 0.0000000000 3.646660 0.5565645 2.944417
## 13 0.0001000000 3.774192 0.5433150 2.997645
## 13 0.0002371374 3.641519 0.5582860 2.915688
## 13 0.0005623413 3.617439 0.5598625 2.887121
## 13 0.0013335214 3.710111 0.5340659 2.969806
## 13 0.0031622777 3.646036 0.5754119 2.902364
## 13 0.0074989421 3.509097 0.5826774 2.804368
## 13 0.0177827941 3.398125 0.5924129 2.725466
## 13 0.0421696503 3.328108 0.6060886 2.660725
## 13 0.1000000000 3.059411 0.6617937 2.440026
## 15 0.0000000000 3.539170 0.5727292 2.850442
## 15 0.0001000000 3.376298 0.6087465 2.656220
## 15 0.0002371374 3.647400 0.5537032 2.933967
## 15 0.0005623413 3.575715 0.5629717 2.843720
## 15 0.0013335214 3.354746 0.6020220 2.686683
## 15 0.0031622777 3.419412 0.5913773 2.720389
## 15 0.0074989421 3.228368 0.6187226 2.598103
## 15 0.0177827941 3.274306 0.6140641 2.608672
## 15 0.0421696503 3.070763 0.6455898 2.452687
## 15 0.1000000000 3.035844 0.6594592 2.405830
## 17 0.0000000000 3.407766 0.5851464 2.716077
## 17 0.0001000000 3.439381 0.5915626 2.738417
## 17 0.0002371374 3.285265 0.6108649 2.629568
## 17 0.0005623413 3.396658 0.6004148 2.712050
## 17 0.0013335214 3.306699 0.6193559 2.681876
## 17 0.0031622777 3.294924 0.6180026 2.622051
## 17 0.0074989421 3.281099 0.6174269 2.618965
## 17 0.0177827941 3.178124 0.6377113 2.515145
## 17 0.0421696503 2.984242 0.6771062 2.356519
## 17 0.1000000000 2.907278 0.6778058 2.325290
## 19 0.0000000000 3.227202 0.6256920 2.576206
## 19 0.0001000000 3.308582 0.6092116 2.656080
## 19 0.0002371374 3.306855 0.6045088 2.632505
## 19 0.0005623413 3.327809 0.6008964 2.656143
## 19 0.0013335214 3.302237 0.6099785 2.619051
## 19 0.0031622777 3.155991 0.6371207 2.499281
## 19 0.0074989421 3.143573 0.6383025 2.481631
## 19 0.0177827941 3.133332 0.6403877 2.490832
## 19 0.0421696503 2.850779 0.6877266 2.288207
## 19 0.1000000000 2.828308 0.6954259 2.248009
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.04216965.
In terms of the test-set evaluation, I predict on a separate 5,000 row test set that was not used for training. I compare three metrics: RMSE and MAE and Rsquared. The model with the smallest RMSE and largest Rsquared is considered the best overall—this shows how well the model generalizes to new data.
testX <- testing[, paste0("x", 1:10)]
obs <- testing$y
test_tbl <- lapply(models, function(m) {
as.data.frame(t(postResample(predict(m, newdata = testX), obs)))
})
test_tbl <- do.call(rbind, test_tbl)
test_tbl$model <- rownames(test_tbl)
row.names(test_tbl) <- NULL
test_tbl[order(test_tbl$RMSE), c("model","RMSE","Rsquared","MAE")]
For the fitted MARS model, I use varImp() to see how important each predictor is. Bigger scores mean the variable helped the model more. In this data, I expect x1–x5 to have high importance and x6–x10 to be near zero, since only x1–x5 truly affect y.
varImp(models$mars, scale = FALSE)$importance[order(-varImp(models$mars, scale = FALSE)$importance$Overall), , drop = FALSE]
Here I’m listing the basis functions the final MARS model kept. The variable names that appear there are the ones actually used by the model. If the terms mostly include x1–x5 (and not x6–x10), it confirms that MARS selected the informative predictors and ignored the noise.
labels(models$mars$finalModel)
## [1] "rss" "rsq" "gcv"
## [4] "grsq" "bx" "dirs"
## [7] "cuts" "selected.terms" "prune.terms"
## [10] "fitted.values" "residuals" "coefficients"
## [13] "rss.per.response" "rsq.per.response" "gcv.per.response"
## [16] "grsq.per.response" "rss.per.subset" "gcv.per.subset"
## [19] "leverages" "pmethod" "nprune"
## [22] "penalty" "nk" "thresh"
## [25] "termcond" "weights" "call"
## [28] "namesx" "modvars" "x"
## [31] "y" "xNames" "problemType"
## [34] "tuneValue" "obsLevels" "param"
Overall, I can say based on these results that on Friedman’s 1 data (200 train/5,000 test), the best performance came from the top nonlinear models of SVM, RBF GBM, and MARS, while KNN was the worst performing. MARS variable importance ranked x1, x4, x2, x5, x3 highest and the final MARS terms used only x1–x5, which tells me that MARS selected the informative predictors and ignored the noise variables x6–x10.
7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models. (a) Which nonlinear regression model gives the optimal resampling and test set performance?
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
data("ChemicalManufacturingProcess")
dat <- ChemicalManufacturingProcess
set.seed(123)
idx <- createDataPartition(dat$Yield, p = 0.75, list = FALSE)
trainDat <- dat[idx, ]
testDat <- dat[-idx, ]
pp_obj <- caret::preProcess(trainDat, method = c("bagImpute","center","scale"))
trainDat_pp <- predict(pp_obj, trainDat)
testDat_pp <- predict(pp_obj, testDat)
# train controls
ctrl <- caret::trainControl(method = "boot", number = 25)
# train models on processed data
models75 <- list()
set.seed(123); models75$svmR <- caret::train(Yield ~ ., trainDat_pp, "svmRadial",
trControl = ctrl, tuneLength = 10)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
set.seed(123); models75$mars <- caret::train(Yield ~ ., trainDat_pp, "earth",
trControl = ctrl,
tuneGrid = expand.grid(degree = 1:2, nprune = 2:30))
set.seed(123); models75$rf <- caret::train(Yield ~ ., trainDat_pp, "rf",
trControl = ctrl, tuneLength = 10, importance = TRUE)
set.seed(123); models75$gbm <- caret::train(Yield ~ ., trainDat_pp, "gbm",
trControl = ctrl, verbose = FALSE, tuneLength = 10)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
# So what's the best nonlinear regression model?
obs <- testDat_pp$Yield
testX <- testDat_pp[, setdiff(names(testDat_pp), "Yield")]
metric_row <- function(m) as.data.frame(t(postResample(predict(m, testX), obs)))
tbl <- do.call(rbind, lapply(models75, metric_row))
tbl$model <- rownames(tbl); rownames(tbl) <- NULL
tbl <- tbl[order(tbl$RMSE), c("model","RMSE","Rsquared","MAE")]
tbl
According to these results, the test-set results show MARS is best with RMSE at 0.6862265 and Rsquared at 0.6311315. Second best would be SVM-RBF with RMSE at 0.6911658, and Rsquared at 0.6260545, and lastly GBM and RF performing the worst with RMSE at 0.7064456 and 0.7416347 respectively and Rsquared at 0.6123435 and 0.5933656.
(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
vi_mars <- varImp(models75$mars, scale = FALSE)$importance
vi_mars$var <- rownames(vi_mars)
top10_mars <- head(vi_mars[order(-vi_mars$Overall), c("var","Overall")], 10)
top10_mars
bio_ct <- sum(grepl("^Biological", top10_mars$var))
proc_ct <- sum(grepl("^ManufacturingProcess", top10_mars$var))
cat(sprintf("MARS top-10 counts — Biological: %d, Process: %d\n", bio_ct, proc_ct))
## MARS top-10 counts — Biological: 0, Process: 2
set.seed(123)
lin_mod <- train(Yield ~ ., data = trainDat_pp, method = "glmnet",
trControl = ctrl, tuneLength = 25)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
vi_lin <- varImp(lin_mod, scale = FALSE)$importance
vi_lin$var <- rownames(vi_lin)
top10_lin <- head(vi_lin[order(-vi_lin$Overall), c("var","Overall")], 10)
top10_lin
overlap <- intersect(top10_mars$var, top10_lin$var)
cat(sprintf("Overlap between MARS and linear top-10: %d of 10\n", length(overlap)))
## Overlap between MARS and linear top-10: 2 of 10
The most important predictors in the optimal nonlinear regression models are ManufacturingProcess32 with an “overall” variable-importance score for each predictor being the highest with 32 being 0.371893411. Process variables dominate the list and no biological variables appear in the top ten. Compared with the linear model, the top ten are very similar and share several process variables, including ManufacturingProcess32 and ManufacturingProcess36.
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
From what I can tell the plots of the MARS predictors show curved effects. For ManufacturingProcess32 and ManufacturingProcess36, yield rises with the value and then levels off. No unique biological predictors appear, so process settings drive yield and aim for the rising or plateau zones.