library(tidyverse)
library(corrplot)
library(caret)
library(e1071)
library(kernlab)
library(earth)
library(neural)
library(RSNNS)
library(mlbench)
library(AppliedPredictiveModeling)
Friedman (1991) introduced several benchmark data sets create by simulation. ... Tune several models on these data. Which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?
set.seed(624)
train_f1 <- mlbench.friedman1(200, sd = 1) #200 observations
train_f1$x <- data.frame(train_f1$x)
test_f1 <- mlbench.friedman1(5000, sd = 1) #5000 observations
test_f1$x <- data.frame(test_f1$x)
featurePlot(train_f1$x, train_f1$y)
In general, the predictors do not appear strongly related to the response. Notably, predictor X4 displays a weakly positive relationship with the response.
sum(is.na(train_f1$x))
## [1] 0
train_f1$x %>%
gather() %>%
ggplot(aes(x = "", y = value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot() +
coord_flip() +
labs(x = NULL, y = NULL)
corr <- cor(train_f1$x, method = "spearman")
corrplot::corrplot(corr)
The data are simulated, so zero missingness is expected, and each predictor's distribution is symmetrical and essentially the same in spread. Further, per Spearman's correlation, none of the predictors are highly correlated with any others. Beyond centering and scaling, which are generally recommended for the subsequent nonlinear methods, additional pre-processing appears unnecessary.
set.seed(624)
(knn_f1 <- caret::train(x = train_f1$x,
y = train_f1$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv")))
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.069204 0.6088491 2.433501
## 7 3.011768 0.6567422 2.399211
## 9 3.037390 0.6618269 2.452561
## 11 3.038967 0.6719378 2.478658
## 13 3.084122 0.6617994 2.526191
## 15 3.152288 0.6453826 2.556748
## 17 3.148509 0.6620085 2.560961
## 19 3.123090 0.6905532 2.541291
## 21 3.172754 0.6951162 2.578561
## 23 3.230072 0.6895217 2.625250
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
Using 10-fold cross-validation to tune the k parameter, the optimal KNN model with 7 neighbors minimizes RMSE at approximately 3.01.
set.seed(624)
knnpred_f1 <- predict(knn_f1, newdata = test_f1$x)
postResample(pred = knnpred_f1, obs = test_f1$y)
## RMSE Rsquared MAE
## 3.285951 0.618920 2.636732
Evaluating on the test set returns a RMSE of approximately 3.29, which is understandably higher than the training set RMSE.
set.seed(624)
(svm_f1 <- caret::train(x = train_f1$x,
y = train_f1$y,
method = "svmRadial",
preProc = c("center","scale"),
tuneLength = 14,
trControl = trainControl(method = "cv")))
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.584231 0.7797669 2.091152
## 0.50 2.360069 0.7912991 1.943972
## 1.00 2.113758 0.8196184 1.750612
## 2.00 2.017285 0.8317397 1.648217
## 4.00 1.965667 0.8395102 1.582769
## 8.00 1.984661 0.8331146 1.567490
## 16.00 2.025350 0.8255122 1.602328
## 32.00 2.027394 0.8251813 1.605477
## 64.00 2.027394 0.8251813 1.605477
## 128.00 2.027394 0.8251813 1.605477
## 256.00 2.027394 0.8251813 1.605477
## 512.00 2.027394 0.8251813 1.605477
## 1024.00 2.027394 0.8251813 1.605477
## 2048.00 2.027394 0.8251813 1.605477
##
## Tuning parameter 'sigma' was held constant at a value of 0.05960274
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05960274 and C = 4.
Kuhn and Johnson note that, of the various kernel functions, the radial basis function is generally effective. Using that kernel along with 10-fold cross-validation to tune \(\sigma\) and cost, the optimal SVM model with \(\sigma\) of approximately 0.06 and cost of 4 minimizes RMSE at approximately 1.97.
set.seed(624)
svmpred_f1 <- predict(svm_f1, newdata = test_f1$x)
postResample(pred = svmpred_f1, obs = test_f1$y)
## RMSE Rsquared MAE
## 2.1310844 0.8281178 1.6484015
Evaluating on the test set returns a RMSE of approximately 2.13, which is understandably higher than the training set RMSE. That value is lower than the same for the optimal KNN model.
set.seed(624)
(mars_f1 <- caret::train(x = train_f1$x,
y = train_f1$y,
method = "earth",
trControl = trainControl(method = "cv")))
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 3.922423 0.3603280 3.329203
## 8 1.886486 0.8643516 1.521238
## 15 1.907564 0.8652765 1.544721
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 8 and degree = 1.
Using 10-fold cross-validation to tune the number of retained terms (degree is held constant at 1), the optimal, first-degree MARS model with 8 terms minimizes RMSE at approximately 1.89.
set.seed(624)
marspred_f1 <- predict(mars_f1, newdata = test_f1$x)
postResample(pred = marspred_f1, obs = test_f1$y)
## RMSE Rsquared MAE
## 1.9506231 0.8595922 1.5182805
Evaluating on the test set returns a RMSE of approximately 1.95, which is understandably higher than the training set RMSE. It is also lower than the test RMSE for both the KNN and SVM models.
varImp(mars_f1)
## earth variable importance
##
## Overall
## X4 100.00
## X1 70.32
## X2 46.62
## X5 28.97
## X3 0.00
The optimal MARS model does select the informative predictors (X1-X5)
set.seed(624)
nnetGrid <- expand.grid(.decay = c(0, 0.01, 1),
.size = c(1:10))
(nnet_f1 <- caret::train(x = train_f1$x,
y = train_f1$y,
method = "nnet",
tuneGrid = nnetGrid,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv"),
linout = TRUE,
trace = FALSE))
## Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.752369 0.6832636 2.204780
## 0.00 2 2.745469 0.6902325 2.184996
## 0.00 3 2.660248 0.6957751 2.187869
## 0.00 4 2.462959 0.7707206 1.973344
## 0.00 5 2.827899 0.7053522 2.285789
## 0.00 6 2.933040 0.7208237 2.285767
## 0.00 7 3.173320 0.6119132 2.570983
## 0.00 8 3.320302 0.6200110 2.596053
## 0.00 9 3.515911 0.6126600 2.681617
## 0.00 10 3.764635 0.5856651 3.045548
## 0.01 1 2.570137 0.7233154 2.093350
## 0.01 2 2.697154 0.6935491 2.224286
## 0.01 3 2.434795 0.7538776 1.974477
## 0.01 4 2.341151 0.7808289 1.865100
## 0.01 5 2.608191 0.7329522 2.146297
## 0.01 6 3.162233 0.6445225 2.455594
## 0.01 7 3.408134 0.5635911 2.591209
## 0.01 8 3.767024 0.5421410 2.783466
## 0.01 9 3.273421 0.6185443 2.597598
## 0.01 10 3.258314 0.6202967 2.537709
## 1.00 1 2.535983 0.7299049 2.042844
## 1.00 2 2.452274 0.7476484 1.955862
## 1.00 3 2.196975 0.7940627 1.790236
## 1.00 4 2.194446 0.8023363 1.785340
## 1.00 5 2.262647 0.7870007 1.836427
## 1.00 6 2.279762 0.7931883 1.804620
## 1.00 7 2.228026 0.7867747 1.799120
## 1.00 8 2.610867 0.7127715 2.080618
## 1.00 9 2.262709 0.7864614 1.842691
## 1.00 10 2.246225 0.7957875 1.825513
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4 and decay = 1.
Using the book-suggested candidate set of models as well as 10-fold cross-validation to tune the parameters, the optimal neural network model with size of 4 and decay of 1 minimizes RMSE at approximately 2.19.
set.seed(624)
nnetpred_f1 <- predict(nnet_f1, newdata = test_f1$x)
postResample(pred = nnetpred_f1, obs = test_f1$y)
## RMSE Rsquared MAE
## 2.3875982 0.7794368 1.8641057
Evaluating on the test set returns an RMSE of approximately 2.39, which is higher than the training set RMSE. It is also higher than the MARS model RMSE.
Of the four models, the MARS model returns the smallest RMSE (~1.95). That model selects the informative predictors (X1-X5).
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
The code and descriptions below are adapted directly from the Week 10 Homework assignment.
data("ChemicalManufacturingProcess")
imputed <- impute::impute.knn(as.matrix(ChemicalManufacturingProcess), rng.seed = 624)
cmp <- as.data.frame(imputed$data)
sum(is.na(cmp))
## [1] 0
The data set contains the 57 predictors (12 describing input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
Data are assumed to missing at least at random for the purposes of this exercise, and KNN imputation is used to estimate the 106 missing values in the set. The imputation process uses 10 neighbors.
set.seed(624)
index <- createDataPartition(cmp$Yield, p = .80, list = FALSE)
cmp_train <- cmp[index,] # 144 observations
cmp_test <- cmp[-index,] # 32 observations
An 80/20 split is used to create a training set of 144 runs and a test set of 32 runs.
ex <- nearZeroVar(cmp_train[-1], saveMetrics = TRUE)
ex %>% arrange(-freqRatio, percentUnique, -nzv) %>% head()
## freqRatio percentUnique zeroVar nzv
## BiologicalMaterial07 71.000000 1.388889 FALSE TRUE
## ManufacturingProcess41 6.500000 2.777778 FALSE FALSE
## ManufacturingProcess28 5.400000 14.583333 FALSE FALSE
## ManufacturingProcess12 4.760000 1.388889 FALSE FALSE
## ManufacturingProcess34 4.636364 6.250000 FALSE FALSE
## ManufacturingProcess40 4.333333 1.388889 FALSE FALSE
sum(ex$nzv)
## [1] 1
A check for near-zero variance predictors returns just one: BiologicalMaterial07, with a frequency ratio of approximately 71.
corr <- cor((cmp_train %>% select(-c("Yield","BiologicalMaterial07"))), method = "spearman")
corrplot::corrplot(corr)
hicorr <- findCorrelation(corr)
set.seed(624)
cmp_train_slim <- cmp_train %>% select(-c(Yield, BiologicalMaterial07)) %>% select(-all_of(hicorr))
cmp_train_transform <- cmp_train_slim %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(cmp_train_slim) %>% cbind(cmp_train$Yield) %>% rename(Yield = "cmp_train$Yield")
cmp_test_slim <- cmp_test %>% select(-c(Yield, BiologicalMaterial07)) %>% select(-all_of(hicorr))
cmp_test_transform <- cmp_test_slim %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(cmp_test_slim) %>% cbind(cmp_test$Yield) %>% rename(Yield = "cmp_test$Yield")
In pre-processing, predictors with near-zero variance or high correlations (using Spearman's \(\rho\)) are removed, and remaining predictors undergo a Box-Cox transformation as well as centering and scaling. The resulting training and test sets feature 51 predictors.
K-Nearest Neighbors (KNN)
set.seed(624)
(knn_cmp <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "knn",
tuneLength = 10,
trControl = trainControl(method = "repeatedcv", repeats = 5)))
## k-Nearest Neighbors
##
## 144 samples
## 51 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 130, 130, 130, 130, 129, 130, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.260486 0.5369754 1.048647
## 7 1.298180 0.5102898 1.078990
## 9 1.344889 0.4683830 1.119324
## 11 1.372963 0.4420748 1.144277
## 13 1.375529 0.4469999 1.142038
## 15 1.371445 0.4582999 1.133800
## 17 1.378592 0.4601082 1.139604
## 19 1.394591 0.4527576 1.154889
## 21 1.408886 0.4457260 1.167704
## 23 1.414787 0.4500894 1.166490
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
knnpred_cmp <- predict(knn_cmp, newdata = cmp_test_transform)
postResample(pred = knnpred_cmp, obs = cmp_test_transform$Yield)
## RMSE Rsquared MAE
## 1.3113527 0.6600128 1.0633125
Using repeated 10-fold cross-validation to tune the k parameter, the optimal KNN model with 5 neighbors minimizes RMSE at approximately 1.26. Evaluating on the test set returns a RMSE of approximately 1.31.
Support Vector Machine (SVM)
set.seed(624)
(svm_cmp <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "svmRadial",
tuneLength = 14,
trControl = trainControl(method = "repeatedcv", repeats = 5)))
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 51 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 130, 130, 130, 130, 129, 130, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.386272 0.4938595 1.1409501
## 0.50 1.279571 0.5341400 1.0539208
## 1.00 1.201502 0.5779977 0.9830663
## 2.00 1.158854 0.6055503 0.9477725
## 4.00 1.137874 0.6161072 0.9250061
## 8.00 1.136175 0.6166123 0.9212457
## 16.00 1.143851 0.6112416 0.9272377
## 32.00 1.143851 0.6112416 0.9272377
## 64.00 1.143851 0.6112416 0.9272377
## 128.00 1.143851 0.6112416 0.9272377
## 256.00 1.143851 0.6112416 0.9272377
## 512.00 1.143851 0.6112416 0.9272377
## 1024.00 1.143851 0.6112416 0.9272377
## 2048.00 1.143851 0.6112416 0.9272377
##
## Tuning parameter 'sigma' was held constant at a value of 0.01381135
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01381135 and C = 8.
svmpred_cmp <- predict(svm_cmp, newdata = cmp_test_transform)
postResample(pred = svmpred_cmp, obs = cmp_test_transform$Yield)
## RMSE Rsquared MAE
## 1.2362638 0.7072089 0.8728152
Using the radial basis kernel function along with repeated 10-fold cross-validation to tune \(\sigma\) and cost, the optimal SVM model with \(\sigma\) of approximately 0.01 and cost of 8 minimizes RMSE at approximately 1.14. Evaluating on the test set returns a RMSE of approximately 1.24.
Multivariate Adaptive Regression Splines (MARS)
set.seed(624)
(mars_cmp <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "earth",
trControl = trainControl(method = "repeatedcv", repeats = 5)))
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 51 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 130, 130, 130, 130, 129, 130, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 1.392380 0.4413063 1.107060
## 9 1.302645 0.5279347 1.056024
## 17 1.382676 0.5027627 1.131236
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 9 and degree = 1.
marspred_cmp <- predict(mars_cmp, newdata = cmp_test_transform)
postResample(pred = marspred_cmp, obs = cmp_test_transform$Yield)
## RMSE Rsquared MAE
## 1.8727433 0.5251501 1.2875771
Using repeated 10-fold cross-validation to tune the number of retained terms (degree is held constant at 1), the optimal, first-degree MARS model with 9 terms minimizes RMSE at approximately 1.30. Evaluating on the test set returns a RMSE of approximately 1.87.
Neural Network
set.seed(624)
nnetGrid <- expand.grid(.decay = c(0, 0.01, 1),
.size = c(1:10))
(nnet_cmp <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "nnet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "repeatedcv", repeats = 5),
linout = TRUE,
trace = FALSE))
## Neural Network
##
## 144 samples
## 51 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 130, 130, 130, 130, 129, 130, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 1.670009 0.1806237 1.374443
## 0.00 2 1.657755 0.2454971 1.356228
## 0.00 3 2.664469 0.2315111 1.969068
## 0.00 4 2.962083 0.2194136 2.306097
## 0.00 5 3.170609 0.2055507 2.496803
## 0.00 6 3.864795 0.1529160 2.957014
## 0.00 7 4.869968 0.1751694 3.596567
## 0.00 8 5.723303 0.1671704 3.995768
## 0.00 9 7.574280 0.1278161 5.138112
## 0.00 10 11.668788 0.1288191 7.232222
## 0.01 1 1.455108 0.3961894 1.166489
## 0.01 2 1.783189 0.3756717 1.363308
## 0.01 3 2.188949 0.3405498 1.647693
## 0.01 4 2.709422 0.2216519 2.105344
## 0.01 5 3.130002 0.1736650 2.432992
## 0.01 6 3.922071 0.1329444 3.023401
## 0.01 7 4.649875 0.1823120 3.342745
## 0.01 8 5.930056 0.1174045 4.170495
## 0.01 9 7.566669 0.1211547 5.034450
## 0.01 10 10.825161 0.1300644 6.800165
## 1.00 1 2.604097 0.2515688 1.717115
## 1.00 2 2.701495 0.2529938 1.886564
## 1.00 3 2.926110 0.2609018 2.111644
## 1.00 4 2.709226 0.2385605 2.074288
## 1.00 5 2.512027 0.2802414 1.989836
## 1.00 6 2.700054 0.2521452 2.135603
## 1.00 7 3.276308 0.2301583 2.495083
## 1.00 8 4.158249 0.2249644 2.908250
## 1.00 9 5.241916 0.1671743 3.566585
## 1.00 10 5.839900 0.1555386 3.967025
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.01.
nnetpred_cmp <- predict(nnet_cmp, newdata = cmp_test_transform)
postResample(pred = nnetpred_cmp, obs = cmp_test_transform$Yield)
## RMSE Rsquared MAE
## 1.6005267 0.4401126 1.1712126
Using the book-suggested candidate set of models as well as repeated 10-fold cross-validation to tune the parameters, the optimal neural network model with size of 1 and decay of 0.01 minimizes RMSE at approximately 1.46. Evaluating on the test set returns an RMSE of approximately 1.60.
Of the four models, the SVM model performs the best on the test set. Its test set RMSE is approximately 1.24.
varImp(svm_cmp)$importance %>%
arrange(-Overall) %>%
rownames_to_column("predictor") %>%
top_n(10) %>%
ggplot(aes(x = reorder(predictor, Overall), y = Overall)) +
geom_col() +
ggtitle("Top ten predictors of product yield, by importance in SVM model") +
xlab(NULL) +
ylab("Importance") +
coord_flip()
set.seed(624)
cmp_pls <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "pls",
tuneLength = 20,
trControl = trainControl(method = "repeatedcv", repeats = 5)
)
varImp(cmp_pls)$importance %>%
arrange(-Overall) %>%
rownames_to_column("predictor") %>%
top_n(10) %>%
ggplot(aes(x = reorder(predictor, Overall), y = Overall)) +
geom_col() +
ggtitle("Top ten predictors of product yield, by importance in PLS model") +
xlab(NULL) +
ylab("Importance") +
coord_flip()
Generally, the same set of predictors is important in both the nonlinear SVM and linear Partial Least Squares (PLS) models. Both models are led by ManufacturingProcess32 and place greater importance on manufacturing predictors. However, the biological predictors are more important in the SVM model, with two of them--BiologicalMaterial06 and BiologicalMaterial03--in the three most important predictors. These two are also important in the PLS model, though less so.
svm_top10 <- varImp(svm_cmp)$importance %>%
arrange(-Overall) %>%
rownames_to_column("predictor") %>%
top_n(10) %>%
select(-Overall)
pls_top10 <- varImp(cmp_pls)$importance %>%
arrange(-Overall) %>%
rownames_to_column("predictor") %>%
top_n(10) %>%
select(-Overall)
(svm_unique <- unname(unlist(anti_join(svm_top10, pls_top10))))
## [1] "ManufacturingProcess31" "BiologicalMaterial12"
featurePlot(cmp_train_transform[, svm_unique], cmp_train_transform$Yield)
cor(cmp_train_transform[, svm_unique], cmp_train_transform$Yield, method = "spearman")
## [,1]
## ManufacturingProcess31 -0.3800313
## BiologicalMaterial12 0.4206077
Two of the top ten predictors for the SVM model are unique relative to the PLS model. They are BiologicalMaterial12 and ManufacturingProcess31. Using the training set, plotting Yield against the two predictors reveals weak if any relations with the biological predictor and at least one clear outlier for the manufacturing predictor. The latter was not addressed by the Box-Cox transformation in pre-processing. Spearman's \(\rho\) for each predictor confirms the visual patterns, though the outlier(s) is playing a role for the manufacturing predictor. Generally, these plots do not reveal much about the relationships between the predictors and Yield, though perhaps additional cleaning would be useful.
Kuhn, M. and Johnson, K. (2013). Applied predictive modeling. doi 10.1007/978-1-4614-6849-3