In this exercise we work with a simulated dataset from the mlbench library. We use the mlbench.friedman1 function to simulate data for ten predictor variables \(X_1\) through \(X_{10}\) and one response variable \(Y\) from the following nonlinear equation:
\[Y = 10 \sin (\pi X_1 X_2) + 20 (X_3 -0.5)^2 + 10 X_4 + 5 X_5 + \epsilon\]
where the \(X_j\) are random variables uniformly distributed on [0, 1] and the error term \(\epsilon \sim N(0, \sigma^2)\) is normally distributed. Note that only the first five \(X_j\) variables enter the equation for the response \(Y\); the remaining \(X_j\) variables are non-informative / noise variables.
We will generate a training set and a test set by simulation, and then fit several models on the training set including parameter tuning. Finally we evaluate performance on the test set across the fitted models.
First we generate 200 instances that will comprise the training dataset. This data includes:
x: 10 predictor variables, which we label X1 through X10y: the response variable.We can confirm from the summary statistics, the feature plot, and the pairs plot of the data that:
For model testing and evaluation, we also generate a test set of 5000 simulated instances.
library(mlbench)
set.seed(200)
# simulate training set of 200 instances
trainingData <- mlbench.friedman1(200, sd = 1)
# convert x matrix to dataframe
trainingData$x <- data.frame(trainingData$x)
summary(data.frame(Y = trainingData$y, trainingData$x))
## Y X1 X2
## Min. : 3.556 Min. :0.002806 Min. :0.0004409
## 1st Qu.:10.756 1st Qu.:0.217222 1st Qu.:0.2795877
## Median :14.556 Median :0.513935 Median :0.5106664
## Mean :14.416 Mean :0.481461 Mean :0.5126936
## 3rd Qu.:17.970 3rd Qu.:0.680814 3rd Qu.:0.7326078
## Max. :28.382 Max. :0.998992 Max. :0.9840194
## X3 X4 X5
## Min. :0.003296 Min. :0.00819 Min. :0.001756
## 1st Qu.:0.284872 1st Qu.:0.23178 1st Qu.:0.285408
## Median :0.537307 Median :0.44458 Median :0.534330
## Mean :0.508571 Mean :0.46939 Mean :0.517948
## 3rd Qu.:0.746155 3rd Qu.:0.68454 3rd Qu.:0.748180
## Max. :0.999923 Max. :0.99929 Max. :0.991577
## X6 X7 X8
## Min. :0.002901 Min. :0.0003388 Min. :0.004698
## 1st Qu.:0.276340 1st Qu.:0.2092393 1st Qu.:0.289413
## Median :0.497598 Median :0.4688035 Median :0.497961
## Mean :0.496841 Mean :0.4636166 Mean :0.512865
## 3rd Qu.:0.749615 3rd Qu.:0.6752972 3rd Qu.:0.727624
## Max. :0.998987 Max. :0.9943478 Max. :0.999561
## X9 X10
## Min. :0.0176 Min. :0.003172
## 1st Qu.:0.2304 1st Qu.:0.317826
## Median :0.5289 Median :0.535922
## Mean :0.5016 Mean :0.542409
## 3rd Qu.:0.7218 3rd Qu.:0.789736
## Max. :0.9910 Max. :0.999992
# feature plot of y vs x_j
featurePlot(trainingData$x, trainingData$y,
labels = c("Predictor", "Response"),
main = "Feature plot of friedman1 training data")
# pairs plot
ggpairs(data.frame(Y = trainingData$y, trainingData$x)) +
labs(title = "Pairs plot of friedman1 training data")
xx <- seq(0, 30, length.out = 100)
yy <- dnorm(xx, mean = mean(trainingData$y), sd = sd(trainingData$y))
ggplot(data.frame(trainingData$y)) + geom_density(aes(x = trainingData$y)) +
geom_line(data = data.frame(xx, yy), aes(x = xx, y = yy), col = "red") +
labs(title = "Distribution of Y",
x = "Y", y = "Density")
#plot(density(trainingData$y, bw = "SJ"), ylim = c(0, 0.085), main = "Distribution of Y")
#lines(xx, yy, col = "red")
# simulate test set of 5000 instances
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
#summary(data.frame(Y = testData$y, testData$x))
Next we fit several models to the training data. We use the caret library for model fitting with the following algorithms and tuning parameters:
nnet library; tune size (number of hidden units) and decay (weight decay) parameters (bag set to FALSE)earth library; tune nprune (number of terms) and degree (product degree) parameterskernlab library; tune C (cost) parameter (with default estimate of sigma)We specify a common pre-processing and tuning methodology for all models within the train function:
For each model, we show below the re-sampling output, the tuning profile, the final model summary, and the variable importance plot.
# center & scale predictors
prep <- c("center", "scale")
# 10-fold cross validation
ctrl <- trainControl(method = "cv",
number = 10)
##############################################################################
# fit nnet model; tune decay & size
##############################################################################
# fit & tune model
set.seed(227)
nnetGrid <- expand.grid(decay = c(0, 0.01, 0.1), size = 1:10, bag = FALSE)
nnetModel <- train(x = trainingData$x,
y = trainingData$y,
method = "avNNet", linout = TRUE, trace = FALSE, MaxNWts = 100,
preProcess = prep,
#tuneLength = 15,
tuneGrid = nnetGrid,
trControl = ctrl)
nnetModel
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.453486 0.7628392 1.939061
## 0.00 2 2.466229 0.7553506 1.935683
## 0.00 3 2.120237 0.8222812 1.646684
## 0.00 4 2.145174 0.8066845 1.688108
## 0.00 5 2.156400 0.8235444 1.694179
## 0.00 6 2.229606 0.8058147 1.747128
## 0.00 7 2.446361 0.7680410 1.970646
## 0.00 8 2.479591 0.7672991 1.949891
## 0.00 9 NaN NaN NaN
## 0.00 10 NaN NaN NaN
## 0.01 1 2.373626 0.7700550 1.846585
## 0.01 2 2.471793 0.7520689 1.918923
## 0.01 3 2.153588 0.8219606 1.665498
## 0.01 4 2.051319 0.8400975 1.597216
## 0.01 5 2.064509 0.8287173 1.576470
## 0.01 6 2.231345 0.8074821 1.747262
## 0.01 7 2.284785 0.7981623 1.768654
## 0.01 8 2.305885 0.7848879 1.853840
## 0.01 9 NaN NaN NaN
## 0.01 10 NaN NaN NaN
## 0.10 1 2.389372 0.7671926 1.856095
## 0.10 2 2.461489 0.7564560 1.923188
## 0.10 3 2.205762 0.8155714 1.699392
## 0.10 4 2.056510 0.8347077 1.604546
## 0.10 5 2.122134 0.8246035 1.678843
## 0.10 6 2.195238 0.8112132 1.759069
## 0.10 7 2.206892 0.8087481 1.762441
## 0.10 8 2.285393 0.8020192 1.828814
## 0.10 9 NaN NaN NaN
## 0.10 10 NaN NaN NaN
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.01 and bag
## = FALSE.
ggplot(nnetModel, highlight = TRUE) +
labs(title = paste0("Tuning profile: ", nnetModel$modelInfo$label))
# final model & variable importance
nnetModel$finalModel
## Model Averaged Neural Network with 5 Repeats
##
## a 10-4-1 network with 49 weights
## options were - linear output units decay=0.01
ggplot(varImp(nnetModel)) +
labs(title = paste0("Variable importance: ", nnetModel$modelInfo$label))
# test set predictions
nnetPred <- predict(nnetModel, newdata = testData$x)
nnetPerf <- postResample(pred = nnetPred, obs = testData$y)
##############################################################################
# fit mars model; tune degree & nprune
##############################################################################
# fit & tune model
set.seed(227)
marsGrid <- expand.grid(degree = 1:2, nprune = 2:20)
marsModel <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
preProcess = prep,
#tuneLength = 15,
tuneGrid = marsGrid,
trControl = ctrl)
marsModel
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.230804 0.3061204 3.4769718
## 1 3 3.446241 0.5179791 2.7747003
## 1 4 2.654600 0.7159102 2.1247908
## 1 5 2.338892 0.7748074 1.9259473
## 1 6 2.285647 0.7858393 1.8165413
## 1 7 1.854760 0.8633398 1.4529610
## 1 8 1.781946 0.8686876 1.3887742
## 1 9 1.722643 0.8780344 1.3564789
## 1 10 1.706516 0.8800101 1.3488078
## 1 11 1.666552 0.8842921 1.2863180
## 1 12 1.580302 0.8971240 1.2317369
## 1 13 1.584415 0.8966347 1.2271373
## 1 14 1.590337 0.8963451 1.2271217
## 1 15 1.590955 0.8963736 1.2291679
## 1 16 1.590955 0.8963736 1.2291679
## 1 17 1.590955 0.8963736 1.2291679
## 1 18 1.590955 0.8963736 1.2291679
## 1 19 1.590955 0.8963736 1.2291679
## 1 20 1.590955 0.8963736 1.2291679
## 2 2 4.283258 0.2959497 3.5188891
## 2 3 3.477719 0.5126430 2.8105998
## 2 4 2.703525 0.7084096 2.1650536
## 2 5 2.393345 0.7675030 1.9738942
## 2 6 2.390236 0.7677042 1.8810199
## 2 7 1.898494 0.8571073 1.5042126
## 2 8 1.716441 0.8837614 1.3492161
## 2 9 1.532824 0.9060610 1.2407400
## 2 10 1.463240 0.9142113 1.1711952
## 2 11 1.354253 0.9248471 1.0794948
## 2 12 1.251030 0.9314224 0.9915422
## 2 13 1.235344 0.9351578 0.9836994
## 2 14 1.218751 0.9366662 0.9496131
## 2 15 1.228174 0.9363321 0.9636462
## 2 16 1.206815 0.9378182 0.9571746
## 2 17 1.206815 0.9378182 0.9571746
## 2 18 1.206815 0.9378182 0.9571746
## 2 19 1.206815 0.9378182 0.9571746
## 2 20 1.206815 0.9378182 0.9571746
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 16 and degree = 2.
ggplot(marsModel, highlight = TRUE) +
labs(title = paste0("Tuning profile: ", marsModel$modelInfo$label))
# final model & variable importance
marsModel$finalModel
## Selected 16 of 21 terms, and 5 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, ...
## Number of terms at each degree of interaction: 1 8 7
## GCV 1.642635 RSS 214.2181 GRSq 0.9333083 RSq 0.9560751
summary(marsModel$finalModel)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE,
## degree=2, nprune=16)
##
## coefficients
## (Intercept) 22.042898
## h(0.507267-X1) -4.261526
## h(X1-0.507267) 2.661009
## h(0.325504-X2) -5.256624
## h(-0.216741-X3) 2.909153
## h(X3- -0.216741) 1.657839
## h(0.953812-X4) -2.810030
## h(X4-0.953812) 2.833141
## h(1.17878-X5) -1.503531
## h(-0.951872-X1) * h(X2-0.325504) -3.617965
## h(X1- -0.951872) * h(X2-0.325504) -1.505955
## h(X1-0.507267) * h(X2- -0.798188) -2.198535
## h(0.606835-X1) * h(0.325504-X2) 1.980041
## h(0.325504-X2) * h(X3-0.795427) 2.143490
## h(X2-0.325504) * h(X3- -0.917499) 1.340520
## h(X2-0.325504) * h(-0.917499-X3) 4.483181
##
## Selected 16 of 21 terms, and 5 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, ...
## Number of terms at each degree of interaction: 1 8 7
## GCV 1.642635 RSS 214.2181 GRSq 0.9333083 RSq 0.9560751
ggplot(varImp(marsModel)) +
labs(title = paste0("Variable importance: ", marsModel$modelInfo$label))
# plot response vs each predictor
plotmo(marsModel$finalModel, pt.col = 2,
caption = paste0("Response relationships with the predictors: ",
marsModel$modelInfo$label))
## plotmo grid: X1 X2 X3 X4 X5
## 0.1174442 -0.007467267 0.1018901 -0.08911631 0.05813967
## X6 X7 X8 X9 X10
## 0.002665016 0.01819572 -0.05372308 0.09088895 -0.02272411
# test set predictions
marsPred <- predict(marsModel, newdata = testData$x)
marsPerf <- postResample(pred = marsPred, obs = testData$y)
##############################################################################
# fit svm model; tune C (use automatic estimate of sigma)
##############################################################################
# fit & tune model
set.seed(227)
svmModel <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProcess = prep,
tuneLength = 10,
trControl = ctrl)
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.502355 0.8014214 1.983965
## 0.50 2.253343 0.8151982 1.778912
## 1.00 2.066064 0.8394041 1.604967
## 2.00 1.967567 0.8493073 1.519869
## 4.00 1.890994 0.8558305 1.484066
## 8.00 1.878264 0.8564414 1.494233
## 16.00 1.878702 0.8565861 1.493896
## 32.00 1.878702 0.8565861 1.493896
## 64.00 1.878702 0.8565861 1.493896
## 128.00 1.878702 0.8565861 1.493896
##
## Tuning parameter 'sigma' was held constant at a value of 0.06682628
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06682628 and C = 8.
ggplot(svmModel, highlight = TRUE) +
labs(title = paste0("Tuning profile: ", svmModel$modelInfo$label))
# final model & variable importance
svmModel$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 8
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0668262839433369
##
## Number of Support Vectors : 153
##
## Objective Function Value : -67.2636
## Training error : 0.008899
ggplot(varImp(svmModel)) +
labs(title = paste0("Variable importance: ", svmModel$modelInfo$label))
# test set predictions
svmPred <- predict(svmModel, newdata = testData$x)
svmPerf <- postResample(pred = svmPred, obs = testData$y)
##############################################################################
# fit knn model; tune k
##############################################################################
# fit & tune model
set.seed(227)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = prep,
tuneLength = 10,
trControl = ctrl)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.245439 0.5892310 2.692806
## 7 3.115972 0.6349210 2.539613
## 9 3.095140 0.6618790 2.507735
## 11 3.067196 0.6844121 2.496442
## 13 3.091888 0.6882998 2.471658
## 15 3.096069 0.6989642 2.479912
## 17 3.116225 0.7075858 2.500198
## 19 3.147661 0.7102043 2.542344
## 21 3.161254 0.7144204 2.552672
## 23 3.186812 0.7136769 2.592388
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
ggplot(knnModel, highlight = TRUE) +
labs(title = paste0("Tuning profile: ", knnModel$modelInfo$label))
# final model & variable importance
knnModel$finalModel
## 11-nearest neighbor regression model
ggplot(varImp(knnModel)) +
labs(title = paste0("Variable importance: ", knnModel$modelInfo$label))
# test set predictions
knnPred <- predict(knnModel, newdata = testData$x)
knnPerf <- postResample(pred = knnPred, obs = testData$y)
It is evident from the variable importance plots above that all four models successfully found that X1 through X5 are the most important predictors. The MARS model is unique in that it found the variable importance for X6 throughX10 to be exactly zero, since these variables were dropped from the model during the pruning procedure. It’s interesting to note that the MARS model orders the predictors (by decreasing importance) as X1, X4, X2, X5, and X3, whereas the other three models order them as X4, X1, X2, X5, and X3 (reversing the first two predictors).
Finally we compare performance metrics of the four models on the test data. From the table below, we see that the MARS model has the strongest predictive performance, with the lowest RMSE and MAE and highest R squared. The NNet and SVM models are the runners up, with comparable performance metrics. The weakest model overall is the k-nearest neighbors model. The strong performance of the MARS model likely results in part from the pruning procedure, in which the five non-informative (noise) variables were dropped. In contrast, the noise variables remain in the other models, which degrades the model-fitting process and weakens predictive performance. For instance, in the KNN model, both the informative predictors and the noise variables contribute equally in the (Euclidean) distance calculation, which means that similarity among instances is based in part on random noise.
rbind(NNet = nnetPerf, MARS = marsPerf, SVM = svmPerf, KNN = knnPerf) %>%
data.frame() %>%
kable(digits = 3,
caption = "Comparison of predictive performance on test set")
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| NNet | 2.035 | 0.835 | 1.527 |
| MARS | 1.279 | 0.934 | 1.009 |
| SVM | 2.071 | 0.826 | 1.573 |
| KNN | 3.122 | 0.669 | 2.496 |
In this exercise we work with the ChemicalManufacturingProcess dataset using the caret library. We apply the same imputation, data splitting, and pre-processing steps that we used in Exercise 6.3 from Homework 7:
preProcess functioncreateDataPartition functionWe then fit and evaluate four models using the following algorithms and tuning parameters:
nnet library; tune size (number of hidden units) and decay (weight decay) parameters (bag set to FALSE)earth library; tune nprune (number of terms) and degree (product degree) parameterskernlab library; tune C (cost) parameter (with default estimate of sigma)We specify a common resampling and tuning methodology for all models within the train function:
For each model, we show below the re-sampling output, the tuning profile, the final model summary, and the variable importance plot.
### load data
data("ChemicalManufacturingProcess")
#str(ChemicalManufacturingProcess)
# for ease of manipulation, split response and predictors & relabel columns
yield <- ChemicalManufacturingProcess[1]
cmp <- ChemicalManufacturingProcess[-1]
colnames(cmp) <- c(paste0("B", 1:12), paste0("M", 1:45))
#summary(yield)
#summary(cmp)
yield <- yield[[1]] # numeric
cmp <- as.matrix(cmp) # matrix
### impute missing values
set.seed(502)
impute <- preProcess(cmp, method = "bagImpute")
cmp <- predict(impute, cmp)
#summary(cmp)
### training/test split
set.seed(314)
train_idx <- createDataPartition(yield, p = 0.75, list = FALSE)
yield_train <- yield[train_idx]
yield_test <- yield[-train_idx]
cmp_train <- cmp[train_idx, ]
cmp_test <- cmp[-train_idx, ]
#summary(yield_train)
#summary(yield_test)
### pre-processing and resampling method
#prep <- c("BoxCox", "center", "scale", "nzv") <<< BOX-COX CAUSES NaN / Inf ERRORS
prep <- c("center", "scale", "nzv")
#ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5) <<< TAKES TOO LONG
ctrl <- trainControl(method = "cv", number = 10)
##############################################################################
# fit nnet model; tune decay & size
##############################################################################
# fit & tune model
set.seed(227)
nnetGrid <- expand.grid(decay = c(0, 0.01, 0.1), size = 1:10, bag = FALSE)
nnetModel <- train(x = cmp_train,
y = yield_train,
preProcess = prep,
trControl = ctrl,
method = "avNNet",
linout = TRUE, trace = FALSE, #MaxNWts = 1000, maxit = 500,
tuneGrid = nnetGrid)
nnetModel
## Model Averaged Neural Network
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (56), scaled (56), remove (1)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 120, 119, 118, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 1.682049 0.2982074 1.384177
## 0.00 2 1.584458 0.3800543 1.291777
## 0.00 3 1.506951 0.4142568 1.207772
## 0.00 4 1.747114 0.4457139 1.373870
## 0.00 5 1.899288 0.4058747 1.529646
## 0.00 6 2.055590 0.3389588 1.619027
## 0.00 7 2.040068 0.4411346 1.672809
## 0.00 8 2.924962 0.3353221 2.113708
## 0.00 9 3.175315 0.2872611 2.446941
## 0.00 10 4.391504 0.2308797 3.047893
## 0.01 1 1.376158 0.5150667 1.087506
## 0.01 2 1.291495 0.5823766 1.071629
## 0.01 3 1.484925 0.4789278 1.179208
## 0.01 4 1.909784 0.3532708 1.550551
## 0.01 5 1.762034 0.4477962 1.489518
## 0.01 6 1.833767 0.4727546 1.514892
## 0.01 7 2.262710 0.4589132 1.745205
## 0.01 8 2.274261 0.4634922 1.772300
## 0.01 9 3.135075 0.1970172 2.241917
## 0.01 10 4.648200 0.2469061 3.156940
## 0.10 1 1.388187 0.5200085 1.116170
## 0.10 2 1.290917 0.5660612 1.077061
## 0.10 3 1.373767 0.5181534 1.116043
## 0.10 4 1.644580 0.3946546 1.303468
## 0.10 5 1.584487 0.4428402 1.334289
## 0.10 6 1.723461 0.4462425 1.367406
## 0.10 7 1.877026 0.3707661 1.534372
## 0.10 8 2.304221 0.4301959 1.780566
## 0.10 9 2.891484 0.3774345 2.179779
## 0.10 10 4.408392 0.2092302 3.014256
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 2, decay = 0.1 and bag
## = FALSE.
ggplot(nnetModel, highlight = TRUE) +
labs(title = paste0("Tuning profile: ", nnetModel$modelInfo$label))
# final model & variable importance
nnetModel$finalModel
## Model Averaged Neural Network with 5 Repeats
##
## a 56-2-1 network with 117 weights
## options were - linear output units decay=0.1
ggplot(varImp(nnetModel), top = 20) +
labs(title = paste0("Variable importance: ", nnetModel$modelInfo$label))
# test set predictions
nnetPred <- predict(nnetModel, newdata = cmp_test)
nnetPerf <- postResample(pred = nnetPred, obs = yield_test)
##############################################################################
# fit mars model; tune degree & nprune
##############################################################################
# fit & tune model
set.seed(227)
marsGrid <- expand.grid(degree = 1:2, nprune = 2:15)
marsModel <- train(x = cmp_train,
y = yield_train,
preProcess = prep,
trControl = ctrl,
method = "earth",
tuneGrid = marsGrid)
marsModel
## Multivariate Adaptive Regression Spline
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (56), scaled (56), remove (1)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 120, 119, 118, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.466767 0.4701481 1.1727359
## 1 3 1.231910 0.6376362 1.0160188
## 1 4 1.228827 0.6177270 1.0235419
## 1 5 1.240476 0.6040729 1.0215185
## 1 6 1.251077 0.5936310 1.0224391
## 1 7 1.169742 0.6396604 0.9557255
## 1 8 1.158192 0.6525749 0.9423312
## 1 9 1.168130 0.6469967 0.9460839
## 1 10 1.195275 0.6280185 0.9711703
## 1 11 1.156803 0.6524510 0.9454132
## 1 12 1.154094 0.6600484 0.9418102
## 1 13 1.153212 0.6626030 0.9446418
## 1 14 1.150778 0.6624886 0.9452896
## 1 15 1.155696 0.6608265 0.9485330
## 2 2 1.465726 0.4712874 1.1696152
## 2 3 1.381434 0.5784857 1.1208429
## 2 4 1.254312 0.6061607 1.0397770
## 2 5 1.261880 0.6057280 1.0278797
## 2 6 1.269116 0.6258132 1.0315846
## 2 7 1.354249 0.5417722 1.0912465
## 2 8 1.286775 0.5947254 1.0385618
## 2 9 1.381173 0.5494190 1.1231933
## 2 10 1.527708 0.4898369 1.1740842
## 2 11 1.525072 0.4940744 1.1735497
## 2 12 1.500675 0.5215782 1.1686466
## 2 13 1.514728 0.5258566 1.1847230
## 2 14 1.496330 0.5346941 1.1690166
## 2 15 1.503170 0.5300052 1.1634684
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 1.
ggplot(marsModel, highlight = TRUE) +
labs(title = paste0("Tuning profile: ", marsModel$modelInfo$label))
# final model & variable importance
marsModel$finalModel
## Selected 14 of 21 terms, and 10 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: M32, M9, M13, M39, M1, B6, M3, M28, B10, B11, B1-unused, ...
## Number of terms at each degree of interaction: 1 13 (additive model)
## GCV 1.167506 RSS 97.51332 GRSq 0.682546 RSq 0.7960532
summary(marsModel$finalModel)
## Call: earth(x=matrix[132,56], y=c(38,42.44,42.0...), keepxy=TRUE,
## degree=1, nprune=14)
##
## coefficients
## (Intercept) 38.551547
## h(-0.862743-B6) -2.817781
## h(-0.847275-B10) -2.175900
## h(-0.549404-B11) 0.949173
## h(M1- -0.719584) 0.686995
## h(M3-0.100062) -0.584934
## h(-1.13526-M9) -1.459686
## h(M9- -1.13526) 0.544228
## h(-1.03772-M13) 2.495349
## h(0.745945-M28) 0.314882
## h(M28-0.745945) 5.771669
## h(M32- -0.855981) 0.947897
## h(0.100088-M39) -0.496338
## h(M39-0.100088) -4.148723
##
## Selected 14 of 21 terms, and 10 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: M32, M9, M13, M39, M1, B6, M3, M28, B10, B11, B1-unused, ...
## Number of terms at each degree of interaction: 1 13 (additive model)
## GCV 1.167506 RSS 97.51332 GRSq 0.682546 RSq 0.7960532
ggplot(varImp(marsModel), top = 20) +
labs(title = paste0("Variable importance: ", marsModel$modelInfo$label))
# plot response vs each predictor
plotmo(marsModel$finalModel, pt.col = 2,
caption = paste0("Response relationships with the predictors: ",
marsModel$modelInfo$label))
## plotmo grid: B1 B2 B3 B4 B5
## -0.1480932 -0.05676868 -0.1450697 -0.07996953 -0.09038694
## B6 B8 B9 B10 B11 B12
## -0.03417493 0.00738872 -0.02465333 -0.1207101 -0.1833683 0.002799627
## M1 M2 M3 M4 M5 M6
## 0.1381053 0.5106584 0.008802588 0.3646308 -0.1131089 -0.2541179
## M7 M8 M9 M10 M11 M12
## -0.3955405 0.8844957 0.06122472 -0.1721256 -0.03316336 -0.5132428
## M13 M14 M15 M16 M17 M18
## 0.1098922 0.03421284 -0.1305892 0.07222311 0.02579683 0.07813078
## M19 M20 M21 M22 M23 M24
## -0.1556479 0.0780851 -0.1638226 -0.07010995 0.01918121 -0.1633761
## M25 M26 M27 M28 M29 M30
## -0.02998408 -0.1050628 -0.06494512 0.7068991 -0.3533464 -0.1139045
## M31 M32 M33 M34 M35 M36
## 0.1159743 -0.04763013 0.1177296 0.1867818 -0.05910451 0.5342972
## M37 M38 M39 M40 M41 M42 M43
## -0.09274239 0.701741 0.2328687 -0.4696155 -0.4496138 0.2107708 -0.1443466
## M44 M45
## 0.2796578 0.1400684
# test set predictions
marsPred <- as.numeric(predict(marsModel, newdata = cmp_test)) # convert matrix to vector
marsPerf <- postResample(pred = marsPred, obs = yield_test)
##############################################################################
# fit svm model; tune C (use automatic estimate of sigma)
##############################################################################
# fit & tune model
set.seed(227)
svmModel <- train(x = cmp_train,
y = yield_train,
preProcess = prep,
trControl = ctrl,
method = "svmRadial",
tuneLength = 10)
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (56), scaled (56), remove (1)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 120, 119, 118, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.503402 0.4801001 1.2270443
## 0.50 1.382362 0.5365246 1.1298529
## 1.00 1.256283 0.6067089 1.0215349
## 2.00 1.164566 0.6584831 0.9436202
## 4.00 1.139106 0.6673601 0.9051311
## 8.00 1.135549 0.6684797 0.9118079
## 16.00 1.135008 0.6688247 0.9113934
## 32.00 1.135008 0.6688247 0.9113934
## 64.00 1.135008 0.6688247 0.9113934
## 128.00 1.135008 0.6688247 0.9113934
##
## Tuning parameter 'sigma' was held constant at a value of 0.01483334
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01483334 and C = 16.
ggplot(svmModel, highlight = TRUE) +
labs(title = paste0("Tuning profile: ", svmModel$modelInfo$label))
# final model & variable importance
svmModel$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 16
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0148333380558955
##
## Number of Support Vectors : 116
##
## Objective Function Value : -74.5875
## Training error : 0.009049
ggplot(varImp(svmModel), top = 20) +
labs(title = paste0("Variable importance: ", svmModel$modelInfo$label))
# test set predictions
svmPred <- predict(svmModel, newdata = cmp_test)
svmPerf <- postResample(pred = svmPred, obs = yield_test)
##############################################################################
# fit knn model; tune k
##############################################################################
# fit & tune model
set.seed(227)
knnGrid <- data.frame(k = 1:20)
knnModel <- train(x = cmp_train,
y = yield_train,
preProcess = prep,
trControl = ctrl,
method = "knn",
tuneGrid = knnGrid)
knnModel
## k-Nearest Neighbors
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (56), scaled (56), remove (1)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 120, 119, 118, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 1.562855 0.4189043 1.231211
## 2 1.366561 0.4972392 1.079467
## 3 1.249477 0.5918773 1.001200
## 4 1.306878 0.5759650 1.049647
## 5 1.344157 0.5477255 1.099069
## 6 1.353635 0.5352235 1.102728
## 7 1.373425 0.5258133 1.119679
## 8 1.371015 0.5335119 1.121872
## 9 1.362660 0.5395666 1.128054
## 10 1.386958 0.5170868 1.146023
## 11 1.389200 0.5143559 1.143645
## 12 1.392736 0.5192199 1.152277
## 13 1.401376 0.5171839 1.155483
## 14 1.419992 0.5048721 1.172474
## 15 1.413551 0.5200757 1.162223
## 16 1.439093 0.4952872 1.176301
## 17 1.444029 0.4856420 1.176658
## 18 1.457741 0.4796633 1.190801
## 19 1.464435 0.4699197 1.192174
## 20 1.472937 0.4640040 1.199591
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
ggplot(knnModel, highlight = TRUE) +
labs(title = paste0("Tuning profile: ", knnModel$modelInfo$label))
# final model & variable importance
knnModel$finalModel
## 3-nearest neighbor regression model
ggplot(varImp(knnModel), top = 20) +
labs(title = paste0("Variable importance: ", knnModel$modelInfo$label))
# test set predictions
knnPred <- predict(knnModel, newdata = cmp_test)
knnPerf <- postResample(pred = knnPred, obs = yield_test)
##############################################################################
# fit linear model for comparison: elastic net
##############################################################################
# fit model using tuned parameters from exercise 6.3
set.seed(227)
enetGrid <- data.frame(fraction = 0.2184211, lambda = 0.005)
enetModel <- train(x = cmp_train,
y = yield_train,
preProcess = prep,
trControl = ctrl,
method = "enet",
tuneGrid = enetGrid)
# test set predictions
enetPred <- predict(enetModel, newdata = cmp_test)
enetPerf <- postResample(pred = enetPred, obs = yield_test)
We evaluate predictive performance for the models based on the root mean squared error (RMSE), R-squared, and mean absolute error (MAE) metrics. We summarize these performance metrics in the table below based on the resampled holdout sets (prefixed as “res_”) as well as the test set (prefixed as “test_”), for all four non-linear models; we also include the linear model from Exercise 6.3 (elastic net model) for comparison.
We note several observations:
We also plot below the resampled metrics for the various models.
# resampling & test set metrics
res_perf <- rbind(NNET = colMeans(nnetModel$resample[-4]),
MARS = colMeans(marsModel$resample[-4]),
SVM = colMeans(svmModel$resample[-4]),
KNN = colMeans(knnModel$resample[-4]),
ENET = colMeans(enetModel$resample[-4]))
colnames(res_perf) <- paste0("res_", colnames(res_perf))
test_perf <- rbind(NNET = nnetPerf, MARS = marsPerf, SVM = svmPerf,
KNN = knnPerf, ENET = enetPerf)
colnames(test_perf) <- paste0("test_", colnames(test_perf))
cbind(res_perf, test_perf) %>%
kable(digits = 3,
caption = "Resampled & test set performance metrics")
| res_RMSE | res_Rsquared | res_MAE | test_RMSE | test_Rsquared | test_MAE | |
|---|---|---|---|---|---|---|
| NNET | 1.291 | 0.566 | 1.077 | 1.022 | 0.641 | 0.764 |
| MARS | 1.151 | 0.662 | 0.945 | 1.212 | 0.504 | 0.972 |
| SVM | 1.135 | 0.669 | 0.911 | 1.020 | 0.627 | 0.856 |
| KNN | 1.249 | 0.592 | 1.001 | 1.116 | 0.537 | 0.902 |
| ENET | 1.165 | 0.652 | 0.948 | 1.430 | 0.396 | 0.967 |
# plot resampled metrics
list(NNET = nnetModel, MARS = marsModel, SVM = svmModel,
KNN = knnModel, ENET = enetModel) %>%
resamples() %>%
bwplot(metric = c("RMSE", "Rsquared", "MAE"),
layout = c(3,1),
between = list(x = 1),
main = "Comparison of resampled performance metrics")
From the variable importance plot above, we see that the top 10 important predictors in the SVM model include 6 manufacturing process and 4 biological variables:
The top 10 predictors in each model (four non-linear models as well as the linear model for comparison) are summarized in the table below. It is interesting that the top 10 predictors are identical in the NNET, SVM, KNN, and ENET models, whereas they differ in the MARS model, which includes only 10 predictors because of feature selection done during the fitting process. Reviewing the help file on the varImp function, it turns out that the variable importance for the MARS model is computed using a MARS-specific calculation method that uses the fitted MARS model; on the other hand, the variable importance for the other models is computed in a non-parametric way (using a loess fit) based on the data, without regard to the model. Hence the variable importance for the non-MARS models is a generic non-model-specific measure that is based only on the dataset.
In any case, reviewing variable importance for the SVM model, we observe that:
Reviewing the top predictors for the MARS model, we note that the manufacturing process variables dominate the top of the list. This supports the conclusion that the manufacturing variables are more important overall than the biological variables for predicting yield.
top10 <- function(mod) {
varImp(mod)$importance %>%
mutate(Var = paste(rownames(varImp(mod)$importance),
round(Overall, 1), sep = " - ")) %>%
arrange(desc(Overall)) %>%
dplyr::select(Var) %>%
head(10)
}
cbind(NNET = top10(nnetModel),
MARS = top10(marsModel),
SVM = top10(svmModel),
KNN = top10(knnModel),
ENET = top10(enetModel)) %>%
kable(col.names = c("NNET", "MARS", "SVM", "KNN", "ENET"),
caption = "Top 10 important variables")
| NNET | MARS | SVM | KNN | ENET |
|---|---|---|---|---|
| M32 - 100 | M32 - 100 | M32 - 100 | M32 - 100 | M32 - 100 |
| M13 - 89.9 | M9 - 66.9 | M13 - 89.9 | M13 - 89.9 | M13 - 89.9 |
| M17 - 82.6 | M13 - 46.2 | M17 - 82.6 | M17 - 82.6 | M17 - 82.6 |
| B6 - 70.9 | M39 - 29 | B6 - 70.9 | B6 - 70.9 | B6 - 70.9 |
| B3 - 63.9 | M1 - 24.5 | B3 - 63.9 | B3 - 63.9 | B3 - 63.9 |
| M9 - 62.9 | B6 - 20.3 | M9 - 62.9 | M9 - 62.9 | M9 - 62.9 |
| M36 - 61.4 | M3 - 13.8 | M36 - 61.4 | M36 - 61.4 | M36 - 61.4 |
| M6 - 60.2 | B10 - 11 | M6 - 60.2 | M6 - 60.2 | M6 - 60.2 |
| B12 - 59.6 | M28 - 9.5 | B12 - 59.6 | B12 - 59.6 | B12 - 59.6 |
| B2 - 51.2 | B11 - 3.7 | B2 - 51.2 | B2 - 51.2 | B2 - 51.2 |
Next we explore the bivariate relationships between the top predictors and the yield response. We do this graphically through scatter plots of the data along with loess fits to compare the observed and predicted yields from the SVM, MARS, and ENET models.
For the top 5 predictors in the SVM and ENET models, we note the following relationships:
For the top 5 predictors in the MARS model that are not included above, we also note:
These relationships suggest that to improve the yield of the manufacturing process, certain variables should be reviewed and considered for adjustment:
# plot bivariate predictor-response relationships
g <- ggplot(data.frame(Y = yield_test, P1 = svmPred, P2 = marsPred, P3 = enetPred, cmp_test))
g + geom_point(aes(x = M32, y = Y)) +
geom_smooth(aes(x = M32, y = Y, col = "Data"), se = FALSE) +
geom_smooth(aes(x = M32, y = P1, col = "SVM"), se = FALSE) +
geom_smooth(aes(x = M32, y = P2, col = "MARS"), se = FALSE) +
geom_smooth(aes(x = M32, y = P3, col = "ENET"), se = FALSE) +
labs(title = "Test set: Observed and predicted Yield versus M32", col = "Loess fit") +
scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))
g + geom_point(aes(x = M13, y = Y)) +
geom_smooth(aes(x = M13, y = Y, col = "Data"), se = FALSE) +
geom_smooth(aes(x = M13, y = P1, col = "SVM"), se = FALSE) +
geom_smooth(aes(x = M13, y = P2, col = "MARS"), se = FALSE) +
geom_smooth(aes(x = M13, y = P3, col = "ENET"), se = FALSE) +
labs(title = "Test set: Observed and predicted Yield versus M13", col = "Loess fit") +
scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))
g + geom_point(aes(x = M17, y = Y)) +
geom_smooth(aes(x = M17, y = Y, col = "Data"), se = FALSE) +
geom_smooth(aes(x = M17, y = P1, col = "SVM"), se = FALSE) +
geom_smooth(aes(x = M17, y = P2, col = "MARS"), se = FALSE) +
geom_smooth(aes(x = M17, y = P3, col = "ENET"), se = FALSE) +
labs(title = "Test set: Observed and predicted Yield versus M17", col = "Loess fit") +
scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))
g + geom_point(aes(x = B6, y = Y)) +
geom_smooth(aes(x = B6, y = Y, col = "Data"), se = FALSE) +
geom_smooth(aes(x = B6, y = P1, col = "SVM"), se = FALSE) +
geom_smooth(aes(x = B6, y = P2, col = "MARS"), se = FALSE) +
geom_smooth(aes(x = B6, y = P3, col = "ENET"), se = FALSE) +
labs(title = "Test set: Observed and predicted Yield versus B6", col = "Loess fit") +
scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))
g + geom_point(aes(x = B3, y = Y)) +
geom_smooth(aes(x = B3, y = Y, col = "Data"), se = FALSE) +
geom_smooth(aes(x = B3, y = P1, col = "SVM"), se = FALSE) +
geom_smooth(aes(x = B3, y = P2, col = "MARS"), se = FALSE) +
geom_smooth(aes(x = B3, y = P3, col = "ENET"), se = FALSE) +
labs(title = "Test set: Observed and predicted Yield versus B3", col = "Loess fit") +
scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))
g + geom_point(aes(x = M9, y = Y)) +
geom_smooth(aes(x = M9, y = Y, col = "Data"), se = FALSE) +
geom_smooth(aes(x = M9, y = P1, col = "SVM"), se = FALSE) +
geom_smooth(aes(x = M9, y = P2, col = "MARS"), se = FALSE) +
geom_smooth(aes(x = M9, y = P3, col = "ENET"), se = FALSE) +
labs(title = "Test set: Observed and predicted Yield versus M9", col = "Loess fit") +
scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))
g + geom_point(aes(x = M39, y = Y)) +
geom_smooth(aes(x = M39, y = Y, col = "Data"), se = FALSE) +
geom_smooth(aes(x = M39, y = P1, col = "SVM"), se = FALSE) +
geom_smooth(aes(x = M39, y = P2, col = "MARS"), se = FALSE) +
geom_smooth(aes(x = M39, y = P3, col = "ENET"), se = FALSE) +
labs(title = "Test set: Observed and predicted Yield versus M39", col = "Loess fit") +
scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))
g + geom_point(aes(x = M1, y = Y)) +
geom_smooth(aes(x = M1, y = Y, col = "Data"), se = FALSE) +
geom_smooth(aes(x = M1, y = P1, col = "SVM"), se = FALSE) +
geom_smooth(aes(x = M1, y = P2, col = "MARS"), se = FALSE) +
geom_smooth(aes(x = M1, y = P3, col = "ENET"), se = FALSE) +
labs(title = "Test set: Observed and predicted Yield versus M1", col = "Loess fit") +
scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))