Friedman (1991) introduced several benchmark data sets create by
simulation. One of these simulations used the following nonlinear
equation to create data: \(y = 10 sin(Ï€x_1x_2)
+ 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, σ^2)\) where the x values
are random variables uniformly distributed between [0, 1] (there are
also 5 other non-informative variables also created in the simulation).
The package mlbench contains a function called
mlbench.friedman1 that simulates these data:
#From text p. 169
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
library(corrplot)
## corrplot 0.95 loaded
corrplot(cor(trainingData$x),
type = "upper",
addCoef.col = "black",
number.cex = 0.8)
The plot reveals the correlation between the independent variables (\(x_1\) to \(x_10\)) and the dependent variable \(y\).There are noticeable upward trend in some of the plots.
The following predictors appears to be positively correlated with \(y\): \(x_1\), \(x_2\), \(x_4\), \(x_5\), The rest of the predictors appear to have no significant linear correlation.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
KNN is sensitive to scaling as it relied heavily on distance metrics. So its crucial to scale and center the data. The optimal k is 17.
# Provide by Text
set.seed(1)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.520525 0.4855093 2.871451
## 7 3.404179 0.5186746 2.766570
## 9 3.294231 0.5596678 2.675804
## 11 3.249701 0.5831641 2.648116
## 13 3.231680 0.5985127 2.637329
## 15 3.202284 0.6159300 2.615179
## 17 3.174852 0.6335173 2.600173
## 19 3.181772 0.6396367 2.610065
## 21 3.180229 0.6495883 2.604871
## 23 3.194633 0.6531814 2.615673
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
plot(knnModel)
varImp(knnModel)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
The model explains about 68% of the variance of the test data. That’s not ideal. Let’s run a diagnostic of the model to see why it’s not performing strongly.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
The following plot compares the predicted values and residuals. An ideal plot would show points randomly scattered around the horizontal line at Residuals = 0. There is an upward curve in the residuals. This confirms KNN is not fully capturing the nonlineear pattern of the underlying Friedman function.
I guess the upward trend is expected as KNN is local averaging.
plot(knnPred, testData$y - knnPred,
xlab = "Predicted", ylab = "Residuals",
abline(h=0, col = "red"))
The model is underestimating high y values and overestimating low y values.
plot(testData$y, knnPred,
xlab = "Actual", ylab = "Predicted",
abline(0, 1, col = "red"))
We can tune KNN furthur with distance weighting. We set distance = 2 to use Euclidean Distance. We can use Euclidean distance since our data is scaled and center.
Triangular kernel consistenly outperforms rectangular (unweighted) and epanechnikov.
RMSE decreases as kmax increases for the triangular kernel, then stabilizes around 3.53 for kmax = 15, 20, and 25.
\(R^2\) improves from about 0.40 (rectangular) to 0.53 (triangular), revealing that weighting neighbors does help improve the modeling.
The following model was chosen: kmax=25, distance = 2, kernel = triangular.
# Provide by Text
set.seed(2)
weightedknnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "kknn",
preProc = c("center", "scale"),
tuneGrid = expand.grid(kmax = seq(5,25,5),
distance = 2,
kernel = c("rectangular", "triangular","epanechnikov")))
weightedknnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## kmax kernel RMSE Rsquared MAE
## 5 rectangular 4.158591 0.4048743 3.409026
## 5 triangular 3.742454 0.4828793 3.062012
## 5 epanechnikov 3.790434 0.4728843 3.106315
## 10 rectangular 4.162592 0.4037107 3.412065
## 10 triangular 3.545847 0.5294473 2.903909
## 10 epanechnikov 3.699420 0.4958138 3.030837
## 15 rectangular 4.162592 0.4037107 3.412065
## 15 triangular 3.530446 0.5346312 2.895038
## 15 epanechnikov 3.677834 0.5031468 3.018106
## 20 rectangular 4.162592 0.4037107 3.412065
## 20 triangular 3.530446 0.5346312 2.895038
## 20 epanechnikov 3.677834 0.5031468 3.018106
## 25 rectangular 4.162592 0.4037107 3.412065
## 25 triangular 3.530446 0.5346312 2.895038
## 25 epanechnikov 3.677834 0.5031468 3.018106
##
## Tuning parameter 'distance' was held constant at a value of 2
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 25, distance = 2 and kernel
## = triangular.
The model explains about 70% of the variance of the test data. That’s an improvement from the unweighted knn model. Let’s run a diagnostic of the model.
weightedknnPred <- predict(weightedknnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = weightedknnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.0445793 0.7041268 2.4351706
Even with weighted KNN and uning, a slight upward trend in residuals persist on the dataset. I guess the upward trend is expected as KNN is local averaging. It cannot perfectly capater the highly nonlinear interactions. In addition, theres the curse of dimensionality in KNN where distances becomes less informative with high dimensional data (the data have 10 predictors).
plot(weightedknnPred, testData$y - weightedknnPred,
xlab = "Predicted", ylab = "Residuals",
abline(h=0, col = "red"))
This model trains one neural network with - 5 hidden units - weight decay regularization decay = 0.01 - linear output layer
set.seed(3)
# Refer to text p.162 following their suggested basic model
nnGrid <- expand.grid(size = 5, decay = 0.01)
nnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "nnet",
preProc = c("center", "scale"),
tuneGrid = nnGrid,
linout = TRUE,
trace = FALSE,
maxit = 500)
nnModel
## Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.373044 0.6172798 2.614574
##
## Tuning parameter 'size' was held constant at a value of 5
## Tuning
## parameter 'decay' was held constant at a value of 0.01
The model explains about 70% of the variance of the test data. This performed similar to the weighted KNN.
nnPred <- predict(nnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = nnPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.8200095 0.7061463 2.1315529
The following plot compares the predicted values and residuals. An ideal plot would show points randomly scattered around the horizontal line at Residuals = 0.
The vertical spread of the residuals around y = 0 is much more uniform than the ones we saw in the KNN models. It has sucessfully capture the nonlinear relationship that KNN couldn’t. However, the spread of residuals is not completely symmetric. For higher predicted values, there are more negative residuals. Perhaps it is overpredicting.
plot(nnPred, testData$y - nnPred,
xlab = "Predicted", ylab = "Residuals",
abline(h=0, col = "red"))
plot(testData$y, nnPred,
xlab = "Actual", ylab = "Predicted",
abline(0, 1, col = "red"))
The dataset is nonlinear but noisy, so averaging smooths out overfitting. This model trains multiple neural networks.
set.seed(4)
# Refer to text p.162 following their suggested basic model
avnnGrid <- expand.grid(size = c(3,5,7),
decay = c(0.001,0.01,0.1),
bag = FALSE)
avnnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "avNNet",
preProc = c("center", "scale"),
tuneGrid = avnnGrid,
linout = TRUE,
trace = FALSE,
maxit = 500,
repeats = 5)
## Warning: executing %dopar% sequentially: no parallel backend registered
avnnModel
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 3 0.001 2.325336 0.7796476 1.810240
## 3 0.010 2.342806 0.7785945 1.842522
## 3 0.100 2.329693 0.7814102 1.820285
## 5 0.001 2.707160 0.7160330 2.108805
## 5 0.010 2.568562 0.7401671 1.992545
## 5 0.100 2.477041 0.7568092 1.937406
## 7 0.001 2.911014 0.6876818 2.281852
## 7 0.010 2.830626 0.6999453 2.205818
## 7 0.100 2.594108 0.7382855 2.035762
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3, decay = 0.001 and bag
## = FALSE.
The model explains about 85% of the variance of the test data. This model performed better than all KNN models we tried.
avnnPred <- predict(avnnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = avnnPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.9321277 0.8517172 1.4562245
plot(avnnPred, testData$y - avnnPred,
xlab = "Predicted", ylab = "Residuals",
abline(h=0, col = "red"))
varImp(avnnModel)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
set.seed(5)
#refer to text p. 165
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = marsGrid,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv"))
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsModel
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.423532 0.2309187 3.5890565
## 1 3 3.713198 0.4478548 3.0221726
## 1 4 2.605485 0.7344600 2.0966374
## 1 5 2.308050 0.7909448 1.8223255
## 1 6 2.313981 0.7856945 1.8202195
## 1 7 1.789424 0.8787532 1.3759182
## 1 8 1.746432 0.8836548 1.3824411
## 1 9 1.641724 0.8942825 1.3077422
## 1 10 1.645349 0.8953105 1.2842713
## 1 11 1.636401 0.8949921 1.2853430
## 1 12 1.647497 0.8942315 1.2810812
## 1 13 1.659909 0.8919944 1.2928655
## 1 14 1.662979 0.8916411 1.2942356
## 1 15 1.662979 0.8916411 1.2942356
## 1 16 1.662979 0.8916411 1.2942356
## 1 17 1.662979 0.8916411 1.2942356
## 1 18 1.662979 0.8916411 1.2942356
## 1 19 1.662979 0.8916411 1.2942356
## 1 20 1.662979 0.8916411 1.2942356
## 1 21 1.662979 0.8916411 1.2942356
## 1 22 1.662979 0.8916411 1.2942356
## 1 23 1.662979 0.8916411 1.2942356
## 1 24 1.662979 0.8916411 1.2942356
## 1 25 1.662979 0.8916411 1.2942356
## 1 26 1.662979 0.8916411 1.2942356
## 1 27 1.662979 0.8916411 1.2942356
## 1 28 1.662979 0.8916411 1.2942356
## 1 29 1.662979 0.8916411 1.2942356
## 1 30 1.662979 0.8916411 1.2942356
## 1 31 1.662979 0.8916411 1.2942356
## 1 32 1.662979 0.8916411 1.2942356
## 1 33 1.662979 0.8916411 1.2942356
## 1 34 1.662979 0.8916411 1.2942356
## 1 35 1.662979 0.8916411 1.2942356
## 1 36 1.662979 0.8916411 1.2942356
## 1 37 1.662979 0.8916411 1.2942356
## 1 38 1.662979 0.8916411 1.2942356
## 2 2 4.423532 0.2309187 3.5890565
## 2 3 3.713198 0.4478548 3.0221726
## 2 4 2.605485 0.7344600 2.0966374
## 2 5 2.330502 0.7855198 1.8542756
## 2 6 2.262752 0.7959896 1.8248388
## 2 7 1.751654 0.8806273 1.3485399
## 2 8 1.701033 0.8879443 1.2910422
## 2 9 1.507255 0.9139185 1.1882540
## 2 10 1.365326 0.9269532 1.0994145
## 2 11 1.261851 0.9374852 1.0427289
## 2 12 1.286904 0.9369358 1.0228168
## 2 13 1.295120 0.9366703 1.0219969
## 2 14 1.254485 0.9400388 0.9973163
## 2 15 1.219310 0.9427229 0.9625907
## 2 16 1.203802 0.9447799 0.9488964
## 2 17 1.201141 0.9453652 0.9474940
## 2 18 1.201141 0.9453652 0.9474940
## 2 19 1.201141 0.9453652 0.9474940
## 2 20 1.201141 0.9453652 0.9474940
## 2 21 1.201141 0.9453652 0.9474940
## 2 22 1.201141 0.9453652 0.9474940
## 2 23 1.201141 0.9453652 0.9474940
## 2 24 1.201141 0.9453652 0.9474940
## 2 25 1.201141 0.9453652 0.9474940
## 2 26 1.201141 0.9453652 0.9474940
## 2 27 1.201141 0.9453652 0.9474940
## 2 28 1.201141 0.9453652 0.9474940
## 2 29 1.201141 0.9453652 0.9474940
## 2 30 1.201141 0.9453652 0.9474940
## 2 31 1.201141 0.9453652 0.9474940
## 2 32 1.201141 0.9453652 0.9474940
## 2 33 1.201141 0.9453652 0.9474940
## 2 34 1.201141 0.9453652 0.9474940
## 2 35 1.201141 0.9453652 0.9474940
## 2 36 1.201141 0.9453652 0.9474940
## 2 37 1.201141 0.9453652 0.9474940
## 2 38 1.201141 0.9453652 0.9474940
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 17 and degree = 2.
The model explains about 93% of the variance of the test data. This model performed better all previous models.
marsPred <- predict(marsModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.2793868 0.9343367 1.0091132
The following plot compares the predicted values and residuals. An ideal plot would show points randomly scattered around the horizontal line at Residuals = 0. The vertical spread of the residuals around y = 0 is much more uniform than the ones we saw in the KNN models. It has sucessfully capture the nonlinear relationship that KNN couldn’t.
plot(marsPred, testData$y - marsPred,
xlab = "Predicted", ylab = "Residuals",
abline(h=0, col = "red"))
plot(testData$y, marsPred,
xlab = "Actual", ylab = "Predicted",
abline(0, 1, col = "red"))
varImp(marsModel)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.25
## X2 48.76
## X5 15.54
## X3 0.00
set.seed(6)
#refer to text p. 166
svmModel <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.491394 0.7989985 2.003983
## 0.50 2.262016 0.8120133 1.812875
## 1.00 2.110200 0.8331361 1.664682
## 2.00 1.991951 0.8522979 1.554581
## 4.00 1.894129 0.8670136 1.462134
## 8.00 1.882129 0.8683749 1.473288
## 16.00 1.875460 0.8699510 1.480442
## 32.00 1.877943 0.8695272 1.484317
## 64.00 1.877943 0.8695272 1.484317
## 128.00 1.877943 0.8695272 1.484317
## 256.00 1.877943 0.8695272 1.484317
## 512.00 1.877943 0.8695272 1.484317
## 1024.00 1.877943 0.8695272 1.484317
## 2048.00 1.877943 0.8695272 1.484317
##
## Tuning parameter 'sigma' was held constant at a value of 0.05635751
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05635751 and C = 16.
The model explains about 83% of the variance of the test data. This model performed better the KNN models but MARS model performed the best overall.
svmPred <- predict(svmModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = svmPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0536835 0.8289696 1.5615277
The following plot compares the predicted values and residuals. An ideal plot would show points randomly scattered around the horizontal line at Residuals = 0.
plot(svmPred, testData$y - svmPred,
xlab = "Predicted", ylab = "Residuals",
abline(h=0, col = "red"))
plot(testData$y, svmPred,
xlab = "Actual", ylab = "Predicted",
abline(0, 1, col = "red"))
varImp(svmModel)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
MARS outperformed all other models with it has the highest \(R^2\) and lowest RMSE. It was able to explain about 93% of the variance of the test data and its predictions was the closest to the true values.
| Model | \(R^2\) | RMSE |
|---|---|---|
| MARS | 0.9343367 | 1.2793868 |
| Average Neural Network | 0.8517172 | 1.9321277 |
| Simple Neural Network | 0.7061463 | 2.8200095 |
| SVM | 0.8289696 | 2.0536835 |
| Weighted KNN | 0.7041268 | 3.0445793 |
| KNN | 0.6819919 | 3.2040595 |
\(x_1\) is the most important predictor, followed by \(x_4\), \(x_2\), and \(x_5\). MARS did not use \(x_3\). It also ignored \(x_6\) - \(x_10\), as these were irrelevant predictors.
varImp(marsModel)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.25
## X2 48.76
## X5 15.54
## X3 0.00
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Refering text p. 54
trans <- preProcess(ChemicalManufacturingProcess[,-1],
method = "knnImpute" )
transformed <- predict(trans, ChemicalManufacturingProcess[,-1])
sum(is.na(transformed))
## [1] 0
#library(corrplot)
#cor(transformed,ChemicalManufacturingProcess$Yield )
chem_filtered <- transformed[, -nearZeroVar(transformed)]
set.seed(6)
trainingRows <- createDataPartition(ChemicalManufacturingProcess$Yield,
p = .70,
list= FALSE)
TrainX <- chem_filtered[trainingRows, ]
TrainY <- ChemicalManufacturingProcess$Yield[trainingRows]
TestX <- chem_filtered[-trainingRows, ]
TestY <- ChemicalManufacturingProcess$Yield[-trainingRows]
tooHigh <- findCorrelation(cor(TrainX), cutoff = .8)
trainXnnet <- TrainX[, -tooHigh]
testXnnet <- TestX[, -tooHigh]
Which nonlinear regression model gives the optimal resampling and test set performance?
# Provide by Text
set.seed(111)
weightedknnModel2 <- train(x = TrainX,
y = TrainY,
method = "kknn",
preProc = c("center", "scale", "pca"),
tuneGrid = expand.grid(kmax = seq(5,25,5),
distance = 2,
kernel = c("rectangular", "triangular","epanechnikov")))
weightedknnModel2
## k-Nearest Neighbors
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56), principal component
## signal extraction (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ...
## Resampling results across tuning parameters:
##
## kmax kernel RMSE Rsquared MAE
## 5 rectangular 1.699732 0.2860582 1.306812
## 5 triangular 1.594407 0.3072062 1.230722
## 5 epanechnikov 1.618735 0.2967283 1.242189
## 10 rectangular 1.699732 0.2860582 1.306812
## 10 triangular 1.553165 0.3255548 1.214788
## 10 epanechnikov 1.599011 0.3028114 1.237059
## 15 rectangular 1.699732 0.2860582 1.306812
## 15 triangular 1.549762 0.3273930 1.213288
## 15 epanechnikov 1.599011 0.3028114 1.237059
## 20 rectangular 1.699732 0.2860582 1.306812
## 20 triangular 1.549762 0.3273930 1.213288
## 20 epanechnikov 1.599011 0.3028114 1.237059
## 25 rectangular 1.699732 0.2860582 1.306812
## 25 triangular 1.549762 0.3273930 1.213288
## 25 epanechnikov 1.599011 0.3028114 1.237059
##
## Tuning parameter 'distance' was held constant at a value of 2
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 25, distance = 2 and kernel
## = triangular.
The model explains about 47% of the variance of the test data. Let’s run a diagnostic of the model.
weightedknnPred2 <- predict(weightedknnModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = weightedknnPred2, obs = TestY)
## RMSE Rsquared MAE
## 1.4676074 0.4733189 1.1051550
plot(weightedknnPred2, TestY - weightedknnPred2,
xlab = "Predicted", ylab = "Residuals",
abline(h=0, col = "red"))
plot(TestY, weightedknnPred2,
xlab = "Actual", ylab = "Predicted",
abline(0, 1, col = "red"))
set.seed(112)
# Refer to text p.162 following their suggested basic model
avnnGrid <- expand.grid(size = c(1:5),
decay = c(0.001,0.01),
bag = FALSE)
avnnModel2 <- train(x = TrainX,
y = TrainY,
method = "avNNet",
preProc = c("center", "scale"),
tuneGrid = avnnGrid,
linout = TRUE,
trace = FALSE,
maxit = 500,
trControl = trainControl(method = "cv", number = 5))
avnnModel2
## Model Averaged Neural Network
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 100, 98, 100, 99, 99
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.001 1.783858 0.4591729 1.2755065
## 1 0.010 1.353396 0.5024565 1.0378731
## 2 0.001 1.215155 0.5631878 0.9658776
## 2 0.010 1.879248 0.3600162 1.4395281
## 3 0.001 1.243871 0.5653154 0.9813878
## 3 0.010 1.740887 0.4017771 1.3025143
## 4 0.001 1.197696 0.5803829 0.9345391
## 4 0.010 1.898971 0.3617917 1.3753743
## 5 0.001 1.385897 0.5003783 1.1059915
## 5 0.010 1.947285 0.2961363 1.4381434
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.001 and bag
## = FALSE.
The model explains about 61% of the variance of the test data. That’s an improvement from the unweighted knn model. Let’s run a diagnostic of the model.
avnnPred2 <- predict(avnnModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = avnnPred2, obs = TestY)
## RMSE Rsquared MAE
## 1.3390678 0.6091029 0.9959886
plot(avnnPred2, TestY - avnnPred2,
xlab = "Predicted", ylab = "Residuals",
abline(h=0, col = "red"))
plot(TestY, avnnPred2,
xlab = "Actual", ylab = "Predicted",
abline(0, 1, col = "red"))
set.seed(113)
#refer to text p. 166
svmModel2 <- train(x = TrainX,
y = TrainY,
method = "svmRadial",
preProc = c("center", "scale", "pca"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmModel2
## Support Vector Machines with Radial Basis Function Kernel
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56), principal component
## signal extraction (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 110, 112, 112, 112, 112, 112, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.450085 0.5568588 1.1997266
## 0.50 1.332047 0.5647384 1.0980277
## 1.00 1.238790 0.6006898 1.0179477
## 2.00 1.200795 0.6201847 0.9773347
## 4.00 1.205578 0.6058129 0.9752402
## 8.00 1.222136 0.5942627 0.9882264
## 16.00 1.221243 0.5954775 0.9870167
## 32.00 1.221243 0.5954775 0.9870167
## 64.00 1.221243 0.5954775 0.9870167
## 128.00 1.221243 0.5954775 0.9870167
## 256.00 1.221243 0.5954775 0.9870167
## 512.00 1.221243 0.5954775 0.9870167
## 1024.00 1.221243 0.5954775 0.9870167
## 2048.00 1.221243 0.5954775 0.9870167
##
## Tuning parameter 'sigma' was held constant at a value of 0.02748059
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02748059 and C = 2.
The model explains about 62% of the variance of the test data. That’s an improvement from the weighted knn model. But it is still not ideal. Let’s run a diagnostic of the model.
svmPred2 <- predict(svmModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = svmPred2, obs = TestY)
## RMSE Rsquared MAE
## 1.2695080 0.6191610 0.9728057
plot(svmPred2, TestY - svmPred2,
xlab = "Predicted", ylab = "Residuals",
abline(h=0, col = "red"))
plot(TestY,svmPred2,
xlab = "Predicted", ylab = "Residuals",
abline(0, 1, col = "red"))
set.seed(114)
#refer to text p. 165
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel2 <- train(x = TrainX,
y = TrainY,
method = "earth",
tuneGrid = marsGrid,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv"))
marsModel2
## Multivariate Adaptive Regression Spline
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 112, 111, 112, 112, 112, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.420245 0.4161646 1.1223649
## 1 3 1.192370 0.6063616 0.9763954
## 1 4 1.171674 0.6163737 0.9433324
## 1 5 1.169215 0.6034139 0.9388695
## 1 6 1.223304 0.5669649 0.9703588
## 1 7 1.185700 0.6027056 0.9510157
## 1 8 1.180833 0.5932560 0.9827166
## 1 9 1.180792 0.5976978 0.9827534
## 1 10 1.185384 0.6048656 0.9685761
## 1 11 1.184590 0.6095915 0.9674617
## 1 12 1.373554 0.5963351 1.0829173
## 1 13 1.395727 0.5843234 1.0984551
## 1 14 1.404500 0.5844436 1.1004205
## 1 15 1.395147 0.5864584 1.0949697
## 1 16 1.405719 0.5837766 1.1066869
## 1 17 1.392827 0.5902839 1.0942656
## 1 18 1.392827 0.5902839 1.0942656
## 1 19 1.392827 0.5902839 1.0942656
## 1 20 1.415115 0.5837947 1.1080326
## 1 21 1.421045 0.5836208 1.1194441
## 1 22 1.421045 0.5836208 1.1194441
## 1 23 1.421045 0.5836208 1.1194441
## 1 24 1.421045 0.5836208 1.1194441
## 1 25 1.421045 0.5836208 1.1194441
## 1 26 1.421045 0.5836208 1.1194441
## 1 27 1.421045 0.5836208 1.1194441
## 1 28 1.421045 0.5836208 1.1194441
## 1 29 1.421045 0.5836208 1.1194441
## 1 30 1.421045 0.5836208 1.1194441
## 1 31 1.421045 0.5836208 1.1194441
## 1 32 1.421045 0.5836208 1.1194441
## 1 33 1.421045 0.5836208 1.1194441
## 1 34 1.421045 0.5836208 1.1194441
## 1 35 1.421045 0.5836208 1.1194441
## 1 36 1.421045 0.5836208 1.1194441
## 1 37 1.421045 0.5836208 1.1194441
## 1 38 1.421045 0.5836208 1.1194441
## 2 2 1.420245 0.4161646 1.1223649
## 2 3 1.184745 0.6076965 0.9602160
## 2 4 1.209515 0.5854936 0.9681314
## 2 5 1.254889 0.5406905 0.9757740
## 2 6 1.183539 0.5938054 0.9470884
## 2 7 1.156842 0.6058061 0.9096095
## 2 8 1.159008 0.6048214 0.9155860
## 2 9 1.158606 0.6017263 0.9002778
## 2 10 1.212862 0.5807610 0.9210734
## 2 11 1.398343 0.5475084 0.9810743
## 2 12 1.489289 0.5383656 1.0335915
## 2 13 1.510972 0.5529365 1.0257113
## 2 14 1.533395 0.5414225 1.0553174
## 2 15 1.532111 0.5254442 1.0506624
## 2 16 1.546251 0.5066151 1.0598119
## 2 17 1.516852 0.5173416 1.0311456
## 2 18 1.578750 0.5075698 1.0671888
## 2 19 1.574408 0.5085381 1.0617242
## 2 20 1.571305 0.5131123 1.0582440
## 2 21 1.583670 0.5129312 1.0444304
## 2 22 1.576575 0.5135234 1.0351308
## 2 23 1.585898 0.5104487 1.0460993
## 2 24 1.585898 0.5104487 1.0460993
## 2 25 1.585898 0.5104487 1.0460993
## 2 26 1.585898 0.5104487 1.0460993
## 2 27 1.585898 0.5104487 1.0460993
## 2 28 1.582272 0.5062776 1.0514079
## 2 29 1.594988 0.5027155 1.0586435
## 2 30 1.594988 0.5027155 1.0586435
## 2 31 1.594988 0.5027155 1.0586435
## 2 32 1.594988 0.5027155 1.0586435
## 2 33 1.594988 0.5027155 1.0586435
## 2 34 1.594988 0.5027155 1.0586435
## 2 35 1.594988 0.5027155 1.0586435
## 2 36 1.594988 0.5027155 1.0586435
## 2 37 1.594988 0.5027155 1.0586435
## 2 38 1.594988 0.5027155 1.0586435
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 7 and degree = 2.
The model explains about 62% of the variance of the test data. That’s an improvement from the unweighted knn model. Let’s run a diagnostic of the model.
marsPred2 <- predict(marsModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = marsPred2, obs = TestY)
## RMSE Rsquared MAE
## 1.2278715 0.6159963 0.9913622
plot(marsPred2, TestY - marsPred2,
xlab = "Predicted", ylab = "Residuals",
abline(h=0, col = "red"))
plot(TestY, marsPred2,
xlab = "Predicted", ylab = "Residuals",
abline(0, 1, col = "red"))
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
| Model | \(R^2\) | RMSE |
|---|---|---|
| SVM | 0.6191610 | 1.2695080 |
| MARS | 0.6159963 | 1.2278715 |
| Average Neural Network | 0.6091029 | 1.3390678 |
| Weighted KNN | 0.4733189 | 1.4676074 |
The SVM model outperform all the other models as it has the highest \(R^2\). However, it is still not an ideal model as it only explains about 62% of the variance of the test data.
Manufacturing process variables slightly dominates the top 10 predictors. Both types are present. Manufacturing Process 32 consistently top 1 predictor in both nonlinear and linear optimal model.
varImp(svmModel2)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 89.27
## BiologicalMaterial12 82.55
## BiologicalMaterial03 79.75
## ManufacturingProcess13 75.28
## BiologicalMaterial02 71.85
## ManufacturingProcess09 70.68
## ManufacturingProcess31 62.16
## ManufacturingProcess06 59.72
## ManufacturingProcess17 53.63
## BiologicalMaterial11 53.11
## ManufacturingProcess29 52.53
## ManufacturingProcess36 52.11
## ManufacturingProcess02 48.59
## BiologicalMaterial08 47.31
## ManufacturingProcess11 46.62
## ManufacturingProcess30 46.31
## BiologicalMaterial04 45.70
## BiologicalMaterial01 43.72
## ManufacturingProcess33 36.14
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Biological predictors tend to have positive relationship with yield while manufacturing process predictors can have either positive or negative relationship with yield.
Biological predictors are highly correlated (multicollinearity?), which could explain why our linear model performed poorly. Nonlinear models are able to capture these interactions.
library(dplyr)
library(corrplot)
top_predictors <- varImp(svmModel2)$importance %>%
mutate(Predictor = rownames(.)) %>%
arrange(desc(Overall)) %>%
head(10)
top_predictor_names <- top_predictors$Predictor
cor(TrainX[, top_predictor_names], TrainY)
## [,1]
## ManufacturingProcess32 0.5746692
## BiologicalMaterial06 0.5539187
## BiologicalMaterial12 0.4374056
## BiologicalMaterial03 0.4986530
## ManufacturingProcess13 -0.4835901
## BiologicalMaterial02 0.5261501
## ManufacturingProcess09 0.5235815
## ManufacturingProcess31 -0.3753337
## ManufacturingProcess06 0.4211420
## ManufacturingProcess17 -0.3454498
corrplot(cor(TrainX[, top_predictor_names]))
featurePlot(TrainX[, top_predictor_names], TrainY)