Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: \(y = 10 sin(πx_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, σ^2)\) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

#From text p. 169
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
 ## We convert the 'x' data from a matrix to a data frame
 ## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
 ## Look at the data using
featurePlot(trainingData$x, trainingData$y)

 ## or other methods.
library(corrplot)
## corrplot 0.95 loaded
corrplot(cor(trainingData$x),
         type = "upper",
         addCoef.col = "black",
         number.cex = 0.8)

The plot reveals the correlation between the independent variables (\(x_1\) to \(x_10\)) and the dependent variable \(y\).There are noticeable upward trend in some of the plots.

The following predictors appears to be positively correlated with \(y\): \(x_1\), \(x_2\), \(x_4\), \(x_5\), The rest of the predictors appear to have no significant linear correlation.

 ## This creates a list with a vector 'y' and a matrix
 ## of predictors 'x'. Also simulate a large test set to
 ## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

KNN

KNN is sensitive to scaling as it relied heavily on distance metrics. So its crucial to scale and center the data. The optimal k is 17.

# Provide by Text
set.seed(1)
knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.520525  0.4855093  2.871451
##    7  3.404179  0.5186746  2.766570
##    9  3.294231  0.5596678  2.675804
##   11  3.249701  0.5831641  2.648116
##   13  3.231680  0.5985127  2.637329
##   15  3.202284  0.6159300  2.615179
##   17  3.174852  0.6335173  2.600173
##   19  3.181772  0.6396367  2.610065
##   21  3.180229  0.6495883  2.604871
##   23  3.194633  0.6531814  2.615673
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
plot(knnModel)

varImp(knnModel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

The model explains about 68% of the variance of the test data. That’s not ideal. Let’s run a diagnostic of the model to see why it’s not performing strongly.

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

The following plot compares the predicted values and residuals. An ideal plot would show points randomly scattered around the horizontal line at Residuals = 0. There is an upward curve in the residuals. This confirms KNN is not fully capturing the nonlineear pattern of the underlying Friedman function.

I guess the upward trend is expected as KNN is local averaging.

plot(knnPred, testData$y - knnPred, 
     xlab = "Predicted", ylab = "Residuals",
     abline(h=0, col = "red"))

The model is underestimating high y values and overestimating low y values.

plot(testData$y, knnPred,
     xlab = "Actual", ylab = "Predicted",
     abline(0, 1, col = "red"))

Weighted KNN

We can tune KNN furthur with distance weighting. We set distance = 2 to use Euclidean Distance. We can use Euclidean distance since our data is scaled and center.

Triangular kernel consistenly outperforms rectangular (unweighted) and epanechnikov.

RMSE decreases as kmax increases for the triangular kernel, then stabilizes around 3.53 for kmax = 15, 20, and 25.

\(R^2\) improves from about 0.40 (rectangular) to 0.53 (triangular), revealing that weighting neighbors does help improve the modeling.

The following model was chosen: kmax=25, distance = 2, kernel = triangular.

# Provide by Text
set.seed(2)
weightedknnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "kknn",
                  preProc = c("center", "scale"),
                  tuneGrid = expand.grid(kmax = seq(5,25,5), 
                                         distance = 2,
                                         kernel = c("rectangular", "triangular","epanechnikov")))
weightedknnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   kmax  kernel        RMSE      Rsquared   MAE     
##    5    rectangular   4.158591  0.4048743  3.409026
##    5    triangular    3.742454  0.4828793  3.062012
##    5    epanechnikov  3.790434  0.4728843  3.106315
##   10    rectangular   4.162592  0.4037107  3.412065
##   10    triangular    3.545847  0.5294473  2.903909
##   10    epanechnikov  3.699420  0.4958138  3.030837
##   15    rectangular   4.162592  0.4037107  3.412065
##   15    triangular    3.530446  0.5346312  2.895038
##   15    epanechnikov  3.677834  0.5031468  3.018106
##   20    rectangular   4.162592  0.4037107  3.412065
##   20    triangular    3.530446  0.5346312  2.895038
##   20    epanechnikov  3.677834  0.5031468  3.018106
##   25    rectangular   4.162592  0.4037107  3.412065
##   25    triangular    3.530446  0.5346312  2.895038
##   25    epanechnikov  3.677834  0.5031468  3.018106
## 
## Tuning parameter 'distance' was held constant at a value of 2
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 25, distance = 2 and kernel
##  = triangular.

The model explains about 70% of the variance of the test data. That’s an improvement from the unweighted knn model. Let’s run a diagnostic of the model.

weightedknnPred <- predict(weightedknnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = weightedknnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.0445793 0.7041268 2.4351706

Even with weighted KNN and uning, a slight upward trend in residuals persist on the dataset. I guess the upward trend is expected as KNN is local averaging. It cannot perfectly capater the highly nonlinear interactions. In addition, theres the curse of dimensionality in KNN where distances becomes less informative with high dimensional data (the data have 10 predictors).

plot(weightedknnPred, testData$y - weightedknnPred, 
     xlab = "Predicted", ylab = "Residuals",
     abline(h=0, col = "red"))

Simple Neural Network

This model trains one neural network with - 5 hidden units - weight decay regularization decay = 0.01 - linear output layer

set.seed(3)

# Refer to text p.162 following their suggested basic model

nnGrid <- expand.grid(size = 5, decay = 0.01)

nnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "nnet",
                  preProc = c("center", "scale"),
                  tuneGrid = nnGrid,
                 linout = TRUE,
                 trace = FALSE,
                 maxit = 500)
nnModel
## Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.373044  0.6172798  2.614574
## 
## Tuning parameter 'size' was held constant at a value of 5
## Tuning
##  parameter 'decay' was held constant at a value of 0.01

The model explains about 70% of the variance of the test data. This performed similar to the weighted KNN.

nnPred <- predict(nnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = nnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.8200095 0.7061463 2.1315529

The following plot compares the predicted values and residuals. An ideal plot would show points randomly scattered around the horizontal line at Residuals = 0.

The vertical spread of the residuals around y = 0 is much more uniform than the ones we saw in the KNN models. It has sucessfully capture the nonlinear relationship that KNN couldn’t. However, the spread of residuals is not completely symmetric. For higher predicted values, there are more negative residuals. Perhaps it is overpredicting.

plot(nnPred, testData$y - nnPred, 
     xlab = "Predicted", ylab = "Residuals",
     abline(h=0, col = "red"))

plot(testData$y, nnPred,
     xlab = "Actual", ylab = "Predicted",
     abline(0, 1, col = "red"))

Average Neural Network Ensemble

The dataset is nonlinear but noisy, so averaging smooths out overfitting. This model trains multiple neural networks.

set.seed(4)

# Refer to text p.162 following their suggested basic model

avnnGrid <- expand.grid(size = c(3,5,7),
                        decay = c(0.001,0.01,0.1),
                        bag = FALSE)

avnnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "avNNet",
                  preProc = c("center", "scale"),
                  tuneGrid = avnnGrid,
                 linout = TRUE,
                 trace = FALSE,
                 maxit = 500, 
                 repeats = 5)
## Warning: executing %dopar% sequentially: no parallel backend registered
avnnModel
## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE     
##   3     0.001  2.325336  0.7796476  1.810240
##   3     0.010  2.342806  0.7785945  1.842522
##   3     0.100  2.329693  0.7814102  1.820285
##   5     0.001  2.707160  0.7160330  2.108805
##   5     0.010  2.568562  0.7401671  1.992545
##   5     0.100  2.477041  0.7568092  1.937406
##   7     0.001  2.911014  0.6876818  2.281852
##   7     0.010  2.830626  0.6999453  2.205818
##   7     0.100  2.594108  0.7382855  2.035762
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3, decay = 0.001 and bag
##  = FALSE.

The model explains about 85% of the variance of the test data. This model performed better than all KNN models we tried.

avnnPred <- predict(avnnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = avnnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.9321277 0.8517172 1.4562245
plot(avnnPred, testData$y - avnnPred, 
     xlab = "Predicted", ylab = "Residuals",
     abline(h=0, col = "red"))

varImp(avnnModel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

Multivariate Adaptive Regression Splines (MARS)

set.seed(5)
#refer to text p. 165
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

marsModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  preProc = c("center", "scale"),
                  trControl = trainControl(method = "cv"))
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsModel
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      4.423532  0.2309187  3.5890565
##   1        3      3.713198  0.4478548  3.0221726
##   1        4      2.605485  0.7344600  2.0966374
##   1        5      2.308050  0.7909448  1.8223255
##   1        6      2.313981  0.7856945  1.8202195
##   1        7      1.789424  0.8787532  1.3759182
##   1        8      1.746432  0.8836548  1.3824411
##   1        9      1.641724  0.8942825  1.3077422
##   1       10      1.645349  0.8953105  1.2842713
##   1       11      1.636401  0.8949921  1.2853430
##   1       12      1.647497  0.8942315  1.2810812
##   1       13      1.659909  0.8919944  1.2928655
##   1       14      1.662979  0.8916411  1.2942356
##   1       15      1.662979  0.8916411  1.2942356
##   1       16      1.662979  0.8916411  1.2942356
##   1       17      1.662979  0.8916411  1.2942356
##   1       18      1.662979  0.8916411  1.2942356
##   1       19      1.662979  0.8916411  1.2942356
##   1       20      1.662979  0.8916411  1.2942356
##   1       21      1.662979  0.8916411  1.2942356
##   1       22      1.662979  0.8916411  1.2942356
##   1       23      1.662979  0.8916411  1.2942356
##   1       24      1.662979  0.8916411  1.2942356
##   1       25      1.662979  0.8916411  1.2942356
##   1       26      1.662979  0.8916411  1.2942356
##   1       27      1.662979  0.8916411  1.2942356
##   1       28      1.662979  0.8916411  1.2942356
##   1       29      1.662979  0.8916411  1.2942356
##   1       30      1.662979  0.8916411  1.2942356
##   1       31      1.662979  0.8916411  1.2942356
##   1       32      1.662979  0.8916411  1.2942356
##   1       33      1.662979  0.8916411  1.2942356
##   1       34      1.662979  0.8916411  1.2942356
##   1       35      1.662979  0.8916411  1.2942356
##   1       36      1.662979  0.8916411  1.2942356
##   1       37      1.662979  0.8916411  1.2942356
##   1       38      1.662979  0.8916411  1.2942356
##   2        2      4.423532  0.2309187  3.5890565
##   2        3      3.713198  0.4478548  3.0221726
##   2        4      2.605485  0.7344600  2.0966374
##   2        5      2.330502  0.7855198  1.8542756
##   2        6      2.262752  0.7959896  1.8248388
##   2        7      1.751654  0.8806273  1.3485399
##   2        8      1.701033  0.8879443  1.2910422
##   2        9      1.507255  0.9139185  1.1882540
##   2       10      1.365326  0.9269532  1.0994145
##   2       11      1.261851  0.9374852  1.0427289
##   2       12      1.286904  0.9369358  1.0228168
##   2       13      1.295120  0.9366703  1.0219969
##   2       14      1.254485  0.9400388  0.9973163
##   2       15      1.219310  0.9427229  0.9625907
##   2       16      1.203802  0.9447799  0.9488964
##   2       17      1.201141  0.9453652  0.9474940
##   2       18      1.201141  0.9453652  0.9474940
##   2       19      1.201141  0.9453652  0.9474940
##   2       20      1.201141  0.9453652  0.9474940
##   2       21      1.201141  0.9453652  0.9474940
##   2       22      1.201141  0.9453652  0.9474940
##   2       23      1.201141  0.9453652  0.9474940
##   2       24      1.201141  0.9453652  0.9474940
##   2       25      1.201141  0.9453652  0.9474940
##   2       26      1.201141  0.9453652  0.9474940
##   2       27      1.201141  0.9453652  0.9474940
##   2       28      1.201141  0.9453652  0.9474940
##   2       29      1.201141  0.9453652  0.9474940
##   2       30      1.201141  0.9453652  0.9474940
##   2       31      1.201141  0.9453652  0.9474940
##   2       32      1.201141  0.9453652  0.9474940
##   2       33      1.201141  0.9453652  0.9474940
##   2       34      1.201141  0.9453652  0.9474940
##   2       35      1.201141  0.9453652  0.9474940
##   2       36      1.201141  0.9453652  0.9474940
##   2       37      1.201141  0.9453652  0.9474940
##   2       38      1.201141  0.9453652  0.9474940
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 17 and degree = 2.

The model explains about 93% of the variance of the test data. This model performed better all previous models.

marsPred <- predict(marsModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = marsPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.2793868 0.9343367 1.0091132

The following plot compares the predicted values and residuals. An ideal plot would show points randomly scattered around the horizontal line at Residuals = 0. The vertical spread of the residuals around y = 0 is much more uniform than the ones we saw in the KNN models. It has sucessfully capture the nonlinear relationship that KNN couldn’t.

plot(marsPred, testData$y - marsPred, 
     xlab = "Predicted", ylab = "Residuals",
     abline(h=0, col = "red"))

plot(testData$y, marsPred,
     xlab = "Actual", ylab = "Predicted",
     abline(0, 1, col = "red"))

varImp(marsModel)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.25
## X2   48.76
## X5   15.54
## X3    0.00

Support Vector Models (SVM)

set.seed(6)
#refer to text p. 166
svmModel <- train(x = trainingData$x,
                  y = trainingData$y,
                    method = "svmRadial",
                    preProc = c("center", "scale"),
                    tuneLength = 14,
                    trControl = trainControl(method = "cv"))
svmModel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.491394  0.7989985  2.003983
##      0.50  2.262016  0.8120133  1.812875
##      1.00  2.110200  0.8331361  1.664682
##      2.00  1.991951  0.8522979  1.554581
##      4.00  1.894129  0.8670136  1.462134
##      8.00  1.882129  0.8683749  1.473288
##     16.00  1.875460  0.8699510  1.480442
##     32.00  1.877943  0.8695272  1.484317
##     64.00  1.877943  0.8695272  1.484317
##    128.00  1.877943  0.8695272  1.484317
##    256.00  1.877943  0.8695272  1.484317
##    512.00  1.877943  0.8695272  1.484317
##   1024.00  1.877943  0.8695272  1.484317
##   2048.00  1.877943  0.8695272  1.484317
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05635751
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05635751 and C = 16.

The model explains about 83% of the variance of the test data. This model performed better the KNN models but MARS model performed the best overall.

svmPred <- predict(svmModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = svmPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.0536835 0.8289696 1.5615277

The following plot compares the predicted values and residuals. An ideal plot would show points randomly scattered around the horizontal line at Residuals = 0.

plot(svmPred, testData$y - svmPred, 
     xlab = "Predicted", ylab = "Residuals",
     abline(h=0, col = "red"))

plot(testData$y, svmPred,
     xlab = "Actual", ylab = "Predicted",
     abline(0, 1, col = "red"))

varImp(svmModel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

Best Model

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

MARS outperformed all other models with it has the highest \(R^2\) and lowest RMSE. It was able to explain about 93% of the variance of the test data and its predictions was the closest to the true values.

Model \(R^2\) RMSE
MARS 0.9343367 1.2793868
Average Neural Network 0.8517172 1.9321277
Simple Neural Network 0.7061463 2.8200095
SVM 0.8289696 2.0536835
Weighted KNN 0.7041268 3.0445793
KNN 0.6819919 3.2040595

\(x_1\) is the most important predictor, followed by \(x_4\), \(x_2\), and \(x_5\). MARS did not use \(x_3\). It also ignored \(x_6\) - \(x_10\), as these were irrelevant predictors.

varImp(marsModel)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.25
## X2   48.76
## X5   15.54
## X3    0.00

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Refering text p. 54
trans <- preProcess(ChemicalManufacturingProcess[,-1], 
                     method = "knnImpute" )

transformed <- predict(trans, ChemicalManufacturingProcess[,-1])

sum(is.na(transformed))
## [1] 0
#library(corrplot)
#cor(transformed,ChemicalManufacturingProcess$Yield )
chem_filtered <- transformed[, -nearZeroVar(transformed)]

set.seed(6)
trainingRows <- createDataPartition(ChemicalManufacturingProcess$Yield,
                                    p = .70,
                                    list= FALSE)

TrainX <- chem_filtered[trainingRows, ]
TrainY <- ChemicalManufacturingProcess$Yield[trainingRows]

TestX <- chem_filtered[-trainingRows, ]
TestY <- ChemicalManufacturingProcess$Yield[-trainingRows]


tooHigh <- findCorrelation(cor(TrainX), cutoff = .8)
trainXnnet <- TrainX[, -tooHigh]
testXnnet <- TestX[, -tooHigh]

part a

Which nonlinear regression model gives the optimal resampling and test set performance?

Weighted KNN 1

# Provide by Text
set.seed(111)
weightedknnModel2 <- train(x = TrainX,
                  y = TrainY,
                  method = "kknn",
                  preProc = c("center", "scale", "pca"),
                  tuneGrid = expand.grid(kmax = seq(5,25,5), 
                                         distance = 2,
                                         kernel = c("rectangular", "triangular","epanechnikov")))
weightedknnModel2
## k-Nearest Neighbors 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56), principal component
##  signal extraction (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   kmax  kernel        RMSE      Rsquared   MAE     
##    5    rectangular   1.699732  0.2860582  1.306812
##    5    triangular    1.594407  0.3072062  1.230722
##    5    epanechnikov  1.618735  0.2967283  1.242189
##   10    rectangular   1.699732  0.2860582  1.306812
##   10    triangular    1.553165  0.3255548  1.214788
##   10    epanechnikov  1.599011  0.3028114  1.237059
##   15    rectangular   1.699732  0.2860582  1.306812
##   15    triangular    1.549762  0.3273930  1.213288
##   15    epanechnikov  1.599011  0.3028114  1.237059
##   20    rectangular   1.699732  0.2860582  1.306812
##   20    triangular    1.549762  0.3273930  1.213288
##   20    epanechnikov  1.599011  0.3028114  1.237059
##   25    rectangular   1.699732  0.2860582  1.306812
##   25    triangular    1.549762  0.3273930  1.213288
##   25    epanechnikov  1.599011  0.3028114  1.237059
## 
## Tuning parameter 'distance' was held constant at a value of 2
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 25, distance = 2 and kernel
##  = triangular.

The model explains about 47% of the variance of the test data. Let’s run a diagnostic of the model.

weightedknnPred2 <- predict(weightedknnModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = weightedknnPred2, obs = TestY)
##      RMSE  Rsquared       MAE 
## 1.4676074 0.4733189 1.1051550
plot(weightedknnPred2, TestY - weightedknnPred2, 
     xlab = "Predicted", ylab = "Residuals",
     abline(h=0, col = "red"))

plot(TestY, weightedknnPred2,
     xlab = "Actual", ylab = "Predicted",
     abline(0, 1, col = "red"))

Average Neural Network Ensemble

set.seed(112)

# Refer to text p.162 following their suggested basic model

avnnGrid <- expand.grid(size = c(1:5),
                        decay = c(0.001,0.01),
                        bag = FALSE)

avnnModel2 <- train(x = TrainX,
                  y = TrainY,
                  method = "avNNet",
                  preProc = c("center", "scale"),
                  tuneGrid = avnnGrid,
                 linout = TRUE,
                 trace = FALSE,
                 maxit = 500, 
                 trControl = trainControl(method = "cv", number = 5))
avnnModel2
## Model Averaged Neural Network 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 100, 98, 100, 99, 99 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE      
##   1     0.001  1.783858  0.4591729  1.2755065
##   1     0.010  1.353396  0.5024565  1.0378731
##   2     0.001  1.215155  0.5631878  0.9658776
##   2     0.010  1.879248  0.3600162  1.4395281
##   3     0.001  1.243871  0.5653154  0.9813878
##   3     0.010  1.740887  0.4017771  1.3025143
##   4     0.001  1.197696  0.5803829  0.9345391
##   4     0.010  1.898971  0.3617917  1.3753743
##   5     0.001  1.385897  0.5003783  1.1059915
##   5     0.010  1.947285  0.2961363  1.4381434
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.001 and bag
##  = FALSE.

The model explains about 61% of the variance of the test data. That’s an improvement from the unweighted knn model. Let’s run a diagnostic of the model.

avnnPred2 <- predict(avnnModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = avnnPred2, obs = TestY)
##      RMSE  Rsquared       MAE 
## 1.3390678 0.6091029 0.9959886
plot(avnnPred2, TestY - avnnPred2, 
     xlab = "Predicted", ylab = "Residuals",
     abline(h=0, col = "red"))

plot(TestY, avnnPred2,
     xlab = "Actual", ylab = "Predicted",
     abline(0, 1, col = "red"))

Support Vector Models (SVM)

set.seed(113)
#refer to text p. 166
svmModel2 <- train(x = TrainX,
                  y = TrainY,
                    method = "svmRadial",
                    preProc = c("center", "scale", "pca"),
                    tuneLength = 14,
                    trControl = trainControl(method = "cv"))
svmModel2
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56), principal component
##  signal extraction (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 110, 112, 112, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE      
##      0.25  1.450085  0.5568588  1.1997266
##      0.50  1.332047  0.5647384  1.0980277
##      1.00  1.238790  0.6006898  1.0179477
##      2.00  1.200795  0.6201847  0.9773347
##      4.00  1.205578  0.6058129  0.9752402
##      8.00  1.222136  0.5942627  0.9882264
##     16.00  1.221243  0.5954775  0.9870167
##     32.00  1.221243  0.5954775  0.9870167
##     64.00  1.221243  0.5954775  0.9870167
##    128.00  1.221243  0.5954775  0.9870167
##    256.00  1.221243  0.5954775  0.9870167
##    512.00  1.221243  0.5954775  0.9870167
##   1024.00  1.221243  0.5954775  0.9870167
##   2048.00  1.221243  0.5954775  0.9870167
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02748059
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02748059 and C = 2.

The model explains about 62% of the variance of the test data. That’s an improvement from the weighted knn model. But it is still not ideal. Let’s run a diagnostic of the model.

svmPred2 <- predict(svmModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = svmPred2, obs = TestY)
##      RMSE  Rsquared       MAE 
## 1.2695080 0.6191610 0.9728057
plot(svmPred2, TestY - svmPred2, 
     xlab = "Predicted", ylab = "Residuals",
     abline(h=0, col = "red"))

plot(TestY,svmPred2, 
     xlab = "Predicted", ylab = "Residuals",
     abline(0, 1, col = "red"))

Multivariate Adaptive Regression Splines (MARS)

set.seed(114)
#refer to text p. 165
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

marsModel2 <- train(x = TrainX,
                  y = TrainY,
                  method = "earth",
                  tuneGrid = marsGrid,
                  preProc = c("center", "scale"),
                  trControl = trainControl(method = "cv"))
marsModel2
## Multivariate Adaptive Regression Spline 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 112, 111, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.420245  0.4161646  1.1223649
##   1        3      1.192370  0.6063616  0.9763954
##   1        4      1.171674  0.6163737  0.9433324
##   1        5      1.169215  0.6034139  0.9388695
##   1        6      1.223304  0.5669649  0.9703588
##   1        7      1.185700  0.6027056  0.9510157
##   1        8      1.180833  0.5932560  0.9827166
##   1        9      1.180792  0.5976978  0.9827534
##   1       10      1.185384  0.6048656  0.9685761
##   1       11      1.184590  0.6095915  0.9674617
##   1       12      1.373554  0.5963351  1.0829173
##   1       13      1.395727  0.5843234  1.0984551
##   1       14      1.404500  0.5844436  1.1004205
##   1       15      1.395147  0.5864584  1.0949697
##   1       16      1.405719  0.5837766  1.1066869
##   1       17      1.392827  0.5902839  1.0942656
##   1       18      1.392827  0.5902839  1.0942656
##   1       19      1.392827  0.5902839  1.0942656
##   1       20      1.415115  0.5837947  1.1080326
##   1       21      1.421045  0.5836208  1.1194441
##   1       22      1.421045  0.5836208  1.1194441
##   1       23      1.421045  0.5836208  1.1194441
##   1       24      1.421045  0.5836208  1.1194441
##   1       25      1.421045  0.5836208  1.1194441
##   1       26      1.421045  0.5836208  1.1194441
##   1       27      1.421045  0.5836208  1.1194441
##   1       28      1.421045  0.5836208  1.1194441
##   1       29      1.421045  0.5836208  1.1194441
##   1       30      1.421045  0.5836208  1.1194441
##   1       31      1.421045  0.5836208  1.1194441
##   1       32      1.421045  0.5836208  1.1194441
##   1       33      1.421045  0.5836208  1.1194441
##   1       34      1.421045  0.5836208  1.1194441
##   1       35      1.421045  0.5836208  1.1194441
##   1       36      1.421045  0.5836208  1.1194441
##   1       37      1.421045  0.5836208  1.1194441
##   1       38      1.421045  0.5836208  1.1194441
##   2        2      1.420245  0.4161646  1.1223649
##   2        3      1.184745  0.6076965  0.9602160
##   2        4      1.209515  0.5854936  0.9681314
##   2        5      1.254889  0.5406905  0.9757740
##   2        6      1.183539  0.5938054  0.9470884
##   2        7      1.156842  0.6058061  0.9096095
##   2        8      1.159008  0.6048214  0.9155860
##   2        9      1.158606  0.6017263  0.9002778
##   2       10      1.212862  0.5807610  0.9210734
##   2       11      1.398343  0.5475084  0.9810743
##   2       12      1.489289  0.5383656  1.0335915
##   2       13      1.510972  0.5529365  1.0257113
##   2       14      1.533395  0.5414225  1.0553174
##   2       15      1.532111  0.5254442  1.0506624
##   2       16      1.546251  0.5066151  1.0598119
##   2       17      1.516852  0.5173416  1.0311456
##   2       18      1.578750  0.5075698  1.0671888
##   2       19      1.574408  0.5085381  1.0617242
##   2       20      1.571305  0.5131123  1.0582440
##   2       21      1.583670  0.5129312  1.0444304
##   2       22      1.576575  0.5135234  1.0351308
##   2       23      1.585898  0.5104487  1.0460993
##   2       24      1.585898  0.5104487  1.0460993
##   2       25      1.585898  0.5104487  1.0460993
##   2       26      1.585898  0.5104487  1.0460993
##   2       27      1.585898  0.5104487  1.0460993
##   2       28      1.582272  0.5062776  1.0514079
##   2       29      1.594988  0.5027155  1.0586435
##   2       30      1.594988  0.5027155  1.0586435
##   2       31      1.594988  0.5027155  1.0586435
##   2       32      1.594988  0.5027155  1.0586435
##   2       33      1.594988  0.5027155  1.0586435
##   2       34      1.594988  0.5027155  1.0586435
##   2       35      1.594988  0.5027155  1.0586435
##   2       36      1.594988  0.5027155  1.0586435
##   2       37      1.594988  0.5027155  1.0586435
##   2       38      1.594988  0.5027155  1.0586435
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 7 and degree = 2.

The model explains about 62% of the variance of the test data. That’s an improvement from the unweighted knn model. Let’s run a diagnostic of the model.

marsPred2 <- predict(marsModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = marsPred2, obs = TestY)
##      RMSE  Rsquared       MAE 
## 1.2278715 0.6159963 0.9913622
plot(marsPred2, TestY - marsPred2, 
     xlab = "Predicted", ylab = "Residuals",
     abline(h=0, col = "red"))

plot(TestY, marsPred2, 
     xlab = "Predicted", ylab = "Residuals",
     abline(0, 1, col = "red"))

part b

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Model \(R^2\) RMSE
SVM 0.6191610 1.2695080
MARS 0.6159963 1.2278715
Average Neural Network 0.6091029 1.3390678
Weighted KNN 0.4733189 1.4676074

The SVM model outperform all the other models as it has the highest \(R^2\). However, it is still not an ideal model as it only explains about 62% of the variance of the test data.

Manufacturing process variables slightly dominates the top 10 predictors. Both types are present. Manufacturing Process 32 consistently top 1 predictor in both nonlinear and linear optimal model.

varImp(svmModel2)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     89.27
## BiologicalMaterial12     82.55
## BiologicalMaterial03     79.75
## ManufacturingProcess13   75.28
## BiologicalMaterial02     71.85
## ManufacturingProcess09   70.68
## ManufacturingProcess31   62.16
## ManufacturingProcess06   59.72
## ManufacturingProcess17   53.63
## BiologicalMaterial11     53.11
## ManufacturingProcess29   52.53
## ManufacturingProcess36   52.11
## ManufacturingProcess02   48.59
## BiologicalMaterial08     47.31
## ManufacturingProcess11   46.62
## ManufacturingProcess30   46.31
## BiologicalMaterial04     45.70
## BiologicalMaterial01     43.72
## ManufacturingProcess33   36.14

part c

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Biological predictors tend to have positive relationship with yield while manufacturing process predictors can have either positive or negative relationship with yield.

Biological predictors are highly correlated (multicollinearity?), which could explain why our linear model performed poorly. Nonlinear models are able to capture these interactions.

library(dplyr)
library(corrplot)
top_predictors <- varImp(svmModel2)$importance %>%
  mutate(Predictor = rownames(.)) %>%
   arrange(desc(Overall)) %>%
  head(10)

top_predictor_names <- top_predictors$Predictor

cor(TrainX[, top_predictor_names], TrainY)
##                              [,1]
## ManufacturingProcess32  0.5746692
## BiologicalMaterial06    0.5539187
## BiologicalMaterial12    0.4374056
## BiologicalMaterial03    0.4986530
## ManufacturingProcess13 -0.4835901
## BiologicalMaterial02    0.5261501
## ManufacturingProcess09  0.5235815
## ManufacturingProcess31 -0.3753337
## ManufacturingProcess06  0.4211420
## ManufacturingProcess17 -0.3454498
corrplot(cor(TrainX[, top_predictor_names]))

featurePlot(TrainX[, top_predictor_names], TrainY)