Exercise 7.2

In this exercise we work with a simulated dataset from the mlbench library. We use the mlbench.friedman1 function to simulate data for ten predictor variables \(X_1\) through \(X_{10}\) and one response variable \(Y\) from the following nonlinear equation:

\[Y = 10 \sin (\pi X_1 X_2) + 20 (X_3 -0.5)^2 + 10 X_4 + 5 X_5 + \epsilon\]

where the \(X_j\) are random variables uniformly distributed on [0, 1] and the error term \(\epsilon \sim N(0, \sigma^2)\) is normally distributed. Note that only the first five \(X_j\) variables enter the equation for the response \(Y\); the remaining \(X_j\) variables are non-informative / noise variables.

We will generate a training set and a test set by simulation, and then fit several models on the training set including parameter tuning. Finally we evaluate performance on the test set across the fitted models.

(a) Data simulation

First we generate 200 instances that will comprise the training dataset. This data includes:

  • x: 10 predictor variables, which we label X1 through X10
  • y: the response variable.

We can confirm from the summary statistics, the feature plot, and the pairs plot of the data that:

  • Each \(X_j\) is uniformly distributed on [0, 1]
  • The \(X_j\) appear to be independent, with pairwise correlations approximately zero
  • The response \(Y\) appears almost normally distributed, centered at 14.5 and ranging from 3.6 to 28.4
  • Some of the \(X_j\) exhibit apparent relationships with \(Y\) (e.g., X1, X4, and X5 appear to have positive relationships with Y).

For model testing and evaluation, we also generate a test set of 5000 simulated instances.

library(mlbench)  
set.seed(200)  
# simulate training set of 200 instances
trainingData <- mlbench.friedman1(200, sd = 1)  
# convert x matrix to dataframe
trainingData$x <- data.frame(trainingData$x)  
summary(data.frame(Y = trainingData$y, trainingData$x))
##        Y                X1                 X2           
##  Min.   : 3.556   Min.   :0.002806   Min.   :0.0004409  
##  1st Qu.:10.756   1st Qu.:0.217222   1st Qu.:0.2795877  
##  Median :14.556   Median :0.513935   Median :0.5106664  
##  Mean   :14.416   Mean   :0.481461   Mean   :0.5126936  
##  3rd Qu.:17.970   3rd Qu.:0.680814   3rd Qu.:0.7326078  
##  Max.   :28.382   Max.   :0.998992   Max.   :0.9840194  
##        X3                 X4                X5          
##  Min.   :0.003296   Min.   :0.00819   Min.   :0.001756  
##  1st Qu.:0.284872   1st Qu.:0.23178   1st Qu.:0.285408  
##  Median :0.537307   Median :0.44458   Median :0.534330  
##  Mean   :0.508571   Mean   :0.46939   Mean   :0.517948  
##  3rd Qu.:0.746155   3rd Qu.:0.68454   3rd Qu.:0.748180  
##  Max.   :0.999923   Max.   :0.99929   Max.   :0.991577  
##        X6                 X7                  X8          
##  Min.   :0.002901   Min.   :0.0003388   Min.   :0.004698  
##  1st Qu.:0.276340   1st Qu.:0.2092393   1st Qu.:0.289413  
##  Median :0.497598   Median :0.4688035   Median :0.497961  
##  Mean   :0.496841   Mean   :0.4636166   Mean   :0.512865  
##  3rd Qu.:0.749615   3rd Qu.:0.6752972   3rd Qu.:0.727624  
##  Max.   :0.998987   Max.   :0.9943478   Max.   :0.999561  
##        X9              X10          
##  Min.   :0.0176   Min.   :0.003172  
##  1st Qu.:0.2304   1st Qu.:0.317826  
##  Median :0.5289   Median :0.535922  
##  Mean   :0.5016   Mean   :0.542409  
##  3rd Qu.:0.7218   3rd Qu.:0.789736  
##  Max.   :0.9910   Max.   :0.999992
# feature plot of y vs x_j  
featurePlot(trainingData$x, trainingData$y, 
            labels = c("Predictor", "Response"), 
            main = "Feature plot of friedman1 training data")  

# pairs plot
ggpairs(data.frame(Y = trainingData$y, trainingData$x)) + 
  labs(title = "Pairs plot of friedman1 training data")

xx <- seq(0, 30, length.out = 100)
yy <- dnorm(xx, mean = mean(trainingData$y), sd = sd(trainingData$y))
ggplot(data.frame(trainingData$y)) + geom_density(aes(x = trainingData$y)) + 
  geom_line(data = data.frame(xx, yy), aes(x = xx, y = yy), col = "red") + 
  labs(title = "Distribution of Y", 
       x = "Y", y = "Density")

#plot(density(trainingData$y, bw = "SJ"), ylim = c(0, 0.085), main = "Distribution of Y") 
#lines(xx, yy, col = "red")

# simulate test set of 5000 instances  
testData <- mlbench.friedman1(5000, sd = 1)  
testData$x <- data.frame(testData$x) 
#summary(data.frame(Y = testData$y, testData$x))

(b) Modeling

Next we fit several models to the training data. We use the caret library for model fitting with the following algorithms and tuning parameters:

  • Model-averaged neural network (NNet): “avNNet” method from the nnet library; tune size (number of hidden units) and decay (weight decay) parameters (bag set to FALSE)
  • Multivariate adaptive regression spline (MARS): “earth” method from the earth library; tune nprune (number of terms) and degree (product degree) parameters
  • Support vector machine with radial basis function kernel (SVM): “svmRadial” from the kernlab library; tune C (cost) parameter (with default estimate of sigma)
  • K-nearest neighbors (KNN): “knn” from base R; tune k parameter.

We specify a common pre-processing and tuning methodology for all models within the train function:

  • Pre-processing: center and scale all predictors
  • Resampling method: 10-fold cross-validation
  • Tuning method: tuning parameters are chosen to minimize RMSE on the holdout samples.

For each model, we show below the re-sampling output, the tuning profile, the final model summary, and the variable importance plot.

# center & scale predictors
prep <- c("center", "scale")
# 10-fold cross validation
ctrl <- trainControl(method = "cv", 
                     number = 10)

##############################################################################
# fit nnet model; tune decay & size
##############################################################################
# fit & tune model
set.seed(227)
nnetGrid <- expand.grid(decay = c(0, 0.01, 0.1), size = 1:10, bag = FALSE)
nnetModel <- train(x = trainingData$x,  
                   y = trainingData$y, 
                   method = "avNNet", linout = TRUE, trace = FALSE, MaxNWts = 100, 
                   preProcess = prep, 
                   #tuneLength = 15, 
                   tuneGrid = nnetGrid,
                   trControl = ctrl)
nnetModel
## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.453486  0.7628392  1.939061
##   0.00    2    2.466229  0.7553506  1.935683
##   0.00    3    2.120237  0.8222812  1.646684
##   0.00    4    2.145174  0.8066845  1.688108
##   0.00    5    2.156400  0.8235444  1.694179
##   0.00    6    2.229606  0.8058147  1.747128
##   0.00    7    2.446361  0.7680410  1.970646
##   0.00    8    2.479591  0.7672991  1.949891
##   0.00    9         NaN        NaN       NaN
##   0.00   10         NaN        NaN       NaN
##   0.01    1    2.373626  0.7700550  1.846585
##   0.01    2    2.471793  0.7520689  1.918923
##   0.01    3    2.153588  0.8219606  1.665498
##   0.01    4    2.051319  0.8400975  1.597216
##   0.01    5    2.064509  0.8287173  1.576470
##   0.01    6    2.231345  0.8074821  1.747262
##   0.01    7    2.284785  0.7981623  1.768654
##   0.01    8    2.305885  0.7848879  1.853840
##   0.01    9         NaN        NaN       NaN
##   0.01   10         NaN        NaN       NaN
##   0.10    1    2.389372  0.7671926  1.856095
##   0.10    2    2.461489  0.7564560  1.923188
##   0.10    3    2.205762  0.8155714  1.699392
##   0.10    4    2.056510  0.8347077  1.604546
##   0.10    5    2.122134  0.8246035  1.678843
##   0.10    6    2.195238  0.8112132  1.759069
##   0.10    7    2.206892  0.8087481  1.762441
##   0.10    8    2.285393  0.8020192  1.828814
##   0.10    9         NaN        NaN       NaN
##   0.10   10         NaN        NaN       NaN
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.01 and bag
##  = FALSE.
ggplot(nnetModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", nnetModel$modelInfo$label))

# final model & variable importance
nnetModel$finalModel
## Model Averaged Neural Network with 5 Repeats  
## 
## a 10-4-1 network with 49 weights
## options were - linear output units  decay=0.01
ggplot(varImp(nnetModel)) + 
  labs(title = paste0("Variable importance: ", nnetModel$modelInfo$label))

# test set predictions
nnetPred <- predict(nnetModel, newdata = testData$x)  
nnetPerf <- postResample(pred = nnetPred, obs = testData$y) 

##############################################################################
# fit mars model; tune degree & nprune
##############################################################################
# fit & tune model
set.seed(227)
marsGrid <- expand.grid(degree = 1:2, nprune = 2:20)
marsModel <- train(x = trainingData$x,  
                   y = trainingData$y, 
                   method = "earth", 
                   preProcess = prep, 
                   #tuneLength = 15, 
                   tuneGrid = marsGrid,
                   trControl = ctrl)
marsModel
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      4.230804  0.3061204  3.4769718
##   1        3      3.446241  0.5179791  2.7747003
##   1        4      2.654600  0.7159102  2.1247908
##   1        5      2.338892  0.7748074  1.9259473
##   1        6      2.285647  0.7858393  1.8165413
##   1        7      1.854760  0.8633398  1.4529610
##   1        8      1.781946  0.8686876  1.3887742
##   1        9      1.722643  0.8780344  1.3564789
##   1       10      1.706516  0.8800101  1.3488078
##   1       11      1.666552  0.8842921  1.2863180
##   1       12      1.580302  0.8971240  1.2317369
##   1       13      1.584415  0.8966347  1.2271373
##   1       14      1.590337  0.8963451  1.2271217
##   1       15      1.590955  0.8963736  1.2291679
##   1       16      1.590955  0.8963736  1.2291679
##   1       17      1.590955  0.8963736  1.2291679
##   1       18      1.590955  0.8963736  1.2291679
##   1       19      1.590955  0.8963736  1.2291679
##   1       20      1.590955  0.8963736  1.2291679
##   2        2      4.283258  0.2959497  3.5188891
##   2        3      3.477719  0.5126430  2.8105998
##   2        4      2.703525  0.7084096  2.1650536
##   2        5      2.393345  0.7675030  1.9738942
##   2        6      2.390236  0.7677042  1.8810199
##   2        7      1.898494  0.8571073  1.5042126
##   2        8      1.716441  0.8837614  1.3492161
##   2        9      1.532824  0.9060610  1.2407400
##   2       10      1.463240  0.9142113  1.1711952
##   2       11      1.354253  0.9248471  1.0794948
##   2       12      1.251030  0.9314224  0.9915422
##   2       13      1.235344  0.9351578  0.9836994
##   2       14      1.218751  0.9366662  0.9496131
##   2       15      1.228174  0.9363321  0.9636462
##   2       16      1.206815  0.9378182  0.9571746
##   2       17      1.206815  0.9378182  0.9571746
##   2       18      1.206815  0.9378182  0.9571746
##   2       19      1.206815  0.9378182  0.9571746
##   2       20      1.206815  0.9378182  0.9571746
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 16 and degree = 2.
ggplot(marsModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", marsModel$modelInfo$label))

# final model & variable importance
marsModel$finalModel
## Selected 16 of 21 terms, and 5 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, ...
## Number of terms at each degree of interaction: 1 8 7
## GCV 1.642635    RSS 214.2181    GRSq 0.9333083    RSq 0.9560751
summary(marsModel$finalModel)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE,
##             degree=2, nprune=16)
## 
##                                   coefficients
## (Intercept)                          22.042898
## h(0.507267-X1)                       -4.261526
## h(X1-0.507267)                        2.661009
## h(0.325504-X2)                       -5.256624
## h(-0.216741-X3)                       2.909153
## h(X3- -0.216741)                      1.657839
## h(0.953812-X4)                       -2.810030
## h(X4-0.953812)                        2.833141
## h(1.17878-X5)                        -1.503531
## h(-0.951872-X1) * h(X2-0.325504)     -3.617965
## h(X1- -0.951872) * h(X2-0.325504)    -1.505955
## h(X1-0.507267) * h(X2- -0.798188)    -2.198535
## h(0.606835-X1) * h(0.325504-X2)       1.980041
## h(0.325504-X2) * h(X3-0.795427)       2.143490
## h(X2-0.325504) * h(X3- -0.917499)     1.340520
## h(X2-0.325504) * h(-0.917499-X3)      4.483181
## 
## Selected 16 of 21 terms, and 5 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, ...
## Number of terms at each degree of interaction: 1 8 7
## GCV 1.642635    RSS 214.2181    GRSq 0.9333083    RSq 0.9560751
ggplot(varImp(marsModel)) + 
  labs(title = paste0("Variable importance: ", marsModel$modelInfo$label))

# plot response vs each predictor
plotmo(marsModel$finalModel, pt.col = 2, 
       caption = paste0("Response relationships with the predictors: ", 
                        marsModel$modelInfo$label))
##  plotmo grid:    X1           X2        X3          X4         X5
##           0.1174442 -0.007467267 0.1018901 -0.08911631 0.05813967
##           X6         X7          X8         X9         X10
##  0.002665016 0.01819572 -0.05372308 0.09088895 -0.02272411

# test set predictions
marsPred <- predict(marsModel, newdata = testData$x)  
marsPerf <- postResample(pred = marsPred, obs = testData$y) 

##############################################################################
# fit svm model; tune C (use automatic estimate of sigma)
##############################################################################
# fit & tune model
set.seed(227)
svmModel <- train(x = trainingData$x,  
                  y = trainingData$y, 
                  method = "svmRadial", 
                  preProcess = prep, 
                  tuneLength = 10, 
                  trControl = ctrl)
svmModel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  2.502355  0.8014214  1.983965
##     0.50  2.253343  0.8151982  1.778912
##     1.00  2.066064  0.8394041  1.604967
##     2.00  1.967567  0.8493073  1.519869
##     4.00  1.890994  0.8558305  1.484066
##     8.00  1.878264  0.8564414  1.494233
##    16.00  1.878702  0.8565861  1.493896
##    32.00  1.878702  0.8565861  1.493896
##    64.00  1.878702  0.8565861  1.493896
##   128.00  1.878702  0.8565861  1.493896
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06682628
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06682628 and C = 8.
ggplot(svmModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", svmModel$modelInfo$label))

# final model & variable importance
svmModel$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 8 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0668262839433369 
## 
## Number of Support Vectors : 153 
## 
## Objective Function Value : -67.2636 
## Training error : 0.008899
ggplot(varImp(svmModel)) + 
  labs(title = paste0("Variable importance: ", svmModel$modelInfo$label))

# test set predictions
svmPred <- predict(svmModel, newdata = testData$x)  
svmPerf <- postResample(pred = svmPred, obs = testData$y) 

##############################################################################
# fit knn model; tune k 
##############################################################################
# fit & tune model
set.seed(227)
knnModel <- train(x = trainingData$x,  
                  y = trainingData$y,  
                  method = "knn", 
                  preProc = prep,  
                  tuneLength = 10, 
                  trControl = ctrl)  
knnModel 
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.245439  0.5892310  2.692806
##    7  3.115972  0.6349210  2.539613
##    9  3.095140  0.6618790  2.507735
##   11  3.067196  0.6844121  2.496442
##   13  3.091888  0.6882998  2.471658
##   15  3.096069  0.6989642  2.479912
##   17  3.116225  0.7075858  2.500198
##   19  3.147661  0.7102043  2.542344
##   21  3.161254  0.7144204  2.552672
##   23  3.186812  0.7136769  2.592388
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
ggplot(knnModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", knnModel$modelInfo$label))

# final model & variable importance
knnModel$finalModel
## 11-nearest neighbor regression model
ggplot(varImp(knnModel)) + 
  labs(title = paste0("Variable importance: ", knnModel$modelInfo$label))

# test set predictions
knnPred <- predict(knnModel, newdata = testData$x)  
knnPerf <- postResample(pred = knnPred, obs = testData$y) 

(c) Model evaluation

It is evident from the variable importance plots above that all four models successfully found that X1 through X5 are the most important predictors. The MARS model is unique in that it found the variable importance for X6 throughX10 to be exactly zero, since these variables were dropped from the model during the pruning procedure. It’s interesting to note that the MARS model orders the predictors (by decreasing importance) as X1, X4, X2, X5, and X3, whereas the other three models order them as X4, X1, X2, X5, and X3 (reversing the first two predictors).

Finally we compare performance metrics of the four models on the test data. From the table below, we see that the MARS model has the strongest predictive performance, with the lowest RMSE and MAE and highest R squared. The NNet and SVM models are the runners up, with comparable performance metrics. The weakest model overall is the k-nearest neighbors model. The strong performance of the MARS model likely results in part from the pruning procedure, in which the five non-informative (noise) variables were dropped. In contrast, the noise variables remain in the other models, which degrades the model-fitting process and weakens predictive performance. For instance, in the KNN model, both the informative predictors and the noise variables contribute equally in the (Euclidean) distance calculation, which means that similarity among instances is based in part on random noise.

rbind(NNet = nnetPerf, MARS = marsPerf, SVM = svmPerf, KNN = knnPerf) %>% 
  data.frame() %>% 
  kable(digits = 3, 
         caption = "Comparison of predictive performance on test set")
Comparison of predictive performance on test set
RMSE Rsquared MAE
NNet 2.035 0.835 1.527
MARS 1.279 0.934 1.009
SVM 2.071 0.826 1.573
KNN 3.122 0.669 2.496

Exercise 7.5

In this exercise we work with the ChemicalManufacturingProcess dataset using the caret library. We apply the same imputation, data splitting, and pre-processing steps that we used in Exercise 6.3 from Homework 7:

  • Imputation for missing values: bagged tree method using the preProcess function
  • Data splitting: 75/25 partition into training and test sets using the createDataPartition function
  • Pre-processing: remove near-zero variance predictors (only 1 variable) and center and scale all remaining predictors.

We then fit and evaluate four models using the following algorithms and tuning parameters:

  • Model-averaged neural network (NNet): “avNNet” method from the nnet library; tune size (number of hidden units) and decay (weight decay) parameters (bag set to FALSE)
  • Multivariate adaptive regression spline (MARS): “earth” method from the earth library; tune nprune (number of terms) and degree (product degree) parameters
  • Support vector machine with radial basis function kernel (SVM): “svmRadial” from the kernlab library; tune C (cost) parameter (with default estimate of sigma)
  • K-nearest neighbors (KNN): “knn” from base R; tune k parameter.

We specify a common resampling and tuning methodology for all models within the train function:

  • Resampling method: 10-fold cross-validation
  • Tuning method: tuning parameters are chosen to minimize RMSE on the holdout samples.

For each model, we show below the re-sampling output, the tuning profile, the final model summary, and the variable importance plot.

### load data
data("ChemicalManufacturingProcess")
#str(ChemicalManufacturingProcess)
# for ease of manipulation, split response and predictors & relabel columns
yield <- ChemicalManufacturingProcess[1]
cmp <- ChemicalManufacturingProcess[-1]
colnames(cmp) <- c(paste0("B", 1:12), paste0("M", 1:45)) 
#summary(yield)
#summary(cmp)
yield <- yield[[1]]     # numeric
cmp <- as.matrix(cmp)   # matrix

### impute missing values
set.seed(502)
impute <- preProcess(cmp, method = "bagImpute")
cmp <- predict(impute, cmp)
#summary(cmp)

### training/test split
set.seed(314)
train_idx <- createDataPartition(yield, p = 0.75, list = FALSE)
yield_train <- yield[train_idx]
yield_test <- yield[-train_idx]
cmp_train <- cmp[train_idx, ]
cmp_test <- cmp[-train_idx, ]
#summary(yield_train)
#summary(yield_test)

### pre-processing and resampling method
#prep <- c("BoxCox", "center", "scale", "nzv")  <<< BOX-COX CAUSES NaN / Inf ERRORS
prep <- c("center", "scale", "nzv")
#ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)  <<< TAKES TOO LONG
ctrl <- trainControl(method = "cv", number = 10)

##############################################################################
# fit nnet model; tune decay & size
##############################################################################
# fit & tune model
set.seed(227)
nnetGrid <- expand.grid(decay = c(0, 0.01, 0.1), size = 1:10, bag = FALSE)
nnetModel <- train(x = cmp_train, 
                   y = yield_train, 
                   preProcess = prep, 
                   trControl = ctrl, 
                   method = "avNNet", 
                   linout = TRUE, trace = FALSE, #MaxNWts = 1000, maxit = 500, 
                   tuneGrid = nnetGrid)
nnetModel
## Model Averaged Neural Network 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 120, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    1.682049  0.2982074  1.384177
##   0.00    2    1.584458  0.3800543  1.291777
##   0.00    3    1.506951  0.4142568  1.207772
##   0.00    4    1.747114  0.4457139  1.373870
##   0.00    5    1.899288  0.4058747  1.529646
##   0.00    6    2.055590  0.3389588  1.619027
##   0.00    7    2.040068  0.4411346  1.672809
##   0.00    8    2.924962  0.3353221  2.113708
##   0.00    9    3.175315  0.2872611  2.446941
##   0.00   10    4.391504  0.2308797  3.047893
##   0.01    1    1.376158  0.5150667  1.087506
##   0.01    2    1.291495  0.5823766  1.071629
##   0.01    3    1.484925  0.4789278  1.179208
##   0.01    4    1.909784  0.3532708  1.550551
##   0.01    5    1.762034  0.4477962  1.489518
##   0.01    6    1.833767  0.4727546  1.514892
##   0.01    7    2.262710  0.4589132  1.745205
##   0.01    8    2.274261  0.4634922  1.772300
##   0.01    9    3.135075  0.1970172  2.241917
##   0.01   10    4.648200  0.2469061  3.156940
##   0.10    1    1.388187  0.5200085  1.116170
##   0.10    2    1.290917  0.5660612  1.077061
##   0.10    3    1.373767  0.5181534  1.116043
##   0.10    4    1.644580  0.3946546  1.303468
##   0.10    5    1.584487  0.4428402  1.334289
##   0.10    6    1.723461  0.4462425  1.367406
##   0.10    7    1.877026  0.3707661  1.534372
##   0.10    8    2.304221  0.4301959  1.780566
##   0.10    9    2.891484  0.3774345  2.179779
##   0.10   10    4.408392  0.2092302  3.014256
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 2, decay = 0.1 and bag
##  = FALSE.
ggplot(nnetModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", nnetModel$modelInfo$label))

# final model & variable importance
nnetModel$finalModel
## Model Averaged Neural Network with 5 Repeats  
## 
## a 56-2-1 network with 117 weights
## options were - linear output units  decay=0.1
ggplot(varImp(nnetModel), top = 20) + 
  labs(title = paste0("Variable importance: ", nnetModel$modelInfo$label))

# test set predictions
nnetPred <- predict(nnetModel, newdata = cmp_test)  
nnetPerf <- postResample(pred = nnetPred, obs = yield_test) 

##############################################################################
# fit mars model; tune degree & nprune
##############################################################################
# fit & tune model
set.seed(227)
marsGrid <- expand.grid(degree = 1:2, nprune = 2:15)
marsModel <- train(x = cmp_train, 
                   y = yield_train, 
                   preProcess = prep, 
                   trControl = ctrl, 
                   method = "earth", 
                   tuneGrid = marsGrid)
marsModel
## Multivariate Adaptive Regression Spline 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 120, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.466767  0.4701481  1.1727359
##   1        3      1.231910  0.6376362  1.0160188
##   1        4      1.228827  0.6177270  1.0235419
##   1        5      1.240476  0.6040729  1.0215185
##   1        6      1.251077  0.5936310  1.0224391
##   1        7      1.169742  0.6396604  0.9557255
##   1        8      1.158192  0.6525749  0.9423312
##   1        9      1.168130  0.6469967  0.9460839
##   1       10      1.195275  0.6280185  0.9711703
##   1       11      1.156803  0.6524510  0.9454132
##   1       12      1.154094  0.6600484  0.9418102
##   1       13      1.153212  0.6626030  0.9446418
##   1       14      1.150778  0.6624886  0.9452896
##   1       15      1.155696  0.6608265  0.9485330
##   2        2      1.465726  0.4712874  1.1696152
##   2        3      1.381434  0.5784857  1.1208429
##   2        4      1.254312  0.6061607  1.0397770
##   2        5      1.261880  0.6057280  1.0278797
##   2        6      1.269116  0.6258132  1.0315846
##   2        7      1.354249  0.5417722  1.0912465
##   2        8      1.286775  0.5947254  1.0385618
##   2        9      1.381173  0.5494190  1.1231933
##   2       10      1.527708  0.4898369  1.1740842
##   2       11      1.525072  0.4940744  1.1735497
##   2       12      1.500675  0.5215782  1.1686466
##   2       13      1.514728  0.5258566  1.1847230
##   2       14      1.496330  0.5346941  1.1690166
##   2       15      1.503170  0.5300052  1.1634684
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 1.
ggplot(marsModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", marsModel$modelInfo$label))

# final model & variable importance
marsModel$finalModel
## Selected 14 of 21 terms, and 10 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: M32, M9, M13, M39, M1, B6, M3, M28, B10, B11, B1-unused, ...
## Number of terms at each degree of interaction: 1 13 (additive model)
## GCV 1.167506    RSS 97.51332    GRSq 0.682546    RSq 0.7960532
summary(marsModel$finalModel)
## Call: earth(x=matrix[132,56], y=c(38,42.44,42.0...), keepxy=TRUE,
##             degree=1, nprune=14)
## 
##                   coefficients
## (Intercept)          38.551547
## h(-0.862743-B6)      -2.817781
## h(-0.847275-B10)     -2.175900
## h(-0.549404-B11)      0.949173
## h(M1- -0.719584)      0.686995
## h(M3-0.100062)       -0.584934
## h(-1.13526-M9)       -1.459686
## h(M9- -1.13526)       0.544228
## h(-1.03772-M13)       2.495349
## h(0.745945-M28)       0.314882
## h(M28-0.745945)       5.771669
## h(M32- -0.855981)     0.947897
## h(0.100088-M39)      -0.496338
## h(M39-0.100088)      -4.148723
## 
## Selected 14 of 21 terms, and 10 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: M32, M9, M13, M39, M1, B6, M3, M28, B10, B11, B1-unused, ...
## Number of terms at each degree of interaction: 1 13 (additive model)
## GCV 1.167506    RSS 97.51332    GRSq 0.682546    RSq 0.7960532
ggplot(varImp(marsModel), top = 20) + 
  labs(title = paste0("Variable importance: ", marsModel$modelInfo$label))

# plot response vs each predictor
plotmo(marsModel$finalModel, pt.col = 2, 
       caption = paste0("Response relationships with the predictors: ", 
                        marsModel$modelInfo$label))
##  plotmo grid:    B1          B2         B3          B4          B5
##          -0.1480932 -0.05676868 -0.1450697 -0.07996953 -0.09038694
##           B6         B8          B9        B10        B11         B12
##  -0.03417493 0.00738872 -0.02465333 -0.1207101 -0.1833683 0.002799627
##         M1        M2          M3        M4         M5         M6
##  0.1381053 0.5106584 0.008802588 0.3646308 -0.1131089 -0.2541179
##          M7        M8         M9        M10         M11        M12
##  -0.3955405 0.8844957 0.06122472 -0.1721256 -0.03316336 -0.5132428
##        M13        M14        M15        M16        M17        M18
##  0.1098922 0.03421284 -0.1305892 0.07222311 0.02579683 0.07813078
##         M19       M20        M21         M22        M23        M24
##  -0.1556479 0.0780851 -0.1638226 -0.07010995 0.01918121 -0.1633761
##          M25        M26         M27       M28        M29        M30
##  -0.02998408 -0.1050628 -0.06494512 0.7068991 -0.3533464 -0.1139045
##        M31         M32       M33       M34         M35       M36
##  0.1159743 -0.04763013 0.1177296 0.1867818 -0.05910451 0.5342972
##          M37      M38       M39        M40        M41       M42        M43
##  -0.09274239 0.701741 0.2328687 -0.4696155 -0.4496138 0.2107708 -0.1443466
##        M44       M45
##  0.2796578 0.1400684

# test set predictions
marsPred <- as.numeric(predict(marsModel, newdata = cmp_test))  # convert matrix to vector  
marsPerf <- postResample(pred = marsPred, obs = yield_test) 

##############################################################################
# fit svm model; tune C (use automatic estimate of sigma)
##############################################################################
# fit & tune model
set.seed(227)
svmModel <- train(x = cmp_train, 
                  y = yield_train, 
                  preProcess = prep, 
                  trControl = ctrl, 
                  method = "svmRadial", 
                  tuneLength = 10)
svmModel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 120, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE      
##     0.25  1.503402  0.4801001  1.2270443
##     0.50  1.382362  0.5365246  1.1298529
##     1.00  1.256283  0.6067089  1.0215349
##     2.00  1.164566  0.6584831  0.9436202
##     4.00  1.139106  0.6673601  0.9051311
##     8.00  1.135549  0.6684797  0.9118079
##    16.00  1.135008  0.6688247  0.9113934
##    32.00  1.135008  0.6688247  0.9113934
##    64.00  1.135008  0.6688247  0.9113934
##   128.00  1.135008  0.6688247  0.9113934
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01483334
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01483334 and C = 16.
ggplot(svmModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", svmModel$modelInfo$label))

# final model & variable importance
svmModel$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0148333380558955 
## 
## Number of Support Vectors : 116 
## 
## Objective Function Value : -74.5875 
## Training error : 0.009049
ggplot(varImp(svmModel), top = 20) + 
  labs(title = paste0("Variable importance: ", svmModel$modelInfo$label))

# test set predictions
svmPred <- predict(svmModel, newdata = cmp_test)  
svmPerf <- postResample(pred = svmPred, obs = yield_test) 

##############################################################################
# fit knn model; tune k 
##############################################################################
# fit & tune model
set.seed(227)
knnGrid <- data.frame(k = 1:20)
knnModel <- train(x = cmp_train, 
                  y = yield_train, 
                  preProcess = prep, 
                  trControl = ctrl, 
                  method = "knn", 
                  tuneGrid = knnGrid)  
knnModel 
## k-Nearest Neighbors 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 120, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    1  1.562855  0.4189043  1.231211
##    2  1.366561  0.4972392  1.079467
##    3  1.249477  0.5918773  1.001200
##    4  1.306878  0.5759650  1.049647
##    5  1.344157  0.5477255  1.099069
##    6  1.353635  0.5352235  1.102728
##    7  1.373425  0.5258133  1.119679
##    8  1.371015  0.5335119  1.121872
##    9  1.362660  0.5395666  1.128054
##   10  1.386958  0.5170868  1.146023
##   11  1.389200  0.5143559  1.143645
##   12  1.392736  0.5192199  1.152277
##   13  1.401376  0.5171839  1.155483
##   14  1.419992  0.5048721  1.172474
##   15  1.413551  0.5200757  1.162223
##   16  1.439093  0.4952872  1.176301
##   17  1.444029  0.4856420  1.176658
##   18  1.457741  0.4796633  1.190801
##   19  1.464435  0.4699197  1.192174
##   20  1.472937  0.4640040  1.199591
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
ggplot(knnModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", knnModel$modelInfo$label))

# final model & variable importance
knnModel$finalModel
## 3-nearest neighbor regression model
ggplot(varImp(knnModel), top = 20) + 
  labs(title = paste0("Variable importance: ", knnModel$modelInfo$label))

# test set predictions
knnPred <- predict(knnModel, newdata = cmp_test)  
knnPerf <- postResample(pred = knnPred, obs = yield_test) 

##############################################################################
# fit linear model for comparison: elastic net  
##############################################################################
# fit model using tuned parameters from exercise 6.3
set.seed(227)
enetGrid <- data.frame(fraction = 0.2184211, lambda = 0.005)
enetModel <- train(x = cmp_train, 
                   y = yield_train, 
                   preProcess = prep, 
                   trControl = ctrl, 
                   method = "enet", 
                   tuneGrid = enetGrid)

# test set predictions
enetPred <- predict(enetModel, newdata = cmp_test)  
enetPerf <- postResample(pred = enetPred, obs = yield_test) 

(a) Optimal model performance

We evaluate predictive performance for the models based on the root mean squared error (RMSE), R-squared, and mean absolute error (MAE) metrics. We summarize these performance metrics in the table below based on the resampled holdout sets (prefixed as “res_”) as well as the test set (prefixed as “test_”), for all four non-linear models; we also include the linear model from Exercise 6.3 (elastic net model) for comparison.

We note several observations:

  • Generally the RMSE and MAE metrics are weaker on the resampled holdout sets than on the test set. This may relate to the small sample size of the holdout sets, since the training set includes 132 cases and each holdout set in the 10-fold cross-validation includes only 13 cases, whereas the test set includes 44 cases.
  • Overall the comparative performance of the models is consistent between the resampled metrics and the test metrics. In particular, all four non-linear models have clearly stronger performance than the linear model (ENET) on the test set, although the performance gap narrows on the resamples.
  • Based on the test set RMSE, SVM with radial basis function kernel is the strongest model overall, while NNET is a close runner-up. MARS is the weakest of the non-linear models based on the test metrics.

We also plot below the resampled metrics for the various models.

# resampling & test set metrics
res_perf <- rbind(NNET = colMeans(nnetModel$resample[-4]), 
                  MARS = colMeans(marsModel$resample[-4]), 
                  SVM = colMeans(svmModel$resample[-4]), 
                  KNN = colMeans(knnModel$resample[-4]), 
                  ENET = colMeans(enetModel$resample[-4]))
colnames(res_perf) <- paste0("res_", colnames(res_perf))
test_perf <- rbind(NNET = nnetPerf, MARS = marsPerf, SVM = svmPerf, 
                   KNN = knnPerf, ENET = enetPerf) 
colnames(test_perf) <- paste0("test_", colnames(test_perf))
cbind(res_perf, test_perf) %>% 
  kable(digits = 3, 
        caption = "Resampled & test set performance metrics")
Resampled & test set performance metrics
res_RMSE res_Rsquared res_MAE test_RMSE test_Rsquared test_MAE
NNET 1.291 0.566 1.077 1.022 0.641 0.764
MARS 1.151 0.662 0.945 1.212 0.504 0.972
SVM 1.135 0.669 0.911 1.020 0.627 0.856
KNN 1.249 0.592 1.001 1.116 0.537 0.902
ENET 1.165 0.652 0.948 1.430 0.396 0.967
# plot resampled metrics
list(NNET = nnetModel, MARS = marsModel, SVM = svmModel, 
     KNN = knnModel, ENET = enetModel) %>% 
  resamples() %>% 
  bwplot(metric = c("RMSE", "Rsquared", "MAE"), 
         layout = c(3,1), 
         between = list(x = 1), 
         main = "Comparison of resampled performance metrics")

(b) Most important predictors

From the variable importance plot above, we see that the top 10 important predictors in the SVM model include 6 manufacturing process and 4 biological variables:

  • M32
  • M13
  • M17
  • B6
  • B3
  • M9
  • M36
  • M6
  • B12
  • B2

The top 10 predictors in each model (four non-linear models as well as the linear model for comparison) are summarized in the table below. It is interesting that the top 10 predictors are identical in the NNET, SVM, KNN, and ENET models, whereas they differ in the MARS model, which includes only 10 predictors because of feature selection done during the fitting process. Reviewing the help file on the varImp function, it turns out that the variable importance for the MARS model is computed using a MARS-specific calculation method that uses the fitted MARS model; on the other hand, the variable importance for the other models is computed in a non-parametric way (using a loess fit) based on the data, without regard to the model. Hence the variable importance for the non-MARS models is a generic non-model-specific measure that is based only on the dataset.

In any case, reviewing variable importance for the SVM model, we observe that:

  • Among the top 5 and top 10 variables, the manufacturing process and biological variables are relatively evenly represented. For instance, 2 of the top 5 and 4 of the top 10 are biological variables.
  • Overall, taking into account the relative order and magnitude of the importance scores, it appears that the manufacturing variables are more important on balance than the biological variables. The top 3 are all manufacturing variables, with importance scores that are much higher than those for the remaining variables.
  • The top 10 important predictors are identical between the SVM model and the ENET model for the reason cited above.

Reviewing the top predictors for the MARS model, we note that the manufacturing process variables dominate the top of the list. This supports the conclusion that the manufacturing variables are more important overall than the biological variables for predicting yield.

top10 <- function(mod) {
  varImp(mod)$importance %>%
  mutate(Var = paste(rownames(varImp(mod)$importance), 
                     round(Overall, 1), sep = " - ")) %>%
  arrange(desc(Overall)) %>% 
  dplyr::select(Var) %>% 
  head(10)
}
cbind(NNET = top10(nnetModel), 
      MARS = top10(marsModel), 
      SVM = top10(svmModel), 
      KNN = top10(knnModel), 
      ENET = top10(enetModel)) %>% 
  kable(col.names = c("NNET", "MARS", "SVM", "KNN", "ENET"), 
        caption = "Top 10 important variables")
Top 10 important variables
NNET MARS SVM KNN ENET
M32 - 100 M32 - 100 M32 - 100 M32 - 100 M32 - 100
M13 - 89.9 M9 - 66.9 M13 - 89.9 M13 - 89.9 M13 - 89.9
M17 - 82.6 M13 - 46.2 M17 - 82.6 M17 - 82.6 M17 - 82.6
B6 - 70.9 M39 - 29 B6 - 70.9 B6 - 70.9 B6 - 70.9
B3 - 63.9 M1 - 24.5 B3 - 63.9 B3 - 63.9 B3 - 63.9
M9 - 62.9 B6 - 20.3 M9 - 62.9 M9 - 62.9 M9 - 62.9
M36 - 61.4 M3 - 13.8 M36 - 61.4 M36 - 61.4 M36 - 61.4
M6 - 60.2 B10 - 11 M6 - 60.2 M6 - 60.2 M6 - 60.2
B12 - 59.6 M28 - 9.5 B12 - 59.6 B12 - 59.6 B12 - 59.6
B2 - 51.2 B11 - 3.7 B2 - 51.2 B2 - 51.2 B2 - 51.2

(c) Top predictor-response relationships

Next we explore the bivariate relationships between the top predictors and the yield response. We do this graphically through scatter plots of the data along with loess fits to compare the observed and predicted yields from the SVM, MARS, and ENET models.

For the top 5 predictors in the SVM and ENET models, we note the following relationships:

  • M32: positive relationship to yield (i.e., increasing M32 is associated with increasing yield)
  • M13: mostly negative relationship
  • M17: non-linear relationship (cyclical pattern)
  • B6: mostly positive relationship
  • B3: mostly positive relationship

For the top 5 predictors in the MARS model that are not included above, we also note:

  • M9: increasing relationship
  • M39: indeterminate relationship
  • M1: non-linear relationship

These relationships suggest that to improve the yield of the manufacturing process, certain variables should be reviewed and considered for adjustment:

  • Increase M32
  • Decrease M13
  • Adjust M17 depending on location on the curve
  • Increase B6 if possible (e.g., use different raw materials)
  • Increase B3 if possible (e.g., use different raw materials)
  • Increase M9.
# plot bivariate predictor-response relationships
g <- ggplot(data.frame(Y = yield_test, P1 = svmPred, P2 = marsPred, P3 = enetPred, cmp_test))
g + geom_point(aes(x = M32, y = Y)) + 
  geom_smooth(aes(x = M32, y = Y, col = "Data"), se = FALSE) + 
  geom_smooth(aes(x = M32, y = P1, col = "SVM"), se = FALSE) + 
  geom_smooth(aes(x = M32, y = P2, col = "MARS"), se = FALSE) + 
  geom_smooth(aes(x = M32, y = P3, col = "ENET"), se = FALSE) + 
  labs(title = "Test set: Observed and predicted Yield versus M32", col = "Loess fit") + 
  scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))

g + geom_point(aes(x = M13, y = Y)) + 
  geom_smooth(aes(x = M13, y = Y, col = "Data"), se = FALSE) + 
  geom_smooth(aes(x = M13, y = P1, col = "SVM"), se = FALSE) + 
  geom_smooth(aes(x = M13, y = P2, col = "MARS"), se = FALSE) + 
  geom_smooth(aes(x = M13, y = P3, col = "ENET"), se = FALSE) + 
  labs(title = "Test set: Observed and predicted Yield versus M13", col = "Loess fit") + 
  scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))

g + geom_point(aes(x = M17, y = Y)) + 
  geom_smooth(aes(x = M17, y = Y, col = "Data"), se = FALSE) + 
  geom_smooth(aes(x = M17, y = P1, col = "SVM"), se = FALSE) + 
  geom_smooth(aes(x = M17, y = P2, col = "MARS"), se = FALSE) + 
  geom_smooth(aes(x = M17, y = P3, col = "ENET"), se = FALSE) + 
  labs(title = "Test set: Observed and predicted Yield versus M17", col = "Loess fit") + 
  scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))

g + geom_point(aes(x = B6, y = Y)) + 
  geom_smooth(aes(x = B6, y = Y, col = "Data"), se = FALSE) + 
  geom_smooth(aes(x = B6, y = P1, col = "SVM"), se = FALSE) + 
  geom_smooth(aes(x = B6, y = P2, col = "MARS"), se = FALSE) + 
  geom_smooth(aes(x = B6, y = P3, col = "ENET"), se = FALSE) + 
  labs(title = "Test set: Observed and predicted Yield versus B6", col = "Loess fit") + 
  scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))

g + geom_point(aes(x = B3, y = Y)) + 
  geom_smooth(aes(x = B3, y = Y, col = "Data"), se = FALSE) + 
  geom_smooth(aes(x = B3, y = P1, col = "SVM"), se = FALSE) + 
  geom_smooth(aes(x = B3, y = P2, col = "MARS"), se = FALSE) + 
  geom_smooth(aes(x = B3, y = P3, col = "ENET"), se = FALSE) + 
  labs(title = "Test set: Observed and predicted Yield versus B3", col = "Loess fit") + 
  scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))

g + geom_point(aes(x = M9, y = Y)) + 
  geom_smooth(aes(x = M9, y = Y, col = "Data"), se = FALSE) + 
  geom_smooth(aes(x = M9, y = P1, col = "SVM"), se = FALSE) + 
  geom_smooth(aes(x = M9, y = P2, col = "MARS"), se = FALSE) + 
  geom_smooth(aes(x = M9, y = P3, col = "ENET"), se = FALSE) + 
  labs(title = "Test set: Observed and predicted Yield versus M9", col = "Loess fit") + 
  scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))

g + geom_point(aes(x = M39, y = Y)) + 
  geom_smooth(aes(x = M39, y = Y, col = "Data"), se = FALSE) + 
  geom_smooth(aes(x = M39, y = P1, col = "SVM"), se = FALSE) + 
  geom_smooth(aes(x = M39, y = P2, col = "MARS"), se = FALSE) + 
  geom_smooth(aes(x = M39, y = P3, col = "ENET"), se = FALSE) + 
  labs(title = "Test set: Observed and predicted Yield versus M39", col = "Loess fit") + 
  scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))

g + geom_point(aes(x = M1, y = Y)) + 
  geom_smooth(aes(x = M1, y = Y, col = "Data"), se = FALSE) + 
  geom_smooth(aes(x = M1, y = P1, col = "SVM"), se = FALSE) + 
  geom_smooth(aes(x = M1, y = P2, col = "MARS"), se = FALSE) + 
  geom_smooth(aes(x = M1, y = P3, col = "ENET"), se = FALSE) + 
  labs(title = "Test set: Observed and predicted Yield versus M1", col = "Loess fit") + 
  scale_color_discrete(limits = c("Data", "SVM", "MARS", "ENET"))