Nonlinear Regression Models

Libraries

library(tidyverse)
library(kableExtra)
library(corrplot)
library(reshape2)
library(Amelia)
library(dlookr)
library(fpp2)
library(plotly)
library(gridExtra)
library(readxl)
library(ggplot2)
library(urca)
library(tseries)
library(AppliedPredictiveModeling)
library(RANN)
library(psych)
library(e1071)
library(corrplot)
library(glmnet)
library(mlbench)
library(caret)
library(earth)

Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10 sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)

where the \(x\) values are random variables uniformly distributed between \([0, 1]\) (there are also \(5\) other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

(This exercise is based on library(mlbench), which I included in libraries at the top.)

set.seed(200)

trainingData <- mlbench.friedman1(200, sd = 1)

## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.

trainingData$x <- data.frame(trainingData$x)

## Look at the data using

featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

(I included library caret in libraries at the top.)

knnModel <- train(x = trainingData$x, y = trainingData$y, method = "knn", preProc = c("center", "scale"), tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

knnPred <- predict(knnModel, newdata = testData$x)

## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Answer:

We observed above that the RMSE of kNN model is 3.2932153. In the following, we’ll explore all other models, Neural Networks, MARS and SVM, mentioned in the book, in this order.

Neural Networks

Used code from page 163 of textbook.

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)

nnetTune <- train(x = trainingData$x, y = trainingData$y, method = "avNNet", tuneGrid = nnetGrid,
                  trControl = trainControl(method = "cv"), preProc = c("center", "scale"),
                  linout = TRUE, trace = FALSE,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500)

## Warning: executing %dopar% sequentially: no parallel backend registered

nnetTune

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.434845  0.7683498  1.921367
##   0.00    2    2.497822  0.7558233  1.993325
##   0.00    3    2.037885  0.8419795  1.609413
##   0.00    4    1.900063  0.8584928  1.529545
##   0.00    5    2.176661  0.8092998  1.628603
##   0.00    6    2.743381  0.7255103  1.988222
##   0.00    7    3.496229  0.6401273  2.493454
##   0.00    8    4.034891  0.5941657  2.749735
##   0.00    9    4.221796  0.5137164  2.800450
##   0.00   10    4.682342  0.5848908  2.818883
##   0.01    1    2.437231  0.7689665  1.934978
##   0.01    2    2.510986  0.7596191  1.988260
##   0.01    3    1.999944  0.8419567  1.555751
##   0.01    4    2.003357  0.8445288  1.549723
##   0.01    5    2.104801  0.8296459  1.664982
##   0.01    6    2.314704  0.7997307  1.857949
##   0.01    7    2.341101  0.8072335  1.872758
##   0.01    8    2.205611  0.8163107  1.748153
##   0.01    9    2.262921  0.8146166  1.776693
##   0.01   10    2.453311  0.7709666  1.981977
##   0.10    1    2.450897  0.7652309  1.942945
##   0.10    2    2.489399  0.7606443  1.997060
##   0.10    3    2.200693  0.8155496  1.786599
##   0.10    4    2.059322  0.8432340  1.651716
##   0.10    5    2.189025  0.8133603  1.729453
##   0.10    6    2.215091  0.8128993  1.757966
##   0.10    7    2.209521  0.8196474  1.786772
##   0.10    8    2.317124  0.8010433  1.826655
##   0.10    9    2.286711  0.7928430  1.849002
##   0.10   10    2.238560  0.8113030  1.787851
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag = FALSE.

Neural_pred <- postResample(pred = predict(nnetTune, newdata = testData$x), obs = testData$y)
Neural_pred

##     RMSE Rsquared      MAE 
## 2.496722 0.784618 1.685182

Observation: RMSE of Neural Networks is 2.496722. It’s way higher than what we obtained in kNN (3.2040595).

varImp(nnetTune)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

The top 5 variables are X4, X1, X2, X5, X3.

MARS

Used code from page 165 of textbook. Included library(earth) in libraries at the top.

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

marsTuned <- train(x = trainingData$x, y = trainingData$y, method = "earth", tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))

marsTuned

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.334325  0.2599883  3.607719
##   1        3      3.599334  0.4805557  2.888987
##   1        4      2.637145  0.7290848  2.087677
##   1        5      2.283872  0.7939684  1.817343
##   1        6      2.125875  0.8183677  1.647491
##   1        7      1.766013  0.8733619  1.410328
##   1        8      1.671282  0.8842102  1.324258
##   1        9      1.645406  0.8867947  1.322041
##   1       10      1.597968  0.8926582  1.297518
##   1       11      1.540109  0.8996361  1.237949
##   1       12      1.545349  0.8992979  1.243771
##   1       13      1.535169  0.9010122  1.233571
##   1       14      1.529405  0.9018457  1.223874
##   1       15      1.529405  0.9018457  1.223874
##   1       16      1.529405  0.9018457  1.223874
##   1       17      1.529405  0.9018457  1.223874
##   1       18      1.529405  0.9018457  1.223874
##   1       19      1.529405  0.9018457  1.223874
##   1       20      1.529405  0.9018457  1.223874
##   1       21      1.529405  0.9018457  1.223874
##   1       22      1.529405  0.9018457  1.223874
##   1       23      1.529405  0.9018457  1.223874
##   1       24      1.529405  0.9018457  1.223874
##   1       25      1.529405  0.9018457  1.223874
##   1       26      1.529405  0.9018457  1.223874
##   1       27      1.529405  0.9018457  1.223874
##   1       28      1.529405  0.9018457  1.223874
##   1       29      1.529405  0.9018457  1.223874
##   1       30      1.529405  0.9018457  1.223874
##   1       31      1.529405  0.9018457  1.223874
##   1       32      1.529405  0.9018457  1.223874
##   1       33      1.529405  0.9018457  1.223874
##   1       34      1.529405  0.9018457  1.223874
##   1       35      1.529405  0.9018457  1.223874
##   1       36      1.529405  0.9018457  1.223874
##   1       37      1.529405  0.9018457  1.223874
##   1       38      1.529405  0.9018457  1.223874
##   2        2      4.334325  0.2599883  3.607719
##   2        3      3.599334  0.4805557  2.888987
##   2        4      2.637145  0.7290848  2.087677
##   2        5      2.271844  0.7927888  1.823675
##   2        6      2.114868  0.8200184  1.659485
##   2        7      1.780140  0.8733216  1.429346
##   2        8      1.663164  0.8891928  1.294968
##   2        9      1.460976  0.9122520  1.180387
##   2       10      1.399692  0.9175376  1.122526
##   2       11      1.380002  0.9216251  1.110556
##   2       12      1.312883  0.9284253  1.063321
##   2       13      1.285612  0.9343029  1.014216
##   2       14      1.328520  0.9286650  1.052185
##   2       15      1.322954  0.9298515  1.045527
##   2       16      1.341454  0.9283961  1.053190
##   2       17      1.344590  0.9280972  1.054209
##   2       18      1.340821  0.9285264  1.050274
##   2       19      1.340821  0.9285264  1.050274
##   2       20      1.340821  0.9285264  1.050274
##   2       21      1.340821  0.9285264  1.050274
##   2       22      1.340821  0.9285264  1.050274
##   2       23      1.340821  0.9285264  1.050274
##   2       24      1.340821  0.9285264  1.050274
##   2       25      1.340821  0.9285264  1.050274
##   2       26      1.340821  0.9285264  1.050274
##   2       27      1.340821  0.9285264  1.050274
##   2       28      1.340821  0.9285264  1.050274
##   2       29      1.340821  0.9285264  1.050274
##   2       30      1.340821  0.9285264  1.050274
##   2       31      1.340821  0.9285264  1.050274
##   2       32      1.340821  0.9285264  1.050274
##   2       33      1.340821  0.9285264  1.050274
##   2       34      1.340821  0.9285264  1.050274
##   2       35      1.340821  0.9285264  1.050274
##   2       36      1.340821  0.9285264  1.050274
##   2       37      1.340821  0.9285264  1.050274
##   2       38      1.340821  0.9285264  1.050274
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 13 and degree = 2.

MARS_pred <- postResample(pred = predict(marsTuned, newdata = testData$x), obs = testData$y)
MARS_pred

##      RMSE  Rsquared       MAE 
## 1.2803060 0.9335241 1.0168673

Observation: RMSE of MARS is 1.2803060. It’s least and best so far.

varImp(marsTuned)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.33
## X2   48.88
## X5   15.63
## X3    0.00

The top 5 variables are X1, X4, X2, X5, X3.

Support Vector Machines

Used code from page 167 of textbook.

svmRTuned <- train(x = trainingData$x, y = trainingData$y,
                   method = "svmRadial", preProc = c("center", "scale"),
                   tuneLength = 14, trControl = trainControl(method = "cv"))

svmRTuned

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.504105  0.7940789  1.987142
##      0.50  2.219946  0.8148914  1.750249
##      1.00  2.028115  0.8388693  1.590383
##      2.00  1.899331  0.8561464  1.486326
##      4.00  1.815632  0.8669708  1.424246
##      8.00  1.798299  0.8702910  1.427678
##     16.00  1.797165  0.8702715  1.431259
##     32.00  1.795246  0.8705225  1.429235
##     64.00  1.795246  0.8705225  1.429235
##    128.00  1.795246  0.8705225  1.429235
##    256.00  1.795246  0.8705225  1.429235
##    512.00  1.795246  0.8705225  1.429235
##   1024.00  1.795246  0.8705225  1.429235
##   2048.00  1.795246  0.8705225  1.429235
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06104815
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06104815 and C = 32.

SVM_pred <- postResample(pred = predict(svmRTuned, newdata = testData$x), obs = testData$y)
SVM_pred

##      RMSE  Rsquared       MAE 
## 2.0693488 0.8263553 1.5718972

Observation: RMSE of SVM is 2.0469184.

varImp(svmRTuned)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

The top 5 variables are X4, X1, X2, X5, X3.

Summary

results <- data.frame(t(postResample(pred = knnPred, obs = testData$y))) %>% mutate("Model" = "KNN")

results <- data.frame(t(Neural_pred)) %>% mutate("Model"= "Neural Networks") %>% bind_rows(results)

results <- data.frame(t(MARS_pred)) %>% mutate("Model"= "MARS") %>% bind_rows(results)

results <- data.frame(t(SVM_pred)) %>% mutate("Model"= "Support Vector Machines") %>% bind_rows(results)

results %>% select(Model, RMSE, Rsquared, MAE) %>% arrange(RMSE)

##                     Model     RMSE  Rsquared      MAE
## 1                    MARS 1.280306 0.9335241 1.016867
## 2 Support Vector Machines 2.069349 0.8263553 1.571897
## 3         Neural Networks 2.496722 0.7846180 1.685182
## 4                     KNN 3.204059 0.6819919 2.568346

Conclusion

MARS outperformed the others.

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Answer:

Before we begin answring the questions, let’s pre-process the data. In previous Homework 7, this data had to be imputed and Box-coxed. I’ll do the same this time, but will not inspect with histograms, because we already know about that.

data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)

## [1] 176  58

I’ll impute the data using the same technique i.e. mice().

data_imputed <- mice(ChemicalManufacturingProcess, m = 1, method = "pmm", print = F) %>% complete()

## Warning: Number of logged events: 135

any(is.na(data_imputed))

## [1] FALSE

At this point, the data is imputed. I’ll proceed to Box-Cox it.

# Initially identifying columns, whose skewness are not less than 1.
transform_cols <- c()

for(i in seq(from = 1, to = length(data_imputed), by = 1)) {
  if(abs(skewness(data_imputed[, i])) >= 1) {
    transform_cols <- append(transform_cols, names(data_imputed[i]))
  }
}

# Applying Box-cox.
lambda <- NULL
data_imputed_2 <- data_imputed

for (i in 1:length(transform_cols)) {
  lambda[transform_cols[i]] <- BoxCox.lambda(abs(data_imputed[, transform_cols[i]]))

  data_imputed_2[c(transform_cols[i])] <- BoxCox(data_imputed[transform_cols[i]], lambda[transform_cols[i]])
}

At this point, the data is pre-processed. The pre-processed data is stored in the variable data_imputed_2.

So, I’ll proceed to split the data into train and test in 80/20 ratio.

set.seed(200)

split_index <- createDataPartition(data_imputed_2$Yield, p = 0.8, list = FALSE)
X_train <- data_imputed_2[split_index, ]
y_train <- data_imputed_2$Yield[split_index]

X_test <- data_imputed_2[-split_index, ]
y_test <- data_imputed_2$Yield[-split_index]

(a)Which nonlinear regression model gives the optimal resampling and test set performance?

kNN

knnModel <- train(x = X_train, y = y_train, method = "knn", preProc = c("center", "scale"), tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 144 samples
##  58 predictor
## 
## Pre-processing: centered (58), scaled (58) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    5  1.210081  0.5601133  0.9444384
##    7  1.200903  0.5694118  0.9409716
##    9  1.203957  0.5686644  0.9411430
##   11  1.220546  0.5574447  0.9575731
##   13  1.234408  0.5520165  0.9643078
##   15  1.258274  0.5344378  0.9900129
##   17  1.266584  0.5344163  1.0004605
##   19  1.272185  0.5356336  1.0082211
##   21  1.279878  0.5371340  1.0153591
##   23  1.295413  0.5283148  1.0270582
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.

knnPred <- postResample(pred = predict(knnModel, newdata = X_test), obs = y_test)
knnPred

##      RMSE  Rsquared       MAE 
## 1.3990997 0.6558133 1.1541518

Neural Networks

set.seed(200)

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)

nnetTune <- train(x = X_train,
                  y = y_train,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl = trainControl(method = "cv"),
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(X_train) + 1) + 10 + 1,
                  maxit = 500)

nnetTune

## Model Averaged Neural Network 
## 
## 144 samples
##  58 predictor
## 
## Pre-processing: centered (58), scaled (58) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE      
##   0.00    1    1.4751204  0.4111896  1.2105103
##   0.00    2    1.3290201  0.4785784  1.0824154
##   0.00    3    1.4476805  0.6626447  1.0855225
##   0.00    4    1.4011939  0.6021817  1.0737305
##   0.00    5    1.4544828  0.6302205  1.1474825
##   0.00    6    1.8831243  0.4510348  1.5056761
##   0.00    7    2.0811185  0.5146939  1.6128725
##   0.00    8    3.4300327  0.2643533  2.6183578
##   0.00    9    5.1618619  0.3198379  3.4635853
##   0.00   10    5.1964111  0.2722493  3.6164326
##   0.01    1    0.2813606  0.9639782  0.1557607
##   0.01    2    0.4477717  0.9334548  0.2668179
##   0.01    3    0.7257841  0.8620893  0.4595370
##   0.01    4    1.2584315  0.6998765  0.7876371
##   0.01    5    1.2438278  0.7242635  0.8201195
##   0.01    6    0.9605542  0.7539316  0.7604256
##   0.01    7    0.9565450  0.7295212  0.7563796
##   0.01    8    1.1977661  0.6866119  0.8956017
##   0.01    9    1.9605391  0.5404868  1.4096342
##   0.01   10    2.0286259  0.5069970  1.4715894
##   0.10    1    0.4523260  0.9383174  0.3212159
##   0.10    2    1.0942445  0.7856302  0.7113967
##   0.10    3    1.2244591  0.7651362  0.7456411
##   0.10    4    1.1764762  0.7744547  0.7043238
##   0.10    5    1.2866176  0.7286886  0.7297159
##   0.10    6    1.3021583  0.7274787  0.7643164
##   0.10    7    1.1773769  0.7163146  0.8046963
##   0.10    8    1.6074084  0.6065895  1.0162976
##   0.10    9    1.2906389  0.6685646  0.9049381
##   0.10   10    1.1863507  0.6751777  0.9125173
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.01 and bag = FALSE.

Neural_pred <- postResample(pred = predict(nnetTune, newdata = X_test), obs = y_test)
Neural_pred

##      RMSE  Rsquared       MAE 
## 0.6956048 0.8890842 0.2403163

MARS

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

marsTuned <- train(x = X_train, y = y_train, method = "earth", tuneGrid = marsGrid, preProc = c("center", "scale"), trControl = trainControl(method = "cv"))

marsTuned

## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  58 predictor
## 
## Pre-processing: centered (58), scaled (58) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 130, 130, 128, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE          Rsquared  MAE         
##   1        2      9.457448e-15  1         9.179015e-15
##   1        3      9.457448e-15  1         9.179015e-15
##   1        4      9.457448e-15  1         9.179015e-15
##   1        5      9.457448e-15  1         9.179015e-15
##   1        6      9.457448e-15  1         9.179015e-15
##   1        7      9.457448e-15  1         9.179015e-15
##   1        8      9.457448e-15  1         9.179015e-15
##   1        9      9.457448e-15  1         9.179015e-15
##   1       10      9.457448e-15  1         9.179015e-15
##   1       11      9.457448e-15  1         9.179015e-15
##   1       12      9.457448e-15  1         9.179015e-15
##   1       13      9.457448e-15  1         9.179015e-15
##   1       14      9.457448e-15  1         9.179015e-15
##   1       15      9.457448e-15  1         9.179015e-15
##   1       16      9.457448e-15  1         9.179015e-15
##   1       17      9.457448e-15  1         9.179015e-15
##   1       18      9.457448e-15  1         9.179015e-15
##   1       19      9.457448e-15  1         9.179015e-15
##   1       20      9.457448e-15  1         9.179015e-15
##   1       21      9.457448e-15  1         9.179015e-15
##   1       22      9.457448e-15  1         9.179015e-15
##   1       23      9.457448e-15  1         9.179015e-15
##   1       24      9.457448e-15  1         9.179015e-15
##   1       25      9.457448e-15  1         9.179015e-15
##   1       26      9.457448e-15  1         9.179015e-15
##   1       27      9.457448e-15  1         9.179015e-15
##   1       28      9.457448e-15  1         9.179015e-15
##   1       29      9.457448e-15  1         9.179015e-15
##   1       30      9.457448e-15  1         9.179015e-15
##   1       31      9.457448e-15  1         9.179015e-15
##   1       32      9.457448e-15  1         9.179015e-15
##   1       33      9.457448e-15  1         9.179015e-15
##   1       34      9.457448e-15  1         9.179015e-15
##   1       35      9.457448e-15  1         9.179015e-15
##   1       36      9.457448e-15  1         9.179015e-15
##   1       37      9.457448e-15  1         9.179015e-15
##   1       38      9.457448e-15  1         9.179015e-15
##   2        2      9.457448e-15  1         9.179015e-15
##   2        3      9.457448e-15  1         9.179015e-15
##   2        4      9.457448e-15  1         9.179015e-15
##   2        5      9.457448e-15  1         9.179015e-15
##   2        6      9.457448e-15  1         9.179015e-15
##   2        7      9.457448e-15  1         9.179015e-15
##   2        8      9.457448e-15  1         9.179015e-15
##   2        9      9.457448e-15  1         9.179015e-15
##   2       10      9.457448e-15  1         9.179015e-15
##   2       11      9.457448e-15  1         9.179015e-15
##   2       12      9.457448e-15  1         9.179015e-15
##   2       13      9.457448e-15  1         9.179015e-15
##   2       14      9.457448e-15  1         9.179015e-15
##   2       15      9.457448e-15  1         9.179015e-15
##   2       16      9.457448e-15  1         9.179015e-15
##   2       17      9.457448e-15  1         9.179015e-15
##   2       18      9.457448e-15  1         9.179015e-15
##   2       19      9.457448e-15  1         9.179015e-15
##   2       20      9.457448e-15  1         9.179015e-15
##   2       21      9.457448e-15  1         9.179015e-15
##   2       22      9.457448e-15  1         9.179015e-15
##   2       23      9.457448e-15  1         9.179015e-15
##   2       24      9.457448e-15  1         9.179015e-15
##   2       25      9.457448e-15  1         9.179015e-15
##   2       26      9.457448e-15  1         9.179015e-15
##   2       27      9.457448e-15  1         9.179015e-15
##   2       28      9.457448e-15  1         9.179015e-15
##   2       29      9.457448e-15  1         9.179015e-15
##   2       30      9.457448e-15  1         9.179015e-15
##   2       31      9.457448e-15  1         9.179015e-15
##   2       32      9.457448e-15  1         9.179015e-15
##   2       33      9.457448e-15  1         9.179015e-15
##   2       34      9.457448e-15  1         9.179015e-15
##   2       35      9.457448e-15  1         9.179015e-15
##   2       36      9.457448e-15  1         9.179015e-15
##   2       37      9.457448e-15  1         9.179015e-15
##   2       38      9.457448e-15  1         9.179015e-15
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 2 and degree = 1.

MARS_pred <- postResample(pred = predict(marsTuned, newdata = X_test), obs = y_test)
MARS_pred

##         RMSE     Rsquared          MAE 
## 7.105427e-15 1.000000e+00 7.105427e-15

SVM

svmRTuned <- train(x = X_train, y = y_train,
                   method = "svmRadial", preProc = c("center", "scale"),
                   tuneLength = 14, trControl = trainControl(method = "cv"))

svmRTuned

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  58 predictor
## 
## Pre-processing: centered (58), scaled (58) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 128, 128, 131, 129, 129, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  1.1014916  0.7383513  0.8993541
##      0.50  0.8745268  0.8330189  0.6988335
##      1.00  0.7057515  0.8932710  0.5543114
##      2.00  0.5963458  0.9191256  0.4730832
##      4.00  0.5828162  0.9207911  0.4634073
##      8.00  0.5828068  0.9207979  0.4633994
##     16.00  0.5828068  0.9207979  0.4633994
##     32.00  0.5828068  0.9207979  0.4633994
##     64.00  0.5828068  0.9207979  0.4633994
##    128.00  0.5828068  0.9207979  0.4633994
##    256.00  0.5828068  0.9207979  0.4633994
##    512.00  0.5828068  0.9207979  0.4633994
##   1024.00  0.5828068  0.9207979  0.4633994
##   2048.00  0.5828068  0.9207979  0.4633994
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0122229
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0122229 and C = 8.

SVM_pred <- postResample(pred = predict(svmRTuned, newdata = X_test), obs = y_test)
SVM_pred

##      RMSE  Rsquared       MAE 
## 0.7961616 0.9025694 0.5261874

Summary

results <- data.frame(t(knnPred)) %>% mutate("Model" = "KNN")

results <- data.frame(t(Neural_pred)) %>% mutate("Model"= "Neural Networks") %>% bind_rows(results)

results <- data.frame(t(MARS_pred)) %>% mutate("Model"= "MARS") %>% bind_rows(results)

results <- data.frame(t(SVM_pred)) %>% mutate("Model"= "Support Vector Machines") %>% bind_rows(results)

results %>% select(Model, RMSE, Rsquared, MAE) %>% arrange(RMSE)

##                     Model         RMSE  Rsquared          MAE
## 1                    MARS 7.105427e-15 1.0000000 7.105427e-15
## 2         Neural Networks 6.956048e-01 0.8890842 2.403163e-01
## 3 Support Vector Machines 7.961616e-01 0.9025694 5.261874e-01
## 4                     KNN 1.399100e+00 0.6558133 1.154152e+00

Conclusion

MARS outperformed the others.

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

varImp(marsTuned)

## earth variable importance

## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

##       Overall
## Yield     NaN

varImp(marsTuned) did not return any predictor. So, let me try the second best mode i.e. Neural Networks.

varImp(nnetTune)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##                        Overall
## Yield                   100.00
## ManufacturingProcess32   38.65
## BiologicalMaterial06     33.94
## ManufacturingProcess13   30.24
## BiologicalMaterial03     29.55
## BiologicalMaterial12     26.73
## ManufacturingProcess17   26.54
## ManufacturingProcess31   26.05
## ManufacturingProcess09   24.78
## ManufacturingProcess36   24.59
## ManufacturingProcess06   22.09
## BiologicalMaterial02     20.68
## BiologicalMaterial11     18.81
## ManufacturingProcess11   16.95
## ManufacturingProcess29   16.08
## ManufacturingProcess33   15.83
## ManufacturingProcess30   15.17
## BiologicalMaterial09     14.82
## BiologicalMaterial04     14.82
## BiologicalMaterial08     14.26

In the case of Neural Networks, 6 of the top ten predictors are ManufacturingProcess predictors and 3 are BiologicalMaterial. So, the ManufacturingProcess predictors dominate.

PLS_MODEL

set.seed(200)
pls_model <- train(x = X_train, y = y_train, method = "pls", preProc = c("center", "scale"), trControl = trainControl("cv", number = 10), tuneLength = 25)

pls_prediction <- predict(pls_model, newdata = X_test)

results <- data.frame(Model = "PLS", RMSE = caret::RMSE(pls_prediction, y_test), Rsquared = caret::R2(pls_prediction, y_test))
results

##   Model      RMSE  Rsquared
## 1   PLS 0.3416962 0.9718459

varImp(pls_model)

## Warning: package 'pls' was built under R version 3.6.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:fpp2':
## 
##     gasoline

## The following object is masked from 'package:corrplot':
## 
##     corrplot

## The following object is masked from 'package:stats':
## 
##     loadings

## pls variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##                        Overall
## Yield                   100.00
## ManufacturingProcess32   42.39
## ManufacturingProcess13   35.62
## ManufacturingProcess36   35.22
## ManufacturingProcess17   33.77
## ManufacturingProcess09   32.21
## BiologicalMaterial02     26.84
## ManufacturingProcess12   26.33
## ManufacturingProcess06   26.30
## BiologicalMaterial06     24.60
## BiologicalMaterial08     24.18
## ManufacturingProcess33   23.48
## ManufacturingProcess28   23.46
## BiologicalMaterial12     23.17
## BiologicalMaterial04     22.97
## ManufacturingProcess29   22.31
## ManufacturingProcess31   22.12
## ManufacturingProcess04   21.24
## BiologicalMaterial11     21.12
## BiologicalMaterial03     21.02

In nonlinear models, MARS performed best. The RMSE was very close to zero. But PLS_MODEL, which is linear, returned an RMSE of 0.1308365 is higher than MARS’s RMSE. So, it faired worse.

However, in linear model PLS_MODEL, among the top 10 variables, ManufacturingProcess is dominant.

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

By comparing 20 varImp(nnetTune) with 20 varImp(pls_model), the former being nonlinear and latter liner, we get the following predictors as unique to nonlinear.

BiologicalMaterial03 BiologicalMaterial09 ManufacturingProcess11 ManufacturingProcess30

This was done offline manually on Bash-shell.

The plots of each of these variables that are unique to nonlinear set are shown below.

ggplot(X_train, aes(BiologicalMaterial03, Yield)) + geom_point() + ggtitle("BiologicalMaterial03")

ggplot(X_train, aes(ManufacturingProcess34, Yield)) + geom_point() + ggtitle("BiologicalMaterial09")

ggplot(X_train, aes(ManufacturingProcess34, Yield)) + geom_point() + ggtitle("ManufacturingProcess11")

ggplot(X_train, aes(ManufacturingProcess34, Yield)) + geom_point() + ggtitle("ManufacturingProcess30")

These don’t indicate any special relationship.

Marker: 624-09