Chapter 7: Nonlinear Regression Models

7.2. Friedman (1991) introduced several benchmark data sets create by sim- ulation. One of these simulations used the following nonlinear equation to create data:

\[y=10 \sin \left(\pi x_{1} x_{2}\right)+20\left(x_{3}-0.5\right)^{2}+10 x_{4}+5 x_{5}+N\left(0, \sigma^{2}\right)\]

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

k-Nearest Neighbors

k-Nearest Neighbors can be used for either regression or classification

k-Nearest Neighbors can be used for either regression or classification

https://www.jeremyjordan.me/k-nearest-neighbors/

The first alrgorithm to fit is the k-Nearest Neighbors model that can be used for classification or regression problems. It assumes that similar data points are grouped together and it calculates the distance between a 'k' number of points. In the case of regression, the algorithm then calculates the mean of the 'k' neighbors around an observation.

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.565620  0.4887976  2.886629
##    7  3.422420  0.5300524  2.752964
##    9  3.368072  0.5536927  2.715310
##   11  3.323010  0.5779056  2.669375
##   13  3.275835  0.6030846  2.628663
##   15  3.261864  0.6163510  2.621192
##   17  3.261973  0.6267032  2.616956
##   19  3.286299  0.6281075  2.640585
##   21  3.280950  0.6390386  2.643807
##   23  3.292397  0.6440392  2.656080
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.

The algorithm calculated that optmum resuls were found when considering the 15 closest neighbors to the observation when estimating the value of an observation.

Support Vector Machine - Regression (SVR)

Support Vector Machine - Regression (SVR)

Support Vector Machine - Regression (SVR)

Support Vector Machine tries create optimum boundaries between objects. In a classifaction problem, the boundaries are between objects of different classes. In a regression, problem, the boundaries are around an hyperplane (fit curve) so that the points capture within the boundaries around the hyperplane have the least error.

Support Vector Regression Tutorial for Machine Learning

https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.066645238611971 
## 
## Number of Support Vectors : 158 
## 
## Objective Function Value : -40.302 
## Training error : 0.067262
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 160, 160, 160, 160, 160 
## Resampling results across tuning parameters:
## 
##   C      RMSE      Rsquared   MAE     
##    0.25  2.552384  0.8036595  2.031945
##    0.50  2.262344  0.8197733  1.779908
##    1.00  2.081715  0.8400180  1.633355
##    2.00  1.981734  0.8541422  1.558551
##    4.00  1.911043  0.8617172  1.513505
##    8.00  1.886361  0.8645105  1.510759
##   16.00  1.890404  0.8640035  1.516985
##   32.00  1.890404  0.8640035  1.516985
##   64.00  1.890404  0.8640035  1.516985
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0691853
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0691853 and C = 8.

Multivariate Adaptive Regression Splines (MARS)

Multivariate adaptive regression splines (MARS) provides a way to capture the non-linear relationship between a predictor and response by combining a group of linear fit lines to the date joined by a series of 'knots'.

Different number of 'Knots' breaking up a fit curve into multiple linear segments

Different number of 'Knots' breaking up a fit curve into multiple linear segments

Multivariate Adaptive Regression Splines

https://bradleyboehmke.github.io/HOML/mars.html

The earth package includes useful plots that shows how the MARS model was optimized. The plot below indicates that the optimum number of terms found was 12 for this particular data set.

## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

Neural Network

A neural network tries to 'learn' the relationship between features and outputs using an input, hidden and output layer. The input layer contains the input data features, the hidden layer does the 'learning' and the output layer contains the predictions.

Basic Representaiton of a Neural Network

Basic Representaiton of a Neural Network

Deep Learning

https://bradleyboehmke.github.io/HOML/deep-learning.html

Accuracy of Non-Linear Regression Models

All but k-Nearest Neighbors showed good prediction performance on the Test data set. The best performing algorithm is a Neural Network model using the avNNet() function.

##                   Model     RMSE   Rsquare
## 1                   knn 3.175066 0.6785946
## 2                   SVM 2.256252 0.8002645
## 3                 SVM_2 2.081703 0.8245831
## 4                  MARS 1.813647 0.8677298
## 5 Neural Network avNNet 1.722982 0.8804869
## 6   Neural Network nnet 1.793138 0.8725856

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Below we see ranked the importance of the different predictors in the data set using the Generalized Cross-Validation (GCV) and Residual Sums of Squares (RSS) criteria:

MARS has indeed chosen the informative predictors as the most relevant to the quality of it of the model in the decreasing ranking order of: X1, X4, X2, X5

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

This data set contains 57 predictors (12 describing the input biological material and 45 describing the process predictors) that explaing the production yield column for the 176 observed manufacturing runs.

Before modelling, the data will be pre-process to:

  • Remove predictors with near-zero variance.
  • Impute missing values using the predictive mean matching (pmm) algorithm.
  • Predictors values will go through a BoxCox transformation, will be centered to have a median to zero and rescaled so that all observations have similar range of values.
## 
##  iter imp variable
##   1   1  MP01  MP02  MP03  MP04  MP05  MP06  MP07  MP08  MP10  MP11  MP12  MP14  MP22  MP23  MP24  MP25  MP26  MP27  MP28  MP29  MP30  MP31  MP33  MP34  MP35  MP36  MP40  MP41
##   2   1  MP01  MP02  MP03  MP04  MP05  MP06  MP07  MP08  MP10  MP11  MP12  MP14  MP22  MP23  MP24  MP25  MP26  MP27  MP28  MP29  MP30  MP31  MP33  MP34  MP35  MP36  MP40  MP41
##   3   1  MP01  MP02  MP03  MP04  MP05  MP06  MP07  MP08  MP10  MP11  MP12  MP14  MP22  MP23  MP24  MP25  MP26  MP27  MP28  MP29  MP30  MP31  MP33  MP34  MP35  MP36  MP40  MP41
##   4   1  MP01  MP02  MP03  MP04  MP05  MP06  MP07  MP08  MP10  MP11  MP12  MP14  MP22  MP23  MP24  MP25  MP26  MP27  MP28  MP29  MP30  MP31  MP33  MP34  MP35  MP36  MP40  MP41
##   5   1  MP01  MP02  MP03  MP04  MP05  MP06  MP07  MP08  MP10  MP11  MP12  MP14  MP22  MP23  MP24  MP25  MP26  MP27  MP28  MP29  MP30  MP31  MP33  MP34  MP35  MP36  MP40  MP41

The data set will be split into a training and test set with a split ratio of 0.8.

Accuracy of Linear Regression Models

These are the accuracy metrics from three linear regresion models (Linear Regressoin, Partial Least Squares and Ridge-Regression) applied to the chemical manufacturing process data set:

##                   Model      RMSE   Rsquare
## 1     Linear Regression 0.6039194 0.7519582
## 2   Robust Linear Model 0.5688108 0.7728901
## 3 Partial Least Squares 0.5128704 0.8081410
## 4      Ridge-regression 0.5494331 0.7918470

Even the linear regression model shows a good Rsquare performance metric with the best performer being the Partial Least Squares algorithm.

  1. Which nonlinear regression model gives the optimal resampling and test set performance?

k-Nearest Neighbors

## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.8036573  0.3530434  0.6358175
##    7  0.7822577  0.3854600  0.6280176
##    9  0.7737430  0.3999663  0.6230066
##   11  0.7707369  0.4062028  0.6262722
##   13  0.7719195  0.4075877  0.6312388
##   15  0.7830572  0.3935152  0.6434488
##   17  0.7855383  0.3995409  0.6456167
##   19  0.7881950  0.4018174  0.6497210
##   21  0.7949948  0.3952094  0.6552899
##   23  0.8010489  0.3924348  0.6589202
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.

In the plot below we see how the calculated RMSE using the Test data set depended on the number of 'k' neighbors. The calculated optimal 'k' number is 11.

##      RMSE  Rsquared       MAE 
## 0.6566325 0.5828532 0.5383922

0.6571677 0.6030765 0.5278513

Support Vector Machine

##                    Model    RMSE   Rsquare
## 1 Support Vector Machine 0.54684 0.7955492

Multivariate Adaptive Regression Splines (MARS)

## Selected 2 of 2 terms, and 1 of 57 predictors
## Termination condition: Reached maximum RSq 0.9990 at 2 terms
## Importance: Yield, BM01-unused, BM02-unused, BM03-unused, BM04-unused, ...
## Number of terms at each degree of interaction: 1 1 (additive model)
## GCV 0    RSS 0    GRSq 1    RSq 1

## Call: earth(x=CMP_t_Train, y=CMP_t_Train$Yield)
## 
##              coefficients
## (Intercept) -5.782412e-19
## Yield        1.000000e+00
## 
## Selected 2 of 2 terms, and 1 of 57 predictors
## Termination condition: Reached maximum RSq 0.9990 at 2 terms
## Importance: Yield, BM01-unused, BM02-unused, BM03-unused, BM04-unused, ...
## Number of terms at each degree of interaction: 1 1 (additive model)
## GCV 0    RSS 0    GRSq 1    RSq 1
## Call: earth(x=data.frame[144,57], y=c(-1.216,1.21,0...), keepxy=TRUE, degree=1,
##             nprune=2)
## 
##              coefficients
## (Intercept) -5.782412e-19
## Yield        1.000000e+00
## 
## Selected 2 of 2 terms, and 1 of 57 predictors (nprune=2)
## Termination condition: Reached maximum RSq 0.9990 at 2 terms
## Importance: Yield, BM01-unused, BM02-unused, BM03-unused, BM04-unused, ...
## Number of terms at each degree of interaction: 1 1 (additive model)
## GCV 0    RSS 0    GRSq 1    RSq 1

Neural Network

##         Length Class  Mode     
## model    5     -none- list     
## repeats  1     -none- numeric  
## bag      1     -none- logical  
## seeds    5     -none- numeric  
## names   57     -none- character

Accuracy of Non-Linear Regression Models

The Non-Linear Regression Models shows a greater variation in accuracy as compared to the Linear Models. The k-Nearest Neighbors performed worse than the worst performing Linear Model (Linear Regression). The avNNet and nnet Neural Network algorithm had very good performance with a calculated Rsquare close to 0.99. The MARS models shows a suspiciously perfect Rquare value of 1. This probably indicates something went wrong with the model fit.

##                    Model         RMSE   Rsquare
## 1    k-Nearest Neighbors 7.925693e-01 0.6068033
## 2  Neural Network avNNet 1.317401e-01 0.9868748
## 3    Neural Network nnet 1.688800e-01 0.9791014
## 4             MARS Tuned 2.336804e-16 1.0000000
## 5                   MARS 2.336804e-16 1.0000000
## 6 Support Vector Machine 5.468400e-01 0.7955492
  1. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Top-Predictors Non-Linar Models

We see that the top predictor (Manufacturing Process MP 32) is the same for Linear and Non-Linear Models. Different models with a high Rsquare value give the rest of the predictors different degrees of importance. Non-linear models give more weight to Biological Material predictors (after MP32) while linear models have mixed BM or MP predictros after MP 32.

Top-Predictors Linar Models

  1. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

We see from the plot below that the Proces predictors have a higher correlation with the response variable Yield over the Material predictors.

Top Predictor In All Models

We can observe a clear and positive linear relationship between yield and MP 32.

Top Non-Linear Model Predictors

We cannot really observe any clear traits on the relationship between Yield and the top Manufacturing Process (MP) or Biological Material (BM) predictors. There appear to be a linear relationship with either. However, the MP predictors appear to have less variance (less scattered) than the MP predictors.

Top Linear Model Predictors