\label{fig:fig1}Applied Predictive Modeling.

Applied Predictive Modeling.

Instructions

Do problems 7.2 and 7.5 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.

URL: http://appliedpredictivemodeling.com/

Exercises

7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\[y = 10 sin(\pi x_1 x_2 ) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N (0, \sigma ^ 2 )\] where the \(x\) values are random variables uniformly distributed between \([0, 1]\) (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the ' x ' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.

## This creates a list with a vector ' y ' and a matrix
## of predictors ' x ' . Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

library(caret)
knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
              
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.565620  0.4887976  2.886629
##    7  3.422420  0.5300524  2.752964
##    9  3.368072  0.5536927  2.715310
##   11  3.323010  0.5779056  2.669375
##   13  3.275835  0.6030846  2.628663
##   15  3.261864  0.6163510  2.621192
##   17  3.261973  0.6267032  2.616956
##   19  3.286299  0.6281075  2.640585
##   21  3.280950  0.6390386  2.643807
##   23  3.292397  0.6440392  2.656080
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
knnPred <- predict(knnModel, newdata = testData$x)
## The function ' postResample ' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)

Let’s visualize the results:

knnpostResample knnModel Diff
RMSE 3.1750657 3.261864 -0.087
Rsquared 0.6785946 0.616351 0.062
MAE 2.5443169 2.621192 -0.077

First, I would like to visualize the correlations.

\label{fig:fig2}Correlation chart for the given data set.

Correlation chart for the given data set.

From the correlations graph it is determined that there a low linear correlations in between the predictors.

nnet Neural Network Model

# Neural Network Model 

# Fix the seed so that the results can be reproduced
set.seed(100)
nnetModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "nnet",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  # The linear relationship between the hidden
                  # units and the prediction can be used with the 
                  # option linout = TRUE .
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500)
## Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   size  decay         RMSE      Rsquared   MAE     
##    1    0.0000000000  2.497600  0.7545522  1.956054
##    1    0.0001000000  2.734632  0.6972571  2.162137
##    1    0.0002371374  2.638259  0.7210747  2.067387
##    1    0.0005623413  2.611553  0.7289837  2.049921
##    1    0.0013335214  2.562094  0.7383177  2.015443
##    1    0.0031622777  2.605143  0.7298392  2.050391
##    1    0.0074989421  2.636422  0.7231220  2.085583
##    1    0.0177827941  2.496140  0.7543174  1.953291
##    1    0.0421696503  2.498996  0.7534972  1.955317
##    1    0.1000000000  2.506223  0.7517278  1.961803
##    3    0.0000000000  2.856496  0.6962672  2.223649
##    3    0.0001000000  2.906191  0.6874898  2.264771
##    3    0.0002371374  2.896291  0.6850854  2.286116
##    3    0.0005623413  2.834275  0.7101224  2.182843
##    3    0.0013335214  3.015298  0.6619003  2.357832
##    3    0.0031622777  3.054260  0.6589361  2.387193
##    3    0.0074989421  2.887075  0.6859427  2.268697
##    3    0.0177827941  2.906213  0.6831763  2.312451
##    3    0.0421696503  2.789268  0.7068723  2.219884
##    3    0.1000000000  2.760720  0.7140945  2.164362
##    5    0.0000000000  4.266845  0.5531609  2.853385
##    5    0.0001000000  3.546638  0.5842393  2.641530
##    5    0.0002371374  3.997344  0.5360713  2.962281
##    5    0.0005623413  4.026938  0.5429662  2.965001
##    5    0.0013335214  3.314771  0.6272433  2.611271
##    5    0.0031622777  3.420932  0.6030587  2.652169
##    5    0.0074989421  3.475582  0.6078295  2.688666
##    5    0.0177827941  3.249637  0.6204248  2.564657
##    5    0.0421696503  3.170282  0.6461321  2.523590
##    5    0.1000000000  3.137383  0.6495947  2.466052
##    7    0.0000000000  8.002307  0.3381807  4.353010
##    7    0.0001000000  4.430284  0.5056887  3.209074
##    7    0.0002371374  4.135117  0.5297763  3.132095
##    7    0.0005623413  4.232626  0.5060780  3.281377
##    7    0.0013335214  4.443191  0.4897367  3.355825
##    7    0.0031622777  4.080581  0.5214544  3.110683
##    7    0.0074989421  3.837268  0.5616421  3.017291
##    7    0.0177827941  3.848676  0.5401837  3.018204
##    7    0.0421696503  3.528795  0.6016891  2.794963
##    7    0.1000000000  3.350476  0.6285659  2.653889
##    9    0.0000000000  4.983956  0.4497053  3.591067
##    9    0.0001000000  4.887547  0.4711915  3.536666
##    9    0.0002371374  3.904550  0.5247603  3.103848
##    9    0.0005623413  4.163710  0.5107779  3.266036
##    9    0.0013335214  4.105538  0.5096156  3.218040
##    9    0.0031622777  4.037495  0.5201388  3.190533
##    9    0.0074989421  3.976522  0.5112530  3.135026
##    9    0.0177827941  3.863966  0.5394006  3.063223
##    9    0.0421696503  3.721862  0.5504238  2.979736
##    9    0.1000000000  3.407499  0.5978234  2.716034
##   11    0.0000000000       NaN        NaN       NaN
##   11    0.0001000000       NaN        NaN       NaN
##   11    0.0002371374       NaN        NaN       NaN
##   11    0.0005623413       NaN        NaN       NaN
##   11    0.0013335214       NaN        NaN       NaN
##   11    0.0031622777       NaN        NaN       NaN
##   11    0.0074989421       NaN        NaN       NaN
##   11    0.0177827941       NaN        NaN       NaN
##   11    0.0421696503       NaN        NaN       NaN
##   11    0.1000000000       NaN        NaN       NaN
##   13    0.0000000000       NaN        NaN       NaN
##   13    0.0001000000       NaN        NaN       NaN
##   13    0.0002371374       NaN        NaN       NaN
##   13    0.0005623413       NaN        NaN       NaN
##   13    0.0013335214       NaN        NaN       NaN
##   13    0.0031622777       NaN        NaN       NaN
##   13    0.0074989421       NaN        NaN       NaN
##   13    0.0177827941       NaN        NaN       NaN
##   13    0.0421696503       NaN        NaN       NaN
##   13    0.1000000000       NaN        NaN       NaN
##   15    0.0000000000       NaN        NaN       NaN
##   15    0.0001000000       NaN        NaN       NaN
##   15    0.0002371374       NaN        NaN       NaN
##   15    0.0005623413       NaN        NaN       NaN
##   15    0.0013335214       NaN        NaN       NaN
##   15    0.0031622777       NaN        NaN       NaN
##   15    0.0074989421       NaN        NaN       NaN
##   15    0.0177827941       NaN        NaN       NaN
##   15    0.0421696503       NaN        NaN       NaN
##   15    0.1000000000       NaN        NaN       NaN
##   17    0.0000000000       NaN        NaN       NaN
##   17    0.0001000000       NaN        NaN       NaN
##   17    0.0002371374       NaN        NaN       NaN
##   17    0.0005623413       NaN        NaN       NaN
##   17    0.0013335214       NaN        NaN       NaN
##   17    0.0031622777       NaN        NaN       NaN
##   17    0.0074989421       NaN        NaN       NaN
##   17    0.0177827941       NaN        NaN       NaN
##   17    0.0421696503       NaN        NaN       NaN
##   17    0.1000000000       NaN        NaN       NaN
##   19    0.0000000000       NaN        NaN       NaN
##   19    0.0001000000       NaN        NaN       NaN
##   19    0.0002371374       NaN        NaN       NaN
##   19    0.0005623413       NaN        NaN       NaN
##   19    0.0013335214       NaN        NaN       NaN
##   19    0.0031622777       NaN        NaN       NaN
##   19    0.0074989421       NaN        NaN       NaN
##   19    0.0177827941       NaN        NaN       NaN
##   19    0.0421696503       NaN        NaN       NaN
##   19    0.1000000000       NaN        NaN       NaN
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.01778279.

Let’s visualize the results:

nnetpostResample nnetModel Diff
RMSE 2.6435865 2.4961400 0.147
Rsquared 0.7193278 0.7543174 -0.035
MAE 2.0236815 1.9532910 0.070

SVM Support Vector Machines Model

#  Support Vector Machines Model 

# Fix the seed so that the results can be reproduced
set.seed(100)
svmModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.534788  0.7882081  2.034824
##      0.50  2.292127  0.8029516  1.819981
##      1.00  2.091598  0.8284381  1.657402
##      2.00  1.967193  0.8457471  1.546737
##      4.00  1.883133  0.8561761  1.482054
##      8.00  1.863807  0.8588797  1.468328
##     16.00  1.834215  0.8633819  1.456738
##     32.00  1.836471  0.8632508  1.459909
##     64.00  1.836471  0.8632508  1.459909
##    128.00  1.836471  0.8632508  1.459909
##    256.00  1.836471  0.8632508  1.459909
##    512.00  1.836471  0.8632508  1.459909
##   1024.00  1.836471  0.8632508  1.459909
##   2048.00  1.836471  0.8632508  1.459909
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0552698
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0552698 and C = 16.

Let’s visualize the results:

svmpostResample svmModel Diff
RMSE 2.0490047 1.8342150 0.215
Rsquared 0.8297577 0.8633819 -0.034
MAE 1.5586106 1.4567380 0.102

MARS Multivariate Adaptive Regression Splines Model

#  Multivariate Adaptive Regression Splines Model 

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

# Fix the seed so that the results can be reproduced
set.seed(100)
marsModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "earth",
                  preProc = c("center", "scale"),
                  # Explicitly declare the candidate models to test
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      4.489470  0.2020919  3.6881383
##   1        3      3.804210  0.4141260  3.0607824
##   1        4      2.622468  0.7176090  2.0807296
##   1        5      2.284475  0.7795541  1.8371183
##   1        6      2.287789  0.7792202  1.7862257
##   1        7      1.754222  0.8744211  1.3842606
##   1        8      1.701785  0.8808238  1.3087709
##   1        9      1.710506  0.8808018  1.3269353
##   1       10      1.684064  0.8833678  1.3218697
##   1       11      1.616665  0.8902847  1.2700615
##   1       12      1.620843  0.8883284  1.2784627
##   1       13      1.615887  0.8888966  1.2711103
##   1       14      1.615887  0.8888966  1.2711103
##   1       15      1.622796  0.8875917  1.2771573
##   1       16      1.622796  0.8875917  1.2771573
##   1       17      1.622796  0.8875917  1.2771573
##   1       18      1.622796  0.8875917  1.2771573
##   1       19      1.622796  0.8875917  1.2771573
##   1       20      1.622796  0.8875917  1.2771573
##   1       21      1.622796  0.8875917  1.2771573
##   1       22      1.622796  0.8875917  1.2771573
##   1       23      1.622796  0.8875917  1.2771573
##   1       24      1.622796  0.8875917  1.2771573
##   1       25      1.622796  0.8875917  1.2771573
##   1       26      1.622796  0.8875917  1.2771573
##   1       27      1.622796  0.8875917  1.2771573
##   1       28      1.622796  0.8875917  1.2771573
##   1       29      1.622796  0.8875917  1.2771573
##   1       30      1.622796  0.8875917  1.2771573
##   1       31      1.622796  0.8875917  1.2771573
##   1       32      1.622796  0.8875917  1.2771573
##   1       33      1.622796  0.8875917  1.2771573
##   1       34      1.622796  0.8875917  1.2771573
##   1       35      1.622796  0.8875917  1.2771573
##   1       36      1.622796  0.8875917  1.2771573
##   1       37      1.622796  0.8875917  1.2771573
##   1       38      1.622796  0.8875917  1.2771573
##   2        2      4.489470  0.2020919  3.6881383
##   2        3      3.804210  0.4141260  3.0607824
##   2        4      2.622468  0.7176090  2.0807296
##   2        5      2.284475  0.7795541  1.8371183
##   2        6      2.312578  0.7782746  1.8037749
##   2        7      1.780599  0.8724334  1.4062049
##   2        8      1.712181  0.8801027  1.3038033
##   2        9      1.535110  0.9026584  1.2201285
##   2       10      1.357614  0.9218402  1.0470553
##   2       11      1.271188  0.9371200  0.9916035
##   2       12      1.238666  0.9412852  0.9680962
##   2       13      1.258187  0.9375168  0.9837376
##   2       14      1.271254  0.9366262  1.0024425
##   2       15      1.253367  0.9375668  0.9901281
##   2       16      1.256205  0.9376482  1.0077633
##   2       17      1.256014  0.9378510  0.9982979
##   2       18      1.256014  0.9378510  0.9982979
##   2       19      1.256014  0.9378510  0.9982979
##   2       20      1.256014  0.9378510  0.9982979
##   2       21      1.256014  0.9378510  0.9982979
##   2       22      1.256014  0.9378510  0.9982979
##   2       23      1.256014  0.9378510  0.9982979
##   2       24      1.256014  0.9378510  0.9982979
##   2       25      1.256014  0.9378510  0.9982979
##   2       26      1.256014  0.9378510  0.9982979
##   2       27      1.256014  0.9378510  0.9982979
##   2       28      1.256014  0.9378510  0.9982979
##   2       29      1.256014  0.9378510  0.9982979
##   2       30      1.256014  0.9378510  0.9982979
##   2       31      1.256014  0.9378510  0.9982979
##   2       32      1.256014  0.9378510  0.9982979
##   2       33      1.256014  0.9378510  0.9982979
##   2       34      1.256014  0.9378510  0.9982979
##   2       35      1.256014  0.9378510  0.9982979
##   2       36      1.256014  0.9378510  0.9982979
##   2       37      1.256014  0.9378510  0.9982979
##   2       38      1.256014  0.9378510  0.9982979
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 12 and degree = 2.

Let’s visualize the results:

marspostResample marsModel Diff
RMSE 1.3227340 1.2386660 0.084
Rsquared 0.9291489 0.9412852 -0.012
MAE 1.0524686 0.9680962 0.084

Which models appear to give the best performance?

Let’s compare the returned values from the postResample function:

knn nnet svm mars
RMSE 3.1750657 2.6435865 2.0490047 1.3227340
Rsquared 0.6785946 0.7193278 0.8297577 0.9291489
MAE 2.5443169 2.0236815 1.5586106 1.0524686

As we can appreciate in the above table, the MARS model returns the lowest RMSE alongside the highest \(R^2\).

Let’s compare the returned values from the original models.

knn nnet svm mars
RMSE 3.261864 2.4961400 1.8342150 1.2386660
Rsquared 0.616351 0.7543174 0.8633819 0.9412852
MAE 2.621192 1.9532910 1.4567380 0.9680962

Same as before, the MARS approach provided the lowest RMSE alongside the highest \(R^2\).

Does MARS select the informative predictors (those named X1 – X5 )?

Let’s visualize the results from the AMRS model.

## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE,
##             degree=2, nprune=12)
## 
##                                   coefficients
## (Intercept)                          21.690154
## h(0.507267-X1)                       -4.203744
## h(X1-0.507267)                        3.072355
## h(0.325504-X2)                       -5.314859
## h(-0.216741-X3)                       3.320304
## h(X3- -0.216741)                      2.321760
## h(0.953812-X4)                       -2.775288
## h(X4-0.953812)                        2.778320
## h(1.17878-X5)                        -1.607769
## h(X1-0.507267) * h(X2- -0.798188)    -3.199202
## h(0.606835-X1) * h(0.325504-X2)       2.030856
## h(0.325504-X2) * h(X3-0.795427)       1.369704
## 
## Selected 12 of 21 terms, and 5 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, ...
## Number of terms at each degree of interaction: 1 8 3
## GCV 1.842426    RSS 270.9495    GRSq 0.9251967    RSq 0.9444425

Answer:

From the above results, we can appreciate that the informative predictors named X1 – X5 are included in the model, hence are selected.

7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Process

Let’s visualize the missing values.

Let’s have a better understanding of the missing data.

Let’s visualize the missing values.

Since there are various predictors with a small number of predictors, I believe is better to replace the missing entries with the mean on that respective predictor. I believe this to be the best approach due to:

  • Using complete cases will reduce dramatically the data set, with the downside of losing possible valuable factors currently present in the model.

  • Replacing just a few records per predictor with the mean seems to be the best approach due to low incidence of missing records, thus not affecting dramatically the data set since only about 1% of data is missing.

# Procedure to replace NAs with the respective mean of the predictor.
for (i in 1:dim(ChemicalManufacturingProcess)[2]){
  
  totalNA <- sum(is.na(ChemicalManufacturingProcess[i]))
  
  if (totalNA > 0)
    print(i)
    meanColunm <- mean(ChemicalManufacturingProcess[,i], na.rm = TRUE)
    ChemicalManufacturingProcess[i][is.na(ChemicalManufacturingProcess[i])] <- meanColunm
}

Preprocess Center & Scale

In order to gain some computation advantage, I will pre-process my data set before running any model. Please note that this could be done while running the model as well but I would like to process before then.

In order to pre-process the data, what we need to do is to center and scale, in order to do so, we can do it as follows:

\[\text{Pre-Process Data} = \frac{y_i - \mu}{\sigma}\]

The above data, will be centered and scaled.

We can achieve this by employing the preProcess function from the caret library. This function has the ability to transform, center, scale, or impute values.

# Function to pre-process data.
library(caret)
trans <- preProcess(ChemicalManufacturingProcess, 
                    method = c("center", "scale"))

# Need to obtain new transformed values
CMP.trans <- predict(trans, ChemicalManufacturingProcess)

Let’s find correlations:

The below list, represents the column numbers in which highly correlated data is present \(> 0.9\).

##  [1]  3  5 13 42 55 38 40 44 31 53

From above, we notice some strong linear correlations in between predictors.

Based on that, I will remove the highly correlated values \(>0.9\).

Let’s find near zero variance

Now, I will proceed to find predictors that have a near zero variance.

From the analysis, it is determined that BiologicalMaterial07 has a near zero variance. This predictor will be removed as well.

Split train & test

Now, I will split the data into 75 % training data and 25 % test.

# Now, I will split the data as follows:
# Training 75%
# Test     25%
set.seed(123)
n <- nrow(CMP.trans_reduced)
trainIndex <- sample(1:n, size = round(0.75*n), replace=FALSE)
CMPtrain <- CMP.trans_reduced[trainIndex ,]
CMPtest <- CMP.trans_reduced[-trainIndex ,]

knn K-Nearest Neighbors

Let’s find this model.

# Fix the seed so that the results can be reproduced
set.seed(100)
knnModel <- train(x = CMPtrain[,-1], # Yield is located in the column 1
                 y = CMPtrain$Yield,  # Yield is located in the column 1
                 method = "knn",
                 # Center and scaling will occur for new predictions too
                 #preProc = c("center", "scale"), #already centered & scaled 
                 tuneGrid = data.frame(.k = 1:20),
                 trControl = trainControl(method = "cv"))
## k-Nearest Neighbors 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 118, 117, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.6794000  0.5757569  0.5223422
##    2  0.6176822  0.6305230  0.5164498
##    3  0.6605027  0.5753067  0.5369146
##    4  0.6708427  0.5593300  0.5453075
##    5  0.6765834  0.5631368  0.5509474
##    6  0.6987497  0.5397590  0.5620890
##    7  0.7144364  0.5193643  0.5730330
##    8  0.7210814  0.5174631  0.5779199
##    9  0.7201191  0.5096748  0.5694338
##   10  0.7346904  0.4861745  0.5743579
##   11  0.7383440  0.4882183  0.5835242
##   12  0.7490996  0.4706029  0.5912526
##   13  0.7452880  0.4745400  0.5916742
##   14  0.7554552  0.4606574  0.6063589
##   15  0.7660497  0.4410709  0.6140884
##   16  0.7656175  0.4426411  0.6163335
##   17  0.7723910  0.4338911  0.6201163
##   18  0.7734641  0.4381213  0.6212987
##   19  0.7736354  0.4383681  0.6194655
##   20  0.7776377  0.4349992  0.6222963
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 2.

Let’s test the above model with the training data.

Let’s visualize the results:

knnpostResample knnModel Diff
RMSE 0.7857604 0.6176822 0.168
Rsquared 0.3984343 0.6305230 -0.232
MAE 0.6049798 0.5164498 0.089

nnet Neural Network Model

# Neural Network Model 

# Fix the seed so that the results can be reproduced
set.seed(100)
nnetModel <- train(x = CMPtrain[,-1], # Yield is located in the column 1
                   y = CMPtrain$Yield,  # Yield is located in the column 1
                  method = "nnet",
                  #preProc = c("center", "scale"),
                  tuneLength = 10,
                  # The linear relationship between the hidden
                  # units and the prediction can be used with the 
                  # option linout = TRUE .
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(CMPtrain[,-1]) + 1) + 10 + 1,
                  maxit = 500)
## Neural Network 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   size  decay         RMSE       Rsquared   MAE      
##    1    0.0000000000  1.1796248  0.1968185  0.9062800
##    1    0.0001000000  1.1253088  0.2051884  0.8982873
##    1    0.0002371374  1.1626150  0.2031684  0.9284455
##    1    0.0005623413  1.1351927  0.2147849  0.9059153
##    1    0.0013335214  1.1721067  0.1854326  0.9446886
##    1    0.0031622777  1.1353498  0.2046183  0.9137300
##    1    0.0074989421  1.1131284  0.2320241  0.8830995
##    1    0.0177827941  1.0932424  0.2432816  0.8712249
##    1    0.0421696503  1.0577946  0.2598119  0.8481323
##    1    0.1000000000  1.0130936  0.2976289  0.8144259
##    3    0.0000000000  1.3954744  0.2265805  1.1065449
##    3    0.0001000000  1.4409364  0.2105104  1.1204449
##    3    0.0002371374  1.2194177  0.2322903  0.9592999
##    3    0.0005623413  1.1915407  0.2467710  0.9367788
##    3    0.0013335214  1.1402344  0.2461006  0.8986199
##    3    0.0031622777  1.0552791  0.3057729  0.8359662
##    3    0.0074989421  1.0074765  0.3461289  0.7945078
##    3    0.0177827941  0.9646867  0.3443929  0.7721008
##    3    0.0421696503  0.9779622  0.3463749  0.7694378
##    3    0.1000000000  0.9155119  0.3827726  0.7291705
##    5    0.0000000000  1.2433499  0.2102741  0.9849966
##    5    0.0001000000  0.9954243  0.3134169  0.8012460
##    5    0.0002371374  0.9433686  0.3597121  0.7424088
##    5    0.0005623413  0.8977300  0.3745157  0.7079160
##    5    0.0013335214  0.9091758  0.3877541  0.7159963
##    5    0.0031622777  0.8740793  0.4121510  0.6968985
##    5    0.0074989421  0.8567089  0.4137448  0.6854498
##    5    0.0177827941  0.8536331  0.4061556  0.6860885
##    5    0.0421696503  0.8693627  0.3962918  0.6858137
##    5    0.1000000000  0.8465231  0.4034549  0.6798335
##    7    0.0000000000  1.0347538  0.2907921  0.8207750
##    7    0.0001000000  0.8695586  0.3931258  0.6882341
##    7    0.0002371374  0.8462636  0.4192968  0.6720169
##    7    0.0005623413  0.8507805  0.4024947  0.6852187
##    7    0.0013335214  0.8412553  0.4172394  0.6644665
##    7    0.0031622777  0.8330734  0.4387487  0.6688182
##    7    0.0074989421  0.8410750  0.4257502  0.6737509
##    7    0.0177827941  0.8041428  0.4598524  0.6365908
##    7    0.0421696503  0.8147847  0.4401499  0.6480599
##    7    0.1000000000  0.8068015  0.4462963  0.6451722
##    9    0.0000000000  0.9996566  0.3298175  0.7968458
##    9    0.0001000000  0.8214779  0.4394031  0.6525946
##    9    0.0002371374  0.8525284  0.4174344  0.6800225
##    9    0.0005623413  0.8379829  0.4197734  0.6780517
##    9    0.0013335214  0.8204291  0.4400485  0.6528655
##    9    0.0031622777  0.8277418  0.4281831  0.6545178
##    9    0.0074989421  0.8160242  0.4408754  0.6511295
##    9    0.0177827941  0.8271611  0.4262726  0.6558029
##    9    0.0421696503  0.8040919  0.4432069  0.6394613
##    9    0.1000000000  0.8117241  0.4433677  0.6445780
##   11    0.0000000000        NaN        NaN        NaN
##   11    0.0001000000        NaN        NaN        NaN
##   11    0.0002371374        NaN        NaN        NaN
##   11    0.0005623413        NaN        NaN        NaN
##   11    0.0013335214        NaN        NaN        NaN
##   11    0.0031622777        NaN        NaN        NaN
##   11    0.0074989421        NaN        NaN        NaN
##   11    0.0177827941        NaN        NaN        NaN
##   11    0.0421696503        NaN        NaN        NaN
##   11    0.1000000000        NaN        NaN        NaN
##   13    0.0000000000        NaN        NaN        NaN
##   13    0.0001000000        NaN        NaN        NaN
##   13    0.0002371374        NaN        NaN        NaN
##   13    0.0005623413        NaN        NaN        NaN
##   13    0.0013335214        NaN        NaN        NaN
##   13    0.0031622777        NaN        NaN        NaN
##   13    0.0074989421        NaN        NaN        NaN
##   13    0.0177827941        NaN        NaN        NaN
##   13    0.0421696503        NaN        NaN        NaN
##   13    0.1000000000        NaN        NaN        NaN
##   15    0.0000000000        NaN        NaN        NaN
##   15    0.0001000000        NaN        NaN        NaN
##   15    0.0002371374        NaN        NaN        NaN
##   15    0.0005623413        NaN        NaN        NaN
##   15    0.0013335214        NaN        NaN        NaN
##   15    0.0031622777        NaN        NaN        NaN
##   15    0.0074989421        NaN        NaN        NaN
##   15    0.0177827941        NaN        NaN        NaN
##   15    0.0421696503        NaN        NaN        NaN
##   15    0.1000000000        NaN        NaN        NaN
##   17    0.0000000000        NaN        NaN        NaN
##   17    0.0001000000        NaN        NaN        NaN
##   17    0.0002371374        NaN        NaN        NaN
##   17    0.0005623413        NaN        NaN        NaN
##   17    0.0013335214        NaN        NaN        NaN
##   17    0.0031622777        NaN        NaN        NaN
##   17    0.0074989421        NaN        NaN        NaN
##   17    0.0177827941        NaN        NaN        NaN
##   17    0.0421696503        NaN        NaN        NaN
##   17    0.1000000000        NaN        NaN        NaN
##   19    0.0000000000        NaN        NaN        NaN
##   19    0.0001000000        NaN        NaN        NaN
##   19    0.0002371374        NaN        NaN        NaN
##   19    0.0005623413        NaN        NaN        NaN
##   19    0.0013335214        NaN        NaN        NaN
##   19    0.0031622777        NaN        NaN        NaN
##   19    0.0074989421        NaN        NaN        NaN
##   19    0.0177827941        NaN        NaN        NaN
##   19    0.0421696503        NaN        NaN        NaN
##   19    0.1000000000        NaN        NaN        NaN
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 9 and decay = 0.04216965.

Let’s visualize the results:

nnetpostResample nnetModel Diff
RMSE 0.7093090 0.8040919 -0.095
Rsquared 0.6042081 0.4432069 0.161
MAE 0.5762007 0.6394613 -0.063

SVM Support Vector Machines Model

#  Support Vector Machines Model 

# Fix the seed so that the results can be reproduced
set.seed(100)
svmModel <- train(x = CMPtrain[,-1], # Yield is located in the column 1
                  y = CMPtrain$Yield,  # Yield is located in the column 1
                  method = "svmRadial",
                  #preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 118, 117, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  0.7871002  0.4502506  0.6302134
##      0.50  0.7313778  0.4912644  0.5982827
##      1.00  0.6801686  0.5443121  0.5534225
##      2.00  0.6463510  0.5834226  0.5292535
##      4.00  0.6099049  0.6251177  0.5072709
##      8.00  0.6028416  0.6385953  0.5063231
##     16.00  0.6012018  0.6410990  0.5052518
##     32.00  0.6012018  0.6410990  0.5052518
##     64.00  0.6012018  0.6410990  0.5052518
##    128.00  0.6012018  0.6410990  0.5052518
##    256.00  0.6012018  0.6410990  0.5052518
##    512.00  0.6012018  0.6410990  0.5052518
##   1024.00  0.6012018  0.6410990  0.5052518
##   2048.00  0.6012018  0.6410990  0.5052518
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01733987
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01733987 and C = 16.

Let’s visualize the results:

svmpostResample svmModel Diff
RMSE 0.6132105 0.6012018 0.012
Rsquared 0.6348329 0.6410990 -0.006
MAE 0.4950236 0.5052518 -0.010

MARS Multivariate Adaptive Regression Splines Model

#  Multivariate Adaptive Regression Splines Model 

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

# Fix the seed so that the results can be reproduced
set.seed(100)
marsModel <- train(x = CMPtrain[,-1], # Yield is located in the column 1
                  y = CMPtrain$Yield,  # Yield is located in the column 1
                  method = "earth",
                  #preProc = c("center", "scale"),
                  # Explicitly declare the candidate models to test
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))
## Multivariate Adaptive Regression Spline 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 118, 117, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        2      0.7608702  0.4765659  0.5940557
##   1        3      0.7402650  0.5062131  0.5873922
##   1        4      0.6913599  0.5683094  0.5487347
##   1        5      0.6656055  0.5739232  0.5260355
##   1        6      0.6921667  0.5493613  0.5444108
##   1        7      0.6699219  0.5800144  0.5197985
##   1        8      0.6821632  0.5629115  0.5277025
##   1        9      0.6851523  0.5551823  0.5319844
##   1       10      0.6900857  0.5338118  0.5408950
##   1       11      0.6902712  0.5426019  0.5368663
##   1       12      0.6802351  0.5874977  0.5171855
##   1       13      0.6743747  0.5875594  0.5225518
##   1       14      0.6724292  0.5542790  0.5312809
##   1       15      0.6755544  0.5501683  0.5357989
##   1       16      0.6782277  0.5456692  0.5385850
##   1       17      0.6814800  0.5407681  0.5392786
##   1       18      0.6836116  0.5370742  0.5391684
##   1       19      0.6888690  0.5299316  0.5418341
##   1       20      0.6878532  0.5285923  0.5394346
##   1       21      0.6980804  0.5186997  0.5457016
##   1       22      0.7028141  0.5146597  0.5478886
##   1       23      0.7028141  0.5146597  0.5478886
##   1       24      0.7011137  0.5152072  0.5492070
##   1       25      0.7059355  0.5121715  0.5525025
##   1       26      0.7059355  0.5121715  0.5525025
##   1       27      0.7059355  0.5121715  0.5525025
##   1       28      0.7059355  0.5121715  0.5525025
##   1       29      0.7059355  0.5121715  0.5525025
##   1       30      0.7059355  0.5121715  0.5525025
##   1       31      0.7059355  0.5121715  0.5525025
##   1       32      0.7059355  0.5121715  0.5525025
##   1       33      0.7059355  0.5121715  0.5525025
##   1       34      0.7059355  0.5121715  0.5525025
##   1       35      0.7059355  0.5121715  0.5525025
##   1       36      0.7059355  0.5121715  0.5525025
##   1       37      0.7059355  0.5121715  0.5525025
##   1       38      0.7059355  0.5121715  0.5525025
##   2        2      0.7608702  0.4765659  0.5940557
##   2        3      0.7758241  0.4579480  0.6058932
##   2        4      0.7154082  0.5357724  0.5668441
##   2        5      0.7698298  0.4759728  0.5998113
##   2        6      0.8106828  0.4252869  0.6170168
##   2        7      0.8523308  0.4062154  0.6596470
##   2        8      0.8850669  0.3915969  0.6621907
##   2        9      0.8887112  0.3992320  0.6706609
##   2       10      0.8761736  0.4217475  0.6520429
##   2       11      0.8948231  0.4161342  0.6709036
##   2       12      0.8706716  0.4380729  0.6464227
##   2       13      0.8713399  0.4454836  0.6286619
##   2       14      0.8445940  0.4618239  0.6181703
##   2       15      0.8565715  0.4606281  0.6209452
##   2       16      1.6149989  0.4068201  0.8872433
##   2       17      1.5865843  0.4348400  0.8666864
##   2       18      1.5694411  0.4486501  0.8529629
##   2       19      1.5533104  0.4637339  0.8410737
##   2       20      1.5453743  0.4647219  0.8346232
##   2       21      1.5507991  0.4703622  0.8378604
##   2       22      1.5614952  0.4661222  0.8450828
##   2       23      1.5594290  0.4657456  0.8449717
##   2       24      1.5594290  0.4657456  0.8449717
##   2       25      1.5594290  0.4657456  0.8449717
##   2       26      1.5594290  0.4657456  0.8449717
##   2       27      1.5594290  0.4657456  0.8449717
##   2       28      1.5594290  0.4657456  0.8449717
##   2       29      1.5594290  0.4657456  0.8449717
##   2       30      1.5594290  0.4657456  0.8449717
##   2       31      1.5594290  0.4657456  0.8449717
##   2       32      1.5594290  0.4657456  0.8449717
##   2       33      1.5594290  0.4657456  0.8449717
##   2       34      1.5594290  0.4657456  0.8449717
##   2       35      1.5594290  0.4657456  0.8449717
##   2       36      1.5594290  0.4657456  0.8449717
##   2       37      1.5594290  0.4657456  0.8449717
##   2       38      1.5594290  0.4657456  0.8449717
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.

Let’s visualize the results:

marspostResample marsModel Diff
RMSE 0.6385416 0.6656055 -0.027
Rsquared 0.6001174 0.5739232 0.026
MAE 0.5051098 0.5260355 -0.021

(a)

Which nonlinear regression model gives the optimal resampling and test set performance?

Let’s compare the returned values from the postResample function:

knn nnet svm mars
RMSE 0.7857604 0.7093090 0.6132105 0.6385416
Rsquared 0.3984343 0.6042081 0.6348329 0.6001174
MAE 0.6049798 0.5762007 0.4950236 0.5051098

As we can appreciate in the above table, the SVM model returns the lowest RMSE alongside the highest \(R^2\).

Let’s compare the returned values from the original training models.

knn nnet svm mars
RMSE 0.6176822 0.8040919 0.6012018 0.6656055
Rsquared 0.6305230 0.4432069 0.6410990 0.5739232
MAE 0.5164498 0.6394613 0.5052518 0.5260355

In effect, the SVM model returned the lowest RMSE alongside the highest \(R^2\) as seen on the above table.

(b)

Which predictors are most important in the optimal nonlinear regression model?

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0173398678411205 
## 
## Number of Support Vectors : 119 
## 
## Objective Function Value : -74.8895 
## Training error : 0.009279

From the above results, the svm model employed 119 training set data points as support vectors (90 % of the training set).

Do either the biological or process variables dominate the list?

##  [1] "Yield"                  "BiologicalMaterial01"  
##  [3] "BiologicalMaterial03"   "BiologicalMaterial05"  
##  [5] "BiologicalMaterial06"   "BiologicalMaterial08"  
##  [7] "BiologicalMaterial09"   "BiologicalMaterial10"  
##  [9] "BiologicalMaterial11"   "ManufacturingProcess01"
## [11] "ManufacturingProcess02" "ManufacturingProcess03"
## [13] "ManufacturingProcess04" "ManufacturingProcess05"
## [15] "ManufacturingProcess06" "ManufacturingProcess07"
## [17] "ManufacturingProcess08" "ManufacturingProcess09"
## [19] "ManufacturingProcess10" "ManufacturingProcess11"
## [21] "ManufacturingProcess12" "ManufacturingProcess13"
## [23] "ManufacturingProcess14" "ManufacturingProcess15"
## [25] "ManufacturingProcess16" "ManufacturingProcess17"
## [27] "ManufacturingProcess19" "ManufacturingProcess20"
## [29] "ManufacturingProcess21" "ManufacturingProcess22"
## [31] "ManufacturingProcess23" "ManufacturingProcess24"
## [33] "ManufacturingProcess26" "ManufacturingProcess28"
## [35] "ManufacturingProcess30" "ManufacturingProcess32"
## [37] "ManufacturingProcess33" "ManufacturingProcess34"
## [39] "ManufacturingProcess35" "ManufacturingProcess36"
## [41] "ManufacturingProcess37" "ManufacturingProcess38"
## [43] "ManufacturingProcess39" "ManufacturingProcess41"
## [45] "ManufacturingProcess43" "ManufacturingProcess44"
## [47] "ManufacturingProcess45"

In this case, the process (Manufacturing) variables dominate the list.

How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Let’s see how the importance for each variable is:

importance <- varImp(svmModel)

importance
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 46)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   94.05
## BiologicalMaterial06     77.39
## BiologicalMaterial03     75.78
## ManufacturingProcess36   71.04
## ManufacturingProcess17   70.86
## ManufacturingProcess09   60.99
## ManufacturingProcess33   51.12
## BiologicalMaterial11     50.21
## ManufacturingProcess06   49.88
## BiologicalMaterial09     38.50
## BiologicalMaterial08     38.19
## ManufacturingProcess11   35.36
## BiologicalMaterial01     31.09
## ManufacturingProcess30   30.90
## ManufacturingProcess12   27.28
## ManufacturingProcess26   25.87
## ManufacturingProcess28   22.47
## ManufacturingProcess01   19.85
## BiologicalMaterial10     17.75

From above, we notice that there’s a total of 7 Manufacturer variables in the top 10 and 3 Biological predictors. I believe this is important due to the fact that we can not disregard the importance that these biological variables play on the chemical manufacturing process.

(c)

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Yes, the intuition is evident in the relationships seeing above. Not only they appear to be linear but also statistically significant as well.

References

Kuhn, M. & Johnson, K. 2018. Applied Predictive Modeling. USA: Pfizer Global R&D. http://appliedpredictivemodeling.com/.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.