Data624

Github Link
Web Link

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10 x4 + 5 x5 + N(0, σ 2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.0.5

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x'data from a matrix to a data frame
## One reason is that this will give the columns names.

trainingData$x <- data.frame(trainingData$x)

## Look at the data using

featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y'and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:

testData <- mlbench.friedman1(5000, sd = 1)

testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

library(caret)

knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample'can be used to get the test set
## perforamnce values

postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

varImp(knnModel)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

Multivariate Adaptive Regression Splines (MARS) MARS models are in several packages, but the most extensive implementation is in the earth package. The MARS model using the nominal forward pass and pruning step can be called simply.

library(earth)

## Warning: package 'earth' was built under R version 4.0.5

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 4.0.5

## Loading required package: plotrix

## Loading required package: TeachingDemos

## Warning: package 'TeachingDemos' was built under R version 4.0.5

## 
## Attaching package: 'plotmo'

## The following object is masked from 'package:urca':
## 
##     plotres

marsFit <- earth(trainingData$x, trainingData$y)
marsFit

## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

The summary method generates more extensive output.

summary(marsFit)

## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

To tune the model using external resampling, the train function can be used.

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced

set.seed(1340)

marsTuned <- train(trainingData$x, trainingData$y,
                   method = "earth", 
                   # Explicitly declare the candidate models to test
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))

marsTuned

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.594324  0.1963843  3.833072
##   1        3      3.836801  0.4233298  3.139225
##   1        4      2.697201  0.7151127  2.134218
##   1        5      2.384270  0.7718287  1.877921
##   1        6      2.292469  0.7939161  1.752696
##   1        7      1.873309  0.8658661  1.420039
##   1        8      1.774391  0.8812817  1.353393
##   1        9      1.731318  0.8898088  1.346609
##   1       10      1.675093  0.8972301  1.308108
##   1       11      1.683475  0.8950124  1.304817
##   1       12      1.630683  0.9016304  1.257779
##   1       13      1.658973  0.8996103  1.271139
##   1       14      1.663902  0.8982894  1.277294
##   1       15      1.663902  0.8982894  1.277294
##   1       16      1.663902  0.8982894  1.277294
##   1       17      1.663902  0.8982894  1.277294
##   1       18      1.663902  0.8982894  1.277294
##   1       19      1.663902  0.8982894  1.277294
##   1       20      1.663902  0.8982894  1.277294
##   1       21      1.663902  0.8982894  1.277294
##   1       22      1.663902  0.8982894  1.277294
##   1       23      1.663902  0.8982894  1.277294
##   1       24      1.663902  0.8982894  1.277294
##   1       25      1.663902  0.8982894  1.277294
##   1       26      1.663902  0.8982894  1.277294
##   1       27      1.663902  0.8982894  1.277294
##   1       28      1.663902  0.8982894  1.277294
##   1       29      1.663902  0.8982894  1.277294
##   1       30      1.663902  0.8982894  1.277294
##   1       31      1.663902  0.8982894  1.277294
##   1       32      1.663902  0.8982894  1.277294
##   1       33      1.663902  0.8982894  1.277294
##   1       34      1.663902  0.8982894  1.277294
##   1       35      1.663902  0.8982894  1.277294
##   1       36      1.663902  0.8982894  1.277294
##   1       37      1.663902  0.8982894  1.277294
##   1       38      1.663902  0.8982894  1.277294
##   2        2      4.594324  0.1963843  3.833072
##   2        3      3.836801  0.4233298  3.139225
##   2        4      2.697201  0.7151127  2.134218
##   2        5      2.406939  0.7716633  1.890850
##   2        6      2.348855  0.7858024  1.826896
##   2        7      1.865997  0.8649874  1.417428
##   2        8      1.718563  0.8879092  1.330179
##   2        9      1.494916  0.9123315  1.180176
##   2       10      1.418147  0.9229322  1.138736
##   2       11      1.361989  0.9301450  1.072554
##   2       12      1.329251  0.9330332  1.042525
##   2       13      1.295076  0.9359551  1.025561
##   2       14      1.286339  0.9383256  1.021240
##   2       15      1.279173  0.9392309  1.023911
##   2       16      1.295970  0.9382018  1.035481
##   2       17      1.311907  0.9368463  1.050086
##   2       18      1.311907  0.9368463  1.050086
##   2       19      1.322933  0.9358727  1.059458
##   2       20      1.322933  0.9358727  1.059458
##   2       21      1.322933  0.9358727  1.059458
##   2       22      1.322933  0.9358727  1.059458
##   2       23      1.322933  0.9358727  1.059458
##   2       24      1.322933  0.9358727  1.059458
##   2       25      1.322933  0.9358727  1.059458
##   2       26      1.322933  0.9358727  1.059458
##   2       27      1.322933  0.9358727  1.059458
##   2       28      1.322933  0.9358727  1.059458
##   2       29      1.322933  0.9358727  1.059458
##   2       30      1.322933  0.9358727  1.059458
##   2       31      1.322933  0.9358727  1.059458
##   2       32      1.322933  0.9358727  1.059458
##   2       33      1.322933  0.9358727  1.059458
##   2       34      1.322933  0.9358727  1.059458
##   2       35      1.322933  0.9358727  1.059458
##   2       36      1.322933  0.9358727  1.059458
##   2       37      1.322933  0.9358727  1.059458
##   2       38      1.322933  0.9358727  1.059458
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.

head(predict(marsTuned, testData$x))

##              y
## [1,] 18.586161
## [2,] 21.272051
## [3,] 12.274064
## [4,]  7.736178
## [5,] 10.592710
## [6,] 14.124571

marsPred <- predict(marsTuned, testData$x)

## The function 'postResample'can be used to get the test set
## perforamnce values

postResample(pred = marsPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 1.1589948 0.9460418 0.9250230

There are two functions that estimate the importance of each predictor in the MARS model: evimp in the earth package and varImp in the caret package (although the latter calls the former):

varImp(marsTuned)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.73
## X5   15.52
## X3    0.00

Only X1 to X5 are important to the model according to Mars model.

Neural Networks (nnet) To fit a regression model, the nnet function takes both the formula and non-formula interfaces. For regression, the linear relationship between the hidden units and the prediction can be used with the option linout = TRUE.

tooHigh <- findCorrelation(cor(trainingData$x), cutoff = .75)

trainXnnet <- trainingData$x[, -tooHigh]

testXnnet <- testData$x[, -tooHigh]
## Create a specific candidate set of models to evaluate:

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10),## The next option is to use bagging (see the
                        ## next chapter) instead of different random
                        ## seeds.
                        .bag = FALSE)

set.seed(31500)

nnetTuned <- train(trainingData$x, trainingData$y,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl = trainControl(method = "cv"),
                  ## Automatically standardize data prior to modeling
                  ## and prediction
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 5 * (ncol(trainXnnet) + 1) + 10 + 1,
                  maxit = 50)
nnetTuned

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.660379  0.7364356  2.129650
##   0.00    2         NaN        NaN       NaN
##   0.00    3         NaN        NaN       NaN
##   0.00    4         NaN        NaN       NaN
##   0.00    5         NaN        NaN       NaN
##   0.00    6         NaN        NaN       NaN
##   0.00    7         NaN        NaN       NaN
##   0.00    8         NaN        NaN       NaN
##   0.00    9         NaN        NaN       NaN
##   0.00   10         NaN        NaN       NaN
##   0.01    1    2.658120  0.7257255  2.138039
##   0.01    2         NaN        NaN       NaN
##   0.01    3         NaN        NaN       NaN
##   0.01    4         NaN        NaN       NaN
##   0.01    5         NaN        NaN       NaN
##   0.01    6         NaN        NaN       NaN
##   0.01    7         NaN        NaN       NaN
##   0.01    8         NaN        NaN       NaN
##   0.01    9         NaN        NaN       NaN
##   0.01   10         NaN        NaN       NaN
##   0.10    1    2.467318  0.7592730  1.956655
##   0.10    2         NaN        NaN       NaN
##   0.10    3         NaN        NaN       NaN
##   0.10    4         NaN        NaN       NaN
##   0.10    5         NaN        NaN       NaN
##   0.10    6         NaN        NaN       NaN
##   0.10    7         NaN        NaN       NaN
##   0.10    8         NaN        NaN       NaN
##   0.10    9         NaN        NaN       NaN
##   0.10   10         NaN        NaN       NaN
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.1 and bag = FALSE.

nnetFit <- earth(trainingData$x, trainingData$y)
nnetFit

## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

summary(nnetFit)

## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

head(predict(nnetTuned, testData$x))

##         1         2         3         4         5         6 
## 17.815453 17.580442 12.202086  8.501727 15.206499 13.941385

nnetPred <- predict(marsTuned, testData$x)

## The function 'postResample'can be used to get the test set
## perforamnce values

postResample(pred = nnetPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 1.1589948 0.9460418 0.9250230

varImp(nnetTuned)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)? Mars model selected the informative predictors (X1-X5). Mars model appears to be the best with selecting predictors which are important for the model. It has better R-squared compared nnet and knn. Other model do not narrow important predictors to 5 like Mars.

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

set.seed(34392)
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.0.5

library(RANN)

## Warning: package 'RANN' was built under R version 4.0.5

data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess
#sum(is.na(df))
trans <- preProcess(df,"knnImpute")
#sum(is.na(trans))
pred <- predict(trans, df)
pred <- pred %>% select_at(vars(-one_of(nearZeroVar(., names = TRUE))))

trainDf <- createDataPartition(pred$Yield, p=0.8, time = 1, list = FALSE)
trainX <-pred[trainDf, ]
trainY <- pred$Yield[trainDf]
#sum(is.na(trainX))
plsTune <- train(trainX, trainY,

 method = "pls",

 ## The default tuning grid evaluates

 ## components 1... tuneLength

 tuneLength = 20,

 trControl = trainControl(method = 'cv'),

 preProc = c("center", "scale"))

plsTune

## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 130, 131, 128, 128, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE        Rsquared   MAE       
##    1     0.67852890  0.5941896  0.54887418
##    2     0.63548816  0.6847727  0.47397765
##    3     0.63643355  0.7621490  0.40839579
##    4     0.66167229  0.7942985  0.38549797
##    5     0.57660761  0.8513484  0.29922932
##    6     0.48017797  0.8775200  0.25072728
##    7     0.31537720  0.9063116  0.17967528
##    8     0.21673598  0.9389757  0.13253697
##    9     0.11834743  0.9852244  0.08810216
##   10     0.11620387  0.9798759  0.07497350
##   11     0.11671649  0.9703266  0.06567956
##   12     0.10582328  0.9698120  0.05535640
##   13     0.08437493  0.9803105  0.04493301
##   14     0.05493883  0.9941788  0.03454552
##   15     0.03184658  0.9990082  0.02420603
##   16     0.03150690  0.9988056  0.02189858
##   17     0.03530588  0.9975295  0.02130832
##   18     0.03132606  0.9981684  0.01965167
##   19     0.02353572  0.9989905  0.01439933
##   20     0.01547202  0.9996614  0.01031025
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 20.

plot(plsTune)

testX <- pred[-trainDf,]
testY <- pred$Yield[-trainDf]
postResample(pred = predict(plsTune, newdata=testX), obs = testY)

##       RMSE   Rsquared        MAE 
## 0.06449150 0.99368528 0.02210791

Neutral Network

# tooHigh <- findCorrelation(cor(trainX), cutoff = .75)
# 
# trainXnnet <- trainX[, -tooHigh]
# 
# testXnnet <- testData$x[, -tooHigh]
## Create a specific candidate set of models to evaluate:

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10),## The next option is to use bagging (see the
                        ## next chapter) instead of different random
                        ## seeds.
                        .bag = FALSE)

set.seed(31500)
# nnetTuned <- train(Yield ~ ., trainX,
#                   method = "avNNet",
#                   tuneGrid = nnetGrid,
#                   trControl = trainControl(method = "cv"),
#                   ## Automatically standardize data prior to modeling
#                   ## and prediction
#                   preProc = c("center", "scale"),
#                   linout = TRUE,
#                   trace = FALSE,
#                   MaxNWts = 5 * (ncol(trainDf) + 1) + 5 + 1,
#                   maxit = 50)
# 
# nnetTuned
# plot(nnetTuned)

#postResample(pred = predict(nnetTuned, newdata=testX), obs = testY)

knn model

knnModel <- train(x = trainX,
                  y = trainY,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

knnModel

## k-Nearest Neighbors 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7154769  0.5471728  0.5667266
##    7  0.7134805  0.5556671  0.5683334
##    9  0.7212001  0.5474275  0.5819166
##   11  0.7284377  0.5428464  0.5893612
##   13  0.7308178  0.5426277  0.5911040
##   15  0.7336511  0.5442605  0.5954581
##   17  0.7336175  0.5491911  0.5965774
##   19  0.7349090  0.5554128  0.5950657
##   21  0.7407129  0.5534573  0.6018688
##   23  0.7449997  0.5545652  0.6039193
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.

plot(knnModel)

postResample(pred = predict(knnModel, newdata=testX), obs = testY)

##      RMSE  Rsquared       MAE 
## 0.5032695 0.5990757 0.4012294

Mars model

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced

set.seed(1340)

marsTuned <- train(trainX, trainY,
                   method = "earth", 
                   # Explicitly declare the candidate models to test
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))

marsTuned

## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 131, 129, 129, 129, 129, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE          Rsquared  MAE         
##   1        2      5.491765e-16  1         4.569087e-16
##   1        3      5.491765e-16  1         4.569087e-16
##   1        4      5.491765e-16  1         4.569087e-16
##   1        5      5.491765e-16  1         4.569087e-16
##   1        6      5.491765e-16  1         4.569087e-16
##   1        7      5.491765e-16  1         4.569087e-16
##   1        8      5.491765e-16  1         4.569087e-16
##   1        9      5.491765e-16  1         4.569087e-16
##   1       10      5.491765e-16  1         4.569087e-16
##   1       11      5.491765e-16  1         4.569087e-16
##   1       12      5.491765e-16  1         4.569087e-16
##   1       13      5.491765e-16  1         4.569087e-16
##   1       14      5.491765e-16  1         4.569087e-16
##   1       15      5.491765e-16  1         4.569087e-16
##   1       16      5.491765e-16  1         4.569087e-16
##   1       17      5.491765e-16  1         4.569087e-16
##   1       18      5.491765e-16  1         4.569087e-16
##   1       19      5.491765e-16  1         4.569087e-16
##   1       20      5.491765e-16  1         4.569087e-16
##   1       21      5.491765e-16  1         4.569087e-16
##   1       22      5.491765e-16  1         4.569087e-16
##   1       23      5.491765e-16  1         4.569087e-16
##   1       24      5.491765e-16  1         4.569087e-16
##   1       25      5.491765e-16  1         4.569087e-16
##   1       26      5.491765e-16  1         4.569087e-16
##   1       27      5.491765e-16  1         4.569087e-16
##   1       28      5.491765e-16  1         4.569087e-16
##   1       29      5.491765e-16  1         4.569087e-16
##   1       30      5.491765e-16  1         4.569087e-16
##   1       31      5.491765e-16  1         4.569087e-16
##   1       32      5.491765e-16  1         4.569087e-16
##   1       33      5.491765e-16  1         4.569087e-16
##   1       34      5.491765e-16  1         4.569087e-16
##   1       35      5.491765e-16  1         4.569087e-16
##   1       36      5.491765e-16  1         4.569087e-16
##   1       37      5.491765e-16  1         4.569087e-16
##   1       38      5.491765e-16  1         4.569087e-16
##   2        2      5.491765e-16  1         4.569087e-16
##   2        3      5.491765e-16  1         4.569087e-16
##   2        4      5.491765e-16  1         4.569087e-16
##   2        5      5.491765e-16  1         4.569087e-16
##   2        6      5.491765e-16  1         4.569087e-16
##   2        7      5.491765e-16  1         4.569087e-16
##   2        8      5.491765e-16  1         4.569087e-16
##   2        9      5.491765e-16  1         4.569087e-16
##   2       10      5.491765e-16  1         4.569087e-16
##   2       11      5.491765e-16  1         4.569087e-16
##   2       12      5.491765e-16  1         4.569087e-16
##   2       13      5.491765e-16  1         4.569087e-16
##   2       14      5.491765e-16  1         4.569087e-16
##   2       15      5.491765e-16  1         4.569087e-16
##   2       16      5.491765e-16  1         4.569087e-16
##   2       17      5.491765e-16  1         4.569087e-16
##   2       18      5.491765e-16  1         4.569087e-16
##   2       19      5.491765e-16  1         4.569087e-16
##   2       20      5.491765e-16  1         4.569087e-16
##   2       21      5.491765e-16  1         4.569087e-16
##   2       22      5.491765e-16  1         4.569087e-16
##   2       23      5.491765e-16  1         4.569087e-16
##   2       24      5.491765e-16  1         4.569087e-16
##   2       25      5.491765e-16  1         4.569087e-16
##   2       26      5.491765e-16  1         4.569087e-16
##   2       27      5.491765e-16  1         4.569087e-16
##   2       28      5.491765e-16  1         4.569087e-16
##   2       29      5.491765e-16  1         4.569087e-16
##   2       30      5.491765e-16  1         4.569087e-16
##   2       31      5.491765e-16  1         4.569087e-16
##   2       32      5.491765e-16  1         4.569087e-16
##   2       33      5.491765e-16  1         4.569087e-16
##   2       34      5.491765e-16  1         4.569087e-16
##   2       35      5.491765e-16  1         4.569087e-16
##   2       36      5.491765e-16  1         4.569087e-16
##   2       37      5.491765e-16  1         4.569087e-16
##   2       38      5.491765e-16  1         4.569087e-16
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 2 and degree = 1.

#plot(marsTuned)
postResample(pred = predict(marsTuned, newdata=testX), obs = testY)

##         RMSE     Rsquared          MAE 
## 5.086513e-16 1.000000e+00 4.391019e-16

Which nonlinear regression model gives the optimal resampling and test set performance? Mars model gives the optimal resampling and test set performance with RMSE = 5.086513e-16 .
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

#varImp(marsTuned)
#varImp(nnetTuned)
varImp(plsTune)

## Warning: package 'pls' was built under R version 4.0.5

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## Yield                   100.00
## ManufacturingProcess32   43.05
## ManufacturingProcess13   39.27
## ManufacturingProcess36   36.66
## ManufacturingProcess17   36.31
## ManufacturingProcess09   34.64
## BiologicalMaterial02     33.30
## BiologicalMaterial06     31.38
## BiologicalMaterial08     30.86
## BiologicalMaterial12     29.61
## BiologicalMaterial03     29.25
## ManufacturingProcess33   29.20
## BiologicalMaterial11     28.98
## BiologicalMaterial01     27.17
## ManufacturingProcess06   27.14
## BiologicalMaterial04     26.78
## ManufacturingProcess12   25.48
## ManufacturingProcess11   25.31
## ManufacturingProcess04   23.70
## ManufacturingProcess28   22.16

varImp(knnModel)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## Yield                   100.00
## ManufacturingProcess32   37.86
## ManufacturingProcess13   36.69
## BiologicalMaterial06     35.28
## BiologicalMaterial12     31.62
## ManufacturingProcess17   31.25
## BiologicalMaterial03     30.61
## BiologicalMaterial02     28.91
## ManufacturingProcess09   27.57
## ManufacturingProcess36   27.23
## BiologicalMaterial11     24.66
## ManufacturingProcess06   23.12
## ManufacturingProcess31   22.35
## BiologicalMaterial04     20.71
## BiologicalMaterial08     19.88
## ManufacturingProcess11   19.18
## ManufacturingProcess33   19.12
## ManufacturingProcess29   18.05
## BiologicalMaterial01     18.02
## ManufacturingProcess02   16.06

We got some issue with Mars model selection of most important predictors. plsTune model and knnModel predictors selection are about the same. Just like what we found in exercise 6.3 The Manufacturing variant among the predictors dominate the list.

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

plot(trainX$ManufacturingProcess32, trainX$Yield , pch = 19)

#lines(trainX$ManufacturingProcess32, trainX$Yield, type = "b", col = 3, lwd = 4, pch = 2 )

Very interesting!

df <- data.frame(trainX$ManufacturingProcess32, trainX$ManufacturingProcess13, trainX$ManufacturingProcess36, trainX$ManufacturingProcess17, trainX$ManufacturingProcess09, trainX$BiologicalMaterial02 , trainX$BiologicalMaterial06, trainX$BiologicalMaterial08, trainX$BiologicalMaterial12, trainX$BiologicalMaterial03, trainX$Yield)

x <- dplyr::select(df , -trainX.Yield)

featurePlot( x, df$trainX.Yield)

Data624_HW8

Alexis Mekueko

11/13/2021