Data 624 HW 8

Maryluz Cruz

2021-04-26

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

                 y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(caret)
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

KNN Model

Tune several models on these data. For example:

knnModel <- train(x = trainingData$x,
            y = trainingData$y,
            method = "knn",
            preProc = c("center", "scale"),
            tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

The RMSE is 3.183130, the optimal R^2 0.6425367

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

The RMSE is 3.2040595, the optimal R^2 0.6819919

plot(varImp(knnModel))

X4 is the most important

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Neural Networks

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10),.bag = FALSE)

set.seed(100)
nnetmodel <- train(x = trainingData$x,
                    y = trainingData$y,
            method = "avNNet",
            tuneGrid = nnetGrid,
            trControl = trainControl(method = "cv"),
            ## Automatically standardize data prior to modeling
            ## and prediction
            preProc = c("center", "scale"),
            linout = TRUE,
            trace = FALSE,
            MaxNWts = 10 * (ncol(trainingData$x) + 1) + 5 + 1,
            maxit = 500)

nnetmodel
## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.392711  0.7610354  1.897330
##   0.00    2    2.410532  0.7567109  1.907478
##   0.00    3    2.043957  0.8224281  1.630751
##   0.00    4    2.289347  0.8130639  1.749187
##   0.00    5    2.445600  0.7709399  1.824446
##   0.00    6    2.898295  0.7388800  2.052725
##   0.00    7    3.351563  0.6644147  2.460366
##   0.00    8    6.513566  0.4418645  3.563297
##   0.00    9    4.484215  0.5644107  2.877950
##   0.00   10         NaN        NaN       NaN
##   0.01    1    2.385381  0.7602926  1.887906
##   0.01    2    2.425125  0.7510903  1.935991
##   0.01    3    2.151209  0.8016018  1.701951
##   0.01    4    2.091925  0.8154383  1.676653
##   0.01    5    2.169742  0.7999255  1.738715
##   0.01    6    2.262032  0.8056619  1.817195
##   0.01    7    2.318301  0.7861811  1.856908
##   0.01    8    2.413847  0.7772629  1.938009
##   0.01    9    2.317190  0.7847500  1.857641
##   0.01   10         NaN        NaN       NaN
##   0.10    1    2.393965  0.7596431  1.894191
##   0.10    2    2.423612  0.7525959  1.935872
##   0.10    3    2.169914  0.7982380  1.726854
##   0.10    4    2.059080  0.8224160  1.648610
##   0.10    5    1.975656  0.8394000  1.578979
##   0.10    6    2.152198  0.8098015  1.693056
##   0.10    7    2.161512  0.8163011  1.693526
##   0.10    8    2.273716  0.7922525  1.822713
##   0.10    9    2.315333  0.7811273  1.785409
##   0.10   10         NaN        NaN       NaN
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.

size = 5, decay = 0.1, RMSE of 1.975656, R^2 of 0.8394000

nnetpredict <- predict(nnetmodel, newdata = testData$x)
postResample(pred = nnetpredict, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.1113956 0.8277556 1.5739011

The RMSE is 2.1113956, and the R^2 is 0.8277556, the resample is better.

plot(varImp(nnetmodel))

X4 is more important

Multivariate Adaptive Regression Splines

#  Define  the  candidate  models  to  test
marsGrid  <-  expand.grid(.degree  =  1:2,  .nprune  =  2:38)
#  Fix  the  seed  so  that  the  results  can  be  reproduced
set.seed(100)
marsmodel  <-  train(x = trainingData$x,
                    y = trainingData$y,
               method  =  "earth",
               tuneGrid  =  marsGrid,
               trControl  =  trainControl(method  =  "cv"),
               preProc = c("center", "scale"),
               tuneLength = 10)

marsmodel
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      4.327937  0.2544880  3.6004742
##   1        3      3.572450  0.4912720  2.8958113
##   1        4      2.596841  0.7183600  2.1063410
##   1        5      2.370161  0.7659777  1.9186686
##   1        6      2.276141  0.7881481  1.8100006
##   1        7      1.766728  0.8751831  1.3902146
##   1        8      1.780946  0.8723243  1.4013449
##   1        9      1.665091  0.8819775  1.3255147
##   1       10      1.663804  0.8821283  1.3276573
##   1       11      1.657738  0.8822967  1.3317299
##   1       12      1.653784  0.8827903  1.3315041
##   1       13      1.648496  0.8823663  1.3164065
##   1       14      1.639073  0.8841742  1.3128329
##   1       15      1.639073  0.8841742  1.3128329
##   1       16      1.639073  0.8841742  1.3128329
##   1       17      1.639073  0.8841742  1.3128329
##   1       18      1.639073  0.8841742  1.3128329
##   1       19      1.639073  0.8841742  1.3128329
##   1       20      1.639073  0.8841742  1.3128329
##   1       21      1.639073  0.8841742  1.3128329
##   1       22      1.639073  0.8841742  1.3128329
##   1       23      1.639073  0.8841742  1.3128329
##   1       24      1.639073  0.8841742  1.3128329
##   1       25      1.639073  0.8841742  1.3128329
##   1       26      1.639073  0.8841742  1.3128329
##   1       27      1.639073  0.8841742  1.3128329
##   1       28      1.639073  0.8841742  1.3128329
##   1       29      1.639073  0.8841742  1.3128329
##   1       30      1.639073  0.8841742  1.3128329
##   1       31      1.639073  0.8841742  1.3128329
##   1       32      1.639073  0.8841742  1.3128329
##   1       33      1.639073  0.8841742  1.3128329
##   1       34      1.639073  0.8841742  1.3128329
##   1       35      1.639073  0.8841742  1.3128329
##   1       36      1.639073  0.8841742  1.3128329
##   1       37      1.639073  0.8841742  1.3128329
##   1       38      1.639073  0.8841742  1.3128329
##   2        2      4.327937  0.2544880  3.6004742
##   2        3      3.572450  0.4912720  2.8958113
##   2        4      2.661826  0.7070510  2.1734709
##   2        5      2.404015  0.7578971  1.9753867
##   2        6      2.243927  0.7914805  1.7830717
##   2        7      1.856336  0.8605482  1.4356822
##   2        8      1.754607  0.8763186  1.3968406
##   2        9      1.653859  0.8870129  1.2813884
##   2       10      1.434159  0.9166537  1.1339203
##   2       11      1.320482  0.9289120  1.0347278
##   2       12      1.317547  0.9306879  1.0359899
##   2       13      1.296910  0.9306902  1.0146112
##   2       14      1.221407  0.9395223  0.9631486
##   2       15      1.230516  0.9390469  0.9761484
##   2       16      1.236911  0.9387407  0.9745362
##   2       17      1.236911  0.9387407  0.9745362
##   2       18      1.236911  0.9387407  0.9745362
##   2       19      1.236911  0.9387407  0.9745362
##   2       20      1.236911  0.9387407  0.9745362
##   2       21      1.236911  0.9387407  0.9745362
##   2       22      1.236911  0.9387407  0.9745362
##   2       23      1.236911  0.9387407  0.9745362
##   2       24      1.236911  0.9387407  0.9745362
##   2       25      1.236911  0.9387407  0.9745362
##   2       26      1.236911  0.9387407  0.9745362
##   2       27      1.236911  0.9387407  0.9745362
##   2       28      1.236911  0.9387407  0.9745362
##   2       29      1.236911  0.9387407  0.9745362
##   2       30      1.236911  0.9387407  0.9745362
##   2       31      1.236911  0.9387407  0.9745362
##   2       32      1.236911  0.9387407  0.9745362
##   2       33      1.236911  0.9387407  0.9745362
##   2       34      1.236911  0.9387407  0.9745362
##   2       35      1.236911  0.9387407  0.9745362
##   2       36      1.236911  0.9387407  0.9745362
##   2       37      1.236911  0.9387407  0.9745362
##   2       38      1.236911  0.9387407  0.9745362
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.

nprune = 14 and degree = 2, the optimal RMSE is 1.221407 and the R^2 0.9395223

marspredict <- predict(marsmodel, newdata = testData$x)
postResample(pred = marspredict, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.2779993 0.9338365 1.0147070

The RMSE is 1.2779993 and the R^2 0.9338365, the resample is better

plot(varImp(marsmodel))

X1 is the important value

Support Vector Machines

svmmodel <- train(x = trainingData$x,
                    y = trainingData$y,
             method = "svmRadial",
             preProc = c("center", "scale"),
             tuneLength = 14,
             trControl = trainControl(method = "cv"))

svmmodel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.490737  0.8009120  1.982118
##      0.50  2.246868  0.8153042  1.774454
##      1.00  2.051872  0.8400992  1.614368
##      2.00  1.949707  0.8534618  1.524201
##      4.00  1.886125  0.8610205  1.465373
##      8.00  1.849240  0.8654699  1.436630
##     16.00  1.834604  0.8673639  1.429807
##     32.00  1.833221  0.8675754  1.428687
##     64.00  1.833221  0.8675754  1.428687
##    128.00  1.833221  0.8675754  1.428687
##    256.00  1.833221  0.8675754  1.428687
##    512.00  1.833221  0.8675754  1.428687
##   1024.00  1.833221  0.8675754  1.428687
##   2048.00  1.833221  0.8675754  1.428687
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06315483
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06315483 and C = 32.

sigma = 0.06315483 and C = 32, the optimal RMSE of 1.833221 and the R^2 of 0.8675754

svmpredict <- predict(svmmodel, newdata = testData$x)
postResample(pred = svmpredict, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.0741473 0.8255848 1.5755185

The RMSE is 2.0741473, the optimal R^2 0.8255848.

plot(varImp(svmmodel))

library(kableExtra)
results<-rbind(
"KNN" = postResample(pred = knnPred, obs = testData$y),
"NNET" = postResample(pred = nnetpredict, obs = testData$y),
"MARS" = postResample(pred = marspredict, obs = testData$y),
"SVM" = postResample(pred = svmpredict, obs = testData$y)

)
  
results %>%
  kable() %>%
  kable_styling()
RMSE Rsquared MAE
KNN 3.204060 0.6819919 2.568346
NNET 2.111396 0.8277556 1.573901
MARS 1.277999 0.9338365 1.014707
SVM 2.074147 0.8255848 1.575519

The MARS Model performed the best with a RMSE of 1.277999, R^2 of 0.9338365.

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
library(tidyverse)

preP <- preProcess(ChemicalManufacturingProcess, 
                   method = c( "knnImpute", "center", "scale"))
df <- predict(preP, ChemicalManufacturingProcess)
## Restore the response variable values to original
df$Yield = ChemicalManufacturingProcess$Yield

## Split the data into a training and a test set
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
df.train <- df[trainRows, ]
df.test <- df[-trainRows, ]


colYield <- which(colnames(df) == "Yield")
trainingX <- df.train[, -colYield]
trainingY <- df.train$Yield
testingX <- df.test[, -colYield]
testingY <- df.test$Yield
  1. Which nonlinear regression model gives the optimal resampling and testset performance?

KNN Model

knnmodelcmp <- train(x = trainingX,
            y = trainingY,
            method = "knn",
            preProc = c("center", "scale"),
            tuneLength = 10)
knnmodelcmp
## k-Nearest Neighbors 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.407356  0.4212720  1.112452
##    7  1.424206  0.4010071  1.136401
##    9  1.411586  0.4117033  1.129366
##   11  1.406807  0.4191247  1.133421
##   13  1.403986  0.4248304  1.132403
##   15  1.416444  0.4191244  1.145144
##   17  1.417817  0.4229133  1.145318
##   19  1.428014  0.4193412  1.155316
##   21  1.439530  0.4129175  1.160116
##   23  1.453014  0.4042676  1.175374
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
knncmppredict <- predict(knnmodelcmp, newdata = testingX)
postResample(pred = knncmppredict, obs = testingY)
##      RMSE  Rsquared       MAE 
## 1.2788738 0.6050203 1.0462036

Neural Networks

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10),.bag = FALSE)

nnetmodelcmp <- train(x = trainingX,
                    y = trainingY,
            method = "avNNet",
            tuneGrid = nnetGrid,
            trControl = trainControl(method = "cv"),
            ## Automatically standardize data prior to modeling
            ## and prediction
            preProc = c("center", "scale"),
            linout = TRUE,
            trace = FALSE,
            MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1,
            maxit = 500)

nnetmodelcmp
## Model Averaged Neural Network 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 131, 129, 129, 131, 129, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    1.584571  0.3130882  1.318742
##   0.00    2         NaN        NaN       NaN
##   0.00    3         NaN        NaN       NaN
##   0.00    4         NaN        NaN       NaN
##   0.00    5         NaN        NaN       NaN
##   0.00    6         NaN        NaN       NaN
##   0.00    7         NaN        NaN       NaN
##   0.00    8         NaN        NaN       NaN
##   0.00    9         NaN        NaN       NaN
##   0.00   10         NaN        NaN       NaN
##   0.01    1    1.556010  0.3951200  1.209399
##   0.01    2         NaN        NaN       NaN
##   0.01    3         NaN        NaN       NaN
##   0.01    4         NaN        NaN       NaN
##   0.01    5         NaN        NaN       NaN
##   0.01    6         NaN        NaN       NaN
##   0.01    7         NaN        NaN       NaN
##   0.01    8         NaN        NaN       NaN
##   0.01    9         NaN        NaN       NaN
##   0.01   10         NaN        NaN       NaN
##   0.10    1    1.416094  0.4921079  1.132066
##   0.10    2         NaN        NaN       NaN
##   0.10    3         NaN        NaN       NaN
##   0.10    4         NaN        NaN       NaN
##   0.10    5         NaN        NaN       NaN
##   0.10    6         NaN        NaN       NaN
##   0.10    7         NaN        NaN       NaN
##   0.10    8         NaN        NaN       NaN
##   0.10    9         NaN        NaN       NaN
##   0.10   10         NaN        NaN       NaN
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.1 and bag = FALSE.
nnetcmppredictcmp <- predict(nnetmodelcmp, newdata = testingX)
postResample(pred = nnetcmppredictcmp, obs = testingY)
##      RMSE  Rsquared       MAE 
## 1.5994384 0.4379603 1.2752048

Multivariate Adaptive Regression Splines

#  Define  the  candidate  models  to  test
marsGrid  <-  expand.grid(.degree  =  1:2,  .nprune  =  2:38)
#  Fix  the  seed  so  that  the  results  can  be  reproduced
set.seed(100)
marsmodelcmp  <-  train(x = trainingX,
                    y = trainingY,
               method  =  "earth",
               tuneGrid  =  marsGrid,
               trControl  =  trainControl(method  =  "cv"),
               preProc = c("center", "scale"),
               tuneLength = 10)

marsmodelcmp
## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.444061  0.4013748  1.1603587
##   1        3      1.175249  0.5894302  0.9428001
##   1        4      1.160211  0.6175903  0.9544192
##   1        5      1.252364  0.5833326  1.0365835
##   1        6      1.166502  0.6234591  0.9586637
##   1        7      1.166563  0.6274583  0.9398742
##   1        8      1.188837  0.5913858  0.9746283
##   1        9      1.208097  0.5815979  0.9870916
##   1       10      1.253156  0.5632424  1.0138561
##   1       11      1.218883  0.5798604  0.9773617
##   1       12      1.205409  0.5906900  0.9582228
##   1       13      1.167951  0.6147467  0.9358355
##   1       14      1.171180  0.6111584  0.9431921
##   1       15      1.172861  0.6069284  0.9546878
##   1       16      1.178332  0.6041202  0.9623805
##   1       17      1.184181  0.6003157  0.9647359
##   1       18      1.184181  0.6003157  0.9647359
##   1       19      1.188177  0.5949445  0.9595038
##   1       20      1.205153  0.5862832  0.9716499
##   1       21      1.210471  0.5834874  0.9765732
##   1       22      1.203311  0.5868768  0.9759124
##   1       23      1.202759  0.5866147  0.9770236
##   1       24      1.197362  0.5900525  0.9706005
##   1       25      1.204300  0.5887813  0.9779134
##   1       26      1.204062  0.5885443  0.9777640
##   1       27      1.204062  0.5885443  0.9777640
##   1       28      1.204062  0.5885443  0.9777640
##   1       29      1.204062  0.5885443  0.9777640
##   1       30      1.204062  0.5885443  0.9777640
##   1       31      1.204062  0.5885443  0.9777640
##   1       32      1.204062  0.5885443  0.9777640
##   1       33      1.204062  0.5885443  0.9777640
##   1       34      1.204062  0.5885443  0.9777640
##   1       35      1.204062  0.5885443  0.9777640
##   1       36      1.204062  0.5885443  0.9777640
##   1       37      1.204062  0.5885443  0.9777640
##   1       38      1.204062  0.5885443  0.9777640
##   2        2      1.444061  0.4013748  1.1603587
##   2        3      1.185007  0.5792362  0.9527514
##   2        4      1.152807  0.6069138  0.9256951
##   2        5      1.202202  0.5910730  0.9576904
##   2        6      1.177337  0.6101047  0.9237307
##   2        7      1.195214  0.6003549  0.9526190
##   2        8      1.182864  0.6039753  0.9322286
##   2        9      1.200815  0.5660698  0.9443230
##   2       10      1.193127  0.5873743  0.9398984
##   2       11      1.224756  0.5986929  0.9710711
##   2       12      1.245054  0.5701015  0.9942423
##   2       13      1.197486  0.5992033  0.9702788
##   2       14      1.356648  0.5265972  1.0741137
##   2       15      1.424230  0.4999087  1.0887157
##   2       16      1.463153  0.4680344  1.1104824
##   2       17      1.476454  0.4767288  1.0899221
##   2       18      1.501645  0.4633398  1.1117723
##   2       19      1.511346  0.4563785  1.1178177
##   2       20      1.510317  0.4549507  1.1167499
##   2       21      1.509077  0.4533001  1.1173055
##   2       22      1.507403  0.4562626  1.1204202
##   2       23      1.515518  0.4537028  1.1217657
##   2       24      1.526465  0.4458162  1.1132488
##   2       25      1.527829  0.4453549  1.1127053
##   2       26      1.534977  0.4433749  1.1136937
##   2       27      1.534977  0.4433749  1.1136937
##   2       28      1.534977  0.4433749  1.1136937
##   2       29      1.534977  0.4433749  1.1136937
##   2       30      1.534977  0.4433749  1.1136937
##   2       31      1.534977  0.4433749  1.1136937
##   2       32      1.534977  0.4433749  1.1136937
##   2       33      1.534977  0.4433749  1.1136937
##   2       34      1.534977  0.4433749  1.1136937
##   2       35      1.534977  0.4433749  1.1136937
##   2       36      1.534977  0.4433749  1.1136937
##   2       37      1.534977  0.4433749  1.1136937
##   2       38      1.534977  0.4433749  1.1136937
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 4 and degree = 2.
marspredictcmp <- predict(marsmodelcmp, newdata = testingX)
postResample(pred = marspredictcmp, obs = testingY)
##      RMSE  Rsquared       MAE 
## 2.1493604 0.1937177 1.5058004

Support Vector Machines

svmmodelcmp <- train(x = trainingX,
                    y = trainingY,
             method = "svmRadial",
             preProc = c("center", "scale"),
             tuneLength = 14,
             trControl = trainControl(method = "cv"))

svmmodelcmp
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 129, 130, 131, 130, 128, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  1.2997257  0.5920603  1.0581932
##      0.50  1.1465549  0.6535431  0.9292275
##      1.00  1.0502310  0.6850454  0.8501671
##      2.00  1.0068258  0.7008040  0.8080636
##      4.00  0.9823643  0.7137321  0.7960622
##      8.00  0.9738763  0.7172855  0.7876829
##     16.00  0.9744713  0.7168080  0.7868434
##     32.00  0.9744713  0.7168080  0.7868434
##     64.00  0.9744713  0.7168080  0.7868434
##    128.00  0.9744713  0.7168080  0.7868434
##    256.00  0.9744713  0.7168080  0.7868434
##    512.00  0.9744713  0.7168080  0.7868434
##   1024.00  0.9744713  0.7168080  0.7868434
##   2048.00  0.9744713  0.7168080  0.7868434
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01440193
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01440193 and C = 8.
svmpredictcmp <- predict(svmmodelcmp, newdata = testingX)
postResample(pred = svmpredictcmp, obs = testingY)
##      RMSE  Rsquared       MAE 
## 1.5199166 0.4181622 1.2680693

Resampling Models

resample<-rbind(
  "KNN" = postResample(pred = predict(knnmodelcmp), obs = trainingY),
  "N NET" = postResample(pred = predict(nnetmodelcmp), obs = trainingY),
  "MARS" = postResample(pred = predict(marsmodelcmp), obs = trainingY),
  "SVM" = postResample(pred = predict(svmmodelcmp), obs = trainingY)
)
resample %>%
  kable() %>%
  kable_styling()
RMSE Rsquared MAE
KNN 1.2480907 0.5713744 1.0147753
N NET 0.7500020 0.8347106 0.6072691
MARS 1.0720993 0.6521974 0.8412929
SVM 0.1753772 0.9920798 0.1677212

The SVM Model for the resample.

Testset Models

predictcmp<-rbind(
  "KNN" = postResample(pred = knncmppredict, obs = testingY),
  "N NET" = postResample(pred = nnetcmppredictcmp, obs = testingY),
  "MARS" = postResample(pred = marspredictcmp, obs = testingY),
  "SVM" = postResample(pred = svmpredictcmp, obs = testingY)
)
predictcmp %>%
  kable() %>%
  kable_styling()
RMSE Rsquared MAE
KNN 1.278874 0.6050203 1.046204
N NET 1.599438 0.4379603 1.275205
MARS 2.149360 0.1937177 1.505800
SVM 1.519917 0.4181622 1.268069

The best model is the KNN for the test set.

  1. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
varImp(knnmodelcmp)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   97.95
## ManufacturingProcess13   95.63
## BiologicalMaterial06     90.79
## BiologicalMaterial03     87.58
## ManufacturingProcess17   82.67
## ManufacturingProcess06   82.18
## BiologicalMaterial12     80.61
## ManufacturingProcess31   75.49
## ManufacturingProcess36   72.65
## ManufacturingProcess11   71.96
## BiologicalMaterial02     58.48
## BiologicalMaterial11     58.39
## ManufacturingProcess18   53.40
## BiologicalMaterial09     52.76
## ManufacturingProcess25   49.90
## BiologicalMaterial04     42.53
## ManufacturingProcess29   41.18
## ManufacturingProcess30   40.60
## BiologicalMaterial08     40.31

ManufacturingProcess32 was the most important for the KNN. ManufacturingProcess dominates the list.

plot(varImp(knnmodelcmp))

varImp(nnetmodelcmp)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   97.95
## ManufacturingProcess13   95.63
## BiologicalMaterial06     90.79
## BiologicalMaterial03     87.58
## ManufacturingProcess17   82.67
## ManufacturingProcess06   82.18
## BiologicalMaterial12     80.61
## ManufacturingProcess31   75.49
## ManufacturingProcess36   72.65
## ManufacturingProcess11   71.96
## BiologicalMaterial02     58.48
## BiologicalMaterial11     58.39
## ManufacturingProcess18   53.40
## BiologicalMaterial09     52.76
## ManufacturingProcess25   49.90
## BiologicalMaterial04     42.53
## ManufacturingProcess29   41.18
## ManufacturingProcess30   40.60
## BiologicalMaterial08     40.31

ManufacturingProcess32 was the most important for the NNet model. ManufacturingProcess dominates the list

plot(varImp(nnetmodelcmp))

varImp(marsmodelcmp)
## earth variable importance
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   49.81
## ManufacturingProcess17    0.00
## BiologicalMaterial04      0.00

ManufacturingProcess32 was the most important in the MARS model. ManufacturingProcess dominates the list

plot(varImp(marsmodelcmp))

varImp(svmmodelcmp)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   97.95
## ManufacturingProcess13   95.63
## BiologicalMaterial06     90.79
## BiologicalMaterial03     87.58
## ManufacturingProcess17   82.67
## ManufacturingProcess06   82.18
## BiologicalMaterial12     80.61
## ManufacturingProcess31   75.49
## ManufacturingProcess36   72.65
## ManufacturingProcess11   71.96
## BiologicalMaterial02     58.48
## BiologicalMaterial11     58.39
## ManufacturingProcess18   53.40
## BiologicalMaterial09     52.76
## ManufacturingProcess25   49.90
## BiologicalMaterial04     42.53
## ManufacturingProcess29   41.18
## ManufacturingProcess30   40.60
## BiologicalMaterial08     40.31

ManufacturingProcess32 was the most important for the SVM Model. ManufacturingProcess dominates the list

plot(varImp(svmmodelcmp))

For all of the models ManufacturingProcess32 was more important, while in all of the Models ManufacturingProcess09 was second.

  1. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
mimp <- varImp(knnmodelcmp)$importance
imp10 <- head(rownames(mimp)[order(-mimp$Overall)], 10)
as.data.frame(imp10)
##                     imp10
## 1  ManufacturingProcess32
## 2  ManufacturingProcess09
## 3  ManufacturingProcess13
## 4    BiologicalMaterial06
## 5    BiologicalMaterial03
## 6  ManufacturingProcess17
## 7  ManufacturingProcess06
## 8    BiologicalMaterial12
## 9  ManufacturingProcess31
## 10 ManufacturingProcess36
#featurePlot(trainX[,top10Vars], trainY)
X <- df[,imp10]
Y <- df[,colYield]

## Shorten the variable names for readability
colnames(X) <- gsub("(Process|Material)", "", colnames(X))
featurePlot(X, Y)

This shows the relationship of the Top 10 most important predictors and how it responds to them. The optimal model was the KNN these plots show that they are mostly linear. Looking at the top two you can see that there are some outliers. Seeing the relationship of the plots allows one to decide how to change the data in order to accomadate what they want to showcase.