Data 624 HW 8

library(knitr)
library(rmdformats)

## Global options
options(max.print="85")
opts_chunk$set(cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=31)

library(mlbench)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(tidyselect)
library(AppliedPredictiveModeling)

7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)

KNN Model

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

knnModel <- train(x = trainingData$x, y = trainingData$y, method = "knn", preProc = c("center", 
    "scale"), tuneLength = 10)
knnModel
k-Nearest Neighbors 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   5  3.466085  0.5121775  2.816838
   7  3.349428  0.5452823  2.727410
   9  3.264276  0.5785990  2.660026
  11  3.214216  0.6024244  2.603767
  13  3.196510  0.6176570  2.591935
  15  3.184173  0.6305506  2.577482
  17  3.183130  0.6425367  2.567787
  19  3.198752  0.6483184  2.592683
  21  3.188993  0.6611428  2.588787
  23  3.200458  0.6638353  2.604529

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
     RMSE  Rsquared       MAE 
3.2040595 0.6819919 2.5683461 

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

MARS Model

options(max.print = 1e+06)

mars.grid <- expand.grid(.degree = 1:2, .nprune = 2:15)

mars.model <- train(x = trainingData$x, y = trainingData$y, method = "earth", tuneGrid = mars.grid, 
    preProcess = c("center", "scale"), tuneLength = 10)

mars.model
Multivariate Adaptive Regression Spline 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE      Rsquared   MAE     
  1        2      4.383438  0.2405683  3.597961
  1        3      3.645469  0.4745962  2.930453
  1        4      2.727602  0.7035031  2.184240
  1        5      2.449243  0.7611230  1.939231
  1        6      2.331605  0.7835496  1.833420
  1        7      1.976830  0.8421599  1.562591
  1        8      1.870959  0.8585503  1.464551
  1        9      1.804342  0.8683110  1.410395
  1       10      1.787676  0.8711960  1.386944
  1       11      1.790700  0.8707740  1.393076
  1       12      1.821005  0.8670619  1.419893
  1       13      1.858688  0.8617344  1.445459
  1       14      1.862343  0.8623072  1.446050
  1       15      1.871033  0.8607099  1.457618
  2        2      4.383438  0.2405683  3.597961
  2        3      3.644919  0.4742570  2.929647
  2        4      2.730222  0.7028372  2.183075
  2        5      2.481291  0.7545789  1.965749
  2        6      2.338369  0.7827873  1.825542
  2        7      2.030065  0.8328250  1.602024
  2        8      1.890997  0.8551326  1.477422
  2        9      1.742626  0.8757904  1.371910
  2       10      1.608221  0.8943432  1.255416
  2       11      1.474325  0.9111463  1.157848
  2       12      1.437483  0.9157967  1.120977
  2       13      1.439395  0.9164721  1.128309
  2       14      1.428565  0.9184503  1.118634
  2       15      1.434093  0.9182413  1.121622

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 14 and degree = 2.

The best MARS model on training set is when nprune = 13 and degree = 2.

mars_pred <- predict(mars.model, newdata = testData$x)
postResample(pred = mars_pred, obs = testData$y)
     RMSE  Rsquared       MAE 
1.2779993 0.9338365 1.0147070 
varImp(mars.model)
earth variable importance

   Overall
X1  100.00
X4   75.40
X2   49.00
X5   15.72
X3    0.00

The 4 most important variables are X1, X4, X2, X5.

SVM Model

svm.model <- train(x = trainingData$x, y = trainingData$y, method = "svmRadial", 
    tuneLength = 10, preProcess = c("center", "scale"), trControl = trainControl(method = "repeatedcv", 
        repeats = 8))
svm.model
Support Vector Machines with Radial Basis Function Kernel 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold, repeated 8 times) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  C       RMSE      Rsquared   MAE     
    0.25  2.502311  0.8009301  1.996146
    0.50  2.245108  0.8163218  1.779447
    1.00  2.059569  0.8381964  1.622551
    2.00  1.960543  0.8507107  1.529816
    4.00  1.888723  0.8597190  1.487947
    8.00  1.861418  0.8630404  1.480071
   16.00  1.856273  0.8635021  1.479585
   32.00  1.856273  0.8635021  1.479585
   64.00  1.856273  0.8635021  1.479585
  128.00  1.856273  0.8635021  1.479585

Tuning parameter 'sigma' was held constant at a value of 0.06670077
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.06670077 and C = 16.

The optimal SVM model is attained wen C = 8.00 and sigma is held constant at 0.06.

svm_pred <- predict(svm.model, newdata = testData$x)
postResample(pred = svm_pred, obs = testData$y)
     RMSE  Rsquared       MAE 
2.0829707 0.8242096 1.5826017 
varImp(svm.model)
loess r-squared variable importance

     Overall
X4  100.0000
X1   95.5047
X2   89.6186
X5   45.2170
X3   29.9330
X9    6.3299
X10   5.5182
X8    3.2527
X6    0.8884
X7    0.0000

SVM did pick the 5 most informative predictors as the 5 most important predictors.

Neural Network Model

nnet.grid <- expand.grid(.decay = c(0, 0.01, 0.1, 0.5, 0.9), .size = c(1, 10), .bag = FALSE)

nnet.model <- train(x = trainingData$x, y = trainingData$y, method = "avNNet", preProcess = c("center", 
    "scale"), tuneGrid = nnet.grid, trControl = trainControl(method = "repeatedcv", 
    repeats = 1), trace = FALSE, linout = TRUE, maxit = 500)

nnet.model
Model Averaged Neural Network 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  decay  size  RMSE      Rsquared   MAE     
  0.00    1    2.433344  0.7867541  1.916758
  0.00   10    4.884038  0.5672747  3.141966
  0.01    1    2.406902  0.7889631  1.877413
  0.01   10    2.454159  0.7774285  1.953892
  0.10    1    2.396847  0.7886603  1.869417
  0.10   10    2.301708  0.8003090  1.815960
  0.50    1    2.422011  0.7813350  1.901694
  0.50   10    2.212246  0.8209941  1.758052
  0.90    1    2.444657  0.7759943  1.923258
  0.90   10    2.052534  0.8457960  1.625324

Tuning parameter 'bag' was held constant at a value of FALSE
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 10, decay = 0.9 and bag = FALSE.
nnet_pred <- predict(nnet.model, newdata = testData$x)
postResample(pred = nnet_pred, obs = testData$y)
     RMSE  Rsquared       MAE 
2.0122533 0.8384209 1.5424496 
varImp(nnet.model)
loess r-squared variable importance

     Overall
X4  100.0000
X1   95.5047
X2   89.6186
X5   45.2170
X3   29.9330
X9    6.3299
X10   5.5182
X8    3.2527
X6    0.8884
X7    0.0000

X4, X1, X2, X5, X3 were listed as the 5 most important predictors for NN model.

Performance Evaluation

results <- data.frame(t(postResample(pred = knnPred, obs = testData$y))) %>% mutate(Model = "KNN")

results <- data.frame(t(postResample(pred = mars_pred, obs = testData$y))) %>% mutate(Model = "MARS") %>% 
    bind_rows(results)

results <- data.frame(t(postResample(pred = svm_pred, obs = testData$y))) %>% mutate(Model = "SVM") %>% 
    bind_rows(results)

results <- data.frame(t(postResample(pred = nnet_pred, obs = testData$y))) %>% mutate(Model = "Neural Network") %>% 
    bind_rows(results)

results %>% dplyr::select(Model, RMSE, Rsquared, MAE) %>% dplyr::arrange(RMSE) %>% 
    kable() %>% kable_styling()
Model RMSE Rsquared MAE
MARS 1.277999 0.9338365 1.014707
Neural Network 2.012253 0.8384209 1.542450
SVM 2.082971 0.8242096 1.582602
KNN 3.204060 0.6819919 2.568346

MARS model performed the best. It selected the informative predictors as most important. It has the best metrics across the broad.

7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

(a)

Which nonlinear regression model gives the optimal resampling and test set performance?

Ans:

data(ChemicalManufacturingProcess)
# Used bagImpute method to impute the missing values.
bagImpute <- preProcess(ChemicalManufacturingProcess[, -c(1)], method = c("bagImpute"))
CMP <- predict(bagImpute, ChemicalManufacturingProcess)
# Make this reproducible
set.seed(45)
# Removing the zero variance variables from the dataset CMP
CMP_2 <- CMP %>% dplyr::select(-one_of(nearZeroVar(., names = T)))
# Create train and test split
train <- createDataPartition(CMP_2$Yield, times = 1, p = 0.8, list = FALSE)
CMP2_train_df <- CMP_2[train, ]
CMP2_test_df <- CMP_2[-train, ]

KNN Model

knn_model <- train(Yield ~ ., data = CMP2_train_df, method = "knn", preProcess = c("center", 
    "scale"), trControl = trainControl("repeatedcv", repeats = 14), tuneLength = 10)
knn_model
k-Nearest Neighbors 

144 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (10 fold, repeated 14 times) 
Summary of sample sizes: 130, 129, 131, 130, 129, 130, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE      
   5  1.238316  0.5711530  0.9920695
   7  1.275954  0.5500808  1.0284643
   9  1.278544  0.5535930  1.0421709
  11  1.285869  0.5541710  1.0533851
  13  1.303112  0.5417439  1.0684796
  15  1.310939  0.5419706  1.0689230
  17  1.326826  0.5363184  1.0785755
  19  1.344597  0.5273826  1.0944736
  21  1.351137  0.5286896  1.0996305
  23  1.363928  0.5262034  1.1086103

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.


knn_pred <- predict(knn_model, CMP2_test_df)

results <- data.frame()
results <- data.frame(t(postResample(pred = knn_pred, obs = CMP2_test_df$Yield))) %>% 
    mutate(Model = "KNN") %>% rbind(results)

MARS Model

mars_grid <- expand.grid(.degree = 1:2, .nprune = 2:14)

mars_model <- train(Yield ~ ., data = CMP2_train_df, method = "earth", tuneGrid = mars_grid, 
    preProcess = c("center", "scale"))
mars_model
Multivariate Adaptive Regression Spline 

144 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE      Rsquared   MAE      
  1        2      1.497808  0.3462486  1.1604640
  1        3      1.264061  0.5271330  0.9945759
  1        4      1.188350  0.5775696  0.9360398
  1        5      1.214630  0.5647259  0.9574269
  1        6      1.357955  0.5256184  1.0094276
  1        7      1.708676  0.4934250  1.0773009
  1        8      1.755209  0.4768041  1.1037725
  1        9      1.782182  0.4747345  1.1099410
  1       10      1.781493  0.4741570  1.1165152
  1       11      1.748795  0.4784380  1.1197517
  1       12      1.751767  0.4674375  1.1321467
  1       13      1.746962  0.4520736  1.1474127
  1       14      1.710127  0.4545385  1.1433512
  2        2      1.497808  0.3462486  1.1604640
  2        3      1.277601  0.5198498  0.9985240
  2        4      1.213490  0.5596745  0.9471576
  2        5      1.240414  0.5495335  0.9776753
 [ reached getOption("max.print") -- omitted 9 rows ]

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 4 and degree = 1.
mars_pred <- predict(mars_model, CMP2_test_df)

results <- data.frame(t(postResample(pred = mars_pred, obs = CMP2_test_df$Yield))) %>% 
    mutate(Model = "MARS") %>% rbind(results)

SVM Model

svm_model <- train(Yield ~ ., data = CMP2_train_df, method = "svmRadial", preProc = c("center", 
    "scale"), trControl = trainControl(method = "repeatedcv", repeats = 8), tuneLength = 14)
svm_model
Support Vector Machines with Radial Basis Function Kernel 

144 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (10 fold, repeated 8 times) 
Summary of sample sizes: 129, 131, 131, 130, 130, 129, ... 
Resampling results across tuning parameters:

  C        RMSE      Rsquared   MAE      
     0.25  1.341938  0.5529206  1.0870890
     0.50  1.205938  0.6087110  0.9779722
     1.00  1.110874  0.6521890  0.8953666
     2.00  1.058657  0.6774674  0.8387258
     4.00  1.069424  0.6717891  0.8447317
     8.00  1.065245  0.6772205  0.8397986
    16.00  1.060000  0.6802050  0.8359710
    32.00  1.060000  0.6802050  0.8359710
    64.00  1.060000  0.6802050  0.8359710
   128.00  1.060000  0.6802050  0.8359710
   256.00  1.060000  0.6802050  0.8359710
   512.00  1.060000  0.6802050  0.8359710
  1024.00  1.060000  0.6802050  0.8359710
  2048.00  1.060000  0.6802050  0.8359710

Tuning parameter 'sigma' was held constant at a value of 0.01244262
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01244262 and C = 2.
svm_pred <- predict(svm_model, CMP2_test_df)

results <- data.frame(t(postResample(pred = svm_pred, obs = CMP2_test_df$Yield))) %>% 
    mutate(Model = "SVM") %>% rbind(results)

Neural Network Model

nnet_grid <- expand.grid(.decay = c(0, 0.01, 0.1), .size = c(1, 5, 10), .bag = FALSE)

nnet_model <- train(Yield ~ ., data = CMP2_train_df, method = "avNNet", tuneGrid = nnet_grid, 
    preProcess = c("center", "scale"), trControl = trainControl(method = "repeatedcv", 
        repeats = 1), trace = FALSE, linout = TRUE, MaxNWts = 301, maxit = 500)

nnet_model
Model Averaged Neural Network 

144 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 131, 130, 130, 129, 130, 129, ... 
Resampling results across tuning parameters:

  decay  size  RMSE      Rsquared   MAE     
  0.00    1    1.513358  0.4390102  1.234170
  0.00    5    1.873332  0.3647700  1.450868
  0.00   10         NaN        NaN       NaN
  0.01    1    1.435553  0.4773275  1.149338
  0.01    5    1.763261  0.4347583  1.238102
  0.01   10         NaN        NaN       NaN
  0.10    1    1.784088  0.4670214  1.220596
  0.10    5    1.824278  0.4046101  1.364862
  0.10   10         NaN        NaN       NaN

Tuning parameter 'bag' was held constant at a value of FALSE
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 1, decay = 0.01 and bag = FALSE.


nnet_pred <- predict(nnet_model, CMP2_test_df)

results <- data.frame(t(postResample(pred = nnet_pred, obs = CMP2_test_df$Yield))) %>% 
    mutate(Model = "Neural Network") %>% rbind(results)
# results <- results [-1, ]
results %>% dplyr::select(Model, RMSE, Rsquared, MAE) %>% arrange(RMSE) %>% kable() %>% 
    kable_styling()
Model RMSE Rsquared MAE
SVM 1.388832 0.4899527 1.110552
MARS 1.468070 0.4309830 1.206402
KNN 1.497138 0.4412967 1.248188
Neural Network 1.567783 0.4009922 1.298610

MARS turns out to the best non-linear model by far, in terms of RMSE.

(b)

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

caret::varImp(mars_model, 10)
earth variable importance

                       Overall
ManufacturingProcess32  100.00
ManufacturingProcess09   49.87
ManufacturingProcess13    0.00
# sourcing the pls_model from HW 7
pls_model <- train(Yield ~ ., data = CMP2_train_df, method = "pls", tuneLength = 24, 
    trControl = trainControl(method = "repeatedcv", repeats = 8), preProcess = c("center", 
        "scale"))
caret::varImp(pls_model, 10)
pls variable importance

  only 20 most important variables shown (out of 56)

                       Overall
ManufacturingProcess32  100.00
ManufacturingProcess13   93.07
ManufacturingProcess17   88.16
ManufacturingProcess09   85.37
ManufacturingProcess36   76.32
ManufacturingProcess06   69.43
BiologicalMaterial02     68.17
BiologicalMaterial06     65.69
ManufacturingProcess33   61.27
BiologicalMaterial08     61.12
ManufacturingProcess12   60.49
BiologicalMaterial03     58.61
BiologicalMaterial12     57.87
BiologicalMaterial11     57.04
ManufacturingProcess11   56.23
BiologicalMaterial01     54.31
BiologicalMaterial04     54.30
ManufacturingProcess28   46.93
ManufacturingProcess04   46.41
ManufacturingProcess10   39.21

When we compare the the top 10 predictors from the most optimal model (mars_model) and top 10 from the pls_model, we do see that ManufacturingProcess32 tops the list. So manufacturing Process definitely dominated the list with 7 out of top 10 in the pls_model is Manufacturing Process.

(c)

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.

Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Ans: There was only one important predictor chosen by MARS, I’m just going to examine its relationship with Yield to completely wrapping up the answer to this question.

ggplot(CMP2_train_df, aes(ManufacturingProcess32, Yield)) + geom_point()

Apparently, the Yield is maximized when the ManufacturingProcess32 is around 167 and, in general, there is a positive correlation between yield, the response variable, and the predictor, ManufacturingProcess32. So the key is to maximize ManufacturingProcess32 and have it attained its value closer to 167.