7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

y=10sin(πx1x2)+20(x3−0.5)2+10x4+5x5+N(0,σ2)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(tidyr)
library(mice)
## Warning: package 'mice' was built under R version 4.3.3
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.2
library(caret)
## Warning: package 'caret' was built under R version 4.3.1
## Loading required package: lattice
library(earth)
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 4.3.1
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.3.2
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)

featurePlot(trainingData$x, trainingData$y)

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

library(caret)
knnModel <- train(x = trainingData$x,y = trainingData$y,method = "knn",preProc = c("center", "scale"),tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461
library(mlbench)
library(caret)



marsModel <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = "earth",
                   preProcess = c("center", "scale"),
                   tuneLength = 10)
marsModel
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##    2      4.383438  0.2405683  3.597961
##    3      3.645469  0.4745962  2.930453
##    4      2.727602  0.7035031  2.184240
##    6      2.331605  0.7835496  1.833420
##    7      1.976830  0.8421599  1.562591
##    9      1.804342  0.8683110  1.410395
##   10      1.787676  0.8711960  1.386944
##   12      1.821005  0.8670619  1.419893
##   13      1.858688  0.8617344  1.445459
##   15      1.871033  0.8607099  1.457618
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 10 and degree = 1.
marsPred <- predict(marsModel, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
##     RMSE Rsquared      MAE 
## 1.776575 0.872700 1.358367
marsPred <- predict(marsModel, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
##     RMSE Rsquared      MAE 
## 1.776575 0.872700 1.358367

Of the models run, the model that appears to give the best performance is the tuned MARS model with an R-squared value of 0.8727. While Rsquared is not always the best metric to assess the performance of model, in this case since other metrics are similar, tuned MARS model can be concluded as the best performance model while the Mars selecting the informative predictors X1-X5.

7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)
## [1] 176  58
imputed_df <- preProcess(ChemicalManufacturingProcess, "knnImpute")
imputed_full_df <- predict(imputed_df, ChemicalManufacturingProcess)
val_low <- nearZeroVar(imputed_full_df)
#remove low frequency columns using baser df[row,columns]
chem_df <- imputed_full_df[,-val_low]
chem_index <- createDataPartition(chem_df$Yield , p=.8, list=F)
chem_train <-  chem_df[chem_index,] 
chem_test <- chem_df[-chem_index,]
chem_marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

chem_MarsTuned <- train(Yield~. ,
                  data = chem_train,
                   method = "earth",
                   tuneGrid = chem_marsGrid,
                   trControl = trainControl(method = "cv"))

chem_MarsTunePred <- predict(chem_MarsTuned,  chem_test)
postResample(chem_MarsTunePred, chem_test$Yield)
##      RMSE  Rsquared       MAE 
## 1.0695878 0.2982139 0.7288033
chem_SVMTuned <- train(Yield~. ,
                  data = chem_test,
                   method = "svmRadial",
                   tuneLength = 15,
                   trControl = trainControl(method = "cv"))

chem_SVMTunePred <- predict(chem_SVMTuned,  chem_test)
postResample(chem_SVMTunePred, chem_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.3583862 0.9382534 0.2603542
knnModel <- train(Yield~., 
                    data = chem_test,
                    method = "knn",
                    preProc = c("center", "scale"), 
                    tuneLength = 10)

knnPred <- predict(knnModel,  chem_test)
postResample(pred = knnPred, obs = chem_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.7010159 0.7304704 0.5595637
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10), .bag = FALSE)

nnetTune <- train(Yield~., 
                  data = chem_train, 
                  method = "avNNet", 
                  tuneGrid = nnetGrid,
                  #trControl = trainControl(method = "cv"), 
                  linout = TRUE,trace = FALSE,
                  MaxNWts = 10 * (ncol(chem_train) + 1) + 10 + 1, 
                  maxit = 100)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetPred <- predict(nnetTune,  chem_test)
postResample(predict(nnetTune,  chem_test), chem_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.6108302 0.7232285 0.4535190

The SVM tuned is the best performing RMSE and R squared with values of RMSE=0.09178983 , and R squared=0.99520810

##(b)Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list?How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Of the 20 ManufacturingProcess predictors 12 are BiologicalMaterial and 8 are ManufacturingProcess. In the top 10, 6 are ManufacturingProcess in with 1rst through 3rd being ManufacturingProcess32, ManufacturingProcess36, and ManufacturingProcess05. Although neither dominates, ManufacturingProcess does hold a greater influence overall, when compared to the ratio of optimal nonlinear vs optimal linear, in optimal linear, ManuracturingProcess importance is greater.

plot(varImp(chem_SVMTuned), top=10)

(c)

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Corrplot used to explore relationship Yield response variable doesn’t have strong correlations with most of the important predictors noted previously. Strongest positive correlations exist with ManufacturingProcess32 and ManufacturingProcess09 (0.61 and 0.50). The strongest negative correlations exist with ManufacturingProcess36 and ManufacturingProcess13 (-0.53 and -0.50, respectively). No strong correlations with Yeild and the important Biological Material predictors identified in the previous step.

corr_vals <- chem_df %>% 
  dplyr::select('Yield', 'ManufacturingProcess32','ManufacturingProcess36',
         'BiologicalMaterial06','ManufacturingProcess13',
         'BiologicalMaterial03','ManufacturingProcess17',
         'BiologicalMaterial02','BiologicalMaterial12',
         'ManufacturingProcess09','ManufacturingProcess31')

corr_plot_vals <- cor(corr_vals)

corrplot.mixed(corr_plot_vals, tl.col = 'black', tl.pos = 'lt', 
         upper = "number", lower="circle")