Kuhn and Johnson

7.2) Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data.

y = 10sin(\(\pi\)\(x_1x_2\)) + 20(\(x_3\) - 0.5)\(^2\) + 10\(x_4\) + 5\(x_5\) + N(0,\(\sigma^2\))

where the x values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data.

library(mlbench)
library(caret)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd=1)
##We conver the 'x' data from a matrix to a dataframe.
##One reason is that this will give us column names.
trainingData$x <- data.frame(trainingData$x)
##Look at the data using
featurePlot(trainingData$x, trainingData$y)

##This creates a list with a vector 'y' and a matrix of predictors 'x'.

There appears to be a positive correlation between the predictor variables x1, x2, x4, and x5 with y. There does not appear to be any correlation between the predictor variables x3, x6, x7, x8, x9 and x10 with y.

Estimate the true error rate with good precision:

testData <- mlbench.friedman1(5000, sd=1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

knnModel <- train(x=trainingData$x,
                  y=trainingData$y,
                  method="knn",
                  preProc=c("center","scale"),
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.565620  0.4887976  2.886629
##    7  3.422420  0.5300524  2.752964
##    9  3.368072  0.5536927  2.715310
##   11  3.323010  0.5779056  2.669375
##   13  3.275835  0.6030846  2.628663
##   15  3.261864  0.6163510  2.621192
##   17  3.261973  0.6267032  2.616956
##   19  3.286299  0.6281075  2.640585
##   21  3.280950  0.6390386  2.643807
##   23  3.292397  0.6440392  2.656080
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
knnPred <- predict(knnModel, newdata = testData$x)
##The function 'postResample' can be used to get the test set performance values
postResample(pred=knnPred,obs=testData$y)
##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169

MARS (Multivariate Adaptive Regression Splines)

library(earth)
marsFit <- earth(trainingData$x, trainingData$y)
marsFit
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982
varImp(marsFit)
##       Overall
## X1  100.00000
## X4   84.21578
## X2   67.21639
## X5   45.44416
## X3   34.63259
## X6   11.90397
## X7    0.00000
## X8    0.00000
## X9    0.00000
## X10   0.00000
marsPred_simple <- predict(marsFit, newdata = testData$x)
postResample(pred=marsPred_simple,obs=testData$y)
##      RMSE  Rsquared       MAE 
## 1.8136467 0.8677298 1.3911836
marsGrid <- expand.grid(.degree=1:4, .nprune=2:30)
set.seed(50)
marsTuned <- train(trainingData$x,
                   trainingData$y,
                   method="earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method="cv"))
#marsTuned
varImp(marsTuned)
## earth variable importance
## 
##     Overall
## X1   100.00
## X4    85.13
## X2    69.22
## X5    49.28
## X3    39.95
## X7     0.00
## X9     0.00
## X6     0.00
## X10    0.00
## X8     0.00
plot(marsTuned)                   

marsPred <- predict(marsTuned, newdata = testData$x)
##The function 'postResample' can be used to get the test set performance values
postResample(pred=marsPred,obs=testData$y)                   
##      RMSE  Rsquared       MAE 
## 1.1722635 0.9448890 0.9324923

Support Vector Machines

library(kernlab)

svmRTuned <- train(trainingData$x, trainingData$y,
                   method="svmRadial", 
                   preProc=c("center","scale"),
                   tuneLength = 14,
                   trControl = trainControl(method="cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.499568  0.8106425  2.005232
##      0.50  2.239312  0.8302888  1.799795
##      1.00  2.064409  0.8521404  1.637527
##      2.00  1.976213  0.8609107  1.553383
##      4.00  1.907789  0.8668323  1.505581
##      8.00  1.878847  0.8704239  1.508344
##     16.00  1.877761  0.8708067  1.508296
##     32.00  1.877761  0.8708067  1.508296
##     64.00  1.877761  0.8708067  1.508296
##    128.00  1.877761  0.8708067  1.508296
##    256.00  1.877761  0.8708067  1.508296
##    512.00  1.877761  0.8708067  1.508296
##   1024.00  1.877761  0.8708067  1.508296
##   2048.00  1.877761  0.8708067  1.508296
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0702961
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0702961 and C = 16.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0702961005968861 
## 
## Number of Support Vectors : 152 
## 
## Objective Function Value : -63.3221 
## Training error : 0.008588
svmPred <- predict(svmRTuned, newdata = testData$x)
postResample(pred=svmPred,obs=testData$y)  
##      RMSE  Rsquared       MAE 
## 2.0930338 0.8226734 1.5906572

Linear Kernel

svmRTuned <- train(trainingData$x, trainingData$y,
                   method="svmLinear", 
                   preProc=c("center","scale"),
                   tuneLength = 14,
                   trControl = trainControl(method="cv"))
svmRTuned
## Support Vector Machines with Linear Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.456188  0.7596526  1.962155
## 
## Tuning parameter 'C' was held constant at a value of 1
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 172 
## 
## Objective Function Value : -54.9759 
## Training error : 0.216921
svmPred <- predict(svmRTuned, newdata = testData$x)
postResample(pred=svmPred,obs=testData$y)  
##      RMSE  Rsquared       MAE 
## 2.7633860 0.6973384 2.0970616

Neural Networks

library(nnet)
nnetFit <- nnet(trainingData$x, trainingData$y,
                size=5,
                decay=0.01,
                trace=FALSE,
                maxit=500,
                MaxNWts=5*(ncol(trainingData$x)+1)+5+1)

nnetFit
## a 10-5-1 network with 61 weights
## options were - decay=0.01
nnetPred <- predict(nnetFit, newdata = testData$x)
postResample(pred=nnetPred,obs=testData$y)
##       RMSE   Rsquared        MAE 
## 14.2769353  0.2779778 13.3869179
tooHigh <- findCorrelation(cor(trainingData$x),cutoff=0.7)

trainXnnet <- trainingData$x[,-tooHigh]
testXnnet <- testData$x[,-tooHigh]

nnetGrid <- expand.grid(.decay=c(0,0.01,.1),
                       .size=c(1:10),
                       .bag=FALSE)
set.seed(100)
nnetTune <- train(trainingData$x, trainingData$y,
                method="avNNet",
                tuneGrid = nnetGrid,
                trControl = trainControl(method="cv",number=10),
                preProc=c("center","scale"),
                linout=TRUE,
                trace=FALSE,
                MaxNWts =10*(ncol(trainingData$x)+1) +10 +1,
                maxit=500)

nnetTune
## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.451722  0.7625351  1.896021
##   0.00    2    2.474140  0.7564072  1.936452
##   0.00    3    2.091278  0.8235551  1.659173
##   0.00    4    2.054404  0.8209165  1.606800
##   0.00    5    2.331057  0.7749136  1.741685
##   0.00    6    2.828831  0.7173701  2.095881
##   0.00    7    3.172453  0.6663624  2.218665
##   0.00    8    4.496342  0.5434195  3.067567
##   0.00    9    5.994543  0.4233180  3.375984
##   0.00   10    4.130111  0.5554640  2.803012
##   0.01    1    2.427639  0.7621863  1.887533
##   0.01    2    2.503912  0.7476021  1.943163
##   0.01    3    2.074827  0.8228616  1.604177
##   0.01    4    2.112135  0.8173594  1.688663
##   0.01    5    2.110145  0.8198569  1.657632
##   0.01    6    2.299108  0.7890686  1.823859
##   0.01    7    2.348609  0.7760174  1.872983
##   0.01    8    2.519307  0.7463523  2.042896
##   0.01    9    2.511074  0.7368007  1.991015
##   0.01   10    2.486297  0.7631709  1.941586
##   0.10    1    2.435364  0.7608881  1.890014
##   0.10    2    2.471661  0.7510586  1.915104
##   0.10    3    2.239584  0.7983200  1.766737
##   0.10    4    2.175860  0.8134068  1.765179
##   0.10    5    2.083732  0.8249385  1.654930
##   0.10    6    2.123293  0.8107443  1.676428
##   0.10    7    2.117684  0.8216322  1.656338
##   0.10    8    2.322703  0.7771985  1.813793
##   0.10    9    2.293910  0.7812604  1.850508
##   0.10   10    2.437919  0.7608963  1.971687
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag
##  = FALSE.
nnetPred <- predict(nnetTune, newdata = testData$x)
postResample(pred=nnetPred,obs=testData$y)
##      RMSE  Rsquared       MAE 
## 2.1619014 0.8160074 1.6498984
library(knitr)
compare_table <- data.frame(list(knn=3.18,MARS_Fit=1.81, MARS_Tuned=1.17, SVM_radial=2.08, SVM_linear=2.76, NeuralNet_Fit=14.3, NeuralNet_Tune=2.16), stringsAsFactors=FALSE)
rownames(compare_table) <- c("RMSE")
kable(compare_table, caption="Model Comparison")
Model Comparison
knn MARS_Fit MARS_Tuned SVM_radial SVM_linear NeuralNet_Fit NeuralNet_Tune
RMSE 3.18 1.81 1.17 2.08 2.76 14.3 2.16

The model with the lowest root mean square error is the tuned MARS model. The model had 14 terms (nprune = 14) and was of degree = 2. MARS selects the informative predictors: x1, x4, x2, x5 and x3.

7.5) Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting and pre-processing steps as before and train several nonlinear regression models.

  1. Which nonlinear regression model gives the optimal resampling and test set performance?

The data set ChemicalManufacturingProcess is loaded.
I imputed the median of a column for missing values. The data is separated into a training set and testing set. Preprocess data to remove 19 highly correlated variables.

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
impute.median <- function(x) replace(x, is.na(x), median(x, na.rm = TRUE))
for (i in 1:ncol(ChemicalManufacturingProcess)){
  ChemicalManufacturingProcess[,i] <- impute.median(ChemicalManufacturingProcess[,i])
}
train_chem <- ChemicalManufacturingProcess[1:130,]
test_chem <- ChemicalManufacturingProcess[131:176,]

highCorr_chem <- findCorrelation(cor(train_chem[,2:58]), cutoff=.7, exact=TRUE)
length(highCorr_chem)
## [1] 19
filtered_train_chem <- train_chem[,-highCorr_chem]
ncol(filtered_train_chem)
## [1] 39
correlation_chem <- cor(filtered_train_chem, method = "pearson")
#summary(ChemicalManufacturingProcess)

Nonlinear Regression Models

k Nearest Neighbors

knnModel_chem <- train(x=train_chem[,-1],
                  y=train_chem$Yield,
                  method="knn",
                  preProc=c("center","scale"),
                  tuneLength = 10)
knnModel_chem
## k-Nearest Neighbors 
## 
## 130 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 130, 130, 130, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.294415  0.4952211  1.045348
##    7  1.315476  0.4812050  1.064978
##    9  1.330273  0.4770898  1.084025
##   11  1.361879  0.4566953  1.114846
##   13  1.380155  0.4459067  1.126567
##   15  1.390935  0.4451186  1.133090
##   17  1.402719  0.4420705  1.140357
##   19  1.412617  0.4397468  1.147891
##   21  1.412668  0.4452244  1.148591
##   23  1.423823  0.4438304  1.158805
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
knnPred_chem <- predict(knnModel_chem, newdata = test_chem[,-1])
##The function 'postResample' can be used to get the test set performance values
postResample(pred=knnPred_chem,obs=test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 1.8164938 0.1849927 1.4630870

MARS

marsGrid <- expand.grid(.degree=1:4, .nprune=2:30)
set.seed(50)
marsTuned <- train(x=train_chem[,-1],
                   y=train_chem$Yield,
                   method="earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method="cv"))
#marsTuned
varImp(marsTuned)
## earth variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess17  100.00
## ManufacturingProcess28   64.29
## ManufacturingProcess32   61.23
## BiologicalMaterial06     44.07
## ManufacturingProcess13   40.73
## ManufacturingProcess09   40.73
## ManufacturingProcess16   40.73
## BiologicalMaterial12     22.54
## BiologicalMaterial02     18.20
## ManufacturingProcess04   12.69
## ManufacturingProcess36    0.00
## BiologicalMaterial07      0.00
## ManufacturingProcess33    0.00
## BiologicalMaterial04      0.00
## ManufacturingProcess05    0.00
## ManufacturingProcess38    0.00
## ManufacturingProcess45    0.00
## ManufacturingProcess27    0.00
## ManufacturingProcess34    0.00
## ManufacturingProcess39    0.00
plot(marsTuned)                   

marsPred <- predict(marsTuned, newdata = test_chem[,-1])
##The function 'postResample' can be used to get the test set performance values
postResample(pred=marsPred,obs=test_chem$Yield)                   
##       RMSE   Rsquared        MAE 
## 2.36737380 0.06714247 2.05351782

Support Vector Machines Radial Basis Kernel

svmRTuned_rad <- train(x=train_chem[,-1],
                   y=train_chem$Yield,
                   method="svmRadial", 
                   preProc=c("center","scale"),
                   tuneLength = 14,
                   trControl = trainControl(method="cv"))
svmRTuned_rad
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 130 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 117, 118, 118, 116, 117, 117, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  1.3464649  0.6176193  1.0788984
##      0.50  1.1961665  0.6568694  0.9709388
##      1.00  1.0767435  0.7008372  0.8671836
##      2.00  1.0186969  0.7155605  0.8127638
##      4.00  1.0014491  0.7159753  0.7851659
##      8.00  0.9868678  0.7215477  0.7704698
##     16.00  0.9821933  0.7243430  0.7657744
##     32.00  0.9821933  0.7243430  0.7657744
##     64.00  0.9821933  0.7243430  0.7657744
##    128.00  0.9821933  0.7243430  0.7657744
##    256.00  0.9821933  0.7243430  0.7657744
##    512.00  0.9821933  0.7243430  0.7657744
##   1024.00  0.9821933  0.7243430  0.7657744
##   2048.00  0.9821933  0.7243430  0.7657744
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01092131
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01092131 and C = 16.
svmRTuned_rad$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0109213068494943 
## 
## Number of Support Vectors : 110 
## 
## Objective Function Value : -74.2534 
## Training error : 0.00899
svmPred_rad <- predict(svmRTuned_rad, newdata = train_chem[,-1])
postResample(pred=svmPred_rad,obs=train_chem$Yield)  
##      RMSE  Rsquared       MAE 
## 0.1744533 0.9926527 0.1697655

Linear Kernel

svmRTuned <- train(x=train_chem[,-1],
                   y=train_chem$Yield,
                   method="svmLinear", 
                   preProc=c("center","scale"),
                   tuneLength = 14,
                   trControl = trainControl(method="cv"))
svmRTuned
## Support Vector Machines with Linear Kernel 
## 
## 130 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 117, 116, 118, 117, 118, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   2.265437  0.5117143  1.20683
## 
## Tuning parameter 'C' was held constant at a value of 1
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 110 
## 
## Objective Function Value : -22.4379 
## Training error : 0.136722
svmPred_lin <- predict(svmRTuned, newdata = train_chem[,-1])
postResample(pred=svmPred_lin,obs=train_chem$Yield)  
##      RMSE  Rsquared       MAE 
## 0.6803201 0.8625641 0.4640489

Neural Network

nnetFit_chem <- nnet(x=train_chem[,-1],
                y=train_chem$Yield,
                size=5,
                decay=c(0,0.01,.1),
                trace=FALSE,
                maxit=500,
                MaxNWts=5*(ncol(train_chem[,-1])+1)+5+1)

nnetFit_chem
## a 57-5-1 network with 296 weights
## options were -
nnetPred_chem <- predict(nnetFit_chem, newdata = test_chem[,-1])
postResample(pred=nnetPred_chem,obs=test_chem$Yield)
##     RMSE Rsquared      MAE 
## 37.98425       NA 37.96478
compare_table <- data.frame(list(knn=1.82, MARS_Tuned=2.37, SVM_radial=0.17, SVM_linear=0.68, NeuralNet=37.98), stringsAsFactors=FALSE)
rownames(compare_table) <- c("RMSE")
kable(compare_table, caption="Model Comparison")
Model Comparison
knn MARS_Tuned SVM_radial SVM_linear NeuralNet
RMSE 1.82 2.37 0.17 0.68 37.98

The Support Vector Machine using a radial basis function gave a model with the lowest root mean square error.

  1. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
Imp_chem <- varImp(svmRTuned_rad)
plot(Imp_chem, top = 10)

The manufacturing process variables are the most important predictors in the SVM model.

Biological Material 06 and Manufacturing Process 36 are among the top 10 predictors for the top linear model and top nonlinear models.

  1. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
top_predictors <- ChemicalManufacturingProcess[,c("Yield","ManufacturingProcess17","ManufacturingProcess13","ManufacturingProcess09","ManufacturingProcess32","BiologicalMaterial06","ManufacturingProcess06","ManufacturingProcess11","ManufacturingProcess36","ManufacturingProcess30","ManufacturingProcess03")]
#head(top_predictors)
correlation_chem_top_pred <- cor(top_predictors, method = "pearson")
corrplot::corrplot(correlation_chem_top_pred, type="upper", method="color")

The yield is positively correlated to biological process 06, manufacturing process 09,32,06,11 and 30. The yield is negatively correlated to manufacturing process 17, 13 and 36. The correlations are fairly weak.