1. 7.2 Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: \(y = 10sin(\pi x_1x_2)+20(x3 - 0.5)^2 +10x_4 +5x_5 +N(0,\sigma ^2)\) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(caret)
library(mlbench)
#help("mlbench.friedman1")

The regression problem Friedman 1 as described in Friedman (1991) and Breiman (1996). Inputs are 10 independent variables uniformly distributed on the interval [0,1], only 5 out of these 10 are actually used. Outputs are created according to the formula

Usage => mlbench.friedman1(n, sd=1)
n = number of patterns to create
sd = Standard deviation of noise
x = input values (independent variables)
y = output values (dependent variable)

set.seed(200)
training.data <- mlbench.friedman1(200, sd=1)
training.data$x <- data.frame(training.data$x)
featurePlot(training.data$x, training.data$y)

Now, estimate the true error rate with good precision:

test.data <- mlbench.friedman1(5000, sd = 1)
test.data$x <- data.frame(test.data$x)

KNN Model - Model tuning technique

set.seed(1)
knnModel <- train(x = training.data$x,
                  y = training.data$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.626071  0.4766532  2.954620
##    7  3.463866  0.5231142  2.823065
##    9  3.387872  0.5552706  2.744969
##   11  3.345212  0.5754355  2.706971
##   13  3.317962  0.5964912  2.696927
##   15  3.291614  0.6123880  2.668077
##   17  3.293991  0.6238074  2.665458
##   19  3.300870  0.6342793  2.668902
##   21  3.297502  0.6472145  2.674792
##   23  3.298045  0.6534698  2.676965
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
plot(knnModel)

knn.Predict <- predict(knnModel, newdata = test.data$x)
postResample(pred = knn.Predict, obs = test.data$y)
##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169

The result of KNN RMSE is 3.117

MARS

library(earth)
set.seed(101)  
marsGrid <- expand.grid(degree =1:2, nprune=seq(2,14,by=2))
marsModel <- train(x = training.data$x, y = training.data$y, method='earth', tuneGrid = marsGrid, trControl = trainControl(method = "cv"))
marsModel
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.417014  0.2350732  3.656244
##   1        4      2.606487  0.7293074  2.069700
##   1        6      2.192319  0.8094378  1.742533
##   1        8      1.700573  0.8889867  1.313446
##   1       10      1.640852  0.8967593  1.277871
##   1       12      1.641261  0.8950167  1.297462
##   1       14      1.642807  0.8964578  1.287479
##   2        2      4.417014  0.2350732  3.656244
##   2        4      2.606487  0.7293074  2.069700
##   2        6      2.218192  0.8021446  1.729223
##   2        8      1.750853  0.8789636  1.370531
##   2       10      1.449480  0.9177568  1.111537
##   2       12      1.316041  0.9299167  1.026115
##   2       14      1.321235  0.9300051  1.048692
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 12 and degree = 2.
plot(marsModel)

Feature Importance

varImp(marsModel)
## earth variable importance
## 
##     Overall
## X1   100.00
## X4    85.05
## X2    69.03
## X5    48.88
## X3    39.40
## X10    0.00
## X6     0.00
## X9     0.00
## X7     0.00
## X8     0.00
mars.Predict <- predict (marsModel, test.data$x)
postResample(pred = mars.Predict, obs = test.data$y)
##      RMSE  Rsquared       MAE 
## 1.2803060 0.9335241 1.0168673

MARS_RMSE = 1.2803060 is definitely better than KNN. Let’s check other models as well.

SVM

set.seed(1)
svmTuned <- train(x = training.data$x, 
                   y = training.data$y,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))
svmTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.531367  0.8022380  2.018644
##      0.50  2.274330  0.8152517  1.796103
##      1.00  2.096272  0.8344065  1.648492
##      2.00  1.963889  0.8528505  1.545874
##      4.00  1.893957  0.8618554  1.488518
##      8.00  1.869287  0.8642229  1.482789
##     16.00  1.867230  0.8644765  1.485749
##     32.00  1.867230  0.8644765  1.485749
##     64.00  1.867230  0.8644765  1.485749
##    128.00  1.867230  0.8644765  1.485749
##    256.00  1.867230  0.8644765  1.485749
##    512.00  1.867230  0.8644765  1.485749
##   1024.00  1.867230  0.8644765  1.485749
##   2048.00  1.867230  0.8644765  1.485749
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06444911
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06444911 and C = 16.
svmTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0644491109951026 
## 
## Number of Support Vectors : 151 
## 
## Objective Function Value : -71.1627 
## Training error : 0.008521
svm.Predict <- predict(svmTuned, newdata = test.data$x)
postResample(pred = svm.Predict, obs = test.data$y)
##      RMSE  Rsquared       MAE 
## 2.0771707 0.8251101 1.5779291

SVM RMSE value is 2.0771707, so it is not performing better than MARS.

Neural Network

library(nnet)
## Warning: package 'nnet' was built under R version 3.4.4
set.seed(1)
nnetFit <- nnet(training.data$x,
                training.data$y,
                size = 5,
                decay = 0.01,
                linout = TRUE,
                trace = FALSE,
                maxit = 500,
                MaxNWts = 5 * (ncol(training.data$x) + 1) + 5 + 1)
nnetFit
## a 10-5-1 network with 61 weights
## options were - linear output units  decay=0.01
nnet.Predict <- predict(nnetFit, newdata = test.data$x)
postResample(pred = nnet.Predict, obs = test.data$y)
##      RMSE  Rsquared       MAE 
## 2.6580454 0.7348028 1.8595160

Neural net also has high RMSE value of 2.6580454, so MARS is definitely a winner here with lowest RMSE value.

  1. 7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models

A. Which nonlinear regression model gives the optimal resampling and test set performance?

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
library(DMwR)
## Loading required package: grid
knn.df <- knnImputation(ChemicalManufacturingProcess[, 1:57], k = 3, meth = "weighAvg")
anyNA(knn.df)
## [1] FALSE

First getting rid od missing values

near_zero <- nearZeroVar(knn.df)
knn.df <- knn.df[,-near_zero]
knn.df[,2:(ncol(knn.df))] <- scale(knn.df[,2:(ncol(knn.df))])
set.seed(1)

inTraining <- createDataPartition(knn.df$Yield, p = 0.80, list=FALSE)
training <- knn.df[ inTraining,]
testing <- knn.df[-inTraining,]

X_train <- training[,2:(length(training))]
Y_train <- training$Yield

X_test <- testing[,2:(length(testing))]
Y_test <- testing$Yield

KNN

set.seed(1)
knnModel <- train(x = X_train,
                  y = Y_train,
                  method = "knn",
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 144 samples
##  55 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.481059  0.3814268  1.169101
##    7  1.461855  0.3927446  1.164240
##    9  1.446106  0.4108643  1.160198
##   11  1.441563  0.4189638  1.157509
##   13  1.437135  0.4284012  1.161061
##   15  1.451620  0.4178187  1.178368
##   17  1.457394  0.4149991  1.186073
##   19  1.463996  0.4131880  1.190119
##   21  1.465359  0.4165220  1.192271
##   23  1.474984  0.4111554  1.195094
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
plot(knnModel)

knnPredict <- predict(knnModel, newdata = X_test)
postResample(pred = knnPredict, obs = Y_test)
##      RMSE  Rsquared       MAE 
## 1.3438388 0.5473626 1.0784564

KNN RMSE value is 1.3438388 and we’ll check for other model as well

MARS

set.seed(1)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel <- train(x = X_train,
                   y = Y_train,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))
marsModel
## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  55 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 130, 130, 131, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.363458  0.4558256  1.0689820
##   1        3      1.271861  0.5201694  1.0106995
##   1        4      1.244503  0.5458146  0.9833013
##   1        5      1.238886  0.5546325  0.9841253
##   1        6      1.299142  0.5236747  1.0200561
##   1        7      1.223629  0.5697152  0.9702813
##   1        8      1.203007  0.5985392  0.9565488
##   1        9      1.220538  0.5834291  0.9684082
##   1       10      1.227145  0.5768417  0.9702812
##   1       11      1.286950  0.5436602  0.9708968
##   1       12      1.271059  0.5559122  0.9644951
##   1       13      1.308718  0.5238114  1.0072276
##   1       14      1.298446  0.5340145  1.0073789
##   1       15      1.372028  0.5195326  1.0299789
##   1       16      1.372028  0.5195326  1.0299789
##   1       17      1.321986  0.5308154  1.0143525
##   1       18      1.329896  0.5283427  1.0172826
##   1       19      1.317402  0.5343547  1.0103932
##   1       20      1.325867  0.5318402  1.0134283
##   1       21      1.328594  0.5327647  1.0139789
##   1       22      1.328594  0.5327647  1.0139789
##   1       23      1.328594  0.5327647  1.0139789
##   1       24      1.328594  0.5327647  1.0139789
##   1       25      1.328594  0.5327647  1.0139789
##   1       26      1.328594  0.5327647  1.0139789
##   1       27      1.328594  0.5327647  1.0139789
##   1       28      1.355204  0.5237057  1.0282422
##   1       29      1.354034  0.5244783  1.0267683
##   1       30      1.359385  0.5194798  1.0336388
##   1       31      1.357839  0.5216488  1.0395689
##   1       32      1.357400  0.5215103  1.0374248
##   1       33      1.356468  0.5222010  1.0277678
##   1       34      1.353343  0.5237420  1.0249071
##   1       35      1.381416  0.5144563  1.0454328
##   1       36      1.381416  0.5144563  1.0454328
##   1       37      1.400371  0.5142342  1.0479364
##   1       38      1.400371  0.5142342  1.0479364
##   2        2      1.363458  0.4558256  1.0689820
##   2        3      1.300992  0.4950000  1.0266650
##   2        4      1.342267  0.4950703  1.0570482
##   2        5      1.352331  0.4642087  1.0690975
##   2        6      1.354055  0.4921812  1.0980620
##   2        7      1.341966  0.4998700  1.0875600
##   2        8      1.371679  0.4920043  1.1089012
##   2        9      1.827149  0.4795344  1.2727604
##   2       10      1.892205  0.4624672  1.3152066
##   2       11      1.805761  0.4863751  1.2596316
##   2       12      1.341988  0.5111124  1.0931730
##   2       13      1.946641  0.4289931  1.3086528
##   2       14      1.918164  0.4381737  1.2851618
##   2       15      1.950743  0.4353344  1.2949747
##   2       16      2.098043  0.4133506  1.3404870
##   2       17      2.141945  0.4045376  1.3682530
##   2       18      2.242581  0.3862486  1.4101273
##   2       19      2.276982  0.3700169  1.4315775
##   2       20      2.275432  0.3711643  1.4234879
##   2       21      2.270367  0.3770002  1.4233780
##   2       22      2.259491  0.3801502  1.4122387
##   2       23      2.257904  0.3795131  1.4106334
##   2       24      2.260496  0.3769517  1.4106286
##   2       25      2.186676  0.3752172  1.3974153
##   2       26      2.178747  0.3845206  1.3954909
##   2       27      2.183187  0.3824566  1.3991446
##   2       28      2.183187  0.3824566  1.3991446
##   2       29      2.183187  0.3824566  1.3991446
##   2       30      2.183187  0.3824566  1.3991446
##   2       31      2.183187  0.3824566  1.3991446
##   2       32      2.183187  0.3824566  1.3991446
##   2       33      2.183187  0.3824566  1.3991446
##   2       34      2.183187  0.3824566  1.3991446
##   2       35      2.183187  0.3824566  1.3991446
##   2       36      2.183187  0.3824566  1.3991446
##   2       37      2.183187  0.3824566  1.3991446
##   2       38      2.183187  0.3824566  1.3991446
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 8 and degree = 1.
plot(marsModel)

marsPredict <- predict(marsModel, newdata = X_test)
postResample(pred = marsPredict, obs = Y_test)
##      RMSE  Rsquared       MAE 
## 1.1870875 0.6182487 0.9690215

MARS lower RMSE value is definitely better than KNN and that makes it a better choice than KNN

Neural Net

set.seed(1)
nnetFit <- nnet(X_train,
                Y_train,
                size = 5,
                decay = 0.01,
                linout = TRUE,
                trace = FALSE,
                maxit = 500,
                MaxNWts = 5 * (ncol(X_train) + 1) + 5 + 1)
nnetFit
## a 55-5-1 network with 286 weights
## options were - linear output units  decay=0.01
nnetFit.Predict <- predict(nnetFit, newdata = X_test)
postResample(pred = nnetFit.Predict, obs = Y_test)
##      RMSE  Rsquared       MAE 
## 2.1204917 0.5387486 1.6984016

Neuranet also does not seem to be better than MARS.

SVM

set.seed(1)
svmModel <- train(x = X_train, 
                   y = Y_train,
                   method = "svmRadial",
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))
svmModel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  55 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 130, 130, 131, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE      
##      0.25  1.416423  0.5028799  1.1605561
##      0.50  1.319232  0.5346143  1.0833591
##      1.00  1.235756  0.5760082  1.0064023
##      2.00  1.222457  0.5768255  0.9843147
##      4.00  1.227817  0.5698084  0.9874162
##      8.00  1.226535  0.5750713  0.9946620
##     16.00  1.227513  0.5748177  0.9962901
##     32.00  1.227513  0.5748177  0.9962901
##     64.00  1.227513  0.5748177  0.9962901
##    128.00  1.227513  0.5748177  0.9962901
##    256.00  1.227513  0.5748177  0.9962901
##    512.00  1.227513  0.5748177  0.9962901
##   1024.00  1.227513  0.5748177  0.9962901
##   2048.00  1.227513  0.5748177  0.9962901
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01559259
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01559259 and C = 2.
svm.Predict <- predict(svmModel, newdata = X_test)
postResample(pred = svm.Predict, obs = Y_test)
##      RMSE  Rsquared       MAE 
## 0.9822078 0.7455355 0.7853845

It turns out that SVM RMSE value is the lowest and hence the best model here.

B. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Feature Importance

varImp(marsModel)
## earth variable importance
## 
##   only 20 most important variables shown (out of 55)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   58.42
## ManufacturingProcess13   40.07
## ManufacturingProcess22   28.34
## ManufacturingProcess01   28.34
## ManufacturingProcess39   21.96
## ManufacturingProcess12    0.00
## BiologicalMaterial10      0.00
## ManufacturingProcess04    0.00
## ManufacturingProcess36    0.00
## ManufacturingProcess37    0.00
## ManufacturingProcess28    0.00
## ManufacturingProcess15    0.00
## ManufacturingProcess10    0.00
## ManufacturingProcess20    0.00
## BiologicalMaterial03      0.00
## BiologicalMaterial01      0.00
## ManufacturingProcess18    0.00
## ManufacturingProcess06    0.00
## ManufacturingProcess11    0.00

Manufacturing process seems to be leading the feature importance list with ManufacturingProcess32 being the overall highest.

varImp(svmModel)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 55)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     84.75
## ManufacturingProcess36   71.84
## BiologicalMaterial03     70.69
## ManufacturingProcess13   68.69
## BiologicalMaterial02     64.16
## ManufacturingProcess31   63.17
## BiologicalMaterial12     60.11
## ManufacturingProcess17   57.19
## ManufacturingProcess09   55.42
## ManufacturingProcess33   54.25
## BiologicalMaterial04     51.35
## ManufacturingProcess29   47.17
## BiologicalMaterial11     46.74
## ManufacturingProcess06   46.73
## BiologicalMaterial01     41.19
## BiologicalMaterial08     38.75
## ManufacturingProcess11   29.49
## ManufacturingProcess26   29.35
## BiologicalMaterial09     26.01

For SVM, both Manufacturing process and Biological Material are making into the top 10 feature importance list, even distribution, with ManufacturingProcess32 being the overall highest.

C. Explore the relationship between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Let’s take a look at the SVM model

svmModel$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0155925867216759 
## 
## Number of Support Vectors : 126 
## 
## Objective Function Value : -67.9252 
## Training error : 0.095408
xyplot(Y_test ~ predict(svmModel, X_test),
type = c("p", "r"),
xlab = "Predicted", ylab = "Observed")

predicted <- predict(svmModel, X_test)
actual <-Y_test
xyplot((predicted-actual) ~ predicted,
type = c("p", "g"),
xlab = "Predicted radial svm", ylab = "Residuals")

The residual plots doesnot indicate any significant pattern but the ‘Actual Vs. Predicted’ plot shows very close correlation.

dotPlot(varImp(svmModel), top=20)

The above dot plot shows how top 20 features rank for SVM Model. We can see there is very even distribution of Manufacturing Process and Biological Material and ManufacturingProcess32 topping the feature importance list.

Now , let’s try to plot feautre plt for top 6 features and find out the correlation.

predictors.val <- c("ManufacturingProcess32", "BiologicalMaterial06", "ManufacturingProcess36", "BiologicalMaterial03", "ManufacturingProcess13", "BiologicalMaterial02")
featurePlot(X_train[,predictors.val], Y_train)

cor(X_train[, predictors.val], Y_train)
##                              [,1]
## ManufacturingProcess32  0.6391015
## BiologicalMaterial06    0.5037752
## ManufacturingProcess36 -0.5417157
## BiologicalMaterial03    0.4578673
## ManufacturingProcess13 -0.4470691
## BiologicalMaterial02    0.5060191

By looking at the above metric it appears that the predictors and yield have the correlation. Manufacturing process have some negative correlation with overll Yield value whereas Biological Maeterial have posittive and balanced correlation.

References:
Breiman, Leo (1996) Bagging predictors. Machine Learning 24, pages 123-140.
Friedman, Jerome H. (1991) Multivariate adaptive regression splines. The Annals of Statistics 19 (1), pages 1-67.