Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

y = 10sin(πx1x2)+20(x3 −0.5)2 +10x4 +5x5 +N(0,σ2)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame 
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

KNN Model

library(caret)
knnModel <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = "knn",
                   preProc = c("center", "scale"),
                   tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.565620  0.4887976  2.886629
##    7  3.422420  0.5300524  2.752964
##    9  3.368072  0.5536927  2.715310
##   11  3.323010  0.5779056  2.669375
##   13  3.275835  0.6030846  2.628663
##   15  3.261864  0.6163510  2.621192
##   17  3.261973  0.6267032  2.616956
##   19  3.286299  0.6281075  2.640585
##   21  3.280950  0.6390386  2.643807
##   23  3.292397  0.6440392  2.656080
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.

Evaluating the model’s performance:

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set perforamnce values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169

SVM Model

The next model I will tune and evaluate is SVM model - I will use pre-process to center and scale the data and will use tune length of 10.

set.seed(200)
svmTuned = train(x = trainingData$x, 
                 y=trainingData$y, 
                 method="svmRadial", 
                 preProc=c("center", "scale"), 
                 tuneLength=10,
                 trControl = trainControl(method = "repeatedcv"))

svmTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  2.509258  0.8007195  2.008079
##     0.50  2.265063  0.8151861  1.787201
##     1.00  2.090305  0.8365821  1.636255
##     2.00  1.995634  0.8495588  1.557253
##     4.00  1.888238  0.8634860  1.501387
##     8.00  1.845986  0.8701926  1.486047
##    16.00  1.842402  0.8704345  1.485279
##    32.00  1.842402  0.8704345  1.485279
##    64.00  1.842402  0.8704345  1.485279
##   128.00  1.842402  0.8704345  1.485279
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06574662
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06574662 and C = 16.

Evaluating the model’s performance:

svmPred = predict(svmTuned, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y) 
##      RMSE  Rsquared       MAE 
## 2.0804235 0.8246038 1.5805232

MARS

The next model I will tune and evaluate is MARS model - I will use pre-process to center and scale the data.

set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTune <- train(x = trainingData$x, 
                  y = trainingData$y, 
                  method = "earth",
                  preProc=c("center", "scale"),
                  tuneGrid= marsGrid,
                  trControl = trainControl(method = "cv"))
marsTune
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      4.350138  0.2513413  3.5784792
##   1        3      3.644350  0.4731295  2.9614267
##   1        4      2.546913  0.7523393  2.0639448
##   1        5      2.209407  0.8000495  1.7664615
##   1        6      2.098155  0.8270593  1.6534878
##   1        7      1.719366  0.8812337  1.3882290
##   1        8      1.695204  0.8831099  1.3488029
##   1        9      1.742115  0.8770325  1.3682250
##   1       10      1.735453  0.8790493  1.3614383
##   1       11      1.703769  0.8857205  1.3551069
##   1       12      1.684606  0.8879534  1.3341565
##   1       13      1.690241  0.8868812  1.3355205
##   1       14      1.677364  0.8877485  1.3241164
##   1       15      1.689320  0.8860203  1.3337850
##   1       16      1.689320  0.8860203  1.3337850
##   1       17      1.689320  0.8860203  1.3337850
##   1       18      1.689320  0.8860203  1.3337850
##   1       19      1.689320  0.8860203  1.3337850
##   1       20      1.689320  0.8860203  1.3337850
##   2        2      4.350138  0.2513413  3.5784792
##   2        3      3.644350  0.4731295  2.9614267
##   2        4      2.546913  0.7523393  2.0639448
##   2        5      2.209407  0.8000495  1.7664615
##   2        6      2.147030  0.8163048  1.6856627
##   2        7      1.694724  0.8853223  1.3533072
##   2        8      1.588283  0.8969166  1.2356377
##   2        9      1.442477  0.9148789  1.1339275
##   2       10      1.393764  0.9212369  1.1120797
##   2       11      1.279830  0.9364074  1.0098606
##   2       12      1.217682  0.9427854  0.9598982
##   2       13      1.204097  0.9460219  0.9422898
##   2       14      1.196080  0.9463109  0.9386478
##   2       15      1.206115  0.9441910  0.9558298
##   2       16      1.212026  0.9446917  0.9549862
##   2       17      1.209095  0.9447813  0.9531492
##   2       18      1.209095  0.9447813  0.9531492
##   2       19      1.209095  0.9447813  0.9531492
##   2       20      1.209095  0.9447813  0.9531492
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.

Evaluating the model’s performance:

marsPred = predict(marsTune, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y) 
##      RMSE  Rsquared       MAE 
## 1.2779993 0.9338365 1.0147070

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169
postResample(pred = svmPred, obs = testData$y) 
##      RMSE  Rsquared       MAE 
## 2.0804235 0.8246038 1.5805232
postResample(pred = marsPred, obs = testData$y) 
##      RMSE  Rsquared       MAE 
## 1.2779993 0.9338365 1.0147070
varImp(marsTune)
## earth variable importance
## 
##     Overall
## X1   100.00
## X4    84.98
## X2    68.87
## X5    48.55
## X3    38.96
## X6     0.00
## X10    0.00
## X8     0.00
## X7     0.00
## X9     0.00

MARS model gives the best performance with lowers RMSE and highest R^2 of 93.38. I use varImp to identify the most important MARS model predictors and confirm that those include the top 5 informative predictors X1 - X5.

Exercise 7.5.

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

The first step is to read the data into a dataframe and using knn to impute the data.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
df1<-ChemicalManufacturingProcess
dim(df1)
## [1] 176  58
#head(df1)

df1pp <- preProcess(df1[,2:ncol(df1)], method=c('knnImpute'))
df1 <- cbind(df1$Yield,predict(df1pp, df1[,2:ncol(df1)]))
colnames(df1)[1] <- "Yield"

The second step is splitting the data into test and train datasets - I will use 80%/20% split.

set.seed(333)
trainingRows <- createDataPartition(df1$Yield, p = 0.8, list = FALSE)

df1_train <- df1[trainingRows, ]
df1_test <- df1[-trainingRows, ]

KNN Model

The first model I will tune and evaluate is KNN model - I will use pre-process to center and scale the data and will use tune length of 6.

library(caret)
set.seed(200)
knnModel <- suppressWarnings(train(x = df1_train[,2:ncol(df1_train)],
                   y = df1_train$Yield,
                   method = "knn",
                   preProc = c("center", "scale"),
                   tuneLength = 6,
                   trControl = trainControl(method = "repeatedcv")))
knnModel
## k-Nearest Neighbors 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 128, 129, 130, 129, 132, 129, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.296841  0.5588651  1.049013
##    7  1.335754  0.5386837  1.097550
##    9  1.324421  0.5752357  1.094814
##   11  1.327724  0.5714342  1.094795
##   13  1.350020  0.5509203  1.107042
##   15  1.363992  0.5438233  1.116888
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

Evaluating the model’s performance:

knnPred <- predict(knnModel, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = knnPred, obs = df1_test$Yield)
##      RMSE  Rsquared       MAE 
## 1.3031733 0.5503026 0.9813125

SVM Model

The next model I will tune and evaluate is SVM model - I will use pre-process to center and scale the data and will use tune length of 10.

set.seed(200)
svmTuned = train(x = df1_train[,2:ncol(df1_train)],
                 y = df1_train$Yield,
                 method="svmRadial", 
                 preProc=c("center", "scale"), 
                 tuneLength=10,
                 trControl = trainControl(method = "repeatedcv"))

svmTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 128, 129, 130, 129, 132, 129, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE      
##     0.25  1.386341  0.5543104  1.1455128
##     0.50  1.257204  0.5993286  1.0318796
##     1.00  1.134554  0.6549640  0.9261774
##     2.00  1.064801  0.6887278  0.8641585
##     4.00  1.051533  0.7004169  0.8568931
##     8.00  1.054729  0.7026112  0.8648906
##    16.00  1.055415  0.7017858  0.8649271
##    32.00  1.055415  0.7017858  0.8649271
##    64.00  1.055415  0.7017858  0.8649271
##   128.00  1.055415  0.7017858  0.8649271
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01372596
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01372596 and C = 4.

Evaluating the model’s performance:

svmPred <- predict(svmTuned, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = svmPred, obs = df1_test$Yield) 
##      RMSE  Rsquared       MAE 
## 1.2208660 0.6179264 0.8668962

The model is performing better than KNN.

MARS

The next model I will tune and evaluate is MARS model - I will use pre-process to center and scale the data.

set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTune <- train(df1_train[,2:ncol(df1_train)],
                  y = df1_train$Yield, 
                  method = "earth",
                  preProc=c("center", "scale"),
                  tuneGrid= marsGrid,
                  trControl = trainControl(method = "cv"))
marsTune
## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 129, 130, 129, 132, 129, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.469694  0.4099538  1.1524962
##   1        3      1.270394  0.5461842  1.0135638
##   1        4      1.237360  0.5645598  0.9940006
##   1        5      1.243428  0.5465747  0.9978504
##   1        6      1.243240  0.5625398  0.9675818
##   1        7      1.281018  0.5448316  1.0106055
##   1        8      1.261312  0.5602795  1.0181434
##   1        9      1.267466  0.5564907  1.0329659
##   1       10      1.302172  0.5403139  1.0488237
##   1       11      1.296341  0.5445436  1.0586088
##   1       12      1.273445  0.5541629  1.0393424
##   1       13      1.252598  0.5698303  1.0246146
##   1       14      1.229202  0.5807273  1.0091250
##   1       15      1.232702  0.5796333  1.0146955
##   1       16      1.231305  0.5806299  1.0147462
##   1       17      1.235551  0.5768138  1.0097048
##   1       18      1.235551  0.5768138  1.0097048
##   1       19      1.235551  0.5768138  1.0097048
##   1       20      1.235551  0.5768138  1.0097048
##   2        2      1.469694  0.4099538  1.1524962
##   2        3      1.281032  0.5539317  1.0244959
##   2        4      1.245999  0.5686795  1.0027749
##   2        5      1.239588  0.5755069  1.0083431
##   2        6      1.277475  0.5596173  1.0404891
##   2        7      1.290410  0.5601103  1.0537775
##   2        8      1.284861  0.5524504  1.0509238
##   2        9      1.278701  0.5694702  1.0456271
##   2       10      1.294947  0.5617347  1.0661414
##   2       11      1.251266  0.5946481  1.0199754
##   2       12      1.296675  0.5954595  1.0328179
##   2       13      1.313246  0.5910848  1.0133991
##   2       14      1.310340  0.5943687  1.0090514
##   2       15      1.311538  0.5921334  1.0090578
##   2       16      1.320320  0.5905494  1.0068345
##   2       17      1.336474  0.5847754  1.0142035
##   2       18      1.326958  0.5945070  1.0169338
##   2       19      1.314935  0.5768725  1.0182396
##   2       20      1.354841  0.5597531  1.0389166
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 1.

Evaluating the model’s performance:

marsPred = predict(marsTune, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = marsPred, obs = df1_test$Yield) 
##      RMSE  Rsquared       MAE 
## 1.0494165 0.6925441 0.8367074

(a)

Which nonlinear regression model gives the optimal resampling and test set performance?

Let’s comparing the models and determine which has the best performance:

postResample(pred = knnPred, obs = df1_test$Yield)
##      RMSE  Rsquared       MAE 
## 1.3031733 0.5503026 0.9813125
postResample(pred = svmPred, obs = df1_test$Yield) 
##      RMSE  Rsquared       MAE 
## 1.2208660 0.6179264 0.8668962
postResample(pred = marsPred, obs = df1_test$Yield) 
##      RMSE  Rsquared       MAE 
## 1.0494165 0.6925441 0.8367074

MARS model gives the best performance with lowers RMSE and highest R^2 of 69.25.

(b)

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Most important MARS Predictors:

varImp(marsTune)
## earth variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09  68.202
## ManufacturingProcess13  36.528
## ManufacturingProcess28  28.445
## ManufacturingProcess33  28.445
## ManufacturingProcess15  27.292
## ManufacturingProcess39  26.812
## ManufacturingProcess04  23.173
## BiologicalMaterial03    12.184
## ManufacturingProcess23   6.087
## ManufacturingProcess10   0.000
## BiologicalMaterial06     0.000
## ManufacturingProcess40   0.000
## ManufacturingProcess41   0.000
## ManufacturingProcess36   0.000
## BiologicalMaterial05     0.000
## BiologicalMaterial08     0.000
## ManufacturingProcess25   0.000
## ManufacturingProcess19   0.000
## BiologicalMaterial04     0.000

Most important predictors are process variables, there is only one biological variable and it is #9 on the list so it is relatively not important.

Linear Model Coefficients

df1_enet <-  train(df1_train[,2:ncol(df1_train)], df1_train$Yield,
                 method='enet',
                 metric='RMSE',
                 trControl = trainControl(method = "repeatedcv"), 
                 preProcess=c('center','scale'))

enetCoef<- predict(df1_enet$finalModel, newx = as.matrix(df1_test[,2:ncol(df1_test)]),
                      s = .1, mode = "fraction",
                      type = "coefficients")
head(sort(enetCoef$coefficients),3)
## ManufacturingProcess13 ManufacturingProcess37 ManufacturingProcess17 
##            -0.25153249            -0.11488070            -0.06413573
tail(sort(enetCoef$coefficients),7)
##   BiologicalMaterial03 ManufacturingProcess44 ManufacturingProcess06 
##             0.07826640             0.09057314             0.10343127 
## ManufacturingProcess15 ManufacturingProcess34 ManufacturingProcess09 
##             0.15819222             0.17985524             0.51029411 
## ManufacturingProcess32 
##             0.72011230

Process 32, 9, 34, 15, 6, 44, 3(biol), 17, 37, and 13 were the stongest predictors in my ENET linear model.

Five of the top 10 important predictors are on the list of the top ten predictors from the optimal linear model. Linear and Nonlinear models are able to identify 5 of the same main predictors but overall Nonlinear model is showing a better performance.

(c)

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

The 5 predictors which were not identified as top predictors in the linear model are the following: ManufacturingProcess33, ManufacturingProcess28, ManufacturingProcess39, ManufacturingProces04, ManufacturingProcess23

featurePlot(df1_test$ManufacturingProcess33, df1_test$Yield)

featurePlot(df1_test$ManufacturingProcess28, df1_test$Yield)

featurePlot(df1_test$ManufacturingProcess39, df1_test$Yield)

featurePlot(df1_test$ManufacturingProcess04, df1_test$Yield)

featurePlot(df1_test$ManufacturingProcess23, df1_test$Yield)

ManufacturingProcess33 appears to have a positive linear relationship with Yield, ManufacturingProcess04 appears to have a negative linear relationship with Yield. ManufacturingProcess28, ManufacturingProcess39, and ManufacturingProcess23 seem to have nonlinear relationship. None of the top predictors unique to the model are biological, so these graphs don’t reveal anything about their relationship with yield.