Exercises from Chapter 7 of textbook Applied Predictive Modeling by Kuhn & Johnson

Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

#library(mlbench)
set.seed(200)
trainingData = mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x = data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData = mlbench.friedman1(5000, sd = 1)
testData$x = data.frame(testData$x)

(a) Tune several models on these data.

KNN
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"), 
tuneLength = 10)

knnModel$finalModel
## 17-nearest neighbor regression model
Evaluate KNN
pred_knn <- predict(knnModel$finalModel, newdata = testData$x)
PR_knn <- postResample(pred = pred_knn, obs = testData$y)

#store in df
compare <- data.frame(model = "knn",
                      rsquared = PR_knn[["Rsquared"]],
           rmse = PR_knn[["RMSE"]])

compare
##   model  rsquared     rmse
## 1   knn 0.5011098 5.956498
SVM

Next I’ll try an SVM model.

model_svm = train(x = trainingData$x, 
                 y = trainingData$y,
                 method = "svmRadial",
                 preProc = c("center", "scale"),
                 tuneLength = 14,
                 trControl = trainControl(method = "cv"))

model_svm$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0680216365076835 
## 
## Number of Support Vectors : 152 
## 
## Objective Function Value : -66.0924 
## Training error : 0.008551
Evaluate SVM
pred_svm <- predict(model_svm$finalModel, newdata = testData$x)
PR_svm <- postResample(pred = pred_svm, obs = testData$y)

#store in df
svm_row <- c("svm",  
                PR_svm[["Rsquared"]],
                PR_svm[["RMSE"]])
compare <-rbind(compare, svm_row)


compare
##   model          rsquared             rmse
## 1   knn 0.501109776944346 5.95649765101139
## 2   svm 0.654429001004934 7.15607503427323
MARS

Last I’ll try a MARS model.

marsGrid = expand.grid(.degree = 1:2, .nprune = 2:38) 
model_mars = train(x = trainingData$x, 
                  y = trainingData$y, 
                  method = "earth", 
                  tuneGrid = marsGrid, 
                  trControl = trainControl(method = "cv", 
                                           number = 10))

model_mars$finalModel
## Selected 14 of 18 terms, and 5 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945    RSS 225.8601    GRSq 0.9338437    RSq 0.953688
Evaluate MARS
pred_mars <- predict(model_mars$finalModel, newdata = testData$x)
PR_mars <- postResample(pred = pred_mars, obs = testData$y)

#store in df
mars_row <- c("mars",  
                PR_mars[["Rsquared"]],
                PR_mars[["RMSE"]])
compare <-rbind(compare, mars_row)

compare
##   model          rsquared             rmse
## 1   knn 0.501109776944346 5.95649765101139
## 2   svm 0.654429001004934 7.15607503427323
## 3  mars 0.944889049379685  1.1722634531051

(b) Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

In looking just at the r-squared and RMSE values I stored below, MARS appears to have the best performance, although an r-squared that high would be a bit alarming in read-world data. In the readout above I do see that MARS selected X1-X5 as it’s top 5 predictors (in a different order than numeric). Effectively, it appears to have the best performance.

compare
##   model          rsquared             rmse
## 1   knn 0.501109776944346 5.95649765101139
## 2   svm 0.654429001004934 7.15607503427323
## 3  mars 0.944889049379685  1.1722634531051

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

First I copy in the code from my previous work.

data(ChemicalManufacturingProcess)
chem <- as.data.frame(ChemicalManufacturingProcess)

knn_imp <- preProcess(chem, "knnImpute")
chem_imp <- predict(knn_imp, chem)

#split train/test
set.seed(3190)
sample_set <- sample(nrow(chem_imp), round(nrow(chem_imp)*0.75), replace = FALSE)

chem_train <-chem_imp[sample_set, ]
chem_train_x <- chem_train[, -1]
chem_train_y <- chem_train[, 1]


chem_test <-chem_imp[-sample_set, ]
chem_test_x <- chem_test[, -1]
chem_test_y <- chem_test[, 1]

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

KNN
knnModel <- train(x = chem_train_x,
  y = chem_train_y,
  method = "knn",
  preProc = c("center", "scale"), 
  tuneLength = 10)

pred_knn <- predict(knnModel$finalModel, newdata = chem_test_x)
PR_knn <- postResample(pred = pred_knn, obs = chem_test_y)

#store in df
compare <- data.frame(model = "knn",
                      rsquared = PR_knn[["Rsquared"]],
           rmse = PR_knn[["RMSE"]])

compare
##   model  rsquared      rmse
## 1   knn 0.3484061 0.8354028
SVM
svmModel <- train(x = chem_train_x,
  y = chem_train_y,
  method = "svmRadial",
  preProc = c("center", "scale"), 
  tuneLength = 14,
   trControl = trainControl(method = "cv"))

pred_svm <- predict(svmModel$finalModel, newdata = chem_test_x)
PR_svm <- postResample(pred = pred_svm, obs = chem_test_y)

#store in df
svm_row <- c("svm",  
                PR_svm[["Rsquared"]],
                PR_svm[["RMSE"]])
compare <-rbind(compare, svm_row)

compare
##   model          rsquared              rmse
## 1   knn 0.348406114678821  0.83540279661946
## 2   svm 0.489107000574625 0.732540773784981
MARS
marsGrid = expand.grid(.degree = 1:2, .nprune = 2:38) 
marsModel <- train(x = chem_train_x,
                  y = chem_train_y,
                  method = "earth", 
                  tuneGrid = marsGrid, 
                  trControl = trainControl(method = "cv", 
                                           number = 10))

pred_mars <- predict(marsModel$finalModel, newdata = chem_test_x)
PR_mars <- postResample(pred = pred_mars, obs = chem_test_y)

#store in df
mars_row <- c("mars",  
                PR_mars[["Rsquared"]],
                PR_mars[["RMSE"]])
compare <-rbind(compare, mars_row)

compare
##   model          rsquared              rmse
## 1   knn 0.348406114678821  0.83540279661946
## 2   svm 0.489107000574625 0.732540773784981
## 3  mars  0.47146012210608 0.733929706841938
NNET
nnetGrid = expand.grid(decay = c(0, 0.01, .1), 
                       size = c(1,5,10), bag = FALSE) 

nnetModel = train(x = chem_train_x,
                  y = chem_train_y, 
                  method = "avNNet",  
                  tuneGrid = nnetGrid,  
                  trControl = trainControl(method = "cv", 
                                           number = 10),
                  preProc = c("center", "scale"),  
                  linout = TRUE,  trace = FALSE, 
                  maxit = 10)

pred_nnet <- predict(nnetModel$finalModel, newdata = chem_test_x)
PR_nnet <- postResample(pred = pred_nnet, obs = chem_test_y)

#store in df
nnet_row <- c("nnet",  
                PR_nnet[["Rsquared"]],
                PR_nnet[["RMSE"]])
compare <-rbind(compare, nnet_row)

compare
##   model          rsquared              rmse
## 1   knn 0.348406114678821  0.83540279661946
## 2   svm 0.489107000574625 0.732540773784981
## 3  mars  0.47146012210608 0.733929706841938
## 4  nnet 0.505326229731184 0.705068411959686
Evaluating Performance

With regard to the R-Squared value, the MARS model performs best, explaining 53.2% of the variance in the dataset. It also has the lowest RMSE - so it has the best performance in this case.

compare
##   model          rsquared              rmse
## 1   knn 0.348406114678821  0.83540279661946
## 2   svm 0.489107000574625 0.732540773784981
## 3  mars  0.47146012210608 0.733929706841938
## 4  nnet 0.505326229731184 0.705068411959686

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

For the MARS model it selected 6 process variables as the most important, followed by 2 biological and another 2 process - as seen below. This is very similar to what I saw in my linear model on the last assignment, 8 process variables and 2 biological variables, and the top two `ManufacturingProcess32 and ManufacturingProcess09 being the same.

varImp(marsModel$finalModel)
##                          Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess09  63.26523
## ManufacturingProcess24  21.25427

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

The only variables in the above top 10 MARS model that weren’t in my PLS model top 10 previously are: ManufacturingProcess01, ManufacturingProcess39, BiologicalMaterial03, BiologicalMaterial04, and ManufacturingProcess10.

I see that BiologicalMaterial03 and `BiologicalMaterial04 have a positive correlation with each other, and with the Yield target variable. The manufacturing process variable here seem to have only slight correlations with Yield, with most being negative. In looking back at my correlation plot from the last assignment, again see negative correlations for the biggest process variables these two modelsh ave in common. It appears that, roughly, the process variables are negatively correlated with Yield while the biologic variables are positively corelated with Yield.

chem_top5 <- chem_imp %>%
  select(c(`ManufacturingProcess01`, `ManufacturingProcess39`, `BiologicalMaterial03`, `BiologicalMaterial04`, `ManufacturingProcess10`,Yield))

correlation <- cor(chem_top5)


# plot correlations
corrplot.mixed(correlation, tl.col = 'black', tl.pos = 'lt', 
               number.cex= .9, tl.cex = .7)