Assignment: Exercises 7.2 and 7.5 from the KJ textbook

library(caTools)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(nnet)

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

y = 10sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

knnModel <- train(x = trainingData$x, y = trainingData$y, method = "knn", preProc = c("center", "scale"), tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

We also try out a neural network:

nnetFit <- nnet(trainingData$x, trainingData$y, size = 5, decay = 0.01, linout = TRUE,
## Reduce the amount of printed output
trace = FALSE,
## Expand the number of iterations to find
## parameter estimates..
maxit = 500,
## and the number of parameters used by the model
MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1)
nNetPred <- predict(nnetFit, newdata = testData$x)
postResample(pred = nNetPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.8564065 0.7091704 1.9600728

Lastly, we try out a MARS model:

marsFit <- train(x = trainingData$x, y = trainingData$y, method = "earth", preProc = c("center", "scale"), tuneLength = 10)
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
marsPred <- predict(marsFit, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
##     RMSE Rsquared      MAE 
## 1.776575 0.872700 1.358367
summary(marsFit$finalModel)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=1,
##             nprune=10)
## 
##                  coefficients
## (Intercept)        20.3958041
## h(0.507267-X1)     -3.0209971
## h(0.325504-X2)     -2.8963069
## h(X3- -0.804171)    1.1187319
## h(-0.216741-X3)     3.4950111
## h(X3-0.453446)      2.1548596
## h(0.953812-X4)     -2.7559239
## h(X4-0.953812)      2.8600536
## h(1.17878-X5)      -1.5056208
## h(X6- -0.47556)    -0.5025995
## 
## Selected 10 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 (additive model)
## GCV 2.731203    RSS 447.3848    GRSq 0.889112    RSq 0.9082649

The neural network is the best performing model with the smallest RMSE and largest R^2 value, and it is followed by MARS and then KNN. The MARS model selects all 5 of the informative predictors as well as X6.

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

#load data
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

#imputation
cmp <- knnImputation(ChemicalManufacturingProcess[, !names(ChemicalManufacturingProcess) %in% "yield"])

#data splitting
set.seed(101) 
sample = sample.split(cmp$BiologicalMaterial01, SplitRatio = .8)
cmp_train = subset(cmp, sample == TRUE)
cmp_test  = subset(cmp, sample == FALSE)

cmp_train_X = subset(cmp_train, select = -Yield)
cmp_train_y = cmp_train[,'Yield']

cmp_test_X = subset(cmp_test, select = -Yield)
cmp_test_y = cmp_test[,'Yield']

KNN model:

cmpknn <- train(x = cmp_train_X, y = cmp_train_y, method = "knn", preProc = c("center", "scale"), tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
cmpknn
## k-Nearest Neighbors 
## 
## 140 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.423326  0.4208469  1.128291
##    7  1.414026  0.4330362  1.128222
##    9  1.388053  0.4618164  1.116660
##   11  1.390159  0.4643388  1.123480
##   13  1.393569  0.4616407  1.130071
##   15  1.401858  0.4573811  1.141917
##   17  1.405514  0.4607165  1.146322
##   19  1.416220  0.4540900  1.149758
##   21  1.423137  0.4546066  1.156157
##   23  1.430939  0.4519784  1.162060
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
cmpknnPred <- predict(cmpknn, newdata = cmp_test_X)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = cmpknnPred, obs = cmp_test_y)
##      RMSE  Rsquared       MAE 
## 1.2778493 0.4141794 0.9856636

Neural Network model:

cmpnnet <- nnet(x = cmp_train_X, y = cmp_train_y, size = 5, decay = 0.01, linout = TRUE, trace = FALSE,
maxit = 500, MaxNWts = 5 * (ncol(cmp_train_X) + 1) + 5 + 1)
cmpnnetPred <- predict(cmpnnet, newdata = cmp_test_X)
postResample(pred = cmpnnetPred, obs = cmp_test_y)
##      RMSE  Rsquared       MAE 
## 1.7343071 0.2831666 1.3282372

MARS model:

cmpmars <- train(x = cmp_train_X, y = cmp_train_y, method = "earth", preProc = c("center", "scale"), tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
cmpmarsPred <- predict(cmpmars, newdata = cmp_test_X)
postResample(pred = cmpmarsPred, obs = cmp_test_y)
##      RMSE  Rsquared       MAE 
## 1.2272495 0.4881219 0.9595875

SVM model:

cmpsvm <- train(x = cmp_train_X, y = cmp_train_y, method = "svmRadial", preProc = c("center", "scale"), tuneLength = 10, trControl = trainControl(method = "cv"))
cmpsvmPred <- predict(cmpsvm, newdata = cmp_test_X)
postResample(pred = cmpsvmPred, obs = cmp_test_y)
##      RMSE  Rsquared       MAE 
## 1.2149832 0.4700383 0.9384320
  1. Which nonlinear regression model gives the optimal resampling and test set performance?

SVM gives the optimal resampling and test set performance with a RMSE of 1.21.

  1. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Here are the 10 predictors from the optimal linear model (elastic net) ranked in order of importance:

ManufacturingProcess36 ManufacturingProcess09 ManufacturingProcess17 ManufacturingProcess32 ManufacturingProcess13 BiologicalMaterial08 ManufacturingProcess06 ManufacturingProcess24 BiologicalMaterial05 BiologicalMaterial02

Here are the most important predictors from the optimal nonlinear model (SVM) ranked in order of importance:

varImp(cmpsvm)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess13  100.00
## ManufacturingProcess17   92.67
## ManufacturingProcess32   88.59
## ManufacturingProcess09   82.10
## BiologicalMaterial06     76.04
## ManufacturingProcess36   74.17
## BiologicalMaterial12     69.50
## BiologicalMaterial03     67.46
## ManufacturingProcess06   64.05
## BiologicalMaterial02     59.77
## ManufacturingProcess31   58.15
## ManufacturingProcess11   55.36
## BiologicalMaterial08     49.21
## BiologicalMaterial11     46.73
## BiologicalMaterial04     43.96
## ManufacturingProcess29   42.49
## ManufacturingProcess33   37.97
## ManufacturingProcess30   36.40
## BiologicalMaterial01     35.79
## ManufacturingProcess18   35.06

In both of the optimal models, manufacturing processes dominate the list. There is a lot of overlap between the two models, but the SVM model has one additonal biological material predictor in its top 10 list. The SVM model also incorporates several more features than the Elastic Net model.

  1. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

When we compare the top 10 predictors, there are 3 that are unique to the SVM model in comparison to the Elastic Net model, and all 3 are biological material predictors that have a positive association with yield. The SVM model reveals that the biological materials may have more of an influence on yield than previously assumed. Of the 3, biological material #6 has the strongest correlation with yield.

cmp2 <- subset(cmp, select = c(BiologicalMaterial06, BiologicalMaterial12, BiologicalMaterial03, Yield))
cor(cmp2, cmp[,'Yield'])
##                           [,1]
## BiologicalMaterial06 0.4781634
## BiologicalMaterial12 0.3674976
## BiologicalMaterial03 0.4450860
## Yield                1.0000000