Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\[y = 10sin(πx_1x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\]

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

library(caret)
knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10
)

knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?

Fitting a few models below:

Average Neural Net - 5 Layers

We’re going to employ bootstrapping by setting bag = TRUE:

library(nnet)
set.seed(19940211)

layers <- 5

avnnet_fit <- avNNet(
  trainingData$x,
  trainingData$y,
  decay = .01,
  size = layers,
  repeats = 5,
  maxit = 500,
  bag = TRUE
)

## Fitting Repeat 1 
## 
## # weights:  61
## initial  value 42775.910945 
## iter  10 value 38543.439520
## iter  20 value 38504.409075
## iter  30 value 38503.073659
## iter  40 value 38502.972683
## final  value 38502.964073 
## converged
## Fitting Repeat 2 
## 
## # weights:  61
## initial  value 39463.360555 
## iter  10 value 38087.440914
## iter  20 value 38050.799600
## iter  30 value 38049.896831
## final  value 38049.879498 
## converged
## Fitting Repeat 3 
## 
## # weights:  61
## initial  value 43545.963636 
## iter  10 value 40071.691922
## iter  20 value 40009.637252
## iter  30 value 40008.200691
## final  value 40008.070969 
## converged
## Fitting Repeat 4 
## 
## # weights:  61
## initial  value 39715.043007 
## iter  10 value 36848.723532
## iter  20 value 36834.073583
## iter  30 value 36833.972812
## final  value 36833.970266 
## converged
## Fitting Repeat 5 
## 
## # weights:  61
## initial  value 42792.513826 
## iter  10 value 40908.945606
## iter  20 value 40828.624468
## iter  30 value 40827.004898
## iter  40 value 40826.853513
## iter  40 value 40826.853303
## iter  40 value 40826.853211
## final  value 40826.853211 
## converged

nnet_pred <- predict(
  avnnet_fit,
  newdata = testData$x
)

postResample(
  pred = nnet_pred,
  obs = testData$y
)

##       RMSE   Rsquared        MAE 
## 14.2769357  0.2747165 13.3869183

Multivariate Adaptive Regression Splines (MARS)

library(earth)

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

mars_fit <- earth(
  trainingData$x,
  trainingData$y
)

summary(mars_fit)

## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

plotmo(mars_fit)

##  plotmo grid:    X1        X2       X3        X4        X5        X6        X7
##           0.5139349 0.5106664 0.537307 0.4445841 0.5343299 0.4975981 0.4688035
##        X8        X9       X10
##  0.497961 0.5288716 0.5359218

mars_pred <- predict(
  mars_fit,
  newdata = testData$x
)

postResample(
  pred = mars_pred,
  obs = testData$y
)

##      RMSE  Rsquared       MAE 
## 1.8136467 0.8677298 1.3911836

Trying a method iterating through .degree and .nprune:

mars_iter_fit <- train(
  trainingData$x,
  trainingData$y,
  method = "earth",
  tuneGrid = expand.grid(.degree = 1:4, .nprune = 2:50),
  trControl = trainControl(method = "cv")
)

mars_iter_fit$finalModel

## Selected 15 of 18 terms, and 5 of 10 predictors (nprune=15)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 10 4
## GCV 1.618197    RSS 217.6151    GRSq 0.9343005    RSq 0.9553786

plot(varImp(mars_iter_fit))

mars_iter_pred <- predict(
  mars_iter_fit,
  newdata = testData$x
)

postResample(
  pred = mars_iter_pred,
  obs = testData$y
)

##      RMSE  Rsquared       MAE 
## 1.1589948 0.9460418 0.9250230

Support Vector Machines (SVM)

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

svm_fit <- train(
  trainingData$x,
  trainingData$y,
  method = "svmRadial",
  tuneLength = 14,
  trControl = trainControl(method = "cv")
)

svm_fit$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 32 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.057842388170764 
## 
## Number of Support Vectors : 154 
## 
## Objective Function Value : -83.6329 
## Training error : 0.008407

svm_pred <- predict(
  svm_fit,
  newdata = testData$x
)

postResample(
  pred = svm_pred,
  obs = testData$y
)

##      RMSE  Rsquared       MAE 
## 2.0628371 0.8274383 1.5675766

K-Neareset Neighbors (KNN)

knn_fit <- train(
  trainingData$x,
  trainingData$y,
  method = "knn",
  tuneGrid = data.frame(.k = 1:20),
  trControl = trainControl(method = "cv")
)

knn_fit$finalModel

## 10-nearest neighbor regression model

knn_pred <- predict(
  knn_fit,
  newdata = testData$x
)

postResample(
  pred = knn_pred,
  obs = testData$y
)

##      RMSE  Rsquared       MAE 
## 3.1532609 0.6548118 2.5249599

From the above models, the MARS model with the iterative approach had the best \(R^2\) of 0.95. We can conclude in this case that \(R^2\) is a sufficient metric to compare these and that the iterative MARS model is the best.

Revisiting the iterative MARS model before, we can see that the model did select X1 through X5 although X1, X4, X2, and X5 are of high importance but X3 isn’t.

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

In the preprocess function, we can specify a NA fill method. I believe using a KNN would be a good method:

sum(is.na(ChemicalManufacturingProcess[, -c(1)]))

## [1] 106

# Removing entries with low variance
ChemicalManufacturingProcess <- ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)]

# Filling NAs using KNN
knn_impute <- preProcess(
  ChemicalManufacturingProcess[, -c(1)],
  method = "knnImpute"
)

cmp_independent <- predict(
  knn_impute,
  ChemicalManufacturingProcess[, -c(1)]
)

cmp_dependent <- ChemicalManufacturingProcess[, c(1), drop=FALSE]

sum(is.na(cmp_independent[, -c(1)]))

## [1] 0

set.seed(19940211)
# Partition the data into a sample of 80% of the full dataset
cmp_train_rows <- createDataPartition(
  cmp_independent$BiologicalMaterial01,
  p = 0.8,
  list = FALSE
)

# Use the sample to create a training dataset
cmp_train_ind <- cmp_independent[cmp_train_rows, ]
cmp_train_dep <- cmp_dependent[cmp_train_rows]

# Use the sample to create a testing dataset
cmp_test_ind <- cmp_independent[-cmp_train_rows, ]
cmp_test_dep <- cmp_dependent[-cmp_train_rows]

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

SVM

cmp_svm_fit <- train(
  cmp_train_ind,
  cmp_train_dep,
  preProc = c("center", "scale"),
  method = "svmRadial",
  tuneLength = 14,
  trControl = trainControl(method = "cv")
)

postResample(
  pred = predict(
    cmp_svm_fit,
    cmp_test_ind
  ),
  cmp_test_dep
)

##      RMSE  Rsquared       MAE 
## 1.3441133 0.5808327 1.0089950

KNN

For KNN, I’ll test multiple tune lengths:

knn_grid <- expand.grid(.k = c(1:20))

cmp_knn_fit <- train(
  x = cmp_train_ind,
  y = cmp_train_dep,
  method = "knn",
  preProc = c("center", "scale"),
  tuneGrid = knn_grid
)

postResample(
  pred = predict(
    cmp_knn_fit,
    cmp_test_ind
  ),
  cmp_test_dep
)

##      RMSE  Rsquared       MAE 
## 1.6081149 0.6100623 1.2726768

MARS - Iterative

Since the iterative version of MARS was better in the last exercise, we will use the same grid:

mars_grid <- expand.grid(.degree = 1:4, .nprune = 2:50)

cmp_mars_fit <- train(
  cmp_train_ind,
  cmp_train_dep,
  method = "earth",
  tuneGrid = mars_grid,
  trControl = trainControl(method = "cv")
)

postResample(
  pred = predict(
    cmp_mars_fit,
    cmp_test_ind
  ),
  cmp_test_dep
)

##      RMSE  Rsquared       MAE 
## 1.3409233 0.5974291 1.0108987

Neural Net

For a neural net, we want to first remove items that are highly correlated. Then adjust the training and test independent to remove those entries.

I’m also going to go over a series of different decays and sizes:

nn_highly_correlated <- findCorrelation(cor(cmp_train_ind), cutoff = .65)

nn_train_ind <- cmp_train_ind[, -nn_highly_correlated]
nn_test_ind <- cmp_test_ind[, -nn_highly_correlated]

nnet_grid <- expand.grid(
  .decay = c(0, .2, by = 0.01),
  .size = c(3:10)
)

cmp_nn_fit <- train(
  nn_train_ind,
  cmp_train_dep,
  method = "nnet",
  tuneGrid = nnet_grid,
  trControl = trainControl(method = "cv", number = 10),
  preProc = c("center", "scale"),
  MaxNWts = 10 * (ncol(nn_train_ind) + 1) + 10 + 1,
  maxit = 500,
  linout = TRUE,
  trace = FALSE
)

postResample(
  pred = predict(
    cmp_nn_fit,
    nn_test_ind
  ),
  cmp_test_dep
)

##      RMSE  Rsquared       MAE 
## 1.7337086 0.5255814 1.3588858

From the above models that we trained, the KNN model had the best \(R^2\) of .61 and an RMSE and MAE comparable to the other models.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

plot(varImp(cmp_knn_fit))

From the above, we can see that the manufacturing process 13, 32, and 09 are the most important and that the majority of the most important variables are manufacturing processes. In our linear model, which can be found here, a similar result was found where the most important variables were the manufacturing processes.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Since our optimal model was a KNN model, we don’t have coefficients as we do with other models. The relationships between the independent and dependent variables are complex as the model uses a non-parametric algorithm.

To try to still find an answer here, I’ll look at the top 10 predictors and Yield across the test set to see if we can see some sort of correlation:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

top_10_predictors <- varImp(cmp_knn_fit)$importance |>
  arrange(desc(Overall)) |>
  head(10)

cmp_knn_predictions <- predict(
  cmp_knn_fit,
  cmp_test_ind
)

relationship_df <- cmp_test_ind |>
  mutate(Yield = cmp_knn_predictions)

for (predictor in rownames(top_10_predictors)) {
  plot <- ggplot(relationship_df, aes_string(x = predictor, y = "Yield")) +
    geom_point(color = "blue", alpha = 0.6) +
    labs(
      title = paste("Relationship between", predictor, "and Yield"),
      x = predictor,
      y = "Yield"
    ) +
    theme_minimal()

  print(plot)
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From the charts above it’s apparant that the KNN model is capturing complex interactions as most of these charts do not have an obvious pattern between the strongest predictors and Yield. There is somewhat a relationship between:

ManufacturingProcess32 - Positive
BiologicalMaterial06 - Positive
BiologicalMaterial03 - Positive
ManufacturingProcess31 - Negative

DATA 624 - Homework 8

Richie Rivera

Exercise 7.2

Exercise 7.5

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?