DATA 624 - Homework 8

7.2 Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:

\[ y = 10sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4+5x_5 + N(0,\sigma^2) \]

where the \(x\) values are random variables uniformly distributed between \([0,1]\) (there are also 5 other non-informative variables also created in the simulation). The package `mlbench` contains a function `mlbench.friedman1` that simulates these data:

library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd=1)

trainingData$x <- data.frame(trainingData$x)

featurePlot(trainingData$x, trainingData$y)

testData <- mlbench.friedman1(5000, sd=1)
testData$x <- data.frame(testData$x)

tune several models on these data. Which models appear to give the best performance? Does MARS select the informative predictors (those named x1 - x5)?

First, we’ll look at the predictors to see if any data are missing:

# Check for NA variables
map_df(trainingData$x, is.na) %>% map_dbl(sum)

##  X1  X2  X3  X4  X5  X6  X7  X8  X9 X10 
##   0   0   0   0   0   0   0   0   0   0

So far so good! Let’s start tuning some models.

# Transform data into a single data frame for easier modeling

trainingData$y <- map_dbl(trainingData$y, t)
trainingData <- bind_cols(trainingData$x, trainingData$y)

n <- c(paste0(rep("X",10),seq(1,10)),"Y")
names(trainingData) <- n

testData$y <- map_dbl(testData$y, t)
testData <- bind_cols(testData$x, testData$y)

names(testData) <- n

MARS

We’ll start with a MARS model. Using 10-fold cross-validation we will tune parameters on the number of terms used and the degree.

set.seed(9912)

# 10-fold CV
folds <- vfold_cv(trainingData, v = 10)

# Recipe (no centering or normalizing are needed here)
mars_recipe <- recipe(Y ~ ., data = trainingData) %>% prep()

# Set the model type and engine
friedman_mars <- mars(num_terms = tune(),
                      prod_degree = tune(),
                      prune_method = "backward") %>%
  set_engine("earth") %>%
  set_mode("regression")

# Grid of tuning values to try
mars_tune <- grid_regular(num_terms(range=c(1,10)),
                          prod_degree(),
                          levels = 10)

# Build the workflow
friedman_workflow <- workflow() %>% add_model(friedman_mars) %>%
  add_recipe(mars_recipe)

# Results of the tuning
mars_tuned <- friedman_workflow %>% 
  tune_grid(
    resamples = folds,
    grid = mars_tune
    )

# Get the best model from our tuning
mars_best <- select_best(mars_tuned,"rmse")

# Finalize the workflow
friedman_workflow <- friedman_workflow %>% finalize_workflow(mars_best)

mars_fit <- 
  friedman_workflow %>% 
  fit(trainingData)

pull_workflow_fit(mars_fit)

## parsnip model object
## 
## Fit time:  16ms 
## Selected 9 of 18 terms, and 5 of 10 predictors (nprune=9)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 6 2
## GCV 2.454262    RSS 393.185    GRSq 0.9003559    RSq 0.9193784

We see that the model (with the tuned hyperparameters) is showing good performance with the training data. Let’s see how it does on the test data:

mars_test <- 
  mars_fit %>% predict(testData)

cbind(mars_test, actual = testData$Y) %>%
  metrics(truth = actual, estimate = .pred)

## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       1.52 
## 2 rsq     standard       0.906
## 3 mae     standard       1.20

The RMSE and \(R^2\) are pretty good for the MARS model. That might be tough to beat, but we’ll try with another model

Neural Net

Next we will try a neural network, tuning hyperparameters as we did with the MARS model above.

# Model recipe - this time we need to scale and center predictors
nn_recipe <- recipe(Y ~ ., data = trainingData) %>%
  step_normalize(all_predictors()) %>% step_center(all_predictors()) %>%
  prep()

# Set the model type and engine
friedman_nn <- mlp(hidden_units = tune(),
                   penalty = tune()) %>%
  set_engine("nnet") %>%
  set_mode("regression")

# Grid of tuning values to try
nn_tune <- grid_regular(hidden_units(),
                          penalty(),
                          levels = 5)

# Build the workflow
friedman_workflow <- workflow() %>% add_model(friedman_nn) %>%
  add_recipe(nn_recipe)

# Results of the tuning
nn_tuned <- friedman_workflow %>% 
  tune_grid(
    resamples = folds,
    grid = nn_tune
    )

nn_best <- select_best(nn_tuned,"rmse")

# Finalize the workflow
friedman_workflow <- friedman_workflow %>% finalize_workflow(nn_best)

nn_fit <- 
  friedman_workflow %>% 
  fit(trainingData)

nn_fit %>% pull_workflow_fit()

## parsnip model object
## 
## Fit time:  28ms 
## a 10-7-1 network with 85 weights
## inputs: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 
## output(s): ..y 
## options were - linear output units  decay=1

Let’s check the test data to see how the model does:

nn_test <- 
  nn_fit %>% predict(testData)

cbind(nn_test, actual = testData$Y) %>%
  metrics(truth = actual, estimate = .pred)

## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       2.12 
## 2 rsq     standard       0.825
## 3 mae     standard       1.61

The neural network model did well, but not as good as the MARS model above.

Variable Importance

Now we’ll compare the variable selection between the two models. Which variables did each model consider most important?

mars_fit %>% pull_workflow_fit() %>% vip() + labs(title="MARS Model")

nn_fit %>% pull_workflow_fit() %>% vip() + labs(title="Neural Net")

The MARS model was successful in finding the important variables (X1 - X5) and ignoring the superfluous ones. The Neural Network, however, was less successful in that endeavor.

With it’s superior performance, and ability to correctly identify the important variables, I would say that The MARS model is the clear winner.

7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

# Load the chemical manufacturing data
library(AppliedPredictiveModeling)

data("ChemicalManufacturingProcess")

Looking at the data again, we will check for missing values:

numNA <- function(x){
  return(sum(is.na(x)))
}

map_df(ChemicalManufacturingProcess, numNA) %>%
  pivot_longer(cols=everything(), names_to="variable") %>%
  filter(value > 0) %>% arrange(-value) %>%
  select("Variable Name" = variable, "# NAs" = value) %>%
  kable() %>% kable_styling(bootstrap_options = "striped", full_width = F) %>%
  scroll_box(height = "300px")

Variable Name	# NAs
ManufacturingProcess03	15
ManufacturingProcess11	10
ManufacturingProcess10	9
ManufacturingProcess25	5
ManufacturingProcess26	5
ManufacturingProcess27	5
ManufacturingProcess28	5
ManufacturingProcess29	5
ManufacturingProcess30	5
ManufacturingProcess31	5
ManufacturingProcess33	5
ManufacturingProcess34	5
ManufacturingProcess35	5
ManufacturingProcess36	5
ManufacturingProcess02	3
ManufacturingProcess06	2
ManufacturingProcess01	1
ManufacturingProcess04	1
ManufacturingProcess05	1
ManufacturingProcess07	1
ManufacturingProcess08	1
ManufacturingProcess12	1
ManufacturingProcess14	1
ManufacturingProcess22	1
ManufacturingProcess23	1
ManufacturingProcess24	1
ManufacturingProcess40	1
ManufacturingProcess41	1

We will need to use imputation to correct these missing values or remove those cases. If we remove them, we would have 152 of 176 records remaining. However, removing them may cause us to lose important information, so we’ll rely on imputation for now.

# Create a train/test split
set.seed(1108)

chem_split <- initial_split(ChemicalManufacturingProcess, prop=0.75)
chem_train <- training(chem_split)
chem_test <- testing(chem_split)

# Create our generic recipe with KNN imputation
chem_recipe <- chem_train %>% recipe(Yield ~ .) %>%
  step_knnimpute(all_predictors()) %>% prep()

# Get 10 folds for CV model tuning
folds <- vfold_cv(chem_train, v=10)

MARS Model

First we’ll try a MARS model as it performed well in the exercise above and also did well in discovering important variables (i.e. feature selection).

# Model specification
chem_mars <- mars(num_terms = tune(),
                  prod_degree = 1,
                  prune_method = "backward") %>%
  set_engine("earth") %>% 
  set_mode("regression")

# Tuning grid
chem_tune <- grid_regular(num_terms(range=c(1,57)),
                          levels = 10)

# Create our workflow for the MARS model
mars_wf <- workflow() %>%
  add_recipe(chem_recipe) %>% add_model(chem_mars)

# Tune our model
mars_cv <- mars_wf %>% tune_grid(resamples = folds, grid = chem_tune)

autoplot(mars_cv)

Our RMSE and \(R^2\) values point to using a model with 13 terms.

# Finalize the workflow with the best model
bestMARS <- select_best(mars_cv, metric="rmse")

mars_wf <- mars_wf %>% finalize_workflow(bestMARS)

# Fit the selected model to the training data
chem_fit <- mars_wf %>% fit(data=chem_train)

# Get vriable importance
importance <- chem_fit %>% pull_workflow_fit() %>% vip()

chem_fit

## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: mars()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## ● step_knnimpute()
## 
## ── Model ───────────────────────────────────────────────────────────────────────────────
## Selected 12 of 22 terms, and 9 of 57 predictors (nprune=13)
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 0.7998748    RSS 71.99479    GRSq 0.736781    RSq 0.8177668

Our final model looks good on the training set. Now we’ll look at it’s performance on the test set:

mars_final <- mars_wf %>% last_fit(chem_split)

mars_final %>% collect_metrics()

## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       1.48 
## 2 rsq     standard       0.520

The \(R^2\) dropped quite a bit and RMSE is higher than the training set (which was, if you calculate it, 0.7385222).

KNN

Next we’ll try a KNN model:

# Update the recipe to center and scale our predictors
chem_recipe <- chem_recipe %>% step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>% prep()

# Model specification
chem_knn <- nearest_neighbor(neighbors = tune(),
                  weight_func = "inv",
                  dist_power = tune()) %>%
  set_engine("kknn") %>% 
  set_mode("regression")

# Tuning grid
chem_tune <- grid_regular(neighbors(),
                          dist_power(),
                          levels = 10)

# Create our workflow for the KNN model
knn_wf <- workflow() %>%
  add_recipe(chem_recipe) %>% add_model(chem_knn)

# Tune our model
knn_cv <- knn_wf %>% tune_grid(resamples = folds, grid = chem_tune)

autoplot(knn_cv)

knn_cv %>% show_best(metric = "rmse")

## # A tibble: 5 x 8
##   neighbors dist_power .metric .estimator  mean     n std_err .config 
##       <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>   
## 1         3       1    rmse    standard    1.10    10   0.119 Model003
## 2         4       1    rmse    standard    1.12    10   0.119 Model004
## 3         5       1.11 rmse    standard    1.12    10   0.116 Model015
## 4         2       1    rmse    standard    1.12    10   0.119 Model002
## 5         5       1    rmse    standard    1.12    10   0.113 Model005

The model tuning shows the best fit is 3 neighbors with a distance power of 1. So, we’ll use that one.

# New model with our optimium values
bestKNN <- select_best(knn_cv, metric = "rmse")

# Update workflow for the optimal model
knn_wf <- knn_wf %>% finalize_workflow(bestKNN)

# Fit the selected model to the training data
chem_fit <- knn_wf %>% fit(data=chem_train)

chem_fit

## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## ● step_knnimpute()
## ● step_center()
## ● step_scale()
## 
## ── Model ───────────────────────────────────────────────────────────────────────────────
## 
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~3L, distance = ~1,     kernel = ~"inv")
## 
## Type of response variable: continuous
## minimal mean absolute error: 0.8170208
## Minimal mean squared error: 1.298037
## Best kernel: inv
## Best k: 3

Now we fit the model to the testing set:

knn_final <- knn_wf %>% last_fit(chem_split)

knn_final %>% collect_metrics()

## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       1.41 
## 2 rsq     standard       0.577

The KNN model did slightly better than the MARS model did in both RMSE and \(R^2\).

a) Which nonlinear regression model gives the optimal resampling and test set performance?

That distinction belongs to the KNN model here. However, I would use the MARS model, as the difference is small in performance metrics, and MARS gives us the ability to investigate the most important variables, and thus impact the chemical manufacturing process to improve Yield.

b) Which predictors are most important in the optimal nonlienar regresison model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Using the MARS model we can look at which variables it found most important:

importance

We see that the Manufacturing processes dominate the list of most important variables (only 2 of 9 are biological material predictors).

Compared to the previous homework (PLS model), the same top 2 variables were chosen (32 and 09). Of the others, only 33 was in the top 10 of the PLS model.

So, the MARS model chose some of the same variables, but not all.

c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Again, using the MARS model (since we cannot determine the variables most influential in the KNN model), we’ll focus on the 13, 22, and 33 Manufacturing Processes as they were less important in the PLS model from the previous homework:

# Process 13
ChemicalManufacturingProcess %>%
  ggplot(aes(x=ManufacturingProcess13, y=Yield)) +
  geom_point() + theme_light() +
  labs(title="Process 13 vs. Yield",
       y="Yield", x="Process 13")

Yield does seem to vary with ManufacturingProcess13 somewhat linearly.

# Process 39
ChemicalManufacturingProcess %>%
  ggplot(aes(x=ManufacturingProcess39, y=Yield)) +
  geom_point() + theme_light() +
  labs(title="Process 39 vs. Yield",
       y="Yield", x="Process 39")

Interestingly, ManufacturingProcess39 doesn’t seem to have a lot of impact on Yield.

# Process 01
ChemicalManufacturingProcess %>%
  ggplot(aes(x=ManufacturingProcess01, y=Yield)) +
  geom_point() + theme_light() +
  labs(title="Process 01 vs. Yield",
       y="Yield", x="Process 01")

ManufacturingProcess01 also doesn’t seem to have any relationship with the outcome Yield.

# Material 06
ChemicalManufacturingProcess %>%
  ggplot(aes(x=BiologicalMaterial06, y=Yield)) +
  geom_point() + theme_light() +
  labs(title="Material 06 vs. Yield",
       y="Yield", x="Material 06")

BiologicalMaterial06 seems to show some linear relationship with the outcome Yield.

# Material 02
ChemicalManufacturingProcess %>%
  ggplot(aes(x=BiologicalMaterial02, y=Yield)) +
  geom_point() + theme_light() +
  labs(title="Material 02 vs. Yield",
       y="Yield", x="Material 02")

BiologicalMaterial02 shows some relationship to Yield like 06 does above.

# Process 33
ChemicalManufacturingProcess %>%
  ggplot(aes(x=ManufacturingProcess33, y=Yield)) +
  geom_point() + theme_light() +
  labs(title="Process 33 vs. Yield",
       y="Yield", x="Process 33")

There is some weak relationship with the ManufacturingProcess33 variable and Yield.

# Process 22
ChemicalManufacturingProcess %>%
  ggplot(aes(x=ManufacturingProcess22, y=Yield)) +
  geom_point() + theme_light() +
  labs(title="Process 22 vs. Yield",
       y="Yield", x="Process 22")

ManufacturingProcess22, however, does not show really any relationship to Yield. So, that is odd why the model chose it. Perhaps it interacts with other variables?

***

DATA 624 - Homework 8

Adam Douglas

11/8/2020

7.2 Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:

where the \(x\) values are random variables uniformly distributed between \([0,1]\) (there are also 5 other non-informative variables also created in the simulation). The package `mlbench` contains a function `mlbench.friedman1` that simulates these data:

tune several models on these data. Which models appear to give the best performance? Does MARS select the informative predictors (those named x1 - x5)?

MARS

Neural Net

Variable Importance

7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

MARS Model

KNN

a) Which nonlinear regression model gives the optimal resampling and test set performance?

b) Which predictors are most important in the optimal nonlienar regresison model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

DATA 624 - Homework 8

Adam Douglas

11/8/2020

7.2 Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:

where the \(x\) values are random variables uniformly distributed between \([0,1]\) (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function mlbench.friedman1 that simulates these data:

tune several models on these data. Which models appear to give the best performance? Does MARS select the informative predictors (those named x1 - x5)?

MARS

Neural Net

Variable Importance

7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

MARS Model

KNN

a) Which nonlinear regression model gives the optimal resampling and test set performance?

b) Which predictors are most important in the optimal nonlienar regresison model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

where the \(x\) values are random variables uniformly distributed between \([0,1]\) (there are also 5 other non-informative variables also created in the simulation). The package `mlbench` contains a function `mlbench.friedman1` that simulates these data: