\[ y = 10sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4+5x_5 + N(0,\sigma^2) \]
mlbench
contains a function mlbench.friedman1
that simulates these data:library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd=1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)
testData <- mlbench.friedman1(5000, sd=1)
testData$x <- data.frame(testData$x)
First, we’ll look at the predictors to see if any data are missing:
# Check for NA variables
map_df(trainingData$x, is.na) %>% map_dbl(sum)
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## 0 0 0 0 0 0 0 0 0 0
So far so good! Let’s start tuning some models.
# Transform data into a single data frame for easier modeling
trainingData$y <- map_dbl(trainingData$y, t)
trainingData <- bind_cols(trainingData$x, trainingData$y)
n <- c(paste0(rep("X",10),seq(1,10)),"Y")
names(trainingData) <- n
testData$y <- map_dbl(testData$y, t)
testData <- bind_cols(testData$x, testData$y)
names(testData) <- n
We’ll start with a MARS model. Using 10-fold cross-validation we will tune parameters on the number of terms used and the degree.
set.seed(9912)
# 10-fold CV
folds <- vfold_cv(trainingData, v = 10)
# Recipe (no centering or normalizing are needed here)
mars_recipe <- recipe(Y ~ ., data = trainingData) %>% prep()
# Set the model type and engine
friedman_mars <- mars(num_terms = tune(),
prod_degree = tune(),
prune_method = "backward") %>%
set_engine("earth") %>%
set_mode("regression")
# Grid of tuning values to try
mars_tune <- grid_regular(num_terms(range=c(1,10)),
prod_degree(),
levels = 10)
# Build the workflow
friedman_workflow <- workflow() %>% add_model(friedman_mars) %>%
add_recipe(mars_recipe)
# Results of the tuning
mars_tuned <- friedman_workflow %>%
tune_grid(
resamples = folds,
grid = mars_tune
)
# Get the best model from our tuning
mars_best <- select_best(mars_tuned,"rmse")
# Finalize the workflow
friedman_workflow <- friedman_workflow %>% finalize_workflow(mars_best)
mars_fit <-
friedman_workflow %>%
fit(trainingData)
pull_workflow_fit(mars_fit)
## parsnip model object
##
## Fit time: 16ms
## Selected 9 of 18 terms, and 5 of 10 predictors (nprune=9)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 6 2
## GCV 2.454262 RSS 393.185 GRSq 0.9003559 RSq 0.9193784
We see that the model (with the tuned hyperparameters) is showing good performance with the training data. Let’s see how it does on the test data:
mars_test <-
mars_fit %>% predict(testData)
cbind(mars_test, actual = testData$Y) %>%
metrics(truth = actual, estimate = .pred)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.52
## 2 rsq standard 0.906
## 3 mae standard 1.20
The RMSE and \(R^2\) are pretty good for the MARS model. That might be tough to beat, but we’ll try with another model
Next we will try a neural network, tuning hyperparameters as we did with the MARS model above.
# Model recipe - this time we need to scale and center predictors
nn_recipe <- recipe(Y ~ ., data = trainingData) %>%
step_normalize(all_predictors()) %>% step_center(all_predictors()) %>%
prep()
# Set the model type and engine
friedman_nn <- mlp(hidden_units = tune(),
penalty = tune()) %>%
set_engine("nnet") %>%
set_mode("regression")
# Grid of tuning values to try
nn_tune <- grid_regular(hidden_units(),
penalty(),
levels = 5)
# Build the workflow
friedman_workflow <- workflow() %>% add_model(friedman_nn) %>%
add_recipe(nn_recipe)
# Results of the tuning
nn_tuned <- friedman_workflow %>%
tune_grid(
resamples = folds,
grid = nn_tune
)
nn_best <- select_best(nn_tuned,"rmse")
# Finalize the workflow
friedman_workflow <- friedman_workflow %>% finalize_workflow(nn_best)
nn_fit <-
friedman_workflow %>%
fit(trainingData)
nn_fit %>% pull_workflow_fit()
## parsnip model object
##
## Fit time: 28ms
## a 10-7-1 network with 85 weights
## inputs: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## output(s): ..y
## options were - linear output units decay=1
Let’s check the test data to see how the model does:
nn_test <-
nn_fit %>% predict(testData)
cbind(nn_test, actual = testData$Y) %>%
metrics(truth = actual, estimate = .pred)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.12
## 2 rsq standard 0.825
## 3 mae standard 1.61
The neural network model did well, but not as good as the MARS model above.
Now we’ll compare the variable selection between the two models. Which variables did each model consider most important?
mars_fit %>% pull_workflow_fit() %>% vip() + labs(title="MARS Model")
nn_fit %>% pull_workflow_fit() %>% vip() + labs(title="Neural Net")
The MARS model was successful in finding the important variables (X1 - X5) and ignoring the superfluous ones. The Neural Network, however, was less successful in that endeavor.
With it’s superior performance, and ability to correctly identify the important variables, I would say that The MARS model is the clear winner.
# Load the chemical manufacturing data
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
Looking at the data again, we will check for missing values:
numNA <- function(x){
return(sum(is.na(x)))
}
map_df(ChemicalManufacturingProcess, numNA) %>%
pivot_longer(cols=everything(), names_to="variable") %>%
filter(value > 0) %>% arrange(-value) %>%
select("Variable Name" = variable, "# NAs" = value) %>%
kable() %>% kable_styling(bootstrap_options = "striped", full_width = F) %>%
scroll_box(height = "300px")
Variable Name | # NAs |
---|---|
ManufacturingProcess03 | 15 |
ManufacturingProcess11 | 10 |
ManufacturingProcess10 | 9 |
ManufacturingProcess25 | 5 |
ManufacturingProcess26 | 5 |
ManufacturingProcess27 | 5 |
ManufacturingProcess28 | 5 |
ManufacturingProcess29 | 5 |
ManufacturingProcess30 | 5 |
ManufacturingProcess31 | 5 |
ManufacturingProcess33 | 5 |
ManufacturingProcess34 | 5 |
ManufacturingProcess35 | 5 |
ManufacturingProcess36 | 5 |
ManufacturingProcess02 | 3 |
ManufacturingProcess06 | 2 |
ManufacturingProcess01 | 1 |
ManufacturingProcess04 | 1 |
ManufacturingProcess05 | 1 |
ManufacturingProcess07 | 1 |
ManufacturingProcess08 | 1 |
ManufacturingProcess12 | 1 |
ManufacturingProcess14 | 1 |
ManufacturingProcess22 | 1 |
ManufacturingProcess23 | 1 |
ManufacturingProcess24 | 1 |
ManufacturingProcess40 | 1 |
ManufacturingProcess41 | 1 |
We will need to use imputation to correct these missing values or remove those cases. If we remove them, we would have 152 of 176 records remaining. However, removing them may cause us to lose important information, so we’ll rely on imputation for now.
# Create a train/test split
set.seed(1108)
chem_split <- initial_split(ChemicalManufacturingProcess, prop=0.75)
chem_train <- training(chem_split)
chem_test <- testing(chem_split)
# Create our generic recipe with KNN imputation
chem_recipe <- chem_train %>% recipe(Yield ~ .) %>%
step_knnimpute(all_predictors()) %>% prep()
# Get 10 folds for CV model tuning
folds <- vfold_cv(chem_train, v=10)
First we’ll try a MARS model as it performed well in the exercise above and also did well in discovering important variables (i.e. feature selection).
# Model specification
chem_mars <- mars(num_terms = tune(),
prod_degree = 1,
prune_method = "backward") %>%
set_engine("earth") %>%
set_mode("regression")
# Tuning grid
chem_tune <- grid_regular(num_terms(range=c(1,57)),
levels = 10)
# Create our workflow for the MARS model
mars_wf <- workflow() %>%
add_recipe(chem_recipe) %>% add_model(chem_mars)
# Tune our model
mars_cv <- mars_wf %>% tune_grid(resamples = folds, grid = chem_tune)
autoplot(mars_cv)
Our RMSE and \(R^2\) values point to using a model with 13 terms.
# Finalize the workflow with the best model
bestMARS <- select_best(mars_cv, metric="rmse")
mars_wf <- mars_wf %>% finalize_workflow(bestMARS)
# Fit the selected model to the training data
chem_fit <- mars_wf %>% fit(data=chem_train)
# Get vriable importance
importance <- chem_fit %>% pull_workflow_fit() %>% vip()
chem_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: mars()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## ● step_knnimpute()
##
## ── Model ───────────────────────────────────────────────────────────────────────────────
## Selected 12 of 22 terms, and 9 of 57 predictors (nprune=13)
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 0.7998748 RSS 71.99479 GRSq 0.736781 RSq 0.8177668
Our final model looks good on the training set. Now we’ll look at it’s performance on the test set:
mars_final <- mars_wf %>% last_fit(chem_split)
mars_final %>% collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.48
## 2 rsq standard 0.520
The \(R^2\) dropped quite a bit and RMSE is higher than the training set (which was, if you calculate it, 0.7385222).
Next we’ll try a KNN model:
# Update the recipe to center and scale our predictors
chem_recipe <- chem_recipe %>% step_center(all_predictors()) %>%
step_scale(all_predictors()) %>% prep()
# Model specification
chem_knn <- nearest_neighbor(neighbors = tune(),
weight_func = "inv",
dist_power = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
# Tuning grid
chem_tune <- grid_regular(neighbors(),
dist_power(),
levels = 10)
# Create our workflow for the KNN model
knn_wf <- workflow() %>%
add_recipe(chem_recipe) %>% add_model(chem_knn)
# Tune our model
knn_cv <- knn_wf %>% tune_grid(resamples = folds, grid = chem_tune)
autoplot(knn_cv)
knn_cv %>% show_best(metric = "rmse")
## # A tibble: 5 x 8
## neighbors dist_power .metric .estimator mean n std_err .config
## <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3 1 rmse standard 1.10 10 0.119 Model003
## 2 4 1 rmse standard 1.12 10 0.119 Model004
## 3 5 1.11 rmse standard 1.12 10 0.116 Model015
## 4 2 1 rmse standard 1.12 10 0.119 Model002
## 5 5 1 rmse standard 1.12 10 0.113 Model005
The model tuning shows the best fit is 3 neighbors with a distance power of 1. So, we’ll use that one.
# New model with our optimium values
bestKNN <- select_best(knn_cv, metric = "rmse")
# Update workflow for the optimal model
knn_wf <- knn_wf %>% finalize_workflow(bestKNN)
# Fit the selected model to the training data
chem_fit <- knn_wf %>% fit(data=chem_train)
chem_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## ● step_knnimpute()
## ● step_center()
## ● step_scale()
##
## ── Model ───────────────────────────────────────────────────────────────────────────────
##
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~3L, distance = ~1, kernel = ~"inv")
##
## Type of response variable: continuous
## minimal mean absolute error: 0.8170208
## Minimal mean squared error: 1.298037
## Best kernel: inv
## Best k: 3
Now we fit the model to the testing set:
knn_final <- knn_wf %>% last_fit(chem_split)
knn_final %>% collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.41
## 2 rsq standard 0.577
The KNN model did slightly better than the MARS model did in both RMSE and \(R^2\).
That distinction belongs to the KNN model here. However, I would use the MARS model, as the difference is small in performance metrics, and MARS gives us the ability to investigate the most important variables, and thus impact the chemical manufacturing process to improve Yield
.
Using the MARS model we can look at which variables it found most important:
importance
We see that the Manufacturing processes dominate the list of most important variables (only 2 of 9 are biological material predictors).
Compared to the previous homework (PLS model), the same top 2 variables were chosen (32 and 09). Of the others, only 33 was in the top 10 of the PLS model.
So, the MARS model chose some of the same variables, but not all.
Again, using the MARS model (since we cannot determine the variables most influential in the KNN model), we’ll focus on the 13, 22, and 33 Manufacturing Processes as they were less important in the PLS model from the previous homework:
# Process 13
ChemicalManufacturingProcess %>%
ggplot(aes(x=ManufacturingProcess13, y=Yield)) +
geom_point() + theme_light() +
labs(title="Process 13 vs. Yield",
y="Yield", x="Process 13")
Yield
does seem to vary with ManufacturingProcess13
somewhat linearly.
# Process 39
ChemicalManufacturingProcess %>%
ggplot(aes(x=ManufacturingProcess39, y=Yield)) +
geom_point() + theme_light() +
labs(title="Process 39 vs. Yield",
y="Yield", x="Process 39")
Interestingly, ManufacturingProcess39
doesn’t seem to have a lot of impact on Yield
.
# Process 01
ChemicalManufacturingProcess %>%
ggplot(aes(x=ManufacturingProcess01, y=Yield)) +
geom_point() + theme_light() +
labs(title="Process 01 vs. Yield",
y="Yield", x="Process 01")
ManufacturingProcess01
also doesn’t seem to have any relationship with the outcome Yield
.
# Material 06
ChemicalManufacturingProcess %>%
ggplot(aes(x=BiologicalMaterial06, y=Yield)) +
geom_point() + theme_light() +
labs(title="Material 06 vs. Yield",
y="Yield", x="Material 06")
BiologicalMaterial06
seems to show some linear relationship with the outcome Yield
.
# Material 02
ChemicalManufacturingProcess %>%
ggplot(aes(x=BiologicalMaterial02, y=Yield)) +
geom_point() + theme_light() +
labs(title="Material 02 vs. Yield",
y="Yield", x="Material 02")
BiologicalMaterial02
shows some relationship to Yield
like 06 does above.
# Process 33
ChemicalManufacturingProcess %>%
ggplot(aes(x=ManufacturingProcess33, y=Yield)) +
geom_point() + theme_light() +
labs(title="Process 33 vs. Yield",
y="Yield", x="Process 33")
There is some weak relationship with the ManufacturingProcess33
variable and Yield
.
# Process 22
ChemicalManufacturingProcess %>%
ggplot(aes(x=ManufacturingProcess22, y=Yield)) +
geom_point() + theme_light() +
labs(title="Process 22 vs. Yield",
y="Yield", x="Process 22")
ManufacturingProcess22
, however, does not show really any relationship to Yield
. So, that is odd why the model chose it. Perhaps it interacts with other variables?
***