Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\[y = 10sin(Ï€x_1x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\]
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(mlbench)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
library(caret)
knnModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10
)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select
the informative predictors (those named
X1-X5)?
Fitting a few models below:
Average Neural Net - 5 Layers
We’re going to employ bootstrapping by setting
bag = TRUE:
library(nnet)
set.seed(19940211)
layers <- 5
avnnet_fit <- avNNet(
trainingData$x,
trainingData$y,
decay = .01,
size = layers,
repeats = 5,
maxit = 500,
bag = TRUE
)
## Fitting Repeat 1
##
## # weights: 61
## initial value 42775.910945
## iter 10 value 38543.439520
## iter 20 value 38504.409075
## iter 30 value 38503.073659
## iter 40 value 38502.972683
## final value 38502.964073
## converged
## Fitting Repeat 2
##
## # weights: 61
## initial value 39463.360555
## iter 10 value 38087.440914
## iter 20 value 38050.799600
## iter 30 value 38049.896831
## final value 38049.879498
## converged
## Fitting Repeat 3
##
## # weights: 61
## initial value 43545.963636
## iter 10 value 40071.691922
## iter 20 value 40009.637252
## iter 30 value 40008.200691
## final value 40008.070969
## converged
## Fitting Repeat 4
##
## # weights: 61
## initial value 39715.043007
## iter 10 value 36848.723532
## iter 20 value 36834.073583
## iter 30 value 36833.972812
## final value 36833.970266
## converged
## Fitting Repeat 5
##
## # weights: 61
## initial value 42792.513826
## iter 10 value 40908.945606
## iter 20 value 40828.624468
## iter 30 value 40827.004898
## iter 40 value 40826.853513
## iter 40 value 40826.853303
## iter 40 value 40826.853211
## final value 40826.853211
## converged
nnet_pred <- predict(
avnnet_fit,
newdata = testData$x
)
postResample(
pred = nnet_pred,
obs = testData$y
)
## RMSE Rsquared MAE
## 14.2769357 0.2747165 13.3869183
Multivariate Adaptive Regression Splines (MARS)
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
mars_fit <- earth(
trainingData$x,
trainingData$y
)
summary(mars_fit)
## Call: earth(x=trainingData$x, y=trainingData$y)
##
## coefficients
## (Intercept) 18.451984
## h(0.621722-X1) -11.074396
## h(0.601063-X2) -10.744225
## h(X3-0.281766) 20.607853
## h(0.447442-X3) 17.880232
## h(X3-0.447442) -23.282007
## h(X3-0.636458) 15.150350
## h(0.734892-X4) -10.027487
## h(X4-0.734892) 9.092045
## h(0.850094-X5) -4.723407
## h(X5-0.850094) 10.832932
## h(X6-0.361791) -1.956821
##
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
plotmo(mars_fit)
## plotmo grid: X1 X2 X3 X4 X5 X6 X7
## 0.5139349 0.5106664 0.537307 0.4445841 0.5343299 0.4975981 0.4688035
## X8 X9 X10
## 0.497961 0.5288716 0.5359218
mars_pred <- predict(
mars_fit,
newdata = testData$x
)
postResample(
pred = mars_pred,
obs = testData$y
)
## RMSE Rsquared MAE
## 1.8136467 0.8677298 1.3911836
Trying a method iterating through .degree and
.nprune:
mars_iter_fit <- train(
trainingData$x,
trainingData$y,
method = "earth",
tuneGrid = expand.grid(.degree = 1:4, .nprune = 2:50),
trControl = trainControl(method = "cv")
)
mars_iter_fit$finalModel
## Selected 15 of 18 terms, and 5 of 10 predictors (nprune=15)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 10 4
## GCV 1.618197 RSS 217.6151 GRSq 0.9343005 RSq 0.9553786
plot(varImp(mars_iter_fit))
mars_iter_pred <- predict(
mars_iter_fit,
newdata = testData$x
)
postResample(
pred = mars_iter_pred,
obs = testData$y
)
## RMSE Rsquared MAE
## 1.1589948 0.9460418 0.9250230
Support Vector Machines (SVM)
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
svm_fit <- train(
trainingData$x,
trainingData$y,
method = "svmRadial",
tuneLength = 14,
trControl = trainControl(method = "cv")
)
svm_fit$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 32
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.057842388170764
##
## Number of Support Vectors : 154
##
## Objective Function Value : -83.6329
## Training error : 0.008407
svm_pred <- predict(
svm_fit,
newdata = testData$x
)
postResample(
pred = svm_pred,
obs = testData$y
)
## RMSE Rsquared MAE
## 2.0628371 0.8274383 1.5675766
K-Neareset Neighbors (KNN)
knn_fit <- train(
trainingData$x,
trainingData$y,
method = "knn",
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = "cv")
)
knn_fit$finalModel
## 10-nearest neighbor regression model
knn_pred <- predict(
knn_fit,
newdata = testData$x
)
postResample(
pred = knn_pred,
obs = testData$y
)
## RMSE Rsquared MAE
## 3.1532609 0.6548118 2.5249599
From the above models, the MARS model with the iterative approach had the best \(R^2\) of 0.95. We can conclude in this case that \(R^2\) is a sufficient metric to compare these and that the iterative MARS model is the best.
Revisiting the iterative MARS model before, we can see that the model
did select X1 through X5 although
X1, X4, X2, and X5
are of high importance but X3 isn’t.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
In the preprocess function, we can specify a NA fill
method. I believe using a KNN would be a good method:
sum(is.na(ChemicalManufacturingProcess[, -c(1)]))
## [1] 106
# Removing entries with low variance
ChemicalManufacturingProcess <- ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)]
# Filling NAs using KNN
knn_impute <- preProcess(
ChemicalManufacturingProcess[, -c(1)],
method = "knnImpute"
)
cmp_independent <- predict(
knn_impute,
ChemicalManufacturingProcess[, -c(1)]
)
cmp_dependent <- ChemicalManufacturingProcess[, c(1), drop=FALSE]
sum(is.na(cmp_independent[, -c(1)]))
## [1] 0
set.seed(19940211)
# Partition the data into a sample of 80% of the full dataset
cmp_train_rows <- createDataPartition(
cmp_independent$BiologicalMaterial01,
p = 0.8,
list = FALSE
)
# Use the sample to create a training dataset
cmp_train_ind <- cmp_independent[cmp_train_rows, ]
cmp_train_dep <- cmp_dependent[cmp_train_rows]
# Use the sample to create a testing dataset
cmp_test_ind <- cmp_independent[-cmp_train_rows, ]
cmp_test_dep <- cmp_dependent[-cmp_train_rows]
SVM
cmp_svm_fit <- train(
cmp_train_ind,
cmp_train_dep,
preProc = c("center", "scale"),
method = "svmRadial",
tuneLength = 14,
trControl = trainControl(method = "cv")
)
postResample(
pred = predict(
cmp_svm_fit,
cmp_test_ind
),
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.3441133 0.5808327 1.0089950
KNN
For KNN, I’ll test multiple tune lengths:
knn_grid <- expand.grid(.k = c(1:20))
cmp_knn_fit <- train(
x = cmp_train_ind,
y = cmp_train_dep,
method = "knn",
preProc = c("center", "scale"),
tuneGrid = knn_grid
)
postResample(
pred = predict(
cmp_knn_fit,
cmp_test_ind
),
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.6081149 0.6100623 1.2726768
MARS - Iterative
Since the iterative version of MARS was better in the last exercise, we will use the same grid:
mars_grid <- expand.grid(.degree = 1:4, .nprune = 2:50)
cmp_mars_fit <- train(
cmp_train_ind,
cmp_train_dep,
method = "earth",
tuneGrid = mars_grid,
trControl = trainControl(method = "cv")
)
postResample(
pred = predict(
cmp_mars_fit,
cmp_test_ind
),
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.3409233 0.5974291 1.0108987
Neural Net
For a neural net, we want to first remove items that are highly correlated. Then adjust the training and test independent to remove those entries.
I’m also going to go over a series of different decays and sizes:
nn_highly_correlated <- findCorrelation(cor(cmp_train_ind), cutoff = .65)
nn_train_ind <- cmp_train_ind[, -nn_highly_correlated]
nn_test_ind <- cmp_test_ind[, -nn_highly_correlated]
nnet_grid <- expand.grid(
.decay = c(0, .2, by = 0.01),
.size = c(3:10)
)
cmp_nn_fit <- train(
nn_train_ind,
cmp_train_dep,
method = "nnet",
tuneGrid = nnet_grid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale"),
MaxNWts = 10 * (ncol(nn_train_ind) + 1) + 10 + 1,
maxit = 500,
linout = TRUE,
trace = FALSE
)
postResample(
pred = predict(
cmp_nn_fit,
nn_test_ind
),
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.7337086 0.5255814 1.3588858
From the above models that we trained, the KNN model had the best \(R^2\) of .61 and an RMSE and MAE comparable to the other models.
plot(varImp(cmp_knn_fit))
From the above, we can see that the manufacturing process 13, 32, and 09 are the most important and that the majority of the most important variables are manufacturing processes. In our linear model, which can be found here, a similar result was found where the most important variables were the manufacturing processes.
Since our optimal model was a KNN model, we don’t have coefficients as we do with other models. The relationships between the independent and dependent variables are complex as the model uses a non-parametric algorithm.
To try to still find an answer here, I’ll look at the top 10 predictors and Yield across the test set to see if we can see some sort of correlation:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
top_10_predictors <- varImp(cmp_knn_fit)$importance |>
arrange(desc(Overall)) |>
head(10)
cmp_knn_predictions <- predict(
cmp_knn_fit,
cmp_test_ind
)
relationship_df <- cmp_test_ind |>
mutate(Yield = cmp_knn_predictions)
for (predictor in rownames(top_10_predictors)) {
plot <- ggplot(relationship_df, aes_string(x = predictor, y = "Yield")) +
geom_point(color = "blue", alpha = 0.6) +
labs(
title = paste("Relationship between", predictor, "and Yield"),
x = predictor,
y = "Yield"
) +
theme_minimal()
print(plot)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
From the charts above it’s apparant that the KNN model is capturing complex interactions as most of these charts do not have an obvious pattern between the strongest predictors and Yield. There is somewhat a relationship between: