library(mlbench)
library(caret)
library(reshape2)
library(ggplot2)
library(tidyr)
library(lattice)
library(earth)
library(kernlab)
library(e1071)
library(tidyverse)
7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: \[ y = 10 \sin(\pi X_1 X_2) + 20 (X_3 - 0.5)^2 + 10 X_4 + 5 X_5 + \varepsilon,\quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation).
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
#featurePlot(trainingData$x[, 1:5], trainingData$y, cex = 0.35)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
str(trainingData)
## List of 2
## $ x:'data.frame': 200 obs. of 10 variables:
## ..$ X1 : num [1:200] 0.534 0.584 0.59 0.691 0.667 ...
## ..$ X2 : num [1:200] 0.648 0.438 0.588 0.226 0.819 ...
## ..$ X3 : num [1:200] 0.8508 0.6727 0.4097 0.0334 0.7168 ...
## ..$ X4 : num [1:200] 0.1816 0.6692 0.3381 0.0669 0.8032 ...
## ..$ X5 : num [1:200] 0.929 0.1638 0.8941 0.6374 0.0831 ...
## ..$ X6 : num [1:200] 0.3618 0.4531 0.0268 0.525 0.2234 ...
## ..$ X7 : num [1:200] 0.827 0.649 0.179 0.513 0.664 ...
## ..$ X8 : num [1:200] 0.421 0.845 0.35 0.797 0.904 ...
## ..$ X9 : num [1:200] 0.5911 0.9282 0.0176 0.6899 0.397 ...
## ..$ X10: num [1:200] 0.589 0.758 0.444 0.445 0.55 ...
## $ y: num [1:200] 18.5 16.1 17.8 13.8 18.4 ...
cor_values <- cor(trainingData$x, trainingData$y)
cor_values
## [,1]
## X1 0.509288830
## X2 0.473594892
## X3 0.008278159
## X4 0.516904120
## X5 0.355704067
## X6 0.023701700
## X7 -0.042081910
## X8 -0.013629682
## X9 0.067044796
## X10 -0.028661255
As we see X1, X2 and X4 has moderate linear correlations with
y, and at the same time surprisingly X3 has almost no
correlation. It can be explained by the fact that X3 has quadratic
effect in the formula, therefore simple correlation cannot capture it
well. Also X5’s correlation below moderate level. All noise predictors
X6-X10 have almost zero correlations with y.
This shows that correlations alone can be misleading if there is non-linear relations between predictors and outcomes.
trainingData_long <- trainingData$x |>
pivot_longer(cols = X1:X10,
names_to = 'variable',
values_to = 'value')
trainingData_long$y <- rep(trainingData$y, times = 10)
trainingData_long$variable <- factor(
trainingData_long$variable,
levels = paste0("X", 1:10)
)
ggplot(trainingData_long, aes(x = value, y = y))+
geom_point(size = 0.5, alpha = 0.4)+
geom_smooth(method = 'lm', se = FALSE)+
facet_wrap(~variable)+
labs(
title = "Relationship Between Predictors and y",
x = "Predictor Value",
y = "y"
)
The plots make it look like there’s barely anything going on.X1, X2 and X3 have some linear trend, but everything else kinda looks like noise. Based on this alone, I would probably say only X2, X3 and maybe X1 matter, and ignore the rest.
But in this case that’s misleading.
The problem is that the data is not generated in a simple linear way. It is generated by this formula:
\[ y = 10 \sin(\pi X_1 X_2) + 20 (X_3 - 0.5)^2 + 10 X_4 + 5 X_5 + \varepsilon,\quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]
As we see X1 and X2 only matter through a sine interaction, so when we look at them separately, they look useless. X3 has a quadratic effect on outcome. X4 and X5 have linear relation with outcome.
So basically, these plots make it look like weak linear signals and a lot of noise, but as we see in formula there are strong nonlinear and interaction effects that these one-variable cannot truly explain.
In my opinion, it can be dangerous to decide which variables are important based on these plots alone.
hist(trainingData$y, breaks = 20, main = 'Distribution of y')
plot(density(trainingData$y), main = 'Distibution of y')
The distribution of “y" looks roughly bell-shaped with a
slight right skew, with no extreme outliers. The histogram shows some
small spikes around values like 6, 8, and 9, but these are probaly due
to random variation from the sample rather than true structure in the
data. The density plot looks smooth and unimodal, so overall the
distribution of”y” is still roughly bell-shaped with slight
right skew. I do not see strong evidence of multiple peaks or serious
issues.
Since the data is generated using nonlinear and interaction terms,
“y" is not expected to be perfectly normal anyway. Also,
models like kNN and MARS do not require normality assumptions, so
checking it’s normality here is not very critical.
Here I want to use linear regression model as a benchmark model.
set.seed(200)
lm_Model <- train(
x = trainingData$x,
y = trainingData$y,
method = "lm"
)
lm_Model
## Linear Regression
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.566445 0.744027 2.031539
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
lm_Pred <- predict(lm_Model, newdata = testData$x)
postResample(pred = lm_Pred, obs = testData$y)
## RMSE Rsquared MAE
## 2.6970680 0.7084666 2.0600540
This model only captures linear relationships, so I do not expect it to fully capture the true structure of the data. The real formula has a sine interaction between \(\pi\), X1 and X2 and a quadratic effect for X3, which linear regression cannot model directly. The test RMSE was about 2.70 and R-squared was about 0.71, so the model is not so bad, but as we see \(R^2\) only explain 71% of variance of the model.
library(ggplot2)
library(patchwork)
plot_model_diagnostics <- function(obs, pred, model_name = "Model") {
diag_df <- data.frame(
observed = as.numeric(obs),
predicted = as.numeric(pred),
residuals = as.numeric(obs) - as.numeric(pred)
)
p1 <- ggplot(data = diag_df, aes(x = predicted, y = observed)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, linetype = 2) +
labs(
title = paste("Predicted vs Observed:", model_name),
x = "Predicted",
y = "Observed"
)
p2 <- ggplot(data = diag_df, aes(x = predicted, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
labs(
title = paste("Residuals vs Predicted:", model_name),
x = "Predicted",
y = "Residuals"
)
p1 + p2
}
plot_model_diagnostics(testData$y, lm_Pred, model_name = 'LM')
As we see from the residuals plot, the linear regression model tends to overpredict for higher values of y (around 15 and above). This happens because the model is trying to fit a straight line to data that is actually nonlinear.
In the true formula, X3 has a quadratic effect, and X1 and X2
interact through a sine function. These nonlinear components create
curvature in the relationship between predictors and y.
Linear regression cannot capture this curvature, so it tries to
approximate it with a straight line.
Because of that, the model overpredicts some values and underpredicts others, which shows up as a pattern in the residuals instead of random scatter.
KNN
set.seed(200)
knn_Model <- train(x = trainingData$x,
y = trainingData$y,
method = 'knn',
preProcess = c('center', 'scale'),
tuneLength = 10)
knn_Model
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.654912 0.4779838 2.958475
## 7 3.529432 0.5118581 2.861742
## 9 3.446330 0.5425096 2.780756
## 11 3.378049 0.5723793 2.719410
## 13 3.332339 0.5953773 2.692863
## 15 3.309235 0.6111389 2.663046
## 17 3.317408 0.6201421 2.678898
## 19 3.311667 0.6333800 2.682098
## 21 3.316340 0.6407537 2.688887
## 23 3.326040 0.6491480 2.705915
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
knn_pred <- predict(knn_Model, newdata = testData$x)
postResample(pred = knn_pred, obs = testData$y)
## RMSE Rsquared MAE
## 3.1750657 0.6785946 2.5443169
I centered and scaled predictors because kNN is distance-based, and variables with larger scales could otherwise dominate the distance calculation. The best value selected is k = 15. On the test set, kNN had RMSE about 3.18 and \(R^2\) about 0.68, which is worse than linear regression in this run.
in this dataset it does not perform well because the data has nonlinear interactions and noise variables, and the sample size is kinda small for 10 predictors. Because of that kNN struggling to capture the true structure of the data.
plot_model_diagnostics(testData$y, knn_pred, model_name = 'KNN k = 15')
Residuals vs predicted plot shows that there some upward linear relationship between residuals and predicted values. As we see most of KNN model predictions under 15 are overpredicted, and there is a tendency of underprediction for values over 15.
ctrl <- trainControl(method = 'cv',
number = 5)
nnet_grid <- expand.grid(.decay = c(0, 0.015, 0.1),
.size = c(1:5),
.bag = FALSE)
set.seed(200)
nnet_Tune <- train(trainingData$x, trainingData$y,
method = 'avNNet',
tuneGrid = nnet_grid,
trControl = ctrl,
preProcess = c('center', 'scale'),
linout = TRUE,
trace = FALSE,
maxNWts = 5 * (ncol(trainingData$x) +1) + 5 + 1)
nnet_Tune
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 160, 160, 160, 160, 160
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.000 1 2.600371 0.7421312 2.026669
## 0.000 2 2.447822 0.7581238 1.939906
## 0.000 3 2.448799 0.7574100 1.923828
## 0.000 4 2.203343 0.8024991 1.695293
## 0.000 5 2.121211 0.8087015 1.684621
## 0.015 1 2.459430 0.7561741 1.899830
## 0.015 2 2.444665 0.7590263 1.932931
## 0.015 3 2.504256 0.7478216 1.933915
## 0.015 4 2.050420 0.8187224 1.593522
## 0.015 5 2.193792 0.7908606 1.741207
## 0.100 1 2.463826 0.7555734 1.906107
## 0.100 2 2.510902 0.7463898 1.970243
## 0.100 3 2.421044 0.7618647 1.968473
## 0.100 4 2.135048 0.8133837 1.669197
## 0.100 5 2.062175 0.8175254 1.594456
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.015 and bag
## = FALSE.
nnet_pred <- predict(nnet_Tune, newdata = testData$x)
postResample(pred = nnet_pred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0995136 0.8288469 1.5715180
I used a model averaged neural network avNNet instead of
a single neural network because neural networks can be unstable and
sensitive to random initialization. By averaging multiple networks, the
model becomes more stable and reduces variance, which usually leads to
better and more reliable performance.
I used a relatively small neural network and tuned the number of hidden
neurons from 1 to 5. Since the training set is small, I kept the network
small to reduce overfitting
The best model used size = 4 and
decay = 0.015. Here, size = 4 means the model
uses 4 hidden neurons, which gives it enough flexibility to learn
nonlinear structure. The decay = 0.015 value adds some
regularization, so the model does not chase noise too much since there
are five variables adding just a noise.
The test set results are RMSE = 2.10, \(R^2\) = 0.829. This is a much
better improvement over KNN, and it shows that the neural network is
capturing more of the nonlinear relationship in the data. But it does
not directly tell us which variables are important, so it is harder to
explain compared to other models.
plot_model_diagnostics(testData$y, nnet_pred, model_name = 'NNET')
The residual plot shows that there are some large negative residuals,
which means the model overpredicts in certain outcomes, especially for
higher values of y> 15. Overall, most of residuals are
centered around zero and there is no strong overall trend (as we seen in
KNN). It shows that the neural network captures most of the structure
but still struggles with some extreme values.
For SVM model i tried two kernells: svmPoly and
svmRadial.
SVM with polynomial kernell
set.seed(200)
svm_poly_Fit <- train(trainingData$x, trainingData$y,
method = 'svmPoly',
preProcess = c('center', 'scale'),
tuneLength = 5,
trControl = ctrl)
svm_poly_Fit
## Support Vector Machines with Polynomial Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 160, 160, 160, 160, 160
## Resampling results across tuning parameters:
##
## degree scale C RMSE Rsquared MAE
## 1 1e-03 0.25 4.788894 0.7287966 3.926068
## 1 1e-03 0.50 4.657569 0.7274512 3.814372
## 1 1e-03 1.00 4.410710 0.7286102 3.602965
## 1 1e-03 2.00 3.981065 0.7283411 3.248080
## 1 1e-03 4.00 3.414155 0.7251268 2.801435
## 1 1e-02 0.25 3.792596 0.7335241 3.094837
## 1 1e-02 0.50 3.197057 0.7304503 2.621917
## 1 1e-02 1.00 2.680618 0.7446363 2.173495
## 1 1e-02 2.00 2.544925 0.7475250 2.068575
## 1 1e-02 4.00 2.505794 0.7465956 2.018848
## 1 1e-01 0.25 2.545823 0.7450915 2.070284
## 1 1e-01 0.50 2.517244 0.7449964 2.024616
## 1 1e-01 1.00 2.530525 0.7426148 2.035408
## 1 1e-01 2.00 2.517979 0.7444989 2.021332
## 1 1e-01 4.00 2.524239 0.7437125 2.026954
## 1 1e+00 0.25 2.519442 0.7445359 2.022988
## 1 1e+00 0.50 2.522680 0.7444811 2.019380
## 1 1e+00 1.00 2.522673 0.7445740 2.019351
## 1 1e+00 2.00 2.526739 0.7442826 2.020256
## 1 1e+00 4.00 2.526394 0.7443473 2.020039
## 1 1e+01 0.25 2.525565 0.7444513 2.019642
## 1 1e+01 0.50 2.526656 0.7443152 2.020158
## 1 1e+01 1.00 2.526583 0.7443044 2.020085
## 1 1e+01 2.00 2.527549 0.7441156 2.020774
## 1 1e+01 4.00 2.527399 0.7441395 2.020707
## 2 1e-03 0.25 4.657564 0.7274752 3.814375
## 2 1e-03 0.50 4.410725 0.7286361 3.602980
## 2 1e-03 1.00 3.981095 0.7283303 3.248116
## 2 1e-03 2.00 3.414020 0.7252689 2.801287
## 2 1e-03 4.00 2.811319 0.7407523 2.281266
## 2 1e-02 0.25 3.188948 0.7322752 2.614651
## 2 1e-02 0.50 2.673669 0.7462253 2.165134
## 2 1e-02 1.00 2.499520 0.7576447 2.027619
## 2 1e-02 2.00 2.462868 0.7557251 1.975107
## 2 1e-02 4.00 2.415934 0.7631762 1.925124
## 2 1e-01 0.25 2.188293 0.8023017 1.695820
## 2 1e-01 0.50 2.106535 0.8178808 1.617347
## 2 1e-01 1.00 2.063797 0.8276610 1.574419
## 2 1e-01 2.00 2.055213 0.8368667 1.611858
## 2 1e-01 4.00 2.102256 0.8330575 1.665243
## 2 1e+00 0.25 2.069165 0.8429957 1.661879
## 2 1e+00 0.50 2.075454 0.8429175 1.668946
## 2 1e+00 1.00 2.074971 0.8429547 1.668533
## 2 1e+00 2.00 2.070968 0.8434377 1.664572
## 2 1e+00 4.00 2.075195 0.8432415 1.670618
## 2 1e+01 0.25 2.076522 0.8431996 1.671451
## 2 1e+01 0.50 2.145109 0.8344878 1.707600
## 2 1e+01 1.00 2.079004 0.8499692 1.677193
## 2 1e+01 2.00 2.851431 0.7204057 2.201769
## 2 1e+01 4.00 2.645239 0.7729003 2.108228
## 3 1e-03 0.25 4.529887 0.7267870 3.705756
## 3 1e-03 0.50 4.198457 0.7250303 3.425093
## 3 1e-03 1.00 3.655681 0.7296779 2.985993
## 3 1e-03 2.00 3.021550 0.7383438 2.472401
## 3 1e-03 4.00 2.630723 0.7467472 2.125796
## 3 1e-02 0.25 2.833060 0.7461427 2.304056
## 3 1e-02 0.50 2.538168 0.7569400 2.051218
## 3 1e-02 1.00 2.433022 0.7637982 1.957615
## 3 1e-02 2.00 2.370815 0.7714677 1.887298
## 3 1e-02 4.00 2.297829 0.7824667 1.803943
## 3 1e-01 0.25 2.103926 0.8192460 1.617207
## 3 1e-01 0.50 2.141827 0.8188975 1.630857
## 3 1e-01 1.00 2.173471 0.8180646 1.667498
## 3 1e-01 2.00 2.223833 0.8126568 1.716705
## 3 1e-01 4.00 2.248065 0.8100883 1.739353
## 3 1e+00 0.25 3.450231 0.6101087 2.603997
## 3 1e+00 0.50 3.450231 0.6101087 2.603997
## 3 1e+00 1.00 3.450231 0.6101087 2.603997
## 3 1e+00 2.00 3.450231 0.6101087 2.603997
## 3 1e+00 4.00 3.450231 0.6101087 2.603997
## 3 1e+01 0.25 4.914557 0.4191887 3.707646
## 3 1e+01 0.50 4.914557 0.4191887 3.707646
## 3 1e+01 1.00 4.914557 0.4191887 3.707646
## 3 1e+01 2.00 4.914557 0.4191887 3.707646
## 3 1e+01 4.00 4.914557 0.4191887 3.707646
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 0.1 and C = 2.
svm_poly_pred <- predict(svm_poly_Fit, newdata = testData$x)
postResample(pred = svm_poly_pred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0800597 0.8290049 1.5778621
The polynomial SVM performed well because it models nonlinear
relationships through the polynomial kernel. The best tuning values were
degree = 2, scale = 0.1, and
Cost = 2. The chosen degree = 2 makes sense
because the true data has nonlinear structure, including a quadratic
term.
On the test set, the polynomial SVM had RMSE = 2.08 and \(R^2\) = 0.83, which is slightly better than
the neural network.
plot_model_diagnostics(testData$y, svm_poly_pred, model_name = '\nSVM Polynomial kernell')
The observed vs predicted plots show a strong relationship between
observed and predicted values.
But the residual plot shows a slight downward trend, which means the
model has a tendency to underpredict smaller values and overpredict
larger values. There is also a visible funnel shape, where the spread of
residuals increases more into negative residuals as predicted values
increase. This indicates heteroscedasticity, meaning the model has
larger errors for higher values of y. Even though the SVM
with polynomial kernel captures nonlinear structure better than simpler
models, it still does not fully capture the complexity of the data,
especially for extreme values.
SVM with radial basis kernell
set.seed(200)
svm_radial_Fit <- train(trainingData$x, trainingData$y,
method = 'svmRadial',
preProcess = c('center', 'scale'),
tuneLength = 5,
trControl = ctrl)
svm_radial_Fit
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 160, 160, 160, 160, 160
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.612355 0.7740090 2.054715
## 0.50 2.361054 0.7851687 1.844923
## 1.00 2.208398 0.8028435 1.711724
## 2.00 2.100613 0.8186640 1.628059
## 4.00 2.061009 0.8245232 1.587537
##
## Tuning parameter 'sigma' was held constant at a value of 0.06299324
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06299324 and C = 4.
svm_radial_Pred <- predict(svm_radial_Fit, newdata = testData$x)
postResample(obs = testData$y, svm_radial_Pred)
## RMSE Rsquared MAE
## 2.0584633 0.8288348 1.5549667
The SVM with radial kernel performs slightly better than the polynomial SVM and neural network. It is more flexible and can capture nonlinear patterns locally, which works better for this dataset.
The test results (RMSE = 2.06, \(R^2\) = 0.83) show slight improvement over polynomial SVM and neural network models. The observed vs predicted plot looks tight, and residuals are mostly centered around zero, but there is still some spread for higher values.
plot_model_diagnostics(testData$y, svm_radial_Pred, model_name = '\nSVM Radial kernell')
The observed vs predicted plot looks tight, and residuals are mostly centered around zero, but there is still some increasing spread for higher predicted values.
mars_Grid <- expand.grid(.degree = 1:2,
.nprune = 2:15)
set.seed(200)
mars_Tune <- train(trainingData$x, trainingData$y,
method = 'earth',
tuneGrid = mars_Grid,
trControl = ctrl)
mars_Tune
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 160, 160, 160, 160, 160
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.394459 0.2239155 3.596815
## 1 3 3.651986 0.4753989 2.945152
## 1 4 2.618756 0.7223387 2.099531
## 1 5 2.387794 0.7644045 1.869810
## 1 6 2.287581 0.7847355 1.791471
## 1 7 1.740268 0.8663442 1.351337
## 1 8 1.711458 0.8728070 1.352405
## 1 9 1.658720 0.8782945 1.334443
## 1 10 1.666839 0.8769241 1.343719
## 1 11 1.639955 0.8807926 1.319442
## 1 12 1.681916 0.8764066 1.367712
## 1 13 1.654276 0.8799149 1.335111
## 1 14 1.655555 0.8798240 1.340932
## 1 15 1.655555 0.8798240 1.340932
## 2 2 4.394459 0.2239155 3.596815
## 2 3 3.651986 0.4753989 2.945152
## 2 4 2.618756 0.7223387 2.099531
## 2 5 2.387794 0.7644045 1.869810
## 2 6 2.312313 0.7759877 1.814310
## 2 7 1.755055 0.8604940 1.362423
## 2 8 1.626000 0.8828583 1.274149
## 2 9 1.469163 0.9108711 1.149545
## 2 10 1.383385 0.9214971 1.092583
## 2 11 1.290267 0.9302051 1.056804
## 2 12 1.368149 0.9223962 1.085336
## 2 13 1.341996 0.9242350 1.075120
## 2 14 1.338496 0.9234476 1.064074
## 2 15 1.368330 0.9199668 1.079207
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 11 and degree = 2.
mars_pred <- predict(mars_Tune, newdata = testData$x)
postResample(obs = testData$y, pred = mars_pred)
## RMSE Rsquared MAE
## 1.2803060 0.9335241 1.0168673
The test results show RMSE = 1.28 and \(R^2\) = 0.93, which is a big improvement
over all previous models.
As we see MARS model clearly performs much better compared to previous
models. It is well suited for this dataset because it can capture both
nonlinear relationships and interactions using piecewise linear
functions.
The final model used degree = 2(meaning it is allowing
multiplicative combinations of ‘hinge’ linear functions) and
nprune = 11, which means it allows interaction terms and
keeps only the most important basis functions. This gives enough
flexibility to model the true structure without overfitting.
plot_model_diagnostics(testData$y, mars_pred, model_name = '\nMARS degree = 2')
The observed vs predicted plot shows a very tight alignment along the
diagonal, which means predictions are very close to actual values.
The residuals vs predicted plot looks much more random and centered
around zero compared to other models. Almost all of the residuals values
are in the range of (-4, 4). There is no clear pattern or trend, and the
spread of residuals is smaller.
This suggests that the model captures the underlying structure very well.
#varImp(mars_Tune)
summary(mars_Tune)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
## nprune=11)
##
## coefficients
## (Intercept) 22.050690
## h(0.621722-X1) -15.001651
## h(X1-0.621722) 10.878737
## h(0.601063-X2) -18.830135
## h(0.447442-X3) 9.940077
## h(X3-0.606015) 12.999390
## h(0.734892-X4) -9.877554
## h(X4-0.734892) 10.414930
## h(0.850094-X5) -5.604897
## h(X1-0.621722) * h(X2-0.295997) -43.245766
## h(0.649253-X1) * h(0.601063-X2) 26.218297
##
## Selected 11 of 18 terms, and 5 of 10 predictors (nprune=11)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 8 2
## GCV 1.747495 RSS 264.5358 GRSq 0.929051 RSq 0.9457576
The MARS summary shows that the model selected 5 out of 10
predictors, which are X1, X2, X3, X4, and X5.
This matches the true data-generating formula, since they are the only
informative predictors. The noise variables (X6–X10) were not
used.
The model also includes multiplicative interaction terms between X1 and
X2, which captures the sine interaction in the true formula.
MARS model not only delivering the best test performance but also right identifications of the important variables and their relationships.
model_results <- data.frame(
Model = c(
"Linear Regression",
"kNN",
"Averaged Neural Network",
"SVM Polynomial",
"SVM Radial",
"MARS"
),
RMSE = c(
postResample(pred = lm_Pred, obs = testData$y)["RMSE"],
postResample(pred = knn_pred, obs = testData$y)["RMSE"],
postResample(pred = nnet_pred, obs = testData$y)["RMSE"],
postResample(pred = svm_poly_pred, obs = testData$y)["RMSE"],
postResample(pred = svm_radial_Pred, obs = testData$y)["RMSE"],
postResample(pred = mars_pred, obs = testData$y)["RMSE"]
),
Rsquared = c(
postResample(pred = lm_Pred, obs = testData$y)["Rsquared"],
postResample(pred = knn_pred, obs = testData$y)["Rsquared"],
postResample(pred = nnet_pred, obs = testData$y)["Rsquared"],
postResample(pred = svm_poly_pred, obs = testData$y)["Rsquared"],
postResample(pred = svm_radial_Pred, obs = testData$y)["Rsquared"],
postResample(pred = mars_pred, obs = testData$y)["Rsquared"]
),
MAE = c(
postResample(pred = lm_Pred, obs = testData$y)["MAE"],
postResample(pred = knn_pred, obs = testData$y)["MAE"],
postResample(pred = nnet_pred, obs = testData$y)["MAE"],
postResample(pred = svm_poly_pred, obs = testData$y)["MAE"],
postResample(pred = svm_radial_Pred, obs = testData$y)["MAE"],
postResample(pred = mars_pred, obs = testData$y)["MAE"]
)
)
model_results <- model_results |>
arrange(RMSE)
model_results
## Model RMSE Rsquared MAE
## 1 MARS 1.280306 0.9335241 1.016867
## 2 SVM Radial 2.058463 0.8288348 1.554967
## 3 SVM Polynomial 2.080060 0.8290049 1.577862
## 4 Averaged Neural Network 2.099514 0.8288469 1.571518
## 5 Linear Regression 2.697068 0.7084666 2.060054
## 6 kNN 3.175066 0.6785946 2.544317
As we see model comparison table shows that MARS clearly had the best
test performance, with the lowest RMSE and highest \(R^2\). This makes sense because the true
formula contains nonlinear and interaction effects, and MARS is better
in capturing this kind of structure. The SVM and neural network models
also performed well, but they did not get close to MARS results.
For this exercise I used linear regression as a baseline model, and kNN
performed worse than I expected, probably because the data has noise
variables and complex interactions. Overall, MARS was the best model for
this dataset.
library(AppliedPredictiveModeling)
library(caret)
library(glmnet)
library(pls)
library(dplyr)
library(ggplot2)
library(lattice)
library(RANN)
library(purrr)
library(e1071)
library(patchwork)
library(corrplot)
library(writexl)
options(scipen = 999)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
allowParallel = TRUE
)
get_cv_results <- function(model, model_name) {
best <- model$bestTune
res <- model$results
for (param in names(best)) {
res <- res[res[[param]] == best[[param]], ]
}
data.frame(
Model = model_name,
CV_RMSE = res$RMSE,
CV_R2 = res$Rsquared,
CV_MAE = res$MAE
)
}
get_test_results <- function(obs, pred, model_name) {
res <- postResample(pred = pred, obs = obs)
data.frame(
Model = model_name,
Test_RMSE = res["RMSE"],
Test_R2 = res["Rsquared"],
Test_MAE = res["MAE"]
)
}
data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)
## [1] 176 58
yield <- ChemicalManufacturingProcess[, 1]
processPredictors <- ChemicalManufacturingProcess[, -1]
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8)
sum(is.na(processPredictors))
## [1] 106
The first step was checking the predictor matrix for missing values. There were 106 missing cells across all measurements, which is still a relatively small amount compared with the full dataset (57 predictors and 176 observations) . To impute NA’s I choosed “bagged tree” imputation because it can better preserve nonlinear relationships between variables.
set.seed(200)
imputation <- preProcess(
processPredictors,
method = 'bagImpute'
)
processPredictors_imputed <- predict(imputation, processPredictors)
sum(is.na(processPredictors_imputed))
## [1] 0
nzv_vars <- nearZeroVar(processPredictors_imputed)
nzv_vars
## [1] 7
processPredictors_imputed <- processPredictors_imputed[, -nzv_vars]
Near-zero variance filtering removed 1 predictors that contained almost no useful variation.
I split data into training and test sets using a 75/25 split.
set.seed(200)
train_index <- createDataPartition(yield, p= 0.75, list = FALSE)
train_procPred <- processPredictors_imputed[train_index,]
test_procPred <- processPredictors_imputed[-train_index,]
train_yield <- yield[train_index]
test_yield <- yield[-train_index]
skewness(train_yield)
## [1] 0.2865166
hist(train_yield, breaks = 20)
Before fitting the model, I checked the distribution of the response variable. The yield variable only showed mild right skew, with skewness equal to 0.287, and the histogram looked close to symmetric. Because of this, I kept the response on its original scale to make the final predictions easier to interpret.
cor_matrix <- cor(train_procPred, use = "pairwise.complete.obs")
corrplot(
cor_matrix,
order = 'hclust',
tl.cex = 0.2
)
The correlation matrix shows several strong clusters among variables.
high_corr <- findCorrelation(cor_matrix, cutoff = 0.9)
train_procPred_filtered <- train_procPred[, -high_corr]
test_procPred_filtered <- test_procPred[, -high_corr]
Using a 0.9 cutoff, 10 highly correlated predictors were identified as potential redundant variables.
I made a mistake in previous assignment, I centered and scaled data before training and testing models. So, here I let preprocessing applied during models training.
set.seed(200)
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.01, 1, length = 20))
enet_tune <- train(train_procPred_filtered, train_yield,
method = 'enet',
tuneGrid = enetGrid,
trControl = ctrl,
preProcess = c('center', 'scale', 'YeoJohnson')
)
enet_tune$results[row.names(enet_tune$bestTune), ]
## lambda fraction RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 45 0.1 0.2184211 1.253232 0.5988309 1.03886 0.2126629 0.1616821 0.1736726
enet_pred <- predict(enet_tune, newdata = test_procPred_filtered)
postResample(enet_pred, test_yield)
## RMSE Rsquared MAE
## 1.2260515 0.6362144 0.9828899
Elastic Net is from exercise 6.3 and here it is used as a baseline model. The model selected a combination of L1 and L2 penalties and achieved a test RMSE = 1.226 and \(R^2\) = 0.636
plot_model_diagnostics(test_yield, enet_pred, model_name = "ENET")
In the predicted vs observed plot, most points follow the diagonal
line, which indicates the model is capturing the general trend.
But there is noticeable spread around the lines in both plots, we can
see that as actual observed values are increasing there is a tendency
that ENET model underestimates them.
The residuals vs predicted plot shows no strong pattern, but the spread
is relatively wide.
set.seed(200)
knn_model <- train(
x = train_procPred_filtered,
y = train_yield,
method = 'knn',
tuneLength = 10,
trControl = ctrl,
preProcess = c('center', 'scale', 'YeoJohnson')
)
knn_model
## k-Nearest Neighbors
##
## 132 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46), Yeo-Johnson transformation (22)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.301887 0.5130484 1.050365
## 7 1.354823 0.4795851 1.095830
## 9 1.365607 0.4702248 1.103777
## 11 1.372830 0.4732010 1.118912
## 13 1.383267 0.4698308 1.134357
## 15 1.390315 0.4708005 1.142312
## 17 1.397245 0.4712144 1.149876
## 19 1.411680 0.4688991 1.159028
## 21 1.426508 0.4646536 1.163456
## 23 1.436087 0.4639830 1.170425
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
knn_pred <- predict(knn_model, newdata = test_procPred_filtered)
postResample(obs = test_yield, pred = knn_pred)
## RMSE Rsquared MAE
## 1.4281925 0.4310553 1.1614091
KNN performed worse than ENET model, with the test RMSE = 1.428. From
\(R^2\) = 0.43 telling us that KNN
model struggles to explain even half of the variance of the model.
Even though it is a nonlinear method, it relies heavily on local
distances, which becomes less effective in high-dimensional data.
Because this dataset has many predictors , the distance-based struggles because of the curse of dimensionality, which explains the poor test performance.
plot_model_diagnostics(test_yield, knn_pred, model_name = 'KNN')
The predicted vs observed plot showing wider spread compared ENET
model, meaning predictions are less accurate. Points are not tightly
aligned with the diagonal line.
The residual plot also shows large variability, which confirms that the
model is unstable and not capturing the underlying structure well.
mars_grid <- expand.grid(
.degree = 1:2,
.nprune = 2:30
)
set.seed(200)
mars_model <- train(
x = train_procPred,
y = train_yield,
method = 'earth',
tuneGrid = mars_grid,
trControl = ctrl,
preProcess = c('center', 'scale', 'YeoJohnson')
)
mars_model
## Multivariate Adaptive Regression Spline
##
## 132 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56), Yeo-Johnson transformation (25)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.375777 0.4596863 1.0978774
## 1 3 1.169033 0.6062765 0.9598891
## 1 4 1.151194 0.6245793 0.9504233
## 1 5 1.169103 0.6172718 0.9578723
## 1 6 1.182881 0.6063307 0.9657971
## 1 7 1.205244 0.5903349 0.9773674
## 1 8 1.224660 0.5882430 0.9928473
## 1 9 1.243509 0.5828915 1.0068009
## 1 10 1.249794 0.5788619 1.0098672
## 1 11 1.266644 0.5729139 1.0247848
## 1 12 1.272552 0.5753734 1.0309235
## 1 13 1.277499 0.5745308 1.0278588
## 1 14 1.295876 0.5685540 1.0412033
## 1 15 1.293342 0.5679382 1.0389234
## 1 16 1.293244 0.5681805 1.0405552
## 1 17 1.359317 0.5570959 1.0621675
## 1 18 1.356408 0.5587051 1.0597379
## 1 19 1.353462 0.5604875 1.0566285
## 1 20 1.353462 0.5604875 1.0566285
## 1 21 1.359667 0.5592852 1.0624910
## 1 22 1.359667 0.5592852 1.0624910
## 1 23 1.359667 0.5592852 1.0624910
## 1 24 1.359667 0.5592852 1.0624910
## 1 25 1.359667 0.5592852 1.0624910
## 1 26 1.359667 0.5592852 1.0624910
## 1 27 1.359667 0.5592852 1.0624910
## 1 28 1.359667 0.5592852 1.0624910
## 1 29 1.359667 0.5592852 1.0624910
## 1 30 1.359667 0.5592852 1.0624910
## 2 2 1.375777 0.4596863 1.0978774
## 2 3 1.234344 0.5590277 1.0018585
## 2 4 1.187404 0.5928335 0.9684019
## 2 5 1.179816 0.6053931 0.9566722
## 2 6 1.173829 0.6109156 0.9539316
## 2 7 1.228271 0.5898192 1.0018101
## 2 8 1.221640 0.5873668 0.9908607
## 2 9 1.250387 0.5891500 1.0019043
## 2 10 1.274317 0.5865343 1.0187710
## 2 11 1.278324 0.5814646 1.0253128
## 2 12 1.291718 0.5808346 1.0415082
## 2 13 1.319707 0.5742255 1.0552566
## 2 14 1.406358 0.5582495 1.1104033
## 2 15 1.416597 0.5475534 1.1183876
## 2 16 1.436650 0.5426552 1.1340803
## 2 17 1.463073 0.5359593 1.1465020
## 2 18 1.485921 0.5299050 1.1606301
## 2 19 1.494435 0.5287815 1.1695633
## 2 20 1.494668 0.5285554 1.1685367
## 2 21 1.517682 0.5244953 1.1829374
## 2 22 1.517187 0.5232374 1.1838747
## 2 23 1.526537 0.5206569 1.1911888
## 2 24 1.526241 0.5209456 1.1898427
## 2 25 1.526241 0.5209456 1.1898427
## 2 26 1.526241 0.5209456 1.1898427
## 2 27 1.525170 0.5218507 1.1902312
## 2 28 1.525170 0.5218507 1.1902312
## 2 29 1.525170 0.5218507 1.1902312
## 2 30 1.525170 0.5218507 1.1902312
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 4 and degree = 1.
mars_pred <- predict(mars_model, newdata = test_procPred)
postResample(obs = test_yield, pred = mars_pred)
## RMSE Rsquared MAE
## 1.1285523 0.6370613 0.8805678
MARS produced the best performance among all models, with the lowest
test RMSE = 1.129 and \(R^2\) = 0.637.
The optimal model used degree = 1 and
nprune = 4, The degree = 1 tells us the chosen
MARS model is additive without multiplicative interaction terms.
This is interesting because it shows that the relationships are nonlinear and at the same there is no need for complex interactions. Instead, hinged linear combinations are enough to capture the structure of the data.
# mars_pred
# length(test_yield)
# length(mars_pred)
plot_model_diagnostics(obs = test_yield, pred = mars_pred, model_name = 'MARS')
The predicted vs observed plot shows a stronger alignment along the diagonal compared to previous models, indicating it has better balanced prediction accuracy. Compared to other models, the points are more tightly clustered.
The residual plot shows points are evenly distributed around zero. This suggests that the model is capturing most of the signal in the data and leaving less unexplained variation.
set.seed(200)
svm_radial_model <- train(train_procPred, train_yield,
method = 'svmRadial',
tuneLength = 14,
trControl = ctrl,
preProcess = c('center', 'scale', 'YeoJohnson')
)
svm_radial_model
## Support Vector Machines with Radial Basis Function Kernel
##
## 132 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56), Yeo-Johnson transformation (25)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.417162 0.4831737 1.1560115
## 0.50 1.306631 0.5320544 1.0676134
## 1.00 1.232920 0.5704469 1.0016118
## 2.00 1.177839 0.6004753 0.9532968
## 4.00 1.166820 0.6041736 0.9456848
## 8.00 1.159502 0.6054002 0.9386184
## 16.00 1.157852 0.6060202 0.9372875
## 32.00 1.157852 0.6060202 0.9372875
## 64.00 1.157852 0.6060202 0.9372875
## 128.00 1.157852 0.6060202 0.9372875
## 256.00 1.157852 0.6060202 0.9372875
## 512.00 1.157852 0.6060202 0.9372875
## 1024.00 1.157852 0.6060202 0.9372875
## 2048.00 1.157852 0.6060202 0.9372875
##
## Tuning parameter 'sigma' was held constant at a value of 0.01343501
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01343501 and C = 16.
svm_radial_pred <- predict(svm_radial_model, newdata = test_procPred)
postResample(obs = test_yield, pred = svm_radial_pred)
## RMSE Rsquared MAE
## 1.1678754 0.6219256 0.9192249
The SVM with radial kernel performed also well and is the second-best model. It has a test RMSE = 1.168, which is close to MARS (1.13).
plot_model_diagnostics(test_yield, svm_radial_pred, model_name = 'SVM')
The predicted vs observed plot shows good alignment with the
diagonal, but slightly more spread than MARS.
The residual plot does not show a strong pattern. But the variability is
slightly higher than MARS, as we see with some values slightly
underpredicted.
nnet_grid <- expand.grid(
.decay = c(0, 0.01, 0.1),
.size = 1:15,
.bag = FALSE
)
set.seed(200)
nnet_model <- train(
x = train_procPred_filtered,
y = train_yield,
method = "avNNet",
tuneGrid = nnet_grid,
trControl = ctrl,
preProc = c("center", "scale", 'YeoJohnson'),
linout = TRUE,
trace = FALSE,
maxNWts = 3000
)
nnet_model
## Model Averaged Neural Network
##
## 132 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46), Yeo-Johnson transformation (22)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 1.703248 0.3778959 1.304958
## 0.00 2 1.508959 0.3939445 1.240394
## 0.00 3 1.612317 0.4007101 1.244382
## 0.00 4 2.030056 0.2849494 1.600580
## 0.00 5 1.921538 0.3533147 1.471030
## 0.00 6 1.859060 0.3623278 1.472607
## 0.00 7 2.105965 0.3196056 1.687440
## 0.00 8 2.579368 0.2869939 2.005186
## 0.00 9 3.250863 0.2546037 2.478087
## 0.00 10 3.952873 0.1998514 2.815648
## 0.00 11 4.905718 0.1782711 3.466201
## 0.00 12 6.371266 0.1711152 4.012285
## 0.00 13 7.737730 0.1580155 4.954700
## 0.00 14 8.626960 0.1538539 5.302241
## 0.00 15 9.107009 0.1531409 5.378354
## 0.01 1 1.361628 0.4733408 1.113470
## 0.01 2 1.291615 0.5357111 1.041846
## 0.01 3 1.519702 0.4539587 1.219488
## 0.01 4 1.877391 0.3490668 1.462645
## 0.01 5 1.994101 0.3109844 1.576038
## 0.01 6 1.924628 0.3405632 1.541169
## 0.01 7 2.042256 0.3303952 1.643767
## 0.01 8 2.562601 0.3018398 1.946254
## 0.01 9 3.195850 0.2980176 2.335708
## 0.01 10 4.241455 0.2408660 2.991853
## 0.01 11 5.021692 0.2092511 3.437460
## 0.01 12 6.706996 0.1730011 4.283838
## 0.01 13 7.615320 0.1429849 4.697240
## 0.01 14 8.549283 0.1501687 5.248403
## 0.01 15 8.801523 0.1696280 5.036857
## 0.10 1 1.364692 0.4850919 1.113154
## 0.10 2 1.336829 0.5090254 1.071598
## 0.10 3 1.509504 0.4602740 1.201903
## 0.10 4 1.852849 0.3515562 1.420515
## 0.10 5 1.807087 0.3963275 1.425934
## 0.10 6 1.780482 0.3876776 1.426311
## 0.10 7 1.978927 0.3766534 1.565521
## 0.10 8 2.280847 0.3420362 1.776463
## 0.10 9 2.927127 0.2543762 2.170576
## 0.10 10 3.504037 0.2309713 2.527744
## 0.10 11 5.065381 0.1944928 3.319595
## 0.10 12 5.548514 0.1908132 3.665042
## 0.10 13 6.863536 0.1494704 4.218530
## 0.10 14 7.934455 0.1767128 4.742481
## 0.10 15 7.326321 0.1547291 4.484152
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 2, decay = 0.01 and bag = FALSE.
nnet_pred <- predict(nnet_model, newdata = test_procPred_filtered)
postResample(obs = test_yield, pred = nnet_pred)
## RMSE Rsquared MAE
## 1.2328764 0.5725431 0.9725628
The neural network model showed moderate performance with test RMSE =
1.233 and lower \(R^2\) = 0.57 compared
to MARS and SVM models. This could be due to the relatively small
dataset size (only 176 observations), which is not ideal for training
neural networks effectively.
Also, training neural networks takes much longer computational time and
resources. So here, I don’t think it would be a best model choice even
if its RMSE and \(R^2\) were close to
MARS and SVM results.
plot_model_diagnostics(test_yield, nnet_pred, model_name = 'NNET')
The predicted vs observed plot shows reasonable alignment but
slightly wider spread compared to MARS and SVM.
The residuals plot’s spread mostly even around zero, but it shows that
higher actual values were underpredicted by the NNET model.
stopCluster(cl)
registerDoSEQ()
cv_results <- dplyr::bind_rows(
get_cv_results(enet_tune, "ENET"),
get_cv_results(knn_model, "KNN"),
get_cv_results(mars_model, "MARS"),
get_cv_results(svm_radial_model, "SVM"),
get_cv_results(nnet_model, "NNET")
)
test_results <- dplyr::bind_rows(
get_test_results(test_yield, enet_pred, "ENET"),
get_test_results(test_yield, knn_pred, "KNN"),
get_test_results(test_yield, mars_pred, "MARS"),
get_test_results(test_yield, svm_radial_pred, "SVM"),
get_test_results(test_yield, nnet_pred, "NNET")
)
Do any of the nonlinear models outperform the optimal linear model you previously developed in Exercise 6.2? If so, what might this tell you about the underlying relationship between the predictors and the response?
final_results <- dplyr::left_join(cv_results, test_results, by = "Model")
final_results <- final_results |>
arrange(Test_RMSE, desc = FALSE)
final_results
## Model CV_RMSE CV_R2 CV_MAE Test_RMSE Test_R2 Test_MAE
## 1 MARS 1.151194 0.6245793 0.9504233 1.128552 0.6370613 0.8805678
## 2 SVM 1.157852 0.6060202 0.9372875 1.167875 0.6219256 0.9192249
## 3 ENET 1.253232 0.5988309 1.0388603 1.226051 0.6362144 0.9828899
## 4 NNET 1.291615 0.5357111 1.0418459 1.232876 0.5725431 0.9725628
## 5 KNN 1.301887 0.5130484 1.0503651 1.428192 0.4310553 1.1614091
The comparison of nonlinear models shows that:
MARS provides the best overall performance. It achieved the lowest test RMSE (1.129) and \(R^2\) = 0.637, indicating the most accurate predictions on test data.
SVM is the second-best model, with performance very close to MARS, which suggests that both methods are effective at capturing nonlinear relationships in the data.
ENET and NNET show moderate performance. Although ENET has a relatively high test \(R^2\) = 0.636 (very close to MARS), its test RMSE is worse than MARS and SVM, which indicates that a linear structure is not sufficient to fully capture the underlying patterns. The neural network also does not outperform simpler nonlinear methods, likely due to the small dataset size and limited ability to generalize.
KNN performed worst compared to the rest of the models, with the highest test RMSE and lowest R². This suggests that distance-based methods struggle in this high-dimensional setting and do not predict well enough.
Overall, MARS is the best model because it provides the best balance between accuracy and predictions. Also is is easier to interpret MARS model compared to SVM and NNET models.
summary(mars_model)
## Call: earth(x=data.frame[132,56], y=c(38,42.44,42.0...), keepxy=TRUE, degree=1,
## nprune=4)
##
## coefficients
## (Intercept) 39.718231
## h(1.05347-ManufacturingProcess09) -0.757467
## h(-1.1859-ManufacturingProcess13) 2.237452
## h(ManufacturingProcess32- -0.815622) 1.257127
##
## Selected 4 of 22 terms, and 3 of 56 predictors (nprune=4)
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 3 (additive model)
## GCV 1.282191 RSS 151.7746 GRSq 0.6246959 RSq 0.6582876
Do either the biological or process variables dominate the list?
mars_imp <- varImp(mars_model)
mars_top10 <- mars_imp$importance |>
tibble::rownames_to_column("Predictor") |>
arrange(desc(Overall)) |>
slice(1:10)
mars_top10
## Predictor Overall
## 1 ManufacturingProcess32 100.00000
## 2 ManufacturingProcess09 46.53772
## 3 ManufacturingProcess13 0.00000
The MARS model selected only 3 predictors out of 56 available
variables. It is because MARS uses built-in variable selection using a
forward and backward stepwise process. After that, in the pruning
process, it removes terms that do not significantly improve model
performance.
The final model kept only ManufacturingProcess32,
ManufacturingProcess09, and
ManufacturingProcess13 since these variables had best
contribution to reduce prediction error.
This result suggests that yield is mostly delivered by this small subset
of manufacturing process variables, and at the same time other
predictors are either redundant or contain small prediction
ability.
As we see these 3 process predictors clearly dominate in MARS model.
enet_imp <- varImp(enet_tune)
enet_top10 <- enet_imp$importance |>
tibble::rownames_to_column("Predictor") |>
arrange(desc(Overall)) |>
slice(1:10)
enet_top10
## Predictor Overall
## 1 ManufacturingProcess32 100.00000
## 2 ManufacturingProcess17 80.31063
## 3 BiologicalMaterial06 75.08969
## 4 ManufacturingProcess13 74.46564
## 5 ManufacturingProcess36 68.75818
## 6 BiologicalMaterial03 67.10754
## 7 ManufacturingProcess06 66.37746
## 8 ManufacturingProcess09 65.28374
## 9 ManufacturingProcess33 46.90589
## 10 BiologicalMaterial08 44.97856
intersect(mars_top10$Predictor, enet_top10$Predictor)
## [1] "ManufacturingProcess32" "ManufacturingProcess09" "ManufacturingProcess13"
The comparison between the top predictors from the MARS model and
optimal linear model (ENET) shows overlap and differences at the same
time. The MARS model selected only three important predictors:
ManufacturingProcess32,
ManufacturingProcess09, and
ManufacturingProcess13. All three of these variables also
appear in the top ten predictors of the ENET model.
From these findings I can suggest that these three manufacturing process variables are consistently the most important predictors of yield, and doesn’t really matter if a linear or nonlinear modeling approach is used.
However, the ENET model includes a broader set of predictors in its
top ten list, including several biological variables such as
BiologicalMaterial06, BiologicalMaterial03, and
BiologicalMaterial08. This indicates that the linear model
spreads importance across more variables, while the MARS model focuses
only on the strongest signals.
Overall, the nonlinear MARS model is more selective and relied on a smaller subset of predictors, while the linear ENET model captures additional weaker relationships across both manufacturing and biological variables.
Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
unique_mars_vars <- intersect(mars_top10$Predictor, enet_top10$Predictor)
plot_df <- train_procPred |>
select(all_of(unique_mars_vars)) |>
mutate(Yield = train_yield)
plot_df_long <- plot_df |>
pivot_longer(
cols = all_of(unique_mars_vars),
names_to = 'Predictor',
values_to = 'Value'
)
ggplot(plot_df_long, aes(x = Value, y = Yield)) +
geom_point(alpha = 0.7) +
geom_smooth(method = 'lm', se = FALSE) +
facet_wrap(~ Predictor, scales = 'free_x') +
labs(
title = 'Relationships between MARS predictors and yield',
x = 'Predictor value',
y = 'Yield'
)
The plots show how the most important predictors are related to yield:
ManufacturingProcess32 has a clear positive trend as
it increases, yield also increases pretty consistently.
ManufacturingProcess09 also has a positive
relationship, but with more spread variability, so the effect is a
little bit smaller.
ManufacturingProcess13 shows a negative trend,
meaning higher values are associated with lower yield.
Overall, these patterns help explain why MARS picked these variables. They have clear srtong relationships with yield, and MARS is more capable to capture these relationships better than a simple linear model.