DATA624_Assignment8

library(mlbench)
library(caret)
library(reshape2)
library(ggplot2)
library(tidyr)
library(lattice)
library(earth)
library(kernlab)
library(e1071)
library(tidyverse)

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: \[ y = 10 \sin(\pi X_1 X_2) + 20 (X_3 - 0.5)^2 + 10 X_4 + 5 X_5 + \varepsilon,\quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation).

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
#featurePlot(trainingData$x[, 1:5], trainingData$y, cex = 0.35)
## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

str(trainingData)

## List of 2
##  $ x:'data.frame':   200 obs. of  10 variables:
##   ..$ X1 : num [1:200] 0.534 0.584 0.59 0.691 0.667 ...
##   ..$ X2 : num [1:200] 0.648 0.438 0.588 0.226 0.819 ...
##   ..$ X3 : num [1:200] 0.8508 0.6727 0.4097 0.0334 0.7168 ...
##   ..$ X4 : num [1:200] 0.1816 0.6692 0.3381 0.0669 0.8032 ...
##   ..$ X5 : num [1:200] 0.929 0.1638 0.8941 0.6374 0.0831 ...
##   ..$ X6 : num [1:200] 0.3618 0.4531 0.0268 0.525 0.2234 ...
##   ..$ X7 : num [1:200] 0.827 0.649 0.179 0.513 0.664 ...
##   ..$ X8 : num [1:200] 0.421 0.845 0.35 0.797 0.904 ...
##   ..$ X9 : num [1:200] 0.5911 0.9282 0.0176 0.6899 0.397 ...
##   ..$ X10: num [1:200] 0.589 0.758 0.444 0.445 0.55 ...
##  $ y: num [1:200] 18.5 16.1 17.8 13.8 18.4 ...

cor_values <- cor(trainingData$x, trainingData$y)
cor_values

##             [,1]
## X1   0.509288830
## X2   0.473594892
## X3   0.008278159
## X4   0.516904120
## X5   0.355704067
## X6   0.023701700
## X7  -0.042081910
## X8  -0.013629682
## X9   0.067044796
## X10 -0.028661255

As we see X1, X2 and X4 has moderate linear correlations with y, and at the same time surprisingly X3 has almost no correlation. It can be explained by the fact that X3 has quadratic effect in the formula, therefore simple correlation cannot capture it well. Also X5’s correlation below moderate level. All noise predictors X6-X10 have almost zero correlations with y.

This shows that correlations alone can be misleading if there is non-linear relations between predictors and outcomes.

trainingData_long <- trainingData$x |> 
  pivot_longer(cols = X1:X10,
               names_to = 'variable',
               values_to = 'value')

trainingData_long$y <- rep(trainingData$y, times = 10)

trainingData_long$variable <- factor(
  trainingData_long$variable,
  levels = paste0("X", 1:10)
)

ggplot(trainingData_long, aes(x = value, y = y))+
  geom_point(size = 0.5, alpha = 0.4)+
  geom_smooth(method = 'lm', se = FALSE)+
  facet_wrap(~variable)+
  labs(
    title = "Relationship Between Predictors and y",
    x = "Predictor Value",
    y = "y"
  )

The plots make it look like there’s barely anything going on.X1, X2 and X3 have some linear trend, but everything else kinda looks like noise. Based on this alone, I would probably say only X2, X3 and maybe X1 matter, and ignore the rest.

But in this case that’s misleading.

The problem is that the data is not generated in a simple linear way. It is generated by this formula:

\[ y = 10 \sin(\pi X_1 X_2) + 20 (X_3 - 0.5)^2 + 10 X_4 + 5 X_5 + \varepsilon,\quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]

As we see X1 and X2 only matter through a sine interaction, so when we look at them separately, they look useless. X3 has a quadratic effect on outcome. X4 and X5 have linear relation with outcome.

So basically, these plots make it look like weak linear signals and a lot of noise, but as we see in formula there are strong nonlinear and interaction effects that these one-variable cannot truly explain.

In my opinion, it can be dangerous to decide which variables are important based on these plots alone.

hist(trainingData$y, breaks = 20, main = 'Distribution of y')

plot(density(trainingData$y), main = 'Distibution of y')

The distribution of “y" looks roughly bell-shaped with a slight right skew, with no extreme outliers. The histogram shows some small spikes around values like 6, 8, and 9, but these are probaly due to random variation from the sample rather than true structure in the data. The density plot looks smooth and unimodal, so overall the distribution of”y” is still roughly bell-shaped with slight right skew. I do not see strong evidence of multiple peaks or serious issues.

Since the data is generated using nonlinear and interaction terms, “y" is not expected to be perfectly normal anyway. Also, models like kNN and MARS do not require normality assumptions, so checking it’s normality here is not very critical.

Modeling

Linear regression

Here I want to use linear regression model as a benchmark model.

set.seed(200)

lm_Model <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "lm"
)

lm_Model

## Linear Regression 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   2.566445  0.744027  2.031539
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

lm_Pred <- predict(lm_Model, newdata = testData$x)

postResample(pred = lm_Pred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.6970680 0.7084666 2.0600540

This model only captures linear relationships, so I do not expect it to fully capture the true structure of the data. The real formula has a sine interaction between \(\pi\), X1 and X2 and a quadratic effect for X3, which linear regression cannot model directly. The test RMSE was about 2.70 and R-squared was about 0.71, so the model is not so bad, but as we see \(R^2\) only explain 71% of variance of the model.

library(ggplot2)
library(patchwork)

plot_model_diagnostics <- function(obs, pred, model_name = "Model") {
  
  diag_df <- data.frame(
    observed = as.numeric(obs),
    predicted = as.numeric(pred),
    residuals = as.numeric(obs) - as.numeric(pred)
  )
  
  p1 <- ggplot(data = diag_df, aes(x = predicted, y = observed)) +
    geom_point() +
    geom_abline(slope = 1, intercept = 0, linetype = 2) +
    labs(
      title = paste("Predicted vs Observed:", model_name),
      x = "Predicted",
      y = "Observed"
    )
  
  p2 <- ggplot(data = diag_df, aes(x = predicted, y = residuals)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = 2) +
    labs(
      title = paste("Residuals vs Predicted:", model_name),
      x = "Predicted",
      y = "Residuals"
    )
  
  p1 + p2
}

plot_model_diagnostics(testData$y, lm_Pred, model_name = 'LM')

As we see from the residuals plot, the linear regression model tends to overpredict for higher values of y (around 15 and above). This happens because the model is trying to fit a straight line to data that is actually nonlinear.

In the true formula, X3 has a quadratic effect, and X1 and X2 interact through a sine function. These nonlinear components create curvature in the relationship between predictors and y. Linear regression cannot capture this curvature, so it tries to approximate it with a straight line.

Because of that, the model overpredicts some values and underpredicts others, which shows up as a pattern in the residuals instead of random scatter.

KNN

set.seed(200)
knn_Model <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = 'knn',
                   preProcess = c('center', 'scale'),
                   tuneLength = 10)
knn_Model

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.654912  0.4779838  2.958475
##    7  3.529432  0.5118581  2.861742
##    9  3.446330  0.5425096  2.780756
##   11  3.378049  0.5723793  2.719410
##   13  3.332339  0.5953773  2.692863
##   15  3.309235  0.6111389  2.663046
##   17  3.317408  0.6201421  2.678898
##   19  3.311667  0.6333800  2.682098
##   21  3.316340  0.6407537  2.688887
##   23  3.326040  0.6491480  2.705915
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.

knn_pred <- predict(knn_Model, newdata = testData$x)
postResample(pred = knn_pred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169

I centered and scaled predictors because kNN is distance-based, and variables with larger scales could otherwise dominate the distance calculation. The best value selected is k = 15. On the test set, kNN had RMSE about 3.18 and \(R^2\) about 0.68, which is worse than linear regression in this run.

in this dataset it does not perform well because the data has nonlinear interactions and noise variables, and the sample size is kinda small for 10 predictors. Because of that kNN struggling to capture the true structure of the data.

plot_model_diagnostics(testData$y, knn_pred, model_name = 'KNN k = 15')

Residuals vs predicted plot shows that there some upward linear relationship between residuals and predicted values. As we see most of KNN model predictions under 15 are overpredicted, and there is a tendency of underprediction for values over 15.

Neural networks

ctrl <- trainControl(method = 'cv', 
                     number = 5)
nnet_grid <- expand.grid(.decay = c(0, 0.015, 0.1),
                         .size = c(1:5),
                         .bag = FALSE)

set.seed(200)

nnet_Tune <- train(trainingData$x, trainingData$y,
                   method = 'avNNet',
                   tuneGrid = nnet_grid,
                   trControl = ctrl,
                   preProcess = c('center', 'scale'),
                   linout = TRUE,
                   trace = FALSE,
                   maxNWts = 5 * (ncol(trainingData$x) +1) + 5 + 1)
nnet_Tune

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 160, 160, 160, 160, 160 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.000  1     2.600371  0.7421312  2.026669
##   0.000  2     2.447822  0.7581238  1.939906
##   0.000  3     2.448799  0.7574100  1.923828
##   0.000  4     2.203343  0.8024991  1.695293
##   0.000  5     2.121211  0.8087015  1.684621
##   0.015  1     2.459430  0.7561741  1.899830
##   0.015  2     2.444665  0.7590263  1.932931
##   0.015  3     2.504256  0.7478216  1.933915
##   0.015  4     2.050420  0.8187224  1.593522
##   0.015  5     2.193792  0.7908606  1.741207
##   0.100  1     2.463826  0.7555734  1.906107
##   0.100  2     2.510902  0.7463898  1.970243
##   0.100  3     2.421044  0.7618647  1.968473
##   0.100  4     2.135048  0.8133837  1.669197
##   0.100  5     2.062175  0.8175254  1.594456
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.015 and bag
##  = FALSE.

nnet_pred <- predict(nnet_Tune, newdata = testData$x)
postResample(pred = nnet_pred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.0995136 0.8288469 1.5715180

I used a model averaged neural network avNNet instead of a single neural network because neural networks can be unstable and sensitive to random initialization. By averaging multiple networks, the model becomes more stable and reduces variance, which usually leads to better and more reliable performance.
I used a relatively small neural network and tuned the number of hidden neurons from 1 to 5. Since the training set is small, I kept the network small to reduce overfitting
The best model used size = 4 and decay = 0.015. Here, size = 4 means the model uses 4 hidden neurons, which gives it enough flexibility to learn nonlinear structure. The decay = 0.015 value adds some regularization, so the model does not chase noise too much since there are five variables adding just a noise.
The test set results are RMSE = 2.10, \(R^2\) = 0.829. This is a much better improvement over KNN, and it shows that the neural network is capturing more of the nonlinear relationship in the data. But it does not directly tell us which variables are important, so it is harder to explain compared to other models.

plot_model_diagnostics(testData$y, nnet_pred, model_name = 'NNET')

The residual plot shows that there are some large negative residuals, which means the model overpredicts in certain outcomes, especially for higher values of y> 15. Overall, most of residuals are centered around zero and there is no strong overall trend (as we seen in KNN). It shows that the neural network captures most of the structure but still struggles with some extreme values.

Support Vector Machines

For SVM model i tried two kernells: svmPoly and svmRadial.

SVM with polynomial kernell

set.seed(200)
svm_poly_Fit <- train(trainingData$x, trainingData$y,
                 method = 'svmPoly',
                 preProcess = c('center', 'scale'),
                 tuneLength = 5,
                 trControl = ctrl)
svm_poly_Fit

## Support Vector Machines with Polynomial Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 160, 160, 160, 160, 160 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     RMSE      Rsquared   MAE     
##   1       1e-03  0.25  4.788894  0.7287966  3.926068
##   1       1e-03  0.50  4.657569  0.7274512  3.814372
##   1       1e-03  1.00  4.410710  0.7286102  3.602965
##   1       1e-03  2.00  3.981065  0.7283411  3.248080
##   1       1e-03  4.00  3.414155  0.7251268  2.801435
##   1       1e-02  0.25  3.792596  0.7335241  3.094837
##   1       1e-02  0.50  3.197057  0.7304503  2.621917
##   1       1e-02  1.00  2.680618  0.7446363  2.173495
##   1       1e-02  2.00  2.544925  0.7475250  2.068575
##   1       1e-02  4.00  2.505794  0.7465956  2.018848
##   1       1e-01  0.25  2.545823  0.7450915  2.070284
##   1       1e-01  0.50  2.517244  0.7449964  2.024616
##   1       1e-01  1.00  2.530525  0.7426148  2.035408
##   1       1e-01  2.00  2.517979  0.7444989  2.021332
##   1       1e-01  4.00  2.524239  0.7437125  2.026954
##   1       1e+00  0.25  2.519442  0.7445359  2.022988
##   1       1e+00  0.50  2.522680  0.7444811  2.019380
##   1       1e+00  1.00  2.522673  0.7445740  2.019351
##   1       1e+00  2.00  2.526739  0.7442826  2.020256
##   1       1e+00  4.00  2.526394  0.7443473  2.020039
##   1       1e+01  0.25  2.525565  0.7444513  2.019642
##   1       1e+01  0.50  2.526656  0.7443152  2.020158
##   1       1e+01  1.00  2.526583  0.7443044  2.020085
##   1       1e+01  2.00  2.527549  0.7441156  2.020774
##   1       1e+01  4.00  2.527399  0.7441395  2.020707
##   2       1e-03  0.25  4.657564  0.7274752  3.814375
##   2       1e-03  0.50  4.410725  0.7286361  3.602980
##   2       1e-03  1.00  3.981095  0.7283303  3.248116
##   2       1e-03  2.00  3.414020  0.7252689  2.801287
##   2       1e-03  4.00  2.811319  0.7407523  2.281266
##   2       1e-02  0.25  3.188948  0.7322752  2.614651
##   2       1e-02  0.50  2.673669  0.7462253  2.165134
##   2       1e-02  1.00  2.499520  0.7576447  2.027619
##   2       1e-02  2.00  2.462868  0.7557251  1.975107
##   2       1e-02  4.00  2.415934  0.7631762  1.925124
##   2       1e-01  0.25  2.188293  0.8023017  1.695820
##   2       1e-01  0.50  2.106535  0.8178808  1.617347
##   2       1e-01  1.00  2.063797  0.8276610  1.574419
##   2       1e-01  2.00  2.055213  0.8368667  1.611858
##   2       1e-01  4.00  2.102256  0.8330575  1.665243
##   2       1e+00  0.25  2.069165  0.8429957  1.661879
##   2       1e+00  0.50  2.075454  0.8429175  1.668946
##   2       1e+00  1.00  2.074971  0.8429547  1.668533
##   2       1e+00  2.00  2.070968  0.8434377  1.664572
##   2       1e+00  4.00  2.075195  0.8432415  1.670618
##   2       1e+01  0.25  2.076522  0.8431996  1.671451
##   2       1e+01  0.50  2.145109  0.8344878  1.707600
##   2       1e+01  1.00  2.079004  0.8499692  1.677193
##   2       1e+01  2.00  2.851431  0.7204057  2.201769
##   2       1e+01  4.00  2.645239  0.7729003  2.108228
##   3       1e-03  0.25  4.529887  0.7267870  3.705756
##   3       1e-03  0.50  4.198457  0.7250303  3.425093
##   3       1e-03  1.00  3.655681  0.7296779  2.985993
##   3       1e-03  2.00  3.021550  0.7383438  2.472401
##   3       1e-03  4.00  2.630723  0.7467472  2.125796
##   3       1e-02  0.25  2.833060  0.7461427  2.304056
##   3       1e-02  0.50  2.538168  0.7569400  2.051218
##   3       1e-02  1.00  2.433022  0.7637982  1.957615
##   3       1e-02  2.00  2.370815  0.7714677  1.887298
##   3       1e-02  4.00  2.297829  0.7824667  1.803943
##   3       1e-01  0.25  2.103926  0.8192460  1.617207
##   3       1e-01  0.50  2.141827  0.8188975  1.630857
##   3       1e-01  1.00  2.173471  0.8180646  1.667498
##   3       1e-01  2.00  2.223833  0.8126568  1.716705
##   3       1e-01  4.00  2.248065  0.8100883  1.739353
##   3       1e+00  0.25  3.450231  0.6101087  2.603997
##   3       1e+00  0.50  3.450231  0.6101087  2.603997
##   3       1e+00  1.00  3.450231  0.6101087  2.603997
##   3       1e+00  2.00  3.450231  0.6101087  2.603997
##   3       1e+00  4.00  3.450231  0.6101087  2.603997
##   3       1e+01  0.25  4.914557  0.4191887  3.707646
##   3       1e+01  0.50  4.914557  0.4191887  3.707646
##   3       1e+01  1.00  4.914557  0.4191887  3.707646
##   3       1e+01  2.00  4.914557  0.4191887  3.707646
##   3       1e+01  4.00  4.914557  0.4191887  3.707646
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 0.1 and C = 2.

svm_poly_pred <- predict(svm_poly_Fit, newdata = testData$x)
postResample(pred = svm_poly_pred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.0800597 0.8290049 1.5778621

The polynomial SVM performed well because it models nonlinear relationships through the polynomial kernel. The best tuning values were degree = 2, scale = 0.1, and Cost = 2. The chosen degree = 2 makes sense because the true data has nonlinear structure, including a quadratic term.
On the test set, the polynomial SVM had RMSE = 2.08 and \(R^2\) = 0.83, which is slightly better than the neural network.

plot_model_diagnostics(testData$y, svm_poly_pred, model_name = '\nSVM Polynomial kernell')

The observed vs predicted plots show a strong relationship between observed and predicted values.
But the residual plot shows a slight downward trend, which means the model has a tendency to underpredict smaller values and overpredict larger values. There is also a visible funnel shape, where the spread of residuals increases more into negative residuals as predicted values increase. This indicates heteroscedasticity, meaning the model has larger errors for higher values of y. Even though the SVM with polynomial kernel captures nonlinear structure better than simpler models, it still does not fully capture the complexity of the data, especially for extreme values.

SVM with radial basis kernell

set.seed(200)
svm_radial_Fit <- train(trainingData$x, trainingData$y,
                 method = 'svmRadial',
                 preProcess = c('center', 'scale'),
                 tuneLength = 5,
                 trControl = ctrl)
svm_radial_Fit

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 160, 160, 160, 160, 160 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared   MAE     
##   0.25  2.612355  0.7740090  2.054715
##   0.50  2.361054  0.7851687  1.844923
##   1.00  2.208398  0.8028435  1.711724
##   2.00  2.100613  0.8186640  1.628059
##   4.00  2.061009  0.8245232  1.587537
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06299324
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06299324 and C = 4.

svm_radial_Pred <- predict(svm_radial_Fit, newdata = testData$x)
postResample(obs = testData$y, svm_radial_Pred)

##      RMSE  Rsquared       MAE 
## 2.0584633 0.8288348 1.5549667

The SVM with radial kernel performs slightly better than the polynomial SVM and neural network. It is more flexible and can capture nonlinear patterns locally, which works better for this dataset.

The test results (RMSE = 2.06, \(R^2\) = 0.83) show slight improvement over polynomial SVM and neural network models. The observed vs predicted plot looks tight, and residuals are mostly centered around zero, but there is still some spread for higher values.

plot_model_diagnostics(testData$y, svm_radial_Pred, model_name = '\nSVM Radial kernell')

The observed vs predicted plot looks tight, and residuals are mostly centered around zero, but there is still some increasing spread for higher predicted values.

Multivariate Adaptive Regression Splines

mars_Grid <- expand.grid(.degree = 1:2,
                         .nprune = 2:15)
set.seed(200)
mars_Tune <- train(trainingData$x, trainingData$y,
                   method = 'earth',
                   tuneGrid = mars_Grid,
                   trControl = ctrl)

mars_Tune

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 160, 160, 160, 160, 160 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.394459  0.2239155  3.596815
##   1        3      3.651986  0.4753989  2.945152
##   1        4      2.618756  0.7223387  2.099531
##   1        5      2.387794  0.7644045  1.869810
##   1        6      2.287581  0.7847355  1.791471
##   1        7      1.740268  0.8663442  1.351337
##   1        8      1.711458  0.8728070  1.352405
##   1        9      1.658720  0.8782945  1.334443
##   1       10      1.666839  0.8769241  1.343719
##   1       11      1.639955  0.8807926  1.319442
##   1       12      1.681916  0.8764066  1.367712
##   1       13      1.654276  0.8799149  1.335111
##   1       14      1.655555  0.8798240  1.340932
##   1       15      1.655555  0.8798240  1.340932
##   2        2      4.394459  0.2239155  3.596815
##   2        3      3.651986  0.4753989  2.945152
##   2        4      2.618756  0.7223387  2.099531
##   2        5      2.387794  0.7644045  1.869810
##   2        6      2.312313  0.7759877  1.814310
##   2        7      1.755055  0.8604940  1.362423
##   2        8      1.626000  0.8828583  1.274149
##   2        9      1.469163  0.9108711  1.149545
##   2       10      1.383385  0.9214971  1.092583
##   2       11      1.290267  0.9302051  1.056804
##   2       12      1.368149  0.9223962  1.085336
##   2       13      1.341996  0.9242350  1.075120
##   2       14      1.338496  0.9234476  1.064074
##   2       15      1.368330  0.9199668  1.079207
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 11 and degree = 2.

mars_pred <- predict(mars_Tune, newdata = testData$x)
postResample(obs = testData$y, pred = mars_pred)

##      RMSE  Rsquared       MAE 
## 1.2803060 0.9335241 1.0168673

The test results show RMSE = 1.28 and \(R^2\) = 0.93, which is a big improvement over all previous models.
As we see MARS model clearly performs much better compared to previous models. It is well suited for this dataset because it can capture both nonlinear relationships and interactions using piecewise linear functions.
The final model used degree = 2(meaning it is allowing multiplicative combinations of ‘hinge’ linear functions) and nprune = 11, which means it allows interaction terms and keeps only the most important basis functions. This gives enough flexibility to model the true structure without overfitting.

plot_model_diagnostics(testData$y, mars_pred, model_name = '\nMARS degree = 2')

The observed vs predicted plot shows a very tight alignment along the diagonal, which means predictions are very close to actual values.
The residuals vs predicted plot looks much more random and centered around zero compared to other models. Almost all of the residuals values are in the range of (-4, 4). There is no clear pattern or trend, and the spread of residuals is smaller.

This suggests that the model captures the underlying structure very well.

#varImp(mars_Tune)

summary(mars_Tune)

## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
##             nprune=11)
## 
##                                 coefficients
## (Intercept)                        22.050690
## h(0.621722-X1)                    -15.001651
## h(X1-0.621722)                     10.878737
## h(0.601063-X2)                    -18.830135
## h(0.447442-X3)                      9.940077
## h(X3-0.606015)                     12.999390
## h(0.734892-X4)                     -9.877554
## h(X4-0.734892)                     10.414930
## h(0.850094-X5)                     -5.604897
## h(X1-0.621722) * h(X2-0.295997)   -43.245766
## h(0.649253-X1) * h(0.601063-X2)    26.218297
## 
## Selected 11 of 18 terms, and 5 of 10 predictors (nprune=11)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 8 2
## GCV 1.747495    RSS 264.5358    GRSq 0.929051    RSq 0.9457576

The MARS summary shows that the model selected 5 out of 10 predictors, which are X1, X2, X3, X4, and X5.
This matches the true data-generating formula, since they are the only informative predictors. The noise variables (X6–X10) were not used.
The model also includes multiplicative interaction terms between X1 and X2, which captures the sine interaction in the true formula.

MARS model not only delivering the best test performance but also right identifications of the important variables and their relationships.

model_results <- data.frame(
  Model = c(
    "Linear Regression",
    "kNN",
    "Averaged Neural Network",
    "SVM Polynomial",
    "SVM Radial",
    "MARS"
  ),
  RMSE = c(
    postResample(pred = lm_Pred, obs = testData$y)["RMSE"],
    postResample(pred = knn_pred, obs = testData$y)["RMSE"],
    postResample(pred = nnet_pred, obs = testData$y)["RMSE"],
    postResample(pred = svm_poly_pred, obs = testData$y)["RMSE"],
    postResample(pred = svm_radial_Pred, obs = testData$y)["RMSE"],
    postResample(pred = mars_pred, obs = testData$y)["RMSE"]
  ),
  Rsquared = c(
    postResample(pred = lm_Pred, obs = testData$y)["Rsquared"],
    postResample(pred = knn_pred, obs = testData$y)["Rsquared"],
    postResample(pred = nnet_pred, obs = testData$y)["Rsquared"],
    postResample(pred = svm_poly_pred, obs = testData$y)["Rsquared"],
    postResample(pred = svm_radial_Pred, obs = testData$y)["Rsquared"],
    postResample(pred = mars_pred, obs = testData$y)["Rsquared"]
  ),
  MAE = c(
    postResample(pred = lm_Pred, obs = testData$y)["MAE"],
    postResample(pred = knn_pred, obs = testData$y)["MAE"],
    postResample(pred = nnet_pred, obs = testData$y)["MAE"],
    postResample(pred = svm_poly_pred, obs = testData$y)["MAE"],
    postResample(pred = svm_radial_Pred, obs = testData$y)["MAE"],
    postResample(pred = mars_pred, obs = testData$y)["MAE"]
  )
)

model_results <- model_results |>
  arrange(RMSE)

model_results

##                     Model     RMSE  Rsquared      MAE
## 1                    MARS 1.280306 0.9335241 1.016867
## 2              SVM Radial 2.058463 0.8288348 1.554967
## 3          SVM Polynomial 2.080060 0.8290049 1.577862
## 4 Averaged Neural Network 2.099514 0.8288469 1.571518
## 5       Linear Regression 2.697068 0.7084666 2.060054
## 6                     kNN 3.175066 0.6785946 2.544317

As we see model comparison table shows that MARS clearly had the best test performance, with the lowest RMSE and highest \(R^2\). This makes sense because the true formula contains nonlinear and interaction effects, and MARS is better in capturing this kind of structure. The SVM and neural network models also performed well, but they did not get close to MARS results.
For this exercise I used linear regression as a baseline model, and kNN performed worse than I expected, probably because the data has noise variables and complex interactions. Overall, MARS was the best model for this dataset.

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Which nonlinear regression model gives the optimal resampling and test set performance?

library(AppliedPredictiveModeling)
library(caret)
library(glmnet)
library(pls)
library(dplyr)
library(ggplot2)
library(lattice)
library(RANN)
library(purrr)
library(e1071)
library(patchwork)
library(corrplot)
library(writexl)
options(scipen = 999)

library(doParallel)

cl <- makeCluster(2)
registerDoParallel(cl)

ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5,
  allowParallel = TRUE
)

get_cv_results <- function(model, model_name) {
  best <- model$bestTune
  
  res <- model$results
  
  for (param in names(best)) {
    res <- res[res[[param]] == best[[param]], ]
  }
  
  data.frame(
    Model = model_name,
    CV_RMSE = res$RMSE,
    CV_R2 = res$Rsquared,
    CV_MAE = res$MAE
  )
}

get_test_results <- function(obs, pred, model_name) {
  res <- postResample(pred = pred, obs = obs)
  
  data.frame(
    Model = model_name,
    Test_RMSE = res["RMSE"],
    Test_R2 = res["Rsquared"],
    Test_MAE = res["MAE"]
  )
}

data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)

## [1] 176  58

yield <- ChemicalManufacturingProcess[, 1]

processPredictors <- ChemicalManufacturingProcess[, -1]

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8)

sum(is.na(processPredictors))

## [1] 106

The first step was checking the predictor matrix for missing values. There were 106 missing cells across all measurements, which is still a relatively small amount compared with the full dataset (57 predictors and 176 observations) . To impute NA’s I choosed “bagged tree” imputation because it can better preserve nonlinear relationships between variables.

set.seed(200)
imputation <- preProcess(
  processPredictors,
  method = 'bagImpute'
)

processPredictors_imputed <- predict(imputation, processPredictors)
sum(is.na(processPredictors_imputed))

## [1] 0

nzv_vars <- nearZeroVar(processPredictors_imputed)
nzv_vars

## [1] 7

processPredictors_imputed <- processPredictors_imputed[, -nzv_vars]

Near-zero variance filtering removed 1 predictors that contained almost no useful variation.

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter.

I split data into training and test sets using a 75/25 split.

set.seed(200)

train_index <- createDataPartition(yield, p= 0.75, list = FALSE)

train_procPred <- processPredictors_imputed[train_index,]
test_procPred <- processPredictors_imputed[-train_index,]

train_yield <- yield[train_index]
test_yield <- yield[-train_index]

skewness(train_yield)

## [1] 0.2865166

hist(train_yield, breaks = 20)

Before fitting the model, I checked the distribution of the response variable. The yield variable only showed mild right skew, with skewness equal to 0.287, and the histogram looked close to symmetric. Because of this, I kept the response on its original scale to make the final predictions easier to interpret.

cor_matrix <- cor(train_procPred, use = "pairwise.complete.obs")
corrplot(
  cor_matrix,
  order = 'hclust',
  tl.cex = 0.2
)

The correlation matrix shows several strong clusters among variables.

high_corr <- findCorrelation(cor_matrix, cutoff = 0.9)

train_procPred_filtered <- train_procPred[, -high_corr]
test_procPred_filtered  <- test_procPred[, -high_corr]

Using a 0.9 cutoff, 10 highly correlated predictors were identified as potential redundant variables.

I made a mistake in previous assignment, I centered and scaled data before training and testing models. So, here I let preprocessing applied during models training.

Modeling

ENET (original model from exercise 6.3)

set.seed(200)
 enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
                         .fraction = seq(.01, 1, length = 20))
enet_tune <- train(train_procPred_filtered, train_yield,
                   method = 'enet',
                   tuneGrid = enetGrid,
                   trControl = ctrl,
                   preProcess = c('center', 'scale', 'YeoJohnson')
                   )
enet_tune$results[row.names(enet_tune$bestTune), ]

##    lambda  fraction     RMSE  Rsquared     MAE    RMSESD RsquaredSD     MAESD
## 45    0.1 0.2184211 1.253232 0.5988309 1.03886 0.2126629  0.1616821 0.1736726

enet_pred <- predict(enet_tune, newdata = test_procPred_filtered)
postResample(enet_pred, test_yield)

##      RMSE  Rsquared       MAE 
## 1.2260515 0.6362144 0.9828899

Elastic Net is from exercise 6.3 and here it is used as a baseline model. The model selected a combination of L1 and L2 penalties and achieved a test RMSE = 1.226 and \(R^2\) = 0.636

plot_model_diagnostics(test_yield, enet_pred, model_name = "ENET")

In the predicted vs observed plot, most points follow the diagonal line, which indicates the model is capturing the general trend.
But there is noticeable spread around the lines in both plots, we can see that as actual observed values are increasing there is a tendency that ENET model underestimates them.
The residuals vs predicted plot shows no strong pattern, but the spread is relatively wide.

KNN

set.seed(200)

knn_model <- train(
  x = train_procPred_filtered,
  y = train_yield,
  method = 'knn',
  tuneLength = 10,
  trControl = ctrl,
  preProcess = c('center', 'scale', 'YeoJohnson')
)
knn_model

## k-Nearest Neighbors 
## 
## 132 samples
##  46 predictor
## 
## Pre-processing: centered (46), scaled (46), Yeo-Johnson transformation (22) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.301887  0.5130484  1.050365
##    7  1.354823  0.4795851  1.095830
##    9  1.365607  0.4702248  1.103777
##   11  1.372830  0.4732010  1.118912
##   13  1.383267  0.4698308  1.134357
##   15  1.390315  0.4708005  1.142312
##   17  1.397245  0.4712144  1.149876
##   19  1.411680  0.4688991  1.159028
##   21  1.426508  0.4646536  1.163456
##   23  1.436087  0.4639830  1.170425
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

knn_pred <- predict(knn_model, newdata = test_procPred_filtered)
postResample(obs = test_yield, pred = knn_pred)

##      RMSE  Rsquared       MAE 
## 1.4281925 0.4310553 1.1614091

KNN performed worse than ENET model, with the test RMSE = 1.428. From \(R^2\) = 0.43 telling us that KNN model struggles to explain even half of the variance of the model.
Even though it is a nonlinear method, it relies heavily on local distances, which becomes less effective in high-dimensional data.

Because this dataset has many predictors , the distance-based struggles because of the curse of dimensionality, which explains the poor test performance.

plot_model_diagnostics(test_yield, knn_pred, model_name =  'KNN')

The predicted vs observed plot showing wider spread compared ENET model, meaning predictions are less accurate. Points are not tightly aligned with the diagonal line.
The residual plot also shows large variability, which confirms that the model is unstable and not capturing the underlying structure well.

MARS

mars_grid <- expand.grid(
  .degree = 1:2,
  .nprune = 2:30
)

set.seed(200)

mars_model <- train(
  x = train_procPred,
  y = train_yield,
  method = 'earth',
  tuneGrid = mars_grid,
  trControl = ctrl,
  preProcess = c('center', 'scale', 'YeoJohnson')
)

mars_model

## Multivariate Adaptive Regression Spline 
## 
## 132 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56), Yeo-Johnson transformation (25) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.375777  0.4596863  1.0978774
##   1        3      1.169033  0.6062765  0.9598891
##   1        4      1.151194  0.6245793  0.9504233
##   1        5      1.169103  0.6172718  0.9578723
##   1        6      1.182881  0.6063307  0.9657971
##   1        7      1.205244  0.5903349  0.9773674
##   1        8      1.224660  0.5882430  0.9928473
##   1        9      1.243509  0.5828915  1.0068009
##   1       10      1.249794  0.5788619  1.0098672
##   1       11      1.266644  0.5729139  1.0247848
##   1       12      1.272552  0.5753734  1.0309235
##   1       13      1.277499  0.5745308  1.0278588
##   1       14      1.295876  0.5685540  1.0412033
##   1       15      1.293342  0.5679382  1.0389234
##   1       16      1.293244  0.5681805  1.0405552
##   1       17      1.359317  0.5570959  1.0621675
##   1       18      1.356408  0.5587051  1.0597379
##   1       19      1.353462  0.5604875  1.0566285
##   1       20      1.353462  0.5604875  1.0566285
##   1       21      1.359667  0.5592852  1.0624910
##   1       22      1.359667  0.5592852  1.0624910
##   1       23      1.359667  0.5592852  1.0624910
##   1       24      1.359667  0.5592852  1.0624910
##   1       25      1.359667  0.5592852  1.0624910
##   1       26      1.359667  0.5592852  1.0624910
##   1       27      1.359667  0.5592852  1.0624910
##   1       28      1.359667  0.5592852  1.0624910
##   1       29      1.359667  0.5592852  1.0624910
##   1       30      1.359667  0.5592852  1.0624910
##   2        2      1.375777  0.4596863  1.0978774
##   2        3      1.234344  0.5590277  1.0018585
##   2        4      1.187404  0.5928335  0.9684019
##   2        5      1.179816  0.6053931  0.9566722
##   2        6      1.173829  0.6109156  0.9539316
##   2        7      1.228271  0.5898192  1.0018101
##   2        8      1.221640  0.5873668  0.9908607
##   2        9      1.250387  0.5891500  1.0019043
##   2       10      1.274317  0.5865343  1.0187710
##   2       11      1.278324  0.5814646  1.0253128
##   2       12      1.291718  0.5808346  1.0415082
##   2       13      1.319707  0.5742255  1.0552566
##   2       14      1.406358  0.5582495  1.1104033
##   2       15      1.416597  0.5475534  1.1183876
##   2       16      1.436650  0.5426552  1.1340803
##   2       17      1.463073  0.5359593  1.1465020
##   2       18      1.485921  0.5299050  1.1606301
##   2       19      1.494435  0.5287815  1.1695633
##   2       20      1.494668  0.5285554  1.1685367
##   2       21      1.517682  0.5244953  1.1829374
##   2       22      1.517187  0.5232374  1.1838747
##   2       23      1.526537  0.5206569  1.1911888
##   2       24      1.526241  0.5209456  1.1898427
##   2       25      1.526241  0.5209456  1.1898427
##   2       26      1.526241  0.5209456  1.1898427
##   2       27      1.525170  0.5218507  1.1902312
##   2       28      1.525170  0.5218507  1.1902312
##   2       29      1.525170  0.5218507  1.1902312
##   2       30      1.525170  0.5218507  1.1902312
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 4 and degree = 1.

mars_pred <- predict(mars_model, newdata = test_procPred)
postResample(obs = test_yield, pred = mars_pred)

##      RMSE  Rsquared       MAE 
## 1.1285523 0.6370613 0.8805678

MARS produced the best performance among all models, with the lowest test RMSE = 1.129 and \(R^2\) = 0.637. The optimal model used degree = 1 and nprune = 4, The degree = 1 tells us the chosen MARS model is additive without multiplicative interaction terms.

This is interesting because it shows that the relationships are nonlinear and at the same there is no need for complex interactions. Instead, hinged linear combinations are enough to capture the structure of the data.

# mars_pred
# length(test_yield)
# length(mars_pred)
plot_model_diagnostics(obs = test_yield, pred = mars_pred, model_name = 'MARS')

The predicted vs observed plot shows a stronger alignment along the diagonal compared to previous models, indicating it has better balanced prediction accuracy. Compared to other models, the points are more tightly clustered.

The residual plot shows points are evenly distributed around zero. This suggests that the model is capturing most of the signal in the data and leaving less unexplained variation.

SVM radial basis kernell

set.seed(200)

svm_radial_model <- train(train_procPred, train_yield,
  method = 'svmRadial',
  tuneLength = 14,
  trControl = ctrl,
  preProcess = c('center', 'scale', 'YeoJohnson')
)

svm_radial_model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 132 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56), Yeo-Johnson transformation (25) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE      
##      0.25  1.417162  0.4831737  1.1560115
##      0.50  1.306631  0.5320544  1.0676134
##      1.00  1.232920  0.5704469  1.0016118
##      2.00  1.177839  0.6004753  0.9532968
##      4.00  1.166820  0.6041736  0.9456848
##      8.00  1.159502  0.6054002  0.9386184
##     16.00  1.157852  0.6060202  0.9372875
##     32.00  1.157852  0.6060202  0.9372875
##     64.00  1.157852  0.6060202  0.9372875
##    128.00  1.157852  0.6060202  0.9372875
##    256.00  1.157852  0.6060202  0.9372875
##    512.00  1.157852  0.6060202  0.9372875
##   1024.00  1.157852  0.6060202  0.9372875
##   2048.00  1.157852  0.6060202  0.9372875
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01343501
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01343501 and C = 16.

svm_radial_pred <- predict(svm_radial_model, newdata = test_procPred)
postResample(obs = test_yield, pred = svm_radial_pred)

##      RMSE  Rsquared       MAE 
## 1.1678754 0.6219256 0.9192249

The SVM with radial kernel performed also well and is the second-best model. It has a test RMSE = 1.168, which is close to MARS (1.13).

plot_model_diagnostics(test_yield, svm_radial_pred, model_name = 'SVM')

The predicted vs observed plot shows good alignment with the diagonal, but slightly more spread than MARS.
The residual plot does not show a strong pattern. But the variability is slightly higher than MARS, as we see with some values slightly underpredicted.

Neural network

nnet_grid <- expand.grid(
  .decay = c(0, 0.01, 0.1),
  .size = 1:15,
  .bag = FALSE
)

set.seed(200)

nnet_model <- train(
  x = train_procPred_filtered,
  y = train_yield,
  method = "avNNet",
  tuneGrid = nnet_grid,
  trControl = ctrl,
  preProc = c("center", "scale", 'YeoJohnson'),
  linout = TRUE,
  trace = FALSE,
  maxNWts = 3000
)

nnet_model

## Model Averaged Neural Network 
## 
## 132 samples
##  46 predictor
## 
## Pre-processing: centered (46), scaled (46), Yeo-Johnson transformation (22) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    1.703248  0.3778959  1.304958
##   0.00    2    1.508959  0.3939445  1.240394
##   0.00    3    1.612317  0.4007101  1.244382
##   0.00    4    2.030056  0.2849494  1.600580
##   0.00    5    1.921538  0.3533147  1.471030
##   0.00    6    1.859060  0.3623278  1.472607
##   0.00    7    2.105965  0.3196056  1.687440
##   0.00    8    2.579368  0.2869939  2.005186
##   0.00    9    3.250863  0.2546037  2.478087
##   0.00   10    3.952873  0.1998514  2.815648
##   0.00   11    4.905718  0.1782711  3.466201
##   0.00   12    6.371266  0.1711152  4.012285
##   0.00   13    7.737730  0.1580155  4.954700
##   0.00   14    8.626960  0.1538539  5.302241
##   0.00   15    9.107009  0.1531409  5.378354
##   0.01    1    1.361628  0.4733408  1.113470
##   0.01    2    1.291615  0.5357111  1.041846
##   0.01    3    1.519702  0.4539587  1.219488
##   0.01    4    1.877391  0.3490668  1.462645
##   0.01    5    1.994101  0.3109844  1.576038
##   0.01    6    1.924628  0.3405632  1.541169
##   0.01    7    2.042256  0.3303952  1.643767
##   0.01    8    2.562601  0.3018398  1.946254
##   0.01    9    3.195850  0.2980176  2.335708
##   0.01   10    4.241455  0.2408660  2.991853
##   0.01   11    5.021692  0.2092511  3.437460
##   0.01   12    6.706996  0.1730011  4.283838
##   0.01   13    7.615320  0.1429849  4.697240
##   0.01   14    8.549283  0.1501687  5.248403
##   0.01   15    8.801523  0.1696280  5.036857
##   0.10    1    1.364692  0.4850919  1.113154
##   0.10    2    1.336829  0.5090254  1.071598
##   0.10    3    1.509504  0.4602740  1.201903
##   0.10    4    1.852849  0.3515562  1.420515
##   0.10    5    1.807087  0.3963275  1.425934
##   0.10    6    1.780482  0.3876776  1.426311
##   0.10    7    1.978927  0.3766534  1.565521
##   0.10    8    2.280847  0.3420362  1.776463
##   0.10    9    2.927127  0.2543762  2.170576
##   0.10   10    3.504037  0.2309713  2.527744
##   0.10   11    5.065381  0.1944928  3.319595
##   0.10   12    5.548514  0.1908132  3.665042
##   0.10   13    6.863536  0.1494704  4.218530
##   0.10   14    7.934455  0.1767128  4.742481
##   0.10   15    7.326321  0.1547291  4.484152
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 2, decay = 0.01 and bag = FALSE.

nnet_pred <- predict(nnet_model, newdata = test_procPred_filtered)
postResample(obs = test_yield, pred = nnet_pred)

##      RMSE  Rsquared       MAE 
## 1.2328764 0.5725431 0.9725628

The neural network model showed moderate performance with test RMSE = 1.233 and lower \(R^2\) = 0.57 compared to MARS and SVM models. This could be due to the relatively small dataset size (only 176 observations), which is not ideal for training neural networks effectively.
Also, training neural networks takes much longer computational time and resources. So here, I don’t think it would be a best model choice even if its RMSE and \(R^2\) were close to MARS and SVM results.

plot_model_diagnostics(test_yield, nnet_pred, model_name = 'NNET')

The predicted vs observed plot shows reasonable alignment but slightly wider spread compared to MARS and SVM.
The residuals plot’s spread mostly even around zero, but it shows that higher actual values were underpredicted by the NNET model.

stopCluster(cl)
registerDoSEQ()

Which nonlinear regression model gives the optimal resampling and test set performance?

Models comparison table

cv_results <- dplyr::bind_rows(
  get_cv_results(enet_tune, "ENET"),
  get_cv_results(knn_model, "KNN"),
  get_cv_results(mars_model, "MARS"),
  get_cv_results(svm_radial_model, "SVM"),
  get_cv_results(nnet_model, "NNET")
)

test_results <- dplyr::bind_rows(
  get_test_results(test_yield, enet_pred, "ENET"),
  get_test_results(test_yield, knn_pred, "KNN"),
  get_test_results(test_yield, mars_pred, "MARS"),
  get_test_results(test_yield, svm_radial_pred, "SVM"),
  get_test_results(test_yield, nnet_pred, "NNET")
)

Do any of the nonlinear models outperform the optimal linear model you previously developed in Exercise 6.2? If so, what might this tell you about the underlying relationship between the predictors and the response?

final_results <- dplyr::left_join(cv_results, test_results, by = "Model")

final_results <- final_results |>
  arrange(Test_RMSE, desc = FALSE)

final_results

##   Model  CV_RMSE     CV_R2    CV_MAE Test_RMSE   Test_R2  Test_MAE
## 1  MARS 1.151194 0.6245793 0.9504233  1.128552 0.6370613 0.8805678
## 2   SVM 1.157852 0.6060202 0.9372875  1.167875 0.6219256 0.9192249
## 3  ENET 1.253232 0.5988309 1.0388603  1.226051 0.6362144 0.9828899
## 4  NNET 1.291615 0.5357111 1.0418459  1.232876 0.5725431 0.9725628
## 5   KNN 1.301887 0.5130484 1.0503651  1.428192 0.4310553 1.1614091

The comparison of nonlinear models shows that:

MARS provides the best overall performance. It achieved the lowest test RMSE (1.129) and \(R^2\) = 0.637, indicating the most accurate predictions on test data.
SVM is the second-best model, with performance very close to MARS, which suggests that both methods are effective at capturing nonlinear relationships in the data.
ENET and NNET show moderate performance. Although ENET has a relatively high test \(R^2\) = 0.636 (very close to MARS), its test RMSE is worse than MARS and SVM, which indicates that a linear structure is not sufficient to fully capture the underlying patterns. The neural network also does not outperform simpler nonlinear methods, likely due to the small dataset size and limited ability to generalize.
KNN performed worst compared to the rest of the models, with the highest test RMSE and lowest R². This suggests that distance-based methods struggle in this high-dimensional setting and do not predict well enough.

Overall, MARS is the best model because it provides the best balance between accuracy and predictions. Also is is easier to interpret MARS model compared to SVM and NNET models.

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list?

summary(mars_model)

## Call: earth(x=data.frame[132,56], y=c(38,42.44,42.0...), keepxy=TRUE, degree=1,
##             nprune=4)
## 
##                                      coefficients
## (Intercept)                             39.718231
## h(1.05347-ManufacturingProcess09)       -0.757467
## h(-1.1859-ManufacturingProcess13)        2.237452
## h(ManufacturingProcess32- -0.815622)     1.257127
## 
## Selected 4 of 22 terms, and 3 of 56 predictors (nprune=4)
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 3 (additive model)
## GCV 1.282191    RSS 151.7746    GRSq 0.6246959    RSq 0.6582876

Do either the biological or process variables dominate the list?

mars_imp <- varImp(mars_model)

mars_top10 <- mars_imp$importance |>
  tibble::rownames_to_column("Predictor") |>
  arrange(desc(Overall)) |>
  slice(1:10)

mars_top10

##                Predictor   Overall
## 1 ManufacturingProcess32 100.00000
## 2 ManufacturingProcess09  46.53772
## 3 ManufacturingProcess13   0.00000

The MARS model selected only 3 predictors out of 56 available variables. It is because MARS uses built-in variable selection using a forward and backward stepwise process. After that, in the pruning process, it removes terms that do not significantly improve model performance.
The final model kept only ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess13 since these variables had best contribution to reduce prediction error.
This result suggests that yield is mostly delivered by this small subset of manufacturing process variables, and at the same time other predictors are either redundant or contain small prediction ability.
As we see these 3 process predictors clearly dominate in MARS model.

How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

enet_imp <- varImp(enet_tune)

enet_top10 <- enet_imp$importance |>
  tibble::rownames_to_column("Predictor") |>
  arrange(desc(Overall)) |>
  slice(1:10)

enet_top10

##                 Predictor   Overall
## 1  ManufacturingProcess32 100.00000
## 2  ManufacturingProcess17  80.31063
## 3    BiologicalMaterial06  75.08969
## 4  ManufacturingProcess13  74.46564
## 5  ManufacturingProcess36  68.75818
## 6    BiologicalMaterial03  67.10754
## 7  ManufacturingProcess06  66.37746
## 8  ManufacturingProcess09  65.28374
## 9  ManufacturingProcess33  46.90589
## 10   BiologicalMaterial08  44.97856

intersect(mars_top10$Predictor, enet_top10$Predictor)

## [1] "ManufacturingProcess32" "ManufacturingProcess09" "ManufacturingProcess13"

The comparison between the top predictors from the MARS model and optimal linear model (ENET) shows overlap and differences at the same time. The MARS model selected only three important predictors: ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess13. All three of these variables also appear in the top ten predictors of the ENET model.

From these findings I can suggest that these three manufacturing process variables are consistently the most important predictors of yield, and doesn’t really matter if a linear or nonlinear modeling approach is used.

However, the ENET model includes a broader set of predictors in its top ten list, including several biological variables such as BiologicalMaterial06, BiologicalMaterial03, and BiologicalMaterial08. This indicates that the linear model spreads importance across more variables, while the MARS model focuses only on the strongest signals.

Overall, the nonlinear MARS model is more selective and relied on a smaller subset of predictors, while the linear ENET model captures additional weaker relationships across both manufacturing and biological variables.

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.

Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

unique_mars_vars <- intersect(mars_top10$Predictor, enet_top10$Predictor)

plot_df <- train_procPred |>
   select(all_of(unique_mars_vars)) |>
   mutate(Yield = train_yield)
  
plot_df_long <- plot_df |>
   pivot_longer(
      cols = all_of(unique_mars_vars),
      names_to = 'Predictor',
      values_to = 'Value'
    )
  
ggplot(plot_df_long, aes(x = Value, y = Yield)) +
    geom_point(alpha = 0.7) +
    geom_smooth(method = 'lm', se = FALSE) +
    facet_wrap(~ Predictor, scales = 'free_x') +
    labs(
      title = 'Relationships between MARS predictors and yield',
      x = 'Predictor value',
      y = 'Yield'
    )

The plots show how the most important predictors are related to yield:

ManufacturingProcess32 has a clear positive trend as it increases, yield also increases pretty consistently.
ManufacturingProcess09 also has a positive relationship, but with more spread variability, so the effect is a little bit smaller.
ManufacturingProcess13 shows a negative trend, meaning higher values are associated with lower yield.

Overall, these patterns help explain why MARS picked these variables. They have clear srtong relationships with yield, and MARS is more capable to capture these relationships better than a simple linear model.

DATA624_Assignment8

Farhod Ibragimov

2026-04-20

Modeling

Linear regression

Neural networks

Support Vector Machines

Multivariate Adaptive Regression Splines

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter.

Modeling

ENET (original model from exercise 6.3)

KNN

MARS

SVM radial basis kernell

Neural network

Which nonlinear regression model gives the optimal resampling and test set performance?

Models comparison table

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list?

How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.