Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.
Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
y =10sin(πx1x2) +20(x3 −0.5)2 +10x4 +5x5 +N(0,σ2)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 19.
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
In order to answer these questions, I will train several models including MARS, Random Forest, and Linear Regression on the training data and evaluate their performance on the test data.
set.seed(200)
train_df <- data.frame(trainingData$x, y = trainingData$y)
test_df <- data.frame(testData$x, y = testData$y)
lmModel <- train(y ~ ., data = train_df, method = "lm")
print(lmModel)
## Linear Regression
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.566445 0.744027 2.031539
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
set.seed(200)
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3) # Use robust control
marsModel <- train(y ~ ., data = train_df,
method = "earth",
preProc = c("center", "scale"),
tuneGrid = expand.grid(degree = 1:2, nprune = seq(2, 30, by = 2)),
trControl = control)
print(marsModel)
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.247447 0.2756934 3.5260787
## 1 4 2.580467 0.7344722 2.0685389
## 1 6 2.252083 0.7931510 1.8036255
## 1 8 1.662867 0.8844888 1.3113260
## 1 10 1.660520 0.8848714 1.3141704
## 1 12 1.591109 0.8948359 1.2543975
## 1 14 1.598874 0.8932603 1.2613831
## 1 16 1.598970 0.8932958 1.2615549
## 1 18 1.598970 0.8932958 1.2615549
## 1 20 1.598970 0.8932958 1.2615549
## 1 22 1.598970 0.8932958 1.2615549
## 1 24 1.598970 0.8932958 1.2615549
## 1 26 1.598970 0.8932958 1.2615549
## 1 28 1.598970 0.8932958 1.2615549
## 1 30 1.598970 0.8932958 1.2615549
## 2 2 4.287418 0.2602224 3.5794159
## 2 4 2.619861 0.7205608 2.1004505
## 2 6 2.250216 0.7895421 1.7766735
## 2 8 1.695971 0.8790873 1.3113416
## 2 10 1.461383 0.9084711 1.1652269
## 2 12 1.317812 0.9263098 1.0382087
## 2 14 1.257471 0.9346033 0.9992458
## 2 16 1.243059 0.9357832 0.9863852
## 2 18 1.245413 0.9355890 0.9877777
## 2 20 1.245413 0.9355890 0.9877777
## 2 22 1.245413 0.9355890 0.9877777
## 2 24 1.245413 0.9355890 0.9877777
## 2 26 1.245413 0.9355890 0.9877777
## 2 28 1.245413 0.9355890 0.9877777
## 2 30 1.245413 0.9355890 0.9877777
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 16 and degree = 2.
set.seed(200)
rfModel <- train(y ~ ., data = train_df,
method = "rf",
preProc = c("center", "scale"),
tuneLength = 5,
trControl = control)
print(rfModel)
## Random Forest
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 2.802378 0.8221917 2.331314
## 4 2.482422 0.8173783 2.063980
## 6 2.413718 0.8031661 2.017034
## 8 2.394165 0.7962892 1.991593
## 10 2.381139 0.7937642 1.975090
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 10.
After training the models, I will evaluate their performance on the test data. First, I will create a list of the trained models and then calculate the RMSE for each model on the test set. I decided to added in a list all the models and then loop through them to calculate the RMSE on the test set because it makes the code more organized and easier to extend if I want to add more models in the future and Im using the same evaluation metric (RMSE) for all models.
models <- list(
LM = lmModel,
KNN = knnModel,
MARS = marsModel,
RF = rfModel
)
test_results <- data.frame(Model = character(), Test_RMSE = numeric())
for (name in names(models)) {
pred <- predict(models[[name]], newdata = test_df)
rmse <- sqrt(mean((pred - test_df$y)^2))
test_results <- test_results |>
bind_rows(data.frame(Model = name, Test_RMSE = rmse))
}
final_model_compare <- test_results |> arrange(Test_RMSE)
print(final_model_compare)
## Model Test_RMSE
## 1 MARS 1.279387
## 2 RF 2.387389
## 3 LM 2.697068
## 4 KNN 3.204059
The model with the lowest RMSE on the test set is considered the best performing model.
MARS(1.28) appears to give the best performance with the lowest RMSE on the test set.By using basis functions and automatically selecting the best splits, MARS was able to closely approximate the complex \(\sin(\pi x_1x_2)\) and \((x_3 - 0.5)^2\) terms.
RF(2.39) also performed well, leveraging ensemble learning to capture nonlinear relationships. It handles complex interactions and noise effectively, but it was outperformed by the explicit function fitting of MARS.
LN(2.70) and KNN(3.12) lagged behind. Both the Linear Model (which assumes a straight line) and KNN (which relies on distances) completely failed to capture the true structure. Their high RMSE values prove that when the relationship is curvy, you can’t rely on simple, linear tools.
In conclusion, after tuning several nonlinear regression models on the Friedman (1991) simulated data, the MARS model achieved the lowest test RMSE (1.28), outperforming Random Forest (2.39), Linear Regression (2.70), and KNN (3.20). This result confirms that MARS provides the best fit for capturing complex nonlinear relationships with interactions. In addition, the variable importance output shows that MARS correctly focuses on predictors X1 X5, the true informative features while ignoring X6–X10, which are purely noise. This highlights MARS’s strength in both prediction accuracy and variable selection in nonlinear settings.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
In this part I will load the ChemicalManufacturingProcess data, handle missing values by imputing with the median.
library(caret)
library(dplyr)
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
process_data <- as.data.frame(ChemicalManufacturingProcess)
process_data <- process_data |>
mutate(across(where(is.numeric),
~ tidyr::replace_na(.x, median(.x, na.rm = TRUE))))
Next, I will split the data into training (80%) and test (20%) sets.
set.seed(42)
inTrain <- createDataPartition(process_data$Yield, p = 0.8, list = FALSE)
train_df <- process_data[inTrain, ]
test_df <- process_data[-inTrain, ]
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
Now, I will train several nonlinear regression models including KNN, SVN, MARS and Random Forest on the training data and evaluate their performance on the test data.
set.seed(42)
knnModel <- train(Yield ~ ., data = train_df,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
set.seed(42)
svmModel <- train(Yield ~ ., data = train_df,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 5,
trControl = control)
set.seed(42)
marsModel <- train(Yield ~ ., data = train_df,
method = "earth",
preProc = c("center", "scale"),
tuneGrid = expand.grid(degree = 1:2, nprune = seq(2, 30, by = 2)),
trControl = control)
set.seed(42)
rfModel <- train(Yield ~ ., data = train_df,
method = "rf",
preProc = c("center", "scale"),
tuneLength = 5,
trControl = control)
After training the models, I will evaluate their performance on the test data. I will create a list of the trained models and then calculate the RMSE(Root Mean Squared Error) for each model on the test set.
models <- list(
KNN = knnModel,
SVM = svmModel,
MARS = marsModel,
RF = rfModel
)
test_results <- data.frame(Model = character(), Test_RMSE = numeric())
for (name in names(models)) {
pred <- predict(models[[name]], newdata = test_df)
rmse <- sqrt(mean((pred - test_df$Yield)^2))
test_results <- test_results |>
bind_rows(data.frame(Model = name, Test_RMSE = rmse))
}
test_results |> arrange(Test_RMSE)
## Model Test_RMSE
## 1 RF 1.165219
## 2 SVM 1.238902
## 3 KNN 1.278229
## 4 MARS 1.328648
The nonlinear regression model that provides the optimal resampling and test set performance for the Chemical Manufacturing Process data is the Random Forest (RF) model. This result reflects the Random Forest’s robust ability to model complex, high order nonlinear interactions among process variables a common feature in manufacturing and chemical process data. The RF model achieved the lowest RMSE (1.17) on the test set compared to SVM (1.24), KNN (1.28), and MARS (1.33), indicating superior predictive accuracy. Its ensemble approach, which aggregates many decision trees, allows the model to capture nonlinear relationships while mitigating overfitting and handling noise effectively.
For the Random Forest model, I will extract and visualize the variable importance to identify the most important predictors.
var_importance <- varImp(rfModel)
print(var_importance)
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial12 35.835
## ManufacturingProcess13 29.924
## BiologicalMaterial03 26.130
## ManufacturingProcess17 20.859
## ManufacturingProcess09 17.234
## BiologicalMaterial06 13.124
## ManufacturingProcess31 9.813
## ManufacturingProcess06 9.254
## BiologicalMaterial11 8.260
## BiologicalMaterial09 7.524
## BiologicalMaterial05 7.350
## BiologicalMaterial02 6.929
## ManufacturingProcess11 6.553
## ManufacturingProcess36 5.873
## BiologicalMaterial08 5.493
## BiologicalMaterial04 4.785
## ManufacturingProcess39 4.528
## ManufacturingProcess25 4.227
## ManufacturingProcess24 4.112
plot(var_importance, top = 10, main = "Top 10 Predictors by Importance (RF Model)")
Based on the variable importance analysis, the nonlinear model (likely Random Forest or SVM) has identified the most crucial drivers of the chemical yield. The list of top predictors is dominated by Process Variables (6 out of 10), with ManufacturingProcess32 being the single most influential factor, assigned the maximum importance score of 100%. This high ranking for process controls is logical, as the final Yield is more sensitive to the settings and conditions under which the reaction is run than the raw materials. When compared to a simpler linear model, this nonlinear ranking is superior because it accurately captures complex, non additive influences and thresholds something a straight line model would completely miss giving us a much more reliable understanding of which factors truly control the manufacturing outcome.
In order to explore the relationships between the top predictors and the response (Yield), I will create scatter plots with smoothing lines for the predictors that are unique to the optimal nonlinear regression model.
library(ggplot2)
plot1 <- ggplot(train_df, aes(x = ManufacturingProcess32, y = Yield)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "gam", formula = y ~ s(x), color = "red", se = FALSE) +
geom_smooth(method = "lm", color = "blue", se = FALSE, linetype = "dashed") +
labs(title = "Nonlinear Relationship: Yield vs. ManufacturingProcess32",
subtitle = "Red line (GAM) shows true non-linearity vs. Blue line (LM)")
print(plot1)
In this plot, the blue dashed line represents the linear model, which assumes that yield keeps increasing at a steady rate as ManufacturingProcess32 goes up. According to this model, if we push ManufacturingProcess32 to its highest level (around 175), yield should just keep rising. But that’s not what really happens, the data tells a different story. The linear model oversimplifies things, missing the natural curve in the relationship, which is why it ends up with a higher error.
The red solid line shows the Generalized Additive Model (GAM) fit, which does a much better job at capturing the true pattern. It reveals that yield increases as ManufacturingProcess32 rises but only up to a point. After around 165–167, the trend levels off, suggesting a ideal value for this process variable. Beyond that range, pushing it further doesn’t really improve yield.
This plot also helps explain why models like MARS, Random Forest, and SVM performed better than the simple linear model. These nonlinear methods can detect that kind of plateau effect and find the point where yield maxes out, while the linear model just keeps drawing a straight line through the data.
plot2 <- ggplot(train_df, aes(x = ManufacturingProcess32, y = Yield, color = BiologicalMaterial12)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Exploring Interaction: MP32 effect conditional on BiologicalMaterial12",
subtitle = "Look for non-parallel lines (different slopes)") +
theme(legend.position = "bottom")
print(plot2)
This plot explores how ManufacturingProcess32 (the top process variable) interacts with BiologicalMaterial12, one of the most influential biological predictors, in shaping yield. The blue regression line shows a steady positive trend overall, higher values of ManufacturingProcess32 are associated with higher yields.
However, the color gradient representing BiologicalMaterial12 levels tells a more subtle story. While the points don’t form distinctly different slopes, suggesting no obvious linear interaction, the distribution hints that BiologicalMaterial12 still influences the strength of the ManufacturingProcess32 Yield relationship. In other words, BiologicalMaterial12 might not change the direction of the relationship, but it could amplify or moderate its effect depending on its level.
This pattern supports what the nonlinear models (like Random Forest and SVM) discovered; rather than simple additive effects, the data likely contain thresholds and conditional relationships.
In summary, the plots show that ManufacturingProcess32 has a clear nonlinear relationship with yield, it rises up to a point and then levels off, suggesting there’s an optimal process range rather than a simple line effect. When we include BiologicalMaterial12, the relationship becomes more complex. While there isn’t a strong visible linear interaction, the data suggest that BiologicalMaterial12 influences how ManufacturingProcess32 impacts yield, likely through subtle threshold or conditional effects.
Overall, both process and biological factors play important roles in determining yield, but their influence isn’t purely additive. These interactions explain why nonlinear models like Random Forest and SVM perform better, they can capture the real world complexity and the “sweet spots” in the manufacturing process that linear models overlook.