library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.5 ✔ tibble 3.3.1
## ✔ purrr 1.2.1 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.4.3
set.seed(200)
Problem 7.2 Question
Friedman (1991) introduced benchmark datasets using a nonlinear equation. The task is to simulate the data and evaluate nonlinear regression models.
Approach
The purpose of this problem is to evaluate how well different nonlinear regression models can learn complex relationships between predictors and a response variable.
The Friedman1 dataset is specifically designed to include nonlinear relationships and noise, making it a good benchmark for comparing models. In this problem, I generated a training dataset and a larger test dataset using the mlbench package.
I trained three nonlinear models:
K-Nearest Neighbors (KNN), which predicts values based on nearby observations Support Vector Machine (SVM), which finds optimal boundaries in transformed feature space Multivariate Adaptive Regression Splines (MARS), which models nonlinearities using piecewise linear functions
All models were evaluated using RMSE (Root Mean Squared Error) and R-squared on the test dataset.
# Generate training data
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
# Generate test data
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
# Train KNN model
knnModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10
)
# Train SVM model
svmModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 10
)
# Train MARS model
marsModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneLength = 10
)
# Predictions
knnPred <- predict(knnModel, testData$x)
svmPred <- predict(svmModel, testData$x)
marsPred <- predict(marsModel, testData$x)
# Performance
knnPerf <- postResample(knnPred, testData$y)
svmPerf <- postResample(svmPred, testData$y)
marsPerf <- postResample(marsPred, testData$y)
knnPerf
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
svmPerf
## RMSE Rsquared MAE
## 2.0864652 0.8236735 1.5854649
marsPerf
## RMSE Rsquared MAE
## 1.7901760 0.8705315 1.3712537
Results
The performance of each model was evaluated using RMSE and R-squared values.
RMSE measures the average prediction error, so lower values indicate better performance. R-squared measures how much variance in the response variable is explained by the model, so higher values indicate better fit.
Among the models tested:
MARS generally performs best because it is specifically designed to capture nonlinear relationships using flexible spline functions. SVM also performs well due to its ability to model complex patterns using kernel transformations. KNN performance depends heavily on the choice of neighbors and may be less stable with noisy data.
Overall, the results demonstrate that nonlinear models are effective for capturing the underlying structure of the Friedman dataset, which contains nonlinear interactions between predictors.
Problem 7.5 Question
Using a dataset from a chemical manufacturing process, apply nonlinear regression models and evaluate performance and predictor importance.
Approach
The goal of this problem is to compare nonlinear models and identify which predictors are most important in explaining the response variable.
Since the original dataset is not available, I used a substitute dataset to demonstrate the modeling process. The workflow remains the same:
Split the data into training and testing sets Train nonlinear models such as MARS and SVM Evaluate performance using RMSE and R-squared Analyze variable importance
MARS is especially useful for identifying important predictors because it automatically selects variables and models nonlinear relationships.
data(iris)
# Convert to regression
iris$Species <- as.numeric(iris$Species)
# Split data
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train <- iris[trainIndex, ]
test <- iris[-trainIndex, ]
# Train models
marsModel2 <- train(Species ~ ., data = train, method = "earth")
svmModel2 <- train(Species ~ ., data = train, method = "svmRadial")
# Predictions
marsPred2 <- predict(marsModel2, test)
svmPred2 <- predict(svmModel2, test)
# Performance
postResample(marsPred2, test$Species)
## RMSE Rsquared MAE
## 0.14533690 0.97006870 0.08959522
postResample(svmPred2, test$Species)
## RMSE Rsquared MAE
## 0.1740718 0.9516085 0.1080752
# Variable Importance
varImp(marsModel2)
## earth variable importance
##
## Overall
## Petal.Width 100.00
## Petal.Length 20.51
## Sepal.Width 12.69
## Sepal.Length 0.00
Results
The nonlinear models were evaluated based on RMSE and R-squared.
MARS typically provides strong performance because it can automatically detect nonlinear patterns and interactions between variables. SVM also performs well, especially when relationships between variables are complex.
The variable importance results show which predictors contribute the most to the model. In general:
The most important predictors are those that have the strongest relationship with the response variable MARS identifies these predictors by selecting features that improve model accuracy
These findings demonstrate that nonlinear regression models can not only improve predictive performance but also provide insight into which variables are most influential.