Data 624 Homework 8

##7.2

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.3.3

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)



library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)

knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

set.seed(200)
 knnPred <- predict(knnModel, newdata = testData$x)
# The function 'postResample' can be used to get the test set
# perforamnce values
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

The Friedman dataset has informative predictors (X1X1 to X5X5) and non-informative predictors (X6X6 to X10X10). MARS, when trained on this data, performs feature selection by identifying these relationships and assigning higher importance to X1X1 to X5X to see below

Fit a MARS

marsModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "earth", # MARS model
  tuneLength = 10
)

## Loading required package: earth

## Warning: package 'earth' was built under R version 4.3.3

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 4.3.3

## Loading required package: plotrix

varImp(marsModel)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   82.92
## X2   64.47
## X5   40.67
## X3   28.65
## X6    0.00

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models. (a) Which nonlinear regression model gives the optimal resampling and test set performance?

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3

data(ChemicalManufacturingProcess)
chemical <- ChemicalManufacturingProcess

# Imputation
preProc <- preProcess(chemical, method = "knnImpute")
chemical_imputed <- predict(preProc, chemical)

# Data splitting
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(chemical_imputed), replace = TRUE, prob = c(0.8, 0.2))
train <- chemical_imputed[sample, ]
test <- chemical_imputed[!sample, ]

# Pre-processing
preProc <- preProcess(train, method = c("center", "scale"))
train_preprocessed <- predict(preProc, train)
test_preprocessed <- predict(preProc, test)

# Train nonlinear models
set.seed(123)
models <- list()

# Random Forest
models$rf <- train(
  Yield ~ ., 
  data = train_preprocessed,
  method = "rf",
  trControl = trainControl(method = "cv", number = 10),
  metric = "RMSE"
)

# Gradient Boosting Machine
models$gbm <- train(
  Yield ~ ., 
  data = train_preprocessed,
  method = "gbm",
  trControl = trainControl(method = "cv", number = 10),
  verbose = FALSE,
  metric = "RMSE"
)

# Support Vector Machines
models$svm <- train(
  Yield ~ ., 
  data = train_preprocessed,
  method = "svmRadial",
  trControl = trainControl(method = "cv", number = 10),
  metric = "RMSE"
)

# Compare model performance
resamples <- resamples(models)
summary(resamples)

## 
## Call:
## summary.resamples(object = resamples)
## 
## Models: rf, gbm, svm 
## Number of resamples: 10 
## 
## MAE 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rf  0.3205756 0.4310047 0.4635988 0.4648531 0.5288804 0.5716890    0
## gbm 0.3363901 0.4297352 0.4426877 0.4592677 0.5169445 0.6084667    0
## svm 0.3436503 0.4701333 0.4930998 0.5155739 0.6222974 0.6764083    0
## 
## RMSE 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rf  0.4277200 0.5715860 0.6126162 0.6041201 0.6436861 0.8156937    0
## gbm 0.4644638 0.5244978 0.5876787 0.5869634 0.6286567 0.7927376    0
## svm 0.4068963 0.5411064 0.5938343 0.6205820 0.7226760 0.8293966    0
## 
## Rsquared 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rf  0.5146884 0.5927702 0.6436740 0.6565916 0.7035083 0.8290242    0
## gbm 0.5428877 0.6194338 0.6667787 0.6856818 0.7683215 0.8254076    0
## svm 0.4816081 0.5890361 0.7014865 0.6809040 0.7856210 0.8186759    0

# Predict on test set for the best model
best_model <- models[[which.min(sapply(models, function(x) min(x$results$RMSE)))]]
test_predictions <- predict(best_model, test_preprocessed)
rmse_test <- sqrt(mean((test_predictions - test_preprocessed$Yield)^2))
cat("Test RMSE for the best model:", rmse_test, "\n")

## Test RMSE for the best model: 0.4386147

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the

 library(AppliedPredictiveModeling)
library(gbm)

## Warning: package 'gbm' was built under R version 4.3.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

data("ChemicalManufacturingProcess")

# Preprocessing steps with new object names
dataPreProcessor <- preProcess(ChemicalManufacturingProcess, 
                                method = c("BoxCox", "knnImpute", "center", "scale"))
processedData <- predict(dataPreProcessor, ChemicalManufacturingProcess)

processedData$Yield <- ChemicalManufacturingProcess$Yield

# Splitting into training and test sets with new object names
set.seed(123)  # For reproducibility
splitIndex <- sample(seq_len(nrow(processedData)), size = floor(0.85 * nrow(processedData)))
trainingSet <- processedData[splitIndex, ]
testingSet <- processedData[-splitIndex, ]

list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

# Variable importance for the best model
importance <- varImp(best_model, scale = TRUE)
print(importance)

## gbm variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13  28.353
## BiologicalMaterial03    26.084
## ManufacturingProcess09  19.061
## BiologicalMaterial12    15.785
## BiologicalMaterial09    13.687
## ManufacturingProcess17  10.344
## ManufacturingProcess06   9.620
## ManufacturingProcess30   9.372
## BiologicalMaterial05     8.398
## ManufacturingProcess31   8.309
## ManufacturingProcess01   8.134
## BiologicalMaterial11     7.311
## ManufacturingProcess20   6.857
## ManufacturingProcess04   6.287
## ManufacturingProcess14   5.942
## ManufacturingProcess05   5.408
## ManufacturingProcess26   5.168
## ManufacturingProcess37   4.835
## ManufacturingProcess45   4.785

# Plot variable importance
plot(importance, top = 10)

# Train linear regression model
linear_model <- train(
  Yield ~ ., 
  data = train_preprocessed,
  method = "lm",
  trControl = trainControl(method = "cv", number = 10),
  metric = "RMSE"
)

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

# Extract variable importance for the linear model
linear_importance <- varImp(linear_model, scale = TRUE)
print(linear_importance)

## lm variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess33   61.85
## ManufacturingProcess04   61.42
## BiologicalMaterial05     46.38
## ManufacturingProcess28   39.39
## BiologicalMaterial02     36.94
## ManufacturingProcess09   36.50
## ManufacturingProcess37   33.80
## ManufacturingProcess45   33.07
## BiologicalMaterial11     32.87
## ManufacturingProcess29   30.12
## ManufacturingProcess12   29.77
## BiologicalMaterial07     29.45
## ManufacturingProcess23   25.02
## ManufacturingProcess38   24.99
## ManufacturingProcess35   21.34
## ManufacturingProcess36   21.27
## ManufacturingProcess07   20.81
## ManufacturingProcess03   19.21
## ManufacturingProcess43   18.67

# Plot variable importance for the linear model
plot(linear_importance, top = 10)

# Get top 10 predictors for nonlinear model
top_predictors_nonlinear <- importance$importance %>%
  arrange(desc(Overall)) %>%
  head(10)

# Get top 10 predictors for linear model
top_predictors_linear <- linear_importance$importance %>%
  arrange(desc(Overall)) %>%
  head(10)

# Compare top predictors
print("Top predictors from nonlinear model:")

## [1] "Top predictors from nonlinear model:"

print(top_predictors_nonlinear)

##                           Overall
## ManufacturingProcess32 100.000000
## ManufacturingProcess13  28.353130
## BiologicalMaterial03    26.083579
## ManufacturingProcess09  19.061239
## BiologicalMaterial12    15.784535
## BiologicalMaterial09    13.687353
## ManufacturingProcess17  10.343768
## ManufacturingProcess06   9.619833
## ManufacturingProcess30   9.371528
## BiologicalMaterial05     8.398010

print("Top predictors from linear model:")

## [1] "Top predictors from linear model:"

print(top_predictors_linear)

##                          Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess33  61.84879
## ManufacturingProcess04  61.42045
## BiologicalMaterial05    46.37958
## ManufacturingProcess28  39.38728
## BiologicalMaterial02    36.93659
## ManufacturingProcess09  36.50330
## ManufacturingProcess37  33.79848
## ManufacturingProcess45  33.07245
## BiologicalMaterial11    32.86803

# Analyze overlap
common_predictors <- intersect(
  rownames(top_predictors_nonlinear),
  rownames(top_predictors_linear)
)
cat("Common predictors between nonlinear and linear models:", common_predictors, "\n")

## Common predictors between nonlinear and linear models: ManufacturingProcess32 ManufacturingProcess09 BiologicalMaterial05

The most important variable is ManufacturingProcess32, with a scaled importance of 100. Process variables dominate the top positions (7 out of 10), suggesting that they contribute more to the optimal nonlinear model. Similar to the nonlinear model, ManufacturingProcess32 appears as the most important predictor, followed by other process variables like ManufacturingProcess13 and ManufacturingProcess09.Manufacturing Process variables are more dominant in both models, particularly in the nonlinear regression model, where they constitute the majority of the top predictors.

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

# Identify predictors unique to the nonlinear model
unique_predictors_nonlinear <- setdiff(
  rownames(top_predictors_nonlinear),
  rownames(top_predictors_linear)
)
print("Unique predictors to the nonlinear model:")

## [1] "Unique predictors to the nonlinear model:"

print(unique_predictors_nonlinear)

## [1] "ManufacturingProcess13" "BiologicalMaterial03"   "BiologicalMaterial12"  
## [4] "BiologicalMaterial09"   "ManufacturingProcess17" "ManufacturingProcess06"
## [7] "ManufacturingProcess30"

# Scatter plots for unique nonlinear predictors
library(ggplot2)
for (predictor in unique_predictors_nonlinear) {
  plot <- ggplot(chemical_imputed, aes_string(x = predictor, y = "Yield")) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "loess", color = "blue") +
    labs(title = paste("Relationship between", predictor, "and Yield"),
         x = predictor, y = "Yield") +
    theme_minimal()
  print(plot)
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

Both biological and process variables show strong nonlinear effects on Yield.Process variables exhibit more complex relationships, such as U-shapes and thresholds, suggesting the nonlinear model captures nuanced operational interactions Biological variables contribute significantly but tend to have smoother, more predictable relationships with Yield. This analysis highlights the strength of the nonlinear model in capturing these intricate patterns, which could guide both operational adjustments and material selection to maximize production efficiency.

Data 624 Homework 8

Bishoy Sokkar

2024-11-11