library(caret)
library(mlbench)
library(tidyverse)
library(earth)
library(nnet)

Assignment:

Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.

Problem 7.2:

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: … where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
 
trainingData$x <- data.frame(trainingData$x)
 
featurePlot(trainingData$x, trainingData$y)

Set up test data

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

KNN Model from book

knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProcess = c("center", "scale"))
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  3.466085  0.5121775  2.816838
##   7  3.349428  0.5452823  2.727410
##   9  3.264276  0.5785990  2.660026
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
knnPredict <- predict(knnModel, newdata = testData$x)

knnResults <- postResample(pred = knnPredict, obs = testData$y)
print(knnResults)
##      RMSE  Rsquared       MAE 
## 3.1172319 0.6556622 2.4899907

MARS

marsModel <- earth(trainingData$x, trainingData$y)
summary(marsModel)
## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982
marsPredict <- predict(marsModel, newdata = testData$x)

MARS_Results <- postResample(pred = marsPredict, obs = testData$y)
print(MARS_Results)
##      RMSE  Rsquared       MAE 
## 1.8136467 0.8677298 1.3911836

Neural Network

neuralModel <- avNNet(x = trainingData$x,
                    y = trainingData$y,
                    decay = 0.01,
                    size = 5,
                    repeats = 5,
                    lineout = T,
                    trace = F,
                    maxit = 500)
## Warning: executing %dopar% sequentially: no parallel backend registered
summary(neuralModel)
##         Length Class  Mode     
## model    5     -none- list     
## repeats  1     -none- numeric  
## bag      1     -none- logical  
## seeds    5     -none- numeric  
## names   10     -none- character
neuralNetPrediction <- predict(neuralModel, newdata = testData$x)

NeuralNetwork_Results <- postResample(pred = neuralNetPrediction, obs = testData$y)
print(NeuralNetwork_Results)
##       RMSE   Rsquared        MAE 
## 14.2769353  0.2791236 13.3869179
model_results <- rbind(knnResults, MARS_Results, NeuralNetwork_Results)
print(model_results)
##                            RMSE  Rsquared       MAE
## knnResults             3.117232 0.6556622  2.489991
## MARS_Results           1.813647 0.8677298  1.391184
## NeuralNetwork_Results 14.276935 0.2791236 13.386918

Which models appear to give the best performance?

My MARS model clearly boasts the lowest while containing the highest R-Squared value. At the same time my neural net model had vary poor performance. knn and MARS were the closest

Does MARS select the informative predictors (those named X1–X5)? MARS selects x1-x5 and also pulls in x6.

varImp(marsModel)
##      Overall
## X1 100.00000
## X4  84.21578
## X2  67.21639
## X5  45.44416
## X3  34.63259
## X6  11.90397

Problem 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Which nonlinear regression model gives the optimal resampling and test set performance?

The MARS model has the highest R-squared and second lowest RMSE, while the SVM model had the lowest RMSE and the second highest R-squared. So the performance differences kind of seem to be a wash.

library(AppliedPredictiveModeling) #from previous assignment
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
data("ChemicalManufacturingProcess")
set.seed(02180)
preProcessChem <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute", "center", "scale"))
imputeChem <- predict(preProcessChem, ChemicalManufacturingProcess)
index <- createDataPartition(imputeChem$Yield, p=0.8, list = F)
train_chem <- imputeChem[index,]
test_chem <- imputeChem[-index,]

Models for Problem 7.5

knnModel_p2 <- train(Yield~., data = train_chem,
                     method = "knn",
                     preProcess = c("center", "scale"),
                     tuneLenth = 10)
knnPredict_p2 <- predict(knnModel_p2, newdata = test_chem)
KNN_Results_p2 <- postResample(pred = knnPredict_p2, obs = test_chem$Yield)

marsModel_p2 <- earth(train_chem[,2:58], train_chem$Yield)
marsPredict_p2 <- predict(marsModel_p2, newdata = test_chem[,2:58])
MARS_Results_p2 <- postResample(pred = marsPredict_p2, obs = test_chem$Yield)

## SVM Model
svmModel <- train(Yield ~., data=train_chem,
                   method = "svmRadial",
                   tuneLength = 15,
                   trControl = trainControl(method = "cv"))
svmPredict <- predict(svmModel, newdata = test_chem)
svm_Results_p2 <- postResample(pred = svmPredict, obs = test_chem$Yield)
model_results_p2 <- rbind(KNN_Results_p2, MARS_Results_p2, svm_Results_p2)
print(model_results_p2)
##                      RMSE  Rsquared       MAE
## KNN_Results_p2  0.7295615 0.5108710 0.5967808
## MARS_Results_p2 0.6337139 0.6851108 0.4971686
## svm_Results_p2  0.6300980 0.6131959 0.5345734

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list?

For consistency sake I will use the MARS Model despite the negligible difference with svm. nonlinear regression models are dominated by manufacturing process’ in the top 10, with only 3 biological materials at the top.

varImp(marsModel_p2)
##                          Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess09  65.12906
## ManufacturingProcess13  41.85819
## ManufacturingProcess39  29.59117
## ManufacturingProcess01  25.08185
## BiologicalMaterial05    19.49698
## BiologicalMaterial03    15.05761
## BiologicalMaterial02    10.97102
## ManufacturingProcess28  14.23491
## ManufacturingProcess33  10.83446

Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

The Yield response variable does not exhibit strong correlations with many of the key predictors identified earlier. Manufacturing Process 32 and Manufacturing 09 have the highest positive The highest negative correlation are with ManufacturingProcess13, with correlation values of -0.50.

correlation <- cor(select(ChemicalManufacturingProcess, "ManufacturingProcess32", "ManufacturingProcess09", "ManufacturingProcess13", "ManufacturingProcess39", "ManufacturingProcess01", "ManufacturingProcess01", "BiologicalMaterial05","BiologicalMaterial03", "BiologicalMaterial02", "ManufacturingProcess28", "ManufacturingProcess33", "Yield"))
corrplot::corrplot(correlation, method='square')