library(caret)
library(mlbench)
library(tidyverse)
library(earth)
library(nnet)
Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.
Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: … where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProcess = c("center", "scale"))
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
knnPredict <- predict(knnModel, newdata = testData$x)
knnResults <- postResample(pred = knnPredict, obs = testData$y)
print(knnResults)
## RMSE Rsquared MAE
## 3.1172319 0.6556622 2.4899907
marsModel <- earth(trainingData$x, trainingData$y)
summary(marsModel)
## Call: earth(x=trainingData$x, y=trainingData$y)
##
## coefficients
## (Intercept) 18.451984
## h(0.621722-X1) -11.074396
## h(0.601063-X2) -10.744225
## h(X3-0.281766) 20.607853
## h(0.447442-X3) 17.880232
## h(X3-0.447442) -23.282007
## h(X3-0.636458) 15.150350
## h(0.734892-X4) -10.027487
## h(X4-0.734892) 9.092045
## h(0.850094-X5) -4.723407
## h(X5-0.850094) 10.832932
## h(X6-0.361791) -1.956821
##
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
marsPredict <- predict(marsModel, newdata = testData$x)
MARS_Results <- postResample(pred = marsPredict, obs = testData$y)
print(MARS_Results)
## RMSE Rsquared MAE
## 1.8136467 0.8677298 1.3911836
neuralModel <- avNNet(x = trainingData$x,
y = trainingData$y,
decay = 0.01,
size = 5,
repeats = 5,
lineout = T,
trace = F,
maxit = 500)
## Warning: executing %dopar% sequentially: no parallel backend registered
summary(neuralModel)
## Length Class Mode
## model 5 -none- list
## repeats 1 -none- numeric
## bag 1 -none- logical
## seeds 5 -none- numeric
## names 10 -none- character
neuralNetPrediction <- predict(neuralModel, newdata = testData$x)
NeuralNetwork_Results <- postResample(pred = neuralNetPrediction, obs = testData$y)
print(NeuralNetwork_Results)
## RMSE Rsquared MAE
## 14.2769353 0.2791236 13.3869179
model_results <- rbind(knnResults, MARS_Results, NeuralNetwork_Results)
print(model_results)
## RMSE Rsquared MAE
## knnResults 3.117232 0.6556622 2.489991
## MARS_Results 1.813647 0.8677298 1.391184
## NeuralNetwork_Results 14.276935 0.2791236 13.386918
Which models appear to give the best performance?
My MARS model clearly boasts the lowest while containing the highest R-Squared value. At the same time my neural net model had vary poor performance. knn and MARS were the closest
Does MARS select the informative predictors (those named X1–X5)? MARS selects x1-x5 and also pulls in x6.
varImp(marsModel)
## Overall
## X1 100.00000
## X4 84.21578
## X2 67.21639
## X5 45.44416
## X3 34.63259
## X6 11.90397
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
The MARS model has the highest R-squared and second lowest RMSE, while the SVM model had the lowest RMSE and the second highest R-squared. So the performance differences kind of seem to be a wash.
library(AppliedPredictiveModeling) #from previous assignment
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
data("ChemicalManufacturingProcess")
set.seed(02180)
preProcessChem <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute", "center", "scale"))
imputeChem <- predict(preProcessChem, ChemicalManufacturingProcess)
index <- createDataPartition(imputeChem$Yield, p=0.8, list = F)
train_chem <- imputeChem[index,]
test_chem <- imputeChem[-index,]
knnModel_p2 <- train(Yield~., data = train_chem,
method = "knn",
preProcess = c("center", "scale"),
tuneLenth = 10)
knnPredict_p2 <- predict(knnModel_p2, newdata = test_chem)
KNN_Results_p2 <- postResample(pred = knnPredict_p2, obs = test_chem$Yield)
marsModel_p2 <- earth(train_chem[,2:58], train_chem$Yield)
marsPredict_p2 <- predict(marsModel_p2, newdata = test_chem[,2:58])
MARS_Results_p2 <- postResample(pred = marsPredict_p2, obs = test_chem$Yield)
## SVM Model
svmModel <- train(Yield ~., data=train_chem,
method = "svmRadial",
tuneLength = 15,
trControl = trainControl(method = "cv"))
svmPredict <- predict(svmModel, newdata = test_chem)
svm_Results_p2 <- postResample(pred = svmPredict, obs = test_chem$Yield)
model_results_p2 <- rbind(KNN_Results_p2, MARS_Results_p2, svm_Results_p2)
print(model_results_p2)
## RMSE Rsquared MAE
## KNN_Results_p2 0.7295615 0.5108710 0.5967808
## MARS_Results_p2 0.6337139 0.6851108 0.4971686
## svm_Results_p2 0.6300980 0.6131959 0.5345734
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list?
For consistency sake I will use the MARS Model despite the negligible difference with svm. nonlinear regression models are dominated by manufacturing process’ in the top 10, with only 3 biological materials at the top.
varImp(marsModel_p2)
## Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess09 65.12906
## ManufacturingProcess13 41.85819
## ManufacturingProcess39 29.59117
## ManufacturingProcess01 25.08185
## BiologicalMaterial05 19.49698
## BiologicalMaterial03 15.05761
## BiologicalMaterial02 10.97102
## ManufacturingProcess28 14.23491
## ManufacturingProcess33 10.83446
The Yield response variable does not exhibit strong correlations with many of the key predictors identified earlier. Manufacturing Process 32 and Manufacturing 09 have the highest positive The highest negative correlation are with ManufacturingProcess13, with correlation values of -0.50.
correlation <- cor(select(ChemicalManufacturingProcess, "ManufacturingProcess32", "ManufacturingProcess09", "ManufacturingProcess13", "ManufacturingProcess39", "ManufacturingProcess01", "ManufacturingProcess01", "BiologicalMaterial05","BiologicalMaterial03", "BiologicalMaterial02", "ManufacturingProcess28", "ManufacturingProcess33", "Yield"))
corrplot::corrplot(correlation, method='square')