Data 624: HW 8

Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the rmd. file.

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N (0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)

trainingData <- mlbench.friedman1(200, sd = 1)

## We convert the 'x' data from a matrix to a data frame. One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)

## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix of predictors 'x'. Also simulate a large test set to estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

KNN Model: The lowest RMSE for the model is at k = 11. The lowest RMSE was 3.12226. The R2 was 0.6690472, which means 66.9% of the variance in y is explained by the kNN model. The MAE was 2.496365 which is the average error between prediction and actual values. From these numbers we can say, the model is okay/decent at modeling the data.

set.seed(200)
knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  trControl = trainControl(method = "cv", number = 10),
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.238598  0.5836232  2.705822
##    7  3.117335  0.6295372  2.561052
##    9  3.100423  0.6590940  2.524483
##   11  3.086639  0.6822198  2.506584
##   13  3.094904  0.6902613  2.504433
##   15  3.116059  0.7045172  2.516131
##   17  3.129874  0.7133067  2.529370
##   19  3.151840  0.7183283  2.546422
##   21  3.175787  0.7209301  2.574113
##   23  3.208213  0.7146199  2.611285
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.1222641 0.6690472 2.4963650

SVM Model: The lowest RMSE for the model is sigma = 0.06299324 and C = 8. The lowest RMSE was 2.0541197. The R2 was 0.8290353, which means 82.9% of the variance in y is explained by the SNM model. The MAE was 1.558, which is the average error between prediction and actual values. From these numbers we can say, the model preformed much better than the kNN model with a lower RMSE and a higher R2.

set.seed(200)
svmRTuned <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))
svmRTuned

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  2.525164  0.7810576  2.010680
##     0.50  2.270567  0.7944850  1.794902
##     1.00  2.099356  0.8155574  1.659376
##     2.00  2.005858  0.8302852  1.578799
##     4.00  1.934650  0.8435677  1.528373
##     8.00  1.915665  0.8475605  1.528648
##    16.00  1.923914  0.8463074  1.535991
##    32.00  1.923914  0.8463074  1.535991
##    64.00  1.923914  0.8463074  1.535991
##   128.00  1.923914  0.8463074  1.535991
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06299324
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06299324 and C = 8.

svmRTuned$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 8 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0629932410345396 
## 
## Number of Support Vectors : 152 
## 
## Objective Function Value : -72.63 
## Training error : 0.009177

svnPred <- predict(svmRTuned, newdata = testData$x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = svnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.0541197 0.8290353 1.5586411

MARS: The lowest RMSE for the model is nprune = 14 and degree = 2. The lowest RMSE was 1.1722635. The R2 was 0.9448890, which means 94.6% of the variance in y is explained by the SNM model. The MAE was 0.9324923, which is the average error between prediction and actual values. From these numbers we can say, the model preformed better than both the kNN model and the SVM model with an even lower RMSE and an even higher R2.

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

# Fix the seed so that the results can be reproduced
set.seed(200)

marsTuned <- train(x = trainingData$x,
                  y = trainingData$y,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))

marsTuned

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.188280  0.3042527  3.460689
##   1        3      3.551182  0.4999832  2.837116
##   1        4      2.653143  0.7167280  2.128222
##   1        5      2.405769  0.7562160  1.948161
##   1        6      2.295006  0.7754603  1.853199
##   1        7      1.771950  0.8611767  1.391357
##   1        8      1.647182  0.8774867  1.299564
##   1        9      1.609816  0.8837307  1.299705
##   1       10      1.635035  0.8798236  1.309436
##   1       11      1.571915  0.8896147  1.260711
##   1       12      1.571561  0.8898750  1.253077
##   1       13      1.567577  0.8906927  1.250795
##   1       14      1.571673  0.8909652  1.245508
##   1       15      1.571673  0.8909652  1.245508
##   1       16      1.571673  0.8909652  1.245508
##   1       17      1.571673  0.8909652  1.245508
##   1       18      1.571673  0.8909652  1.245508
##   1       19      1.571673  0.8909652  1.245508
##   1       20      1.571673  0.8909652  1.245508
##   1       21      1.571673  0.8909652  1.245508
##   1       22      1.571673  0.8909652  1.245508
##   1       23      1.571673  0.8909652  1.245508
##   1       24      1.571673  0.8909652  1.245508
##   1       25      1.571673  0.8909652  1.245508
##   1       26      1.571673  0.8909652  1.245508
##   1       27      1.571673  0.8909652  1.245508
##   1       28      1.571673  0.8909652  1.245508
##   1       29      1.571673  0.8909652  1.245508
##   1       30      1.571673  0.8909652  1.245508
##   1       31      1.571673  0.8909652  1.245508
##   1       32      1.571673  0.8909652  1.245508
##   1       33      1.571673  0.8909652  1.245508
##   1       34      1.571673  0.8909652  1.245508
##   1       35      1.571673  0.8909652  1.245508
##   1       36      1.571673  0.8909652  1.245508
##   1       37      1.571673  0.8909652  1.245508
##   1       38      1.571673  0.8909652  1.245508
##   2        2      4.188280  0.3042527  3.460689
##   2        3      3.551182  0.4999832  2.837116
##   2        4      2.615256  0.7216809  2.128763
##   2        5      2.344223  0.7683855  1.890080
##   2        6      2.275048  0.7762472  1.807779
##   2        7      1.841464  0.8418935  1.457945
##   2        8      1.641647  0.8839822  1.288520
##   2        9      1.535119  0.9002991  1.214772
##   2       10      1.473254  0.9101555  1.158761
##   2       11      1.379476  0.9207735  1.080991
##   2       12      1.285380  0.9283193  1.033426
##   2       13      1.267261  0.9328905  1.014726
##   2       14      1.261797  0.9327541  1.009821
##   2       15      1.266663  0.9320714  1.005751
##   2       16      1.270858  0.9322465  1.009757
##   2       17      1.263778  0.9327687  1.007653
##   2       18      1.263778  0.9327687  1.007653
##   2       19      1.263778  0.9327687  1.007653
##   2       20      1.263778  0.9327687  1.007653
##   2       21      1.263778  0.9327687  1.007653
##   2       22      1.263778  0.9327687  1.007653
##   2       23      1.263778  0.9327687  1.007653
##   2       24      1.263778  0.9327687  1.007653
##   2       25      1.263778  0.9327687  1.007653
##   2       26      1.263778  0.9327687  1.007653
##   2       27      1.263778  0.9327687  1.007653
##   2       28      1.263778  0.9327687  1.007653
##   2       29      1.263778  0.9327687  1.007653
##   2       30      1.263778  0.9327687  1.007653
##   2       31      1.263778  0.9327687  1.007653
##   2       32      1.263778  0.9327687  1.007653
##   2       33      1.263778  0.9327687  1.007653
##   2       34      1.263778  0.9327687  1.007653
##   2       35      1.263778  0.9327687  1.007653
##   2       36      1.263778  0.9327687  1.007653
##   2       37      1.263778  0.9327687  1.007653
##   2       38      1.263778  0.9327687  1.007653
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.

marsPred <- predict(marsTuned, newdata = testData$x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = marsPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 1.1722635 0.9448890 0.9324923

Neural Networks: The lowest RMSE for the model is size = 4, decay = 0.01 and bag = FALSE. The lowest RMSE was 2.0603901. The R2 was 0.8320669, which means 83.2% of the variance in y is explained by the SNM model. The MAE was 1.5289876, which is the average error between prediction and actual values. From these numbers we can say, the model preformed better than the kNN model and very close to the SVM model, but under preformed compared to the MARS model.

tooHigh <- findCorrelation(cor(trainingData$x), cutoff = .75)

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),.size = c(1:10),.bag = FALSE)

set.seed(200)
nnetTune <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl = trainControl(method = "cv", number = 10),
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500)

## Warning: executing %dopar% sequentially: no parallel backend registered

nnetTune

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.409283  0.7621183  1.899438
##   0.00    2    2.422970  0.7596059  1.940174
##   0.00    3    2.050680  0.8164736  1.639116
##   0.00    4    1.945570  0.8359932  1.553962
##   0.00    5    2.459608  0.7843926  1.852817
##   0.00    6    3.619609  0.6518455  2.367746
##   0.00    7    3.732346  0.5465485  2.505487
##   0.00    8    5.341921  0.4925666  3.027228
##   0.00    9    4.448181  0.5628776  2.637284
##   0.00   10    3.736545  0.6189292  2.529286
##   0.01    1    2.380838  0.7641924  1.871203
##   0.01    2    2.456920  0.7487966  1.925584
##   0.01    3    2.152614  0.8037282  1.690702
##   0.01    4    1.926277  0.8453345  1.547266
##   0.01    5    2.129094  0.8103702  1.698914
##   0.01    6    2.140650  0.8117168  1.698805
##   0.01    7    2.414589  0.7646608  1.911319
##   0.01    8    2.366556  0.7741779  1.873354
##   0.01    9    2.368192  0.7641654  1.781225
##   0.01   10    2.336267  0.7855705  1.857180
##   0.10    1    2.392300  0.7614548  1.873846
##   0.10    2    2.437038  0.7557138  1.918843
##   0.10    3    2.136582  0.8043180  1.702665
##   0.10    4    2.009698  0.8245209  1.574401
##   0.10    5    2.015296  0.8345946  1.586740
##   0.10    6    2.038841  0.8283220  1.586032
##   0.10    7    2.129954  0.8133844  1.707177
##   0.10    8    2.148353  0.8099883  1.690601
##   0.10    9    2.254190  0.7942883  1.759231
##   0.10   10    2.359693  0.7719593  1.873586
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.01 and bag = FALSE.

nnetPred <- predict(nnetTune, newdata = testData$x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = nnetPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.0603901 0.8320669 1.5289876

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

The MARS (Multivariate Adaptive Regression Splines) was the best performing model with the lowest RMSE (1.1589948) and MAE (0.9250) and highest R2 (explaining ~95% of the variance). This tell us that the MARS models accurately predicted the actual values, since the R2 was very close to 1. The MAE also tell us the the difference between the actual and predicted values were very small.

From the final model output, we can see that MARS did select the the informative predictors (those named X1–X5). This shows that MARS preformed feature selection on the model to exclude noise predictors (those named X6-X10).

marsTuned$finalModel

## Selected 14 of 18 terms, and 5 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945    RSS 225.8601    GRSq 0.9338437    RSq 0.953688

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

data(ChemicalManufacturingProcess)
impute <- preProcess(ChemicalManufacturingProcess, "knnImpute")
bio <- predict(impute, ChemicalManufacturingProcess)

filtered_bio <- bio[, -nearZeroVar(bio)]

set.seed(1000)
splits2 <- createDataPartition(filtered_bio$Yield, p = .80, times = 1, list = FALSE)

training <- filtered_bio[splits2, ]
testing  <- filtered_bio[-splits2, ]

# Separate predictors and outcome
training_x <- training[, names(training) != "Yield"]
training_y <- training$Yield

test_x <- testing[, names(testing) != "Yield"]
test_y <- testing$Yield

kNN Model: The lowest RMSE for the model is at k = 7. The lowest RMSE was 0.707341. The R2 was 0.439959, which means 44.0% of the variance in y is explained by the kNN model. The MAE was 0.597708 which is the average error between prediction and actual values. From these numbers we can say, the model is probably not the best model to predict the values of this dataset.

set.seed(200)
knnModel2 <- train(x = training_x,
                  y = training_y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  trControl = trainControl(method = "cv", number = 10),
                  tuneLength = 10)

knnModel2

## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7323182  0.5025668  0.5986016
##    7  0.7315242  0.5243653  0.5820750
##    9  0.7339834  0.5156104  0.5870805
##   11  0.7454995  0.5164099  0.6068725
##   13  0.7481804  0.5179680  0.6128097
##   15  0.7558579  0.5084821  0.6188031
##   17  0.7590126  0.5077806  0.6160498
##   19  0.7691120  0.4923293  0.6231535
##   21  0.7743640  0.4994307  0.6290775
##   23  0.7843231  0.4926474  0.6354539
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.

knnPred2 <- predict(knnModel2, newdata = test_x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = knnPred2, obs = test_y)

##     RMSE Rsquared      MAE 
## 0.707341 0.439959 0.597708

Neural Networks: The lowest RMSE for the model is size = 6, decay = 0.01 and bag = FALSE. The lowest RMSE was 0.9090523. The R2 was 0.2950409, which means 29.5% of the variance in y is explained by the SNM model. The MAE was 0.683490, which is the average error between prediction and actual values. From these numbers we can say, the model preformed worse than kNN model with a very small R2 value and a higher RMSE than the kNN model - but overall, not good at predicting the actual values.

tooHigh <- findCorrelation(cor(training_x), cutoff = .75)

train_x_nnet <- training_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),.size = c(1:10),.bag = FALSE)

set.seed(200)
nnetTune2 <- train(x = train_x_nnet,
                  y = training_y,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl = trainControl(method = "cv", number = 10),
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(train_x_nnet) + 1) + 10 + 1,
                  maxit = 500)

nnetTune2

## Model Averaged Neural Network 
## 
## 144 samples
##  34 predictor
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE      
##   0.00    1    0.8102564  0.4207664  0.6456538
##   0.00    2    0.7521505  0.5237831  0.5982847
##   0.00    3    0.8127144  0.4671907  0.6529255
##   0.00    4    0.7426528  0.5466168  0.5990348
##   0.00    5    0.7526093  0.5564751  0.6030258
##   0.00    6    0.8217485  0.4643396  0.6693577
##   0.00    7    0.7481465  0.5319397  0.6242967
##   0.00    8    0.7077327  0.5665477  0.5666518
##   0.00    9    0.7014477  0.5776377  0.5820450
##   0.00   10    0.7032134  0.5589602  0.5634196
##   0.01    1    0.8006842  0.4430332  0.6422199
##   0.01    2    0.6982785  0.5887453  0.5513743
##   0.01    3    0.7343272  0.5454770  0.5945528
##   0.01    4    0.7574609  0.5002943  0.6176227
##   0.01    5    0.6733725  0.6117187  0.5538290
##   0.01    6    0.6297313  0.6457762  0.5171542
##   0.01    7    0.6617782  0.6179006  0.5332240
##   0.01    8    0.6787046  0.5982176  0.5505471
##   0.01    9    0.6733624  0.6075216  0.5464198
##   0.01   10    0.6790164  0.6060510  0.5616365
##   0.10    1    0.8042218  0.4596355  0.6580454
##   0.10    2    0.6911396  0.5825341  0.5555202
##   0.10    3    0.7390385  0.5473842  0.5904390
##   0.10    4    0.6830727  0.6272789  0.5499197
##   0.10    5    0.6885004  0.6025470  0.5720166
##   0.10    6    0.6760863  0.6090577  0.5508936
##   0.10    7    0.6835250  0.6062134  0.5475977
##   0.10    8    0.6730756  0.6008292  0.5648896
##   0.10    9    0.6489555  0.6273300  0.5321331
##   0.10   10    0.6834932  0.5914035  0.5600145
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 6, decay = 0.01 and bag = FALSE.

nnetPred2 <- predict(nnetTune2, newdata = test_x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = nnetPred2, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.9090523 0.2950409 0.6834901

MARS: The lowest RMSE for the model is nprune = 3 and degree = 2. The lowest RMSE was 0.7284024. The R2 was 0.4554852, which means 45.55% of the variance in y is explained by the SNM model. The MAE was 0.5794772, which is the average error between prediction and actual values. From these numbers we can say, the model preformed better than the kNN model and the Neural Network model with a slightly higher R2 than NN model but a slightly higher RMSE value - but overall, not that great at predicting the actual values.

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

# Fix the seed so that the results can be reproduced
set.seed(200)

marsTuned2 <- train(x = training_x,
                   y = training_y,
                   method = "earth",
                   tuneGrid = marsGrid,
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

marsTuned2

## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        2      0.7351587  0.4816841  0.5881771
##   1        3      0.6645810  0.5759239  0.5427819
##   1        4      0.6554259  0.5858700  0.5459351
##   1        5      0.6438461  0.6011463  0.5399213
##   1        6      0.6894082  0.5576078  0.5676534
##   1        7      0.7406958  0.5318539  0.6035565
##   1        8      0.7767046  0.4956401  0.6383589
##   1        9      0.7756256  0.4982759  0.6222046
##   1       10      0.7464739  0.5462954  0.5793487
##   1       11      0.7701961  0.5202253  0.5946750
##   1       12      0.8004951  0.4842799  0.6199716
##   1       13      0.8142147  0.4803955  0.6254273
##   1       14      0.8388045  0.4802877  0.6328067
##   1       15      0.8453222  0.4751444  0.6326985
##   1       16      0.8414692  0.4800565  0.6301831
##   1       17      0.8414692  0.4800565  0.6301831
##   1       18      0.8378011  0.4872107  0.6305787
##   1       19      0.8312784  0.4913901  0.6281800
##   1       20      0.8312784  0.4913901  0.6281800
##   1       21      0.8312784  0.4913901  0.6281800
##   1       22      0.8312784  0.4913901  0.6281800
##   1       23      0.8312784  0.4913901  0.6281800
##   1       24      0.8312784  0.4913901  0.6281800
##   1       25      0.8312784  0.4913901  0.6281800
##   1       26      0.8312784  0.4913901  0.6281800
##   1       27      0.8312784  0.4913901  0.6281800
##   1       28      0.8312784  0.4913901  0.6281800
##   1       29      0.8312784  0.4913901  0.6281800
##   1       30      0.8312784  0.4913901  0.6281800
##   1       31      0.8312784  0.4913901  0.6281800
##   1       32      0.8312784  0.4913901  0.6281800
##   1       33      0.8312784  0.4913901  0.6281800
##   1       34      0.8312784  0.4913901  0.6281800
##   1       35      0.8312784  0.4913901  0.6281800
##   1       36      0.8312784  0.4913901  0.6281800
##   1       37      0.8312784  0.4913901  0.6281800
##   1       38      0.8312784  0.4913901  0.6281800
##   2        2      0.7351587  0.4816841  0.5881771
##   2        3      0.6034844  0.6550300  0.4948902
##   2        4      0.6256130  0.6304572  0.5143866
##   2        5      0.6778739  0.5913431  0.5477765
##   2        6      0.6726250  0.5905856  0.5411828
##   2        7      0.6909968  0.5598967  0.5686644
##   2        8      0.7556908  0.4943548  0.5923824
##   2        9      0.7571614  0.5012385  0.5898781
##   2       10      0.7275396  0.5369153  0.5643234
##   2       11      0.7498740  0.5294244  0.5786155
##   2       12      0.7411371  0.5387883  0.5658056
##   2       13      0.7691828  0.5233715  0.5777534
##   2       14      0.7870945  0.5093099  0.5873846
##   2       15      0.8030363  0.4914380  0.6045851
##   2       16      0.8131474  0.4946378  0.6150707
##   2       17      0.8233712  0.4879887  0.6202039
##   2       18      0.8355710  0.4763152  0.6278927
##   2       19      0.8485194  0.4692869  0.6305074
##   2       20      0.8526728  0.4658218  0.6316808
##   2       21      0.8749133  0.4510601  0.6392113
##   2       22      0.8749133  0.4510601  0.6392113
##   2       23      0.8749133  0.4510601  0.6392113
##   2       24      0.8749133  0.4510601  0.6392113
##   2       25      0.8749133  0.4510601  0.6392113
##   2       26      0.8749133  0.4510601  0.6392113
##   2       27      0.8786266  0.4505728  0.6435548
##   2       28      0.8786266  0.4505728  0.6435548
##   2       29      0.8786266  0.4505728  0.6435548
##   2       30      0.8786266  0.4505728  0.6435548
##   2       31      0.8786266  0.4505728  0.6435548
##   2       32      0.8786266  0.4505728  0.6435548
##   2       33      0.8786266  0.4505728  0.6435548
##   2       34      0.8786266  0.4505728  0.6435548
##   2       35      0.8786266  0.4505728  0.6435548
##   2       36      0.8786266  0.4505728  0.6435548
##   2       37      0.8786266  0.4505728  0.6435548
##   2       38      0.8786266  0.4505728  0.6435548
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 2.

marsPred2 <- predict(marsTuned2, newdata = test_x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = marsPred2, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.7284024 0.4554852 0.5794772

SVM Model: The lowest RMSE for the model is sigma = 0.01399323 and C = 4. The lowest RMSE was 0.6802913. The R2 was 0.5225403, which means 52.3% of the variance in y is explained by the SNM model. The MAE was 0.5200695, which is the average error between prediction and actual values. From these numbers we can say, the model preformed the best out of all the models with the lowest RMSE and a highest R2.

set.seed(200)
svmRTuned2 <- train(x = training_x,
                  y = training_y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRTuned2

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  0.7580621  0.5492036  0.6176256
##      0.50  0.6939783  0.5981826  0.5693265
##      1.00  0.6457835  0.6530946  0.5296596
##      2.00  0.6134673  0.6836176  0.5011625
##      4.00  0.5964908  0.6925414  0.4835798
##      8.00  0.6016593  0.6770286  0.4868337
##     16.00  0.5989691  0.6802140  0.4863216
##     32.00  0.5989691  0.6802140  0.4863216
##     64.00  0.5989691  0.6802140  0.4863216
##    128.00  0.5989691  0.6802140  0.4863216
##    256.00  0.5989691  0.6802140  0.4863216
##    512.00  0.5989691  0.6802140  0.4863216
##   1024.00  0.5989691  0.6802140  0.4863216
##   2048.00  0.5989691  0.6802140  0.4863216
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01399323
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01399323 and C = 4.

svmRTuned2$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 4 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0139932256245438 
## 
## Number of Support Vectors : 128 
## 
## Objective Function Value : -77.2058 
## Training error : 0.026496

svnPred2 <- predict(svmRTuned2, newdata = test_x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = svnPred2, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.6802913 0.5225403 0.5200695

(a) Which nonlinear regression model gives the optimal resampling and test set performance? Based on the resampling and test set performance results across the nonlinear regression models, the SVM was the best-performing model overall with the lowest RMSE (0.680291) and MAE (0.5200695) and highest R2 (explaining ~52.3% of the variance). This tell us that although the SVM model performed the best out of all the models, it was not very good overall at accurately predicting the actual values, since the R2 was around 0.5 - which means there is still a large amount of unexplained variance.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

The most important predictors in the non-linear SVM model that I have trained are:
ManufacturingProcess32
BiologicalMaterial06
ManufacturingProcess36
ManufacturingProcess13
BiologicalMaterial03
BiologicalMaterial12
ManufacturingProcess17
BiologicalMaterial02
ManufacturingProcess31
ManufacturingProcess09

The most important predictors from the optimal linear model was:
ManufacturingProcess32
ManufacturingProcess36
ManufacturingProcess13
ManufacturingProcess09
ManufacturingProcess17
ManufacturingProcess06
BiologicalMaterial02
BiologicalMaterial06
ManufacturingProcess11
ManufacturingProcess33

The process predictors dominate the list compared to biological predictors for both the non-linear and linear models. However, there are much more process predictors in top 10 predictors for the linear model compared to the non-linear model. The very top predictor for both models was “ManufacturingProcess32.” Both models shared more of the same top predictors, just not in the same order of importance, these include:
ManufacturingProcess36
ManufacturingProcess13
ManufacturingProcess09
ManufacturingProcess17
BiologicalMaterial06

varImp(svmRTuned2)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     87.83
## ManufacturingProcess36   84.94
## ManufacturingProcess13   80.51
## BiologicalMaterial03     73.41
## BiologicalMaterial12     68.92
## ManufacturingProcess17   68.72
## BiologicalMaterial02     66.62
## ManufacturingProcess31   64.95
## ManufacturingProcess09   63.74
## ManufacturingProcess06   63.35
## ManufacturingProcess11   49.94
## BiologicalMaterial04     49.41
## ManufacturingProcess33   46.77
## ManufacturingProcess29   45.64
## BiologicalMaterial11     42.08
## BiologicalMaterial09     39.23
## BiologicalMaterial08     39.20
## BiologicalMaterial01     35.96
## ManufacturingProcess30   33.76

ggplot(varImp(svmRTuned2), top =15)

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

We can see from the results of the correlation that most of the relationships between the top predictors and the response was either moderate or weak. The relationship between ManufacturingProcess32: Corr = 0.6343, BiologicalMaterial06: Corr = 0.4990, ManufacturingProcess36: Corr = -0.584640, ManufacturingProcess13: Corr = -0.4872, ManufacturingProcess13: Corr = -0.4871624, BiologicalMaterial03: Corr = 0.4518573, and BiologicalMaterial12: Corr = 0.3565016.

The relationship between ManufacturingProcess32: Corr = 0.6343301 and the response is positive and slightly strong. The relationship between BiologicalMaterial06: Corr = 0.4990, BiologicalMaterial03: Corr = 0.4518573, and BiologicalMaterial12: Corr = 0.3565016 are positive and on the moderate-weaker side. The relationship between ManufacturingProcess36: Corr = -0.584640 and ManufacturingProcess13: Corr = -0.48716 are negative and moderate. Positive associations generally means increasing the value of the predictor would increase the yield, and vice versa for negative associations. Whereas negative associations generally means decreasing the value of the predictor would decrease the yield. However, these relationships are not very strong, so we might not see this occur for all.

These plots reveal intuition about the biological or process predictors and their relationship with yield since knowing the predictors highly correlated with the product yield will help to improve production. The manufacture may want to use more of these predictors during production to have a higher yield or possibly less of the uncorrelated predictors, especially the predictors that produced a negative correlated value.

top_predictors <- (c("ManufacturingProcess32", "BiologicalMaterial06", "ManufacturingProcess36","ManufacturingProcess13", "BiologicalMaterial03", "BiologicalMaterial12" ))

cor_results <- cor(training[, top_predictors], training$Yield)
cor_results

##                              [,1]
## ManufacturingProcess32  0.6343301
## BiologicalMaterial06    0.4990380
## ManufacturingProcess36 -0.5846408
## ManufacturingProcess13 -0.4871624
## BiologicalMaterial03    0.4518573
## BiologicalMaterial12    0.3565016

plot_data <- training[, c(top_predictors, "Yield")]

plot_data_long <- pivot_longer(plot_data, cols = top_predictors, 
                               names_to = "Predictor", values_to = "Value")

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(top_predictors)
## 
##   # Now:
##   data %>% select(all_of(top_predictors))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(plot_data_long, aes(x = Value, y = Yield)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +
  facet_wrap(~ Predictor, scales = "free_x") +  
  labs(title = "Relationship Between Top Predictors and Yield",
       x = "Predictor Value",
       y = "Yield")

## `geom_smooth()` using formula = 'y ~ x'

Data 624: HW 8

Nakesha Fray

2025-04-09

Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the rmd. file.