Homework 8

All libraries needed for the Homework

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.3.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## -- Attaching packages -------------------------------------------- fpp3 1.0.0 --

## v tibble      3.2.1     v tsibble     1.1.5
## v dplyr       1.1.2     v tsibbledata 0.4.1
## v tidyr       1.3.0     v feasts      0.3.2
## v lubridate   1.9.2     v fable       0.3.4
## v ggplot2     3.5.1     v fabletools  0.4.2

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tsibble' was built under R version 4.3.3

## Warning: package 'tsibbledata' was built under R version 4.3.3

## Warning: package 'feasts' was built under R version 4.3.3

## Warning: package 'fabletools' was built under R version 4.3.3

## Warning: package 'fable' was built under R version 4.3.3

## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()

library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tidyverse)

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v forcats 1.0.0     v readr   2.1.4
## v purrr   1.0.1     v stringr 1.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()     masks stats::filter()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag()        masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(lubridate)
library(tsibble)
library(pracma)

## Warning: package 'pracma' was built under R version 4.3.3

## 
## Attaching package: 'pracma'
## 
## The following object is masked from 'package:purrr':
## 
##     cross

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.3.3

library(corrplot)

## corrplot 0.92 loaded

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.1

## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:pracma':
## 
##     sigmoid
## 
## The following object is masked from 'package:fabletools':
## 
##     interpolate

library(psych)

## Warning: package 'psych' was built under R version 4.3.1

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:pracma':
## 
##     logit, polar
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(imputeTS)

## Warning: package 'imputeTS' was built under R version 4.3.3

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.3.1

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret)

## Warning: package 'caret' was built under R version 4.3.1

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## The following objects are masked from 'package:fabletools':
## 
##     MAE, RMSE

7.2 - Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

y=10sin(πx1x2)+20(x3−0.5)2+10x4+5x5+N(0,σ2)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

Tuning the Support Vector Machine (SVM) model

set.seed(100)

#tuning the svm model
svm_model <- train(trainingData$x, trainingData$y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

#displaying model results
svm_model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.530787  0.7922715  2.013175
##      0.50  2.259539  0.8064569  1.789962
##      1.00  2.099789  0.8274242  1.656154
##      2.00  2.002943  0.8412934  1.583791
##      4.00  1.943618  0.8504425  1.546586
##      8.00  1.918711  0.8547582  1.532981
##     16.00  1.920651  0.8536189  1.536116
##     32.00  1.920651  0.8536189  1.536116
##     64.00  1.920651  0.8536189  1.536116
##    128.00  1.920651  0.8536189  1.536116
##    256.00  1.920651  0.8536189  1.536116
##    512.00  1.920651  0.8536189  1.536116
##   1024.00  1.920651  0.8536189  1.536116
##   2048.00  1.920651  0.8536189  1.536116
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06509124
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06509124 and C = 8.

#Making predictions using the test data and retrieving the MSE,Rsquared and MAE metrics
svm_prediction <- predict(svm_model, testData$x)
postResample(svm_prediction, testData$y)

##      RMSE  Rsquared       MAE 
## 2.0631908 0.8275736 1.5662213

Tuning the Multivariate Adaptive Regression Splines (MARS) model

# tuning grid
mars_grid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100)


#tuning the svm model
mars_model <- train(trainingData$x, trainingData$y,
                  method = "earth",
                  tuneGrid = mars_grid,
                  trControl = trainControl(method = "cv"))

## Loading required package: earth

## Warning: package 'earth' was built under R version 4.3.3

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 4.3.3

## Loading required package: plotrix

## Warning: package 'plotrix' was built under R version 4.3.2

## 
## Attaching package: 'plotrix'

## The following object is masked from 'package:psych':
## 
##     rescale

#displaying model results
mars_model

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.327937  0.2544880  3.600474
##   1        3      3.572450  0.4912720  2.895811
##   1        4      2.596841  0.7183600  2.106341
##   1        5      2.370161  0.7659777  1.918669
##   1        6      2.276141  0.7881481  1.810001
##   1        7      1.766728  0.8751831  1.390215
##   1        8      1.780946  0.8723243  1.401345
##   1        9      1.665091  0.8819775  1.325515
##   1       10      1.663804  0.8821283  1.327657
##   1       11      1.657738  0.8822967  1.331730
##   1       12      1.653784  0.8827903  1.331504
##   1       13      1.648496  0.8823663  1.316407
##   1       14      1.639073  0.8841742  1.312833
##   1       15      1.639073  0.8841742  1.312833
##   1       16      1.639073  0.8841742  1.312833
##   1       17      1.639073  0.8841742  1.312833
##   1       18      1.639073  0.8841742  1.312833
##   1       19      1.639073  0.8841742  1.312833
##   1       20      1.639073  0.8841742  1.312833
##   1       21      1.639073  0.8841742  1.312833
##   1       22      1.639073  0.8841742  1.312833
##   1       23      1.639073  0.8841742  1.312833
##   1       24      1.639073  0.8841742  1.312833
##   1       25      1.639073  0.8841742  1.312833
##   1       26      1.639073  0.8841742  1.312833
##   1       27      1.639073  0.8841742  1.312833
##   1       28      1.639073  0.8841742  1.312833
##   1       29      1.639073  0.8841742  1.312833
##   1       30      1.639073  0.8841742  1.312833
##   1       31      1.639073  0.8841742  1.312833
##   1       32      1.639073  0.8841742  1.312833
##   1       33      1.639073  0.8841742  1.312833
##   1       34      1.639073  0.8841742  1.312833
##   1       35      1.639073  0.8841742  1.312833
##   1       36      1.639073  0.8841742  1.312833
##   1       37      1.639073  0.8841742  1.312833
##   1       38      1.639073  0.8841742  1.312833
##   2        2      4.327937  0.2544880  3.600474
##   2        3      3.572450  0.4912720  2.895811
##   2        4      2.661826  0.7070510  2.173471
##   2        5      2.404015  0.7578971  1.975387
##   2        6      2.243927  0.7914805  1.783072
##   2        7      1.856336  0.8605482  1.435682
##   2        8      1.754607  0.8763186  1.396841
##   2        9      1.603578  0.8938666  1.261361
##   2       10      1.492421  0.9084998  1.168700
##   2       11      1.317350  0.9292504  1.033926
##   2       12      1.304327  0.9320133  1.019108
##   2       13      1.277510  0.9323681  1.002927
##   2       14      1.269626  0.9350024  1.003346
##   2       15      1.266217  0.9359400  1.013893
##   2       16      1.268470  0.9354868  1.011414
##   2       17      1.268470  0.9354868  1.011414
##   2       18      1.268470  0.9354868  1.011414
##   2       19      1.268470  0.9354868  1.011414
##   2       20      1.268470  0.9354868  1.011414
##   2       21      1.268470  0.9354868  1.011414
##   2       22      1.268470  0.9354868  1.011414
##   2       23      1.268470  0.9354868  1.011414
##   2       24      1.268470  0.9354868  1.011414
##   2       25      1.268470  0.9354868  1.011414
##   2       26      1.268470  0.9354868  1.011414
##   2       27      1.268470  0.9354868  1.011414
##   2       28      1.268470  0.9354868  1.011414
##   2       29      1.268470  0.9354868  1.011414
##   2       30      1.268470  0.9354868  1.011414
##   2       31      1.268470  0.9354868  1.011414
##   2       32      1.268470  0.9354868  1.011414
##   2       33      1.268470  0.9354868  1.011414
##   2       34      1.268470  0.9354868  1.011414
##   2       35      1.268470  0.9354868  1.011414
##   2       36      1.268470  0.9354868  1.011414
##   2       37      1.268470  0.9354868  1.011414
##   2       38      1.268470  0.9354868  1.011414
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.

#Making predictions using the test data and retrieving the MSE,Rsquared and MAE metrics
mars_prediction <- predict(mars_model, testData$x)
postResample(mars_prediction, testData$y)

##      RMSE  Rsquared       MAE 
## 1.1589948 0.9460418 0.9250230

Tuning the K-Nearest Neighbors (KNN) model

set.seed(100)


#tuning the KNN model
knn_model <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
#displaying model results
knn_model

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.569425  0.4967799  2.910695
##    7  3.456938  0.5295171  2.810768
##    9  3.360125  0.5683839  2.734616
##   11  3.305196  0.5962254  2.687646
##   13  3.294275  0.6130595  2.670403
##   15  3.275696  0.6306881  2.648741
##   17  3.273680  0.6423614  2.650513
##   19  3.275232  0.6543585  2.641627
##   21  3.278413  0.6633100  2.643754
##   23  3.301181  0.6662048  2.667520
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

#Making predictions using the test data and retrieving the MSE,Rsquared and MAE metrics
knn_prediction <- predict(knn_model, newdata = testData$x)
postResample(pred = knn_prediction, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Tuning the Neural Network (NN) model

set.seed(100)

#Neural Network grid
neural_network_grid <- expand.grid(.decay = c(0, 0.01, .1),.size = c(1:10))


# cross-validation to better estimates
ctrl <- trainControl(method = "cv", number = 11)

#tuning the Neural Network model
neural_network_model <- train(trainingData$x, trainingData$y,
                  method = "nnet",
                  tuneGrid = neural_network_grid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500)
#displaying model results
neural_network_model

## Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (11 fold) 
## Summary of sample sizes: 181, 182, 181, 182, 183, 182, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.363586  0.7737360  1.894008
##   0.00    2    2.538511  0.7393855  2.083142
##   0.00    3    2.209800  0.7875106  1.785130
##   0.00    4    2.320228  0.7779683  1.901525
##   0.00    5    2.495764  0.7632688  1.955534
##   0.00    6    3.504768  0.6839476  2.296570
##   0.00    7    3.073165  0.6813657  2.380531
##   0.00    8    5.504927  0.4825886  3.469773
##   0.00    9    5.494302  0.6302821  3.553296
##   0.00   10    3.753644  0.5661750  3.031287
##   0.01    1    2.362630  0.7733250  1.892652
##   0.01    2    2.634150  0.7279157  2.161138
##   0.01    3    2.170091  0.8116434  1.747423
##   0.01    4    2.186273  0.8135195  1.768255
##   0.01    5    2.576585  0.7606083  2.092351
##   0.01    6    2.937286  0.6772682  2.353048
##   0.01    7    2.940417  0.6992182  2.292223
##   0.01    8    3.229277  0.6494204  2.558359
##   0.01    9    4.032219  0.5785116  3.166650
##   0.01   10    3.504213  0.5937873  2.775801
##   0.10    1    2.371857  0.7701753  1.902779
##   0.10    2    2.437559  0.7614497  1.965764
##   0.10    3    2.276381  0.7915799  1.930249
##   0.10    4    2.160201  0.8173378  1.684239
##   0.10    5    2.554366  0.7485888  2.053954
##   0.10    6    2.547257  0.7667097  2.100520
##   0.10    7    2.814326  0.7171493  2.215859
##   0.10    8    3.130764  0.6800749  2.551020
##   0.10    9    3.215940  0.6538239  2.594899
##   0.10   10    3.345395  0.6186179  2.684631
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4 and decay = 0.1.

#Making predictions using the test data and retrieving the MSE,Rsquared and MAE metrics
neural_network_prediction <- predict(neural_network_model, testData$x)
postResample(pred = neural_network_prediction, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.0507310 0.8364239 1.5519612

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Based on the predictions of all the models, the MARS model has the lowest MAE and the lowest RMSE of all the models. Thus, the MARS models has the best performance and does select the informative predictors.

7.5 - Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3

data("ChemicalManufacturingProcess")

#imputation using the BoxCox and bagImpute methods
preProcess <- preProcess(ChemicalManufacturingProcess, method = c("BoxCox","bagImpute"))
PreProcess_prediction <- predict(preProcess, ChemicalManufacturingProcess)

#creating a training and testing set
index <- sample(seq_len(nrow(PreProcess_prediction )), size = floor(0.85 * nrow(PreProcess_prediction)))
train <- PreProcess_prediction[index, ]
test <- PreProcess_prediction[-index, ]

(a).Which nonlinear regression model gives the optimal resampling and test set performance?

Tuning the Support Vector Machine (SVM) model

svm_model_manufacturing <- train(Yield ~., data=train,
                   method = "svmRadial",
                   tuneLength = 15,
                   trControl = trainControl(method = "cv"))
svm_model_manufacturing_prediction <- predict(svm_model_manufacturing, newdata = test)

Tuning the Multivariate Adaptive Regression Splines (MARS) model

mars_model_manufacturing <- train(Yield ~., data=train,
                   method = "earth",
                   tuneGrid = expand.grid(degree = 1:3, nprune = seq(5, 15, by = 2)),
                   trControl = trainControl(method = "cv"))

## Warning: duplicate term name "h(ManufacturingProcess19-0.5)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(ManufacturingProcess19-0.5)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(ManufacturingProcess19-0.5)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(ManufacturingProcess19-0.5)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(ManufacturingProcess19-0.5)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(ManufacturingProcess19-0.5)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499977)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499977)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499977)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499977)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499977)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499977)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499976)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499976)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499976)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499976)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499976)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

## Warning: duplicate term name "h(BiologicalMaterial11-0.499976)"
## This is usually caused by cuts that are very close to each other
## Remedy: use options(digits=NDIGITS), typically NDIGITS has to be at least 7 (currently NDIGITS=7)

mars_model_manufacturing_prediction <- predict(mars_model_manufacturing, newdata = test)

Tuning the K-Nearest Neighbors (KNN) model

knn_model_manufacturing  <- train(Yield ~., data = train,
                  method = "knn",
                  preProcess = c("center", "scale"),
                  tuneLength = 10)

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

knn_model_manufacturing_prediction <- predict(knn_model_manufacturing, newdata = test)

Tuning the Neural Network (NN) model

neural_network_model_manufacturing  <- train(Yield ~., data = train,
                  method = "avNNet",
                  trControl = trainControl(method = "cv"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 5 * (ncol(train)) + 5 + 1,
                  maxit = 500)

## Warning: executing %dopar% sequentially: no parallel backend registered

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

neural_network_model_manufacturing_prediction<- predict(neural_network_model_manufacturing , newdata = test)

a).Which nonlinear regression model gives the optimal resampling and test set performance?

#Displaying the RMSE, Rsquared and MAE of the above models in tabular form
as.data.frame(rbind(
  "SVM model" = postResample(pred = svm_model_manufacturing_prediction, obs = test$Yield),
  "Mars model" = postResample(pred = mars_model_manufacturing_prediction, obs = test$Yield),
  "KNN model" = postResample(pred = knn_model_manufacturing_prediction, obs = test$Yield),
  "Neural Network model" = postResample(pred = neural_network_model_manufacturing_prediction,
  obs=test$Yield)
  
)) %>% arrange(RMSE)

##                              RMSE   Rsquared          MAE
## SVM model            0.0001112950 0.74621113 8.931143e-05
## Mars model           0.0001284610 0.58258991 9.628243e-05
## KNN model            0.0001286492 0.72902833 9.938732e-05
## Neural Network model 0.0002079953 0.09375666 1.620971e-04

The SVM model gives the optimal resampling and test set performance since it has the lowest MAE and RMSE out of all 4 models.

b). Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

varImp(svm_model_manufacturing)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     91.25
## ManufacturingProcess13   88.58
## ManufacturingProcess09   85.72
## BiologicalMaterial03     83.78
## BiologicalMaterial12     80.91
## ManufacturingProcess17   79.57
## ManufacturingProcess36   78.41
## ManufacturingProcess06   72.10
## ManufacturingProcess11   72.07
## BiologicalMaterial02     67.82
## BiologicalMaterial11     62.54
## ManufacturingProcess31   52.55
## ManufacturingProcess29   49.21
## ManufacturingProcess18   48.98
## BiologicalMaterial04     48.77
## ManufacturingProcess33   46.36
## ManufacturingProcess30   44.81
## BiologicalMaterial08     43.46
## BiologicalMaterial09     42.14

The manufacturing process variables dominate the biological process variables with a ratio of 12 to 8. I believe the manufacturing process variables from the linear model also dominated the biological ones but with a ratio of 11 to 9. The domination seems bigger in non-linear models.

c). Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

varimp <- varImp(svm_model_manufacturing)$importance
varimp <- cbind(rownames(varimp), data.frame(varimp , row.names=NULL))
colnames(varimp) <- c("Predictor","Overall")
varimp <- varimp %>% arrange(Overall) %>% tail(10) %>% select(Predictor)
variables <- as.vector(varimp$Predictor)
featurePlot(PreProcess_prediction[,variables], PreProcess_prediction$Yield)

Firstly, ManufacturingProcess32 has the highest positive correlation with Yield. Manufacturing Process13 and ManufacturingProcess have negative correlations with Yield.Biological predictors all seem to have a positive correlation with Yield.All biological variables are in sort of clusters but some like: ManufacturingProcess06 and ManufacturingProcess36 have complex, non-linear correlations.

Homework 8

Vladimir Nimchenko

2024-11-09