Exercises:

7.2

a) Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

 set.seed(200)

# using function mlbench.friedman1 that simulates this data:
 trainingData <- mlbench.friedman1(200, sd = 1)
 
 # convert the x matrix to a dataframe
 trainingData$x <- data.frame(trainingData$x)
 
# Visualize the data
 featurePlot(trainingData$x, trainingData$y)

 # simulating a large dataset for testing
 testData <- mlbench.friedman1(5000, sd = 1)
 
 #Creating a list with a vector y and x predictors in matrix format
 testData$x <- data.frame(testData$x)

b) Tune several models on these data.

kNN model

Answer:

  • The training data took 200 samples and 10 predictors.

  • Using RMSE as the performance metric, the kNN method provided 17 to be the k parameter for optimal performance.The corresponding RMSE of the trained model is 3.183130 .

  • After predicting the test data using the kNN model, the RMSE of the predicted model is 3.2040595.

  • This explains that the model is able to predict close the the training data however its accuracy is lower than the trained dataset.

# check to see if any sparse and unbalanced predictors 
knnDescr <- as.integer(nearZeroVar(trainingData$x))
paste0("Sparse and unbalanced predictors to be removed:",knnDescr)
## [1] "Sparse and unbalanced predictors to be removed:"
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
plot(knnModel) 

getTrainPerf(knnModel)
##   TrainRMSE TrainRsquared TrainMAE method
## 1   3.18313     0.6425367 2.567787    knn
knnPred <- predict(knnModel, testData$x)

postResample(knnPred, testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

MARS Model

Answer:

  • The training data took 200 samples and 10 predictors.

  • Using RMSE as the performance metric, the final values used for the MARS model were nprune = 14 and degree = 2.The corresponding RMSE of the trained model is 1.256694.

  • After predicting the test data using the MARS model, the RMSE of the predicted model is 1.2779993.

  • As the difference is quite small( about0.02) this explains that the model isn’t overfitting , it’s generalizing well to new data.

  • Its better than the kNN Model.

  • MARS model does select the informative predictors (X1-X5) shown using varImp() function.

# cross-validation as the resampling method for model training
control <- trainControl(method = "cv")

# grid of hyperparameters to tune the MARS model
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:15)

# train the model
marsModel <- train(trainingData$x, trainingData$y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  preProcess = c("center", "scale"),
                  tuneLength = 10,
                  trControl = control)

marsModel
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.462296  0.2176253  3.697979
##   1        3      3.720663  0.4673821  2.949121
##   1        4      2.680039  0.7094916  2.123848
##   1        5      2.333538  0.7781559  1.856629
##   1        6      2.367933  0.7754329  1.901509
##   1        7      1.809983  0.8656526  1.414985
##   1        8      1.692656  0.8838936  1.333678
##   1        9      1.704958  0.8845683  1.339517
##   1       10      1.688559  0.8842495  1.309838
##   1       11      1.669043  0.8886165  1.296522
##   1       12      1.645066  0.8892796  1.271981
##   1       13      1.655570  0.8886896  1.271232
##   1       14      1.666354  0.8879143  1.285545
##   1       15      1.666354  0.8879143  1.285545
##   2        2      4.440854  0.2204755  3.686796
##   2        3      3.697203  0.4714312  2.938566
##   2        4      2.664266  0.7149235  2.119566
##   2        5      2.313371  0.7837374  1.852172
##   2        6      2.335796  0.7875253  1.841919
##   2        7      1.833248  0.8623489  1.461538
##   2        8      1.695822  0.8883658  1.329030
##   2        9      1.555106  0.9028532  1.221365
##   2       10      1.497805  0.9088251  1.158054
##   2       11      1.419280  0.9207646  1.139722
##   2       12      1.326566  0.9315939  1.066200
##   2       13      1.266877  0.9354482  1.002983
##   2       14      1.256694  0.9349307  1.006273
##   2       15      1.311401  0.9316487  1.039213
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
#produce plot
plotmo(marsModel)
##  plotmo grid:    X1        X2       X3        X4        X5        X6        X7
##           0.5139349 0.5106664 0.537307 0.4445841 0.5343299 0.4975981 0.4688035
##        X8        X9       X10
##  0.497961 0.5288716 0.5359218

getTrainPerf(marsModel)
##   TrainRMSE TrainRsquared TrainMAE method
## 1  1.256694     0.9349307 1.006273  earth
marsPred <- predict(marsModel, testData$x)

postResample(marsPred, testData$y)
##      RMSE  Rsquared       MAE 
## 1.2779993 0.9338365 1.0147070
#Important Predictors
varImp(marsModel)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.40
## X2   49.00
## X5   15.72
## X3    0.00

Neural Network Model

Answer:

  • The training data took 200 samples and 10 predictors.

  • As shown by the tooHigh variable, there is no predictor having absolute correlation greater than 0.75.

  • Using RMSE as the performance metric, The final values used for the model were size = 4, decay = 0.01 and bag = FALSE.The corresponding RMSE of the trained model is 1.929113 .

  • After predicting the test data using the Neural Network model, the RMSE of the predicted model is 2.8796844.

  • As the difference is small,this explains that the model isn’t overfitting , it’s generalizing well to new data.

  • Its better than the kNN Model but not as good as the MARS model.

  • Neural Network model does select the informative predictors (X1-X5) shown using varImp() function. However, it also selected the non informative variables.

# Remove predictors to ensure absolute correlation is less than 0.75.
tooHigh <- findCorrelation(cor(trainingData$x), cutoff = .75)

print (tooHigh)
## integer(0)
# define the hyperparameters
nNetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)

# no of parameters to be used by the model
nNetMaxnwts <- 5 * (ncol(trainingData$x) + 1) + 5 + 1

# use the averaging neural network model
# linout = TRUE to use the linear relationship between the  hidden units and the predictors.

nnetAvgModel <- train(x = trainingData$x,
                    y = trainingData$y,
                    method = "avNNet",
                    preProcess = c("center", "scale"),
                    tuneGrid = nNetGrid,
                    trControl = control,
                    linout = TRUE,
                    trace = FALSE, # reduce the amount of printed output
                    MaxNWts = nNetMaxnwts,
                    maxit = 500) # the number of iterations to find parameter estimates.
nnetAvgModel
## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.421837  0.7675787  1.905824
##   0.00    2    2.501165  0.7510185  1.975237
##   0.00    3    2.047942  0.8377685  1.596232
##   0.00    4    2.144009  0.8143402  1.631462
##   0.00    5    2.894774  0.7216396  1.993228
##   0.00    6         NaN        NaN       NaN
##   0.00    7         NaN        NaN       NaN
##   0.00    8         NaN        NaN       NaN
##   0.00    9         NaN        NaN       NaN
##   0.00   10         NaN        NaN       NaN
##   0.01    1    2.435259  0.7634789  1.891214
##   0.01    2    2.508561  0.7580129  1.959482
##   0.01    3    2.089809  0.8287090  1.653015
##   0.01    4    2.133132  0.8249811  1.651165
##   0.01    5    2.092225  0.8254972  1.635836
##   0.01    6         NaN        NaN       NaN
##   0.01    7         NaN        NaN       NaN
##   0.01    8         NaN        NaN       NaN
##   0.01    9         NaN        NaN       NaN
##   0.01   10         NaN        NaN       NaN
##   0.10    1    2.443828  0.7620641  1.898056
##   0.10    2    2.521015  0.7496509  1.967252
##   0.10    3    2.155159  0.8205917  1.686712
##   0.10    4    2.197423  0.8181142  1.714613
##   0.10    5    2.032561  0.8412067  1.586589
##   0.10    6         NaN        NaN       NaN
##   0.10    7         NaN        NaN       NaN
##   0.10    8         NaN        NaN       NaN
##   0.10    9         NaN        NaN       NaN
##   0.10   10         NaN        NaN       NaN
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.
plot(nnetAvgModel) 

getTrainPerf(nnetAvgModel)
##   TrainRMSE TrainRsquared TrainMAE method
## 1  2.032561     0.8412067 1.586589 avNNet
nnetAvgPred <- predict(nnetAvgModel, testData$x)

postResample(nnetAvgPred, testData$y)
##      RMSE  Rsquared       MAE 
## 2.0695282 0.8343683 1.5481826
#Important Predictors
varImp(nnetAvgModel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

s) SVM Model

Answer:

  • The training data took 200 samples and 10 predictors.

  • Using RMSE as the performance metric, The final values used for the model were sigma = 0.0593254 and C = 8.The model used 152 training set data points as Support Vectors. The corresponding RMSE of the trained model is 1.777582.

  • After predicting the test data using the SVM model, the RMSE of the predicted model is 2.0401686.

  • The model did not perform that well to the test data.

  • Its better than the kNN Model and Neural network model but not as good as the MARS model.

  • SVM model does select the informative predictors (X1-X5) shown using varImp() function. However, it also selected the non informative variables.

# As the values are unknown, it is estimated by svmRadial, svmlinear,svmPloy as its fits different kernels.
svmModel <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = "svmRadial",
                   preProcess = c("center", "scale"),
                   tuneLength = 10,
                   trControl = control)
svmModel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  2.491825  0.8004439  1.995525
##     0.50  2.226745  0.8177629  1.763324
##     1.00  2.040354  0.8389020  1.625726
##     2.00  1.912564  0.8524876  1.530802
##     4.00  1.862065  0.8583504  1.523556
##     8.00  1.868188  0.8584491  1.540959
##    16.00  1.873205  0.8577527  1.539899
##    32.00  1.873205  0.8577527  1.539899
##    64.00  1.873205  0.8577527  1.539899
##   128.00  1.873205  0.8577527  1.539899
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06607581
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06607581 and C = 4.
#validation plot
plot(svmModel)

#summary
svmModel$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 4 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0660758131802768 
## 
## Number of Support Vectors : 155 
## 
## Objective Function Value : -64.0293 
## Training error : 0.01385
#predict the permeability response for the test data
svmpredict <- predict(svmModel, testData$x)

#compare the variance of the predicted values to the test values.
postResample(svmpredict, testData$y)
##      RMSE  Rsquared       MAE 
## 2.0668766 0.8276165 1.5625893
#Important Predictors
varImp(svmModel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

Best Model

Answer:

Based on RMSE as the performance metric , MARS model performed the best as it had the lowest RMSE.

# Define the data
results_table <- data.frame(
  Model = c("kNN", "MARS", "Neural Network ", "SVM"),
  Trained_RMSE   = c(3.183130, 1.256694, 1.929113, 1.777582),
  Test_RMSE = c(3.2040595, 1.2779993, 2.8796844,2.0401686)
)

# Display the table using kable
kable(results_table, format = "markdown", align = "l")
Model Trained_RMSE Test_RMSE
kNN 3.183130 3.204060
MARS 1.256694 1.277999
Neural Network 1.929113 2.879684
SVM 1.777582 2.040169

7.5 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Load the data for chemical manufacturing process

library(AppliedPredictiveModeling)

data(ChemicalManufacturingProcess)

cmp <- data.frame(ChemicalManufacturingProcess)

Data imputation to fill in the missing values.

# summary of missing values
paste0(" Missing values: ",sum(is.na(cmp)))
## [1] " Missing values: 106"
#data imputation
cmp_impute <- preProcess(cmp, method = "knnImpute")
cmp_impute  <- predict(cmp_impute,cmp)


# summary post data imputation
paste0(" Missing values post data kNN Impute:",sum(is.na(cmp_impute)))
## [1] " Missing values post data kNN Impute:0"

Split the data into a training and a test set, pre-process the data.

  • For evaluation i used Rsquared as the performance metric.
  • Filtered predictors with low frequencies.
  • Checked for skewness revealed BiologicalMaterial01 -05 and Yield have skewness.Hence we need to apply Box-cox transformation.
# pre-process to filter predictor with low frequencies
cmp_transform <- cmp_impute[, -nearZeroVar(cmp_impute)]

# Check for skewness
library(e1071)
skewVals <- apply(cmp_transform, 2, skewness)
head(skewVals)
##                Yield BiologicalMaterial01 BiologicalMaterial02 
##           0.31095956           0.27331650           0.24412685 
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05 
##           0.02851075           1.73231530           0.30400532
#data transformation 
cmp_boxcox <- preProcess(cmp_transform, method = c("BoxCox","knnImpute"))
cmp_transform  <- predict(cmp_boxcox,cmp_transform)

set.seed(100)

# sample data for training
index <- createDataPartition(cmp_transform$Yield, p = .8, list = FALSE)

# train and test data
train_cmp <- cmp_transform[index, ]
test_cmp <- cmp_transform[-index, ]

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

kNN Model

Answer:

  • The training data took 144 samples and 56 predictors.

  • Using RMSE as the performance metric, the kNN method provided 5 to be the k parameter for optimal performance.The corresponding RMSE of the trained model is 0.6645289.

  • After predicting the test data using the kNN model, the RMSE of the predicted model is 0.7900462.

  • This explains that the model is able to predict close the the training data however its accuracy is lower than the trained dataset.

# check to see if any sparse and unbalanced predictors 
knnDescr <- as.integer(nearZeroVar(train_cmp[ ,-1]))
paste0("Sparse and unbalanced predictors to be removed:",knnDescr)
## [1] "Sparse and unbalanced predictors to be removed:"
knnModel <- train(x = train_cmp[ ,-1],
y = train_cmp$Yield ,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7900462  0.4007219  0.6292123
##    7  0.7988250  0.3869367  0.6396344
##    9  0.7965544  0.3914364  0.6408470
##   11  0.7961205  0.3934229  0.6436726
##   13  0.7993345  0.3930332  0.6490228
##   15  0.8041721  0.3851576  0.6527714
##   17  0.8047452  0.3889346  0.6533437
##   19  0.8084362  0.3831876  0.6571909
##   21  0.8086075  0.3889180  0.6558489
##   23  0.8100661  0.3925148  0.6569792
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
#Validation plot
plot(knnModel) 

#optimal PLS components
knnModel$bestTune
##   k
## 1 5
#performance
getTrainPerf(knnModel)
##   TrainRMSE TrainRsquared  TrainMAE method
## 1 0.7900462     0.4007219 0.6292123    knn
cmp_predict <- predict(knnModel, test_cmp[ ,-1])
postResample(cmp_predict, test_cmp[ ,1])
##      RMSE  Rsquared       MAE 
## 0.6645289 0.4623051 0.5320910
xyplot( predict(knnModel) ~ train_cmp$Yield , type = c("p","g"))

xyplot( predict(knnModel) ~ resid(knnModel) , type = c("p","g"))

MARS Model

Answer:

  • The training data took 144 samples and 56 predictors.

  • Using RMSE as the performance metric, the final values used for the MARS model were nprune = 5 and degree = 1.The corresponding RMSE of the trained model is 0.6664529.

  • After predicting the test data using the MARS model, the RMSE of the predicted model is 1.2779993.

  • As the difference is quite small( about0.02) this explains that the model isn’t overfitting , it’s generalizing well to new data.

  • Its better than the kNN Model.

  • MARS model does select the informative predictors (X1-X5) shown using varImp() function.

# grid of hyperparameters to tune the MARS model
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)


marsModel <- train(x = train_cmp[ ,-1],
y = train_cmp$Yield ,
method = "earth",
tuneGrid = marsGrid,
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = control)


#Validation plot
plot(marsModel) 

#optimal PLS components
marsModel$bestTune
##   nprune degree
## 4      5      1
#performance
getTrainPerf(marsModel)
##   TrainRMSE TrainRsquared  TrainMAE method
## 1 0.6664529     0.6068185 0.5468289  earth
cmp_predict <- predict(marsModel, test_cmp[ ,-1])
postResample(cmp_predict, test_cmp[ ,1])
##      RMSE  Rsquared       MAE 
## 0.6316697 0.5300194 0.5101996
xyplot( predict(marsModel) ~ train_cmp$Yield , type = c("p","g"))

xyplot( predict(marsModel) ~ resid(marsModel) , type = c("p","g"))

Neural Network Model

Answer:

  • The training data took 144 samples and 56 predictors.

  • As shown by the tooHigh variable, there is no predictor having absolute correlation greater than 0.75.

  • Using RMSE as the performance metric, The final values used for the model were size = 4, decay = 0.01 and bag = FALSE.The corresponding RMSE of the trained model is 1.929113 .

  • After predicting the test data using the Neural Network model, the RMSE of the predicted model is 2.8796844.

  • As the difference is small,this explains that the model isn’t overfitting , it’s generalizing well to new data.

  • Its better than the kNN Model but not as good as the MARS model.

  • Neural Network model does select the informative predictors (X1-X5) shown using varImp() function. However, it also selected the non informative variables.

# define the hyperparameters
nNetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)

# no of parameters to be used by the model
nNetMaxnwts <- 5 * (ncol(train_cmp[ ,-1]) + 1) + 5 + 1

# use the averaging neural network model
# linout = TRUE to use the linear relationship between the  hidden units and the predictors.

nnetAvgModel <- train(x = train_cmp[ ,-1],
                    y = train_cmp$Yield, 
                    method = "avNNet",
                    preProcess = c("center", "scale"),
                    tuneGrid = nNetGrid,
                    trControl = control,
                    linout = TRUE,
                    trace = FALSE, # reduce the amount of printed output
                    MaxNWts = nNetMaxnwts,
                    maxit = 500) # the number of iterations to find parameter estimates.


plot(nnetAvgModel) 

getTrainPerf(nnetAvgModel)
##   TrainRMSE TrainRsquared  TrainMAE method
## 1 0.6267866     0.6703062 0.5149613 avNNet
nnetAvgPred <- predict(nnetAvgModel, test_cmp[ ,-1])

postResample(nnetAvgPred, test_cmp[ ,1])
##      RMSE  Rsquared       MAE 
## 0.5336337 0.6969595 0.4225320

SVM model

Answer:

  • The training data took 144 samples and 56 predictors.

  • Using RMSE as the performance metric, The final values used for the model were sigma = 0.014767 and C = 8.The model used 126 training set data points as Support Vectors. The corresponding RMSE of the trained model is 0.6277591 .

  • After predicting the test data using the SVM model, the RMSE of the predicted model is 0.5782283.

  • The model performed that well to the test data.

  • Its better than the kNN Model and MARS model but not as good as the Neural Network model.

svmModel <- train(x = train_cmp[ ,-1],
                  y = train_cmp$Yield, 
                   method = "svmRadial",
                   preProcess = c("center", "scale"),
                   tuneLength = 10,
                   trControl = control)
svmModel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 128, 130, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE      
##     0.25  0.7674302  0.5263657  0.6206958
##     0.50  0.6902547  0.5832043  0.5686566
##     1.00  0.6374649  0.6387539  0.5292632
##     2.00  0.6147021  0.6605693  0.5080481
##     4.00  0.5965025  0.6828171  0.4872453
##     8.00  0.5957551  0.6912832  0.4798324
##    16.00  0.5934408  0.6938640  0.4778767
##    32.00  0.5934408  0.6938640  0.4778767
##    64.00  0.5934408  0.6938640  0.4778767
##   128.00  0.5934408  0.6938640  0.4778767
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01275155
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01275155 and C = 16.
#validation plot
plot(svmModel)

#summary
svmModel$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0127515535955409 
## 
## Number of Support Vectors : 122 
## 
## Objective Function Value : -96.8568 
## Training error : 0.00892
getTrainPerf(svmModel)
##   TrainRMSE TrainRsquared  TrainMAE    method
## 1 0.5934408      0.693864 0.4778767 svmRadial
#predict the permeability response for the test data
svmpredict <- predict(svmModel, test_cmp[ ,-1])

#compare the variance of the predicted values to the test values.
postResample(svmpredict, test_cmp[ ,1])
##      RMSE  Rsquared       MAE 
## 0.5822715 0.5862931 0.4596495

a) Which nonlinear regression model gives the optimal resampling and test set performance?

Answer:

  • Based on RMSE as the performance metric , nNet model performed the best in optimal resampling and test set performance as it had the lowest RMSE.

  • SVM was relatively close to nNet model and both performed better than the linear models which had much higher RMSE( around 11) and lower R squared ( less than 0.5).

# Define the data
results_table <- data.frame(
  Model = c("kNN", "MARS", "Neural Network ", "SVM"),
  Trained_RMSE   = c(0.6645289, 0.6664529, 0.6267866, 0.6277591),
  Test_RMSE = c(0.7900462, 0.6316697, 0.5336337,0.5782283)
)

# Display the table using kable
kable(results_table, format = "markdown", align = "l")
Model Trained_RMSE Test_RMSE
kNN 0.6645289 0.7900462
MARS 0.6664529 0.6316697
Neural Network 0.6267866 0.5336337
SVM 0.6277591 0.5782283

b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Answer:

  • There seems to be an similar number of biological and manufacturing process predictors that are important in the nNet non linear regression model.In the Top 10 selection, process variables(6) are more than biological variables(4). Hence neither the biological or process variables dominate the list, but process variables outnumber.

  • In the optimal linear model(pls), ManufacturingProcess32 has the highest positive correlation with Yield, followed by ManufacturingProcess09 and BiologicalMaterial02. ManufacturingProcess13,ManufacturingProcess36 and ManufacturingProcess17 which were in the Top5 are negatively correlated with Yield.

  • In the optimal nonlinear model(nNet), ManufacturingProcess32 had the highest positive correlation with Yield, followed by BiologicalMaterial06 and BiologicalMaterial03. ManufacturingProcess13, 17 and 36 were negatively correlated with Yield.

  • Hence there is difference in the order of importance of the predictors.

top_predictors <-data.frame(varImp(nnetAvgModel)$importance)

top_10_predictors <- top_predictors  |>
  arrange(desc(Overall)) |>
    slice_head(n = 10)

top_10_predictors
##                          Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess13  97.83640
## BiologicalMaterial06    82.21744
## ManufacturingProcess17  77.26777
## BiologicalMaterial03    76.21094
## ManufacturingProcess36  70.96498
## BiologicalMaterial02    68.78876
## ManufacturingProcess09  67.86384
## BiologicalMaterial12    63.36203
## ManufacturingProcess06  55.15443
cmp_transform |>
  select(c("Yield", row.names(top_10_predictors))) |>
  cor() |>
  corrplot()

c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Answer:

  • Using the top 10 predictors to validate the relationship with yield, i plotted it using the original data.

  • Visualization of their relationships with Yield using scatter plots and LOESS smoothing indicates non-linear effect.

  • By intuition,most of these plots show clustered data except for ManufacturingProcess32 which had yield values concentrated on 4 predictor values.

  • These patterns were not captured by the the linear model, highlighting the strength of neural networks in modeling complex interactions.

  • The partial dependence plots shows how the prediction changes as the variable changes.

# Top predictors
top_10_predictors <- c("ManufacturingProcess32", "ManufacturingProcess13", 
                    "BiologicalMaterial06", "ManufacturingProcess17", 
                    "BiologicalMaterial03", "ManufacturingProcess36", 
                    "BiologicalMaterial02", "ManufacturingProcess09", 
                    "BiologicalMaterial12", "ManufacturingProcess06")


# relationship of each important predictor to Yield
for (var in top_10_predictors) {
 p<-  ggplot(cmp_transform, aes_string(x = var, y = "Yield")) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "loess", se = FALSE, color = "blue") +
    labs(title = paste("Yield vs", var),
         x = var,
         y = "Yield")
  print(p)
}

# partial dependence plots to show how the prediction changes as the variable changes.
for (var in top_10_predictors) {
  pd <- pdp::partial(nnetAvgModel, pred.var = var, train = train_cmp[ , -1])
  plot(pd, main = paste("Partial Dependence on", var))
}