y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
# using function mlbench.friedman1 that simulates this data:
trainingData <- mlbench.friedman1(200, sd = 1)
# convert the x matrix to a dataframe
trainingData$x <- data.frame(trainingData$x)
# Visualize the data
featurePlot(trainingData$x, trainingData$y)
# simulating a large dataset for testing
testData <- mlbench.friedman1(5000, sd = 1)
#Creating a list with a vector y and x predictors in matrix format
testData$x <- data.frame(testData$x)
Answer:
The training data took 200 samples and 10 predictors.
Using RMSE as the performance metric, the kNN method provided 17 to be the k parameter for optimal performance.The corresponding RMSE of the trained model is 3.183130 .
After predicting the test data using the kNN model, the RMSE of the predicted model is 3.2040595.
This explains that the model is able to predict close the the training data however its accuracy is lower than the trained dataset.
# check to see if any sparse and unbalanced predictors
knnDescr <- as.integer(nearZeroVar(trainingData$x))
paste0("Sparse and unbalanced predictors to be removed:",knnDescr)
## [1] "Sparse and unbalanced predictors to be removed:"
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
plot(knnModel)
getTrainPerf(knnModel)
## TrainRMSE TrainRsquared TrainMAE method
## 1 3.18313 0.6425367 2.567787 knn
knnPred <- predict(knnModel, testData$x)
postResample(knnPred, testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Answer:
The training data took 200 samples and 10 predictors.
Using RMSE as the performance metric, the final values used for the MARS model were nprune = 14 and degree = 2.The corresponding RMSE of the trained model is 1.256694.
After predicting the test data using the MARS model, the RMSE of the predicted model is 1.2779993.
As the difference is quite small( about0.02) this explains that the model isn’t overfitting , it’s generalizing well to new data.
Its better than the kNN Model.
MARS model does select the informative predictors (X1-X5) shown using varImp() function.
# cross-validation as the resampling method for model training
control <- trainControl(method = "cv")
# grid of hyperparameters to tune the MARS model
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:15)
# train the model
marsModel <- train(trainingData$x, trainingData$y,
method = "earth",
tuneGrid = marsGrid,
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = control)
marsModel
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.462296 0.2176253 3.697979
## 1 3 3.720663 0.4673821 2.949121
## 1 4 2.680039 0.7094916 2.123848
## 1 5 2.333538 0.7781559 1.856629
## 1 6 2.367933 0.7754329 1.901509
## 1 7 1.809983 0.8656526 1.414985
## 1 8 1.692656 0.8838936 1.333678
## 1 9 1.704958 0.8845683 1.339517
## 1 10 1.688559 0.8842495 1.309838
## 1 11 1.669043 0.8886165 1.296522
## 1 12 1.645066 0.8892796 1.271981
## 1 13 1.655570 0.8886896 1.271232
## 1 14 1.666354 0.8879143 1.285545
## 1 15 1.666354 0.8879143 1.285545
## 2 2 4.440854 0.2204755 3.686796
## 2 3 3.697203 0.4714312 2.938566
## 2 4 2.664266 0.7149235 2.119566
## 2 5 2.313371 0.7837374 1.852172
## 2 6 2.335796 0.7875253 1.841919
## 2 7 1.833248 0.8623489 1.461538
## 2 8 1.695822 0.8883658 1.329030
## 2 9 1.555106 0.9028532 1.221365
## 2 10 1.497805 0.9088251 1.158054
## 2 11 1.419280 0.9207646 1.139722
## 2 12 1.326566 0.9315939 1.066200
## 2 13 1.266877 0.9354482 1.002983
## 2 14 1.256694 0.9349307 1.006273
## 2 15 1.311401 0.9316487 1.039213
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
#produce plot
plotmo(marsModel)
## plotmo grid: X1 X2 X3 X4 X5 X6 X7
## 0.5139349 0.5106664 0.537307 0.4445841 0.5343299 0.4975981 0.4688035
## X8 X9 X10
## 0.497961 0.5288716 0.5359218
getTrainPerf(marsModel)
## TrainRMSE TrainRsquared TrainMAE method
## 1 1.256694 0.9349307 1.006273 earth
marsPred <- predict(marsModel, testData$x)
postResample(marsPred, testData$y)
## RMSE Rsquared MAE
## 1.2779993 0.9338365 1.0147070
#Important Predictors
varImp(marsModel)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.40
## X2 49.00
## X5 15.72
## X3 0.00
Answer:
The training data took 200 samples and 10 predictors.
As shown by the tooHigh variable, there is no predictor having absolute correlation greater than 0.75.
Using RMSE as the performance metric, The final values used for the model were size = 4, decay = 0.01 and bag = FALSE.The corresponding RMSE of the trained model is 1.929113 .
After predicting the test data using the Neural Network model, the RMSE of the predicted model is 2.8796844.
As the difference is small,this explains that the model isn’t overfitting , it’s generalizing well to new data.
Its better than the kNN Model but not as good as the MARS model.
Neural Network model does select the informative predictors (X1-X5) shown using varImp() function. However, it also selected the non informative variables.
# Remove predictors to ensure absolute correlation is less than 0.75.
tooHigh <- findCorrelation(cor(trainingData$x), cutoff = .75)
print (tooHigh)
## integer(0)
# define the hyperparameters
nNetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
# no of parameters to be used by the model
nNetMaxnwts <- 5 * (ncol(trainingData$x) + 1) + 5 + 1
# use the averaging neural network model
# linout = TRUE to use the linear relationship between the hidden units and the predictors.
nnetAvgModel <- train(x = trainingData$x,
y = trainingData$y,
method = "avNNet",
preProcess = c("center", "scale"),
tuneGrid = nNetGrid,
trControl = control,
linout = TRUE,
trace = FALSE, # reduce the amount of printed output
MaxNWts = nNetMaxnwts,
maxit = 500) # the number of iterations to find parameter estimates.
nnetAvgModel
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.421837 0.7675787 1.905824
## 0.00 2 2.501165 0.7510185 1.975237
## 0.00 3 2.047942 0.8377685 1.596232
## 0.00 4 2.144009 0.8143402 1.631462
## 0.00 5 2.894774 0.7216396 1.993228
## 0.00 6 NaN NaN NaN
## 0.00 7 NaN NaN NaN
## 0.00 8 NaN NaN NaN
## 0.00 9 NaN NaN NaN
## 0.00 10 NaN NaN NaN
## 0.01 1 2.435259 0.7634789 1.891214
## 0.01 2 2.508561 0.7580129 1.959482
## 0.01 3 2.089809 0.8287090 1.653015
## 0.01 4 2.133132 0.8249811 1.651165
## 0.01 5 2.092225 0.8254972 1.635836
## 0.01 6 NaN NaN NaN
## 0.01 7 NaN NaN NaN
## 0.01 8 NaN NaN NaN
## 0.01 9 NaN NaN NaN
## 0.01 10 NaN NaN NaN
## 0.10 1 2.443828 0.7620641 1.898056
## 0.10 2 2.521015 0.7496509 1.967252
## 0.10 3 2.155159 0.8205917 1.686712
## 0.10 4 2.197423 0.8181142 1.714613
## 0.10 5 2.032561 0.8412067 1.586589
## 0.10 6 NaN NaN NaN
## 0.10 7 NaN NaN NaN
## 0.10 8 NaN NaN NaN
## 0.10 9 NaN NaN NaN
## 0.10 10 NaN NaN NaN
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.
plot(nnetAvgModel)
getTrainPerf(nnetAvgModel)
## TrainRMSE TrainRsquared TrainMAE method
## 1 2.032561 0.8412067 1.586589 avNNet
nnetAvgPred <- predict(nnetAvgModel, testData$x)
postResample(nnetAvgPred, testData$y)
## RMSE Rsquared MAE
## 2.0695282 0.8343683 1.5481826
#Important Predictors
varImp(nnetAvgModel)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
Answer:
The training data took 200 samples and 10 predictors.
Using RMSE as the performance metric, The final values used for the model were sigma = 0.0593254 and C = 8.The model used 152 training set data points as Support Vectors. The corresponding RMSE of the trained model is 1.777582.
After predicting the test data using the SVM model, the RMSE of the predicted model is 2.0401686.
The model did not perform that well to the test data.
Its better than the kNN Model and Neural network model but not as good as the MARS model.
SVM model does select the informative predictors (X1-X5) shown using varImp() function. However, it also selected the non informative variables.
# As the values are unknown, it is estimated by svmRadial, svmlinear,svmPloy as its fits different kernels.
svmModel <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = control)
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.491825 0.8004439 1.995525
## 0.50 2.226745 0.8177629 1.763324
## 1.00 2.040354 0.8389020 1.625726
## 2.00 1.912564 0.8524876 1.530802
## 4.00 1.862065 0.8583504 1.523556
## 8.00 1.868188 0.8584491 1.540959
## 16.00 1.873205 0.8577527 1.539899
## 32.00 1.873205 0.8577527 1.539899
## 64.00 1.873205 0.8577527 1.539899
## 128.00 1.873205 0.8577527 1.539899
##
## Tuning parameter 'sigma' was held constant at a value of 0.06607581
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06607581 and C = 4.
#validation plot
plot(svmModel)
#summary
svmModel$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 4
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0660758131802768
##
## Number of Support Vectors : 155
##
## Objective Function Value : -64.0293
## Training error : 0.01385
#predict the permeability response for the test data
svmpredict <- predict(svmModel, testData$x)
#compare the variance of the predicted values to the test values.
postResample(svmpredict, testData$y)
## RMSE Rsquared MAE
## 2.0668766 0.8276165 1.5625893
#Important Predictors
varImp(svmModel)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
Answer:
Based on RMSE as the performance metric , MARS model performed the best as it had the lowest RMSE.
# Define the data
results_table <- data.frame(
Model = c("kNN", "MARS", "Neural Network ", "SVM"),
Trained_RMSE = c(3.183130, 1.256694, 1.929113, 1.777582),
Test_RMSE = c(3.2040595, 1.2779993, 2.8796844,2.0401686)
)
# Display the table using kable
kable(results_table, format = "markdown", align = "l")
Model | Trained_RMSE | Test_RMSE |
---|---|---|
kNN | 3.183130 | 3.204060 |
MARS | 1.256694 | 1.277999 |
Neural Network | 1.929113 | 2.879684 |
SVM | 1.777582 | 2.040169 |
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
cmp <- data.frame(ChemicalManufacturingProcess)
# summary of missing values
paste0(" Missing values: ",sum(is.na(cmp)))
## [1] " Missing values: 106"
#data imputation
cmp_impute <- preProcess(cmp, method = "knnImpute")
cmp_impute <- predict(cmp_impute,cmp)
# summary post data imputation
paste0(" Missing values post data kNN Impute:",sum(is.na(cmp_impute)))
## [1] " Missing values post data kNN Impute:0"
# pre-process to filter predictor with low frequencies
cmp_transform <- cmp_impute[, -nearZeroVar(cmp_impute)]
# Check for skewness
library(e1071)
skewVals <- apply(cmp_transform, 2, skewness)
head(skewVals)
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0.31095956 0.27331650 0.24412685
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0.02851075 1.73231530 0.30400532
#data transformation
cmp_boxcox <- preProcess(cmp_transform, method = c("BoxCox","knnImpute"))
cmp_transform <- predict(cmp_boxcox,cmp_transform)
set.seed(100)
# sample data for training
index <- createDataPartition(cmp_transform$Yield, p = .8, list = FALSE)
# train and test data
train_cmp <- cmp_transform[index, ]
test_cmp <- cmp_transform[-index, ]
# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)
Answer:
The training data took 144 samples and 56 predictors.
Using RMSE as the performance metric, the kNN method provided 5 to be the k parameter for optimal performance.The corresponding RMSE of the trained model is 0.6645289.
After predicting the test data using the kNN model, the RMSE of the predicted model is 0.7900462.
This explains that the model is able to predict close the the training data however its accuracy is lower than the trained dataset.
# check to see if any sparse and unbalanced predictors
knnDescr <- as.integer(nearZeroVar(train_cmp[ ,-1]))
paste0("Sparse and unbalanced predictors to be removed:",knnDescr)
## [1] "Sparse and unbalanced predictors to be removed:"
knnModel <- train(x = train_cmp[ ,-1],
y = train_cmp$Yield ,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.7900462 0.4007219 0.6292123
## 7 0.7988250 0.3869367 0.6396344
## 9 0.7965544 0.3914364 0.6408470
## 11 0.7961205 0.3934229 0.6436726
## 13 0.7993345 0.3930332 0.6490228
## 15 0.8041721 0.3851576 0.6527714
## 17 0.8047452 0.3889346 0.6533437
## 19 0.8084362 0.3831876 0.6571909
## 21 0.8086075 0.3889180 0.6558489
## 23 0.8100661 0.3925148 0.6569792
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
#Validation plot
plot(knnModel)
#optimal PLS components
knnModel$bestTune
## k
## 1 5
#performance
getTrainPerf(knnModel)
## TrainRMSE TrainRsquared TrainMAE method
## 1 0.7900462 0.4007219 0.6292123 knn
cmp_predict <- predict(knnModel, test_cmp[ ,-1])
postResample(cmp_predict, test_cmp[ ,1])
## RMSE Rsquared MAE
## 0.6645289 0.4623051 0.5320910
xyplot( predict(knnModel) ~ train_cmp$Yield , type = c("p","g"))
xyplot( predict(knnModel) ~ resid(knnModel) , type = c("p","g"))
Answer:
The training data took 144 samples and 56 predictors.
Using RMSE as the performance metric, the final values used for the MARS model were nprune = 5 and degree = 1.The corresponding RMSE of the trained model is 0.6664529.
After predicting the test data using the MARS model, the RMSE of the predicted model is 1.2779993.
As the difference is quite small( about0.02) this explains that the model isn’t overfitting , it’s generalizing well to new data.
Its better than the kNN Model.
MARS model does select the informative predictors (X1-X5) shown using varImp() function.
# grid of hyperparameters to tune the MARS model
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsModel <- train(x = train_cmp[ ,-1],
y = train_cmp$Yield ,
method = "earth",
tuneGrid = marsGrid,
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = control)
#Validation plot
plot(marsModel)
#optimal PLS components
marsModel$bestTune
## nprune degree
## 4 5 1
#performance
getTrainPerf(marsModel)
## TrainRMSE TrainRsquared TrainMAE method
## 1 0.6664529 0.6068185 0.5468289 earth
cmp_predict <- predict(marsModel, test_cmp[ ,-1])
postResample(cmp_predict, test_cmp[ ,1])
## RMSE Rsquared MAE
## 0.6316697 0.5300194 0.5101996
xyplot( predict(marsModel) ~ train_cmp$Yield , type = c("p","g"))
xyplot( predict(marsModel) ~ resid(marsModel) , type = c("p","g"))
Answer:
The training data took 144 samples and 56 predictors.
As shown by the tooHigh variable, there is no predictor having absolute correlation greater than 0.75.
Using RMSE as the performance metric, The final values used for the model were size = 4, decay = 0.01 and bag = FALSE.The corresponding RMSE of the trained model is 1.929113 .
After predicting the test data using the Neural Network model, the RMSE of the predicted model is 2.8796844.
As the difference is small,this explains that the model isn’t overfitting , it’s generalizing well to new data.
Its better than the kNN Model but not as good as the MARS model.
Neural Network model does select the informative predictors (X1-X5) shown using varImp() function. However, it also selected the non informative variables.
# define the hyperparameters
nNetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
# no of parameters to be used by the model
nNetMaxnwts <- 5 * (ncol(train_cmp[ ,-1]) + 1) + 5 + 1
# use the averaging neural network model
# linout = TRUE to use the linear relationship between the hidden units and the predictors.
nnetAvgModel <- train(x = train_cmp[ ,-1],
y = train_cmp$Yield,
method = "avNNet",
preProcess = c("center", "scale"),
tuneGrid = nNetGrid,
trControl = control,
linout = TRUE,
trace = FALSE, # reduce the amount of printed output
MaxNWts = nNetMaxnwts,
maxit = 500) # the number of iterations to find parameter estimates.
plot(nnetAvgModel)
getTrainPerf(nnetAvgModel)
## TrainRMSE TrainRsquared TrainMAE method
## 1 0.6267866 0.6703062 0.5149613 avNNet
nnetAvgPred <- predict(nnetAvgModel, test_cmp[ ,-1])
postResample(nnetAvgPred, test_cmp[ ,1])
## RMSE Rsquared MAE
## 0.5336337 0.6969595 0.4225320
Answer:
The training data took 144 samples and 56 predictors.
Using RMSE as the performance metric, The final values used for the model were sigma = 0.014767 and C = 8.The model used 126 training set data points as Support Vectors. The corresponding RMSE of the trained model is 0.6277591 .
After predicting the test data using the SVM model, the RMSE of the predicted model is 0.5782283.
The model performed that well to the test data.
Its better than the kNN Model and MARS model but not as good as the Neural Network model.
svmModel <- train(x = train_cmp[ ,-1],
y = train_cmp$Yield,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = control)
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 128, 130, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7674302 0.5263657 0.6206958
## 0.50 0.6902547 0.5832043 0.5686566
## 1.00 0.6374649 0.6387539 0.5292632
## 2.00 0.6147021 0.6605693 0.5080481
## 4.00 0.5965025 0.6828171 0.4872453
## 8.00 0.5957551 0.6912832 0.4798324
## 16.00 0.5934408 0.6938640 0.4778767
## 32.00 0.5934408 0.6938640 0.4778767
## 64.00 0.5934408 0.6938640 0.4778767
## 128.00 0.5934408 0.6938640 0.4778767
##
## Tuning parameter 'sigma' was held constant at a value of 0.01275155
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01275155 and C = 16.
#validation plot
plot(svmModel)
#summary
svmModel$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 16
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0127515535955409
##
## Number of Support Vectors : 122
##
## Objective Function Value : -96.8568
## Training error : 0.00892
getTrainPerf(svmModel)
## TrainRMSE TrainRsquared TrainMAE method
## 1 0.5934408 0.693864 0.4778767 svmRadial
#predict the permeability response for the test data
svmpredict <- predict(svmModel, test_cmp[ ,-1])
#compare the variance of the predicted values to the test values.
postResample(svmpredict, test_cmp[ ,1])
## RMSE Rsquared MAE
## 0.5822715 0.5862931 0.4596495
Answer:
Based on RMSE as the performance metric , nNet model performed the best in optimal resampling and test set performance as it had the lowest RMSE.
SVM was relatively close to nNet model and both performed better than the linear models which had much higher RMSE( around 11) and lower R squared ( less than 0.5).
# Define the data
results_table <- data.frame(
Model = c("kNN", "MARS", "Neural Network ", "SVM"),
Trained_RMSE = c(0.6645289, 0.6664529, 0.6267866, 0.6277591),
Test_RMSE = c(0.7900462, 0.6316697, 0.5336337,0.5782283)
)
# Display the table using kable
kable(results_table, format = "markdown", align = "l")
Model | Trained_RMSE | Test_RMSE |
---|---|---|
kNN | 0.6645289 | 0.7900462 |
MARS | 0.6664529 | 0.6316697 |
Neural Network | 0.6267866 | 0.5336337 |
SVM | 0.6277591 | 0.5782283 |
Answer:
There seems to be an similar number of biological and manufacturing process predictors that are important in the nNet non linear regression model.In the Top 10 selection, process variables(6) are more than biological variables(4). Hence neither the biological or process variables dominate the list, but process variables outnumber.
In the optimal linear model(pls), ManufacturingProcess32 has the highest positive correlation with Yield, followed by ManufacturingProcess09 and BiologicalMaterial02. ManufacturingProcess13,ManufacturingProcess36 and ManufacturingProcess17 which were in the Top5 are negatively correlated with Yield.
In the optimal nonlinear model(nNet), ManufacturingProcess32 had the highest positive correlation with Yield, followed by BiologicalMaterial06 and BiologicalMaterial03. ManufacturingProcess13, 17 and 36 were negatively correlated with Yield.
Hence there is difference in the order of importance of the predictors.
top_predictors <-data.frame(varImp(nnetAvgModel)$importance)
top_10_predictors <- top_predictors |>
arrange(desc(Overall)) |>
slice_head(n = 10)
top_10_predictors
## Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess13 97.83640
## BiologicalMaterial06 82.21744
## ManufacturingProcess17 77.26777
## BiologicalMaterial03 76.21094
## ManufacturingProcess36 70.96498
## BiologicalMaterial02 68.78876
## ManufacturingProcess09 67.86384
## BiologicalMaterial12 63.36203
## ManufacturingProcess06 55.15443
cmp_transform |>
select(c("Yield", row.names(top_10_predictors))) |>
cor() |>
corrplot()
Answer:
Using the top 10 predictors to validate the relationship with yield, i plotted it using the original data.
Visualization of their relationships with Yield using scatter plots and LOESS smoothing indicates non-linear effect.
By intuition,most of these plots show clustered data except for ManufacturingProcess32 which had yield values concentrated on 4 predictor values.
These patterns were not captured by the the linear model, highlighting the strength of neural networks in modeling complex interactions.
The partial dependence plots shows how the prediction changes as the variable changes.
# Top predictors
top_10_predictors <- c("ManufacturingProcess32", "ManufacturingProcess13",
"BiologicalMaterial06", "ManufacturingProcess17",
"BiologicalMaterial03", "ManufacturingProcess36",
"BiologicalMaterial02", "ManufacturingProcess09",
"BiologicalMaterial12", "ManufacturingProcess06")
# relationship of each important predictor to Yield
for (var in top_10_predictors) {
p<- ggplot(cmp_transform, aes_string(x = var, y = "Yield")) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
labs(title = paste("Yield vs", var),
x = var,
y = "Yield")
print(p)
}
# partial dependence plots to show how the prediction changes as the variable changes.
for (var in top_10_predictors) {
pd <- pdp::partial(nnetAvgModel, pred.var = var, train = train_cmp[ , -1])
plot(pd, main = paste("Partial Dependence on", var))
}