In this assignment, the problems 7.2 and 7.5 have been solved from the Kuhn and Johnson book.
library(caret)
library(nnet)
library(earth)
library(kernlab)
library(mlbench)
library(AppliedPredictiveModeling)
library(dplyr)
Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data
featurePlot(trainingData$x, trainingData$y)
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data:
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
Make prediction on test data and evaluate KNN model performance:
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
set.seed(100)
marsTuned <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
# Explicitly declare the candidate models to test
tuneGrid = marsGrid,
preProcess = c("center", "scale"),
tuneLength = 10)
marsTuned
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.476111 0.2204191 3.673471
## 1 3 3.785524 0.4453226 3.060822
## 1 4 2.888006 0.6739447 2.299054
## 1 5 2.617975 0.7305555 2.088069
## 1 6 2.503672 0.7565683 1.993318
## 1 7 2.118338 0.8216259 1.677739
## 1 8 1.929299 0.8520020 1.524792
## 1 9 1.831854 0.8673078 1.449707
## 1 10 1.803948 0.8715808 1.427875
## 1 11 1.794652 0.8728269 1.415841
## 1 12 1.812615 0.8703780 1.433389
## 1 13 1.811880 0.8707304 1.430757
## 1 14 1.825082 0.8688957 1.443386
## 1 15 1.837181 0.8672550 1.449804
## 1 16 1.854541 0.8647653 1.464247
## 1 17 1.853712 0.8648111 1.461572
## 1 18 1.853712 0.8648111 1.461572
## 1 19 1.853712 0.8648111 1.461572
## 1 20 1.853712 0.8648111 1.461572
## 2 2 4.476111 0.2204191 3.673471
## 2 3 3.768037 0.4504537 3.042638
## 2 4 2.944726 0.6620582 2.337447
## 2 5 2.633521 0.7276488 2.091193
## 2 6 2.510978 0.7543740 1.977686
## 2 7 2.172705 0.8141037 1.716339
## 2 8 1.997426 0.8422921 1.576842
## 2 9 1.846270 0.8649287 1.459079
## 2 10 1.750285 0.8791091 1.387328
## 2 11 1.612432 0.8971939 1.282390
## 2 12 1.524046 0.9076011 1.204662
## 2 13 1.485129 0.9118790 1.171148
## 2 14 1.522225 0.9080030 1.182493
## 2 15 1.521753 0.9082379 1.181495
## 2 16 1.530159 0.9073195 1.188689
## 2 17 1.513610 0.9091232 1.176109
## 2 18 1.515488 0.9090281 1.180086
## 2 19 1.515035 0.9091647 1.179373
## 2 20 1.517907 0.9088130 1.181517
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 13 and degree = 2.
Make prediction on test data and evaluate MARS model performance:
marsPred <- predict(marsTuned, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.3227340 0.9291489 1.0524686
See variables importance:
varImp(marsTuned)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.40
## X2 49.00
## X5 15.72
## X3 0.00
It is seen that MARS did select the predictors x1, x2, x4, and x5 as informative predictors and excluded predictor x3 as non-informative predictor. Note that the presence of x3 with a score of 0 in the table indicates that MARS considered it initially as informative predictor but found it unimportant finally, and therefore did not use it in any model terms.
set.seed(100)
svmRTuned <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.530787 0.7922715 2.013175
## 0.50 2.259539 0.8064569 1.789962
## 1.00 2.099789 0.8274242 1.656154
## 2.00 2.002943 0.8412934 1.583791
## 4.00 1.943618 0.8504425 1.546586
## 8.00 1.918711 0.8547582 1.532981
## 16.00 1.920651 0.8536189 1.536116
## 32.00 1.920651 0.8536189 1.536116
## 64.00 1.920651 0.8536189 1.536116
## 128.00 1.920651 0.8536189 1.536116
## 256.00 1.920651 0.8536189 1.536116
## 512.00 1.920651 0.8536189 1.536116
## 1024.00 1.920651 0.8536189 1.536116
## 2048.00 1.920651 0.8536189 1.536116
##
## Tuning parameter 'sigma' was held constant at a value of 0.06509124
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06509124 and C = 8.
Make prediction on test data and evaluate SVM model performance:
svmRPred <- predict(svmRTuned, newdata = testData$x)
postResample(pred = svmRPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0631908 0.8275736 1.5662213
## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
.bag = FALSE)
set.seed(100)
nnetTune <- train(trainingData$x,
y = trainingData$y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv"),
## Automatically standardize data prior to modeling and prediction
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1,
maxit = 500)
nnetTune
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.392711 0.7610354 1.897330
## 0.00 2 2.410532 0.7567109 1.907478
## 0.00 3 2.043957 0.8224281 1.630751
## 0.00 4 2.289347 0.8130639 1.749187
## 0.00 5 2.445600 0.7709399 1.824446
## 0.00 6 NaN NaN NaN
## 0.00 7 NaN NaN NaN
## 0.00 8 NaN NaN NaN
## 0.00 9 NaN NaN NaN
## 0.00 10 NaN NaN NaN
## 0.01 1 2.385381 0.7602926 1.887906
## 0.01 2 2.425125 0.7510903 1.935991
## 0.01 3 2.151209 0.8016018 1.701951
## 0.01 4 2.091925 0.8154383 1.676653
## 0.01 5 2.169742 0.7999255 1.738715
## 0.01 6 NaN NaN NaN
## 0.01 7 NaN NaN NaN
## 0.01 8 NaN NaN NaN
## 0.01 9 NaN NaN NaN
## 0.01 10 NaN NaN NaN
## 0.10 1 2.393965 0.7596431 1.894191
## 0.10 2 2.423612 0.7525959 1.935872
## 0.10 3 2.169914 0.7982380 1.726854
## 0.10 4 2.059080 0.8224160 1.648610
## 0.10 5 1.975656 0.8394000 1.578979
## 0.10 6 NaN NaN NaN
## 0.10 7 NaN NaN NaN
## 0.10 8 NaN NaN NaN
## 0.10 9 NaN NaN NaN
## 0.10 10 NaN NaN NaN
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.
Make prediction on test data and evaluate Neural Network model performance:
nnetPred <- predict(nnetTune, newdata = testData$x)
postResample(pred = nnetPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.1113956 0.8277556 1.5739011
Combine results of different models into a single table:
# Get results
knn_res <- postResample(pred = knnPred, obs = testData$y)
mars_res <- postResample(pred = marsPred, obs = testData$y)
svm_res <- postResample(pred = svmRPred, obs = testData$y)
nnet_res <- postResample(pred = nnetPred, obs = testData$y)
# Combine into a sigle table
all_results<- rbind(
KNN = knn_res,
MARS = mars_res,
SVM = svm_res,
NNET = nnet_res
)
# Convert to a data frame
results <- as.data.frame(all_results)
# See results
print(results)
## RMSE Rsquared MAE
## KNN 3.204059 0.6819919 2.568346
## MARS 1.322734 0.9291489 1.052469
## SVM 2.063191 0.8275736 1.566221
## NNET 2.111396 0.8277556 1.573901
Based on the RMSE and MAE values, the MARS model is the best performed model here. Its R-squared value is also the highest, indicating the model’s strong ability to capture the variability in the test data.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
Get data and do required pre-processing:
data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)
## [1] 176 58
#head(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values:
# Check for missing values in each column
missing_values <- colSums(is.na(ChemicalManufacturingProcess))
print(missing_values)
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0 0 0
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0 0 0
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 0 0 0
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 0 0 0
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0 1 3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## 15 1 1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 2 1 1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## 0 9 10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 1 0 1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 0 0 0
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## 0 0 0
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 0 1 1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 1 5 5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## 5 5 5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## 5 5 0
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 5 5 5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 5 0 0
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## 0 1 1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## 0 0 0
## ManufacturingProcess45
## 0
# Apply KNN imputation
knn_impute_chemical <- preProcess(ChemicalManufacturingProcess, method=c('knnImpute'))
# Imputed dataset
imputed_chemical_df <- predict(knn_impute_chemical, ChemicalManufacturingProcess)
# Calculate total number of missing values after imputation
total_missing <- sum(is.na(imputed_chemical_df))
print(total_missing)
## [1] 0
Split the data into train test set:
#dim(imputed_chemical_df)
imputed_chemical_df <- imputed_chemical_df %>%
select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
set.seed(100)
train_chemical <-createDataPartition(imputed_chemical_df$Yield, times = 1, p = .70, list = FALSE)
train_chemical_x <- imputed_chemical_df[train_chemical, ][, -c(1)]
test_chemical_x <- imputed_chemical_df[-train_chemical, ][, -c(1)]
train_chemical_y<- imputed_chemical_df[train_chemical, ]$Yield
test_chemical_y <- imputed_chemical_df[-train_chemical, ]$Yield
Optimal linear regression from previous problem 6.3 from homework 7 was ridge regression which is given below:
set.seed(135)
ridgegrid <- data.frame(.lambda = seq(0,0.1,length=15))
ridge_model <- train(x=train_chemical_x,y=train_chemical_y,
method='ridge',
tuneGrid=ridgegrid,
trControl=trainControl(method='cv'),
preProc = c('center','scale')
)
ridge_model
## Ridge Regression
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 111, 112, 112, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 4.054160 0.3911420 1.6139763
## 0.007142857 1.633270 0.5251733 0.8315422
## 0.014285714 1.767593 0.5375635 0.8611365
## 0.021428571 1.777055 0.5452884 0.8623404
## 0.028571429 1.764046 0.5507236 0.8590180
## 0.035714286 1.746125 0.5548928 0.8551684
## 0.042857143 1.727679 0.5582498 0.8507987
## 0.050000000 1.709956 0.5610352 0.8464616
## 0.057142857 1.693284 0.5633933 0.8425123
## 0.064285714 1.677696 0.5654189 0.8388062
## 0.071428571 1.663129 0.5671781 0.8355731
## 0.078571429 1.649496 0.5687194 0.8324824
## 0.085714286 1.636707 0.5700795 0.8295368
## 0.092857143 1.624683 0.5712868 0.8268756
## 0.100000000 1.613350 0.5723639 0.8243647
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
Make prediction on test set and evaluate performance of Ridge model:
ridgepred <- predict(ridge_model, newdata=test_chemical_x)
postResample(pred=ridgepred, obs=test_chemical_y)
## RMSE Rsquared MAE
## 1.6115319 0.1753037 0.8482122
Which nonlinear regression model gives the optimal resampling and test set performance?
knnModel <- train(x = train_chemical_x,
y = train_chemical_y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.8063072 0.4113745 0.6276575
## 7 0.8021466 0.4134935 0.6349301
## 9 0.7957383 0.4264981 0.6303948
## 11 0.8017544 0.4221905 0.6369931
## 13 0.8120202 0.4112094 0.6434460
## 15 0.8166274 0.4077386 0.6434984
## 17 0.8128181 0.4192364 0.6398746
## 19 0.8226790 0.4055193 0.6468352
## 21 0.8239424 0.4095640 0.6493374
## 23 0.8239422 0.4160486 0.6493893
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
Make prediction on test data and evaluate KNN model performance:
knnPred <- predict(knnModel, newdata = test_chemical_x)
postResample(pred = knnPred, obs = test_chemical_y)
## RMSE Rsquared MAE
## 0.6605832 0.4693909 0.5401542
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
set.seed(100)
marsTuned <- train(x = train_chemical_x,
y = train_chemical_y,
method = "earth",
# Explicitly declare the candidate models to test
tuneGrid = marsGrid,
preProcess = c("center", "scale"),
tuneLength = 10)
marsTuned
## Multivariate Adaptive Regression Spline
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.8837262 0.3311784 0.6854087
## 1 3 0.7510781 0.5694676 0.5754854
## 1 4 0.8271809 0.5527540 0.5794891
## 1 5 0.8203153 0.5426763 0.5845587
## 1 6 0.8139182 0.5451256 0.5873172
## 1 7 0.8295880 0.5318086 0.5961639
## 1 8 0.8747891 0.4938993 0.6180948
## 1 9 0.8759510 0.4857531 0.6226605
## 1 10 0.8933521 0.4650080 0.6391421
## 1 11 0.9305400 0.4470554 0.6635047
## 1 12 0.9740609 0.4287921 0.6853149
## 1 13 0.9918454 0.4282800 0.6890905
## 1 14 1.0744403 0.4041813 0.7105806
## 1 15 1.1944929 0.3994023 0.7320364
## 1 16 1.3949698 0.3877695 0.7705839
## 1 17 1.3931039 0.3860506 0.7699150
## 1 18 1.3913721 0.3842815 0.7697398
## 1 19 1.4007599 0.3842434 0.7736725
## 1 20 1.4285178 0.3832159 0.7794604
## 2 2 0.8710108 0.3474798 0.6750623
## 2 3 0.6786687 0.5931803 0.5490885
## 2 4 0.7800390 0.5590575 0.5738683
## 2 5 0.7153508 0.5605823 0.5728347
## 2 6 1.0484891 0.5133484 0.6343887
## 2 7 0.8273236 0.5057285 0.6019359
## 2 8 1.1792370 0.4717877 0.6640681
## 2 9 1.1962236 0.4580054 0.6713780
## 2 10 1.1752026 0.4376325 0.6801762
## 2 11 1.2226689 0.4074134 0.7091184
## 2 12 1.6010727 0.3740950 0.7819973
## 2 13 1.4870836 0.3782513 0.7807328
## 2 14 1.5275899 0.3684877 0.7990793
## 2 15 1.6032317 0.3564796 0.8206816
## 2 16 3.4184827 0.3388729 1.0888041
## 2 17 3.8235480 0.3310718 1.1558747
## 2 18 4.4741956 0.3045501 1.2716957
## 2 19 4.6309617 0.2919685 1.3085895
## 2 20 4.5462829 0.2872523 1.3036636
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 2.
Make prediction on test data and evaluate MARS model performance:
marsPred <- predict(marsTuned, newdata = test_chemical_x)
postResample(pred = marsPred, obs = test_chemical_y)
## RMSE Rsquared MAE
## 0.7677886 0.4789852 0.6238710
set.seed(100)
svmRTuned <- train(x = train_chemical_x,
y = train_chemical_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 112, 112, 111, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7950783 0.5019956 0.6355235
## 0.50 0.7206527 0.5675452 0.5786388
## 1.00 0.6620420 0.6391608 0.5329275
## 2.00 0.6199184 0.6705515 0.5039466
## 4.00 0.5997918 0.6797461 0.4884378
## 8.00 0.5916547 0.6873380 0.4857772
## 16.00 0.5916547 0.6873380 0.4857772
## 32.00 0.5916547 0.6873380 0.4857772
## 64.00 0.5916547 0.6873380 0.4857772
## 128.00 0.5916547 0.6873380 0.4857772
## 256.00 0.5916547 0.6873380 0.4857772
## 512.00 0.5916547 0.6873380 0.4857772
## 1024.00 0.5916547 0.6873380 0.4857772
## 2048.00 0.5916547 0.6873380 0.4857772
##
## Tuning parameter 'sigma' was held constant at a value of 0.01447582
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01447582 and C = 8.
Make prediction on test data and evaluate SVM model performance:
svmRPred <- predict(svmRTuned, newdata = test_chemical_x)
postResample(pred = svmRPred, obs = test_chemical_y)
## RMSE Rsquared MAE
## 0.6809040 0.5165981 0.5419960
## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
.bag = FALSE)
set.seed(100)
nnetTune <- train(train_chemical_x,
y = train_chemical_y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv"),
## Automatically standardize data prior to modeling and prediction
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 5 * (ncol(train_chemical_x) + 1) + 5 + 1,
maxit = 500)
nnetTune
## Model Averaged Neural Network
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 112, 112, 111, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 0.7824281 0.4987313 0.6096973
## 0.00 2 0.7740575 0.5301156 0.5949447
## 0.00 3 0.7637564 0.5563183 0.6079613
## 0.00 4 0.6671040 0.6423764 0.5292693
## 0.00 5 0.7234008 0.5864772 0.5851181
## 0.00 6 NaN NaN NaN
## 0.00 7 NaN NaN NaN
## 0.00 8 NaN NaN NaN
## 0.00 9 NaN NaN NaN
## 0.00 10 NaN NaN NaN
## 0.01 1 0.7766836 0.5170718 0.6302954
## 0.01 2 0.7279324 0.5982842 0.5794846
## 0.01 3 0.5953180 0.7173802 0.4635498
## 0.01 4 0.6637780 0.6741588 0.5202170
## 0.01 5 0.5655143 0.7073605 0.4652646
## 0.01 6 NaN NaN NaN
## 0.01 7 NaN NaN NaN
## 0.01 8 NaN NaN NaN
## 0.01 9 NaN NaN NaN
## 0.01 10 NaN NaN NaN
## 0.10 1 0.7135024 0.5696593 0.5793630
## 0.10 2 0.5756759 0.7373548 0.4665686
## 0.10 3 0.6176515 0.6932432 0.4829121
## 0.10 4 0.5601016 0.7345852 0.4494931
## 0.10 5 0.5631801 0.7336356 0.4651236
## 0.10 6 NaN NaN NaN
## 0.10 7 NaN NaN NaN
## 0.10 8 NaN NaN NaN
## 0.10 9 NaN NaN NaN
## 0.10 10 NaN NaN NaN
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.1 and bag = FALSE.
Make prediction on test data and evaluate Neural Network model performance:
nnetPred <- predict(nnetTune, newdata = test_chemical_x)
postResample(pred = nnetPred, obs = test_chemical_y)
## RMSE Rsquared MAE
## 0.8963296 0.3950717 0.7084741
Combine results of different models into a single table:
# Get results
knn_res1 <- postResample(pred = knnPred, obs = test_chemical_y)
mars_res1<- postResample(pred = marsPred, obs = test_chemical_y)
svm_res1 <- postResample(pred = svmRPred, obs = test_chemical_y)
nnet_res1 <- postResample(pred = nnetPred, obs = test_chemical_y)
# Combine into a sigle table
all_results1<- rbind(
KNN = knn_res1,
MARS = mars_res1,
SVM = svm_res1,
NNET = nnet_res1
)
# Convert to a data frame
results1 <- as.data.frame(all_results1)
# See results
print(results1)
## RMSE Rsquared MAE
## KNN 0.6605832 0.4693909 0.5401542
## MARS 0.7677886 0.4789852 0.6238710
## SVM 0.6809040 0.5165981 0.5419960
## NNET 0.8963296 0.3950717 0.7084741
The SVM model gives the optimal resampling and test set performance based on the performance metrics. It has the lowest RMSE and the highest R-squared values. It indicates that the model has both good predictive accuracy and a good fit to the data compared to the other models.
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
Get important variables for optimal nonlinear SVM model:
varImp(svmRTuned)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 94.64
## BiologicalMaterial06 76.19
## ManufacturingProcess17 73.26
## ManufacturingProcess31 69.73
## ManufacturingProcess09 69.65
## BiologicalMaterial12 66.34
## ManufacturingProcess36 66.21
## BiologicalMaterial02 66.06
## BiologicalMaterial03 63.32
## ManufacturingProcess11 57.92
## ManufacturingProcess06 53.69
## ManufacturingProcess30 50.51
## BiologicalMaterial04 50.26
## ManufacturingProcess29 44.79
## BiologicalMaterial08 44.11
## BiologicalMaterial09 42.22
## BiologicalMaterial11 39.01
## ManufacturingProcess33 38.69
## ManufacturingProcess02 37.76
Get important variables for optimal linear Ridge regression model:
varImp(ridge_model)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 94.64
## BiologicalMaterial06 76.19
## ManufacturingProcess17 73.26
## ManufacturingProcess31 69.73
## ManufacturingProcess09 69.65
## BiologicalMaterial12 66.34
## ManufacturingProcess36 66.21
## BiologicalMaterial02 66.06
## BiologicalMaterial03 63.32
## ManufacturingProcess11 57.92
## ManufacturingProcess06 53.69
## ManufacturingProcess30 50.51
## BiologicalMaterial04 50.26
## ManufacturingProcess29 44.79
## BiologicalMaterial08 44.11
## BiologicalMaterial09 42.22
## BiologicalMaterial11 39.01
## ManufacturingProcess33 38.69
## ManufacturingProcess02 37.76
For optimal nonlinear model, it is seen that ManufacturingProcess32,ManufacturingProcess13, BiologicalMaterial06, ManufacturingProcess17, ManufacturingProcess31, and ManufacturingProcess09 are the most important predictors. It is also seen that the process variables dominate the list. The top 10 important variables are the same for both the optimal nonlinear and linear regression models.
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
The relationships between the top five most important predictors and the response variable will be determined here:
# Define the top five important predictors
imp_var <- c('ManufacturingProcess32','ManufacturingProcess13',
'BiologicalMaterial06','ManufacturingProcess17',
'ManufacturingProcess31')
featurePlot(train_chemical_x[,imp_var], train_chemical_y)
Get correlations:
cor(train_chemical_x[,imp_var], train_chemical_y)
## [,1]
## ManufacturingProcess32 0.59616057
## ManufacturingProcess13 -0.55704820
## BiologicalMaterial06 0.49749646
## ManufacturingProcess17 -0.46822624
## ManufacturingProcess31 -0.06849684
The plots and correlations of top five most important predictors with the response variable reveal that ManufacturingProcess32 has strong positive influence on yield, while BiologicalMaterial06 has moderate positive influence on yield. ManufacturingProcess13 has a strong negative influence on yield and ManufacturingProcess17 has moderate negative influence on yield. Moreover, ManufacturingProcess31 appears to have a negligible impact on yield. Basically, these correlations can help us to take decisions on which processes or materials to prioritize or avoid to optimize yield in the manufacturing process.