In this assignment, the problems 7.2 and 7.5 have been solved from the Kuhn and Johnson book.

library(caret)
library(nnet)
library(earth)
library(kernlab)
library(mlbench)
library(AppliedPredictiveModeling)
library(dplyr)

Problem-7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data 
featurePlot(trainingData$x, trainingData$y)

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data:

Tune KNN model

knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

Make prediction on test data and evaluate KNN model performance:

knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Tune MARS Model

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)

set.seed(100)
 marsTuned <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
# Explicitly declare the candidate models to test
tuneGrid = marsGrid,
preProcess = c("center", "scale"),
tuneLength = 10)

marsTuned

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.476111  0.2204191  3.673471
##   1        3      3.785524  0.4453226  3.060822
##   1        4      2.888006  0.6739447  2.299054
##   1        5      2.617975  0.7305555  2.088069
##   1        6      2.503672  0.7565683  1.993318
##   1        7      2.118338  0.8216259  1.677739
##   1        8      1.929299  0.8520020  1.524792
##   1        9      1.831854  0.8673078  1.449707
##   1       10      1.803948  0.8715808  1.427875
##   1       11      1.794652  0.8728269  1.415841
##   1       12      1.812615  0.8703780  1.433389
##   1       13      1.811880  0.8707304  1.430757
##   1       14      1.825082  0.8688957  1.443386
##   1       15      1.837181  0.8672550  1.449804
##   1       16      1.854541  0.8647653  1.464247
##   1       17      1.853712  0.8648111  1.461572
##   1       18      1.853712  0.8648111  1.461572
##   1       19      1.853712  0.8648111  1.461572
##   1       20      1.853712  0.8648111  1.461572
##   2        2      4.476111  0.2204191  3.673471
##   2        3      3.768037  0.4504537  3.042638
##   2        4      2.944726  0.6620582  2.337447
##   2        5      2.633521  0.7276488  2.091193
##   2        6      2.510978  0.7543740  1.977686
##   2        7      2.172705  0.8141037  1.716339
##   2        8      1.997426  0.8422921  1.576842
##   2        9      1.846270  0.8649287  1.459079
##   2       10      1.750285  0.8791091  1.387328
##   2       11      1.612432  0.8971939  1.282390
##   2       12      1.524046  0.9076011  1.204662
##   2       13      1.485129  0.9118790  1.171148
##   2       14      1.522225  0.9080030  1.182493
##   2       15      1.521753  0.9082379  1.181495
##   2       16      1.530159  0.9073195  1.188689
##   2       17      1.513610  0.9091232  1.176109
##   2       18      1.515488  0.9090281  1.180086
##   2       19      1.515035  0.9091647  1.179373
##   2       20      1.517907  0.9088130  1.181517
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 13 and degree = 2.

Make prediction on test data and evaluate MARS model performance:

marsPred <- predict(marsTuned, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 1.3227340 0.9291489 1.0524686

See variables importance:

varImp(marsTuned)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.40
## X2   49.00
## X5   15.72
## X3    0.00

It is seen that MARS did select the predictors x1, x2, x4, and x5 as informative predictors and excluded predictor x3 as non-informative predictor. Note that the presence of x3 with a score of 0 in the table indicates that MARS considered it initially as informative predictor but found it unimportant finally, and therefore did not use it in any model terms.

Tune SVM Model

set.seed(100)
svmRTuned <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))

svmRTuned

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.530787  0.7922715  2.013175
##      0.50  2.259539  0.8064569  1.789962
##      1.00  2.099789  0.8274242  1.656154
##      2.00  2.002943  0.8412934  1.583791
##      4.00  1.943618  0.8504425  1.546586
##      8.00  1.918711  0.8547582  1.532981
##     16.00  1.920651  0.8536189  1.536116
##     32.00  1.920651  0.8536189  1.536116
##     64.00  1.920651  0.8536189  1.536116
##    128.00  1.920651  0.8536189  1.536116
##    256.00  1.920651  0.8536189  1.536116
##    512.00  1.920651  0.8536189  1.536116
##   1024.00  1.920651  0.8536189  1.536116
##   2048.00  1.920651  0.8536189  1.536116
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06509124
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06509124 and C = 8.

Make prediction on test data and evaluate SVM model performance:

svmRPred <- predict(svmRTuned, newdata = testData$x)
postResample(pred = svmRPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.0631908 0.8275736 1.5662213

Tune Neural Network Model

## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
 .bag = FALSE)

set.seed(100)
nnetTune <- train(trainingData$x,
y = trainingData$y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv"),
## Automatically standardize data prior to modeling and prediction
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1,
maxit = 500)

nnetTune

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.392711  0.7610354  1.897330
##   0.00    2    2.410532  0.7567109  1.907478
##   0.00    3    2.043957  0.8224281  1.630751
##   0.00    4    2.289347  0.8130639  1.749187
##   0.00    5    2.445600  0.7709399  1.824446
##   0.00    6         NaN        NaN       NaN
##   0.00    7         NaN        NaN       NaN
##   0.00    8         NaN        NaN       NaN
##   0.00    9         NaN        NaN       NaN
##   0.00   10         NaN        NaN       NaN
##   0.01    1    2.385381  0.7602926  1.887906
##   0.01    2    2.425125  0.7510903  1.935991
##   0.01    3    2.151209  0.8016018  1.701951
##   0.01    4    2.091925  0.8154383  1.676653
##   0.01    5    2.169742  0.7999255  1.738715
##   0.01    6         NaN        NaN       NaN
##   0.01    7         NaN        NaN       NaN
##   0.01    8         NaN        NaN       NaN
##   0.01    9         NaN        NaN       NaN
##   0.01   10         NaN        NaN       NaN
##   0.10    1    2.393965  0.7596431  1.894191
##   0.10    2    2.423612  0.7525959  1.935872
##   0.10    3    2.169914  0.7982380  1.726854
##   0.10    4    2.059080  0.8224160  1.648610
##   0.10    5    1.975656  0.8394000  1.578979
##   0.10    6         NaN        NaN       NaN
##   0.10    7         NaN        NaN       NaN
##   0.10    8         NaN        NaN       NaN
##   0.10    9         NaN        NaN       NaN
##   0.10   10         NaN        NaN       NaN
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.

Make prediction on test data and evaluate Neural Network model performance:

nnetPred <- predict(nnetTune, newdata = testData$x)
postResample(pred = nnetPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.1113956 0.8277556 1.5739011

Combine results of different models into a single table:

# Get results
knn_res <- postResample(pred = knnPred, obs = testData$y)
mars_res <- postResample(pred = marsPred, obs = testData$y)
svm_res <- postResample(pred = svmRPred, obs = testData$y)
nnet_res <- postResample(pred = nnetPred, obs = testData$y)

# Combine into a sigle table
all_results<- rbind(
  KNN = knn_res,
  MARS = mars_res,
  SVM = svm_res,
  NNET = nnet_res
)

# Convert to a data frame 
results <- as.data.frame(all_results)

# See results
print(results)

##          RMSE  Rsquared      MAE
## KNN  3.204059 0.6819919 2.568346
## MARS 1.322734 0.9291489 1.052469
## SVM  2.063191 0.8275736 1.566221
## NNET 2.111396 0.8277556 1.573901

Based on the RMSE and MAE values, the MARS model is the best performed model here. Its R-squared value is also the highest, indicating the model’s strong ability to capture the variability in the test data.

Problem-7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Get data and do required pre-processing:

data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)

## [1] 176  58

#head(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values:

# Check for missing values in each column
missing_values <- colSums(is.na(ChemicalManufacturingProcess))
print(missing_values)

##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      1                      3 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                     15                      1                      1 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      2                      1                      1 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      9                     10 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      1                      0                      1 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      1                      1 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      1                      5                      5 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      5                      5                      5 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      5                      5                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      5                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      1                      1 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0

# Apply KNN imputation
knn_impute_chemical <- preProcess(ChemicalManufacturingProcess, method=c('knnImpute'))

# Imputed dataset
imputed_chemical_df <- predict(knn_impute_chemical, ChemicalManufacturingProcess)

# Calculate total number of missing values after imputation
total_missing <- sum(is.na(imputed_chemical_df))
print(total_missing)

## [1] 0

Split the data into train test set:

#dim(imputed_chemical_df)
imputed_chemical_df <- imputed_chemical_df %>%
  select_at(vars(-one_of(nearZeroVar(., names = TRUE))))

set.seed(100)
train_chemical <-createDataPartition(imputed_chemical_df$Yield, times = 1, p = .70, list = FALSE)

train_chemical_x <- imputed_chemical_df[train_chemical, ][, -c(1)] 
test_chemical_x <- imputed_chemical_df[-train_chemical, ][, -c(1)] 
train_chemical_y<- imputed_chemical_df[train_chemical, ]$Yield
test_chemical_y <- imputed_chemical_df[-train_chemical, ]$Yield

Optimal linear regression from previous problem 6.3 from homework 7 was ridge regression which is given below:

set.seed(135)

ridgegrid <- data.frame(.lambda = seq(0,0.1,length=15))

ridge_model <- train(x=train_chemical_x,y=train_chemical_y,
                       method='ridge',
                       tuneGrid=ridgegrid,
                       trControl=trainControl(method='cv'),
                       preProc = c('center','scale')
                       )

ridge_model

## Ridge Regression 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 111, 112, 111, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE      
##   0.000000000  4.054160  0.3911420  1.6139763
##   0.007142857  1.633270  0.5251733  0.8315422
##   0.014285714  1.767593  0.5375635  0.8611365
##   0.021428571  1.777055  0.5452884  0.8623404
##   0.028571429  1.764046  0.5507236  0.8590180
##   0.035714286  1.746125  0.5548928  0.8551684
##   0.042857143  1.727679  0.5582498  0.8507987
##   0.050000000  1.709956  0.5610352  0.8464616
##   0.057142857  1.693284  0.5633933  0.8425123
##   0.064285714  1.677696  0.5654189  0.8388062
##   0.071428571  1.663129  0.5671781  0.8355731
##   0.078571429  1.649496  0.5687194  0.8324824
##   0.085714286  1.636707  0.5700795  0.8295368
##   0.092857143  1.624683  0.5712868  0.8268756
##   0.100000000  1.613350  0.5723639  0.8243647
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

Make prediction on test set and evaluate performance of Ridge model:

ridgepred <- predict(ridge_model, newdata=test_chemical_x)
postResample(pred=ridgepred, obs=test_chemical_y)

##      RMSE  Rsquared       MAE 
## 1.6115319 0.1753037 0.8482122

Part(a)

Which nonlinear regression model gives the optimal resampling and test set performance?

Tune KNN model

knnModel <- train(x = train_chemical_x,
y = train_chemical_y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.8063072  0.4113745  0.6276575
##    7  0.8021466  0.4134935  0.6349301
##    9  0.7957383  0.4264981  0.6303948
##   11  0.8017544  0.4221905  0.6369931
##   13  0.8120202  0.4112094  0.6434460
##   15  0.8166274  0.4077386  0.6434984
##   17  0.8128181  0.4192364  0.6398746
##   19  0.8226790  0.4055193  0.6468352
##   21  0.8239424  0.4095640  0.6493374
##   23  0.8239422  0.4160486  0.6493893
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

Make prediction on test data and evaluate KNN model performance:

knnPred <- predict(knnModel, newdata = test_chemical_x)
postResample(pred = knnPred, obs = test_chemical_y)

##      RMSE  Rsquared       MAE 
## 0.6605832 0.4693909 0.5401542

Tune MARS Model

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)

set.seed(100)
 marsTuned <- train(x = train_chemical_x,
y = train_chemical_y,
method = "earth",
# Explicitly declare the candidate models to test
tuneGrid = marsGrid,
preProcess = c("center", "scale"),
tuneLength = 10)

marsTuned

## Multivariate Adaptive Regression Spline 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        2      0.8837262  0.3311784  0.6854087
##   1        3      0.7510781  0.5694676  0.5754854
##   1        4      0.8271809  0.5527540  0.5794891
##   1        5      0.8203153  0.5426763  0.5845587
##   1        6      0.8139182  0.5451256  0.5873172
##   1        7      0.8295880  0.5318086  0.5961639
##   1        8      0.8747891  0.4938993  0.6180948
##   1        9      0.8759510  0.4857531  0.6226605
##   1       10      0.8933521  0.4650080  0.6391421
##   1       11      0.9305400  0.4470554  0.6635047
##   1       12      0.9740609  0.4287921  0.6853149
##   1       13      0.9918454  0.4282800  0.6890905
##   1       14      1.0744403  0.4041813  0.7105806
##   1       15      1.1944929  0.3994023  0.7320364
##   1       16      1.3949698  0.3877695  0.7705839
##   1       17      1.3931039  0.3860506  0.7699150
##   1       18      1.3913721  0.3842815  0.7697398
##   1       19      1.4007599  0.3842434  0.7736725
##   1       20      1.4285178  0.3832159  0.7794604
##   2        2      0.8710108  0.3474798  0.6750623
##   2        3      0.6786687  0.5931803  0.5490885
##   2        4      0.7800390  0.5590575  0.5738683
##   2        5      0.7153508  0.5605823  0.5728347
##   2        6      1.0484891  0.5133484  0.6343887
##   2        7      0.8273236  0.5057285  0.6019359
##   2        8      1.1792370  0.4717877  0.6640681
##   2        9      1.1962236  0.4580054  0.6713780
##   2       10      1.1752026  0.4376325  0.6801762
##   2       11      1.2226689  0.4074134  0.7091184
##   2       12      1.6010727  0.3740950  0.7819973
##   2       13      1.4870836  0.3782513  0.7807328
##   2       14      1.5275899  0.3684877  0.7990793
##   2       15      1.6032317  0.3564796  0.8206816
##   2       16      3.4184827  0.3388729  1.0888041
##   2       17      3.8235480  0.3310718  1.1558747
##   2       18      4.4741956  0.3045501  1.2716957
##   2       19      4.6309617  0.2919685  1.3085895
##   2       20      4.5462829  0.2872523  1.3036636
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 2.

Make prediction on test data and evaluate MARS model performance:

marsPred <- predict(marsTuned, newdata = test_chemical_x)
postResample(pred = marsPred, obs = test_chemical_y)

##      RMSE  Rsquared       MAE 
## 0.7677886 0.4789852 0.6238710

Tune SVM Model

set.seed(100)
svmRTuned <- train(x = train_chemical_x,
y = train_chemical_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))

svmRTuned

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 111, 112, 112, 112, 111, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  0.7950783  0.5019956  0.6355235
##      0.50  0.7206527  0.5675452  0.5786388
##      1.00  0.6620420  0.6391608  0.5329275
##      2.00  0.6199184  0.6705515  0.5039466
##      4.00  0.5997918  0.6797461  0.4884378
##      8.00  0.5916547  0.6873380  0.4857772
##     16.00  0.5916547  0.6873380  0.4857772
##     32.00  0.5916547  0.6873380  0.4857772
##     64.00  0.5916547  0.6873380  0.4857772
##    128.00  0.5916547  0.6873380  0.4857772
##    256.00  0.5916547  0.6873380  0.4857772
##    512.00  0.5916547  0.6873380  0.4857772
##   1024.00  0.5916547  0.6873380  0.4857772
##   2048.00  0.5916547  0.6873380  0.4857772
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01447582
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01447582 and C = 8.

Make prediction on test data and evaluate SVM model performance:

svmRPred <- predict(svmRTuned, newdata = test_chemical_x)
postResample(pred = svmRPred, obs = test_chemical_y)

##      RMSE  Rsquared       MAE 
## 0.6809040 0.5165981 0.5419960

Tune Neural Network Model

## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
 .bag = FALSE)

set.seed(100)
nnetTune <- train(train_chemical_x,
y = train_chemical_y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv"),
## Automatically standardize data prior to modeling and prediction
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 5 * (ncol(train_chemical_x) + 1) + 5 + 1,
maxit = 500)

nnetTune

## Model Averaged Neural Network 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 111, 112, 112, 112, 111, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE      
##   0.00    1    0.7824281  0.4987313  0.6096973
##   0.00    2    0.7740575  0.5301156  0.5949447
##   0.00    3    0.7637564  0.5563183  0.6079613
##   0.00    4    0.6671040  0.6423764  0.5292693
##   0.00    5    0.7234008  0.5864772  0.5851181
##   0.00    6          NaN        NaN        NaN
##   0.00    7          NaN        NaN        NaN
##   0.00    8          NaN        NaN        NaN
##   0.00    9          NaN        NaN        NaN
##   0.00   10          NaN        NaN        NaN
##   0.01    1    0.7766836  0.5170718  0.6302954
##   0.01    2    0.7279324  0.5982842  0.5794846
##   0.01    3    0.5953180  0.7173802  0.4635498
##   0.01    4    0.6637780  0.6741588  0.5202170
##   0.01    5    0.5655143  0.7073605  0.4652646
##   0.01    6          NaN        NaN        NaN
##   0.01    7          NaN        NaN        NaN
##   0.01    8          NaN        NaN        NaN
##   0.01    9          NaN        NaN        NaN
##   0.01   10          NaN        NaN        NaN
##   0.10    1    0.7135024  0.5696593  0.5793630
##   0.10    2    0.5756759  0.7373548  0.4665686
##   0.10    3    0.6176515  0.6932432  0.4829121
##   0.10    4    0.5601016  0.7345852  0.4494931
##   0.10    5    0.5631801  0.7336356  0.4651236
##   0.10    6          NaN        NaN        NaN
##   0.10    7          NaN        NaN        NaN
##   0.10    8          NaN        NaN        NaN
##   0.10    9          NaN        NaN        NaN
##   0.10   10          NaN        NaN        NaN
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.1 and bag = FALSE.

Make prediction on test data and evaluate Neural Network model performance:

nnetPred <- predict(nnetTune, newdata = test_chemical_x)
postResample(pred = nnetPred, obs = test_chemical_y)

##      RMSE  Rsquared       MAE 
## 0.8963296 0.3950717 0.7084741

Combine results of different models into a single table:

# Get results
knn_res1 <- postResample(pred = knnPred, obs = test_chemical_y)
mars_res1<- postResample(pred = marsPred, obs = test_chemical_y)
svm_res1 <- postResample(pred = svmRPred, obs = test_chemical_y)
nnet_res1 <- postResample(pred = nnetPred, obs = test_chemical_y)

# Combine into a sigle table
all_results1<- rbind(
  KNN = knn_res1,
  MARS = mars_res1,
  SVM = svm_res1,
  NNET = nnet_res1
)

# Convert to a data frame 
results1 <- as.data.frame(all_results1)

# See results
print(results1)

##           RMSE  Rsquared       MAE
## KNN  0.6605832 0.4693909 0.5401542
## MARS 0.7677886 0.4789852 0.6238710
## SVM  0.6809040 0.5165981 0.5419960
## NNET 0.8963296 0.3950717 0.7084741

The SVM model gives the optimal resampling and test set performance based on the performance metrics. It has the lowest RMSE and the highest R-squared values. It indicates that the model has both good predictive accuracy and a good fit to the data compared to the other models.

Part-(b):

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Get important variables for optimal nonlinear SVM model:

varImp(svmRTuned)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   94.64
## BiologicalMaterial06     76.19
## ManufacturingProcess17   73.26
## ManufacturingProcess31   69.73
## ManufacturingProcess09   69.65
## BiologicalMaterial12     66.34
## ManufacturingProcess36   66.21
## BiologicalMaterial02     66.06
## BiologicalMaterial03     63.32
## ManufacturingProcess11   57.92
## ManufacturingProcess06   53.69
## ManufacturingProcess30   50.51
## BiologicalMaterial04     50.26
## ManufacturingProcess29   44.79
## BiologicalMaterial08     44.11
## BiologicalMaterial09     42.22
## BiologicalMaterial11     39.01
## ManufacturingProcess33   38.69
## ManufacturingProcess02   37.76

Get important variables for optimal linear Ridge regression model:

varImp(ridge_model)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   94.64
## BiologicalMaterial06     76.19
## ManufacturingProcess17   73.26
## ManufacturingProcess31   69.73
## ManufacturingProcess09   69.65
## BiologicalMaterial12     66.34
## ManufacturingProcess36   66.21
## BiologicalMaterial02     66.06
## BiologicalMaterial03     63.32
## ManufacturingProcess11   57.92
## ManufacturingProcess06   53.69
## ManufacturingProcess30   50.51
## BiologicalMaterial04     50.26
## ManufacturingProcess29   44.79
## BiologicalMaterial08     44.11
## BiologicalMaterial09     42.22
## BiologicalMaterial11     39.01
## ManufacturingProcess33   38.69
## ManufacturingProcess02   37.76

For optimal nonlinear model, it is seen that ManufacturingProcess32,ManufacturingProcess13, BiologicalMaterial06, ManufacturingProcess17, ManufacturingProcess31, and ManufacturingProcess09 are the most important predictors. It is also seen that the process variables dominate the list. The top 10 important variables are the same for both the optimal nonlinear and linear regression models.

Part-(c):

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

The relationships between the top five most important predictors and the response variable will be determined here:

# Define the top five important predictors
imp_var <- c('ManufacturingProcess32','ManufacturingProcess13',
             'BiologicalMaterial06','ManufacturingProcess17',
             'ManufacturingProcess31')
featurePlot(train_chemical_x[,imp_var], train_chemical_y)

Get correlations:

cor(train_chemical_x[,imp_var], train_chemical_y)

##                               [,1]
## ManufacturingProcess32  0.59616057
## ManufacturingProcess13 -0.55704820
## BiologicalMaterial06    0.49749646
## ManufacturingProcess17 -0.46822624
## ManufacturingProcess31 -0.06849684

The plots and correlations of top five most important predictors with the response variable reveal that ManufacturingProcess32 has strong positive influence on yield, while BiologicalMaterial06 has moderate positive influence on yield. ManufacturingProcess13 has a strong negative influence on yield and ManufacturingProcess17 has moderate negative influence on yield. Moreover, ManufacturingProcess31 appears to have a negligible impact on yield. Basically, these correlations can help us to take decisions on which processes or materials to prioritize or avoid to optimize yield in the manufacturing process.

Data624_HW08

Mahmud Hasan Al Raji

2024-11-10

Problem-7.2

Tune KNN model

Tune MARS Model

Tune SVM Model

Tune Neural Network Model

Problem-7.5

Part(a)

Tune KNN model

Tune MARS Model

Tune SVM Model

Tune Neural Network Model

Part-(b):

Part-(c):