Data 624 Homework 8

Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10 \sin(\pi x_{1}x_{2}) + 20(x_{3} - 0.5)^2 + 10x_{4} + 5x_{5} + N(0, \sigma^2)\)

where the \(x\) values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

(a) Tune several models on these data. For example:

knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values.
knnPerformance <- postResample(pred = knnPred, obs = testData$y)
knnPerformance

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance?

Model 1: MARS Model

set.seed(50)
# Define and tune the MARS model.
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:15)
marsModel <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = 'earth',
                   tuneGrid = marsGrid,
                   tuneLength = 25,
                   preProc = c('center', 'scale'))

marsModel

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.404347  0.2245451  3.590453
##   1        3      3.746178  0.4394258  3.031879
##   1        4      2.810191  0.6837637  2.267477
##   1        5      2.546676  0.7381424  2.038306
##   1        6      2.457201  0.7571937  1.956201
##   1        7      1.982313  0.8395771  1.561345
##   1        8      1.853756  0.8616590  1.462258
##   1        9      1.812192  0.8676035  1.419271
##   1       10      1.764617  0.8742948  1.397465
##   1       11      1.759589  0.8748903  1.383201
##   1       12      1.770417  0.8729590  1.382469
##   1       13      1.781220  0.8710731  1.388037
##   1       14      1.799692  0.8682999  1.405146
##   1       15      1.807710  0.8668916  1.407880
##   2        2      4.401107  0.2279086  3.575905
##   2        3      3.733032  0.4437917  3.015929
##   2        4      2.853802  0.6737997  2.299631
##   2        5      2.578260  0.7313585  2.052106
##   2        6      2.438420  0.7625103  1.916612
##   2        7      2.085843  0.8225249  1.632268
##   2        8      1.925515  0.8499074  1.486561
##   2        9      1.797766  0.8676310  1.393979
##   2       10      1.628736  0.8920037  1.278338
##   2       11      1.530777  0.9053842  1.200624
##   2       12      1.513383  0.9065194  1.179791
##   2       13      1.482438  0.9110205  1.147230
##   2       14      1.468260  0.9123245  1.134669
##   2       15      1.473961  0.9113928  1.143106
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.

# Run predict() and postResample() on the model.
marsPred <- predict(marsModel, newdata = testData$x)
marsPerformance <- postResample(pred = marsPred, obs = testData$y)
marsPerformance

##      RMSE  Rsquared       MAE 
## 1.2779993 0.9338365 1.0147070

Model 2: SVM Model

set.seed(50)
# Define and tune the SVM model.
svmModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = 'svmRadial',
                  preProc = c('center', 'scale'),
                  tuneLength = 14,
                  trControl = trainControl(method = 'cv'))

svmModel

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.485542  0.8040069  1.988122
##      0.50  2.224708  0.8215673  1.762504
##      1.00  2.033732  0.8438484  1.600925
##      2.00  1.906322  0.8578861  1.499489
##      4.00  1.811983  0.8692457  1.435218
##      8.00  1.775768  0.8736945  1.413409
##     16.00  1.768571  0.8754132  1.410033
##     32.00  1.769243  0.8754177  1.410352
##     64.00  1.769243  0.8754177  1.410352
##    128.00  1.769243  0.8754177  1.410352
##    256.00  1.769243  0.8754177  1.410352
##    512.00  1.769243  0.8754177  1.410352
##   1024.00  1.769243  0.8754177  1.410352
##   2048.00  1.769243  0.8754177  1.410352
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05909722
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05909722 and C = 16.

# Run predict() and postResample().
svmPred <- predict(svmModel, newdata = testData$x)
svmPerformance <- postResample(pred = svmPred, obs = testData$y)
svmPerformance

##     RMSE Rsquared      MAE 
## 2.062750 0.827448 1.567249

Model 3: Neural Network Model

set.seed(50)
# Define and tune the Neural Network model.
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
nnetModel <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = 'avNNet',
                   preProc = c('center', 'scale'),
                   tuneGrid = nnetGrid,
                   trControl = trainControl(method = 'cv'),
                   linout = TRUE,
                   trace = FALSE,
                   MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                   maxit = 500)

nnetModel

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.412253  0.7684991  1.921744
##   0.00    2    2.478253  0.7571375  1.968920
##   0.00    3    2.073243  0.8297694  1.661660
##   0.00    4    1.896207  0.8535420  1.481233
##   0.00    5    1.967575  0.8499912  1.535623
##   0.00    6    3.155466  0.6667462  2.134195
##   0.00    7    4.644263  0.5211336  2.793921
##   0.00    8    4.692602  0.5686019  3.052220
##   0.00    9    5.478626  0.5330334  3.055039
##   0.00   10    3.875366  0.6236835  2.483791
##   0.01    1    2.388403  0.7729080  1.880208
##   0.01    2    2.458497  0.7642903  1.919754
##   0.01    3    2.045352  0.8326055  1.586264
##   0.01    4    2.013498  0.8409427  1.613255
##   0.01    5    2.013716  0.8434760  1.582315
##   0.01    6    2.141465  0.8135164  1.701146
##   0.01    7    2.376010  0.7806695  1.857197
##   0.01    8    2.536101  0.7604519  2.023233
##   0.01    9    2.387242  0.7848141  1.957367
##   0.01   10    2.332668  0.7841584  1.842190
##   0.10    1    2.399640  0.7708286  1.889741
##   0.10    2    2.492352  0.7504716  1.967391
##   0.10    3    2.113141  0.8259432  1.660050
##   0.10    4    2.067737  0.8290000  1.659741
##   0.10    5    2.011310  0.8384875  1.625327
##   0.10    6    2.208217  0.8102918  1.773579
##   0.10    7    2.169940  0.8173408  1.723025
##   0.10    8    2.206817  0.8088873  1.728806
##   0.10    9    2.323232  0.7906531  1.830399
##   0.10   10    2.241793  0.7980504  1.754629
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag = FALSE.

# Run predict() and postResample().
nnetPred <- predict(nnetModel, newdata = testData$x)
nnetPerformance <- postResample(pred = nnetPred, obs = testData$y)
nnetPerformance

##      RMSE  Rsquared       MAE 
## 2.0073619 0.8399851 1.5368340

Model Performance Comparison

rbind('MARS' = marsPerformance, 'SVM' = svmPerformance, 'Neural Network' = nnetPerformance, 'KNN' = knnPerformance) %>%
  kable() %>% kable_styling()

	RMSE	Rsquared	MAE
MARS	1.277999	0.9338365	1.014707
SVM	2.062750	0.8274480	1.567249
Neural Network	2.007362	0.8399851	1.536834
KNN	3.204060	0.6819919	2.568346

Does MARS select the informative predictors (those named X1–X5)?

varImp(marsModel)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.40
## X2   49.00
## X5   15.72
## X3    0.00

Answer:

From the above output, it looks like MARS selected the informative predictors X1, X2, X4, and X5, but X3 has an overall score of 0.0 which suggests it did not select it.

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

# Load the chemicalManufacturingProcess dataset that is provided by the "AppliedPredictiveModeling" package.
data(ChemicalManufacturingProcess)

Use The Same Data Imputation, Data Splitting, and Pre-Processing Steps as Before In Exercise 6.3.

1. Impute missing values using KNN.

# Impute the missing values using KNN.
cmpImputed <- preProcess(ChemicalManufacturingProcess, 'knnImpute')

2. Predict after imputation.

# Predict after imputation.
chemicalMPData <- predict(cmpImputed, ChemicalManufacturingProcess)

3. Split the data into training and test sets.

# Split the training data using an 80% training data split.
trainingData <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
xTrainData <- chemicalMPData[trainingData, ]
yTrainData <- ChemicalManufacturingProcess$Yield[trainingData]

# Split the test data.
xTestData <- chemicalMPData[-trainingData, ]
yTestData <- ChemicalManufacturingProcess$Yield[-trainingData]

Non Linear PLS Model

set.seed(50)

# Define and tune a PLS model.
plsModel <- train(x = xTrainData,
                  y = yTrainData,
                  method = 'pls',
                  metric = 'Rsquared',
                  tuneLength = 20,
                  trControl = trainControl(method = 'cv'),
                  preProcess = c('center', 'scale'))

# Print out the results.
plsModel

## Partial Least Squares 
## 
## 144 samples
##  58 predictor
## 
## Pre-processing: centered (58), scaled (58) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 129, 130, 129, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE        Rsquared   MAE       
##    1     1.33574102  0.5280627  1.03699637
##    2     1.55514570  0.6253031  0.94935173
##    3     0.86332117  0.8111104  0.65667569
##    4     1.04296461  0.7662041  0.63454464
##    5     0.91339824  0.8086960  0.51672968
##    6     0.79162632  0.8453758  0.44255201
##    7     0.68159846  0.8717648  0.36204944
##    8     0.56873260  0.9100155  0.29161510
##    9     0.42468882  0.9447098  0.21310311
##   10     0.42815268  0.9437162  0.19383465
##   11     0.37270636  0.9524233  0.16135526
##   12     0.32895662  0.9588218  0.14359705
##   13     0.26436139  0.9698894  0.11528017
##   14     0.19059348  0.9820459  0.08601711
##   15     0.15370618  0.9850314  0.07005732
##   16     0.14405399  0.9873845  0.06228076
##   17     0.13376304  0.9904345  0.05565696
##   18     0.12491794  0.9920739  0.05081665
##   19     0.10329226  0.9945144  0.04253655
##   20     0.06676993  0.9970457  0.03046372
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 20.

# Run predict() and postResample() on the model.
plsPred <- predict(plsModel, newdata = xTestData)
plsPerformance <- postResample(pred = plsPred, obs = yTestData)
plsPerformance

##       RMSE   Rsquared        MAE 
## 0.01695540 0.99991846 0.01312557

Train Several Nonlinear Regression Models

Model 1: KNN Model

set.seed(50)

# Train a KNN model.
knnModel <- train(x = xTrainData,
                  y = yTrainData,
                  method = 'knn',
                  preProc = c('center', 'scale'),
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 144 samples
##  58 predictor
## 
## Pre-processing: centered (58), scaled (58) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.313240  0.4976614  1.036873
##    7  1.284785  0.5237572  1.012093
##    9  1.284953  0.5348281  1.017324
##   11  1.283164  0.5439781  1.025511
##   13  1.272175  0.5600633  1.018044
##   15  1.282842  0.5582924  1.026190
##   17  1.286213  0.5646184  1.030484
##   19  1.299337  0.5596987  1.041285
##   21  1.310624  0.5571506  1.054980
##   23  1.324173  0.5526370  1.063355
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.

# Run predict() and postResample().
knnPred <- predict(knnModel, newdata = xTestData)
knnPerformance <- postResample(pred = knnPred, obs = yTestData)
knnPerformance

##      RMSE  Rsquared       MAE 
## 1.2430863 0.6463763 1.0065625

Model 2: SVM Model

set.seed(50)
# Define and tune the SVM model.
svmModel <- train(x = xTrainData,
                  y = yTrainData,
                  method = 'svmRadial',
                  preProc = c('center', 'scale'),
                  tuneLength = 14,
                  trControl = trainControl(method = 'cv'))

svmModel

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  58 predictor
## 
## Pre-processing: centered (58), scaled (58) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 129, 130, 129, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  1.1701760  0.6858684  0.9520916
##      0.50  0.9344061  0.7987497  0.7447998
##      1.00  0.7425587  0.8749638  0.5722410
##      2.00  0.6482239  0.9040551  0.5022892
##      4.00  0.6344307  0.9068769  0.4892183
##      8.00  0.6344307  0.9068769  0.4892183
##     16.00  0.6344307  0.9068769  0.4892183
##     32.00  0.6344307  0.9068769  0.4892183
##     64.00  0.6344307  0.9068769  0.4892183
##    128.00  0.6344307  0.9068769  0.4892183
##    256.00  0.6344307  0.9068769  0.4892183
##    512.00  0.6344307  0.9068769  0.4892183
##   1024.00  0.6344307  0.9068769  0.4892183
##   2048.00  0.6344307  0.9068769  0.4892183
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01299667
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01299667 and C = 4.

# Run predict() and postResample().
svmPred <- predict(svmModel, newdata = xTestData)
svmPerformance <- postResample(pred = svmPred, obs = yTestData)
svmPerformance

##      RMSE  Rsquared       MAE 
## 0.5865390 0.9263753 0.4751181

Model 3: Neural Network Model

set.seed(50)
# Define and tune the Neural Network model.
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
nnetModel <- train(x = xTrainData,
                   y = yTrainData,
                   method = 'avNNet',
                   preProc = c('center', 'scale'),
                   tuneGrid = nnetGrid,
                   trControl = trainControl(method = 'cv'),
                   linout = TRUE,
                   trace = FALSE,
                   MaxNWts = 10 * (ncol(xTrainData) + 1) + 10 + 1,
                   maxit = 500)

nnetModel

## Model Averaged Neural Network 
## 
## 144 samples
##  58 predictor
## 
## Pre-processing: centered (58), scaled (58) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 129, 130, 129, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE      
##   0.00    1    1.4531348  0.4533271  1.1823720
##   0.00    2    1.3162565  0.5395748  1.0778069
##   0.00    3    1.2198049  0.6526343  0.9471283
##   0.00    4    1.3378670  0.6284613  1.0905312
##   0.00    5    1.5719641  0.5865243  1.2495912
##   0.00    6    1.6265453  0.5154893  1.2914693
##   0.00    7    2.4829943  0.4833184  1.8647638
##   0.00    8    3.2255752  0.3735543  2.3738178
##   0.00    9    5.8292389  0.2897651  3.7393765
##   0.00   10    6.2528966  0.2918346  4.2571667
##   0.01    1    0.3281779  0.9561907  0.1631346
##   0.01    2    0.4054010  0.9454843  0.2604388
##   0.01    3    0.8581809  0.8090849  0.5085160
##   0.01    4    1.0382264  0.7905566  0.6555289
##   0.01    5    1.2616141  0.6974517  0.8464117
##   0.01    6    1.0336227  0.7570536  0.7710083
##   0.01    7    1.0238360  0.7628288  0.7573270
##   0.01    8    1.1738440  0.6735672  0.9121379
##   0.01    9    1.6752530  0.5879600  1.1506262
##   0.01   10    1.7321663  0.5827655  1.2045439
##   0.10    1    0.6123655  0.8988199  0.3389761
##   0.10    2    1.1331552  0.7436390  0.6257841
##   0.10    3    1.2289181  0.7491524  0.6729656
##   0.10    4    1.3192214  0.7293678  0.7547688
##   0.10    5    1.6319327  0.6785774  0.8655697
##   0.10    6    1.5203251  0.6820915  0.8815065
##   0.10    7    1.5876946  0.6720674  0.9679185
##   0.10    8    1.3918967  0.6790485  0.8999611
##   0.10    9    1.3915676  0.6176788  0.9516209
##   0.10   10    1.1999465  0.6977182  0.8571284
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.01 and bag = FALSE.

# Run predict() and postResample().
nnetPred <- predict(nnetModel, newdata = xTestData)
nnetPerformance <- postResample(pred = nnetPred, obs = yTestData)
nnetPerformance

##       RMSE   Rsquared        MAE 
## 0.07730387 0.99839211 0.06183762

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

rbind('PLS (Linear Model)' = plsPerformance, 'SVM' = svmPerformance, 'Neural Network' = nnetPerformance, 'KNN' = knnPerformance) %>%
  kable() %>% kable_styling()

	RMSE	Rsquared	MAE
PLS (Linear Model)	0.0169554	0.9999185	0.0131256
SVM	0.5865390	0.9263753	0.4751181
Neural Network	0.0773039	0.9983921	0.0618376
KNN	1.2430863	0.6463763	1.0065625

Answer:

Based on the lowest RMSE value and the highest R^2 value, the SVM model gives the optimal resampling and test set performance.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

svmImportantPredictors <- varImp(svmModel)
svmImportantPredictors

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##                        Overall
## Yield                   100.00
## ManufacturingProcess32   38.78
## BiologicalMaterial06     34.33
## ManufacturingProcess13   32.88
## BiologicalMaterial03     27.23
## ManufacturingProcess17   27.00
## BiologicalMaterial02     26.83
## ManufacturingProcess36   26.58
## BiologicalMaterial12     25.98
## ManufacturingProcess31   25.05
## ManufacturingProcess09   24.97
## ManufacturingProcess02   20.54
## ManufacturingProcess33   20.13
## BiologicalMaterial04     19.17
## ManufacturingProcess06   18.50
## ManufacturingProcess29   18.49
## ManufacturingProcess11   17.51
## BiologicalMaterial11     17.05
## BiologicalMaterial08     16.83
## BiologicalMaterial01     16.18

Answer:

B1 Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list?

The most important predictors for the optimal nonlinear regression model (the SVM model) are shown above. The ManufacturingProcess predictors dominate the list.

B2 How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

varImp(plsModel)

## pls variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##                        Overall
## Yield                   100.00
## ManufacturingProcess32   40.43
## ManufacturingProcess17   37.81
## ManufacturingProcess13   35.58
## ManufacturingProcess09   34.20
## ManufacturingProcess36   31.78
## BiologicalMaterial02     27.78
## BiologicalMaterial06     26.54
## BiologicalMaterial08     26.47
## BiologicalMaterial11     25.60
## BiologicalMaterial12     25.23
## ManufacturingProcess33   25.13
## ManufacturingProcess11   24.50
## BiologicalMaterial01     23.96
## ManufacturingProcess12   23.29
## ManufacturingProcess06   23.29
## BiologicalMaterial03     23.22
## ManufacturingProcess28   23.00
## BiologicalMaterial04     22.78
## ManufacturingProcess04   22.02

The top 10 predictors of the optimal nonlinear regression model are very similiar to the top 10 predictors of the linear model (PLS model), ManufacturingProcess predictors dominate the list.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Yield <- which(colnames(chemicalMPData) == 'Yield')
SVMTopTenPredictors <- head(rownames(svmImportantPredictors$importance)[order(-svmImportantPredictors$importance$Overall)], 10)
as.data.frame(SVMTopTenPredictors)

##       SVMTopTenPredictors
## 1                   Yield
## 2  ManufacturingProcess32
## 3    BiologicalMaterial06
## 4  ManufacturingProcess13
## 5    BiologicalMaterial03
## 6  ManufacturingProcess17
## 7    BiologicalMaterial02
## 8  ManufacturingProcess36
## 9    BiologicalMaterial12
## 10 ManufacturingProcess31

Y <- chemicalMPData[,Yield]
X <- chemicalMPData[,SVMTopTenPredictors]

colnames(X) <- gsub('(Process|Material)', '', colnames(X))

featurePlot(x = X, y = Y, plot = 'scatter', type = c('p', 'smooth'), span = 0.5)

The above plots show us that there is a relationship between the response variable (Yield), and the top 10 predictor variables. Most of the predictor variables have a linear relationship with the response variable. For example, there is a clear positive linear relationship between Yield and Biological03, whilst Manufacturing17 appears to have a negative relationship.

Data 624 Homework 8

Week 11 Non-Linear Regression

Stephen Haslett

11/08/2021

Exercise 7.2

(a) Tune several models on these data. For example:

Which models appear to give the best performance?

Does MARS select the informative predictors (those named X1–X5)?

Exercise 7.5

Use The Same Data Imputation, Data Splitting, and Pre-Processing Steps as Before In Exercise 6.3.

Non Linear PLS Model

Train Several Nonlinear Regression Models

Model 1: KNN Model

Model 2: SVM Model

Model 3: Neural Network Model