Data624HW8

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

knnModel <- train(x =trainingData$x, 
                   y= trainingData$y, 
                   method = "knn",
                   preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 19.

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

I add MARS model as comparison.

MARS has a better performance with lower RMSE and higher Rsquared.

Also, MARS selet X1-X5 plus X6.

knn_fit<- train(x=trainingData$x,y=trainingData$y, method = "knn", preProcess = c("center", "scale"), tuneLength = 10)

mars_fit<- train(x = trainingData$x, y = trainingData$y, method = "earth",
  preProcess = c("center", "scale"), tuneLength = 10)

## Loading required package: earth

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

## Loading required package: TeachingDemos

knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

marsPred <- predict(mars_fit, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)

##     RMSE Rsquared      MAE 
## 1.776575 0.872700 1.358367

varImp((knn_fit))

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

varImp((mars_fit))

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   82.78
## X2   64.18
## X5   40.21
## X3   28.14
## X6    0.00

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
library(RANN)

estdata <- preProcess(ChemicalManufacturingProcess, "knnImpute")

chemdata <- predict(estdata, ChemicalManufacturingProcess)

chemdata <- chemdata[, -nearZeroVar(chemdata)]

ch_index <- createDataPartition(chemdata$Yield, p = .8, list = FALSE)

trainx <- chemdata[ch_index, -1]
trainy <- chemdata[ch_index, 1]

testx <- chemdata[-ch_index, -1]
testy <- chemdata[-ch_index, 1]

Which nonlinear regression model gives the optimal resampling and test set performance?

4 model was create, KNN, MARS, Neural Network and SVM. SVM is the preferable model since it has the lowest RMSE and 2nd highest Rsquared.

knnModelC <- train(trainx, trainy,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

knnModelC

## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7663166  0.4267656  0.6103715
##    7  0.7620570  0.4334381  0.6085957
##    9  0.7606730  0.4383593  0.6112026
##   11  0.7637161  0.4369804  0.6112213
##   13  0.7702460  0.4338978  0.6183574
##   15  0.7757912  0.4294637  0.6246815
##   17  0.7793033  0.4268261  0.6265074
##   19  0.7857410  0.4192433  0.6310951
##   21  0.7876125  0.4219037  0.6318536
##   23  0.7926886  0.4171869  0.6380595
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

MARS

set.seed(777)

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38) #p.165
marsTuneC <- train(trainx, trainy,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsTuneC

## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        2      0.7571365  0.4384424  0.5890694
##   1        3      0.6209432  0.6171880  0.5001021
##   1        4      0.6125417  0.6149778  0.4974624
##   1        5      0.6086976  0.6193433  0.5002367
##   1        6      0.6112194  0.6220564  0.5001236
##   1        7      0.5833192  0.6601286  0.4707630
##   1        8      0.5954229  0.6548777  0.4773941
##   1        9      0.5908520  0.6688769  0.4641166
##   1       10      0.5922562  0.6729081  0.4589072
##   1       11      0.6046974  0.6547836  0.4578478
##   1       12      0.6021642  0.6585914  0.4529060
##   1       13      0.5982064  0.6552723  0.4642189
##   1       14      0.5993061  0.6579914  0.4710578
##   1       15      0.6003698  0.6629187  0.4686158
##   1       16      0.5846706  0.6788217  0.4511537
##   1       17      0.5860203  0.6802277  0.4511400
##   1       18      0.5849925  0.6786591  0.4492209
##   1       19      0.5849925  0.6786591  0.4492209
##   1       20      0.5849925  0.6786591  0.4492209
##   1       21      0.5849925  0.6786591  0.4492209
##   1       22      0.5849925  0.6786591  0.4492209
##   1       23      0.5849925  0.6786591  0.4492209
##   1       24      0.5849925  0.6786591  0.4492209
##   1       25      0.5849925  0.6786591  0.4492209
##   1       26      0.5849925  0.6786591  0.4492209
##   1       27      0.5849925  0.6786591  0.4492209
##   1       28      0.5849925  0.6786591  0.4492209
##   1       29      0.5849925  0.6786591  0.4492209
##   1       30      0.5849925  0.6786591  0.4492209
##   1       31      0.5849925  0.6786591  0.4492209
##   1       32      0.5849925  0.6786591  0.4492209
##   1       33      0.5849925  0.6786591  0.4492209
##   1       34      0.5849925  0.6786591  0.4492209
##   1       35      0.5849925  0.6786591  0.4492209
##   1       36      0.5849925  0.6786591  0.4492209
##   1       37      0.5849925  0.6786591  0.4492209
##   1       38      0.5849925  0.6786591  0.4492209
##   2        2      0.7571365  0.4384424  0.5890694
##   2        3      0.6389273  0.5892642  0.5242676
##   2        4      0.6247083  0.5967464  0.5181887
##   2        5      0.6152453  0.6099734  0.5094673
##   2        6      0.6315286  0.6175948  0.5132345
##   2        7      0.5898082  0.6628224  0.4812386
##   2        8      0.5773520  0.6710845  0.4698095
##   2        9      0.5696061  0.6801360  0.4646234
##   2       10      0.5848359  0.6626196  0.4765138
##   2       11      0.5694192  0.6879157  0.4649821
##   2       12      0.5634597  0.6920767  0.4564068
##   2       13      0.5640860  0.6887346  0.4576884
##   2       14      0.5697185  0.6883572  0.4531146
##   2       15      0.5919275  0.6731146  0.4702897
##   2       16      0.6167678  0.6586522  0.4874193
##   2       17      0.6198244  0.6591712  0.4951976
##   2       18      0.6108703  0.6755062  0.4937246
##   2       19      0.6094393  0.6755931  0.4951412
##   2       20      0.5999557  0.6841172  0.4869600
##   2       21      0.5945895  0.6886322  0.4845766
##   2       22      0.5967784  0.6969207  0.4885505
##   2       23      0.6044671  0.6961590  0.4938530
##   2       24      0.5974824  0.6984470  0.4878614
##   2       25      0.5974824  0.6984470  0.4878614
##   2       26      0.6126017  0.6966108  0.4983812
##   2       27      0.6126017  0.6966108  0.4983812
##   2       28      0.6126017  0.6966108  0.4983812
##   2       29      0.6126017  0.6966108  0.4983812
##   2       30      0.6126017  0.6966108  0.4983812
##   2       31      0.6126017  0.6966108  0.4983812
##   2       32      0.6126017  0.6966108  0.4983812
##   2       33      0.6126017  0.6966108  0.4983812
##   2       34      0.6126017  0.6966108  0.4983812
##   2       35      0.6126017  0.6966108  0.4983812
##   2       36      0.6126017  0.6966108  0.4983812
##   2       37      0.6126017  0.6966108  0.4983812
##   2       38      0.6126017  0.6966108  0.4983812
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 12 and degree = 2.

tooHigh <- findCorrelation(cor(trainx), cutoff = .75) #p.163
train_x_nnet <- trainx[, -tooHigh]
test_x_nnet <- testx[, -tooHigh]
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))

ctrl <- trainControl(method = "cv", number = 10)

set.seed(777)

nnetTune <- train(train_x_nnet, trainy,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(train_x_nnet) + 1) + 10 + 1,
                  maxit = 500)

nnetTune

## Neural Network 
## 
## 144 samples
##  37 predictor
## 
## Pre-processing: centered (37), scaled (37) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE      
##   0.00    1    1.2163074  0.3136695  0.8295162
##   0.00    2    1.0184622  0.3425981  0.8306294
##   0.00    3    1.0404403  0.3236338  0.8351704
##   0.00    4    1.4772631  0.1900349  1.1425086
##   0.00    5    1.5537554  0.1408028  1.2441427
##   0.00    6    1.3815231  0.2459903  1.0489401
##   0.00    7    1.1281311  0.3749623  0.9003955
##   0.00    8    1.1785144  0.2517516  0.8937500
##   0.00    9    1.0238042  0.3370734  0.7755890
##   0.00   10    0.9625880  0.3778590  0.7979186
##   0.01    1    0.8925761  0.3749872  0.7201091
##   0.01    2    1.0991754  0.2649870  0.8736625
##   0.01    3    1.1941416  0.2768011  0.9195688
##   0.01    4    1.1621122  0.3145344  0.9196618
##   0.01    5    1.0365727  0.3331291  0.8558945
##   0.01    6    0.9795945  0.3633113  0.7548577
##   0.01    7    0.8225926  0.4790358  0.6654358
##   0.01    8    0.8018038  0.4532290  0.6450975
##   0.01    9    0.8380935  0.4301259  0.6838092
##   0.01   10    0.8684114  0.4008014  0.6963858
##   0.10    1    0.7830112  0.4641974  0.6340823
##   0.10    2    0.9658965  0.3696723  0.7614198
##   0.10    3    0.9047863  0.4129417  0.6915331
##   0.10    4    0.9122687  0.4288613  0.7240297
##   0.10    5    0.9064383  0.3640349  0.7144517
##   0.10    6    0.9134298  0.3881678  0.7042020
##   0.10    7    0.7538289  0.5057989  0.6098617
##   0.10    8    0.8314645  0.4263620  0.6550355
##   0.10    9    0.8200322  0.4127259  0.6588729
##   0.10   10    0.8372057  0.4245492  0.6801626
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 7 and decay = 0.1.

set.seed(777)

svmRTune <- train(trainx, trainy,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRTune

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  0.7468900  0.5336373  0.6009050
##      0.50  0.6821002  0.5748808  0.5559519
##      1.00  0.6212096  0.6302097  0.5074772
##      2.00  0.5995841  0.6471372  0.4826408
##      4.00  0.6036013  0.6314314  0.4873860
##      8.00  0.6177886  0.6152405  0.4983435
##     16.00  0.6172946  0.6165007  0.4984441
##     32.00  0.6172946  0.6165007  0.4984441
##     64.00  0.6172946  0.6165007  0.4984441
##    128.00  0.6172946  0.6165007  0.4984441
##    256.00  0.6172946  0.6165007  0.4984441
##    512.00  0.6172946  0.6165007  0.4984441
##   1024.00  0.6172946  0.6165007  0.4984441
##   2048.00  0.6172946  0.6165007  0.4984441
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0141958
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0141958 and C = 2.

knnPred <- predict(knnModelC, testx)
postResample(pred = knnPred, obs = testy)

##      RMSE  Rsquared       MAE 
## 0.7366242 0.4621173 0.5993582

marsPred <- predict(marsTuneC, testx)
postResample(pred = marsPred, obs = testy)

##      RMSE  Rsquared       MAE 
## 0.7051474 0.5211750 0.5567698

svmRPred <- predict(svmRTune, testx)
postResample(pred = svmRPred, obs = testy)

##      RMSE  Rsquared       MAE 
## 0.6917503 0.5138788 0.5533730

nnPred <- predict(nnetTune, testx)
postResample(pred = nnPred, obs = testy)

##      RMSE  Rsquared       MAE 
## 0.7500078 0.4442805 0.6100607

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

The Top 10 important predictors are the same for 2 model.

varImp(svmRTune)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   94.77
## BiologicalMaterial06     82.41
## ManufacturingProcess17   78.71
## BiologicalMaterial03     78.71
## ManufacturingProcess31   71.77
## ManufacturingProcess06   68.38
## BiologicalMaterial12     66.59
## BiologicalMaterial02     65.86
## ManufacturingProcess09   65.47
## ManufacturingProcess36   62.43
## BiologicalMaterial04     53.46
## BiologicalMaterial11     52.00
## ManufacturingProcess11   51.25
## ManufacturingProcess33   47.86
## BiologicalMaterial08     46.86
## ManufacturingProcess29   43.34
## ManufacturingProcess02   41.33
## ManufacturingProcess30   40.49
## BiologicalMaterial09     40.28

set.seed(777)

larsTune <- train(trainx, trainy, 
                  method = "lars", 
                  metric = "Rsquared",
                  tuneLength = 20, 
                  trControl = ctrl, 
                  preProc = c("center", "scale"))

lars_predict <- predict(larsTune, testx)

plot(varImp(larsTune), top = 10)

plot(varImp(svmRTune), top = 10)

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield

As we can see ManufacturingProcess36 and 13 has neg correlations while other has prositve.

library(magrittr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(corrplot)

## corrplot 0.92 loaded

top10 <- varImp(svmRTune)$importance %>%
  arrange(-Overall) %>%
  head(10)


chemdata %>%
  select(c("Yield", row.names(top10))) %>%
  cor() %>%
  corrplot()

Data624HW8

Jay Lee

2023-04-23