library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(AppliedPredictiveModeling)
library(missForest)
library(Cubist)
library(mlbench)
library(nnet)
library(partykit)
library(earth)
library(corrgram)

7.2 Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10sin(\pi x_1x_2)+20(x3 - 0.5)^2 +10x_4 +5x_5 +N(0,\sigma ^2)\)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)
training.data <- mlbench.friedman1(200, sd=1)
training.data$x <- data.frame(training.data$x)
featurePlot(training.data$x, training.data$y)

test.data <- mlbench.friedman1(5000, sd = 1)
test.data$x <- data.frame(test.data$x)
knnModel <- train(x = training.data$x,
                  y = training.data$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.565620  0.4887976  2.886629
##    7  3.422420  0.5300524  2.752964
##    9  3.368072  0.5536927  2.715310
##   11  3.323010  0.5779056  2.669375
##   13  3.275835  0.6030846  2.628663
##   15  3.261864  0.6163510  2.621192
##   17  3.261973  0.6267032  2.616956
##   19  3.286299  0.6281075  2.640585
##   21  3.280950  0.6390386  2.643807
##   23  3.292397  0.6440392  2.656080
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
knn.Predict <- predict(knnModel, newdata = test.data$x)
postResample(pred = knn.Predict, obs = test.data$y)
##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169
marsModel <- train(x = training.data$x, y = training.data$y, method='earth', tuneLength=10, trControl = trainControl(method = "cv"))

marsModel
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##    2      4.297119  0.2569346  3.548983
##    3      3.702964  0.4528644  2.964117
##    4      2.534931  0.7453994  2.086018
##    6      2.336194  0.7903845  1.848704
##    7      1.798280  0.8706570  1.417550
##    9      1.644565  0.8900009  1.297656
##   10      1.630223  0.8919648  1.265712
##   12      1.638057  0.8927116  1.270530
##   13      1.644756  0.8907795  1.270365
##   15      1.648248  0.8903664  1.271487
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 10 and degree = 1.
mars.Predict <- predict (marsModel, test.data$x)
postResample(pred = mars.Predict, obs = test.data$y)
##     RMSE Rsquared      MAE 
## 1.776575 0.872700 1.358367

The RMSE for the MARS model is much better than that of the KNN model.

varImp(marsModel)
## earth variable importance
## 
##     Overall
## X1  100.000
## X4   84.065
## X2   66.860
## X5   44.678
## X3   33.508
## X6    7.475
## X8    0.000
## X10   0.000
## X7    0.000
## X9    0.000

Which models appear to give the best performance? Does MARS select the informative predictors? The MARS model seems to give the best performance. The X1-X5 predictors are selected by the MARS model.

7.5 7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models

data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess
#data imputation
df1 <- missForest(df)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
df2 <- df1$ximp
head(df2)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                 10.834                  10.14                 1.5334
## 2                  0.000                   0.00                 1.5446
## 3                  0.000                   0.00                 1.5431
## 4                  0.000                   0.00                 1.5434
## 5                 10.700                   0.00                 1.5372
## 6                 12.000                   0.00                 1.5350
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                 928.39               1002.855                206.608
## 2                 917.00               1032.200                210.000
## 3                 912.00               1003.600                207.100
## 4                 911.00               1014.600                213.300
## 5                 918.00               1027.500                205.700
## 6                 924.00               1016.800                208.900
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                 177.58                 177.74                  43.00
## 2                 177.00                 178.00                  46.57
## 3                 178.00                 178.00                  45.07
## 4                 177.00                 177.00                  44.92
## 5                 178.00                 178.00                  44.96
## 6                 178.00                 178.00                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                  9.104                  9.431                 363.92
## 2                  9.359                  9.905                   0.00
## 3                  9.166                  9.580                   0.00
## 4                  9.258                  9.625                   0.00
## 5                  9.062                  9.359                   0.00
## 6                  9.269                 10.009                   0.00
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                   4.42                   3.26                  12.77
## 2                   3.00                   0.00                   3.00
## 3                   4.00                   1.00                   4.00
## 4                   5.00                   2.00                   5.00
## 5                   8.00                   4.00                  18.00
## 6                   9.00                   1.00                   1.00
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                  0.013                  0.015                   11.6
## 2                  0.100                  0.150                   11.1
## 3                  0.000                  0.000                   12.0
## 4                  0.000                  0.000                   10.6
## 5                  0.000                  0.000                   11.0
## 6                  0.000                  0.000                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0
#set test/train data
data <- df2[, 2:58]
target <- df2[,1]
train <- createDataPartition(target, p=0.75)
train_pred <- data[train$Resample1,]
train_target <- target[train$Resample]
test_pred <- data[-train$Resample1,]
test_target <- target[-train$Resample1]
control <- trainControl(method = "cv", number=10)
knnModel <- train(x = train_pred,
                  y = train_target,
                  method = "knn",
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.776862  0.2103631  1.411766
##    7  1.749685  0.2136185  1.407276
##    9  1.714869  0.2261357  1.372198
##   11  1.710873  0.2247247  1.379369
##   13  1.715927  0.2168172  1.381832
##   15  1.719385  0.2115744  1.383668
##   17  1.712961  0.2172051  1.377429
##   19  1.721686  0.2139728  1.386131
##   21  1.737534  0.2009739  1.402955
##   23  1.752321  0.1918709  1.414137
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
knn.Predict <- predict(knnModel, newdata = test_pred)
postResample(pred = knn.Predict, obs = test_target)
##      RMSE  Rsquared       MAE 
## 1.3048525 0.3024465 1.0678512

MARS Model

marsModel <- train(x = train_pred, y = train_target, method='earth', tuneLength=10, trControl = trainControl(method = "cv"))

marsModel
## Multivariate Adaptive Regression Spline 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 118, 119, 118, 120, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##    2      1.491723  0.4078425  1.206024
##    3      1.353735  0.5070462  1.097948
##    5      1.308553  0.5556791  1.068443
##    7      1.249427  0.5966211  1.043078
##    8      1.240367  0.6026297  1.041784
##   10      1.241248  0.6113224  1.034270
##   12      1.254623  0.6134204  1.032793
##   13      1.250880  0.6203455  1.040137
##   15      1.259988  0.6284302  1.045013
##   17      1.276721  0.6253071  1.060521
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 8 and degree = 1.
mars.Predict <- predict (marsModel, test_pred)
postResample(pred = mars.Predict, obs = test_target)
##      RMSE  Rsquared       MAE 
## 0.9664855 0.6959803 0.7997188

Neural Net

nnetModel <- nnet(train_pred, train_target, size=5, decay=0.01, linout= T, trace=F, maxit = 500 ,
                  MaxNWts = 5 * (ncol(train_pred) + 1) + 5 + 1)
nnetPredict <- predict(nnetModel, test_pred)
postResample(pred = nnetPredict, obs = test_target)
##        RMSE    Rsquared         MAE 
## 1.562087433 0.008574579 1.321736856

SVM Model

svmModel <- train(x = train_pred, 
                  y = train_target, 
                  method='svmRadial', 
                  tuneLength=14, 
                  trControl = trainControl(method = "cv"),
                  preProc = c("center", "scale"))
svmModel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 119, 117, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE      
##      0.25  1.540944  0.4982792  1.2476873
##      0.50  1.418246  0.5494338  1.1520354
##      1.00  1.310090  0.5957758  1.0684414
##      2.00  1.242287  0.6145278  1.0028349
##      4.00  1.205616  0.6283394  0.9673304
##      8.00  1.186749  0.6368772  0.9582191
##     16.00  1.191156  0.6344858  0.9613057
##     32.00  1.191156  0.6344858  0.9613057
##     64.00  1.191156  0.6344858  0.9613057
##    128.00  1.191156  0.6344858  0.9613057
##    256.00  1.191156  0.6344858  0.9613057
##    512.00  1.191156  0.6344858  0.9613057
##   1024.00  1.191156  0.6344858  0.9613057
##   2048.00  1.191156  0.6344858  0.9613057
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01458004
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01458004 and C = 8.
svm.Predict <- predict (svmModel, test_pred)
postResample(pred = svm.Predict, obs = test_target)
##      RMSE  Rsquared       MAE 
## 0.9665332 0.6873189 0.7479273

The SVM Model this time slightly outperforms the MARS model.

B.) Which predictors are most important in the optimal nonlinear regres-sion model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten from the optimal linear model?

varImp(svmModel)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   98.64
## ManufacturingProcess17   90.81
## BiologicalMaterial12     88.93
## BiologicalMaterial03     88.45
## ManufacturingProcess09   86.63
## BiologicalMaterial06     82.85
## ManufacturingProcess36   82.46
## ManufacturingProcess06   64.47
## BiologicalMaterial02     64.02
## BiologicalMaterial11     63.28
## ManufacturingProcess31   63.11
## ManufacturingProcess29   55.52
## BiologicalMaterial08     49.55
## ManufacturingProcess11   48.91
## BiologicalMaterial04     46.52
## ManufacturingProcess33   45.86
## BiologicalMaterial09     45.45
## BiologicalMaterial01     45.08
## ManufacturingProcess30   40.10
varImp(marsModel)
## earth variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   64.04
## ManufacturingProcess13   40.20
## ManufacturingProcess33   28.61
## BiologicalMaterial03     25.01
## ManufacturingProcess28   15.78
## ManufacturingProcess04    0.00
## ManufacturingProcess29    0.00
## ManufacturingProcess21    0.00
## ManufacturingProcess25    0.00
## ManufacturingProcess05    0.00
## ManufacturingProcess38    0.00
## BiologicalMaterial06      0.00
## ManufacturingProcess36    0.00
## ManufacturingProcess15    0.00
## BiologicalMaterial10      0.00
## ManufacturingProcess12    0.00
## ManufacturingProcess23    0.00
## ManufacturingProcess42    0.00
## ManufacturingProcess24    0.00
plot(varImp(marsModel), top=10)

plot(varImp(svmModel), top=10)

The predictors that are most important in both models are the ManufacturingProcess32 and ManufacturingProcess13. 32 is in first place spot for both models, while 13 is in both models top 3. Manufacturing and Biological both are represented within the top 10.

C. Explore the relationship between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield? train_pred

cor_model <- train_pred %>% select(ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess17, BiologicalMaterial12, BiologicalMaterial03, ManufacturingProcess09, BiologicalMaterial06, ManufacturingProcess36, ManufacturingProcess06, BiologicalMaterial02) 
corrgram(cor_model, order=TRUE, upper.panel=panel.cor)

The corrplot shows the correlations between variables in the dataset.