library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(AppliedPredictiveModeling)
library(missForest)
library(Cubist)
library(mlbench)
library(nnet)
library(partykit)
library(earth)
library(corrgram)
7.2 Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\(y = 10sin(\pi x_1x_2)+20(x3 - 0.5)^2 +10x_4 +5x_5 +N(0,\sigma ^2)\)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
training.data <- mlbench.friedman1(200, sd=1)
training.data$x <- data.frame(training.data$x)
featurePlot(training.data$x, training.data$y)
test.data <- mlbench.friedman1(5000, sd = 1)
test.data$x <- data.frame(test.data$x)
knnModel <- train(x = training.data$x,
y = training.data$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.565620 0.4887976 2.886629
## 7 3.422420 0.5300524 2.752964
## 9 3.368072 0.5536927 2.715310
## 11 3.323010 0.5779056 2.669375
## 13 3.275835 0.6030846 2.628663
## 15 3.261864 0.6163510 2.621192
## 17 3.261973 0.6267032 2.616956
## 19 3.286299 0.6281075 2.640585
## 21 3.280950 0.6390386 2.643807
## 23 3.292397 0.6440392 2.656080
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
knn.Predict <- predict(knnModel, newdata = test.data$x)
postResample(pred = knn.Predict, obs = test.data$y)
## RMSE Rsquared MAE
## 3.1750657 0.6785946 2.5443169
marsModel <- train(x = training.data$x, y = training.data$y, method='earth', tuneLength=10, trControl = trainControl(method = "cv"))
marsModel
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 4.297119 0.2569346 3.548983
## 3 3.702964 0.4528644 2.964117
## 4 2.534931 0.7453994 2.086018
## 6 2.336194 0.7903845 1.848704
## 7 1.798280 0.8706570 1.417550
## 9 1.644565 0.8900009 1.297656
## 10 1.630223 0.8919648 1.265712
## 12 1.638057 0.8927116 1.270530
## 13 1.644756 0.8907795 1.270365
## 15 1.648248 0.8903664 1.271487
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 10 and degree = 1.
mars.Predict <- predict (marsModel, test.data$x)
postResample(pred = mars.Predict, obs = test.data$y)
## RMSE Rsquared MAE
## 1.776575 0.872700 1.358367
The RMSE for the MARS model is much better than that of the KNN model.
varImp(marsModel)
## earth variable importance
##
## Overall
## X1 100.000
## X4 84.065
## X2 66.860
## X5 44.678
## X3 33.508
## X6 7.475
## X8 0.000
## X10 0.000
## X7 0.000
## X9 0.000
Which models appear to give the best performance? Does MARS select the informative predictors? The MARS model seems to give the best performance. The X1-X5 predictors are selected by the MARS model.
7.5 7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess
#data imputation
df1 <- missForest(df)
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
df2 <- df1$ximp
head(df2)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 10.834 10.14 1.5334
## 2 0.000 0.00 1.5446
## 3 0.000 0.00 1.5431
## 4 0.000 0.00 1.5434
## 5 10.700 0.00 1.5372
## 6 12.000 0.00 1.5350
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 928.39 1002.855 206.608
## 2 917.00 1032.200 210.000
## 3 912.00 1003.600 207.100
## 4 911.00 1014.600 213.300
## 5 918.00 1027.500 205.700
## 6 924.00 1016.800 208.900
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 177.58 177.74 43.00
## 2 177.00 178.00 46.57
## 3 178.00 178.00 45.07
## 4 177.00 177.00 44.92
## 5 178.00 178.00 44.96
## 6 178.00 178.00 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 9.104 9.431 363.92
## 2 9.359 9.905 0.00
## 3 9.166 9.580 0.00
## 4 9.258 9.625 0.00
## 5 9.062 9.359 0.00
## 6 9.269 10.009 0.00
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 4.42 3.26 12.77
## 2 3.00 0.00 3.00
## 3 4.00 1.00 4.00
## 4 5.00 2.00 5.00
## 5 8.00 4.00 18.00
## 6 9.00 1.00 1.00
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 0.013 0.015 11.6
## 2 0.100 0.150 11.1
## 3 0.000 0.000 12.0
## 4 0.000 0.000 10.6
## 5 0.000 0.000 11.0
## 6 0.000 0.000 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
#set test/train data
data <- df2[, 2:58]
target <- df2[,1]
train <- createDataPartition(target, p=0.75)
train_pred <- data[train$Resample1,]
train_target <- target[train$Resample]
test_pred <- data[-train$Resample1,]
test_target <- target[-train$Resample1]
control <- trainControl(method = "cv", number=10)
knnModel <- train(x = train_pred,
y = train_target,
method = "knn",
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.776862 0.2103631 1.411766
## 7 1.749685 0.2136185 1.407276
## 9 1.714869 0.2261357 1.372198
## 11 1.710873 0.2247247 1.379369
## 13 1.715927 0.2168172 1.381832
## 15 1.719385 0.2115744 1.383668
## 17 1.712961 0.2172051 1.377429
## 19 1.721686 0.2139728 1.386131
## 21 1.737534 0.2009739 1.402955
## 23 1.752321 0.1918709 1.414137
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
knn.Predict <- predict(knnModel, newdata = test_pred)
postResample(pred = knn.Predict, obs = test_target)
## RMSE Rsquared MAE
## 1.3048525 0.3024465 1.0678512
MARS Model
marsModel <- train(x = train_pred, y = train_target, method='earth', tuneLength=10, trControl = trainControl(method = "cv"))
marsModel
## Multivariate Adaptive Regression Spline
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 118, 119, 118, 120, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 1.491723 0.4078425 1.206024
## 3 1.353735 0.5070462 1.097948
## 5 1.308553 0.5556791 1.068443
## 7 1.249427 0.5966211 1.043078
## 8 1.240367 0.6026297 1.041784
## 10 1.241248 0.6113224 1.034270
## 12 1.254623 0.6134204 1.032793
## 13 1.250880 0.6203455 1.040137
## 15 1.259988 0.6284302 1.045013
## 17 1.276721 0.6253071 1.060521
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 8 and degree = 1.
mars.Predict <- predict (marsModel, test_pred)
postResample(pred = mars.Predict, obs = test_target)
## RMSE Rsquared MAE
## 0.9664855 0.6959803 0.7997188
Neural Net
nnetModel <- nnet(train_pred, train_target, size=5, decay=0.01, linout= T, trace=F, maxit = 500 ,
MaxNWts = 5 * (ncol(train_pred) + 1) + 5 + 1)
nnetPredict <- predict(nnetModel, test_pred)
postResample(pred = nnetPredict, obs = test_target)
## RMSE Rsquared MAE
## 1.562087433 0.008574579 1.321736856
SVM Model
svmModel <- train(x = train_pred,
y = train_target,
method='svmRadial',
tuneLength=14,
trControl = trainControl(method = "cv"),
preProc = c("center", "scale"))
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 119, 117, 120, 120, 120, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.540944 0.4982792 1.2476873
## 0.50 1.418246 0.5494338 1.1520354
## 1.00 1.310090 0.5957758 1.0684414
## 2.00 1.242287 0.6145278 1.0028349
## 4.00 1.205616 0.6283394 0.9673304
## 8.00 1.186749 0.6368772 0.9582191
## 16.00 1.191156 0.6344858 0.9613057
## 32.00 1.191156 0.6344858 0.9613057
## 64.00 1.191156 0.6344858 0.9613057
## 128.00 1.191156 0.6344858 0.9613057
## 256.00 1.191156 0.6344858 0.9613057
## 512.00 1.191156 0.6344858 0.9613057
## 1024.00 1.191156 0.6344858 0.9613057
## 2048.00 1.191156 0.6344858 0.9613057
##
## Tuning parameter 'sigma' was held constant at a value of 0.01458004
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01458004 and C = 8.
svm.Predict <- predict (svmModel, test_pred)
postResample(pred = svm.Predict, obs = test_target)
## RMSE Rsquared MAE
## 0.9665332 0.6873189 0.7479273
The SVM Model this time slightly outperforms the MARS model.
B.) Which predictors are most important in the optimal nonlinear regres-sion model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten from the optimal linear model?
varImp(svmModel)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 98.64
## ManufacturingProcess17 90.81
## BiologicalMaterial12 88.93
## BiologicalMaterial03 88.45
## ManufacturingProcess09 86.63
## BiologicalMaterial06 82.85
## ManufacturingProcess36 82.46
## ManufacturingProcess06 64.47
## BiologicalMaterial02 64.02
## BiologicalMaterial11 63.28
## ManufacturingProcess31 63.11
## ManufacturingProcess29 55.52
## BiologicalMaterial08 49.55
## ManufacturingProcess11 48.91
## BiologicalMaterial04 46.52
## ManufacturingProcess33 45.86
## BiologicalMaterial09 45.45
## BiologicalMaterial01 45.08
## ManufacturingProcess30 40.10
varImp(marsModel)
## earth variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 64.04
## ManufacturingProcess13 40.20
## ManufacturingProcess33 28.61
## BiologicalMaterial03 25.01
## ManufacturingProcess28 15.78
## ManufacturingProcess04 0.00
## ManufacturingProcess29 0.00
## ManufacturingProcess21 0.00
## ManufacturingProcess25 0.00
## ManufacturingProcess05 0.00
## ManufacturingProcess38 0.00
## BiologicalMaterial06 0.00
## ManufacturingProcess36 0.00
## ManufacturingProcess15 0.00
## BiologicalMaterial10 0.00
## ManufacturingProcess12 0.00
## ManufacturingProcess23 0.00
## ManufacturingProcess42 0.00
## ManufacturingProcess24 0.00
plot(varImp(marsModel), top=10)
plot(varImp(svmModel), top=10)
The predictors that are most important in both models are the ManufacturingProcess32 and ManufacturingProcess13. 32 is in first place spot for both models, while 13 is in both models top 3. Manufacturing and Biological both are represented within the top 10.
C. Explore the relationship between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield? train_pred
cor_model <- train_pred %>% select(ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess17, BiologicalMaterial12, BiologicalMaterial03, ManufacturingProcess09, BiologicalMaterial06, ManufacturingProcess36, ManufacturingProcess06, BiologicalMaterial02)
corrgram(cor_model, order=TRUE, upper.panel=panel.cor)
The corrplot shows the correlations between variables in the dataset.