Data 624 HW 8
library(knitr)
library(rmdformats)
## Global options
options(max.print="85")
opts_chunk$set(cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=31)
library(mlbench)
library(caret)## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
7.2
Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)KNN Model
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
knnModel <- train(x = trainingData$x, y = trainingData$y, method = "knn", preProc = c("center",
"scale"), tuneLength = 10)
knnModelk-Nearest Neighbors
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 3.466085 0.5121775 2.816838
7 3.349428 0.5452823 2.727410
9 3.264276 0.5785990 2.660026
11 3.214216 0.6024244 2.603767
13 3.196510 0.6176570 2.591935
15 3.184173 0.6305506 2.577482
17 3.183130 0.6425367 2.567787
19 3.198752 0.6483184 2.592683
21 3.188993 0.6611428 2.588787
23 3.200458 0.6638353 2.604529
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
RMSE Rsquared MAE
3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
MARS Model
options(max.print = 1e+06)
mars.grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
mars.model <- train(x = trainingData$x, y = trainingData$y, method = "earth", tuneGrid = mars.grid,
preProcess = c("center", "scale"), tuneLength = 10)
mars.modelMultivariate Adaptive Regression Spline
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
degree nprune RMSE Rsquared MAE
1 2 4.383438 0.2405683 3.597961
1 3 3.645469 0.4745962 2.930453
1 4 2.727602 0.7035031 2.184240
1 5 2.449243 0.7611230 1.939231
1 6 2.331605 0.7835496 1.833420
1 7 1.976830 0.8421599 1.562591
1 8 1.870959 0.8585503 1.464551
1 9 1.804342 0.8683110 1.410395
1 10 1.787676 0.8711960 1.386944
1 11 1.790700 0.8707740 1.393076
1 12 1.821005 0.8670619 1.419893
1 13 1.858688 0.8617344 1.445459
1 14 1.862343 0.8623072 1.446050
1 15 1.871033 0.8607099 1.457618
2 2 4.383438 0.2405683 3.597961
2 3 3.644919 0.4742570 2.929647
2 4 2.730222 0.7028372 2.183075
2 5 2.481291 0.7545789 1.965749
2 6 2.338369 0.7827873 1.825542
2 7 2.030065 0.8328250 1.602024
2 8 1.890997 0.8551326 1.477422
2 9 1.742626 0.8757904 1.371910
2 10 1.608221 0.8943432 1.255416
2 11 1.474325 0.9111463 1.157848
2 12 1.437483 0.9157967 1.120977
2 13 1.439395 0.9164721 1.128309
2 14 1.428565 0.9184503 1.118634
2 15 1.434093 0.9182413 1.121622
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 14 and degree = 2.
The best MARS model on training set is when nprune = 13 and degree = 2.
mars_pred <- predict(mars.model, newdata = testData$x)
postResample(pred = mars_pred, obs = testData$y) RMSE Rsquared MAE
1.2779993 0.9338365 1.0147070
earth variable importance
Overall
X1 100.00
X4 75.40
X2 49.00
X5 15.72
X3 0.00
The 4 most important variables are X1, X4, X2, X5.
SVM Model
svm.model <- train(x = trainingData$x, y = trainingData$y, method = "svmRadial",
tuneLength = 10, preProcess = c("center", "scale"), trControl = trainControl(method = "repeatedcv",
repeats = 8))
svm.modelSupport Vector Machines with Radial Basis Function Kernel
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Cross-Validated (10 fold, repeated 8 times)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.502311 0.8009301 1.996146
0.50 2.245108 0.8163218 1.779447
1.00 2.059569 0.8381964 1.622551
2.00 1.960543 0.8507107 1.529816
4.00 1.888723 0.8597190 1.487947
8.00 1.861418 0.8630404 1.480071
16.00 1.856273 0.8635021 1.479585
32.00 1.856273 0.8635021 1.479585
64.00 1.856273 0.8635021 1.479585
128.00 1.856273 0.8635021 1.479585
Tuning parameter 'sigma' was held constant at a value of 0.06670077
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.06670077 and C = 16.
The optimal SVM model is attained wen C = 8.00 and sigma is held constant at 0.06.
svm_pred <- predict(svm.model, newdata = testData$x)
postResample(pred = svm_pred, obs = testData$y) RMSE Rsquared MAE
2.0829707 0.8242096 1.5826017
loess r-squared variable importance
Overall
X4 100.0000
X1 95.5047
X2 89.6186
X5 45.2170
X3 29.9330
X9 6.3299
X10 5.5182
X8 3.2527
X6 0.8884
X7 0.0000
SVM did pick the 5 most informative predictors as the 5 most important predictors.
Neural Network Model
nnet.grid <- expand.grid(.decay = c(0, 0.01, 0.1, 0.5, 0.9), .size = c(1, 10), .bag = FALSE)
nnet.model <- train(x = trainingData$x, y = trainingData$y, method = "avNNet", preProcess = c("center",
"scale"), tuneGrid = nnet.grid, trControl = trainControl(method = "repeatedcv",
repeats = 1), trace = FALSE, linout = TRUE, maxit = 500)
nnet.modelModel Averaged Neural Network
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:
decay size RMSE Rsquared MAE
0.00 1 2.433344 0.7867541 1.916758
0.00 10 4.884038 0.5672747 3.141966
0.01 1 2.406902 0.7889631 1.877413
0.01 10 2.454159 0.7774285 1.953892
0.10 1 2.396847 0.7886603 1.869417
0.10 10 2.301708 0.8003090 1.815960
0.50 1 2.422011 0.7813350 1.901694
0.50 10 2.212246 0.8209941 1.758052
0.90 1 2.444657 0.7759943 1.923258
0.90 10 2.052534 0.8457960 1.625324
Tuning parameter 'bag' was held constant at a value of FALSE
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 10, decay = 0.9 and bag = FALSE.
nnet_pred <- predict(nnet.model, newdata = testData$x)
postResample(pred = nnet_pred, obs = testData$y) RMSE Rsquared MAE
2.0122533 0.8384209 1.5424496
loess r-squared variable importance
Overall
X4 100.0000
X1 95.5047
X2 89.6186
X5 45.2170
X3 29.9330
X9 6.3299
X10 5.5182
X8 3.2527
X6 0.8884
X7 0.0000
X4, X1, X2, X5, X3 were listed as the 5 most important predictors for NN model.
Performance Evaluation
results <- data.frame(t(postResample(pred = knnPred, obs = testData$y))) %>% mutate(Model = "KNN")
results <- data.frame(t(postResample(pred = mars_pred, obs = testData$y))) %>% mutate(Model = "MARS") %>%
bind_rows(results)
results <- data.frame(t(postResample(pred = svm_pred, obs = testData$y))) %>% mutate(Model = "SVM") %>%
bind_rows(results)
results <- data.frame(t(postResample(pred = nnet_pred, obs = testData$y))) %>% mutate(Model = "Neural Network") %>%
bind_rows(results)
results %>% dplyr::select(Model, RMSE, Rsquared, MAE) %>% dplyr::arrange(RMSE) %>%
kable() %>% kable_styling()| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| MARS | 1.277999 | 0.9338365 | 1.014707 |
| Neural Network | 2.012253 | 0.8384209 | 1.542450 |
| SVM | 2.082971 | 0.8242096 | 1.582602 |
| KNN | 3.204060 | 0.6819919 | 2.568346 |
MARS model performed the best. It selected the informative predictors as most important. It has the best metrics across the broad.
7.5
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
(a)
Which nonlinear regression model gives the optimal resampling and test set performance?
Ans:
data(ChemicalManufacturingProcess)
# Used bagImpute method to impute the missing values.
bagImpute <- preProcess(ChemicalManufacturingProcess[, -c(1)], method = c("bagImpute"))# Make this reproducible
set.seed(45)
# Removing the zero variance variables from the dataset CMP
CMP_2 <- CMP %>% dplyr::select(-one_of(nearZeroVar(., names = T)))
# Create train and test split
train <- createDataPartition(CMP_2$Yield, times = 1, p = 0.8, list = FALSE)
CMP2_train_df <- CMP_2[train, ]
CMP2_test_df <- CMP_2[-train, ]KNN Model
knn_model <- train(Yield ~ ., data = CMP2_train_df, method = "knn", preProcess = c("center",
"scale"), trControl = trainControl("repeatedcv", repeats = 14), tuneLength = 10)
knn_modelk-Nearest Neighbors
144 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (10 fold, repeated 14 times)
Summary of sample sizes: 130, 129, 131, 130, 129, 130, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 1.238316 0.5711530 0.9920695
7 1.275954 0.5500808 1.0284643
9 1.278544 0.5535930 1.0421709
11 1.285869 0.5541710 1.0533851
13 1.303112 0.5417439 1.0684796
15 1.310939 0.5419706 1.0689230
17 1.326826 0.5363184 1.0785755
19 1.344597 0.5273826 1.0944736
21 1.351137 0.5286896 1.0996305
23 1.363928 0.5262034 1.1086103
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
MARS Model
mars_grid <- expand.grid(.degree = 1:2, .nprune = 2:14)
mars_model <- train(Yield ~ ., data = CMP2_train_df, method = "earth", tuneGrid = mars_grid,
preProcess = c("center", "scale"))
mars_modelMultivariate Adaptive Regression Spline
144 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
Resampling results across tuning parameters:
degree nprune RMSE Rsquared MAE
1 2 1.497808 0.3462486 1.1604640
1 3 1.264061 0.5271330 0.9945759
1 4 1.188350 0.5775696 0.9360398
1 5 1.214630 0.5647259 0.9574269
1 6 1.357955 0.5256184 1.0094276
1 7 1.708676 0.4934250 1.0773009
1 8 1.755209 0.4768041 1.1037725
1 9 1.782182 0.4747345 1.1099410
1 10 1.781493 0.4741570 1.1165152
1 11 1.748795 0.4784380 1.1197517
1 12 1.751767 0.4674375 1.1321467
1 13 1.746962 0.4520736 1.1474127
1 14 1.710127 0.4545385 1.1433512
2 2 1.497808 0.3462486 1.1604640
2 3 1.277601 0.5198498 0.9985240
2 4 1.213490 0.5596745 0.9471576
2 5 1.240414 0.5495335 0.9776753
[ reached getOption("max.print") -- omitted 9 rows ]
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 4 and degree = 1.
SVM Model
svm_model <- train(Yield ~ ., data = CMP2_train_df, method = "svmRadial", preProc = c("center",
"scale"), trControl = trainControl(method = "repeatedcv", repeats = 8), tuneLength = 14)
svm_modelSupport Vector Machines with Radial Basis Function Kernel
144 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (10 fold, repeated 8 times)
Summary of sample sizes: 129, 131, 131, 130, 130, 129, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 1.341938 0.5529206 1.0870890
0.50 1.205938 0.6087110 0.9779722
1.00 1.110874 0.6521890 0.8953666
2.00 1.058657 0.6774674 0.8387258
4.00 1.069424 0.6717891 0.8447317
8.00 1.065245 0.6772205 0.8397986
16.00 1.060000 0.6802050 0.8359710
32.00 1.060000 0.6802050 0.8359710
64.00 1.060000 0.6802050 0.8359710
128.00 1.060000 0.6802050 0.8359710
256.00 1.060000 0.6802050 0.8359710
512.00 1.060000 0.6802050 0.8359710
1024.00 1.060000 0.6802050 0.8359710
2048.00 1.060000 0.6802050 0.8359710
Tuning parameter 'sigma' was held constant at a value of 0.01244262
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01244262 and C = 2.
Neural Network Model
nnet_grid <- expand.grid(.decay = c(0, 0.01, 0.1), .size = c(1, 5, 10), .bag = FALSE)
nnet_model <- train(Yield ~ ., data = CMP2_train_df, method = "avNNet", tuneGrid = nnet_grid,
preProcess = c("center", "scale"), trControl = trainControl(method = "repeatedcv",
repeats = 1), trace = FALSE, linout = TRUE, MaxNWts = 301, maxit = 500)
nnet_modelModel Averaged Neural Network
144 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 131, 130, 130, 129, 130, 129, ...
Resampling results across tuning parameters:
decay size RMSE Rsquared MAE
0.00 1 1.513358 0.4390102 1.234170
0.00 5 1.873332 0.3647700 1.450868
0.00 10 NaN NaN NaN
0.01 1 1.435553 0.4773275 1.149338
0.01 5 1.763261 0.4347583 1.238102
0.01 10 NaN NaN NaN
0.10 1 1.784088 0.4670214 1.220596
0.10 5 1.824278 0.4046101 1.364862
0.10 10 NaN NaN NaN
Tuning parameter 'bag' was held constant at a value of FALSE
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 1, decay = 0.01 and bag = FALSE.
nnet_pred <- predict(nnet_model, CMP2_test_df)
results <- data.frame(t(postResample(pred = nnet_pred, obs = CMP2_test_df$Yield))) %>%
mutate(Model = "Neural Network") %>% rbind(results)# results <- results [-1, ]
results %>% dplyr::select(Model, RMSE, Rsquared, MAE) %>% arrange(RMSE) %>% kable() %>%
kable_styling()| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| SVM | 1.388832 | 0.4899527 | 1.110552 |
| MARS | 1.468070 | 0.4309830 | 1.206402 |
| KNN | 1.497138 | 0.4412967 | 1.248188 |
| Neural Network | 1.567783 | 0.4009922 | 1.298610 |
MARS turns out to the best non-linear model by far, in terms of RMSE.
(b)
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
earth variable importance
Overall
ManufacturingProcess32 100.00
ManufacturingProcess09 49.87
ManufacturingProcess13 0.00
# sourcing the pls_model from HW 7
pls_model <- train(Yield ~ ., data = CMP2_train_df, method = "pls", tuneLength = 24,
trControl = trainControl(method = "repeatedcv", repeats = 8), preProcess = c("center",
"scale"))
caret::varImp(pls_model, 10)pls variable importance
only 20 most important variables shown (out of 56)
Overall
ManufacturingProcess32 100.00
ManufacturingProcess13 93.07
ManufacturingProcess17 88.16
ManufacturingProcess09 85.37
ManufacturingProcess36 76.32
ManufacturingProcess06 69.43
BiologicalMaterial02 68.17
BiologicalMaterial06 65.69
ManufacturingProcess33 61.27
BiologicalMaterial08 61.12
ManufacturingProcess12 60.49
BiologicalMaterial03 58.61
BiologicalMaterial12 57.87
BiologicalMaterial11 57.04
ManufacturingProcess11 56.23
BiologicalMaterial01 54.31
BiologicalMaterial04 54.30
ManufacturingProcess28 46.93
ManufacturingProcess04 46.41
ManufacturingProcess10 39.21
When we compare the the top 10 predictors from the most optimal model (mars_model) and top 10 from the pls_model, we do see that ManufacturingProcess32 tops the list. So manufacturing Process definitely dominated the list with 7 out of top 10 in the pls_model is Manufacturing Process.
(c)
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.
Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Ans: There was only one important predictor chosen by MARS, I’m just going to examine its relationship with Yield to completely wrapping up the answer to this question.
Apparently, the Yield is maximized when the ManufacturingProcess32 is around 167 and, in general, there is a positive correlation between yield, the response variable, and the predictor, ManufacturingProcess32. So the key is to maximize ManufacturingProcess32 and have it attained its value closer to 167.