Code
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, error = FALSE)knitr::opts_chunk$set(echo = TRUE, warning = FALSE, error = FALSE)Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.
7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data
ANSWER:
library(mlbench)
library(Seurat)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will five the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
#featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)Tune several models on these data. For example:
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10)
knnModelk-Nearest Neighbors
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 3.466085 0.5121775 2.816838
7 3.349428 0.5452823 2.727410
9 3.264276 0.5785990 2.660026
11 3.214216 0.6024244 2.603767
13 3.196510 0.6176570 2.591935
15 3.184173 0.6305506 2.577482
17 3.183130 0.6425367 2.567787
19 3.198752 0.6483184 2.592683
21 3.188993 0.6611428 2.588787
23 3.200458 0.6638353 2.604529
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
##The function 'postResample' can be used to get the test set > ##perforamnce values
postResample(pred = knnPred, obs = testData$y) RMSE Rsquared MAE
3.2040595 0.6819919 2.5683461
Which models appear to give the best performance?
ANSWER: Here we have trained several models below.
MARS Choose and optimal RMSE when nprune = 13 and degree is 2.
MARS_param <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = MARS_param,
preProcess = c("center", "scale"),
tuneLength = 10)
MARS_modelMultivariate Adaptive Regression Spline
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
degree nprune RMSE Rsquared MAE
1 2 4.383438 0.2405683 3.597961
1 3 3.645469 0.4745962 2.930453
1 4 2.727602 0.7035031 2.184240
1 5 2.449243 0.7611230 1.939231
1 6 2.331605 0.7835496 1.833420
1 7 1.976830 0.8421599 1.562591
1 8 1.870959 0.8585503 1.464551
1 9 1.804342 0.8683110 1.410395
1 10 1.787676 0.8711960 1.386944
1 11 1.790700 0.8707740 1.393076
1 12 1.821005 0.8670619 1.419893
1 13 1.858688 0.8617344 1.445459
1 14 1.862343 0.8623072 1.446050
1 15 1.871033 0.8607099 1.457618
2 2 4.383438 0.2405683 3.597961
2 3 3.644919 0.4742570 2.929647
2 4 2.730222 0.7028372 2.183075
2 5 2.481291 0.7545789 1.965749
2 6 2.338369 0.7827873 1.825542
2 7 2.030065 0.8328250 1.602024
2 8 1.890997 0.8551326 1.477422
2 9 1.742626 0.8757904 1.371910
2 10 1.608221 0.8943432 1.255416
2 11 1.474325 0.9111463 1.157848
2 12 1.437483 0.9157967 1.120977
2 13 1.439395 0.9164721 1.128309
2 14 1.428565 0.9184503 1.118634
2 15 1.434093 0.9182413 1.121622
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 14 and degree = 2.
The RMSE for MARS is much lower than KNN model
MARS_pred <- predict(MARS_model, newdata = testData$x)
postResample(pred = MARS_pred, obs = testData$y) RMSE Rsquared MAE
1.2779993 0.9338365 1.0147070
Does MARS select the informative predictors (those named X1–X5)?
ANSWER: The variable importance are 4 variables below. Yes, the MARS model selects 4 variables as most important.
varImp(MARS_model)earth variable importance
Overall
X1 100.00
X4 75.40
X2 49.00
X5 15.72
X3 0.00
SVM_model <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
SVM_modelSupport Vector Machines with Radial Basis Function Kernel
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.548095 0.7967225 2.001583
0.50 2.290826 0.8118225 1.779407
1.00 2.095918 0.8337552 1.634192
2.00 1.989469 0.8462100 1.539721
4.00 1.901332 0.8565135 1.510431
8.00 1.879815 0.8588126 1.514502
16.00 1.878866 0.8591568 1.518094
32.00 1.878866 0.8591568 1.518094
64.00 1.878866 0.8591568 1.518094
128.00 1.878866 0.8591568 1.518094
Tuning parameter 'sigma' was held constant at a value of 0.06670077
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.06670077 and C = 16.
SVM_predict <- predict(SVM_model, newdata = testData$x)
postResample(pred = SVM_predict, obs = testData$y) RMSE Rsquared MAE
2.0829707 0.8242096 1.5826017
varImp(SVM_model)loess r-squared variable importance
Overall
X4 100.0000
X1 95.5047
X2 89.6186
X5 45.2170
X3 29.9330
X9 6.3299
X10 5.5182
X8 3.2527
X6 0.8884
X7 0.0000
#install.packages(c('neuralnet'),dependencies = T)
library(neuralnet)
nnet_model <- neuralnet(trainingData$y~.,trainingData,
hidden=c(7,3),
linear.output = FALSE)
plot(nnet_model,rep = "best")nnet <- predict(nnet_model, newdata = testData$x)
postResample(pred = nnet, obs = testData$y) RMSE Rsquared MAE
14.27692927 0.01560489 13.38691083
ANSWER: The MARS model as the best Rsquared value of 0.9291489 and a lower RMSE , therefore we select this model.
results <- data.frame(t(postResample(pred = knnPred, obs = testData$y))) %>%
mutate("Model" = "KNN")
results <- data.frame(t(postResample(pred = MARS_pred, obs = testData$y))) %>%
mutate("Model"= "MARS") %>%
bind_rows(results)
results <- data.frame(t(postResample(pred = SVM_predict, obs = testData$y))) %>%
mutate("Model"= "SVM") %>%
bind_rows(results)
results <- data.frame(t(postResample(pred = nnet, obs = testData$y))) %>%
mutate("Model"= "Neural Network") %>%
bind_rows(results)
results %>%
dplyr::select(Model, RMSE, Rsquared, MAE) %>%
arrange(RMSE) %>%
kable() %>%
kable_styling()| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| MARS | 1.277999 | 0.9338365 | 1.014707 |
| SVM | 2.082971 | 0.8242096 | 1.582602 |
| KNN | 3.204060 | 0.6819919 | 2.568346 |
| Neural Network | 14.276929 | 0.0156049 | 13.386911 |
7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
(a) Which nonlinear regression model gives the optimal resampling and test set performance?
ANSWER: After trying several models, its seems that the PLS model has the highest Rquared value of 67.7% and the RMSE 0.619.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Make this reproducible
set.seed(42)
knn <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df1 <- predict(knn, ChemicalManufacturingProcess)
df <- df1 %>%
select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
pls_model <- train(
Yield ~ ., data = train_df, method = "pls",
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
pls_modelPartial Least Squares
144 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 130, 129, 128, 129, 130, 129, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 0.8824790 0.3779221 0.6711462
2 1.1458456 0.4219806 0.7086431
3 0.7363066 0.5244517 0.5688553
4 0.8235294 0.5298005 0.5933120
5 0.9670735 0.4846010 0.6371199
6 0.9959036 0.4776684 0.6427478
7 0.9119517 0.4986338 0.6200233
8 0.9068621 0.5012144 0.6293371
9 0.8517370 0.5220166 0.6163795
10 0.8919356 0.5062912 0.6332243
11 0.9173758 0.4934557 0.6463164
12 0.9064125 0.4791526 0.6485663
13 0.9255289 0.4542181 0.6620193
14 1.0239913 0.4358371 0.6944056
15 1.0754710 0.4365214 0.7077991
16 1.1110579 0.4269065 0.7135684
17 1.1492855 0.4210485 0.7222868
18 1.1940639 0.4132534 0.7396357
19 1.2271867 0.4079005 0.7494818
20 1.2077102 0.4022859 0.7470327
21 1.2082648 0.4026711 0.7452969
22 1.2669285 0.3987044 0.7634170
23 1.3663033 0.3970188 0.7957514
24 1.4531634 0.3898475 0.8243034
25 1.5624265 0.3820102 0.8612935
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 3.
pls <- predict(pls_model, test_df)
results <- data.frame(t(postResample(pred = pls, obs = test_df$Yield))) %>%
mutate("Model"= "PLS")
plot(pls_model)MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(
Yield ~ ., data = train_df, method = "earth",
tuneGrid = MARS_grid,
# If the following lines are uncommented, it throws an error
#center = TRUE,
#scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
MARS_modelMultivariate Adaptive Regression Spline
144 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 130, 129, 130, 130, 130, 131, ...
Resampling results across tuning parameters:
degree nprune RMSE Rsquared MAE
1 2 0.7727673 0.4024317 0.6055172
1 3 0.6617934 0.5462836 0.5301205
1 4 0.6414403 0.5782650 0.5175873
1 5 0.6612068 0.5491861 0.5191206
1 6 0.6578997 0.5541443 0.5065749
1 7 0.6431856 0.5731468 0.5094160
1 8 0.6365886 0.5818343 0.5009426
1 9 0.6489700 0.5690713 0.5198826
1 10 0.6289426 0.6018481 0.5107398
1 11 0.6254798 0.6109447 0.4938393
1 12 0.8175335 0.5491371 0.5524263
1 13 0.8485872 0.5202012 0.5742223
1 14 0.9496041 0.4980894 0.6161224
1 15 0.9447012 0.4963106 0.6150941
2 2 0.7727673 0.4024317 0.6055172
2 3 0.7017341 0.4897201 0.5560432
2 4 0.6723104 0.5388118 0.5456391
2 5 0.6720395 0.5643163 0.5260406
2 6 0.6054599 0.6330120 0.4700175
2 7 0.5879508 0.6609611 0.4649161
2 8 0.5579780 0.6910772 0.4407506
2 9 0.5605913 0.6918493 0.4401424
2 10 0.5766080 0.6752904 0.4484094
2 11 0.5743771 0.6754628 0.4479538
2 12 0.5773643 0.6804448 0.4451957
2 13 0.5743328 0.6899189 0.4354640
2 14 0.6066695 0.6585134 0.4483969
2 15 0.6195630 0.6490841 0.4512092
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 8 and degree = 2.
MARS_pred <- predict(MARS_model, test_df)
results <- data.frame(t(postResample(pred = MARS_pred, obs = test_df$Yield))) %>%
mutate("Model"= "MARS") %>% rbind(results)
plot(MARS_model)SVM_model <- train(
Yield ~ ., data = train_df, method = "svmRadial",
center = TRUE,
scale = TRUE,
trControl = trainControl(method = "cv"),
tuneLength = 25
)
SVM_modelSupport Vector Machines with Radial Basis Function Kernel
144 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 130, 130, 129, 129, 129, 130, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 0.7686126 0.4567865 0.6341644
0.50 0.7187220 0.4883557 0.5932541
1.00 0.6756247 0.5370899 0.5532656
2.00 0.6502763 0.5676720 0.5259804
4.00 0.6577992 0.5614980 0.5317055
8.00 0.6316941 0.5959658 0.5141839
16.00 0.6291552 0.5994754 0.5145133
32.00 0.6291552 0.5994754 0.5145133
64.00 0.6291552 0.5994754 0.5145133
128.00 0.6291552 0.5994754 0.5145133
256.00 0.6291552 0.5994754 0.5145133
512.00 0.6291552 0.5994754 0.5145133
1024.00 0.6291552 0.5994754 0.5145133
2048.00 0.6291552 0.5994754 0.5145133
4096.00 0.6291552 0.5994754 0.5145133
8192.00 0.6291552 0.5994754 0.5145133
16384.00 0.6291552 0.5994754 0.5145133
32768.00 0.6291552 0.5994754 0.5145133
65536.00 0.6291552 0.5994754 0.5145133
131072.00 0.6291552 0.5994754 0.5145133
262144.00 0.6291552 0.5994754 0.5145133
524288.00 0.6291552 0.5994754 0.5145133
1048576.00 0.6291552 0.5994754 0.5145133
2097152.00 0.6291552 0.5994754 0.5145133
4194304.00 0.6291552 0.5994754 0.5145133
Tuning parameter 'sigma' was held constant at a value of 0.0137077
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.0137077 and C = 16.
SVM_pred <- predict(SVM_model, test_df)
results <- data.frame(t(postResample(pred = SVM_pred, obs = test_df$Yield))) %>%
mutate("Model"= "SVM") %>% rbind(results)
plot(SVM_model)knn_model <- train(
Yield ~ ., data = train_df, method = "knn",
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
knn_modelk-Nearest Neighbors
144 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 132, 130, 128, 129, 128, 129, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 0.6828734 0.5168641 0.5604420
7 0.7347844 0.4695838 0.6141467
9 0.7460879 0.4583270 0.6240751
11 0.7476629 0.4585902 0.6358480
13 0.7490971 0.4610305 0.6317689
15 0.7591816 0.4463879 0.6369478
17 0.7711858 0.4294319 0.6461468
19 0.7778231 0.4224278 0.6543177
21 0.7800437 0.4336780 0.6499864
23 0.7816982 0.4390426 0.6504034
25 0.7904133 0.4288995 0.6559515
27 0.8021603 0.4151312 0.6675529
29 0.8077564 0.4143896 0.6739736
31 0.8159533 0.4104242 0.6797708
33 0.8200913 0.4070435 0.6797057
35 0.8228305 0.4099405 0.6821038
37 0.8260244 0.4191607 0.6829727
39 0.8319769 0.4057005 0.6858128
41 0.8360941 0.4025312 0.6889323
43 0.8354433 0.4036491 0.6876839
45 0.8335381 0.4158074 0.6845551
47 0.8379981 0.4122304 0.6914769
49 0.8435241 0.4030341 0.6951803
51 0.8439107 0.4086331 0.6962559
53 0.8455989 0.4070050 0.6962722
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
knn_predict <- predict(knn_model, test_df)
results <- data.frame(t(postResample(pred = knn_predict, obs = test_df$Yield))) %>%
mutate("Model"= "KNN") %>% rbind(results)
plot(knn_model)results1 <- data.frame(t(postResample(pred = knn_predict, obs = test_df$Yield))) %>%
mutate("Model" = "KNN")
results1 <- data.frame(t(postResample(pred = MARS_pred, obs = test_df$Yield))) %>%
mutate("Model"= "MARS") %>%
bind_rows(results1)
results1 <- data.frame(t(postResample(pred = SVM_pred, obs = test_df$Yield))) %>%
mutate("Model"= "SVM") %>%
bind_rows(results1)
results1 <- data.frame(t(postResample(pred = pls, obs = test_df$Yield))) %>%
mutate("Model"= "PLS") %>%
bind_rows(results1)
results1 %>%
dplyr::select(Model, RMSE, Rsquared, MAE) %>%
arrange(RMSE) %>%
kable() %>%
kable_styling()| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| PLS | 0.6192577 | 0.6771122 | 0.5059984 |
| SVM | 0.6794955 | 0.6245393 | 0.4935834 |
| KNN | 0.7149819 | 0.5894607 | 0.5047973 |
| MARS | 0.7199111 | 0.5920601 | 0.5651110 |
(b) Which predictors are most important in the optimal nonlinear regression model?
ANSWER: In the PLS model, the Manufacturing process variables (32, 9 , 36, 13,17) are the highest
library(dplyr)
top <- varImp(pls_model)$importance %>%
arrange(-Overall) %>%
head(10) %>% kable() %>% kable_styling()
top1 <- varImp(SVM_model)$importance %>%
arrange(-Overall) %>%
head(10) %>% kable() %>% kable_styling()
top| Overall | |
|---|---|
| ManufacturingProcess32 | 100.00000 |
| ManufacturingProcess09 | 88.04157 |
| ManufacturingProcess36 | 82.20485 |
| ManufacturingProcess13 | 82.11199 |
| ManufacturingProcess17 | 80.24860 |
| ManufacturingProcess06 | 59.05610 |
| ManufacturingProcess11 | 55.92794 |
| BiologicalMaterial02 | 55.45684 |
| BiologicalMaterial06 | 54.64083 |
| BiologicalMaterial03 | 54.49979 |
top1| Overall | |
|---|---|
| ManufacturingProcess32 | 100.00000 |
| ManufacturingProcess13 | 93.81970 |
| ManufacturingProcess09 | 89.92917 |
| ManufacturingProcess17 | 88.19815 |
| BiologicalMaterial06 | 82.60809 |
| BiologicalMaterial03 | 79.43762 |
| ManufacturingProcess36 | 73.84749 |
| BiologicalMaterial12 | 72.36048 |
| ManufacturingProcess06 | 69.00493 |
| ManufacturingProcess11 | 62.33657 |
Do either the biological or process variables dominate the list?
ANSWER: For the PLS model with the higher RSQUARED, the manufacturing process variables dominate the list.
How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
ANSWER: Both PLS and SVM seem to have process variables at the top of the importance list. However, for the SVM model has few biological material variables (06 and 03) deemed important.
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.
Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
ggplot(train_df, aes(BiologicalMaterial12, Yield)) +
geom_point()ggplot(train_df, aes(ManufacturingProcess17, Yield)) +
geom_point()ggplot(train_df, aes(ManufacturingProcess09, Yield)) +
geom_point()ggplot(train_df, aes(BiologicalMaterial06, Yield)) +
geom_point()ggplot(train_df, aes(BiologicalMaterial03, Yield)) +
geom_point()# Get top-10 feature names
importance <- varImp(SVM_model)$importance
importance$feature <- rownames(importance)
importance <- importance[order(importance$Overall, decreasing=TRUE), ]
# Get importance feature names
important_f <- rownames(head(importance, 10))
# Create correlelogram of imputed chemical data
correlation <- cor(train_df[, c(important_f, "Yield")])
# corrplot(correlation, method="circle")
ggcorrplot(correlation, type = "upper", outline.color = "white",
ggtheme = theme_classic,
#colors = c("#6D9EC1", "white", "#E46726"),
lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)#
# # Calculate the correlation matrix
# correlation_matrix <- cor(numeric_data, use="complete.obs")
# print(correlation_matrix)
# corrplot(correlation_matrix, method="circle")