Github repo | portfolio | Blog
Friedman (1991) introduced several benchmark data sets created by simulation. On of these simulations used the following nonlinear equations to create data:
\(y = 10 sin(\pi x_1x_2) + 20(x_3- 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)
where the \(x\) values are random variables uniformly distributed between \([0,1]\) (there are also 5 other non-informative variables also created in the simulation). The package mlbench
contains a function called mlbench.friedman1
that simulates these data:
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will five the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10)
knnModel
#> k-Nearest Neighbors
#>
#> 200 samples
#> 10 predictor
#>
#> Pre-processing: centered (10), scaled (10)
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 5 3.466085 0.5121775 2.816838
#> 7 3.349428 0.5452823 2.727410
#> 9 3.264276 0.5785990 2.660026
#> 11 3.214216 0.6024244 2.603767
#> 13 3.196510 0.6176570 2.591935
#> 15 3.184173 0.6305506 2.577482
#> 17 3.183130 0.6425367 2.567787
#> 19 3.198752 0.6483184 2.592683
#> 21 3.188993 0.6611428 2.588787
#> 23 3.200458 0.6638353 2.604529
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
#> RMSE Rsquared MAE
#> 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?
MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = MARS_grid,
preProcess = c("center", "scale"),
tuneLength = 10)
#> Loading required package: earth
#> Loading required package: Formula
#> Loading required package: plotmo
#> Loading required package: plotrix
#> Loading required package: TeachingDemos
MARS_model
#> Multivariate Adaptive Regression Spline
#>
#> 200 samples
#> 10 predictor
#>
#> Pre-processing: centered (10), scaled (10)
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
#> Resampling results across tuning parameters:
#>
#> degree nprune RMSE Rsquared MAE
#> 1 2 4.383438 0.2405683 3.597961
#> 1 3 3.645469 0.4745962 2.930453
#> 1 4 2.727602 0.7035031 2.184240
#> 1 5 2.449243 0.7611230 1.939231
#> 1 6 2.331605 0.7835496 1.833420
#> 1 7 1.976830 0.8421599 1.562591
#> 1 8 1.870959 0.8585503 1.464551
#> 1 9 1.804342 0.8683110 1.410395
#> 1 10 1.787676 0.8711960 1.386944
#> 1 11 1.790700 0.8707740 1.393076
#> 1 12 1.821005 0.8670619 1.419893
#> 1 13 1.858688 0.8617344 1.445459
#> 1 14 1.862343 0.8623072 1.446050
#> 1 15 1.871033 0.8607099 1.457618
#> 2 2 4.383438 0.2405683 3.597961
#> 2 3 3.644919 0.4742570 2.929647
#> 2 4 2.730222 0.7028372 2.183075
#> 2 5 2.481291 0.7545789 1.965749
#> 2 6 2.338369 0.7827873 1.825542
#> 2 7 2.030065 0.8328250 1.602024
#> 2 8 1.890997 0.8551326 1.477422
#> 2 9 1.742626 0.8757904 1.371910
#> 2 10 1.608221 0.8943432 1.255416
#> 2 11 1.474325 0.9111463 1.157848
#> 2 12 1.437483 0.9157967 1.120977
#> 2 13 1.439395 0.9164721 1.128309
#> 2 14 1.428565 0.9184503 1.118634
#> 2 15 1.434093 0.9182413 1.121622
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were nprune = 14 and degree = 2.
The optimal MARS model minimized the RMSE when the nprune = 14 and the degree = 2.
MARS_predictions <- predict(MARS_model, newdata = testData$x)
postResample(pred = MARS_predictions, obs = testData$y)
#> RMSE Rsquared MAE
#> 1.2779993 0.9338365 1.0147070
The RMSE of the MARS model is a lot lower than the KNN model. Let’s see what variables are important.
varImp(MARS_model)
#> earth variable importance
#>
#> Overall
#> X1 100.00
#> X4 75.40
#> X2 49.00
#> X5 15.72
#> X3 0.00
The MARS model picks up the X1-X5 varaibles.
SVM_model <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
SVM_model
#> Support Vector Machines with Radial Basis Function Kernel
#>
#> 200 samples
#> 10 predictor
#>
#> Pre-processing: centered (10), scaled (10)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
#> Resampling results across tuning parameters:
#>
#> C RMSE Rsquared MAE
#> 0.25 2.548095 0.7967225 2.001583
#> 0.50 2.290818 0.8118247 1.779392
#> 1.00 2.095918 0.8337552 1.634192
#> 2.00 1.989469 0.8462100 1.539721
#> 4.00 1.901315 0.8565159 1.510413
#> 8.00 1.879808 0.8588137 1.514492
#> 16.00 1.878851 0.8591592 1.518105
#> 32.00 1.878851 0.8591592 1.518105
#> 64.00 1.878851 0.8591592 1.518105
#> 128.00 1.878851 0.8591592 1.518105
#>
#> Tuning parameter 'sigma' was held constant at a value of 0.06670077
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were sigma = 0.06670077 and C = 16.
The optimal SVM mdoel has a \(\sigma\) of 0.0667008 and an C of 16.
SVM_predictions <- predict(SVM_model, newdata = testData$x)
postResample(pred = SVM_predictions, obs = testData$y)
#> RMSE Rsquared MAE
#> 2.0829707 0.8242096 1.5826017
varImp(SVM_model)
#> loess r-squared variable importance
#>
#> Overall
#> X4 100.0000
#> X1 95.5047
#> X2 89.6186
#> X5 45.2170
#> X3 29.9330
#> X9 6.3299
#> X10 5.5182
#> X8 3.2527
#> X6 0.8884
#> X7 0.0000
The SVM picked up the important variables.
The best neural network has a size = 4 and a decay of 0.1.
nnet_predictions <- predict(nnet_model, newdata = testData$x)
postResample(pred = nnet_predictions, obs = testData$y)
#> RMSE Rsquared MAE
#> 2.0532730 0.8346365 1.5444574
varImp(nnet_model)
#> loess r-squared variable importance
#>
#> Overall
#> X4 100.0000
#> X1 95.5047
#> X2 89.6186
#> X5 45.2170
#> X3 29.9330
#> X9 6.3299
#> X10 5.5182
#> X8 3.2527
#> X6 0.8884
#> X7 0.0000
The top 5 variables are the ones we want to see listed.
results <- data.frame(t(postResample(pred = knnPred, obs = testData$y))) %>%
mutate("Model" = "KNN")
results <- data.frame(t(postResample(pred = MARS_predictions, obs = testData$y))) %>%
mutate("Model"= "MARS") %>%
bind_rows(results)
results <- data.frame(t(postResample(pred = SVM_predictions, obs = testData$y))) %>%
mutate("Model"= "SVM") %>%
bind_rows(results)
results <- data.frame(t(postResample(pred = nnet_predictions, obs = testData$y))) %>%
mutate("Model"= "Neural Network") %>%
bind_rows(results)
results %>%
select(Model, RMSE, Rsquared, MAE) %>%
arrange(RMSE) %>%
kable() %>%
kable_styling()
Model | RMSE | Rsquared | MAE |
---|---|---|---|
MARS | 1.277999 | 0.9338365 | 1.014707 |
Neural Network | 2.053273 | 0.8346365 | 1.544457 |
SVM | 2.082971 | 0.8242096 | 1.582602 |
KNN | 3.204060 | 0.6819919 | 2.568346 |
The MARS model preformed the best and identified the right variables as the important ones. The \(R^2\) on it is extremely high with a relatively lowest RMSE. It has a better performance on the test set!
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and traing several nonlinear regression models.
I will run the data through the models in the chapter trying to keep each model as close to the original linear model in terms of the parameters.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Make this reproducible
set.seed(42)
knn_model <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)
df <- df %>%
select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
pls_model <- train(
Yield ~ ., data = train_df, method = "pls",
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
pls_model
#> Partial Least Squares
#>
#> 144 samples
#> 56 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 130, 129, 128, 129, 130, 129, ...
#> Resampling results across tuning parameters:
#>
#> ncomp RMSE Rsquared MAE
#> 1 0.8824790 0.3779221 0.6711462
#> 2 1.1458456 0.4219806 0.7086431
#> 3 0.7363066 0.5244517 0.5688553
#> 4 0.8235294 0.5298005 0.5933120
#> 5 0.9670735 0.4846010 0.6371199
#> 6 0.9959036 0.4776684 0.6427478
#> 7 0.9119517 0.4986338 0.6200233
#> 8 0.9068621 0.5012144 0.6293371
#> 9 0.8517370 0.5220166 0.6163795
#> 10 0.8919356 0.5062912 0.6332243
#> 11 0.9173758 0.4934557 0.6463164
#> 12 0.9064125 0.4791526 0.6485663
#> 13 0.9255289 0.4542181 0.6620193
#> 14 1.0239913 0.4358371 0.6944056
#> 15 1.0754710 0.4365214 0.7077991
#> 16 1.1110579 0.4269065 0.7135684
#> 17 1.1492855 0.4210485 0.7222868
#> 18 1.1940639 0.4132534 0.7396357
#> 19 1.2271867 0.4079005 0.7494818
#> 20 1.2077102 0.4022859 0.7470327
#> 21 1.2082648 0.4026711 0.7452969
#> 22 1.2669285 0.3987044 0.7634170
#> 23 1.3663033 0.3970188 0.7957514
#> 24 1.4531634 0.3898475 0.8243034
#> 25 1.5624265 0.3820102 0.8612935
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was ncomp = 3.
pls_predictions <- predict(pls_model, test_df)
results <- data.frame(t(postResample(pred = pls_predictions, obs = test_df$Yield))) %>%
mutate("Model"= "PLS")
Which nonlinear regression model give the optimal resampling and test set performance?
knn_model <- train(
Yield ~ ., data = train_df, method = "knn",
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
knn_model
#> k-Nearest Neighbors
#>
#> 144 samples
#> 56 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 130, 129, 130, 130, 130, 131, ...
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 5 0.7085987 0.4846711 0.5709345
#> 7 0.7331379 0.4559107 0.5981516
#> 9 0.7417741 0.4556845 0.6124459
#> 11 0.7505969 0.4526780 0.6277377
#> 13 0.7450412 0.4691830 0.6181457
#> 15 0.7539451 0.4604423 0.6246762
#> 17 0.7663763 0.4474991 0.6306860
#> 19 0.7694326 0.4480450 0.6315047
#> 21 0.7706841 0.4508441 0.6288689
#> 23 0.7780866 0.4419184 0.6350889
#> 25 0.7834333 0.4320541 0.6382165
#> 27 0.7877890 0.4339581 0.6454522
#> 29 0.7981302 0.4187796 0.6533767
#> 31 0.8060716 0.3977540 0.6610186
#> 33 0.8063566 0.4128701 0.6600828
#> 35 0.8144040 0.3964037 0.6671734
#> 37 0.8160580 0.4009313 0.6662999
#> 39 0.8220176 0.3884786 0.6693883
#> 41 0.8274778 0.3750873 0.6746765
#> 43 0.8288000 0.3820179 0.6758220
#> 45 0.8285016 0.3878549 0.6747216
#> 47 0.8331504 0.3808342 0.6774480
#> 49 0.8368096 0.3816669 0.6829726
#> 51 0.8391074 0.3804316 0.6857959
#> 53 0.8442763 0.3704957 0.6915276
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 5.
knn_predictions <- predict(knn_model, test_df)
results <- data.frame(t(postResample(pred = knn_predictions, obs = test_df$Yield))) %>%
mutate("Model"= "KNN") %>% rbind(results)
MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(
Yield ~ ., data = train_df, method = "earth",
tuneGrid = MARS_grid,
# If the following lines are uncommented, it throws an error
#center = TRUE,
#scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
MARS_model
#> Multivariate Adaptive Regression Spline
#>
#> 144 samples
#> 56 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 129, 130, 130, 128, 130, 128, ...
#> Resampling results across tuning parameters:
#>
#> degree nprune RMSE Rsquared MAE
#> 1 2 0.7744594 0.4159942 0.6084363
#> 1 3 0.7003597 0.5682276 0.5530374
#> 1 4 0.6469215 0.6086750 0.5154043
#> 1 5 0.6404520 0.6073071 0.5042864
#> 1 6 0.6960964 0.5396157 0.5588565
#> 1 7 0.6997921 0.5521698 0.5669813
#> 1 8 0.7147341 0.5423472 0.5621022
#> 1 9 0.7173559 0.5508650 0.5641610
#> 1 10 0.7487512 0.5325936 0.5869088
#> 1 11 0.7415722 0.5353676 0.5847493
#> 1 12 0.7353149 0.5519142 0.5768069
#> 1 13 0.7450704 0.5357967 0.5887685
#> 1 14 0.7419274 0.5400866 0.5844385
#> 1 15 0.7262459 0.5553445 0.5775694
#> 2 2 0.7814439 0.4052275 0.6150579
#> 2 3 0.6961165 0.5526862 0.5636101
#> 2 4 0.6947315 0.5595844 0.5539150
#> 2 5 0.6779962 0.5808479 0.5327456
#> 2 6 0.6727275 0.5828458 0.5188054
#> 2 7 0.6679783 0.5914730 0.5207659
#> 2 8 0.6108172 0.6495008 0.4727198
#> 2 9 0.5989850 0.6605148 0.4657677
#> 2 10 0.6026849 0.6641443 0.4591813
#> 2 11 0.6077285 0.6574897 0.4719777
#> 2 12 0.6015413 0.6605563 0.4640395
#> 2 13 0.6154576 0.6602372 0.4727339
#> 2 14 0.6222890 0.6555194 0.4751916
#> 2 15 0.6102534 0.6703285 0.4650639
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were nprune = 9 and degree = 2.
MARS_predictions <- predict(MARS_model, test_df)
results <- data.frame(t(postResample(pred = MARS_predictions, obs = test_df$Yield))) %>%
mutate("Model"= "MARS") %>% rbind(results)
SVM_model <- train(
Yield ~ ., data = train_df, method = "svmRadial",
center = TRUE,
scale = TRUE,
trControl = trainControl(method = "cv"),
tuneLength = 25
)
SVM_model
#> Support Vector Machines with Radial Basis Function Kernel
#>
#> 144 samples
#> 56 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 129, 129, 129, 130, 131, 131, ...
#> Resampling results across tuning parameters:
#>
#> C RMSE Rsquared MAE
#> 0.25 0.7683015 0.4646282 0.6392703
#> 0.50 0.7174781 0.4848381 0.5965095
#> 1.00 0.6686679 0.5330949 0.5530530
#> 2.00 0.6409061 0.5552794 0.5275947
#> 4.00 0.6356640 0.5619233 0.5166512
#> 8.00 0.6286151 0.5761807 0.5115837
#> 16.00 0.6287131 0.5760690 0.5116965
#> 32.00 0.6287131 0.5760690 0.5116965
#> 64.00 0.6287131 0.5760690 0.5116965
#> 128.00 0.6287131 0.5760690 0.5116965
#> 256.00 0.6287131 0.5760690 0.5116965
#> 512.00 0.6287131 0.5760690 0.5116965
#> 1024.00 0.6287131 0.5760690 0.5116965
#> 2048.00 0.6287131 0.5760690 0.5116965
#> 4096.00 0.6287131 0.5760690 0.5116965
#> 8192.00 0.6287131 0.5760690 0.5116965
#> 16384.00 0.6287131 0.5760690 0.5116965
#> 32768.00 0.6287131 0.5760690 0.5116965
#> 65536.00 0.6287131 0.5760690 0.5116965
#> 131072.00 0.6287131 0.5760690 0.5116965
#> 262144.00 0.6287131 0.5760690 0.5116965
#> 524288.00 0.6287131 0.5760690 0.5116965
#> 1048576.00 0.6287131 0.5760690 0.5116965
#> 2097152.00 0.6287131 0.5760690 0.5116965
#> 4194304.00 0.6287131 0.5760690 0.5116965
#>
#> Tuning parameter 'sigma' was held constant at a value of 0.015148
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were sigma = 0.015148 and C = 8.
SVM_predictions <- predict(SVM_model, test_df)
results <- data.frame(t(postResample(pred = SVM_predictions, obs = test_df$Yield))) %>%
mutate("Model"= "SVM") %>% rbind(results)
Model | RMSE | Rsquared | MAE |
---|---|---|---|
PLS | 0.6192577 | 0.6771122 | 0.5059984 |
MARS | 0.6797644 | 0.6313203 | 0.5452501 |
SVM | 0.6815833 | 0.6251941 | 0.4926105 |
KNN | 0.7149819 | 0.5894607 | 0.5047973 |
Neural Network | 0.7368477 | 0.5465295 | 0.5848484 |
The PLS Model was the best model with the highest Rsquared of 67.7% and lowest RMSE wwith 0.619.
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
varImp(SVM_model, 10)
#> loess r-squared variable importance
#>
#> only 20 most important variables shown (out of 56)
#>
#> Overall
#> ManufacturingProcess32 100.00
#> ManufacturingProcess13 93.82
#> ManufacturingProcess09 89.93
#> ManufacturingProcess17 88.20
#> BiologicalMaterial06 82.61
#> BiologicalMaterial03 79.44
#> ManufacturingProcess36 73.85
#> BiologicalMaterial12 72.36
#> ManufacturingProcess06 69.00
#> ManufacturingProcess11 62.34
#> ManufacturingProcess31 56.39
#> BiologicalMaterial02 50.34
#> BiologicalMaterial11 48.53
#> BiologicalMaterial09 44.76
#> ManufacturingProcess30 41.87
#> BiologicalMaterial08 40.24
#> ManufacturingProcess29 38.54
#> ManufacturingProcess33 38.16
#> BiologicalMaterial04 36.92
#> ManufacturingProcess25 36.83
varImp(pls_model, 10)
#> pls variable importance
#>
#> only 20 most important variables shown (out of 56)
#>
#> Overall
#> ManufacturingProcess32 100.00
#> ManufacturingProcess09 88.04
#> ManufacturingProcess36 82.20
#> ManufacturingProcess13 82.11
#> ManufacturingProcess17 80.25
#> ManufacturingProcess06 59.06
#> ManufacturingProcess11 55.93
#> BiologicalMaterial02 55.46
#> BiologicalMaterial06 54.64
#> BiologicalMaterial03 54.50
#> ManufacturingProcess33 53.91
#> ManufacturingProcess12 52.04
#> BiologicalMaterial08 49.76
#> BiologicalMaterial12 47.40
#> ManufacturingProcess34 45.47
#> BiologicalMaterial11 45.05
#> BiologicalMaterial01 44.18
#> BiologicalMaterial04 42.95
#> ManufacturingProcess04 39.94
#> ManufacturingProcess28 36.61
The SVM model was very similar to the PLS model. In either case the manufacuturing process variables dominate the list. The SVM model found BiologicalMaterial12
to be important and didn’t find BiologicalMaterial02
to be important.
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
This indicates a positive relationship between the BiologicalMaterial02
and the yeild, although it seems that there is a sweet spot, or a point of diminishing marginal returns on the material as the highest yeilds are not the furthest to the right on the graph.