library(AppliedPredictiveModeling)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(mlbench)
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(ggplot2)
Friedman (1991) introduced several benchmark data sets create by sim- ulation. One of these simulations used the following nonlinear equation to create data: … where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simula- tion). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
# Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
#KNN Model
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
# MARS Model
set.seed(0505)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsFit <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsFit
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.473707 0.2059112 3.735380
## 1 3 3.674517 0.4563730 2.961494
## 1 4 2.693987 0.7219221 2.161466
## 1 5 2.354295 0.7839794 1.883063
## 1 6 2.261965 0.7996703 1.796800
## 1 7 1.842051 0.8681038 1.415254
## 1 8 1.667184 0.8936275 1.303338
## 1 9 1.658109 0.8960439 1.301385
## 1 10 1.642522 0.8967562 1.301455
## 1 11 1.638873 0.8981547 1.273918
## 1 12 1.627332 0.9007205 1.269979
## 1 13 1.641691 0.8994554 1.281886
## 1 14 1.668962 0.8957737 1.308823
## 1 15 1.668986 0.8956228 1.308573
## 1 16 1.668986 0.8956228 1.308573
## 1 17 1.668986 0.8956228 1.308573
## 1 18 1.668986 0.8956228 1.308573
## 1 19 1.668986 0.8956228 1.308573
## 1 20 1.668986 0.8956228 1.308573
## 1 21 1.668986 0.8956228 1.308573
## 1 22 1.668986 0.8956228 1.308573
## 1 23 1.668986 0.8956228 1.308573
## 1 24 1.668986 0.8956228 1.308573
## 1 25 1.668986 0.8956228 1.308573
## 1 26 1.668986 0.8956228 1.308573
## 1 27 1.668986 0.8956228 1.308573
## 1 28 1.668986 0.8956228 1.308573
## 1 29 1.668986 0.8956228 1.308573
## 1 30 1.668986 0.8956228 1.308573
## 1 31 1.668986 0.8956228 1.308573
## 1 32 1.668986 0.8956228 1.308573
## 1 33 1.668986 0.8956228 1.308573
## 1 34 1.668986 0.8956228 1.308573
## 1 35 1.668986 0.8956228 1.308573
## 1 36 1.668986 0.8956228 1.308573
## 1 37 1.668986 0.8956228 1.308573
## 1 38 1.668986 0.8956228 1.308573
## 2 2 4.473707 0.2059112 3.735380
## 2 3 3.674517 0.4563730 2.961494
## 2 4 2.644768 0.7301597 2.124856
## 2 5 2.258397 0.8004069 1.792321
## 2 6 2.206230 0.8065141 1.724864
## 2 7 1.792779 0.8767625 1.417441
## 2 8 1.626351 0.8975182 1.255933
## 2 9 1.417336 0.9210696 1.124743
## 2 10 1.364662 0.9281539 1.112445
## 2 11 1.289844 0.9363794 1.034030
## 2 12 1.307324 0.9350859 1.034410
## 2 13 1.345946 0.9316704 1.081343
## 2 14 1.321258 0.9325492 1.064430
## 2 15 1.326130 0.9317542 1.063100
## 2 16 1.330557 0.9314304 1.058284
## 2 17 1.336075 0.9308362 1.056546
## 2 18 1.336075 0.9308362 1.056546
## 2 19 1.336075 0.9308362 1.056546
## 2 20 1.336075 0.9308362 1.056546
## 2 21 1.336075 0.9308362 1.056546
## 2 22 1.336075 0.9308362 1.056546
## 2 23 1.336075 0.9308362 1.056546
## 2 24 1.336075 0.9308362 1.056546
## 2 25 1.336075 0.9308362 1.056546
## 2 26 1.336075 0.9308362 1.056546
## 2 27 1.336075 0.9308362 1.056546
## 2 28 1.336075 0.9308362 1.056546
## 2 29 1.336075 0.9308362 1.056546
## 2 30 1.336075 0.9308362 1.056546
## 2 31 1.336075 0.9308362 1.056546
## 2 32 1.336075 0.9308362 1.056546
## 2 33 1.336075 0.9308362 1.056546
## 2 34 1.336075 0.9308362 1.056546
## 2 35 1.336075 0.9308362 1.056546
## 2 36 1.336075 0.9308362 1.056546
## 2 37 1.336075 0.9308362 1.056546
## 2 38 1.336075 0.9308362 1.056546
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 11 and degree = 2.
#The final values used for the model were nprune = 11 and degree = 2.
marsFit$bestTune
## nprune degree
## 47 11 2
marsPred <- predict(marsFit, newdata = testData$x)
marsResults <- postResample(pred = marsPred, obs = testData$y)
marsResults
## RMSE Rsquared MAE
## 1.2803060 0.9335241 1.0168673
#SVM Model
set.seed(0506)
svmFit <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
svmFit
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.475858 0.8121084 1.981112
## 0.50 2.212206 0.8276713 1.753422
## 1.00 2.048716 0.8481672 1.609954
## 2.00 1.906683 0.8678043 1.490096
## 4.00 1.824943 0.8763489 1.414680
## 8.00 1.805692 0.8800417 1.400559
## 16.00 1.803490 0.8814764 1.407560
## 32.00 1.804262 0.8813492 1.408806
## 64.00 1.804262 0.8813492 1.408806
## 128.00 1.804262 0.8813492 1.408806
##
## Tuning parameter 'sigma' was held constant at a value of 0.05936526
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05936526 and C = 16.
#The final values used for the model were sigma = 0.05936526 and C = 16.
svmFit$bestTune
## sigma C
## 7 0.05936526 16
svmPred <- predict(svmFit, newdata = testData$x)
svmResults <- postResample(pred = svmPred, obs = testData$y)
svmResults
## RMSE Rsquared MAE
## 2.063427 0.827336 1.567688
#comparing all three models
data.frame(
Model = c("KNN", "MARS", "SVM"),
rbind(
postResample(pred = knnPred, obs = testData$y),
postResample(pred = marsPred, obs = testData$y),
postResample(pred = svmPred, obs = testData$y)))
## Model RMSE Rsquared MAE
## 1 KNN 3.204059 0.6819919 2.568346
## 2 MARS 1.280306 0.9335241 1.016867
## 3 SVM 2.063427 0.8273360 1.567688
The performance comparison of three models shows that the MARS model has the best predictive accuracy, with the lowest RMSE (1.28), highest R-squared (0.93), and lowest MAE (1.02), indicating strong fit and minimal prediction error.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
data(ChemicalManufacturingProcess)
# pre-processing from 6.3 exercise
set.seed(0506)
cmp <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
cmp_imputed <- predict(cmp, ChemicalManufacturingProcess)
any(is.na(cmp_imputed))
## [1] FALSE
#removing near zero values
cmp_filtered <- cmp_imputed[,-nearZeroVar(cmp_imputed)]
set.seed(0507)
cmp_train_index <- createDataPartition(cmp_filtered$Yield, p= 0.8, list = FALSE)
cmp_train <- cmp_filtered[cmp_train_index,]
cmp_test <- cmp_filtered[-cmp_train_index,]
#MARS model
cmp_mars <- train(cmp_train[, !names(cmp_train) %in% "Yield"],
cmp_train$Yield,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
cmp_mars
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 129, 128, 130, 129, 130, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.7752911 0.4275023 0.6120243
## 1 3 0.6032026 0.6463356 0.4839294
## 1 4 0.6265973 0.6185799 0.4896057
## 1 5 0.6118502 0.6378310 0.4824863
## 1 6 0.6393335 0.6083814 0.5094254
## 1 7 0.6373448 0.6209967 0.5059538
## 1 8 0.6052538 0.6584140 0.4823153
## 1 9 0.6097183 0.6517113 0.4872536
## 1 10 0.6330484 0.6296726 0.5018900
## 1 11 0.6418055 0.6240124 0.5115936
## 1 12 0.6387257 0.6177895 0.5145750
## 1 13 0.6247542 0.6358544 0.5033395
## 1 14 0.6296166 0.6242255 0.4995423
## 1 15 0.6355888 0.6210653 0.5009699
## 1 16 0.6369560 0.6193788 0.5015336
## 1 17 0.6434263 0.6202956 0.5119013
## 1 18 0.6257724 0.6377410 0.5051716
## 1 19 0.6255045 0.6374628 0.5042739
## 1 20 0.6271235 0.6372024 0.5051993
## 1 21 0.6271235 0.6372024 0.5051993
## 1 22 0.6271235 0.6372024 0.5051993
## 1 23 0.6271235 0.6372024 0.5051993
## 1 24 0.6271235 0.6372024 0.5051993
## 1 25 0.6271235 0.6372024 0.5051993
## 1 26 0.6271235 0.6372024 0.5051993
## 1 27 0.6271235 0.6372024 0.5051993
## 1 28 0.6271235 0.6372024 0.5051993
## 1 29 0.6271235 0.6372024 0.5051993
## 1 30 0.6271235 0.6372024 0.5051993
## 1 31 0.6271235 0.6372024 0.5051993
## 1 32 0.6271235 0.6372024 0.5051993
## 1 33 0.6271235 0.6372024 0.5051993
## 1 34 0.6271235 0.6372024 0.5051993
## 1 35 0.6271235 0.6372024 0.5051993
## 1 36 0.6271235 0.6372024 0.5051993
## 1 37 0.6271235 0.6372024 0.5051993
## 1 38 0.6271235 0.6372024 0.5051993
## 2 2 0.7752911 0.4275023 0.6120243
## 2 3 0.6326674 0.5975588 0.4993006
## 2 4 0.6257613 0.6279919 0.4927481
## 2 5 0.6805299 0.5581148 0.5380409
## 2 6 0.6796221 0.5643818 0.5440488
## 2 7 0.7037744 0.5509892 0.5476012
## 2 8 0.7064711 0.5546050 0.5399432
## 2 9 0.6927406 0.5684102 0.5352198
## 2 10 0.6937045 0.5922264 0.5114920
## 2 11 0.7224928 0.5715142 0.5307765
## 2 12 0.7665474 0.5649394 0.5503282
## 2 13 0.7237082 0.6089367 0.5307078
## 2 14 0.7719478 0.5759897 0.5492151
## 2 15 0.7414732 0.6028284 0.5395357
## 2 16 0.9468814 0.5641198 0.6214396
## 2 17 0.7852803 0.6028986 0.5560268
## 2 18 0.7681858 0.6158080 0.5442563
## 2 19 0.9819668 0.5849192 0.6281134
## 2 20 0.9596390 0.5861546 0.6179385
## 2 21 0.9526725 0.5848057 0.6195384
## 2 22 0.9790833 0.5798822 0.6333510
## 2 23 0.9773756 0.5825638 0.6344239
## 2 24 0.9749866 0.5839369 0.6381853
## 2 25 0.9540998 0.5862773 0.6319302
## 2 26 0.9623307 0.5832487 0.6395288
## 2 27 0.9687205 0.5803439 0.6454061
## 2 28 0.9687205 0.5803439 0.6454061
## 2 29 0.9687205 0.5803439 0.6454061
## 2 30 0.9710568 0.5803154 0.6441091
## 2 31 0.9710568 0.5803154 0.6441091
## 2 32 0.9710568 0.5803154 0.6441091
## 2 33 0.9710568 0.5803154 0.6441091
## 2 34 0.9710568 0.5803154 0.6441091
## 2 35 0.9710568 0.5803154 0.6441091
## 2 36 0.9710568 0.5803154 0.6441091
## 2 37 0.9710568 0.5803154 0.6441091
## 2 38 0.9710568 0.5803154 0.6441091
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.
#The final values used for the model were nprune = 3 and degree = 1.
cmp_marsPred <- predict(cmp_mars, newdata = cmp_test)
cmp_marsResults <- postResample(pred = cmp_marsPred, obs = cmp_test$Yield)
cmp_marsResults
## RMSE Rsquared MAE
## 0.7191385 0.4663119 0.5946564
# KNN Model
set.seed(0507)
cmp_knnFit <- train(Yield ~ .,
data = cmp_train,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
cmp_knnFit
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.7984837 0.4037482 0.6275656
## 7 0.7858486 0.4259869 0.6253203
## 9 0.7813229 0.4430184 0.6210572
## 11 0.7894727 0.4322128 0.6315689
## 13 0.7985669 0.4192702 0.6396295
## 15 0.7990100 0.4285618 0.6404208
## 17 0.8022355 0.4291068 0.6446410
## 19 0.8054929 0.4302893 0.6463274
## 21 0.8075964 0.4349964 0.6463214
## 23 0.8140798 0.4314819 0.6526266
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
#The final value used for the model was k = 19.
cmp_knnPred <- predict(cmp_knnFit, newdata = cmp_test)
cmp_knnResults <- postResample(pred = cmp_knnPred, obs = cmp_test$Yield)
cmp_knnResults
## RMSE Rsquared MAE
## 0.7102753 0.5300573 0.5627296
#SVM Model
set.seed(0507)
cmp_svmFit <- train(Yield ~ .,
data = cmp_train,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
cmp_svmFit
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 129, 131, 130, 129, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7591297 0.5331625 0.6239359
## 0.50 0.7032320 0.5790092 0.5749790
## 1.00 0.6510864 0.6223945 0.5253027
## 2.00 0.6178926 0.6448386 0.4907204
## 4.00 0.6140758 0.6398688 0.4757480
## 8.00 0.6124280 0.6411986 0.4822256
## 16.00 0.6070621 0.6496302 0.4798367
## 32.00 0.6070621 0.6496302 0.4798367
## 64.00 0.6070621 0.6496302 0.4798367
## 128.00 0.6070621 0.6496302 0.4798367
##
## Tuning parameter 'sigma' was held constant at a value of 0.01360243
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01360243 and C = 16.
#The final values used for the model were sigma = 0.01360243 and C = 16.
cmp_svmPred <- predict(cmp_svmFit, newdata = cmp_test)
cmp_svmResults <- postResample(pred = cmp_svmPred, obs = cmp_test$Yield)
cmp_svmResults
## RMSE Rsquared MAE
## 0.6202813 0.6358902 0.5351300
#comparing all three models
cmp_models <- data.frame(
Model = c("KNN", "MARS", "SVM"),
rbind(
postResample(pred = cmp_knnPred, obs = cmp_test$Yield),
postResample(pred = cmp_marsPred, obs = cmp_test$Yield),
postResample(pred = cmp_svmPred, obs = cmp_test$Yield)))
cmp_models
## Model RMSE Rsquared MAE
## 1 KNN 0.7102753 0.5300573 0.5627296
## 2 MARS 0.7191385 0.4663119 0.5946564
## 3 SVM 0.6202813 0.6358902 0.5351300
Among the nonlinear regression models evaluated, the Support Vector Machine (SVM) model demonstrated the best performance, achieving the lowest RMSE (0.62), highest R-squared (0.64), and lowest MAE (0.54), indicating the most accurate predictions.
cmp_predictors <- varImp(cmp_svmFit, scale = FALSE)
plot(cmp_predictors, top = 10)
The most important predictors were mainly process variables. The
top predictors included ManufacturingProcess13, ManufacturingProcess32,
ManufacturingProcess09, and ManufacturingProcess17, indicating a strong
influence of manufacturing process factors on yield. A few biological
variables also appeared in the top ten, including BiologicalMaterial06,
BiologicalMaterial03, and BiologicalMaterial12, but overall the list was
dominated by process-related features. These were also present in the
optimal linear model.
Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
unique_predictors <- c("ManufacturingProcess13",
"ManufacturingProcess32",
"ManufacturingProcess09")
# Loop over predictors and create scatterplots with LOESS smooth line
for (pred in unique_predictors) {
p <- ggplot(cmp_train, aes_string(x = pred, y = "Yield")) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE, color = "blue", linewidth = 1) +
labs(title = paste("Yield vs.", pred),
x = pred,
y = "Yield") +
theme_minimal()
print(p)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
There is a clear non-linear relationship for all three top
predictors. ManufacturingProcess13 had a strong inverse nonlinear
relationship. ManufacturingProcess32 peaked at moderate values and
dropped at the extremes. ManufacturingProcess09 showed a positive
relationship with a noticeable increase after a certain point. These
trends suggest nonlinear process effects that the linear model
missed.