Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\(y = 10 sin(\pi x_1x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
KNN Model
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
NN Model
nnetGrid <- expand.grid(.decay = c(0,0.01,.1),
.size = c(1:5),
.bag = FALSE)
nnetFit <- train(trainingData$x, trainingData$y,
method = 'avNNet',
tuneGrid = nnetGrid,
preProc = c('center','scale'),
linout = TRUE,
trace = FALSE,
MaxNWts = 5 * (ncol(trainingData$x) + 1 + 5 + 1),
maxit = 100
)
nnetFit
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.565214 0.7424510 2.011320
## 0.00 2 2.577406 0.7382329 2.028408
## 0.00 3 2.337417 0.7825332 1.839583
## 0.00 4 2.729590 0.7185699 2.068292
## 0.00 5 2.629962 0.7367168 2.040611
## 0.01 1 2.556163 0.7421467 1.992682
## 0.01 2 2.562614 0.7432748 1.996182
## 0.01 3 2.351816 0.7804277 1.860291
## 0.01 4 2.439330 0.7691751 1.927696
## 0.01 5 2.572200 0.7424606 2.028893
## 0.10 1 2.518290 0.7488257 1.952082
## 0.10 2 2.543715 0.7440260 2.007897
## 0.10 3 2.283904 0.7929747 1.807032
## 0.10 4 2.371849 0.7798611 1.883289
## 0.10 5 2.474275 0.7627753 1.959351
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3, decay = 0.1 and bag = FALSE.
nnetPred <- predict(nnetFit, newdata = testData$x)
postResample(pred = nnetPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.2087518 0.8066003 1.6628115
MARS Model
# create a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)
# tune
marsTune <- train(trainingData$x, trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTune
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.327937 0.2544880 3.600474
## 1 3 3.572450 0.4912720 2.895811
## 1 4 2.596841 0.7183600 2.106341
## 1 5 2.370161 0.7659777 1.918669
## 1 6 2.276141 0.7881481 1.810001
## 1 7 1.766728 0.8751831 1.390215
## 1 8 1.780946 0.8723243 1.401345
## 1 9 1.665091 0.8819775 1.325515
## 1 10 1.663804 0.8821283 1.327657
## 1 11 1.657738 0.8822967 1.331730
## 1 12 1.653784 0.8827903 1.331504
## 1 13 1.648496 0.8823663 1.316407
## 1 14 1.639073 0.8841742 1.312833
## 1 15 1.639073 0.8841742 1.312833
## 1 16 1.639073 0.8841742 1.312833
## 1 17 1.639073 0.8841742 1.312833
## 1 18 1.639073 0.8841742 1.312833
## 1 19 1.639073 0.8841742 1.312833
## 1 20 1.639073 0.8841742 1.312833
## 1 21 1.639073 0.8841742 1.312833
## 1 22 1.639073 0.8841742 1.312833
## 1 23 1.639073 0.8841742 1.312833
## 1 24 1.639073 0.8841742 1.312833
## 1 25 1.639073 0.8841742 1.312833
## 1 26 1.639073 0.8841742 1.312833
## 1 27 1.639073 0.8841742 1.312833
## 1 28 1.639073 0.8841742 1.312833
## 1 29 1.639073 0.8841742 1.312833
## 1 30 1.639073 0.8841742 1.312833
## 1 31 1.639073 0.8841742 1.312833
## 1 32 1.639073 0.8841742 1.312833
## 1 33 1.639073 0.8841742 1.312833
## 1 34 1.639073 0.8841742 1.312833
## 1 35 1.639073 0.8841742 1.312833
## 1 36 1.639073 0.8841742 1.312833
## 1 37 1.639073 0.8841742 1.312833
## 1 38 1.639073 0.8841742 1.312833
## 2 2 4.327937 0.2544880 3.600474
## 2 3 3.572450 0.4912720 2.895811
## 2 4 2.661826 0.7070510 2.173471
## 2 5 2.404015 0.7578971 1.975387
## 2 6 2.243927 0.7914805 1.783072
## 2 7 1.856336 0.8605482 1.435682
## 2 8 1.754607 0.8763186 1.396841
## 2 9 1.603578 0.8938666 1.261361
## 2 10 1.492421 0.9084998 1.168700
## 2 11 1.317350 0.9292504 1.033926
## 2 12 1.304327 0.9320133 1.019108
## 2 13 1.277510 0.9323681 1.002927
## 2 14 1.269626 0.9350024 1.003346
## 2 15 1.266217 0.9359400 1.013893
## 2 16 1.268470 0.9354868 1.011414
## 2 17 1.268470 0.9354868 1.011414
## 2 18 1.268470 0.9354868 1.011414
## 2 19 1.268470 0.9354868 1.011414
## 2 20 1.268470 0.9354868 1.011414
## 2 21 1.268470 0.9354868 1.011414
## 2 22 1.268470 0.9354868 1.011414
## 2 23 1.268470 0.9354868 1.011414
## 2 24 1.268470 0.9354868 1.011414
## 2 25 1.268470 0.9354868 1.011414
## 2 26 1.268470 0.9354868 1.011414
## 2 27 1.268470 0.9354868 1.011414
## 2 28 1.268470 0.9354868 1.011414
## 2 29 1.268470 0.9354868 1.011414
## 2 30 1.268470 0.9354868 1.011414
## 2 31 1.268470 0.9354868 1.011414
## 2 32 1.268470 0.9354868 1.011414
## 2 33 1.268470 0.9354868 1.011414
## 2 34 1.268470 0.9354868 1.011414
## 2 35 1.268470 0.9354868 1.011414
## 2 36 1.268470 0.9354868 1.011414
## 2 37 1.268470 0.9354868 1.011414
## 2 38 1.268470 0.9354868 1.011414
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
marsPred <- predict(marsTune, testData$x)
postResample(marsPred, testData$y)
## RMSE Rsquared MAE
## 1.1589948 0.9460418 0.9250230
varImp(marsTune)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.24
## X2 48.73
## X5 15.52
## X3 0.00
SVM Model
set.seed(100)
# tune
svmRTune <- train(trainingData$x, trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTune
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.530787 0.7922715 2.013175
## 0.50 2.259539 0.8064569 1.789962
## 1.00 2.099789 0.8274242 1.656154
## 2.00 2.002943 0.8412934 1.583791
## 4.00 1.943618 0.8504425 1.546586
## 8.00 1.918711 0.8547582 1.532981
## 16.00 1.920651 0.8536189 1.536116
## 32.00 1.920651 0.8536189 1.536116
## 64.00 1.920651 0.8536189 1.536116
## 128.00 1.920651 0.8536189 1.536116
## 256.00 1.920651 0.8536189 1.536116
## 512.00 1.920651 0.8536189 1.536116
## 1024.00 1.920651 0.8536189 1.536116
## 2048.00 1.920651 0.8536189 1.536116
##
## Tuning parameter 'sigma' was held constant at a value of 0.06509124
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06509124 and C = 8.
svmRPred <- predict(svmRTune, testData$x)
postResample(svmRPred, testData$y)
## RMSE Rsquared MAE
## 2.0631908 0.8275736 1.5662213
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
The MARS model produces the best results with a Rsquared of 0.9460418 on the test set. The Mars model only uses the informative predictors, X1-X5.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
data(ChemicalManufacturingProcess)
# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)
# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]
set.seed(624)
# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
# train
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]
# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]
KNN Model
knnModel <- train(train_x, train_y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.471125 0.3330992 1.161484
## 7 1.447346 0.3519621 1.150975
## 9 1.439505 0.3614781 1.153856
## 11 1.440067 0.3597565 1.157491
## 13 1.446347 0.3556436 1.165135
## 15 1.437409 0.3693582 1.165991
## 17 1.448196 0.3618152 1.176400
## 19 1.452990 0.3601724 1.182114
## 21 1.456702 0.3606783 1.183356
## 23 1.457503 0.3658775 1.185981
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
knnPred <- predict(knnModel, test_x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, test_y)
## RMSE Rsquared MAE
## 1.5262067 0.6187302 1.1800625
NN Model
# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75
tooHigh <- findCorrelation(cor(train_x), cutoff = .75)
# removing 21 variables
train_x_nnet <- train_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]
# create a tuning grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)
set.seed(100)
# tune
nnetTune <- train(train_x_nnet, train_y,
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(train_x_nnet) + 1) + 10 + 1,
maxit = 500)
nnetTune
## Neural Network
##
## 144 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 1.653183 0.2181442 1.345706
## 0.00 2 2.534424 0.2369899 1.836159
## 0.00 3 3.171155 0.2926531 2.287708
## 0.00 4 3.711481 0.1209223 2.917563
## 0.00 5 3.431171 0.1500184 2.741639
## 0.00 6 4.519146 0.1324394 3.247006
## 0.00 7 4.572852 0.1511347 3.586819
## 0.00 8 4.897815 0.1777553 3.195740
## 0.00 9 6.323278 0.1664817 4.360992
## 0.00 10 8.667370 0.1152399 5.988899
## 0.01 1 1.667606 0.3154025 1.388167
## 0.01 2 2.265838 0.1993149 1.714076
## 0.01 3 2.332248 0.2440199 1.842895
## 0.01 4 3.002585 0.1568784 2.241891
## 0.01 5 2.559003 0.2072487 1.958315
## 0.01 6 2.615888 0.2014615 2.003966
## 0.01 7 2.704167 0.1873030 2.115345
## 0.01 8 2.884852 0.1905781 2.225486
## 0.01 9 2.664823 0.2242711 2.139448
## 0.01 10 3.327161 0.2317687 2.496622
## 0.10 1 1.618516 0.3543468 1.325739
## 0.10 2 1.852789 0.3901490 1.390430
## 0.10 3 2.907373 0.1839412 2.024455
## 0.10 4 2.664748 0.1941929 1.965063
## 0.10 5 2.946770 0.2249056 2.098324
## 0.10 6 2.533430 0.2964670 1.884063
## 0.10 7 2.175251 0.3069320 1.702581
## 0.10 8 2.696990 0.1964820 1.966580
## 0.10 9 2.282723 0.2395294 1.862798
## 0.10 10 2.780285 0.1640235 2.034794
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.1.
nnPred <- predict(nnetTune, test_x_nnet)
postResample(nnPred, test_y)
## RMSE Rsquared MAE
## 1.5064579 0.5140357 1.1159762
Mars Model
# create a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)
# tune
marsTune <- train(train_x, train_y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTune
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.382295 0.4386629 1.1032611
## 1 3 1.240867 0.5448952 0.9985512
## 1 4 1.259935 0.5341424 1.0107010
## 1 5 1.245790 0.5272274 1.0113559
## 1 6 1.269935 0.5136793 1.0204522
## 1 7 1.310209 0.5055710 1.0295204
## 1 8 1.288293 0.5221112 1.0036609
## 1 9 1.293021 0.5193283 1.0156268
## 1 10 1.286486 0.5258144 1.0107051
## 1 11 1.350612 0.5108572 1.0494019
## 1 12 1.354690 0.5164837 1.0502417
## 1 13 1.371710 0.5124198 1.0535178
## 1 14 1.386234 0.5064731 1.0729218
## 1 15 1.377159 0.5169364 1.0708723
## 1 16 1.377159 0.5169364 1.0708723
## 1 17 1.377159 0.5169364 1.0708723
## 1 18 1.377159 0.5169364 1.0708723
## 1 19 1.377159 0.5169364 1.0708723
## 1 20 1.377159 0.5169364 1.0708723
## 1 21 1.377159 0.5169364 1.0708723
## 1 22 1.377159 0.5169364 1.0708723
## 1 23 1.377159 0.5169364 1.0708723
## 1 24 1.377159 0.5169364 1.0708723
## 1 25 1.377159 0.5169364 1.0708723
## 1 26 1.377159 0.5169364 1.0708723
## 1 27 1.377159 0.5169364 1.0708723
## 1 28 1.377159 0.5169364 1.0708723
## 1 29 1.377159 0.5169364 1.0708723
## 1 30 1.377159 0.5169364 1.0708723
## 1 31 1.377159 0.5169364 1.0708723
## 1 32 1.377159 0.5169364 1.0708723
## 1 33 1.377159 0.5169364 1.0708723
## 1 34 1.377159 0.5169364 1.0708723
## 1 35 1.377159 0.5169364 1.0708723
## 1 36 1.377159 0.5169364 1.0708723
## 1 37 1.377159 0.5169364 1.0708723
## 1 38 1.377159 0.5169364 1.0708723
## 2 2 1.382295 0.4386629 1.1032611
## 2 3 1.237952 0.5375297 1.0083290
## 2 4 1.253568 0.5221886 1.0335088
## 2 5 1.204199 0.5507043 0.9713244
## 2 6 1.241877 0.5180123 1.0022903
## 2 7 1.228535 0.5360710 0.9772064
## 2 8 1.236188 0.5297973 0.9891217
## 2 9 1.224202 0.5377333 0.9943605
## 2 10 1.196350 0.5532418 0.9855648
## 2 11 1.217007 0.5502910 1.0105749
## 2 12 1.236600 0.5473328 1.0021900
## 2 13 1.227170 0.5587354 0.9909744
## 2 14 1.263470 0.5599646 1.0158323
## 2 15 1.230580 0.5620079 1.0103784
## 2 16 1.241609 0.5506318 0.9964320
## 2 17 1.233933 0.5689345 0.9858733
## 2 18 1.241566 0.5806316 1.0029570
## 2 19 1.236775 0.5859195 0.9987440
## 2 20 1.317821 0.5266260 1.0648319
## 2 21 1.388138 0.5126592 1.1035179
## 2 22 1.402762 0.5068048 1.1134955
## 2 23 1.396884 0.5054997 1.1196368
## 2 24 1.380184 0.5113281 1.1059875
## 2 25 1.380184 0.5113281 1.1059875
## 2 26 1.386388 0.5070473 1.1174699
## 2 27 1.380683 0.5101973 1.1123044
## 2 28 1.361918 0.5211907 1.0932094
## 2 29 1.366147 0.5191619 1.0957169
## 2 30 1.366147 0.5191619 1.0957169
## 2 31 1.366147 0.5191619 1.0957169
## 2 32 1.360840 0.5200205 1.0921339
## 2 33 1.360840 0.5200205 1.0921339
## 2 34 1.360840 0.5200205 1.0921339
## 2 35 1.360840 0.5200205 1.0921339
## 2 36 1.360840 0.5200205 1.0921339
## 2 37 1.360840 0.5200205 1.0921339
## 2 38 1.360840 0.5200205 1.0921339
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 10 and degree = 2.
marsPred <- predict(marsTune, test_x)
postResample(marsPred, test_y)
## RMSE Rsquared MAE
## 1.3464789 0.6138875 0.9826902
SVM Model
set.seed(100)
# tune
svmRTune <- train(train_x, train_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTune
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.413177 0.4630126 1.1760898
## 0.50 1.314625 0.5018046 1.0947625
## 1.00 1.217731 0.5647210 1.0095889
## 2.00 1.164634 0.5994161 0.9630243
## 4.00 1.124391 0.6199423 0.9192936
## 8.00 1.119796 0.6170091 0.9287431
## 16.00 1.118734 0.6174115 0.9308110
## 32.00 1.118734 0.6174115 0.9308110
## 64.00 1.118734 0.6174115 0.9308110
## 128.00 1.118734 0.6174115 0.9308110
## 256.00 1.118734 0.6174115 0.9308110
## 512.00 1.118734 0.6174115 0.9308110
## 1024.00 1.118734 0.6174115 0.9308110
## 2048.00 1.118734 0.6174115 0.9308110
##
## Tuning parameter 'sigma' was held constant at a value of 0.0139359
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0139359 and C = 16.
svmRPred <- predict(svmRTune, test_x)
postResample(svmRPred, test_y)
## RMSE Rsquared MAE
## 1.1412463 0.7513994 0.8006586
rbind(knn = postResample(knnPred, test_y),
nn = postResample(nnPred, test_y),
mars = postResample(marsPred, test_y),
svmR = postResample(svmRPred, test_y))
## RMSE Rsquared MAE
## knn 1.526207 0.6187302 1.1800625
## nn 1.506458 0.5140357 1.1159762
## mars 1.346479 0.6138875 0.9826902
## svmR 1.141246 0.7513994 0.8006586
SVM gives the best performance with the radial method as it has the lowest RMSE and MAE and the highest RSquared = 0.7513994.
varImp(svmRTune)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 89.32
## BiologicalMaterial03 77.48
## ManufacturingProcess36 76.64
## ManufacturingProcess09 73.90
## ManufacturingProcess13 73.24
## ManufacturingProcess31 67.06
## BiologicalMaterial02 66.92
## BiologicalMaterial12 64.94
## ManufacturingProcess06 59.23
## ManufacturingProcess17 53.07
## BiologicalMaterial11 49.11
## BiologicalMaterial04 48.27
## ManufacturingProcess11 45.42
## ManufacturingProcess29 45.31
## ManufacturingProcess33 44.62
## BiologicalMaterial01 40.70
## BiologicalMaterial08 38.19
## ManufacturingProcess30 35.52
## BiologicalMaterial09 29.60
plot(varImp(svmRTune), top = 20)
set.seed(100)
larsTune <- train(train_x, train_y,
method = "lars",
metric = "Rsquared",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale"))
lars_predict <- predict(larsTune, test_x)
The top ten important predictors are the same as the top ten predictors from the optimal linear model, which was the LARS model.
plot(varImp(svmRTune), top = 10,
main = "Nonlinear: Top 10 Important Predictors")
plot(varImp(larsTune), top = 10,
main = "Linear: Top 10 Important Predictors")
top10 <- varImp(svmRTune)$importance %>%
arrange(-Overall) %>%
head(10)
Chemical %>%
select(c("Yield", row.names(top10))) %>%
cor() %>%
corrplot()
train_x %>%
select(row.names(top10)) %>%
featurePlot(., train_y)
ManufacturingProcess32 has the most robust relationship
with Yield, as seen by the correlation plot.
Yield and two of the top 10 factors have a negative
correlation.
While the relationship between the manufacturing methods and yield
varies, it appears that the biological predictors have a favorable
relationship with the yield. For example,
ManufacturingProcess36 has levels, and
ManufacturingProcess31 is primarily focused on a value.