7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(mlbench)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
knnModel <- train(x =trainingData$x,
y= trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 19.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
I add MARS model as comparison.
MARS has a better performance with lower RMSE and higher Rsquared.
Also, MARS selet X1-X5 plus X6.
knn_fit<- train(x=trainingData$x,y=trainingData$y, method = "knn", preProcess = c("center", "scale"), tuneLength = 10)
mars_fit<- train(x = trainingData$x, y = trainingData$y, method = "earth",
preProcess = c("center", "scale"), tuneLength = 10)
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
marsPred <- predict(mars_fit, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.776575 0.872700 1.358367
varImp((knn_fit))
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
varImp((mars_fit))
## earth variable importance
##
## Overall
## X1 100.00
## X4 82.78
## X2 64.18
## X5 40.21
## X3 28.14
## X6 0.00
7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
library(RANN)
estdata <- preProcess(ChemicalManufacturingProcess, "knnImpute")
chemdata <- predict(estdata, ChemicalManufacturingProcess)
chemdata <- chemdata[, -nearZeroVar(chemdata)]
ch_index <- createDataPartition(chemdata$Yield, p = .8, list = FALSE)
trainx <- chemdata[ch_index, -1]
trainy <- chemdata[ch_index, 1]
testx <- chemdata[-ch_index, -1]
testy <- chemdata[-ch_index, 1]
4 model was create, KNN, MARS, Neural Network and SVM. SVM is the preferable model since it has the lowest RMSE and 2nd highest Rsquared.
knnModelC <- train(trainx, trainy,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModelC
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.7663166 0.4267656 0.6103715
## 7 0.7620570 0.4334381 0.6085957
## 9 0.7606730 0.4383593 0.6112026
## 11 0.7637161 0.4369804 0.6112213
## 13 0.7702460 0.4338978 0.6183574
## 15 0.7757912 0.4294637 0.6246815
## 17 0.7793033 0.4268261 0.6265074
## 19 0.7857410 0.4192433 0.6310951
## 21 0.7876125 0.4219037 0.6318536
## 23 0.7926886 0.4171869 0.6380595
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
MARS
set.seed(777)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38) #p.165
marsTuneC <- train(trainx, trainy,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTuneC
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.7571365 0.4384424 0.5890694
## 1 3 0.6209432 0.6171880 0.5001021
## 1 4 0.6125417 0.6149778 0.4974624
## 1 5 0.6086976 0.6193433 0.5002367
## 1 6 0.6112194 0.6220564 0.5001236
## 1 7 0.5833192 0.6601286 0.4707630
## 1 8 0.5954229 0.6548777 0.4773941
## 1 9 0.5908520 0.6688769 0.4641166
## 1 10 0.5922562 0.6729081 0.4589072
## 1 11 0.6046974 0.6547836 0.4578478
## 1 12 0.6021642 0.6585914 0.4529060
## 1 13 0.5982064 0.6552723 0.4642189
## 1 14 0.5993061 0.6579914 0.4710578
## 1 15 0.6003698 0.6629187 0.4686158
## 1 16 0.5846706 0.6788217 0.4511537
## 1 17 0.5860203 0.6802277 0.4511400
## 1 18 0.5849925 0.6786591 0.4492209
## 1 19 0.5849925 0.6786591 0.4492209
## 1 20 0.5849925 0.6786591 0.4492209
## 1 21 0.5849925 0.6786591 0.4492209
## 1 22 0.5849925 0.6786591 0.4492209
## 1 23 0.5849925 0.6786591 0.4492209
## 1 24 0.5849925 0.6786591 0.4492209
## 1 25 0.5849925 0.6786591 0.4492209
## 1 26 0.5849925 0.6786591 0.4492209
## 1 27 0.5849925 0.6786591 0.4492209
## 1 28 0.5849925 0.6786591 0.4492209
## 1 29 0.5849925 0.6786591 0.4492209
## 1 30 0.5849925 0.6786591 0.4492209
## 1 31 0.5849925 0.6786591 0.4492209
## 1 32 0.5849925 0.6786591 0.4492209
## 1 33 0.5849925 0.6786591 0.4492209
## 1 34 0.5849925 0.6786591 0.4492209
## 1 35 0.5849925 0.6786591 0.4492209
## 1 36 0.5849925 0.6786591 0.4492209
## 1 37 0.5849925 0.6786591 0.4492209
## 1 38 0.5849925 0.6786591 0.4492209
## 2 2 0.7571365 0.4384424 0.5890694
## 2 3 0.6389273 0.5892642 0.5242676
## 2 4 0.6247083 0.5967464 0.5181887
## 2 5 0.6152453 0.6099734 0.5094673
## 2 6 0.6315286 0.6175948 0.5132345
## 2 7 0.5898082 0.6628224 0.4812386
## 2 8 0.5773520 0.6710845 0.4698095
## 2 9 0.5696061 0.6801360 0.4646234
## 2 10 0.5848359 0.6626196 0.4765138
## 2 11 0.5694192 0.6879157 0.4649821
## 2 12 0.5634597 0.6920767 0.4564068
## 2 13 0.5640860 0.6887346 0.4576884
## 2 14 0.5697185 0.6883572 0.4531146
## 2 15 0.5919275 0.6731146 0.4702897
## 2 16 0.6167678 0.6586522 0.4874193
## 2 17 0.6198244 0.6591712 0.4951976
## 2 18 0.6108703 0.6755062 0.4937246
## 2 19 0.6094393 0.6755931 0.4951412
## 2 20 0.5999557 0.6841172 0.4869600
## 2 21 0.5945895 0.6886322 0.4845766
## 2 22 0.5967784 0.6969207 0.4885505
## 2 23 0.6044671 0.6961590 0.4938530
## 2 24 0.5974824 0.6984470 0.4878614
## 2 25 0.5974824 0.6984470 0.4878614
## 2 26 0.6126017 0.6966108 0.4983812
## 2 27 0.6126017 0.6966108 0.4983812
## 2 28 0.6126017 0.6966108 0.4983812
## 2 29 0.6126017 0.6966108 0.4983812
## 2 30 0.6126017 0.6966108 0.4983812
## 2 31 0.6126017 0.6966108 0.4983812
## 2 32 0.6126017 0.6966108 0.4983812
## 2 33 0.6126017 0.6966108 0.4983812
## 2 34 0.6126017 0.6966108 0.4983812
## 2 35 0.6126017 0.6966108 0.4983812
## 2 36 0.6126017 0.6966108 0.4983812
## 2 37 0.6126017 0.6966108 0.4983812
## 2 38 0.6126017 0.6966108 0.4983812
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 12 and degree = 2.
tooHigh <- findCorrelation(cor(trainx), cutoff = .75) #p.163
train_x_nnet <- trainx[, -tooHigh]
test_x_nnet <- testx[, -tooHigh]
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
ctrl <- trainControl(method = "cv", number = 10)
set.seed(777)
nnetTune <- train(train_x_nnet, trainy,
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(train_x_nnet) + 1) + 10 + 1,
maxit = 500)
nnetTune
## Neural Network
##
## 144 samples
## 37 predictor
##
## Pre-processing: centered (37), scaled (37)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 1.2163074 0.3136695 0.8295162
## 0.00 2 1.0184622 0.3425981 0.8306294
## 0.00 3 1.0404403 0.3236338 0.8351704
## 0.00 4 1.4772631 0.1900349 1.1425086
## 0.00 5 1.5537554 0.1408028 1.2441427
## 0.00 6 1.3815231 0.2459903 1.0489401
## 0.00 7 1.1281311 0.3749623 0.9003955
## 0.00 8 1.1785144 0.2517516 0.8937500
## 0.00 9 1.0238042 0.3370734 0.7755890
## 0.00 10 0.9625880 0.3778590 0.7979186
## 0.01 1 0.8925761 0.3749872 0.7201091
## 0.01 2 1.0991754 0.2649870 0.8736625
## 0.01 3 1.1941416 0.2768011 0.9195688
## 0.01 4 1.1621122 0.3145344 0.9196618
## 0.01 5 1.0365727 0.3331291 0.8558945
## 0.01 6 0.9795945 0.3633113 0.7548577
## 0.01 7 0.8225926 0.4790358 0.6654358
## 0.01 8 0.8018038 0.4532290 0.6450975
## 0.01 9 0.8380935 0.4301259 0.6838092
## 0.01 10 0.8684114 0.4008014 0.6963858
## 0.10 1 0.7830112 0.4641974 0.6340823
## 0.10 2 0.9658965 0.3696723 0.7614198
## 0.10 3 0.9047863 0.4129417 0.6915331
## 0.10 4 0.9122687 0.4288613 0.7240297
## 0.10 5 0.9064383 0.3640349 0.7144517
## 0.10 6 0.9134298 0.3881678 0.7042020
## 0.10 7 0.7538289 0.5057989 0.6098617
## 0.10 8 0.8314645 0.4263620 0.6550355
## 0.10 9 0.8200322 0.4127259 0.6588729
## 0.10 10 0.8372057 0.4245492 0.6801626
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 7 and decay = 0.1.
set.seed(777)
svmRTune <- train(trainx, trainy,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTune
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7468900 0.5336373 0.6009050
## 0.50 0.6821002 0.5748808 0.5559519
## 1.00 0.6212096 0.6302097 0.5074772
## 2.00 0.5995841 0.6471372 0.4826408
## 4.00 0.6036013 0.6314314 0.4873860
## 8.00 0.6177886 0.6152405 0.4983435
## 16.00 0.6172946 0.6165007 0.4984441
## 32.00 0.6172946 0.6165007 0.4984441
## 64.00 0.6172946 0.6165007 0.4984441
## 128.00 0.6172946 0.6165007 0.4984441
## 256.00 0.6172946 0.6165007 0.4984441
## 512.00 0.6172946 0.6165007 0.4984441
## 1024.00 0.6172946 0.6165007 0.4984441
## 2048.00 0.6172946 0.6165007 0.4984441
##
## Tuning parameter 'sigma' was held constant at a value of 0.0141958
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0141958 and C = 2.
knnPred <- predict(knnModelC, testx)
postResample(pred = knnPred, obs = testy)
## RMSE Rsquared MAE
## 0.7366242 0.4621173 0.5993582
marsPred <- predict(marsTuneC, testx)
postResample(pred = marsPred, obs = testy)
## RMSE Rsquared MAE
## 0.7051474 0.5211750 0.5567698
svmRPred <- predict(svmRTune, testx)
postResample(pred = svmRPred, obs = testy)
## RMSE Rsquared MAE
## 0.6917503 0.5138788 0.5533730
nnPred <- predict(nnetTune, testx)
postResample(pred = nnPred, obs = testy)
## RMSE Rsquared MAE
## 0.7500078 0.4442805 0.6100607
The Top 10 important predictors are the same for 2 model.
varImp(svmRTune)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 94.77
## BiologicalMaterial06 82.41
## ManufacturingProcess17 78.71
## BiologicalMaterial03 78.71
## ManufacturingProcess31 71.77
## ManufacturingProcess06 68.38
## BiologicalMaterial12 66.59
## BiologicalMaterial02 65.86
## ManufacturingProcess09 65.47
## ManufacturingProcess36 62.43
## BiologicalMaterial04 53.46
## BiologicalMaterial11 52.00
## ManufacturingProcess11 51.25
## ManufacturingProcess33 47.86
## BiologicalMaterial08 46.86
## ManufacturingProcess29 43.34
## ManufacturingProcess02 41.33
## ManufacturingProcess30 40.49
## BiologicalMaterial09 40.28
set.seed(777)
larsTune <- train(trainx, trainy,
method = "lars",
metric = "Rsquared",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale"))
lars_predict <- predict(larsTune, testx)
plot(varImp(larsTune), top = 10)
plot(varImp(svmRTune), top = 10)
As we can see ManufacturingProcess36 and 13 has neg correlations while other has prositve.
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.92 loaded
top10 <- varImp(svmRTune)$importance %>%
arrange(-Overall) %>%
head(10)
chemdata %>%
select(c("Yield", row.names(top10))) %>%
cor() %>%
corrplot()