Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\[y=10sin(πx_1x_2)+20(x_3−0.5)^2+10x_4+5x_5+N(0,σ^2)\]
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the ' x ' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector ' y ' and a matrix
## of predictors ' x ' . Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
KNN
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
SVM
svmRTuned <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.505383 0.8031869 1.999381
## 0.50 2.290725 0.8103140 1.829703
## 1.00 2.105086 0.8302040 1.677851
## 2.00 2.014620 0.8418576 1.598814
## 4.00 1.965196 0.8491165 1.567327
## 8.00 1.927649 0.8538945 1.542267
## 16.00 1.924262 0.8545293 1.539275
## 32.00 1.924262 0.8545293 1.539275
## 64.00 1.924262 0.8545293 1.539275
## 128.00 1.924262 0.8545293 1.539275
## 256.00 1.924262 0.8545293 1.539275
## 512.00 1.924262 0.8545293 1.539275
## 1024.00 1.924262 0.8545293 1.539275
## 2048.00 1.924262 0.8545293 1.539275
##
## Tuning parameter 'sigma' was held constant at a value of 0.06802164
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06802164 and C = 16.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 16
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0680216365076835
##
## Number of Support Vectors : 152
##
## Objective Function Value : -66.0924
## Training error : 0.008551
MARS
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = marsGrid,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv"))
marsTuned
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.555815 0.2211499 3.7894920
## 1 3 3.926474 0.3967924 3.1774338
## 1 4 2.590871 0.7259766 2.0669237
## 1 5 2.294241 0.7891189 1.8337749
## 1 6 2.199933 0.8070077 1.7339981
## 1 7 1.742343 0.8732965 1.3795558
## 1 8 1.666015 0.8845896 1.2982794
## 1 9 1.642477 0.8883691 1.2818259
## 1 10 1.648030 0.8895147 1.2904522
## 1 11 1.617092 0.8944590 1.2649496
## 1 12 1.588637 0.8989528 1.2362045
## 1 13 1.616912 0.8958467 1.2653165
## 1 14 1.617859 0.8959075 1.2598825
## 1 15 1.626588 0.8945949 1.2766998
## 1 16 1.626588 0.8945949 1.2766998
## 1 17 1.626588 0.8945949 1.2766998
## 1 18 1.626588 0.8945949 1.2766998
## 1 19 1.626588 0.8945949 1.2766998
## 1 20 1.626588 0.8945949 1.2766998
## 1 21 1.626588 0.8945949 1.2766998
## 1 22 1.626588 0.8945949 1.2766998
## 1 23 1.626588 0.8945949 1.2766998
## 1 24 1.626588 0.8945949 1.2766998
## 1 25 1.626588 0.8945949 1.2766998
## 1 26 1.626588 0.8945949 1.2766998
## 1 27 1.626588 0.8945949 1.2766998
## 1 28 1.626588 0.8945949 1.2766998
## 1 29 1.626588 0.8945949 1.2766998
## 1 30 1.626588 0.8945949 1.2766998
## 1 31 1.626588 0.8945949 1.2766998
## 1 32 1.626588 0.8945949 1.2766998
## 1 33 1.626588 0.8945949 1.2766998
## 1 34 1.626588 0.8945949 1.2766998
## 1 35 1.626588 0.8945949 1.2766998
## 1 36 1.626588 0.8945949 1.2766998
## 1 37 1.626588 0.8945949 1.2766998
## 1 38 1.626588 0.8945949 1.2766998
## 2 2 4.632387 0.1948616 3.8525357
## 2 3 3.917493 0.4004048 3.1441082
## 2 4 2.640513 0.7253577 2.1289055
## 2 5 2.363920 0.7845149 1.8807776
## 2 6 2.296960 0.7944450 1.8480708
## 2 7 1.882405 0.8579384 1.4883305
## 2 8 1.685100 0.8885201 1.2864012
## 2 9 1.600628 0.8964225 1.2475233
## 2 10 1.409385 0.9204173 1.0882356
## 2 11 1.345238 0.9273259 1.0531338
## 2 12 1.261378 0.9348132 1.0092697
## 2 13 1.235285 0.9376491 0.9899744
## 2 14 1.214095 0.9393634 0.9874565
## 2 15 1.204429 0.9409055 0.9709448
## 2 16 1.189766 0.9415400 0.9628053
## 2 17 1.207788 0.9400920 0.9687374
## 2 18 1.207788 0.9400920 0.9687374
## 2 19 1.207788 0.9400920 0.9687374
## 2 20 1.207788 0.9400920 0.9687374
## 2 21 1.207788 0.9400920 0.9687374
## 2 22 1.207788 0.9400920 0.9687374
## 2 23 1.207788 0.9400920 0.9687374
## 2 24 1.207788 0.9400920 0.9687374
## 2 25 1.207788 0.9400920 0.9687374
## 2 26 1.207788 0.9400920 0.9687374
## 2 27 1.207788 0.9400920 0.9687374
## 2 28 1.207788 0.9400920 0.9687374
## 2 29 1.207788 0.9400920 0.9687374
## 2 30 1.207788 0.9400920 0.9687374
## 2 31 1.207788 0.9400920 0.9687374
## 2 32 1.207788 0.9400920 0.9687374
## 2 33 1.207788 0.9400920 0.9687374
## 2 34 1.207788 0.9400920 0.9687374
## 2 35 1.207788 0.9400920 0.9687374
## 2 36 1.207788 0.9400920 0.9687374
## 2 37 1.207788 0.9400920 0.9687374
## 2 38 1.207788 0.9400920 0.9687374
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 16 and degree = 2.
NEURAL NETWORK
nnetGrid <- expand.grid(.decay=c(0, 0.01, 0.1),
.size=c(1, 5, 10),
.bag=FALSE)
nnetTune <- train(x = trainingData$x,
y = trainingData$y,
method = "avNNet",
tuneGrid = nnetGrid,
preProc = c("center", "scale"),
trace=FALSE,
linout=TRUE,
maxit=500)
nnetTune
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.618680 0.7313522 2.058850
## 0.00 5 3.530129 0.6271191 2.471984
## 0.00 10 3.192839 0.6573003 2.380899
## 0.01 1 2.588779 0.7351027 2.020064
## 0.01 5 2.510018 0.7576369 1.977819
## 0.01 10 2.802374 0.6986265 2.226414
## 0.10 1 2.578377 0.7379213 2.006248
## 0.10 5 2.436043 0.7706012 1.912769
## 0.10 10 2.536595 0.7479290 2.007334
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.
knnPred <- predict(knnModel, newdata = testData$x)
svmPred <- predict(svmRTuned, newdata = testData$x)
marsPred <- predict(marsTuned, newdata = testData$x)
nnetPred <- predict(nnetTune, newdata = testData$x)
KNN <- postResample(pred = knnPred, obs = testData$y)
SVM <- postResample(pred = svmPred, obs = testData$y)
MARS <- postResample(pred = marsPred, obs = testData$y)
AvgNN <- postResample(pred = nnetPred, obs = testData$y)
rbind(KNN,SVM,MARS,AvgNN)
## RMSE Rsquared MAE
## KNN 3.204059 0.6819919 2.568346
## SVM 2.086465 0.8236735 1.585465
## MARS 1.279387 0.9343367 1.009113
## AvgNN 2.075269 0.8312549 1.551190
It appears that MARS has the least predictive errors and strongest relationship metric among the 4 models and it selected features (X1-X5) which contributes mostinformation to the model.
varImp(marsTuned)
## earth variable importance
##
## Overall
## X1 100.00
## X4 85.12
## X2 69.20
## X5 49.23
## X3 39.89
## X7 0.00
## X9 0.00
## X8 0.00
## X6 0.00
## X10 0.00
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
set.seed(123)
# Data load, imputation, and split
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess
zero.var <- nearZeroVar(df)
df <- df[,-zero.var]
df.chem <- knnImputation(df)
X <- as.data.frame(df.chem %>% select(-Yield))
Y <- df.chem %>% select(Yield)
train_index <- sample(1:nrow(df.chem), nrow(df.chem)*.75)
x.train <- X[train_index,]
y.train <- df.chem$Yield[train_index]
x.test <- X[-train_index,]
y.test <- df.chem$Yield[-train_index]
# Models
chem.knnModel <- train(x = x.train,
y = y.train,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
chem.svmRTuned <- train(x = x.train,
y = y.train,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
chem.marsTuned <- train(x = x.train,
y = y.train,
method = "earth",
tuneGrid = marsGrid,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv"))
chem.nnetTune <- train(x = x.train,
y = y.train,
method = "avNNet",
tuneGrid = nnetGrid,
preProc = c("center", "scale"),
trace=FALSE,
linout=TRUE,
maxit=500)
# Predictions and Performance
chem.knnPred <- predict(chem.knnModel, newdata = x.test)
chem.svmPred <- predict(chem.svmRTuned, newdata = x.test)
chem.marsPred <- predict(chem.marsTuned, newdata = x.test)
chem.nnetPred <- predict(chem.nnetTune, newdata = x.test)
KNN.Model <- postResample(pred = chem.knnPred, obs = y.test)
SVM.Model <- postResample(pred = chem.svmPred, obs = y.test)
MARS.Model <- postResample(pred = chem.marsPred, obs = y.test)
AvgNN.Model <- postResample(pred = chem.nnetPred, obs = y.test)
rbind(KNN.Model, SVM.Model, MARS.Model, AvgNN.Model)
## RMSE Rsquared MAE
## KNN.Model 1.524302 0.4740547 1.2638312
## SVM.Model 1.105444 0.7852650 0.8085453
## MARS.Model 1.588791 0.3665373 1.2775890
## AvgNN.Model 1.740405 0.3852604 1.4623066
Among the 4 non-linear models, SVM exhibits best performance against the chemical processing test data.
p <- varImp(chem.svmRTuned); p
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 94.13
## ManufacturingProcess13 90.25
## BiologicalMaterial03 83.97
## ManufacturingProcess17 79.13
## BiologicalMaterial12 69.79
## BiologicalMaterial02 68.69
## ManufacturingProcess09 67.89
## ManufacturingProcess31 65.63
## ManufacturingProcess36 64.71
## ManufacturingProcess06 64.46
## ManufacturingProcess33 50.55
## BiologicalMaterial04 50.41
## BiologicalMaterial11 47.80
## ManufacturingProcess30 47.10
## ManufacturingProcess11 46.02
## ManufacturingProcess29 44.84
## ManufacturingProcess02 43.49
## BiologicalMaterial01 41.83
## BiologicalMaterial08 37.51
plot(p, top = 10)
From previous assignment about linear models, 9 out of top 10 variables were from Manufacturing Processes. However in this assignment, there still dominance of manufacturing predictors, but degree is less, considering 6 out of top 10 coming from Manufacturing predictors, while the rest from Biological predictors, making 60%-40% split. Overall, from either linear or non-linear models, it appears that manufacturing processes have more influence on the response.
The plots suggests moderate correlations betwen top predictors and the response. Other than ManufacturingProcess32 which positively correlate singularly at 60% with Yield, the rest of the manufacturing and biological predictors have 40%-50% positve or negative correlations with Yield.
top10.var <- rownames(data.frame(p$importance))[order(p$importance$Overall, decreasing = TRUE)][1:10]
top10.x <- df.chem[top10.var]
Yield <- df.chem$Yield
top10 <- cbind(top10.x, Yield)
corr<- round(cor(top10),1)
featurePlot(top10.x, Yield)
ggcorrplot(corr, lab = TRUE)