library(caret)
library(mlbench)
#help("mlbench.friedman1")The regression problem Friedman 1 as described in Friedman (1991) and Breiman (1996). Inputs are 10 independent variables uniformly distributed on the interval [0,1], only 5 out of these 10 are actually used. Outputs are created according to the formula
Usage => mlbench.friedman1(n, sd=1)
n = number of patterns to create
sd = Standard deviation of noise
x = input values (independent variables)
y = output values (dependent variable)
set.seed(200)
training.data <- mlbench.friedman1(200, sd=1)
training.data$x <- data.frame(training.data$x)
featurePlot(training.data$x, training.data$y)Now, estimate the true error rate with good precision:
test.data <- mlbench.friedman1(5000, sd = 1)
test.data$x <- data.frame(test.data$x)KNN Model - Model tuning technique
set.seed(1)
knnModel <- train(x = training.data$x,
y = training.data$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.626071 0.4766532 2.954620
## 7 3.463866 0.5231142 2.823065
## 9 3.387872 0.5552706 2.744969
## 11 3.345212 0.5754355 2.706971
## 13 3.317962 0.5964912 2.696927
## 15 3.291614 0.6123880 2.668077
## 17 3.293991 0.6238074 2.665458
## 19 3.300870 0.6342793 2.668902
## 21 3.297502 0.6472145 2.674792
## 23 3.298045 0.6534698 2.676965
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
plot(knnModel)knn.Predict <- predict(knnModel, newdata = test.data$x)
postResample(pred = knn.Predict, obs = test.data$y)## RMSE Rsquared MAE
## 3.1750657 0.6785946 2.5443169
The result of KNN RMSE is 3.117
MARS
library(earth)
set.seed(101)
marsGrid <- expand.grid(degree =1:2, nprune=seq(2,14,by=2))
marsModel <- train(x = training.data$x, y = training.data$y, method='earth', tuneGrid = marsGrid, trControl = trainControl(method = "cv"))
marsModel## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.417014 0.2350732 3.656244
## 1 4 2.606487 0.7293074 2.069700
## 1 6 2.192319 0.8094378 1.742533
## 1 8 1.700573 0.8889867 1.313446
## 1 10 1.640852 0.8967593 1.277871
## 1 12 1.641261 0.8950167 1.297462
## 1 14 1.642807 0.8964578 1.287479
## 2 2 4.417014 0.2350732 3.656244
## 2 4 2.606487 0.7293074 2.069700
## 2 6 2.218192 0.8021446 1.729223
## 2 8 1.750853 0.8789636 1.370531
## 2 10 1.449480 0.9177568 1.111537
## 2 12 1.316041 0.9299167 1.026115
## 2 14 1.321235 0.9300051 1.048692
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 12 and degree = 2.
plot(marsModel)Feature Importance
varImp(marsModel)## earth variable importance
##
## Overall
## X1 100.00
## X4 85.05
## X2 69.03
## X5 48.88
## X3 39.40
## X10 0.00
## X6 0.00
## X9 0.00
## X7 0.00
## X8 0.00
mars.Predict <- predict (marsModel, test.data$x)
postResample(pred = mars.Predict, obs = test.data$y)## RMSE Rsquared MAE
## 1.2803060 0.9335241 1.0168673
MARS_RMSE = 1.2803060 is definitely better than KNN. Let’s check other models as well.
SVM
set.seed(1)
svmTuned <- train(x = training.data$x,
y = training.data$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmTuned## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.531367 0.8022380 2.018644
## 0.50 2.274330 0.8152517 1.796103
## 1.00 2.096272 0.8344065 1.648492
## 2.00 1.963889 0.8528505 1.545874
## 4.00 1.893957 0.8618554 1.488518
## 8.00 1.869287 0.8642229 1.482789
## 16.00 1.867230 0.8644765 1.485749
## 32.00 1.867230 0.8644765 1.485749
## 64.00 1.867230 0.8644765 1.485749
## 128.00 1.867230 0.8644765 1.485749
## 256.00 1.867230 0.8644765 1.485749
## 512.00 1.867230 0.8644765 1.485749
## 1024.00 1.867230 0.8644765 1.485749
## 2048.00 1.867230 0.8644765 1.485749
##
## Tuning parameter 'sigma' was held constant at a value of 0.06444911
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06444911 and C = 16.
svmTuned$finalModel## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 16
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0644491109951026
##
## Number of Support Vectors : 151
##
## Objective Function Value : -71.1627
## Training error : 0.008521
svm.Predict <- predict(svmTuned, newdata = test.data$x)
postResample(pred = svm.Predict, obs = test.data$y)## RMSE Rsquared MAE
## 2.0771707 0.8251101 1.5779291
SVM RMSE value is 2.0771707, so it is not performing better than MARS.
Neural Network
library(nnet)## Warning: package 'nnet' was built under R version 3.4.4
set.seed(1)
nnetFit <- nnet(training.data$x,
training.data$y,
size = 5,
decay = 0.01,
linout = TRUE,
trace = FALSE,
maxit = 500,
MaxNWts = 5 * (ncol(training.data$x) + 1) + 5 + 1)
nnetFit## a 10-5-1 network with 61 weights
## options were - linear output units decay=0.01
nnet.Predict <- predict(nnetFit, newdata = test.data$x)
postResample(pred = nnet.Predict, obs = test.data$y)## RMSE Rsquared MAE
## 2.6580454 0.7348028 1.8595160
Neural net also has high RMSE value of 2.6580454, so MARS is definitely a winner here with lowest RMSE value.
A. Which nonlinear regression model gives the optimal resampling and test set performance?
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")library(DMwR)## Loading required package: grid
knn.df <- knnImputation(ChemicalManufacturingProcess[, 1:57], k = 3, meth = "weighAvg")
anyNA(knn.df)## [1] FALSE
First getting rid od missing values
near_zero <- nearZeroVar(knn.df)
knn.df <- knn.df[,-near_zero]knn.df[,2:(ncol(knn.df))] <- scale(knn.df[,2:(ncol(knn.df))])set.seed(1)
inTraining <- createDataPartition(knn.df$Yield, p = 0.80, list=FALSE)
training <- knn.df[ inTraining,]
testing <- knn.df[-inTraining,]
X_train <- training[,2:(length(training))]
Y_train <- training$Yield
X_test <- testing[,2:(length(testing))]
Y_test <- testing$YieldKNN
set.seed(1)
knnModel <- train(x = X_train,
y = Y_train,
method = "knn",
tuneLength = 10)
knnModel## k-Nearest Neighbors
##
## 144 samples
## 55 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.481059 0.3814268 1.169101
## 7 1.461855 0.3927446 1.164240
## 9 1.446106 0.4108643 1.160198
## 11 1.441563 0.4189638 1.157509
## 13 1.437135 0.4284012 1.161061
## 15 1.451620 0.4178187 1.178368
## 17 1.457394 0.4149991 1.186073
## 19 1.463996 0.4131880 1.190119
## 21 1.465359 0.4165220 1.192271
## 23 1.474984 0.4111554 1.195094
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
plot(knnModel)knnPredict <- predict(knnModel, newdata = X_test)
postResample(pred = knnPredict, obs = Y_test)## RMSE Rsquared MAE
## 1.3438388 0.5473626 1.0784564
KNN RMSE value is 1.3438388 and we’ll check for other model as well
MARS
set.seed(1)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel <- train(x = X_train,
y = Y_train,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsModel## Multivariate Adaptive Regression Spline
##
## 144 samples
## 55 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 130, 130, 131, 129, 130, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.363458 0.4558256 1.0689820
## 1 3 1.271861 0.5201694 1.0106995
## 1 4 1.244503 0.5458146 0.9833013
## 1 5 1.238886 0.5546325 0.9841253
## 1 6 1.299142 0.5236747 1.0200561
## 1 7 1.223629 0.5697152 0.9702813
## 1 8 1.203007 0.5985392 0.9565488
## 1 9 1.220538 0.5834291 0.9684082
## 1 10 1.227145 0.5768417 0.9702812
## 1 11 1.286950 0.5436602 0.9708968
## 1 12 1.271059 0.5559122 0.9644951
## 1 13 1.308718 0.5238114 1.0072276
## 1 14 1.298446 0.5340145 1.0073789
## 1 15 1.372028 0.5195326 1.0299789
## 1 16 1.372028 0.5195326 1.0299789
## 1 17 1.321986 0.5308154 1.0143525
## 1 18 1.329896 0.5283427 1.0172826
## 1 19 1.317402 0.5343547 1.0103932
## 1 20 1.325867 0.5318402 1.0134283
## 1 21 1.328594 0.5327647 1.0139789
## 1 22 1.328594 0.5327647 1.0139789
## 1 23 1.328594 0.5327647 1.0139789
## 1 24 1.328594 0.5327647 1.0139789
## 1 25 1.328594 0.5327647 1.0139789
## 1 26 1.328594 0.5327647 1.0139789
## 1 27 1.328594 0.5327647 1.0139789
## 1 28 1.355204 0.5237057 1.0282422
## 1 29 1.354034 0.5244783 1.0267683
## 1 30 1.359385 0.5194798 1.0336388
## 1 31 1.357839 0.5216488 1.0395689
## 1 32 1.357400 0.5215103 1.0374248
## 1 33 1.356468 0.5222010 1.0277678
## 1 34 1.353343 0.5237420 1.0249071
## 1 35 1.381416 0.5144563 1.0454328
## 1 36 1.381416 0.5144563 1.0454328
## 1 37 1.400371 0.5142342 1.0479364
## 1 38 1.400371 0.5142342 1.0479364
## 2 2 1.363458 0.4558256 1.0689820
## 2 3 1.300992 0.4950000 1.0266650
## 2 4 1.342267 0.4950703 1.0570482
## 2 5 1.352331 0.4642087 1.0690975
## 2 6 1.354055 0.4921812 1.0980620
## 2 7 1.341966 0.4998700 1.0875600
## 2 8 1.371679 0.4920043 1.1089012
## 2 9 1.827149 0.4795344 1.2727604
## 2 10 1.892205 0.4624672 1.3152066
## 2 11 1.805761 0.4863751 1.2596316
## 2 12 1.341988 0.5111124 1.0931730
## 2 13 1.946641 0.4289931 1.3086528
## 2 14 1.918164 0.4381737 1.2851618
## 2 15 1.950743 0.4353344 1.2949747
## 2 16 2.098043 0.4133506 1.3404870
## 2 17 2.141945 0.4045376 1.3682530
## 2 18 2.242581 0.3862486 1.4101273
## 2 19 2.276982 0.3700169 1.4315775
## 2 20 2.275432 0.3711643 1.4234879
## 2 21 2.270367 0.3770002 1.4233780
## 2 22 2.259491 0.3801502 1.4122387
## 2 23 2.257904 0.3795131 1.4106334
## 2 24 2.260496 0.3769517 1.4106286
## 2 25 2.186676 0.3752172 1.3974153
## 2 26 2.178747 0.3845206 1.3954909
## 2 27 2.183187 0.3824566 1.3991446
## 2 28 2.183187 0.3824566 1.3991446
## 2 29 2.183187 0.3824566 1.3991446
## 2 30 2.183187 0.3824566 1.3991446
## 2 31 2.183187 0.3824566 1.3991446
## 2 32 2.183187 0.3824566 1.3991446
## 2 33 2.183187 0.3824566 1.3991446
## 2 34 2.183187 0.3824566 1.3991446
## 2 35 2.183187 0.3824566 1.3991446
## 2 36 2.183187 0.3824566 1.3991446
## 2 37 2.183187 0.3824566 1.3991446
## 2 38 2.183187 0.3824566 1.3991446
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 8 and degree = 1.
plot(marsModel)marsPredict <- predict(marsModel, newdata = X_test)
postResample(pred = marsPredict, obs = Y_test)## RMSE Rsquared MAE
## 1.1870875 0.6182487 0.9690215
MARS lower RMSE value is definitely better than KNN and that makes it a better choice than KNN
Neural Net
set.seed(1)
nnetFit <- nnet(X_train,
Y_train,
size = 5,
decay = 0.01,
linout = TRUE,
trace = FALSE,
maxit = 500,
MaxNWts = 5 * (ncol(X_train) + 1) + 5 + 1)
nnetFit## a 55-5-1 network with 286 weights
## options were - linear output units decay=0.01
nnetFit.Predict <- predict(nnetFit, newdata = X_test)
postResample(pred = nnetFit.Predict, obs = Y_test)## RMSE Rsquared MAE
## 2.1204917 0.5387486 1.6984016
Neuranet also does not seem to be better than MARS.
SVM
set.seed(1)
svmModel <- train(x = X_train,
y = Y_train,
method = "svmRadial",
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmModel## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 55 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 130, 130, 131, 129, 130, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.416423 0.5028799 1.1605561
## 0.50 1.319232 0.5346143 1.0833591
## 1.00 1.235756 0.5760082 1.0064023
## 2.00 1.222457 0.5768255 0.9843147
## 4.00 1.227817 0.5698084 0.9874162
## 8.00 1.226535 0.5750713 0.9946620
## 16.00 1.227513 0.5748177 0.9962901
## 32.00 1.227513 0.5748177 0.9962901
## 64.00 1.227513 0.5748177 0.9962901
## 128.00 1.227513 0.5748177 0.9962901
## 256.00 1.227513 0.5748177 0.9962901
## 512.00 1.227513 0.5748177 0.9962901
## 1024.00 1.227513 0.5748177 0.9962901
## 2048.00 1.227513 0.5748177 0.9962901
##
## Tuning parameter 'sigma' was held constant at a value of 0.01559259
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01559259 and C = 2.
svm.Predict <- predict(svmModel, newdata = X_test)
postResample(pred = svm.Predict, obs = Y_test)## RMSE Rsquared MAE
## 0.9822078 0.7455355 0.7853845
It turns out that SVM RMSE value is the lowest and hence the best model here.
B. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
Feature Importance
varImp(marsModel)## earth variable importance
##
## only 20 most important variables shown (out of 55)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 58.42
## ManufacturingProcess13 40.07
## ManufacturingProcess22 28.34
## ManufacturingProcess01 28.34
## ManufacturingProcess39 21.96
## ManufacturingProcess12 0.00
## BiologicalMaterial10 0.00
## ManufacturingProcess04 0.00
## ManufacturingProcess36 0.00
## ManufacturingProcess37 0.00
## ManufacturingProcess28 0.00
## ManufacturingProcess15 0.00
## ManufacturingProcess10 0.00
## ManufacturingProcess20 0.00
## BiologicalMaterial03 0.00
## BiologicalMaterial01 0.00
## ManufacturingProcess18 0.00
## ManufacturingProcess06 0.00
## ManufacturingProcess11 0.00
Manufacturing process seems to be leading the feature importance list with ManufacturingProcess32 being the overall highest.
varImp(svmModel)## loess r-squared variable importance
##
## only 20 most important variables shown (out of 55)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 84.75
## ManufacturingProcess36 71.84
## BiologicalMaterial03 70.69
## ManufacturingProcess13 68.69
## BiologicalMaterial02 64.16
## ManufacturingProcess31 63.17
## BiologicalMaterial12 60.11
## ManufacturingProcess17 57.19
## ManufacturingProcess09 55.42
## ManufacturingProcess33 54.25
## BiologicalMaterial04 51.35
## ManufacturingProcess29 47.17
## BiologicalMaterial11 46.74
## ManufacturingProcess06 46.73
## BiologicalMaterial01 41.19
## BiologicalMaterial08 38.75
## ManufacturingProcess11 29.49
## ManufacturingProcess26 29.35
## BiologicalMaterial09 26.01
For SVM, both Manufacturing process and Biological Material are making into the top 10 feature importance list, even distribution, with ManufacturingProcess32 being the overall highest.
C. Explore the relationship between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Let’s take a look at the SVM model
svmModel$finalModel## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 2
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0155925867216759
##
## Number of Support Vectors : 126
##
## Objective Function Value : -67.9252
## Training error : 0.095408
xyplot(Y_test ~ predict(svmModel, X_test),
type = c("p", "r"),
xlab = "Predicted", ylab = "Observed")predicted <- predict(svmModel, X_test)
actual <-Y_test
xyplot((predicted-actual) ~ predicted,
type = c("p", "g"),
xlab = "Predicted radial svm", ylab = "Residuals")The residual plots doesnot indicate any significant pattern but the ‘Actual Vs. Predicted’ plot shows very close correlation.
dotPlot(varImp(svmModel), top=20)The above dot plot shows how top 20 features rank for SVM Model. We can see there is very even distribution of Manufacturing Process and Biological Material and ManufacturingProcess32 topping the feature importance list.
Now , let’s try to plot feautre plt for top 6 features and find out the correlation.
predictors.val <- c("ManufacturingProcess32", "BiologicalMaterial06", "ManufacturingProcess36", "BiologicalMaterial03", "ManufacturingProcess13", "BiologicalMaterial02")
featurePlot(X_train[,predictors.val], Y_train)cor(X_train[, predictors.val], Y_train)## [,1]
## ManufacturingProcess32 0.6391015
## BiologicalMaterial06 0.5037752
## ManufacturingProcess36 -0.5417157
## BiologicalMaterial03 0.4578673
## ManufacturingProcess13 -0.4470691
## BiologicalMaterial02 0.5060191
By looking at the above metric it appears that the predictors and yield have the correlation. Manufacturing process have some negative correlation with overll Yield value whereas Biological Maeterial have posittive and balanced correlation.
References:
Breiman, Leo (1996) Bagging predictors. Machine Learning 24, pages 123-140.
Friedman, Jerome H. (1991) Multivariate adaptive regression splines. The Annals of Statistics 19 (1), pages 1-67.