Kuhn and Johnson
7.2) Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data.
y = 10sin(\(\pi\)\(x_1x_2\)) + 20(\(x_3\) - 0.5)\(^2\) + 10\(x_4\) + 5\(x_5\) + N(0,\(\sigma^2\))
where the x values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data.
library(mlbench)
library(caret)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd=1)
##We conver the 'x' data from a matrix to a dataframe.
##One reason is that this will give us column names.
trainingData$x <- data.frame(trainingData$x)
##Look at the data using
featurePlot(trainingData$x, trainingData$y)
##This creates a list with a vector 'y' and a matrix of predictors 'x'.
There appears to be a positive correlation between the predictor variables x1, x2, x4, and x5 with y. There does not appear to be any correlation between the predictor variables x3, x6, x7, x8, x9 and x10 with y.
Estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd=1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
knnModel <- train(x=trainingData$x,
y=trainingData$y,
method="knn",
preProc=c("center","scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.565620 0.4887976 2.886629
## 7 3.422420 0.5300524 2.752964
## 9 3.368072 0.5536927 2.715310
## 11 3.323010 0.5779056 2.669375
## 13 3.275835 0.6030846 2.628663
## 15 3.261864 0.6163510 2.621192
## 17 3.261973 0.6267032 2.616956
## 19 3.286299 0.6281075 2.640585
## 21 3.280950 0.6390386 2.643807
## 23 3.292397 0.6440392 2.656080
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
knnPred <- predict(knnModel, newdata = testData$x)
##The function 'postResample' can be used to get the test set performance values
postResample(pred=knnPred,obs=testData$y)
## RMSE Rsquared MAE
## 3.1750657 0.6785946 2.5443169
MARS (Multivariate Adaptive Regression Splines)
library(earth)
marsFit <- earth(trainingData$x, trainingData$y)
marsFit
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
varImp(marsFit)
## Overall
## X1 100.00000
## X4 84.21578
## X2 67.21639
## X5 45.44416
## X3 34.63259
## X6 11.90397
## X7 0.00000
## X8 0.00000
## X9 0.00000
## X10 0.00000
marsPred_simple <- predict(marsFit, newdata = testData$x)
postResample(pred=marsPred_simple,obs=testData$y)
## RMSE Rsquared MAE
## 1.8136467 0.8677298 1.3911836
marsGrid <- expand.grid(.degree=1:4, .nprune=2:30)
set.seed(50)
marsTuned <- train(trainingData$x,
trainingData$y,
method="earth",
tuneGrid = marsGrid,
trControl = trainControl(method="cv"))
#marsTuned
varImp(marsTuned)
## earth variable importance
##
## Overall
## X1 100.00
## X4 85.13
## X2 69.22
## X5 49.28
## X3 39.95
## X7 0.00
## X9 0.00
## X6 0.00
## X10 0.00
## X8 0.00
plot(marsTuned)
marsPred <- predict(marsTuned, newdata = testData$x)
##The function 'postResample' can be used to get the test set performance values
postResample(pred=marsPred,obs=testData$y)
## RMSE Rsquared MAE
## 1.1722635 0.9448890 0.9324923
Support Vector Machines
library(kernlab)
svmRTuned <- train(trainingData$x, trainingData$y,
method="svmRadial",
preProc=c("center","scale"),
tuneLength = 14,
trControl = trainControl(method="cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.499568 0.8106425 2.005232
## 0.50 2.239312 0.8302888 1.799795
## 1.00 2.064409 0.8521404 1.637527
## 2.00 1.976213 0.8609107 1.553383
## 4.00 1.907789 0.8668323 1.505581
## 8.00 1.878847 0.8704239 1.508344
## 16.00 1.877761 0.8708067 1.508296
## 32.00 1.877761 0.8708067 1.508296
## 64.00 1.877761 0.8708067 1.508296
## 128.00 1.877761 0.8708067 1.508296
## 256.00 1.877761 0.8708067 1.508296
## 512.00 1.877761 0.8708067 1.508296
## 1024.00 1.877761 0.8708067 1.508296
## 2048.00 1.877761 0.8708067 1.508296
##
## Tuning parameter 'sigma' was held constant at a value of 0.0702961
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0702961 and C = 16.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 16
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0702961005968861
##
## Number of Support Vectors : 152
##
## Objective Function Value : -63.3221
## Training error : 0.008588
svmPred <- predict(svmRTuned, newdata = testData$x)
postResample(pred=svmPred,obs=testData$y)
## RMSE Rsquared MAE
## 2.0930338 0.8226734 1.5906572
Linear Kernel
svmRTuned <- train(trainingData$x, trainingData$y,
method="svmLinear",
preProc=c("center","scale"),
tuneLength = 14,
trControl = trainControl(method="cv"))
svmRTuned
## Support Vector Machines with Linear Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.456188 0.7596526 1.962155
##
## Tuning parameter 'C' was held constant at a value of 1
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 172
##
## Objective Function Value : -54.9759
## Training error : 0.216921
svmPred <- predict(svmRTuned, newdata = testData$x)
postResample(pred=svmPred,obs=testData$y)
## RMSE Rsquared MAE
## 2.7633860 0.6973384 2.0970616
Neural Networks
library(nnet)
nnetFit <- nnet(trainingData$x, trainingData$y,
size=5,
decay=0.01,
trace=FALSE,
maxit=500,
MaxNWts=5*(ncol(trainingData$x)+1)+5+1)
nnetFit
## a 10-5-1 network with 61 weights
## options were - decay=0.01
nnetPred <- predict(nnetFit, newdata = testData$x)
postResample(pred=nnetPred,obs=testData$y)
## RMSE Rsquared MAE
## 14.2769353 0.2779778 13.3869179
tooHigh <- findCorrelation(cor(trainingData$x),cutoff=0.7)
trainXnnet <- trainingData$x[,-tooHigh]
testXnnet <- testData$x[,-tooHigh]
nnetGrid <- expand.grid(.decay=c(0,0.01,.1),
.size=c(1:10),
.bag=FALSE)
set.seed(100)
nnetTune <- train(trainingData$x, trainingData$y,
method="avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method="cv",number=10),
preProc=c("center","scale"),
linout=TRUE,
trace=FALSE,
MaxNWts =10*(ncol(trainingData$x)+1) +10 +1,
maxit=500)
nnetTune
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.451722 0.7625351 1.896021
## 0.00 2 2.474140 0.7564072 1.936452
## 0.00 3 2.091278 0.8235551 1.659173
## 0.00 4 2.054404 0.8209165 1.606800
## 0.00 5 2.331057 0.7749136 1.741685
## 0.00 6 2.828831 0.7173701 2.095881
## 0.00 7 3.172453 0.6663624 2.218665
## 0.00 8 4.496342 0.5434195 3.067567
## 0.00 9 5.994543 0.4233180 3.375984
## 0.00 10 4.130111 0.5554640 2.803012
## 0.01 1 2.427639 0.7621863 1.887533
## 0.01 2 2.503912 0.7476021 1.943163
## 0.01 3 2.074827 0.8228616 1.604177
## 0.01 4 2.112135 0.8173594 1.688663
## 0.01 5 2.110145 0.8198569 1.657632
## 0.01 6 2.299108 0.7890686 1.823859
## 0.01 7 2.348609 0.7760174 1.872983
## 0.01 8 2.519307 0.7463523 2.042896
## 0.01 9 2.511074 0.7368007 1.991015
## 0.01 10 2.486297 0.7631709 1.941586
## 0.10 1 2.435364 0.7608881 1.890014
## 0.10 2 2.471661 0.7510586 1.915104
## 0.10 3 2.239584 0.7983200 1.766737
## 0.10 4 2.175860 0.8134068 1.765179
## 0.10 5 2.083732 0.8249385 1.654930
## 0.10 6 2.123293 0.8107443 1.676428
## 0.10 7 2.117684 0.8216322 1.656338
## 0.10 8 2.322703 0.7771985 1.813793
## 0.10 9 2.293910 0.7812604 1.850508
## 0.10 10 2.437919 0.7608963 1.971687
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag
## = FALSE.
nnetPred <- predict(nnetTune, newdata = testData$x)
postResample(pred=nnetPred,obs=testData$y)
## RMSE Rsquared MAE
## 2.1619014 0.8160074 1.6498984
library(knitr)
compare_table <- data.frame(list(knn=3.18,MARS_Fit=1.81, MARS_Tuned=1.17, SVM_radial=2.08, SVM_linear=2.76, NeuralNet_Fit=14.3, NeuralNet_Tune=2.16), stringsAsFactors=FALSE)
rownames(compare_table) <- c("RMSE")
kable(compare_table, caption="Model Comparison")
knn | MARS_Fit | MARS_Tuned | SVM_radial | SVM_linear | NeuralNet_Fit | NeuralNet_Tune | |
---|---|---|---|---|---|---|---|
RMSE | 3.18 | 1.81 | 1.17 | 2.08 | 2.76 | 14.3 | 2.16 |
The model with the lowest root mean square error is the tuned MARS model. The model had 14 terms (nprune = 14) and was of degree = 2. MARS selects the informative predictors: x1, x4, x2, x5 and x3.
7.5) Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting and pre-processing steps as before and train several nonlinear regression models.
The data set ChemicalManufacturingProcess is loaded.
I imputed the median of a column for missing values. The data is separated into a training set and testing set. Preprocess data to remove 19 highly correlated variables.
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
impute.median <- function(x) replace(x, is.na(x), median(x, na.rm = TRUE))
for (i in 1:ncol(ChemicalManufacturingProcess)){
ChemicalManufacturingProcess[,i] <- impute.median(ChemicalManufacturingProcess[,i])
}
train_chem <- ChemicalManufacturingProcess[1:130,]
test_chem <- ChemicalManufacturingProcess[131:176,]
highCorr_chem <- findCorrelation(cor(train_chem[,2:58]), cutoff=.7, exact=TRUE)
length(highCorr_chem)
## [1] 19
filtered_train_chem <- train_chem[,-highCorr_chem]
ncol(filtered_train_chem)
## [1] 39
correlation_chem <- cor(filtered_train_chem, method = "pearson")
#summary(ChemicalManufacturingProcess)
Nonlinear Regression Models
k Nearest Neighbors
knnModel_chem <- train(x=train_chem[,-1],
y=train_chem$Yield,
method="knn",
preProc=c("center","scale"),
tuneLength = 10)
knnModel_chem
## k-Nearest Neighbors
##
## 130 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 130, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.294415 0.4952211 1.045348
## 7 1.315476 0.4812050 1.064978
## 9 1.330273 0.4770898 1.084025
## 11 1.361879 0.4566953 1.114846
## 13 1.380155 0.4459067 1.126567
## 15 1.390935 0.4451186 1.133090
## 17 1.402719 0.4420705 1.140357
## 19 1.412617 0.4397468 1.147891
## 21 1.412668 0.4452244 1.148591
## 23 1.423823 0.4438304 1.158805
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
knnPred_chem <- predict(knnModel_chem, newdata = test_chem[,-1])
##The function 'postResample' can be used to get the test set performance values
postResample(pred=knnPred_chem,obs=test_chem$Yield)
## RMSE Rsquared MAE
## 1.8164938 0.1849927 1.4630870
MARS
marsGrid <- expand.grid(.degree=1:4, .nprune=2:30)
set.seed(50)
marsTuned <- train(x=train_chem[,-1],
y=train_chem$Yield,
method="earth",
tuneGrid = marsGrid,
trControl = trainControl(method="cv"))
#marsTuned
varImp(marsTuned)
## earth variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess17 100.00
## ManufacturingProcess28 64.29
## ManufacturingProcess32 61.23
## BiologicalMaterial06 44.07
## ManufacturingProcess13 40.73
## ManufacturingProcess09 40.73
## ManufacturingProcess16 40.73
## BiologicalMaterial12 22.54
## BiologicalMaterial02 18.20
## ManufacturingProcess04 12.69
## ManufacturingProcess36 0.00
## BiologicalMaterial07 0.00
## ManufacturingProcess33 0.00
## BiologicalMaterial04 0.00
## ManufacturingProcess05 0.00
## ManufacturingProcess38 0.00
## ManufacturingProcess45 0.00
## ManufacturingProcess27 0.00
## ManufacturingProcess34 0.00
## ManufacturingProcess39 0.00
plot(marsTuned)
marsPred <- predict(marsTuned, newdata = test_chem[,-1])
##The function 'postResample' can be used to get the test set performance values
postResample(pred=marsPred,obs=test_chem$Yield)
## RMSE Rsquared MAE
## 2.36737380 0.06714247 2.05351782
Support Vector Machines Radial Basis Kernel
svmRTuned_rad <- train(x=train_chem[,-1],
y=train_chem$Yield,
method="svmRadial",
preProc=c("center","scale"),
tuneLength = 14,
trControl = trainControl(method="cv"))
svmRTuned_rad
## Support Vector Machines with Radial Basis Function Kernel
##
## 130 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 117, 118, 118, 116, 117, 117, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.3464649 0.6176193 1.0788984
## 0.50 1.1961665 0.6568694 0.9709388
## 1.00 1.0767435 0.7008372 0.8671836
## 2.00 1.0186969 0.7155605 0.8127638
## 4.00 1.0014491 0.7159753 0.7851659
## 8.00 0.9868678 0.7215477 0.7704698
## 16.00 0.9821933 0.7243430 0.7657744
## 32.00 0.9821933 0.7243430 0.7657744
## 64.00 0.9821933 0.7243430 0.7657744
## 128.00 0.9821933 0.7243430 0.7657744
## 256.00 0.9821933 0.7243430 0.7657744
## 512.00 0.9821933 0.7243430 0.7657744
## 1024.00 0.9821933 0.7243430 0.7657744
## 2048.00 0.9821933 0.7243430 0.7657744
##
## Tuning parameter 'sigma' was held constant at a value of 0.01092131
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01092131 and C = 16.
svmRTuned_rad$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 16
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0109213068494943
##
## Number of Support Vectors : 110
##
## Objective Function Value : -74.2534
## Training error : 0.00899
svmPred_rad <- predict(svmRTuned_rad, newdata = train_chem[,-1])
postResample(pred=svmPred_rad,obs=train_chem$Yield)
## RMSE Rsquared MAE
## 0.1744533 0.9926527 0.1697655
Linear Kernel
svmRTuned <- train(x=train_chem[,-1],
y=train_chem$Yield,
method="svmLinear",
preProc=c("center","scale"),
tuneLength = 14,
trControl = trainControl(method="cv"))
svmRTuned
## Support Vector Machines with Linear Kernel
##
## 130 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 117, 116, 118, 117, 118, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.265437 0.5117143 1.20683
##
## Tuning parameter 'C' was held constant at a value of 1
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 110
##
## Objective Function Value : -22.4379
## Training error : 0.136722
svmPred_lin <- predict(svmRTuned, newdata = train_chem[,-1])
postResample(pred=svmPred_lin,obs=train_chem$Yield)
## RMSE Rsquared MAE
## 0.6803201 0.8625641 0.4640489
Neural Network
nnetFit_chem <- nnet(x=train_chem[,-1],
y=train_chem$Yield,
size=5,
decay=c(0,0.01,.1),
trace=FALSE,
maxit=500,
MaxNWts=5*(ncol(train_chem[,-1])+1)+5+1)
nnetFit_chem
## a 57-5-1 network with 296 weights
## options were -
nnetPred_chem <- predict(nnetFit_chem, newdata = test_chem[,-1])
postResample(pred=nnetPred_chem,obs=test_chem$Yield)
## RMSE Rsquared MAE
## 37.98425 NA 37.96478
compare_table <- data.frame(list(knn=1.82, MARS_Tuned=2.37, SVM_radial=0.17, SVM_linear=0.68, NeuralNet=37.98), stringsAsFactors=FALSE)
rownames(compare_table) <- c("RMSE")
kable(compare_table, caption="Model Comparison")
knn | MARS_Tuned | SVM_radial | SVM_linear | NeuralNet | |
---|---|---|---|---|---|
RMSE | 1.82 | 2.37 | 0.17 | 0.68 | 37.98 |
The Support Vector Machine using a radial basis function gave a model with the lowest root mean square error.
Imp_chem <- varImp(svmRTuned_rad)
plot(Imp_chem, top = 10)
The manufacturing process variables are the most important predictors in the SVM model.
Biological Material 06 and Manufacturing Process 36 are among the top 10 predictors for the top linear model and top nonlinear models.
top_predictors <- ChemicalManufacturingProcess[,c("Yield","ManufacturingProcess17","ManufacturingProcess13","ManufacturingProcess09","ManufacturingProcess32","BiologicalMaterial06","ManufacturingProcess06","ManufacturingProcess11","ManufacturingProcess36","ManufacturingProcess30","ManufacturingProcess03")]
#head(top_predictors)
correlation_chem_top_pred <- cor(top_predictors, method = "pearson")
corrplot::corrplot(correlation_chem_top_pred, type="upper", method="color")
The yield is positively correlated to biological process 06, manufacturing process 09,32,06,11 and 30. The yield is negatively correlated to manufacturing process 17, 13 and 36. The correlations are fairly weak.