where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
knn_fit<- train(x=trainingData$x,y=trainingData$y, method = "knn", preProcess = c("center", "scale"), tuneLength = 10)
knnPred <- predict(knn_fit, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.1750657 0.6785946 2.5443169
varImp((knn_fit))
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
mars_fit<- train(x = trainingData$x, y = trainingData$y, method = "earth",
preProcess = c("center", "scale"), tuneLength = 10)
marsPred <- predict(mars_fit, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.776575 0.872700 1.358367
impv_mars <- varImp(mars_fit)
impv_mars
## earth variable importance
##
## Overall
## X1 100.000
## X4 84.065
## X2 66.860
## X5 44.678
## X3 33.508
## X6 7.475
## X7 0.000
## X8 0.000
## X9 0.000
## X10 0.000
MARS outperforms the KNN model with least RMSE and higher R-squared. IN addition, MARS is using only X1-X5 as the informative predictors.
Load the data using Applied Predictive Modeling library and extract matrix processPredictors as predictors and yield as response.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# http://appliedpredictivemodeling.com/errata
processPredictors = as.matrix(ChemicalManufacturingProcess[,2:58])
yield = ChemicalManufacturingProcess[,1]
Using the similar approach followed in 6.3, we will split data and preprocess.
We will use createDataPartition function from caret package to split the data into training and testing sets. Using general rule, we will split in 75-25 ration for our testing and trainig data.
Preprocess: We will use preProcess function from the caret package to process our data for training the model. Within that function, nzv identifies the numeric predictors with near zero variance and remove them rom the model, corr identifies highly correlated predictors and filter them out, center subtracts the mean of the predictor’s data from the predictor values, scale method divides values by the standard deviation of the predictor, BoxCox will be simply transforming the data, and medianimpute will impute the missing values.
https://machinelearningmastery.com/pre-process-your-dataset-in-r/
train_r <- createDataPartition(yield, p=0.75, list=F)
pp_train <- ChemicalManufacturingProcess[train_r,-1]
y_train <- ChemicalManufacturingProcess[train_r,1]
pp_test <- ChemicalManufacturingProcess[-train_r,-1]
y_test <- ChemicalManufacturingProcess[-train_r,1]
p_pro <- c("nzv", "corr", "center","scale", "medianImpute")
The data is splitted into training and test sets. Preporcessing methods are defined and will be utitlized during model building. #### Run the PLS model for comparison
t_ctrl <- trainControl(method = "repeatedcv", repeats = 5)
pls_fit<-train(pp_train, y_train, method="pls", tuneLength = 10,preProcess=p_pro, trainControl=t_ctrl)
pls_fit
## Partial Least Squares
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (46), scaled (46), median imputation (46),
## remove (11)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.649499 0.3473544 1.234225
## 2 2.237693 0.3036385 1.314258
## 3 1.826627 0.3778304 1.243423
## 4 2.105085 0.3719592 1.276270
## 5 2.451145 0.3550204 1.355700
## 6 2.735526 0.3245468 1.429938
## 7 3.099563 0.2953942 1.514624
## 8 3.455185 0.2804582 1.586962
## 9 3.842599 0.2779483 1.661117
## 10 4.172390 0.2622035 1.734641
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.
plot(pls_fit)
pls_pred <- predict(pls_fit, pp_test)
postResample(pred = pls_pred, obs = y_test)
## RMSE Rsquared MAE
## 1.3549402 0.4366454 1.0930409
knn_fit <- train(pp_train, y_train, method="knn", preProcess=p_pro, tuneLength=10, trainControl=t_ctrl)
knn_pred <- predict(knn_fit, newdata=pp_test)
knn_fit
## k-Nearest Neighbors
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (46), scaled (46), median imputation (46),
## remove (11)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.503325 0.3789061 1.193593
## 7 1.482005 0.3884756 1.189932
## 9 1.490838 0.3778357 1.196135
## 11 1.482797 0.3810628 1.195776
## 13 1.463919 0.3992910 1.181296
## 15 1.464999 0.4017237 1.190885
## 17 1.461220 0.4110234 1.187636
## 19 1.460545 0.4203195 1.186977
## 21 1.460777 0.4270189 1.184834
## 23 1.472240 0.4230811 1.189777
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 19.
plot(knn_fit)
knn_m <-postResample(pred=knn_pred,y_test)
svm_fit <- train(pp_train, y_train, method="svmRadial", preProcess=p_pro, tuneLength=10, trainControl=t_ctrl)
svm_fit
## Support Vector Machines with Radial Basis Function Kernel
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (46), scaled (46), median imputation (46),
## remove (11)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.516047 0.4313404 1.214062
## 0.50 1.438925 0.4526654 1.155476
## 1.00 1.377357 0.4829551 1.108218
## 2.00 1.351905 0.4944348 1.089402
## 4.00 1.357344 0.4896563 1.089482
## 8.00 1.360609 0.4859543 1.087695
## 16.00 1.361029 0.4854763 1.087915
## 32.00 1.361029 0.4854763 1.087915
## 64.00 1.361029 0.4854763 1.087915
## 128.00 1.361029 0.4854763 1.087915
##
## Tuning parameter 'sigma' was held constant at a value of 0.01686745
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01686745 and C = 2.
plot(svm_fit)
svm_pred <- predict(svm_fit, newdata=pp_test)
svm_m <- postResample(pred=svm_pred,y_test)
mars_fit <- train(pp_train, y_train, method="earth", preProcess=p_pro, tuneLength=10)
mars_fit
## Multivariate Adaptive Regression Spline
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (46), scaled (46), median imputation (46),
## remove (11)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 1.542819 0.3387305 1.217911
## 3 1.432818 0.4343790 1.154482
## 5 1.799623 0.4205390 1.235728
## 7 2.020888 0.3969565 1.303598
## 8 2.453527 0.3724294 1.387364
## 10 2.851682 0.3460168 1.469651
## 12 3.030238 0.3236064 1.576174
## 13 3.008100 0.3268116 1.582054
## 15 3.608152 0.3136720 1.690135
## 17 3.600956 0.3173296 1.682631
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.
plot(mars_fit)
mars_pred <- predict(mars_fit, newdata=pp_test)
mars_m <- postResample(pred=mars_pred,y_test)
## Create a specific candidate set of models to evaluate
nnGrid <- expand.grid(size=seq(from = 1, to = 10, by = 1),
decay = seq(from = 0.1, to = 0.5, by = 0.1))
nnet_fit <- train(pp_train, y_train, method="nnet", tuneGrid = nnGrid,
preProcess= p_pro, linout=T,trace=F,
MaxNWts=10 * (ncol(pp_train)+1) + 10 + 1, maxit=500)
nnet_fit
## Neural Network
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (46), scaled (46), median imputation (46),
## remove (11)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.1 2.177066 0.2631097 1.651179
## 1 0.2 2.483223 0.2337543 1.735573
## 1 0.3 2.718187 0.2005200 1.801851
## 1 0.4 2.506660 0.2264149 1.664771
## 1 0.5 2.892222 0.1848099 1.860306
## 2 0.1 2.934009 0.1984823 2.128258
## 2 0.2 3.110881 0.1575318 2.174205
## 2 0.3 3.550210 0.1302530 2.481438
## 2 0.4 3.251780 0.1478676 2.263563
## 2 0.5 3.353747 0.1481891 2.340123
## 3 0.1 3.227824 0.1513529 2.224513
## 3 0.2 2.945425 0.1661075 2.107094
## 3 0.3 3.214234 0.1550977 2.230540
## 3 0.4 3.112763 0.1474562 2.162661
## 3 0.5 2.995537 0.1612249 2.062713
## 4 0.1 2.881199 0.1533350 2.082091
## 4 0.2 2.808203 0.2014249 1.978360
## 4 0.3 2.904353 0.1898917 1.957424
## 4 0.4 2.868694 0.1855947 1.935946
## 4 0.5 2.757338 0.2008770 1.882012
## 5 0.1 2.803381 0.1910496 1.990928
## 5 0.2 2.676518 0.1973473 1.908357
## 5 0.3 2.666967 0.1922784 1.861332
## 5 0.4 2.619213 0.2204461 1.802472
## 5 0.5 2.603762 0.2215553 1.793889
## 6 0.1 2.707496 0.1952622 1.967732
## 6 0.2 2.511764 0.2104622 1.820560
## 6 0.3 2.669908 0.2093892 1.796526
## 6 0.4 2.519129 0.2215016 1.773140
## 6 0.5 2.594186 0.2346017 1.758269
## 7 0.1 2.609833 0.1751048 1.948440
## 7 0.2 2.642270 0.1869776 1.841622
## 7 0.3 2.603007 0.2078686 1.813747
## 7 0.4 2.555743 0.2040786 1.743659
## 7 0.5 2.704462 0.2092016 1.775572
## 8 0.1 2.834882 0.1296304 2.172001
## 8 0.2 2.857053 0.1407029 2.025424
## 8 0.3 2.626982 0.2055718 1.785611
## 8 0.4 2.568197 0.2360317 1.740569
## 8 0.5 2.485839 0.2148556 1.723779
## 9 0.1 3.002221 0.1076479 2.248735
## 9 0.2 2.588830 0.1709720 1.909529
## 9 0.3 2.669724 0.1757115 1.914257
## 9 0.4 2.605848 0.2041841 1.819743
## 9 0.5 2.399118 0.2436973 1.644230
## 10 0.1 2.903608 0.1361786 2.256942
## 10 0.2 2.762159 0.1887588 2.047491
## 10 0.3 2.593729 0.1952917 1.903201
## 10 0.4 2.476836 0.1926449 1.827659
## 10 0.5 2.537011 0.2102454 1.734376
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.1.
plot(nnet_fit)
nnet_pred <- predict(nnet_fit, newdata=pp_test)
nnet_m <- postResample(pred=nnet_pred,y_test)
comp_mat <- data.frame(rbind(knn_m, svm_m,mars_m, nnet_m))
comp_mat$MODEL <- c("KNN","SVM","MARS","NN")
comp_mat <- comp_mat %>% select(MODEL,RMSE, Rsquared,MAE)
rownames(comp_mat)<- c()
comp_mat
## MODEL RMSE Rsquared MAE
## 1 KNN 1.379848 0.5021641 1.1180816
## 2 SVM 1.046721 0.7062339 0.7903611
## 3 MARS 1.172835 0.6120368 0.8761763
## 4 NN 1.648574 0.3786380 1.3142114
SVM has better performance than other models with lowest RMSE and highest Rsquared value in all four models. We used similar preprocessing as it was used in the PLS model.
impv_pls <- varImp(pls_fit)
## Warning: package 'pls' was built under R version 3.5.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
impv_pls
## pls variable importance
##
## only 20 most important variables shown (out of 46)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess36 89.33
## BiologicalMaterial06 84.01
## ManufacturingProcess13 81.24
## BiologicalMaterial03 74.58
## ManufacturingProcess09 74.08
## ManufacturingProcess33 68.95
## BiologicalMaterial12 67.74
## BiologicalMaterial08 67.15
## ManufacturingProcess17 64.59
## ManufacturingProcess06 61.27
## BiologicalMaterial01 60.64
## ManufacturingProcess12 57.88
## ManufacturingProcess28 52.89
## ManufacturingProcess04 50.35
## ManufacturingProcess43 47.76
## ManufacturingProcess11 47.65
## ManufacturingProcess15 44.17
## ManufacturingProcess02 41.31
## BiologicalMaterial05 39.49
plot(impv_pls, top=10)
impv_nonlinear <- varImp(svm_fit)
plot(impv_nonlinear, top=10)
impv_nonlinear
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 75.79
## ManufacturingProcess13 74.84
## ManufacturingProcess36 71.63
## BiologicalMaterial12 69.93
## ManufacturingProcess31 66.41
## BiologicalMaterial02 63.24
## BiologicalMaterial03 61.14
## ManufacturingProcess17 60.63
## ManufacturingProcess29 48.52
## BiologicalMaterial11 48.33
## ManufacturingProcess09 48.15
## ManufacturingProcess11 47.44
## BiologicalMaterial04 45.72
## ManufacturingProcess33 44.72
## ManufacturingProcess06 44.50
## BiologicalMaterial08 39.41
## BiologicalMaterial01 34.51
## ManufacturingProcess10 31.71
## ManufacturingProcess30 31.35
ManufacturingProcess32 is the most important predictor in both cases, whether it was a pls model or our non-linear SVM model. The remaining dominant predictors have little different importance in both models, but it was quite similar. Also the Manufacturing and Biological Predictors’ ratio was similar in PLS and SVM. For example in our top 10 predictor variables in both models, the ratio for Manufacturing to Biolpgical was 60:40.
imp_train <- pp_train %>%select(ManufacturingProcess32,BiologicalMaterial06, ManufacturingProcess13,ManufacturingProcess36, BiologicalMaterial12)
cor(imp_train)
## ManufacturingProcess32 BiologicalMaterial06
## ManufacturingProcess32 1.0000000 0.6434305
## BiologicalMaterial06 0.6434305 1.0000000
## ManufacturingProcess13 -0.1212948 -0.2082535
## ManufacturingProcess36 NA NA
## BiologicalMaterial12 0.4567480 0.8207278
## ManufacturingProcess13 ManufacturingProcess36
## ManufacturingProcess32 -0.1212948 NA
## BiologicalMaterial06 -0.2082535 NA
## ManufacturingProcess13 1.0000000 NA
## ManufacturingProcess36 NA 1
## BiologicalMaterial12 -0.1287040 NA
## BiologicalMaterial12
## ManufacturingProcess32 0.4567480
## BiologicalMaterial06 0.8207278
## ManufacturingProcess13 -0.1287040
## ManufacturingProcess36 NA
## BiologicalMaterial12 1.0000000
cor(imp_train, y_train)
## [,1]
## ManufacturingProcess32 0.6092474
## BiologicalMaterial06 0.5121039
## ManufacturingProcess13 -0.4952973
## ManufacturingProcess36 NA
## BiologicalMaterial12 0.4132523
We generated a list of top five dominant variables and check the correlation between the variables itself and response variables. Exploring this will provide us better picture to pick bewteen the dominat ratio of biolgical vs manufacturing predictors. Negative correlations and positive correlations between the predictors and response will further provide us an overview for tuning the model and reducing the predictors in order to simplify the models. We explored the dominant models in step above, we got an inference that manufacturing predictors were dominant than bilogical. The BiologicalMaterial12 and BiologicalMaterial12 has correlation of 0.80 that mans they are providing the same information and it is redundant to includee them in the model.