Libraries

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)

trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

knn_fit<- train(x=trainingData$x,y=trainingData$y, method = "knn", preProcess = c("center", "scale"), tuneLength = 10)
knnPred <- predict(knn_fit, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169
varImp((knn_fit))
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000
mars_fit<- train(x = trainingData$x, y = trainingData$y, method = "earth",
  preProcess = c("center", "scale"), tuneLength = 10)

marsPred <- predict(mars_fit, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
##     RMSE Rsquared      MAE 
## 1.776575 0.872700 1.358367
impv_mars <- varImp(mars_fit)
impv_mars
## earth variable importance
## 
##     Overall
## X1  100.000
## X4   84.065
## X2   66.860
## X5   44.678
## X3   33.508
## X6    7.475
## X7    0.000
## X8    0.000
## X9    0.000
## X10   0.000

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?

MARS outperforms the KNN model with least RMSE and higher R-squared. IN addition, MARS is using only X1-X5 as the informative predictors.

7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Load the data using Applied Predictive Modeling library and extract matrix processPredictors as predictors and yield as response.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# http://appliedpredictivemodeling.com/errata
processPredictors = as.matrix(ChemicalManufacturingProcess[,2:58])
yield = ChemicalManufacturingProcess[,1]  

Using the similar approach followed in 6.3, we will split data and preprocess.

We will use createDataPartition function from caret package to split the data into training and testing sets. Using general rule, we will split in 75-25 ration for our testing and trainig data.

Preprocess: We will use preProcess function from the caret package to process our data for training the model. Within that function, nzv identifies the numeric predictors with near zero variance and remove them rom the model, corr identifies highly correlated predictors and filter them out, center subtracts the mean of the predictor’s data from the predictor values, scale method divides values by the standard deviation of the predictor, BoxCox will be simply transforming the data, and medianimpute will impute the missing values.

https://machinelearningmastery.com/pre-process-your-dataset-in-r/

train_r <- createDataPartition(yield, p=0.75, list=F)
pp_train <- ChemicalManufacturingProcess[train_r,-1]
y_train <-  ChemicalManufacturingProcess[train_r,1]
pp_test <- ChemicalManufacturingProcess[-train_r,-1]
y_test <-  ChemicalManufacturingProcess[-train_r,1]
p_pro <- c("nzv",  "corr", "center","scale", "medianImpute")

The data is splitted into training and test sets. Preporcessing methods are defined and will be utitlized during model building. #### Run the PLS model for comparison

t_ctrl <- trainControl(method = "repeatedcv", repeats = 5)
pls_fit<-train(pp_train, y_train, method="pls", tuneLength = 10,preProcess=p_pro, trainControl=t_ctrl)
pls_fit
## Partial Least Squares 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (46), scaled (46), median imputation (46),
##  remove (11) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.649499  0.3473544  1.234225
##    2     2.237693  0.3036385  1.314258
##    3     1.826627  0.3778304  1.243423
##    4     2.105085  0.3719592  1.276270
##    5     2.451145  0.3550204  1.355700
##    6     2.735526  0.3245468  1.429938
##    7     3.099563  0.2953942  1.514624
##    8     3.455185  0.2804582  1.586962
##    9     3.842599  0.2779483  1.661117
##   10     4.172390  0.2622035  1.734641
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.
plot(pls_fit)

pls_pred <- predict(pls_fit, pp_test)
postResample(pred = pls_pred, obs = y_test)
##      RMSE  Rsquared       MAE 
## 1.3549402 0.4366454 1.0930409

Build Non-linear regression models:

knn_fit <- train(pp_train, y_train, method="knn", preProcess=p_pro, tuneLength=10, trainControl=t_ctrl)
knn_pred <- predict(knn_fit, newdata=pp_test)
knn_fit
## k-Nearest Neighbors 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (46), scaled (46), median imputation (46),
##  remove (11) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.503325  0.3789061  1.193593
##    7  1.482005  0.3884756  1.189932
##    9  1.490838  0.3778357  1.196135
##   11  1.482797  0.3810628  1.195776
##   13  1.463919  0.3992910  1.181296
##   15  1.464999  0.4017237  1.190885
##   17  1.461220  0.4110234  1.187636
##   19  1.460545  0.4203195  1.186977
##   21  1.460777  0.4270189  1.184834
##   23  1.472240  0.4230811  1.189777
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 19.
plot(knn_fit)

knn_m <-postResample(pred=knn_pred,y_test)
svm_fit <- train(pp_train, y_train, method="svmRadial", preProcess=p_pro, tuneLength=10, trainControl=t_ctrl)
svm_fit
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (46), scaled (46), median imputation (46),
##  remove (11) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  1.516047  0.4313404  1.214062
##     0.50  1.438925  0.4526654  1.155476
##     1.00  1.377357  0.4829551  1.108218
##     2.00  1.351905  0.4944348  1.089402
##     4.00  1.357344  0.4896563  1.089482
##     8.00  1.360609  0.4859543  1.087695
##    16.00  1.361029  0.4854763  1.087915
##    32.00  1.361029  0.4854763  1.087915
##    64.00  1.361029  0.4854763  1.087915
##   128.00  1.361029  0.4854763  1.087915
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01686745
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01686745 and C = 2.
plot(svm_fit)

svm_pred <- predict(svm_fit, newdata=pp_test)
svm_m <- postResample(pred=svm_pred,y_test)
mars_fit <- train(pp_train, y_train, method="earth", preProcess=p_pro, tuneLength=10)

mars_fit
## Multivariate Adaptive Regression Spline 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (46), scaled (46), median imputation (46),
##  remove (11) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##    2      1.542819  0.3387305  1.217911
##    3      1.432818  0.4343790  1.154482
##    5      1.799623  0.4205390  1.235728
##    7      2.020888  0.3969565  1.303598
##    8      2.453527  0.3724294  1.387364
##   10      2.851682  0.3460168  1.469651
##   12      3.030238  0.3236064  1.576174
##   13      3.008100  0.3268116  1.582054
##   15      3.608152  0.3136720  1.690135
##   17      3.600956  0.3173296  1.682631
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.
plot(mars_fit)

mars_pred <- predict(mars_fit, newdata=pp_test)
mars_m <- postResample(pred=mars_pred,y_test)
## Create a specific candidate set of models to evaluate
nnGrid <- expand.grid(size=seq(from = 1, to = 10, by = 1),
                      decay = seq(from = 0.1, to = 0.5, by = 0.1))
nnet_fit <- train(pp_train, y_train, method="nnet",  tuneGrid = nnGrid, 
                 preProcess= p_pro, linout=T,trace=F, 
                 MaxNWts=10 * (ncol(pp_train)+1) + 10 + 1, maxit=500)
nnet_fit
## Neural Network 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (46), scaled (46), median imputation (46),
##  remove (11) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE     
##    1    0.1    2.177066  0.2631097  1.651179
##    1    0.2    2.483223  0.2337543  1.735573
##    1    0.3    2.718187  0.2005200  1.801851
##    1    0.4    2.506660  0.2264149  1.664771
##    1    0.5    2.892222  0.1848099  1.860306
##    2    0.1    2.934009  0.1984823  2.128258
##    2    0.2    3.110881  0.1575318  2.174205
##    2    0.3    3.550210  0.1302530  2.481438
##    2    0.4    3.251780  0.1478676  2.263563
##    2    0.5    3.353747  0.1481891  2.340123
##    3    0.1    3.227824  0.1513529  2.224513
##    3    0.2    2.945425  0.1661075  2.107094
##    3    0.3    3.214234  0.1550977  2.230540
##    3    0.4    3.112763  0.1474562  2.162661
##    3    0.5    2.995537  0.1612249  2.062713
##    4    0.1    2.881199  0.1533350  2.082091
##    4    0.2    2.808203  0.2014249  1.978360
##    4    0.3    2.904353  0.1898917  1.957424
##    4    0.4    2.868694  0.1855947  1.935946
##    4    0.5    2.757338  0.2008770  1.882012
##    5    0.1    2.803381  0.1910496  1.990928
##    5    0.2    2.676518  0.1973473  1.908357
##    5    0.3    2.666967  0.1922784  1.861332
##    5    0.4    2.619213  0.2204461  1.802472
##    5    0.5    2.603762  0.2215553  1.793889
##    6    0.1    2.707496  0.1952622  1.967732
##    6    0.2    2.511764  0.2104622  1.820560
##    6    0.3    2.669908  0.2093892  1.796526
##    6    0.4    2.519129  0.2215016  1.773140
##    6    0.5    2.594186  0.2346017  1.758269
##    7    0.1    2.609833  0.1751048  1.948440
##    7    0.2    2.642270  0.1869776  1.841622
##    7    0.3    2.603007  0.2078686  1.813747
##    7    0.4    2.555743  0.2040786  1.743659
##    7    0.5    2.704462  0.2092016  1.775572
##    8    0.1    2.834882  0.1296304  2.172001
##    8    0.2    2.857053  0.1407029  2.025424
##    8    0.3    2.626982  0.2055718  1.785611
##    8    0.4    2.568197  0.2360317  1.740569
##    8    0.5    2.485839  0.2148556  1.723779
##    9    0.1    3.002221  0.1076479  2.248735
##    9    0.2    2.588830  0.1709720  1.909529
##    9    0.3    2.669724  0.1757115  1.914257
##    9    0.4    2.605848  0.2041841  1.819743
##    9    0.5    2.399118  0.2436973  1.644230
##   10    0.1    2.903608  0.1361786  2.256942
##   10    0.2    2.762159  0.1887588  2.047491
##   10    0.3    2.593729  0.1952917  1.903201
##   10    0.4    2.476836  0.1926449  1.827659
##   10    0.5    2.537011  0.2102454  1.734376
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.1.
plot(nnet_fit)

nnet_pred <- predict(nnet_fit, newdata=pp_test)

nnet_m <- postResample(pred=nnet_pred,y_test)
comp_mat <- data.frame(rbind(knn_m, svm_m,mars_m, nnet_m))
comp_mat$MODEL <-  c("KNN","SVM","MARS","NN") 
comp_mat <- comp_mat %>% select(MODEL,RMSE, Rsquared,MAE) 
rownames(comp_mat)<- c()
comp_mat
##   MODEL     RMSE  Rsquared       MAE
## 1   KNN 1.379848 0.5021641 1.1180816
## 2   SVM 1.046721 0.7062339 0.7903611
## 3  MARS 1.172835 0.6120368 0.8761763
## 4    NN 1.648574 0.3786380 1.3142114
  1. Which nonlinear regression model gives the optimal resampling and test set performance?

SVM has better performance than other models with lowest RMSE and highest Rsquared value in all four models. We used similar preprocessing as it was used in the PLS model.

  1. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
impv_pls <- varImp(pls_fit)
## Warning: package 'pls' was built under R version 3.5.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
impv_pls
## pls variable importance
## 
##   only 20 most important variables shown (out of 46)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess36   89.33
## BiologicalMaterial06     84.01
## ManufacturingProcess13   81.24
## BiologicalMaterial03     74.58
## ManufacturingProcess09   74.08
## ManufacturingProcess33   68.95
## BiologicalMaterial12     67.74
## BiologicalMaterial08     67.15
## ManufacturingProcess17   64.59
## ManufacturingProcess06   61.27
## BiologicalMaterial01     60.64
## ManufacturingProcess12   57.88
## ManufacturingProcess28   52.89
## ManufacturingProcess04   50.35
## ManufacturingProcess43   47.76
## ManufacturingProcess11   47.65
## ManufacturingProcess15   44.17
## ManufacturingProcess02   41.31
## BiologicalMaterial05     39.49
plot(impv_pls, top=10)

impv_nonlinear <- varImp(svm_fit)
plot(impv_nonlinear, top=10)

impv_nonlinear
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     75.79
## ManufacturingProcess13   74.84
## ManufacturingProcess36   71.63
## BiologicalMaterial12     69.93
## ManufacturingProcess31   66.41
## BiologicalMaterial02     63.24
## BiologicalMaterial03     61.14
## ManufacturingProcess17   60.63
## ManufacturingProcess29   48.52
## BiologicalMaterial11     48.33
## ManufacturingProcess09   48.15
## ManufacturingProcess11   47.44
## BiologicalMaterial04     45.72
## ManufacturingProcess33   44.72
## ManufacturingProcess06   44.50
## BiologicalMaterial08     39.41
## BiologicalMaterial01     34.51
## ManufacturingProcess10   31.71
## ManufacturingProcess30   31.35

ManufacturingProcess32 is the most important predictor in both cases, whether it was a pls model or our non-linear SVM model. The remaining dominant predictors have little different importance in both models, but it was quite similar. Also the Manufacturing and Biological Predictors’ ratio was similar in PLS and SVM. For example in our top 10 predictor variables in both models, the ratio for Manufacturing to Biolpgical was 60:40.

  1. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
imp_train <- pp_train %>%select(ManufacturingProcess32,BiologicalMaterial06, ManufacturingProcess13,ManufacturingProcess36, BiologicalMaterial12)
cor(imp_train)
##                        ManufacturingProcess32 BiologicalMaterial06
## ManufacturingProcess32              1.0000000            0.6434305
## BiologicalMaterial06                0.6434305            1.0000000
## ManufacturingProcess13             -0.1212948           -0.2082535
## ManufacturingProcess36                     NA                   NA
## BiologicalMaterial12                0.4567480            0.8207278
##                        ManufacturingProcess13 ManufacturingProcess36
## ManufacturingProcess32             -0.1212948                     NA
## BiologicalMaterial06               -0.2082535                     NA
## ManufacturingProcess13              1.0000000                     NA
## ManufacturingProcess36                     NA                      1
## BiologicalMaterial12               -0.1287040                     NA
##                        BiologicalMaterial12
## ManufacturingProcess32            0.4567480
## BiologicalMaterial06              0.8207278
## ManufacturingProcess13           -0.1287040
## ManufacturingProcess36                   NA
## BiologicalMaterial12              1.0000000
cor(imp_train, y_train)
##                              [,1]
## ManufacturingProcess32  0.6092474
## BiologicalMaterial06    0.5121039
## ManufacturingProcess13 -0.4952973
## ManufacturingProcess36         NA
## BiologicalMaterial12    0.4132523

We generated a list of top five dominant variables and check the correlation between the variables itself and response variables. Exploring this will provide us better picture to pick bewteen the dominat ratio of biolgical vs manufacturing predictors. Negative correlations and positive correlations between the predictors and response will further provide us an overview for tuning the model and reducing the predictors in order to simplify the models. We explored the dominant models in step above, we got an inference that manufacturing predictors were dominant than bilogical. The BiologicalMaterial12 and BiologicalMaterial12 has correlation of 0.80 that mans they are providing the same information and it is redundant to includee them in the model.