Applied Predictive Modeling.
Instructions
Do problems 7.2
and 7.5
in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.
Exercises
7.2
Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\[y = 10 sin(\pi x_1 x_2 ) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N (0, \sigma ^ 2 )\] where the \(x\) values are random variables uniformly distributed between \([0, 1]\) (there are also 5 other non-informative variables also created in the simulation). The package mlbench
contains a function called mlbench.friedman1
that simulates these data:
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the ' x ' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector ' y ' and a matrix
## of predictors ' x ' . Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.565620 0.4887976 2.886629
## 7 3.422420 0.5300524 2.752964
## 9 3.368072 0.5536927 2.715310
## 11 3.323010 0.5779056 2.669375
## 13 3.275835 0.6030846 2.628663
## 15 3.261864 0.6163510 2.621192
## 17 3.261973 0.6267032 2.616956
## 19 3.286299 0.6281075 2.640585
## 21 3.280950 0.6390386 2.643807
## 23 3.292397 0.6440392 2.656080
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
knnPred <- predict(knnModel, newdata = testData$x)
## The function ' postResample ' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
Let’s visualize the results:
knnpostResample | knnModel | Diff | |
---|---|---|---|
RMSE | 3.1750657 | 3.261864 | -0.087 |
Rsquared | 0.6785946 | 0.616351 | 0.062 |
MAE | 2.5443169 | 2.621192 | -0.077 |
First, I would like to visualize the correlations.
Correlation chart for the given data set.
From the correlations graph it is determined that there a low linear correlations in between the predictors.
nnet
Neural Network Model
# Neural Network Model
# Fix the seed so that the results can be reproduced
set.seed(100)
nnetModel <- train(x = trainingData$x,
y = trainingData$y,
method = "nnet",
preProc = c("center", "scale"),
tuneLength = 10,
# The linear relationship between the hidden
# units and the prediction can be used with the
# option linout = TRUE .
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
## Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.0000000000 2.497600 0.7545522 1.956054
## 1 0.0001000000 2.734632 0.6972571 2.162137
## 1 0.0002371374 2.638259 0.7210747 2.067387
## 1 0.0005623413 2.611553 0.7289837 2.049921
## 1 0.0013335214 2.562094 0.7383177 2.015443
## 1 0.0031622777 2.605143 0.7298392 2.050391
## 1 0.0074989421 2.636422 0.7231220 2.085583
## 1 0.0177827941 2.496140 0.7543174 1.953291
## 1 0.0421696503 2.498996 0.7534972 1.955317
## 1 0.1000000000 2.506223 0.7517278 1.961803
## 3 0.0000000000 2.856496 0.6962672 2.223649
## 3 0.0001000000 2.906191 0.6874898 2.264771
## 3 0.0002371374 2.896291 0.6850854 2.286116
## 3 0.0005623413 2.834275 0.7101224 2.182843
## 3 0.0013335214 3.015298 0.6619003 2.357832
## 3 0.0031622777 3.054260 0.6589361 2.387193
## 3 0.0074989421 2.887075 0.6859427 2.268697
## 3 0.0177827941 2.906213 0.6831763 2.312451
## 3 0.0421696503 2.789268 0.7068723 2.219884
## 3 0.1000000000 2.760720 0.7140945 2.164362
## 5 0.0000000000 4.266845 0.5531609 2.853385
## 5 0.0001000000 3.546638 0.5842393 2.641530
## 5 0.0002371374 3.997344 0.5360713 2.962281
## 5 0.0005623413 4.026938 0.5429662 2.965001
## 5 0.0013335214 3.314771 0.6272433 2.611271
## 5 0.0031622777 3.420932 0.6030587 2.652169
## 5 0.0074989421 3.475582 0.6078295 2.688666
## 5 0.0177827941 3.249637 0.6204248 2.564657
## 5 0.0421696503 3.170282 0.6461321 2.523590
## 5 0.1000000000 3.137383 0.6495947 2.466052
## 7 0.0000000000 8.002307 0.3381807 4.353010
## 7 0.0001000000 4.430284 0.5056887 3.209074
## 7 0.0002371374 4.135117 0.5297763 3.132095
## 7 0.0005623413 4.232626 0.5060780 3.281377
## 7 0.0013335214 4.443191 0.4897367 3.355825
## 7 0.0031622777 4.080581 0.5214544 3.110683
## 7 0.0074989421 3.837268 0.5616421 3.017291
## 7 0.0177827941 3.848676 0.5401837 3.018204
## 7 0.0421696503 3.528795 0.6016891 2.794963
## 7 0.1000000000 3.350476 0.6285659 2.653889
## 9 0.0000000000 4.983956 0.4497053 3.591067
## 9 0.0001000000 4.887547 0.4711915 3.536666
## 9 0.0002371374 3.904550 0.5247603 3.103848
## 9 0.0005623413 4.163710 0.5107779 3.266036
## 9 0.0013335214 4.105538 0.5096156 3.218040
## 9 0.0031622777 4.037495 0.5201388 3.190533
## 9 0.0074989421 3.976522 0.5112530 3.135026
## 9 0.0177827941 3.863966 0.5394006 3.063223
## 9 0.0421696503 3.721862 0.5504238 2.979736
## 9 0.1000000000 3.407499 0.5978234 2.716034
## 11 0.0000000000 NaN NaN NaN
## 11 0.0001000000 NaN NaN NaN
## 11 0.0002371374 NaN NaN NaN
## 11 0.0005623413 NaN NaN NaN
## 11 0.0013335214 NaN NaN NaN
## 11 0.0031622777 NaN NaN NaN
## 11 0.0074989421 NaN NaN NaN
## 11 0.0177827941 NaN NaN NaN
## 11 0.0421696503 NaN NaN NaN
## 11 0.1000000000 NaN NaN NaN
## 13 0.0000000000 NaN NaN NaN
## 13 0.0001000000 NaN NaN NaN
## 13 0.0002371374 NaN NaN NaN
## 13 0.0005623413 NaN NaN NaN
## 13 0.0013335214 NaN NaN NaN
## 13 0.0031622777 NaN NaN NaN
## 13 0.0074989421 NaN NaN NaN
## 13 0.0177827941 NaN NaN NaN
## 13 0.0421696503 NaN NaN NaN
## 13 0.1000000000 NaN NaN NaN
## 15 0.0000000000 NaN NaN NaN
## 15 0.0001000000 NaN NaN NaN
## 15 0.0002371374 NaN NaN NaN
## 15 0.0005623413 NaN NaN NaN
## 15 0.0013335214 NaN NaN NaN
## 15 0.0031622777 NaN NaN NaN
## 15 0.0074989421 NaN NaN NaN
## 15 0.0177827941 NaN NaN NaN
## 15 0.0421696503 NaN NaN NaN
## 15 0.1000000000 NaN NaN NaN
## 17 0.0000000000 NaN NaN NaN
## 17 0.0001000000 NaN NaN NaN
## 17 0.0002371374 NaN NaN NaN
## 17 0.0005623413 NaN NaN NaN
## 17 0.0013335214 NaN NaN NaN
## 17 0.0031622777 NaN NaN NaN
## 17 0.0074989421 NaN NaN NaN
## 17 0.0177827941 NaN NaN NaN
## 17 0.0421696503 NaN NaN NaN
## 17 0.1000000000 NaN NaN NaN
## 19 0.0000000000 NaN NaN NaN
## 19 0.0001000000 NaN NaN NaN
## 19 0.0002371374 NaN NaN NaN
## 19 0.0005623413 NaN NaN NaN
## 19 0.0013335214 NaN NaN NaN
## 19 0.0031622777 NaN NaN NaN
## 19 0.0074989421 NaN NaN NaN
## 19 0.0177827941 NaN NaN NaN
## 19 0.0421696503 NaN NaN NaN
## 19 0.1000000000 NaN NaN NaN
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.01778279.
Let’s visualize the results:
nnetpostResample | nnetModel | Diff | |
---|---|---|---|
RMSE | 2.6435865 | 2.4961400 | 0.147 |
Rsquared | 0.7193278 | 0.7543174 | -0.035 |
MAE | 2.0236815 | 1.9532910 | 0.070 |
SVM
Support Vector Machines Model
# Support Vector Machines Model
# Fix the seed so that the results can be reproduced
set.seed(100)
svmModel <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.534788 0.7882081 2.034824
## 0.50 2.292127 0.8029516 1.819981
## 1.00 2.091598 0.8284381 1.657402
## 2.00 1.967193 0.8457471 1.546737
## 4.00 1.883133 0.8561761 1.482054
## 8.00 1.863807 0.8588797 1.468328
## 16.00 1.834215 0.8633819 1.456738
## 32.00 1.836471 0.8632508 1.459909
## 64.00 1.836471 0.8632508 1.459909
## 128.00 1.836471 0.8632508 1.459909
## 256.00 1.836471 0.8632508 1.459909
## 512.00 1.836471 0.8632508 1.459909
## 1024.00 1.836471 0.8632508 1.459909
## 2048.00 1.836471 0.8632508 1.459909
##
## Tuning parameter 'sigma' was held constant at a value of 0.0552698
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0552698 and C = 16.
Let’s visualize the results:
svmpostResample | svmModel | Diff | |
---|---|---|---|
RMSE | 2.0490047 | 1.8342150 | 0.215 |
Rsquared | 0.8297577 | 0.8633819 | -0.034 |
MAE | 1.5586106 | 1.4567380 | 0.102 |
MARS
Multivariate Adaptive Regression Splines Model
# Multivariate Adaptive Regression Splines Model
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced
set.seed(100)
marsModel <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
preProc = c("center", "scale"),
# Explicitly declare the candidate models to test
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.489470 0.2020919 3.6881383
## 1 3 3.804210 0.4141260 3.0607824
## 1 4 2.622468 0.7176090 2.0807296
## 1 5 2.284475 0.7795541 1.8371183
## 1 6 2.287789 0.7792202 1.7862257
## 1 7 1.754222 0.8744211 1.3842606
## 1 8 1.701785 0.8808238 1.3087709
## 1 9 1.710506 0.8808018 1.3269353
## 1 10 1.684064 0.8833678 1.3218697
## 1 11 1.616665 0.8902847 1.2700615
## 1 12 1.620843 0.8883284 1.2784627
## 1 13 1.615887 0.8888966 1.2711103
## 1 14 1.615887 0.8888966 1.2711103
## 1 15 1.622796 0.8875917 1.2771573
## 1 16 1.622796 0.8875917 1.2771573
## 1 17 1.622796 0.8875917 1.2771573
## 1 18 1.622796 0.8875917 1.2771573
## 1 19 1.622796 0.8875917 1.2771573
## 1 20 1.622796 0.8875917 1.2771573
## 1 21 1.622796 0.8875917 1.2771573
## 1 22 1.622796 0.8875917 1.2771573
## 1 23 1.622796 0.8875917 1.2771573
## 1 24 1.622796 0.8875917 1.2771573
## 1 25 1.622796 0.8875917 1.2771573
## 1 26 1.622796 0.8875917 1.2771573
## 1 27 1.622796 0.8875917 1.2771573
## 1 28 1.622796 0.8875917 1.2771573
## 1 29 1.622796 0.8875917 1.2771573
## 1 30 1.622796 0.8875917 1.2771573
## 1 31 1.622796 0.8875917 1.2771573
## 1 32 1.622796 0.8875917 1.2771573
## 1 33 1.622796 0.8875917 1.2771573
## 1 34 1.622796 0.8875917 1.2771573
## 1 35 1.622796 0.8875917 1.2771573
## 1 36 1.622796 0.8875917 1.2771573
## 1 37 1.622796 0.8875917 1.2771573
## 1 38 1.622796 0.8875917 1.2771573
## 2 2 4.489470 0.2020919 3.6881383
## 2 3 3.804210 0.4141260 3.0607824
## 2 4 2.622468 0.7176090 2.0807296
## 2 5 2.284475 0.7795541 1.8371183
## 2 6 2.312578 0.7782746 1.8037749
## 2 7 1.780599 0.8724334 1.4062049
## 2 8 1.712181 0.8801027 1.3038033
## 2 9 1.535110 0.9026584 1.2201285
## 2 10 1.357614 0.9218402 1.0470553
## 2 11 1.271188 0.9371200 0.9916035
## 2 12 1.238666 0.9412852 0.9680962
## 2 13 1.258187 0.9375168 0.9837376
## 2 14 1.271254 0.9366262 1.0024425
## 2 15 1.253367 0.9375668 0.9901281
## 2 16 1.256205 0.9376482 1.0077633
## 2 17 1.256014 0.9378510 0.9982979
## 2 18 1.256014 0.9378510 0.9982979
## 2 19 1.256014 0.9378510 0.9982979
## 2 20 1.256014 0.9378510 0.9982979
## 2 21 1.256014 0.9378510 0.9982979
## 2 22 1.256014 0.9378510 0.9982979
## 2 23 1.256014 0.9378510 0.9982979
## 2 24 1.256014 0.9378510 0.9982979
## 2 25 1.256014 0.9378510 0.9982979
## 2 26 1.256014 0.9378510 0.9982979
## 2 27 1.256014 0.9378510 0.9982979
## 2 28 1.256014 0.9378510 0.9982979
## 2 29 1.256014 0.9378510 0.9982979
## 2 30 1.256014 0.9378510 0.9982979
## 2 31 1.256014 0.9378510 0.9982979
## 2 32 1.256014 0.9378510 0.9982979
## 2 33 1.256014 0.9378510 0.9982979
## 2 34 1.256014 0.9378510 0.9982979
## 2 35 1.256014 0.9378510 0.9982979
## 2 36 1.256014 0.9378510 0.9982979
## 2 37 1.256014 0.9378510 0.9982979
## 2 38 1.256014 0.9378510 0.9982979
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 12 and degree = 2.
Let’s visualize the results:
marspostResample | marsModel | Diff | |
---|---|---|---|
RMSE | 1.3227340 | 1.2386660 | 0.084 |
Rsquared | 0.9291489 | 0.9412852 | -0.012 |
MAE | 1.0524686 | 0.9680962 | 0.084 |
Which models appear to give the best performance?
Let’s compare the returned values from the postResample
function:
knn | nnet | svm | mars | |
---|---|---|---|---|
RMSE | 3.1750657 | 2.6435865 | 2.0490047 | 1.3227340 |
Rsquared | 0.6785946 | 0.7193278 | 0.8297577 | 0.9291489 |
MAE | 2.5443169 | 2.0236815 | 1.5586106 | 1.0524686 |
As we can appreciate in the above table, the MARS model returns the lowest RMSE alongside the highest \(R^2\).
Let’s compare the returned values from the original models.
knn | nnet | svm | mars | |
---|---|---|---|---|
RMSE | 3.261864 | 2.4961400 | 1.8342150 | 1.2386660 |
Rsquared | 0.616351 | 0.7543174 | 0.8633819 | 0.9412852 |
MAE | 2.621192 | 1.9532910 | 1.4567380 | 0.9680962 |
Same as before, the MARS approach provided the lowest RMSE alongside the highest \(R^2\).
Does MARS select the informative predictors (those named X1 – X5 )?
Let’s visualize the results from the AMRS model.
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE,
## degree=2, nprune=12)
##
## coefficients
## (Intercept) 21.690154
## h(0.507267-X1) -4.203744
## h(X1-0.507267) 3.072355
## h(0.325504-X2) -5.314859
## h(-0.216741-X3) 3.320304
## h(X3- -0.216741) 2.321760
## h(0.953812-X4) -2.775288
## h(X4-0.953812) 2.778320
## h(1.17878-X5) -1.607769
## h(X1-0.507267) * h(X2- -0.798188) -3.199202
## h(0.606835-X1) * h(0.325504-X2) 2.030856
## h(0.325504-X2) * h(X3-0.795427) 1.369704
##
## Selected 12 of 21 terms, and 5 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, ...
## Number of terms at each degree of interaction: 1 8 3
## GCV 1.842426 RSS 270.9495 GRSq 0.9251967 RSq 0.9444425
Answer:
From the above results, we can appreciate that the informative predictors named X1 – X5 are included in the model, hence are selected.
7.5
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
Process
Let’s visualize the missing values.
Let’s have a better understanding of the missing data.
Let’s visualize the missing values.
Since there are various predictors with a small number of predictors, I believe is better to replace the missing entries with the mean on that respective predictor. I believe this to be the best approach due to:
Using complete cases will reduce dramatically the data set, with the downside of losing possible valuable factors currently present in the model.
Replacing just a few records per predictor with the mean seems to be the best approach due to low incidence of missing records, thus not affecting dramatically the data set since only about 1% of data is missing.
# Procedure to replace NAs with the respective mean of the predictor.
for (i in 1:dim(ChemicalManufacturingProcess)[2]){
totalNA <- sum(is.na(ChemicalManufacturingProcess[i]))
if (totalNA > 0)
print(i)
meanColunm <- mean(ChemicalManufacturingProcess[,i], na.rm = TRUE)
ChemicalManufacturingProcess[i][is.na(ChemicalManufacturingProcess[i])] <- meanColunm
}
Preprocess Center & Scale
In order to gain some computation advantage, I will pre-process my data set before running any model. Please note that this could be done while running the model as well but I would like to process before then.
In order to pre-process the data, what we need to do is to center and scale, in order to do so, we can do it as follows:
\[\text{Pre-Process Data} = \frac{y_i - \mu}{\sigma}\]
The above data, will be centered and scaled.
We can achieve this by employing the preProcess
function from the caret
library. This function has the ability to transform, center, scale, or impute values.
# Function to pre-process data.
library(caret)
trans <- preProcess(ChemicalManufacturingProcess,
method = c("center", "scale"))
# Need to obtain new transformed values
CMP.trans <- predict(trans, ChemicalManufacturingProcess)
Let’s find correlations:
The below list, represents the column numbers in which highly correlated data is present \(> 0.9\).
## [1] 3 5 13 42 55 38 40 44 31 53
From above, we notice some strong linear correlations in between predictors.
Based on that, I will remove the highly correlated values \(>0.9\).
Let’s find near zero variance
Now, I will proceed to find predictors that have a near zero variance.
From the analysis, it is determined that BiologicalMaterial07 has a near zero variance. This predictor will be removed as well.
Split train & test
Now, I will split the data into 75 % training data and 25 % test.
# Now, I will split the data as follows:
# Training 75%
# Test 25%
set.seed(123)
n <- nrow(CMP.trans_reduced)
trainIndex <- sample(1:n, size = round(0.75*n), replace=FALSE)
CMPtrain <- CMP.trans_reduced[trainIndex ,]
CMPtest <- CMP.trans_reduced[-trainIndex ,]
knn
K-Nearest Neighbors
Let’s find this model.
# Fix the seed so that the results can be reproduced
set.seed(100)
knnModel <- train(x = CMPtrain[,-1], # Yield is located in the column 1
y = CMPtrain$Yield, # Yield is located in the column 1
method = "knn",
# Center and scaling will occur for new predictions too
#preProc = c("center", "scale"), #already centered & scaled
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = "cv"))
## k-Nearest Neighbors
##
## 132 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 118, 117, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.6794000 0.5757569 0.5223422
## 2 0.6176822 0.6305230 0.5164498
## 3 0.6605027 0.5753067 0.5369146
## 4 0.6708427 0.5593300 0.5453075
## 5 0.6765834 0.5631368 0.5509474
## 6 0.6987497 0.5397590 0.5620890
## 7 0.7144364 0.5193643 0.5730330
## 8 0.7210814 0.5174631 0.5779199
## 9 0.7201191 0.5096748 0.5694338
## 10 0.7346904 0.4861745 0.5743579
## 11 0.7383440 0.4882183 0.5835242
## 12 0.7490996 0.4706029 0.5912526
## 13 0.7452880 0.4745400 0.5916742
## 14 0.7554552 0.4606574 0.6063589
## 15 0.7660497 0.4410709 0.6140884
## 16 0.7656175 0.4426411 0.6163335
## 17 0.7723910 0.4338911 0.6201163
## 18 0.7734641 0.4381213 0.6212987
## 19 0.7736354 0.4383681 0.6194655
## 20 0.7776377 0.4349992 0.6222963
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 2.
Let’s test the above model with the training data.
Let’s visualize the results:
knnpostResample | knnModel | Diff | |
---|---|---|---|
RMSE | 0.7857604 | 0.6176822 | 0.168 |
Rsquared | 0.3984343 | 0.6305230 | -0.232 |
MAE | 0.6049798 | 0.5164498 | 0.089 |
nnet
Neural Network Model
# Neural Network Model
# Fix the seed so that the results can be reproduced
set.seed(100)
nnetModel <- train(x = CMPtrain[,-1], # Yield is located in the column 1
y = CMPtrain$Yield, # Yield is located in the column 1
method = "nnet",
#preProc = c("center", "scale"),
tuneLength = 10,
# The linear relationship between the hidden
# units and the prediction can be used with the
# option linout = TRUE .
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(CMPtrain[,-1]) + 1) + 10 + 1,
maxit = 500)
## Neural Network
##
## 132 samples
## 46 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.0000000000 1.1796248 0.1968185 0.9062800
## 1 0.0001000000 1.1253088 0.2051884 0.8982873
## 1 0.0002371374 1.1626150 0.2031684 0.9284455
## 1 0.0005623413 1.1351927 0.2147849 0.9059153
## 1 0.0013335214 1.1721067 0.1854326 0.9446886
## 1 0.0031622777 1.1353498 0.2046183 0.9137300
## 1 0.0074989421 1.1131284 0.2320241 0.8830995
## 1 0.0177827941 1.0932424 0.2432816 0.8712249
## 1 0.0421696503 1.0577946 0.2598119 0.8481323
## 1 0.1000000000 1.0130936 0.2976289 0.8144259
## 3 0.0000000000 1.3954744 0.2265805 1.1065449
## 3 0.0001000000 1.4409364 0.2105104 1.1204449
## 3 0.0002371374 1.2194177 0.2322903 0.9592999
## 3 0.0005623413 1.1915407 0.2467710 0.9367788
## 3 0.0013335214 1.1402344 0.2461006 0.8986199
## 3 0.0031622777 1.0552791 0.3057729 0.8359662
## 3 0.0074989421 1.0074765 0.3461289 0.7945078
## 3 0.0177827941 0.9646867 0.3443929 0.7721008
## 3 0.0421696503 0.9779622 0.3463749 0.7694378
## 3 0.1000000000 0.9155119 0.3827726 0.7291705
## 5 0.0000000000 1.2433499 0.2102741 0.9849966
## 5 0.0001000000 0.9954243 0.3134169 0.8012460
## 5 0.0002371374 0.9433686 0.3597121 0.7424088
## 5 0.0005623413 0.8977300 0.3745157 0.7079160
## 5 0.0013335214 0.9091758 0.3877541 0.7159963
## 5 0.0031622777 0.8740793 0.4121510 0.6968985
## 5 0.0074989421 0.8567089 0.4137448 0.6854498
## 5 0.0177827941 0.8536331 0.4061556 0.6860885
## 5 0.0421696503 0.8693627 0.3962918 0.6858137
## 5 0.1000000000 0.8465231 0.4034549 0.6798335
## 7 0.0000000000 1.0347538 0.2907921 0.8207750
## 7 0.0001000000 0.8695586 0.3931258 0.6882341
## 7 0.0002371374 0.8462636 0.4192968 0.6720169
## 7 0.0005623413 0.8507805 0.4024947 0.6852187
## 7 0.0013335214 0.8412553 0.4172394 0.6644665
## 7 0.0031622777 0.8330734 0.4387487 0.6688182
## 7 0.0074989421 0.8410750 0.4257502 0.6737509
## 7 0.0177827941 0.8041428 0.4598524 0.6365908
## 7 0.0421696503 0.8147847 0.4401499 0.6480599
## 7 0.1000000000 0.8068015 0.4462963 0.6451722
## 9 0.0000000000 0.9996566 0.3298175 0.7968458
## 9 0.0001000000 0.8214779 0.4394031 0.6525946
## 9 0.0002371374 0.8525284 0.4174344 0.6800225
## 9 0.0005623413 0.8379829 0.4197734 0.6780517
## 9 0.0013335214 0.8204291 0.4400485 0.6528655
## 9 0.0031622777 0.8277418 0.4281831 0.6545178
## 9 0.0074989421 0.8160242 0.4408754 0.6511295
## 9 0.0177827941 0.8271611 0.4262726 0.6558029
## 9 0.0421696503 0.8040919 0.4432069 0.6394613
## 9 0.1000000000 0.8117241 0.4433677 0.6445780
## 11 0.0000000000 NaN NaN NaN
## 11 0.0001000000 NaN NaN NaN
## 11 0.0002371374 NaN NaN NaN
## 11 0.0005623413 NaN NaN NaN
## 11 0.0013335214 NaN NaN NaN
## 11 0.0031622777 NaN NaN NaN
## 11 0.0074989421 NaN NaN NaN
## 11 0.0177827941 NaN NaN NaN
## 11 0.0421696503 NaN NaN NaN
## 11 0.1000000000 NaN NaN NaN
## 13 0.0000000000 NaN NaN NaN
## 13 0.0001000000 NaN NaN NaN
## 13 0.0002371374 NaN NaN NaN
## 13 0.0005623413 NaN NaN NaN
## 13 0.0013335214 NaN NaN NaN
## 13 0.0031622777 NaN NaN NaN
## 13 0.0074989421 NaN NaN NaN
## 13 0.0177827941 NaN NaN NaN
## 13 0.0421696503 NaN NaN NaN
## 13 0.1000000000 NaN NaN NaN
## 15 0.0000000000 NaN NaN NaN
## 15 0.0001000000 NaN NaN NaN
## 15 0.0002371374 NaN NaN NaN
## 15 0.0005623413 NaN NaN NaN
## 15 0.0013335214 NaN NaN NaN
## 15 0.0031622777 NaN NaN NaN
## 15 0.0074989421 NaN NaN NaN
## 15 0.0177827941 NaN NaN NaN
## 15 0.0421696503 NaN NaN NaN
## 15 0.1000000000 NaN NaN NaN
## 17 0.0000000000 NaN NaN NaN
## 17 0.0001000000 NaN NaN NaN
## 17 0.0002371374 NaN NaN NaN
## 17 0.0005623413 NaN NaN NaN
## 17 0.0013335214 NaN NaN NaN
## 17 0.0031622777 NaN NaN NaN
## 17 0.0074989421 NaN NaN NaN
## 17 0.0177827941 NaN NaN NaN
## 17 0.0421696503 NaN NaN NaN
## 17 0.1000000000 NaN NaN NaN
## 19 0.0000000000 NaN NaN NaN
## 19 0.0001000000 NaN NaN NaN
## 19 0.0002371374 NaN NaN NaN
## 19 0.0005623413 NaN NaN NaN
## 19 0.0013335214 NaN NaN NaN
## 19 0.0031622777 NaN NaN NaN
## 19 0.0074989421 NaN NaN NaN
## 19 0.0177827941 NaN NaN NaN
## 19 0.0421696503 NaN NaN NaN
## 19 0.1000000000 NaN NaN NaN
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 9 and decay = 0.04216965.
Let’s visualize the results:
nnetpostResample | nnetModel | Diff | |
---|---|---|---|
RMSE | 0.7093090 | 0.8040919 | -0.095 |
Rsquared | 0.6042081 | 0.4432069 | 0.161 |
MAE | 0.5762007 | 0.6394613 | -0.063 |
SVM
Support Vector Machines Model
# Support Vector Machines Model
# Fix the seed so that the results can be reproduced
set.seed(100)
svmModel <- train(x = CMPtrain[,-1], # Yield is located in the column 1
y = CMPtrain$Yield, # Yield is located in the column 1
method = "svmRadial",
#preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
## Support Vector Machines with Radial Basis Function Kernel
##
## 132 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 118, 117, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7871002 0.4502506 0.6302134
## 0.50 0.7313778 0.4912644 0.5982827
## 1.00 0.6801686 0.5443121 0.5534225
## 2.00 0.6463510 0.5834226 0.5292535
## 4.00 0.6099049 0.6251177 0.5072709
## 8.00 0.6028416 0.6385953 0.5063231
## 16.00 0.6012018 0.6410990 0.5052518
## 32.00 0.6012018 0.6410990 0.5052518
## 64.00 0.6012018 0.6410990 0.5052518
## 128.00 0.6012018 0.6410990 0.5052518
## 256.00 0.6012018 0.6410990 0.5052518
## 512.00 0.6012018 0.6410990 0.5052518
## 1024.00 0.6012018 0.6410990 0.5052518
## 2048.00 0.6012018 0.6410990 0.5052518
##
## Tuning parameter 'sigma' was held constant at a value of 0.01733987
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01733987 and C = 16.
Let’s visualize the results:
svmpostResample | svmModel | Diff | |
---|---|---|---|
RMSE | 0.6132105 | 0.6012018 | 0.012 |
Rsquared | 0.6348329 | 0.6410990 | -0.006 |
MAE | 0.4950236 | 0.5052518 | -0.010 |
MARS
Multivariate Adaptive Regression Splines Model
# Multivariate Adaptive Regression Splines Model
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced
set.seed(100)
marsModel <- train(x = CMPtrain[,-1], # Yield is located in the column 1
y = CMPtrain$Yield, # Yield is located in the column 1
method = "earth",
#preProc = c("center", "scale"),
# Explicitly declare the candidate models to test
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
## Multivariate Adaptive Regression Spline
##
## 132 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 118, 117, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.7608702 0.4765659 0.5940557
## 1 3 0.7402650 0.5062131 0.5873922
## 1 4 0.6913599 0.5683094 0.5487347
## 1 5 0.6656055 0.5739232 0.5260355
## 1 6 0.6921667 0.5493613 0.5444108
## 1 7 0.6699219 0.5800144 0.5197985
## 1 8 0.6821632 0.5629115 0.5277025
## 1 9 0.6851523 0.5551823 0.5319844
## 1 10 0.6900857 0.5338118 0.5408950
## 1 11 0.6902712 0.5426019 0.5368663
## 1 12 0.6802351 0.5874977 0.5171855
## 1 13 0.6743747 0.5875594 0.5225518
## 1 14 0.6724292 0.5542790 0.5312809
## 1 15 0.6755544 0.5501683 0.5357989
## 1 16 0.6782277 0.5456692 0.5385850
## 1 17 0.6814800 0.5407681 0.5392786
## 1 18 0.6836116 0.5370742 0.5391684
## 1 19 0.6888690 0.5299316 0.5418341
## 1 20 0.6878532 0.5285923 0.5394346
## 1 21 0.6980804 0.5186997 0.5457016
## 1 22 0.7028141 0.5146597 0.5478886
## 1 23 0.7028141 0.5146597 0.5478886
## 1 24 0.7011137 0.5152072 0.5492070
## 1 25 0.7059355 0.5121715 0.5525025
## 1 26 0.7059355 0.5121715 0.5525025
## 1 27 0.7059355 0.5121715 0.5525025
## 1 28 0.7059355 0.5121715 0.5525025
## 1 29 0.7059355 0.5121715 0.5525025
## 1 30 0.7059355 0.5121715 0.5525025
## 1 31 0.7059355 0.5121715 0.5525025
## 1 32 0.7059355 0.5121715 0.5525025
## 1 33 0.7059355 0.5121715 0.5525025
## 1 34 0.7059355 0.5121715 0.5525025
## 1 35 0.7059355 0.5121715 0.5525025
## 1 36 0.7059355 0.5121715 0.5525025
## 1 37 0.7059355 0.5121715 0.5525025
## 1 38 0.7059355 0.5121715 0.5525025
## 2 2 0.7608702 0.4765659 0.5940557
## 2 3 0.7758241 0.4579480 0.6058932
## 2 4 0.7154082 0.5357724 0.5668441
## 2 5 0.7698298 0.4759728 0.5998113
## 2 6 0.8106828 0.4252869 0.6170168
## 2 7 0.8523308 0.4062154 0.6596470
## 2 8 0.8850669 0.3915969 0.6621907
## 2 9 0.8887112 0.3992320 0.6706609
## 2 10 0.8761736 0.4217475 0.6520429
## 2 11 0.8948231 0.4161342 0.6709036
## 2 12 0.8706716 0.4380729 0.6464227
## 2 13 0.8713399 0.4454836 0.6286619
## 2 14 0.8445940 0.4618239 0.6181703
## 2 15 0.8565715 0.4606281 0.6209452
## 2 16 1.6149989 0.4068201 0.8872433
## 2 17 1.5865843 0.4348400 0.8666864
## 2 18 1.5694411 0.4486501 0.8529629
## 2 19 1.5533104 0.4637339 0.8410737
## 2 20 1.5453743 0.4647219 0.8346232
## 2 21 1.5507991 0.4703622 0.8378604
## 2 22 1.5614952 0.4661222 0.8450828
## 2 23 1.5594290 0.4657456 0.8449717
## 2 24 1.5594290 0.4657456 0.8449717
## 2 25 1.5594290 0.4657456 0.8449717
## 2 26 1.5594290 0.4657456 0.8449717
## 2 27 1.5594290 0.4657456 0.8449717
## 2 28 1.5594290 0.4657456 0.8449717
## 2 29 1.5594290 0.4657456 0.8449717
## 2 30 1.5594290 0.4657456 0.8449717
## 2 31 1.5594290 0.4657456 0.8449717
## 2 32 1.5594290 0.4657456 0.8449717
## 2 33 1.5594290 0.4657456 0.8449717
## 2 34 1.5594290 0.4657456 0.8449717
## 2 35 1.5594290 0.4657456 0.8449717
## 2 36 1.5594290 0.4657456 0.8449717
## 2 37 1.5594290 0.4657456 0.8449717
## 2 38 1.5594290 0.4657456 0.8449717
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.
Let’s visualize the results:
marspostResample | marsModel | Diff | |
---|---|---|---|
RMSE | 0.6385416 | 0.6656055 | -0.027 |
Rsquared | 0.6001174 | 0.5739232 | 0.026 |
MAE | 0.5051098 | 0.5260355 | -0.021 |
(a)
Which nonlinear regression model gives the optimal resampling and test set performance?
Let’s compare the returned values from the postResample
function:
knn | nnet | svm | mars | |
---|---|---|---|---|
RMSE | 0.7857604 | 0.7093090 | 0.6132105 | 0.6385416 |
Rsquared | 0.3984343 | 0.6042081 | 0.6348329 | 0.6001174 |
MAE | 0.6049798 | 0.5762007 | 0.4950236 | 0.5051098 |
As we can appreciate in the above table, the SVM model returns the lowest RMSE alongside the highest \(R^2\).
Let’s compare the returned values from the original training models.
knn | nnet | svm | mars | |
---|---|---|---|---|
RMSE | 0.6176822 | 0.8040919 | 0.6012018 | 0.6656055 |
Rsquared | 0.6305230 | 0.4432069 | 0.6410990 | 0.5739232 |
MAE | 0.5164498 | 0.6394613 | 0.5052518 | 0.5260355 |
In effect, the SVM model returned the lowest RMSE alongside the highest \(R^2\) as seen on the above table.
(b)
Which predictors are most important in the optimal nonlinear regression model?
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 16
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0173398678411205
##
## Number of Support Vectors : 119
##
## Objective Function Value : -74.8895
## Training error : 0.009279
From the above results, the svm model employed 119 training set data points as support vectors (90 % of the training set).
Do either the biological or process variables dominate the list?
## [1] "Yield" "BiologicalMaterial01"
## [3] "BiologicalMaterial03" "BiologicalMaterial05"
## [5] "BiologicalMaterial06" "BiologicalMaterial08"
## [7] "BiologicalMaterial09" "BiologicalMaterial10"
## [9] "BiologicalMaterial11" "ManufacturingProcess01"
## [11] "ManufacturingProcess02" "ManufacturingProcess03"
## [13] "ManufacturingProcess04" "ManufacturingProcess05"
## [15] "ManufacturingProcess06" "ManufacturingProcess07"
## [17] "ManufacturingProcess08" "ManufacturingProcess09"
## [19] "ManufacturingProcess10" "ManufacturingProcess11"
## [21] "ManufacturingProcess12" "ManufacturingProcess13"
## [23] "ManufacturingProcess14" "ManufacturingProcess15"
## [25] "ManufacturingProcess16" "ManufacturingProcess17"
## [27] "ManufacturingProcess19" "ManufacturingProcess20"
## [29] "ManufacturingProcess21" "ManufacturingProcess22"
## [31] "ManufacturingProcess23" "ManufacturingProcess24"
## [33] "ManufacturingProcess26" "ManufacturingProcess28"
## [35] "ManufacturingProcess30" "ManufacturingProcess32"
## [37] "ManufacturingProcess33" "ManufacturingProcess34"
## [39] "ManufacturingProcess35" "ManufacturingProcess36"
## [41] "ManufacturingProcess37" "ManufacturingProcess38"
## [43] "ManufacturingProcess39" "ManufacturingProcess41"
## [45] "ManufacturingProcess43" "ManufacturingProcess44"
## [47] "ManufacturingProcess45"
In this case, the process (Manufacturing) variables dominate the list.
How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
Let’s see how the importance for each variable is:
importance <- varImp(svmModel)
importance
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 46)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 94.05
## BiologicalMaterial06 77.39
## BiologicalMaterial03 75.78
## ManufacturingProcess36 71.04
## ManufacturingProcess17 70.86
## ManufacturingProcess09 60.99
## ManufacturingProcess33 51.12
## BiologicalMaterial11 50.21
## ManufacturingProcess06 49.88
## BiologicalMaterial09 38.50
## BiologicalMaterial08 38.19
## ManufacturingProcess11 35.36
## BiologicalMaterial01 31.09
## ManufacturingProcess30 30.90
## ManufacturingProcess12 27.28
## ManufacturingProcess26 25.87
## ManufacturingProcess28 22.47
## ManufacturingProcess01 19.85
## BiologicalMaterial10 17.75
From above, we notice that there’s a total of 7 Manufacturer variables in the top 10 and 3 Biological predictors. I believe this is important due to the fact that we can not disregard the importance that these biological variables play on the chemical manufacturing process.
(c)
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Yes, the intuition is evident in the relationships seeing above. Not only they appear to be linear but also statistically significant as well.
References
Kuhn, M. & Johnson, K. 2018. Applied Predictive Modeling. USA: Pfizer Global R&D. http://appliedpredictivemodeling.com/.
R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.