CUNY 624 Homework 8

Joel Park

Importing libraries:

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
library(e1071)
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(nnet)

7.2

Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:

\[y = 10 sin(\pi x_1x_2) + 20(x_3-0.5)^2 + 10x^4 + 5x_5 + N(0,\sigma^2)\] where the x values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(caret)
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd=1)
## We conver the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to 
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

We will be implementing the non-linear models that we have learned in Chapter 7.

KNN

According to 7.4, the KNN approach simply predicts a new sample using the K-closest samples from the training set. The basic KNN method depends on how the user defines distance between samples. Minkowski distance is typically utilized:

\[(\Sigma_{j-1}^{P}(x_{aj}-x_{bj})^2)^{\frac{1}{2}}\]

Many other distance metrics exist, such as Tanimoto, Hamming, and cosine, and are more appropriate for specific types of predictors and in specific scientific contexts.

Let us build using the training data set and the KNN model.

set.seed(1)
knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.626071  0.4766532  2.954620
##    7  3.463866  0.5231142  2.823065
##    9  3.387872  0.5552706  2.744969
##   11  3.345212  0.5754355  2.706971
##   13  3.317962  0.5964912  2.696927
##   15  3.291614  0.6123880  2.668077
##   17  3.293991  0.6238074  2.665458
##   19  3.300870  0.6342793  2.668902
##   21  3.297502  0.6472145  2.674792
##   23  3.298045  0.6534698  2.676965
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
plot(knnModel)

This model suggests that using 15 nearest neighbors produced the lowest RMSE. As a result, we will use parameter. Please note that we needed to standardize and scale our dataset prior to applying the KNN as distance is important. Let us see how well this model applies to the testing data set.

knnPred <- predict(knnModel, newdata = testData$x)
## The function `postResample` can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169

Overall, the model performs reasonally well with an RMSE of 3.175 and an R-squared value of 0.679, suggesting that nearly 68% of the datasets variability can be explained by this model.

Support Vector Machines

According to 7.3, Support Vector Machines are a class of powerful and highly flexible modeling techniques. For this model, we will be utilizing the regression form. This model tends to be noted as robust. The model in particular that we will be focusing on will be the \(\epsilon\)-insensitive regression.

Given a threshold set by the user (denoted as \(\epsilon\)), data points with residuals within the threshold do not contribute to the regression fit while data points with an absolute difference greater than the threshold contribute a linear-scale amount.

There are several consequences to this approach. First, since the squared residuals are not used, large outliers have a limited effect on the regression equation. Second, samples that the model fits well have no effect on the regression equation.

We should also note that SVM supports kernel functions. The kernel function can be used to generalized the regression model and encompass nonlinear function of the predictors.

\[\text{polynomial} = (\phi (x'u) + 1)^{\text{degree}}\] \[\text{radial basis function} = exp(-\sigma||x - u||^2)\] \[\text{hyperbolic tangent} = tanh(\phi (x'u) + 1)\]

Let us see if we can tune our SVM model using cross validation and a radial basis function.

set.seed(1)
svmRTuned <- train(x = trainingData$x, 
                   y = trainingData$y,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.531367  0.8022380  2.018644
##      0.50  2.274330  0.8152517  1.796103
##      1.00  2.096272  0.8344065  1.648492
##      2.00  1.963889  0.8528505  1.545874
##      4.00  1.893957  0.8618554  1.488518
##      8.00  1.869287  0.8642229  1.482789
##     16.00  1.867271  0.8644676  1.485789
##     32.00  1.867271  0.8644676  1.485789
##     64.00  1.867271  0.8644676  1.485789
##    128.00  1.867271  0.8644676  1.485789
##    256.00  1.867271  0.8644676  1.485789
##    512.00  1.867271  0.8644676  1.485789
##   1024.00  1.867271  0.8644676  1.485789
##   2048.00  1.867271  0.8644676  1.485789
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06444911
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06444911 and C = 16.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0644491109951026 
## 
## Number of Support Vectors : 151 
## 
## Objective Function Value : -71.1627 
## Training error : 0.008521

The final model that SVM had formed was an epsilon value of 0.1, cost C = 16 using 151 support vectors. The sigma value for the radial basis kernel function was 0.0644. Overall, the training error was quite low, which could potentially represent overfitting. However, let us examine the testing data set and see how the model performs.

svmPred <- predict(svmRTuned, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.0772708 0.8250944 1.5780035

Overall the support vector machine performs quite well with an RMSE of 2.0772780 and R-squared value of 82.51%. This model performs significantly better than KNN.

Neural Networks

According to 7.1, Neural networks are powerful nonlinear regression techniques. The outcome is modeled by an intermediary set of unobserved variables called hidden units. Each hidden unit is a linear combination of some or all of the predictor variables. However, this linear combination is typically transformed by a nonlinear function \(g(.)\), such as the logistic function. Once the number of hidden units is defined, each unit must be related to the outcome.

For this type of network model and P predictors, there are a total of H(P + 1) + H + 1 total parameters being estimated. The parameters are usually optimized to minimize the sum of the squared residuals. As neural networks has a tendendency to overfit, different techniques can be deployed to combat this issue including early stopping and weight decay. The weight decay is a penalization method to regularize the model similar to ridge regression.

Let us fit a neural network model.

set.seed(1)
nnetFit <- nnet(trainingData$x,
                trainingData$y,
                size = 5,
                decay = 0.01,
                linout = TRUE,
                ## Reduce the amount of printed output
                trace = FALSE,
                ## Expand the number of iterations to find parameter estimates
                maxit = 500,
                ## and the number of parameters used by the model
                MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1)
nnetFit
## a 10-5-1 network with 61 weights
## options were - linear output units  decay=0.01

Let us now see how the model performs with testing data.

nnetFitPred <- predict(nnetFit, newdata = testData$x)
postResample(pred = nnetFitPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.6581943 0.7347806 1.8595643

Overall, the model performed well. Not quite as well as the support vector machine. However, we can try to use an average output of multiple different models and see if we have an improved RMSE. As neural networks utilize forward feed and then back propagation, the networks tend to run the risk of converging on the local minima and not the global minima. As a result, the initial randomization of parameters are large factors into its convergence. Therefore, by initializing several models and taking its average, we can hope to potentially find a better fit.

set.seed(1)
nnetAvg <- avNNet(trainingData$x,
                  trainingData$y,
                  size = 5,
                  decay = 0.01,
                  ## Specify how many models to average
                  repeats = 5,
                  linout = TRUE,
                  ## Reduce the amount of printed output
                  trace = FALSE,
                  ## expand the number of iterations to find paramater estimates)
                  maxit = 500,
                  ## and the number of parameters used by the model
                  MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetAvg
## Model Averaged Neural Network with 5 Repeats  
## 
## a 10-5-1 network with 61 weights
## options were - linear output units  decay=0.01
nnetAvgPred <- predict(nnetAvg, newdata = testData$x)
postResample(pred = nnetAvgPred, obs = testData$y)
##     RMSE Rsquared      MAE 
## 1.669999 0.888789 1.253323

The average model performed than the single neural network model. So far, the average neural network model works better than all of the previous models noted above.

Multivariate Adaptive Regression Splines

According to section 7.2, MARS uses surrogate features instead of the origianl features. MARS creates two contrasted versions of a predictor to enter the model. The nature of the MARS features breaks the predictors into two groups and models linear relationships between the predictor and the outcome in each group. In effect, this scheme creates a piecewise linear model where each new feature models an isolated portion of the original data.

Each data point for each predictor is evaluated as a candidate cut point by creating a linear regression model with the candidate features, and the corresponding model error is calculated. The predictor / cut point combination that achieves the smallest error is then used for the model. The nature of the predictor transformation makes such a large number of linear regressions computationally feasible. The nature of the predictor transformation makes such a large number of linear regressions computationally feasible.

After the initial model is created with the first two features, the model conducts another exhaustive search to find the next set of features that, given the initial set, yield the best model fit. The process continues until a stopping point is reached.

Once the full set of features has been created, the algorithm sequentially removes individual features that do not contribute significantly to the model equation. This “pruning” procedure assesses each predictor variable and esti- mates how much the error rate was decreased by including it in the model. To determine the contribution of each feature to the model, the GCV statistic is used.

Let us build our model using MARS.

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced
set.seed(1)
marsTuned <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = "earth",
                   # Explicitly declare the candidate models to test
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))
marsTuned
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      4.300066  0.2727572  3.5381053
##   1        3      3.703179  0.4654129  2.9872991
##   1        4      2.581276  0.7364416  2.0665764
##   1        5      2.337532  0.7783412  1.8686573
##   1        6      2.225019  0.8069865  1.7661424
##   1        7      1.706797  0.8912095  1.3548039
##   1        8      1.642564  0.8998049  1.2789497
##   1        9      1.591858  0.9089374  1.2703021
##   1       10      1.644124  0.9007157  1.3063644
##   1       11      1.651664  0.8970682  1.3214880
##   1       12      1.668817  0.8939019  1.3152870
##   1       13      1.664278  0.8948394  1.3161942
##   1       14      1.646341  0.8969544  1.2956599
##   1       15      1.646341  0.8969544  1.2956599
##   1       16      1.646341  0.8969544  1.2956599
##   1       17      1.646341  0.8969544  1.2956599
##   1       18      1.646341  0.8969544  1.2956599
##   1       19      1.646341  0.8969544  1.2956599
##   1       20      1.646341  0.8969544  1.2956599
##   1       21      1.646341  0.8969544  1.2956599
##   1       22      1.646341  0.8969544  1.2956599
##   1       23      1.646341  0.8969544  1.2956599
##   1       24      1.646341  0.8969544  1.2956599
##   1       25      1.646341  0.8969544  1.2956599
##   1       26      1.646341  0.8969544  1.2956599
##   1       27      1.646341  0.8969544  1.2956599
##   1       28      1.646341  0.8969544  1.2956599
##   1       29      1.646341  0.8969544  1.2956599
##   1       30      1.646341  0.8969544  1.2956599
##   1       31      1.646341  0.8969544  1.2956599
##   1       32      1.646341  0.8969544  1.2956599
##   1       33      1.646341  0.8969544  1.2956599
##   1       34      1.646341  0.8969544  1.2956599
##   1       35      1.646341  0.8969544  1.2956599
##   1       36      1.646341  0.8969544  1.2956599
##   1       37      1.646341  0.8969544  1.2956599
##   1       38      1.646341  0.8969544  1.2956599
##   2        2      4.409681  0.2412484  3.5980118
##   2        3      3.707764  0.4648067  2.9883864
##   2        4      2.581636  0.7362217  2.0618036
##   2        5      2.269974  0.7948037  1.8302011
##   2        6      2.237822  0.8029324  1.7700901
##   2        7      1.733594  0.8850348  1.3695287
##   2        8      1.644366  0.8972816  1.2693666
##   2        9      1.494714  0.9121742  1.1667691
##   2       10      1.446932  0.9181004  1.1573821
##   2       11      1.344525  0.9282764  1.0911973
##   2       12      1.260571  0.9367803  0.9954347
##   2       13      1.270994  0.9361170  0.9851090
##   2       14      1.227449  0.9406268  0.9454779
##   2       15      1.225592  0.9407452  0.9553970
##   2       16      1.228297  0.9409774  0.9650007
##   2       17      1.227121  0.9408193  0.9640853
##   2       18      1.227121  0.9408193  0.9640853
##   2       19      1.227121  0.9408193  0.9640853
##   2       20      1.227121  0.9408193  0.9640853
##   2       21      1.227121  0.9408193  0.9640853
##   2       22      1.227121  0.9408193  0.9640853
##   2       23      1.227121  0.9408193  0.9640853
##   2       24      1.227121  0.9408193  0.9640853
##   2       25      1.227121  0.9408193  0.9640853
##   2       26      1.227121  0.9408193  0.9640853
##   2       27      1.227121  0.9408193  0.9640853
##   2       28      1.227121  0.9408193  0.9640853
##   2       29      1.227121  0.9408193  0.9640853
##   2       30      1.227121  0.9408193  0.9640853
##   2       31      1.227121  0.9408193  0.9640853
##   2       32      1.227121  0.9408193  0.9640853
##   2       33      1.227121  0.9408193  0.9640853
##   2       34      1.227121  0.9408193  0.9640853
##   2       35      1.227121  0.9408193  0.9640853
##   2       36      1.227121  0.9408193  0.9640853
##   2       37      1.227121  0.9408193  0.9640853
##   2       38      1.227121  0.9408193  0.9640853
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.

The final values that were used were nprune = 15 and degree = 2.

plot(marsTuned)

marsPred <- predict(marsTuned, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.1589948 0.9460418 0.9250230

The MARS model worked best here with an astounding R-squared value and low RMSE. Let us determine and see which variables were considered important.

# `varImp` in caret package can estimate the importance of each predictor in the MARS model
varImp(marsTuned)
## earth variable importance
## 
##     Overall
## X1   100.00
## X4    85.14
## X2    69.24
## X5    49.31
## X3    40.00
## X10    0.00
## X8     0.00
## X9     0.00
## X6     0.00
## X7     0.00

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?

In comparison (utilizing RMSE), the MARS model appeared to have the lowest RMSE and highest R-squared. Therefore, the MARS model gave the best performance.

The MARS model does appear to have X1 - X5 as its most informative predictors (as noted in the varImp function).

7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Loading in the data.

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

Reference: http://www.rpubs.com/jcp9010/483992

Please refer to my previous homework page as noted above in the reference. The data contained many missing values. Given the numberous amount of predictor variables, I had suggested using KNN to impute in the missing values.

Imputation of the missing data points using KNN.

library(DMwR)
## Loading required package: grid
df <- knnImputation(ChemicalManufacturingProcess[, 1:57], k = 3, meth = "weighAvg")

Now that all of the missing values have been filled, I had decided to remove zero or near zero variables as those variables can lead to model instability.

Removing zero or near zero variables.

near_zero <- nearZeroVar(df)
df <- df[,-near_zero]

As many of the models require a scaling of the predictors, I will go ahead and standardize and scale the predictors.

Standardize and scale the predictors:

df[,2:(ncol(df))] <- scale(df[,2:(ncol(df))])
head(df)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00           -0.2261036           -1.5140979          -2.68303622
## 2 42.44            2.2391498            1.3089960          -0.05623504
## 3 42.03            2.2391498            1.3089960          -0.05623504
## 4 41.42            2.2391498            1.3089960          -0.05623504
## 5 42.49            1.4827653            1.8939391           1.13594780
## 6 43.57           -0.4081962            0.6620886          -0.59859075
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1            0.2201765            0.4941942           -1.3828880
## 2            1.2964386            0.4128555            1.1290767
## 3            1.2964386            0.4128555            1.1290767
## 4            1.2964386            0.4128555            1.1290767
## 5            0.9414412           -0.3734185            1.5348350
## 6            1.5894524            1.7305423            0.6192092
##   BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 1            -1.233131           -3.3962895            1.1005296
## 2             2.282619           -0.7227225            1.1005296
## 3             2.282619           -0.7227225            1.1005296
## 4             2.282619           -0.7227225            1.1005296
## 5             1.071310           -0.1205678            0.4162193
## 6             1.189487           -1.7343424            1.6346255
##   BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 1            -1.838655           -1.7709224              0.2006342
## 2             1.393395            1.0989855             -6.1677821
## 3             1.393395            1.0989855             -6.1677821
## 4             1.393395            1.0989855             -6.1677821
## 5             0.136256            1.0989855             -0.2803476
## 6             1.022062            0.7240877              0.4349481
##   ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 1              0.5551856              0.3045192              0.2349609
## 2             -1.9906511              0.5708010             -2.3747305
## 3             -1.9906511              0.1938106             -3.1737741
## 4             -1.9906511              0.5427779             -3.3335828
## 5             -1.9906511              0.4262861             -2.2149218
## 6             -1.9906511              0.4262861             -1.2560694
##   ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 1            -0.43413255             -0.4693269             -0.4256500
## 2             1.00413856              0.9730078             -0.9578363
## 3             0.06509023             -0.1046607              1.0427233
## 4             0.42626267              2.1993201             -0.9578363
## 5             0.84981943             -0.6249144              1.0427233
## 6             0.49849715              0.5642370              1.0427233
##   ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 1             -0.5761169             -1.7201524            -0.29888818
## 2              0.8991659              0.5883746             0.48941917
## 3              0.8991659             -0.3815947             0.19393012
## 4             -1.1108074             -0.4785917             0.09437624
## 5              0.8991659             -0.4527258            -0.37534711
## 6              0.8991659             -0.2199332            -0.32443045
##   ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 1             -0.3232328             -0.4790178             0.97711512
## 2              1.1432407             -0.4790178            -0.50030980
## 3              0.2640909             -0.4790178             0.28765016
## 4              1.0545956             -0.4790178             0.28765016
## 5              0.5336292             -0.4790178             0.09066017
## 6              0.5765004             -0.4790178            -0.50030980
##   ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 1              0.8141279              1.1846438              0.3303945
## 2              0.2811695              0.9617071              0.1455765
## 3              0.4465704              0.8245152              0.1455765
## 4              0.7957500              1.0817499              0.1967569
## 5              2.5416480              3.3282665              0.4754056
## 6              2.4130029              3.1396277              0.6261033
##   ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 1              0.9263296              0.1505348              0.4563798
## 2             -0.2753953              0.1559773              1.5095063
## 3              0.3655246              0.1831898              1.0926437
## 4              0.3655246              0.1695836              0.9829430
## 5             -0.3555103              0.2076811              1.6192070
## 6             -0.7560852              0.1423710              1.9044287
##   ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 1              0.3109942              0.2109804              0.3855865
## 2              0.1849230              0.2109804             -0.7262673
## 3              0.1849230              0.2109804             -0.4252906
## 4              0.1562704              0.2109804             -0.1243139
## 5              0.2938027             -0.6884239              0.7786162
## 6              0.3998171             -0.5599376              1.0795929
##   ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 1              0.7815967              0.4717009              0.1216354
## 2             -1.8212444             -1.0109484              0.1107691
## 3             -1.2190925             -0.8381332              0.1868337
## 4             -0.6169407             -0.6653181              0.1732507
## 5              0.5873630              1.5812789              0.2764812
## 6             -1.2190925             -1.3565787              0.1162023
##   ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 1              0.1274268              0.3500191              0.8103308
## 2              0.1994510              0.1923811              0.9052417
## 3              0.2190939              0.2124441              0.8862595
## 4              0.2081812              0.1923811              0.8862595
## 5              0.2954832              0.3471529              0.9242238
## 6              0.2452845              0.3557513              0.9432060
##   ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 1              0.6071537              0.7568194             -0.2010964
## 2              0.8509759              0.7568194             -0.2741327
## 3              0.7900203              0.2379678             -0.1645783
## 4              0.7900203              0.2379678             -0.1645783
## 5              0.9728869             -0.1771134             -0.1463192
## 6              1.0338425              0.9643601             -0.3654280
##   ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 1             -0.4568829              0.9999809             -1.7011967
## 2              1.9517531              0.9999809              1.9880597
## 3              2.6928719              0.9999809              1.9880597
## 4              2.3223125              1.8158588              0.1434315
## 5              2.3223125              2.6317367              0.1434315
## 6              2.6928719              2.6317367              0.1434315
##   ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 1            -0.87382216             -0.6515638             -1.1540243
## 2             1.17282515             -0.6515638              2.2161351
## 3             1.26585457             -1.8089817             -0.7046697
## 4             0.05647207             -1.8089817              0.4187168
## 5            -2.54835178             -2.9663997             -1.8280562
## 6            -0.50170447             -1.8089817             -1.3787016
##   ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 1              0.7174727              0.2317270             -0.4610622
## 2             -0.8224687              0.2317270              2.1565813
## 3             -0.8224687              0.2317270             -0.4610622
## 4             -0.8224687              0.2317270             -0.4610622
## 5             -0.8224687              0.2981503             -0.4610622
## 6             -0.8224687              0.2317270             -0.4610622
##   ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 1             -0.4390981             0.20279570             2.40564734
## 2              2.3542004            -0.05472265            -0.01374656
## 3             -0.4390981             0.40881037             0.10146268
## 4             -0.4390981            -0.31224099             0.21667191
## 5             -0.4390981            -0.10622632             0.21667191
## 6             -0.4390981             0.15129203             1.48397347
##   ManufacturingProcess44
## 1            -0.01588055
## 2             0.29467248
## 3            -0.01588055
## 4            -0.01588055
## 5            -0.32643359
## 6            -0.01588055

Now that I have successfully imputed and scaled the data, it is now time to split the data into training and testing data sets.

Splitting the data into training and testing data sets.

set.seed(1)
inTraining <- createDataPartition(df$Yield, p = 0.80, list=FALSE)
training <- df[ inTraining,]
testing <- df[-inTraining,]

X_train <- training[,2:(length(training))]
Y_train <- training$Yield

X_test <- testing[,2:(length(testing))]
Y_test <- testing$Yield

The data has now been imputed, splitted and pre-processed.

A. Which nonlinear regression model gives the optimal resampling and test set performance?

As of note, we will be using the same models as previously performed in question 7.2. We should note that response variable here is:yield.

K-Nearest Neighbors

set.seed(1)
knnModel1 <- train(x = X_train,
                  y = Y_train,
                  method = "knn",
                  #preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel1
## k-Nearest Neighbors 
## 
## 144 samples
##  55 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.481059  0.3814268  1.169101
##    7  1.461855  0.3927446  1.164240
##    9  1.446106  0.4108643  1.160198
##   11  1.441563  0.4189638  1.157509
##   13  1.437135  0.4284012  1.161061
##   15  1.451620  0.4178187  1.178368
##   17  1.457394  0.4149991  1.186073
##   19  1.463996  0.4131880  1.190119
##   21  1.465359  0.4165220  1.192271
##   23  1.474984  0.4111554  1.195094
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
plot(knnModel1)

The model has chosen k = 13 as the optimal model choice for k nearest neighbors. Let’s see how it performs on testing data.

knnPred1 <- predict(knnModel1, newdata = X_test)
postResample(pred = knnPred1, obs = Y_test)
##      RMSE  Rsquared       MAE 
## 1.3438388 0.5473626 1.0784564

Overall, the model performs reasonably well.

Support Vector Machines

set.seed(1)
svmRTuned1 <- train(x = X_train, 
                   y = Y_train,
                   method = "svmRadial",
                   #preProc = c("center", "scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))
svmRTuned1
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  55 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 130, 130, 131, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE      
##      0.25  1.416423  0.5028799  1.1605561
##      0.50  1.319232  0.5346143  1.0833591
##      1.00  1.235756  0.5760082  1.0064023
##      2.00  1.222457  0.5768255  0.9843147
##      4.00  1.227817  0.5698084  0.9874162
##      8.00  1.226535  0.5750713  0.9946620
##     16.00  1.227513  0.5748177  0.9962901
##     32.00  1.227513  0.5748177  0.9962901
##     64.00  1.227513  0.5748177  0.9962901
##    128.00  1.227513  0.5748177  0.9962901
##    256.00  1.227513  0.5748177  0.9962901
##    512.00  1.227513  0.5748177  0.9962901
##   1024.00  1.227513  0.5748177  0.9962901
##   2048.00  1.227513  0.5748177  0.9962901
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01559259
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01559259 and C = 2.
svmRTuned1$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0155925867216759 
## 
## Number of Support Vectors : 126 
## 
## Objective Function Value : -67.9252 
## Training error : 0.095408

The SVM model is optimized with parameters epsilon = 0.1 and cost C = 2. This is using a kernel function called Gaussian Radial Basis kernel function with sigma = 0.01559 (the radial basis kernel was specified by us). Let us see how it performs on testing data.

svmPred1 <- predict(svmRTuned1, newdata = X_test)
postResample(pred = svmPred1, obs = Y_test)
##      RMSE  Rsquared       MAE 
## 0.9822078 0.7455355 0.7853845

The SVM appears to perform better than the KNN model.

Neural Networks

We will explore the single neural network model and then the average neural network model.

set.seed(1)
nnetFit1 <- nnet(X_train,
                Y_train,
                size = 5,
                decay = 0.01,
                linout = TRUE,
                ## Reduce the amount of printed output
                trace = FALSE,
                ## Expand the number of iterations to find parameter estimates
                maxit = 500,
                ## and the number of parameters used by the model
                MaxNWts = 5 * (ncol(X_train) + 1) + 5 + 1)

nnetFit1
## a 55-5-1 network with 286 weights
## options were - linear output units  decay=0.01
nnetFitPred1 <- predict(nnetFit1, newdata = X_test)
postResample(pred = nnetFitPred1, obs = Y_test)
##      RMSE  Rsquared       MAE 
## 2.0812159 0.5576348 1.6526974

The single neural network unfortunately does not perform quite as well as we hoped. Let us take a look at the average neural network.

set.seed(1)
nnetAvg1 <- avNNet(X_train,
                  Y_train,
                  size = 5,
                  decay = 0.01,
                  ## Specify how many models to average
                  repeats = 5,
                  linout = TRUE,
                  ## Reduce the amount of printed output
                  trace = FALSE,
                  ## expand the number of iterations to find paramater estimates)
                  maxit = 500,
                  ## and the number of parameters used by the model
                  MaxNWts = 5 * (ncol(X_train) + 1) + 5 + 1)

nnetAvg1
## Model Averaged Neural Network with 5 Repeats  
## 
## a 55-5-1 network with 286 weights
## options were - linear output units  decay=0.01
nnetAvgPred1 <- predict(nnetAvg1, newdata = X_test)
postResample(pred = nnetAvgPred1, obs = Y_test)
##      RMSE  Rsquared       MAE 
## 1.5638320 0.4884307 1.1923043

Interestingly, the average neural network performs worse than the single model. This suggests that the neural network model is really not the optimal model to choose for this dataset.

Multivariate Adaptive Regression Splines

As MARS performed the best with our previous question, let us see if MARS can duplicate similar success.

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced
set.seed(1)
marsTuned1 <- train(x = X_train,
                   y = Y_train,
                   method = "earth",
                   # Explicitly declare the candidate models to test
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))
marsTuned1
## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  55 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 130, 130, 131, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.363458  0.4558256  1.0689820
##   1        3      1.271861  0.5201694  1.0106995
##   1        4      1.244503  0.5458146  0.9833013
##   1        5      1.238886  0.5546325  0.9841253
##   1        6      1.299142  0.5236747  1.0200561
##   1        7      1.223629  0.5697152  0.9702813
##   1        8      1.203007  0.5985392  0.9565488
##   1        9      1.220538  0.5834291  0.9684082
##   1       10      1.227145  0.5768417  0.9702812
##   1       11      1.286950  0.5436602  0.9708968
##   1       12      1.271059  0.5559122  0.9644951
##   1       13      1.308718  0.5238114  1.0072276
##   1       14      1.298446  0.5340145  1.0073789
##   1       15      1.372028  0.5195326  1.0299789
##   1       16      1.372028  0.5195326  1.0299789
##   1       17      1.321986  0.5308154  1.0143525
##   1       18      1.329896  0.5283427  1.0172826
##   1       19      1.317402  0.5343547  1.0103932
##   1       20      1.325867  0.5318402  1.0134283
##   1       21      1.328594  0.5327647  1.0139789
##   1       22      1.328594  0.5327647  1.0139789
##   1       23      1.328594  0.5327647  1.0139789
##   1       24      1.328594  0.5327647  1.0139789
##   1       25      1.328594  0.5327647  1.0139789
##   1       26      1.328594  0.5327647  1.0139789
##   1       27      1.328594  0.5327647  1.0139789
##   1       28      1.355204  0.5237057  1.0282422
##   1       29      1.354034  0.5244783  1.0267683
##   1       30      1.359385  0.5194798  1.0336388
##   1       31      1.357839  0.5216488  1.0395689
##   1       32      1.357400  0.5215103  1.0374248
##   1       33      1.356468  0.5222010  1.0277678
##   1       34      1.353343  0.5237420  1.0249071
##   1       35      1.381416  0.5144563  1.0454328
##   1       36      1.381416  0.5144563  1.0454328
##   1       37      1.400371  0.5142342  1.0479364
##   1       38      1.400371  0.5142342  1.0479364
##   2        2      1.363458  0.4558256  1.0689820
##   2        3      1.300992  0.4950000  1.0266650
##   2        4      1.342267  0.4950703  1.0570482
##   2        5      1.366914  0.4565658  1.0811839
##   2        6      1.413882  0.4549845  1.1351104
##   2        7      1.391165  0.4701968  1.1161580
##   2        8      1.414190  0.4657550  1.1370014
##   2        9      1.852827  0.4650250  1.2825380
##   2       10      1.903292  0.4544349  1.3156051
##   2       11      1.849950  0.4646529  1.2628435
##   2       12      1.380283  0.4982493  1.0957051
##   2       13      1.984937  0.4161301  1.3111849
##   2       14      1.952098  0.4275927  1.2962357
##   2       15      1.984969  0.4254058  1.2999391
##   2       16      2.138145  0.4004278  1.3499275
##   2       17      2.182047  0.3916148  1.3776934
##   2       18      2.268426  0.3796668  1.4170635
##   2       19      2.302797  0.3622969  1.4409992
##   2       20      2.301247  0.3634443  1.4329096
##   2       21      2.294231  0.3665744  1.4312673
##   2       22      2.283355  0.3697245  1.4201279
##   2       23      2.283355  0.3697245  1.4201279
##   2       24      2.283355  0.3697245  1.4201279
##   2       25      2.206790  0.3696703  1.4045992
##   2       26      2.198842  0.3789270  1.4022743
##   2       27      2.198842  0.3789270  1.4022743
##   2       28      2.198842  0.3789270  1.4022743
##   2       29      2.198842  0.3789270  1.4022743
##   2       30      2.198842  0.3789270  1.4022743
##   2       31      2.198842  0.3789270  1.4022743
##   2       32      2.198842  0.3789270  1.4022743
##   2       33      2.198842  0.3789270  1.4022743
##   2       34      2.198842  0.3789270  1.4022743
##   2       35      2.198842  0.3789270  1.4022743
##   2       36      2.198842  0.3789270  1.4022743
##   2       37      2.198842  0.3789270  1.4022743
##   2       38      2.198842  0.3789270  1.4022743
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 8 and degree = 1.
plot(marsTuned1)

It appears that the optimal parameters are npruned = 8 and degree = 1. How does this perform with the testing data set.

marsPred1 <- predict(marsTuned1, newdata = X_test)
postResample(pred = marsPred1, obs = Y_test)
##      RMSE  Rsquared       MAE 
## 1.1870875 0.6182487 0.9690215

While this model performed better than the neural network, this model has not performed as the support vector machine.

The SVM model had the lowest RMSE and highest R-squared, making the SVM model the best non-linear model for this data set.

B. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Using the varImp function from caret, this can help determine the importance of each predictor in the SVM model.

# `varImp` in caret package can estimate the importance of each predictor in the SVM model
varImp(svmRTuned1)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 55)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     84.75
## ManufacturingProcess36   71.84
## BiologicalMaterial03     70.69
## ManufacturingProcess13   68.69
## BiologicalMaterial02     64.16
## ManufacturingProcess31   63.17
## BiologicalMaterial12     60.11
## ManufacturingProcess17   57.19
## ManufacturingProcess09   55.42
## ManufacturingProcess33   54.25
## BiologicalMaterial04     51.35
## ManufacturingProcess29   47.17
## BiologicalMaterial11     46.74
## ManufacturingProcess06   46.73
## BiologicalMaterial01     41.19
## BiologicalMaterial08     38.75
## ManufacturingProcess11   29.49
## ManufacturingProcess26   29.35
## BiologicalMaterial09     26.01

Out of the top 20, it is near equal split for Manufacturing and Biological Material in regards to variable importance in the SVM model. The most important feature though is manufacturing process 32. Let us compare this to our optimal linear model.

If we recall from the previous assignment, the lasso model was the most optimal linear model. Let us rebuild that model here.

library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
lassoModel <- enet(x = as.matrix(X_train), y = Y_train,
                   lambda = 0.0, normalize = TRUE)

lassoCoef <- predict(lassoModel, newx = as.matrix(X_test),
                     s=.1, mode = "fraction", type = "coefficients")

list_coef <- lassoCoef$coefficients
sort(list_coef[list_coef != 0])
## ManufacturingProcess17 ManufacturingProcess37 ManufacturingProcess07 
##           -0.292509854           -0.264746168           -0.145392555 
## ManufacturingProcess28   BiologicalMaterial10 ManufacturingProcess33 
##           -0.130398423           -0.092349719           -0.077063799 
## ManufacturingProcess13 ManufacturingProcess35 ManufacturingProcess23 
##           -0.072017994           -0.068392958           -0.021126290 
## ManufacturingProcess22 ManufacturingProcess11 ManufacturingProcess01 
##           -0.001667586            0.004284190            0.009531290 
## ManufacturingProcess29 ManufacturingProcess44 ManufacturingProcess18 
##            0.022486782            0.022556238            0.024365489 
## ManufacturingProcess15 ManufacturingProcess19   BiologicalMaterial03 
##            0.061132305            0.063689988            0.069128515 
## ManufacturingProcess42 ManufacturingProcess34 ManufacturingProcess39 
##            0.117869928            0.120223921            0.135753598 
## ManufacturingProcess43 ManufacturingProcess06   BiologicalMaterial06 
##            0.140424600            0.153505477            0.187477120 
##   BiologicalMaterial05 ManufacturingProcess04 ManufacturingProcess09 
##            0.231054233            0.267281197            0.608444498 
## ManufacturingProcess32 
##            1.017039797

As we notice, the lasso model appaers to weight significantly more on the manufacturing process (notice that all but 4 are manufacturing), whereas the SVM model had a near even split between the manufacturing and biological processes.

C. Explore the relationship between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Below is the final model for the tuned SVM.

svmRTuned1$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0155925867216759 
## 
## Number of Support Vectors : 126 
## 
## Objective Function Value : -67.9252 
## Training error : 0.095408

Let’s plot out and see how well it compares for the actual vs. predicted, and then the residual plot.

xyplot(Y_test ~ predict(svmRTuned1, X_test),
type = c("p", "r"),
xlab = "Predicted", ylab = "Observed")

predicted <- predict(svmRTuned1, X_test)
actual <-Y_test
xyplot((predicted-actual) ~ predicted,
type = c("p", "g"),
xlab = "Predicted radial svm", ylab = "Residuals")

Both plots appear to do well. The residual plots demonstrate no obvious pattern and the “actual vs. predicted” demonstrates fairly close correlation.

Here is the dot plot regarding the variable importance for this model.

dotPlot(varImp(svmRTuned1), top=15)

The above demonstrates the individual variable importance in a support vector machine. The higher the importance, the more valuable the predictor variable is when it comes to contributing to the model.

However, let us explore the relationship between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.

We will use the featurePlot for the top 6 predictors and see how it correlates to the response variable yield.

features <- c("ManufacturingProcess32", "BiologicalMaterial06", "ManufacturingProcess36", "BiologicalMaterial03", "ManufacturingProcess13", "BiologicalMaterial02")
featurePlot(X_train[,features], Y_train)

cor(X_train[, features], Y_train)
##                              [,1]
## ManufacturingProcess32  0.6391015
## BiologicalMaterial06    0.5037752
## ManufacturingProcess36 -0.5417157
## BiologicalMaterial03    0.4578673
## ManufacturingProcess13 -0.4470691
## BiologicalMaterial02    0.5060191

The above is a visualization and actual numbers of the correlation of each feature to the response variable. As we can see, each variable corresponds differently (either positively or negative) with the response variable. Because the processes or materials are just numbers, it is difficult to tell what each process of materials does, so obtaining intuition regarding this is somewhat evasive. That being said, there does appear to be in each step some sort of correlation in regards to the Y response value.