library(dplyr)
library(forecast)
library(ggplot2)
library(VIM) #missing data visualization
library(tidyr)
library(mice)
library(corrplot)
library(MASS)
library(Boruta) #feature selection
library(mlbench)
library(caret)
library(earth) #MARS
library(nnet) #neural network

Background

The purpose of this assignment was to explore Non-Linear Regression exercises from Applied Predictive Modeling.


7.2

Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data [available in course text] where x values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(333)

## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData <- mlbench.friedman1(200, sd=1)
trainingData$x <- data.frame(trainingData$x)

## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd=1)
testData$x <- data.frame(testData$x)

KNN

Tune several models on these data. For example:

knnModel <- train(x = trainingData$x, y = trainingData$y, 
                  method = 'knn',
                  preProc = c("center","scale"),
                  tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.655040  0.5617369  2.947070
##    7  3.612413  0.5783865  2.909076
##    9  3.586721  0.5997805  2.900490
##   11  3.550710  0.6216398  2.855431
##   13  3.568226  0.6277351  2.884006
##   15  3.598420  0.6304719  2.917630
##   17  3.615982  0.6406437  2.933561
##   19  3.633097  0.6477973  2.949347
##   21  3.656781  0.6511308  2.975054
##   23  3.669868  0.6584910  2.987740
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.1134993 0.6605619 2.4805851

We see that our KNN model, with k = 11, produced an RMSE of 3.11 and an Rsquared of 0.66. These appear to be good performance metrics but we really don’t know until we compare. For comparison we’ll run Neural Network, Multivariate Adaptive Regression Spine (MARS), and Support Vector Machine (SVMS) models …

Neural Network

We start with by training a basic neural network model, in a form nearly identical to the textbook example, and then outputting test data performance metrics:

#basic neural network function call:
nnetFit <- nnet(trainingData$x, trainingData$y,
                size = 5,
                decay = 0.01,
                linout = TRUE,
                trace = FALSE,
                maxit = 500,
                MaxNWts = 5 * (ncol(trainingData$x) + 1) +5 +1)

nnetPred <- predict(nnetFit, newdata = testData$x)
postResample(pred = nnetPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.8480667 0.8675255 1.4461882

Our simple neural network model has a lower RMSE and higher R-squared and appears to be a far better model than our KNN model.

MARS

We continue with a MARS model:

marsFit <- earth(trainingData$x, trainingData$y)
marsFit
## Selected 13 of 17 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X4, X1, X2, X5, X3, X10, X6-unused, X7-unused, X8-unused, ...
## Number of terms at each degree of interaction: 1 12 (additive model)
## GCV 3.472269    RSS 531.6912    GRSq 0.8839551    RSq 0.910258

Yes, the MARS model selected the informative X1 - X5 predictors.

Next we consider performance metrics:

marsPred <- predict(marsFit, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.7701179 0.8754754 1.3840929

Judging by the lower RMSE and higher R-squared, our MARS model performed even better than our neural net model.

SVM

Finally, we consider a support vector machine model:

svmRTuned <- train(trainingData$x, trainingData$y,
                   method = "svmRadial",
                   preProc = c("center","scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))

svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.950677  0.7626755  2.318960
##      0.50  2.642286  0.7817966  2.061376
##      1.00  2.409545  0.8131294  1.885171
##      2.00  2.274644  0.8324003  1.773595
##      4.00  2.158366  0.8493427  1.687086
##      8.00  2.120974  0.8549863  1.660130
##     16.00  2.105689  0.8570113  1.650397
##     32.00  2.105689  0.8570113  1.650397
##     64.00  2.105689  0.8570113  1.650397
##    128.00  2.105689  0.8570113  1.650397
##    256.00  2.105689  0.8570113  1.650397
##    512.00  2.105689  0.8570113  1.650397
##   1024.00  2.105689  0.8570113  1.650397
##   2048.00  2.105689  0.8570113  1.650397
## 
## Tuning parameter 'sigma' was held constant at a value of 0.07281999
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.07281999 and C = 16.

With a sigma of ~0.0728 we find that our optimal model has a C of 16, RMSE of 2.106 and Rsquared of 0.857. These are promising metrics for a training model and so we proceed to casting predictions and comparing their accuracy to our test data:

svmPred <- predict(svmRTuned, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.1480184 0.8155776 1.6976820

While the model is not all that bad, it is the worst of those that we’ve considered thus far on the data under consideration. By contrast, the MARS model was our strongest model.

7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

As in Exercise 6.3, we load in our data, convert it to a matrix, impute missing values, remove highly correlated features, center and scale, convert to dataframe (so that we have column names), and then train test split our data using a 75/25 split (75% training data):

# Load the data
library(AppliedPredictiveModeling)
data(package = "AppliedPredictiveModeling")$results[, "Item"]
data(ChemicalManufacturingProcess)

#convert data to matrix
cmp_matrix <- as.matrix(ChemicalManufacturingProcess)

#impute missing values
set.seed(333)
imputed_cmp_matrix <- mice(cmp_matrix, printFlag=F, method="pmm")
cmp_comp <- complete(imputed_cmp_matrix)

#remove highly correlated features
highly_correlated = findCorrelation(cor(cmp_comp), 0.80)
cmp_comp1 = cmp_comp[, -highly_correlated]

#center and scale
cmp_comp2 <- scale(cmp_comp1, center = TRUE, scale = TRUE)

#convert to dataframe - COMMENTED OUT
cmp_df <- as.data.frame(cmp_comp2) #convert to df

#select features from model identified during stepAIC optimization - COMMENTED OUT
# cmp_df_simplified <- cmp_df[, c('Yield','ManufacturingProcess10','ManufacturingProcess06','ManufacturingProcess28', 'BiologicalMaterial05', 'ManufacturingProcess43', 'ManufacturingProcess20', 'ManufacturingProcess07', 'ManufacturingProcess02', 'ManufacturingProcess04', 'ManufacturingProcess13', 'ManufacturingProcess39', 'BiologicalMaterial10', 'BiologicalMaterial03', 'ManufacturingProcess34', 'ManufacturingProcess37', 
# 'ManufacturingProcess09', 'ManufacturingProcess32')]
# cmp_df_simplified #verify 

#train test split - 75/25 split
sample_size = round(nrow(cmp_df)*.75)
index <- sample(seq_len(nrow(cmp_df)), size = sample_size)
 
train <- cmp_df[index, ]
test <- cmp_df[-index, ]

Note: I commented out the feature selection determined via stepAIC optimization portion because I believe it would skew my later responses and a large part of this question is to see how the nonlinear model’s features and results differ from a linear model’s.

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

As in the preceding section, we’ll consider KNN, Neural Network, MARS and SVM models.

KNN

set.seed(333)

#remove yield from sets to feed models
train.x <- train[-c(1)]
test.x <- test[-c(1)]

knnModel <- train(x = train.x, y = train$Yield, 
                  method = 'knn',
                  preProc = c("center","scale"),
                  tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 132 samples
##  40 predictor
## 
## Pre-processing: centered (40), scaled (40) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7770971  0.3969565  0.6066163
##    7  0.7796069  0.3927465  0.6174896
##    9  0.7728522  0.4043210  0.6194590
##   11  0.7731338  0.4074788  0.6227107
##   13  0.7772395  0.4038092  0.6287639
##   15  0.7773724  0.4112243  0.6293491
##   17  0.7777031  0.4210834  0.6281060
##   19  0.7864108  0.4176115  0.6366022
##   21  0.7895602  0.4218009  0.6392570
##   23  0.7927567  0.4235250  0.6427964
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

k = 9 produced our optimal KNN model with what may be a good RMSE value but wa rather poor R-squared value. We consider the verification / test set performance:

knnPred <- predict(knnModel, newdata = test.x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 0.7458738 0.6030843 0.5837452

Oddly enough, our KNN model (with k = 9) performed better on test data. The RMSE and Rsquared appear to be rather strong and so we proceed with our model exploration with this in mind.

Neural Network

#basic neural network function call:
nnetFit <- nnet(train.x, train$Yield,
                size = 5,
                decay = 0.01,
                linout = TRUE,
                trace = FALSE,
                maxit = 500,
                MaxNWts = 5 * (ncol(train.x) + 1) +5 +1)

nnetPred <- predict(nnetFit, newdata = test.x)
postResample(pred = nnetPred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 0.8597067 0.6599363 0.6976073

Our simple neural network model has a higher RMSE and R-squared. In selecting the strongest model we want the combination of lowest error (RMSE) and highest predictive accuracy (Rsquared) and there appears to be room for improvement here.

As such, we consider a MARS model:

MARS

marsFit <- earth(train.x, train$Yield)
marsFit
## Selected 13 of 21 terms, and 9 of 40 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 12 (additive model)
## GCV 0.3445371    RSS 29.88337    GRSq 0.6324041    RSq 0.7547576

Based on the training output (ie. RSq) there appears to be a lot of promise here. Let’s consider the test performance metrics:

marsPred <- predict(marsFit, newdata = test.x)
postResample(pred = marsPred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 0.6595294 0.6466317 0.5181177

It has the lowest RMSE and 2nd highest Rsquared. This makes it a toss up between our neural network model (better Rsquared) and MARS model (better RMSE). Before drawing any conclusions, we consider an SVM model:

SVM

svmRTuned <- train(train.x, train$Yield,
                   method = "svmRadial",
                   preProc = c("center","scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))

svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 132 samples
##  40 predictor
## 
## Pre-processing: centered (40), scaled (40) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 117, 118, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  0.7649027  0.4616070  0.6229629
##      0.50  0.7150301  0.5016506  0.5757713
##      1.00  0.6667735  0.5603855  0.5323655
##      2.00  0.6295205  0.6009277  0.5065152
##      4.00  0.5920877  0.6443093  0.4883642
##      8.00  0.5832544  0.6540218  0.4826263
##     16.00  0.5832544  0.6540218  0.4826263
##     32.00  0.5832544  0.6540218  0.4826263
##     64.00  0.5832544  0.6540218  0.4826263
##    128.00  0.5832544  0.6540218  0.4826263
##    256.00  0.5832544  0.6540218  0.4826263
##    512.00  0.5832544  0.6540218  0.4826263
##   1024.00  0.5832544  0.6540218  0.4826263
##   2048.00  0.5832544  0.6540218  0.4826263
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02392857
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02392857 and C = 8.

With sigma = 0.0187, C=8, we have an RMSE of 0.583 and Rsquared of 0.654. These are promising metrics for a training model and so we proceed to casting predictions and comparing their accuracy to our test data:

svmPred <- predict(svmRTuned, newdata = test.x)
postResample(pred = svmPred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 0.6725301 0.6507183 0.4973401

Based on the combination of RMSE and Rsquared values in selecting models, this appears to be our strongest model yet. I elect the SVM model as my optimal nonlinear regression model.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

For variable importance we use the varImp() function on our optimal SVM model and produce the following:

svmImp <- varImp(svmRTuned, scale = FALSE)
svmImp
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                        Overall
## ManufacturingProcess32 0.35450
## ManufacturingProcess13 0.28288
## ManufacturingProcess09 0.26357
## ManufacturingProcess17 0.25501
## ManufacturingProcess36 0.24516
## BiologicalMaterial03   0.24342
## BiologicalMaterial11   0.17693
## ManufacturingProcess06 0.17206
## ManufacturingProcess11 0.16828
## ManufacturingProcess30 0.13826
## ManufacturingProcess12 0.11933
## BiologicalMaterial09   0.11409
## ManufacturingProcess04 0.09219
## ManufacturingProcess16 0.08699
## ManufacturingProcess26 0.07998
## ManufacturingProcess01 0.07410
## ManufacturingProcess35 0.07069
## ManufacturingProcess28 0.05898
## ManufacturingProcess10 0.05842
## BiologicalMaterial10   0.05474

As was the case for our linear regression model, the process variables dominate the list.

When we consider the top 10 variables for our nonlinear regression model v. our linear regression model DATA 624 HW7), we observe that ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess09, and ManufacturingProcess06 were in both models whereas ManufacturingProcess17, ManufacturingProcess36, BiologicalMaterial03, BiologicalMaterial11, ManufacturingProcess11, andManufacturingProcess30 are unique to the nonlinear SVM model.

(c) Explore the relationship between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

To explore the relationship between top predictors and response that are unique to the optimal nonlinear regression model (our SVM model), I elect to visualize via correlation matrix and feature plots:

#train.x or cmp_df
unique_train <- train.x[, c('ManufacturingProcess17', 'ManufacturingProcess36', 'BiologicalMaterial03', 'BiologicalMaterial11', 'ManufacturingProcess11', 'ManufacturingProcess30')]

#correlation matrix
resp_cor <- cor(unique_train, train.x$Yield)
corrplot(resp_cor, method = 'number', type = 'lower')

#feature plots
colnames(unique_train) <- gsub("(Process|Material)", "", colnames(unique_train))
featurePlot(unique_train, train$Yield)

From the correlation matrix we see the strength of relationship / correlation between our top predictors and from the feature plots we see a visualization of the relationship between our top predictor variables and the response in an x ~ y fashion.

We see a linear relationships between Yield and Biological03, Biological11, and Manufacturing11 whereas Manufacturing17 appears to have a negative relationship (higher values of Manufacturing17 appear to lead to lower Yield), Manufacturing30 is distributed about 0 and Manufacturing36 has levels. From these plots, we can determine how we might try to tweak the Manufacturing variables within our control to increase yield. For example, we might reduce Manufacturing17 and 36 since there appears to be a negative correlation at play.