library(dplyr)
library(forecast)
library(ggplot2)
library(VIM) #missing data visualization
library(tidyr)
library(mice)
library(corrplot)
library(MASS)
library(Boruta) #feature selection
library(mlbench)
library(caret)
library(earth) #MARS
library(nnet) #neural network
The purpose of this assignment was to explore Non-Linear Regression exercises from Applied Predictive Modeling.
Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data [available in course text] where x values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench
contains a function called mlbench.friedman1
that simulates these data:
set.seed(333)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
<- mlbench.friedman1(200, sd=1)
trainingData $x <- data.frame(trainingData$x)
trainingData
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
<- mlbench.friedman1(5000, sd=1)
testData $x <- data.frame(testData$x) testData
Tune several models on these data. For example:
<- train(x = trainingData$x, y = trainingData$y,
knnModel method = 'knn',
preProc = c("center","scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.655040 0.5617369 2.947070
## 7 3.612413 0.5783865 2.909076
## 9 3.586721 0.5997805 2.900490
## 11 3.550710 0.6216398 2.855431
## 13 3.568226 0.6277351 2.884006
## 15 3.598420 0.6304719 2.917630
## 17 3.615982 0.6406437 2.933561
## 19 3.633097 0.6477973 2.949347
## 21 3.656781 0.6511308 2.975054
## 23 3.669868 0.6584910 2.987740
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
<- predict(knnModel, newdata = testData$x)
knnPred ## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.1134993 0.6605619 2.4805851
We see that our KNN model, with k = 11, produced an RMSE of 3.11 and an Rsquared of 0.66. These appear to be good performance metrics but we really don’t know until we compare. For comparison we’ll run Neural Network, Multivariate Adaptive Regression Spine (MARS), and Support Vector Machine (SVMS) models …
We start with by training a basic neural network model, in a form nearly identical to the textbook example, and then outputting test data performance metrics:
#basic neural network function call:
<- nnet(trainingData$x, trainingData$y,
nnetFit size = 5,
decay = 0.01,
linout = TRUE,
trace = FALSE,
maxit = 500,
MaxNWts = 5 * (ncol(trainingData$x) + 1) +5 +1)
<- predict(nnetFit, newdata = testData$x)
nnetPred postResample(pred = nnetPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.8480667 0.8675255 1.4461882
Our simple neural network model has a lower RMSE and higher R-squared and appears to be a far better model than our KNN model.
We continue with a MARS model:
<- earth(trainingData$x, trainingData$y)
marsFit marsFit
## Selected 13 of 17 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X4, X1, X2, X5, X3, X10, X6-unused, X7-unused, X8-unused, ...
## Number of terms at each degree of interaction: 1 12 (additive model)
## GCV 3.472269 RSS 531.6912 GRSq 0.8839551 RSq 0.910258
Yes, the MARS model selected the informative X1 - X5 predictors.
Next we consider performance metrics:
<- predict(marsFit, newdata = testData$x)
marsPred postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.7701179 0.8754754 1.3840929
Judging by the lower RMSE and higher R-squared, our MARS model performed even better than our neural net model.
Finally, we consider a support vector machine model:
<- train(trainingData$x, trainingData$y,
svmRTuned method = "svmRadial",
preProc = c("center","scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.950677 0.7626755 2.318960
## 0.50 2.642286 0.7817966 2.061376
## 1.00 2.409545 0.8131294 1.885171
## 2.00 2.274644 0.8324003 1.773595
## 4.00 2.158366 0.8493427 1.687086
## 8.00 2.120974 0.8549863 1.660130
## 16.00 2.105689 0.8570113 1.650397
## 32.00 2.105689 0.8570113 1.650397
## 64.00 2.105689 0.8570113 1.650397
## 128.00 2.105689 0.8570113 1.650397
## 256.00 2.105689 0.8570113 1.650397
## 512.00 2.105689 0.8570113 1.650397
## 1024.00 2.105689 0.8570113 1.650397
## 2048.00 2.105689 0.8570113 1.650397
##
## Tuning parameter 'sigma' was held constant at a value of 0.07281999
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.07281999 and C = 16.
With a sigma of ~0.0728 we find that our optimal model has a C of 16, RMSE of 2.106 and Rsquared of 0.857. These are promising metrics for a training model and so we proceed to casting predictions and comparing their accuracy to our test data:
<- predict(svmRTuned, newdata = testData$x)
svmPred postResample(pred = svmPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.1480184 0.8155776 1.6976820
While the model is not all that bad, it is the worst of those that we’ve considered thus far on the data under consideration. By contrast, the MARS model was our strongest model.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
As in Exercise 6.3, we load in our data, convert it to a matrix, impute missing values, remove highly correlated features, center and scale, convert to dataframe (so that we have column names), and then train test split our data using a 75/25 split (75% training data):
# Load the data
library(AppliedPredictiveModeling)
data(package = "AppliedPredictiveModeling")$results[, "Item"]
data(ChemicalManufacturingProcess)
#convert data to matrix
<- as.matrix(ChemicalManufacturingProcess)
cmp_matrix
#impute missing values
set.seed(333)
<- mice(cmp_matrix, printFlag=F, method="pmm")
imputed_cmp_matrix <- complete(imputed_cmp_matrix)
cmp_comp
#remove highly correlated features
= findCorrelation(cor(cmp_comp), 0.80)
highly_correlated = cmp_comp[, -highly_correlated]
cmp_comp1
#center and scale
<- scale(cmp_comp1, center = TRUE, scale = TRUE)
cmp_comp2
#convert to dataframe - COMMENTED OUT
<- as.data.frame(cmp_comp2) #convert to df
cmp_df
#select features from model identified during stepAIC optimization - COMMENTED OUT
# cmp_df_simplified <- cmp_df[, c('Yield','ManufacturingProcess10','ManufacturingProcess06','ManufacturingProcess28', 'BiologicalMaterial05', 'ManufacturingProcess43', 'ManufacturingProcess20', 'ManufacturingProcess07', 'ManufacturingProcess02', 'ManufacturingProcess04', 'ManufacturingProcess13', 'ManufacturingProcess39', 'BiologicalMaterial10', 'BiologicalMaterial03', 'ManufacturingProcess34', 'ManufacturingProcess37',
# 'ManufacturingProcess09', 'ManufacturingProcess32')]
# cmp_df_simplified #verify
#train test split - 75/25 split
= round(nrow(cmp_df)*.75)
sample_size <- sample(seq_len(nrow(cmp_df)), size = sample_size)
index
<- cmp_df[index, ]
train <- cmp_df[-index, ] test
Note: I commented out the feature selection determined via stepAIC optimization portion because I believe it would skew my later responses and a large part of this question is to see how the nonlinear model’s features and results differ from a linear model’s.
(a) Which nonlinear regression model gives the optimal resampling and test set performance?
As in the preceding section, we’ll consider KNN, Neural Network, MARS and SVM models.
set.seed(333)
#remove yield from sets to feed models
<- train[-c(1)]
train.x <- test[-c(1)]
test.x
<- train(x = train.x, y = train$Yield,
knnModel method = 'knn',
preProc = c("center","scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 132 samples
## 40 predictor
##
## Pre-processing: centered (40), scaled (40)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.7770971 0.3969565 0.6066163
## 7 0.7796069 0.3927465 0.6174896
## 9 0.7728522 0.4043210 0.6194590
## 11 0.7731338 0.4074788 0.6227107
## 13 0.7772395 0.4038092 0.6287639
## 15 0.7773724 0.4112243 0.6293491
## 17 0.7777031 0.4210834 0.6281060
## 19 0.7864108 0.4176115 0.6366022
## 21 0.7895602 0.4218009 0.6392570
## 23 0.7927567 0.4235250 0.6427964
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
k = 9 produced our optimal KNN model with what may be a good RMSE value but wa rather poor R-squared value. We consider the verification / test set performance:
<- predict(knnModel, newdata = test.x)
knnPred ## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = test$Yield)
## RMSE Rsquared MAE
## 0.7458738 0.6030843 0.5837452
Oddly enough, our KNN model (with k = 9) performed better on test data. The RMSE and Rsquared appear to be rather strong and so we proceed with our model exploration with this in mind.
#basic neural network function call:
<- nnet(train.x, train$Yield,
nnetFit size = 5,
decay = 0.01,
linout = TRUE,
trace = FALSE,
maxit = 500,
MaxNWts = 5 * (ncol(train.x) + 1) +5 +1)
<- predict(nnetFit, newdata = test.x)
nnetPred postResample(pred = nnetPred, obs = test$Yield)
## RMSE Rsquared MAE
## 0.8597067 0.6599363 0.6976073
Our simple neural network model has a higher RMSE and R-squared. In selecting the strongest model we want the combination of lowest error (RMSE) and highest predictive accuracy (Rsquared) and there appears to be room for improvement here.
As such, we consider a MARS model:
<- earth(train.x, train$Yield)
marsFit marsFit
## Selected 13 of 21 terms, and 9 of 40 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 12 (additive model)
## GCV 0.3445371 RSS 29.88337 GRSq 0.6324041 RSq 0.7547576
Based on the training output (ie. RSq) there appears to be a lot of promise here. Let’s consider the test performance metrics:
<- predict(marsFit, newdata = test.x)
marsPred postResample(pred = marsPred, obs = test$Yield)
## RMSE Rsquared MAE
## 0.6595294 0.6466317 0.5181177
It has the lowest RMSE and 2nd highest Rsquared. This makes it a toss up between our neural network model (better Rsquared) and MARS model (better RMSE). Before drawing any conclusions, we consider an SVM model:
<- train(train.x, train$Yield,
svmRTuned method = "svmRadial",
preProc = c("center","scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 132 samples
## 40 predictor
##
## Pre-processing: centered (40), scaled (40)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 117, 118, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7649027 0.4616070 0.6229629
## 0.50 0.7150301 0.5016506 0.5757713
## 1.00 0.6667735 0.5603855 0.5323655
## 2.00 0.6295205 0.6009277 0.5065152
## 4.00 0.5920877 0.6443093 0.4883642
## 8.00 0.5832544 0.6540218 0.4826263
## 16.00 0.5832544 0.6540218 0.4826263
## 32.00 0.5832544 0.6540218 0.4826263
## 64.00 0.5832544 0.6540218 0.4826263
## 128.00 0.5832544 0.6540218 0.4826263
## 256.00 0.5832544 0.6540218 0.4826263
## 512.00 0.5832544 0.6540218 0.4826263
## 1024.00 0.5832544 0.6540218 0.4826263
## 2048.00 0.5832544 0.6540218 0.4826263
##
## Tuning parameter 'sigma' was held constant at a value of 0.02392857
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02392857 and C = 8.
With sigma = 0.0187, C=8, we have an RMSE of 0.583 and Rsquared of 0.654. These are promising metrics for a training model and so we proceed to casting predictions and comparing their accuracy to our test data:
<- predict(svmRTuned, newdata = test.x)
svmPred postResample(pred = svmPred, obs = test$Yield)
## RMSE Rsquared MAE
## 0.6725301 0.6507183 0.4973401
Based on the combination of RMSE and Rsquared values in selecting models, this appears to be our strongest model yet. I elect the SVM model as my optimal nonlinear regression model.
(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
For variable importance we use the varImp()
function on our optimal SVM model and produce the following:
<- varImp(svmRTuned, scale = FALSE)
svmImp svmImp
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ManufacturingProcess32 0.35450
## ManufacturingProcess13 0.28288
## ManufacturingProcess09 0.26357
## ManufacturingProcess17 0.25501
## ManufacturingProcess36 0.24516
## BiologicalMaterial03 0.24342
## BiologicalMaterial11 0.17693
## ManufacturingProcess06 0.17206
## ManufacturingProcess11 0.16828
## ManufacturingProcess30 0.13826
## ManufacturingProcess12 0.11933
## BiologicalMaterial09 0.11409
## ManufacturingProcess04 0.09219
## ManufacturingProcess16 0.08699
## ManufacturingProcess26 0.07998
## ManufacturingProcess01 0.07410
## ManufacturingProcess35 0.07069
## ManufacturingProcess28 0.05898
## ManufacturingProcess10 0.05842
## BiologicalMaterial10 0.05474
As was the case for our linear regression model, the process variables dominate the list.
When we consider the top 10 variables for our nonlinear regression model v. our linear regression model DATA 624 HW7), we observe that ManufacturingProcess32
, ManufacturingProcess13
, ManufacturingProcess09
, and ManufacturingProcess06
were in both models whereas ManufacturingProcess17
, ManufacturingProcess36
, BiologicalMaterial03
, BiologicalMaterial11
, ManufacturingProcess11
, andManufacturingProcess30
are unique to the nonlinear SVM model.
(c) Explore the relationship between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
To explore the relationship between top predictors and response that are unique to the optimal nonlinear regression model (our SVM model), I elect to visualize via correlation matrix and feature plots:
#train.x or cmp_df
<- train.x[, c('ManufacturingProcess17', 'ManufacturingProcess36', 'BiologicalMaterial03', 'BiologicalMaterial11', 'ManufacturingProcess11', 'ManufacturingProcess30')]
unique_train
#correlation matrix
<- cor(unique_train, train.x$Yield)
resp_cor corrplot(resp_cor, method = 'number', type = 'lower')
#feature plots
colnames(unique_train) <- gsub("(Process|Material)", "", colnames(unique_train))
featurePlot(unique_train, train$Yield)
From the correlation matrix we see the strength of relationship / correlation between our top predictors and from the feature plots we see a visualization of the relationship between our top predictor variables and the response in an x ~ y fashion.
We see a linear relationships between Yield and Biological03, Biological11, and Manufacturing11 whereas Manufacturing17 appears to have a negative relationship (higher values of Manufacturing17 appear to lead to lower Yield), Manufacturing30 is distributed about 0 and Manufacturing36 has levels. From these plots, we can determine how we might try to tweak the Manufacturing variables within our control to increase yield. For example, we might reduce Manufacturing17 and 36 since there appears to be a negative correlation at play.