The purpose of this assignment is to explore Non-Linear Regression exercises from Applied Predictive Modeling book.

Libraries Required

library(caret)
library(mlbench)
library(earth)
library(nnet)
library(corrplot)
library(mice)
library(dplyr)

Excercise 7.2

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between 0, 1. The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

KNN

Tune several models on these data. For example:

knnModel <- train(x = trainingData$x,
 y = trainingData$y,
 method = "knn",
 preProc = c("center", "scale"),
 tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

The KNN model above gives an RMSE of 3.20 and an Rsquared of 0.68 with k = 17.

This is a good metrics but I will compare this result to other models (Neural Network, Multivariate Adaptive Regression Spine (MARS), and Support Vector Machine (SVMS))

Neural Network

nnetFit <- nnet(trainingData$x, trainingData$y,
                size = 5,
                decay = 0.01,
                linout = TRUE,
                trace = FALSE,
                maxit = 500,
                MaxNWts = 5 * (ncol(trainingData$x) + 1) +5 +1)

nnetFit
## a 10-5-1 network with 61 weights
## options were - linear output units  decay=0.01
nnetPred <- predict(nnetFit, newdata = testData$x)
postResample(pred = nnetPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.5244877 0.7621849 1.8722728

Neural Network model has a lower RMSE (2.52) and higher R-squared(0.76) than KNN model.

MARS

marsmodel <- earth(trainingData$x, trainingData$y)
marsmodel
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

Yes, the MARS model selected the informative X1 - X5 predictors.

Next we consider performance metrics:

marsPred <- predict(marsmodel, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.8136467 0.8677298 1.3911836

MARS model has a lower RMSE (1.81) than Neural Network and KNN model.

SVM

svmRTuned <- train(trainingData$x, trainingData$y,
                   method = "svmRadial",
                   preProc = c("center","scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))

svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.540674  0.8047835  2.019031
##      0.50  2.325191  0.8134761  1.830817
##      1.00  2.149210  0.8347898  1.678792
##      2.00  2.068179  0.8435275  1.603251
##      4.00  1.978121  0.8589616  1.552173
##      8.00  1.954282  0.8637113  1.552491
##     16.00  1.949830  0.8651355  1.562072
##     32.00  1.947301  0.8654435  1.560529
##     64.00  1.947301  0.8654435  1.560529
##    128.00  1.947301  0.8654435  1.560529
##    256.00  1.947301  0.8654435  1.560529
##    512.00  1.947301  0.8654435  1.560529
##   1024.00  1.947301  0.8654435  1.560529
##   2048.00  1.947301  0.8654435  1.560529
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05897103
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05897103 and C = 32.
svmPred <- predict(svmRTuned, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.0649869 0.8270753 1.5689338

SVM model gives a lower RMSE than Neural Network and KNN, but it is higher than MARS model.

Since MARS model gives a lower RMSE and higher R-squared than KNN model, MARS model is considered to be the strongest model.

Exercise 7.5

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

#Load the data
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.1.3
data("ChemicalManufacturingProcess")
#convert data to matrix
Data_matrix <- as.matrix(ChemicalManufacturingProcess)

#impute missing values
set.seed(333)
imputed_Data_matrix <- mice(Data_matrix, printFlag=F, method="pmm")
## Warning: Number of logged events: 675
data_comp <- complete(imputed_Data_matrix)

#remove highly correlated features
highly_correlated = findCorrelation(cor(data_comp), 0.80)
data_comp1 = data_comp[, -highly_correlated]

#center and scale
data_comp2 <- scale(data_comp1, center = TRUE, scale = TRUE)

#convert to dataframe - COMMENTED OUT
cmp_df <- as.data.frame(data_comp2) #convert to df

#train test split - 75/25 split
sample_size = round(nrow(cmp_df)*.75)
index <- sample(seq_len(nrow(cmp_df)), size = sample_size)
 
train <- cmp_df[index, ]
test <- cmp_df[-index, ]
#remove yield from sets to feed models
train.x <- train[-c(1)]
test.x <- test[-c(1)]

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

KNN

set.seed(333)
knnModel <- train(x = train.x, y = train$Yield, 
                  method = 'knn',
                  preProc = c("center","scale"),
                  tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 132 samples
##  40 predictor
## 
## Pre-processing: centered (40), scaled (40) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7770971  0.3969565  0.6066163
##    7  0.7796069  0.3927465  0.6174896
##    9  0.7728522  0.4043210  0.6194590
##   11  0.7731338  0.4074788  0.6227107
##   13  0.7772395  0.4038092  0.6287639
##   15  0.7773724  0.4112243  0.6293491
##   17  0.7777031  0.4210834  0.6281060
##   19  0.7864108  0.4176115  0.6366022
##   21  0.7895602  0.4218009  0.6392570
##   23  0.7927567  0.4235250  0.6427964
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

The final value used for the model was k = 9, produced our optimal KNN model with what may be a good RMSE value but we rather poor R-squared value.

knnPred <- predict(knnModel, newdata = test.x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 0.7458738 0.6030843 0.5837452

KNN model(with k = 9) gives a good performance on the test data. The RMSE of 0.74 and Rsquared of 0.603 seems to be strong.

Neural Network

#basic neural network function call:
nnetFit <- nnet(train.x, train$Yield,
                size = 5,
                decay = 0.01,
                linout = TRUE,
                trace = FALSE,
                maxit = 500,
                MaxNWts = 5 * (ncol(train.x) + 1) +5 +1)

nnetPred <- predict(nnetFit, newdata = test.x)
postResample(pred = nnetPred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 0.8597067 0.6599363 0.6976073

Neural Network model has a higher RMSE and R-squared. In order to select the strongest model, we look for a combination of lowest error (RMSE) and highest predictive accuracy (Rsquared) and there appears to be room for improvement here.

MARS

marsFit <- earth(train.x, train$Yield)
#marsFit
marsPred <- predict(marsFit, newdata = test.x)
postResample(pred = marsPred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 0.6595294 0.6466317 0.5181177

MARS model has so far the lowest RMSE (0.65) and 2nd highest Rsquared.

SVM

svmRTuned <- train(train.x, train$Yield,
                   method = "svmRadial",
                   preProc = c("center","scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))

svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 132 samples
##  40 predictor
## 
## Pre-processing: centered (40), scaled (40) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 117, 118, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  0.7649027  0.4616070  0.6229629
##      0.50  0.7150301  0.5016506  0.5757713
##      1.00  0.6667735  0.5603855  0.5323655
##      2.00  0.6295205  0.6009277  0.5065152
##      4.00  0.5920877  0.6443093  0.4883642
##      8.00  0.5832544  0.6540218  0.4826263
##     16.00  0.5832544  0.6540218  0.4826263
##     32.00  0.5832544  0.6540218  0.4826263
##     64.00  0.5832544  0.6540218  0.4826263
##    128.00  0.5832544  0.6540218  0.4826263
##    256.00  0.5832544  0.6540218  0.4826263
##    512.00  0.5832544  0.6540218  0.4826263
##   1024.00  0.5832544  0.6540218  0.4826263
##   2048.00  0.5832544  0.6540218  0.4826263
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02392857
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02392857 and C = 8.
svmPred <- predict(svmRTuned, newdata = test.x)
postResample(pred = svmPred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 0.6725301 0.6507183 0.4973401

Based on the combination of RMSE and Rsquared values in selecting models, MARS and SVM models have a very similar results. Therefore, I choose the SVM model as my optimal nonlinear regression model since it has a lower RMSE value.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

For variable importance we use the varImp() function on our optimal SVM model.

svmImp <- varImp(svmRTuned, scale = FALSE)
svmImp
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                        Overall
## ManufacturingProcess32 0.35450
## ManufacturingProcess13 0.28288
## ManufacturingProcess09 0.26357
## ManufacturingProcess17 0.25501
## ManufacturingProcess36 0.24516
## BiologicalMaterial03   0.24342
## BiologicalMaterial11   0.17693
## ManufacturingProcess06 0.17206
## ManufacturingProcess11 0.16828
## ManufacturingProcess30 0.13826
## ManufacturingProcess12 0.11933
## BiologicalMaterial09   0.11409
## ManufacturingProcess04 0.09219
## ManufacturingProcess16 0.08699
## ManufacturingProcess26 0.07998
## ManufacturingProcess01 0.07410
## ManufacturingProcess35 0.07069
## ManufacturingProcess28 0.05898
## ManufacturingProcess10 0.05842
## BiologicalMaterial10   0.05474

As was the case for our linear regression model, the manufacturing process variables dominate the list.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

#Feature plots
unique_train <- train.x[, c('ManufacturingProcess17', 'ManufacturingProcess36', 'BiologicalMaterial03', 'BiologicalMaterial11', 'ManufacturingProcess11', 'ManufacturingProcess30')]
colnames(unique_train) <- gsub("(Process|Material)", "", colnames(unique_train))
featurePlot(unique_train, train$Yield)

The feature plots display a visualization of the relationship between top predictor variables and the response in an x ~ y fashion.

#correlation matrix
resp_cor <- cor(unique_train, train.x$Yield)
corrplot(resp_cor, method = 'number', type = 'lower')

The correlation matrix display the strength of correlation between top predictors.

There is a linear relationships between Yield and Biological03, Biological11, and Manufacturing11 whereas Manufacturing17 appears to have a negative relationship (higher values of Manufacturing17 appear to lead to lower Yield), Manufacturing30 is distributed about 0 and Manufacturing36 has levels.