DATA624 - HW8

Exercise 7.2 Fitting Simulation-Induced Datasets

__Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:__

\(y = 10 sin(?? x_1 * 2) + 20(x_3 ??? 0.5)2 + 10x_4 + 5x_5 + N(0, ??2)\)

Here the x values are random variables uniformly distributed between [0, 1]. There are also 5 other non-informative variables also created in the simulation. Part of our aim is to determine which models select out these noisy features.

The package mlbench contains a function called mlbench.friedman1 that simulates these data. This creates a list with a vector ‘y’ and a matrix of predictors ‘x’.

Data Generation

set.seed(200)
trainData <- mlbench.friedman1(200, sd = 1)
trainData$x <- data.frame(trainData$x)

set.seed(300)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

trainData.comp <- cbind(trainData$x,y=trainData$y) #binding for select model use
testData.comp <- cbind(testData$x,y=testData$y)

#featurePlot(trainingData$x, trainingData$y)

We tune the following models on these Friedman data, measuring model R-squared and RMSE on the test set.

Multivariate Adaptive Regression Splines (MARS), RMSE: 1.173534, R-squared: 95.37%
K-Nearest Neighbors (KNN), RMSE: 3.2000435, R-squared: 65.71%.
Support Vector Machine (SVM) - Linear Kernel, RMSE: 2.839774, R-squared: 76.35%.
Support Vector Machine (SVM) - Radial Kernel, RMSE: 2.082551, R-squared: 85.17%
Support Vector Machine (SVM) - Polynomial Kernel, RMSE: 2.130626, R-squared: 83.56%

(1) Multivariate Adaptive Regression Splines (MARS)

# Defining the candidate models to test with 10-fold cross-validation
fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10)

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)

marsTuned <- train(trainData$x, trainData$y,
                   method = "earth", 
                   tuneGrid = marsGrid,
                   trControl = fitControl)

#marsTuned

Although results may vary depending on the random seed set, RMSE was used to select the optimal model using the smallest value.

Without listing the full results, the final values used for the model were nprune = 14 and degree = 2, with RMSE of 1.259610.

marsFit <- earth(x=trainData$x, y=trainData$y, degree=2, nprune=14)
marsFit

## Selected 14 of 18 terms, and 5 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945    RSS 225.8601    GRSq 0.9338437    RSq 0.953688

R-squared is 95.37%, suggesting an outstanding level of goodness of fit.

Quite amazingly, the RMSE is even lower for the test data, with 5000 observations.

pred.mars <- predict(marsFit, testData$x)
rmse(pred.mars, testData$y)

## [1] 1.173534

We chart the residuals of the actual and predicted value on the testing set, noting a symmetrical, near-normal distribution.

hist(testData$y-pred.mars, breaks=20)

(2) K-Nearest Neighbors (KNN)

knnGrid <-  expand.grid(k = 1:20)

knnFit <- train(y ~ ., data = trainData.comp, 
                method = "knn",
                trControl = fitControl, 
                tuneGrid = knnGrid)
knnFit

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    1  4.220129  0.3879525  3.455557
##    2  3.609581  0.4952757  2.991598
##    3  3.417852  0.5315833  2.805540
##    4  3.309265  0.5594254  2.708890
##    5  3.284147  0.5699628  2.715187
##    6  3.187697  0.6031450  2.617294
##    7  3.174441  0.6183403  2.591660
##    8  3.156761  0.6349250  2.579192
##    9  3.140537  0.6513271  2.564352
##   10  3.129836  0.6603483  2.549580
##   11  3.122135  0.6744231  2.537512
##   12  3.107032  0.6915949  2.518858
##   13  3.114743  0.6972825  2.530954
##   14  3.123569  0.7008701  2.535676
##   15  3.139958  0.7006634  2.549244
##   16  3.155335  0.7028706  2.562318
##   17  3.156246  0.7076837  2.561867
##   18  3.150326  0.7167097  2.556622
##   19  3.160656  0.7194376  2.568619
##   20  3.183634  0.7160898  2.591439
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 12.

With k optimized at 9 neighbors, RMSE is 3.142657 (on resampled training set) and R-squared is 65.71%.

pred.knn <- predict(knnFit, newdata = testData$x)
#rmse(knnPred, testData$y)
postResample(pred = pred.knn, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2385229 0.6611621 2.5985362

RMSE on the test set is 3.2000435.

hist(testData$y-pred.knn, breaks=20)

Although the residuals appear normal about 0, the spread is extremely wide.

(3) Support Vector Machine (SVM) with Linear Kernel

fitControl2 <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 5)

svmGrid <-  expand.grid(cost = c(1,1000))

svmFit <- train(y ~ ., data = trainData.comp, 
                 #type='eps-regression',
                 method = 'svmLinear2', 
                 trControl = fitControl2, 
                 tuneGrid = svmGrid)
svmFit

## Support Vector Machines with Linear Kernel 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 160, 160, 160, 160, 160, 160, ... 
## Resampling results across tuning parameters:
## 
##   cost  RMSE      Rsquared   MAE     
##      1  2.490718  0.7561396  2.000542
##   1000  2.494851  0.7561338  2.004330
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cost = 1.

With cost parameter of 1, R-squared works out to be 76.35%.

The RMSE on the test set is signicantly higher than with MARS.

pred.svm <- predict(svmFit, testData$x)
rmse(pred.svm, testData$y)

## [1] 2.839774

hist(testData$y-pred.svm, breaks=20)

The histogram of the residuals between the model predictions and actual observations shows a negative skew. Spread in general is much wider than the MARS model, suggesting underfitting.

(4) Support Vector Machine (SVM) with Radial Basis Function Kernel

svmGrid <-  expand.grid(C = c(1,1000))

svmFit2 <- train(y ~ ., data = trainData.comp, 
                 #type='eps-regression',
                 method = 'svmRadialCost', 
                 trControl = fitControl, 
                 tuneGrid = svmGrid)
svmFit2

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared   MAE     
##      1  2.092172  0.8330171  1.650050
##   1000  1.874927  0.8619636  1.491091
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 1000.

The RMSE on the test set is higher than with MARS.

pred.svm2 <- predict(svmFit2, testData$x)
rmse(pred.svm2, testData$y)

## [1] 2.061092

hist(testData$y-pred.svm2, breaks=20)

(5) Support Vector Machine (SVM) with Polynomial Function Kernel

svmGrid <-  expand.grid(degree = 1:3,
                        scale=1,
                        C = c(1,1000))

svmFit3 <- train(y ~ ., data = trainData.comp, 
                 #type='eps-regression',
                 method = 'svmPoly', 
                 trControl = fitControl, 
                 tuneGrid = svmGrid)
svmFit3

## Support Vector Machines with Polynomial Kernel 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  C     RMSE      Rsquared   MAE     
##   1          1  2.444470  0.7673159  1.971865
##   1       1000  2.448654  0.7665822  1.974222
##   2          1  2.001730  0.8541411  1.611020
##   2       1000  2.289189  0.8190326  1.805272
##   3          1  3.235382  0.6627515  2.513787
##   3       1000  3.235382  0.6627515  2.513787
## 
## Tuning parameter 'scale' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 1 and C = 1.

With a degree of 2 (quadratic) and Cost parameter of 1, RMSE is 2.082 with R-squared of 83.56%.

pred.svm3 <- predict(svmFit3, testData$x)
rmse(pred.svm3, testData$y)

## [1] 2.130626

hist(testData$y-pred.svm3, breaks=20)

Which models appear to give the best performance?

In order of performance we have:

Multivariate Adaptive Regression Splines (MARS), RMSE: 1.173534, R-squared: 95.37%%
Support Vector Machine (SVM) - Radial Kernel, RMSE: 2.082551, R-squared: 85.17%
Support Vector Machine (SVM) - Polynomial Kernel, RMSE: 2.130626, R-squared: 83.56%
Support Vector Machine (SVM) - Linear Kernel, RMSE: 2.839774, R-squared: 76.35%.
K-Nearest Neighbors (KNN), RMSE: 3.2000435, R-squared: 65.71%.

Does MARS select the informative predictors (those named X1-X5)?

We can confirm the relative importance of the variables with the varImp() function, which gives weights to all variable X1:X5, with no importance attached to the superficilous ‘noise’ variables.

varImp(marsFit)

##       Overall
## X1  100.00000
## X4   85.13408
## X2   69.22036
## X5   49.27524
## X3   39.95037
## X6    0.00000
## X7    0.00000
## X8    0.00000
## X9    0.00000
## X10   0.00000

7.5. Chemical Manufacturing

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Data Loading & Pre-processing

Data Splitting

indexTrain <- sample(
  x=c(TRUE, FALSE),
  size=nrow(wholeData),
  replace=TRUE,
  prob=c(0.75, 0.25)
)

dtTrain <- wholeData[indexTrain,] #135 rows
dtTest <- wholeData[!indexTrain,] #40 rows

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

For this exercise we center and scale all the training data, executing the same models as above:

Multivariate Adaptive Regression Splines (MARS), RMSE: 1.25337, R-squared: 79.87%
K-Nearest Neighbors (KNN), RMSE: 1.045041, R-squared: 46.33%.
Support Vector Machine (SVM) - Linear Kernel, RMSE: 1.404265, R-squared: 45.88%.
Support Vector Machine (SVM) - Radial Kernel, RMSE: 0.923446, R-squared: 63.01%

(1) Multivariate Adaptive Regression Splines (MARS)

# Defining the candidate models to test with 10-fold cross-validation
fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10)

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)

marsTuned <- train(dtTrain[, !names(dtTrain) == "Yield"], dtTrain$Yield,
                   method = "earth", 
                   preProc = c("center", "scale"),
                   tuneGrid = marsGrid,
                   trControl = fitControl)

#marsTuned

Without listing the full results, the final values used for the model were nprune = 14 and degree = 1, with RMSE of 1.214416.

marsFit <- earth(x=dtTrain[, !names(dtTrain) == "Yield"], dtTrain$Yield, degree=1, nprune=14)
marsFit

## Selected 13 of 21 terms, and 10 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 12 (additive model)
## GCV 1.148277    RSS 112.9219    GRSq 0.6716648    RSq 0.7726268

R-squared is 79.87%, suggesting an high level of goodness of fit.

The RMSE on the test data, 1.25337

pred.mars <- predict(marsFit, dtTest[, !names(dtTest) == "Yield"])
rmse(pred.mars, dtTest$Yield)

## [1] 0.8578827

We chart the residuals of the actual and predicted value on the testing set, noting a distibution somewhat skewed to the left.

hist(dtTest$Yield-pred.mars, breaks=8)

(2) K-Nearest Neighbors (KNN)

knnGrid <-  expand.grid(k = 1:20)

knnFit <- train(Yield ~ ., data = dtTrain, 
                method = "knn",
                preProc = c("center", "scale"),
                trControl = fitControl, 
                tuneGrid = knnGrid)
knnFit

## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 129, 130, 129, 129, 130, 131, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    1  1.521975  0.4254698  1.176087
##    2  1.365167  0.4868462  1.075733
##    3  1.282147  0.5397880  1.037852
##    4  1.303584  0.5265957  1.049758
##    5  1.325820  0.5103958  1.061337
##    6  1.331319  0.5106140  1.059659
##    7  1.347644  0.4981640  1.065979
##    8  1.350827  0.4973417  1.070509
##    9  1.353273  0.4962336  1.080475
##   10  1.359884  0.4943106  1.091082
##   11  1.361113  0.4979771  1.093921
##   12  1.362040  0.5010546  1.097695
##   13  1.364518  0.5033710  1.101295
##   14  1.374542  0.4999748  1.111113
##   15  1.375818  0.5018921  1.111342
##   16  1.383547  0.4972612  1.117190
##   17  1.387281  0.4970723  1.122453
##   18  1.399251  0.4873602  1.132101
##   19  1.407962  0.4828828  1.141523
##   20  1.412946  0.4827424  1.147856
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.

With k optimized at 4 neighbors, RMSE is 1.372581 (on resampled training set) and R-squared is 46.33%.

The RMSE on the test data, 1.045041, oddly lower than the training set RMSE.

pred.knn <- predict(knnFit, dtTest[, !names(dtTest) == "Yield"])
rmse(pred.knn, dtTest$Yield)

## [1] 1.033602

We chart the residuals of the actual and predicted value on the testing set, distribution skewed to the left.

hist(dtTest$Yield-pred.knn, breaks=8)

(3) Support Vector Machine (SVM) with Linear Kernel

fitControl2 <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 5)

svmGrid <-  expand.grid(cost = c(1,1000))

svmFit <- train(Yield ~ ., data = dtTrain, 
                 method = 'svmLinear2', 
                preProc = c("center", "scale"),
                 trControl = fitControl2, 
                 tuneGrid = svmGrid)
svmFit

## Support Vector Machines with Linear Kernel 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 116, 116, 113, 115, 116, 116, ... 
## Resampling results across tuning parameters:
## 
##   cost  RMSE      Rsquared   MAE     
##      1  3.091966  0.3288432  1.657914
##   1000  9.199447  0.2039639  3.150608
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cost = 1.

With cost parameter of 1, R-squared works out to be 45.88%.

The RMSE on the test data, 1.404265

pred.svm <- predict(svmFit , dtTest[, !names(dtTest) == "Yield"])
rmse(pred.svm, dtTest$Yield)

## [1] 1.292104

We chart the residuals of the actual and predicted value on the testing set, noting a slight skew to the right.

hist(dtTest$Yield-pred.svm, breaks=8)

The histogram of the residuals between the model predictions and actual observations shows a negative skew. Spread in general is much wider than the MARS model, suggesting underfitting.

(4) Support Vector Machine (SVM) with Radial Basis Function Kernel

svmGrid <-  expand.grid(C = c(1,1000))

svmFit2 <- train(Yield ~ ., data = dtTrain, 
                 method = 'svmRadialCost', 
                 preProc = c("center", "scale"),
                 trControl = fitControl, 
                 tuneGrid = svmGrid)
svmFit2

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 129, 129, 130, 128, 130, 131, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared   MAE      
##      1  1.206274  0.6097420  0.9735681
##   1000  1.117890  0.6544848  0.8990701
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 1000.

R-squared is 63.01% when the cost parameter is 1000, suggesting an high level of goodness of fit.

The RMSE on the test data, 0.923446

pred.svm2 <- predict(svmFit2, dtTest[, !names(dtTest) == "Yield"])
rmse(pred.svm2, dtTest$Yield)

## [1] 0.9028991

We chart the residuals of the actual and predicted value on the testing set, noting a slight skew to the left.

hist(dtTest$Yield-pred.mars, breaks=8)

(5) Support Vector Machine (SVM) with Polynomial Function Kernel

fitControl2 <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 5)

svmGrid <-  expand.grid(degree = 1:3,
                        scale=1,
                        C = c(1,1000))

svmFit3 <- train(Yield ~ ., data = dtTrain, 
                 #type='eps-regression',
                 method = 'svmPoly', 
                 preProc = c("center", "scale"),
                 trControl = fitControl2, 
                 tuneGrid = svmGrid)
#svmFit3

svmFit3 <- train(Yield ~ ., data = dtTrain, 
                 #type='eps-regression',
                 method = 'svmPoly', 
                 preProc = c("center", "scale"),
                 trControl = fitControl2, 
                 tuneGrid = svmGrid)

With optimal degree of 1, there is no point in using a polynomial kernel.

Support Vector Machine (SVM) - Radial Kernel, RMSE: 0.923446, R-squared: 63.01%
K-Nearest Neighbors (KNN), RMSE: 1.045041, R-squared: 46.33%.
Multivariate Adaptive Regression Splines (MARS), RMSE: 1.25337, R-squared: 79.87%
Support Vector Machine (SVM) - Linear Kernel, RMSE: 1.404265, R-squared: 45.88%.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Our best fitting model is a SVM model with Radial kernel method. However, in non-linear kernels, the separating plane exists in another space - a result of kernel transformation of the original space. Its coefficients are not directly related to the input space. Thus, we cannot access the contribution of individual predictors on the model fit.

Instead, we will view the most important predictors according to the MARS model, as in the first exercise, using the varImp() function within the caret package.

import.mars <- varImp(marsFit)
import.mars$class <- rownames(import.mars)
mars.top <- as.list(import.mars[order(-import.mars$Overall),][1:10,]['class'])

We compare this to linear regression model derived from the step() function, which takes the optimal model according to the Akaike Information Criteria (AIC) and backward elimination. The step() function gives a model of 12 variables, which is further reduced to 10 in the ranking.

## Applying step() function to find most important predictors from backward elimination

# intercept_only_model <- lm(Yield ~ 1, data = dtTrain)
# lmMod <- lm(Yield~.,data=dtTrain)
# finalMod <- step(intercept_only_model, direction = 'both', scope = formula(lmMod))

#Top 10 Predictors - Linear Model

lmMod.optimal <- lm(Yield ~ ManufacturingProcess32 + ManufacturingProcess09 + ManufacturingProcess33 + ManufacturingProcess37 + ManufacturingProcess07 + BiologicalMaterial06 + ManufacturingProcess17 + ManufacturingProcess04 + ManufacturingProcess21 +ManufacturingProcess28 + ManufacturingProcess44 + ManufacturingProcess15 + ManufacturingProcess43 + ManufacturingProcess22 + ManufacturingProcess34, data = dtTrain) 
#summary(lmMod.optimal) #Adjusted R-squared:  0.714

import.linreg <- varImp(lmMod.optimal)
import.linreg$class <- rownames(import.linreg)
linreg.top <- as.list(import.linreg[order(-import.linreg$Overall),][1:10,]['class'])

rank <- data.frame(mars.top, linreg.top)
colnames(rank) <- c('MARS.top','LinReg.top')
rank

##                  MARS.top             LinReg.top
## 1  ManufacturingProcess32 ManufacturingProcess32
## 2  ManufacturingProcess09 ManufacturingProcess17
## 3  ManufacturingProcess13 ManufacturingProcess37
## 4  ManufacturingProcess33 ManufacturingProcess21
## 5  ManufacturingProcess01 ManufacturingProcess15
## 6    BiologicalMaterial03 ManufacturingProcess09
## 7  ManufacturingProcess28 ManufacturingProcess04
## 8  ManufacturingProcess39 ManufacturingProcess33
## 9  ManufacturingProcess22   BiologicalMaterial06
## 10   BiologicalMaterial12 ManufacturingProcess07

The MARS model weights two Biological Materials among the top-ten most impactful explanatory variables, while the linear regression model has one. The overlapping variables between the two models are:
- ManufacturingProcess32,
- ManufacturingProcess33,
- ManufacturingProcess04,
- BiologicalMaterial6

With six completely different leading predictors, this suggests a radical difference in model structure between the two.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

unique.predictors <- mars.top$`class`[c(2,3,7,8,9,10)]

Glimpsing at a few of the variables, we notice a moderately positive linear correlation between ManufacturingProcess09 and Yield…

vals <- unlist(dtTrain[colnames(dtTrain) == unique.predictors[1]])
plot(x=vals, y=dtTrain$Yield, xlab=unique.predictors[1])

cor(dtTrain$ManufacturingProcess09, dtTrain$Yield)

## [1] 0.5035476

… and a moderately negative linear correlation between ManufacturingProcess13 and Yield.

vals <- unlist(dtTrain[colnames(dtTrain) == unique.predictors[2]])
plot(x=vals, y=dtTrain$Yield, xlab=unique.predictors[2])

cor(dtTrain$ManufacturingProcess13, dtTrain$Yield)

## [1] -0.5066883

DATA624 - HW8

Zachary Herold

April 9, 2020

Exercise 7.2 Fitting Simulation-Induced Datasets

Data Generation

(1) Multivariate Adaptive Regression Splines (MARS)

(2) K-Nearest Neighbors (KNN)

(3) Support Vector Machine (SVM) with Linear Kernel

(4) Support Vector Machine (SVM) with Radial Basis Function Kernel

(5) Support Vector Machine (SVM) with Polynomial Function Kernel

7.5. Chemical Manufacturing

Data Loading & Pre-processing

Data Splitting

(1) Multivariate Adaptive Regression Splines (MARS)

(2) K-Nearest Neighbors (KNN)

(3) Support Vector Machine (SVM) with Linear Kernel

(4) Support Vector Machine (SVM) with Radial Basis Function Kernel

(5) Support Vector Machine (SVM) with Polynomial Function Kernel