__Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:__
\(y = 10 sin(?? x_1 * 2) + 20(x_3 ??? 0.5)2 + 10x_4 + 5x_5 + N(0, ??2)\)
Here the x values are random variables uniformly distributed between [0, 1]. There are also 5 other non-informative variables also created in the simulation. Part of our aim is to determine which models select out these noisy features.
The package mlbench contains a function called mlbench.friedman1 that simulates these data. This creates a list with a vector ‘y’ and a matrix of predictors ‘x’.
set.seed(200)
trainData <- mlbench.friedman1(200, sd = 1)
trainData$x <- data.frame(trainData$x)
set.seed(300)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
trainData.comp <- cbind(trainData$x,y=trainData$y) #binding for select model use
testData.comp <- cbind(testData$x,y=testData$y)
#featurePlot(trainingData$x, trainingData$y)
We tune the following models on these Friedman data, measuring model R-squared and RMSE on the test set.
# Defining the candidate models to test with 10-fold cross-validation
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTuned <- train(trainData$x, trainData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = fitControl)
#marsTuned
Although results may vary depending on the random seed set, RMSE was used to select the optimal model using the smallest value.
Without listing the full results, the final values used for the model were nprune = 14 and degree = 2, with RMSE of 1.259610.
marsFit <- earth(x=trainData$x, y=trainData$y, degree=2, nprune=14)
marsFit
## Selected 14 of 18 terms, and 5 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945 RSS 225.8601 GRSq 0.9338437 RSq 0.953688
R-squared is 95.37%, suggesting an outstanding level of goodness of fit.
Quite amazingly, the RMSE is even lower for the test data, with 5000 observations.
pred.mars <- predict(marsFit, testData$x)
rmse(pred.mars, testData$y)
## [1] 1.173534
We chart the residuals of the actual and predicted value on the testing set, noting a symmetrical, near-normal distribution.
hist(testData$y-pred.mars, breaks=20)
knnGrid <- expand.grid(k = 1:20)
knnFit <- train(y ~ ., data = trainData.comp,
method = "knn",
trControl = fitControl,
tuneGrid = knnGrid)
knnFit
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 4.220129 0.3879525 3.455557
## 2 3.609581 0.4952757 2.991598
## 3 3.417852 0.5315833 2.805540
## 4 3.309265 0.5594254 2.708890
## 5 3.284147 0.5699628 2.715187
## 6 3.187697 0.6031450 2.617294
## 7 3.174441 0.6183403 2.591660
## 8 3.156761 0.6349250 2.579192
## 9 3.140537 0.6513271 2.564352
## 10 3.129836 0.6603483 2.549580
## 11 3.122135 0.6744231 2.537512
## 12 3.107032 0.6915949 2.518858
## 13 3.114743 0.6972825 2.530954
## 14 3.123569 0.7008701 2.535676
## 15 3.139958 0.7006634 2.549244
## 16 3.155335 0.7028706 2.562318
## 17 3.156246 0.7076837 2.561867
## 18 3.150326 0.7167097 2.556622
## 19 3.160656 0.7194376 2.568619
## 20 3.183634 0.7160898 2.591439
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 12.
With k optimized at 9 neighbors, RMSE is 3.142657 (on resampled training set) and R-squared is 65.71%.
pred.knn <- predict(knnFit, newdata = testData$x)
#rmse(knnPred, testData$y)
postResample(pred = pred.knn, obs = testData$y)
## RMSE Rsquared MAE
## 3.2385229 0.6611621 2.5985362
RMSE on the test set is 3.2000435.
hist(testData$y-pred.knn, breaks=20)
Although the residuals appear normal about 0, the spread is extremely wide.
fitControl2 <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
svmGrid <- expand.grid(cost = c(1,1000))
svmFit <- train(y ~ ., data = trainData.comp,
#type='eps-regression',
method = 'svmLinear2',
trControl = fitControl2,
tuneGrid = svmGrid)
svmFit
## Support Vector Machines with Linear Kernel
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 160, 160, 160, 160, 160, 160, ...
## Resampling results across tuning parameters:
##
## cost RMSE Rsquared MAE
## 1 2.490718 0.7561396 2.000542
## 1000 2.494851 0.7561338 2.004330
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cost = 1.
With cost parameter of 1, R-squared works out to be 76.35%.
The RMSE on the test set is signicantly higher than with MARS.
pred.svm <- predict(svmFit, testData$x)
rmse(pred.svm, testData$y)
## [1] 2.839774
hist(testData$y-pred.svm, breaks=20)
The histogram of the residuals between the model predictions and actual observations shows a negative skew. Spread in general is much wider than the MARS model, suggesting underfitting.
svmGrid <- expand.grid(C = c(1,1000))
svmFit2 <- train(y ~ ., data = trainData.comp,
#type='eps-regression',
method = 'svmRadialCost',
trControl = fitControl,
tuneGrid = svmGrid)
svmFit2
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 1 2.092172 0.8330171 1.650050
## 1000 1.874927 0.8619636 1.491091
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 1000.
The RMSE on the test set is higher than with MARS.
pred.svm2 <- predict(svmFit2, testData$x)
rmse(pred.svm2, testData$y)
## [1] 2.061092
hist(testData$y-pred.svm2, breaks=20)
svmGrid <- expand.grid(degree = 1:3,
scale=1,
C = c(1,1000))
svmFit3 <- train(y ~ ., data = trainData.comp,
#type='eps-regression',
method = 'svmPoly',
trControl = fitControl,
tuneGrid = svmGrid)
svmFit3
## Support Vector Machines with Polynomial Kernel
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree C RMSE Rsquared MAE
## 1 1 2.444470 0.7673159 1.971865
## 1 1000 2.448654 0.7665822 1.974222
## 2 1 2.001730 0.8541411 1.611020
## 2 1000 2.289189 0.8190326 1.805272
## 3 1 3.235382 0.6627515 2.513787
## 3 1000 3.235382 0.6627515 2.513787
##
## Tuning parameter 'scale' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 1 and C = 1.
With a degree of 2 (quadratic) and Cost parameter of 1, RMSE is 2.082 with R-squared of 83.56%.
pred.svm3 <- predict(svmFit3, testData$x)
rmse(pred.svm3, testData$y)
## [1] 2.130626
hist(testData$y-pred.svm3, breaks=20)
Which models appear to give the best performance?
In order of performance we have:
Does MARS select the informative predictors (those named X1-X5)?
We can confirm the relative importance of the variables with the varImp() function, which gives weights to all variable X1:X5, with no importance attached to the superficilous ‘noise’ variables.
varImp(marsFit)
## Overall
## X1 100.00000
## X4 85.13408
## X2 69.22036
## X5 49.27524
## X3 39.95037
## X6 0.00000
## X7 0.00000
## X8 0.00000
## X9 0.00000
## X10 0.00000
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
indexTrain <- sample(
x=c(TRUE, FALSE),
size=nrow(wholeData),
replace=TRUE,
prob=c(0.75, 0.25)
)
dtTrain <- wholeData[indexTrain,] #135 rows
dtTest <- wholeData[!indexTrain,] #40 rows
(a) Which nonlinear regression model gives the optimal resampling and test set performance?
For this exercise we center and scale all the training data, executing the same models as above:
# Defining the candidate models to test with 10-fold cross-validation
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTuned <- train(dtTrain[, !names(dtTrain) == "Yield"], dtTrain$Yield,
method = "earth",
preProc = c("center", "scale"),
tuneGrid = marsGrid,
trControl = fitControl)
#marsTuned
Without listing the full results, the final values used for the model were nprune = 14 and degree = 1, with RMSE of 1.214416.
marsFit <- earth(x=dtTrain[, !names(dtTrain) == "Yield"], dtTrain$Yield, degree=1, nprune=14)
marsFit
## Selected 13 of 21 terms, and 10 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 12 (additive model)
## GCV 1.148277 RSS 112.9219 GRSq 0.6716648 RSq 0.7726268
R-squared is 79.87%, suggesting an high level of goodness of fit.
The RMSE on the test data, 1.25337
pred.mars <- predict(marsFit, dtTest[, !names(dtTest) == "Yield"])
rmse(pred.mars, dtTest$Yield)
## [1] 0.8578827
We chart the residuals of the actual and predicted value on the testing set, noting a distibution somewhat skewed to the left.
hist(dtTest$Yield-pred.mars, breaks=8)
knnGrid <- expand.grid(k = 1:20)
knnFit <- train(Yield ~ ., data = dtTrain,
method = "knn",
preProc = c("center", "scale"),
trControl = fitControl,
tuneGrid = knnGrid)
knnFit
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 129, 130, 129, 129, 130, 131, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 1.521975 0.4254698 1.176087
## 2 1.365167 0.4868462 1.075733
## 3 1.282147 0.5397880 1.037852
## 4 1.303584 0.5265957 1.049758
## 5 1.325820 0.5103958 1.061337
## 6 1.331319 0.5106140 1.059659
## 7 1.347644 0.4981640 1.065979
## 8 1.350827 0.4973417 1.070509
## 9 1.353273 0.4962336 1.080475
## 10 1.359884 0.4943106 1.091082
## 11 1.361113 0.4979771 1.093921
## 12 1.362040 0.5010546 1.097695
## 13 1.364518 0.5033710 1.101295
## 14 1.374542 0.4999748 1.111113
## 15 1.375818 0.5018921 1.111342
## 16 1.383547 0.4972612 1.117190
## 17 1.387281 0.4970723 1.122453
## 18 1.399251 0.4873602 1.132101
## 19 1.407962 0.4828828 1.141523
## 20 1.412946 0.4827424 1.147856
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
With k optimized at 4 neighbors, RMSE is 1.372581 (on resampled training set) and R-squared is 46.33%.
The RMSE on the test data, 1.045041, oddly lower than the training set RMSE.
pred.knn <- predict(knnFit, dtTest[, !names(dtTest) == "Yield"])
rmse(pred.knn, dtTest$Yield)
## [1] 1.033602
We chart the residuals of the actual and predicted value on the testing set, distribution skewed to the left.
hist(dtTest$Yield-pred.knn, breaks=8)
fitControl2 <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
svmGrid <- expand.grid(cost = c(1,1000))
svmFit <- train(Yield ~ ., data = dtTrain,
method = 'svmLinear2',
preProc = c("center", "scale"),
trControl = fitControl2,
tuneGrid = svmGrid)
svmFit
## Support Vector Machines with Linear Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 116, 116, 113, 115, 116, 116, ...
## Resampling results across tuning parameters:
##
## cost RMSE Rsquared MAE
## 1 3.091966 0.3288432 1.657914
## 1000 9.199447 0.2039639 3.150608
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cost = 1.
With cost parameter of 1, R-squared works out to be 45.88%.
The RMSE on the test data, 1.404265
pred.svm <- predict(svmFit , dtTest[, !names(dtTest) == "Yield"])
rmse(pred.svm, dtTest$Yield)
## [1] 1.292104
We chart the residuals of the actual and predicted value on the testing set, noting a slight skew to the right.
hist(dtTest$Yield-pred.svm, breaks=8)
The histogram of the residuals between the model predictions and actual observations shows a negative skew. Spread in general is much wider than the MARS model, suggesting underfitting.
svmGrid <- expand.grid(C = c(1,1000))
svmFit2 <- train(Yield ~ ., data = dtTrain,
method = 'svmRadialCost',
preProc = c("center", "scale"),
trControl = fitControl,
tuneGrid = svmGrid)
svmFit2
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 129, 129, 130, 128, 130, 131, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 1 1.206274 0.6097420 0.9735681
## 1000 1.117890 0.6544848 0.8990701
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 1000.
R-squared is 63.01% when the cost parameter is 1000, suggesting an high level of goodness of fit.
The RMSE on the test data, 0.923446
pred.svm2 <- predict(svmFit2, dtTest[, !names(dtTest) == "Yield"])
rmse(pred.svm2, dtTest$Yield)
## [1] 0.9028991
We chart the residuals of the actual and predicted value on the testing set, noting a slight skew to the left.
hist(dtTest$Yield-pred.mars, breaks=8)
fitControl2 <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
svmGrid <- expand.grid(degree = 1:3,
scale=1,
C = c(1,1000))
svmFit3 <- train(Yield ~ ., data = dtTrain,
#type='eps-regression',
method = 'svmPoly',
preProc = c("center", "scale"),
trControl = fitControl2,
tuneGrid = svmGrid)
#svmFit3
svmFit3 <- train(Yield ~ ., data = dtTrain,
#type='eps-regression',
method = 'svmPoly',
preProc = c("center", "scale"),
trControl = fitControl2,
tuneGrid = svmGrid)
With optimal degree of 1, there is no point in using a polynomial kernel.
(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
Our best fitting model is a SVM model with Radial kernel method. However, in non-linear kernels, the separating plane exists in another space - a result of kernel transformation of the original space. Its coefficients are not directly related to the input space. Thus, we cannot access the contribution of individual predictors on the model fit.
Instead, we will view the most important predictors according to the MARS model, as in the first exercise, using the varImp() function within the caret package.
import.mars <- varImp(marsFit)
import.mars$class <- rownames(import.mars)
mars.top <- as.list(import.mars[order(-import.mars$Overall),][1:10,]['class'])
We compare this to linear regression model derived from the step() function, which takes the optimal model according to the Akaike Information Criteria (AIC) and backward elimination. The step() function gives a model of 12 variables, which is further reduced to 10 in the ranking.
## Applying step() function to find most important predictors from backward elimination
# intercept_only_model <- lm(Yield ~ 1, data = dtTrain)
# lmMod <- lm(Yield~.,data=dtTrain)
# finalMod <- step(intercept_only_model, direction = 'both', scope = formula(lmMod))
#Top 10 Predictors - Linear Model
lmMod.optimal <- lm(Yield ~ ManufacturingProcess32 + ManufacturingProcess09 + ManufacturingProcess33 + ManufacturingProcess37 + ManufacturingProcess07 + BiologicalMaterial06 + ManufacturingProcess17 + ManufacturingProcess04 + ManufacturingProcess21 +ManufacturingProcess28 + ManufacturingProcess44 + ManufacturingProcess15 + ManufacturingProcess43 + ManufacturingProcess22 + ManufacturingProcess34, data = dtTrain)
#summary(lmMod.optimal) #Adjusted R-squared: 0.714
import.linreg <- varImp(lmMod.optimal)
import.linreg$class <- rownames(import.linreg)
linreg.top <- as.list(import.linreg[order(-import.linreg$Overall),][1:10,]['class'])
rank <- data.frame(mars.top, linreg.top)
colnames(rank) <- c('MARS.top','LinReg.top')
rank
## MARS.top LinReg.top
## 1 ManufacturingProcess32 ManufacturingProcess32
## 2 ManufacturingProcess09 ManufacturingProcess17
## 3 ManufacturingProcess13 ManufacturingProcess37
## 4 ManufacturingProcess33 ManufacturingProcess21
## 5 ManufacturingProcess01 ManufacturingProcess15
## 6 BiologicalMaterial03 ManufacturingProcess09
## 7 ManufacturingProcess28 ManufacturingProcess04
## 8 ManufacturingProcess39 ManufacturingProcess33
## 9 ManufacturingProcess22 BiologicalMaterial06
## 10 BiologicalMaterial12 ManufacturingProcess07
The MARS model weights two Biological Materials among the top-ten most impactful explanatory variables, while the linear regression model has one. The overlapping variables between the two models are:
- ManufacturingProcess32,
- ManufacturingProcess33,
- ManufacturingProcess04,
- BiologicalMaterial6
With six completely different leading predictors, this suggests a radical difference in model structure between the two.
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
unique.predictors <- mars.top$`class`[c(2,3,7,8,9,10)]
Glimpsing at a few of the variables, we notice a moderately positive linear correlation between ManufacturingProcess09 and Yield…
vals <- unlist(dtTrain[colnames(dtTrain) == unique.predictors[1]])
plot(x=vals, y=dtTrain$Yield, xlab=unique.predictors[1])
cor(dtTrain$ManufacturingProcess09, dtTrain$Yield)
## [1] 0.5035476
… and a moderately negative linear correlation between ManufacturingProcess13 and Yield.
vals <- unlist(dtTrain[colnames(dtTrain) == unique.predictors[2]])
plot(x=vals, y=dtTrain$Yield, xlab=unique.predictors[2])
cor(dtTrain$ManufacturingProcess13, dtTrain$Yield)
## [1] -0.5066883