Assignment: Exercises 7.2 and 7.5 from the KJ textbook
library(caTools)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(nnet)
7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
y = 10sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
knnModel <- train(x = trainingData$x, y = trainingData$y, method = "knn", preProc = c("center", "scale"), tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
We also try out a neural network:
nnetFit <- nnet(trainingData$x, trainingData$y, size = 5, decay = 0.01, linout = TRUE,
## Reduce the amount of printed output
trace = FALSE,
## Expand the number of iterations to find
## parameter estimates..
maxit = 500,
## and the number of parameters used by the model
MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1)
nNetPred <- predict(nnetFit, newdata = testData$x)
postResample(pred = nNetPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.8564065 0.7091704 1.9600728
Lastly, we try out a MARS model:
marsFit <- train(x = trainingData$x, y = trainingData$y, method = "earth", preProc = c("center", "scale"), tuneLength = 10)
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
marsPred <- predict(marsFit, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.776575 0.872700 1.358367
summary(marsFit$finalModel)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=1,
## nprune=10)
##
## coefficients
## (Intercept) 20.3958041
## h(0.507267-X1) -3.0209971
## h(0.325504-X2) -2.8963069
## h(X3- -0.804171) 1.1187319
## h(-0.216741-X3) 3.4950111
## h(X3-0.453446) 2.1548596
## h(0.953812-X4) -2.7559239
## h(X4-0.953812) 2.8600536
## h(1.17878-X5) -1.5056208
## h(X6- -0.47556) -0.5025995
##
## Selected 10 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 (additive model)
## GCV 2.731203 RSS 447.3848 GRSq 0.889112 RSq 0.9082649
The neural network is the best performing model with the smallest RMSE and largest R^2 value, and it is followed by MARS and then KNN. The MARS model selects all 5 of the informative predictors as well as X6.
7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
#load data
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
#imputation
cmp <- knnImputation(ChemicalManufacturingProcess[, !names(ChemicalManufacturingProcess) %in% "yield"])
#data splitting
set.seed(101)
sample = sample.split(cmp$BiologicalMaterial01, SplitRatio = .8)
cmp_train = subset(cmp, sample == TRUE)
cmp_test = subset(cmp, sample == FALSE)
cmp_train_X = subset(cmp_train, select = -Yield)
cmp_train_y = cmp_train[,'Yield']
cmp_test_X = subset(cmp_test, select = -Yield)
cmp_test_y = cmp_test[,'Yield']
KNN model:
cmpknn <- train(x = cmp_train_X, y = cmp_train_y, method = "knn", preProc = c("center", "scale"), tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
cmpknn
## k-Nearest Neighbors
##
## 140 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.423326 0.4208469 1.128291
## 7 1.414026 0.4330362 1.128222
## 9 1.388053 0.4618164 1.116660
## 11 1.390159 0.4643388 1.123480
## 13 1.393569 0.4616407 1.130071
## 15 1.401858 0.4573811 1.141917
## 17 1.405514 0.4607165 1.146322
## 19 1.416220 0.4540900 1.149758
## 21 1.423137 0.4546066 1.156157
## 23 1.430939 0.4519784 1.162060
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
cmpknnPred <- predict(cmpknn, newdata = cmp_test_X)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = cmpknnPred, obs = cmp_test_y)
## RMSE Rsquared MAE
## 1.2778493 0.4141794 0.9856636
Neural Network model:
cmpnnet <- nnet(x = cmp_train_X, y = cmp_train_y, size = 5, decay = 0.01, linout = TRUE, trace = FALSE,
maxit = 500, MaxNWts = 5 * (ncol(cmp_train_X) + 1) + 5 + 1)
cmpnnetPred <- predict(cmpnnet, newdata = cmp_test_X)
postResample(pred = cmpnnetPred, obs = cmp_test_y)
## RMSE Rsquared MAE
## 1.7343071 0.2831666 1.3282372
MARS model:
cmpmars <- train(x = cmp_train_X, y = cmp_train_y, method = "earth", preProc = c("center", "scale"), tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
cmpmarsPred <- predict(cmpmars, newdata = cmp_test_X)
postResample(pred = cmpmarsPred, obs = cmp_test_y)
## RMSE Rsquared MAE
## 1.2272495 0.4881219 0.9595875
SVM model:
cmpsvm <- train(x = cmp_train_X, y = cmp_train_y, method = "svmRadial", preProc = c("center", "scale"), tuneLength = 10, trControl = trainControl(method = "cv"))
cmpsvmPred <- predict(cmpsvm, newdata = cmp_test_X)
postResample(pred = cmpsvmPred, obs = cmp_test_y)
## RMSE Rsquared MAE
## 1.2149832 0.4700383 0.9384320
SVM gives the optimal resampling and test set performance with a RMSE of 1.21.
Here are the 10 predictors from the optimal linear model (elastic net) ranked in order of importance:
ManufacturingProcess36 ManufacturingProcess09 ManufacturingProcess17 ManufacturingProcess32 ManufacturingProcess13 BiologicalMaterial08 ManufacturingProcess06 ManufacturingProcess24 BiologicalMaterial05 BiologicalMaterial02
Here are the most important predictors from the optimal nonlinear model (SVM) ranked in order of importance:
varImp(cmpsvm)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess13 100.00
## ManufacturingProcess17 92.67
## ManufacturingProcess32 88.59
## ManufacturingProcess09 82.10
## BiologicalMaterial06 76.04
## ManufacturingProcess36 74.17
## BiologicalMaterial12 69.50
## BiologicalMaterial03 67.46
## ManufacturingProcess06 64.05
## BiologicalMaterial02 59.77
## ManufacturingProcess31 58.15
## ManufacturingProcess11 55.36
## BiologicalMaterial08 49.21
## BiologicalMaterial11 46.73
## BiologicalMaterial04 43.96
## ManufacturingProcess29 42.49
## ManufacturingProcess33 37.97
## ManufacturingProcess30 36.40
## BiologicalMaterial01 35.79
## ManufacturingProcess18 35.06
In both of the optimal models, manufacturing processes dominate the list. There is a lot of overlap between the two models, but the SVM model has one additonal biological material predictor in its top 10 list. The SVM model also incorporates several more features than the Elastic Net model.
When we compare the top 10 predictors, there are 3 that are unique to the SVM model in comparison to the Elastic Net model, and all 3 are biological material predictors that have a positive association with yield. The SVM model reveals that the biological materials may have more of an influence on yield than previously assumed. Of the 3, biological material #6 has the strongest correlation with yield.
cmp2 <- subset(cmp, select = c(BiologicalMaterial06, BiologicalMaterial12, BiologicalMaterial03, Yield))
cor(cmp2, cmp[,'Yield'])
## [,1]
## BiologicalMaterial06 0.4781634
## BiologicalMaterial12 0.3674976
## BiologicalMaterial03 0.4450860
## Yield 1.0000000