Exercises from Chapter 7 of textbook Applied Predictive Modeling by Kuhn & Johnson
\(y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
#library(mlbench)
set.seed(200)
trainingData = mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x = data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData = mlbench.friedman1(5000, sd = 1)
testData$x = data.frame(testData$x)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel$finalModel
## 17-nearest neighbor regression model
pred_knn <- predict(knnModel$finalModel, newdata = testData$x)
PR_knn <- postResample(pred = pred_knn, obs = testData$y)
#store in df
compare <- data.frame(model = "knn",
rsquared = PR_knn[["Rsquared"]],
rmse = PR_knn[["RMSE"]])
compare
## model rsquared rmse
## 1 knn 0.5011098 5.956498
Next I’ll try an SVM model.
model_svm = train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
model_svm$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 16
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0680216365076835
##
## Number of Support Vectors : 152
##
## Objective Function Value : -66.0924
## Training error : 0.008551
pred_svm <- predict(model_svm$finalModel, newdata = testData$x)
PR_svm <- postResample(pred = pred_svm, obs = testData$y)
#store in df
svm_row <- c("svm",
PR_svm[["Rsquared"]],
PR_svm[["RMSE"]])
compare <-rbind(compare, svm_row)
compare
## model rsquared rmse
## 1 knn 0.501109776944346 5.95649765101139
## 2 svm 0.654429001004934 7.15607503427323
Last I’ll try a MARS model.
marsGrid = expand.grid(.degree = 1:2, .nprune = 2:38)
model_mars = train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv",
number = 10))
model_mars$finalModel
## Selected 14 of 18 terms, and 5 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945 RSS 225.8601 GRSq 0.9338437 RSq 0.953688
pred_mars <- predict(model_mars$finalModel, newdata = testData$x)
PR_mars <- postResample(pred = pred_mars, obs = testData$y)
#store in df
mars_row <- c("mars",
PR_mars[["Rsquared"]],
PR_mars[["RMSE"]])
compare <-rbind(compare, mars_row)
compare
## model rsquared rmse
## 1 knn 0.501109776944346 5.95649765101139
## 2 svm 0.654429001004934 7.15607503427323
## 3 mars 0.944889049379685 1.1722634531051
In looking just at the r-squared and RMSE values I stored below, MARS appears to have the best performance, although an r-squared that high would be a bit alarming in read-world data. In the readout above I do see that MARS selected X1-X5 as it’s top 5 predictors (in a different order than numeric). Effectively, it appears to have the best performance.
compare
## model rsquared rmse
## 1 knn 0.501109776944346 5.95649765101139
## 2 svm 0.654429001004934 7.15607503427323
## 3 mars 0.944889049379685 1.1722634531051
First I copy in the code from my previous work.
data(ChemicalManufacturingProcess)
chem <- as.data.frame(ChemicalManufacturingProcess)
knn_imp <- preProcess(chem, "knnImpute")
chem_imp <- predict(knn_imp, chem)
#split train/test
set.seed(3190)
sample_set <- sample(nrow(chem_imp), round(nrow(chem_imp)*0.75), replace = FALSE)
chem_train <-chem_imp[sample_set, ]
chem_train_x <- chem_train[, -1]
chem_train_y <- chem_train[, 1]
chem_test <-chem_imp[-sample_set, ]
chem_test_x <- chem_test[, -1]
chem_test_y <- chem_test[, 1]
knnModel <- train(x = chem_train_x,
y = chem_train_y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
pred_knn <- predict(knnModel$finalModel, newdata = chem_test_x)
PR_knn <- postResample(pred = pred_knn, obs = chem_test_y)
#store in df
compare <- data.frame(model = "knn",
rsquared = PR_knn[["Rsquared"]],
rmse = PR_knn[["RMSE"]])
compare
## model rsquared rmse
## 1 knn 0.3484061 0.8354028
svmModel <- train(x = chem_train_x,
y = chem_train_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
pred_svm <- predict(svmModel$finalModel, newdata = chem_test_x)
PR_svm <- postResample(pred = pred_svm, obs = chem_test_y)
#store in df
svm_row <- c("svm",
PR_svm[["Rsquared"]],
PR_svm[["RMSE"]])
compare <-rbind(compare, svm_row)
compare
## model rsquared rmse
## 1 knn 0.348406114678821 0.83540279661946
## 2 svm 0.489107000574625 0.732540773784981
marsGrid = expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel <- train(x = chem_train_x,
y = chem_train_y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv",
number = 10))
pred_mars <- predict(marsModel$finalModel, newdata = chem_test_x)
PR_mars <- postResample(pred = pred_mars, obs = chem_test_y)
#store in df
mars_row <- c("mars",
PR_mars[["Rsquared"]],
PR_mars[["RMSE"]])
compare <-rbind(compare, mars_row)
compare
## model rsquared rmse
## 1 knn 0.348406114678821 0.83540279661946
## 2 svm 0.489107000574625 0.732540773784981
## 3 mars 0.47146012210608 0.733929706841938
nnetGrid = expand.grid(decay = c(0, 0.01, .1),
size = c(1,5,10), bag = FALSE)
nnetModel = train(x = chem_train_x,
y = chem_train_y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv",
number = 10),
preProc = c("center", "scale"),
linout = TRUE, trace = FALSE,
maxit = 10)
pred_nnet <- predict(nnetModel$finalModel, newdata = chem_test_x)
PR_nnet <- postResample(pred = pred_nnet, obs = chem_test_y)
#store in df
nnet_row <- c("nnet",
PR_nnet[["Rsquared"]],
PR_nnet[["RMSE"]])
compare <-rbind(compare, nnet_row)
compare
## model rsquared rmse
## 1 knn 0.348406114678821 0.83540279661946
## 2 svm 0.489107000574625 0.732540773784981
## 3 mars 0.47146012210608 0.733929706841938
## 4 nnet 0.505326229731184 0.705068411959686
With regard to the R-Squared value, the MARS model performs best, explaining 53.2% of the variance in the dataset. It also has the lowest RMSE - so it has the best performance in this case.
compare
## model rsquared rmse
## 1 knn 0.348406114678821 0.83540279661946
## 2 svm 0.489107000574625 0.732540773784981
## 3 mars 0.47146012210608 0.733929706841938
## 4 nnet 0.505326229731184 0.705068411959686
For the MARS model it selected 6 process variables as the most important, followed by 2 biological and another 2 process - as seen below. This is very similar to what I saw in my linear model on the last assignment, 8 process variables and 2 biological variables, and the top two `ManufacturingProcess32 and ManufacturingProcess09 being the same.
varImp(marsModel$finalModel)
## Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess09 63.26523
## ManufacturingProcess24 21.25427
The only variables in the above top 10 MARS model that weren’t in my PLS model top 10 previously are: ManufacturingProcess01, ManufacturingProcess39, BiologicalMaterial03, BiologicalMaterial04, and ManufacturingProcess10.
I see that BiologicalMaterial03 and `BiologicalMaterial04 have a positive correlation with each other, and with the Yield target variable. The manufacturing process variable here seem to have only slight correlations with Yield, with most being negative. In looking back at my correlation plot from the last assignment, again see negative correlations for the biggest process variables these two modelsh ave in common. It appears that, roughly, the process variables are negatively correlated with Yield while the biologic variables are positively corelated with Yield.
chem_top5 <- chem_imp %>%
select(c(`ManufacturingProcess01`, `ManufacturingProcess39`, `BiologicalMaterial03`, `BiologicalMaterial04`, `ManufacturingProcess10`,Yield))
correlation <- cor(chem_top5)
# plot correlations
corrplot.mixed(correlation, tl.col = 'black', tl.pos = 'lt',
number.cex= .9, tl.cex = .7)