7.2 , 7.5
Benchmark data set created by simulation.
One of the simulations used the following nonlinear equation
\[y = 10sin(\pi x_1 x_2)+20(x_3 -0.5)^2+10x_4 + 5x_5 +N(0, \sigma^2)\]
x are the values of random variables uniformly distributed between [0,1]
5 non-informative variables also created for simulation
mlbench contains a function called
mlbench.friedman1 that sims these data:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.1.3
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(200)
trainingData<-mlbench.friedman1(200, sd = 1)
## convert the 'x' data froma matrix to a data frame
## one reason is that this will give column names
trainingData$x<-data.frame(trainingData$x)
## look at the data using
featurePlot(trainingData$x, trainingData$y)
## this creates a list with vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData<-mlbench.friedman1(500, sd = 1)
testData$x<-data.frame(testData$x)
Tune several models on these data. for example:
library(caret)
knnModel<-train(x = trainingData$x,
y = trainingData$y,
method = 'knn',
preProc = c('center', 'scale'),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.587483 0.5016797 2.925321
## 7 3.449601 0.5457896 2.820733
## 9 3.366991 0.5780134 2.745968
## 11 3.319312 0.6028553 2.692924
## 13 3.338963 0.6074127 2.709832
## 15 3.324987 0.6225090 2.703993
## 17 3.312809 0.6390199 2.684094
## 19 3.320180 0.6507115 2.689365
## 21 3.348821 0.6496607 2.717876
## 23 3.355189 0.6584384 2.726909
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <-predict(knnModel, newdata = testData$x)
## the function 'postResample' can be used to get the test set
## performance values
knn_metrics<-postResample(pred = knnPred, obs = testData$y)
Lets check the variable importance for the knn Model.
varImp(knnModel)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
Interestingly there are variables we know are not important (because we created them) that the knn views as important. This clearly means the model is not working well. The reason why is because the data has non-neighborhood correlation.
which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?
Lets rebuild the model for MARS with some tuning parameters for number of predictors.
marsGrid <- expand.grid(degree = 1:2, nprune = seq(2,14, by =2))
set.seed(921)
marsModel<-train(x = trainingData$x,
y = trainingData$y,
method = 'earth',
preProc = c('center','scale'),
tuneGrid = marsGrid)
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.1.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.1.3
## Loading required package: plotrix
## Loading required package: TeachingDemos
## Warning: package 'TeachingDemos' was built under R version 4.1.3
marsPred <-predict(marsModel, newdata = testData$x)
mars_metrics<-postResample(pred = marsPred, obs = testData$y)
Lets check the variable importance for the MARS
varImp(marsModel)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.40
## X2 49.00
## X5 15.72
## X3 0.00
The MARS model correctly pulls out the most important features.
Now lets build and train the SVM model.
use plotmo package to create plots of each predictor vs
outcome
set.seed(921)
svmRmodel<-train(x = trainingData$x,
y = trainingData$y,
method = 'svmRadial',
preProc = c('center','scale'),
tuneLength = 8)
svmRpred<-predict(svmRmodel, newdata = testData$x)
varImp(svmRmodel)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
The SVM model does rate the important variables the highest, but there is some noise that leaks in from the false data.
plotmo(svmRmodel)
## plotmo grid: X1 X2 X3 X4 X5 X6 X7
## 0.5139349 0.5106664 0.537307 0.4445841 0.5343299 0.4975981 0.4688035
## X8 X9 X10
## 0.497961 0.5288716 0.5359218
looking at the SVM plots, we can see that the non-important variables are virtually zero.
svm_metrics<-postResample(pred = svmRpred, obs = testData$y)
Now lets look at the metrics to determine which model performed the best.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## v purrr 0.3.4
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
df<-as.data.frame(rbind(knn_metrics, mars_metrics, svm_metrics))
df%>%arrange(desc(Rsquared))
## RMSE Rsquared MAE
## mars_metrics 1.323799 0.9317627 1.027932
## svm_metrics 2.193731 0.8128594 1.634044
## knn_metrics 3.333159 0.6741453 2.653670
Overall the mars model works the best. This is a bit suprising, because one of the SVM’s selling points is that it is able to identify sinusoidal data with the radial kernel.
a
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.1.2
data("ChemicalManufacturingProcess")
b A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values
Lets take a look at the missing values
library(visdat)
vis_miss(ChemicalManufacturingProcess)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
Lets first see if We can remove some columns with
nearZeroVar and how that impacts the missingness
#knnDescr<-solTrainXtrans[,-nearZeroVar(solTrainXtrans)]
chem_dat<-ChemicalManufacturingProcess[,-nearZeroVar(ChemicalManufacturingProcess)]
vis_miss(chem_dat)
Ironically it INCREASED the percentage of data missing! (because we are
removing columns that had a higher percentage of populated data, this
isn’t a problem)
Now lets get a list of columns that contain missing data.
names(which(colSums(is.na(chem_dat))>0))
## [1] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess03"
## [4] "ManufacturingProcess04" "ManufacturingProcess05" "ManufacturingProcess06"
## [7] "ManufacturingProcess07" "ManufacturingProcess08" "ManufacturingProcess10"
## [10] "ManufacturingProcess11" "ManufacturingProcess12" "ManufacturingProcess14"
## [13] "ManufacturingProcess22" "ManufacturingProcess23" "ManufacturingProcess24"
## [16] "ManufacturingProcess25" "ManufacturingProcess26" "ManufacturingProcess27"
## [19] "ManufacturingProcess28" "ManufacturingProcess29" "ManufacturingProcess30"
## [22] "ManufacturingProcess31" "ManufacturingProcess33" "ManufacturingProcess34"
## [25] "ManufacturingProcess35" "ManufacturingProcess36" "ManufacturingProcess40"
## [28] "ManufacturingProcess41"
… so basically all of the column have NA values.
Since the missing data seems fairly spread out, we will just impute during the preprocessing process for our model.
Lets now split our data into predictors and response
library(RANN)
## Warning: package 'RANN' was built under R version 4.1.3
response<-chem_dat%>%select(Yield)
features<-chem_dat%>%select(-Yield)
set.seed(123)
train_rows<-createDataPartition(response$Yield,
p = .7,
list = FALSE)
train_pred<-features[train_rows,]
train_resp<-response[train_rows,]
test_pred<-features[-train_rows,]
test_resp<-response[-train_rows,]
#preprocess data
pp<-preProcess(train_pred, method = c('BoxCox', 'center','scale','knnImpute'))
pp_train_pred<-predict(pp,train_pred)
pp_test_pred<-predict(pp,test_pred)
Since we already removed the near zero variance columns, we should remove the high correlation variables as well
pred_cor<-cor(pp_train_pred)
high_corr_pp<-findCorrelation(pred_cor)
pp_train_pred<-pp_train_pred[,-high_corr_pp]
pp_test_pred<-pp_test_pred[,-high_corr_pp]
c
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
I would not expect this data to have neighborhood correlation so I won’t use a knn. I also like the idea of the self pruning of variables, so I’ll start with a MARS model
ctrl <- trainControl(method = "repeatedcv", repeats = 5)
marsGrid <- expand.grid(degree = 1:2, nprune = seq(2,30, by =2))
set.seed(921)
marsModel<-train(x = pp_train_pred,
y = train_resp,
method = 'earth',
preProc = c('center','scale'),
trControl = ctrl,
tuneGrid = marsGrid)
marsPred <-predict(marsModel, newdata = pp_test_pred)
postResample(pred = marsPred, obs = test_resp)
## RMSE Rsquared MAE
## 1.0949893 0.6432769 0.8859736
Lets look at what the most important predictors the MARS model selected are
varImp(marsModel)
## earth variable importance
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 59.69
## ManufacturingProcess13 21.22
## ManufacturingProcess01 0.00
plotmo(marsModel)
## plotmo grid: BiologicalMaterial01 BiologicalMaterial03 BiologicalMaterial04
## -0.04498146 -0.166129 -0.009073643
## BiologicalMaterial05 BiologicalMaterial06 BiologicalMaterial08
## -0.06373363 -0.05786777 0.07287398
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## -0.0338531 -0.1758822 -0.08481589
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 0.08200158 0.4864915 0.03424434
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 0.3312441 -0.09860299 -0.2402823
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## -0.9038186 0.9336492 0.04686526
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## -0.1058449 -0.06053385 -0.5129976
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 0.1326757 0.0851508 -0.06382909
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess19
## 0.08746425 0.08163835 -0.1607842
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 0.09026192 -0.19479 -0.1742694
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess26
## 0.01003158 -0.128005 0.07788206
## ManufacturingProcess28 ManufacturingProcess30 ManufacturingProcess32
## 0.7153686 -0.003758284 -0.08208428
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 0.1813987 0.1113707 -0.07283974
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## -0.432863 -0.072112 0.6817466
## ManufacturingProcess39 ManufacturingProcess41 ManufacturingProcess43
## 0.2167978 -0.455224 -0.1210006
## ManufacturingProcess44 ManufacturingProcess45
## 0.2662542 0.1323683
I really like using the MARS model here, because I don’t lose the predictors, and it will be easy to explain to decision makers how we could change the manufacturing process.