This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Our data has already been split into a training and testing dataset, so these are loaded in easily.
# import train and test data
train <- read_csv("training_data.csv")
test <- read_csv("testing_data.csv")
Exploring the train dataset we see there are 2,571 cases
and 33 variables - all of which are numeric except for one character
variable. The character variable is BrandCode that appears
to have 4 unique values.
With regard to missingness, no variable has an alarming amount of
missing data - MFR has 212 values missing resulting in a
complete case percentage of 91.8. We’ll consider imputing missing values
dependent on the type of models we choose to try later. We also obverse
drastically different scales and ranges of values in the numeric
variables, so scaling and centering will be considered dependent on the
models selected later.
skim(train)
| Name | train |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Fill Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Hyd Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| Carb Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Carb Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Balling Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
To take a closer look at our missingness, we check for any frequent
patterns of the missing data. For 100 cases only the MFR
value is missing, and for 91 cases only the BrandCode is
missing. These are the top two patterns of missingness, and we feel
comfortable that these values are ‘missing completely at random’ and not
a result of an underlying interaction or phenomena.
# plotting patterns in missingness, top 15 patterns
train %>%
plot_na_intersect(only_na = TRUE, typographic = FALSE, n_intersacts = 15)
For our numeric variables with missing values, we first check the shape of these variables to see if a mean or median imputation might suffice. From the histograms we see a lot of not-normally distributed variables and many with outliers. This rules out a mean imputation being useful, and while a mode could suffice we’ll choose to use knn imputation when model-building since with this smaller dataset it won’t be computationally challenging.
train %>%
#histograms only for numeric variables
subset(select = -`Brand Code`) %>%
#reshape
gather() %>%
ggplot(aes(value)) +
geom_histogram() +
facet_wrap(~ key, scales = "free", ncol = 4) +
labs(title = "Checking Distribution of Numeric Predictor Variables")
To see if any of the numeric variables may be better served as factors, we look at the above histograms and the following output that shows the count of unique values for each variable. We don’t see any numeric variables that appear to be dichotomous or with only a few levels/values, so we will not switch any of the numeric variables to factors.
var_unique <- lapply(train, unique)
lengths(var_unique)
## Brand Code Carb Volume Fill Ounces PC Volume
## 5 102 93 455
## Carb Pressure Carb Temp PSC PSC Fill
## 107 124 130 33
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## 14 488 141 109
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## 246 208 193 41
## Filler Level Filler Speed Temperature Usage cont
## 289 245 57 482
## Carb Flow Density MFR Balling
## 534 79 588 218
## Pressure Vacuum PH Oxygen Filler Bowl Setpoint
## 16 53 339 12
## Pressure Setpoint Air Pressurer Alch Rel Carb Rel
## 9 32 54 43
## Balling Lvl
## 83
As it’s hard to visualize all of these variables and their possible
correlations, we numerically check for multicollinearity in the training
data. We find many variables with greater than (+/-) 0.8. Knowing such,
we’ll employ the step_corr feature of model recipes to
determine which features should be removed due to the collinearity.
train %>%
#filter_if(is.numeric)%>%
correlate() %>%
filter(coef_corr > .8 | coef_corr < -.8)%>% # set thresholds to limit output
arrange(desc(abs(coef_corr))) %>%
flextable() %>%
set_caption('Table 1. Correlated Variables')
var1 | var2 | coef_corr |
Balling Lvl | Balling | 0.9782085 |
Balling | Balling Lvl | 0.9782085 |
Balling | Density | 0.9551553 |
Density | Balling | 0.9551553 |
Bowl Setpoint | Filler Level | 0.9489345 |
Filler Level | Bowl Setpoint | 0.9489345 |
Balling Lvl | Density | 0.9479195 |
Density | Balling Lvl | 0.9479195 |
MFR | Filler Speed | 0.9297119 |
Filler Speed | MFR | 0.9297119 |
Balling Lvl | Alch Rel | 0.9258951 |
Alch Rel | Balling Lvl | 0.9258951 |
Hyd Pressure3 | Hyd Pressure2 | 0.9248823 |
Hyd Pressure2 | Hyd Pressure3 | 0.9248823 |
Alch Rel | Balling | 0.9247934 |
Balling | Alch Rel | 0.9247934 |
Alch Rel | Density | 0.9027398 |
Density | Alch Rel | 0.9027398 |
Balling Lvl | Carb Rel | 0.8447733 |
Carb Rel | Balling Lvl | 0.8447733 |
Carb Rel | Alch Rel | 0.8435849 |
Alch Rel | Carb Rel | 0.8435849 |
Carb Rel | Density | 0.8243548 |
Density | Carb Rel | 0.8243548 |
Carb Rel | Balling | 0.8224442 |
Balling | Carb Rel | 0.8224442 |
Carb Temp | Carb Pressure | 0.8141717 |
Carb Pressure | Carb Temp | 0.8141717 |
Based on findings in our initial exploratory data analysis, we will evaluate a set of machine learning models to predict PH that make no assumptions about the underlying variable distributions, are appropriate for continuous response variables, can fit nonlinear relationships, and are robust to both missing values and outliers.
The following predictive models will be trained and evaluated here:
Model Characteristics
Decision Tree: can capture nonlinear relationships, little data preprocessing, doesn’t require normalizing, easy to interpret, can get sense of feature importance, sensitive to collinearity as well as changes in data input, lower accuracy than other models.
Random Forest: one of the most accurate decision models, Works well on large datasets, can be used to extract variable importance, does not require feature engineering (scaling and normalization), tends to overfit noisy data, hard to interpret, requires careful tuning.
SVM: complex, high dimension data, low risk of overfitting, memory efficient, but difficult to interpret
NNET: works with little data processing, good with nonlinear data with a large number of inputs, parallel processing ability, low interpretability, computationally intensive
MARS: automatic feature selection, interpretability, little data preprocessing required, continuous and categorical predictors
Our data has already been split into a training and testing dataset. To use these data in a Tidymodels modeling process, we will combine the two and reverse engineer our split.
# The following code is adapted from: https://stackoverflow.com/questions/64004132/tidymodels-creating-an-rsplit-object-from-training-and-testing-data
# combine train and test data
data<-rbind(train,test) # combine data
data%<>%
rename(BrandCode = 'Brand Code')%>%
filter(!is.na(PH))%>%
mutate_if(is.character, as.factor)
# calculate proportion of data from train
prop<-nrow(train)/(nrow(train)+nrow(test))
# create tidymodels split
data.split<-initial_time_split(data, prop=prop) # split without randomizing
train.split<-training(data.split) # create training set from 'data'
test.split<-testing(data.split) # create test set to graph results of last_fit
# check that train and train.split are identical
all_equal(train, train.split)
## Cols in `y` but not `x`: `BrandCode`.
## Cols in `x` but not `y`: `Brand Code`.
We start model development by setting up our resampling scheme for parameter optimization and model selection. Here we will use 10-fold cross-validation and identify PH as our response variable.
# create resamples for model tuning
set.seed(0501)
cv<-
vfold_cv(train.split,
v=10,
strata= 'PH')
Next, we conduct our data preprocessing steps on the training data using the Recipes package. Two recipes are represented here: one for our MARS model and another for the remaining models. These steps include the following:
We note that multicollinearity can impact tree-based models and that SVM requires normalizing (scaling and centering) prior to model training. We also impute missing values to enable parameter estimation via grid-tuning. Our earlier tuning attempts failed due to missingness.
# build recipe for dtree, svm, nnet, rf models
train.recipe<-
recipe(PH ~., data=train.split)%>%
step_impute_knn(all_predictors())%>%
step_normalize(all_numeric())%>% # center and scale for svm
step_dummy(all_nominal()) %>%
step_corr(threshold = 0.8) # trees
# build recipe for mars model
mars.recipe<-
recipe(PH ~ ., data = train.split) %>%
step_impute_knn(all_predictors())%>%
step_dummy(all_nominal())
Our model specifications include selected tuning parameters specific to each model. These parameters will be optimized using grid_tune().
#Specify Tree
dtree.spec<-decision_tree(
tree_depth = tune(), # creating tuneable parameters
min_n = tune(),
cost_complexity = tune())%>%
set_engine("rpart")%>% # ctree - tree_depth, min_n
set_mode("regression")
# Specify SVM
svm.spec<-svm_rbf(
cost = tune(),
margin = tune(),
rbf_sigma = tune()) %>%
set_engine("kernlab") %>%
set_mode("regression")
# specify random forest
rf.spec <-rand_forest(
mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_mode("regression")%>%
set_engine("ranger")
# Specify Neural Net
nnet.spec<-mlp(
hidden_units = tune(),
penalty = tune(),
epochs = tune()) %>%
set_engine("nnet", MaxNWts = 2600) %>%
set_mode("regression")
# a neural network should have up to 27 hidden units in layer (Kuhn Johnson)
nnet.param <-
nnet.spec %>%
extract_parameter_set_dials() %>%
update(hidden_units = hidden_units(c(1, 27)))
# Specify MARS
mars.spec <-
mars(mode = "regression",
num_terms = tune(),
prod_degree = tune()) %>%
set_mode("regression")%>%
set_engine("earth")
We now construct our initial model workflows and combine them using workflow_set() function to streamline our training process. The workflows themselves link each model’s recipe and specification.
#build workflow for recipe with normalization
models.wf<-
workflow_set(
preproc=list(normalized = train.recipe),
models = list(Dec.Tree = dtree.spec, SVM.radial = svm.spec, R.Forest = rf.spec, NNet = nnet.spec)
)
mars.wf<-
workflow_set(
preproc=list(simple = mars.recipe),
models = list(MARS = mars.spec)
)
#combine workflows
combined.wf<-
bind_rows(models.wf, mars.wf)%>%
mutate(wflow_id = gsub("(simple_)|(normalized_)", "", wflow_id))
Next, we optimize our model’s parameters via a 20-point grid and fit our training data in one step. The fitting process is time intensive. As a result, we also save our results to an .RData file that can be reused at later dates.
# speed grid estimation
race.ctrl <-
control_race(
save_pred = TRUE,
save_workflow = TRUE
)
grid.results <-
combined.wf %>%
workflow_map(
'tune_race_anova', # compute performance metrics (finetune)
seed = 05402112,
resamples = cv,
grid = 20, # setting up grid size of 20
control = race.ctrl,
verbose = TRUE
)
# review and save tuning results
save(grid.results, file='grid.results.Rdata')
#load('grid.results.Rdata')
We use the RMSE (root mean squared error) metric to evaluate/compare our model’s performance in fitting the training data.
From Table 2 and Figure 1, it is clear that the MARS model outperformed the other models - i.e., lowest rmse & smallest std_err.
# rank results
rank_results(grid.results, rank_metric = "rmse")%>%
flextable()%>%
set_caption('Table 2. Model Performance Metrics for Grid Tuning')
wflow_id | .config | .metric | mean | std_err | n | preprocessor | model | rank |
MARS | Preprocessor1_Model3 | rmse | 0.1406986 | 0.002399975 | 10 | recipe | mars | 1 |
MARS | Preprocessor1_Model3 | rsq | 0.3468590 | 0.017078520 | 10 | recipe | mars | 1 |
R.Forest | Preprocessor1_Model09 | rmse | 0.5392000 | 0.013303407 | 10 | recipe | rand_forest | 2 |
R.Forest | Preprocessor1_Model09 | rsq | 0.7164233 | 0.014597414 | 10 | recipe | rand_forest | 2 |
Dec.Tree | Preprocessor1_Model09 | rmse | 0.6534581 | 0.014266312 | 10 | recipe | decision_tree | 3 |
Dec.Tree | Preprocessor1_Model09 | rsq | 0.5788396 | 0.017417534 | 10 | recipe | decision_tree | 3 |
Dec.Tree | Preprocessor1_Model15 | rmse | 0.6542463 | 0.013902747 | 10 | recipe | decision_tree | 4 |
Dec.Tree | Preprocessor1_Model15 | rsq | 0.5784452 | 0.017196541 | 10 | recipe | decision_tree | 4 |
SVM.radial | Preprocessor1_Model18 | rmse | 0.6799684 | 0.011453536 | 10 | recipe | svm_rbf | 5 |
SVM.radial | Preprocessor1_Model18 | rsq | 0.5419071 | 0.012546694 | 10 | recipe | svm_rbf | 5 |
NNet | Preprocessor1_Model12 | rmse | 0.7011618 | 0.012398132 | 10 | recipe | mlp | 6 |
NNet | Preprocessor1_Model12 | rsq | 0.5120298 | 0.012716749 | 10 | recipe | mlp | 6 |
NNet | Preprocessor1_Model15 | rmse | 0.7067832 | 0.014009440 | 10 | recipe | mlp | 7 |
NNet | Preprocessor1_Model15 | rsq | 0.5147337 | 0.016094132 | 10 | recipe | mlp | 7 |
NNet | Preprocessor1_Model02 | rmse | 0.7112156 | 0.010517122 | 10 | recipe | mlp | 8 |
NNet | Preprocessor1_Model02 | rsq | 0.5077373 | 0.010154447 | 10 | recipe | mlp | 8 |
NNet | Preprocessor1_Model18 | rmse | 0.7117098 | 0.016039306 | 10 | recipe | mlp | 9 |
NNet | Preprocessor1_Model18 | rsq | 0.5123723 | 0.020327149 | 10 | recipe | mlp | 9 |
# plot rmse tuning results for each model
autoplot(
grid.results,
rank_metric = "rmse", # <- how to order models
metric = "rmse", # <- which metric to visualize
select_best = TRUE) + # <- one point per workflow
lims(y = c(0, 0.75)) +
labs(title='Figure 1. Estimated RMSE for Best Model Configuration', x='')+
theme_light()
We can also extract and compare the highest performing resample model for each algorithm. These estimates are included in Table 3.
# list best rmse results by model
collect_metrics(grid.results)%>%
filter(.metric == c("rmse"))%>%
dplyr::select(model, .metric, mean, .config) %>%
group_by(model)%>%
slice(which.min(mean))%>%
rename(c(Model = model, Estimate = 'mean', 'CV-Fold' = .config))%>%
mutate(Metric = 'RMSE')%>%
relocate(Metric, .after=Model)%>%
mutate(Model = ifelse(Model %in% 'rand_forest', 'Random Forest', Model))%>%
mutate(Model = ifelse(Model %in% 'svm_rbf', 'SVM', Model))%>%
mutate(Model = ifelse(Model %in% 'mlp', 'NNET', Model))%>%
mutate(Model = ifelse(Model %in% 'decision_tree', 'Decision Tree', Model))%>%
mutate(Model = ifelse(Model %in% 'mars', 'MARS', Model))%>%
#mutate(Model = toupper(Model))%>%
arrange(Estimate)%>%
select(!.metric)%>%
flextable()%>%
set_caption('Table 3. Selected Models Based on Training Fit')
Model | Metric | Estimate | CV-Fold |
MARS | RMSE | 0.1398617 | Preprocessor1_Model2 |
Random Forest | RMSE | 0.5177498 | Preprocessor1_Model17 |
Decision Tree | RMSE | 0.6534581 | Preprocessor1_Model09 |
SVM | RMSE | 0.6799684 | Preprocessor1_Model18 |
NNET | RMSE | 0.7011618 | Preprocessor1_Model12 |
Our final model evaluation process includes the following steps:
Table 4. shows the results of our model evaluation on the test data. It’s clear, based on rmse scores, that the MARS model outperformed the others. It is also worth noting that the rmse score for the MARS test data is less than for the training data. This lends us some assurance that the model is not over-fitting the data. If that was the case, we would expect this error to be equal to, or less than, that from the training fit.
# select best model fits
#MARS
mars.best <-
grid.results %>%
extract_workflow_set_result("MARS") %>%
select_best(Metric = "RMSE")
#Random Forest
rforest.best <-
grid.results %>%
extract_workflow_set_result("R.Forest") %>%
select_best(Metric = "RMSE")
#Decision Tree
dtree.best <-
grid.results %>%
extract_workflow_set_result("Dec.Tree") %>%
select_best(Metric = "RMSE")
# SVM
svm.best <-
grid.results %>%
extract_workflow_set_result("SVM.radial") %>%
select_best(Metric = "RMSE")
#NNET
nnet.best <-
grid.results %>%
extract_workflow_set_result("NNet") %>%
select_best(Metric = "RMSE")
# Finalize workflows and fit test data with last_fit
mars.test<-
grid.results%>%
extract_workflow("MARS") %>%
finalize_workflow(mars.best)%>%
last_fit(split = data.split)
rforest.test<-
grid.results%>%
extract_workflow("R.Forest") %>%
finalize_workflow(rforest.best)%>%
last_fit(split = data.split)
dtree.test<-
grid.results%>%
extract_workflow("Dec.Tree") %>%
finalize_workflow(dtree.best)%>%
last_fit(split = data.split)
svm.test<-
grid.results%>%
extract_workflow("SVM.radial") %>%
finalize_workflow(svm.best)%>%
last_fit(split = data.split)
nnet.test<-
grid.results%>%
extract_workflow("NNet") %>%
finalize_workflow(nnet.best)%>%
last_fit(split = data.split)
# collect model rmse for fit comparison
metrics1<-mars.test%>%
collect_metrics()%>%
select(.metric, .estimate)%>%
filter(.metric == c("rmse"))%>%
rename(Metric = .metric, Estimate=.estimate)%>%
mutate(Model = 'MARS')%>%
relocate(Model, .before='Metric')
metrics2<-rforest.test%>%
collect_metrics()%>%
select(.metric, .estimate)%>%
filter(.metric == c("rmse"))%>%
rename(Metric = .metric, Estimate=.estimate)%>%
mutate(Model = 'Random Forest')%>%
relocate(Model, .before='Metric')
metrics3<-dtree.test%>%
collect_metrics()%>%
select(.metric, .estimate)%>%
filter(.metric == c("rmse"))%>%
rename(Metric = .metric, Estimate=.estimate)%>%
mutate(Model = 'Decision Tree')%>%
relocate(Model, .before='Metric')
metrics4<-svm.test%>%
collect_metrics()%>%
select(.metric, .estimate)%>%
filter(.metric == c("rmse"))%>%
rename(Metric = .metric, Estimate=.estimate)%>%
mutate(Model = 'SVM')%>%
relocate(Model, .before='Metric')
metrics5<-nnet.test%>%
collect_metrics()%>%
select(.metric, .estimate)%>%
filter(.metric == c("rmse"))%>%
rename(Metric = .metric, Estimate=.estimate)%>%
mutate(Model = 'Neural Network')%>%
relocate(Model, .before='Metric')
# print table of test fit results
rbind(metrics1, metrics2, metrics3, metrics4, metrics5)%>%
arrange(Estimate)%>%
flextable()%>%
set_caption('Table 4. Comparison of Model Fit on Test Data')
Model | Metric | Estimate |
MARS | rmse | 0.1398370 |
Neural Network | rmse | 0.8170020 |
Random Forest | rmse | 0.8347546 |
SVM | rmse | 0.8579408 |
Decision Tree | rmse | 0.9389812 |
One advantage of MARS models is their interpretability. Table 5 shows the variables selected for the final model along with their respective coefficients.
The final model form is:
PH =
8.466+.001h(0.2-Mnf Flow)-0.113BrandCode_C-0.045h(Usage cont-22.14)+0.028h(68.2-Temperature);
where h = the model hinge
Figure 2. displays the partial dependence between PH and each of the predictors for the MARS model. In this context, each plot describes the effect of a predictor with all other covariates held at their average value.
Each plot also reveals the hinge that is characteristic of MARS such that:
# Print model
output<-extract_fit_engine(mars.test)%>%
summary()
output[12]%>%
data.frame()%>%
rownames_to_column(var="Model Components")%>%
rename(Coefficient = ..y)%>%
flextable%>%
set_caption('Table 5. Final MARS Model: Features & Coefficients')
Model Components | Coefficient |
(Intercept) | 8.465801028 |
h(0.2-`Mnf Flow`) | 0.001256308 |
BrandCode_C | -0.112952249 |
h(`Usage cont`-22.14) | -0.044961352 |
h(68.2-Temperature) | 0.027763745 |
# plot partial dependence of features (from plotmo package in Earth -- plot regression surfaces)
plotmo(output, pmethod='partdep', caption='Figure 2. Partial dependence plots for continuous predictors using MARS')
## calculating partdep for Mnf Flow
## calculating partdep for Temperature
## calculating partdep for Usage cont
## calculating partdep for BrandCode_C
To effectively visualize the model results we can plot our observed PH values against the model PH predictions for both training and test sets. These results as shown in Figure 3 and Table 6. Our errors are lowest around ~ PH = 8.5. Our model tends to overpredict at lower values and overpredict at higher values for PH on the test data.
# adapted from https://stackoverflow.com/questions/68124804/tidymodels-get-predictions-and-metrics-on-training-data-using-workflow-recipe
# pull trained workflow
fit.preds <- mars.test$.workflow[[1]]
# augment on test and training data
model.predictions <- bind_rows(
augment(fit.preds, new_data = test.split) %>%
mutate(type = "Test Data"),
augment(fit.preds, new_data = train.split) %>%
mutate(type = "Training Data"))
# plot observed vs predicted PH for train and test sets
model.predictions %>%
ggplot(aes(PH, .pred)) +
geom_point(aes(color=type, alpha=.95), show.legend = FALSE) +
scale_color_manual(values = c("Training Data" = "#4DB6D0", "Test Data" = "#BECA55"))+
labs(x='PH (Observed)', y= 'PH (Predicted)', title='Figure 3. Observed vs Predicted PH Values: MARS Model')+
theme(plot.title = element_text(vjust = 10))+ # this doesn't seem to be working
theme_minimal() +
facet_wrap(~type)
# print fit metrics for train and test
model.predictions %>%
group_by(type) %>%
metrics(PH, .pred)%>%
select(!.estimator)%>%
rename(Metric=.metric, Estimate=.estimate, Data=type)%>%
pivot_wider(names_from = Metric, values_from = Estimate)%>%
select(!mae)%>%
flextable()%>%
set_caption('Table 6. Performance Metrics on Final Training and Test Datasets: MARS')
Data | rmse | rsq |
Test Data | 0.1398370 | 0.2199699 |
Training Data | 0.1406417 | 0.3458952 |
Finally, we can rank the relative importance of each covariate in terms of their contribution as predictors of PH. From Table 7., we that ‘MnF Flow’ makes the greatest contribution to the model prediction. For this dataset, only four covariates were included in the final model.
# Plot variable importance for final model and test data
mars.test%>%
extract_fit_parsnip()%>%
vi_model(type='gcv')%>% # we can control importance type using vi_model instead of vip
filter(Importance > 0)%>%
arrange(desc(Importance))%>%
mutate(Variable = factor(Variable), Variable = fct_reorder(Variable, Importance))%>%
ggplot(aes(x=Importance, y = Variable))+
geom_col( fill='midnightblue', alpha=0.7)+
labs(title='Figure 7. Variable Importance: MARS Model', subtitle='PH')+
theme_light()
We conclude that any investments to manipulate PH as part of the manufacturing process should focus on adjustments to these four covariates. And that a cost:benefit analysis be conducted before committing company resources for this purpose.