Project Premise

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Data Exploration

Our data has already been split into a training and testing dataset, so these are loaded in easily.

# import train and test data
train <- read_csv("training_data.csv")
test <- read_csv("testing_data.csv")

Exploring the train dataset we see there are 2,571 cases and 33 variables - all of which are numeric except for one character variable. The character variable is BrandCode that appears to have 4 unique values.

With regard to missingness, no variable has an alarming amount of missing data - MFR has 212 values missing resulting in a complete case percentage of 91.8. We’ll consider imputing missing values dependent on the type of models we choose to try later. We also obverse drastically different scales and ranges of values in the numeric variables, so scaling and centering will be considered dependent on the models selected later.

skim(train)

Data summary
Name	train
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

To take a closer look at our missingness, we check for any frequent patterns of the missing data. For 100 cases only the MFR value is missing, and for 91 cases only the BrandCode is missing. These are the top two patterns of missingness, and we feel comfortable that these values are ‘missing completely at random’ and not a result of an underlying interaction or phenomena.

# plotting patterns in missingness, top 15 patterns
train %>% 
  plot_na_intersect(only_na = TRUE, typographic = FALSE, n_intersacts = 15)

For our numeric variables with missing values, we first check the shape of these variables to see if a mean or median imputation might suffice. From the histograms we see a lot of not-normally distributed variables and many with outliers. This rules out a mean imputation being useful, and while a mode could suffice we’ll choose to use knn imputation when model-building since with this smaller dataset it won’t be computationally challenging.

train %>%
  #histograms only for numeric variables
  subset(select = -`Brand Code`) %>%
  #reshape
  gather() %>% 
  ggplot(aes(value)) +
  geom_histogram() +
  facet_wrap(~ key, scales = "free", ncol = 4) +
  labs(title = "Checking Distribution of Numeric Predictor Variables")

To see if any of the numeric variables may be better served as factors, we look at the above histograms and the following output that shows the count of unique values for each variable. We don’t see any numeric variables that appear to be dichotomous or with only a few levels/values, so we will not switch any of the numeric variables to factors.

var_unique <- lapply(train, unique)
lengths(var_unique)

##        Brand Code       Carb Volume       Fill Ounces         PC Volume 
##                 5               102                93               455 
##     Carb Pressure         Carb Temp               PSC          PSC Fill 
##               107               124               130                33 
##           PSC CO2          Mnf Flow    Carb Pressure1     Fill Pressure 
##                14               488               141               109 
##     Hyd Pressure1     Hyd Pressure2     Hyd Pressure3     Hyd Pressure4 
##               246               208               193                41 
##      Filler Level      Filler Speed       Temperature        Usage cont 
##               289               245                57               482 
##         Carb Flow           Density               MFR           Balling 
##               534                79               588               218 
##   Pressure Vacuum                PH     Oxygen Filler     Bowl Setpoint 
##                16                53               339                12 
## Pressure Setpoint     Air Pressurer          Alch Rel          Carb Rel 
##                 9                32                54                43 
##       Balling Lvl 
##                83

As it’s hard to visualize all of these variables and their possible correlations, we numerically check for multicollinearity in the training data. We find many variables with greater than (+/-) 0.8. Knowing such, we’ll employ the step_corr feature of model recipes to determine which features should be removed due to the collinearity.

train %>%
    #filter_if(is.numeric)%>%
    correlate() %>%
    filter(coef_corr > .8 | coef_corr < -.8)%>% # set thresholds to limit output 
    arrange(desc(abs(coef_corr))) %>%
    flextable() %>%
  set_caption('Table 1. Correlated Variables')

Table 1. Correlated Variables
var1	var2	coef_corr
Balling Lvl	Balling	0.9782085
Balling	Balling Lvl	0.9782085
Balling	Density	0.9551553
Density	Balling	0.9551553
Bowl Setpoint	Filler Level	0.9489345
Filler Level	Bowl Setpoint	0.9489345
Balling Lvl	Density	0.9479195
Density	Balling Lvl	0.9479195
MFR	Filler Speed	0.9297119
Filler Speed	MFR	0.9297119
Balling Lvl	Alch Rel	0.9258951
Alch Rel	Balling Lvl	0.9258951
Hyd Pressure3	Hyd Pressure2	0.9248823
Hyd Pressure2	Hyd Pressure3	0.9248823
Alch Rel	Balling	0.9247934
Balling	Alch Rel	0.9247934
Alch Rel	Density	0.9027398
Density	Alch Rel	0.9027398
Balling Lvl	Carb Rel	0.8447733
Carb Rel	Balling Lvl	0.8447733
Carb Rel	Alch Rel	0.8435849
Alch Rel	Carb Rel	0.8435849
Carb Rel	Density	0.8243548
Density	Carb Rel	0.8243548
Carb Rel	Balling	0.8224442
Balling	Carb Rel	0.8224442
Carb Temp	Carb Pressure	0.8141717
Carb Pressure	Carb Temp	0.8141717

Model Development

Based on findings in our initial exploratory data analysis, we will evaluate a set of machine learning models to predict PH that make no assumptions about the underlying variable distributions, are appropriate for continuous response variables, can fit nonlinear relationships, and are robust to both missing values and outliers.

The following predictive models will be trained and evaluated here:

Basic Decision Tree
Random Forest
Support Vector Machine (SVM)
Neural Net (single layer, forward) (NNET)
Multivariate Adaptive Regression Splines (MARS)

Model Characteristics

Decision Tree: can capture nonlinear relationships, little data preprocessing, doesn’t require normalizing, easy to interpret, can get sense of feature importance, sensitive to collinearity as well as changes in data input, lower accuracy than other models.

Random Forest: one of the most accurate decision models, Works well on large datasets, can be used to extract variable importance, does not require feature engineering (scaling and normalization), tends to overfit noisy data, hard to interpret, requires careful tuning.

SVM: complex, high dimension data, low risk of overfitting, memory efficient, but difficult to interpret

NNET: works with little data processing, good with nonlinear data with a large number of inputs, parallel processing ability, low interpretability, computationally intensive

MARS: automatic feature selection, interpretability, little data preprocessing required, continuous and categorical predictors

Train and Test Splits

Our data has already been split into a training and testing dataset. To use these data in a Tidymodels modeling process, we will combine the two and reverse engineer our split.

# The following code is adapted from: https://stackoverflow.com/questions/64004132/tidymodels-creating-an-rsplit-object-from-training-and-testing-data



# combine train and test data

data<-rbind(train,test) # combine data

data%<>%
  rename(BrandCode = 'Brand Code')%>%
  filter(!is.na(PH))%>%
  mutate_if(is.character, as.factor)

# calculate proportion of data from train

prop<-nrow(train)/(nrow(train)+nrow(test)) 

# create tidymodels split

data.split<-initial_time_split(data, prop=prop) # split without randomizing

train.split<-training(data.split) # create training set from 'data'

test.split<-testing(data.split) # create test set to graph results of last_fit

# check that train and train.split are identical

all_equal(train, train.split)

## Cols in `y` but not `x`: `BrandCode`.
## Cols in `x` but not `y`: `Brand Code`.

Cross-Validation

We start model development by setting up our resampling scheme for parameter optimization and model selection. Here we will use 10-fold cross-validation and identify PH as our response variable.

# create resamples for model tuning

set.seed(0501)

cv<-
  vfold_cv(train.split, 
           v=10,
           strata= 'PH')

Tidymodel Recipes

Next, we conduct our data preprocessing steps on the training data using the Recipes package. Two recipes are represented here: one for our MARS model and another for the remaining models. These steps include the following:

MARS: impute missing values, dummy code factor variables
Others: impute missing values, normalize numeric data, dummy-code factor variables, remove multicollinear predictors

We note that multicollinearity can impact tree-based models and that SVM requires normalizing (scaling and centering) prior to model training. We also impute missing values to enable parameter estimation via grid-tuning. Our earlier tuning attempts failed due to missingness.

# build recipe for dtree, svm, nnet, rf models

train.recipe<-
  recipe(PH ~., data=train.split)%>%
  step_impute_knn(all_predictors())%>%
  step_normalize(all_numeric())%>% # center and scale for svm
  step_dummy(all_nominal()) %>%
  step_corr(threshold = 0.8) # trees

# build recipe for mars model

mars.recipe<- 
  recipe(PH ~ ., data = train.split) %>%
  step_impute_knn(all_predictors())%>%
  step_dummy(all_nominal())

Model Specification

Our model specifications include selected tuning parameters specific to each model. These parameters will be optimized using grid_tune().

#Specify Tree

dtree.spec<-decision_tree(
  tree_depth = tune(), # creating tuneable parameters
  min_n = tune(),
  cost_complexity = tune())%>%
    set_engine("rpart")%>% # ctree - tree_depth, min_n
    set_mode("regression")

# Specify SVM

svm.spec<-svm_rbf(
  cost = tune(), 
  margin = tune(),
  rbf_sigma = tune()) %>% 
  set_engine("kernlab") %>% 
  set_mode("regression")

# specify random forest

rf.spec <-rand_forest(
    mtry = tune(),
    trees = tune(),
    min_n = tune()) %>%
    set_mode("regression")%>%
    set_engine("ranger")

# Specify Neural Net

nnet.spec<-mlp(
  hidden_units = tune(), 
  penalty = tune(),
  epochs = tune()) %>% 
  set_engine("nnet", MaxNWts = 2600) %>% 
  set_mode("regression")

# a neural network should have up to 27 hidden units in layer (Kuhn Johnson)

nnet.param <- 
   nnet.spec %>% 
   extract_parameter_set_dials() %>% 
   update(hidden_units = hidden_units(c(1, 27)))

# Specify MARS

mars.spec <-
  mars(mode = "regression", 
       num_terms = tune(), 
       prod_degree = tune()) %>%
  set_mode("regression")%>%
  set_engine("earth")

Model Workflows

We now construct our initial model workflows and combine them using workflow_set() function to streamline our training process. The workflows themselves link each model’s recipe and specification.

#build workflow for recipe with normalization

models.wf<-
  workflow_set(
    preproc=list(normalized = train.recipe),
    models = list(Dec.Tree = dtree.spec, SVM.radial = svm.spec, R.Forest = rf.spec, NNet = nnet.spec)
  )

mars.wf<-
  workflow_set(
    preproc=list(simple = mars.recipe),
    models = list(MARS = mars.spec)
  )

#combine workflows 

combined.wf<-
  bind_rows(models.wf, mars.wf)%>%
  mutate(wflow_id = gsub("(simple_)|(normalized_)", "", wflow_id))

Model Parameter Tuning

Next, we optimize our model’s parameters via a 20-point grid and fit our training data in one step. The fitting process is time intensive. As a result, we also save our results to an .RData file that can be reused at later dates.

# speed grid estimation

race.ctrl <-
   control_race(
      save_pred = TRUE,
      save_workflow = TRUE
   )

grid.results <-
   combined.wf %>%
   workflow_map(     
      'tune_race_anova', # compute performance metrics (finetune)
      seed = 05402112,
      resamples = cv,
      grid = 20, # setting up grid size of 20
      control = race.ctrl,
      verbose = TRUE
   )

# review and save tuning results

save(grid.results, file='grid.results.Rdata')

#load('grid.results.Rdata')

Training Fit: Performance Metrics

We use the RMSE (root mean squared error) metric to evaluate/compare our model’s performance in fitting the training data.

From Table 2 and Figure 1, it is clear that the MARS model outperformed the other models - i.e., lowest rmse & smallest std_err.

# rank results

rank_results(grid.results, rank_metric = "rmse")%>%
  flextable()%>%
  set_caption('Table 2. Model Performance Metrics for Grid Tuning')

Table 2. Model Performance Metrics for Grid Tuning
wflow_id	.config	.metric	mean	std_err	n	preprocessor	model	rank
MARS	Preprocessor1_Model3	rmse	0.1406986	0.002399975	10	recipe	mars	1
MARS	Preprocessor1_Model3	rsq	0.3468590	0.017078520	10	recipe	mars	1
R.Forest	Preprocessor1_Model09	rmse	0.5392000	0.013303407	10	recipe	rand_forest	2
R.Forest	Preprocessor1_Model09	rsq	0.7164233	0.014597414	10	recipe	rand_forest	2
Dec.Tree	Preprocessor1_Model09	rmse	0.6534581	0.014266312	10	recipe	decision_tree	3
Dec.Tree	Preprocessor1_Model09	rsq	0.5788396	0.017417534	10	recipe	decision_tree	3
Dec.Tree	Preprocessor1_Model15	rmse	0.6542463	0.013902747	10	recipe	decision_tree	4
Dec.Tree	Preprocessor1_Model15	rsq	0.5784452	0.017196541	10	recipe	decision_tree	4
SVM.radial	Preprocessor1_Model18	rmse	0.6799684	0.011453536	10	recipe	svm_rbf	5
SVM.radial	Preprocessor1_Model18	rsq	0.5419071	0.012546694	10	recipe	svm_rbf	5
NNet	Preprocessor1_Model12	rmse	0.7011618	0.012398132	10	recipe	mlp	6
NNet	Preprocessor1_Model12	rsq	0.5120298	0.012716749	10	recipe	mlp	6
NNet	Preprocessor1_Model15	rmse	0.7067832	0.014009440	10	recipe	mlp	7
NNet	Preprocessor1_Model15	rsq	0.5147337	0.016094132	10	recipe	mlp	7
NNet	Preprocessor1_Model02	rmse	0.7112156	0.010517122	10	recipe	mlp	8
NNet	Preprocessor1_Model02	rsq	0.5077373	0.010154447	10	recipe	mlp	8
NNet	Preprocessor1_Model18	rmse	0.7117098	0.016039306	10	recipe	mlp	9
NNet	Preprocessor1_Model18	rsq	0.5123723	0.020327149	10	recipe	mlp	9

# plot rmse tuning results for each model

autoplot(
  grid.results,
  rank_metric = "rmse",  # <- how to order models
  metric = "rmse",       # <- which metric to visualize
  select_best = TRUE) +     # <- one point per workflow
  lims(y = c(0, 0.75)) +
  labs(title='Figure 1. Estimated RMSE for Best Model Configuration', x='')+
  theme_light()

We can also extract and compare the highest performing resample model for each algorithm. These estimates are included in Table 3.

 # list best rmse results by model

collect_metrics(grid.results)%>%
  filter(.metric == c("rmse"))%>%
  dplyr::select(model, .metric, mean, .config) %>%
  group_by(model)%>%
  slice(which.min(mean))%>%
  rename(c(Model = model, Estimate = 'mean', 'CV-Fold' = .config))%>%
  mutate(Metric = 'RMSE')%>%
  relocate(Metric, .after=Model)%>%
  mutate(Model = ifelse(Model %in% 'rand_forest', 'Random Forest', Model))%>%
  mutate(Model = ifelse(Model %in% 'svm_rbf', 'SVM', Model))%>%
  mutate(Model = ifelse(Model %in% 'mlp', 'NNET', Model))%>%
  mutate(Model = ifelse(Model %in% 'decision_tree', 'Decision Tree', Model))%>%
  mutate(Model = ifelse(Model %in% 'mars', 'MARS', Model))%>%
  #mutate(Model = toupper(Model))%>%
  arrange(Estimate)%>%
  select(!.metric)%>%
  flextable()%>%
  set_caption('Table 3. Selected Models Based on Training Fit')

Table 3. Selected Models Based on Training Fit
Model	Metric	Estimate	CV-Fold
MARS	RMSE	0.1398617	Preprocessor1_Model2
Random Forest	RMSE	0.5177498	Preprocessor1_Model17
Decision Tree	RMSE	0.6534581	Preprocessor1_Model09
SVM	RMSE	0.6799684	Preprocessor1_Model18
NNET	RMSE	0.7011618	Preprocessor1_Model12

Model Evaluation

Our final model evaluation process includes the following steps:

update each model with the highest performing set of tuned parameters for each model.
update our workflows to include the these models
fit our models to the full training and test datasets at one time
collect/compare performance metrics for each model on the test data
select the highest performing model and inspect our predictions

Table 4. shows the results of our model evaluation on the test data. It’s clear, based on rmse scores, that the MARS model outperformed the others. It is also worth noting that the rmse score for the MARS test data is less than for the training data. This lends us some assurance that the model is not over-fitting the data. If that was the case, we would expect this error to be equal to, or less than, that from the training fit.

# select best model fits

#MARS

mars.best <- 
   grid.results %>% 
   extract_workflow_set_result("MARS") %>% 
   select_best(Metric = "RMSE")

#Random Forest

rforest.best <- 
   grid.results %>% 
   extract_workflow_set_result("R.Forest") %>% 
   select_best(Metric = "RMSE")

#Decision Tree

dtree.best <- 
   grid.results %>% 
   extract_workflow_set_result("Dec.Tree") %>% 
   select_best(Metric = "RMSE")

# SVM

svm.best <- 
   grid.results %>% 
   extract_workflow_set_result("SVM.radial") %>% 
   select_best(Metric = "RMSE")

#NNET

nnet.best <- 
   grid.results %>% 
   extract_workflow_set_result("NNet") %>% 
   select_best(Metric = "RMSE")


# Finalize workflows and fit test data with last_fit

mars.test<-
  grid.results%>%
  extract_workflow("MARS") %>% 
  finalize_workflow(mars.best)%>% 
  last_fit(split = data.split)

rforest.test<-
  grid.results%>%
  extract_workflow("R.Forest") %>% 
  finalize_workflow(rforest.best)%>% 
  last_fit(split = data.split)


dtree.test<-
  grid.results%>%
  extract_workflow("Dec.Tree") %>% 
  finalize_workflow(dtree.best)%>% 
  last_fit(split = data.split)

svm.test<-
  grid.results%>%
  extract_workflow("SVM.radial") %>% 
  finalize_workflow(svm.best)%>% 
  last_fit(split = data.split)

nnet.test<-
  grid.results%>%
  extract_workflow("NNet") %>% 
  finalize_workflow(nnet.best)%>% 
  last_fit(split = data.split)

# collect model rmse for fit comparison

metrics1<-mars.test%>% 
  collect_metrics()%>%
  select(.metric, .estimate)%>%
  filter(.metric == c("rmse"))%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  mutate(Model = 'MARS')%>%
  relocate(Model, .before='Metric')
  
metrics2<-rforest.test%>% 
  collect_metrics()%>%
  select(.metric, .estimate)%>%
  filter(.metric == c("rmse"))%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  mutate(Model = 'Random Forest')%>%
  relocate(Model, .before='Metric')

metrics3<-dtree.test%>% 
  collect_metrics()%>%
  select(.metric, .estimate)%>%
  filter(.metric == c("rmse"))%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  mutate(Model = 'Decision Tree')%>%
  relocate(Model, .before='Metric')

metrics4<-svm.test%>% 
  collect_metrics()%>%
  select(.metric, .estimate)%>%
  filter(.metric == c("rmse"))%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  mutate(Model = 'SVM')%>%
  relocate(Model, .before='Metric')

metrics5<-nnet.test%>% 
  collect_metrics()%>%
  select(.metric, .estimate)%>%
  filter(.metric == c("rmse"))%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  mutate(Model = 'Neural Network')%>%
  relocate(Model, .before='Metric')

# print table of test fit results
  
rbind(metrics1, metrics2, metrics3, metrics4, metrics5)%>%
   arrange(Estimate)%>%
   flextable()%>%
   set_caption('Table 4. Comparison of Model Fit on Test Data')

Table 4. Comparison of Model Fit on Test Data
Model	Metric	Estimate
MARS	rmse	0.1398370
Neural Network	rmse	0.8170020
Random Forest	rmse	0.8347546
SVM	rmse	0.8579408
Decision Tree	rmse	0.9389812

Model Interpretation

One advantage of MARS models is their interpretability. Table 5 shows the variables selected for the final model along with their respective coefficients.

The final model form is:

PH = 8.466+.001h(0.2-Mnf Flow)-0.113BrandCode_C-0.045h(Usage cont-22.14)+0.028h(68.2-Temperature); where h = the model hinge

Figure 2. displays the partial dependence between PH and each of the predictors for the MARS model. In this context, each plot describes the effect of a predictor with all other covariates held at their average value.

Each plot also reveals the hinge that is characteristic of MARS such that:

Mnf Flow = 0 above 0.2
Usage Cont = 0 below 22.14
Temperature = 0 above 68.2
BrandCode_C = 0 for all values other than 0.45

# Print model
output<-extract_fit_engine(mars.test)%>%
  summary()

output[12]%>%
  data.frame()%>%
  rownames_to_column(var="Model Components")%>%
  rename(Coefficient = ..y)%>%
  flextable%>%
  set_caption('Table 5. Final MARS Model: Features & Coefficients')

Table 5. Final MARS Model: Features & Coefficients
Model Components	Coefficient
(Intercept)	8.465801028
h(0.2-`Mnf Flow`)	0.001256308
BrandCode_C	-0.112952249
h(`Usage cont`-22.14)	-0.044961352
h(68.2-Temperature)	0.027763745

# plot partial dependence of features (from plotmo package in Earth -- plot regression surfaces)
plotmo(output, pmethod='partdep', caption='Figure 2. Partial dependence plots for continuous predictors using MARS')

## calculating partdep for Mnf Flow 
## calculating partdep for Temperature 
## calculating partdep for Usage cont 
## calculating partdep for BrandCode_C

To effectively visualize the model results we can plot our observed PH values against the model PH predictions for both training and test sets. These results as shown in Figure 3 and Table 6. Our errors are lowest around ~ PH = 8.5. Our model tends to overpredict at lower values and overpredict at higher values for PH on the test data.

# adapted from https://stackoverflow.com/questions/68124804/tidymodels-get-predictions-and-metrics-on-training-data-using-workflow-recipe

# pull trained workflow

fit.preds <- mars.test$.workflow[[1]] 

# augment on test and training data 

model.predictions <- bind_rows(
  augment(fit.preds, new_data = test.split) %>% 
    mutate(type = "Test Data"),
  augment(fit.preds, new_data = train.split) %>% 
    mutate(type = "Training Data"))

# plot observed vs predicted PH for train and test sets

model.predictions %>%
  ggplot(aes(PH, .pred)) +
  geom_point(aes(color=type, alpha=.95), show.legend = FALSE) +
  scale_color_manual(values = c("Training Data" = "#4DB6D0", "Test Data" = "#BECA55"))+
  labs(x='PH (Observed)', y= 'PH (Predicted)', title='Figure 3. Observed vs Predicted PH Values: MARS Model')+
  theme(plot.title = element_text(vjust = 10))+ # this doesn't seem to be working
  theme_minimal() +
  facet_wrap(~type)

# print fit metrics for train and test 
  
model.predictions %>%
  group_by(type) %>%
  metrics(PH, .pred)%>%
  select(!.estimator)%>%
  rename(Metric=.metric, Estimate=.estimate, Data=type)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(!mae)%>%
  flextable()%>%
  set_caption('Table 6. Performance Metrics on Final Training and Test Datasets: MARS')

Table 6. Performance Metrics on Final Training and Test Datasets: MARS
Data	rmse	rsq
Test Data	0.1398370	0.2199699
Training Data	0.1406417	0.3458952

Finally, we can rank the relative importance of each covariate in terms of their contribution as predictors of PH. From Table 7., we that ‘MnF Flow’ makes the greatest contribution to the model prediction. For this dataset, only four covariates were included in the final model.

# Plot variable importance for final model and test data


mars.test%>%
  extract_fit_parsnip()%>%
  vi_model(type='gcv')%>% # we can control importance type using vi_model instead of vip
  filter(Importance > 0)%>%
  arrange(desc(Importance))%>%
  mutate(Variable =  factor(Variable), Variable = fct_reorder(Variable, Importance))%>%
  ggplot(aes(x=Importance, y = Variable))+
  geom_col( fill='midnightblue', alpha=0.7)+
  labs(title='Figure 7. Variable Importance: MARS Model', subtitle='PH')+
  theme_light()

We conclude that any investments to manipulate PH as part of the manufacturing process should focus on adjustments to these four covariates. And that a cost:benefit analysis be conducted before committing company resources for this purpose.

Project 2 - Technical Report

Douglas Barley, Rachel Greenlee, Sean Connin

5/7/2022