Data


We start our analysis by importing the training and evaluation set. Manual one hot encoding is also applied to the Brand Code variable. This results in three new variables: brand_code_b, brand_code_c and brand_code_d. A fourth variable brand_code_ais implicit when the preceding three variable are equal to 0.

train <- read_excel("StudentData.xlsx")
train <- train %>% 
  mutate(`Brand Code B` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 1,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
  ),
  `Brand Code C` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 1,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
    ),
  `Brand Code D` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 1,
    TRUE ~ 0
  )) %>% 
  select(-`Brand Code`)

test <- read_excel("StudentEvaluation.xlsx")
test <- test %>% 
  mutate(`Brand Code B` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 1,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
  ),
  `Brand Code C` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 1,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
    ),
  `Brand Code D` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 1,
    TRUE ~ 0
  )) %>% 
  select(-`Brand Code`)

Remove NA Values


Next we drop observations with NA values on the training set. This step leaves with a sample set that is more than adequate and eliminate the uncertain surround selecting a correct imputation strategy. Additionally, modeling results improved after adoption of the drop NA approach. Initial iteration used KNN imputation.

train <- train %>% 
  drop_na()

skim(train)
Data summary
Name train
Number of rows 2129
Number of columns 35
_______________________
Column type frequency:
numeric 35
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Carb Volume 0 1 5.37 0.10 5.07 5.29 5.35 5.45 5.70 ▁▇▇▅▁
Fill Ounces 0 1 23.98 0.09 23.65 23.93 23.98 24.03 24.32 ▁▂▇▂▁
PC Volume 0 1 0.28 0.06 0.08 0.24 0.27 0.31 0.48 ▁▃▇▂▁
Carb Pressure 0 1 68.21 3.44 58.20 65.60 68.20 70.60 78.80 ▁▆▇▅▁
Carb Temp 0 1 141.12 3.97 128.60 138.40 140.80 143.80 154.00 ▁▃▇▃▁
PSC 0 1 0.08 0.05 0.00 0.05 0.08 0.11 0.27 ▆▇▃▁▁
PSC Fill 0 1 0.19 0.12 0.00 0.10 0.18 0.26 0.62 ▆▇▃▁▁
PSC CO2 0 1 0.06 0.04 0.00 0.02 0.04 0.08 0.24 ▇▅▂▁▁
Mnf Flow 0 1 29.58 122.19 -100.20 -100.00 105.60 143.60 229.40 ▇▁▁▇▂
Carb Pressure1 0 1 122.14 4.45 106.40 118.60 123.00 125.00 138.00 ▁▅▇▃▁
Fill Pressure 0 1 48.18 2.84 43.60 46.00 46.60 50.00 59.20 ▇▂▅▁▁
Hyd Pressure1 0 1 13.24 12.27 -0.60 0.00 12.40 21.00 58.00 ▇▆▂▁▁
Hyd Pressure2 0 1 22.92 16.09 0.00 0.00 30.80 35.20 59.40 ▆▁▇▅▁
Hyd Pressure3 0 1 22.29 15.61 -1.20 0.00 30.00 33.80 50.00 ▆▁▃▇▁
Hyd Pressure4 0 1 94.33 10.81 70.00 84.00 96.00 100.00 138.00 ▅▇▇▁▁
Filler Level 0 1 108.49 15.58 55.80 92.20 117.40 120.00 146.60 ▁▃▂▇▁
Filler Speed 0 1 3881.07 344.08 1004.00 3896.00 3986.00 3998.00 4028.00 ▁▁▁▁▇
Temperature 0 1 65.77 1.03 63.60 65.20 65.60 66.20 73.20 ▇▇▁▁▁
Usage cont 0 1 21.11 2.91 12.46 18.50 22.04 23.76 25.90 ▁▃▃▃▇
Carb Flow 0 1 2637.61 899.99 44.00 2728.00 3042.00 3204.00 5104.00 ▁▃▇▇▁
Density 0 1 1.17 0.37 0.44 0.92 0.98 1.62 1.92 ▁▇▁▁▃
MFR 0 1 708.17 64.36 95.40 708.20 724.80 731.00 868.60 ▁▁▁▅▇
Balling 0 1 2.20 0.92 0.35 1.50 1.65 3.34 3.98 ▁▇▁▁▃
Pressure Vacuum 0 1 -5.22 0.57 -6.60 -5.60 -5.40 -5.00 -3.60 ▂▇▆▂▁
PH 0 1 8.55 0.17 7.88 8.44 8.56 8.68 8.94 ▁▂▆▇▃
Oxygen Filler 0 1 0.04 0.04 0.00 0.02 0.03 0.06 0.32 ▇▂▁▁▁
Bowl Setpoint 0 1 109.02 15.28 70.00 90.00 120.00 120.00 140.00 ▁▃▃▇▁
Pressure Setpoint 0 1 47.62 2.05 44.00 46.00 46.00 50.00 52.00 ▁▇▁▆▁
Air Pressurer 0 1 142.84 1.22 140.80 142.20 142.60 143.00 147.60 ▂▇▁▁▁
Alch Rel 0 1 6.89 0.50 6.32 6.54 6.56 7.22 8.62 ▇▁▁▂▁
Carb Rel 0 1 5.43 0.12 5.02 5.34 5.40 5.54 5.78 ▁▃▇▅▁
Balling Lvl 0 1 2.04 0.86 1.18 1.38 1.46 3.14 3.66 ▇▁▁▂▃
Brand Code B 0 1 0.49 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇
Brand Code C 0 1 0.12 0.32 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
Brand Code D 0 1 0.24 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂

Preprocessing


For the training set we employ the corr and nzv preprocessing functions. The corr, nzv and medianImpute functions are applied to the evaluation data set. Note only predictor variables are being imputed in the evaluation data set. Finally, the janitor::clean_names function is applied to all variables of each data set

# train data
temp_df <- data.matrix(train) 
preprocessing <- preProcess(temp_df, method = c("corr", "nzv"))
student_train_preprocess <-  predict(preprocessing, temp_df) 
train_df <- as_tibble(student_train_preprocess)
temp_df2 <- data.matrix(test)


# test data
preprocessing <- preProcess(temp_df2, method = c("medianImpute","corr", "nzv"))
student_test_preprocess <-  predict(preprocessing, temp_df2) 
test_df <- as_tibble(student_test_preprocess)


train_df <- train_df %>% 
  select(PH, everything()) %>%
  clean_names()

test_df <- test_df %>% 
  select(everything()) %>%
  clean_names()

Preprocessed and Cleaned Data Set


We use the skim function to present our preprocessed and cleaned data set.

Data summary
Name train_df
Number of rows 2129
Number of columns 35
_______________________
Column type frequency:
numeric 35
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ph 0 1 8.55 0.17 7.88 8.44 8.56 8.68 8.94 ▁▂▆▇▃
carb_volume 0 1 5.37 0.10 5.07 5.29 5.35 5.45 5.70 ▁▇▇▅▁
fill_ounces 0 1 23.98 0.09 23.65 23.93 23.98 24.03 24.32 ▁▂▇▂▁
pc_volume 0 1 0.28 0.06 0.08 0.24 0.27 0.31 0.48 ▁▃▇▂▁
carb_pressure 0 1 68.21 3.44 58.20 65.60 68.20 70.60 78.80 ▁▆▇▅▁
carb_temp 0 1 141.12 3.97 128.60 138.40 140.80 143.80 154.00 ▁▃▇▃▁
psc 0 1 0.08 0.05 0.00 0.05 0.08 0.11 0.27 ▆▇▃▁▁
psc_fill 0 1 0.19 0.12 0.00 0.10 0.18 0.26 0.62 ▆▇▃▁▁
psc_co2 0 1 0.06 0.04 0.00 0.02 0.04 0.08 0.24 ▇▅▂▁▁
mnf_flow 0 1 29.58 122.19 -100.20 -100.00 105.60 143.60 229.40 ▇▁▁▇▂
carb_pressure1 0 1 122.14 4.45 106.40 118.60 123.00 125.00 138.00 ▁▅▇▃▁
fill_pressure 0 1 48.18 2.84 43.60 46.00 46.60 50.00 59.20 ▇▂▅▁▁
hyd_pressure1 0 1 13.24 12.27 -0.60 0.00 12.40 21.00 58.00 ▇▆▂▁▁
hyd_pressure2 0 1 22.92 16.09 0.00 0.00 30.80 35.20 59.40 ▆▁▇▅▁
hyd_pressure3 0 1 22.29 15.61 -1.20 0.00 30.00 33.80 50.00 ▆▁▃▇▁
hyd_pressure4 0 1 94.33 10.81 70.00 84.00 96.00 100.00 138.00 ▅▇▇▁▁
filler_level 0 1 108.49 15.58 55.80 92.20 117.40 120.00 146.60 ▁▃▂▇▁
filler_speed 0 1 3881.07 344.08 1004.00 3896.00 3986.00 3998.00 4028.00 ▁▁▁▁▇
temperature 0 1 65.77 1.03 63.60 65.20 65.60 66.20 73.20 ▇▇▁▁▁
usage_cont 0 1 21.11 2.91 12.46 18.50 22.04 23.76 25.90 ▁▃▃▃▇
carb_flow 0 1 2637.61 899.99 44.00 2728.00 3042.00 3204.00 5104.00 ▁▃▇▇▁
density 0 1 1.17 0.37 0.44 0.92 0.98 1.62 1.92 ▁▇▁▁▃
mfr 0 1 708.17 64.36 95.40 708.20 724.80 731.00 868.60 ▁▁▁▅▇
balling 0 1 2.20 0.92 0.35 1.50 1.65 3.34 3.98 ▁▇▁▁▃
pressure_vacuum 0 1 -5.22 0.57 -6.60 -5.60 -5.40 -5.00 -3.60 ▂▇▆▂▁
oxygen_filler 0 1 0.04 0.04 0.00 0.02 0.03 0.06 0.32 ▇▂▁▁▁
bowl_setpoint 0 1 109.02 15.28 70.00 90.00 120.00 120.00 140.00 ▁▃▃▇▁
pressure_setpoint 0 1 47.62 2.05 44.00 46.00 46.00 50.00 52.00 ▁▇▁▆▁
air_pressurer 0 1 142.84 1.22 140.80 142.20 142.60 143.00 147.60 ▂▇▁▁▁
alch_rel 0 1 6.89 0.50 6.32 6.54 6.56 7.22 8.62 ▇▁▁▂▁
carb_rel 0 1 5.43 0.12 5.02 5.34 5.40 5.54 5.78 ▁▃▇▅▁
balling_lvl 0 1 2.04 0.86 1.18 1.38 1.46 3.14 3.66 ▇▁▁▂▃
brand_code_b 0 1 0.49 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇
brand_code_c 0 1 0.12 0.32 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
brand_code_d 0 1 0.24 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂

Explore Data


Our data exploration leverages the skim report above as well as Figure 1 and Figure 2 below. Figure 1 provides insight into the distribution of our target variable pH. Figure looks at the relationship of the target variable with each of the predictor variables. It this plot that convinced the analysis to steer away from linear models.

Figure 1 . Histogram - ph Level

Figure 2. Scatter Plots pH vs Predictor Variables



Training / Test Split


Our next step is to take the training data and perform a training / test split. This enables us to train on some of the data and then use the test split to ensure we have not over fit the data. Please note, the test data is not the same as the evaluation data set. It is a subset of the training data set. Additionally, tuning/training folds are established for our model tuning.

set.seed(123)
ph_split <- initial_split(train_df, strata= ph, na.rm =TRUE)
ph_train <- training(ph_split)
ph_test  <- testing(ph_split)

set.seed(234)
ph_folds <- bootstraps(ph_train, strata = ph)
ph_folds
## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
##    splits             id         
##    <list>             <chr>      
##  1 <split [1.6K/597]> Bootstrap01
##  2 <split [1.6K/594]> Bootstrap02
##  3 <split [1.6K/571]> Bootstrap03
##  4 <split [1.6K/565]> Bootstrap04
##  5 <split [1.6K/597]> Bootstrap05
##  6 <split [1.6K/572]> Bootstrap06
##  7 <split [1.6K/596]> Bootstrap07
##  8 <split [1.6K/572]> Bootstrap08
##  9 <split [1.6K/591]> Bootstrap09
## 10 <split [1.6K/592]> Bootstrap10
## # … with 15 more rows

Build Models


In this time of Covid-19 our modeling team is operating remotely with each data scientist pursuing specific modeling approaches.

The first model is our best fit using a generalized linear model method, built by Calvin Cox. The next two models were built by Adam Douglas and are: Multivariate Adaptive Regression Splines (or MARS), and Neural Networks. The next three models, built by Jim Mundy, are: Random Forest, XGBoost and Cubist.

Generalized Linear Model


The GLM model specification, tuning parameters and modeling results are set forth below:

glmnet_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors(), -all_nominal()) 

glmnet_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet") 

glmnet_workflow <- 
  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec) 

glmnet_grid <- tidyr::crossing(penalty = 10^seq(-6, -1, length.out = 20), mixture = c(0.05, 
    0.2, 0.4, 0.6, 0.8, 1)) 

glmnet_tune <- 
  tune_grid(glmnet_workflow, resamples = ph_folds, grid = 11) 


Top Performing Models with Tuning Results and RMSE

show_best(glmnet_tune, metric ="rmse")
## # A tibble: 5 x 8
##        penalty mixture .metric .estimator  mean     n  std_err .config          
##          <dbl>   <dbl> <chr>   <chr>      <dbl> <int>    <dbl> <chr>            
## 1      1.16e-4  0.252  rmse    standard   0.131    25 0.000785 Preprocessor1_Mo…
## 2      2.93e-6  0.0717 rmse    standard   0.131    25 0.000784 Preprocessor1_Mo…
## 3      1.27e-9  0.904  rmse    standard   0.131    25 0.000782 Preprocessor1_Mo…
## 4      1.36e-7  0.607  rmse    standard   0.131    25 0.000782 Preprocessor1_Mo…
## 5      1.81e-8  0.789  rmse    standard   0.131    25 0.000782 Preprocessor1_Mo…
autoplot(glmnet_tune)

final_glmnet <- glmnet_workflow %>% 
  finalize_workflow(select_best(glmnet_tune))
ph_fit_glmnet <- last_fit(final_glmnet, ph_split)

GLMnet Modeling Results


As suspected, the traditional linear modeling approach yielded a poor model, given the training data.

collect_metrics(ph_fit_glmnet) %>% gt()

.metric .estimator .estimate .config
rmse standard 0.1281839 Preprocessor1_Model1
rsq standard 0.4533451 Preprocessor1_Model1

collect_predictions(ph_fit_glmnet) %>% 
  ggplot(aes(ph, .pred)) +
  geom_abline(lty = 2, color="gray50") +
  geom_point(alpha = 0.5, color="steelblue") +
  coord_fixed() + theme_fivethirtyeight() + labs(title="linear regression ph vs predicted ph", subtitle = "An Analysis for ABC Beverage Co.")

Important Variables


Our next step is to use the VIP package to identify the important variables in the modeling process.


glmnet_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet") 

  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec) %>%
  fit(ph_train) %>% 
  pull_workflow_fit() %>% 
  vip(aesthetic = list(alpha = 0.8, fill = "steelblue"))

Prediction With Evaluation Data Set


Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

Multivariate Adaptive Regression Splines (MARS) Model


The MARS model specification, tuning parameters and modeling results are set forth below:

# Recipe for MARS
mars_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>%
  prep(retain=T)

mars_spec <- 
  mars(num_terms = tune(), prod_degree = 1,
       prune_method = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("earth")

mars_grid <- grid_regular(num_terms(range=c(1,34)),
                          prune_method(c("backward","forward","none")),
                          levels=20)

mars_workflow <- 
  workflow() %>% 
  add_recipe(mars_recipe) %>% 
  add_model(mars_spec)

set.seed(8577)
mars_tune <- tune_grid(mars_workflow,
                         resamples = ph_folds,
                         grid=mars_grid)


Top Performing Models with Tuning Results and RMSE

num_terms prune_method .metric .estimator mean n std_err .config
25 forward rmse standard 0.1288090 25 0.001133040 Preprocessor1_Model35
34 forward rmse standard 0.1288473 25 0.001123949 Preprocessor1_Model40
32 forward rmse standard 0.1288497 25 0.001123429 Preprocessor1_Model39
34 none rmse standard 0.1288659 25 0.001133304 Preprocessor1_Model60
30 forward rmse standard 0.1288672 25 0.001119719 Preprocessor1_Model38

MARS Modeling Results


Our fitted MARS Model Using the Test Data Set and a plot of observed vs predicted are set forth below:

.metric .estimator .estimate .config
rmse standard 0.1277871 Preprocessor1_Model1
rsq standard 0.4566604 Preprocessor1_Model1


Important Variables


Our next step is to use the VIP package to identify the important variables in the modeling process.


Prediction With Evaluation Data Set


Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

Neural Network Model


The Neural Network model specification, tuning parameters and modeling results are set forth below:

# Recipe for NN
nn_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>%
  prep(retain=T)

nn_spec <- 
  mlp(hidden_units = tune(), penalty = tune(),
       epochs = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("nnet")

nn_grid <- grid_regular(hidden_units(),
                        penalty(),
                        epochs(),
                        levels = 3)

nn_workflow <- 
  workflow() %>% 
  add_recipe(nn_recipe) %>% 
  add_model(nn_spec)

set.seed(8577)
nn_tune <- tune_grid(nn_workflow,
                         resamples = ph_folds,
                         grid=nn_grid)


Top Performing Models with Tuning Results and RMSE

hidden_units penalty epochs .metric .estimator mean n std_err .config
5 1 505 rmse standard 0.1280353 25 0.001332102 Preprocessor1_Model17
10 1 505 rmse standard 0.1286441 25 0.001590298 Preprocessor1_Model18
10 1 1000 rmse standard 0.1300482 25 0.001473111 Preprocessor1_Model27
5 1 1000 rmse standard 0.1306369 25 0.002118459 Preprocessor1_Model26
1 1 1000 rmse standard 0.1386982 25 0.001738515 Preprocessor1_Model25

Neural Network Modeling Results


Our fitted Neural Network Model Using the Test Data Set and a plot of observed vs predicted are set forth below:

.metric .estimator .estimate .config
rmse standard 0.1202430 Preprocessor1_Model1
rsq standard 0.5199355 Preprocessor1_Model1


Important Variables


Our next step is to use the VIP package to identify the important variables in the modeling process.


Prediction With Evaluation Data Set


Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

Random Forest Model


The Random Forest model specification, tuning parameters and modeling results are set forth below:

ranger_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) 

ranger_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_mode("regression") %>% 
  set_engine("ranger") 

ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec) 

set.seed(8577)
ranger_tune <- tune_grid(ranger_workflow,
                         resamples = ph_folds,
                         grid=11)


Top Performing Models with Tuning Results and RMSE

mtry min_n .metric .estimator mean n std_err .config
13 4 rmse standard 0.1031171 25 0.0007668758 Preprocessor1_Model10
24 10 rmse standard 0.1043719 25 0.0007960392 Preprocessor1_Model05
18 18 rmse standard 0.1049345 25 0.0008165298 Preprocessor1_Model01
31 6 rmse standard 0.1051590 25 0.0008627660 Preprocessor1_Model06
22 24 rmse standard 0.1061753 25 0.0008183294 Preprocessor1_Model03

Random Forest Modeling Results


Our fitted Random Forest Model Using the Test Data Set and a plot of observed vs predicted are set forth below:

.metric .estimator .estimate .config
rmse standard 0.09432975 Preprocessor1_Model1
rsq standard 0.73432166 Preprocessor1_Model1


Important Variables


Our next step is to use the VIP package to identify the important variables in the modeling process.


Prediction With Evaluation Data Set


Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

XGBOOST MODEL


The XGBOOST model specification, tuning parameters and modeling results are set forth below:

xgboost_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>% 
  step_zv(all_predictors()) 

xgboost_spec <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), 
    loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec) 

set.seed(77648)
xgboost_tune <-
  tune_grid(xgboost_workflow, resamples = ph_folds, grid = 11)


Top Performing Models with Tuning Results and RMSE

show_best(xgboost_tune, metric ="rmse") %>% 
  gt()
trees min_n tree_depth learn_rate loss_reduction sample_size .metric .estimator mean n std_err .config
625 34 14 9.216943e-03 6.275024e-09 0.3514750 rmse standard 0.1189910 25 0.0008210379 Preprocessor1_Model10
1612 31 11 4.296299e-02 2.218003e+00 0.4552075 rmse standard 0.1491916 25 0.0008831361 Preprocessor1_Model09
859 14 9 5.447205e-05 2.081954e-10 0.1997397 rmse standard 7.6851221 25 0.0007059712 Preprocessor1_Model04
48 39 12 4.797775e-04 1.643376e-03 0.3425751 rmse standard 7.8689865 25 0.0007027639 Preprocessor1_Model11
1309 3 4 1.430741e-05 1.798794e+01 0.9061079 rmse standard 7.9027101 25 0.0007017228 Preprocessor1_Model01

XGBOOST Modeling Results


Results from our fitted XGBOOST Model Using the Test Data Set and a plot of observed vs. predicted are set forth below:

.metric .estimator .estimate .config
rmse standard 0.1182419 Preprocessor1_Model1
rsq standard 0.5817994 Preprocessor1_Model1


Important Variables


Our next step is to use the VIP package to identify the important variables in the modeling process.



Prediction With Evaluation Data Set


Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

CUBIST MODEL


The CUBIST model specification, tuning parameters and modeling results are set forth below:

cubist_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>% 
  step_zv(all_predictors()) 

cubist_spec <- 
  cubist_rules(committees = tune(), neighbors = tune()) %>% 
  set_engine("Cubist") 

cubist_workflow <- 
  workflow() %>% 
  add_recipe(cubist_recipe) %>% 
  add_model(cubist_spec) 

cubist_grid <- tidyr::crossing(committees = c(1:9, (1:5) * 10), neighbors = c(0, 
    3, 6, 9)) 

cubist_tune <- 
  tune_grid(cubist_workflow, resamples = ph_folds, grid = cubist_grid) 


Top Performing Models with Tuning Results and RMSE

committees neighbors .metric .estimator mean n std_err .config
50 6 rmse standard 0.09737828 25 0.0008200935 Preprocessor1_Model55
50 9 rmse standard 0.09738718 25 0.0008244416 Preprocessor1_Model56
40 6 rmse standard 0.09771429 25 0.0007820249 Preprocessor1_Model51
40 9 rmse standard 0.09772138 25 0.0007904497 Preprocessor1_Model52
50 3 rmse standard 0.09772571 25 0.0008850636 Preprocessor1_Model54


## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: cubist_rules()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## ● step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Cubist Model Specification (regression)
## 
## Main Arguments:
##   committees = 50
##   neighbors = 6
## 
## Computational engine: Cubist

CUBIST Modeling Results


Results from our fitted CUBIST Model Using the Test Data Set and a plot of observed vs. predicted are set forth below:

.metric .estimator .estimate .config
rmse standard 0.08530442 Preprocessor1_Model1
rsq standard 0.76176579 Preprocessor1_Model1
collect_predictions(ph_fit_cube) %>% 
  ggplot(aes(ph, .pred)) +
  geom_abline(lty = 2, color="gray50") +
  geom_point(alpha = 0.5, color="steelblue") +
  coord_fixed() + theme_fivethirtyeight() + labs(title="Cubist Model: ph vs predicted ph", subtitle = "An Analysis for ABC Beverage Co.")

Important Variables


Our next step is to use the VIP package to identify the important variables in the modeling process.



Prediction With Evaluation Data Set


Finally, we display a plot of predicted values from the evaluation set and write all predicted values to an Excel-readable format.

cube_ph_eval <- predict(ph_fit_cube$.workflow[[1]], test_df)
write.csv(cube_ph_eval, file = "cube_ph_eval.csv")

cube_ph_eval %>% 
  as_tibble() %>% 
  rownames_to_column(var="x") %>% 
  mutate(x = as.numeric(x)) %>% 
  rename(y = .pred) %>% 
  ggplot() +
  geom_line(aes(x=x, y=y), color="steelblue") +
  theme_fivethirtyeight() +
  labs(title="Best Model (Cubist) - Predicted pH Values", subtitle = "pH Ranges -Drinking Water: 6.5 to 8.5, Other Beverages 2.5 to 7.0")


Conclusion


We have performed a detail PH analysis for ABC Beverage Company. During the course of the analysis we utilized numerous predictive models to identify the model best suited to predict ABC Beverage’s PH levels.

Several models produce good predictions, however, we recommend use of the Cubist model to predict pH levels. We based this on the strong RMSE and R-Square scores produced by this model relative to other models as well as the models ability to identify important variables.

Our analysis indicates the following manufacturing processes are the most important to the company’s pH levels:

  • Mnf_Flow
  • Bailing_Lvl
  • Pressure_Vacuum
  • Bailing
  • Alch_Rel

We applied the cubist model to the evaluation data provided and predicted pH values that ranged between 8.2 and 8.7. It is our understanding that these levels present no health or safety concerns, however, pH vales at this level could result in a bitter taste. We have also provided a non-technical report for your review and consideration.

Please let us know if you have any questions. We appreciate the opportunity to work on this project.