Data

We start our analysis by importing the training and evaluation set. Manual one hot encoding is also applied to the Brand Code variable. This results in three new variables: brand_code_b, brand_code_c and brand_code_d. A fourth variable brand_code_ais implicit when the preceding three variable are equal to 0.

train <- read_excel("StudentData.xlsx")
train <- train %>% 
  mutate(`Brand Code B` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 1,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
  ),
  `Brand Code C` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 1,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
    ),
  `Brand Code D` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 1,
    TRUE ~ 0
  )) %>% 
  select(-`Brand Code`)

test <- read_excel("StudentEvaluation.xlsx")
test <- test %>% 
  mutate(`Brand Code B` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 1,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
  ),
  `Brand Code C` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 1,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
    ),
  `Brand Code D` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 1,
    TRUE ~ 0
  )) %>% 
  select(-`Brand Code`)

Remove NA Values

Next we drop observations with NA values on the training set. This step leaves with a sample set that is more than adequate and eliminate the uncertain surround selecting a correct imputation strategy. Additionally, modeling results improved after adoption of the drop NA approach. Initial iteration used KNN imputation.

train <- train %>% 
  drop_na()

skim(train)

Data summary
Name	train
Number of rows	2129
Number of columns	35
_______________________
Column type frequency:
numeric	35
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	1	5.37	0.10	5.07	5.29	5.35	5.45	5.70	▁▇▇▅▁
Fill Ounces	1	23.98	0.09	23.65	23.93	23.98	24.03	24.32	▁▂▇▂▁
PC Volume	1	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb Pressure	1	68.21	3.44	58.20	65.60	68.20	70.60	78.80	▁▆▇▅▁
Carb Temp	1	141.12	3.97	128.60	138.40	140.80	143.80	154.00	▁▃▇▃▁
PSC	1	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC Fill	1	0.19	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	1	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf Flow	1	29.58	122.19	-100.20	-100.00	105.60	143.60	229.40	▇▁▁▇▂
Carb Pressure1	1	122.14	4.45	106.40	118.60	123.00	125.00	138.00	▁▅▇▃▁
Fill Pressure	1	48.18	2.84	43.60	46.00	46.60	50.00	59.20	▇▂▅▁▁
Hyd Pressure1	1	13.24	12.27	-0.60	0.00	12.40	21.00	58.00	▇▆▂▁▁
Hyd Pressure2	1	22.92	16.09	0.00	0.00	30.80	35.20	59.40	▆▁▇▅▁
Hyd Pressure3	1	22.29	15.61	-1.20	0.00	30.00	33.80	50.00	▆▁▃▇▁
Hyd Pressure4	1	94.33	10.81	70.00	84.00	96.00	100.00	138.00	▅▇▇▁▁
Filler Level	1	108.49	15.58	55.80	92.20	117.40	120.00	146.60	▁▃▂▇▁
Filler Speed	1	3881.07	344.08	1004.00	3896.00	3986.00	3998.00	4028.00	▁▁▁▁▇
Temperature	1	65.77	1.03	63.60	65.20	65.60	66.20	73.20	▇▇▁▁▁
Usage cont	1	21.11	2.91	12.46	18.50	22.04	23.76	25.90	▁▃▃▃▇
Carb Flow	1	2637.61	899.99	44.00	2728.00	3042.00	3204.00	5104.00	▁▃▇▇▁
Density	1	1.17	0.37	0.44	0.92	0.98	1.62	1.92	▁▇▁▁▃
MFR	1	708.17	64.36	95.40	708.20	724.80	731.00	868.60	▁▁▁▅▇
Balling	1	2.20	0.92	0.35	1.50	1.65	3.34	3.98	▁▇▁▁▃
Pressure Vacuum	1	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	1	8.55	0.17	7.88	8.44	8.56	8.68	8.94	▁▂▆▇▃
Oxygen Filler	1	0.04	0.04	0.00	0.02	0.03	0.06	0.32	▇▂▁▁▁
Bowl Setpoint	1	109.02	15.28	70.00	90.00	120.00	120.00	140.00	▁▃▃▇▁
Pressure Setpoint	1	47.62	2.05	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	1	142.84	1.22	140.80	142.20	142.60	143.00	147.60	▂▇▁▁▁
Alch Rel	1	6.89	0.50	6.32	6.54	6.56	7.22	8.62	▇▁▁▂▁
Carb Rel	1	5.43	0.12	5.02	5.34	5.40	5.54	5.78	▁▃▇▅▁
Balling Lvl	1	2.04	0.86	1.18	1.38	1.46	3.14	3.66	▇▁▁▂▃
Brand Code B	1	0.49	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇
Brand Code C	1	0.12	0.32	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
Brand Code D	1	0.24	0.43	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂

Preprocessing

For the training set we employ the corr and nzv preprocessing functions. The corr, nzv and medianImpute functions are applied to the evaluation data set. Note only predictor variables are being imputed in the evaluation data set. Finally, the janitor::clean_names function is applied to all variables of each data set

# train data
temp_df <- data.matrix(train) 
preprocessing <- preProcess(temp_df, method = c("corr", "nzv"))
student_train_preprocess <-  predict(preprocessing, temp_df) 
train_df <- as_tibble(student_train_preprocess)
temp_df2 <- data.matrix(test)


# test data
preprocessing <- preProcess(temp_df2, method = c("medianImpute","corr", "nzv"))
student_test_preprocess <-  predict(preprocessing, temp_df2) 
test_df <- as_tibble(student_test_preprocess)


train_df <- train_df %>% 
  select(PH, everything()) %>%
  clean_names()

test_df <- test_df %>% 
  select(everything()) %>%
  clean_names()

Preprocessed and Cleaned Data Set

We use the skim function to present our preprocessed and cleaned data set.

Data summary
Name	train_df
Number of rows	2129
Number of columns	35
_______________________
Column type frequency:
numeric	35
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ph	1	8.55	0.17	7.88	8.44	8.56	8.68	8.94	▁▂▆▇▃
carb_volume	1	5.37	0.10	5.07	5.29	5.35	5.45	5.70	▁▇▇▅▁
fill_ounces	1	23.98	0.09	23.65	23.93	23.98	24.03	24.32	▁▂▇▂▁
pc_volume	1	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
carb_pressure	1	68.21	3.44	58.20	65.60	68.20	70.60	78.80	▁▆▇▅▁
carb_temp	1	141.12	3.97	128.60	138.40	140.80	143.80	154.00	▁▃▇▃▁
psc	1	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
psc_fill	1	0.19	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
psc_co2	1	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
mnf_flow	1	29.58	122.19	-100.20	-100.00	105.60	143.60	229.40	▇▁▁▇▂
carb_pressure1	1	122.14	4.45	106.40	118.60	123.00	125.00	138.00	▁▅▇▃▁
fill_pressure	1	48.18	2.84	43.60	46.00	46.60	50.00	59.20	▇▂▅▁▁
hyd_pressure1	1	13.24	12.27	-0.60	0.00	12.40	21.00	58.00	▇▆▂▁▁
hyd_pressure2	1	22.92	16.09	0.00	0.00	30.80	35.20	59.40	▆▁▇▅▁
hyd_pressure3	1	22.29	15.61	-1.20	0.00	30.00	33.80	50.00	▆▁▃▇▁
hyd_pressure4	1	94.33	10.81	70.00	84.00	96.00	100.00	138.00	▅▇▇▁▁
filler_level	1	108.49	15.58	55.80	92.20	117.40	120.00	146.60	▁▃▂▇▁
filler_speed	1	3881.07	344.08	1004.00	3896.00	3986.00	3998.00	4028.00	▁▁▁▁▇
temperature	1	65.77	1.03	63.60	65.20	65.60	66.20	73.20	▇▇▁▁▁
usage_cont	1	21.11	2.91	12.46	18.50	22.04	23.76	25.90	▁▃▃▃▇
carb_flow	1	2637.61	899.99	44.00	2728.00	3042.00	3204.00	5104.00	▁▃▇▇▁
density	1	1.17	0.37	0.44	0.92	0.98	1.62	1.92	▁▇▁▁▃
mfr	1	708.17	64.36	95.40	708.20	724.80	731.00	868.60	▁▁▁▅▇
balling	1	2.20	0.92	0.35	1.50	1.65	3.34	3.98	▁▇▁▁▃
pressure_vacuum	1	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
oxygen_filler	1	0.04	0.04	0.00	0.02	0.03	0.06	0.32	▇▂▁▁▁
bowl_setpoint	1	109.02	15.28	70.00	90.00	120.00	120.00	140.00	▁▃▃▇▁
pressure_setpoint	1	47.62	2.05	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
air_pressurer	1	142.84	1.22	140.80	142.20	142.60	143.00	147.60	▂▇▁▁▁
alch_rel	1	6.89	0.50	6.32	6.54	6.56	7.22	8.62	▇▁▁▂▁
carb_rel	1	5.43	0.12	5.02	5.34	5.40	5.54	5.78	▁▃▇▅▁
balling_lvl	1	2.04	0.86	1.18	1.38	1.46	3.14	3.66	▇▁▁▂▃
brand_code_b	1	0.49	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇
brand_code_c	1	0.12	0.32	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
brand_code_d	1	0.24	0.43	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂

Explore Data

Our data exploration leverages the skim report above as well as Figure 1 and Figure 2 below. Figure 1 provides insight into the distribution of our target variable pH. Figure looks at the relationship of the target variable with each of the predictor variables. It this plot that convinced the analysis to steer away from linear models.

Figure 1 . Histogram - ph Level

Figure 2. Scatter Plots pH vs Predictor Variables

Training / Test Split

Our next step is to take the training data and perform a training / test split. This enables us to train on some of the data and then use the test split to ensure we have not over fit the data. Please note, the test data is not the same as the evaluation data set. It is a subset of the training data set. Additionally, tuning/training folds are established for our model tuning.

set.seed(123)
ph_split <- initial_split(train_df, strata= ph, na.rm =TRUE)
ph_train <- training(ph_split)
ph_test  <- testing(ph_split)

set.seed(234)
ph_folds <- bootstraps(ph_train, strata = ph)
ph_folds

## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
##    splits             id         
##    <list>             <chr>      
##  1 <split [1.6K/597]> Bootstrap01
##  2 <split [1.6K/594]> Bootstrap02
##  3 <split [1.6K/571]> Bootstrap03
##  4 <split [1.6K/565]> Bootstrap04
##  5 <split [1.6K/597]> Bootstrap05
##  6 <split [1.6K/572]> Bootstrap06
##  7 <split [1.6K/596]> Bootstrap07
##  8 <split [1.6K/572]> Bootstrap08
##  9 <split [1.6K/591]> Bootstrap09
## 10 <split [1.6K/592]> Bootstrap10
## # … with 15 more rows

Build Models

In this time of Covid-19 our modeling team is operating remotely with each data scientist pursuing specific modeling approaches.

The first model is our best fit using a generalized linear model method, built by Calvin Cox. The next two models were built by Adam Douglas and are: Multivariate Adaptive Regression Splines (or MARS), and Neural Networks. The next three models, built by Jim Mundy, are: Random Forest, XGBoost and Cubist.

Generalized Linear Model

The GLM model specification, tuning parameters and modeling results are set forth below:

glmnet_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors(), -all_nominal()) 

glmnet_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet") 

glmnet_workflow <- 
  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec) 

glmnet_grid <- tidyr::crossing(penalty = 10^seq(-6, -1, length.out = 20), mixture = c(0.05, 
    0.2, 0.4, 0.6, 0.8, 1)) 

glmnet_tune <- 
  tune_grid(glmnet_workflow, resamples = ph_folds, grid = 11)

Top Performing Models with Tuning Results and RMSE

show_best(glmnet_tune, metric ="rmse")

## # A tibble: 5 x 8
##        penalty mixture .metric .estimator  mean     n  std_err .config          
##          <dbl>   <dbl> <chr>   <chr>      <dbl> <int>    <dbl> <chr>            
## 1      1.16e-4  0.252  rmse    standard   0.131    25 0.000785 Preprocessor1_Mo…
## 2      2.93e-6  0.0717 rmse    standard   0.131    25 0.000784 Preprocessor1_Mo…
## 3      1.27e-9  0.904  rmse    standard   0.131    25 0.000782 Preprocessor1_Mo…
## 4      1.36e-7  0.607  rmse    standard   0.131    25 0.000782 Preprocessor1_Mo…
## 5      1.81e-8  0.789  rmse    standard   0.131    25 0.000782 Preprocessor1_Mo…

autoplot(glmnet_tune)

final_glmnet <- glmnet_workflow %>% 
  finalize_workflow(select_best(glmnet_tune))

ph_fit_glmnet <- last_fit(final_glmnet, ph_split)

GLMnet Modeling Results

As suspected, the traditional linear modeling approach yielded a poor model, given the training data.

collect_metrics(ph_fit_glmnet) %>% gt()

.metric	.estimator	.estimate	.config
rmse	standard	0.1281839	Preprocessor1_Model1
rsq	standard	0.4533451	Preprocessor1_Model1

collect_predictions(ph_fit_glmnet) %>% 
  ggplot(aes(ph, .pred)) +
  geom_abline(lty = 2, color="gray50") +
  geom_point(alpha = 0.5, color="steelblue") +
  coord_fixed() + theme_fivethirtyeight() + labs(title="linear regression ph vs predicted ph", subtitle = "An Analysis for ABC Beverage Co.")

Important Variables

Our next step is to use the VIP package to identify the important variables in the modeling process.

glmnet_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet") 

  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec) %>%
  fit(ph_train) %>% 
  pull_workflow_fit() %>% 
  vip(aesthetic = list(alpha = 0.8, fill = "steelblue"))

Prediction With Evaluation Data Set

Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

Multivariate Adaptive Regression Splines (MARS) Model

The MARS model specification, tuning parameters and modeling results are set forth below:

# Recipe for MARS
mars_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>%
  prep(retain=T)

mars_spec <- 
  mars(num_terms = tune(), prod_degree = 1,
       prune_method = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("earth")

mars_grid <- grid_regular(num_terms(range=c(1,34)),
                          prune_method(c("backward","forward","none")),
                          levels=20)

mars_workflow <- 
  workflow() %>% 
  add_recipe(mars_recipe) %>% 
  add_model(mars_spec)

set.seed(8577)
mars_tune <- tune_grid(mars_workflow,
                         resamples = ph_folds,
                         grid=mars_grid)

Top Performing Models with Tuning Results and RMSE

num_terms	prune_method	.metric	.estimator	mean	n	std_err	.config
25	forward	rmse	standard	0.1288090	25	0.001133040	Preprocessor1_Model35
34	forward	rmse	standard	0.1288473	25	0.001123949	Preprocessor1_Model40
32	forward	rmse	standard	0.1288497	25	0.001123429	Preprocessor1_Model39
34	none	rmse	standard	0.1288659	25	0.001133304	Preprocessor1_Model60
30	forward	rmse	standard	0.1288672	25	0.001119719	Preprocessor1_Model38

MARS Modeling Results

Our fitted MARS Model Using the Test Data Set and a plot of observed vs predicted are set forth below:

.metric	.estimator	.estimate	.config
rmse	standard	0.1277871	Preprocessor1_Model1
rsq	standard	0.4566604	Preprocessor1_Model1

Important Variables

Our next step is to use the VIP package to identify the important variables in the modeling process.

Prediction With Evaluation Data Set

Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

Neural Network Model

The Neural Network model specification, tuning parameters and modeling results are set forth below:

# Recipe for NN
nn_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>%
  prep(retain=T)

nn_spec <- 
  mlp(hidden_units = tune(), penalty = tune(),
       epochs = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("nnet")

nn_grid <- grid_regular(hidden_units(),
                        penalty(),
                        epochs(),
                        levels = 3)

nn_workflow <- 
  workflow() %>% 
  add_recipe(nn_recipe) %>% 
  add_model(nn_spec)

set.seed(8577)
nn_tune <- tune_grid(nn_workflow,
                         resamples = ph_folds,
                         grid=nn_grid)

Top Performing Models with Tuning Results and RMSE

hidden_units	penalty	epochs	.metric	.estimator	mean	n	std_err	.config
5	1	505	rmse	standard	0.1280353	25	0.001332102	Preprocessor1_Model17
10	1	505	rmse	standard	0.1286441	25	0.001590298	Preprocessor1_Model18
10	1	1000	rmse	standard	0.1300482	25	0.001473111	Preprocessor1_Model27
5	1	1000	rmse	standard	0.1306369	25	0.002118459	Preprocessor1_Model26
1	1	1000	rmse	standard	0.1386982	25	0.001738515	Preprocessor1_Model25

Neural Network Modeling Results

Our fitted Neural Network Model Using the Test Data Set and a plot of observed vs predicted are set forth below:

.metric	.estimator	.estimate	.config
rmse	standard	0.1202430	Preprocessor1_Model1
rsq	standard	0.5199355	Preprocessor1_Model1

Important Variables

Our next step is to use the VIP package to identify the important variables in the modeling process.

Prediction With Evaluation Data Set

Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

Random Forest Model

The Random Forest model specification, tuning parameters and modeling results are set forth below:

ranger_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) 

ranger_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_mode("regression") %>% 
  set_engine("ranger") 

ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec) 

set.seed(8577)
ranger_tune <- tune_grid(ranger_workflow,
                         resamples = ph_folds,
                         grid=11)

Top Performing Models with Tuning Results and RMSE

mtry	min_n	.metric	.estimator	mean	n	std_err	.config
13	4	rmse	standard	0.1031171	25	0.0007668758	Preprocessor1_Model10
24	10	rmse	standard	0.1043719	25	0.0007960392	Preprocessor1_Model05
18	18	rmse	standard	0.1049345	25	0.0008165298	Preprocessor1_Model01
31	6	rmse	standard	0.1051590	25	0.0008627660	Preprocessor1_Model06
22	24	rmse	standard	0.1061753	25	0.0008183294	Preprocessor1_Model03

Random Forest Modeling Results

Our fitted Random Forest Model Using the Test Data Set and a plot of observed vs predicted are set forth below:

.metric	.estimator	.estimate	.config
rmse	standard	0.09432975	Preprocessor1_Model1
rsq	standard	0.73432166	Preprocessor1_Model1

Important Variables

Our next step is to use the VIP package to identify the important variables in the modeling process.

Prediction With Evaluation Data Set

Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

XGBOOST MODEL

The XGBOOST model specification, tuning parameters and modeling results are set forth below:

xgboost_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>% 
  step_zv(all_predictors()) 

xgboost_spec <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), 
    loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec) 

set.seed(77648)
xgboost_tune <-
  tune_grid(xgboost_workflow, resamples = ph_folds, grid = 11)

Top Performing Models with Tuning Results and RMSE

show_best(xgboost_tune, metric ="rmse") %>% 
  gt()

trees	min_n	tree_depth	learn_rate	loss_reduction	sample_size	.metric	.estimator	mean	n	std_err	.config
625	34	14	9.216943e-03	6.275024e-09	0.3514750	rmse	standard	0.1189910	25	0.0008210379	Preprocessor1_Model10
1612	31	11	4.296299e-02	2.218003e+00	0.4552075	rmse	standard	0.1491916	25	0.0008831361	Preprocessor1_Model09
859	14	9	5.447205e-05	2.081954e-10	0.1997397	rmse	standard	7.6851221	25	0.0007059712	Preprocessor1_Model04
48	39	12	4.797775e-04	1.643376e-03	0.3425751	rmse	standard	7.8689865	25	0.0007027639	Preprocessor1_Model11
1309	3	4	1.430741e-05	1.798794e+01	0.9061079	rmse	standard	7.9027101	25	0.0007017228	Preprocessor1_Model01

XGBOOST Modeling Results

Results from our fitted XGBOOST Model Using the Test Data Set and a plot of observed vs. predicted are set forth below:

.metric	.estimator	.estimate	.config
rmse	standard	0.1182419	Preprocessor1_Model1
rsq	standard	0.5817994	Preprocessor1_Model1

Important Variables

Our next step is to use the VIP package to identify the important variables in the modeling process.

Prediction With Evaluation Data Set

Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

CUBIST MODEL

The CUBIST model specification, tuning parameters and modeling results are set forth below:

cubist_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>% 
  step_zv(all_predictors()) 

cubist_spec <- 
  cubist_rules(committees = tune(), neighbors = tune()) %>% 
  set_engine("Cubist") 

cubist_workflow <- 
  workflow() %>% 
  add_recipe(cubist_recipe) %>% 
  add_model(cubist_spec) 

cubist_grid <- tidyr::crossing(committees = c(1:9, (1:5) * 10), neighbors = c(0, 
    3, 6, 9)) 

cubist_tune <- 
  tune_grid(cubist_workflow, resamples = ph_folds, grid = cubist_grid)

Top Performing Models with Tuning Results and RMSE

committees	neighbors	.metric	.estimator	mean	n	std_err	.config
50	6	rmse	standard	0.09737828	25	0.0008200935	Preprocessor1_Model55
50	9	rmse	standard	0.09738718	25	0.0008244416	Preprocessor1_Model56
40	6	rmse	standard	0.09771429	25	0.0007820249	Preprocessor1_Model51
40	9	rmse	standard	0.09772138	25	0.0007904497	Preprocessor1_Model52
50	3	rmse	standard	0.09772571	25	0.0008850636	Preprocessor1_Model54

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: cubist_rules()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## ● step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Cubist Model Specification (regression)
## 
## Main Arguments:
##   committees = 50
##   neighbors = 6
## 
## Computational engine: Cubist

CUBIST Modeling Results

Results from our fitted CUBIST Model Using the Test Data Set and a plot of observed vs. predicted are set forth below:

.metric	.estimator	.estimate	.config
rmse	standard	0.08530442	Preprocessor1_Model1
rsq	standard	0.76176579	Preprocessor1_Model1

collect_predictions(ph_fit_cube) %>% 
  ggplot(aes(ph, .pred)) +
  geom_abline(lty = 2, color="gray50") +
  geom_point(alpha = 0.5, color="steelblue") +
  coord_fixed() + theme_fivethirtyeight() + labs(title="Cubist Model: ph vs predicted ph", subtitle = "An Analysis for ABC Beverage Co.")

Important Variables

Our next step is to use the VIP package to identify the important variables in the modeling process.

Prediction With Evaluation Data Set

Finally, we display a plot of predicted values from the evaluation set and write all predicted values to an Excel-readable format.

cube_ph_eval <- predict(ph_fit_cube$.workflow[[1]], test_df)
write.csv(cube_ph_eval, file = "cube_ph_eval.csv")

cube_ph_eval %>% 
  as_tibble() %>% 
  rownames_to_column(var="x") %>% 
  mutate(x = as.numeric(x)) %>% 
  rename(y = .pred) %>% 
  ggplot() +
  geom_line(aes(x=x, y=y), color="steelblue") +
  theme_fivethirtyeight() +
  labs(title="Best Model (Cubist) - Predicted pH Values", subtitle = "pH Ranges -Drinking Water: 6.5 to 8.5, Other Beverages 2.5 to 7.0")

Conclusion

We have performed a detail PH analysis for ABC Beverage Company. During the course of the analysis we utilized numerous predictive models to identify the model best suited to predict ABC Beverage’s PH levels.

Several models produce good predictions, however, we recommend use of the Cubist model to predict pH levels. We based this on the strong RMSE and R-Square scores produced by this model relative to other models as well as the models ability to identify important variables.

Our analysis indicates the following manufacturing processes are the most important to the company’s pH levels:

Mnf_Flow
Bailing_Lvl
Pressure_Vacuum
Bailing
Alch_Rel

We applied the cubist model to the evaluation data provided and predicted pH values that ranged between 8.2 and 8.7. It is our understanding that these levels present no health or safety concerns, however, pH vales at this level could result in a bitter taste. We have also provided a non-technical report for your review and consideration.

Please let us know if you have any questions. We appreciate the opportunity to work on this project.

Analysis for ABC Beverage Company - PH Analysis

Data

Remove NA Values

Preprocessing

Preprocessed and Cleaned Data Set

Explore Data

Figure 1 . Histogram - ph Level

Figure 2. Scatter Plots pH vs Predictor Variables

Training / Test Split

Build Models

Generalized Linear Model

Top Performing Models with Tuning Results and RMSE

GLMnet Modeling Results

Important Variables

Prediction With Evaluation Data Set

Multivariate Adaptive Regression Splines (MARS) Model

Top Performing Models with Tuning Results and RMSE

MARS Modeling Results

Important Variables

Prediction With Evaluation Data Set

Neural Network Model

Top Performing Models with Tuning Results and RMSE

Neural Network Modeling Results

Important Variables

Prediction With Evaluation Data Set

Random Forest Model

Top Performing Models with Tuning Results and RMSE

Random Forest Modeling Results

Important Variables

Prediction With Evaluation Data Set

XGBOOST MODEL

Top Performing Models with Tuning Results and RMSE

XGBOOST Modeling Results

Important Variables

Prediction With Evaluation Data Set

CUBIST MODEL

Top Performing Models with Tuning Results and RMSE

CUBIST Modeling Results

Important Variables

Prediction With Evaluation Data Set

Conclusion