Analysis for ABC Beverage Company

DATA

We start our annalysis by importing the training and evaluation set. Manual one hot encoding is also applied to the Brand Code variable. This results in three new variables: brand_code_b, brand_code_c and brand_code_d. A fourth variable brand_code_ais implicit when the preceding three variable are equal to 0.

train <- read_excel("StudentData.xlsx")
train <- train %>% 
  mutate(`Brand Code B` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 1,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
  ),
  `Brand Code C` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 1,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
    ),
  `Brand Code D` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 1,
    TRUE ~ 0
  )) %>% 
  select(-`Brand Code`)

test <- read_excel("StudentEvaluation.xlsx")
test <- test %>% 
  mutate(`Brand Code B` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 1,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
  ),
  `Brand Code C` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 1,
    `Brand Code` == "D" ~ 0,
    TRUE ~ 0
    ),
  `Brand Code D` = case_when(
    `Brand Code` == "A" ~ 0,
    `Brand Code` == "B" ~ 0,
    `Brand Code` == "C" ~ 0,
    `Brand Code` == "D" ~ 1,
    TRUE ~ 0
  )) %>% 
  select(-`Brand Code`)

Remove NA Values

Next we drop observatins with NA values on the training set. This step leaves with a sample set that is more than adequate and eliminate the uncertain surround selecting a correct imputation strategy. Additionally, modeling results improved after adoption of the drop NA approach. Initial interation used KNN imputation.

train <- train %>% 
  drop_na()

skim(train)

Data summary
Name	train
Number of rows	2129
Number of columns	35
_______________________
Column type frequency:
numeric	35
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	1	5.37	0.10	5.07	5.29	5.35	5.45	5.70	▁▇▇▅▁
Fill Ounces	1	23.98	0.09	23.65	23.93	23.98	24.03	24.32	▁▂▇▂▁
PC Volume	1	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb Pressure	1	68.21	3.44	58.20	65.60	68.20	70.60	78.80	▁▆▇▅▁
Carb Temp	1	141.12	3.97	128.60	138.40	140.80	143.80	154.00	▁▃▇▃▁
PSC	1	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC Fill	1	0.19	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	1	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf Flow	1	29.58	122.19	-100.20	-100.00	105.60	143.60	229.40	▇▁▁▇▂
Carb Pressure1	1	122.14	4.45	106.40	118.60	123.00	125.00	138.00	▁▅▇▃▁
Fill Pressure	1	48.18	2.84	43.60	46.00	46.60	50.00	59.20	▇▂▅▁▁
Hyd Pressure1	1	13.24	12.27	-0.60	0.00	12.40	21.00	58.00	▇▆▂▁▁
Hyd Pressure2	1	22.92	16.09	0.00	0.00	30.80	35.20	59.40	▆▁▇▅▁
Hyd Pressure3	1	22.29	15.61	-1.20	0.00	30.00	33.80	50.00	▆▁▃▇▁
Hyd Pressure4	1	94.33	10.81	70.00	84.00	96.00	100.00	138.00	▅▇▇▁▁
Filler Level	1	108.49	15.58	55.80	92.20	117.40	120.00	146.60	▁▃▂▇▁
Filler Speed	1	3881.07	344.08	1004.00	3896.00	3986.00	3998.00	4028.00	▁▁▁▁▇
Temperature	1	65.77	1.03	63.60	65.20	65.60	66.20	73.20	▇▇▁▁▁
Usage cont	1	21.11	2.91	12.46	18.50	22.04	23.76	25.90	▁▃▃▃▇
Carb Flow	1	2637.61	899.99	44.00	2728.00	3042.00	3204.00	5104.00	▁▃▇▇▁
Density	1	1.17	0.37	0.44	0.92	0.98	1.62	1.92	▁▇▁▁▃
MFR	1	708.17	64.36	95.40	708.20	724.80	731.00	868.60	▁▁▁▅▇
Balling	1	2.20	0.92	0.35	1.50	1.65	3.34	3.98	▁▇▁▁▃
Pressure Vacuum	1	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	1	8.55	0.17	7.88	8.44	8.56	8.68	8.94	▁▂▆▇▃
Oxygen Filler	1	0.04	0.04	0.00	0.02	0.03	0.06	0.32	▇▂▁▁▁
Bowl Setpoint	1	109.02	15.28	70.00	90.00	120.00	120.00	140.00	▁▃▃▇▁
Pressure Setpoint	1	47.62	2.05	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	1	142.84	1.22	140.80	142.20	142.60	143.00	147.60	▂▇▁▁▁
Alch Rel	1	6.89	0.50	6.32	6.54	6.56	7.22	8.62	▇▁▁▂▁
Carb Rel	1	5.43	0.12	5.02	5.34	5.40	5.54	5.78	▁▃▇▅▁
Balling Lvl	1	2.04	0.86	1.18	1.38	1.46	3.14	3.66	▇▁▁▂▃
Brand Code B	1	0.49	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇
Brand Code C	1	0.12	0.32	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
Brand Code D	1	0.24	0.43	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂

#Perform Preprocessing

For the training set we employ the corr and nzv proprocessing functions. The corr, nzv and medianImpute functions are applied to the evaluation data set. Note only predictor variables are being imputed in the evaluation data set. Finally, the janitor::clean_names function is applied to all variables of each data set

# train data
temp_df <- data.matrix(train) 
preprocessing <- preProcess(temp_df, method = c("corr", "nzv"))
student_train_preprocess <-  predict(preprocessing, temp_df) 
train_df <- as_tibble(student_train_preprocess)
temp_df2 <- data.matrix(test) 


# test data
preprocessing <- preProcess(temp_df2, method = c("medianImpute","corr", "nzv"))
student_test_preprocess <-  predict(preprocessing, temp_df2) 
test_df <- as_tibble(student_test_preprocess)


train_df <- train_df %>% 
  select(PH, everything()) %>%
  clean_names()

test_df <- test_df %>% 
  select(everything()) %>%
  clean_names()

Preprocessed and Cleaned Data Set

We use the skim function to present our preprocessed and cleaned data set.

Data summary
Name	train_df
Number of rows	2129
Number of columns	35
_______________________
Column type frequency:
numeric	35
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ph	1	8.55	0.17	7.88	8.44	8.56	8.68	8.94	▁▂▆▇▃
carb_volume	1	5.37	0.10	5.07	5.29	5.35	5.45	5.70	▁▇▇▅▁
fill_ounces	1	23.98	0.09	23.65	23.93	23.98	24.03	24.32	▁▂▇▂▁
pc_volume	1	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
carb_pressure	1	68.21	3.44	58.20	65.60	68.20	70.60	78.80	▁▆▇▅▁
carb_temp	1	141.12	3.97	128.60	138.40	140.80	143.80	154.00	▁▃▇▃▁
psc	1	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
psc_fill	1	0.19	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
psc_co2	1	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
mnf_flow	1	29.58	122.19	-100.20	-100.00	105.60	143.60	229.40	▇▁▁▇▂
carb_pressure1	1	122.14	4.45	106.40	118.60	123.00	125.00	138.00	▁▅▇▃▁
fill_pressure	1	48.18	2.84	43.60	46.00	46.60	50.00	59.20	▇▂▅▁▁
hyd_pressure1	1	13.24	12.27	-0.60	0.00	12.40	21.00	58.00	▇▆▂▁▁
hyd_pressure2	1	22.92	16.09	0.00	0.00	30.80	35.20	59.40	▆▁▇▅▁
hyd_pressure3	1	22.29	15.61	-1.20	0.00	30.00	33.80	50.00	▆▁▃▇▁
hyd_pressure4	1	94.33	10.81	70.00	84.00	96.00	100.00	138.00	▅▇▇▁▁
filler_level	1	108.49	15.58	55.80	92.20	117.40	120.00	146.60	▁▃▂▇▁
filler_speed	1	3881.07	344.08	1004.00	3896.00	3986.00	3998.00	4028.00	▁▁▁▁▇
temperature	1	65.77	1.03	63.60	65.20	65.60	66.20	73.20	▇▇▁▁▁
usage_cont	1	21.11	2.91	12.46	18.50	22.04	23.76	25.90	▁▃▃▃▇
carb_flow	1	2637.61	899.99	44.00	2728.00	3042.00	3204.00	5104.00	▁▃▇▇▁
density	1	1.17	0.37	0.44	0.92	0.98	1.62	1.92	▁▇▁▁▃
mfr	1	708.17	64.36	95.40	708.20	724.80	731.00	868.60	▁▁▁▅▇
balling	1	2.20	0.92	0.35	1.50	1.65	3.34	3.98	▁▇▁▁▃
pressure_vacuum	1	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
oxygen_filler	1	0.04	0.04	0.00	0.02	0.03	0.06	0.32	▇▂▁▁▁
bowl_setpoint	1	109.02	15.28	70.00	90.00	120.00	120.00	140.00	▁▃▃▇▁
pressure_setpoint	1	47.62	2.05	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
air_pressurer	1	142.84	1.22	140.80	142.20	142.60	143.00	147.60	▂▇▁▁▁
alch_rel	1	6.89	0.50	6.32	6.54	6.56	7.22	8.62	▇▁▁▂▁
carb_rel	1	5.43	0.12	5.02	5.34	5.40	5.54	5.78	▁▃▇▅▁
balling_lvl	1	2.04	0.86	1.18	1.38	1.46	3.14	3.66	▇▁▁▂▃
brand_code_b	1	0.49	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇
brand_code_c	1	0.12	0.32	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
brand_code_d	1	0.24	0.43	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂

Explore Data

Our data exploration leverages the skim report above as well as Figure 1 and Figure 2 below. Figure 1 provides insight into the distribution of our target variable pH. Figure looks at the relationship of the target variable with each of the predictor variables. It this plot that convinced the analysis to steer away from linear models.

Figure 1 . Histogram - ph Level

Figure 2. Scatter Plots pH vs Predictor Variables

Training / Test Split

Our next step is to take the training data and perform a training / test split. This enables us to train on some of the data and then use the test split to ensure we have not overfit the data. Please note, the test data is not the same as the evaluation data set. It is a subset of the trainind data set. Additionally, tuning/training folds are established for our model tuning.

set.seed(123)
ph_split <- initial_split(train_df, strata= ph, na.rm =TRUE)
ph_train <- training(ph_split)
ph_test  <- testing(ph_split)

set.seed(234)
ph_folds <- bootstraps(ph_train, strata = ph)
ph_folds

## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
##    splits             id         
##    <list>             <chr>      
##  1 <split [1.6K/597]> Bootstrap01
##  2 <split [1.6K/594]> Bootstrap02
##  3 <split [1.6K/571]> Bootstrap03
##  4 <split [1.6K/565]> Bootstrap04
##  5 <split [1.6K/597]> Bootstrap05
##  6 <split [1.6K/572]> Bootstrap06
##  7 <split [1.6K/596]> Bootstrap07
##  8 <split [1.6K/572]> Bootstrap08
##  9 <split [1.6K/591]> Bootstrap09
## 10 <split [1.6K/592]> Bootstrap10
## # ... with 15 more rows

Build Models

In this time of Covid-19 our modeling team is operating remotely. This portion of the model building addresses three models, Random Forest, XGBoost and Cubist. Other data scientist on the team will be pursuing alternative modeling startegies.

Random Forest Model

The Random Forest model specification, tuning parameters and modeling results are set forth below:

ranger_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) 

ranger_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_mode("regression") %>% 
  set_engine("ranger") 

ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec) 

set.seed(8577)
doParallel::registerDoParallel()
ranger_tune <- tune_grid(ranger_workflow,
                         resamples = ph_folds,
                         grid=11)

### Explore Results

Top Performing Models with Tuning Results and RMSE

mtry	min_n	.metric	.estimator	mean	n	std_err	.config
13	4	rmse	standard	0.1031171	25	0.0007668758	Preprocessor1_Model10
24	10	rmse	standard	0.1043719	25	0.0007960392	Preprocessor1_Model05
18	18	rmse	standard	0.1049345	25	0.0008165298	Preprocessor1_Model01
31	6	rmse	standard	0.1051590	25	0.0008627660	Preprocessor1_Model06
22	24	rmse	standard	0.1061753	25	0.0008183294	Preprocessor1_Model03

Random Forest Modeling Results

Our fitted Random Forest Model Using the Test Data Set and a plot of observed vs predicted are set forth below:

.metric	.estimator	.estimate	.config
rmse	standard	0.09432975	Preprocessor1_Model1
rsq	standard	0.73432166	Preprocessor1_Model1

Important Variables

Our next step is to use the VIP package to identify the important variables in the modeling process.

Prediction With Evaluation Data Set

Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

XGBOOST MODEL

The XGBOOST model specification, tuning parameters and modeling results are set forth below:

xgboost_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>% 
  step_zv(all_predictors()) 

xgboost_spec <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), 
    loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec) 

set.seed(77648)
xgboost_tune <-
  tune_grid(xgboost_workflow, resamples = ph_folds, grid = 11)

Explore Results

Top Performing Models with Tuning Results and RMSE

trees	min_n	tree_depth	learn_rate	loss_reduction	sample_size	.metric	.estimator	mean	n	std_err	.config
625	34	14	9.216943e-03	6.275024e-09	0.3514750	rmse	standard	0.1189910	25	0.0008210379	Preprocessor1_Model10
1612	31	11	4.296299e-02	2.218003e+00	0.4552075	rmse	standard	0.1491916	25	0.0008831361	Preprocessor1_Model09
859	14	9	5.447205e-05	2.081954e-10	0.1997397	rmse	standard	7.6851221	25	0.0007059712	Preprocessor1_Model04
48	39	12	4.797775e-04	1.643376e-03	0.3425751	rmse	standard	7.8689865	25	0.0007027639	Preprocessor1_Model11
1309	3	4	1.430741e-05	1.798794e+01	0.9061079	rmse	standard	7.9027101	25	0.0007017228	Preprocessor1_Model01

XGBOST Modeling Results

Results from our fitted XGBOOST Model Using the Test Data Set and a plot of observed vs. predicted are set forth below:

.metric	.estimator	.estimate	.config
rmse	standard	0.1182419	Preprocessor1_Model1
rsq	standard	0.5817994	Preprocessor1_Model1

Important Variables

Our next step is to use the VIP package to identify the important variables in the modeling process.

Prediction With Evaluation Data Set

Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

CUBIST MODEL

The CUBIST model specification, tuning parameters and modeling results are set forth below:

cubist_recipe <- 
  recipe(formula = ph ~ ., data = ph_train) %>% 
  step_zv(all_predictors()) 

cubist_spec <- 
  cubist_rules(committees = tune(), neighbors = tune()) %>% 
  set_engine("Cubist") 

cubist_workflow <- 
  workflow() %>% 
  add_recipe(cubist_recipe) %>% 
  add_model(cubist_spec) 

cubist_grid <- tidyr::crossing(committees = c(1:9, (1:5) * 10), neighbors = c(0, 
    3, 6, 9)) 

cubist_tune <- 
  tune_grid(cubist_workflow, resamples = ph_folds, grid = cubist_grid)

Top Performing Models with Tuning Results and RMSE

committees	neighbors	.metric	.estimator	mean	n	std_err	.config
50	6	rmse	standard	0.09737828	25	0.0008200935	Preprocessor1_Model55
50	9	rmse	standard	0.09738718	25	0.0008244416	Preprocessor1_Model56
40	6	rmse	standard	0.09771429	25	0.0007820249	Preprocessor1_Model51
40	9	rmse	standard	0.09772138	25	0.0007904497	Preprocessor1_Model52
50	3	rmse	standard	0.09772571	25	0.0008850636	Preprocessor1_Model54

## == Workflow ======================================================================================================================================================
## Preprocessor: Recipe
## Model: cubist_rules()
## 
## -- Preprocessor --------------------------------------------------------------------------------------------------------------------------------------------------
## 1 Recipe Step
## 
## * step_zv()
## 
## -- Model ---------------------------------------------------------------------------------------------------------------------------------------------------------
## Cubist Model Specification (regression)
## 
## Main Arguments:
##   committees = 50
##   neighbors = 6
## 
## Computational engine: Cubist

CUBIST Modeling Results

Results from our fitted CUBIST Model Using the Test Data Set and a plot of observed vs. predicted are set forth below:

.metric	.estimator	.estimate	.config
rmse	standard	0.08530442	Preprocessor1_Model1
rsq	standard	0.76176579	Preprocessor1_Model1

collect_predictions(ph_fit_cube) %>% 
  ggplot(aes(ph, .pred)) +
  geom_abline(lty = 2, color="gray50") +
  geom_point(alpha = 0.5, color="steelblue") +
  coord_fixed() + theme_fivethirtyeight() + labs(title="Cubist Model: ph vs predicted ph", subtitle = "An Analysis for ABC Beverage Co.")

Important Variables

Our next step is to use the VIP package to identify the important variables in the modeling process.

Prediction With Evaluation Data Set

Finally, we display a plot of predicted values from the evaluation set and write all predicted values to an Excel-readable format.

Summary and Findings

# Random Thoughts

# do we agree that the pH Values seem a bit high. I believe values of 8.3 to 8.5 are safe but also bitter. 

# Some pH / Beverage info:  https://www.sheltondentistry.com/patient-information/ph-values-common-drinks/

#For the technical write-up what else needs to be discussed or do we just summarize all the modeling state our finds and call it a day.

Analysis for ABC Beverage Company - PH Analysis

DATA

Remove NA Values

Preprocessed and Cleaned Data Set

Explore Data

Figure 1 . Histogram - ph Level

Figure 2. Scatter Plots pH vs Predictor Variables

Training / Test Split

Build Models

Random Forest Model

Top Performing Models with Tuning Results and RMSE

Random Forest Modeling Results

Important Variables

Prediction With Evaluation Data Set

XGBOOST MODEL

Explore Results

Top Performing Models with Tuning Results and RMSE

XGBOST Modeling Results

Important Variables

Prediction With Evaluation Data Set

CUBIST MODEL

Top Performing Models with Tuning Results and RMSE

CUBIST Modeling Results

Important Variables

Prediction With Evaluation Data Set

Summary and Findings