We start our analysis by importing the training and evaluation set. Manual one hot encoding is also applied to the Brand Code variable. This results in three new variables: brand_code_b, brand_code_c and brand_code_d. A fourth variable brand_code_ais implicit when the preceding three variable are equal to 0.
train <- read_excel("StudentData.xlsx")
train <- train %>%
mutate(`Brand Code B` = case_when(
`Brand Code` == "A" ~ 0,
`Brand Code` == "B" ~ 1,
`Brand Code` == "C" ~ 0,
`Brand Code` == "D" ~ 0,
TRUE ~ 0
),
`Brand Code C` = case_when(
`Brand Code` == "A" ~ 0,
`Brand Code` == "B" ~ 0,
`Brand Code` == "C" ~ 1,
`Brand Code` == "D" ~ 0,
TRUE ~ 0
),
`Brand Code D` = case_when(
`Brand Code` == "A" ~ 0,
`Brand Code` == "B" ~ 0,
`Brand Code` == "C" ~ 0,
`Brand Code` == "D" ~ 1,
TRUE ~ 0
)) %>%
select(-`Brand Code`)
test <- read_excel("StudentEvaluation.xlsx")
test <- test %>%
mutate(`Brand Code B` = case_when(
`Brand Code` == "A" ~ 0,
`Brand Code` == "B" ~ 1,
`Brand Code` == "C" ~ 0,
`Brand Code` == "D" ~ 0,
TRUE ~ 0
),
`Brand Code C` = case_when(
`Brand Code` == "A" ~ 0,
`Brand Code` == "B" ~ 0,
`Brand Code` == "C" ~ 1,
`Brand Code` == "D" ~ 0,
TRUE ~ 0
),
`Brand Code D` = case_when(
`Brand Code` == "A" ~ 0,
`Brand Code` == "B" ~ 0,
`Brand Code` == "C" ~ 0,
`Brand Code` == "D" ~ 1,
TRUE ~ 0
)) %>%
select(-`Brand Code`)Next we drop observations with NA values on the training set. This step leaves with a sample set that is more than adequate and eliminate the uncertain surround selecting a correct imputation strategy. Additionally, modeling results improved after adoption of the drop NA approach. Initial iteration used KNN imputation.
| Name | train |
| Number of rows | 2129 |
| Number of columns | 35 |
| _______________________ | |
| Column type frequency: | |
| numeric | 35 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb Volume | 0 | 1 | 5.37 | 0.10 | 5.07 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▇▇▅▁ |
| Fill Ounces | 0 | 1 | 23.98 | 0.09 | 23.65 | 23.93 | 23.98 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC Volume | 0 | 1 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb Pressure | 0 | 1 | 68.21 | 3.44 | 58.20 | 65.60 | 68.20 | 70.60 | 78.80 | ▁▆▇▅▁ |
| Carb Temp | 0 | 1 | 141.12 | 3.97 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▃▇▃▁ |
| PSC | 0 | 1 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC Fill | 0 | 1 | 0.19 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC CO2 | 0 | 1 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf Flow | 0 | 1 | 29.58 | 122.19 | -100.20 | -100.00 | 105.60 | 143.60 | 229.40 | ▇▁▁▇▂ |
| Carb Pressure1 | 0 | 1 | 122.14 | 4.45 | 106.40 | 118.60 | 123.00 | 125.00 | 138.00 | ▁▅▇▃▁ |
| Fill Pressure | 0 | 1 | 48.18 | 2.84 | 43.60 | 46.00 | 46.60 | 50.00 | 59.20 | ▇▂▅▁▁ |
| Hyd Pressure1 | 0 | 1 | 13.24 | 12.27 | -0.60 | 0.00 | 12.40 | 21.00 | 58.00 | ▇▆▂▁▁ |
| Hyd Pressure2 | 0 | 1 | 22.92 | 16.09 | 0.00 | 0.00 | 30.80 | 35.20 | 59.40 | ▆▁▇▅▁ |
| Hyd Pressure3 | 0 | 1 | 22.29 | 15.61 | -1.20 | 0.00 | 30.00 | 33.80 | 50.00 | ▆▁▃▇▁ |
| Hyd Pressure4 | 0 | 1 | 94.33 | 10.81 | 70.00 | 84.00 | 96.00 | 100.00 | 138.00 | ▅▇▇▁▁ |
| Filler Level | 0 | 1 | 108.49 | 15.58 | 55.80 | 92.20 | 117.40 | 120.00 | 146.60 | ▁▃▂▇▁ |
| Filler Speed | 0 | 1 | 3881.07 | 344.08 | 1004.00 | 3896.00 | 3986.00 | 3998.00 | 4028.00 | ▁▁▁▁▇ |
| Temperature | 0 | 1 | 65.77 | 1.03 | 63.60 | 65.20 | 65.60 | 66.20 | 73.20 | ▇▇▁▁▁ |
| Usage cont | 0 | 1 | 21.11 | 2.91 | 12.46 | 18.50 | 22.04 | 23.76 | 25.90 | ▁▃▃▃▇ |
| Carb Flow | 0 | 1 | 2637.61 | 899.99 | 44.00 | 2728.00 | 3042.00 | 3204.00 | 5104.00 | ▁▃▇▇▁ |
| Density | 0 | 1 | 1.17 | 0.37 | 0.44 | 0.92 | 0.98 | 1.62 | 1.92 | ▁▇▁▁▃ |
| MFR | 0 | 1 | 708.17 | 64.36 | 95.40 | 708.20 | 724.80 | 731.00 | 868.60 | ▁▁▁▅▇ |
| Balling | 0 | 1 | 2.20 | 0.92 | 0.35 | 1.50 | 1.65 | 3.34 | 3.98 | ▁▇▁▁▃ |
| Pressure Vacuum | 0 | 1 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 0 | 1 | 8.55 | 0.17 | 7.88 | 8.44 | 8.56 | 8.68 | 8.94 | ▁▂▆▇▃ |
| Oxygen Filler | 0 | 1 | 0.04 | 0.04 | 0.00 | 0.02 | 0.03 | 0.06 | 0.32 | ▇▂▁▁▁ |
| Bowl Setpoint | 0 | 1 | 109.02 | 15.28 | 70.00 | 90.00 | 120.00 | 120.00 | 140.00 | ▁▃▃▇▁ |
| Pressure Setpoint | 0 | 1 | 47.62 | 2.05 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air Pressurer | 0 | 1 | 142.84 | 1.22 | 140.80 | 142.20 | 142.60 | 143.00 | 147.60 | ▂▇▁▁▁ |
| Alch Rel | 0 | 1 | 6.89 | 0.50 | 6.32 | 6.54 | 6.56 | 7.22 | 8.62 | ▇▁▁▂▁ |
| Carb Rel | 0 | 1 | 5.43 | 0.12 | 5.02 | 5.34 | 5.40 | 5.54 | 5.78 | ▁▃▇▅▁ |
| Balling Lvl | 0 | 1 | 2.04 | 0.86 | 1.18 | 1.38 | 1.46 | 3.14 | 3.66 | ▇▁▁▂▃ |
| Brand Code B | 0 | 1 | 0.49 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| Brand Code C | 0 | 1 | 0.12 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| Brand Code D | 0 | 1 | 0.24 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
For the training set we employ the corr and nzv preprocessing functions. The corr, nzv and medianImpute functions are applied to the evaluation data set. Note only predictor variables are being imputed in the evaluation data set. Finally, the janitor::clean_names function is applied to all variables of each data set
# train data
temp_df <- data.matrix(train)
preprocessing <- preProcess(temp_df, method = c("corr", "nzv"))
student_train_preprocess <- predict(preprocessing, temp_df)
train_df <- as_tibble(student_train_preprocess)
temp_df2 <- data.matrix(test)
# test data
preprocessing <- preProcess(temp_df2, method = c("medianImpute","corr", "nzv"))
student_test_preprocess <- predict(preprocessing, temp_df2)
test_df <- as_tibble(student_test_preprocess)
train_df <- train_df %>%
select(PH, everything()) %>%
clean_names()
test_df <- test_df %>%
select(everything()) %>%
clean_names()We use the skim function to present our preprocessed and cleaned data set.
| Name | train_df |
| Number of rows | 2129 |
| Number of columns | 35 |
| _______________________ | |
| Column type frequency: | |
| numeric | 35 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ph | 0 | 1 | 8.55 | 0.17 | 7.88 | 8.44 | 8.56 | 8.68 | 8.94 | ▁▂▆▇▃ |
| carb_volume | 0 | 1 | 5.37 | 0.10 | 5.07 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▇▇▅▁ |
| fill_ounces | 0 | 1 | 23.98 | 0.09 | 23.65 | 23.93 | 23.98 | 24.03 | 24.32 | ▁▂▇▂▁ |
| pc_volume | 0 | 1 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| carb_pressure | 0 | 1 | 68.21 | 3.44 | 58.20 | 65.60 | 68.20 | 70.60 | 78.80 | ▁▆▇▅▁ |
| carb_temp | 0 | 1 | 141.12 | 3.97 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▃▇▃▁ |
| psc | 0 | 1 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| psc_fill | 0 | 1 | 0.19 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| psc_co2 | 0 | 1 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| mnf_flow | 0 | 1 | 29.58 | 122.19 | -100.20 | -100.00 | 105.60 | 143.60 | 229.40 | ▇▁▁▇▂ |
| carb_pressure1 | 0 | 1 | 122.14 | 4.45 | 106.40 | 118.60 | 123.00 | 125.00 | 138.00 | ▁▅▇▃▁ |
| fill_pressure | 0 | 1 | 48.18 | 2.84 | 43.60 | 46.00 | 46.60 | 50.00 | 59.20 | ▇▂▅▁▁ |
| hyd_pressure1 | 0 | 1 | 13.24 | 12.27 | -0.60 | 0.00 | 12.40 | 21.00 | 58.00 | ▇▆▂▁▁ |
| hyd_pressure2 | 0 | 1 | 22.92 | 16.09 | 0.00 | 0.00 | 30.80 | 35.20 | 59.40 | ▆▁▇▅▁ |
| hyd_pressure3 | 0 | 1 | 22.29 | 15.61 | -1.20 | 0.00 | 30.00 | 33.80 | 50.00 | ▆▁▃▇▁ |
| hyd_pressure4 | 0 | 1 | 94.33 | 10.81 | 70.00 | 84.00 | 96.00 | 100.00 | 138.00 | ▅▇▇▁▁ |
| filler_level | 0 | 1 | 108.49 | 15.58 | 55.80 | 92.20 | 117.40 | 120.00 | 146.60 | ▁▃▂▇▁ |
| filler_speed | 0 | 1 | 3881.07 | 344.08 | 1004.00 | 3896.00 | 3986.00 | 3998.00 | 4028.00 | ▁▁▁▁▇ |
| temperature | 0 | 1 | 65.77 | 1.03 | 63.60 | 65.20 | 65.60 | 66.20 | 73.20 | ▇▇▁▁▁ |
| usage_cont | 0 | 1 | 21.11 | 2.91 | 12.46 | 18.50 | 22.04 | 23.76 | 25.90 | ▁▃▃▃▇ |
| carb_flow | 0 | 1 | 2637.61 | 899.99 | 44.00 | 2728.00 | 3042.00 | 3204.00 | 5104.00 | ▁▃▇▇▁ |
| density | 0 | 1 | 1.17 | 0.37 | 0.44 | 0.92 | 0.98 | 1.62 | 1.92 | ▁▇▁▁▃ |
| mfr | 0 | 1 | 708.17 | 64.36 | 95.40 | 708.20 | 724.80 | 731.00 | 868.60 | ▁▁▁▅▇ |
| balling | 0 | 1 | 2.20 | 0.92 | 0.35 | 1.50 | 1.65 | 3.34 | 3.98 | ▁▇▁▁▃ |
| pressure_vacuum | 0 | 1 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| oxygen_filler | 0 | 1 | 0.04 | 0.04 | 0.00 | 0.02 | 0.03 | 0.06 | 0.32 | ▇▂▁▁▁ |
| bowl_setpoint | 0 | 1 | 109.02 | 15.28 | 70.00 | 90.00 | 120.00 | 120.00 | 140.00 | ▁▃▃▇▁ |
| pressure_setpoint | 0 | 1 | 47.62 | 2.05 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| air_pressurer | 0 | 1 | 142.84 | 1.22 | 140.80 | 142.20 | 142.60 | 143.00 | 147.60 | ▂▇▁▁▁ |
| alch_rel | 0 | 1 | 6.89 | 0.50 | 6.32 | 6.54 | 6.56 | 7.22 | 8.62 | ▇▁▁▂▁ |
| carb_rel | 0 | 1 | 5.43 | 0.12 | 5.02 | 5.34 | 5.40 | 5.54 | 5.78 | ▁▃▇▅▁ |
| balling_lvl | 0 | 1 | 2.04 | 0.86 | 1.18 | 1.38 | 1.46 | 3.14 | 3.66 | ▇▁▁▂▃ |
| brand_code_b | 0 | 1 | 0.49 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| brand_code_c | 0 | 1 | 0.12 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| brand_code_d | 0 | 1 | 0.24 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
Our data exploration leverages the skim report above as well as Figure 1 and Figure 2 below. Figure 1 provides insight into the distribution of our target variable pH. Figure looks at the relationship of the target variable with each of the predictor variables. It this plot that convinced the analysis to steer away from linear models.
Our next step is to take the training data and perform a training / test split. This enables us to train on some of the data and then use the test split to ensure we have not over fit the data. Please note, the test data is not the same as the evaluation data set. It is a subset of the training data set. Additionally, tuning/training folds are established for our model tuning.
set.seed(123)
ph_split <- initial_split(train_df, strata= ph, na.rm =TRUE)
ph_train <- training(ph_split)
ph_test <- testing(ph_split)
set.seed(234)
ph_folds <- bootstraps(ph_train, strata = ph)
ph_folds## # Bootstrap sampling using stratification
## # A tibble: 25 x 2
## splits id
## <list> <chr>
## 1 <split [1.6K/597]> Bootstrap01
## 2 <split [1.6K/594]> Bootstrap02
## 3 <split [1.6K/571]> Bootstrap03
## 4 <split [1.6K/565]> Bootstrap04
## 5 <split [1.6K/597]> Bootstrap05
## 6 <split [1.6K/572]> Bootstrap06
## 7 <split [1.6K/596]> Bootstrap07
## 8 <split [1.6K/572]> Bootstrap08
## 9 <split [1.6K/591]> Bootstrap09
## 10 <split [1.6K/592]> Bootstrap10
## # … with 15 more rows
In this time of Covid-19 our modeling team is operating remotely with each data scientist pursuing specific modeling approaches.
The first model is our best fit using a generalized linear model method, built by Calvin Cox. The next two models were built by Adam Douglas and are: Multivariate Adaptive Regression Splines (or MARS), and Neural Networks. The next three models, built by Jim Mundy, are: Random Forest, XGBoost and Cubist.The GLM model specification, tuning parameters and modeling results are set forth below:
glmnet_recipe <-
recipe(formula = ph ~ ., data = ph_train) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal())
glmnet_spec <-
linear_reg(penalty = tune(), mixture = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
glmnet_workflow <-
workflow() %>%
add_recipe(glmnet_recipe) %>%
add_model(glmnet_spec)
glmnet_grid <- tidyr::crossing(penalty = 10^seq(-6, -1, length.out = 20), mixture = c(0.05,
0.2, 0.4, 0.6, 0.8, 1))
glmnet_tune <-
tune_grid(glmnet_workflow, resamples = ph_folds, grid = 11) ## # A tibble: 5 x 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1.16e-4 0.252 rmse standard 0.131 25 0.000785 Preprocessor1_Mo…
## 2 2.93e-6 0.0717 rmse standard 0.131 25 0.000784 Preprocessor1_Mo…
## 3 1.27e-9 0.904 rmse standard 0.131 25 0.000782 Preprocessor1_Mo…
## 4 1.36e-7 0.607 rmse standard 0.131 25 0.000782 Preprocessor1_Mo…
## 5 1.81e-8 0.789 rmse standard 0.131 25 0.000782 Preprocessor1_Mo…
As suspected, the traditional linear modeling approach yielded a poor model, given the training data.
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 0.1281839 | Preprocessor1_Model1 |
| rsq | standard | 0.4533451 | Preprocessor1_Model1 |
collect_predictions(ph_fit_glmnet) %>%
ggplot(aes(ph, .pred)) +
geom_abline(lty = 2, color="gray50") +
geom_point(alpha = 0.5, color="steelblue") +
coord_fixed() + theme_fivethirtyeight() + labs(title="linear regression ph vs predicted ph", subtitle = "An Analysis for ABC Beverage Co.")Our next step is to use the VIP package to identify the important variables in the modeling process.
glmnet_spec <-
linear_reg(penalty = tune(), mixture = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
workflow() %>%
add_recipe(glmnet_recipe) %>%
add_model(glmnet_spec) %>%
fit(ph_train) %>%
pull_workflow_fit() %>%
vip(aesthetic = list(alpha = 0.8, fill = "steelblue"))Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.
The MARS model specification, tuning parameters and modeling results are set forth below:
# Recipe for MARS
mars_recipe <-
recipe(formula = ph ~ ., data = ph_train) %>%
prep(retain=T)
mars_spec <-
mars(num_terms = tune(), prod_degree = 1,
prune_method = tune()) %>%
set_mode("regression") %>%
set_engine("earth")
mars_grid <- grid_regular(num_terms(range=c(1,34)),
prune_method(c("backward","forward","none")),
levels=20)
mars_workflow <-
workflow() %>%
add_recipe(mars_recipe) %>%
add_model(mars_spec)
set.seed(8577)
mars_tune <- tune_grid(mars_workflow,
resamples = ph_folds,
grid=mars_grid)| num_terms | prune_method | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 25 | forward | rmse | standard | 0.1288090 | 25 | 0.001133040 | Preprocessor1_Model35 |
| 34 | forward | rmse | standard | 0.1288473 | 25 | 0.001123949 | Preprocessor1_Model40 |
| 32 | forward | rmse | standard | 0.1288497 | 25 | 0.001123429 | Preprocessor1_Model39 |
| 34 | none | rmse | standard | 0.1288659 | 25 | 0.001133304 | Preprocessor1_Model60 |
| 30 | forward | rmse | standard | 0.1288672 | 25 | 0.001119719 | Preprocessor1_Model38 |
Our fitted MARS Model Using the Test Data Set and a plot of observed vs predicted are set forth below:
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 0.1277871 | Preprocessor1_Model1 |
| rsq | standard | 0.4566604 | Preprocessor1_Model1 |
Our next step is to use the VIP package to identify the important variables in the modeling process.
Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.
The Neural Network model specification, tuning parameters and modeling results are set forth below:
# Recipe for NN
nn_recipe <-
recipe(formula = ph ~ ., data = ph_train) %>%
prep(retain=T)
nn_spec <-
mlp(hidden_units = tune(), penalty = tune(),
epochs = tune()) %>%
set_mode("regression") %>%
set_engine("nnet")
nn_grid <- grid_regular(hidden_units(),
penalty(),
epochs(),
levels = 3)
nn_workflow <-
workflow() %>%
add_recipe(nn_recipe) %>%
add_model(nn_spec)
set.seed(8577)
nn_tune <- tune_grid(nn_workflow,
resamples = ph_folds,
grid=nn_grid)| hidden_units | penalty | epochs | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 5 | 1 | 505 | rmse | standard | 0.1280353 | 25 | 0.001332102 | Preprocessor1_Model17 |
| 10 | 1 | 505 | rmse | standard | 0.1286441 | 25 | 0.001590298 | Preprocessor1_Model18 |
| 10 | 1 | 1000 | rmse | standard | 0.1300482 | 25 | 0.001473111 | Preprocessor1_Model27 |
| 5 | 1 | 1000 | rmse | standard | 0.1306369 | 25 | 0.002118459 | Preprocessor1_Model26 |
| 1 | 1 | 1000 | rmse | standard | 0.1386982 | 25 | 0.001738515 | Preprocessor1_Model25 |
Our fitted Neural Network Model Using the Test Data Set and a plot of observed vs predicted are set forth below:
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 0.1202430 | Preprocessor1_Model1 |
| rsq | standard | 0.5199355 | Preprocessor1_Model1 |
Our next step is to use the VIP package to identify the important variables in the modeling process.
Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.
The Random Forest model specification, tuning parameters and modeling results are set forth below:
ranger_recipe <-
recipe(formula = ph ~ ., data = ph_train)
ranger_spec <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_mode("regression") %>%
set_engine("ranger")
ranger_workflow <-
workflow() %>%
add_recipe(ranger_recipe) %>%
add_model(ranger_spec)
set.seed(8577)
ranger_tune <- tune_grid(ranger_workflow,
resamples = ph_folds,
grid=11)| mtry | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 13 | 4 | rmse | standard | 0.1031171 | 25 | 0.0007668758 | Preprocessor1_Model10 |
| 24 | 10 | rmse | standard | 0.1043719 | 25 | 0.0007960392 | Preprocessor1_Model05 |
| 18 | 18 | rmse | standard | 0.1049345 | 25 | 0.0008165298 | Preprocessor1_Model01 |
| 31 | 6 | rmse | standard | 0.1051590 | 25 | 0.0008627660 | Preprocessor1_Model06 |
| 22 | 24 | rmse | standard | 0.1061753 | 25 | 0.0008183294 | Preprocessor1_Model03 |
Our fitted Random Forest Model Using the Test Data Set and a plot of observed vs predicted are set forth below:
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 0.09432975 | Preprocessor1_Model1 |
| rsq | standard | 0.73432166 | Preprocessor1_Model1 |
Our next step is to use the VIP package to identify the important variables in the modeling process.
Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.
The XGBOOST model specification, tuning parameters and modeling results are set forth below:
xgboost_recipe <-
recipe(formula = ph ~ ., data = ph_train) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(),
loss_reduction = tune(), sample_size = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
set.seed(77648)
xgboost_tune <-
tune_grid(xgboost_workflow, resamples = ph_folds, grid = 11)| trees | min_n | tree_depth | learn_rate | loss_reduction | sample_size | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 625 | 34 | 14 | 9.216943e-03 | 6.275024e-09 | 0.3514750 | rmse | standard | 0.1189910 | 25 | 0.0008210379 | Preprocessor1_Model10 |
| 1612 | 31 | 11 | 4.296299e-02 | 2.218003e+00 | 0.4552075 | rmse | standard | 0.1491916 | 25 | 0.0008831361 | Preprocessor1_Model09 |
| 859 | 14 | 9 | 5.447205e-05 | 2.081954e-10 | 0.1997397 | rmse | standard | 7.6851221 | 25 | 0.0007059712 | Preprocessor1_Model04 |
| 48 | 39 | 12 | 4.797775e-04 | 1.643376e-03 | 0.3425751 | rmse | standard | 7.8689865 | 25 | 0.0007027639 | Preprocessor1_Model11 |
| 1309 | 3 | 4 | 1.430741e-05 | 1.798794e+01 | 0.9061079 | rmse | standard | 7.9027101 | 25 | 0.0007017228 | Preprocessor1_Model01 |
Results from our fitted XGBOOST Model Using the Test Data Set and a plot of observed vs. predicted are set forth below:
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 0.1182419 | Preprocessor1_Model1 |
| rsq | standard | 0.5817994 | Preprocessor1_Model1 |
Our next step is to use the VIP package to identify the important variables in the modeling process.
Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.
The CUBIST model specification, tuning parameters and modeling results are set forth below:
cubist_recipe <-
recipe(formula = ph ~ ., data = ph_train) %>%
step_zv(all_predictors())
cubist_spec <-
cubist_rules(committees = tune(), neighbors = tune()) %>%
set_engine("Cubist")
cubist_workflow <-
workflow() %>%
add_recipe(cubist_recipe) %>%
add_model(cubist_spec)
cubist_grid <- tidyr::crossing(committees = c(1:9, (1:5) * 10), neighbors = c(0,
3, 6, 9))
cubist_tune <-
tune_grid(cubist_workflow, resamples = ph_folds, grid = cubist_grid) | committees | neighbors | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 50 | 6 | rmse | standard | 0.09737828 | 25 | 0.0008200935 | Preprocessor1_Model55 |
| 50 | 9 | rmse | standard | 0.09738718 | 25 | 0.0008244416 | Preprocessor1_Model56 |
| 40 | 6 | rmse | standard | 0.09771429 | 25 | 0.0007820249 | Preprocessor1_Model51 |
| 40 | 9 | rmse | standard | 0.09772138 | 25 | 0.0007904497 | Preprocessor1_Model52 |
| 50 | 3 | rmse | standard | 0.09772571 | 25 | 0.0008850636 | Preprocessor1_Model54 |
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: cubist_rules()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## ● step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Cubist Model Specification (regression)
##
## Main Arguments:
## committees = 50
## neighbors = 6
##
## Computational engine: Cubist
Results from our fitted CUBIST Model Using the Test Data Set and a plot of observed vs. predicted are set forth below:
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 0.08530442 | Preprocessor1_Model1 |
| rsq | standard | 0.76176579 | Preprocessor1_Model1 |
collect_predictions(ph_fit_cube) %>%
ggplot(aes(ph, .pred)) +
geom_abline(lty = 2, color="gray50") +
geom_point(alpha = 0.5, color="steelblue") +
coord_fixed() + theme_fivethirtyeight() + labs(title="Cubist Model: ph vs predicted ph", subtitle = "An Analysis for ABC Beverage Co.")Our next step is to use the VIP package to identify the important variables in the modeling process.
Finally, we display a plot of predicted values from the evaluation set and write all predicted values to an Excel-readable format.
cube_ph_eval <- predict(ph_fit_cube$.workflow[[1]], test_df)
write.csv(cube_ph_eval, file = "cube_ph_eval.csv")
cube_ph_eval %>%
as_tibble() %>%
rownames_to_column(var="x") %>%
mutate(x = as.numeric(x)) %>%
rename(y = .pred) %>%
ggplot() +
geom_line(aes(x=x, y=y), color="steelblue") +
theme_fivethirtyeight() +
labs(title="Best Model (Cubist) - Predicted pH Values", subtitle = "pH Ranges -Drinking Water: 6.5 to 8.5, Other Beverages 2.5 to 7.0")We have performed a detail PH analysis for ABC Beverage Company. During the course of the analysis we utilized numerous predictive models to identify the model best suited to predict ABC Beverage’s PH levels.
Several models produce good predictions, however, we recommend use of the Cubist model to predict pH levels. We based this on the strong RMSE and R-Square scores produced by this model relative to other models as well as the models ability to identify important variables.
Our analysis indicates the following manufacturing processes are the most important to the company’s pH levels:
We applied the cubist model to the evaluation data provided and predicted pH values that ranged between 8.2 and 8.7. It is our understanding that these levels present no health or safety concerns, however, pH vales at this level could result in a bitter taste. We have also provided a non-technical report for your review and consideration.
Please let us know if you have any questions. We appreciate the opportunity to work on this project.