DATA


We start our annalysis by importing the training and evaluation set. Manual one hot encoding is also applied to the Brand Code variable. This results in three new variables: brand_code_b, brand_code_c and brand_code_d. A fourth variable brand_code_ais implicit when the preceding three variable are equal to 0.

Remove NA Values


Next we drop observatins with NA values on the training set. This step leaves with a sample set that is more than adequate and eliminate the uncertain surround selecting a correct imputation strategy. Additionally, modeling results improved after adoption of the drop NA approach. Initial interation used KNN imputation.

Data summary
Name train
Number of rows 2129
Number of columns 35
_______________________
Column type frequency:
numeric 35
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Carb Volume 0 1 5.37 0.10 5.07 5.29 5.35 5.45 5.70 ▁▇▇▅▁
Fill Ounces 0 1 23.98 0.09 23.65 23.93 23.98 24.03 24.32 ▁▂▇▂▁
PC Volume 0 1 0.28 0.06 0.08 0.24 0.27 0.31 0.48 ▁▃▇▂▁
Carb Pressure 0 1 68.21 3.44 58.20 65.60 68.20 70.60 78.80 ▁▆▇▅▁
Carb Temp 0 1 141.12 3.97 128.60 138.40 140.80 143.80 154.00 ▁▃▇▃▁
PSC 0 1 0.08 0.05 0.00 0.05 0.08 0.11 0.27 ▆▇▃▁▁
PSC Fill 0 1 0.19 0.12 0.00 0.10 0.18 0.26 0.62 ▆▇▃▁▁
PSC CO2 0 1 0.06 0.04 0.00 0.02 0.04 0.08 0.24 ▇▅▂▁▁
Mnf Flow 0 1 29.58 122.19 -100.20 -100.00 105.60 143.60 229.40 ▇▁▁▇▂
Carb Pressure1 0 1 122.14 4.45 106.40 118.60 123.00 125.00 138.00 ▁▅▇▃▁
Fill Pressure 0 1 48.18 2.84 43.60 46.00 46.60 50.00 59.20 ▇▂▅▁▁
Hyd Pressure1 0 1 13.24 12.27 -0.60 0.00 12.40 21.00 58.00 ▇▆▂▁▁
Hyd Pressure2 0 1 22.92 16.09 0.00 0.00 30.80 35.20 59.40 ▆▁▇▅▁
Hyd Pressure3 0 1 22.29 15.61 -1.20 0.00 30.00 33.80 50.00 ▆▁▃▇▁
Hyd Pressure4 0 1 94.33 10.81 70.00 84.00 96.00 100.00 138.00 ▅▇▇▁▁
Filler Level 0 1 108.49 15.58 55.80 92.20 117.40 120.00 146.60 ▁▃▂▇▁
Filler Speed 0 1 3881.07 344.08 1004.00 3896.00 3986.00 3998.00 4028.00 ▁▁▁▁▇
Temperature 0 1 65.77 1.03 63.60 65.20 65.60 66.20 73.20 ▇▇▁▁▁
Usage cont 0 1 21.11 2.91 12.46 18.50 22.04 23.76 25.90 ▁▃▃▃▇
Carb Flow 0 1 2637.61 899.99 44.00 2728.00 3042.00 3204.00 5104.00 ▁▃▇▇▁
Density 0 1 1.17 0.37 0.44 0.92 0.98 1.62 1.92 ▁▇▁▁▃
MFR 0 1 708.17 64.36 95.40 708.20 724.80 731.00 868.60 ▁▁▁▅▇
Balling 0 1 2.20 0.92 0.35 1.50 1.65 3.34 3.98 ▁▇▁▁▃
Pressure Vacuum 0 1 -5.22 0.57 -6.60 -5.60 -5.40 -5.00 -3.60 ▂▇▆▂▁
PH 0 1 8.55 0.17 7.88 8.44 8.56 8.68 8.94 ▁▂▆▇▃
Oxygen Filler 0 1 0.04 0.04 0.00 0.02 0.03 0.06 0.32 ▇▂▁▁▁
Bowl Setpoint 0 1 109.02 15.28 70.00 90.00 120.00 120.00 140.00 ▁▃▃▇▁
Pressure Setpoint 0 1 47.62 2.05 44.00 46.00 46.00 50.00 52.00 ▁▇▁▆▁
Air Pressurer 0 1 142.84 1.22 140.80 142.20 142.60 143.00 147.60 ▂▇▁▁▁
Alch Rel 0 1 6.89 0.50 6.32 6.54 6.56 7.22 8.62 ▇▁▁▂▁
Carb Rel 0 1 5.43 0.12 5.02 5.34 5.40 5.54 5.78 ▁▃▇▅▁
Balling Lvl 0 1 2.04 0.86 1.18 1.38 1.46 3.14 3.66 ▇▁▁▂▃
Brand Code B 0 1 0.49 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇
Brand Code C 0 1 0.12 0.32 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
Brand Code D 0 1 0.24 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂

#Perform Preprocessing


For the training set we employ the corr and nzv proprocessing functions. The corr, nzv and medianImpute functions are applied to the evaluation data set. Note only predictor variables are being imputed in the evaluation data set. Finally, the janitor::clean_names function is applied to all variables of each data set

Preprocessed and Cleaned Data Set


We use the skim function to present our preprocessed and cleaned data set.

Data summary
Name train_df
Number of rows 2129
Number of columns 35
_______________________
Column type frequency:
numeric 35
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ph 0 1 8.55 0.17 7.88 8.44 8.56 8.68 8.94 ▁▂▆▇▃
carb_volume 0 1 5.37 0.10 5.07 5.29 5.35 5.45 5.70 ▁▇▇▅▁
fill_ounces 0 1 23.98 0.09 23.65 23.93 23.98 24.03 24.32 ▁▂▇▂▁
pc_volume 0 1 0.28 0.06 0.08 0.24 0.27 0.31 0.48 ▁▃▇▂▁
carb_pressure 0 1 68.21 3.44 58.20 65.60 68.20 70.60 78.80 ▁▆▇▅▁
carb_temp 0 1 141.12 3.97 128.60 138.40 140.80 143.80 154.00 ▁▃▇▃▁
psc 0 1 0.08 0.05 0.00 0.05 0.08 0.11 0.27 ▆▇▃▁▁
psc_fill 0 1 0.19 0.12 0.00 0.10 0.18 0.26 0.62 ▆▇▃▁▁
psc_co2 0 1 0.06 0.04 0.00 0.02 0.04 0.08 0.24 ▇▅▂▁▁
mnf_flow 0 1 29.58 122.19 -100.20 -100.00 105.60 143.60 229.40 ▇▁▁▇▂
carb_pressure1 0 1 122.14 4.45 106.40 118.60 123.00 125.00 138.00 ▁▅▇▃▁
fill_pressure 0 1 48.18 2.84 43.60 46.00 46.60 50.00 59.20 ▇▂▅▁▁
hyd_pressure1 0 1 13.24 12.27 -0.60 0.00 12.40 21.00 58.00 ▇▆▂▁▁
hyd_pressure2 0 1 22.92 16.09 0.00 0.00 30.80 35.20 59.40 ▆▁▇▅▁
hyd_pressure3 0 1 22.29 15.61 -1.20 0.00 30.00 33.80 50.00 ▆▁▃▇▁
hyd_pressure4 0 1 94.33 10.81 70.00 84.00 96.00 100.00 138.00 ▅▇▇▁▁
filler_level 0 1 108.49 15.58 55.80 92.20 117.40 120.00 146.60 ▁▃▂▇▁
filler_speed 0 1 3881.07 344.08 1004.00 3896.00 3986.00 3998.00 4028.00 ▁▁▁▁▇
temperature 0 1 65.77 1.03 63.60 65.20 65.60 66.20 73.20 ▇▇▁▁▁
usage_cont 0 1 21.11 2.91 12.46 18.50 22.04 23.76 25.90 ▁▃▃▃▇
carb_flow 0 1 2637.61 899.99 44.00 2728.00 3042.00 3204.00 5104.00 ▁▃▇▇▁
density 0 1 1.17 0.37 0.44 0.92 0.98 1.62 1.92 ▁▇▁▁▃
mfr 0 1 708.17 64.36 95.40 708.20 724.80 731.00 868.60 ▁▁▁▅▇
balling 0 1 2.20 0.92 0.35 1.50 1.65 3.34 3.98 ▁▇▁▁▃
pressure_vacuum 0 1 -5.22 0.57 -6.60 -5.60 -5.40 -5.00 -3.60 ▂▇▆▂▁
oxygen_filler 0 1 0.04 0.04 0.00 0.02 0.03 0.06 0.32 ▇▂▁▁▁
bowl_setpoint 0 1 109.02 15.28 70.00 90.00 120.00 120.00 140.00 ▁▃▃▇▁
pressure_setpoint 0 1 47.62 2.05 44.00 46.00 46.00 50.00 52.00 ▁▇▁▆▁
air_pressurer 0 1 142.84 1.22 140.80 142.20 142.60 143.00 147.60 ▂▇▁▁▁
alch_rel 0 1 6.89 0.50 6.32 6.54 6.56 7.22 8.62 ▇▁▁▂▁
carb_rel 0 1 5.43 0.12 5.02 5.34 5.40 5.54 5.78 ▁▃▇▅▁
balling_lvl 0 1 2.04 0.86 1.18 1.38 1.46 3.14 3.66 ▇▁▁▂▃
brand_code_b 0 1 0.49 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇
brand_code_c 0 1 0.12 0.32 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
brand_code_d 0 1 0.24 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂

Explore Data


Our data exploration leverages the skim report above as well as Figure 1 and Figure 2 below. Figure 1 provides insight into the distribution of our target variable pH. Figure looks at the relationship of the target variable with each of the predictor variables. It this plot that convinced the analysis to steer away from linear models.

Figure 1 . Histogram - ph Level

Figure 2. Scatter Plots pH vs Predictor Variables



Training / Test Split


Our next step is to take the training data and perform a training / test split. This enables us to train on some of the data and then use the test split to ensure we have not overfit the data. Please note, the test data is not the same as the evaluation data set. It is a subset of the trainind data set. Additionally, tuning/training folds are established for our model tuning.

## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
##    splits             id         
##    <list>             <chr>      
##  1 <split [1.6K/597]> Bootstrap01
##  2 <split [1.6K/594]> Bootstrap02
##  3 <split [1.6K/571]> Bootstrap03
##  4 <split [1.6K/565]> Bootstrap04
##  5 <split [1.6K/597]> Bootstrap05
##  6 <split [1.6K/572]> Bootstrap06
##  7 <split [1.6K/596]> Bootstrap07
##  8 <split [1.6K/572]> Bootstrap08
##  9 <split [1.6K/591]> Bootstrap09
## 10 <split [1.6K/592]> Bootstrap10
## # ... with 15 more rows

Build Models


In this time of Covid-19 our modeling team is operating remotely. This portion of the model building addresses three models, Random Forest, XGBoost and Cubist. Other data scientist on the team will be pursuing alternative modeling startegies.

Random Forest Model


The Random Forest model specification, tuning parameters and modeling results are set forth below:


### Explore Results

Top Performing Models with Tuning Results and RMSE

mtry min_n .metric .estimator mean n std_err .config
13 4 rmse standard 0.1031171 25 0.0007668758 Preprocessor1_Model10
24 10 rmse standard 0.1043719 25 0.0007960392 Preprocessor1_Model05
18 18 rmse standard 0.1049345 25 0.0008165298 Preprocessor1_Model01
31 6 rmse standard 0.1051590 25 0.0008627660 Preprocessor1_Model06
22 24 rmse standard 0.1061753 25 0.0008183294 Preprocessor1_Model03

Random Forest Modeling Results


Our fitted Random Forest Model Using the Test Data Set and a plot of observed vs predicted are set forth below:

.metric .estimator .estimate .config
rmse standard 0.09432975 Preprocessor1_Model1
rsq standard 0.73432166 Preprocessor1_Model1


Important Variables


Our next step is to use the VIP package to identify the important variables in the modeling process.


Prediction With Evaluation Data Set


Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

Explore Results

Top Performing Models with Tuning Results and RMSE

trees min_n tree_depth learn_rate loss_reduction sample_size .metric .estimator mean n std_err .config
625 34 14 9.216943e-03 6.275024e-09 0.3514750 rmse standard 0.1189910 25 0.0008210379 Preprocessor1_Model10
1612 31 11 4.296299e-02 2.218003e+00 0.4552075 rmse standard 0.1491916 25 0.0008831361 Preprocessor1_Model09
859 14 9 5.447205e-05 2.081954e-10 0.1997397 rmse standard 7.6851221 25 0.0007059712 Preprocessor1_Model04
48 39 12 4.797775e-04 1.643376e-03 0.3425751 rmse standard 7.8689865 25 0.0007027639 Preprocessor1_Model11
1309 3 4 1.430741e-05 1.798794e+01 0.9061079 rmse standard 7.9027101 25 0.0007017228 Preprocessor1_Model01

XGBOST Modeling Results


Results from our fitted XGBOOST Model Using the Test Data Set and a plot of observed vs. predicted are set forth below:

.metric .estimator .estimate .config
rmse standard 0.1182419 Preprocessor1_Model1
rsq standard 0.5817994 Preprocessor1_Model1


Important Variables


Our next step is to use the VIP package to identify the important variables in the modeling process.



Prediction With Evaluation Data Set


Finally, we display the first 10 predicted values from the evaluation set and write all predicted values to an Excel-readable format.

CUBIST MODEL


The CUBIST model specification, tuning parameters and modeling results are set forth below:


Top Performing Models with Tuning Results and RMSE

committees neighbors .metric .estimator mean n std_err .config
50 6 rmse standard 0.09737828 25 0.0008200935 Preprocessor1_Model55
50 9 rmse standard 0.09738718 25 0.0008244416 Preprocessor1_Model56
40 6 rmse standard 0.09771429 25 0.0007820249 Preprocessor1_Model51
40 9 rmse standard 0.09772138 25 0.0007904497 Preprocessor1_Model52
50 3 rmse standard 0.09772571 25 0.0008850636 Preprocessor1_Model54


## == Workflow ======================================================================================================================================================
## Preprocessor: Recipe
## Model: cubist_rules()
## 
## -- Preprocessor --------------------------------------------------------------------------------------------------------------------------------------------------
## 1 Recipe Step
## 
## * step_zv()
## 
## -- Model ---------------------------------------------------------------------------------------------------------------------------------------------------------
## Cubist Model Specification (regression)
## 
## Main Arguments:
##   committees = 50
##   neighbors = 6
## 
## Computational engine: Cubist

CUBIST Modeling Results


Results from our fitted CUBIST Model Using the Test Data Set and a plot of observed vs. predicted are set forth below:

.metric .estimator .estimate .config
rmse standard 0.08530442 Preprocessor1_Model1
rsq standard 0.76176579 Preprocessor1_Model1

Important Variables


Our next step is to use the VIP package to identify the important variables in the modeling process.



Prediction With Evaluation Data Set


Finally, we display a plot of predicted values from the evaluation set and write all predicted values to an Excel-readable format.


Summary and Findings