Gradient Boosted Trees

We consider the same dataset and variable importance problem using gradient boosted trees. The gbm library and variable importance measures are run within caret training procedure. The gradient boosting approach relies on a weak learner in an additive manner that minimizes the given loss function.

set.seed(1027)

gbmControlFull <- trainControl(method = "repeatedcv", 
                           number = 10,  # for debug mode, set this to a low number (1)
                           repeats = 10 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

gbmControlLight <- trainControl(method = "repeatedcv", 
                           number = 5,  # for debug mode, set this to a low number (1)
                           repeats = 1 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

if(tuningFull == TRUE)
{
   gbmControl = gbmControlFull
   gbmTune = expand.grid( n.trees = seq(250, 1000, by = 250),
                                                  shrinkage = c( 0.02, 0.05,  0.1) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 15
                                                  )
} else  {
   gbmControl = gbmControlLight
   gbmTune =  expand.grid( n.trees = c( 2000), 
                                                  shrinkage = c( 0.02, 0.05 , 0.1 ) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 4
                                                  )
}

(gbmTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "gbm",
                          tuneGrid =  gbmTune ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = gbmControl ) )

## Stochastic Gradient Boosting 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   0.02       3                   250     0.1434388  0.3444707  0.1130682
##   0.02       3                   500     0.1399058  0.3781026  0.1093679
##   0.02       3                   750     0.1381834  0.3948353  0.1075680
##   0.02       3                  1000     0.1371452  0.4052631  0.1065308
##   0.02       7                   250     0.1375659  0.3998528  0.1072464
##   0.02       7                   500     0.1339780  0.4329578  0.1037270
##   0.02       7                   750     0.1326790  0.4454112  0.1025093
##   0.02       7                  1000     0.1320687  0.4515818  0.1019326
##   0.05       3                   250     0.1391815  0.3845878  0.1085746
##   0.05       3                   500     0.1370316  0.4073504  0.1063473
##   0.05       3                   750     0.1364003  0.4163621  0.1056836
##   0.05       3                  1000     0.1360908  0.4213742  0.1053774
##   0.05       7                   250     0.1337953  0.4330017  0.1032590
##   0.05       7                   500     0.1324980  0.4471243  0.1018901
##   0.05       7                   750     0.1319808  0.4530628  0.1013538
##   0.05       7                  1000     0.1317524  0.4561859  0.1010493
##   0.10       3                   250     0.1379920  0.3994445  0.1072680
##   0.10       3                   500     0.1371891  0.4134839  0.1064164
##   0.10       3                   750     0.1368470  0.4199454  0.1059928
##   0.10       3                  1000     0.1366708  0.4242737  0.1056828
##   0.10       7                   250     0.1343453  0.4341820  0.1034168
##   0.10       7                   500     0.1335705  0.4438374  0.1027183
##   0.10       7                   750     0.1332796  0.4472765  0.1023144
##   0.10       7                  1000     0.1332705  0.4482689  0.1022585
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 15
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  7, shrinkage = 0.05 and n.minobsinnode = 15.

As the model output indicates the greater tree depth outperforms the lower counts and the lower shrinkage value produces a lower (better) RMSE value. The shrinkage, or learning rate, works better with smaller values, yet require longer processing time.

## Saving 7 x 5 in image

The plot of the tuned model clearly shows the better performance with the increased max tree depth but also indicates the lower shrinkage value doesn’t guarantee a lower RMSE. For the max tree depth of 3, the lowest shrinkage value does not produce the lowest RMSE for that tree depth.

## Saving 7 x 5 in image

The variable importance plot shows the 100% importance of the MnfFlow followed by BrandCode as the only other variable with an importance above 50%. Usagecont, OxygenFiller, and Temperature round out the top five variables based on importance with scores greater than 25%.

Overall, the final Gradient Boosted model results in an RMSE of approximately 0.13 and an \(R^2\) less than 0.5.

Cubist Model

Now we train a Cubist tree using the best selectionFunction on RMSE. The Cubist library and variable importance measures are run within caret training procedure.
The Cubist model is a rules-based approach in which the terminal leaves contains linear regression models. The models are based on predictions defined in the previous nodes of the tree.

set.seed(1027)

cubistControlFull <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridFull  <- expand.grid( committees = c( 10, 50, 100 ) ,
                              neighbors = c( 0, 1, 5, 9 )
                                                  ) 

cubistControlLight <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridLight <- expand.grid( committees = c( 10, 20 ) , 
                              neighbors = c( 0, 5 )  )

if(tuningFull == TRUE)
{
   cubistControl = cubistControlFull
   cubistGrid = tuneGridFull
} else  {
   cubistControl = cubistControlLight
   cubistGrid = tuneGridLight
}

(cubistTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )

## Cubist 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE        Rsquared   MAE       
##    10         0          0.10326126  0.6469501  0.07521436
##    10         1          0.10698108  0.6394174  0.07408043
##    10         5          0.09661241  0.6865169  0.06923965
##    10         9          0.09715807  0.6838488  0.06954861
##    50         0          0.10213076  0.6574062  0.07362497
##    50         1          0.10583349  0.6449193  0.07208654
##    50         5          0.09539318  0.6943587  0.06757449
##    50         9          0.09598916  0.6917813  0.06811918
##   100         0          0.10218813  0.6576706  0.07366847
##   100         1          0.10587327  0.6449503  0.07201655
##   100         5          0.09547621  0.6941947  0.06752479
##   100         9          0.09606335  0.6917983  0.06810842
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 5.

The Cubist model results provide a good RMSE value and \(R^2\) value with a higher number of committees and neighbors.

The line plot confirms the better model per RMSE with more neighbors and increasing number of committees.

## Saving 7 x 5 in image

The variable importance plot identifies MnfFlow as the most valuable variable followed by BallingLvl, Balling, AlchRel, and PressureVacuum to round out the top 5 variables. Overall, 8 variables resulted in over 50% importance indicating those variables are contained in over 50% of the rules defining the Cubist model.

## Saving 7 x 5 in image

The final Cubist model confirms the high usage rate of the variable MnfFlow.

Cubist Predictor Usage

Conditions	Model	Variable
82	61	MnfFlow
18	81	Balling
22	71	BallingLvl
19	73	AlchRel
28	60	PressureVacuum
19	63	OxygenFiller
14	66	Density
21	51	BowlSetpoint
8	61	HydPressure3
7	62	Temperature
9	59	AirPressurer
18	47	CarbFlow
6	58	CarbRel
3	54	HydPressure2
3	52	CarbPressure1
11	40	Usagecont
16	31	FillerSpeed
6	39	HydPressure1
41	0	BrandCode
3	33	FillerLevel
0	27	CarbPressure
3	20	CarbVolume
2	21	PCVolume
0	20	CarbTemp
0	20	PressureSetpoint
0	17	FillPressure
0	16	HydPressure4
0	16	PSCFill
0	6	MFR
0	5	FillOunces
0	4	PSC
0	3	PSCCO2

The splits plot indicates the usage of each variable by committee/rule. The values are normalized and highlight the high usage rate of MnfFlow along with the use of greater than or less than inequalities for MnfFlow.

## quartz_off_screen 
##                 2

The dotplot allows us to see the distribution of the coefficient parameters for each linear model associated with a committee/rule.
When a predictor has more dots, it is used in more models which suggests it is influential. The color density is proportional to the variable importance. The sign of the coefficient tells us whether the marginal effect of the predictor on the response PH. However, since Cubist uses many models, the concept of the sign of a predictor coefficient is unclear.
The x-axis value of the dot indicates the true coefficient in the original units unlike the above plot which normalizes the x-axis.

## quartz_off_screen 
##                 2

The boxplot below which we devised constructs a distribution of coefficients for each predictor. We interpret the sign of the predictor \(X\) marginal effect to be the sign of the median of the distribution of coefficients pf \(X\) over those models which include \(X\). The predictors have been sorted by their median.

The plot provides clarity regarding the impact of a given variable as the median is a good estimate of the sign of the variable coefficient. The AlchRel variable box indicates a positive relationship with the dependent variable PH. The most important variable MnfFlow box results in a negative coefficient, and thus an inverse effect on PH. For all the rules in the final Cubist model, the understanding of the coefficient values for each variable provides a clear relationship to the dependent variable despite the obtuse nature of the Cubist model construction.

Top 10 importance: MnfFlow: Negative BallingLvl: Positive Balling: Negative AlchRel: Positive PressureVacuum: N Density: N OxygenFilter: N AirPressurer: N BowlSetpoint: P HydPressure3: P

LM, coeff pos or neg: MnfFlow: N-check BallingLvl: P-check Balling: N-check AlchRel: P-check PressureVacuum: N-check Density: N-check OxygenFilter: N-check AirPressure: N-check BowlSetpoint: P-check HydPressure3: P-check

## Saving 7 x 5 in image

Overall, the final Cubist model results in an RMSE under 0.1 and an \(R^2\) just short of 0.7.

CART Tree

Using the rpart library and its training methods, we consider the best CART based on the RMSE metric. The CART approach follows a classification and regression tree methodology.

set.seed(10393)

library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 20, by = 1 )  )

cartControl <- trainControl(method = "boot", 
                            number = 30,  
                            selectionFunction = "best")

(rpartTune = train( x = sdata_trainX_pp, 
                    y = sdata_train_pp$PH ,
                   method = "rpart2" ,
                   metric = "RMSE" ,
                   tuneGrid = cartGrid ,
                   trControl = cartControl ) )

## CART 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (30 reps) 
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE       Rsquared   MAE      
##    1        0.1540934  0.2150028  0.1207013
##    2        0.1485972  0.2697600  0.1163125
##    3        0.1439126  0.3150026  0.1130389
##    4        0.1418976  0.3346403  0.1106321
##    5        0.1408007  0.3455111  0.1095038
##    6        0.1398046  0.3557469  0.1085050
##    7        0.1389994  0.3635874  0.1074293
##    8        0.1381955  0.3717156  0.1065335
##    9        0.1373228  0.3802125  0.1057161
##   10        0.1362384  0.3899605  0.1046780
##   11        0.1353910  0.3979319  0.1038336
##   12        0.1345705  0.4055253  0.1028965
##   13        0.1340562  0.4106769  0.1022712
##   14        0.1335638  0.4154690  0.1017284
##   15        0.1330891  0.4202569  0.1012114
##   16        0.1327996  0.4230057  0.1008957
##   17        0.1327006  0.4239743  0.1008136
##   18        0.1326532  0.4245372  0.1007342
##   19        0.1326119  0.4250877  0.1006977
##   20        0.1326119  0.4250877  0.1006977
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 19.

The model output shows increasing tree depth leads to improved RMSE and \(R^2\) values.

The below plot displays the improving RMSE value with the increasing tree depth with an optimal tree depth of 19.

## Saving 7 x 5 in image

Once again, the variable MnfFlow is the most important variable with an importance rating of 100%. Six variables achieve an importance score of greater than 50% with BrandCode, Temperature, and CarbRel each achieving greater than 75% importance.

## Saving 7 x 5 in image

The CART tree plot of the final model reaffirms the importance of the variable MnfFlow as the initial node splits on MnfFlow. One of the second level nodes is BrandCode, another highly important variable to the CART model. Eight ff the top 10 most important variables in the Cubist model are found in the top 15 variables of the CART model. This overlap of variable importance shows a consistency across the different model approaches.

CART top 20: MnfFlow :IN CUBIST TOP 10 BrandCode Temperature CarbRel BallingLvl :IN CUBIST TOP 10 FillerSpeed HyPressure3 :IN CUBIST TOP 10 Usagecont BowlSetpoint :IN CUBIST TOP 10 FillerLevel OxygenFiller :IN CUBIST TOP 10 AirPressurer :IN CUBIST TOP 10 PressureVacuum :IN CUBIST TOP 10 Balling :IN CUBIST TOP 10 FillPressure HydPressure1 PSCFill CarbPressure1 HydPressure2 PressureSetpoint

## Loading required package: libcoin

## 
## Attaching package: 'partykit'

## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

Overall, the final CART model results in an RMSE of approximately 0.13 and an \(R^2\) less than 0.5.

MARS Model

Now we train a MARS model based on the implementation of earth from the caret library. The MARS approach is a non-linear regression model that doesn’t rely on the original variables but instead on surrogate features.

The tuned MARS model defines the derived features as linear combinations of the initial variables.

## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598    RSS 27.25932    GRSq 0.5260713    RSq 0.5555922

## Call: earth(x=tbl_df[2055,32], y=c(8.36,8.26,8.9...), keepxy=TRUE, degree=2,
##             nprune=27)
## 
##                                                         coefficients
## (Intercept)                                                8.3958776
## h(-0.215889-MnfFlow)                                       0.1780017
## h(HydPressure3- -1.29159)                                  0.0391037
## h(-0.728732-Temperature)                                  -0.0805473
## h(Temperature- -0.728732)                                 -0.0477016
## h(-0.658787-Balling)                                       0.3588441
## h(-0.832915-OxygenFiller)                                 -0.0445311
## h(OxygenFiller- -0.832915)                                -0.0417267
## h(BowlSetpoint- -1.34947)                                  0.0724302
## h(0.640105-AlchRel)                                        0.0395901
## h(AlchRel-0.640105)                                        0.1293063
## h(1.05286-MnfFlow) * BrandCodeC                           -0.0902482
## h(MnfFlow-1.05286) * BrandCodeC                           -1.2433037
## h(-0.215889-MnfFlow) * h(PressureVacuum-0.386629)         -0.0625259
## h(-0.215889-MnfFlow) * h(0.386629-PressureVacuum)         -0.1069156
## h(-0.215889-MnfFlow) * h(AirPressurer-0.839882)           -0.1043474
## h(-0.215889-MnfFlow) * h(0.839882-AirPressurer)            0.0605090
## h(MnfFlow- -0.215889) * h(0.640105-AlchRel)               -0.0669114
## h(1.75812-CarbPressure1) * h(BowlSetpoint- -1.34947)      -0.0120925
## h(0.1987-HydPressure1) * h(-0.658787-Balling)             -0.2036569
## h(0.21892-FillerSpeed) * h(Balling- -0.658787)             0.0158849
## h(FillerSpeed-0.21892) * h(Balling- -0.658787)             0.2165359
## h(Temperature-1.27402) * h(BowlSetpoint- -1.34947)         0.0275573
## h(Usagecont-0.0816887) * h(Balling- -0.658787)            -0.0381250
## h(Density-0.482509) * h(BowlSetpoint- -1.34947)           -0.0578527
## h(0.386629-PressureVacuum) * h(0.640105-AlchRel)           0.0204706
## h(OxygenFiller- -0.832915) * h(AirPressurer- -0.700718)    0.0305074
## 
## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598    RSS 27.25932    GRSq 0.5260713    RSq 0.5555922

The plot of the final MARS model indicates increasing the number of terms leads to a lower RMSE.

## Saving 7 x 5 in image

Once again, the variable MnfFlow achieves an importance score of 100%. Only 3 variables reach the score of 50%, the aforementioned MnfFlow, BrandCodeC, and AlchRel.
The variable MnfFlow is influential because the variable effects many other variables in the final model. Due to the numerous linear combinations including MnfFlow, the relationship to PH is difficult to ascertain. The MnfFlow variable requires more investigation into the variable interactions to determine the impact to PH.

## Saving 7 x 5 in image

Overall, the MARS model reaffirms the importance of the variable MnfFlow and also results in an RMSE and \(R^2\) in line with the previous models.

Linear Regression

We also run a simple OLS regression on PH in terms of the scaled and centered variables. While its predictive performance is weaker than the other models on the training set, the coefficients are much easier to interpret.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51932 -0.07897  0.00917  0.08626  0.44987 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.586130   0.010428 823.364  < 2e-16 ***
## CarbVolume       -0.005488   0.008389  -0.654 0.513047    
## FillOunces       -0.004105   0.003163  -1.298 0.194451    
## PCVolume         -0.006448   0.003678  -1.753 0.079740 .  
## CarbPressure     -0.003088   0.011767  -0.262 0.793033    
## CarbTemp          0.008475   0.010712   0.791 0.428936    
## PSC              -0.003168   0.003147  -1.007 0.314160    
## PSCFill          -0.005762   0.003066  -1.879 0.060361 .  
## PSCCO2           -0.006170   0.003020  -2.043 0.041221 *  
## MnfFlow          -0.089502   0.006520 -13.728  < 2e-16 ***
## CarbPressure1     0.030465   0.003680   8.278 2.25e-16 ***
## FillPressure      0.005601   0.004501   1.244 0.213528    
## HydPressure1      0.004482   0.005177   0.866 0.386658    
## HydPressure2     -0.027126   0.009963  -2.723 0.006533 ** 
## HydPressure3      0.059090   0.010506   5.624 2.12e-08 ***
## HydPressure4      0.001523   0.004624   0.329 0.741982    
## FillerLevel      -0.021768   0.008733  -2.493 0.012760 *  
## FillerSpeed       0.003863   0.004923   0.785 0.432809    
## Temperature      -0.019234   0.003564  -5.397 7.59e-08 ***
## Usagecont        -0.020299   0.003911  -5.191 2.31e-07 ***
## CarbFlow          0.012499   0.004515   2.768 0.005685 ** 
## Density          -0.018774   0.009023  -2.081 0.037598 *  
## MFR               0.002030   0.003898   0.521 0.602651    
## Balling          -0.091337   0.014856  -6.148 9.43e-10 ***
## PressureVacuum   -0.016446   0.004588  -3.585 0.000345 ***
## OxygenFiller     -0.015182   0.004217  -3.600 0.000326 ***
## BowlSetpoint      0.053190   0.009135   5.823 6.72e-09 ***
## PressureSetpoint -0.013892   0.004500  -3.087 0.002047 ** 
## AirPressurer     -0.002649   0.003204  -0.827 0.408420    
## AlchRel           0.034578   0.011814   2.927 0.003463 ** 
## CarbRel           0.005015   0.006593   0.761 0.447003    
## BallingLvl        0.097856   0.017857   5.480 4.79e-08 ***
## BrandCodeA       -0.095705   0.025430  -3.764 0.000172 ***
## BrandCodeC       -0.139671   0.010233 -13.649  < 2e-16 ***
## BrandCodeD       -0.043954   0.029625  -1.484 0.138054    
## BrandCodeE       -0.057782   0.015435  -3.744 0.000186 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1325 on 2019 degrees of freedom
## Multiple R-squared:  0.4223, Adjusted R-squared:  0.4123 
## F-statistic: 42.16 on 35 and 2019 DF,  p-value: < 2.2e-16

The most significant variables are MnfFlow, CarbPressure, HydPressure3, Temperature, Usagecont, Balling, PressureVacuum, OxygenFiller, BowlSetpoint, BallingLvl, and BrandCode. Seven of these 11 variables are found in the top 10 most important variables of the Cubist model. Upon closer evaluation, the median value of the variable coefficients from the top 10 of the Cubist model match the sign of the coefficient value in the linear regression model. This finding further supports the assessment of the Cubist model variable impacts on the dependent variable PH.

We summarize the plot of coefficients below.

## NULL

## Saving 7 x 5 in image

The diagnostic plots of the OLS model suggest that the histogram of residuals is relatively symmetric and looks normal.
The QQ plot suggests the model fit does not fully capture the fat tails possibly due to a small number of outliers.

Overall, the linear regression model does not perform as well as the Cubist model but does provide a valuable sanity check and baseline to the results from the more advanced models.

Model Selection

Based on the above results, we summarize their training and test RMSE and \(R^2\) below.

The joint results are tabulated below.

# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()

model_list = list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)

model_stats = data.frame()

for( modelObj in model_list )
{
    if( modelObj$method != 'lm' ){
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
    }
    else
    {
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
    }
    output <- data.frame( modelName = modelObj$method ,
                     trainRMSE = getTrainPerf(modelObj)[1,1] ,
                     trainR2   = getTrainPerf(modelObj)[1,2] ,
                     testRMSE  = caret::RMSE( pred_data,  sdata_test_pp$PH) ,
                     testR2    = caret::R2(   pred_data,  sdata_test_pp$PH) 
                     )

    model_stats <- rbind(model_stats, output)
}

Model Results Comparison

modelName	trainRMSE	trainR2	testRMSE	testR2
cubist	0.095	0.694	0.090	0.722
rpart2	0.133	0.425	0.125	0.474
lm	0.134	0.396	0.127	0.449
gbm	0.132	0.456	0.151	0.342
earth	0.125	0.486	0.153	0.279

## save_kable will have the best result with magick installed.

We may also inspect the plot of the predicted versus the observed data for the 5 models. It is clear below that cubist outperforms all others on both the training set in the upper row and also the test data on the lower rows.

Discussion

We conclude that the best model is Cubist but the multilinear regression model, MARS and CART models provide useful insight on the main predictors.

We would recommend the use of the Cubist model for prediction
However, the OLS (ordinary least squares) linear regression help us to understand the sign and magnitude of the most important variables.
The MARS model helps to detect non-linearity and predictor interaction in prediction.

The top 5 variables that impact PH are:

MnfFlow - this appears negatively related to PH level based on OLS, MARS and Cubist median plot.
BallingLvl is positively related to PH level.
PreserveVacuum is negatively related to PH.
AlchRel is positively related to PH.
Balling is negatively related to PH.

Lastly, we observe that Brand is probably not a manufacturing predictor that can be controlled. Mostly likely Brand refers to a recipe for the final product. However, knowing the desired optimal PH for each Brand may help us control the most influential variables that control PH, thereby providing better quality control. While there are over 30 predictors in the dataset, we recommend looking at the top 10 most influential predictors. Most of the models agree substantially on which predictors to include in the top 10.

For example, the linear regression model coefficients agrees with the sign and variable importance of the top 10 predictors of the Cubist model. Importance predictors of the OLS model has the highest statistical significance - consistent with variable importance measure of caret.

Final Predictions

## Cubist 
## 
## 2567 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2310, 2309, 2310, 2310, 2310, 2311, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE        Rsquared   MAE       
##    10         0          0.09786899  0.6814835  0.07063050
##    10         1          0.10347326  0.6605109  0.07157433
##    10         5          0.09205621  0.7148932  0.06565672
##    10         9          0.09275054  0.7104184  0.06629558
##    50         0          0.09733927  0.6877932  0.07034005
##    50         1          0.10284267  0.6632483  0.07070141
##    50         5          0.09105650  0.7209833  0.06455347
##    50         9          0.09179098  0.7169196  0.06532125
##   100         0          0.09682465  0.6925441  0.07016957
##   100         1          0.10246532  0.6656340  0.07043498
##   100         5          0.09042487  0.7248834  0.06419195
##   100         9          0.09121584  0.7206745  0.06503557
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 100 and neighbors = 5.

## [1] 0.09042487

## [1] 0.7248834

##   [1] 8.553383 8.416589 8.484277 8.651452 8.430888 8.520683 8.464472 8.500473
##   [9] 8.533825 8.642326 8.481946 8.513762 8.484258 8.562369 8.253350 8.676709
##  [17] 8.508010 8.425731 8.456175 8.688740 8.687377 8.773614 8.658579 8.455132
##  [25] 8.580982 8.493421 8.357762 8.653392 8.730004 8.739744 8.614525 8.791067
##  [33] 8.657874 8.693810 8.684228 8.568902 8.508800 8.523384 8.651680 8.814646
##  [41] 8.789721 8.406401 8.174829 8.301746 8.767859 8.762051 8.762412 8.744424
##  [49] 8.757907 8.791381 8.694399 8.632868 8.693358 8.807720 8.808817 8.768996
##  [57] 8.478882 8.436927 8.740316 8.800134 8.814241 8.727329 8.821269 8.821293
##  [65] 8.787587 8.784393 8.537858 8.572862 8.522822 8.500213 8.414266 8.274856
##  [73] 8.532194 8.499533 8.497195 8.504463 8.522461 8.732538 8.709569 8.696074
##  [81] 8.828541 8.809374 8.774046 8.738761 8.834514 8.772323 8.752558 8.662767
##  [89] 8.576780 8.724874 8.646579 9.032428 8.443877 8.493906 8.582039 8.622660
##  [97] 8.631973 8.676289 8.704071 8.681482 8.686296 8.760990 8.683686 8.704911
## [105] 8.826912 8.782175 8.524895 8.464222 8.538521 8.513417 8.693749 8.678286
## [113] 8.727077 8.713489 8.642420 8.747110 8.738415 8.780863 8.731585 8.704364
## [121] 8.703690 8.771915 8.788529 8.689462 8.560857 8.477674 8.485120 8.301046
## [129] 8.441027 8.476292 8.508128 8.520318 8.475845 8.499913 8.433493 8.488183
## [137] 8.502113 8.435307 8.413801 8.441218 8.462484 8.477620 8.455779 8.522523
## [145] 8.388823 8.556230 8.488006 8.600226 8.546601 8.595560 8.623026 8.622278
## [153] 8.483624 8.716797 8.448185 8.452550 8.439169 8.525030 8.553894 8.560632
## [161] 8.601017 8.627051 8.631309 8.669895 8.650928 8.582525 8.481905 8.776589
## [169] 8.688186 8.830791 8.479057 8.565610 8.293248 8.450176 8.482644 8.568878
## [177] 8.418387 8.441730 8.507828 8.456392 8.425079 8.525384 8.504602 8.496831
## [185] 8.518225 8.313547 8.315780 8.503910 8.440440 8.322478 8.361710 8.519551
## [193] 8.500801 8.525855 8.471058 8.272171 8.304795 8.189250 8.225987 8.475965
## [201] 8.390318 8.437284 8.519439 8.446362 8.373441 8.497079 8.504481 8.430357
## [209] 8.425136 8.236720 8.303498 8.340604 8.454713 8.509512 8.484015 8.551454
## [217] 8.354124 8.372385 8.516407 8.484663 8.537519 8.516002 8.400567 8.338906
## [225] 8.606352 8.475992 8.436245 8.394827 8.532536 8.460215 8.510155 8.471123
## [233] 8.514993 8.584372 8.599546 8.494669 8.657253 8.696697 8.466996 8.509019
## [241] 8.478971 8.569809 8.481023 8.543774 8.392702 8.441716 8.388472 8.501712
## [249] 8.589886 8.488894 8.406670 8.469671 8.534724 8.541173 8.549125 8.438866
## [257] 8.483340 8.713450 8.524870 8.523258 8.640900 8.616673 8.491389 8.613351
## [265] 8.271590 8.253966 8.119421

Code

We summarize all the R code used in this project in this appendix for ease of reading.

knitr::opts_chunk$set(echo = FALSE)
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages( library(tidyverse))
suppressPackageStartupMessages(library(kableExtra))
library(cowplot)
library(skimr)

library(RColorBrewer)
library(caret)
library(Cubist)
library(rpart)
suppressPackageStartupMessages(library(gbm))
library(party)
library(tictoc)
tic()
.code-bg {
  background-color: lightgray;
}

sdata_train_pp = read_csv("sdata_train_pp.csv", 
                          col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

sdata_test_pp  = read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

sdata_trainX_pp  = sdata_train_pp %>% select( -PH, -id )
sdata_testX_pp   = sdata_test_pp %>% select(-PH , -id)
tuningFull = TRUE   # Set to TRUE if running the lengthy tuning process for all models
set.seed(1027)

gbmControlFull <- trainControl(method = "repeatedcv", 
                           number = 10,  # for debug mode, set this to a low number (1)
                           repeats = 10 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

gbmControlLight <- trainControl(method = "repeatedcv", 
                           number = 5,  # for debug mode, set this to a low number (1)
                           repeats = 1 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

if(tuningFull == TRUE)
{
   gbmControl = gbmControlFull
   gbmTune = expand.grid( n.trees = seq(250, 1000, by = 250),
                                                  shrinkage = c( 0.02, 0.05,  0.1) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 15
                                                  )
} else  {
   gbmControl = gbmControlLight
   gbmTune =  expand.grid( n.trees = c( 2000), 
                                                  shrinkage = c( 0.02, 0.05 , 0.1 ) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 4
                                                  )
}

(gbmTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "gbm",
                          tuneGrid =  gbmTune ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = gbmControl ) )
gbm_plot <- ggplot(gbmTune) + labs(title = "Gradient Boosted Trees - Tuning")
gbm_plot
ggsave("GradientBoostedTreesTurning.jpg", gbm_plot)
gbm_vi_plot <- ggplot(varImp(gbmTune, numTrees = gbmTune$finalModel$n.trees) ) + labs(title = "Gradient Boosted Trees - Variance Importance")
gbm_vi_plot
ggsave("GbmVarImp.jpg", gbm_vi_plot)
set.seed(1027)

cubistControlFull <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridFull  <- expand.grid( committees = c( 10, 50, 100 ) ,
                              neighbors = c( 0, 1, 5, 9 )
                                                  ) 

cubistControlLight <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridLight <- expand.grid( committees = c( 10, 20 ) , 
                              neighbors = c( 0, 5 )  )

if(tuningFull == TRUE)
{
   cubistControl = cubistControlFull
   cubistGrid = tuneGridFull
} else  {
   cubistControl = cubistControlLight
   cubistGrid = tuneGridLight
}

(cubistTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )
cubist_plot <- ggplot(cubistTune) + labs(title="Cubist Model Tuning")
cubist_plot
ggsave("CubistTuning.jpg", cubist_plot)
cubist_vi_plot <- ggplot(varImp(cubistTune), top = 20) + labs(title = "Cubist Model Variable Importance")
cubist_vi_plot
ggsave("CubistVarImp.jpg", cubist_vi_plot)
cubistTune$finalModel
cubistTune$finalModel$usage %>% arrange( desc(Model+ Conditions)) %>% 
  kable(caption = "Cubist Predictor Usage") %>% kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
cubist_splitsplot <- dotplot( cubistTune$finalModel, what = "splits" )
cubist_splitsplot
# Output to file
jpeg(file="CubistSplitsPlot.jpg")
cubist_splitsplot
dev.off()
cubist_dotplot <- dotplot( cubistTune$finalModel, 
         what = "coefs", 
         between = list(x = 0.2, y = 0.5) , 
         scales = list(x = list(relation = "free"),  
                       y = list(cex = 0.25)  ) )
cubist_dotplot
jpeg(file="CubistDotPlot.jpg")
cubist_splitsplot
dev.off()
cubistTune$finalModel$coefficients %>% 
  rownames_to_column(var="id") %>% 
  select(-committee, -rule) %>% 
  pivot_longer(!id, names_to = "predictor", values_to = "coef" ) %>% 
  filter( !is.na(coef)) %>% 
  filter( predictor != '(Intercept)' )  %>%
  filter(abs(coef) < 20 ) -> coef_piv

cubist_boxplot <- coef_piv %>% 
  ggplot() + 
  geom_boxplot(
    aes(x=reorder(predictor, coef, FUN = median, na.rm = TRUE ),  # order the boxplots by median
        y = coef ), outlier.shape = NA ) +  # strip out the outliers
  coord_flip(ylim=c(-.5,.5)) +
  labs(title = "Distribution of Coefficients of Predictors", 
       subtitle = "from Cubist Committee/Rules",
       x = "Predictors" ,
       y = "Coefficient in Linear Model"
       )

cubist_boxplot
ggsave("CubistBoxPlot.jpg", cubist_boxplot)
set.seed(10393)

library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 20, by = 1 )  )

cartControl <- trainControl(method = "boot", 
                            number = 30,  
                            selectionFunction = "best")

(rpartTune = train( x = sdata_trainX_pp, 
                    y = sdata_train_pp$PH ,
                   method = "rpart2" ,
                   metric = "RMSE" ,
                   tuneGrid = cartGrid ,
                   trControl = cartControl ) )
cart_tuning_plot <- ggplot(rpartTune) + labs(title="CART Tree Tuning")
cart_tuning_plot
ggsave("CartTuningPlot.jpg", cart_tuning_plot)
cart_vi_plot <- ggplot(varImp( rpartTune), top = 20 ) + labs(title = "CART Tree Variable Importance")
cart_vi_plot
ggsave("CartVarImp.jpg", cart_vi_plot)
library(rpart.plot)
library(partykit)

rpart.plot(rpartTune$finalModel, digits = 3)

if( tuningFull == TRUE )
{
    marsGrid = expand.grid( .degree = 1:2, .nprune = seq(2,27, by=5) )
    marsControl = trainControl(method = "cv", 
                     number = 10, 
                
                     selectionFunction = "best", 
                     verboseIter = FALSE)
} else {
    marsGrid = expand.grid( .degree = 1:2, .nprune = seq(2,25, by = 10 ) )
    
    marsControl = trainControl(method = "cv", 
                     number = 10, 
                     
                     selectionFunction = "best", 
                     verboseIter = FALSE)
}
    

set.seed(1000)

marsTuned = caret::train(x = sdata_trainX_pp , 
                         y = sdata_train_pp$PH, 
                         method = "earth" ,
                         metric = "RMSE" , 
                         tuneGrid = marsGrid ,
                         trControl = marsControl )
marsTuned$finalModel

summary(marsTuned)
mars_tuning_plot <- ggplot(marsTuned) + labs(title="MARS Model Tuning" )
mars_tuning_plot
ggsave("MarsTuningPlot.jpg", mars_tuning_plot)
mars_vi_plot <- ggplot( varImp(marsTuned), 20 ) + labs(title = "MARS Model Variable Importance")
mars_vi_plot
ggsave("MarsVarImp.jpg", mars_vi_plot)
set.seed(123)

train.control <- trainControl( method = "cv", number = 5)

linearTune = train( PH ~ . -id, 
                    data = sdata_train_pp, 
                    method = "lm", 
                    tuneGrid = data.frame(intercept = TRUE),
                    trControl = train.control)

summary(linearTune$finalModel)
par(mfrow=c(2,2))

lm_final_model_plot <- plot(linearTune$finalModel)
lm_final_model_plot
ggsave("LinearModelFinalPlot.jpg", lm_final_model_plot)
hist(linearTune$finalModel$residuals)

# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()

model_list = list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)

model_stats = data.frame()

for( modelObj in model_list )
{
    if( modelObj$method != 'lm' ){
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
    }
    else
    {
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
    }
    output <- data.frame( modelName = modelObj$method ,
                     trainRMSE = getTrainPerf(modelObj)[1,1] ,
                     trainR2   = getTrainPerf(modelObj)[1,2] ,
                     testRMSE  = caret::RMSE( pred_data,  sdata_test_pp$PH) ,
                     testR2    = caret::R2(   pred_data,  sdata_test_pp$PH) 
                     )

    model_stats <- rbind(model_stats, output)
}

results_table <- model_stats %>% as_tibble() %>% 
  arrange( testRMSE) %>%
  kable( digits = 3, caption = "Model Results Comparison") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

results_table

save_kable(results_table, "ModelResultsTable.jpg")
predVals = extractPrediction(list(marsTuned, rpartTune, cubistTune, gbmTune), testX = sdata_testX_pp , testY = sdata_test_pp$PH )

# Attempting to hack into the extract prediction object with the LM data
# Cols: obs pred model dataType object
pred_lm =  predict(linearTune, newdata = sdata_test_pp)
pred_lm_train <- linearTune$finalModel$fitted.values

# Create placeholder data for the training values
model_l <- rep("LM", length = 2055)
dt_l <- rep("Training", length = 2055)
obj_l <- rep("Object9", length = 2055)

# Create dataframe of the training data of LM
lm_train_preds <- as.data.frame(cbind(obs=sdata_train_pp$PH, pred=pred_lm_train, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)

# Create placeholder data for the test values
model_l <- rep("LM", length = 512)
dt_l <- rep("Test", length = 512)
obj_l <- rep("Object10", length = 512)

# Create dataframe of the test data of LM
lm_test_preds <- as.data.frame(cbind(obs=sdata_test_pp$PH, pred=pred_lm, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)

# Combine all data for plotting
all_preds <- rbind(predVals, lm_train_preds, lm_test_preds)

# Ensure obs and preds are numerics
all_preds$obs <- as.numeric(all_preds$obs)
all_preds$pred <- as.numeric(all_preds$pred)

# Plot using Caret built-in reporting.
# --------------------------------------
obsVsPredsPlot <- plotObsVsPred(all_preds)
obsVsPredsPlot
#ggsave("ObsVsPredsPlot.jpg", obsVsPredsPlot)
# Combine the initial training and test datasets into 1
sdata_trainfull_pp <- rbind(sdata_train_pp, sdata_test_pp)
sdata_trainfullX_pp <- sdata_trainfull_pp %>% select(-PH , -id)

edata_test_pp = read_csv("edata_test_pp.csv", 
                          col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

# Train the Cubist model
set.seed(1027)

if(tuningFull == TRUE)
{
   cubistControl = cubistControlFull
   cubistGrid = tuneGridFull
} else  {
   cubistControl = cubistControlLight
   cubistGrid = tuneGridLight
}



(cubistTune = caret::train( x = sdata_trainfullX_pp, 
                          y = sdata_trainfull_pp$PH , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )

# Make predictions on the provided dataset
final_pred_data <- as.numeric( predict( cubistTune, newdata=edata_test_pp ))

# Output prediction performance
full_cubist_train_rmse <- getTrainPerf(cubistTune)[1,1]
full_cubist_train_r2   <- getTrainPerf(cubistTune)[1,2]

full_cubist_train_rmse
full_cubist_train_r2

# Output the results to file
# TODO
final_pred_data

toc()

Tuning full is TRUE.

9am on 12/8/2021 Tuning full is FALSE. (589.958 sec elapsed)

## 4120.739 sec elapsed

Technical Report: Determinants of PH in Beverage Process

Alexander Ng, Philip Tanofsky

Due 12/13/2021

Overview

Introduction

Importing Data