This technical report tunes multiple models for prediction of beverage PH and identifies determinants of PH among predictors in the manufacturing process. We train and test 5 models for this purpose including the Cubist, Gradient Boosted Tree, CART, MARS and linear regression models.
The training and test performance were XXXPSTXXX.
We load the pre-processed datasets for training and testing in this section. We load the csv file for the training data set below and display the first few column characteristics from the file.
= read_csv("sdata_train_pp.csv",
sdata_train_pp col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
= read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
sdata_test_pp as_tibble() %>% remove_rownames()
= sdata_train_pp %>% select( -PH, -id )
sdata_trainX_pp = sdata_test_pp %>% select(-PH , -id) sdata_testX_pp
= TRUE # Set to TRUE if running the lengthy tuning process for all models tuningFull
We consider the same dataset and variable importance problem using gradient boosted trees. The gbm
library and variable importance measures are run within caret training procedure. The gradient boosting approach relies on a weak learner in an additive manner that minimizes the given loss function.
set.seed(1027)
<- trainControl(method = "repeatedcv",
gbmControlFull number = 10, # for debug mode, set this to a low number (1)
repeats = 10 ,
selectionFunction = "best",
verboseIter = FALSE)
<- trainControl(method = "repeatedcv",
gbmControlLight number = 5, # for debug mode, set this to a low number (1)
repeats = 1 ,
selectionFunction = "best",
verboseIter = FALSE)
if(tuningFull == TRUE)
{= gbmControlFull
gbmControl = expand.grid( n.trees = seq(250, 1000, by = 250),
gbmTune shrinkage = c( 0.02, 0.05, 0.1) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 15
)else {
} = gbmControlLight
gbmControl = expand.grid( n.trees = c( 2000),
gbmTune shrinkage = c( 0.02, 0.05 , 0.1 ) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 4
)
}
gbmTune = caret::train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "gbm",
tuneGrid = gbmTune ,
verbose = FALSE,
metric = "RMSE" ,
trControl = gbmControl ) )
## Stochastic Gradient Boosting
##
## 2055 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.02 3 250 0.1434388 0.3444707 0.1130682
## 0.02 3 500 0.1399058 0.3781026 0.1093679
## 0.02 3 750 0.1381834 0.3948353 0.1075680
## 0.02 3 1000 0.1371452 0.4052631 0.1065308
## 0.02 7 250 0.1375659 0.3998528 0.1072464
## 0.02 7 500 0.1339780 0.4329578 0.1037270
## 0.02 7 750 0.1326790 0.4454112 0.1025093
## 0.02 7 1000 0.1320687 0.4515818 0.1019326
## 0.05 3 250 0.1391815 0.3845878 0.1085746
## 0.05 3 500 0.1370316 0.4073504 0.1063473
## 0.05 3 750 0.1364003 0.4163621 0.1056836
## 0.05 3 1000 0.1360908 0.4213742 0.1053774
## 0.05 7 250 0.1337953 0.4330017 0.1032590
## 0.05 7 500 0.1324980 0.4471243 0.1018901
## 0.05 7 750 0.1319808 0.4530628 0.1013538
## 0.05 7 1000 0.1317524 0.4561859 0.1010493
## 0.10 3 250 0.1379920 0.3994445 0.1072680
## 0.10 3 500 0.1371891 0.4134839 0.1064164
## 0.10 3 750 0.1368470 0.4199454 0.1059928
## 0.10 3 1000 0.1366708 0.4242737 0.1056828
## 0.10 7 250 0.1343453 0.4341820 0.1034168
## 0.10 7 500 0.1335705 0.4438374 0.1027183
## 0.10 7 750 0.1332796 0.4472765 0.1023144
## 0.10 7 1000 0.1332705 0.4482689 0.1022585
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 15
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
## 7, shrinkage = 0.05 and n.minobsinnode = 15.
As the model output indicates the greater tree depth outperforms the lower counts and the lower shrinkage value produces a lower (better) RMSE value. The shrinkage, or learning rate, works better with smaller values, yet require longer processing time.
## Saving 7 x 5 in image
The plot of the tuned model clearly shows the better performance with the increased max tree depth but also indicates the lower shrinkage value doesn’t guarantee a lower RMSE. For the max tree depth of 3, the lowest shrinkage value does not produce the lowest RMSE for that tree depth.
## Saving 7 x 5 in image
The variable importance plot shows the 100% importance of the MnfFlow
followed by BrandCode
as the only other variable with an importance above 50%. Usagecont
, OxygenFiller
, and Temperature
round out the top five variables based on importance with scores greater than 25%.
Overall, the final Gradient Boosted model results in an RMSE of approximately 0.13 and an \(R^2\) less than 0.5.
Now we train a Cubist tree using the best
selectionFunction on RMSE. The Cubist
library and variable importance measures are run within caret training procedure.
The Cubist model is a rules-based approach in which the terminal leaves contains linear regression models. The models are based on predictions defined in the previous nodes of the tree.
set.seed(1027)
<- trainControl(method = "cv" , selectionFunction = "best")
cubistControlFull <- expand.grid( committees = c( 10, 50, 100 ) ,
tuneGridFull neighbors = c( 0, 1, 5, 9 )
)
<- trainControl(method = "cv" , selectionFunction = "best")
cubistControlLight <- expand.grid( committees = c( 10, 20 ) ,
tuneGridLight neighbors = c( 0, 5 ) )
if(tuningFull == TRUE)
{= cubistControlFull
cubistControl = tuneGridFull
cubistGrid else {
} = cubistControlLight
cubistControl = tuneGridLight
cubistGrid
}
cubistTune = caret::train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "cubist",
tuneGrid = cubistGrid ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )
## Cubist
##
## 2055 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 10 0 0.10326126 0.6469501 0.07521436
## 10 1 0.10698108 0.6394174 0.07408043
## 10 5 0.09661241 0.6865169 0.06923965
## 10 9 0.09715807 0.6838488 0.06954861
## 50 0 0.10213076 0.6574062 0.07362497
## 50 1 0.10583349 0.6449193 0.07208654
## 50 5 0.09539318 0.6943587 0.06757449
## 50 9 0.09598916 0.6917813 0.06811918
## 100 0 0.10218813 0.6576706 0.07366847
## 100 1 0.10587327 0.6449503 0.07201655
## 100 5 0.09547621 0.6941947 0.06752479
## 100 9 0.09606335 0.6917983 0.06810842
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 5.
The Cubist model results provide a good RMSE value and \(R^2\) value with a higher number of committees and neighbors.
The line plot confirms the better model per RMSE with more neighbors and increasing number of committees.
## Saving 7 x 5 in image
The variable importance plot identifies MnfFlow
as the most valuable variable followed by BallingLvl
, Balling
, AlchRel
, and PressureVacuum
to round out the top 5 variables. Overall, 8 variables resulted in over 50% importance indicating those variables are contained in over 50% of the rules defining the Cubist model.
## Saving 7 x 5 in image
The final Cubist model confirms the high usage rate of the variable MnfFlow
.
Cubist Predictor Usage
Conditions | Model | Variable |
---|---|---|
82 | 61 | MnfFlow |
18 | 81 | Balling |
22 | 71 | BallingLvl |
19 | 73 | AlchRel |
28 | 60 | PressureVacuum |
19 | 63 | OxygenFiller |
14 | 66 | Density |
21 | 51 | BowlSetpoint |
8 | 61 | HydPressure3 |
7 | 62 | Temperature |
9 | 59 | AirPressurer |
18 | 47 | CarbFlow |
6 | 58 | CarbRel |
3 | 54 | HydPressure2 |
3 | 52 | CarbPressure1 |
11 | 40 | Usagecont |
16 | 31 | FillerSpeed |
6 | 39 | HydPressure1 |
41 | 0 | BrandCode |
3 | 33 | FillerLevel |
0 | 27 | CarbPressure |
3 | 20 | CarbVolume |
2 | 21 | PCVolume |
0 | 20 | CarbTemp |
0 | 20 | PressureSetpoint |
0 | 17 | FillPressure |
0 | 16 | HydPressure4 |
0 | 16 | PSCFill |
0 | 6 | MFR |
0 | 5 | FillOunces |
0 | 4 | PSC |
0 | 3 | PSCCO2 |
The splits
plot indicates the usage of each variable by committee/rule. The values are normalized and highlight the high usage rate of MnfFlow
along with the use of greater than or less than inequalities for MnfFlow
.
## quartz_off_screen
## 2
The dotplot
allows us to see the distribution of the coefficient parameters for each linear model associated with a committee/rule.
When a predictor has more dots, it is used in more models which suggests it is influential. The color density is proportional to the variable importance. The sign of the coefficient tells us whether the marginal effect of the predictor on the response PH
. However, since Cubist uses many models, the concept of the sign
of a predictor coefficient is unclear.
The x-axis value of the dot indicates the true coefficient in the original units unlike the above plot which normalizes the x-axis.
## quartz_off_screen
## 2
The boxplot below which we devised constructs a distribution of coefficients for each predictor. We interpret the sign of the predictor \(X\) marginal effect to be the sign of the median of the distribution of coefficients pf \(X\) over those models which include \(X\). The predictors have been sorted by their median.
The plot provides clarity regarding the impact of a given variable as the median is a good estimate of the sign of the variable coefficient. The AlchRel
variable box indicates a positive relationship with the dependent variable PH
. The most important variable MnfFlow
box results in a negative coefficient, and thus an inverse effect on PH
. For all the rules in the final Cubist model, the understanding of the coefficient values for each variable provides a clear relationship to the dependent variable despite the obtuse nature of the Cubist model construction.
Top 10 importance: MnfFlow: Negative BallingLvl: Positive Balling: Negative AlchRel: Positive PressureVacuum: N Density: N OxygenFilter: N AirPressurer: N BowlSetpoint: P HydPressure3: P
LM, coeff pos or neg: MnfFlow: N-check BallingLvl: P-check Balling: N-check AlchRel: P-check PressureVacuum: N-check Density: N-check OxygenFilter: N-check AirPressure: N-check BowlSetpoint: P-check HydPressure3: P-check
## Saving 7 x 5 in image
Overall, the final Cubist model results in an RMSE under 0.1 and an \(R^2\) just short of 0.7.
Using the rpart
library and its training methods, we consider the best CART based on the RMSE
metric. The CART approach follows a classification and regression tree methodology.
set.seed(10393)
library(rpart)
= expand.grid( maxdepth = seq( 1, 20, by = 1 ) )
cartGrid
<- trainControl(method = "boot",
cartControl number = 30,
selectionFunction = "best")
rpartTune = train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "rpart2" ,
metric = "RMSE" ,
tuneGrid = cartGrid ,
trControl = cartControl ) )
## CART
##
## 2055 samples
## 32 predictor
##
## No pre-processing
## Resampling: Bootstrapped (30 reps)
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 0.1540934 0.2150028 0.1207013
## 2 0.1485972 0.2697600 0.1163125
## 3 0.1439126 0.3150026 0.1130389
## 4 0.1418976 0.3346403 0.1106321
## 5 0.1408007 0.3455111 0.1095038
## 6 0.1398046 0.3557469 0.1085050
## 7 0.1389994 0.3635874 0.1074293
## 8 0.1381955 0.3717156 0.1065335
## 9 0.1373228 0.3802125 0.1057161
## 10 0.1362384 0.3899605 0.1046780
## 11 0.1353910 0.3979319 0.1038336
## 12 0.1345705 0.4055253 0.1028965
## 13 0.1340562 0.4106769 0.1022712
## 14 0.1335638 0.4154690 0.1017284
## 15 0.1330891 0.4202569 0.1012114
## 16 0.1327996 0.4230057 0.1008957
## 17 0.1327006 0.4239743 0.1008136
## 18 0.1326532 0.4245372 0.1007342
## 19 0.1326119 0.4250877 0.1006977
## 20 0.1326119 0.4250877 0.1006977
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 19.
The model output shows increasing tree depth leads to improved RMSE and \(R^2\) values.
The below plot displays the improving RMSE value with the increasing tree depth with an optimal tree depth of 19.
## Saving 7 x 5 in image
Once again, the variable MnfFlow
is the most important variable with an importance rating of 100%. Six variables achieve an importance score of greater than 50% with BrandCode
, Temperature
, and CarbRel
each achieving greater than 75% importance.
## Saving 7 x 5 in image
The CART tree plot of the final model reaffirms the importance of the variable MnfFlow
as the initial node splits on MnfFlow
. One of the second level nodes is BrandCode
, another highly important variable to the CART model. Eight ff the top 10 most important variables in the Cubist model are found in the top 15 variables of the CART model. This overlap of variable importance shows a consistency across the different model approaches.
CART top 20: MnfFlow :IN CUBIST TOP 10 BrandCode Temperature CarbRel BallingLvl :IN CUBIST TOP 10 FillerSpeed HyPressure3 :IN CUBIST TOP 10 Usagecont BowlSetpoint :IN CUBIST TOP 10 FillerLevel OxygenFiller :IN CUBIST TOP 10 AirPressurer :IN CUBIST TOP 10 PressureVacuum :IN CUBIST TOP 10 Balling :IN CUBIST TOP 10 FillPressure HydPressure1 PSCFill CarbPressure1 HydPressure2 PressureSetpoint
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
Overall, the final CART model results in an RMSE of approximately 0.13 and an \(R^2\) less than 0.5.
Now we train a MARS model based on the implementation of earth from the caret
library. The MARS approach is a non-linear regression model that doesn’t rely on the original variables but instead on surrogate features.
The tuned MARS model defines the derived features as linear combinations of the initial variables.
## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598 RSS 27.25932 GRSq 0.5260713 RSq 0.5555922
## Call: earth(x=tbl_df[2055,32], y=c(8.36,8.26,8.9...), keepxy=TRUE, degree=2,
## nprune=27)
##
## coefficients
## (Intercept) 8.3958776
## h(-0.215889-MnfFlow) 0.1780017
## h(HydPressure3- -1.29159) 0.0391037
## h(-0.728732-Temperature) -0.0805473
## h(Temperature- -0.728732) -0.0477016
## h(-0.658787-Balling) 0.3588441
## h(-0.832915-OxygenFiller) -0.0445311
## h(OxygenFiller- -0.832915) -0.0417267
## h(BowlSetpoint- -1.34947) 0.0724302
## h(0.640105-AlchRel) 0.0395901
## h(AlchRel-0.640105) 0.1293063
## h(1.05286-MnfFlow) * BrandCodeC -0.0902482
## h(MnfFlow-1.05286) * BrandCodeC -1.2433037
## h(-0.215889-MnfFlow) * h(PressureVacuum-0.386629) -0.0625259
## h(-0.215889-MnfFlow) * h(0.386629-PressureVacuum) -0.1069156
## h(-0.215889-MnfFlow) * h(AirPressurer-0.839882) -0.1043474
## h(-0.215889-MnfFlow) * h(0.839882-AirPressurer) 0.0605090
## h(MnfFlow- -0.215889) * h(0.640105-AlchRel) -0.0669114
## h(1.75812-CarbPressure1) * h(BowlSetpoint- -1.34947) -0.0120925
## h(0.1987-HydPressure1) * h(-0.658787-Balling) -0.2036569
## h(0.21892-FillerSpeed) * h(Balling- -0.658787) 0.0158849
## h(FillerSpeed-0.21892) * h(Balling- -0.658787) 0.2165359
## h(Temperature-1.27402) * h(BowlSetpoint- -1.34947) 0.0275573
## h(Usagecont-0.0816887) * h(Balling- -0.658787) -0.0381250
## h(Density-0.482509) * h(BowlSetpoint- -1.34947) -0.0578527
## h(0.386629-PressureVacuum) * h(0.640105-AlchRel) 0.0204706
## h(OxygenFiller- -0.832915) * h(AirPressurer- -0.700718) 0.0305074
##
## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598 RSS 27.25932 GRSq 0.5260713 RSq 0.5555922
The plot of the final MARS model indicates increasing the number of terms leads to a lower RMSE.
## Saving 7 x 5 in image
Once again, the variable MnfFlow
achieves an importance score of 100%. Only 3 variables reach the score of 50%, the aforementioned MnfFlow
, BrandCodeC
, and AlchRel
.
The variable MnfFlow
is influential because the variable effects many other variables in the final model. Due to the numerous linear combinations including MnfFlow
, the relationship to PH
is difficult to ascertain. The MnfFlow
variable requires more investigation into the variable interactions to determine the impact to PH
.
## Saving 7 x 5 in image
Overall, the MARS model reaffirms the importance of the variable MnfFlow
and also results in an RMSE and \(R^2\) in line with the previous models.
We also run a simple OLS regression on PH in terms of the scaled and centered variables. While its predictive performance is weaker than the other models on the training set, the coefficients are much easier to interpret.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51932 -0.07897 0.00917 0.08626 0.44987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.586130 0.010428 823.364 < 2e-16 ***
## CarbVolume -0.005488 0.008389 -0.654 0.513047
## FillOunces -0.004105 0.003163 -1.298 0.194451
## PCVolume -0.006448 0.003678 -1.753 0.079740 .
## CarbPressure -0.003088 0.011767 -0.262 0.793033
## CarbTemp 0.008475 0.010712 0.791 0.428936
## PSC -0.003168 0.003147 -1.007 0.314160
## PSCFill -0.005762 0.003066 -1.879 0.060361 .
## PSCCO2 -0.006170 0.003020 -2.043 0.041221 *
## MnfFlow -0.089502 0.006520 -13.728 < 2e-16 ***
## CarbPressure1 0.030465 0.003680 8.278 2.25e-16 ***
## FillPressure 0.005601 0.004501 1.244 0.213528
## HydPressure1 0.004482 0.005177 0.866 0.386658
## HydPressure2 -0.027126 0.009963 -2.723 0.006533 **
## HydPressure3 0.059090 0.010506 5.624 2.12e-08 ***
## HydPressure4 0.001523 0.004624 0.329 0.741982
## FillerLevel -0.021768 0.008733 -2.493 0.012760 *
## FillerSpeed 0.003863 0.004923 0.785 0.432809
## Temperature -0.019234 0.003564 -5.397 7.59e-08 ***
## Usagecont -0.020299 0.003911 -5.191 2.31e-07 ***
## CarbFlow 0.012499 0.004515 2.768 0.005685 **
## Density -0.018774 0.009023 -2.081 0.037598 *
## MFR 0.002030 0.003898 0.521 0.602651
## Balling -0.091337 0.014856 -6.148 9.43e-10 ***
## PressureVacuum -0.016446 0.004588 -3.585 0.000345 ***
## OxygenFiller -0.015182 0.004217 -3.600 0.000326 ***
## BowlSetpoint 0.053190 0.009135 5.823 6.72e-09 ***
## PressureSetpoint -0.013892 0.004500 -3.087 0.002047 **
## AirPressurer -0.002649 0.003204 -0.827 0.408420
## AlchRel 0.034578 0.011814 2.927 0.003463 **
## CarbRel 0.005015 0.006593 0.761 0.447003
## BallingLvl 0.097856 0.017857 5.480 4.79e-08 ***
## BrandCodeA -0.095705 0.025430 -3.764 0.000172 ***
## BrandCodeC -0.139671 0.010233 -13.649 < 2e-16 ***
## BrandCodeD -0.043954 0.029625 -1.484 0.138054
## BrandCodeE -0.057782 0.015435 -3.744 0.000186 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1325 on 2019 degrees of freedom
## Multiple R-squared: 0.4223, Adjusted R-squared: 0.4123
## F-statistic: 42.16 on 35 and 2019 DF, p-value: < 2.2e-16
The most significant variables are MnfFlow
, CarbPressure
, HydPressure3
, Temperature
, Usagecont
, Balling
, PressureVacuum
, OxygenFiller
, BowlSetpoint
, BallingLvl
, and BrandCode
. Seven of these 11 variables are found in the top 10 most important variables of the Cubist model. Upon closer evaluation, the median value of the variable coefficients from the top 10 of the Cubist model match the sign of the coefficient value in the linear regression model. This finding further supports the assessment of the Cubist model variable impacts on the dependent variable PH
.
We summarize the plot of coefficients below.
## NULL
## Saving 7 x 5 in image
The diagnostic plots of the OLS model suggest that the histogram of residuals is relatively symmetric and looks normal.
The QQ plot suggests the model fit does not fully capture the fat tails possibly due to a small number of outliers.
Overall, the linear regression model does not perform as well as the Cubist model but does provide a valuable sanity check and baseline to the results from the more advanced models.
Based on the above results, we summarize their training and test RMSE and \(R^2\) below.
The joint results are tabulated below.
# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()
= list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)
model_list
= data.frame()
model_stats
for( modelObj in model_list )
{if( modelObj$method != 'lm' ){
<- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
pred_data
}else
{<- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
pred_data
}<- data.frame( modelName = modelObj$method ,
output trainRMSE = getTrainPerf(modelObj)[1,1] ,
trainR2 = getTrainPerf(modelObj)[1,2] ,
testRMSE = caret::RMSE( pred_data, sdata_test_pp$PH) ,
testR2 = caret::R2( pred_data, sdata_test_pp$PH)
)
<- rbind(model_stats, output)
model_stats }
Model Results Comparison
modelName | trainRMSE | trainR2 | testRMSE | testR2 |
---|---|---|---|---|
cubist | 0.095 | 0.694 | 0.090 | 0.722 |
rpart2 | 0.133 | 0.425 | 0.125 | 0.474 |
lm | 0.134 | 0.396 | 0.127 | 0.449 |
gbm | 0.132 | 0.456 | 0.151 | 0.342 |
earth | 0.125 | 0.486 | 0.153 | 0.279 |
## save_kable will have the best result with magick installed.
We may also inspect the plot of the predicted versus the observed data for the 5 models. It is clear below that cubist outperforms all others on both the training set in the upper row and also the test data on the lower rows.
We conclude that the best model is Cubist but the multilinear regression model, MARS and CART models provide useful insight on the main predictors.
The top 5 variables that impact PH are:
MnfFlow
- this appears negatively related to PH
level based on OLS, MARS and Cubist median plot.
BallingLvl
is positively related to PH
level.
PreserveVacuum
is negatively related to PH.
AlchRel
is positively related to PH
.
Balling
is negatively related to PH
.
Lastly, we observe that Brand is probably not a manufacturing predictor that can be controlled. Mostly likely Brand refers to a recipe for the final product. However, knowing the desired optimal PH
for each Brand may help us control the most influential variables that control PH
, thereby providing better quality control. While there are over 30 predictors in the dataset, we recommend looking at the top 10 most influential predictors. Most of the models agree substantially on which predictors to include in the top 10.
For example, the linear regression model coefficients agrees with the sign and variable importance of the top 10 predictors of the Cubist model. Importance predictors of the OLS model has the highest statistical significance - consistent with variable importance measure of caret
.
## Cubist
##
## 2567 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2310, 2309, 2310, 2310, 2310, 2311, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 10 0 0.09786899 0.6814835 0.07063050
## 10 1 0.10347326 0.6605109 0.07157433
## 10 5 0.09205621 0.7148932 0.06565672
## 10 9 0.09275054 0.7104184 0.06629558
## 50 0 0.09733927 0.6877932 0.07034005
## 50 1 0.10284267 0.6632483 0.07070141
## 50 5 0.09105650 0.7209833 0.06455347
## 50 9 0.09179098 0.7169196 0.06532125
## 100 0 0.09682465 0.6925441 0.07016957
## 100 1 0.10246532 0.6656340 0.07043498
## 100 5 0.09042487 0.7248834 0.06419195
## 100 9 0.09121584 0.7206745 0.06503557
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 100 and neighbors = 5.
## [1] 0.09042487
## [1] 0.7248834
## [1] 8.553383 8.416589 8.484277 8.651452 8.430888 8.520683 8.464472 8.500473
## [9] 8.533825 8.642326 8.481946 8.513762 8.484258 8.562369 8.253350 8.676709
## [17] 8.508010 8.425731 8.456175 8.688740 8.687377 8.773614 8.658579 8.455132
## [25] 8.580982 8.493421 8.357762 8.653392 8.730004 8.739744 8.614525 8.791067
## [33] 8.657874 8.693810 8.684228 8.568902 8.508800 8.523384 8.651680 8.814646
## [41] 8.789721 8.406401 8.174829 8.301746 8.767859 8.762051 8.762412 8.744424
## [49] 8.757907 8.791381 8.694399 8.632868 8.693358 8.807720 8.808817 8.768996
## [57] 8.478882 8.436927 8.740316 8.800134 8.814241 8.727329 8.821269 8.821293
## [65] 8.787587 8.784393 8.537858 8.572862 8.522822 8.500213 8.414266 8.274856
## [73] 8.532194 8.499533 8.497195 8.504463 8.522461 8.732538 8.709569 8.696074
## [81] 8.828541 8.809374 8.774046 8.738761 8.834514 8.772323 8.752558 8.662767
## [89] 8.576780 8.724874 8.646579 9.032428 8.443877 8.493906 8.582039 8.622660
## [97] 8.631973 8.676289 8.704071 8.681482 8.686296 8.760990 8.683686 8.704911
## [105] 8.826912 8.782175 8.524895 8.464222 8.538521 8.513417 8.693749 8.678286
## [113] 8.727077 8.713489 8.642420 8.747110 8.738415 8.780863 8.731585 8.704364
## [121] 8.703690 8.771915 8.788529 8.689462 8.560857 8.477674 8.485120 8.301046
## [129] 8.441027 8.476292 8.508128 8.520318 8.475845 8.499913 8.433493 8.488183
## [137] 8.502113 8.435307 8.413801 8.441218 8.462484 8.477620 8.455779 8.522523
## [145] 8.388823 8.556230 8.488006 8.600226 8.546601 8.595560 8.623026 8.622278
## [153] 8.483624 8.716797 8.448185 8.452550 8.439169 8.525030 8.553894 8.560632
## [161] 8.601017 8.627051 8.631309 8.669895 8.650928 8.582525 8.481905 8.776589
## [169] 8.688186 8.830791 8.479057 8.565610 8.293248 8.450176 8.482644 8.568878
## [177] 8.418387 8.441730 8.507828 8.456392 8.425079 8.525384 8.504602 8.496831
## [185] 8.518225 8.313547 8.315780 8.503910 8.440440 8.322478 8.361710 8.519551
## [193] 8.500801 8.525855 8.471058 8.272171 8.304795 8.189250 8.225987 8.475965
## [201] 8.390318 8.437284 8.519439 8.446362 8.373441 8.497079 8.504481 8.430357
## [209] 8.425136 8.236720 8.303498 8.340604 8.454713 8.509512 8.484015 8.551454
## [217] 8.354124 8.372385 8.516407 8.484663 8.537519 8.516002 8.400567 8.338906
## [225] 8.606352 8.475992 8.436245 8.394827 8.532536 8.460215 8.510155 8.471123
## [233] 8.514993 8.584372 8.599546 8.494669 8.657253 8.696697 8.466996 8.509019
## [241] 8.478971 8.569809 8.481023 8.543774 8.392702 8.441716 8.388472 8.501712
## [249] 8.589886 8.488894 8.406670 8.469671 8.534724 8.541173 8.549125 8.438866
## [257] 8.483340 8.713450 8.524870 8.523258 8.640900 8.616673 8.491389 8.613351
## [265] 8.271590 8.253966 8.119421
We summarize all the R code used in this project in this appendix for ease of reading.
::opts_chunk$set(echo = FALSE)
knitrsuppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages( library(tidyverse))
suppressPackageStartupMessages(library(kableExtra))
library(cowplot)
library(skimr)
library(RColorBrewer)
library(caret)
library(Cubist)
library(rpart)
suppressPackageStartupMessages(library(gbm))
library(party)
library(tictoc)
tic()
-bg {
.code-color: lightgray;
background
}
= read_csv("sdata_train_pp.csv",
sdata_train_pp col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
= read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
sdata_test_pp as_tibble() %>% remove_rownames()
= sdata_train_pp %>% select( -PH, -id )
sdata_trainX_pp = sdata_test_pp %>% select(-PH , -id)
sdata_testX_pp = TRUE # Set to TRUE if running the lengthy tuning process for all models
tuningFull set.seed(1027)
<- trainControl(method = "repeatedcv",
gbmControlFull number = 10, # for debug mode, set this to a low number (1)
repeats = 10 ,
selectionFunction = "best",
verboseIter = FALSE)
<- trainControl(method = "repeatedcv",
gbmControlLight number = 5, # for debug mode, set this to a low number (1)
repeats = 1 ,
selectionFunction = "best",
verboseIter = FALSE)
if(tuningFull == TRUE)
{= gbmControlFull
gbmControl = expand.grid( n.trees = seq(250, 1000, by = 250),
gbmTune shrinkage = c( 0.02, 0.05, 0.1) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 15
)else {
} = gbmControlLight
gbmControl = expand.grid( n.trees = c( 2000),
gbmTune shrinkage = c( 0.02, 0.05 , 0.1 ) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 4
)
}
gbmTune = caret::train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "gbm",
tuneGrid = gbmTune ,
verbose = FALSE,
metric = "RMSE" ,
trControl = gbmControl ) )
<- ggplot(gbmTune) + labs(title = "Gradient Boosted Trees - Tuning")
gbm_plot
gbm_plotggsave("GradientBoostedTreesTurning.jpg", gbm_plot)
<- ggplot(varImp(gbmTune, numTrees = gbmTune$finalModel$n.trees) ) + labs(title = "Gradient Boosted Trees - Variance Importance")
gbm_vi_plot
gbm_vi_plotggsave("GbmVarImp.jpg", gbm_vi_plot)
set.seed(1027)
<- trainControl(method = "cv" , selectionFunction = "best")
cubistControlFull <- expand.grid( committees = c( 10, 50, 100 ) ,
tuneGridFull neighbors = c( 0, 1, 5, 9 )
)
<- trainControl(method = "cv" , selectionFunction = "best")
cubistControlLight <- expand.grid( committees = c( 10, 20 ) ,
tuneGridLight neighbors = c( 0, 5 ) )
if(tuningFull == TRUE)
{= cubistControlFull
cubistControl = tuneGridFull
cubistGrid else {
} = cubistControlLight
cubistControl = tuneGridLight
cubistGrid
}
cubistTune = caret::train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "cubist",
tuneGrid = cubistGrid ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )
<- ggplot(cubistTune) + labs(title="Cubist Model Tuning")
cubist_plot
cubist_plotggsave("CubistTuning.jpg", cubist_plot)
<- ggplot(varImp(cubistTune), top = 20) + labs(title = "Cubist Model Variable Importance")
cubist_vi_plot
cubist_vi_plotggsave("CubistVarImp.jpg", cubist_vi_plot)
$finalModel
cubistTune$finalModel$usage %>% arrange( desc(Model+ Conditions)) %>%
cubistTunekable(caption = "Cubist Predictor Usage") %>% kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
<- dotplot( cubistTune$finalModel, what = "splits" )
cubist_splitsplot
cubist_splitsplot# Output to file
jpeg(file="CubistSplitsPlot.jpg")
cubist_splitsplotdev.off()
<- dotplot( cubistTune$finalModel,
cubist_dotplot what = "coefs",
between = list(x = 0.2, y = 0.5) ,
scales = list(x = list(relation = "free"),
y = list(cex = 0.25) ) )
cubist_dotplotjpeg(file="CubistDotPlot.jpg")
cubist_splitsplotdev.off()
$finalModel$coefficients %>%
cubistTunerownames_to_column(var="id") %>%
select(-committee, -rule) %>%
pivot_longer(!id, names_to = "predictor", values_to = "coef" ) %>%
filter( !is.na(coef)) %>%
filter( predictor != '(Intercept)' ) %>%
filter(abs(coef) < 20 ) -> coef_piv
<- coef_piv %>%
cubist_boxplot ggplot() +
geom_boxplot(
aes(x=reorder(predictor, coef, FUN = median, na.rm = TRUE ), # order the boxplots by median
y = coef ), outlier.shape = NA ) + # strip out the outliers
coord_flip(ylim=c(-.5,.5)) +
labs(title = "Distribution of Coefficients of Predictors",
subtitle = "from Cubist Committee/Rules",
x = "Predictors" ,
y = "Coefficient in Linear Model"
)
cubist_boxplotggsave("CubistBoxPlot.jpg", cubist_boxplot)
set.seed(10393)
library(rpart)
= expand.grid( maxdepth = seq( 1, 20, by = 1 ) )
cartGrid
<- trainControl(method = "boot",
cartControl number = 30,
selectionFunction = "best")
rpartTune = train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "rpart2" ,
metric = "RMSE" ,
tuneGrid = cartGrid ,
trControl = cartControl ) )
<- ggplot(rpartTune) + labs(title="CART Tree Tuning")
cart_tuning_plot
cart_tuning_plotggsave("CartTuningPlot.jpg", cart_tuning_plot)
<- ggplot(varImp( rpartTune), top = 20 ) + labs(title = "CART Tree Variable Importance")
cart_vi_plot
cart_vi_plotggsave("CartVarImp.jpg", cart_vi_plot)
library(rpart.plot)
library(partykit)
rpart.plot(rpartTune$finalModel, digits = 3)
if( tuningFull == TRUE )
{= expand.grid( .degree = 1:2, .nprune = seq(2,27, by=5) )
marsGrid = trainControl(method = "cv",
marsControl number = 10,
selectionFunction = "best",
verboseIter = FALSE)
else {
} = expand.grid( .degree = 1:2, .nprune = seq(2,25, by = 10 ) )
marsGrid
= trainControl(method = "cv",
marsControl number = 10,
selectionFunction = "best",
verboseIter = FALSE)
}
set.seed(1000)
= caret::train(x = sdata_trainX_pp ,
marsTuned y = sdata_train_pp$PH,
method = "earth" ,
metric = "RMSE" ,
tuneGrid = marsGrid ,
trControl = marsControl )
$finalModel
marsTuned
summary(marsTuned)
<- ggplot(marsTuned) + labs(title="MARS Model Tuning" )
mars_tuning_plot
mars_tuning_plotggsave("MarsTuningPlot.jpg", mars_tuning_plot)
<- ggplot( varImp(marsTuned), 20 ) + labs(title = "MARS Model Variable Importance")
mars_vi_plot
mars_vi_plotggsave("MarsVarImp.jpg", mars_vi_plot)
set.seed(123)
<- trainControl( method = "cv", number = 5)
train.control
= train( PH ~ . -id,
linearTune data = sdata_train_pp,
method = "lm",
tuneGrid = data.frame(intercept = TRUE),
trControl = train.control)
summary(linearTune$finalModel)
par(mfrow=c(2,2))
<- plot(linearTune$finalModel)
lm_final_model_plot
lm_final_model_plotggsave("LinearModelFinalPlot.jpg", lm_final_model_plot)
hist(linearTune$finalModel$residuals)
# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()
= list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)
model_list
= data.frame()
model_stats
for( modelObj in model_list )
{if( modelObj$method != 'lm' ){
<- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
pred_data
}else
{<- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
pred_data
}<- data.frame( modelName = modelObj$method ,
output trainRMSE = getTrainPerf(modelObj)[1,1] ,
trainR2 = getTrainPerf(modelObj)[1,2] ,
testRMSE = caret::RMSE( pred_data, sdata_test_pp$PH) ,
testR2 = caret::R2( pred_data, sdata_test_pp$PH)
)
<- rbind(model_stats, output)
model_stats
}
<- model_stats %>% as_tibble() %>%
results_table arrange( testRMSE) %>%
kable( digits = 3, caption = "Model Results Comparison") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
results_table
save_kable(results_table, "ModelResultsTable.jpg")
= extractPrediction(list(marsTuned, rpartTune, cubistTune, gbmTune), testX = sdata_testX_pp , testY = sdata_test_pp$PH )
predVals
# Attempting to hack into the extract prediction object with the LM data
# Cols: obs pred model dataType object
= predict(linearTune, newdata = sdata_test_pp)
pred_lm <- linearTune$finalModel$fitted.values
pred_lm_train
# Create placeholder data for the training values
<- rep("LM", length = 2055)
model_l <- rep("Training", length = 2055)
dt_l <- rep("Object9", length = 2055)
obj_l
# Create dataframe of the training data of LM
<- as.data.frame(cbind(obs=sdata_train_pp$PH, pred=pred_lm_train, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)
lm_train_preds
# Create placeholder data for the test values
<- rep("LM", length = 512)
model_l <- rep("Test", length = 512)
dt_l <- rep("Object10", length = 512)
obj_l
# Create dataframe of the test data of LM
<- as.data.frame(cbind(obs=sdata_test_pp$PH, pred=pred_lm, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)
lm_test_preds
# Combine all data for plotting
<- rbind(predVals, lm_train_preds, lm_test_preds)
all_preds
# Ensure obs and preds are numerics
$obs <- as.numeric(all_preds$obs)
all_preds$pred <- as.numeric(all_preds$pred)
all_preds
# Plot using Caret built-in reporting.
# --------------------------------------
<- plotObsVsPred(all_preds)
obsVsPredsPlot
obsVsPredsPlot#ggsave("ObsVsPredsPlot.jpg", obsVsPredsPlot)
# Combine the initial training and test datasets into 1
<- rbind(sdata_train_pp, sdata_test_pp)
sdata_trainfull_pp <- sdata_trainfull_pp %>% select(-PH , -id)
sdata_trainfullX_pp
= read_csv("edata_test_pp.csv",
edata_test_pp col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
# Train the Cubist model
set.seed(1027)
if(tuningFull == TRUE)
{= cubistControlFull
cubistControl = tuneGridFull
cubistGrid else {
} = cubistControlLight
cubistControl = tuneGridLight
cubistGrid
}
cubistTune = caret::train( x = sdata_trainfullX_pp,
(y = sdata_trainfull_pp$PH ,
method = "cubist",
tuneGrid = cubistGrid ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )
# Make predictions on the provided dataset
<- as.numeric( predict( cubistTune, newdata=edata_test_pp ))
final_pred_data
# Output prediction performance
<- getTrainPerf(cubistTune)[1,1]
full_cubist_train_rmse <- getTrainPerf(cubistTune)[1,2]
full_cubist_train_r2
full_cubist_train_rmse
full_cubist_train_r2
# Output the results to file
# TODO
final_pred_data
toc()
Tuning full is TRUE.
9am on 12/8/2021 Tuning full is FALSE. (589.958 sec elapsed)
## 4120.739 sec elapsed