Executive Summary
How do production processes affect the pH levels in our soda brands? What is the best way to increase or decrease pH?
This analysis seeks to answer this question. Production data are subjected to statistical learning models and evaluated. The results indicate the following manipulations can increase pH:
mnf_flow< -50usage_cont< 20oxygen_filler< 0.1pressure_vacuumnear -5filler_level> 115carb_rel> 5.60
Similarly, we can minimize pH by keeping:
mnf_flow> -50usage_cont> 24oxygen_filler> 0.2pressure_vacuumless than -6 or greater than -4filler_levelnear 80carb_relnear 5
Introduction
This analysis begins with the usual first steps: Examine data for missing observations, data exploration, and variable clustering. The data is split into a train (80 percent) and test (20 percent) set. Models are trained on the train set, using 100 repeats of 10-fold cross-validation to provide an unbiased estimate of the accuracy of each model.
I develop three classes of models:
- Dummy models serve as benchmarks
- Models that accept all availalbe information and algorithmically chose the optimal variable set
- Manually reduced variable models
These models are finally evaluated by their performance on the test set in terms of root mean squared error (\(RMSE\)).
Ultimately, a random forest model provides the best performance.
The Data
The first step is to split all the data into train and test sets. The latter will not considered in any analysis until the very end, where it will provide an accurate assessment of model performance. I remove the full dataset df immediately after split. There are also a handful of observations where the dependent variable ph is not recorded; these are removed:
df <- readxl::read_excel('./data/StudentData.xlsx')
colnames(df) <- sapply(colnames(df), function(x) tolower(str_replace_all(x, ' ', '_')))
df <- df %>% tidyr::drop_na(ph)
set.seed(1804)
train_ix <- createDataPartition(df$ph, p=0.8, list = FALSE)
train <- df %>% slice(train_ix)
test <- df %>% slice(-train_ix)
rm(df)There are some missing values in the training data, though none too severe. The variable with the most missing observations mfr is only missing about 8 percent of its values.
## mnf_flow density balling pressure_vacuum
## 0 0 0 0
## ph air_pressurer bowl_setpoint balling_lvl
## 0 0 1 1
## carb_flow usage_cont alch_rel carb_volume
## 2 5 6 7
## temperature carb_rel hyd_pressure1 oxygen_filler
## 8 8 9 9
## pressure_setpoint filler_level hyd_pressure2 hyd_pressure3
## 10 12 13 13
## fill_pressure carb_temp carb_pressure psc_fill
## 14 17 19 19
## carb_pressure1 hyd_pressure4 psc fill_ounces
## 23 23 28 31
## pc_volume psc_co2 filler_speed brand_code
## 33 33 40 106
## mfr
## 157
To handle these, I use the newer package missRanger. Built on a fast implementation of the random forest algorithm, it imputes missing values with predictions based on the remaining variables. It iterates over all variables until the out-of-bag prediction error stops improving. Note that I do not use the dependent variable ph for the purposes of imputation:
I compared the summary statistics of the original train data set with the imputed data frame and found them satisfactory.
Data Exploration
The dependent variable ph is approximately normal. It is near-symmetric, with mean of 8.55 and median of 8.54. Almost all observations are between 8 and 9.
There are two observations with large outliers. One has an ‘isolated’ ph level beyond three standard deviations from the mean. The other is an isolated carb_rel observation, over 4 standard deviations from the mean and one standard deviation larger than the next closest observation. These two outlying observations are removed.
ph_lim <- mean(train_imp$ph) + (3 * sd(train_imp$ph))
carb_rel_lim <- mean(train_imp$carb_rel) + (4 * sd(train_imp$carb_rel))
train_imp <- train_imp %>%
filter(ph <= ph_lim,
carb_rel <= carb_rel_lim)The approximately normal distribution of ph hides some complexity, however. Each of the brand codes have a different shape, although most of their central tendencies are close to the overall center. Several of the distributions have two peaks, indicating there may be two mechanisms of production, or possibly complex interactions across several variables.
Check how other variables vary across brand code:
train_imp %>%
tidyr::gather(-brand_code, key = "var", value = "value") %>%
ggplot(aes(x = brand_code, y = value, color=brand_code)) +
geom_boxplot() +
facet_wrap(~ var, scales = "free") +
theme_bw()A pattern emerges: Brands A and D have similar distributions, and B and C have similar distributions. This is most visible with alch_rel, balling, balling_lvl, and density. This suggests there may be interaction effects between these variables and brand code.
Next, examine how the brand codes vary with ph:
train_imp %>%
tidyr::gather(-ph, -brand_code, key = "var", value = "value") %>%
ggplot(aes(x = value, y = ph, color=factor(brand_code))) +
geom_point(alpha=0.33) +
geom_smooth() +
facet_wrap(~ var, scales = "free") +
theme_bw()The brand codes appear to be related to the IVs and ph in similar ways, though the intercepts of the Loess lines are different. Most of the four potential interactions above do not appear that strong in this plot. However, I will retain density and carb_rel for potential interactions.
Although we have plenty of data, it could still be helpful to remove redundant variables. I use Hmisc::redun() to identity variables that can be explained by other variables with \(R^2 > .90\):
rd <- redun( ~ air_pressurer + alch_rel + balling + balling_lvl + carb_flow +
carb_pressure + carb_pressure1 + carb_rel + carb_temp +
carb_volume + density + fill_ounces + fill_pressure +
filler_level + filler_speed + hyd_pressure1 + hyd_pressure2 +
hyd_pressure3 + hyd_pressure4 + mfr + mnf_flow + oxygen_filler +
pc_volume + pressure_vacuum + psc + psc_co2 + psc_fill +
temperature + usage_cont, train_imp, r2=0.9, pr=TRUE)## 2053 observations used in analysis
## Step 1 of a maximum of 29
Step 2 of a maximum of 29
Step 3 of a maximum of 29
Step 4 of a maximum of 29
Step 5 of a maximum of 29
Step 6 of a maximum of 29
Step 7 of a maximum of 29
##
## Redundancy Analysis
##
## redun(formula = ~air_pressurer + alch_rel + balling + balling_lvl +
## carb_flow + carb_pressure + carb_pressure1 + carb_rel + carb_temp +
## carb_volume + density + fill_ounces + fill_pressure + filler_level +
## filler_speed + hyd_pressure1 + hyd_pressure2 + hyd_pressure3 +
## hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
## pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
## usage_cont, data = train_imp, r2 = 0.9, pr = TRUE)
##
## n: 2053 p: 29 nk: 3
##
## Number of NAs: 0
##
## Transformation of target variables forced to be linear
##
## R-squared cutoff: 0.9 Type: ordinary
##
## R^2 with which each variable can be predicted from all other variables:
##
## air_pressurer alch_rel balling balling_lvl
## 0.232 0.919 0.988 0.986
## carb_flow carb_pressure carb_pressure1 carb_rel
## 0.708 0.935 0.391 0.826
## carb_temp carb_volume density fill_ounces
## 0.923 0.875 0.952 0.199
## fill_pressure filler_level filler_speed hyd_pressure1
## 0.435 0.644 0.910 0.768
## hyd_pressure2 hyd_pressure3 hyd_pressure4 mfr
## 0.941 0.948 0.620 0.864
## mnf_flow oxygen_filler pc_volume pressure_vacuum
## 0.828 0.370 0.415 0.671
## psc psc_co2 psc_fill temperature
## 0.145 0.076 0.144 0.340
## usage_cont
## 0.431
##
## Rendundant variables:
##
## balling balling_lvl hyd_pressure3 carb_pressure filler_speed alch_rel
##
## Predicted from variables:
##
## air_pressurer carb_flow carb_pressure1 carb_rel carb_temp carb_volume density fill_ounces fill_pressure filler_level hyd_pressure1 hyd_pressure2 hyd_pressure4 mfr mnf_flow oxygen_filler pc_volume pressure_vacuum psc psc_co2 psc_fill temperature usage_cont
##
## Variable Deleted R^2 R^2 after later deletions
## 1 balling 0.988 0.969 0.967 0.967 0.967 0.955
## 2 balling_lvl 0.963 0.962 0.962 0.962 0.945
## 3 hyd_pressure3 0.945 0.944 0.943 0.943
## 4 carb_pressure 0.935 0.935 0.935
## 5 filler_speed 0.908 0.907
## 6 alch_rel 0.901
This analysis suggests we can drop the following variables: balling, balling_lvl, hyd_pressure3, carb_pressure, filler_speed, and alch_rel.
To get a better idea of the structure of the varaibles, we can use variable clustering,
Variable clustering:
var_tree <- hclustvar(train_imp %>%
select(-ph, -balling, -balling_lvl, -hyd_pressure3,
-carb_pressure, -filler_speed, -alch_rel) %>%
select_if(is.numeric) %>%
as.matrix)
plot(var_tree)Without background knowledge, it’s hard to say what concept each cluster represents. But theoretically, each cluster of similar variables represents a differnet aspect of the same concept.
This clustering is used later in this report for variable reduction.
Modeling
The number of independent variables possible in this analysis is on the edge of what is plausible for manual analysis. A dedicated data scientist could concievably spend a few weeks closely examining each variables’ relationship with ph and their interrelationship with other independent variables.
For the purposes of this project, I will be using both data reduction, parameter shrinkage, and algorithmic variable selection methods (excluding stepwise variable selection). Thus data exploration and variable selection are more tightly integrated with the process of modeling itself.
First, I develop dummy models with which we can compare more sophisticated models. If the more complex models fail to perform much better than these simpler models, it indicates something is wrong with the models or with our understanding.
Models will be judged by how they minimize \(RMSE\) on the test set. I will also consider the measures \(R^2\), and performance on the train set for diagnostic purposes.
We will use the caret cross validation schema with 10 folds and 100 repeats:
Dummy models
\(M_0\): Predict the mean for each brand
This is a very simple model that predicts the brand code mean for each observation. The cross-validated \(R^2 = 0.1080\) and \(RMSE = 0.1615\). We should expect more sophisticated models to do substantially better than this one. If not, this simple model is to be preferred.
set.seed(1804)
m0 <- train(ph ~ brand_code,
data=train_imp,
method='lm',
trControl=ctrl,
metric='RMSE')
m0$results## intercept RMSE Rsquared MAE RMSESD RsquaredSD
## 1 TRUE 0.1615239 0.1080345 0.1313097 0.006006724 0.03687632
## MAESD
## 1 0.004516335
\(M_1\): Linear regression with all variables, interactions
The second simplest possible model assumes all independent variables are important and linearly related to ph, and uncorrelated with eachother. The variables identified above are ommitted, and interactions are included.
set.seed(1804)
m1 <- train(ph ~ air_pressurer + carb_flow + carb_pressure1 + carb_rel +
carb_temp + carb_volume + density + fill_ounces +
fill_pressure + filler_level + hyd_pressure1 + hyd_pressure2 +
hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
usage_cont + density*brand_code + carb_rel*brand_code,
data=train_imp,
method='lm',
trControl=ctrl,
metric='RMSE')
m1$results## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 0.135467 0.3732627 0.1055916 0.00710571 0.05108998 0.005313304
This models performs fairly well, with \(R^2 = 0.3732\) and \(RMSE = 0.1354\). The summary shows that many variables are significant, including some of our interaction terms. The dummy variables for brand_code are not significant, however.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47607 -0.07980 0.00843 0.08649 0.51211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.026e+01 1.133e+00 9.051 < 2e-16 ***
## air_pressurer -1.558e-04 2.614e-03 -0.060 0.95249
## carb_flow 1.152e-05 3.918e-06 2.941 0.00331 **
## carb_pressure1 5.757e-03 7.735e-04 7.443 1.45e-13 ***
## carb_rel 5.314e-02 1.190e-01 0.447 0.65524
## carb_temp 8.460e-04 7.621e-04 1.110 0.26705
## carb_volume -7.089e-02 5.725e-02 -1.238 0.21580
## density -1.740e-02 5.865e-02 -0.297 0.76670
## fill_ounces -5.525e-02 3.636e-02 -1.519 0.12883
## fill_pressure -4.600e-05 1.096e-03 -0.042 0.96654
## filler_level 1.726e-03 2.832e-04 6.094 1.31e-09 ***
## hyd_pressure1 -3.942e-05 4.123e-04 -0.096 0.92384
## hyd_pressure2 1.136e-03 4.044e-04 2.810 0.00500 **
## hyd_pressure4 -1.428e-04 3.576e-04 -0.399 0.68963
## mfr -1.636e-05 3.642e-05 -0.449 0.65341
## mnf_flow -6.632e-04 4.846e-05 -13.685 < 2e-16 ***
## oxygen_filler -3.280e-01 8.014e-02 -4.093 4.42e-05 ***
## pc_volume -4.667e-02 6.137e-02 -0.760 0.44712
## pressure_vacuum -1.055e-02 7.221e-03 -1.460 0.14433
## psc -1.388e-01 6.512e-02 -2.132 0.03316 *
## psc_co2 -9.738e-02 7.106e-02 -1.370 0.17072
## psc_fill -6.349e-03 2.611e-02 -0.243 0.80791
## temperature -1.759e-02 2.490e-03 -7.062 2.25e-12 ***
## usage_cont -7.257e-03 1.293e-03 -5.612 2.27e-08 ***
## brand_codeB -1.483e-01 7.222e-01 -0.205 0.83728
## brand_codeC -1.460e+00 8.905e-01 -1.639 0.10128
## brand_codeD -1.607e+00 8.303e-01 -1.935 0.05308 .
## `density:brand_codeB` -1.746e-01 6.761e-02 -2.583 0.00987 **
## `density:brand_codeC` 1.285e-01 8.031e-02 1.600 0.10970
## `density:brand_codeD` -1.781e-01 7.195e-02 -2.475 0.01339 *
## `carb_rel:brand_codeB` 6.522e-02 1.333e-01 0.489 0.62462
## `carb_rel:brand_codeC` 2.358e-01 1.664e-01 1.417 0.15659
## `carb_rel:brand_codeD` 3.541e-01 1.535e-01 2.306 0.02120 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.134 on 2020 degrees of freedom
## Multiple R-squared: 0.3944, Adjusted R-squared: 0.3848
## F-statistic: 41.12 on 32 and 2020 DF, p-value: < 2.2e-16
I examine the residuals of this model in the hopes that it will suggest corrections to the model. Below are the standard residual plots. For the most part, this model appears ok. Fitted versus residuals so no relationship. The Q-Q plot, however, shows a worrying amount of deviation from normal distribution.
I also used the car package to examine partial residuals. Most of them showed the desired no-pattern pattern. However, the following four variables are especially troublesome. For each of these variables, as each variable diverges from its mean, the absolute value of error increases. My guess is this is due to complex interactions between the variables.
Algorithmic variable selection
\(M_2\): GLMnet
This model is an attempt to use variable shrinkage for variable selection. Elasticnet and ridge regression are two examples of this. The package glmnet allows us to test both models in one grid via the \(\alpha\) parameter. Again, I omit the redundant variables and include interactions:
m2_grid <- data.frame(alpha=0:1, lambda=seq(0, .1, length=100))
m2 <- train(ph ~ air_pressurer + carb_flow + carb_pressure1 +
carb_rel + carb_temp + carb_volume + density + fill_ounces +
fill_pressure + filler_level + hyd_pressure1 + hyd_pressure2 +
hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
usage_cont + density*brand_code + carb_rel*brand_code,
data=train_imp,
method='glmnet',
preProc = c('center', 'scale'),
trControl=ctrl,
tuneGrid=m2_grid,
metric='RMSE')
m2$bestTune## alpha lambda
## 4 0 0.006060606
The best tune is a ridge regression (\(\alpha = 0\)) with regularization parameter \(\lambda = 0.0061\). This indicates little regularization is required, so the resulting model will be close to \(M_1\) normal regression with interactions. Arguably this could be expected, as the redunancy analysis already eliminated the variables ridge regression would shrink.
This model generates a cross-validated \(RMSE = 0.1385\) and \(R^2 = 0.3518\)—slightly worse than vanilla regression, but about equal.
\(M_3\): Partial Least Squares
Partial least squares regression performs variable selection as well. It projects the independent variables \(X\) and the dependent data \(Y\) into optimal numeric space for prediction. It’s “mechanics” are the same as principle components analysis, which can also be used for variable reduction.
m3_grid <- data.frame(.ncomp=2:26)
m3 <- train(ph ~ air_pressurer + carb_flow + carb_pressure1 +
carb_rel + carb_temp + carb_volume + density + fill_ounces +
fill_pressure + filler_level + hyd_pressure1 + hyd_pressure2 +
hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
usage_cont + density*brand_code + carb_rel*brand_code,
data=train_imp,
method='pls',
trControl=ctrl,
tuneGrid=m3_grid,
metric='RMSE')
print(m3$bestTune)## ncomp
## 20 21
The optimization routines finds that 21 components best fits the data. This results in a cross-validated \(RMSE = 0.1353\) and \(R^2 = 0.3746\)—just slightly better than linear regression.
\(M_4\): MARS
None of the previous variable reduction techniques have proven superior to linear regression. However, there is good reason a priori to believe mutlivariate adapative regression spline technique (MARS) may substantially improve on previous methods. None of the previous algorithms are really designed to handle nonlinear relationships or substantial interactions between variables—which we know (or at least surmise) are contained in this dataset. MARS, however, is designed for this.
Grid search occurs across degree of interactions under consideration, and number of variables we will permit to retain in the final model. (Due to computational constraints, I use 10-fold cross validation without repeats here.) Since we know there are interactions in the data, I do not consider models without interactions:
m4_grid <- expand.grid(degree=2:3, nprune=5:26)
m4 <- train(ph ~ air_pressurer + carb_flow + carb_pressure1 +
carb_rel + carb_temp + carb_volume + density + fill_ounces +
fill_pressure + filler_level + hyd_pressure1 + hyd_pressure2 +
hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
usage_cont + brand_code,
data=train_imp,
method='earth',
trControl=trainControl(method='cv', number=10),
tuneGrid=m4_grid,
metric='RMSE')
print(m4$bestTune)## nprune degree
## 41 23 3
Best tune is \(nprune = 23\) and \(degree = 3\), with a cross-validated \(RMSE = 0.1219\) and \(R^2 = 0.4949\). Our intution was correct, and the MARS model’s ability to capture complex interactions and nonlinearities produced superior performance.
From coef, we see where the algorithm has picked up on these complexities. For instance, the output suggests a ‘kink’ in mnf_flow near 0.2, which also interacts with pressure_vacuum.
## (Intercept)
## 8.3966151719
## h(mnf_flow-0.2)
## 0.0006900883
## h(0.2-mnf_flow)
## 0.0022994971
## h(0.2-mnf_flow) * h(-5-pressure_vacuum)
## -0.0015102289
## h(mnf_flow-0.2) * brand_codeD
## 0.0010688054
## h(146.6-mnf_flow) * brand_codeC
## -0.0005251704
\(M_5\): Random forest
The random forest algorithm is also known to handle interactions well. It develops decision trees, often hundreds, and combines their predictions together, thus reducing the variance of decision trees without increasing the bias of the model.
# m5_grid <- expand.grid(mtry=8:26)
# m5 <- train(ph ~ air_pressurer + carb_flow + carb_pressure1 +
# carb_rel + carb_temp + carb_volume + density + fill_ounces +
# fill_pressure + filler_level + hyd_pressure1 + hyd_pressure2 +
# hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
# pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
# usage_cont + brand_code,
# data=train_imp,
# method='rf',
# trControl=trainControl(method='cv', number=10),
# tuneGrid=m5_grid,
# metric='RMSE')
#saveRDS(m5, 'm5.rds')
m5 <- readRDS('m5.rds')
plot(m5)## Random Forest
##
## 2053 samples
## 24 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1848, 1847, 1848, 1847, 1848, 1848, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 8 0.10249342 0.6615939 0.07652869
## 9 0.10211819 0.6623477 0.07594921
## 10 0.10143424 0.6659025 0.07530823
## 11 0.10079834 0.6692978 0.07498000
## 12 0.10054612 0.6696493 0.07443819
## 13 0.10077385 0.6672601 0.07466721
## 14 0.10040009 0.6691800 0.07425781
## 15 0.09996865 0.6715474 0.07389853
## 16 0.09992275 0.6712791 0.07383834
## 17 0.09971171 0.6719506 0.07362730
## 18 0.09947903 0.6731124 0.07340238
## 19 0.09949184 0.6722239 0.07336658
## 20 0.09928817 0.6733215 0.07325301
## 21 0.09946314 0.6708454 0.07317070
## 22 0.09945228 0.6703219 0.07313559
## 23 0.09924319 0.6720486 0.07297231
## 24 0.09930293 0.6710184 0.07305435
## 25 0.09937296 0.6696641 0.07303984
## 26 0.09931272 0.6699625 0.07283086
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 23.
Best model has mtry=23, with cross validated \(RMSE = 0.0992\) and \(R^2 = 0.6720\). Both measures are substantially better than previous models!
Variable clustering
The previous section focused on algorithmic variable selection, where the full (near-full) data set was fed to the algorithms, which were left to their own devices to determine which variables were important.
In this section, I will manally reduce the variables, based on variable clustering. I then apply statistical learning algorithms.
Recall the variable clustering tree from before:
I parse out 7 clusters, based on this tree. For each of these clusters, I find the variables’ Spearman rank correlation to ph, and select that with the most correlation as the ‘representative’ of this cluster. This results in the following variable set:
carb_reltemperaturepsc_co2pressure_vacuumbowl_setpointmnf_flowpc_volume
and brand_code.
## carb_volume fill_ounces pc_volume carb_pressure
## 0.06047926 -0.08173385 0.08718200 0.07054020
## carb_temp psc psc_fill psc_co2
## 0.02648804 -0.08654489 -0.02715838 -0.06877569
## mnf_flow carb_pressure1 fill_pressure hyd_pressure1
## -0.48094249 -0.10466227 -0.25257874 -0.11828725
## hyd_pressure2 hyd_pressure3 hyd_pressure4 filler_level
## -0.19610311 -0.23563152 -0.13371287 0.35631838
## filler_speed temperature usage_cont carb_flow
## 0.04804262 -0.19964251 -0.35677925 0.04608193
## density mfr balling pressure_vacuum
## -0.01072597 0.03801699 -0.03507307 0.24883819
## ph oxygen_filler bowl_setpoint pressure_setpoint
## 1.00000000 0.18585281 0.40494642 -0.30793401
## air_pressurer alch_rel carb_rel balling_lvl
## -0.01209651 0.10068454 0.19043588 0.07626197
\(M_6\): Linear model with reduced variables
set.seed(1804)
m6 <- train(ph ~ brand_code*carb_rel + temperature + psc_co2 + pressure_vacuum +
bowl_setpoint + mnf_flow + pc_volume,
data=train_imp,
method='lm',
trControl=ctrl,
metric='RMSE')
m6$results## intercept RMSE Rsquared MAE RMSESD RsquaredSD
## 1 TRUE 0.1394137 0.3355636 0.110679 0.006599649 0.04752279
## MAESD
## 1 0.005190041
We see the reduced model performs only slightly worse than the full linear model \(M_1\) on both \(RMSE\) and \(R^2\). This indicates that a much smallet set of variables are actually informative.
\(M_7\): Generalized additive model with reduced variables
GAMs function like generalized linear models (GLM), except the estimated relationships are not necessarily linear. Thus GAMs are more flexible and appropriate for modeling nonlinear relationships.
The caret package’s support for GAMs is not great, so I omit its use here:
m7 <- mgcv::gam(ph ~ brand_code + brand_code:carb_rel + s(carb_rel, bs='ps') + s(temperature, bs='ps') +
s(psc_co2, bs='ps') + s(pressure_vacuum, bs='ps') +
s(bowl_setpoint, bs='ps') + s(mnf_flow, bs='ps') + s(pc_volume, bs='ps'),
data=train_imp, method="REML")
performance::rmse(m7)## [1] 0.1308967
## $R2_Nagelkerke
## Nagelkerke's R2
## 0.4159687
This results in a non-cross-validated \(RMSE = 0.1309\), and \(R^2 = 0.4160\). This is superior to both \(M_1\), linear regression with full variables and interactions, and the above \(M_7\), linear regression with reduced variable set. However, it is nowhere near the random forest model. Arguably, this suggests that the difficulty modeling this data is due to interactions, and not nonlinearities.
Evaluation
In this section, each of the nine models developed will be tested on the hold out dataset. The highest-performing model will be analyzed in the next section to determine the best way to maniplate ph.
First, we have to fill in missing data in the same way as above:
## brand_code carb_volume fill_ounces pc_volume
## 14 3 7 6
## carb_pressure carb_temp psc psc_fill
## 8 9 5 4
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## 6 0 9 4
## hyd_pressure1 hyd_pressure2 hyd_pressure3 hyd_pressure4
## 2 2 2 5
## filler_level filler_speed temperature usage_cont
## 4 14 4 0
## carb_flow density mfr balling
## 0 0 51 0
## pressure_vacuum ph oxygen_filler bowl_setpoint
## 0 0 2 1
## pressure_setpoint air_pressurer alch_rel carb_rel
## 2 0 1 0
## balling_lvl
## 0
Now, calculate performance metrics on the imputed test set:
| Model | Description | Train \(RMSE\) | Train \(R^2\) |
|---|---|---|---|
| \(M_0\) | Dummy: Predict brand’s mean | 0.1641 | 0.1300 |
| \(M_1\) | LR, all IVs + interactions | 0.1401 | 0.3651 |
| \(M_2\) | Ridge, all IVs + interactions | 0.1400 | 0.3669 |
| \(M_3\) | PLS, all IVs + interactions | 0.1388 | 0.3774 |
| \(M_4\) | MARS, all IVs + interactions | 0.1229 | 0.5127 |
| \(M_5\) | Random forest, all IVs | 0.0990 | 0.6908 |
| \(M_6\) | LR, reduced IVs | 0.1445 | 0.3250 |
| \(M_7\) | GAM, reduced IVs | 0.1363 | 0.3996 |
Thus the random forest model performs the best, substantially better than the next closed model that used MARS. Other interesting observations:
For the most part, the performance on the test set is very close to the cross-validated performance on the training set
There is little difference between vanilla regression model and more complicated regressions (PSL, Ridge, GAM). Only when the algorithm starts to introduce complex interactions do we see improved performance (MARS).
The reduced variable LR performs nearly as well with only six variables and the full data set. This suggests variable clustering successfully captured the key driving phenokemona of the data set.
Recommendations and Conclusion
This section analyzes the model to develop recommendations for manipulating the pH levels of soft drinks.
The plot below shows how important each independent variable is to the final outcome. The variable mnf_flow dominantes the ranking. The dummy variable for brandC is recorded as being very important, which is an interesting result.
Let’s ‘zoom in’ on the most important variables to understand how changes in them affect pH:
Thus the model suggests we can increase pH by keeping:
mnf_flow< -50usage_cont< 20oxygen_filler< 0.1pressure_vacuumnear -5filler_level> 115carb_rel> 5.60
Similarly, we can minimize pH by keeping:
mnf_flow> -50usage_cont> 24oxygen_filler> 0.2pressure_vacuumless than -6 or greater than -4filler_levelnear 80carb_relnear 5